Distributed Diagnostics Assistant
A specification-first diagnostic assistant for distributed systems incidents. Builds bounded diagnostic state across iterations instead of producing free-form answers.
This project started as an attempt to treat distributed systems diagnosis as an iterative reasoning problem, not just a retrieval or answer-generation task. I wanted to explore what happens if an assistant does not immediately jump to a root-cause claim, but instead works more like a careful investigation: grounding itself in precedent, keeping competing explanations alive, and focusing on the next check that would actually reduce uncertainty.
That pushed the project toward a more structured design than a typical RAG system. The assistant is built around a small diagnostic state: a current problem understanding, a leading explanation, a competing explanation, and one discriminating check. It also treats follow-up observations as part of the same evolving investigation rather than as unrelated chat turns, so the system can update its hypotheses as new evidence appears.
At the center of the project is a simple idea: the most useful first response in an incident is often not a confident conclusion, but a disciplined next move. To support that, the system retrieves both incident precedents and mechanism-level theory, separates primary and alternative context, and packs only the evidence that helps explain the match, preserve ambiguity, or shape the next check.
What makes the project interesting to me is that it treats diagnosis as something that should be inspectable end to end. Instead of asking only whether the final answer sounds good, it becomes easier to ask more useful questions:
- did the system structure the problem correctly?
- did it retrieve the right precedent and meaningful competing context?
- did the next check actually discriminate between live explanations?
- did the continuation update learn from the new observation or just rephrase the previous answer?
The current implementation includes a stateful runtime, structured query interpretation, precedent and theory retrieval, bounded evidence packing, continuation handling, and an evaluation layer that scores not just final responses but the whole diagnostic chain. The broader goal is to make incident reasoning more grounded, more testable, and easier to improve systematically.
Case Study
The Amazon RDS reader stale reads case shows one diagnostic run across three iterations: competing explanations → new observation → refreshed retrieval → updated hypothesis confidence → more targeted check.