Root Cause Analysis
Go past the symptom to the underlying cause so a problem stays fixed.
Restarting a service stops the bleeding, but if you don’t find why it failed, it will happen again. Root cause analysis (RCA) digs past the symptom to the systemic cause and the fix that prevents recurrence.
A simple, powerful technique is the 5 Whys: ask “why?” repeatedly until you reach something fixable in the system, not a person.
- Symptom: the site went down
- Why? The server ran out of disk
- Why? Logs filled the disk
- Why? Log rotation wasn’t configured
- Why? It wasn’t in the provisioning template → fix the template
Incident: API latency spiked to 10s for 30 minutes.
Contributing factors (not just one "root"):
- A slow query was deployed (trigger)
- No query timeout existed (it could run unbounded)
- No alert fired on p99 latency (slow detection)
Action items:
- Add an index + statement timeout (prevent)
- Add a p99 latency SLO alert (detect faster)
- Add query review to the PR checklist (catch earlier) - Take a past incident and run the 5 Whys until you hit a systemic cause.
- For that incident, write three action items: prevent, detect, mitigate.
- Rewrite a “human error” conclusion as a systemic, blameless one.
- Explain why a single “root cause” is often really several contributing factors.
Cheat Sheet▾
| Concept | Meaning |
|---|---|
| Symptom vs cause | What you saw vs why it happened |
| 5 Whys | Ask “why” until systemic/fixable |
| Contributing factors | Usually several, not one |
| Blameless | Fix systems, not people |
| Action items | Prevent / detect / mitigate |
| Postmortem | Written record + follow-ups |
Common Interview Questions▾
What is root cause analysis?
A structured look past the immediate symptom to the underlying systemic cause, so the fix prevents recurrence rather than just restoring service.
What are the 5 Whys?
A technique of asking “why?” iteratively (about five times) to move from a surface symptom to a deeper, fixable cause in the system.
Why should postmortems be blameless?
Blame makes people hide information, which blocks learning. Assuming good intent surfaces the systemic gaps that actually need fixing and keeps reporting honest.