Root Cause Analysis

💤0
Lv 10 XP
← 🧱 Foundations · Troubleshooting Mindset

Root Cause Analysis

Advanced ⭐ 120 XP ⏱ 18 min #troubleshooting#rca#postmortem

Go past the symptom to the underlying cause so a problem stays fixed.

📖Theory

Restarting a service stops the bleeding, but if you don’t find why it failed, it will happen again. Root cause analysis (RCA) digs past the symptom to the systemic cause and the fix that prevents recurrence.

A simple, powerful technique is the 5 Whys: ask “why?” repeatedly until you reach something fixable in the system, not a person.

  • Symptom: the site went down
  • Why? The server ran out of disk
  • Why? Logs filled the disk
  • Why? Log rotation wasn’t configured
  • Why? It wasn’t in the provisioning template → fix the template
🌍Real-World Example
Incident: API latency spiked to 10s for 30 minutes.

Contributing factors (not just one "root"):
  - A slow query was deployed (trigger)
  - No query timeout existed (it could run unbounded)
  - No alert fired on p99 latency (slow detection)

Action items:
  - Add an index + statement timeout       (prevent)
  - Add a p99 latency SLO alert            (detect faster)
  - Add query review to the PR checklist   (catch earlier)
✍️Hands-On Exercise
  1. Take a past incident and run the 5 Whys until you hit a systemic cause.
  2. For that incident, write three action items: prevent, detect, mitigate.
  3. Rewrite a “human error” conclusion as a systemic, blameless one.
  4. Explain why a single “root cause” is often really several contributing factors.
🧾Cheat Sheet
ConceptMeaning
Symptom vs causeWhat you saw vs why it happened
5 WhysAsk “why” until systemic/fixable
Contributing factorsUsually several, not one
BlamelessFix systems, not people
Action itemsPrevent / detect / mitigate
PostmortemWritten record + follow-ups
💬Common Interview Questions
What is root cause analysis?

A structured look past the immediate symptom to the underlying systemic cause, so the fix prevents recurrence rather than just restoring service.

What are the 5 Whys?

A technique of asking “why?” iteratively (about five times) to move from a surface symptom to a deeper, fixable cause in the system.

Why should postmortems be blameless?

Blame makes people hide information, which blocks learning. Assuming good intent surfaces the systemic gaps that actually need fixing and keeps reporting honest.

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type