Incident Management

💤0
Lv 10 XP
← 🛡️ Platform Engineering & SRE · Incidents & Resilience

Incident Management

Intermediate ⭐ 80 XP ⏱ 18 min #sre#incidents#on-call

Respond to outages calmly with clear roles, communication, and blameless follow-up.

📖Theory

Incident management is the structured response to an outage. A repeatable process beats heroics:

  1. Detect — alerts (or a report) surface the problem
  2. Declare & assign roles — an Incident Commander coordinates; others handle ops and communications
  3. Mitigate first — restore service before root-causing (roll back, failover, scale up) — stop the bleeding
  4. Communicate — keep stakeholders updated on status and ETA
  5. Resolve & review — a blameless postmortem with concrete action items

Clear roles prevent chaos; mitigate before diagnose is the key instinct — users want the service back, not an explanation.

🌍Real-World Example
Incident timeline:
  14:02  Alert: error rate 30%        (Detect)
  14:04  IC declared, roles assigned  (Coordinate)
  14:06  Last deploy identified        (Diagnose enough to act)
  14:08  Rolled back → errors drop      (Mitigate)
  14:10  Status update sent            (Communicate)
  Next day: blameless postmortem + action items (Review)
✍️Hands-On Exercise
  1. List the phases of incident response in order.
  2. Explain the Incident Commander role.
  3. Why mitigate before fully diagnosing?
  4. What should a postmortem produce besides a narrative?
🧾Cheat Sheet
PhaseAction
DetectAlert/report surfaces issue
DeclareAssign Incident Commander + roles
MitigateRestore service first
CommunicateUpdate stakeholders
ResolveConfirm recovery
ReviewBlameless postmortem + actions
💬Common Interview Questions
What are the key steps of incident management?

Detect, declare and assign roles (Incident Commander), mitigate to restore service, communicate status, then resolve and run a blameless postmortem with action items.

Why mitigate before finding the root cause?

Users care about service being restored, not the explanation. Rolling back or failing over stops the impact immediately; root-cause analysis can follow safely.

What does an Incident Commander do?

Coordinates the response — assigns roles, drives decisions, and keeps the effort organized — without necessarily doing the hands-on fixing themselves.

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type