Incident Management

Intermediate ⭐ 80 XP ⏱ 18 min #sre#incidents#on-call

Respond to outages calmly with clear roles, communication, and blameless follow-up.

📖Theory

Incident management is the structured response to an outage. A repeatable process beats heroics:

Detect — alerts (or a report) surface the problem
Declare & assign roles — an Incident Commander coordinates; others handle ops and communications
Mitigate first — restore service before root-causing (roll back, failover, scale up) — stop the bleeding
Communicate — keep stakeholders updated on status and ETA
Resolve & review — a blameless postmortem with concrete action items

Clear roles prevent chaos; mitigate before diagnose is the key instinct — users want the service back, not an explanation.

🌍Real-World Example

Incident timeline:
  14:02  Alert: error rate 30%        (Detect)
  14:04  IC declared, roles assigned  (Coordinate)
  14:06  Last deploy identified        (Diagnose enough to act)
  14:08  Rolled back → errors drop      (Mitigate)
  14:10  Status update sent            (Communicate)
  Next day: blameless postmortem + action items (Review)

✍️Hands-On Exercise

List the phases of incident response in order.
Explain the Incident Commander role.
Why mitigate before fully diagnosing?
What should a postmortem produce besides a narrative?

🧾Cheat Sheet▾

Phase	Action
Detect	Alert/report surfaces issue
Declare	Assign Incident Commander + roles
Mitigate	Restore service first
Communicate	Update stakeholders
Resolve	Confirm recovery
Review	Blameless postmortem + actions

💬Common Interview Questions▾

What are the key steps of incident management?

Detect, declare and assign roles (Incident Commander), mitigate to restore service, communicate status, then resolve and run a blameless postmortem with action items.

Why mitigate before finding the root cause?

Users care about service being restored, not the explanation. Rolling back or failing over stops the impact immediately; root-cause analysis can follow safely.

What does an Incident Commander do?

Coordinates the response — assigns roles, drives decisions, and keeps the effort organized — without necessarily doing the hands-on fixing themselves.

📚Official Documentation

↗ Google SRE — Managing Incidents

📝 My notes on this topic

Auto-saves as you type