Incident Management
Respond to outages calmly with clear roles, communication, and blameless follow-up.
Incident management is the structured response to an outage. A repeatable process beats heroics:
- Detect — alerts (or a report) surface the problem
- Declare & assign roles — an Incident Commander coordinates; others handle ops and communications
- Mitigate first — restore service before root-causing (roll back, failover, scale up) — stop the bleeding
- Communicate — keep stakeholders updated on status and ETA
- Resolve & review — a blameless postmortem with concrete action items
Clear roles prevent chaos; mitigate before diagnose is the key instinct — users want the service back, not an explanation.
Incident timeline:
14:02 Alert: error rate 30% (Detect)
14:04 IC declared, roles assigned (Coordinate)
14:06 Last deploy identified (Diagnose enough to act)
14:08 Rolled back → errors drop (Mitigate)
14:10 Status update sent (Communicate)
Next day: blameless postmortem + action items (Review) - List the phases of incident response in order.
- Explain the Incident Commander role.
- Why mitigate before fully diagnosing?
- What should a postmortem produce besides a narrative?
Cheat Sheet▾
| Phase | Action |
|---|---|
| Detect | Alert/report surfaces issue |
| Declare | Assign Incident Commander + roles |
| Mitigate | Restore service first |
| Communicate | Update stakeholders |
| Resolve | Confirm recovery |
| Review | Blameless postmortem + actions |
Common Interview Questions▾
What are the key steps of incident management?
Detect, declare and assign roles (Incident Commander), mitigate to restore service, communicate status, then resolve and run a blameless postmortem with action items.
Why mitigate before finding the root cause?
Users care about service being restored, not the explanation. Rolling back or failing over stops the impact immediately; root-cause analysis can follow safely.
What does an Incident Commander do?
Coordinates the response — assigns roles, drives decisions, and keeps the effort organized — without necessarily doing the hands-on fixing themselves.