Logging & Alerting
Turn telemetry into actionable alerts without drowning in noise.
Collecting telemetry is useless if no one acts on it. Good alerting turns signals into timely, actionable pages — and avoids the bigger danger: alert fatigue, where noisy alerts get ignored and real incidents slip through.
Principles for alerts that work:
- Alert on symptoms, not causes — page on user-facing pain (high error rate, latency, SLO burn), not every internal metric like CPU
- Make them actionable — every alert should require a human action and link to a runbook
- Tier severity — page for urgent, ticket/Slack for the rest
- Reduce noise — group, deduplicate, and set sensible thresholds/durations
Logs and metrics both feed alerts; the Four Golden Signals (latency, traffic, errors, saturation) are a great starting set.
Good (symptom, actionable):
"API 5xx error rate > 2% for 5 min" → page on-call, link runbook
"p99 latency > 1s for 10 min" → page
Noisy (cause, not actionable alone):
"CPU > 80%" → maybe fine under load; ticket at most
"single pod restarted" → expected; don't page - Rewrite a CPU-threshold alert as a user-facing symptom alert.
- List the Four Golden Signals.
- Explain alert fatigue and one way to reduce it.
- What should every alert link to?
Cheat Sheet▾
| Principle | Detail |
|---|---|
| Symptoms > causes | Alert on user impact |
| Actionable | Requires a human action |
| Runbook | Link from every alert |
| Severity tiers | Page vs ticket |
| Golden signals | Latency, traffic, errors, saturation |
| Reduce noise | Group, dedupe, tune thresholds |
Common Interview Questions▾
What makes a good alert?
It’s tied to a user-facing symptom, is actionable, links to a runbook, and is tuned to avoid noise. Alerts on causes (like CPU) or that need no action create fatigue.
What are the Four Golden Signals?
Latency, traffic, errors, and saturation — a concise, high-value set of signals to monitor and alert on for any user-facing service.
What is alert fatigue and why is it dangerous?
When frequent noisy or non-actionable alerts desensitize responders, who then start ignoring the pager — so a genuine incident gets missed amid the noise.