Logging & Alerting

Intermediate ⭐ 80 XP ⏱ 16 min #observability#alerting#on-call

Turn telemetry into actionable alerts without drowning in noise.

📖Theory

Collecting telemetry is useless if no one acts on it. Good alerting turns signals into timely, actionable pages — and avoids the bigger danger: alert fatigue, where noisy alerts get ignored and real incidents slip through.

Principles for alerts that work:

Alert on symptoms, not causes — page on user-facing pain (high error rate, latency, SLO burn), not every internal metric like CPU
Make them actionable — every alert should require a human action and link to a runbook
Tier severity — page for urgent, ticket/Slack for the rest
Reduce noise — group, deduplicate, and set sensible thresholds/durations

Logs and metrics both feed alerts; the Four Golden Signals (latency, traffic, errors, saturation) are a great starting set.

🌍Real-World Example

Good (symptom, actionable):
  "API 5xx error rate > 2% for 5 min"  → page on-call, link runbook
  "p99 latency > 1s for 10 min"        → page

Noisy (cause, not actionable alone):
  "CPU > 80%"                          → maybe fine under load; ticket at most
  "single pod restarted"               → expected; don't page

✍️Hands-On Exercise

Rewrite a CPU-threshold alert as a user-facing symptom alert.
List the Four Golden Signals.
Explain alert fatigue and one way to reduce it.
What should every alert link to?

🧾Cheat Sheet▾

Principle	Detail
Symptoms > causes	Alert on user impact
Actionable	Requires a human action
Runbook	Link from every alert
Severity tiers	Page vs ticket
Golden signals	Latency, traffic, errors, saturation
Reduce noise	Group, dedupe, tune thresholds

💬Common Interview Questions▾

What makes a good alert?

It’s tied to a user-facing symptom, is actionable, links to a runbook, and is tuned to avoid noise. Alerts on causes (like CPU) or that need no action create fatigue.

What are the Four Golden Signals?

Latency, traffic, errors, and saturation — a concise, high-value set of signals to monitor and alert on for any user-facing service.

What is alert fatigue and why is it dangerous?

When frequent noisy or non-actionable alerts desensitize responders, who then start ignoring the pager — so a genuine incident gets missed amid the noise.

📚Official Documentation

↗ Google SRE — Monitoring distributed systems

📝 My notes on this topic

Auto-saves as you type