Logging & Alerting

💤0
Lv 10 XP
← 📊 Monitoring & Observability · Logs & Traces

Logging & Alerting

Intermediate ⭐ 80 XP ⏱ 16 min #observability#alerting#on-call

Turn telemetry into actionable alerts without drowning in noise.

📖Theory

Collecting telemetry is useless if no one acts on it. Good alerting turns signals into timely, actionable pages — and avoids the bigger danger: alert fatigue, where noisy alerts get ignored and real incidents slip through.

Principles for alerts that work:

  • Alert on symptoms, not causes — page on user-facing pain (high error rate, latency, SLO burn), not every internal metric like CPU
  • Make them actionable — every alert should require a human action and link to a runbook
  • Tier severity — page for urgent, ticket/Slack for the rest
  • Reduce noise — group, deduplicate, and set sensible thresholds/durations

Logs and metrics both feed alerts; the Four Golden Signals (latency, traffic, errors, saturation) are a great starting set.

🌍Real-World Example
Good (symptom, actionable):
  "API 5xx error rate > 2% for 5 min"  → page on-call, link runbook
  "p99 latency > 1s for 10 min"        → page

Noisy (cause, not actionable alone):
  "CPU > 80%"                          → maybe fine under load; ticket at most
  "single pod restarted"               → expected; don't page
✍️Hands-On Exercise
  1. Rewrite a CPU-threshold alert as a user-facing symptom alert.
  2. List the Four Golden Signals.
  3. Explain alert fatigue and one way to reduce it.
  4. What should every alert link to?
🧾Cheat Sheet
PrincipleDetail
Symptoms > causesAlert on user impact
ActionableRequires a human action
RunbookLink from every alert
Severity tiersPage vs ticket
Golden signalsLatency, traffic, errors, saturation
Reduce noiseGroup, dedupe, tune thresholds
💬Common Interview Questions
What makes a good alert?

It’s tied to a user-facing symptom, is actionable, links to a runbook, and is tuned to avoid noise. Alerts on causes (like CPU) or that need no action create fatigue.

What are the Four Golden Signals?

Latency, traffic, errors, and saturation — a concise, high-value set of signals to monitor and alert on for any user-facing service.

What is alert fatigue and why is it dangerous?

When frequent noisy or non-actionable alerts desensitize responders, who then start ignoring the pager — so a genuine incident gets missed amid the noise.

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type