SLI, SLO & SLA

💤0
Lv 10 XP
← 🛡️ Platform Engineering & SRE · Reliability

SLI, SLO & SLA

Intermediate ⭐ 80 XP ⏱ 18 min #sre#slo#error-budget

Define what reliability means and budget for it with SLIs, SLOs, SLAs, and error budgets.

📖Theory

These three terms form a hierarchy:

  • SLI (Indicator) — a measured metric of health, e.g. the percentage of requests served under 300ms, or the success rate
  • SLO (Objective) — your internal target for an SLI, e.g. “99.9% of requests succeed over 30 days”
  • SLA (Agreement) — an external contract with customers, often with penalties if the SLO is missed; usually looser than the SLO

The error budget = 1 − SLO. A 99.9% SLO permits 0.1% failure — about 43 minutes/month. You spend it on risky deploys and experiments; if it’s exhausted, you freeze features and focus on reliability. This makes the speed-vs-stability trade-off objective.

🌍Real-World Example
SLI:  proportion of HTTP requests with status < 500 and latency < 300ms
SLO:  99.9% over a rolling 30 days
SLA:  99.5% (looser) — service credits if breached

Error budget (99.9%):
  0.1% of 30 days ≈ 43 minutes of allowed failure per month
  Spent deliberately (risky release) vs accidentally (recurring bug)
✍️Hands-On Exercise
  1. Define SLI, SLO, and SLA and how they relate.
  2. Calculate the monthly error budget for a 99.95% SLO.
  3. Explain what to do when the error budget is exhausted.
  4. Why should an SLA be looser than the internal SLO?
🧾Cheat Sheet
TermMeaning
SLIMeasured indicator (e.g. success %)
SLOInternal target for the SLI
SLAExternal contract + penalties
Error budget1 − SLO (allowed failure)
99.9%≈43 min/month budget
Budget spentFreeze features, fix reliability
💬Common Interview Questions
Explain SLI, SLO, and SLA.

An SLI is a measured indicator of service health; an SLO is the internal target for that indicator; an SLA is an external contract (often with penalties) that’s usually looser than the SLO.

What is an error budget and how is it used?

It’s 1 minus the SLO — the allowed amount of unreliability. Teams spend it on risky changes; if it runs out, they freeze feature work and prioritize reliability.

Why should an SLA be looser than the SLO?

The SLO provides a safety margin: you want to detect and fix problems internally (SLO breach) well before you violate the customer-facing contract (SLA).

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type