SLI, SLO & SLA
Define what reliability means and budget for it with SLIs, SLOs, SLAs, and error budgets.
These three terms form a hierarchy:
- SLI (Indicator) — a measured metric of health, e.g. the percentage of requests served under 300ms, or the success rate
- SLO (Objective) — your internal target for an SLI, e.g. “99.9% of requests succeed over 30 days”
- SLA (Agreement) — an external contract with customers, often with penalties if the SLO is missed; usually looser than the SLO
The error budget = 1 − SLO. A 99.9% SLO permits 0.1% failure — about 43 minutes/month. You spend it on risky deploys and experiments; if it’s exhausted, you freeze features and focus on reliability. This makes the speed-vs-stability trade-off objective.
SLI: proportion of HTTP requests with status < 500 and latency < 300ms
SLO: 99.9% over a rolling 30 days
SLA: 99.5% (looser) — service credits if breached
Error budget (99.9%):
0.1% of 30 days ≈ 43 minutes of allowed failure per month
Spent deliberately (risky release) vs accidentally (recurring bug) - Define SLI, SLO, and SLA and how they relate.
- Calculate the monthly error budget for a 99.95% SLO.
- Explain what to do when the error budget is exhausted.
- Why should an SLA be looser than the internal SLO?
Cheat Sheet▾
| Term | Meaning |
|---|---|
| SLI | Measured indicator (e.g. success %) |
| SLO | Internal target for the SLI |
| SLA | External contract + penalties |
| Error budget | 1 − SLO (allowed failure) |
| 99.9% | ≈43 min/month budget |
| Budget spent | Freeze features, fix reliability |
Common Interview Questions▾
Explain SLI, SLO, and SLA.
An SLI is a measured indicator of service health; an SLO is the internal target for that indicator; an SLA is an external contract (often with penalties) that’s usually looser than the SLO.
What is an error budget and how is it used?
It’s 1 minus the SLO — the allowed amount of unreliability. Teams spend it on risky changes; if it runs out, they freeze feature work and prioritize reliability.
Why should an SLA be looser than the SLO?
The SLO provides a safety margin: you want to detect and fix problems internally (SLO breach) well before you violate the customer-facing contract (SLA).