Reliability Principles
The core ideas of Site Reliability Engineering — treating operations as a software problem.
Theory
Site Reliability Engineering (SRE) is Google’s approach of applying software engineering to operations. Its guiding ideas:
- Reliability is a feature — measured with SLIs/SLOs and budgeted, not assumed
- Embrace risk — 100% reliability is the wrong target; an error budget defines acceptable unreliability and pace of change
- Eliminate toil — automate repetitive manual work so engineers build, not firefight
- Blameless culture — learn from incidents, fix systems not people
- Measure everything — you can’t improve what you don’t observe
The famous tension SRE resolves: developers want to ship fast, ops wants stability. The error budget turns that into a shared, data-driven agreement.
Real-World Example
SRE in practice:
Define SLO → 99.9% of requests succeed
Error budget → 0.1% (≈43 min/month) of allowed failure
Budget healthy → ship features freely
Budget exhausted → freeze features, focus on reliability
Toil > 50% → invest in automation to reduce it Hands-On Exercise
- Explain why 100% reliability is not the goal.
- Define toil and give an example of eliminating it.
- Describe how an error budget aligns dev and ops incentives.
- Why is blameless culture important to reliability?
Cheat Sheet▾
| Principle | Detail |
|---|---|
| Reliability as a feature | Measure + budget it |
| Embrace risk | Error budgets, not 100% |
| Eliminate toil | Automate repetitive work |
| Blameless | Fix systems, not people |
| Measure everything | Observability first |
| Dev vs ops | Balanced by error budget |
Common Interview Questions▾
What is SRE in one sentence?
Applying software engineering practices to operations — making reliability a measured, budgeted, automated concern rather than ad-hoc firefighting.
What is toil and why eliminate it?
Manual, repetitive, automatable work that scales with service growth and produces no lasting value. Automating it frees engineers to improve systems and reduces error.
Official Documentation
📝 My notes on this topic
Auto-saves as you type