High Availability
Design systems that keep running through failures with redundancy and failover.
Theory
High availability (HA) keeps a service running through routine failures by removing single points of failure. The core techniques:
- Redundancy — run multiple instances across availability zones so losing one keeps you up
- Load balancing + health checks — route only to healthy instances
- Failover — automatically promote a standby when the primary dies (e.g. database replicas)
- Statelessness — keep app servers stateless so any instance can serve any request
- Graceful degradation — shed non-essential features under stress rather than failing entirely
Availability is often quoted in nines: 99.9% ≈ 8.7 h/year down, 99.99% ≈ 52 min/year. More nines cost exponentially more — match it to real need.
Real-World Example
HA web architecture:
Load balancer (multi-AZ) → app servers in AZ-a AND AZ-b (stateless)
Database: primary + synchronous replica with automatic failover
Health checks remove unhealthy instances from rotation
Lose any one AZ → service continues on the other Hands-On Exercise
- Identify single points of failure in a single-server web app and fix them.
- Explain why app servers should be stateless for HA.
- Describe automatic database failover.
- Convert 99.99% availability into approximate downtime per year.
Cheat Sheet▾
| Technique | Purpose |
|---|---|
| Redundancy (multi-AZ) | Survive instance/zone loss |
| Load balancing | Route to healthy instances |
| Health checks | Detect + remove bad instances |
| Failover | Promote standby on failure |
| Statelessness | Any instance serves any request |
| Graceful degradation | Shed load, don’t collapse |
| 99.99% | ≈52 min/year downtime |
Common Interview Questions▾
How do you design for high availability?
Remove single points of failure via redundancy across availability zones, load balancing with health checks, automatic failover, stateless app tiers, and graceful degradation under stress.
Why must app servers be stateless for HA?
So any instance can handle any request and instances can be added, removed, or replaced freely. Local state would be lost on failure and break load balancing.
Official Documentation
📝 My notes on this topic
Auto-saves as you type