High Availability

💤0
Lv 10 XP
← 🛡️ Platform Engineering & SRE · Incidents & Resilience

High Availability

Advanced ⭐ 120 XP ⏱ 18 min #sre#high-availability#resilience

Design systems that keep running through failures with redundancy and failover.

📖Theory

High availability (HA) keeps a service running through routine failures by removing single points of failure. The core techniques:

  • Redundancy — run multiple instances across availability zones so losing one keeps you up
  • Load balancing + health checks — route only to healthy instances
  • Failover — automatically promote a standby when the primary dies (e.g. database replicas)
  • Statelessness — keep app servers stateless so any instance can serve any request
  • Graceful degradation — shed non-essential features under stress rather than failing entirely

Availability is often quoted in nines: 99.9% ≈ 8.7 h/year down, 99.99% ≈ 52 min/year. More nines cost exponentially more — match it to real need.

🌍Real-World Example
HA web architecture:
  Load balancer (multi-AZ) → app servers in AZ-a AND AZ-b (stateless)
  Database: primary + synchronous replica with automatic failover
  Health checks remove unhealthy instances from rotation
  Lose any one AZ → service continues on the other
✍️Hands-On Exercise
  1. Identify single points of failure in a single-server web app and fix them.
  2. Explain why app servers should be stateless for HA.
  3. Describe automatic database failover.
  4. Convert 99.99% availability into approximate downtime per year.
🧾Cheat Sheet
TechniquePurpose
Redundancy (multi-AZ)Survive instance/zone loss
Load balancingRoute to healthy instances
Health checksDetect + remove bad instances
FailoverPromote standby on failure
StatelessnessAny instance serves any request
Graceful degradationShed load, don’t collapse
99.99%≈52 min/year downtime
💬Common Interview Questions
How do you design for high availability?

Remove single points of failure via redundancy across availability zones, load balancing with health checks, automatic failover, stateless app tiers, and graceful degradation under stress.

Why must app servers be stateless for HA?

So any instance can handle any request and instances can be added, removed, or replaced freely. Local state would be lost on failure and break load balancing.

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type