High Availability

Advanced ⭐ 120 XP ⏱ 18 min #sre#high-availability#resilience

Design systems that keep running through failures with redundancy and failover.

📖Theory

High availability (HA) keeps a service running through routine failures by removing single points of failure. The core techniques:

Redundancy — run multiple instances across availability zones so losing one keeps you up
Load balancing + health checks — route only to healthy instances
Failover — automatically promote a standby when the primary dies (e.g. database replicas)
Statelessness — keep app servers stateless so any instance can serve any request
Graceful degradation — shed non-essential features under stress rather than failing entirely

Availability is often quoted in nines: 99.9% ≈ 8.7 h/year down, 99.99% ≈ 52 min/year. More nines cost exponentially more — match it to real need.

🌍Real-World Example

HA web architecture:
  Load balancer (multi-AZ) → app servers in AZ-a AND AZ-b (stateless)
  Database: primary + synchronous replica with automatic failover
  Health checks remove unhealthy instances from rotation
  Lose any one AZ → service continues on the other

✍️Hands-On Exercise

Identify single points of failure in a single-server web app and fix them.
Explain why app servers should be stateless for HA.
Describe automatic database failover.
Convert 99.99% availability into approximate downtime per year.

🧾Cheat Sheet▾

Technique	Purpose
Redundancy (multi-AZ)	Survive instance/zone loss
Load balancing	Route to healthy instances
Health checks	Detect + remove bad instances
Failover	Promote standby on failure
Statelessness	Any instance serves any request
Graceful degradation	Shed load, don’t collapse
99.99%	≈52 min/year downtime

💬Common Interview Questions▾

How do you design for high availability?

Remove single points of failure via redundancy across availability zones, load balancing with health checks, automatic failover, stateless app tiers, and graceful degradation under stress.

Why must app servers be stateless for HA?

So any instance can handle any request and instances can be added, removed, or replaced freely. Local state would be lost on failure and break load balancing.

📚Official Documentation

↗ AWS — Reliability Pillar

📝 My notes on this topic

Auto-saves as you type