Disaster Recovery

Advanced ⭐ 120 XP ⏱ 18 min #sre#disaster-recovery#backups

Plan to recover from major outages, measured by RTO and RPO, and test it.

📖Theory

Where high availability handles routine failures, disaster recovery (DR) plans for major events — a region outage, data corruption, ransomware. Two metrics define your target:

RTO (Recovery Time Objective) — how fast you must be back up
RPO (Recovery Point Objective) — how much data loss is acceptable (the backup/replication gap)

DR strategies trade cost against RTO/RPO:

Backup & restore — cheapest, slowest (hours)
Pilot light — minimal standby, scaled up on disaster
Warm standby — scaled-down running copy
Multi-site active/active — near-zero RTO/RPO, most expensive

The non-negotiable rule: test your DR plan and your restores. Untested backups fail exactly when you need them.

🌍Real-World Example

RTO/RPO drives the strategy:
  RTO 24h,  RPO 24h   → nightly backup & restore (cheap)
  RTO 1h,   RPO 5min  → warm standby + frequent replication
  RTO ~0,   RPO ~0    → active/active multi-region (costly)

Always: backups in a separate region/account, immutable, and restore-tested.

✍️Hands-On Exercise

Define RTO and RPO and how they differ.
Match a strategy (backup, pilot light, warm standby, active/active) to an RTO/RPO.
Explain why backups must be tested and stored separately.
How does DR differ from high availability?

🧾Cheat Sheet▾

Term / strategy	Detail
RTO	How fast you recover
RPO	How much data loss is OK
Backup & restore	Cheap, slow
Pilot light	Minimal standby
Warm standby	Scaled-down running copy
Active/active	Near-zero RTO/RPO, costly
Test restores	Non-negotiable

💬Common Interview Questions▾

What's the difference between RTO and RPO?

RTO is the maximum acceptable time to restore service after a disaster; RPO is the maximum acceptable data loss, set by how recent your last recoverable backup is.

How does disaster recovery differ from high availability?

HA keeps a system running through routine failures with redundancy; DR is the plan to recover from a major event (region loss, corruption) and is measured by RTO/RPO.

Why must backups be tested?

Because untested backups frequently turn out to be incomplete or unrestorable — discovered only during a real disaster. Regular restore drills prove they work.

📚Official Documentation

↗ AWS — Disaster recovery options

📝 My notes on this topic

Auto-saves as you type