Disaster Recovery

💤0
Lv 10 XP
← 🛡️ Platform Engineering & SRE · Incidents & Resilience

Disaster Recovery

Advanced ⭐ 120 XP ⏱ 18 min #sre#disaster-recovery#backups

Plan to recover from major outages, measured by RTO and RPO, and test it.

📖Theory

Where high availability handles routine failures, disaster recovery (DR) plans for major events — a region outage, data corruption, ransomware. Two metrics define your target:

  • RTO (Recovery Time Objective) — how fast you must be back up
  • RPO (Recovery Point Objective) — how much data loss is acceptable (the backup/replication gap)

DR strategies trade cost against RTO/RPO:

  • Backup & restore — cheapest, slowest (hours)
  • Pilot light — minimal standby, scaled up on disaster
  • Warm standby — scaled-down running copy
  • Multi-site active/active — near-zero RTO/RPO, most expensive

The non-negotiable rule: test your DR plan and your restores. Untested backups fail exactly when you need them.

🌍Real-World Example
RTO/RPO drives the strategy:
  RTO 24h,  RPO 24h   → nightly backup & restore (cheap)
  RTO 1h,   RPO 5min  → warm standby + frequent replication
  RTO ~0,   RPO ~0    → active/active multi-region (costly)

Always: backups in a separate region/account, immutable, and restore-tested.
✍️Hands-On Exercise
  1. Define RTO and RPO and how they differ.
  2. Match a strategy (backup, pilot light, warm standby, active/active) to an RTO/RPO.
  3. Explain why backups must be tested and stored separately.
  4. How does DR differ from high availability?
🧾Cheat Sheet
Term / strategyDetail
RTOHow fast you recover
RPOHow much data loss is OK
Backup & restoreCheap, slow
Pilot lightMinimal standby
Warm standbyScaled-down running copy
Active/activeNear-zero RTO/RPO, costly
Test restoresNon-negotiable
💬Common Interview Questions
What's the difference between RTO and RPO?

RTO is the maximum acceptable time to restore service after a disaster; RPO is the maximum acceptable data loss, set by how recent your last recoverable backup is.

How does disaster recovery differ from high availability?

HA keeps a system running through routine failures with redundancy; DR is the plan to recover from a major event (region loss, corruption) and is measured by RTO/RPO.

Why must backups be tested?

Because untested backups frequently turn out to be incomplete or unrestorable — discovered only during a real disaster. Regular restore drills prove they work.

📚Official Documentation

📝 My notes on this topic

Auto-saves as you type