Disaster Recovery
Plan to recover from major outages, measured by RTO and RPO, and test it.
Where high availability handles routine failures, disaster recovery (DR) plans for major events — a region outage, data corruption, ransomware. Two metrics define your target:
- RTO (Recovery Time Objective) — how fast you must be back up
- RPO (Recovery Point Objective) — how much data loss is acceptable (the backup/replication gap)
DR strategies trade cost against RTO/RPO:
- Backup & restore — cheapest, slowest (hours)
- Pilot light — minimal standby, scaled up on disaster
- Warm standby — scaled-down running copy
- Multi-site active/active — near-zero RTO/RPO, most expensive
The non-negotiable rule: test your DR plan and your restores. Untested backups fail exactly when you need them.
RTO/RPO drives the strategy:
RTO 24h, RPO 24h → nightly backup & restore (cheap)
RTO 1h, RPO 5min → warm standby + frequent replication
RTO ~0, RPO ~0 → active/active multi-region (costly)
Always: backups in a separate region/account, immutable, and restore-tested. - Define RTO and RPO and how they differ.
- Match a strategy (backup, pilot light, warm standby, active/active) to an RTO/RPO.
- Explain why backups must be tested and stored separately.
- How does DR differ from high availability?
Cheat Sheet▾
| Term / strategy | Detail |
|---|---|
| RTO | How fast you recover |
| RPO | How much data loss is OK |
| Backup & restore | Cheap, slow |
| Pilot light | Minimal standby |
| Warm standby | Scaled-down running copy |
| Active/active | Near-zero RTO/RPO, costly |
| Test restores | Non-negotiable |
Common Interview Questions▾
What's the difference between RTO and RPO?
RTO is the maximum acceptable time to restore service after a disaster; RPO is the maximum acceptable data loss, set by how recent your last recoverable backup is.
How does disaster recovery differ from high availability?
HA keeps a system running through routine failures with redundancy; DR is the plan to recover from a major event (region loss, corruption) and is measured by RTO/RPO.
Why must backups be tested?
Because untested backups frequently turn out to be incomplete or unrestorable — discovered only during a real disaster. Regular restore drills prove they work.