Disaster Recovery Strategies
DR strategy comparison (Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active/Active) with RTO and RPO definitions. Core SAA/SAP topic.
Overview
Disaster Recovery (DR) encompasses strategies and processes used to restore IT systems and data after a failure, outage, or disaster — whether natural or man-made. The goal is to minimize downtime and data loss.
Two key metrics define any DR strategy:
- Recovery Time Objective (RTO) — Maximum acceptable duration of downtime.
- Recovery Point Objective (RPO) — Maximum acceptable data loss, measured as time.
DR Strategy Comparison
| AWS Strategy | Alias | Description | RTO | RPO | Relative Cost |
|---|---|---|---|---|---|
| Backup & Restore | Cold Site / Cold Standby | Backs up data periodically; restore to a new or existing environment on failure. No standby infrastructure running. | Hours – Days | Hours – Days | Lowest |
| Pilot Light | Warm Site (some vendors) | A minimal core (typically DB replication only) runs continuously in a secondary region. On failure, provision additional resources to complete the system. | Minutes – Hours | Minutes | Low |
| Warm Standby | Warm Site / Warm Standby | A scaled-down but fully functional copy of the production system runs in a second region. On failure, scale up and redirect traffic. | Minutes | Seconds – Minutes | Medium |
| Multi-Site Active/Active | Hot Standby / Hot Site / Active-Active | Equal-scale active instances run simultaneously across multiple regions. Traffic is already distributed; failover is instant. | Near-zero | Near-zero | Highest |
SAA/SAP Tip: Map strategies to the scenario's RTO/RPO constraints. Exam questions often pair a budget or time requirement with a strategy — narrow the answer using the cost vs. recovery speed trade-off.
Exam Trap: Pilot Light ≠ Warm Standby. Pilot Light keeps only the minimum critical core running (e.g., DB replication). Warm Standby runs a complete but scaled-down version of the full system.
Exam Trap: "Hot Standby", "Warm Site", and "Cold Site" are informal industry terms — not AWS exam answers. AWS uses the four strategy names in the table above. Map vendor terminology to AWS names before answering.
Recovery Time Objective (RTO)
The maximum time a system can remain offline after a failure before business impact becomes unacceptable.
Example: RTO = 15 minutes → the system must be restored and operational within 15 minutes of failure.
| Implication | Detail |
|---|---|
| Lower RTO | Requires active redundancy, automated failover, pre-provisioned capacity → higher cost |
| Higher RTO | Cold restores from backup are acceptable → lower cost |
Recovery Point Objective (RPO)
The maximum amount of data loss a business can tolerate, expressed as the time gap between the last recovery point and the moment of failure.
Example: RPO = 15 minutes → the business accepts losing up to 15 minutes of transactions.
| Implication | Detail |
|---|---|
| Lower RPO | Requires continuous replication or high-frequency snapshots → higher cost |
| Higher RPO | Infrequent backups are acceptable → lower cost, higher data-loss risk |
RTO vs. RPO at a Glance
| Metric | Measures | Driven By |
|---|---|---|
| RTO | How fast is recovery? (downtime duration) | Failover speed, automation level, pre-provisioned capacity |
| RPO | How much data loss is acceptable? | Backup frequency, replication lag |
AWS DR Implementation Options
| Strategy | AWS Services |
|---|---|
| Backup & Restore | AWS Backup, S3 + Lifecycle policies, RDS automated snapshots |
| Pilot Light | RDS cross-region read replica, Amazon Machine Image (AMI) copy, Route 53 health checks |
| Warm Standby | EC2 Auto Scaling (min capacity > 0), RDS Multi-Availability Zone (Multi-AZ), Elastic Load Balancing (ELB) |
| Multi-Site Active/Active | Route 53 latency/geolocation routing, Aurora Global Database, DynamoDB Global Tables |
Related Services / See Also
- Amazon RDS Multi-AZ — synchronous standby replica for automatic failover
- Aurora Global Database — cross-region replication with sub-second RPO
- AWS Backup — centralized, policy-based backup across AWS services
- AWS Elastic Disaster Recovery (DRS) — continuous block-level server replication
Database Performance Fundamentals
Core database performance concepts: WAL, CPU vs RAM trade-offs, IOPS, Throughput, OLTP vs OLAP. Essential background for AWS database service selection on SAA/SAP exams.
ETL and ELT Fundamentals
Extract, Transform, Load (ETL) and its modern variant ELT — the core data pipeline patterns behind every data warehouse and analytics system.