Disaster Recovery Strategies

DR strategy comparison (Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active/Active) with RTO and RPO definitions. Core SAA/SAP topic.

Overview

Disaster Recovery (DR) encompasses strategies and processes used to restore IT systems and data after a failure, outage, or disaster — whether natural or man-made. The goal is to minimize downtime and data loss.

Two key metrics define any DR strategy:

Recovery Time Objective (RTO) — Maximum acceptable duration of downtime.
Recovery Point Objective (RPO) — Maximum acceptable data loss, measured as time.

DR Strategy Comparison

AWS Strategy	Alias	Description	RTO	RPO	Relative Cost
Backup & Restore	Cold Site / Cold Standby	Backs up data periodically; restore to a new or existing environment on failure. No standby infrastructure running.	Hours – Days	Hours – Days	Lowest
Pilot Light	Warm Site (some vendors)	A minimal core (typically DB replication only) runs continuously in a secondary region. On failure, provision additional resources to complete the system.	Minutes – Hours	Minutes	Low
Warm Standby	Warm Site / Warm Standby	A scaled-down but fully functional copy of the production system runs in a second region. On failure, scale up and redirect traffic.	Minutes	Seconds – Minutes	Medium
Multi-Site Active/Active	Hot Standby / Hot Site / Active-Active	Equal-scale active instances run simultaneously across multiple regions. Traffic is already distributed; failover is instant.	Near-zero	Near-zero	Highest

SAA/SAP Tip: Map strategies to the scenario's RTO/RPO constraints. Exam questions often pair a budget or time requirement with a strategy — narrow the answer using the cost vs. recovery speed trade-off.

Exam Trap: Pilot Light ≠ Warm Standby. Pilot Light keeps only the minimum critical core running (e.g., DB replication). Warm Standby runs a complete but scaled-down version of the full system.

Exam Trap: "Hot Standby", "Warm Site", and "Cold Site" are informal industry terms — not AWS exam answers. AWS uses the four strategy names in the table above. Map vendor terminology to AWS names before answering.

Recovery Time Objective (RTO)

The maximum time a system can remain offline after a failure before business impact becomes unacceptable.

Example: RTO = 15 minutes → the system must be restored and operational within 15 minutes of failure.

Implication	Detail
Lower RTO	Requires active redundancy, automated failover, pre-provisioned capacity → higher cost
Higher RTO	Cold restores from backup are acceptable → lower cost

Recovery Point Objective (RPO)

The maximum amount of data loss a business can tolerate, expressed as the time gap between the last recovery point and the moment of failure.

Example: RPO = 15 minutes → the business accepts losing up to 15 minutes of transactions.

Implication	Detail
Lower RPO	Requires continuous replication or high-frequency snapshots → higher cost
Higher RPO	Infrequent backups are acceptable → lower cost, higher data-loss risk

RTO vs. RPO at a Glance

Metric	Measures	Driven By
RTO	How fast is recovery? (downtime duration)	Failover speed, automation level, pre-provisioned capacity
RPO	How much data loss is acceptable?	Backup frequency, replication lag

AWS DR Implementation Options

Strategy	AWS Services
Backup & Restore	AWS Backup, S3 + Lifecycle policies, RDS automated snapshots
Pilot Light	RDS cross-region read replica, Amazon Machine Image (AMI) copy, Route 53 health checks
Warm Standby	EC2 Auto Scaling (min capacity > 0), RDS Multi-Availability Zone (Multi-AZ), Elastic Load Balancing (ELB)
Multi-Site Active/Active	Route 53 latency/geolocation routing, Aurora Global Database, DynamoDB Global Tables

Amazon RDS Multi-AZ — synchronous standby replica for automatic failover
Aurora Global Database — cross-region replication with sub-second RPO
AWS Backup — centralized, policy-based backup across AWS services
AWS Elastic Disaster Recovery (DRS) — continuous block-level server replication