Xoxoftware - XOXO Creative Studio | Web & Mobile App Development | Fred Cheung | Hong Kong
Concepts

Disaster Recovery Strategies

DR strategy comparison (Backup & Restore, Pilot Light, Warm Standby, Multi-Site Active/Active) with RTO and RPO definitions. Core SAA/SAP topic.

Overview

Disaster Recovery (DR) encompasses strategies and processes used to restore IT systems and data after a failure, outage, or disaster — whether natural or man-made. The goal is to minimize downtime and data loss.

Two key metrics define any DR strategy:

  • Recovery Time Objective (RTO) — Maximum acceptable duration of downtime.
  • Recovery Point Objective (RPO) — Maximum acceptable data loss, measured as time.

DR Strategy Comparison

AWS StrategyAliasDescriptionRTORPORelative Cost
Backup & RestoreCold Site / Cold StandbyBacks up data periodically; restore to a new or existing environment on failure. No standby infrastructure running.Hours – DaysHours – DaysLowest
Pilot LightWarm Site (some vendors)A minimal core (typically DB replication only) runs continuously in a secondary region. On failure, provision additional resources to complete the system.Minutes – HoursMinutesLow
Warm StandbyWarm Site / Warm StandbyA scaled-down but fully functional copy of the production system runs in a second region. On failure, scale up and redirect traffic.MinutesSeconds – MinutesMedium
Multi-Site Active/ActiveHot Standby / Hot Site / Active-ActiveEqual-scale active instances run simultaneously across multiple regions. Traffic is already distributed; failover is instant.Near-zeroNear-zeroHighest

SAA/SAP Tip: Map strategies to the scenario's RTO/RPO constraints. Exam questions often pair a budget or time requirement with a strategy — narrow the answer using the cost vs. recovery speed trade-off.

Exam Trap: Pilot Light ≠ Warm Standby. Pilot Light keeps only the minimum critical core running (e.g., DB replication). Warm Standby runs a complete but scaled-down version of the full system.

Exam Trap: "Hot Standby", "Warm Site", and "Cold Site" are informal industry terms — not AWS exam answers. AWS uses the four strategy names in the table above. Map vendor terminology to AWS names before answering.


Recovery Time Objective (RTO)

The maximum time a system can remain offline after a failure before business impact becomes unacceptable.

Example: RTO = 15 minutes → the system must be restored and operational within 15 minutes of failure.

ImplicationDetail
Lower RTORequires active redundancy, automated failover, pre-provisioned capacity → higher cost
Higher RTOCold restores from backup are acceptable → lower cost

Recovery Point Objective (RPO)

The maximum amount of data loss a business can tolerate, expressed as the time gap between the last recovery point and the moment of failure.

Example: RPO = 15 minutes → the business accepts losing up to 15 minutes of transactions.

ImplicationDetail
Lower RPORequires continuous replication or high-frequency snapshots → higher cost
Higher RPOInfrequent backups are acceptable → lower cost, higher data-loss risk

RTO vs. RPO at a Glance

MetricMeasuresDriven By
RTOHow fast is recovery? (downtime duration)Failover speed, automation level, pre-provisioned capacity
RPOHow much data loss is acceptable?Backup frequency, replication lag

AWS DR Implementation Options

StrategyAWS Services
Backup & RestoreAWS Backup, S3 + Lifecycle policies, RDS automated snapshots
Pilot LightRDS cross-region read replica, Amazon Machine Image (AMI) copy, Route 53 health checks
Warm StandbyEC2 Auto Scaling (min capacity > 0), RDS Multi-Availability Zone (Multi-AZ), Elastic Load Balancing (ELB)
Multi-Site Active/ActiveRoute 53 latency/geolocation routing, Aurora Global Database, DynamoDB Global Tables

Built by Fred Cheung @CookedRicer · Powered by Fumadocs & Github Copilot

On this page