AWS Disaster Recovery — RTO, RPO, and the 4 DR Strategies

September 20, 2025 · 7 min read

DevOps & Cloud Learning Hub

It's 2 AM. Your primary region (us-east-1) is experiencing a major outage. Your CEO is calling. Customers are tweeting. And you're realizing that "we'll figure out DR later" was not a viable strategy. Disaster recovery isn't about preventing failures — AWS regions go down, AZs have issues, services degrade. DR is about how fast you recover and how much data you can afford to lose.

RTO vs RPO — The Two Numbers That Define Your DR

Every DR plan comes down to two metrics:

Recovery Time Objective (RTO): How long can your application be down? If your RTO is 1 hour, you must be back online within 60 minutes of a disaster.

Recovery Point Objective (RPO): How much data can you lose? If your RPO is 1 hour, you can lose up to 60 minutes of data (the last hour of transactions, uploads, etc.).

These numbers drive every decision. A lower RTO/RPO costs more because it requires more active infrastructure in your DR region.

Timeline of a disaster:

Last backup          Disaster occurs          Recovery complete
    |__________________________|________________________|
              RPO                        RTO
    (data you could lose)         (time you're down)

The 4 DR Strategies

AWS defines four DR strategies with increasing cost and decreasing RTO/RPO:

Strategy	RTO	RPO	Cost	Active Infra in DR Region
Backup & Restore	24+ hours	24 hours (last backup)	$ (lowest)	None — just backups
Pilot Light	1-4 hours	Minutes-1 hour	$$	Core databases only
Warm Standby	10-30 minutes	Seconds-minutes	$$$	Scaled-down full copy
Multi-Site Active/Active	Near zero	Near zero	$$$$ (highest)	Full production copy

Strategy 1: Backup and Restore

The simplest approach. You take regular backups and store them cross-region. When disaster strikes, you provision infrastructure from scratch and restore from backups:

# Cross-region S3 replication for backups
aws s3api put-bucket-replication \
  --bucket my-backups-us-east-1 \
  --replication-configuration '{
    "Role": "arn:aws:iam::role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::my-backups-us-west-2",
        "StorageClass": "STANDARD_IA"
      }
    }]
  }'

# Automated RDS snapshot copy to DR region
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:123456789012:snapshot:mydb-daily-2025-09-20 \
  --target-db-snapshot-identifier mydb-dr-copy-2025-09-20 \
  --source-region us-east-1 \
  --region us-west-2

# Automated with EventBridge rule (runs daily)
# Event: RDS snapshot completed → Lambda → copies to us-west-2

Pros: Cheapest. Simple to implement. Cons: Long recovery time. Hours to spin up infrastructure and restore data.

Strategy 2: Pilot Light

Keep the core data layer running in the DR region (databases, replicating continuously), but nothing else. When disaster strikes, provision the application servers around the already-running database:

# RDS read replica in DR region (continuously replicating)
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-dr-replica \
  --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:mydb-primary \
  --region us-west-2 \
  --db-instance-class db.r6g.large

# DynamoDB Global Table (automatic multi-region replication)
aws dynamodb update-table \
  --table-name user-sessions \
  --replica-updates '[{"Create": {"RegionName": "us-west-2"}}]'

# During disaster: promote the read replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier mydb-dr-replica \
  --region us-west-2

# Then launch application infrastructure from CloudFormation
aws cloudformation create-stack \
  --stack-name app-dr-recovery \
  --template-url https://my-backups-us-west-2.s3.amazonaws.com/cfn/app-stack.yaml \
  --parameters ParameterKey=Environment,ParameterValue=dr-recovery \
  --region us-west-2

The database is already warm with current data. You only need to provision compute (EC2, ECS, Lambda) and update DNS.

Strategy 3: Warm Standby

A scaled-down but fully functional copy of your production environment runs in the DR region at all times. It can serve traffic immediately — you just need to scale it up:

# DR region has a smaller Auto Scaling Group (running 24/7)
# Production: min=6, desired=10
# DR Standby: min=1, desired=2

# During disaster: scale up the DR environment
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name app-dr-asg \
  --min-size 6 --desired-capacity 10 \
  --region us-west-2

# Scale up RDS (if using a smaller instance in DR)
aws rds modify-db-instance \
  --db-instance-identifier mydb-dr-replica \
  --db-instance-class db.r6g.2xlarge \
  --apply-immediately \
  --region us-west-2

The DR environment is always running and tested. Failover is just a scaling operation plus DNS switch.

Strategy 4: Multi-Site Active/Active

Both regions serve production traffic simultaneously. There's no failover — both regions are primary. If one goes down, the other absorbs all traffic:

# Route 53 weighted routing across both regions
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "us-east-1",
        "Weight": 50,
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "alb-east.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }, {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "us-west-2",
        "Weight": 50,
        "AliasTarget": {
          "HostedZoneId": "Z1H1FL5HABSF5",
          "DNSName": "alb-west.us-west-2.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Active/active requires your application to handle multi-region writes, conflict resolution, and data consistency. DynamoDB Global Tables and Aurora Global Database make this feasible on AWS.

Route 53 Failover Routing

For Pilot Light and Warm Standby, Route 53 health checks trigger automatic DNS failover:

# Create a health check for the primary region
HEALTH_CHECK_ID=$(aws route53 create-health-check \
  --caller-reference "primary-$(date +%s)" \
  --health-check-config '{
    "FullyQualifiedDomainName": "primary-alb.us-east-1.elb.amazonaws.com",
    "Port": 443,
    "Type": "HTTPS",
    "RequestInterval": 10,
    "FailureThreshold": 3
  }' \
  --query 'HealthCheck.Id' --output text)

# Configure failover routing
# Primary record (us-east-1)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "primary",
        "Failover": "PRIMARY",
        "HealthCheckId": "'$HEALTH_CHECK_ID'",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "alb-east.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

When the health check fails three consecutive times (30 seconds with 10-second intervals), Route 53 automatically routes traffic to the secondary record.

AWS Elastic Disaster Recovery (DRS)

For lift-and-shift workloads, AWS Elastic Disaster Recovery continuously replicates your servers to a staging area in the DR region at block level:

# Install the replication agent on your source server
wget -O aws-replication-installer.py \
  https://aws-elastic-disaster-recovery-us-west-2.s3.amazonaws.com/latest/linux/aws-replication-installer-init
sudo python3 aws-replication-installer.py \
  --region us-west-2 \
  --aws-access-key-id AKIA... \
  --aws-secret-access-key ...

# Monitor replication status
aws drs describe-source-servers \
  --region us-west-2 \
  --query 'items[*].{Server: sourceProperties.identificationHints.hostname, Status: dataReplicationInfo.dataReplicationState}'

DRS continuously replicates at the block level (not snapshot-based), giving you RPOs of seconds. When you initiate recovery, it launches full instances from the replicated data in minutes.

Testing Your DR Plan

A DR plan that hasn't been tested is just a hypothesis. Schedule regular DR drills:

# DRS: Launch test instances (doesn't affect production or replication)
aws drs start-recovery \
  --source-servers '[{"sourceServerID": "s-abc123"}]' \
  --is-drill true \
  --region us-west-2

# After testing, terminate the test instances
aws drs terminate-recovery-instances \
  --recovery-instance-ids ri-test123 \
  --region us-west-2

Document your runbook: who triggers failover, what's automated vs manual, how to fail back, and how to validate the recovery. Test quarterly at minimum.

Cost Comparison

Strategy	Monthly DR Cost (example)	What's Running
Backup & Restore	~$50 (S3 storage + snapshot copies)	Nothing
Pilot Light	~$400 (RDS replica + DynamoDB Global)	Database layer only
Warm Standby	~$2,000 (scaled-down full stack)	Everything, at reduced scale
Active/Active	~$8,000 (full production duplicate)	Everything, at full scale

The right strategy depends on your business requirements. An e-commerce site losing $50,000/hour in revenue can easily justify a $2,000/month warm standby. A development blog cannot.

What's Next

DR ensures your architecture survives failures. But how do you know if your architecture is well-designed in the first place? In the next post, we'll walk through the AWS Well-Architected Framework — the five pillars that separate solid infrastructure from expensive mistakes.

RTO vs RPO — The Two Numbers That Define Your DR​

The 4 DR Strategies​

Strategy 1: Backup and Restore​

Strategy 2: Pilot Light​

Strategy 3: Warm Standby​

Strategy 4: Multi-Site Active/Active​

Route 53 Failover Routing​

AWS Elastic Disaster Recovery (DRS)​

Testing Your DR Plan​

Cost Comparison​

What's Next​

Stay Updated