Skip to main content

Azure Site Recovery — DR Strategy for Production Workloads

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 2 AM. Your primary Azure region is experiencing a major outage. Your CEO is on Slack asking when the website will be back. Your answer depends entirely on whether you set up disaster recovery last quarter or kept pushing it to "next sprint." Azure Site Recovery makes DR achievable without maintaining a fully hot standby — you replicate, you test, and when disaster strikes, you failover with confidence.

DR Fundamentals — RTO and RPO

Every disaster recovery strategy starts with two numbers:

  • RTO (Recovery Time Objective): How long can you be down? If your RTO is 1 hour, your system must be operational within 60 minutes of a failure.
  • RPO (Recovery Point Objective): How much data can you lose? If your RPO is 15 minutes, you need replicated data that is no more than 15 minutes old at the point of failure.
ScenarioTypical RTOTypical RPOStrategy
Static website4+ hours24 hoursBackup + Redeploy
Internal business app1-4 hours1 hourASR replication
Customer-facing web app15-60 min5-15 minASR + Traffic Manager
Financial transaction system< 5 min0 (zero data loss)Active-Active + SQL Always On
E-commerce platform15-30 min5 minASR + SQL geo-replication + CDN

Your RTO and RPO dictate your cost. Zero RPO requires synchronous replication, which means running compute in two regions simultaneously. A 1-hour RPO with ASR costs a fraction of that.

Azure Site Recovery (ASR) for VMs

ASR continuously replicates your Azure VMs to a secondary region. The replication is asynchronous, typically achieving an RPO of 30 seconds to a few minutes. You pay for the replicated storage and ASR licensing, but you do not pay for compute in the target region until you actually failover.

# Create a Recovery Services vault in the target region
az backup vault create \
--resource-group rg-dr-westus \
--name rsv-dr-westus \
--location westus2

# Enable replication for an Azure VM (East US → West US 2)
az site-recovery replication-protected-item create \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--fabric-name "azure-eastus" \
--protection-container "asr-a2a-default-eastus" \
--name "vm-webapp-01-repl" \
--policy-id "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.RecoveryServices/vaults/rsv-dr-westus/replicationPolicies/24-hour-retention" \
--provider-specific-details '{
"instanceType": "A2A",
"fabricObjectId": "/subscriptions/<sub-id>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachines/vm-webapp-01",
"recoveryContainerName": "asr-a2a-default-westus2",
"recoveryResourceGroupId": "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus",
"recoveryAvailabilitySetId": null
}'

What gets replicated:

  • OS disk and all data disks
  • VM configuration (size, networking, extensions)
  • Managed identity assignments
  • Tags and metadata

What does NOT get replicated (you must configure separately):

  • NSG rules, public IPs, load balancer configuration
  • Azure Firewall rules, DNS records
  • Application-level configuration that references region-specific endpoints

Replication from On-Premises to Azure

ASR also supports replicating on-premises VMware VMs, Hyper-V VMs, and physical servers to Azure. This is the migration and DR path for hybrid environments.

# The on-premises workflow:
# 1. Deploy the ASR Configuration Server (VMware) or Hyper-V Site
# 2. Install the Mobility Service agent on source VMs
# 3. Configure replication policy (RPO, retention, app-consistent snapshots)
# 4. Enable replication for each VM
# 5. Monitor replication health in the vault

The replication policy controls:

  • Recovery point retention: How long to keep recovery points (default 24 hours)
  • App-consistent snapshot frequency: How often to create application-consistent snapshots (default 4 hours)
  • Replication frequency: How often delta changes are sent (30 seconds for Azure-to-Azure)

Recovery Plans with Runbooks

A recovery plan orchestrates the failover of multiple VMs in a specific order. You group VMs into tiers — database first, then application, then web — and add automation runbooks between groups.

# Create a recovery plan
az site-recovery recovery-plan create \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--primary-fabric "azure-eastus" \
--recovery-fabric "azure-westus2" \
--groups '[
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<sql-vm-replication-id>"}
],
"startGroupActions": [],
"endGroupActions": [
{
"actionName": "wait-for-sql",
"failoverTypes": ["PlannedFailover", "UnplannedFailover"],
"failoverDirections": ["PrimaryToRecovery"],
"customDetails": {
"instanceType": "AutomationRunbookActionDetails",
"runbookId": "<runbook-id>",
"fabricLocation": "Recovery"
}
}
]
},
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<app-vm-1-replication-id>"},
{"id": "<app-vm-2-replication-id>"}
]
},
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<web-vm-1-replication-id>"},
{"id": "<web-vm-2-replication-id>"}
]
}
]'

The runbook between Group 1 (SQL) and Group 2 (App) can verify the database is responsive, update connection strings, or perform any custom logic. This ensures your application tier does not start before the database is ready.

Test Failover

Test failover is the most important feature of ASR — and the most underused. It creates a copy of your replicated VMs in an isolated network without disrupting production replication or your primary site.

# Perform a test failover for the recovery plan
az site-recovery recovery-plan test-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery \
--network-id "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.Network/virtualNetworks/vnet-test-failover" \
--network-type VmNetworkAsInput

# After testing, clean up
az site-recovery recovery-plan test-failover-cleanup \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--comments "Q3 DR test passed — all services operational in 12 minutes"

Run test failovers quarterly at minimum. Document the results. Measure the actual RTO. If your recovery plan takes 45 minutes and your SLA promises 30 minutes, you have a problem to fix before disaster strikes — not during it.

Planned vs Unplanned Failover

TypeWhen To UseData LossSource VMs
Test FailoverDR drills, validationNone (isolated copy)Keep running
Planned FailoverScheduled region migration, maintenanceZero (waits for replication sync)Shut down first
Unplanned FailoverRegion outage, emergencyPossible (up to RPO)May be inaccessible
# Planned failover — use when primary region is still accessible
az site-recovery recovery-plan planned-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery

# Unplanned failover — use during actual disaster
az site-recovery recovery-plan unplanned-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery \
--source-site-operations NotRequired

After failover, commit the failover to finalize it, then re-protect the VMs to start replication in the reverse direction (for failing back later).

Geo-Redundant Storage

Azure Storage offers built-in geo-redundancy:

RedundancyCopiesRegionsRead Access in Secondary
LRS3 copies1 regionNo
ZRS3 copies1 region (across zones)No
GRS6 copies2 regionsNo (failover required)
GZRS6 copies2 regions (primary zonal)No
RA-GRS6 copies2 regionsYes (read-only)
RA-GZRS6 copies2 regions (primary zonal)Yes (read-only)
# Create a storage account with geo-zone-redundant storage
az storage account create \
--name stprodgeo2025 \
--resource-group rg-prod \
--location eastus \
--sku Standard_RAGZRS \
--kind StorageV2

# Check replication status
az storage account show \
--name stprodgeo2025 \
--query "{Primary:primaryLocation, Secondary:secondaryLocation, Status:statusOfSecondary}" \
--output table

RA-GZRS is the gold standard for critical data — zone-redundant in the primary region, geo-replicated to a secondary region, and readable in both locations.

Azure Backup vs ASR

These serve different purposes:

FeatureAzure BackupAzure Site Recovery
PurposePoint-in-time restoreContinuous replication + failover
RTOHours (restore from backup)Minutes (pre-replicated VMs)
RPOLast backup (daily/hourly)Seconds to minutes
Use caseAccidental deletion, data corruptionRegion outage, disaster
CostStorage cost for backupsReplication + licensing per VM
TargetSame region (or vault)Different region

Use both together. Azure Backup protects against "oops I deleted the database" scenarios. ASR protects against "the entire East US region is down" scenarios.

SQL Always On for Database DR

For databases with zero or near-zero RPO requirements, SQL Always On availability groups replicate transactions synchronously (within the same region) or asynchronously (across regions).

-- Create an availability group with a secondary in another region
ALTER AVAILABILITY GROUP [ag-prod-db]
ADD REPLICA ON 'sql-dr-westus2'
WITH (
ENDPOINT_URL = 'TCP://sql-dr-westus2.database.windows.net:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);

For Azure SQL Database (PaaS), use active geo-replication or auto-failover groups instead:

# Create a failover group for Azure SQL Database
az sql failover-group create \
--name fog-prod-db \
--resource-group rg-prod \
--server sql-prod-eastus \
--partner-server sql-dr-westus2 \
--partner-resource-group rg-dr-westus \
--add-db proddb \
--failover-policy Automatic \
--grace-period 1

Traffic Manager for DNS Failover

Traffic Manager routes users to healthy endpoints using DNS. In a DR scenario, it automatically redirects traffic from your failed primary region to the secondary.

# Create a Traffic Manager profile with priority routing
az network traffic-manager profile create \
--resource-group rg-global \
--name tm-webapp \
--routing-method Priority \
--unique-dns-name goel-webapp \
--ttl 60 \
--protocol HTTPS \
--port 443 \
--path "/health"

# Add primary endpoint (East US)
az network traffic-manager endpoint create \
--resource-group rg-global \
--profile-name tm-webapp \
--name primary-eastus \
--type azureEndpoints \
--target-resource-id "<webapp-eastus-id>" \
--priority 1

# Add DR endpoint (West US 2)
az network traffic-manager endpoint create \
--resource-group rg-global \
--profile-name tm-webapp \
--name dr-westus2 \
--type azureEndpoints \
--target-resource-id "<webapp-westus2-id>" \
--priority 2

With a TTL of 60 seconds, once Traffic Manager detects the primary endpoint is unhealthy (health probe failures), DNS responses switch to the secondary endpoint within 1-2 minutes. Users are transparently redirected.

Cost Planning

DR is an insurance policy. Budget accordingly:

ComponentMonthly Cost (approx.)
ASR license per VM~$25/VM
Replicated managed disks (Standard HDD)~$0.05/GB
Recovery Services vaultFree (pay for storage)
Compute in DR region (only during failover)Same as primary
GRS/GZRS storage premium over LRS~2x LRS cost
Traffic Manager profile~$0.75/million queries
Test failover compute (during DR drills)Hours of VM cost

For a 10-VM production stack with 500 GB of disks, expect roughly $300-400/month for DR readiness — far less than the cost of unplanned downtime.

Wrapping Up

Disaster recovery is not something you set up once and forget. It is a practice. Configure ASR replication for your critical VMs. Build recovery plans with proper boot ordering and automation runbooks. Run test failovers every quarter and measure your actual RTO against your SLA. Use geo-redundant storage for data and Traffic Manager for DNS failover. The organizations that recover from outages in minutes are the ones that rehearsed for it in advance.


Next up: We will deep dive into Azure RBAC — roles, permissions, custom role definitions, Conditional Access policies, and Privileged Identity Management for securing who can do what in your Azure environment.