Azure Site Recovery — DR Strategy for Production Workloads
It is 2 AM. Your primary Azure region is experiencing a major outage. Your CEO is on Slack asking when the website will be back. Your answer depends entirely on whether you set up disaster recovery last quarter or kept pushing it to "next sprint." Azure Site Recovery makes DR achievable without maintaining a fully hot standby — you replicate, you test, and when disaster strikes, you failover with confidence.
DR Fundamentals — RTO and RPO
Every disaster recovery strategy starts with two numbers:
- RTO (Recovery Time Objective): How long can you be down? If your RTO is 1 hour, your system must be operational within 60 minutes of a failure.
- RPO (Recovery Point Objective): How much data can you lose? If your RPO is 15 minutes, you need replicated data that is no more than 15 minutes old at the point of failure.
| Scenario | Typical RTO | Typical RPO | Strategy |
|---|---|---|---|
| Static website | 4+ hours | 24 hours | Backup + Redeploy |
| Internal business app | 1-4 hours | 1 hour | ASR replication |
| Customer-facing web app | 15-60 min | 5-15 min | ASR + Traffic Manager |
| Financial transaction system | < 5 min | 0 (zero data loss) | Active-Active + SQL Always On |
| E-commerce platform | 15-30 min | 5 min | ASR + SQL geo-replication + CDN |
Your RTO and RPO dictate your cost. Zero RPO requires synchronous replication, which means running compute in two regions simultaneously. A 1-hour RPO with ASR costs a fraction of that.
Azure Site Recovery (ASR) for VMs
ASR continuously replicates your Azure VMs to a secondary region. The replication is asynchronous, typically achieving an RPO of 30 seconds to a few minutes. You pay for the replicated storage and ASR licensing, but you do not pay for compute in the target region until you actually failover.
# Create a Recovery Services vault in the target region
az backup vault create \
--resource-group rg-dr-westus \
--name rsv-dr-westus \
--location westus2
# Enable replication for an Azure VM (East US → West US 2)
az site-recovery replication-protected-item create \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--fabric-name "azure-eastus" \
--protection-container "asr-a2a-default-eastus" \
--name "vm-webapp-01-repl" \
--policy-id "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.RecoveryServices/vaults/rsv-dr-westus/replicationPolicies/24-hour-retention" \
--provider-specific-details '{
"instanceType": "A2A",
"fabricObjectId": "/subscriptions/<sub-id>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachines/vm-webapp-01",
"recoveryContainerName": "asr-a2a-default-westus2",
"recoveryResourceGroupId": "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus",
"recoveryAvailabilitySetId": null
}'
What gets replicated:
- OS disk and all data disks
- VM configuration (size, networking, extensions)
- Managed identity assignments
- Tags and metadata
What does NOT get replicated (you must configure separately):
- NSG rules, public IPs, load balancer configuration
- Azure Firewall rules, DNS records
- Application-level configuration that references region-specific endpoints
Replication from On-Premises to Azure
ASR also supports replicating on-premises VMware VMs, Hyper-V VMs, and physical servers to Azure. This is the migration and DR path for hybrid environments.
# The on-premises workflow:
# 1. Deploy the ASR Configuration Server (VMware) or Hyper-V Site
# 2. Install the Mobility Service agent on source VMs
# 3. Configure replication policy (RPO, retention, app-consistent snapshots)
# 4. Enable replication for each VM
# 5. Monitor replication health in the vault
The replication policy controls:
- Recovery point retention: How long to keep recovery points (default 24 hours)
- App-consistent snapshot frequency: How often to create application-consistent snapshots (default 4 hours)
- Replication frequency: How often delta changes are sent (30 seconds for Azure-to-Azure)
Recovery Plans with Runbooks
A recovery plan orchestrates the failover of multiple VMs in a specific order. You group VMs into tiers — database first, then application, then web — and add automation runbooks between groups.
# Create a recovery plan
az site-recovery recovery-plan create \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--primary-fabric "azure-eastus" \
--recovery-fabric "azure-westus2" \
--groups '[
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<sql-vm-replication-id>"}
],
"startGroupActions": [],
"endGroupActions": [
{
"actionName": "wait-for-sql",
"failoverTypes": ["PlannedFailover", "UnplannedFailover"],
"failoverDirections": ["PrimaryToRecovery"],
"customDetails": {
"instanceType": "AutomationRunbookActionDetails",
"runbookId": "<runbook-id>",
"fabricLocation": "Recovery"
}
}
]
},
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<app-vm-1-replication-id>"},
{"id": "<app-vm-2-replication-id>"}
]
},
{
"groupType": "Boot",
"replicationProtectedItems": [
{"id": "<web-vm-1-replication-id>"},
{"id": "<web-vm-2-replication-id>"}
]
}
]'
The runbook between Group 1 (SQL) and Group 2 (App) can verify the database is responsive, update connection strings, or perform any custom logic. This ensures your application tier does not start before the database is ready.
Test Failover
Test failover is the most important feature of ASR — and the most underused. It creates a copy of your replicated VMs in an isolated network without disrupting production replication or your primary site.
# Perform a test failover for the recovery plan
az site-recovery recovery-plan test-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery \
--network-id "/subscriptions/<sub-id>/resourceGroups/rg-dr-westus/providers/Microsoft.Network/virtualNetworks/vnet-test-failover" \
--network-type VmNetworkAsInput
# After testing, clean up
az site-recovery recovery-plan test-failover-cleanup \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--comments "Q3 DR test passed — all services operational in 12 minutes"
Run test failovers quarterly at minimum. Document the results. Measure the actual RTO. If your recovery plan takes 45 minutes and your SLA promises 30 minutes, you have a problem to fix before disaster strikes — not during it.
Planned vs Unplanned Failover
| Type | When To Use | Data Loss | Source VMs |
|---|---|---|---|
| Test Failover | DR drills, validation | None (isolated copy) | Keep running |
| Planned Failover | Scheduled region migration, maintenance | Zero (waits for replication sync) | Shut down first |
| Unplanned Failover | Region outage, emergency | Possible (up to RPO) | May be inaccessible |
# Planned failover — use when primary region is still accessible
az site-recovery recovery-plan planned-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery
# Unplanned failover — use during actual disaster
az site-recovery recovery-plan unplanned-failover \
--resource-group rg-dr-westus \
--vault-name rsv-dr-westus \
--name "rp-webapp-full" \
--failover-direction PrimaryToRecovery \
--source-site-operations NotRequired
After failover, commit the failover to finalize it, then re-protect the VMs to start replication in the reverse direction (for failing back later).
Geo-Redundant Storage
Azure Storage offers built-in geo-redundancy:
| Redundancy | Copies | Regions | Read Access in Secondary |
|---|---|---|---|
| LRS | 3 copies | 1 region | No |
| ZRS | 3 copies | 1 region (across zones) | No |
| GRS | 6 copies | 2 regions | No (failover required) |
| GZRS | 6 copies | 2 regions (primary zonal) | No |
| RA-GRS | 6 copies | 2 regions | Yes (read-only) |
| RA-GZRS | 6 copies | 2 regions (primary zonal) | Yes (read-only) |
# Create a storage account with geo-zone-redundant storage
az storage account create \
--name stprodgeo2025 \
--resource-group rg-prod \
--location eastus \
--sku Standard_RAGZRS \
--kind StorageV2
# Check replication status
az storage account show \
--name stprodgeo2025 \
--query "{Primary:primaryLocation, Secondary:secondaryLocation, Status:statusOfSecondary}" \
--output table
RA-GZRS is the gold standard for critical data — zone-redundant in the primary region, geo-replicated to a secondary region, and readable in both locations.
Azure Backup vs ASR
These serve different purposes:
| Feature | Azure Backup | Azure Site Recovery |
|---|---|---|
| Purpose | Point-in-time restore | Continuous replication + failover |
| RTO | Hours (restore from backup) | Minutes (pre-replicated VMs) |
| RPO | Last backup (daily/hourly) | Seconds to minutes |
| Use case | Accidental deletion, data corruption | Region outage, disaster |
| Cost | Storage cost for backups | Replication + licensing per VM |
| Target | Same region (or vault) | Different region |
Use both together. Azure Backup protects against "oops I deleted the database" scenarios. ASR protects against "the entire East US region is down" scenarios.
SQL Always On for Database DR
For databases with zero or near-zero RPO requirements, SQL Always On availability groups replicate transactions synchronously (within the same region) or asynchronously (across regions).
-- Create an availability group with a secondary in another region
ALTER AVAILABILITY GROUP [ag-prod-db]
ADD REPLICA ON 'sql-dr-westus2'
WITH (
ENDPOINT_URL = 'TCP://sql-dr-westus2.database.windows.net:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
For Azure SQL Database (PaaS), use active geo-replication or auto-failover groups instead:
# Create a failover group for Azure SQL Database
az sql failover-group create \
--name fog-prod-db \
--resource-group rg-prod \
--server sql-prod-eastus \
--partner-server sql-dr-westus2 \
--partner-resource-group rg-dr-westus \
--add-db proddb \
--failover-policy Automatic \
--grace-period 1
Traffic Manager for DNS Failover
Traffic Manager routes users to healthy endpoints using DNS. In a DR scenario, it automatically redirects traffic from your failed primary region to the secondary.
# Create a Traffic Manager profile with priority routing
az network traffic-manager profile create \
--resource-group rg-global \
--name tm-webapp \
--routing-method Priority \
--unique-dns-name goel-webapp \
--ttl 60 \
--protocol HTTPS \
--port 443 \
--path "/health"
# Add primary endpoint (East US)
az network traffic-manager endpoint create \
--resource-group rg-global \
--profile-name tm-webapp \
--name primary-eastus \
--type azureEndpoints \
--target-resource-id "<webapp-eastus-id>" \
--priority 1
# Add DR endpoint (West US 2)
az network traffic-manager endpoint create \
--resource-group rg-global \
--profile-name tm-webapp \
--name dr-westus2 \
--type azureEndpoints \
--target-resource-id "<webapp-westus2-id>" \
--priority 2
With a TTL of 60 seconds, once Traffic Manager detects the primary endpoint is unhealthy (health probe failures), DNS responses switch to the secondary endpoint within 1-2 minutes. Users are transparently redirected.
Cost Planning
DR is an insurance policy. Budget accordingly:
| Component | Monthly Cost (approx.) |
|---|---|
| ASR license per VM | ~$25/VM |
| Replicated managed disks (Standard HDD) | ~$0.05/GB |
| Recovery Services vault | Free (pay for storage) |
| Compute in DR region (only during failover) | Same as primary |
| GRS/GZRS storage premium over LRS | ~2x LRS cost |
| Traffic Manager profile | ~$0.75/million queries |
| Test failover compute (during DR drills) | Hours of VM cost |
For a 10-VM production stack with 500 GB of disks, expect roughly $300-400/month for DR readiness — far less than the cost of unplanned downtime.
Wrapping Up
Disaster recovery is not something you set up once and forget. It is a practice. Configure ASR replication for your critical VMs. Build recovery plans with proper boot ordering and automation runbooks. Run test failovers every quarter and measure your actual RTO against your SLA. Use geo-redundant storage for data and Traffic Manager for DNS failover. The organizations that recover from outages in minutes are the ones that rehearsed for it in advance.
Next up: We will deep dive into Azure RBAC — roles, permissions, custom role definitions, Conditional Access policies, and Privileged Identity Management for securing who can do what in your Azure environment.
