Azure Site Recovery
No matter how well-designed an infrastructure is, disasters happen — data center fires, major power outages, regional floods, or catastrophic hardware failures can take down an entire location. Disaster Recovery (DR) is the plan to bring systems back online after such events. Azure Site Recovery (ASR) is a fully managed disaster recovery service that replicates workloads from a primary location to a secondary location and enables rapid failover when disaster strikes.
Two Key Concepts: RTO and RPO
Every disaster recovery plan is defined by two measurements:
- RTO (Recovery Time Objective): The maximum acceptable time for systems to be restored after a disaster. "Our RTO is 4 hours" means the business can survive at most 4 hours of downtime.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. "Our RPO is 1 hour" means the business can tolerate losing at most 1 hour of data.
RTO and RPO Diagram
Time ──────────────────────────────────────────────────────► Last backup / ◄─── RPO ──────► Disaster ◄──── RTO ─────► Systems Replication point Occurs Back Online (Data we don't lose) (Outage starts) (Recovery complete) Example: 9:00 AM — Last replication completed 10:00 AM — Disaster occurs RPO = 1 hour (data from 9–10 AM is lost) RTO = 2 hours → Systems back online by 12:00 PM
What Azure Site Recovery Does
ASR continuously replicates virtual machines, physical servers, and workloads from the primary site to a secondary Azure region (or another location). When a failover is triggered, the replicated workloads start at the secondary site within minutes.
Supported Scenarios
| From | To | Use Case |
|---|---|---|
| Azure VMs (Primary Region) | Azure VMs (Secondary Region) | DR for Azure-native workloads — most common |
| On-premises VMware VMs | Azure | Move on-premises DR to cloud (no secondary data center needed) |
| On-premises Hyper-V VMs | Azure | DR for Windows Server Hyper-V environments |
| Physical Servers (Windows/Linux) | Azure | DR for physical servers without a secondary physical site |
How Azure Site Recovery Works
For Azure VM Replication
Primary Region: East US Secondary Region: West US
┌─────────────────────────┐ ┌─────────────────────────┐
│ VM: WebServer-1 │ │ VM: WebServer-1 (copy) │
│ VM: DBServer-1 │──replicate│ VM: DBServer-1 (copy) │
│ (Running - serving │──────────►│ (Stopped - standby, │
│ live traffic) │ │ no compute billing) │
└─────────────────────────┘ └─────────────────────────┘
│ Disaster!
▼
Primary becomes unavailable
│
▼ Failover triggered (minutes)
┌─────────────────────────┐
│ VM: WebServer-1 (copy) │ ← Now starts and serves traffic
│ VM: DBServer-1 (copy) │ ← Resumes from last replication point
└─────────────────────────┘
Replication Process
- The ASR Mobility Service agent (installed on VMs) captures disk writes and sends them to a cache storage account in the source region.
- Data is transferred from the cache to a replica storage account in the target region.
- Recovery points are created every few minutes — these are the snapshots that can be used during failover.
- Replication is continuous — the secondary site stays synchronized with changes at the primary site.
Recovery Plans
A Recovery Plan defines the ordered sequence for failing over multiple VMs together. It ensures the right VMs start in the right order — for example, the database server must start before the application server, and the application server before the web server.
Example Recovery Plan
Recovery Plan: DR-EcommerceApp
│
├── Group 1 (starts first):
│ └── DBServer-1 (SQL Database VM)
│ Wait: 5 minutes after DB is up
│
├── Group 2 (starts second):
│ └── AppServer-1 (API Layer VM)
│ └── AppServer-2 (API Layer VM)
│ Wait: 3 minutes
│
└── Group 3 (starts last):
└── WebServer-1 (Frontend VM)
└── WebServer-2 (Frontend VM)
Update DNS to point to secondary region IP
Types of Failover
| Type | Description | When Used |
|---|---|---|
| Test Failover | Starts VMs in the secondary region in an isolated network — primary site continues running normally. Used to verify the DR plan works without any production impact. | Regular DR drills (recommended quarterly) |
| Planned Failover | Cleanly migrates to the secondary region with zero data loss. Primary site is shut down gracefully before failover. Full synchronization happens before the switch. | Planned maintenance, regional migration |
| Unplanned Failover | Immediately starts VMs at the secondary region using the latest available recovery point. Some data loss may occur (limited to the time since the last recovery point). | Actual disaster — primary site is down |
Failback
After a disaster is resolved and the primary site is restored, failback is the process of moving workloads back from the secondary site to the primary site. ASR supports failback to both Azure and on-premises environments. The secondary site replicates changes back to the primary, and a planned failover brings traffic back to the original region.
ASR vs Azure Backup
| Feature | Azure Site Recovery | Azure Backup |
|---|---|---|
| Purpose | Continuous replication for fast disaster recovery | Point-in-time backups for data protection |
| RPO | Minutes (continuous replication) | Hours to days (backup frequency) |
| RTO | Minutes (VMs boot from replica) | Hours (restore from backup) |
| Cost | Higher (secondary site resources) | Lower (compressed backup storage only) |
| Best For | Mission-critical systems requiring minimal downtime | Data protection, accidental deletion, ransomware recovery |
Key Takeaways
- Azure Site Recovery continuously replicates VMs and servers to a secondary location for fast disaster recovery.
- RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss — both guide DR design.
- Recovery Plans define the ordered startup sequence for multiple VMs, ensuring dependencies are respected during failover.
- Test Failover validates the DR plan with no production impact; Unplanned Failover activates immediately during a real disaster.
- After recovery, Failback moves workloads back to the primary site once it is restored.
