Azure Site Recovery

No matter how well-designed an infrastructure is, disasters happen — data center fires, major power outages, regional floods, or catastrophic hardware failures can take down an entire location. Disaster Recovery (DR) is the plan to bring systems back online after such events. Azure Site Recovery (ASR) is a fully managed disaster recovery service that replicates workloads from a primary location to a secondary location and enables rapid failover when disaster strikes.

Two Key Concepts: RTO and RPO

Every disaster recovery plan is defined by two measurements:

RTO (Recovery Time Objective): The maximum acceptable time for systems to be restored after a disaster. "Our RTO is 4 hours" means the business can survive at most 4 hours of downtime.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. "Our RPO is 1 hour" means the business can tolerate losing at most 1 hour of data.

RTO and RPO Diagram

  Time ──────────────────────────────────────────────────────►

  Last backup / ◄─── RPO ──────► Disaster   ◄──── RTO ─────► Systems
  Replication point              Occurs               Back Online
  (Data we don't lose)          (Outage starts)      (Recovery complete)

  Example:
  9:00 AM — Last replication completed
  10:00 AM — Disaster occurs
  RPO = 1 hour (data from 9–10 AM is lost)
  RTO = 2 hours → Systems back online by 12:00 PM

What Azure Site Recovery Does

ASR continuously replicates virtual machines, physical servers, and workloads from the primary site to a secondary Azure region (or another location). When a failover is triggered, the replicated workloads start at the secondary site within minutes.

Supported Scenarios

From	To	Use Case
Azure VMs (Primary Region)	Azure VMs (Secondary Region)	DR for Azure-native workloads — most common
On-premises VMware VMs	Azure	Move on-premises DR to cloud (no secondary data center needed)
On-premises Hyper-V VMs	Azure	DR for Windows Server Hyper-V environments
Physical Servers (Windows/Linux)	Azure	DR for physical servers without a secondary physical site

How Azure Site Recovery Works

For Azure VM Replication

  Primary Region: East US                Secondary Region: West US
  ┌─────────────────────────┐           ┌─────────────────────────┐
  │  VM: WebServer-1        │           │  VM: WebServer-1 (copy) │
  │  VM: DBServer-1         │──replicate│  VM: DBServer-1 (copy)  │
  │  (Running - serving     │──────────►│  (Stopped - standby,    │
  │   live traffic)         │           │   no compute billing)   │
  └─────────────────────────┘           └─────────────────────────┘
              │ Disaster!
              ▼
  Primary becomes unavailable
              │
              ▼ Failover triggered (minutes)
  ┌─────────────────────────┐
  │  VM: WebServer-1 (copy) │ ← Now starts and serves traffic
  │  VM: DBServer-1 (copy)  │ ← Resumes from last replication point
  └─────────────────────────┘

Replication Process

The ASR Mobility Service agent (installed on VMs) captures disk writes and sends them to a cache storage account in the source region.
Data is transferred from the cache to a replica storage account in the target region.
Recovery points are created every few minutes — these are the snapshots that can be used during failover.
Replication is continuous — the secondary site stays synchronized with changes at the primary site.

Recovery Plans

A Recovery Plan defines the ordered sequence for failing over multiple VMs together. It ensures the right VMs start in the right order — for example, the database server must start before the application server, and the application server before the web server.

Example Recovery Plan

  Recovery Plan: DR-EcommerceApp
  │
  ├── Group 1 (starts first):
  │   └── DBServer-1 (SQL Database VM)
  │       Wait: 5 minutes after DB is up
  │
  ├── Group 2 (starts second):
  │   └── AppServer-1 (API Layer VM)
  │   └── AppServer-2 (API Layer VM)
  │       Wait: 3 minutes
  │
  └── Group 3 (starts last):
      └── WebServer-1 (Frontend VM)
      └── WebServer-2 (Frontend VM)
          Update DNS to point to secondary region IP

Types of Failover

Type	Description	When Used
Test Failover	Starts VMs in the secondary region in an isolated network — primary site continues running normally. Used to verify the DR plan works without any production impact.	Regular DR drills (recommended quarterly)
Planned Failover	Cleanly migrates to the secondary region with zero data loss. Primary site is shut down gracefully before failover. Full synchronization happens before the switch.	Planned maintenance, regional migration
Unplanned Failover	Immediately starts VMs at the secondary region using the latest available recovery point. Some data loss may occur (limited to the time since the last recovery point).	Actual disaster — primary site is down

Failback

After a disaster is resolved and the primary site is restored, failback is the process of moving workloads back from the secondary site to the primary site. ASR supports failback to both Azure and on-premises environments. The secondary site replicates changes back to the primary, and a planned failover brings traffic back to the original region.

ASR vs Azure Backup

Feature	Azure Site Recovery	Azure Backup
Purpose	Continuous replication for fast disaster recovery	Point-in-time backups for data protection
RPO	Minutes (continuous replication)	Hours to days (backup frequency)
RTO	Minutes (VMs boot from replica)	Hours (restore from backup)
Cost	Higher (secondary site resources)	Lower (compressed backup storage only)
Best For	Mission-critical systems requiring minimal downtime	Data protection, accidental deletion, ransomware recovery

Key Takeaways

Azure Site Recovery continuously replicates VMs and servers to a secondary location for fast disaster recovery.
RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss — both guide DR design.
Recovery Plans define the ordered startup sequence for multiple VMs, ensuring dependencies are respected during failover.
Test Failover validates the DR plan with no production impact; Unplanned Failover activates immediately during a real disaster.
After recovery, Failback moves workloads back to the primary site once it is restored.

Previous lessons

Back to courses

Next lessons