Cloud/ AWS / AWS Certified CloudOps Engineer โ€” Associate (SOA-C03) / SOA-C03 Reliability & DR: AWS Backup, RTO/RPO, Multi-Region Patterns

AWS Amazon Web Services Associate Step 2 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 2 โ€” Reliability & Business Continuity

Ask any experienced operations engineer what actually gets tested during an outage, and itโ€™s never the architecture diagram โ€” itโ€™s whether the backup youโ€™ve been trusting for eight months actually restores cleanly, and whether your team can hit the recovery time you promised the business. This step is about the operational discipline behind resilience, not just the AWS services that enable it.


RTO and RPO: The Two Numbers That Drive Every Decision

Before picking a DR pattern, you need two numbers from the business, not from engineering:

Disaster strikes
โ”‚
โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ
RPO window RTO window
(data loss (time to
tolerance) recover)
โ”‚ โ”‚
Last backup Systems back up
before disaster and serving traffic

An RPO of 15 minutes means your last backup or replication checkpoint can be at most 15 minutes old when disaster hits โ€” anything more recent is lost. An RTO of 1 hour means you have 60 minutes from the moment of failure to full recovery. These two numbers, not personal preference, are what should point you toward a specific DR pattern, because tighter numbers cost more money and operational complexity.


The Four Classic DR Patterns

These come up constantly, and the exam expects you to match a scenarioโ€™s RTO/RPO requirements to the cheapest pattern that satisfies them โ€” not the fanciest one.

PatternRTORPOCostWhatโ€™s actually running in DR
Backup & RestoreHoursHours$Nothing โ€” you restore from backups on demand
Pilot LightTens of minutesMinutes$$Core DB replicating; everything else off, scaled up on failover
Warm StandbyMinutesSecondsโ€“minutes$$$Scaled-down full stack running, sized up during failover
Multi-Site Active/ActiveNear-zeroNear-zero$$$$Full production stack live in both regions simultaneously
Backup & Restore Pilot Light Warm Standby Active/Active
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Region A (live) Region A (live) Region A (live) Region A (live)
Region B: backups Region B: DB Region B: small Region B (live)
only replica only full stack full stack
(rest is off) (scaled down)

Backup & Restore is the cheapest and slowest โ€” appropriate when a business can genuinely tolerate hours of downtime, which is more common than architects like to admit. Pilot Light keeps your data tier warm and ready but your compute tier cold, so failover means scaling up infrastructure that already has current data. Warm Standby runs a smaller version of the full stack continuously, so failover is a scaling event rather than a build-from-scratch event. Multi-Site Active/Active runs full capacity in more than one region simultaneously โ€” the only pattern that gets you near-zero RTO/RPO, and the only one where youโ€™re paying for redundant peak capacity around the clock.

The trap candidates fall into: assuming โ€œmore expensiveโ€ is always โ€œmore correct.โ€ If the scenario says the workload tolerates a few hours of downtime, Warm Standby is over-engineering and the wrong answer, cost-optimization-wise.


AWS Backup: Centralizing What Used to Be Fragmented

Before AWS Backup existed, ops teams cobbled together EBS snapshot lifecycle rules, RDS automated backups, DynamoDB point-in-time recovery, and separate EFS backup jobs โ€” each with its own schedule, retention, and monitoring. AWS Backup unifies this into one policy-driven service across EBS, RDS, DynamoDB, EFS, FSx, Storage Gateway, and EC2 as a whole.

The core building blocks:

Backup Plan: "daily-prod"
โ”œโ”€โ”€ Schedule: daily @ 03:00 UTC
โ”œโ”€โ”€ Lifecycle: move to cold storage after 30 days, expire after 365
โ”œโ”€โ”€ Resource selection: tag Environment=prod
โ””โ”€โ”€ Copy actions:
โ”œโ”€โ”€ โ†’ Backup Vault (us-west-2) [region DR]
โ””โ”€โ”€ โ†’ Backup Vault (backup account) [ransomware isolation]

Vault Lock deserves particular attention operationally โ€” once applied in compliance mode, not even the root user can shorten retention or delete recovery points before they expire. Thatโ€™s a one-way door, and the exam likes testing whether you understand it canโ€™t be undone.

Restore testing isnโ€™t optional busywork โ€” AWS Backup includes restore testing automation specifically because untested backups are a liability that only reveals itself during the worst possible moment. Schedule periodic automated restores into an isolated environment and validate them; donโ€™t wait for an actual incident to discover a backup job silently failed for three weeks.


Multi-AZ vs Multi-Region: Two Different Problems

Itโ€™s easy to conflate these, but they solve different failure modes:

A production system thatโ€™s Multi-AZ but single-Region is a completely reasonable, common posture. Multi-Region is for workloads where a full-region outage (rare, but it happens) or strict compliance geography genuinely requires it โ€” treat it as an added cost/complexity decision, not a default.


Fault-Tolerant Operational Practices

A few practices that show up repeatedly in both real operations and exam scenarios:


Exam Focus: What Questions Test From This Step