Step 2 โ Reliability & Business Continuity
Ask any experienced operations engineer what actually gets tested during an outage, and itโs never the architecture diagram โ itโs whether the backup youโve been trusting for eight months actually restores cleanly, and whether your team can hit the recovery time you promised the business. This step is about the operational discipline behind resilience, not just the AWS services that enable it.
RTO and RPO: The Two Numbers That Drive Every Decision
Before picking a DR pattern, you need two numbers from the business, not from engineering:
- RTO (Recovery Time Objective) โ how long can the system be down before itโs unacceptable?
- RPO (Recovery Point Objective) โ how much data can you afford to lose, measured in time?
Disaster strikes โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโบ RPO window RTO window (data loss (time to tolerance) recover) โ โ Last backup Systems back up before disaster and serving trafficAn RPO of 15 minutes means your last backup or replication checkpoint can be at most 15 minutes old when disaster hits โ anything more recent is lost. An RTO of 1 hour means you have 60 minutes from the moment of failure to full recovery. These two numbers, not personal preference, are what should point you toward a specific DR pattern, because tighter numbers cost more money and operational complexity.
The Four Classic DR Patterns
These come up constantly, and the exam expects you to match a scenarioโs RTO/RPO requirements to the cheapest pattern that satisfies them โ not the fanciest one.
| Pattern | RTO | RPO | Cost | Whatโs actually running in DR |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Nothing โ you restore from backups on demand |
| Pilot Light | Tens of minutes | Minutes | $$ | Core DB replicating; everything else off, scaled up on failover |
| Warm Standby | Minutes | Secondsโminutes | $$$ | Scaled-down full stack running, sized up during failover |
| Multi-Site Active/Active | Near-zero | Near-zero | $$$$ | Full production stack live in both regions simultaneously |
Backup & Restore Pilot Light Warm Standby Active/Activeโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ Region A (live) Region A (live) Region A (live) Region A (live) Region B: backups Region B: DB Region B: small Region B (live) only replica only full stack full stack (rest is off) (scaled down)Backup & Restore is the cheapest and slowest โ appropriate when a business can genuinely tolerate hours of downtime, which is more common than architects like to admit. Pilot Light keeps your data tier warm and ready but your compute tier cold, so failover means scaling up infrastructure that already has current data. Warm Standby runs a smaller version of the full stack continuously, so failover is a scaling event rather than a build-from-scratch event. Multi-Site Active/Active runs full capacity in more than one region simultaneously โ the only pattern that gets you near-zero RTO/RPO, and the only one where youโre paying for redundant peak capacity around the clock.
The trap candidates fall into: assuming โmore expensiveโ is always โmore correct.โ If the scenario says the workload tolerates a few hours of downtime, Warm Standby is over-engineering and the wrong answer, cost-optimization-wise.
AWS Backup: Centralizing What Used to Be Fragmented
Before AWS Backup existed, ops teams cobbled together EBS snapshot lifecycle rules, RDS automated backups, DynamoDB point-in-time recovery, and separate EFS backup jobs โ each with its own schedule, retention, and monitoring. AWS Backup unifies this into one policy-driven service across EBS, RDS, DynamoDB, EFS, FSx, Storage Gateway, and EC2 as a whole.
The core building blocks:
- Backup plans โ define schedule, lifecycle (when to transition to cold storage, when to expire), and which resources are in scope
- Resource assignment โ usually done by tag, so newly launched resources matching a tag automatically inherit backup coverage without manual onboarding
- Backup vaults โ logical containers for recovery points, with optional vault lock for compliance-grade immutability
- Cross-region and cross-account copy โ a backup plan can automatically copy recovery points to another region or a dedicated backup account, which is the pattern for ransomware resilience โ an isolated account with restrictive access limits blast radius if production credentials are compromised
Backup Plan: "daily-prod" โโโ Schedule: daily @ 03:00 UTC โโโ Lifecycle: move to cold storage after 30 days, expire after 365 โโโ Resource selection: tag Environment=prod โโโ Copy actions: โโโ โ Backup Vault (us-west-2) [region DR] โโโ โ Backup Vault (backup account) [ransomware isolation]Vault Lock deserves particular attention operationally โ once applied in compliance mode, not even the root user can shorten retention or delete recovery points before they expire. Thatโs a one-way door, and the exam likes testing whether you understand it canโt be undone.
Restore testing isnโt optional busywork โ AWS Backup includes restore testing automation specifically because untested backups are a liability that only reveals itself during the worst possible moment. Schedule periodic automated restores into an isolated environment and validate them; donโt wait for an actual incident to discover a backup job silently failed for three weeks.
Multi-AZ vs Multi-Region: Two Different Problems
Itโs easy to conflate these, but they solve different failure modes:
- Multi-AZ protects against a data center-level failure within a single Region โ power, cooling, networking issues confined to one facility. RDS Multi-AZ, ALB spanning subnets, and ASGs across AZs are the default posture for any production workload, not an optional upgrade.
- Multi-Region protects against a Region-wide event or is driven by compliance/latency needs for a geographically distributed user base. Itโs a bigger operational lift โ data replication lag, DNS failover via Route 53 health checks, and keeping two environments in configuration parity.
A production system thatโs Multi-AZ but single-Region is a completely reasonable, common posture. Multi-Region is for workloads where a full-region outage (rare, but it happens) or strict compliance geography genuinely requires it โ treat it as an added cost/complexity decision, not a default.
Fault-Tolerant Operational Practices
A few practices that show up repeatedly in both real operations and exam scenarios:
- Health checks that actually check health โ a load balancer health check hitting
/and getting a 200 doesnโt confirm the database connection pool is healthy. Health check endpoints should verify the dependencies that matter, not just process liveness. - Auto Scaling replacing unhealthy instances โ combine ASG health checks with ELB health checks so instances failing application-level checks get cycled automatically, not just ones that stop responding to ping.
- Chaos testing in non-production โ periodically terminating instances or injecting latency in a controlled environment is how you find out your โfault-tolerantโ architecture has an untested assumption before a real event finds it for you.
- Route 53 health checks and failover routing โ for multi-region setups, DNS-level failover needs health checks configured against the actual application endpoint, with a sensible failure threshold so you donโt fail over on a single transient blip.
Exam Focus: What Questions Test From This Step
- Matching a scenarioโs stated RTO/RPO to the cheapest sufficient DR pattern (Backup & Restore through Active/Active)
- Explaining whatโs actually running (or not running) in each DR pattern during normal operations
- AWS Backup plan components: schedules, lifecycle transitions, resource assignment by tag, cross-account/cross-region copy
- Vault Lock compliance mode as an irreversible retention guarantee
- Why untested backups are treated as a documented operational risk, and how restore testing addresses it
- Multi-AZ as the default HA posture vs Multi-Region as a deliberate, costlier decision
- Recognizing when a health check design fails to reflect true application health