Step 2 — Reliability & Business Continuity

Ask any experienced operations engineer what actually gets tested during an outage, and it’s never the architecture diagram — it’s whether the backup you’ve been trusting for eight months actually restores cleanly, and whether your team can hit the recovery time you promised the business. This step is about the operational discipline behind resilience, not just the AWS services that enable it.

RTO and RPO: The Two Numbers That Drive Every Decision

Before picking a DR pattern, you need two numbers from the business, not from engineering:

RTO (Recovery Time Objective) — how long can the system be down before it’s unacceptable?
RPO (Recovery Point Objective) — how much data can you afford to lose, measured in time?

        Disaster strikes
              │
   ◄──────────┼───────────────►
   RPO window          RTO window
   (data loss           (time to
    tolerance)           recover)
       │                     │
  Last backup           Systems back up
  before disaster       and serving traffic

An RPO of 15 minutes means your last backup or replication checkpoint can be at most 15 minutes old when disaster hits — anything more recent is lost. An RTO of 1 hour means you have 60 minutes from the moment of failure to full recovery. These two numbers, not personal preference, are what should point you toward a specific DR pattern, because tighter numbers cost more money and operational complexity.

The Four Classic DR Patterns

These come up constantly, and the exam expects you to match a scenario’s RTO/RPO requirements to the cheapest pattern that satisfies them — not the fanciest one.

Pattern	RTO	RPO	Cost	What’s actually running in DR
Backup & Restore	Hours	Hours	$	Nothing — you restore from backups on demand
Pilot Light	Tens of minutes	Minutes	$$	Core DB replicating; everything else off, scaled up on failover
Warm Standby	Minutes	Seconds–minutes	$$$	Scaled-down full stack running, sized up during failover
Multi-Site Active/Active	Near-zero	Near-zero	$$$$	Full production stack live in both regions simultaneously

Backup & Restore        Pilot Light           Warm Standby         Active/Active
─────────────────      ─────────────         ─────────────        ─────────────
 Region A (live)        Region A (live)       Region A (live)      Region A (live)
 Region B: backups      Region B: DB          Region B: small      Region B (live)
           only         replica only          full stack           full stack
                        (rest is off)          (scaled down)

Backup & Restore is the cheapest and slowest — appropriate when a business can genuinely tolerate hours of downtime, which is more common than architects like to admit. Pilot Light keeps your data tier warm and ready but your compute tier cold, so failover means scaling up infrastructure that already has current data. Warm Standby runs a smaller version of the full stack continuously, so failover is a scaling event rather than a build-from-scratch event. Multi-Site Active/Active runs full capacity in more than one region simultaneously — the only pattern that gets you near-zero RTO/RPO, and the only one where you’re paying for redundant peak capacity around the clock.

The trap candidates fall into: assuming “more expensive” is always “more correct.” If the scenario says the workload tolerates a few hours of downtime, Warm Standby is over-engineering and the wrong answer, cost-optimization-wise.

AWS Backup: Centralizing What Used to Be Fragmented

Before AWS Backup existed, ops teams cobbled together EBS snapshot lifecycle rules, RDS automated backups, DynamoDB point-in-time recovery, and separate EFS backup jobs — each with its own schedule, retention, and monitoring. AWS Backup unifies this into one policy-driven service across EBS, RDS, DynamoDB, EFS, FSx, Storage Gateway, and EC2 as a whole.

The core building blocks:

Backup plans — define schedule, lifecycle (when to transition to cold storage, when to expire), and which resources are in scope
Resource assignment — usually done by tag, so newly launched resources matching a tag automatically inherit backup coverage without manual onboarding
Backup vaults — logical containers for recovery points, with optional vault lock for compliance-grade immutability
Cross-region and cross-account copy — a backup plan can automatically copy recovery points to another region or a dedicated backup account, which is the pattern for ransomware resilience — an isolated account with restrictive access limits blast radius if production credentials are compromised

Backup Plan: "daily-prod"
  ├── Schedule: daily @ 03:00 UTC
  ├── Lifecycle: move to cold storage after 30 days, expire after 365
  ├── Resource selection: tag Environment=prod
  └── Copy actions:
        ├── → Backup Vault (us-west-2)     [region DR]
        └── → Backup Vault (backup account) [ransomware isolation]

Vault Lock deserves particular attention operationally — once applied in compliance mode, not even the root user can shorten retention or delete recovery points before they expire. That’s a one-way door, and the exam likes testing whether you understand it can’t be undone.

Restore testing isn’t optional busywork — AWS Backup includes restore testing automation specifically because untested backups are a liability that only reveals itself during the worst possible moment. Schedule periodic automated restores into an isolated environment and validate them; don’t wait for an actual incident to discover a backup job silently failed for three weeks.

Multi-AZ vs Multi-Region: Two Different Problems

It’s easy to conflate these, but they solve different failure modes:

Multi-AZ protects against a data center-level failure within a single Region — power, cooling, networking issues confined to one facility. RDS Multi-AZ, ALB spanning subnets, and ASGs across AZs are the default posture for any production workload, not an optional upgrade.
Multi-Region protects against a Region-wide event or is driven by compliance/latency needs for a geographically distributed user base. It’s a bigger operational lift — data replication lag, DNS failover via Route 53 health checks, and keeping two environments in configuration parity.

A production system that’s Multi-AZ but single-Region is a completely reasonable, common posture. Multi-Region is for workloads where a full-region outage (rare, but it happens) or strict compliance geography genuinely requires it — treat it as an added cost/complexity decision, not a default.

Fault-Tolerant Operational Practices

A few practices that show up repeatedly in both real operations and exam scenarios:

Health checks that actually check health — a load balancer health check hitting / and getting a 200 doesn’t confirm the database connection pool is healthy. Health check endpoints should verify the dependencies that matter, not just process liveness.
Auto Scaling replacing unhealthy instances — combine ASG health checks with ELB health checks so instances failing application-level checks get cycled automatically, not just ones that stop responding to ping.
Chaos testing in non-production — periodically terminating instances or injecting latency in a controlled environment is how you find out your “fault-tolerant” architecture has an untested assumption before a real event finds it for you.
Route 53 health checks and failover routing — for multi-region setups, DNS-level failover needs health checks configured against the actual application endpoint, with a sensible failure threshold so you don’t fail over on a single transient blip.

Exam Focus: What Questions Test From This Step

Matching a scenario’s stated RTO/RPO to the cheapest sufficient DR pattern (Backup & Restore through Active/Active)
Explaining what’s actually running (or not running) in each DR pattern during normal operations
AWS Backup plan components: schedules, lifecycle transitions, resource assignment by tag, cross-account/cross-region copy
Vault Lock compliance mode as an irreversible retention guarantee
Why untested backups are treated as a documented operational risk, and how restore testing addresses it
Multi-AZ as the default HA posture vs Multi-Region as a deliberate, costlier decision
Recognizing when a health check design fails to reflect true application health

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.