Step 5 — High Availability, Disaster Recovery & Security Automation

This is the step where everything from the previous four converges. High availability, disaster recovery, and security automation aren’t separate disciplines from CI/CD and observability — they’re what CI/CD and observability are for. A pipeline that deploys safely, alarms that page the right person, and remediation that runs itself all exist in service of a system that keeps running and stays compliant without constant human intervention. Let’s finish the picture, then talk about the exam itself.

High Availability Architecture Patterns at the Professional Level

Associate-level HA stops at “Multi-AZ RDS, ALB across zones.” Professional-level HA reasons about failure domains explicitly and designs for graceful degradation, not just redundancy.

                        Route 53 (health-checked failover / latency routing)
                                        │
                ┌───────────────────────┼───────────────────────┐
                ▼                       ▼                       ▼
          Region: us-east-1       Region: us-west-2       Region: eu-west-1
          (primary, active)       (active, active-active)  (DR, warm standby)
                │                       │                       │
          AZ-a / AZ-b / AZ-c      AZ-a / AZ-b / AZ-c      AZ-a / AZ-b
                │                       │                       │
          ALB → ASG → Aurora      ALB → ASG → Aurora      ALB → ASG (scaled down)
                            Global Database replication ──────► (read replica,
                                                                  promotable)

Static stability is the concept the exam leans on most heavily and the one most engineers underrate until they’ve lived through a regional event: a highly available system should survive the failure of a dependency (including a control-plane API) without needing that dependency to recover first. A classic example — an Auto Scaling Group that depends on calling an external configuration service during instance launch will fail to scale precisely when that service is degraded, which is often exactly when you need to scale. Statically stable designs pre-provision enough capacity, or cache configuration locally, so that a control-plane outage doesn’t cascade into a data-plane outage.

Multi-Region strategies, ranked by cost and recovery speed:

Strategy	RPO / RTO	Cost profile	Description
Backup & Restore	Hours (RPO/RTO)	Lowest	Snapshots/backups shipped to another region; infra provisioned from IaC only after disaster declared
Pilot Light	Tens of minutes	Low-moderate	Core data replicated continuously (e.g., RDS/Aurora replica); compute scaled to zero/minimal, scaled up on failover
Warm Standby	Minutes	Moderate-high	Scaled-down but fully functional replica stack running continuously; scale up on failover
Active-Active (Multi-Region)	Near zero	Highest	Full production capacity live in 2+ regions simultaneously, traffic distributed by Route 53 or Global Accelerator

The exam consistently frames this as a cost-versus-recovery-time tradeoff, and the correct answer depends entirely on the stated RTO/RPO requirement in the scenario — memorizing the table above without being able to map “we can tolerate 4 hours of downtime” to Pilot Light, or “sub-minute failover, cost is not the primary constraint” to Active-Active, won’t get you through the harder scenario questions.

Automated Disaster Recovery Orchestration

Declaring a disaster and manually clicking through a console to bring up a DR region is itself a resilience failure — the professional pattern automates the failover sequence so it executes correctly under the stress of an actual outage.

CloudWatch/Route 53 health check fails on primary region
        │
        ▼
EventBridge rule / Route 53 Application Recovery Controller
        │
        ▼
Step Functions: "RegionalFailover"
   Step 1: Promote Aurora Global Database secondary → primary
   Step 2: Update Route 53 records (or ARC routing control) to
           direct traffic to DR region
   Step 3: Scale DR region ASGs from warm-standby to full capacity
   Step 4: Validate health checks pass in DR region
   Step 5: Notify stakeholders, open incident ticket automatically

Route 53 Application Recovery Controller (ARC) is worth knowing specifically because it decouples the failover decision from DNS propagation delay — it uses routing controls backed by a highly available cluster of its own, designed so the failover mechanism itself doesn’t depend on the infrastructure that might be failing. Recognize ARC as the answer whenever a scenario emphasizes that the failover mechanism must remain available even during a broad regional impairment.

Aurora Global Database underpins most of these patterns for relational data: continuous replication to a secondary region with typically low replication lag, and a managed promotion API that turns the secondary into a fully writable primary during failover — much faster than restoring from snapshot, which is why it dominates over backup-and-restore for anything with a meaningful RTO requirement.

The core testable idea: DR orchestration should be tested regularly (game days again, same FIS-driven discipline from Step 4) and triggered automatically or with a single controlled action, never a multi-hour manual runbook assembled from memory during the actual event.

Security Automation: Compliance as Code

Manually auditing hundreds of accounts for compliance doesn’t scale, and it doesn’t hold up as an exam answer either. The professional pattern is continuous, automated compliance evaluation with automated remediation for well-understood violations.

AWS Config Rule (managed or custom, e.g. "s3-bucket-public-read-prohibited")
        │
        ▼  evaluates resource configuration continuously
   NON_COMPLIANT finding
        │
        ▼
EventBridge (Config compliance change event)
        │
        ▼
SSM Automation document: "RemediatePublicS3Bucket"
   - Automatically attached as a Config remediation action
   - Applies the bucket's public access block
   - Tags the resource with remediation timestamp
   - Notifies via SNS

Config supports automatic remediation directly on a rule — you attach an SSM Automation document to a Config rule, and non-compliant resources get remediated without a separate EventBridge rule at all, though the EventBridge path is useful when the response needs to be more elaborate than a single Automation document handles (notify a specific team based on resource tags, escalate repeated violations differently than first-time ones).

Secrets rotation automation is the other pillar here. Secrets Manager’s native rotation (backed by a Lambda function it manages for RDS, Redshift, and DocumentDB, or a custom rotation Lambda you supply for anything else) rotates credentials on a schedule without application downtime, using the four-step rotation Lambda contract (createSecret, setSecret, testSecret, finishSecret) that keeps both the old and new secret valid during the transition window. The exam wants you to recognize this as the default answer whenever a scenario mentions database credentials, API keys, or any credential that needs periodic rotation without manual intervention — and to know that Parameter Store’s rotation requires you to build the scheduling and Lambda invocation yourself via EventBridge, whereas Secrets Manager has rotation scheduling built in natively.

Concern	Config + SSM Automation	Secrets Manager rotation
What it fixes	Misconfigured resources (public buckets, open security groups, missing encryption)	Stale/long-lived credentials
Trigger	Continuous rule evaluation	Time-based schedule (e.g., every 30 days)
Remediation actor	SSM Automation document	Managed or custom rotation Lambda
Zero-downtime requirement	N/A (config change)	Yes — old and new secret both valid during rotation window

Exam Domains and How to Study Them

The DOP-C02 exam is organized into six domains, and the weighting itself tells you where to spend your study time:

Domain	Approx. weighting	Core theme
SDLC Automation	~22%	CI/CD pipeline design, deployment strategies, artifact management
Configuration Management & IaC	~17%	CloudFormation/CDK/StackSets, SSM, immutable infrastructure
Resilient Cloud Solutions	~15%	HA architecture, multi-region DR, fault tolerance
Monitoring & Logging	~15%	CloudWatch, X-Ray, centralized logging, alerting design
Incident & Event Response	~14%	EventBridge automation, remediation, runbooks
Security & Compliance	~17%	Compliance as code, secrets management, least privilege automation

Notice that SDLC Automation and the combined Configuration Management/Security domains together account for over half the exam — this maps directly to Steps 1, 2, and 5 of this guide, and it’s why those areas deserve disproportionate study time relative to a naive “six domains, study equally” approach.

Common Professional-Level Traps

A few patterns recur often enough to call out explicitly:

Picking a technically correct but operationally wrong answer. Several answer choices in a given question will “work.” The exam is testing which one matches AWS best practice for cost, blast radius, or operational overhead — not merely which one is technically feasible. If an option involves manual console steps repeated regularly, it’s almost never the intended answer.
Ignoring stated RTO/RPO constraints. A question that specifies “recovery within 15 minutes” is eliminating Backup & Restore and Pilot Light before you even read the answer choices. Read the constraint first, filter the strategy table, then evaluate options.
Assuming full automation with no human checkpoint is always better. For destructive or security-sensitive actions, the exam frequently rewards designs that keep a human approval step, provided everything else is automated — full unattended automation isn’t automatically the “more mature” answer when the blast radius is high.
Conflating detection with remediation. GuardDuty detects; it doesn’t fix anything by itself. Config evaluates compliance; it doesn’t remediate without an attached Automation document or a separate workflow. Expect questions that describe only half a solution and ask what’s missing.
Underestimating cross-account/cross-region plumbing. A correct architecture that “should” work often has a missing detail buried in the scenario — an un-shared KMS key, a Config rule that isn’t organization-aggregated, a StackSet using self-managed permissions when the org has hundreds of accounts joining continuously. These details are usually the actual point of the question.

Study Approach

Spend real time in the console (or with CDK/CloudFormation locally) actually building a cross-account pipeline, a Config remediation rule, and a canary CodeDeploy deployment — this exam punishes purely theoretical preparation more than the associate-level exams do, because the scenarios assume you’ve felt the friction of these systems firsthand. Read AWS’s own Well-Architected “Operational Excellence” and “Reliability” pillar whitepapers once, closely, near the end of your prep — by that point you’ll recognize almost every principle from something you already built while working through this guide, and that recognition is what makes the exam’s judgment-based questions go quickly instead of becoming coin flips.

Exam Focus: What Questions Test From This Step

Static stability as a design principle — surviving control-plane degradation without needing it to recover first
Matching a stated RTO/RPO to the correct multi-region DR strategy (Backup & Restore, Pilot Light, Warm Standby, Active-Active)
Route 53 Application Recovery Controller versus plain DNS failover, and why ARC’s own availability matters
Aurora Global Database promotion as the fast path for relational DR versus snapshot restore
Config rules with attached SSM Automation remediation versus a separate EventBridge-triggered remediation workflow
Secrets Manager’s native rotation contract (create/set/test/finish) versus building rotation yourself for Parameter Store
Recognizing “detection without remediation” as an incomplete architecture in scenario questions
Weighting study time toward SDLC Automation, Configuration Management/IaC, and Security — the highest-weighted domains
Choosing operationally sound answers over merely technically-feasible ones, especially where blast radius or manual toil is involved

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.