Step 5 โ High Availability, Disaster Recovery & Security Automation
This is the step where everything from the previous four converges. High availability, disaster recovery, and security automation arenโt separate disciplines from CI/CD and observability โ theyโre what CI/CD and observability are for. A pipeline that deploys safely, alarms that page the right person, and remediation that runs itself all exist in service of a system that keeps running and stays compliant without constant human intervention. Letโs finish the picture, then talk about the exam itself.
High Availability Architecture Patterns at the Professional Level
Associate-level HA stops at โMulti-AZ RDS, ALB across zones.โ Professional-level HA reasons about failure domains explicitly and designs for graceful degradation, not just redundancy.
Route 53 (health-checked failover / latency routing) โ โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ โผ โผ โผ Region: us-east-1 Region: us-west-2 Region: eu-west-1 (primary, active) (active, active-active) (DR, warm standby) โ โ โ AZ-a / AZ-b / AZ-c AZ-a / AZ-b / AZ-c AZ-a / AZ-b โ โ โ ALB โ ASG โ Aurora ALB โ ASG โ Aurora ALB โ ASG (scaled down) Global Database replication โโโโโโโบ (read replica, promotable)Static stability is the concept the exam leans on most heavily and the one most engineers underrate until theyโve lived through a regional event: a highly available system should survive the failure of a dependency (including a control-plane API) without needing that dependency to recover first. A classic example โ an Auto Scaling Group that depends on calling an external configuration service during instance launch will fail to scale precisely when that service is degraded, which is often exactly when you need to scale. Statically stable designs pre-provision enough capacity, or cache configuration locally, so that a control-plane outage doesnโt cascade into a data-plane outage.
Multi-Region strategies, ranked by cost and recovery speed:
| Strategy | RPO / RTO | Cost profile | Description |
|---|---|---|---|
| Backup & Restore | Hours (RPO/RTO) | Lowest | Snapshots/backups shipped to another region; infra provisioned from IaC only after disaster declared |
| Pilot Light | Tens of minutes | Low-moderate | Core data replicated continuously (e.g., RDS/Aurora replica); compute scaled to zero/minimal, scaled up on failover |
| Warm Standby | Minutes | Moderate-high | Scaled-down but fully functional replica stack running continuously; scale up on failover |
| Active-Active (Multi-Region) | Near zero | Highest | Full production capacity live in 2+ regions simultaneously, traffic distributed by Route 53 or Global Accelerator |
The exam consistently frames this as a cost-versus-recovery-time tradeoff, and the correct answer depends entirely on the stated RTO/RPO requirement in the scenario โ memorizing the table above without being able to map โwe can tolerate 4 hours of downtimeโ to Pilot Light, or โsub-minute failover, cost is not the primary constraintโ to Active-Active, wonโt get you through the harder scenario questions.
Automated Disaster Recovery Orchestration
Declaring a disaster and manually clicking through a console to bring up a DR region is itself a resilience failure โ the professional pattern automates the failover sequence so it executes correctly under the stress of an actual outage.
CloudWatch/Route 53 health check fails on primary region โ โผEventBridge rule / Route 53 Application Recovery Controller โ โผStep Functions: "RegionalFailover" Step 1: Promote Aurora Global Database secondary โ primary Step 2: Update Route 53 records (or ARC routing control) to direct traffic to DR region Step 3: Scale DR region ASGs from warm-standby to full capacity Step 4: Validate health checks pass in DR region Step 5: Notify stakeholders, open incident ticket automaticallyRoute 53 Application Recovery Controller (ARC) is worth knowing specifically because it decouples the failover decision from DNS propagation delay โ it uses routing controls backed by a highly available cluster of its own, designed so the failover mechanism itself doesnโt depend on the infrastructure that might be failing. Recognize ARC as the answer whenever a scenario emphasizes that the failover mechanism must remain available even during a broad regional impairment.
Aurora Global Database underpins most of these patterns for relational data: continuous replication to a secondary region with typically low replication lag, and a managed promotion API that turns the secondary into a fully writable primary during failover โ much faster than restoring from snapshot, which is why it dominates over backup-and-restore for anything with a meaningful RTO requirement.
The core testable idea: DR orchestration should be tested regularly (game days again, same FIS-driven discipline from Step 4) and triggered automatically or with a single controlled action, never a multi-hour manual runbook assembled from memory during the actual event.
Security Automation: Compliance as Code
Manually auditing hundreds of accounts for compliance doesnโt scale, and it doesnโt hold up as an exam answer either. The professional pattern is continuous, automated compliance evaluation with automated remediation for well-understood violations.
AWS Config Rule (managed or custom, e.g. "s3-bucket-public-read-prohibited") โ โผ evaluates resource configuration continuously NON_COMPLIANT finding โ โผEventBridge (Config compliance change event) โ โผSSM Automation document: "RemediatePublicS3Bucket" - Automatically attached as a Config remediation action - Applies the bucket's public access block - Tags the resource with remediation timestamp - Notifies via SNSConfig supports automatic remediation directly on a rule โ you attach an SSM Automation document to a Config rule, and non-compliant resources get remediated without a separate EventBridge rule at all, though the EventBridge path is useful when the response needs to be more elaborate than a single Automation document handles (notify a specific team based on resource tags, escalate repeated violations differently than first-time ones).
Secrets rotation automation is the other pillar here. Secrets Managerโs native rotation (backed by a Lambda function it manages for RDS, Redshift, and DocumentDB, or a custom rotation Lambda you supply for anything else) rotates credentials on a schedule without application downtime, using the four-step rotation Lambda contract (createSecret, setSecret, testSecret, finishSecret) that keeps both the old and new secret valid during the transition window. The exam wants you to recognize this as the default answer whenever a scenario mentions database credentials, API keys, or any credential that needs periodic rotation without manual intervention โ and to know that Parameter Storeโs rotation requires you to build the scheduling and Lambda invocation yourself via EventBridge, whereas Secrets Manager has rotation scheduling built in natively.
| Concern | Config + SSM Automation | Secrets Manager rotation |
|---|---|---|
| What it fixes | Misconfigured resources (public buckets, open security groups, missing encryption) | Stale/long-lived credentials |
| Trigger | Continuous rule evaluation | Time-based schedule (e.g., every 30 days) |
| Remediation actor | SSM Automation document | Managed or custom rotation Lambda |
| Zero-downtime requirement | N/A (config change) | Yes โ old and new secret both valid during rotation window |
Exam Domains and How to Study Them
The DOP-C02 exam is organized into six domains, and the weighting itself tells you where to spend your study time:
| Domain | Approx. weighting | Core theme |
|---|---|---|
| SDLC Automation | ~22% | CI/CD pipeline design, deployment strategies, artifact management |
| Configuration Management & IaC | ~17% | CloudFormation/CDK/StackSets, SSM, immutable infrastructure |
| Resilient Cloud Solutions | ~15% | HA architecture, multi-region DR, fault tolerance |
| Monitoring & Logging | ~15% | CloudWatch, X-Ray, centralized logging, alerting design |
| Incident & Event Response | ~14% | EventBridge automation, remediation, runbooks |
| Security & Compliance | ~17% | Compliance as code, secrets management, least privilege automation |
Notice that SDLC Automation and the combined Configuration Management/Security domains together account for over half the exam โ this maps directly to Steps 1, 2, and 5 of this guide, and itโs why those areas deserve disproportionate study time relative to a naive โsix domains, study equallyโ approach.
Common Professional-Level Traps
A few patterns recur often enough to call out explicitly:
- Picking a technically correct but operationally wrong answer. Several answer choices in a given question will โwork.โ The exam is testing which one matches AWS best practice for cost, blast radius, or operational overhead โ not merely which one is technically feasible. If an option involves manual console steps repeated regularly, itโs almost never the intended answer.
- Ignoring stated RTO/RPO constraints. A question that specifies โrecovery within 15 minutesโ is eliminating Backup & Restore and Pilot Light before you even read the answer choices. Read the constraint first, filter the strategy table, then evaluate options.
- Assuming full automation with no human checkpoint is always better. For destructive or security-sensitive actions, the exam frequently rewards designs that keep a human approval step, provided everything else is automated โ full unattended automation isnโt automatically the โmore matureโ answer when the blast radius is high.
- Conflating detection with remediation. GuardDuty detects; it doesnโt fix anything by itself. Config evaluates compliance; it doesnโt remediate without an attached Automation document or a separate workflow. Expect questions that describe only half a solution and ask whatโs missing.
- Underestimating cross-account/cross-region plumbing. A correct architecture that โshouldโ work often has a missing detail buried in the scenario โ an un-shared KMS key, a Config rule that isnโt organization-aggregated, a StackSet using self-managed permissions when the org has hundreds of accounts joining continuously. These details are usually the actual point of the question.
Study Approach
Spend real time in the console (or with CDK/CloudFormation locally) actually building a cross-account pipeline, a Config remediation rule, and a canary CodeDeploy deployment โ this exam punishes purely theoretical preparation more than the associate-level exams do, because the scenarios assume youโve felt the friction of these systems firsthand. Read AWSโs own Well-Architected โOperational Excellenceโ and โReliabilityโ pillar whitepapers once, closely, near the end of your prep โ by that point youโll recognize almost every principle from something you already built while working through this guide, and that recognition is what makes the examโs judgment-based questions go quickly instead of becoming coin flips.
Exam Focus: What Questions Test From This Step
- Static stability as a design principle โ surviving control-plane degradation without needing it to recover first
- Matching a stated RTO/RPO to the correct multi-region DR strategy (Backup & Restore, Pilot Light, Warm Standby, Active-Active)
- Route 53 Application Recovery Controller versus plain DNS failover, and why ARCโs own availability matters
- Aurora Global Database promotion as the fast path for relational DR versus snapshot restore
- Config rules with attached SSM Automation remediation versus a separate EventBridge-triggered remediation workflow
- Secrets Managerโs native rotation contract (create/set/test/finish) versus building rotation yourself for Parameter Store
- Recognizing โdetection without remediationโ as an incomplete architecture in scenario questions
- Weighting study time toward SDLC Automation, Configuration Management/IaC, and Security โ the highest-weighted domains
- Choosing operationally sound answers over merely technically-feasible ones, especially where blast radius or manual toil is involved