Cloud/ AWS / AWS Certified DevOps Engineer โ€” Professional (DOP-C02) / DOP-C02 Step 5: HA, Disaster Recovery, Security Automation & Exam Strategy

AWS Amazon Web Services Professional Step 5 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 5 โ€” High Availability, Disaster Recovery & Security Automation

This is the step where everything from the previous four converges. High availability, disaster recovery, and security automation arenโ€™t separate disciplines from CI/CD and observability โ€” theyโ€™re what CI/CD and observability are for. A pipeline that deploys safely, alarms that page the right person, and remediation that runs itself all exist in service of a system that keeps running and stays compliant without constant human intervention. Letโ€™s finish the picture, then talk about the exam itself.


High Availability Architecture Patterns at the Professional Level

Associate-level HA stops at โ€œMulti-AZ RDS, ALB across zones.โ€ Professional-level HA reasons about failure domains explicitly and designs for graceful degradation, not just redundancy.

Route 53 (health-checked failover / latency routing)
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ โ–ผ
Region: us-east-1 Region: us-west-2 Region: eu-west-1
(primary, active) (active, active-active) (DR, warm standby)
โ”‚ โ”‚ โ”‚
AZ-a / AZ-b / AZ-c AZ-a / AZ-b / AZ-c AZ-a / AZ-b
โ”‚ โ”‚ โ”‚
ALB โ†’ ASG โ†’ Aurora ALB โ†’ ASG โ†’ Aurora ALB โ†’ ASG (scaled down)
Global Database replication โ”€โ”€โ”€โ”€โ”€โ”€โ–บ (read replica,
promotable)

Static stability is the concept the exam leans on most heavily and the one most engineers underrate until theyโ€™ve lived through a regional event: a highly available system should survive the failure of a dependency (including a control-plane API) without needing that dependency to recover first. A classic example โ€” an Auto Scaling Group that depends on calling an external configuration service during instance launch will fail to scale precisely when that service is degraded, which is often exactly when you need to scale. Statically stable designs pre-provision enough capacity, or cache configuration locally, so that a control-plane outage doesnโ€™t cascade into a data-plane outage.

Multi-Region strategies, ranked by cost and recovery speed:

StrategyRPO / RTOCost profileDescription
Backup & RestoreHours (RPO/RTO)LowestSnapshots/backups shipped to another region; infra provisioned from IaC only after disaster declared
Pilot LightTens of minutesLow-moderateCore data replicated continuously (e.g., RDS/Aurora replica); compute scaled to zero/minimal, scaled up on failover
Warm StandbyMinutesModerate-highScaled-down but fully functional replica stack running continuously; scale up on failover
Active-Active (Multi-Region)Near zeroHighestFull production capacity live in 2+ regions simultaneously, traffic distributed by Route 53 or Global Accelerator

The exam consistently frames this as a cost-versus-recovery-time tradeoff, and the correct answer depends entirely on the stated RTO/RPO requirement in the scenario โ€” memorizing the table above without being able to map โ€œwe can tolerate 4 hours of downtimeโ€ to Pilot Light, or โ€œsub-minute failover, cost is not the primary constraintโ€ to Active-Active, wonโ€™t get you through the harder scenario questions.


Automated Disaster Recovery Orchestration

Declaring a disaster and manually clicking through a console to bring up a DR region is itself a resilience failure โ€” the professional pattern automates the failover sequence so it executes correctly under the stress of an actual outage.

CloudWatch/Route 53 health check fails on primary region
โ”‚
โ–ผ
EventBridge rule / Route 53 Application Recovery Controller
โ”‚
โ–ผ
Step Functions: "RegionalFailover"
Step 1: Promote Aurora Global Database secondary โ†’ primary
Step 2: Update Route 53 records (or ARC routing control) to
direct traffic to DR region
Step 3: Scale DR region ASGs from warm-standby to full capacity
Step 4: Validate health checks pass in DR region
Step 5: Notify stakeholders, open incident ticket automatically

Route 53 Application Recovery Controller (ARC) is worth knowing specifically because it decouples the failover decision from DNS propagation delay โ€” it uses routing controls backed by a highly available cluster of its own, designed so the failover mechanism itself doesnโ€™t depend on the infrastructure that might be failing. Recognize ARC as the answer whenever a scenario emphasizes that the failover mechanism must remain available even during a broad regional impairment.

Aurora Global Database underpins most of these patterns for relational data: continuous replication to a secondary region with typically low replication lag, and a managed promotion API that turns the secondary into a fully writable primary during failover โ€” much faster than restoring from snapshot, which is why it dominates over backup-and-restore for anything with a meaningful RTO requirement.

The core testable idea: DR orchestration should be tested regularly (game days again, same FIS-driven discipline from Step 4) and triggered automatically or with a single controlled action, never a multi-hour manual runbook assembled from memory during the actual event.


Security Automation: Compliance as Code

Manually auditing hundreds of accounts for compliance doesnโ€™t scale, and it doesnโ€™t hold up as an exam answer either. The professional pattern is continuous, automated compliance evaluation with automated remediation for well-understood violations.

AWS Config Rule (managed or custom, e.g. "s3-bucket-public-read-prohibited")
โ”‚
โ–ผ evaluates resource configuration continuously
NON_COMPLIANT finding
โ”‚
โ–ผ
EventBridge (Config compliance change event)
โ”‚
โ–ผ
SSM Automation document: "RemediatePublicS3Bucket"
- Automatically attached as a Config remediation action
- Applies the bucket's public access block
- Tags the resource with remediation timestamp
- Notifies via SNS

Config supports automatic remediation directly on a rule โ€” you attach an SSM Automation document to a Config rule, and non-compliant resources get remediated without a separate EventBridge rule at all, though the EventBridge path is useful when the response needs to be more elaborate than a single Automation document handles (notify a specific team based on resource tags, escalate repeated violations differently than first-time ones).

Secrets rotation automation is the other pillar here. Secrets Managerโ€™s native rotation (backed by a Lambda function it manages for RDS, Redshift, and DocumentDB, or a custom rotation Lambda you supply for anything else) rotates credentials on a schedule without application downtime, using the four-step rotation Lambda contract (createSecret, setSecret, testSecret, finishSecret) that keeps both the old and new secret valid during the transition window. The exam wants you to recognize this as the default answer whenever a scenario mentions database credentials, API keys, or any credential that needs periodic rotation without manual intervention โ€” and to know that Parameter Storeโ€™s rotation requires you to build the scheduling and Lambda invocation yourself via EventBridge, whereas Secrets Manager has rotation scheduling built in natively.

ConcernConfig + SSM AutomationSecrets Manager rotation
What it fixesMisconfigured resources (public buckets, open security groups, missing encryption)Stale/long-lived credentials
TriggerContinuous rule evaluationTime-based schedule (e.g., every 30 days)
Remediation actorSSM Automation documentManaged or custom rotation Lambda
Zero-downtime requirementN/A (config change)Yes โ€” old and new secret both valid during rotation window

Exam Domains and How to Study Them

The DOP-C02 exam is organized into six domains, and the weighting itself tells you where to spend your study time:

DomainApprox. weightingCore theme
SDLC Automation~22%CI/CD pipeline design, deployment strategies, artifact management
Configuration Management & IaC~17%CloudFormation/CDK/StackSets, SSM, immutable infrastructure
Resilient Cloud Solutions~15%HA architecture, multi-region DR, fault tolerance
Monitoring & Logging~15%CloudWatch, X-Ray, centralized logging, alerting design
Incident & Event Response~14%EventBridge automation, remediation, runbooks
Security & Compliance~17%Compliance as code, secrets management, least privilege automation

Notice that SDLC Automation and the combined Configuration Management/Security domains together account for over half the exam โ€” this maps directly to Steps 1, 2, and 5 of this guide, and itโ€™s why those areas deserve disproportionate study time relative to a naive โ€œsix domains, study equallyโ€ approach.

Common Professional-Level Traps

A few patterns recur often enough to call out explicitly:

Study Approach

Spend real time in the console (or with CDK/CloudFormation locally) actually building a cross-account pipeline, a Config remediation rule, and a canary CodeDeploy deployment โ€” this exam punishes purely theoretical preparation more than the associate-level exams do, because the scenarios assume youโ€™ve felt the friction of these systems firsthand. Read AWSโ€™s own Well-Architected โ€œOperational Excellenceโ€ and โ€œReliabilityโ€ pillar whitepapers once, closely, near the end of your prep โ€” by that point youโ€™ll recognize almost every principle from something you already built while working through this guide, and that recognition is what makes the examโ€™s judgment-based questions go quickly instead of becoming coin flips.


Exam Focus: What Questions Test From This Step