Step 2 โ Configuration Management & Infrastructure as Code
Ask ten engineers what โinfrastructure as codeโ means and youโll get ten answers that all stop short of what this exam wants. A single template that provisions a VPC is not the skill being tested here. The skill being tested is: can you manage infrastructure as code across dozens of accounts, keep it from drifting, and roll changes out without someone fat-fingering a stack update in production. Letโs get into it.
CloudFormation Beyond the Basics
You already know the fundamentals โ templates, resources, parameters, outputs. At the professional level, the exam cares about how templates compose and fail at scale.
Nested Stacks
A nested stack is a stack created as a resource inside a parent stack (AWS::CloudFormation::Stack), pointing at a child template stored in S3. The reason to use them isnโt aesthetics โ itโs the 500-resource-per-stack limit and the desire to reuse common building blocks (a standard VPC, a standard logging setup) across many top-level stacks without copy-pasting YAML.
Parent Stack (app-stack.yaml) โโโ Resource: NetworkStack โโโบ nested-templates/vpc.yaml โโโ Resource: DatabaseStack โโโบ nested-templates/rds.yaml โโโ Resource: ComputeStack โโโบ nested-templates/asg.yaml โโโ Outputs pulled from nested stacks via GetAttUpdates to a parent stack propagate to nested stacks automatically during a parent update โ but a nested stack cannot be updated independently through the console without going through the parent. This trips people up during incident response: you canโt just patch the child stack in isolation.
StackSets โ Deploying Across Accounts and Regions
StackSets exist for exactly one problem: you have the same template and you need it applied consistently across many accounts and regions, with drift tracked centrally. Think: a baseline CloudTrail configuration, a standard IAM role for break-glass access, or a mandatory security group ruleset that every account in the organization must have.
Management/Delegated Admin Account โ โ StackSet: "org-baseline-security" โ โโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Deployment targets: entire OU or account โ โ list, across selected regions โ โโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโ โผ โผ Account A (us-east-1, eu-west-1) Account B (us-east-1) Stack instance Stack instanceTwo deployment models matter for the exam:
- Self-managed permissions โ you manually create
AWSCloudFormationStackSetAdministrationRoleandAWSCloudFormationStackSetExecutionRolein each account, with a trust relationship between them. Older pattern, more control, more setup. - Service-managed permissions โ StackSets integrates directly with AWS Organizations. Deploy to an entire OU, and new accounts that join the OU automatically get the stack instance. This is the pattern to reach for whenever a scenario mentions โnew accounts should automatically receive baseline resources.โ
Concurrent deployment controls โ MaxConcurrentCount / MaxConcurrentPercentage and FailureToleranceCount / FailureTolerancePercentage govern how fast a StackSet rolls out and how many failures are tolerated before it stops. This is the same โblast radius controlโ philosophy as CodeDeploy traffic shifting, just applied to infrastructure changes across accounts instead of application traffic across instances. Expect a question that asks you to prevent a bad template from being applied to all 200 accounts in an organization simultaneously โ the answer is tuning these tolerance settings, not a manual approval step (StackSets doesnโt have one natively; that gate lives in the pipeline that triggers the StackSet update).
Custom Resources
When CloudFormation doesnโt natively support something โ registering a third-party SaaS webhook, looking up an AMI ID dynamically, or running a one-time data migration โ a custom resource backs the resource with a Lambda function (or, less commonly now, an SNS topic). The Lambda receives a Create/Update/Delete event from CloudFormation and must send a response signal back to a pre-signed S3 URL. The single most common real-world bug here โ and a favorite exam trap โ is a custom resource Lambda that fails silently and never sends a response, leaving the stack stuck in CREATE_IN_PROGRESS for the full one-hour timeout. Always wrap custom resource logic in the cfn-response module or the newer Provider framework in CDK, which handles this for you.
Drift Detection
Drift detection compares the live state of stack resources against the templateโs expected state and flags anything changed out-of-band (a security group rule added manually in the console, an S3 bucket policy edited directly). Itโs not continuous โ you trigger it on a stack or a StackSet, and itโs not automatic remediation, just detection and reporting. Pair it with EventBridge (Configโs ConfigurationItemChangeNotification or scheduled drift detection runs) if a scenario wants proactive alerting rather than manual checks.
AWS CDK โ Concepts Youโre Expected to Know
DOP-C02 wonโt ask you to write TypeScript, but it will test whether you understand CDKโs relationship to CloudFormation. CDK is not a replacement for CloudFormation โ itโs a code-first authoring layer that synthesizes CloudFormation templates. When you run cdk deploy, CDK synthesizes your app into one or more CloudFormation templates and asset bundles, then hands them to CloudFormation to actually provision. Every CDK deployment is still a CloudFormation stack under the hood, which means all the StackSets, drift detection, and change set behavior you just learned still applies.
Key vocabulary:
- Construct โ the basic building block; L1 constructs map 1:1 to CloudFormation resources, L2 constructs add sane defaults and convenience methods, L3 constructs (โpatternsโ) compose multiple resources into a common architecture (e.g., an ALB-fronted Fargate service in a handful of lines).
- Stack โ a unit of deployment, same concept as a CloudFormation stack, just defined in code.
- App โ the root construct that can contain multiple stacks, potentially targeting different accounts/regions.
- cdk synth / cdk diff / cdk deploy โ synth previews the generated template, diff compares it against whatโs deployed, deploy applies it. Expect exam scenarios where โreview infrastructure changes before applying them in a pipelineโ maps to running
cdk diff(or an equivalent CloudFormation change set) as a pipeline stage before the deploy action.
CDK Pipelines (a construct library) automates the self-mutating pipeline pattern: a CDK app can define its own CodePipeline, and when the pipeline definition itself changes, the pipeline updates itself before deploying the rest of the application stacks. This is a subtle but testable point โ the pipeline is infrastructure too, and it deploys itself first.
Systems Manager for Fleet Configuration Management
Once you have more than a few dozen instances, โSSH in and fix itโ stops being a strategy. Systems Manager (SSM) is the fleet-wide control plane, and DOP-C02 tests several of its capabilities specifically:
| SSM Capability | What it solves |
|---|---|
| Run Command | Execute ad-hoc or scheduled commands across a fleet without SSH/RDP access |
| State Manager | Continuously enforce a desired configuration (e.g., ensure an antivirus agent is always running) on a schedule |
| Patch Manager | Automated OS/application patching with maintenance windows and patch baselines |
| Automation | Runbook-style multi-step operational workflows (documents), often triggered by EventBridge |
| Parameter Store | Hierarchical config and secret storage, referenced directly from CloudFormation and CodeBuild |
| Session Manager | Shell access to instances without opening SSH ports or managing bastion hosts |
| Fleet Manager | Console-based fleet inventory and management view |
| Inventory | Collects metadata (installed packages, running services) across the fleet for compliance querying |
Maintenance Windows let you schedule Patch Manager or Run Command tasks during defined low-traffic periods, with concurrency and error-threshold controls that mirror the same blast-radius philosophy you saw in StackSets and CodeDeploy โ patch 10% of instances, stop if too many fail health checks.
The SSM Agent needs to be installed and an instance profile with the right managed policy (AmazonSSMManagedInstanceCore) attached for any of this to work โ a detail that shows up in troubleshooting-style questions (โinstances arenโt appearing in Fleet Managerโ).
Immutable vs. Mutable Infrastructure
MUTABLE PATTERN IMMUTABLE PATTERNโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโInstance launches once New AMI/image built per release โ โ โผ โผConfig management agent New ASG/task definition launched(Ansible/Chef/Puppet/SSM) โ โ applies changes in place โผ โผ Old instances terminated afterInstance drifts over time cutover (blue/green)if changes aren't trackedThe professional exam has a clear bias: immutable infrastructure is the preferred pattern for production workloads because it eliminates configuration drift entirely โ you never patch a running instance, you replace it with one built from a known-good image. Mutable patterns (SSM State Manager enforcing config on long-lived instances) are still valid, particularly for stateful or legacy fleets where rebuilding isnโt practical, but expect the โbest practiceโ answer to lean immutable whenever the scenario allows it.
Golden AMI Pipelines with EC2 Image Builder
EC2 Image Builder formalizes the โgolden AMIโ pattern that used to require hand-rolled Packer scripts and cron jobs. The pipeline:
Source Image (base AMI) โโโบ Build Component(s) โโโบ Test Component(s) โโโบ Distribution (Amazon Linux 2023) - install agent - vulnerability scan - copy AMI to - apply hardening - smoke test multiple regions - bake app runtime - share to other accounts/OUsImage Builder runs on a schedule (or triggered by a new base AMI release via EventBridge) and produces a versioned, tested AMI automatically โ no more โwho built this AMI and whatโs on itโ archaeology. Distribution settings push the finished AMI to every region and account that needs it in one pipeline run, which is exactly the multi-account concern this whole step keeps circling back to.
Pair this with Auto Scaling Group instance refresh: once a new golden AMI is published, an instance refresh gradually replaces running instances with new ones launched from the updated AMI, respecting a minimum healthy percentage โ again, the same controlled-blast-radius rollout pattern, just applied at the AMI layer.
Exam Focus: What Questions Test From This Step
- When to use nested stacks (resource limit, reuse) versus StackSets (cross-account/region consistency)
- Service-managed vs. self-managed StackSet permissions, and which one auto-applies to new OU member accounts
- StackSet failure tolerance and concurrency settings as the mechanism for limiting blast radius of a bad template
- Custom resource Lambda behavior โ the response-signal requirement and what happens if itโs omitted
- Drift detection as a detection-only tool, not automated remediation
- CDKโs relationship to CloudFormation โ constructs synthesize templates,
cdk difffor pre-deploy review, self-mutating CDK Pipelines - Matching an SSM capability (Run Command, State Manager, Patch Manager, Automation, Session Manager) to a fleet management scenario
- Immutable infrastructure as the preferred professional-level pattern, and when mutable configuration management is still the right call
- Golden AMI pipelines via EC2 Image Builder combined with ASG instance refresh for fleet-wide rollout