Cloud/ AWS / AWS Certified DevOps Engineer โ€” Professional (DOP-C02) / DOP-C02 Step 4: Event-Driven Automation & Incident Response on AWS

AWS Amazon Web Services Professional Step 4 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 4 โ€” Incident & Event Response

Thereโ€™s a particular kind of relief in watching an alarm fire, a remediation kick off automatically, and the incident close itself out before youโ€™ve even finished reading the Slack notification. Thatโ€™s the state this step is building toward. Not โ€œmonitoring exists,โ€ but โ€œthe system heals itself for the failure modes youโ€™ve already seen before, and hands you a clean starting point for the ones you havenโ€™t.โ€


EventBridge as the Nervous System

Every meaningful state change in an AWS account โ€” an instance stopping, a CodeDeploy deployment failing, a GuardDuty finding, a CloudWatch alarm flipping to ALARM โ€” is available as an event. EventBridge is the router that decides what happens next, and the professional exam expects you to design with it as the default glue, not as an optional extra.

EventBridge Bus
Event Sources (default or custom bus) Targets
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
CloudWatch Alarm โ”€โ”€โ” โ”Œโ”€โ”€โ–บ Lambda (remediation)
CodeDeploy status โ”€โ”€โ”ค โ”œโ”€โ”€โ–บ SSM Automation runbook
GuardDuty finding โ”€โ”€โ”ผโ”€โ”€โ–บ Rule: event pattern match โ”€โ”€โ–บโ”ผโ”€โ”€โ–บ SNS (page on-call)
Config rule result โ”€โ”€โ”ค (source, detail-type, โ”œโ”€โ”€โ–บ Step Functions
Auto Scaling event โ”€โ”€โ”˜ detail field filters) โ””โ”€โ”€โ–บ Another account's bus
(cross-account)

A rule matches on an event pattern โ€” a JSON structure filtered against the eventโ€™s source, detail-type, and arbitrary fields inside detail. This is more expressive than it sounds: you can write a rule that only fires for GuardDuty findings above a certain severity, or only for CodeDeploy deployments that failed in a specific application, without writing a single line of Lambda filtering logic. Push the filtering into the rule pattern whenever possible โ€” itโ€™s cheaper, itโ€™s declarative, and it keeps your Lambda function focused on the actual remediation instead of on deciding whether it should run at all.

Custom event buses matter for cross-account and cross-organization event routing: a security-tooling account can have its own bus that receives forwarded GuardDuty/Security Hub events from every member account, using a resource policy on the bus rather than per-account Lambda polling. This is structurally identical to the centralized logging pattern from Step 3 โ€” centralize the signal, fan out the response from one place.

Scheduled rules (cron or rate expressions) are also EventBridge rules, just triggered by time instead of an event โ€” the modern replacement for a cron job on an EC2 instance, useful for periodic compliance checks or scheduled Automation runbook execution.


Automated Remediation: SSM Automation and Lambda

Once EventBridge routes the signal, something has to act on it. Two tools dominate here, and the exam wants you to pick correctly between them.

SSM Automation documents are declarative, multi-step runbooks โ€” a sequence of steps (aws:runCommand, aws:invokeLambdaFunction, aws:executeAwsApi, aws:approve for a human gate mid-runbook) with built-in retry, rollback-on-failure steps, and execution history you can audit later. Reach for Automation when the remediation is a well-defined operational procedure: restart a stuck service, detach and replace an unhealthy instance from an ASG, rotate a credential, or resize an EBS volume thatโ€™s about to fill up.

Lambda functions are the right tool when the remediation logic is more bespoke โ€” parsing a specific findingโ€™s structure, making a conditional decision based on data youโ€™d otherwise have to fetch from another API, or orchestrating a response that doesnโ€™t map cleanly to existing SSM Automation actions. In practice, most real systems use both together: EventBridge triggers a Lambda, and the Lambdaโ€™s actual remediation step is to start an SSM Automation execution, because the Automation document gives you the audit trail and retry semantics that raw Lambda code has to reimplement by hand.

GuardDuty: EC2 instance i-0abc communicating with known crypto-mining C2 IP
โ”‚
โ–ผ
EventBridge rule (source: aws.guardduty, severity >= 7)
โ”‚
โ–ผ
Lambda: "TriageFinding"
- looks up instance owner/tags
- checks if instance is in an auto-remediate allowlist
โ”‚
โ–ผ
SSM Automation: "IsolateAndSnapshotInstance"
Step 1: Detach instance from Auto Scaling Group
Step 2: Apply "quarantine" security group (no egress)
Step 3: Create EBS snapshot / AMI for forensics
Step 4: aws:approve โ€” human confirms termination
Step 5: Terminate instance, launch replacement from golden AMI
โ”‚
โ–ผ
SNS: notify security channel with runbook execution link

Notice the human approval step embedded mid-runbook. Full auto-remediation without a human checkpoint is appropriate for well-understood, low-risk actions (restart a service, scale out capacity). For anything destructive or security-sensitive, the professional-level answer keeps a human in the loop at the point of highest consequence, while still automating everything mechanical around that decision.


Comparing Remediation Approaches

ApproachBest forAuditabilityTypical trigger
SSM Automation runbookStandardized, repeatable operational proceduresBuilt-in execution history, step-by-stepEventBridge rule, Config rule non-compliance, manual
Lambda functionCustom logic, conditional branching, API orchestrationRequires your own logging (CloudWatch Logs)EventBridge rule, direct API/SDK invocation
Step Functions state machineLong-running, multi-service workflows with complex branching/wait statesVisual execution history in consoleEventBridge rule, another Step Functions execution
CodeDeploy automatic rollbackRolling back a bad deployment specificallyDeployment event historyCloudWatch alarm attached to deployment group

Step Functions deserves a specific callout here: when a remediation workflow needs to wait on an external condition, branch on multiple outcomes, or coordinate several services over a timeframe longer than a single Lambdaโ€™s timeout allows, Step Functions is the answer over a long Lambda function or a deeply nested SSM Automation document. Itโ€™s the tool for โ€œthis incident response has several possible paths depending on what we find along the way.โ€


Incident Response Runbook Automation

The instinct to avoid is treating a runbook as a document a human reads during an incident. At the professional level, a runbook is code โ€” an SSM Automation document or Step Functions definition, version-controlled, tested in a non-production account, and invoked automatically or with one click, not transcribed manually from a wiki page while a service is down.

Practical patterns worth knowing:


Post-Incident Practices

Once the fire is out, the professional-level differentiator is what happens next โ€” and this shows up on the exam as much as the automation itself does.

Root cause analysis workflows. The mature pattern isnโ€™t a meeting three weeks later โ€” itโ€™s capturing the data needed for RCA automatically at incident time: X-Ray traces around the incident window, the specific CloudWatch Logs Insights query that isolates the failing requests, the CloudFormation/CodePipeline change history for what deployed in the preceding hours. A well-designed system correlates โ€œwhat changedโ€ (CloudTrail, CodePipeline execution history) against โ€œwhen did it breakโ€ (the CloudWatch alarm timestamp) automatically, because that correlation is usually the entire root cause.

Automated rollback triggers, revisited from Step 1: the tightest possible incident response loop is one where a bad deployment never becomes an incident that requires human RCA at all, because the CodeDeploy alarm-based rollback (or an AppConfig feature-flag rollback) already reverted it during the bake window. Every automation covered in this step exists to shrink that gap โ€” the goal is for the โ€œincidentโ€ to be something engineers read about after the fact in a Automation execution log, not something they were paged for at 3 a.m.

Game days. AWS Fault Injection Service (FIS) lets you deliberately inject failure โ€” terminate instances, throttle API calls, inject latency โ€” in a controlled experiment to validate that your alarms, runbooks, and rollback automation actually work before a real incident tests them for you. Expect FIS to appear as the answer whenever a scenario asks how to verify resilience automation without waiting for a genuine outage.


Exam Focus: What Questions Test From This Step