Step 4 โ Incident & Event Response
Thereโs a particular kind of relief in watching an alarm fire, a remediation kick off automatically, and the incident close itself out before youโve even finished reading the Slack notification. Thatโs the state this step is building toward. Not โmonitoring exists,โ but โthe system heals itself for the failure modes youโve already seen before, and hands you a clean starting point for the ones you havenโt.โ
EventBridge as the Nervous System
Every meaningful state change in an AWS account โ an instance stopping, a CodeDeploy deployment failing, a GuardDuty finding, a CloudWatch alarm flipping to ALARM โ is available as an event. EventBridge is the router that decides what happens next, and the professional exam expects you to design with it as the default glue, not as an optional extra.
EventBridge Bus Event Sources (default or custom bus) Targets โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ CloudWatch Alarm โโโ โโโโบ Lambda (remediation) CodeDeploy status โโโค โโโโบ SSM Automation runbook GuardDuty finding โโโผโโโบ Rule: event pattern match โโโบโผโโโบ SNS (page on-call) Config rule result โโโค (source, detail-type, โโโโบ Step Functions Auto Scaling event โโโ detail field filters) โโโโบ Another account's bus (cross-account)A rule matches on an event pattern โ a JSON structure filtered against the eventโs source, detail-type, and arbitrary fields inside detail. This is more expressive than it sounds: you can write a rule that only fires for GuardDuty findings above a certain severity, or only for CodeDeploy deployments that failed in a specific application, without writing a single line of Lambda filtering logic. Push the filtering into the rule pattern whenever possible โ itโs cheaper, itโs declarative, and it keeps your Lambda function focused on the actual remediation instead of on deciding whether it should run at all.
Custom event buses matter for cross-account and cross-organization event routing: a security-tooling account can have its own bus that receives forwarded GuardDuty/Security Hub events from every member account, using a resource policy on the bus rather than per-account Lambda polling. This is structurally identical to the centralized logging pattern from Step 3 โ centralize the signal, fan out the response from one place.
Scheduled rules (cron or rate expressions) are also EventBridge rules, just triggered by time instead of an event โ the modern replacement for a cron job on an EC2 instance, useful for periodic compliance checks or scheduled Automation runbook execution.
Automated Remediation: SSM Automation and Lambda
Once EventBridge routes the signal, something has to act on it. Two tools dominate here, and the exam wants you to pick correctly between them.
SSM Automation documents are declarative, multi-step runbooks โ a sequence of steps (aws:runCommand, aws:invokeLambdaFunction, aws:executeAwsApi, aws:approve for a human gate mid-runbook) with built-in retry, rollback-on-failure steps, and execution history you can audit later. Reach for Automation when the remediation is a well-defined operational procedure: restart a stuck service, detach and replace an unhealthy instance from an ASG, rotate a credential, or resize an EBS volume thatโs about to fill up.
Lambda functions are the right tool when the remediation logic is more bespoke โ parsing a specific findingโs structure, making a conditional decision based on data youโd otherwise have to fetch from another API, or orchestrating a response that doesnโt map cleanly to existing SSM Automation actions. In practice, most real systems use both together: EventBridge triggers a Lambda, and the Lambdaโs actual remediation step is to start an SSM Automation execution, because the Automation document gives you the audit trail and retry semantics that raw Lambda code has to reimplement by hand.
GuardDuty: EC2 instance i-0abc communicating with known crypto-mining C2 IP โ โผEventBridge rule (source: aws.guardduty, severity >= 7) โ โผLambda: "TriageFinding" - looks up instance owner/tags - checks if instance is in an auto-remediate allowlist โ โผSSM Automation: "IsolateAndSnapshotInstance" Step 1: Detach instance from Auto Scaling Group Step 2: Apply "quarantine" security group (no egress) Step 3: Create EBS snapshot / AMI for forensics Step 4: aws:approve โ human confirms termination Step 5: Terminate instance, launch replacement from golden AMI โ โผSNS: notify security channel with runbook execution linkNotice the human approval step embedded mid-runbook. Full auto-remediation without a human checkpoint is appropriate for well-understood, low-risk actions (restart a service, scale out capacity). For anything destructive or security-sensitive, the professional-level answer keeps a human in the loop at the point of highest consequence, while still automating everything mechanical around that decision.
Comparing Remediation Approaches
| Approach | Best for | Auditability | Typical trigger |
|---|---|---|---|
| SSM Automation runbook | Standardized, repeatable operational procedures | Built-in execution history, step-by-step | EventBridge rule, Config rule non-compliance, manual |
| Lambda function | Custom logic, conditional branching, API orchestration | Requires your own logging (CloudWatch Logs) | EventBridge rule, direct API/SDK invocation |
| Step Functions state machine | Long-running, multi-service workflows with complex branching/wait states | Visual execution history in console | EventBridge rule, another Step Functions execution |
| CodeDeploy automatic rollback | Rolling back a bad deployment specifically | Deployment event history | CloudWatch alarm attached to deployment group |
Step Functions deserves a specific callout here: when a remediation workflow needs to wait on an external condition, branch on multiple outcomes, or coordinate several services over a timeframe longer than a single Lambdaโs timeout allows, Step Functions is the answer over a long Lambda function or a deeply nested SSM Automation document. Itโs the tool for โthis incident response has several possible paths depending on what we find along the way.โ
Incident Response Runbook Automation
The instinct to avoid is treating a runbook as a document a human reads during an incident. At the professional level, a runbook is code โ an SSM Automation document or Step Functions definition, version-controlled, tested in a non-production account, and invoked automatically or with one click, not transcribed manually from a wiki page while a service is down.
Practical patterns worth knowing:
- Automation documents can be shared across accounts via AWS Resource Access Manager or by publishing to a shared SSM document, so a central platform team maintains one canonical โrestart unhealthy ECS taskโ runbook that every application account references rather than reinventing.
- Break-glass access for incidents that need broader permissions than normal is implemented as a pre-provisioned IAM role with a short session duration and mandatory MFA, invoked through Identity Center or
AssumeRole, with CloudTrail logging every action taken under that role โ not a standing set of elevated credentials someone keeps around โjust in case.โ - ChatOps integration โ EventBridge or SNS targets that post to a chat channel with an actionable button (approve, rollback, acknowledge) is a common professional pattern for reducing the time between detection and human decision, especially for the approval steps embedded in Automation runbooks above.
Post-Incident Practices
Once the fire is out, the professional-level differentiator is what happens next โ and this shows up on the exam as much as the automation itself does.
Root cause analysis workflows. The mature pattern isnโt a meeting three weeks later โ itโs capturing the data needed for RCA automatically at incident time: X-Ray traces around the incident window, the specific CloudWatch Logs Insights query that isolates the failing requests, the CloudFormation/CodePipeline change history for what deployed in the preceding hours. A well-designed system correlates โwhat changedโ (CloudTrail, CodePipeline execution history) against โwhen did it breakโ (the CloudWatch alarm timestamp) automatically, because that correlation is usually the entire root cause.
Automated rollback triggers, revisited from Step 1: the tightest possible incident response loop is one where a bad deployment never becomes an incident that requires human RCA at all, because the CodeDeploy alarm-based rollback (or an AppConfig feature-flag rollback) already reverted it during the bake window. Every automation covered in this step exists to shrink that gap โ the goal is for the โincidentโ to be something engineers read about after the fact in a Automation execution log, not something they were paged for at 3 a.m.
Game days. AWS Fault Injection Service (FIS) lets you deliberately inject failure โ terminate instances, throttle API calls, inject latency โ in a controlled experiment to validate that your alarms, runbooks, and rollback automation actually work before a real incident tests them for you. Expect FIS to appear as the answer whenever a scenario asks how to verify resilience automation without waiting for a genuine outage.
Exam Focus: What Questions Test From This Step
- Designing EventBridge rules with event patterns to filter on source, detail-type, and nested detail fields
- Custom event buses and cross-account event forwarding via bus resource policies
- Choosing between SSM Automation, Lambda, and Step Functions for a given remediation scenario
- Embedding human approval steps (
aws:approve) in otherwise automated runbooks for high-consequence actions - Break-glass access design: temporary elevated roles with MFA and full CloudTrail auditing, not standing credentials
- Automated rollback as an EventBridge-driven or CodeDeploy-native reaction to a CloudWatch alarm
- Correlating deployment/change history against incident timestamps as the backbone of automated root cause analysis
- Using AWS Fault Injection Service to validate incident response automation proactively via game days