Step 4 — Incident & Event Response

There’s a particular kind of relief in watching an alarm fire, a remediation kick off automatically, and the incident close itself out before you’ve even finished reading the Slack notification. That’s the state this step is building toward. Not “monitoring exists,” but “the system heals itself for the failure modes you’ve already seen before, and hands you a clean starting point for the ones you haven’t.”

EventBridge as the Nervous System

Every meaningful state change in an AWS account — an instance stopping, a CodeDeploy deployment failing, a GuardDuty finding, a CloudWatch alarm flipping to ALARM — is available as an event. EventBridge is the router that decides what happens next, and the professional exam expects you to design with it as the default glue, not as an optional extra.

                              EventBridge Bus
  Event Sources                (default or custom bus)                Targets
 ─────────────────────       ──────────────────────────         ─────────────────────
 CloudWatch Alarm    ──┐                                   ┌──► Lambda (remediation)
 CodeDeploy status   ──┤                                   ├──► SSM Automation runbook
 GuardDuty finding   ──┼──►  Rule: event pattern match  ──►┼──► SNS (page on-call)
 Config rule result  ──┤        (source, detail-type,       ├──► Step Functions
 Auto Scaling event  ──┘         detail field filters)       └──► Another account's bus
                                                                    (cross-account)

A rule matches on an event pattern — a JSON structure filtered against the event’s source, detail-type, and arbitrary fields inside detail. This is more expressive than it sounds: you can write a rule that only fires for GuardDuty findings above a certain severity, or only for CodeDeploy deployments that failed in a specific application, without writing a single line of Lambda filtering logic. Push the filtering into the rule pattern whenever possible — it’s cheaper, it’s declarative, and it keeps your Lambda function focused on the actual remediation instead of on deciding whether it should run at all.

Custom event buses matter for cross-account and cross-organization event routing: a security-tooling account can have its own bus that receives forwarded GuardDuty/Security Hub events from every member account, using a resource policy on the bus rather than per-account Lambda polling. This is structurally identical to the centralized logging pattern from Step 3 — centralize the signal, fan out the response from one place.

Scheduled rules (cron or rate expressions) are also EventBridge rules, just triggered by time instead of an event — the modern replacement for a cron job on an EC2 instance, useful for periodic compliance checks or scheduled Automation runbook execution.

Automated Remediation: SSM Automation and Lambda

Once EventBridge routes the signal, something has to act on it. Two tools dominate here, and the exam wants you to pick correctly between them.

SSM Automation documents are declarative, multi-step runbooks — a sequence of steps (aws:runCommand, aws:invokeLambdaFunction, aws:executeAwsApi, aws:approve for a human gate mid-runbook) with built-in retry, rollback-on-failure steps, and execution history you can audit later. Reach for Automation when the remediation is a well-defined operational procedure: restart a stuck service, detach and replace an unhealthy instance from an ASG, rotate a credential, or resize an EBS volume that’s about to fill up.

Lambda functions are the right tool when the remediation logic is more bespoke — parsing a specific finding’s structure, making a conditional decision based on data you’d otherwise have to fetch from another API, or orchestrating a response that doesn’t map cleanly to existing SSM Automation actions. In practice, most real systems use both together: EventBridge triggers a Lambda, and the Lambda’s actual remediation step is to start an SSM Automation execution, because the Automation document gives you the audit trail and retry semantics that raw Lambda code has to reimplement by hand.

GuardDuty: EC2 instance i-0abc communicating with known crypto-mining C2 IP
        │
        ▼
EventBridge rule (source: aws.guardduty, severity >= 7)
        │
        ▼
Lambda: "TriageFinding"
   - looks up instance owner/tags
   - checks if instance is in an auto-remediate allowlist
        │
        ▼
SSM Automation: "IsolateAndSnapshotInstance"
   Step 1: Detach instance from Auto Scaling Group
   Step 2: Apply "quarantine" security group (no egress)
   Step 3: Create EBS snapshot / AMI for forensics
   Step 4: aws:approve — human confirms termination
   Step 5: Terminate instance, launch replacement from golden AMI
        │
        ▼
SNS: notify security channel with runbook execution link

Notice the human approval step embedded mid-runbook. Full auto-remediation without a human checkpoint is appropriate for well-understood, low-risk actions (restart a service, scale out capacity). For anything destructive or security-sensitive, the professional-level answer keeps a human in the loop at the point of highest consequence, while still automating everything mechanical around that decision.

Comparing Remediation Approaches

Approach	Best for	Auditability	Typical trigger
SSM Automation runbook	Standardized, repeatable operational procedures	Built-in execution history, step-by-step	EventBridge rule, Config rule non-compliance, manual
Lambda function	Custom logic, conditional branching, API orchestration	Requires your own logging (CloudWatch Logs)	EventBridge rule, direct API/SDK invocation
Step Functions state machine	Long-running, multi-service workflows with complex branching/wait states	Visual execution history in console	EventBridge rule, another Step Functions execution
CodeDeploy automatic rollback	Rolling back a bad deployment specifically	Deployment event history	CloudWatch alarm attached to deployment group

Step Functions deserves a specific callout here: when a remediation workflow needs to wait on an external condition, branch on multiple outcomes, or coordinate several services over a timeframe longer than a single Lambda’s timeout allows, Step Functions is the answer over a long Lambda function or a deeply nested SSM Automation document. It’s the tool for “this incident response has several possible paths depending on what we find along the way.”

Incident Response Runbook Automation

The instinct to avoid is treating a runbook as a document a human reads during an incident. At the professional level, a runbook is code — an SSM Automation document or Step Functions definition, version-controlled, tested in a non-production account, and invoked automatically or with one click, not transcribed manually from a wiki page while a service is down.

Practical patterns worth knowing:

Automation documents can be shared across accounts via AWS Resource Access Manager or by publishing to a shared SSM document, so a central platform team maintains one canonical “restart unhealthy ECS task” runbook that every application account references rather than reinventing.
Break-glass access for incidents that need broader permissions than normal is implemented as a pre-provisioned IAM role with a short session duration and mandatory MFA, invoked through Identity Center or AssumeRole, with CloudTrail logging every action taken under that role — not a standing set of elevated credentials someone keeps around “just in case.”
ChatOps integration — EventBridge or SNS targets that post to a chat channel with an actionable button (approve, rollback, acknowledge) is a common professional pattern for reducing the time between detection and human decision, especially for the approval steps embedded in Automation runbooks above.

Post-Incident Practices

Once the fire is out, the professional-level differentiator is what happens next — and this shows up on the exam as much as the automation itself does.

Root cause analysis workflows. The mature pattern isn’t a meeting three weeks later — it’s capturing the data needed for RCA automatically at incident time: X-Ray traces around the incident window, the specific CloudWatch Logs Insights query that isolates the failing requests, the CloudFormation/CodePipeline change history for what deployed in the preceding hours. A well-designed system correlates “what changed” (CloudTrail, CodePipeline execution history) against “when did it break” (the CloudWatch alarm timestamp) automatically, because that correlation is usually the entire root cause.

Automated rollback triggers, revisited from Step 1: the tightest possible incident response loop is one where a bad deployment never becomes an incident that requires human RCA at all, because the CodeDeploy alarm-based rollback (or an AppConfig feature-flag rollback) already reverted it during the bake window. Every automation covered in this step exists to shrink that gap — the goal is for the “incident” to be something engineers read about after the fact in a Automation execution log, not something they were paged for at 3 a.m.

Game days. AWS Fault Injection Service (FIS) lets you deliberately inject failure — terminate instances, throttle API calls, inject latency — in a controlled experiment to validate that your alarms, runbooks, and rollback automation actually work before a real incident tests them for you. Expect FIS to appear as the answer whenever a scenario asks how to verify resilience automation without waiting for a genuine outage.

Exam Focus: What Questions Test From This Step

Designing EventBridge rules with event patterns to filter on source, detail-type, and nested detail fields
Custom event buses and cross-account event forwarding via bus resource policies
Choosing between SSM Automation, Lambda, and Step Functions for a given remediation scenario
Embedding human approval steps (aws:approve) in otherwise automated runbooks for high-consequence actions
Break-glass access design: temporary elevated roles with MFA and full CloudTrail auditing, not standing credentials
Automated rollback as an EventBridge-driven or CodeDeploy-native reaction to a CloudWatch alarm
Correlating deployment/change history against incident timestamps as the backbone of automated root cause analysis
Using AWS Fault Injection Service to validate incident response automation proactively via game days

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.