Cloud/ AWS / AWS Certified DevOps Engineer — Professional (DOP-C02) / DOP-C02 Step 3: Observability at Scale — CloudWatch, X-Ray & SLOs

AWS Amazon Web Services Professional Step 3 of 5 106 guides · updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 3 — Monitoring, Logging & Observability

Here’s a question worth sitting with before you read another word: when your pager goes off at 2 a.m., what tells you what to do versus what merely tells you something is wrong? Most teams over-invest in the second and under-invest in the first. This step is about building an observability stack on AWS that actually earns its keep during an incident, not just one that produces dashboards nobody looks at until something breaks.


CloudWatch Metrics: The Foundation, Handled Properly

At the professional level, the interesting problems aren’t “how do I get a metric into CloudWatch” — they’re about resolution, aggregation, and noise.

Standard vs. high-resolution metrics. Standard resolution stores data at one-minute granularity. High-resolution custom metrics (down to one second) cost more and are appropriate for latency-sensitive services where a one-minute-old data point is already too stale to act on — think real-time bidding systems or trading platforms, not your average CRUD API.

Metric math lets you combine multiple metrics into a derived expression without writing custom instrumentation — for example, computing an error rate as (errors / requests) * 100 directly in a CloudWatch alarm, rather than emitting a pre-computed percentage from the application. This matters because raw counts (5 errors) mean nothing without context (5 errors out of 4 requests versus 5 errors out of 40,000).

Anomaly Detection fits bands around a metric’s expected range based on historical patterns (including daily/weekly seasonality) rather than a fixed threshold. This is the correct answer whenever a scenario describes traffic with a strong daily cycle — a static “alert if CPU > 80%” threshold either fires constantly during normal peak hours or misses real problems during off-peak hours. An anomaly detection alarm compares against the expected band for that time of day, not a flat line.

Composite Alarms in Practice

A single metric crossing a threshold rarely tells the whole story. Composite alarms combine the state of multiple underlying alarms with AND/OR/NOT logic, and they’re the professional-level answer to alarm fatigue.

ALARM: HighErrorRate (5xx rate > 5%)
ALARM: HighLatency (p99 latency > 2s)
ALARM: LowHealthyHostCount (< 2 healthy targets in target group)
Composite Alarm: "ServiceDegraded"
= HighErrorRate OR HighLatency OR LowHealthyHostCount
Composite Alarm: "ServiceCriticalWithCapacityRisk"
= HighErrorRate AND LowHealthyHostCount
(suppresses noisy latency-only alerts during a deploy,
escalates only when errors AND capacity both look bad)

The practical value: page the on-call engineer only on the composite alarm, not on every underlying alarm individually. This alone eliminates a huge share of the alert noise that causes on-call burnout — and it’s a recurring theme on this exam, which rewards designs that reduce operational toil, not just designs that technically work.


Centralized Logging Across Accounts

A multi-account AWS organization (which is the default assumption at professional level, not the exception) needs logs aggregated somewhere a security or operations team can actually query them — not scattered across forty account-local CloudWatch Logs consoles.

Member Account A Member Account B
┌─────────────────┐ ┌─────────────────┐
│ CloudWatch Logs │ │ CloudWatch Logs │
│ Log Groups │ │ Log Groups │
└────────┬─────────┘ └────────┬─────────┘
│ Subscription Filter │ Subscription Filter
▼ ▼
┌──────────────────────────────────────────┐
│ Kinesis Data Streams / Firehose │
│ (Logging/Observability Account) │
└───────────────────┬────────────────────────┘
┌────────────────────────┐
│ Amazon S3 (raw archive) │
│ + OpenSearch Service │
│ (searchable, dashboards)│
└────────────────────────┘

The mechanism is a cross-account subscription filter: each member account’s log group streams to a Kinesis Data Stream (or Firehose delivery stream) owned by the central logging account, authorized via a resource policy on the destination. This is the pattern to recognize whenever a scenario says “security team needs a single place to search logs from all accounts” or “logs must be retained centrally even if a member account is compromised or deleted” — centralizing logs outside the account that produced them is itself a security control, since an attacker with access to a compromised account can’t tamper with logs that already left it.

For CloudTrail specifically, the equivalent pattern is an organization trail created from the management account, which automatically applies to every member account and writes to a single S3 bucket — no per-account trail configuration needed, and member accounts can’t disable it.


X-Ray: Service Maps and Trace Analysis

Metrics tell you that p99 latency spiked. Distributed tracing tells you where in the call graph it spiked. X-Ray instruments requests as they flow through services and assembles a service map from the resulting trace data.

Client ──► API Gateway ──► Lambda (OrderService) ──► DynamoDB
└──► Lambda (PaymentService) ──► external HTTPS call
(3rd-party gateway)
X-Ray Service Map annotates each edge with:
- average/p99 latency
- error/fault rate
- request volume
A slow external HTTPS call to the payment gateway shows up as
a red/yellow node — pinpointing the bottleneck without grepping
logs across three separate services.

Key mechanics worth knowing cold:


Comparing the Observability Building Blocks

SignalAWS ServiceAnswers the question
MetricsCloudWatch Metrics + AlarmsIs something wrong, right now, in aggregate?
LogsCloudWatch Logs / OpenSearchWhat exactly happened, in detail, on one request or host?
TracesX-Ray / ADOTWhere in a distributed call chain did the problem originate?
Synthetic checksCloudWatch SyntheticsDoes the user-facing flow still work, from the outside in?
Real user monitoringCloudWatch RUMWhat are actual end users experiencing in-browser?

None of these substitute for another — a mature exam answer usually combines at least two. “We need to know immediately if checkout is broken, and diagnose why within minutes” is a Synthetics canary (detection) plus X-Ray tracing (diagnosis), not one or the other.


Building SLO/SLI-Driven Alerting

Uptime percentages (“we were up 99.95% this month”) are a lagging, mostly useless indicator on their own. What actually drives good on-call behavior is defining Service Level Indicators (SLIs — the metrics that represent user-perceived health, like successful-request ratio or latency under a threshold) and Service Level Objectives (SLOs — the target for that SLI over a rolling window), then alerting on error budget burn rate rather than on raw threshold breaches.

SLO: 99.9% of requests succeed within a 30-day rolling window
Error budget: 0.1% of requests may fail = ~43 minutes of full downtime equivalent per month
Burn rate alerting:
Fast burn (budget exhausted in < 2 hours at current rate) → page immediately
Slow burn (budget exhausted in ~ 3 days at current rate) → ticket, review next business day

On AWS, this is implemented with metric math to compute the burn rate as a ratio, a CloudWatch alarm on that derived metric, and typically two alarm thresholds (fast burn vs. slow burn) feeding different notification channels via SNS — a page for fast burn, a lower-urgency ticket for slow burn. This is a distinctly professional-level concept: the associate exam stops at “set an alarm on a threshold,” while DOP-C02 expects you to reason about why a flat threshold under- or over-alerts compared to a burn-rate model tied to an actual business commitment.


Exam Focus: What Questions Test From This Step