Step 3 — Monitoring, Logging & Observability

Here’s a question worth sitting with before you read another word: when your pager goes off at 2 a.m., what tells you what to do versus what merely tells you something is wrong? Most teams over-invest in the second and under-invest in the first. This step is about building an observability stack on AWS that actually earns its keep during an incident, not just one that produces dashboards nobody looks at until something breaks.

CloudWatch Metrics: The Foundation, Handled Properly

At the professional level, the interesting problems aren’t “how do I get a metric into CloudWatch” — they’re about resolution, aggregation, and noise.

Standard vs. high-resolution metrics. Standard resolution stores data at one-minute granularity. High-resolution custom metrics (down to one second) cost more and are appropriate for latency-sensitive services where a one-minute-old data point is already too stale to act on — think real-time bidding systems or trading platforms, not your average CRUD API.

Metric math lets you combine multiple metrics into a derived expression without writing custom instrumentation — for example, computing an error rate as (errors / requests) * 100 directly in a CloudWatch alarm, rather than emitting a pre-computed percentage from the application. This matters because raw counts (5 errors) mean nothing without context (5 errors out of 4 requests versus 5 errors out of 40,000).

Anomaly Detection fits bands around a metric’s expected range based on historical patterns (including daily/weekly seasonality) rather than a fixed threshold. This is the correct answer whenever a scenario describes traffic with a strong daily cycle — a static “alert if CPU > 80%” threshold either fires constantly during normal peak hours or misses real problems during off-peak hours. An anomaly detection alarm compares against the expected band for that time of day, not a flat line.

Composite Alarms in Practice

A single metric crossing a threshold rarely tells the whole story. Composite alarms combine the state of multiple underlying alarms with AND/OR/NOT logic, and they’re the professional-level answer to alarm fatigue.

ALARM: HighErrorRate        (5xx rate > 5%)
ALARM: HighLatency          (p99 latency > 2s)
ALARM: LowHealthyHostCount  (< 2 healthy targets in target group)

Composite Alarm: "ServiceDegraded"
  = HighErrorRate OR HighLatency OR LowHealthyHostCount

Composite Alarm: "ServiceCriticalWithCapacityRisk"
  = HighErrorRate AND LowHealthyHostCount
    (suppresses noisy latency-only alerts during a deploy,
     escalates only when errors AND capacity both look bad)

The practical value: page the on-call engineer only on the composite alarm, not on every underlying alarm individually. This alone eliminates a huge share of the alert noise that causes on-call burnout — and it’s a recurring theme on this exam, which rewards designs that reduce operational toil, not just designs that technically work.

Centralized Logging Across Accounts

A multi-account AWS organization (which is the default assumption at professional level, not the exception) needs logs aggregated somewhere a security or operations team can actually query them — not scattered across forty account-local CloudWatch Logs consoles.

              Member Account A            Member Account B
              ┌─────────────────┐        ┌─────────────────┐
              │ CloudWatch Logs  │        │ CloudWatch Logs  │
              │  Log Groups      │        │  Log Groups      │
              └────────┬─────────┘        └────────┬─────────┘
                       │ Subscription Filter        │ Subscription Filter
                       ▼                            ▼
              ┌──────────────────────────────────────────┐
              │   Kinesis Data Streams / Firehose          │
              │   (Logging/Observability Account)          │
              └───────────────────┬────────────────────────┘
                                  ▼
                     ┌────────────────────────┐
                     │  Amazon S3 (raw archive) │
                     │  + OpenSearch Service     │
                     │    (searchable, dashboards)│
                     └────────────────────────┘

The mechanism is a cross-account subscription filter: each member account’s log group streams to a Kinesis Data Stream (or Firehose delivery stream) owned by the central logging account, authorized via a resource policy on the destination. This is the pattern to recognize whenever a scenario says “security team needs a single place to search logs from all accounts” or “logs must be retained centrally even if a member account is compromised or deleted” — centralizing logs outside the account that produced them is itself a security control, since an attacker with access to a compromised account can’t tamper with logs that already left it.

For CloudTrail specifically, the equivalent pattern is an organization trail created from the management account, which automatically applies to every member account and writes to a single S3 bucket — no per-account trail configuration needed, and member accounts can’t disable it.

X-Ray: Service Maps and Trace Analysis

Metrics tell you that p99 latency spiked. Distributed tracing tells you where in the call graph it spiked. X-Ray instruments requests as they flow through services and assembles a service map from the resulting trace data.

Client ──► API Gateway ──► Lambda (OrderService) ──► DynamoDB
                                    │
                                    └──► Lambda (PaymentService) ──► external HTTPS call
                                                                        (3rd-party gateway)

X-Ray Service Map annotates each edge with:
  - average/p99 latency
  - error/fault rate
  - request volume

A slow external HTTPS call to the payment gateway shows up as
a red/yellow node — pinpointing the bottleneck without grepping
logs across three separate services.

Key mechanics worth knowing cold:

Sampling rules control what fraction of requests get traced — by default 1 request/second plus 5% of additional requests, but you can define custom rules per service or URL path to trace 100% of a specific critical endpoint while sampling lightly elsewhere.
Annotations vs. metadata on a trace segment: annotations are indexed and searchable (use them for fields you’ll filter on, like customerId or orderStatus); metadata is not indexed (use it for large or non-queryable context, like a full request payload).
X-Ray works alongside, not instead of, OpenTelemetry. The direction of the ecosystem by 2026 has clearly moved toward OpenTelemetry as the vendor-neutral instrumentation standard — the ADOT (AWS Distro for OpenTelemetry) collector can export the same trace data to X-Ray and to third-party backends simultaneously, which is the answer whenever a scenario wants tracing that isn’t locked into a single vendor’s dashboard.

Comparing the Observability Building Blocks

Signal	AWS Service	Answers the question
Metrics	CloudWatch Metrics + Alarms	Is something wrong, right now, in aggregate?
Logs	CloudWatch Logs / OpenSearch	What exactly happened, in detail, on one request or host?
Traces	X-Ray / ADOT	Where in a distributed call chain did the problem originate?
Synthetic checks	CloudWatch Synthetics	Does the user-facing flow still work, from the outside in?
Real user monitoring	CloudWatch RUM	What are actual end users experiencing in-browser?

None of these substitute for another — a mature exam answer usually combines at least two. “We need to know immediately if checkout is broken, and diagnose why within minutes” is a Synthetics canary (detection) plus X-Ray tracing (diagnosis), not one or the other.

Building SLO/SLI-Driven Alerting

Uptime percentages (“we were up 99.95% this month”) are a lagging, mostly useless indicator on their own. What actually drives good on-call behavior is defining Service Level Indicators (SLIs — the metrics that represent user-perceived health, like successful-request ratio or latency under a threshold) and Service Level Objectives (SLOs — the target for that SLI over a rolling window), then alerting on error budget burn rate rather than on raw threshold breaches.

SLO: 99.9% of requests succeed within a 30-day rolling window
Error budget: 0.1% of requests may fail = ~43 minutes of full downtime equivalent per month

Burn rate alerting:
  Fast burn  (budget exhausted in < 2 hours at current rate) → page immediately
  Slow burn  (budget exhausted in ~ 3 days at current rate)  → ticket, review next business day

On AWS, this is implemented with metric math to compute the burn rate as a ratio, a CloudWatch alarm on that derived metric, and typically two alarm thresholds (fast burn vs. slow burn) feeding different notification channels via SNS — a page for fast burn, a lower-urgency ticket for slow burn. This is a distinctly professional-level concept: the associate exam stops at “set an alarm on a threshold,” while DOP-C02 expects you to reason about why a flat threshold under- or over-alerts compared to a burn-rate model tied to an actual business commitment.

Exam Focus: What Questions Test From This Step

Choosing anomaly detection over static thresholds for metrics with strong seasonality
Composite alarms as the mechanism for reducing alert noise and encoding “alert only when multiple conditions co-occur”
Cross-account centralized logging via subscription filters into Kinesis/Firehose, and why centralizing logs outside the source account is itself a security control
Organization CloudTrail trails versus per-account trails
X-Ray service maps for pinpointing latency/fault sources in a distributed call chain
Sampling rule configuration and the annotations-vs-metadata distinction on trace segments
Recognizing when a scenario wants vendor-neutral instrumentation (OpenTelemetry/ADOT) instead of X-Ray-only tracing
Matching a signal type (metrics, logs, traces, synthetics, RUM) to what the scenario is actually asking to detect or diagnose
Error-budget burn-rate alerting versus flat-threshold alerting, and fast-burn vs. slow-burn notification routing

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.