Step 1 — Monitoring & Observability

If you’ve spent any time on an on-call rotation, you already know the real skill being tested here isn’t “can you find CloudWatch in the console.” It’s whether you can build monitoring that tells you something is wrong before a customer does, without burying that signal under a pile of alerts nobody trusts anymore. That’s the lens to keep on for this whole domain.

One naming note before we go further: this certification used to be called AWS Certified SysOps Administrator – Associate. AWS renamed it to AWS Certified CloudOps Engineer – Associate (SOA-C03) to better reflect what the role actually covers day to day. If you’re searching around and keep finding “SysOps” study material, that’s the same lineage — much of it is still relevant, just be mindful of version drift on newer services.

CloudWatch Metrics: What’s Actually Being Measured

Every AWS resource you touch is already emitting metrics whether you asked for them or not. EC2 ships CPUUtilization, NetworkIn/Out, and disk I/O at a 5-minute interval by default. Flip on Detailed Monitoring and that interval drops to 1 minute — useful when you’re chasing a transient spike that a 5-minute average would smooth right over.

What EC2 does not give you natively is memory or disk space utilization — that requires the CloudWatch agent pushing custom metrics from inside the instance. This trips up a lot of candidates who assume CloudWatch sees everything by default. It doesn’t. Anything happening inside the guest OS needs an agent.

Metrics live in namespaces and get sliced by dimensions:

Namespace: AWS/EC2
  Dimension: InstanceId = i-0abc123
    Metric: CPUUtilization
      Datapoints: [12%, 14%, 71%, 68%, 15%] over 5-min periods

Namespace: CWAgent (custom)
  Dimension: InstanceId, Filesystem
    Metric: disk_used_percent
      Datapoints: [55%, 55%, 56%, 91%, 92%]

Custom metrics you push yourself (via the CLI, SDK, or embedded metric format from Lambda logs) can go down to 1-second resolution if you request it — but that’s expensive at scale, so reserve high-resolution metrics for things you’re actively debugging, not everything by default.

Alarms: The Difference Between Signal and Noise

A CloudWatch alarm watches one metric against a threshold over an evaluation period and moves through three states: OK, ALARM, INSUFFICIENT_DATA. The last one matters more than people give it credit for — it means the alarm hasn’t received enough datapoints to evaluate, not that everything is fine. Treating INSUFFICIENT_DATA as “healthy” is a real production mistake, and the exam knows it.

                     ┌────────────────────────┐
  Metric datapoints ─┤  Evaluate over N of M  ├──► State
                     │  periods breaching     │
                     └────────────────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
             OK              ALARM          INSUFFICIENT_DATA
       (within range)   (breach confirmed)   (missing data —
                                               not necessarily healthy)

The single biggest lever for cutting alert fatigue is the “M out of N datapoints” setting. Alarming on a single breached datapoint turns every network blip into a page. Requiring, say, 3 out of 5 evaluation periods to breach absorbs the noise while still catching genuinely sustained problems.

Composite alarms take this further by combining multiple underlying alarms with AND/OR logic. This is how you stop paging someone at 3 a.m. for a CPU spike that’s actually just a scheduled batch job — you might require CPU high AND latency high AND error rate elevated before it escalates. Individually, each metric might cross its threshold routinely; together, they only trip when something is genuinely broken.

Anomaly detection alarms are worth knowing for workloads with a predictable daily or weekly rhythm. Instead of a flat threshold, CloudWatch builds a band based on historical behavior and alarms when the metric steps outside that band — handy for traffic that’s naturally low on weekends and shouldn’t trigger a static threshold built around weekday peak.

Alarm type	Best for	Watch-out
Static threshold	Predictable, flat baselines (disk space, error count)	Wrong threshold = constant noise or missed incidents
Anomaly detection	Traffic with daily/weekly seasonality	Needs a couple weeks of history to calibrate well
Composite	Reducing false pages across correlated signals	Requires disciplined alarm hygiene underneath it
Metric math	Derived values (error rate = errors / requests)	Easy to get the math backwards under pressure

Dashboards Built for the People Who’ll Actually Use Them

A dashboard nobody reads during an incident is decoration. The operationally useful pattern is to build dashboards around a golden signals view — latency, traffic, errors, saturation — per service, rather than one giant dashboard with every metric from every resource crammed in. During an incident, you want someone to glance at one screen and know which service is degraded, not scroll through forty widgets hunting for the anomaly.

Cross-account, cross-region dashboards matter more in 2026 than they used to, now that most orgs run multi-account landing zones as a baseline. CloudWatch supports pulling metrics from linked accounts into a single monitoring account’s dashboard, which is the pattern you want instead of asking every team to log into their own account to check health.

CloudWatch Logs and Logs Insights

Logs land in log groups, which contain log streams, which contain individual log events. The operational decisions that matter:

Retention — logs never expire by default, which quietly becomes a cost problem. Set explicit retention per log group based on compliance needs, not “just in case.”
Metric filters — turn a pattern match in your log lines into a numeric CloudWatch metric, so you can alarm on “ERROR” count per minute without shipping logs anywhere else.
Subscription filters — stream matching log events in near real time to Kinesis, Firehose, or Lambda for further processing.

Logs Insights is the part that separates someone who can grep a log file from someone who can actually run an investigation across a fleet. It’s a purpose-built query language, not full SQL, but it covers what you need for incident triage:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)
| sort errorCount desc

That single query answers “when did errors spike and how bad” across every log stream in the group, without downloading anything. Know the core commands — fields, filter, stats, sort, limit, parse — well enough to read one and predict its output, because the exam will show you a query and ask what it returns.

CloudWatch Synthetics: Catching Problems Before Users Report Them

Synthetics runs scripted canaries — small Node.js or Python scripts using headless browsers — on a schedule, simulating what a real user does: load the homepage, log in, complete a checkout flow. The point is catching outside-in failures that internal metrics might miss entirely. Your servers can report 0% errors while your login page is actually broken because of a DNS or CDN issue between the canary’s vantage point and your endpoint — internal metrics won’t show that, but a synthetic check will.

Scheduled Canary (every 5 min)
        │
        ▼
  Headless browser run from AWS region
        │
   ┌────┴────┐
   ▼         ▼
 Pass      Fail ──► CloudWatch Alarm ──► SNS ──► On-call
   │
   ▼
 Screenshot + HAR file stored in S3

Canaries also capture screenshots and HAR files on failure, which is genuinely useful during triage — you don’t have to reproduce the failure yourself, you can just look at what the canary saw.

X-Ray: Tracing Requests Across Services

Once you’re past a monolith, “which service caused the slowdown” stops being obvious from metrics alone. X-Ray attaches a trace ID to a request and follows it across every instrumented service, producing a service map and a timeline segment breakdown.

Client Request (trace ID: 1-abc-def)
   │
   ▼
API Gateway ──► Lambda (Auth) ──► Lambda (Order Service) ──► DynamoDB
   45ms            80ms                 210ms  ⚠ slow          15ms

Service map highlights the Order Service node in red —
that's where your latency budget is actually going.

The practical value: instead of guessing which downstream call is slow, you get segment-level timing for the entire request path. Sampling rules control cost — you don’t need to trace 100% of requests to find a systemic slow path, and the default sampling rate is tuned to keep costs sane on high-traffic services while still catching patterns.

Exam Focus: What Questions Test From This Step

Reading a CloudWatch alarm scenario and identifying whether OK, ALARM, or INSUFFICIENT_DATA applies, especially around missing datapoints
Choosing composite alarms or “M out of N” evaluation to reduce false-positive paging
Knowing that memory and disk metrics require the CloudWatch agent — EC2 doesn’t expose them natively
Interpreting or completing a Logs Insights query (fields, filter, stats, sort)
When to reach for metric filters vs subscription filters vs Logs Insights
Recognizing Synthetics canaries as the tool for outside-in, user-experience monitoring
Understanding what an X-Ray service map and trace segment actually represent
Anomaly detection alarms vs static threshold alarms for seasonal traffic patterns

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.