Cloud/ AWS / AWS Certified CloudOps Engineer โ€” Associate (SOA-C03) / SOA-C03 Monitoring & Observability: CloudWatch, X-Ray, Synthetics

AWS Amazon Web Services Associate Step 1 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 1 โ€” Monitoring & Observability

If youโ€™ve spent any time on an on-call rotation, you already know the real skill being tested here isnโ€™t โ€œcan you find CloudWatch in the console.โ€ Itโ€™s whether you can build monitoring that tells you something is wrong before a customer does, without burying that signal under a pile of alerts nobody trusts anymore. Thatโ€™s the lens to keep on for this whole domain.

One naming note before we go further: this certification used to be called AWS Certified SysOps Administrator โ€“ Associate. AWS renamed it to AWS Certified CloudOps Engineer โ€“ Associate (SOA-C03) to better reflect what the role actually covers day to day. If youโ€™re searching around and keep finding โ€œSysOpsโ€ study material, thatโ€™s the same lineage โ€” much of it is still relevant, just be mindful of version drift on newer services.


CloudWatch Metrics: Whatโ€™s Actually Being Measured

Every AWS resource you touch is already emitting metrics whether you asked for them or not. EC2 ships CPUUtilization, NetworkIn/Out, and disk I/O at a 5-minute interval by default. Flip on Detailed Monitoring and that interval drops to 1 minute โ€” useful when youโ€™re chasing a transient spike that a 5-minute average would smooth right over.

What EC2 does not give you natively is memory or disk space utilization โ€” that requires the CloudWatch agent pushing custom metrics from inside the instance. This trips up a lot of candidates who assume CloudWatch sees everything by default. It doesnโ€™t. Anything happening inside the guest OS needs an agent.

Metrics live in namespaces and get sliced by dimensions:

Namespace: AWS/EC2
Dimension: InstanceId = i-0abc123
Metric: CPUUtilization
Datapoints: [12%, 14%, 71%, 68%, 15%] over 5-min periods
Namespace: CWAgent (custom)
Dimension: InstanceId, Filesystem
Metric: disk_used_percent
Datapoints: [55%, 55%, 56%, 91%, 92%]

Custom metrics you push yourself (via the CLI, SDK, or embedded metric format from Lambda logs) can go down to 1-second resolution if you request it โ€” but thatโ€™s expensive at scale, so reserve high-resolution metrics for things youโ€™re actively debugging, not everything by default.


Alarms: The Difference Between Signal and Noise

A CloudWatch alarm watches one metric against a threshold over an evaluation period and moves through three states: OK, ALARM, INSUFFICIENT_DATA. The last one matters more than people give it credit for โ€” it means the alarm hasnโ€™t received enough datapoints to evaluate, not that everything is fine. Treating INSUFFICIENT_DATA as โ€œhealthyโ€ is a real production mistake, and the exam knows it.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Metric datapoints โ”€โ”ค Evaluate over N of M โ”œโ”€โ”€โ–บ State
โ”‚ periods breaching โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ โ–ผ
OK ALARM INSUFFICIENT_DATA
(within range) (breach confirmed) (missing data โ€”
not necessarily healthy)

The single biggest lever for cutting alert fatigue is the โ€œM out of N datapointsโ€ setting. Alarming on a single breached datapoint turns every network blip into a page. Requiring, say, 3 out of 5 evaluation periods to breach absorbs the noise while still catching genuinely sustained problems.

Composite alarms take this further by combining multiple underlying alarms with AND/OR logic. This is how you stop paging someone at 3 a.m. for a CPU spike thatโ€™s actually just a scheduled batch job โ€” you might require CPU high AND latency high AND error rate elevated before it escalates. Individually, each metric might cross its threshold routinely; together, they only trip when something is genuinely broken.

Anomaly detection alarms are worth knowing for workloads with a predictable daily or weekly rhythm. Instead of a flat threshold, CloudWatch builds a band based on historical behavior and alarms when the metric steps outside that band โ€” handy for traffic thatโ€™s naturally low on weekends and shouldnโ€™t trigger a static threshold built around weekday peak.

Alarm typeBest forWatch-out
Static thresholdPredictable, flat baselines (disk space, error count)Wrong threshold = constant noise or missed incidents
Anomaly detectionTraffic with daily/weekly seasonalityNeeds a couple weeks of history to calibrate well
CompositeReducing false pages across correlated signalsRequires disciplined alarm hygiene underneath it
Metric mathDerived values (error rate = errors / requests)Easy to get the math backwards under pressure

Dashboards Built for the People Whoโ€™ll Actually Use Them

A dashboard nobody reads during an incident is decoration. The operationally useful pattern is to build dashboards around a golden signals view โ€” latency, traffic, errors, saturation โ€” per service, rather than one giant dashboard with every metric from every resource crammed in. During an incident, you want someone to glance at one screen and know which service is degraded, not scroll through forty widgets hunting for the anomaly.

Cross-account, cross-region dashboards matter more in 2026 than they used to, now that most orgs run multi-account landing zones as a baseline. CloudWatch supports pulling metrics from linked accounts into a single monitoring accountโ€™s dashboard, which is the pattern you want instead of asking every team to log into their own account to check health.


CloudWatch Logs and Logs Insights

Logs land in log groups, which contain log streams, which contain individual log events. The operational decisions that matter:

Logs Insights is the part that separates someone who can grep a log file from someone who can actually run an investigation across a fleet. Itโ€™s a purpose-built query language, not full SQL, but it covers what you need for incident triage:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)
| sort errorCount desc

That single query answers โ€œwhen did errors spike and how badโ€ across every log stream in the group, without downloading anything. Know the core commands โ€” fields, filter, stats, sort, limit, parse โ€” well enough to read one and predict its output, because the exam will show you a query and ask what it returns.


CloudWatch Synthetics: Catching Problems Before Users Report Them

Synthetics runs scripted canaries โ€” small Node.js or Python scripts using headless browsers โ€” on a schedule, simulating what a real user does: load the homepage, log in, complete a checkout flow. The point is catching outside-in failures that internal metrics might miss entirely. Your servers can report 0% errors while your login page is actually broken because of a DNS or CDN issue between the canaryโ€™s vantage point and your endpoint โ€” internal metrics wonโ€™t show that, but a synthetic check will.

Scheduled Canary (every 5 min)
โ”‚
โ–ผ
Headless browser run from AWS region
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ
Pass Fail โ”€โ”€โ–บ CloudWatch Alarm โ”€โ”€โ–บ SNS โ”€โ”€โ–บ On-call
โ”‚
โ–ผ
Screenshot + HAR file stored in S3

Canaries also capture screenshots and HAR files on failure, which is genuinely useful during triage โ€” you donโ€™t have to reproduce the failure yourself, you can just look at what the canary saw.


X-Ray: Tracing Requests Across Services

Once youโ€™re past a monolith, โ€œwhich service caused the slowdownโ€ stops being obvious from metrics alone. X-Ray attaches a trace ID to a request and follows it across every instrumented service, producing a service map and a timeline segment breakdown.

Client Request (trace ID: 1-abc-def)
โ”‚
โ–ผ
API Gateway โ”€โ”€โ–บ Lambda (Auth) โ”€โ”€โ–บ Lambda (Order Service) โ”€โ”€โ–บ DynamoDB
45ms 80ms 210ms โš  slow 15ms
Service map highlights the Order Service node in red โ€”
that's where your latency budget is actually going.

The practical value: instead of guessing which downstream call is slow, you get segment-level timing for the entire request path. Sampling rules control cost โ€” you donโ€™t need to trace 100% of requests to find a systemic slow path, and the default sampling rate is tuned to keep costs sane on high-traffic services while still catching patterns.


Exam Focus: What Questions Test From This Step