Cloud/ AWS / AWS Certified Data Engineer โ€” Associate (DEA-C01) / DEA-C01 Operations & Support: Step Functions, MWAA, Monitoring

AWS Amazon Web Services Associate Step 3 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 3 โ€” Operations & Support

A pipeline that runs once is a script. A pipeline that runs reliably every day for two years, alerts someone when it breaks, and costs a sane amount to operate โ€” thatโ€™s engineering. This section is where the exam checks whether you can keep a data platform alive, not just build it once and walk away.


Orchestration: Step Functions vs MWAA

Both services coordinate multi-step workflows, and the exam wants you to pick based on complexity and ecosystem fit rather than by habit.

AWS Step Functions Amazon MWAA (Managed Airflow)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
State machine defined in JSON/ DAGs defined in Python
Amazon States Language
Native integration with 200+ Airflow operators/hooks โ€”
AWS services directly huge existing ecosystem
Pay per state transition Pay for environment uptime
(serverless, no idle cost) (small/medium/large environment)
Best for: AWS-service-heavy Best for: complex DAGs, cross-
workflows, event-driven glue system dependencies, teams
already using Airflow
Visual workflow debugging in Airflow UI, task-level logs,
the console retries, backfills

If a scenario says โ€œcoordinate a Glue job, then a Lambda function, then send an SNS notification, all triggered by an S3 event, with no infrastructure to manageโ€ โ€” that phrasing points to Step Functions. If it says โ€œthe team has 40 interdependent DAGs already written in Airflow and wants to migrate without a rewriteโ€ โ€” thatโ€™s MWAA, because you keep the DAG code and get AWS to run the scheduler, workers, and web server for you.

A Step Functions state machine for a typical pipeline handoff:

Start
โ”‚
โ–ผ
[Glue Crawler] โ”€โ”€โ–บ Wait for SUCCEEDED
โ”‚
โ–ผ
[Glue ETL Job] โ”€โ”€โ–บ Choice: did it fail?
โ”‚ โ”‚
โ”‚ success โ”‚ failure
โ–ผ โ–ผ
[Redshift COPY] [SNS: Alert on-call]
โ”‚ โ”‚
โ–ผ โ–ผ
[SNS: Success] End
โ”‚
โ–ผ
End

Step Functions natively supports retries with exponential backoff and catch blocks per state โ€” you donโ€™t write that retry logic yourself, you declare it in the state definition.


MWAA in Practice

MWAA runs the same open-source Airflow youโ€™d self-host, but AWS manages the scheduler, workers, metadata database, and web server. DAGs, plugins, and requirements files live in an S3 bucket that MWAA polls. The operational win is that you stop patching Airflow infrastructure โ€” the tradeoff is environment startup/resize takes longer than spinning up a Lambda-backed Step Functions workflow, and you pay for the environment whether or not DAGs are actively running.

Sizing an MWAA environment (small/medium/large) controls the number of concurrent tasks the scheduler and workers can handle โ€” undersizing shows up as tasks queuing indefinitely rather than an outright failure, which is a favorite โ€œdiagnose the problemโ€ exam pattern.


Monitoring Pipeline Health with CloudWatch

CloudWatch is the nervous system for every AWS-native pipeline, and the exam expects fluency with which signal comes from where:

SignalSource
Glue job run status, duration, DPU usageCloudWatch Metrics (Glue namespace) + Glue job run history
EMR step failures, cluster utilizationCloudWatch Metrics (EMR namespace) + on-cluster logs in S3
Kinesis stream throttling, iterator ageCloudWatch Metrics (GetRecords.IteratorAgeMilliseconds)
Step Functions execution failuresCloudWatch Metrics + Step Functions execution history
Custom application-level eventsCloudWatch Logs + Logs Insights queries
Anomalous metric behavior without a fixed thresholdCloudWatch Anomaly Detection

Iterator age deserves special attention โ€” itโ€™s the metric that tells you a Kinesis consumer is falling behind the producer. A steadily climbing iterator age means your consumer (Lambda, KCL app, or Flink application) canโ€™t keep pace, and youโ€™re accumulating processing lag that will eventually hit the streamโ€™s retention window and start dropping data.

A typical alerting setup layers CloudWatch Alarms on top of these metrics, routing to SNS, which fans out to email, Slack (via a Lambda subscriber), or a paging system. For pipeline-specific health, many teams also emit custom metrics from within a Glue job or Lambda (records processed, records quarantined by a data quality rule, rows rejected on validation) using put_metric_data, since AWS-native service metrics alone wonโ€™t tell you if the data itself looks wrong.


Error Handling and Retry Patterns

Transient failures โ€” a throttled API call, a momentarily unavailable JDBC endpoint, a Spot Instance reclaim โ€” are normal, and the architecture needs to absorb them without a human intervening every time.

Attempt 1 โ”€โ”€failโ”€โ”€โ–บ wait 2s โ”€โ”€โ–บ Attempt 2 โ”€โ”€failโ”€โ”€โ–บ wait 4s โ”€โ”€โ–บ Attempt 3
โ”‚
still failing
โ–ผ
Dead Letter Queue (SQS)
โ”‚
โ–ผ
Alert + manual review

This exponential backoff pattern is built into Step Functions retries natively (you configure IntervalSeconds, BackoffRate, and MaxAttempts per state), and itโ€™s the default behavior for many SDK calls as well. For streaming consumers, a dead-letter queue captures records that repeatedly fail processing so a bad record doesnโ€™t block the entire shard or stall the pipeline โ€” Kinesis and Lambda event source mappings support configuring a DLQ destination for exactly this reason.

A related pattern worth knowing cold: idempotency. If a Step Functions state or Lambda retries a write operation, you need that operation to be safe to repeat โ€” using a deterministic key (like an order ID) for an upsert rather than blindly appending, or checking a processed-records table before re-applying a transformation.


Cost-Optimizing Data Pipelines

Cost questions in this domain usually reduce to โ€œare you paying for idle capacity, and can you shift work to when itโ€™s cheaper or shrink the compute footprint.โ€


Troubleshooting Common Glue and EMR Failures

Symptom: Glue job runs out of memory
Cause: Data skew โ€” one partition/key vastly larger than others,
causing a single executor to process a disproportionate share
Fix: Salt the skewed key, repartition before the join,
or increase worker type/DPU count
Symptom: Glue job succeeds but downstream table is empty
Cause: Job bookmark treats data as "already processed"
(common after a schema change or manual backfill)
Fix: Reset the job bookmark explicitly for a fresh full run
Symptom: EMR step fails with "container killed on request, exit code 137"
Cause: YARN container exceeded memory limits (OOM kill)
Fix: Increase executor memory or reduce partitions per executor
Symptom: Crawler creates duplicate tables with slightly different schemas
Cause: Inconsistent file formats/schemas under the same S3 prefix
Fix: Enforce a consistent schema at write time, or configure the
crawler's schema change policy explicitly

Data skew is worth dwelling on because itโ€™s the most common root cause behind โ€œGlue job runs fine on small data but fails at scaleโ€ scenarios. If 90% of your join keys fall under a handful of values, a straight hash-partitioned join sends nearly all the work to a few executors. Salting the key (appending a random suffix and exploding the smaller side of the join accordingly) spreads that work back out evenly.


Exam Focus: What Questions Test From This Step