Step 3 โ Operations & Support
A pipeline that runs once is a script. A pipeline that runs reliably every day for two years, alerts someone when it breaks, and costs a sane amount to operate โ thatโs engineering. This section is where the exam checks whether you can keep a data platform alive, not just build it once and walk away.
Orchestration: Step Functions vs MWAA
Both services coordinate multi-step workflows, and the exam wants you to pick based on complexity and ecosystem fit rather than by habit.
AWS Step Functions Amazon MWAA (Managed Airflow)โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโState machine defined in JSON/ DAGs defined in Python Amazon States LanguageNative integration with 200+ Airflow operators/hooks โ AWS services directly huge existing ecosystemPay per state transition Pay for environment uptime (serverless, no idle cost) (small/medium/large environment)Best for: AWS-service-heavy Best for: complex DAGs, cross- workflows, event-driven glue system dependencies, teams already using AirflowVisual workflow debugging in Airflow UI, task-level logs, the console retries, backfillsIf a scenario says โcoordinate a Glue job, then a Lambda function, then send an SNS notification, all triggered by an S3 event, with no infrastructure to manageโ โ that phrasing points to Step Functions. If it says โthe team has 40 interdependent DAGs already written in Airflow and wants to migrate without a rewriteโ โ thatโs MWAA, because you keep the DAG code and get AWS to run the scheduler, workers, and web server for you.
A Step Functions state machine for a typical pipeline handoff:
Start โ โผ[Glue Crawler] โโโบ Wait for SUCCEEDED โ โผ[Glue ETL Job] โโโบ Choice: did it fail? โ โ โ success โ failure โผ โผ[Redshift COPY] [SNS: Alert on-call] โ โ โผ โผ[SNS: Success] End โ โผ EndStep Functions natively supports retries with exponential backoff and catch blocks per state โ you donโt write that retry logic yourself, you declare it in the state definition.
MWAA in Practice
MWAA runs the same open-source Airflow youโd self-host, but AWS manages the scheduler, workers, metadata database, and web server. DAGs, plugins, and requirements files live in an S3 bucket that MWAA polls. The operational win is that you stop patching Airflow infrastructure โ the tradeoff is environment startup/resize takes longer than spinning up a Lambda-backed Step Functions workflow, and you pay for the environment whether or not DAGs are actively running.
Sizing an MWAA environment (small/medium/large) controls the number of concurrent tasks the scheduler and workers can handle โ undersizing shows up as tasks queuing indefinitely rather than an outright failure, which is a favorite โdiagnose the problemโ exam pattern.
Monitoring Pipeline Health with CloudWatch
CloudWatch is the nervous system for every AWS-native pipeline, and the exam expects fluency with which signal comes from where:
| Signal | Source |
|---|---|
| Glue job run status, duration, DPU usage | CloudWatch Metrics (Glue namespace) + Glue job run history |
| EMR step failures, cluster utilization | CloudWatch Metrics (EMR namespace) + on-cluster logs in S3 |
| Kinesis stream throttling, iterator age | CloudWatch Metrics (GetRecords.IteratorAgeMilliseconds) |
| Step Functions execution failures | CloudWatch Metrics + Step Functions execution history |
| Custom application-level events | CloudWatch Logs + Logs Insights queries |
| Anomalous metric behavior without a fixed threshold | CloudWatch Anomaly Detection |
Iterator age deserves special attention โ itโs the metric that tells you a Kinesis consumer is falling behind the producer. A steadily climbing iterator age means your consumer (Lambda, KCL app, or Flink application) canโt keep pace, and youโre accumulating processing lag that will eventually hit the streamโs retention window and start dropping data.
A typical alerting setup layers CloudWatch Alarms on top of these metrics, routing to SNS, which fans out to email, Slack (via a Lambda subscriber), or a paging system. For pipeline-specific health, many teams also emit custom metrics from within a Glue job or Lambda (records processed, records quarantined by a data quality rule, rows rejected on validation) using put_metric_data, since AWS-native service metrics alone wonโt tell you if the data itself looks wrong.
Error Handling and Retry Patterns
Transient failures โ a throttled API call, a momentarily unavailable JDBC endpoint, a Spot Instance reclaim โ are normal, and the architecture needs to absorb them without a human intervening every time.
Attempt 1 โโfailโโโบ wait 2s โโโบ Attempt 2 โโfailโโโบ wait 4s โโโบ Attempt 3 โ still failing โผ Dead Letter Queue (SQS) โ โผ Alert + manual reviewThis exponential backoff pattern is built into Step Functions retries natively (you configure IntervalSeconds, BackoffRate, and MaxAttempts per state), and itโs the default behavior for many SDK calls as well. For streaming consumers, a dead-letter queue captures records that repeatedly fail processing so a bad record doesnโt block the entire shard or stall the pipeline โ Kinesis and Lambda event source mappings support configuring a DLQ destination for exactly this reason.
A related pattern worth knowing cold: idempotency. If a Step Functions state or Lambda retries a write operation, you need that operation to be safe to repeat โ using a deterministic key (like an order ID) for an upsert rather than blindly appending, or checking a processed-records table before re-applying a transformation.
Cost-Optimizing Data Pipelines
Cost questions in this domain usually reduce to โare you paying for idle capacity, and can you shift work to when itโs cheaper or shrink the compute footprint.โ
- Glue โ job bookmarks avoid reprocessing data thatโs already been handled on a previous run, which directly cuts DPU-hours on incremental loads.
- EMR โ use Spot Instances for task nodes (not core/master nodes, which hold state) on fault-tolerant Spark jobs; enable EMR managed scaling so the cluster shrinks when a jobโs later stages need less compute.
- Redshift Serverless โ set a sane base RPU range so it doesnโt overprovision for rare bursts; pause/scale-to-near-zero behavior handles idle periods automatically.
- Step Functions โ Standard workflows charge per state transition, so a workflow that polls in a tight loop racks up transitions fast; use
Waitstates or switch to Express workflows for high-volume, short-duration executions. - S3 โ lifecycle rules moving cold data out of Standard remain the single highest-leverage lever in most data lake cost reviews.
Troubleshooting Common Glue and EMR Failures
Symptom: Glue job runs out of memoryCause: Data skew โ one partition/key vastly larger than others, causing a single executor to process a disproportionate shareFix: Salt the skewed key, repartition before the join, or increase worker type/DPU count
Symptom: Glue job succeeds but downstream table is emptyCause: Job bookmark treats data as "already processed" (common after a schema change or manual backfill)Fix: Reset the job bookmark explicitly for a fresh full run
Symptom: EMR step fails with "container killed on request, exit code 137"Cause: YARN container exceeded memory limits (OOM kill)Fix: Increase executor memory or reduce partitions per executor
Symptom: Crawler creates duplicate tables with slightly different schemasCause: Inconsistent file formats/schemas under the same S3 prefixFix: Enforce a consistent schema at write time, or configure the crawler's schema change policy explicitlyData skew is worth dwelling on because itโs the most common root cause behind โGlue job runs fine on small data but fails at scaleโ scenarios. If 90% of your join keys fall under a handful of values, a straight hash-partitioned join sends nearly all the work to a few executors. Salting the key (appending a random suffix and exploding the smaller side of the join accordingly) spreads that work back out evenly.
Exam Focus: What Questions Test From This Step
- Step Functions vs MWAA โ pick based on AWS-service density vs existing Airflow DAG investment
- Reading CloudWatch iterator age as a sign of a lagging Kinesis consumer
- Exponential backoff configuration and dead-letter queue usage for transient failures
- Idempotency requirements when retries are possible
- Cost levers: Glue job bookmarks, EMR Spot for task nodes, Redshift Serverless RPU ranges, Step Functions Standard vs Express
- Diagnosing Glue OOM/data skew failures and the salting fix
- Recognizing job bookmark issues causing โjob succeeded but no new dataโ symptoms
- EMR container OOM kills (exit 137) and executor memory tuning