Step 3 — Operations & Support

A pipeline that runs once is a script. A pipeline that runs reliably every day for two years, alerts someone when it breaks, and costs a sane amount to operate — that’s engineering. This section is where the exam checks whether you can keep a data platform alive, not just build it once and walk away.

Orchestration: Step Functions vs MWAA

Both services coordinate multi-step workflows, and the exam wants you to pick based on complexity and ecosystem fit rather than by habit.

AWS Step Functions                    Amazon MWAA (Managed Airflow)
──────────────────────────            ──────────────────────────────
State machine defined in JSON/        DAGs defined in Python
  Amazon States Language
Native integration with 200+          Airflow operators/hooks —
  AWS services directly                 huge existing ecosystem
Pay per state transition              Pay for environment uptime
  (serverless, no idle cost)            (small/medium/large environment)
Best for: AWS-service-heavy           Best for: complex DAGs, cross-
  workflows, event-driven glue          system dependencies, teams
                                        already using Airflow
Visual workflow debugging in          Airflow UI, task-level logs,
  the console                           retries, backfills

If a scenario says “coordinate a Glue job, then a Lambda function, then send an SNS notification, all triggered by an S3 event, with no infrastructure to manage” — that phrasing points to Step Functions. If it says “the team has 40 interdependent DAGs already written in Airflow and wants to migrate without a rewrite” — that’s MWAA, because you keep the DAG code and get AWS to run the scheduler, workers, and web server for you.

A Step Functions state machine for a typical pipeline handoff:

Start
  │
  ▼
[Glue Crawler] ──► Wait for SUCCEEDED
  │
  ▼
[Glue ETL Job] ──► Choice: did it fail?
  │                    │
  │ success            │ failure
  ▼                    ▼
[Redshift COPY]   [SNS: Alert on-call]
  │                    │
  ▼                    ▼
[SNS: Success]        End
  │
  ▼
 End

Step Functions natively supports retries with exponential backoff and catch blocks per state — you don’t write that retry logic yourself, you declare it in the state definition.

MWAA in Practice

MWAA runs the same open-source Airflow you’d self-host, but AWS manages the scheduler, workers, metadata database, and web server. DAGs, plugins, and requirements files live in an S3 bucket that MWAA polls. The operational win is that you stop patching Airflow infrastructure — the tradeoff is environment startup/resize takes longer than spinning up a Lambda-backed Step Functions workflow, and you pay for the environment whether or not DAGs are actively running.

Sizing an MWAA environment (small/medium/large) controls the number of concurrent tasks the scheduler and workers can handle — undersizing shows up as tasks queuing indefinitely rather than an outright failure, which is a favorite “diagnose the problem” exam pattern.

Monitoring Pipeline Health with CloudWatch

CloudWatch is the nervous system for every AWS-native pipeline, and the exam expects fluency with which signal comes from where:

Signal	Source
Glue job run status, duration, DPU usage	CloudWatch Metrics (Glue namespace) + Glue job run history
EMR step failures, cluster utilization	CloudWatch Metrics (EMR namespace) + on-cluster logs in S3
Kinesis stream throttling, iterator age	CloudWatch Metrics (`GetRecords.IteratorAgeMilliseconds`)
Step Functions execution failures	CloudWatch Metrics + Step Functions execution history
Custom application-level events	CloudWatch Logs + Logs Insights queries
Anomalous metric behavior without a fixed threshold	CloudWatch Anomaly Detection

Iterator age deserves special attention — it’s the metric that tells you a Kinesis consumer is falling behind the producer. A steadily climbing iterator age means your consumer (Lambda, KCL app, or Flink application) can’t keep pace, and you’re accumulating processing lag that will eventually hit the stream’s retention window and start dropping data.

A typical alerting setup layers CloudWatch Alarms on top of these metrics, routing to SNS, which fans out to email, Slack (via a Lambda subscriber), or a paging system. For pipeline-specific health, many teams also emit custom metrics from within a Glue job or Lambda (records processed, records quarantined by a data quality rule, rows rejected on validation) using put_metric_data, since AWS-native service metrics alone won’t tell you if the data itself looks wrong.

Error Handling and Retry Patterns

Transient failures — a throttled API call, a momentarily unavailable JDBC endpoint, a Spot Instance reclaim — are normal, and the architecture needs to absorb them without a human intervening every time.

Attempt 1 ──fail──► wait 2s ──► Attempt 2 ──fail──► wait 4s ──► Attempt 3
                                                                     │
                                                              still failing
                                                                     ▼
                                                          Dead Letter Queue (SQS)
                                                                     │
                                                                     ▼
                                                          Alert + manual review

This exponential backoff pattern is built into Step Functions retries natively (you configure IntervalSeconds, BackoffRate, and MaxAttempts per state), and it’s the default behavior for many SDK calls as well. For streaming consumers, a dead-letter queue captures records that repeatedly fail processing so a bad record doesn’t block the entire shard or stall the pipeline — Kinesis and Lambda event source mappings support configuring a DLQ destination for exactly this reason.

A related pattern worth knowing cold: idempotency. If a Step Functions state or Lambda retries a write operation, you need that operation to be safe to repeat — using a deterministic key (like an order ID) for an upsert rather than blindly appending, or checking a processed-records table before re-applying a transformation.

Cost-Optimizing Data Pipelines

Cost questions in this domain usually reduce to “are you paying for idle capacity, and can you shift work to when it’s cheaper or shrink the compute footprint.”

Glue — job bookmarks avoid reprocessing data that’s already been handled on a previous run, which directly cuts DPU-hours on incremental loads.
EMR — use Spot Instances for task nodes (not core/master nodes, which hold state) on fault-tolerant Spark jobs; enable EMR managed scaling so the cluster shrinks when a job’s later stages need less compute.
Redshift Serverless — set a sane base RPU range so it doesn’t overprovision for rare bursts; pause/scale-to-near-zero behavior handles idle periods automatically.
Step Functions — Standard workflows charge per state transition, so a workflow that polls in a tight loop racks up transitions fast; use Wait states or switch to Express workflows for high-volume, short-duration executions.
S3 — lifecycle rules moving cold data out of Standard remain the single highest-leverage lever in most data lake cost reviews.

Troubleshooting Common Glue and EMR Failures

Symptom: Glue job runs out of memory
Cause:   Data skew — one partition/key vastly larger than others,
         causing a single executor to process a disproportionate share
Fix:     Salt the skewed key, repartition before the join,
         or increase worker type/DPU count

Symptom: Glue job succeeds but downstream table is empty
Cause:   Job bookmark treats data as "already processed"
         (common after a schema change or manual backfill)
Fix:     Reset the job bookmark explicitly for a fresh full run

Symptom: EMR step fails with "container killed on request, exit code 137"
Cause:   YARN container exceeded memory limits (OOM kill)
Fix:     Increase executor memory or reduce partitions per executor

Symptom: Crawler creates duplicate tables with slightly different schemas
Cause:   Inconsistent file formats/schemas under the same S3 prefix
Fix:     Enforce a consistent schema at write time, or configure the
         crawler's schema change policy explicitly

Data skew is worth dwelling on because it’s the most common root cause behind “Glue job runs fine on small data but fails at scale” scenarios. If 90% of your join keys fall under a handful of values, a straight hash-partitioned join sends nearly all the work to a few executors. Salting the key (appending a random suffix and exploding the smaller side of the join accordingly) spreads that work back out evenly.

Exam Focus: What Questions Test From This Step

Step Functions vs MWAA — pick based on AWS-service density vs existing Airflow DAG investment
Reading CloudWatch iterator age as a sign of a lagging Kinesis consumer
Exponential backoff configuration and dead-letter queue usage for transient failures
Idempotency requirements when retries are possible
Cost levers: Glue job bookmarks, EMR Spot for task nodes, Redshift Serverless RPU ranges, Step Functions Standard vs Express
Diagnosing Glue OOM/data skew failures and the salting fix
Recognizing job bookmark issues causing “job succeeded but no new data” symptoms
EMR container OOM kills (exit 137) and executor memory tuning

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.