Step 1 — Ingestion & Transformation

Every data pipeline starts the same way: something produces records, and you have to get them somewhere useful without losing, duplicating, or mangling them along the way. The DEA-C01 exam spends a lot of its weight here, and for good reason — pick the wrong ingestion tool for a scenario and everything downstream inherits the mistake. Let’s work through how AWS wants you to think about it.

Batch vs Streaming: The First Fork in the Road

Almost every ingestion question on this exam is secretly asking one thing: does this data need to be acted on within seconds, or can it wait?

BATCH INGESTION                         STREAMING INGESTION
────────────────────────────            ────────────────────────────
Data collected over a window            Data processed as it arrives
Minutes to hours of latency             Sub-second to a few seconds
Cheaper per record processed            Higher cost per record
Good for: nightly reports,              Good for: fraud alerts, live
reconciliation, historical loads        dashboards, IoT telemetry
Typical tools: Glue jobs, EMR,          Typical tools: Kinesis Data
S3 batch loads, Redshift COPY           Streams, MSK, Kinesis Firehose

The trap the exam sets: a scenario describes something that sounds urgent (“the business wants near-real-time visibility”) but the actual requirement, once you read closely, tolerates a 15-minute delay. That’s still batch — just on a tighter schedule. Reserve streaming architectures for genuinely continuous, unbounded data where waiting for a batch window isn’t acceptable.

AWS Glue: Jobs, Crawlers, and Studio

Glue is the backbone of serverless ETL on AWS, and DEA-C01 expects you to know its pieces individually rather than as one blob called “Glue.”

Glue Crawlers scan a data source (S3, JDBC, DynamoDB) and infer schema, writing table definitions into the Glue Data Catalog. Crawlers don’t move or transform data — they only catalog it. A common exam wrinkle: a crawler runs against a partitioned S3 prefix and needs to detect new partitions on a schedule so downstream Athena or Redshift Spectrum queries stay accurate.

Glue Jobs are the actual transformation compute — Spark under the hood (or Python shell for lightweight scripts, or Ray for more recent distributed Python workloads). You write a script, or you build one visually.

Glue Studio is the drag-and-drop authoring layer over Glue jobs. It generates a visual DAG of sources, transforms, and sinks, and produces the underlying PySpark code for you — useful when a team wants ETL logic that isn’t purely code-first, or when you’re prototyping before hand-tuning a script.

Glue Data Quality rides on top of jobs and lets you define rules (using DQDL, the Data Quality Definition Language) that check things like completeness, uniqueness, and referential expectations before data is allowed to land in a trusted zone. Expect scenario questions where a pipeline needs to quarantine bad records rather than fail the whole job — that’s a Data Quality ruleset with an action of “skip” or route to a separate output rather than a hard stop.

S3 Raw Zone
     │
     ▼
Glue Crawler ──► Glue Data Catalog (schema + partitions)
     │
     ▼
Glue Job (Spark) ──► Glue Data Quality rules
     │                    │
     │              fails ruleset
     │                    ▼
     │              S3 Quarantine Zone
     ▼
S3 Curated Zone (Parquet, partitioned)

Streaming Ingestion: Kinesis and MSK

Kinesis Data Streams

A Kinesis Data Stream is a set of ordered shards. Producers put records in, consumers (Kinesis Client Library apps, Lambda, Managed Service for Apache Flink) read them out, and each shard preserves order for records sharing a partition key. Capacity is either provisioned (you size shards yourself) or on-demand (Kinesis scales shards for you based on throughput, at a premium).

Kinesis Data Firehose

Firehose is the “just deliver it somewhere” service — it doesn’t require you to write consumer code. Point it at S3, Redshift, OpenSearch, or an HTTP endpoint, optionally attach a Lambda for lightweight transformation in-flight, and it handles buffering, batching, and retries. The exam distinction that trips people up:

Need	Use
Custom, low-latency stream processing with your own consumer logic	Kinesis Data Streams
Fully managed delivery to S3/Redshift/OpenSearch with minimal code	Kinesis Data Firehose
Real-time SQL or Flink-based stream analytics	Managed Service for Apache Flink
Kafka-compatible streaming with existing Kafka tooling/consumers	Amazon MSK

Amazon MSK

MSK is managed Apache Kafka. Choose it — instead of Kinesis — when the organization already has Kafka producers/consumers, needs Kafka-specific semantics (consumer groups, long retention replays, exactly-once semantics with Kafka transactions), or is migrating an on-prem Kafka estate into AWS without a rewrite. MSK Serverless removes the broker-sizing decision entirely, auto-scaling storage and throughput, which is increasingly the default recommendation for 2026-era workloads unless you need fine control over broker configuration.

A useful mental shortcut: Kinesis is “AWS-native streaming, pay-as-you-go, simplest ops.” MSK is “bring your Kafka expertise and ecosystem, more portable, more configuration surface.”

Schema Evolution

Data engineers rarely get to fix a schema and walk away — producers add fields, rename things, change types. The exam wants you to know how the catalog and query layers cope:

Adding a nullable column — generally safe; new Parquet/ORC files carry the extra column, older files just show null, Glue Catalog table version updates.
Renaming or removing a column — breaks readers expecting the old name; needs either a new table version, a view abstraction, or a controlled migration.
Type changes (e.g., int to string) — usually requires backfilling or maintaining two schema versions in the catalog until old data ages out.

Glue Crawlers can be configured to add new columns automatically on re-crawl, but they will not silently resolve type conflicts — that ambiguity gets logged, and it’s the engineer’s job to resolve it, often by pinning a schema in the job script instead of trusting inference on every run.

Choosing the Transformation Tool

Not every transform belongs in the same compute layer. The exam scenarios usually hinge on volume, complexity, and latency tolerance:

Lightweight, event-driven, sub-minute   ──► Lambda
  (reformat a single S3 object,
   enrich a Kinesis record in Firehose)

Heavy, distributed, batch or micro-batch ──► Glue ETL / EMR Spark
  (joins across large datasets,
   dedup, complex aggregations)

Need full control over cluster/Spark    ──► EMR
  (custom libraries, Hive, Presto,
   HBase, existing Hadoop ecosystem)

Fully serverless, less cluster tuning   ──► Glue Jobs
  (same Spark engine, AWS manages
   provisioning and scaling)

EMR gives you the whole Hadoop ecosystem and full control over instance types, bootstrap actions, and cluster lifecycle — useful when a workload needs something Glue doesn’t expose, or when cost optimization via Spot Instances on long-running clusters matters more than operational simplicity. Glue trades some of that control for a no-servers experience with per-second billing on job runtime alone.

Lambda transformations show up mostly at the edges of a pipeline: reshaping a single record before it lands in Firehose, triggering off an S3 PUT event to do light validation, or gluing together small orchestration steps. If your transformation logic needs more than roughly 15 minutes of runtime or has to join large datasets, that’s a signal to move it out of Lambda.

Putting It Together: A Realistic Pipeline

                     ┌──────────────┐
IoT devices ────────►│ Kinesis Data │────► Managed Flink (real-time
(streaming)          │   Streams    │      anomaly scoring)
                     └──────────────┘
                            │
                            ▼
                     Kinesis Firehose ────► S3 Raw Zone (Parquet)
                                                  │
Batch exports from ─────────────────────────────►│
on-prem DB (nightly)                              ▼
                                            Glue Crawler
                                                  │
                                                  ▼
                                          Glue Data Catalog
                                                  │
                                                  ▼
                                     Glue ETL Job + Data Quality
                                                  │
                                                  ▼
                                          S3 Curated Zone ──► Redshift Serverless

This is the shape most DEA-C01 scenario questions gesture toward: a streaming source for immediacy, a batch source for completeness, converging into a catalog-aware transformation layer before landing somewhere queryable.

Exam Focus: What Questions Test From This Step

Recognizing when a “near real-time” requirement is actually satisfied by batch (tight-window batch vs true streaming)
Kinesis Data Streams vs Firehose vs MSK — which one fits a given producer/consumer scenario
Glue Crawlers only catalog; they do not transform or move data
Glue Data Quality rulesets and how to route failing records instead of failing an entire job
When to pick EMR (full cluster control) over Glue Jobs (serverless, less tuning)
Handling schema evolution: additive changes vs breaking changes vs type conflicts
Choosing Lambda for lightweight, short-duration transforms vs Glue/EMR for heavy distributed processing
MSK vs Kinesis when an organization already runs Kafka

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.