Step 1 โ Ingestion & Transformation
Every data pipeline starts the same way: something produces records, and you have to get them somewhere useful without losing, duplicating, or mangling them along the way. The DEA-C01 exam spends a lot of its weight here, and for good reason โ pick the wrong ingestion tool for a scenario and everything downstream inherits the mistake. Letโs work through how AWS wants you to think about it.
Batch vs Streaming: The First Fork in the Road
Almost every ingestion question on this exam is secretly asking one thing: does this data need to be acted on within seconds, or can it wait?
BATCH INGESTION STREAMING INGESTIONโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโData collected over a window Data processed as it arrivesMinutes to hours of latency Sub-second to a few secondsCheaper per record processed Higher cost per recordGood for: nightly reports, Good for: fraud alerts, livereconciliation, historical loads dashboards, IoT telemetryTypical tools: Glue jobs, EMR, Typical tools: Kinesis DataS3 batch loads, Redshift COPY Streams, MSK, Kinesis FirehoseThe trap the exam sets: a scenario describes something that sounds urgent (โthe business wants near-real-time visibilityโ) but the actual requirement, once you read closely, tolerates a 15-minute delay. Thatโs still batch โ just on a tighter schedule. Reserve streaming architectures for genuinely continuous, unbounded data where waiting for a batch window isnโt acceptable.
AWS Glue: Jobs, Crawlers, and Studio
Glue is the backbone of serverless ETL on AWS, and DEA-C01 expects you to know its pieces individually rather than as one blob called โGlue.โ
Glue Crawlers scan a data source (S3, JDBC, DynamoDB) and infer schema, writing table definitions into the Glue Data Catalog. Crawlers donโt move or transform data โ they only catalog it. A common exam wrinkle: a crawler runs against a partitioned S3 prefix and needs to detect new partitions on a schedule so downstream Athena or Redshift Spectrum queries stay accurate.
Glue Jobs are the actual transformation compute โ Spark under the hood (or Python shell for lightweight scripts, or Ray for more recent distributed Python workloads). You write a script, or you build one visually.
Glue Studio is the drag-and-drop authoring layer over Glue jobs. It generates a visual DAG of sources, transforms, and sinks, and produces the underlying PySpark code for you โ useful when a team wants ETL logic that isnโt purely code-first, or when youโre prototyping before hand-tuning a script.
Glue Data Quality rides on top of jobs and lets you define rules (using DQDL, the Data Quality Definition Language) that check things like completeness, uniqueness, and referential expectations before data is allowed to land in a trusted zone. Expect scenario questions where a pipeline needs to quarantine bad records rather than fail the whole job โ thatโs a Data Quality ruleset with an action of โskipโ or route to a separate output rather than a hard stop.
S3 Raw Zone โ โผGlue Crawler โโโบ Glue Data Catalog (schema + partitions) โ โผGlue Job (Spark) โโโบ Glue Data Quality rules โ โ โ fails ruleset โ โผ โ S3 Quarantine Zone โผS3 Curated Zone (Parquet, partitioned)Streaming Ingestion: Kinesis and MSK
Kinesis Data Streams
A Kinesis Data Stream is a set of ordered shards. Producers put records in, consumers (Kinesis Client Library apps, Lambda, Managed Service for Apache Flink) read them out, and each shard preserves order for records sharing a partition key. Capacity is either provisioned (you size shards yourself) or on-demand (Kinesis scales shards for you based on throughput, at a premium).
Kinesis Data Firehose
Firehose is the โjust deliver it somewhereโ service โ it doesnโt require you to write consumer code. Point it at S3, Redshift, OpenSearch, or an HTTP endpoint, optionally attach a Lambda for lightweight transformation in-flight, and it handles buffering, batching, and retries. The exam distinction that trips people up:
| Need | Use |
|---|---|
| Custom, low-latency stream processing with your own consumer logic | Kinesis Data Streams |
| Fully managed delivery to S3/Redshift/OpenSearch with minimal code | Kinesis Data Firehose |
| Real-time SQL or Flink-based stream analytics | Managed Service for Apache Flink |
| Kafka-compatible streaming with existing Kafka tooling/consumers | Amazon MSK |
Amazon MSK
MSK is managed Apache Kafka. Choose it โ instead of Kinesis โ when the organization already has Kafka producers/consumers, needs Kafka-specific semantics (consumer groups, long retention replays, exactly-once semantics with Kafka transactions), or is migrating an on-prem Kafka estate into AWS without a rewrite. MSK Serverless removes the broker-sizing decision entirely, auto-scaling storage and throughput, which is increasingly the default recommendation for 2026-era workloads unless you need fine control over broker configuration.
A useful mental shortcut: Kinesis is โAWS-native streaming, pay-as-you-go, simplest ops.โ MSK is โbring your Kafka expertise and ecosystem, more portable, more configuration surface.โ
Schema Evolution
Data engineers rarely get to fix a schema and walk away โ producers add fields, rename things, change types. The exam wants you to know how the catalog and query layers cope:
- Adding a nullable column โ generally safe; new Parquet/ORC files carry the extra column, older files just show null, Glue Catalog table version updates.
- Renaming or removing a column โ breaks readers expecting the old name; needs either a new table version, a view abstraction, or a controlled migration.
- Type changes (e.g., int to string) โ usually requires backfilling or maintaining two schema versions in the catalog until old data ages out.
Glue Crawlers can be configured to add new columns automatically on re-crawl, but they will not silently resolve type conflicts โ that ambiguity gets logged, and itโs the engineerโs job to resolve it, often by pinning a schema in the job script instead of trusting inference on every run.
Choosing the Transformation Tool
Not every transform belongs in the same compute layer. The exam scenarios usually hinge on volume, complexity, and latency tolerance:
Lightweight, event-driven, sub-minute โโโบ Lambda (reformat a single S3 object, enrich a Kinesis record in Firehose)
Heavy, distributed, batch or micro-batch โโโบ Glue ETL / EMR Spark (joins across large datasets, dedup, complex aggregations)
Need full control over cluster/Spark โโโบ EMR (custom libraries, Hive, Presto, HBase, existing Hadoop ecosystem)
Fully serverless, less cluster tuning โโโบ Glue Jobs (same Spark engine, AWS manages provisioning and scaling)EMR gives you the whole Hadoop ecosystem and full control over instance types, bootstrap actions, and cluster lifecycle โ useful when a workload needs something Glue doesnโt expose, or when cost optimization via Spot Instances on long-running clusters matters more than operational simplicity. Glue trades some of that control for a no-servers experience with per-second billing on job runtime alone.
Lambda transformations show up mostly at the edges of a pipeline: reshaping a single record before it lands in Firehose, triggering off an S3 PUT event to do light validation, or gluing together small orchestration steps. If your transformation logic needs more than roughly 15 minutes of runtime or has to join large datasets, thatโs a signal to move it out of Lambda.
Putting It Together: A Realistic Pipeline
โโโโโโโโโโโโโโโโIoT devices โโโโโโโโโบโ Kinesis Data โโโโโโบ Managed Flink (real-time(streaming) โ Streams โ anomaly scoring) โโโโโโโโโโโโโโโโ โ โผ Kinesis Firehose โโโโโบ S3 Raw Zone (Parquet) โBatch exports from โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบโon-prem DB (nightly) โผ Glue Crawler โ โผ Glue Data Catalog โ โผ Glue ETL Job + Data Quality โ โผ S3 Curated Zone โโโบ Redshift ServerlessThis is the shape most DEA-C01 scenario questions gesture toward: a streaming source for immediacy, a batch source for completeness, converging into a catalog-aware transformation layer before landing somewhere queryable.
Exam Focus: What Questions Test From This Step
- Recognizing when a โnear real-timeโ requirement is actually satisfied by batch (tight-window batch vs true streaming)
- Kinesis Data Streams vs Firehose vs MSK โ which one fits a given producer/consumer scenario
- Glue Crawlers only catalog; they do not transform or move data
- Glue Data Quality rulesets and how to route failing records instead of failing an entire job
- When to pick EMR (full cluster control) over Glue Jobs (serverless, less tuning)
- Handling schema evolution: additive changes vs breaking changes vs type conflicts
- Choosing Lambda for lightweight, short-duration transforms vs Glue/EMR for heavy distributed processing
- MSK vs Kinesis when an organization already runs Kafka