Cloud/ AWS / AWS Certified Data Engineer โ€” Associate (DEA-C01) / AWS Data Engineer DEA-C01: Ingestion & Transformation Patterns

AWS Amazon Web Services Associate Step 1 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 1 โ€” Ingestion & Transformation

Every data pipeline starts the same way: something produces records, and you have to get them somewhere useful without losing, duplicating, or mangling them along the way. The DEA-C01 exam spends a lot of its weight here, and for good reason โ€” pick the wrong ingestion tool for a scenario and everything downstream inherits the mistake. Letโ€™s work through how AWS wants you to think about it.


Batch vs Streaming: The First Fork in the Road

Almost every ingestion question on this exam is secretly asking one thing: does this data need to be acted on within seconds, or can it wait?

BATCH INGESTION STREAMING INGESTION
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Data collected over a window Data processed as it arrives
Minutes to hours of latency Sub-second to a few seconds
Cheaper per record processed Higher cost per record
Good for: nightly reports, Good for: fraud alerts, live
reconciliation, historical loads dashboards, IoT telemetry
Typical tools: Glue jobs, EMR, Typical tools: Kinesis Data
S3 batch loads, Redshift COPY Streams, MSK, Kinesis Firehose

The trap the exam sets: a scenario describes something that sounds urgent (โ€œthe business wants near-real-time visibilityโ€) but the actual requirement, once you read closely, tolerates a 15-minute delay. Thatโ€™s still batch โ€” just on a tighter schedule. Reserve streaming architectures for genuinely continuous, unbounded data where waiting for a batch window isnโ€™t acceptable.


AWS Glue: Jobs, Crawlers, and Studio

Glue is the backbone of serverless ETL on AWS, and DEA-C01 expects you to know its pieces individually rather than as one blob called โ€œGlue.โ€

Glue Crawlers scan a data source (S3, JDBC, DynamoDB) and infer schema, writing table definitions into the Glue Data Catalog. Crawlers donโ€™t move or transform data โ€” they only catalog it. A common exam wrinkle: a crawler runs against a partitioned S3 prefix and needs to detect new partitions on a schedule so downstream Athena or Redshift Spectrum queries stay accurate.

Glue Jobs are the actual transformation compute โ€” Spark under the hood (or Python shell for lightweight scripts, or Ray for more recent distributed Python workloads). You write a script, or you build one visually.

Glue Studio is the drag-and-drop authoring layer over Glue jobs. It generates a visual DAG of sources, transforms, and sinks, and produces the underlying PySpark code for you โ€” useful when a team wants ETL logic that isnโ€™t purely code-first, or when youโ€™re prototyping before hand-tuning a script.

Glue Data Quality rides on top of jobs and lets you define rules (using DQDL, the Data Quality Definition Language) that check things like completeness, uniqueness, and referential expectations before data is allowed to land in a trusted zone. Expect scenario questions where a pipeline needs to quarantine bad records rather than fail the whole job โ€” thatโ€™s a Data Quality ruleset with an action of โ€œskipโ€ or route to a separate output rather than a hard stop.

S3 Raw Zone
โ”‚
โ–ผ
Glue Crawler โ”€โ”€โ–บ Glue Data Catalog (schema + partitions)
โ”‚
โ–ผ
Glue Job (Spark) โ”€โ”€โ–บ Glue Data Quality rules
โ”‚ โ”‚
โ”‚ fails ruleset
โ”‚ โ–ผ
โ”‚ S3 Quarantine Zone
โ–ผ
S3 Curated Zone (Parquet, partitioned)

Streaming Ingestion: Kinesis and MSK

Kinesis Data Streams

A Kinesis Data Stream is a set of ordered shards. Producers put records in, consumers (Kinesis Client Library apps, Lambda, Managed Service for Apache Flink) read them out, and each shard preserves order for records sharing a partition key. Capacity is either provisioned (you size shards yourself) or on-demand (Kinesis scales shards for you based on throughput, at a premium).

Kinesis Data Firehose

Firehose is the โ€œjust deliver it somewhereโ€ service โ€” it doesnโ€™t require you to write consumer code. Point it at S3, Redshift, OpenSearch, or an HTTP endpoint, optionally attach a Lambda for lightweight transformation in-flight, and it handles buffering, batching, and retries. The exam distinction that trips people up:

NeedUse
Custom, low-latency stream processing with your own consumer logicKinesis Data Streams
Fully managed delivery to S3/Redshift/OpenSearch with minimal codeKinesis Data Firehose
Real-time SQL or Flink-based stream analyticsManaged Service for Apache Flink
Kafka-compatible streaming with existing Kafka tooling/consumersAmazon MSK

Amazon MSK

MSK is managed Apache Kafka. Choose it โ€” instead of Kinesis โ€” when the organization already has Kafka producers/consumers, needs Kafka-specific semantics (consumer groups, long retention replays, exactly-once semantics with Kafka transactions), or is migrating an on-prem Kafka estate into AWS without a rewrite. MSK Serverless removes the broker-sizing decision entirely, auto-scaling storage and throughput, which is increasingly the default recommendation for 2026-era workloads unless you need fine control over broker configuration.

A useful mental shortcut: Kinesis is โ€œAWS-native streaming, pay-as-you-go, simplest ops.โ€ MSK is โ€œbring your Kafka expertise and ecosystem, more portable, more configuration surface.โ€


Schema Evolution

Data engineers rarely get to fix a schema and walk away โ€” producers add fields, rename things, change types. The exam wants you to know how the catalog and query layers cope:

Glue Crawlers can be configured to add new columns automatically on re-crawl, but they will not silently resolve type conflicts โ€” that ambiguity gets logged, and itโ€™s the engineerโ€™s job to resolve it, often by pinning a schema in the job script instead of trusting inference on every run.


Choosing the Transformation Tool

Not every transform belongs in the same compute layer. The exam scenarios usually hinge on volume, complexity, and latency tolerance:

Lightweight, event-driven, sub-minute โ”€โ”€โ–บ Lambda
(reformat a single S3 object,
enrich a Kinesis record in Firehose)
Heavy, distributed, batch or micro-batch โ”€โ”€โ–บ Glue ETL / EMR Spark
(joins across large datasets,
dedup, complex aggregations)
Need full control over cluster/Spark โ”€โ”€โ–บ EMR
(custom libraries, Hive, Presto,
HBase, existing Hadoop ecosystem)
Fully serverless, less cluster tuning โ”€โ”€โ–บ Glue Jobs
(same Spark engine, AWS manages
provisioning and scaling)

EMR gives you the whole Hadoop ecosystem and full control over instance types, bootstrap actions, and cluster lifecycle โ€” useful when a workload needs something Glue doesnโ€™t expose, or when cost optimization via Spot Instances on long-running clusters matters more than operational simplicity. Glue trades some of that control for a no-servers experience with per-second billing on job runtime alone.

Lambda transformations show up mostly at the edges of a pipeline: reshaping a single record before it lands in Firehose, triggering off an S3 PUT event to do light validation, or gluing together small orchestration steps. If your transformation logic needs more than roughly 15 minutes of runtime or has to join large datasets, thatโ€™s a signal to move it out of Lambda.


Putting It Together: A Realistic Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
IoT devices โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Kinesis Data โ”‚โ”€โ”€โ”€โ”€โ–บ Managed Flink (real-time
(streaming) โ”‚ Streams โ”‚ anomaly scoring)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
Kinesis Firehose โ”€โ”€โ”€โ”€โ–บ S3 Raw Zone (Parquet)
โ”‚
Batch exports from โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
on-prem DB (nightly) โ–ผ
Glue Crawler
โ”‚
โ–ผ
Glue Data Catalog
โ”‚
โ–ผ
Glue ETL Job + Data Quality
โ”‚
โ–ผ
S3 Curated Zone โ”€โ”€โ–บ Redshift Serverless

This is the shape most DEA-C01 scenario questions gesture toward: a streaming source for immediacy, a batch source for completeness, converging into a catalog-aware transformation layer before landing somewhere queryable.


Exam Focus: What Questions Test From This Step