Cloud/ AWS / AWS Certified Data Engineer — Associate (DEA-C01) / DEA-C01 Exam Prep: Domains, Scenarios, and Common Traps

AWS Amazon Web Services Associate Step 5 of 5 106 guides · updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 5 — Exam Prep & Scenarios

You’ve now got the pieces: ingestion, storage, operations, governance. This last step is about pulling them together the way the actual exam will — mixed into scenarios where the “correct” service is buried inside a paragraph of business context, not labeled for you. Let’s rehearse that.


How the Exam Weights Its Domains

DEA-C01 is organized into four domains, and they are not weighted evenly. Roughly speaking, expect the exam to lean heaviest on data ingestion/transformation and data store management, with operations and governance still carrying serious weight but slightly less raw volume of questions.

Domain 1: Data Ingestion and Transformation ~ largest share
Domain 2: Data Store Management ~ close second
Domain 3: Data Operations and Support ~ solid third
Domain 4: Data Security and Governance ~ smallest, still material

Don’t read “smallest” as “skippable” — governance questions tend to be conceptually dense (Lake Formation grant mechanics, encryption key management) even if there are fewer of them, so they can eat disproportionate study time relative to their question count. Budget accordingly: more repetition on ingestion/storage scenarios since you’ll see more of them, but don’t shortchange governance concepts just because the domain percentage looks smaller.


Worked Scenario 1: The Ambiguous “Real-Time” Requirement

Setup: A retail company wants to track inventory levels across 500 warehouses. Store systems currently batch-upload inventory counts every 4 hours via SFTP to an on-prem server. The business now wants “up-to-date visibility” into stock levels, and warehouse managers say a 15-minute delay would be “much better than what we have now, and totally workable.”

What’s actually being asked: Despite the phrase “up-to-date visibility,” the tolerance stated is 15 minutes — that’s still a batch-shaped problem, just running far more frequently than the current 4-hour cycle. Jumping to Kinesis/MSK here is over-engineering; the requirement doesn’t call for sub-second processing.

Reasonable answer path: Increase batch frequency — perhaps every 10-15 minutes via a scheduled Glue job or a Step Functions workflow triggered on a schedule, reading from an S3 landing location the warehouses upload to (replacing SFTP with S3 transfer). No streaming infrastructure required.

Why this matters for the exam: The trap is pattern-matching “real-time” or “up-to-date” to streaming keywords without reading the actual latency number stated in the scenario. Always find the explicit tolerance number if one exists — it overrides the vibe of the sentence around it.


Worked Scenario 2: The Hot Partition

Setup: A DynamoDB table stores clickstream events with partition key event_type (values: page_view, click, purchase, scroll — four total values) and is experiencing throttling on writes despite the table being provisioned well above its average traffic, with utilization metrics showing most partitions nearly idle.

What’s actually being asked: This is a partition key design flaw, not a capacity problem — throwing more provisioned capacity at it won’t help because the issue is that four values can only ever map to a small number of physical partitions, and page_view almost certainly dominates volume, concentrating nearly all writes onto whichever partition(s) hold that value.

Reasonable answer path: Redesign the partition key to include higher-cardinality data — a composite key like event_type#user_id_hash or event_type#shard_number (a random shard suffix, reassembled at read time via a GSI or application-side fan-out query). This spreads writes across many more physical partitions.

Why this matters for the exam: DEA-C01 scenario questions about DynamoDB throttling almost always want you to diagnose partition key cardinality before jumping to “just add more capacity” — that’s the wrong lever if the real constraint is key design.


Worked Scenario 3: Choosing Between Lake Formation and Raw IAM

Setup: A company has a data lake with around 200 tables in the Glue Data Catalog, shared across six business units. Each business unit should only see specific tables, and within a shared “customers” table, EU-based analysts should only see EU customer rows. The security team wants this centrally auditable and wants to avoid writing custom application logic to enforce row filtering.

What’s actually being asked: Two mechanisms are needed at once — table-level segregation across business units, and row-level filtering within one shared table by region.

Reasonable answer path: Lake Formation permissions for table-level grants per business-unit IAM role or group, layered with a Lake Formation row-level data filter on the customers table restricting EU analysts to region = 'EU' rows. Tag-based access control (LF-TBAC) is worth mentioning as the scalable approach for granting across 200 tables rather than writing 200 individual grants by hand.

Why this matters for the exam: This scenario is really testing whether you reach for Lake Formation instead of hand-rolled IAM/S3 policies the moment you see “row-level” or “column-level” language, and whether you know row-level filtering is a Lake Formation data filter, not something you build with a Lambda function checking claims at query time.


Worked Scenario 4: The Silent Glue Bookmark

Setup: A nightly Glue ETL job reads new files from an S3 prefix and loads them into Redshift. After a manual backfill process copied a batch of historical files into the same S3 prefix to fix a data gap, the next scheduled Glue job run completed successfully in the console, but none of the backfilled data appeared in Redshift.

What’s actually being asked: This is describing job bookmark behavior — Glue job bookmarks track which files/objects have already been processed (by path and timestamp heuristics) so a job doesn’t reprocess the same data on every run. A manual backfill drops files into S3, but if their timestamps predate the bookmark’s last recorded position, or if the job simply already considers that prefix “seen,” the job skips them silently — no error, just no new rows.

Reasonable answer path: Reset the job bookmark (or run the job with bookmarks disabled for that one execution) to force a full reprocess of the prefix, or run a one-off job pointed specifically at the backfilled files with bookmarks turned off.

Why this matters for the exam: “Job succeeded, but data is missing downstream” is a recurring exam shape, and job bookmarks are one of the most common quiet culprits — the exam wants you to know this is a feature working as designed, not a bug, and that the fix is an explicit bookmark reset.


Comparing the Scenario Types You’ll See

Scenario flavorWhat it’s really testing
”Near real-time” language with a stated toleranceBatch vs streaming judgment — read the actual number
Throttling/performance despite ample provisioned capacityKey design (partition key, distribution key, sort key)
Multi-team access with row/column restrictionsLake Formation permissions, not custom app logic
Job succeeds, data missing downstreamBookmarks, dedup logic, or silent schema mismatch
”Team already uses X tool/ecosystem”Service choice driven by existing investment (MSK over Kinesis, MWAA over Step Functions)
Cost complaint with a specific idle-time detailServerless/on-demand alternative to a fixed-capacity resource

Study Tips


Common Traps at the Associate Level


Exam Focus: What Questions Test From This Step