Step 5 — Exam Prep & Scenarios
You’ve now got the pieces: ingestion, storage, operations, governance. This last step is about pulling them together the way the actual exam will — mixed into scenarios where the “correct” service is buried inside a paragraph of business context, not labeled for you. Let’s rehearse that.
How the Exam Weights Its Domains
DEA-C01 is organized into four domains, and they are not weighted evenly. Roughly speaking, expect the exam to lean heaviest on data ingestion/transformation and data store management, with operations and governance still carrying serious weight but slightly less raw volume of questions.
Domain 1: Data Ingestion and Transformation ~ largest shareDomain 2: Data Store Management ~ close secondDomain 3: Data Operations and Support ~ solid thirdDomain 4: Data Security and Governance ~ smallest, still materialDon’t read “smallest” as “skippable” — governance questions tend to be conceptually dense (Lake Formation grant mechanics, encryption key management) even if there are fewer of them, so they can eat disproportionate study time relative to their question count. Budget accordingly: more repetition on ingestion/storage scenarios since you’ll see more of them, but don’t shortchange governance concepts just because the domain percentage looks smaller.
Worked Scenario 1: The Ambiguous “Real-Time” Requirement
Setup: A retail company wants to track inventory levels across 500 warehouses. Store systems currently batch-upload inventory counts every 4 hours via SFTP to an on-prem server. The business now wants “up-to-date visibility” into stock levels, and warehouse managers say a 15-minute delay would be “much better than what we have now, and totally workable.”
What’s actually being asked: Despite the phrase “up-to-date visibility,” the tolerance stated is 15 minutes — that’s still a batch-shaped problem, just running far more frequently than the current 4-hour cycle. Jumping to Kinesis/MSK here is over-engineering; the requirement doesn’t call for sub-second processing.
Reasonable answer path: Increase batch frequency — perhaps every 10-15 minutes via a scheduled Glue job or a Step Functions workflow triggered on a schedule, reading from an S3 landing location the warehouses upload to (replacing SFTP with S3 transfer). No streaming infrastructure required.
Why this matters for the exam: The trap is pattern-matching “real-time” or “up-to-date” to streaming keywords without reading the actual latency number stated in the scenario. Always find the explicit tolerance number if one exists — it overrides the vibe of the sentence around it.
Worked Scenario 2: The Hot Partition
Setup: A DynamoDB table stores clickstream events with partition key event_type (values: page_view, click, purchase, scroll — four total values) and is experiencing throttling on writes despite the table being provisioned well above its average traffic, with utilization metrics showing most partitions nearly idle.
What’s actually being asked: This is a partition key design flaw, not a capacity problem — throwing more provisioned capacity at it won’t help because the issue is that four values can only ever map to a small number of physical partitions, and page_view almost certainly dominates volume, concentrating nearly all writes onto whichever partition(s) hold that value.
Reasonable answer path: Redesign the partition key to include higher-cardinality data — a composite key like event_type#user_id_hash or event_type#shard_number (a random shard suffix, reassembled at read time via a GSI or application-side fan-out query). This spreads writes across many more physical partitions.
Why this matters for the exam: DEA-C01 scenario questions about DynamoDB throttling almost always want you to diagnose partition key cardinality before jumping to “just add more capacity” — that’s the wrong lever if the real constraint is key design.
Worked Scenario 3: Choosing Between Lake Formation and Raw IAM
Setup: A company has a data lake with around 200 tables in the Glue Data Catalog, shared across six business units. Each business unit should only see specific tables, and within a shared “customers” table, EU-based analysts should only see EU customer rows. The security team wants this centrally auditable and wants to avoid writing custom application logic to enforce row filtering.
What’s actually being asked: Two mechanisms are needed at once — table-level segregation across business units, and row-level filtering within one shared table by region.
Reasonable answer path: Lake Formation permissions for table-level grants per business-unit IAM role or group, layered with a Lake Formation row-level data filter on the customers table restricting EU analysts to region = 'EU' rows. Tag-based access control (LF-TBAC) is worth mentioning as the scalable approach for granting across 200 tables rather than writing 200 individual grants by hand.
Why this matters for the exam: This scenario is really testing whether you reach for Lake Formation instead of hand-rolled IAM/S3 policies the moment you see “row-level” or “column-level” language, and whether you know row-level filtering is a Lake Formation data filter, not something you build with a Lambda function checking claims at query time.
Worked Scenario 4: The Silent Glue Bookmark
Setup: A nightly Glue ETL job reads new files from an S3 prefix and loads them into Redshift. After a manual backfill process copied a batch of historical files into the same S3 prefix to fix a data gap, the next scheduled Glue job run completed successfully in the console, but none of the backfilled data appeared in Redshift.
What’s actually being asked: This is describing job bookmark behavior — Glue job bookmarks track which files/objects have already been processed (by path and timestamp heuristics) so a job doesn’t reprocess the same data on every run. A manual backfill drops files into S3, but if their timestamps predate the bookmark’s last recorded position, or if the job simply already considers that prefix “seen,” the job skips them silently — no error, just no new rows.
Reasonable answer path: Reset the job bookmark (or run the job with bookmarks disabled for that one execution) to force a full reprocess of the prefix, or run a one-off job pointed specifically at the backfilled files with bookmarks turned off.
Why this matters for the exam: “Job succeeded, but data is missing downstream” is a recurring exam shape, and job bookmarks are one of the most common quiet culprits — the exam wants you to know this is a feature working as designed, not a bug, and that the fix is an explicit bookmark reset.
Comparing the Scenario Types You’ll See
| Scenario flavor | What it’s really testing |
|---|---|
| ”Near real-time” language with a stated tolerance | Batch vs streaming judgment — read the actual number |
| Throttling/performance despite ample provisioned capacity | Key design (partition key, distribution key, sort key) |
| Multi-team access with row/column restrictions | Lake Formation permissions, not custom app logic |
| Job succeeds, data missing downstream | Bookmarks, dedup logic, or silent schema mismatch |
| ”Team already uses X tool/ecosystem” | Service choice driven by existing investment (MSK over Kinesis, MWAA over Step Functions) |
| Cost complaint with a specific idle-time detail | Serverless/on-demand alternative to a fixed-capacity resource |
Study Tips
- Build the pipeline in your head, end to end, for every scenario — ingestion, storage, transformation, orchestration, security — before jumping to an answer. Half the wrong answers on this exam are technically valid services that solve the wrong stage of the pipeline.
- Practice reading past the adjectives. Words like “real-time,” “massive scale,” “highly secure” are scenario flavor text; the actual requirement is usually a specific number, a specific access pattern, or a specific existing constraint (an existing Kafka investment, a compliance rule, a stated latency budget).
- Know the “why,” not just the name. You won’t just be asked “what is Lake Formation” — you’ll be asked to recognize a row-level access requirement and produce the mechanism from memory.
- Drill the failure-mode vocabulary: hot partition, data skew, small files problem, job bookmark, iterator age, OOM kill. These specific terms are how the exam signals which troubleshooting scenario you’re in.
- Don’t neglect cost-shaped questions. A meaningful slice of operations questions are really cost questions wearing a performance costume — “the pipeline works but the bill is too high” almost always has a serverless, Spot, or lifecycle-policy answer.
Common Traps at the Associate Level
- Reaching for streaming architecture (Kinesis/MSK) whenever a scenario says “real-time,” without checking the actual stated latency tolerance.
- Treating DynamoDB throttling as a capacity problem first, key design problem second — it’s almost always the reverse.
- Forgetting that Glue Crawlers only catalog data; they never transform or move it, and won’t fix a schema conflict automatically.
- Assuming SSE-S3 and SSE-KMS are interchangeable because “both encrypt the data” — missing the audit trail and key control distinction the exam tests directly.
- Overlooking that row-level and column-level security in a lake are Lake Formation features, not something to build with custom application code or IAM condition keys alone.
- Confusing Multi-AZ-style high availability concepts (borrowed from other AWS exams) with data engineering concerns like idempotency and exactly-once processing, which are different problems entirely.
- Picking EMR by default for “big data” scenarios when a serverless Glue job would satisfy the requirement with less operational overhead — EMR is the right answer when the scenario specifically needs cluster-level control or an ecosystem tool Glue doesn’t support.
Exam Focus: What Questions Test From This Step
- Time-boxing study effort across domains without under-preparing for the lower-weighted governance domain
- Extracting the real, numeric requirement from scenario language dressed in urgency-sounding adjectives
- Working through multi-step scenarios that combine ingestion, storage, and governance decisions in a single question
- Recognizing recurring failure-mode vocabulary (hot partition, bookmark, iterator age, OOM, small files) as diagnostic shortcuts
- Distinguishing genuine architecture requirements from existing-tooling constraints (e.g., “team already runs Kafka” implies MSK)
- Avoiding default answers (streaming for anything urgent-sounding, EMR for anything “big”) without checking the specific requirement