Cloud/ AWS / AWS Certified Data Engineer โ€” Associate (DEA-C01) / DEA-C01 Data Store Management: S3, Redshift, DynamoDB Guide

AWS Amazon Web Services Associate Step 2 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 2 โ€” Data Store Management

Youโ€™ve got data flowing in. Now where does it actually live? This step is where a lot of candidates lose points, not because the individual services are obscure, but because the exam likes to describe a workload in plain business language and expect you to translate it into โ€œthis needs a columnar warehouseโ€ or โ€œthis needs single-digit-millisecond key lookupsโ€ without ever using those words itself. Letโ€™s build that translation skill.


The Core Decision: Five Stores, Five Jobs

S3 โ†’ Cheap, durable object storage. The data lake foundation.
Redshift โ†’ Columnar warehouse. Complex analytical SQL over huge datasets.
DynamoDB โ†’ Key-value / document NoSQL. Millisecond lookups at scale.
RDS โ†’ Relational, transactional (OLTP). Joins, constraints, ACID.
OpenSearch โ†’ Full-text search, log analytics, near-real-time dashboards.

None of these compete head-to-head as often as new candidates assume โ€” they solve different access patterns. The exam question usually hides the access pattern in a sentence like โ€œthe application needs to retrieve a customerโ€™s order history by customer ID in under 10 milliseconds,โ€ which is DynamoDB dressed up in plain English.

Workload descriptionStore
Ad-hoc SQL joins across billions of rows for BI dashboardsRedshift
Raw and semi-structured files retained indefinitely, queried occasionallyS3 (+ Athena)
Shopping cart lookups by user ID, single-digit ms latencyDynamoDB
Order management system needing multi-table transactionsRDS
Searching free-text log messages, faceted search on an e-commerce catalogOpenSearch

S3 as the Data Lake Foundation

S3 underpins almost every data lake architecture on this exam, so the details of storage classes and lifecycle rules matter more here than they might on a general AWS exam, because a data engineer is expected to actually tune the cost/access tradeoff, not just recognize the tier names.

S3 Standard โ€” hot data, frequent access, ms latency
S3 Intelligent-Tiering โ€” unknown/changing access patterns, auto-moves objects
S3 Standard-IA โ€” infrequent access, still millisecond retrieval
S3 One Zone-IA โ€” same as Standard-IA, single AZ, cheaper, less durable
S3 Glacier Instant โ€” archive tier, instant retrieval, rarely touched
S3 Glacier Flexible โ€” archive, minutes-to-hours retrieval
S3 Glacier Deep Archive โ€” compliance retention, 12-hour retrieval, lowest cost

For a data lake, the pattern that shows up constantly is: land raw data in Standard, transition it after 30-60 days of inactivity into Intelligent-Tiering or Standard-IA, and push anything beyond a defined retention window into Glacier tiers for compliance. A lifecycle policy automates the whole thing:

Raw Zone (S3 Standard)
โ”‚ after 30 days of no access
โ–ผ
Standard-IA
โ”‚ after 180 days
โ–ผ
Glacier Flexible Retrieval
โ”‚ after 3 years (regulatory hold)
โ–ผ
Glacier Deep Archive

Partitioning Strategy

Partitioning is arguably the single highest-leverage decision in a data lake, because it directly controls how much data Athena, Redshift Spectrum, or a Glue job has to scan per query. The standard approach is Hive-style partitioning by a low-cardinality, frequently-filtered column โ€” almost always some form of date:

s3://data-lake/curated/orders/
year=2026/month=01/day=15/part-0000.parquet
year=2026/month=01/day=16/part-0000.parquet
year=2026/month=02/day=01/part-0000.parquet

Two failure modes the exam likes to probe:

  1. Over-partitioning โ€” partitioning by a high-cardinality field (like customer_id) creates millions of tiny partitions, each with a handful of small files. Query planners choke on the metadata overhead, and you end up with a โ€œsmall files problem.โ€
  2. Under-partitioning โ€” no partitioning at all, or partitioning only by year, forces a full or near-full table scan for queries that only need one dayโ€™s data.

The fix for small files, once youโ€™re stuck with them, is usually a compaction job (Glue job that reads many small Parquet files and rewrites them as fewer, larger ones โ€” typically targeting a few hundred MB to a few GB per file for efficient scan performance).


Redshift: Distribution and Sort Keys

Redshift is a columnar, MPP (massively parallel processing) database, and its performance hinges on two decisions that donโ€™t have a direct equivalent in RDS: distribution style and sort keys.

Distribution Styles

KEY โ€” Rows distributed across nodes by hash of a chosen column.
Good when that column is used heavily in joins (co-locates
matching rows on the same node, avoiding network shuffle).
ALL โ€” Full copy of the table on every node. Good for small,
frequently-joined dimension tables (a few million rows or less).
EVEN โ€” Round-robin distribution. Good default when there's no
single obvious join column, or for staging tables.
AUTO โ€” Redshift picks EVEN or ALL based on table size, and can
switch as the table grows. The default and generally the
right starting point unless you have a specific reason
to override it.

The classic exam scenario: a large fact table joined constantly to a large dimension table on customer_id โ€” thatโ€™s a strong KEY distribution candidate on customer_id for both tables, so matching rows sit on the same node and the join doesnโ€™t need to shuffle data across the network.

Sort Keys

Sort keys determine the physical order data is stored in on disk, which lets Redshift skip blocks entirely for range-restricted queries (this is called zone map pruning).

A table with a date sort key and queries that always filter on a recent date range will scan a fraction of the blocks a similarly sized unsorted table would.

Redshift Serverless

Redshift Serverless removes cluster sizing from the equation entirely โ€” you set Redshift Processing Units (RPUs) as a capacity range, and it scales compute up and down (including to near-zero for idle periods) based on query demand. This has become the default recommendation for workloads with unpredictable or intermittent query patterns, since you stop paying for an always-on cluster sized for peak load. Provisioned clusters still make sense for steady, predictable, high-utilization workloads where reserved pricing beats on-demand RPU billing.


DynamoDB for Analytics-Adjacent Workloads

DynamoDB isnโ€™t a data warehouse, but data engineers hit it constantly as a serving layer โ€” the fast lookup table that sits downstream of an analytical pipeline, feeding an application or API.

Capacity Modes

Provisioned โ€” You define Read/Write Capacity Units (RCU/WCU).
Cheaper at steady, predictable, high throughput.
Pair with auto-scaling for gradual traffic shifts.
On-Demand โ€” Pay per request, no capacity planning.
Better for spiky, unpredictable, or new workloads
where you don't yet know the traffic shape.

Partition Key Design

The partition key determines how DynamoDB spreads data (and therefore throughput) across physical partitions. A poor choice creates a hot partition โ€” one partition absorbing disproportionate traffic while others sit idle, capping your effective throughput regardless of overall table capacity.

For analytics-feeding tables, a common anti-pattern is using a low-cardinality key like event_type (only a handful of distinct values) when the table needs high write throughput โ€” every write for the same type lands on the same partition. The fix is usually a composite or write-sharded key, e.g., event_type#shard_id, spreading writes across multiple partitions while still allowing query-time reconstruction of the full set.

SymptomLikely causeFix
Throttling despite low overall table usageHot partition from low-cardinality keyAdd a shard suffix or pick a higher-cardinality key
Read costs spiking on one access patternNo GSI matching that query shapeAdd a Global Secondary Index
Need to run ad-hoc analytical queriesDynamoDB isnโ€™t built for scans/joinsExport via DynamoDB Streams or S3 export to Redshift/S3 for analysis

Exam Focus: What Questions Test From This Step