Step 2 โ Data Store Management
Youโve got data flowing in. Now where does it actually live? This step is where a lot of candidates lose points, not because the individual services are obscure, but because the exam likes to describe a workload in plain business language and expect you to translate it into โthis needs a columnar warehouseโ or โthis needs single-digit-millisecond key lookupsโ without ever using those words itself. Letโs build that translation skill.
The Core Decision: Five Stores, Five Jobs
S3 โ Cheap, durable object storage. The data lake foundation.Redshift โ Columnar warehouse. Complex analytical SQL over huge datasets.DynamoDB โ Key-value / document NoSQL. Millisecond lookups at scale.RDS โ Relational, transactional (OLTP). Joins, constraints, ACID.OpenSearch โ Full-text search, log analytics, near-real-time dashboards.None of these compete head-to-head as often as new candidates assume โ they solve different access patterns. The exam question usually hides the access pattern in a sentence like โthe application needs to retrieve a customerโs order history by customer ID in under 10 milliseconds,โ which is DynamoDB dressed up in plain English.
| Workload description | Store |
|---|---|
| Ad-hoc SQL joins across billions of rows for BI dashboards | Redshift |
| Raw and semi-structured files retained indefinitely, queried occasionally | S3 (+ Athena) |
| Shopping cart lookups by user ID, single-digit ms latency | DynamoDB |
| Order management system needing multi-table transactions | RDS |
| Searching free-text log messages, faceted search on an e-commerce catalog | OpenSearch |
S3 as the Data Lake Foundation
S3 underpins almost every data lake architecture on this exam, so the details of storage classes and lifecycle rules matter more here than they might on a general AWS exam, because a data engineer is expected to actually tune the cost/access tradeoff, not just recognize the tier names.
S3 Standard โ hot data, frequent access, ms latencyS3 Intelligent-Tiering โ unknown/changing access patterns, auto-moves objectsS3 Standard-IA โ infrequent access, still millisecond retrievalS3 One Zone-IA โ same as Standard-IA, single AZ, cheaper, less durableS3 Glacier Instant โ archive tier, instant retrieval, rarely touchedS3 Glacier Flexible โ archive, minutes-to-hours retrievalS3 Glacier Deep Archive โ compliance retention, 12-hour retrieval, lowest costFor a data lake, the pattern that shows up constantly is: land raw data in Standard, transition it after 30-60 days of inactivity into Intelligent-Tiering or Standard-IA, and push anything beyond a defined retention window into Glacier tiers for compliance. A lifecycle policy automates the whole thing:
Raw Zone (S3 Standard) โ after 30 days of no access โผStandard-IA โ after 180 days โผGlacier Flexible Retrieval โ after 3 years (regulatory hold) โผGlacier Deep ArchivePartitioning Strategy
Partitioning is arguably the single highest-leverage decision in a data lake, because it directly controls how much data Athena, Redshift Spectrum, or a Glue job has to scan per query. The standard approach is Hive-style partitioning by a low-cardinality, frequently-filtered column โ almost always some form of date:
s3://data-lake/curated/orders/ year=2026/month=01/day=15/part-0000.parquet year=2026/month=01/day=16/part-0000.parquet year=2026/month=02/day=01/part-0000.parquetTwo failure modes the exam likes to probe:
- Over-partitioning โ partitioning by a high-cardinality field (like customer_id) creates millions of tiny partitions, each with a handful of small files. Query planners choke on the metadata overhead, and you end up with a โsmall files problem.โ
- Under-partitioning โ no partitioning at all, or partitioning only by year, forces a full or near-full table scan for queries that only need one dayโs data.
The fix for small files, once youโre stuck with them, is usually a compaction job (Glue job that reads many small Parquet files and rewrites them as fewer, larger ones โ typically targeting a few hundred MB to a few GB per file for efficient scan performance).
Redshift: Distribution and Sort Keys
Redshift is a columnar, MPP (massively parallel processing) database, and its performance hinges on two decisions that donโt have a direct equivalent in RDS: distribution style and sort keys.
Distribution Styles
KEY โ Rows distributed across nodes by hash of a chosen column. Good when that column is used heavily in joins (co-locates matching rows on the same node, avoiding network shuffle).
ALL โ Full copy of the table on every node. Good for small, frequently-joined dimension tables (a few million rows or less).
EVEN โ Round-robin distribution. Good default when there's no single obvious join column, or for staging tables.
AUTO โ Redshift picks EVEN or ALL based on table size, and can switch as the table grows. The default and generally the right starting point unless you have a specific reason to override it.The classic exam scenario: a large fact table joined constantly to a large dimension table on customer_id โ thatโs a strong KEY distribution candidate on customer_id for both tables, so matching rows sit on the same node and the join doesnโt need to shuffle data across the network.
Sort Keys
Sort keys determine the physical order data is stored in on disk, which lets Redshift skip blocks entirely for range-restricted queries (this is called zone map pruning).
- Compound sort key โ columns sorted in the order listed, most useful when queries filter on a prefix of those columns (e.g., sort by
date, regionwhen most queries filter by date first). - Interleaved sort key โ gives equal weight to each column in the key, useful when queries filter unpredictably across several columns, at the cost of slower load/vacuum operations.
A table with a date sort key and queries that always filter on a recent date range will scan a fraction of the blocks a similarly sized unsorted table would.
Redshift Serverless
Redshift Serverless removes cluster sizing from the equation entirely โ you set Redshift Processing Units (RPUs) as a capacity range, and it scales compute up and down (including to near-zero for idle periods) based on query demand. This has become the default recommendation for workloads with unpredictable or intermittent query patterns, since you stop paying for an always-on cluster sized for peak load. Provisioned clusters still make sense for steady, predictable, high-utilization workloads where reserved pricing beats on-demand RPU billing.
DynamoDB for Analytics-Adjacent Workloads
DynamoDB isnโt a data warehouse, but data engineers hit it constantly as a serving layer โ the fast lookup table that sits downstream of an analytical pipeline, feeding an application or API.
Capacity Modes
Provisioned โ You define Read/Write Capacity Units (RCU/WCU). Cheaper at steady, predictable, high throughput. Pair with auto-scaling for gradual traffic shifts.
On-Demand โ Pay per request, no capacity planning. Better for spiky, unpredictable, or new workloads where you don't yet know the traffic shape.Partition Key Design
The partition key determines how DynamoDB spreads data (and therefore throughput) across physical partitions. A poor choice creates a hot partition โ one partition absorbing disproportionate traffic while others sit idle, capping your effective throughput regardless of overall table capacity.
For analytics-feeding tables, a common anti-pattern is using a low-cardinality key like event_type (only a handful of distinct values) when the table needs high write throughput โ every write for the same type lands on the same partition. The fix is usually a composite or write-sharded key, e.g., event_type#shard_id, spreading writes across multiple partitions while still allowing query-time reconstruction of the full set.
| Symptom | Likely cause | Fix |
|---|---|---|
| Throttling despite low overall table usage | Hot partition from low-cardinality key | Add a shard suffix or pick a higher-cardinality key |
| Read costs spiking on one access pattern | No GSI matching that query shape | Add a Global Secondary Index |
| Need to run ad-hoc analytical queries | DynamoDB isnโt built for scans/joins | Export via DynamoDB Streams or S3 export to Redshift/S3 for analysis |
Exam Focus: What Questions Test From This Step
- Mapping a plain-English workload description to the right store (Redshift vs DynamoDB vs RDS vs OpenSearch)
- S3 storage class selection and lifecycle transition ordering
- Diagnosing and fixing the โsmall files problemโ through partitioning and compaction
- Redshift distribution style choice (KEY vs ALL vs EVEN vs AUTO) based on join patterns
- Compound vs interleaved sort keys and when each helps query performance
- When Redshift Serverless beats a provisioned cluster (intermittent vs steady workloads)
- DynamoDB partition key design and diagnosing hot partitions
- Provisioned vs on-demand DynamoDB capacity mode selection