Step 2 — Data Store Management

You’ve got data flowing in. Now where does it actually live? This step is where a lot of candidates lose points, not because the individual services are obscure, but because the exam likes to describe a workload in plain business language and expect you to translate it into “this needs a columnar warehouse” or “this needs single-digit-millisecond key lookups” without ever using those words itself. Let’s build that translation skill.

The Core Decision: Five Stores, Five Jobs

S3            → Cheap, durable object storage. The data lake foundation.
Redshift      → Columnar warehouse. Complex analytical SQL over huge datasets.
DynamoDB      → Key-value / document NoSQL. Millisecond lookups at scale.
RDS           → Relational, transactional (OLTP). Joins, constraints, ACID.
OpenSearch    → Full-text search, log analytics, near-real-time dashboards.

None of these compete head-to-head as often as new candidates assume — they solve different access patterns. The exam question usually hides the access pattern in a sentence like “the application needs to retrieve a customer’s order history by customer ID in under 10 milliseconds,” which is DynamoDB dressed up in plain English.

Workload description	Store
Ad-hoc SQL joins across billions of rows for BI dashboards	Redshift
Raw and semi-structured files retained indefinitely, queried occasionally	S3 (+ Athena)
Shopping cart lookups by user ID, single-digit ms latency	DynamoDB
Order management system needing multi-table transactions	RDS
Searching free-text log messages, faceted search on an e-commerce catalog	OpenSearch

S3 as the Data Lake Foundation

S3 underpins almost every data lake architecture on this exam, so the details of storage classes and lifecycle rules matter more here than they might on a general AWS exam, because a data engineer is expected to actually tune the cost/access tradeoff, not just recognize the tier names.

S3 Standard            — hot data, frequent access, ms latency
S3 Intelligent-Tiering  — unknown/changing access patterns, auto-moves objects
S3 Standard-IA          — infrequent access, still millisecond retrieval
S3 One Zone-IA          — same as Standard-IA, single AZ, cheaper, less durable
S3 Glacier Instant      — archive tier, instant retrieval, rarely touched
S3 Glacier Flexible     — archive, minutes-to-hours retrieval
S3 Glacier Deep Archive — compliance retention, 12-hour retrieval, lowest cost

For a data lake, the pattern that shows up constantly is: land raw data in Standard, transition it after 30-60 days of inactivity into Intelligent-Tiering or Standard-IA, and push anything beyond a defined retention window into Glacier tiers for compliance. A lifecycle policy automates the whole thing:

Raw Zone (S3 Standard)
   │  after 30 days of no access
   ▼
Standard-IA
   │  after 180 days
   ▼
Glacier Flexible Retrieval
   │  after 3 years (regulatory hold)
   ▼
Glacier Deep Archive

Partitioning Strategy

Partitioning is arguably the single highest-leverage decision in a data lake, because it directly controls how much data Athena, Redshift Spectrum, or a Glue job has to scan per query. The standard approach is Hive-style partitioning by a low-cardinality, frequently-filtered column — almost always some form of date:

s3://data-lake/curated/orders/
    year=2026/month=01/day=15/part-0000.parquet
    year=2026/month=01/day=16/part-0000.parquet
    year=2026/month=02/day=01/part-0000.parquet

Two failure modes the exam likes to probe:

Over-partitioning — partitioning by a high-cardinality field (like customer_id) creates millions of tiny partitions, each with a handful of small files. Query planners choke on the metadata overhead, and you end up with a “small files problem.”
Under-partitioning — no partitioning at all, or partitioning only by year, forces a full or near-full table scan for queries that only need one day’s data.

The fix for small files, once you’re stuck with them, is usually a compaction job (Glue job that reads many small Parquet files and rewrites them as fewer, larger ones — typically targeting a few hundred MB to a few GB per file for efficient scan performance).

Redshift: Distribution and Sort Keys

Redshift is a columnar, MPP (massively parallel processing) database, and its performance hinges on two decisions that don’t have a direct equivalent in RDS: distribution style and sort keys.

Distribution Styles

KEY       — Rows distributed across nodes by hash of a chosen column.
             Good when that column is used heavily in joins (co-locates
             matching rows on the same node, avoiding network shuffle).

ALL       — Full copy of the table on every node. Good for small,
             frequently-joined dimension tables (a few million rows or less).

EVEN      — Round-robin distribution. Good default when there's no
             single obvious join column, or for staging tables.

AUTO      — Redshift picks EVEN or ALL based on table size, and can
             switch as the table grows. The default and generally the
             right starting point unless you have a specific reason
             to override it.

The classic exam scenario: a large fact table joined constantly to a large dimension table on customer_id — that’s a strong KEY distribution candidate on customer_id for both tables, so matching rows sit on the same node and the join doesn’t need to shuffle data across the network.

Sort Keys

Sort keys determine the physical order data is stored in on disk, which lets Redshift skip blocks entirely for range-restricted queries (this is called zone map pruning).

Compound sort key — columns sorted in the order listed, most useful when queries filter on a prefix of those columns (e.g., sort by date, region when most queries filter by date first).
Interleaved sort key — gives equal weight to each column in the key, useful when queries filter unpredictably across several columns, at the cost of slower load/vacuum operations.

A table with a date sort key and queries that always filter on a recent date range will scan a fraction of the blocks a similarly sized unsorted table would.

Redshift Serverless

Redshift Serverless removes cluster sizing from the equation entirely — you set Redshift Processing Units (RPUs) as a capacity range, and it scales compute up and down (including to near-zero for idle periods) based on query demand. This has become the default recommendation for workloads with unpredictable or intermittent query patterns, since you stop paying for an always-on cluster sized for peak load. Provisioned clusters still make sense for steady, predictable, high-utilization workloads where reserved pricing beats on-demand RPU billing.

DynamoDB for Analytics-Adjacent Workloads

DynamoDB isn’t a data warehouse, but data engineers hit it constantly as a serving layer — the fast lookup table that sits downstream of an analytical pipeline, feeding an application or API.

Capacity Modes

Provisioned    — You define Read/Write Capacity Units (RCU/WCU).
                 Cheaper at steady, predictable, high throughput.
                 Pair with auto-scaling for gradual traffic shifts.

On-Demand      — Pay per request, no capacity planning.
                 Better for spiky, unpredictable, or new workloads
                 where you don't yet know the traffic shape.

Partition Key Design

The partition key determines how DynamoDB spreads data (and therefore throughput) across physical partitions. A poor choice creates a hot partition — one partition absorbing disproportionate traffic while others sit idle, capping your effective throughput regardless of overall table capacity.

For analytics-feeding tables, a common anti-pattern is using a low-cardinality key like event_type (only a handful of distinct values) when the table needs high write throughput — every write for the same type lands on the same partition. The fix is usually a composite or write-sharded key, e.g., event_type#shard_id, spreading writes across multiple partitions while still allowing query-time reconstruction of the full set.

Symptom	Likely cause	Fix
Throttling despite low overall table usage	Hot partition from low-cardinality key	Add a shard suffix or pick a higher-cardinality key
Read costs spiking on one access pattern	No GSI matching that query shape	Add a Global Secondary Index
Need to run ad-hoc analytical queries	DynamoDB isn’t built for scans/joins	Export via DynamoDB Streams or S3 export to Redshift/S3 for analysis

Exam Focus: What Questions Test From This Step

Mapping a plain-English workload description to the right store (Redshift vs DynamoDB vs RDS vs OpenSearch)
S3 storage class selection and lifecycle transition ordering
Diagnosing and fixing the “small files problem” through partitioning and compaction
Redshift distribution style choice (KEY vs ALL vs EVEN vs AUTO) based on join patterns
Compound vs interleaved sort keys and when each helps query performance
When Redshift Serverless beats a provisioned cluster (intermittent vs steady workloads)
DynamoDB partition key design and diagnosing hot partitions
Provisioned vs on-demand DynamoDB capacity mode selection

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.