Step 1 — Data Preparation

Every ML engineering exam question that looks like a modeling question is secretly a data question. Get the data pipeline wrong and no amount of hyperparameter tuning saves you. This step is about building the muscle memory for “where does this data live, how does it get clean, and how do I prove it stayed clean.”

Where Training Data Actually Comes From

Most real pipelines pull from more than one source, and the exam expects you to match the source to the right ingestion tool rather than defaulting to “just use S3 for everything.”

                    ┌─────────────────┐
   Batch files ───► │   Amazon S3      │ ──► Glue Crawler ──► Data Catalog
   (CSV/Parquet)    │  (raw data lake) │
                    └─────────────────┘
                              ▲
   Streaming events ─────────►│
   (clickstream, IoT)   Kinesis Data Streams / Firehose
                              │
                              ▼
                    ┌─────────────────┐
                    │   Glue ETL job   │ ── clean, join, dedupe
                    └────────┬─────────┘
                             ▼
                    ┌─────────────────┐
                    │ SageMaker Feature│ ── online + offline store
                    │      Store       │
                    └────────┬─────────┘
                             ▼
                    Training job / batch inference

S3 is the default landing zone for anything file-based — it’s cheap, versioned (via S3 Versioning), and every SageMaker training job reads from it eventually, even if a feature store sits in front. Glue is the workhorse for schema discovery and serverless ETL; a Glue Crawler infers schema and populates the Data Catalog, and Glue jobs (Spark under the hood) do the heavy transformation work without you provisioning a cluster. Kinesis Data Streams and Kinesis Data Firehose cover the streaming case — Streams when you need low-latency custom processing, Firehose when you just want data landed in S3 or Redshift with light transformation.

The exam likes to test whether you’d reach for Glue vs. EMR vs. a SageMaker Processing job. The rule of thumb: Glue for serverless, catalog-integrated ETL; EMR when you already run Spark/Hadoop at scale and want cluster control; SageMaker Processing when the transformation is tightly coupled to a training job and you want it to run in the same SDK/pipeline context.

SageMaker Feature Store

A feature store solves a problem that bites almost every team eventually: training-serving skew, where the features computed offline for training don’t match what gets computed online at inference time.

Feature Group: "customer_churn_features"
├── Offline Store (S3, Parquet)
│      used by: training jobs, batch scoring, Athena queries
└── Online Store (low-latency key-value)
       used by: real-time inference, feature lookups < 10ms

Both stores are written to from the same ingestion path, so the exact feature definition used to train a model is guaranteed to be the same one served at inference. Feature Groups carry a record identifier and an event-time feature, which lets you do point-in-time correct joins — critical when you’re reconstructing “what did this feature look like on the day we trained” for an audit or a reproducibility requirement.

Don’t confuse Feature Store with a generic feature engineering library. It’s a managed store with versioning and access control, not a transformation engine — you still do the actual feature computation in Glue, SageMaker Processing, or Data Wrangler, and then ingest the result.

Cleaning and Transforming

SageMaker Data Wrangler gets a lot of exam attention because it’s the visual, low-code way to explore and transform data inside Studio. It ships with 300+ built-in transforms (missing value imputation, one-hot encoding, outlier handling, time-series lag features) and a “Data Quality and Insights Report” that flags target leakage, class imbalance, and duplicate rows before you waste a training run on bad data.

Typical cleaning steps you should be able to reason about without a UI, too:

Missing values — drop rows only when missingness is trivial and random; otherwise impute (mean/median for numeric, mode or a new “unknown” category for categorical, or a model-based imputer for anything structurally important)
Outliers — clip, winsorize, or bucket depending on whether the outlier is a data error or a genuine tail event you want the model to see
Encoding — one-hot for low-cardinality categoricals, target/frequency encoding for high-cardinality ones (don’t one-hot a zip-code column and blow up your feature space)
Scaling — standardization for algorithms sensitive to feature magnitude (linear models, neural nets, k-NN); tree-based models generally don’t need it

Handling Imbalanced Data

This shows up constantly on the exam, usually framed as fraud detection or churn prediction where the positive class is 1-3% of the data.

Technique	What it does	When to prefer it
Random oversampling	Duplicates minority class rows	Small datasets, quick baseline
SMOTE	Synthesizes new minority samples via interpolation	Continuous features, moderate imbalance
Random undersampling	Drops majority class rows	Very large datasets where you can afford to lose data
Class weighting	Penalizes misclassifying the minority class more heavily	Preferred when you don’t want to distort the data distribution
Anomaly detection framing	Treat minority class as an anomaly rather than a classification target	Extreme imbalance (<0.5% positive)

A trap the exam sets: applying SMOTE or oversampling before the train/test split. That leaks synthetic or duplicated minority samples into your evaluation set and inflates your metrics. Always split first, resample only the training fold.

Metric choice matters just as much as the resampling technique — accuracy is close to useless on an imbalanced dataset. Precision, recall, F1, and PR-AUC (not ROC-AUC) are the numbers that actually tell you whether the model is useful.

Data Labeling with SageMaker Ground Truth

When you don’t have labels at all, Ground Truth is the managed labeling workflow AWS wants you to know. It supports built-in workflows for image classification, bounding boxes, semantic segmentation, text classification, and named entity recognition, and it routes work to a labeling workforce — your own private workforce, Amazon Mechanical Turk, or a vetted third-party vendor.

The feature that gets tested most is automated data labeling: Ground Truth trains a model on the labels humans have already produced, uses that model to label the easy examples automatically, and only sends the low-confidence examples back to humans. Over the course of a labeling job, this can cut human labeling cost substantially while keeping quality high, because the active-learning loop concentrates human effort where it’s actually needed.

Unlabeled data ──► Human labels (initial batch)
                          │
                          ▼
                 Train auto-labeling model
                          │
              ┌───────────┴────────────┐
              ▼                        ▼
     High confidence labels     Low confidence examples
     (auto-applied)             (sent to humans again)

Ground Truth Plus goes a step further and provides a fully managed workforce and labeling operation for teams that don’t want to manage workforce logistics at all.

Data Quality and Versioning

Two things the exam expects you to connect: SageMaker Clarify for bias detection in the pre-training data (checking whether a sensitive attribute like age or gender correlates unfairly with the label), and dataset versioning for reproducibility — every training job should be traceable back to an exact snapshot of the data it was trained on, which usually means either S3 object versioning, a Feature Store point-in-time query, or an explicit manifest file checked into your pipeline’s lineage tracking.

Skipping versioning is the kind of shortcut that looks fine until someone asks “which dataset produced the model currently in production,” and nobody can answer with certainty. Treat data like code: pin it, tag it, and never silently overwrite a training dataset in place.

Exam Focus: What Questions Test From This Step

Matching ingestion tool to data shape: Kinesis for streaming, Glue for serverless batch ETL, S3 as the universal landing zone
Why Feature Store prevents training-serving skew (online vs. offline store roles)
Correct order of operations for imbalanced data: split first, resample only the training set
Metric selection for imbalanced classification (precision/recall/F1/PR-AUC over raw accuracy)
Ground Truth’s automated labeling / active-learning loop and when it reduces human labeling cost
Data Wrangler’s role in the Studio workflow vs. Glue’s role in production ETL
Data versioning and lineage as a reproducibility requirement, not an optional nicety

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.