Step 1 โ Data Preparation
Every ML engineering exam question that looks like a modeling question is secretly a data question. Get the data pipeline wrong and no amount of hyperparameter tuning saves you. This step is about building the muscle memory for โwhere does this data live, how does it get clean, and how do I prove it stayed clean.โ
Where Training Data Actually Comes From
Most real pipelines pull from more than one source, and the exam expects you to match the source to the right ingestion tool rather than defaulting to โjust use S3 for everything.โ
โโโโโโโโโโโโโโโโโโโ Batch files โโโโบ โ Amazon S3 โ โโโบ Glue Crawler โโโบ Data Catalog (CSV/Parquet) โ (raw data lake) โ โโโโโโโโโโโโโโโโโโโ โฒ Streaming events โโโโโโโโโโบโ (clickstream, IoT) Kinesis Data Streams / Firehose โ โผ โโโโโโโโโโโโโโโโโโโ โ Glue ETL job โ โโ clean, join, dedupe โโโโโโโโโโฌโโโโโโโโโโ โผ โโโโโโโโโโโโโโโโโโโ โ SageMaker Featureโ โโ online + offline store โ Store โ โโโโโโโโโโฌโโโโโโโโโโ โผ Training job / batch inferenceS3 is the default landing zone for anything file-based โ itโs cheap, versioned (via S3 Versioning), and every SageMaker training job reads from it eventually, even if a feature store sits in front. Glue is the workhorse for schema discovery and serverless ETL; a Glue Crawler infers schema and populates the Data Catalog, and Glue jobs (Spark under the hood) do the heavy transformation work without you provisioning a cluster. Kinesis Data Streams and Kinesis Data Firehose cover the streaming case โ Streams when you need low-latency custom processing, Firehose when you just want data landed in S3 or Redshift with light transformation.
The exam likes to test whether youโd reach for Glue vs. EMR vs. a SageMaker Processing job. The rule of thumb: Glue for serverless, catalog-integrated ETL; EMR when you already run Spark/Hadoop at scale and want cluster control; SageMaker Processing when the transformation is tightly coupled to a training job and you want it to run in the same SDK/pipeline context.
SageMaker Feature Store
A feature store solves a problem that bites almost every team eventually: training-serving skew, where the features computed offline for training donโt match what gets computed online at inference time.
Feature Group: "customer_churn_features"โโโ Offline Store (S3, Parquet)โ used by: training jobs, batch scoring, Athena queriesโโโ Online Store (low-latency key-value) used by: real-time inference, feature lookups < 10msBoth stores are written to from the same ingestion path, so the exact feature definition used to train a model is guaranteed to be the same one served at inference. Feature Groups carry a record identifier and an event-time feature, which lets you do point-in-time correct joins โ critical when youโre reconstructing โwhat did this feature look like on the day we trainedโ for an audit or a reproducibility requirement.
Donโt confuse Feature Store with a generic feature engineering library. Itโs a managed store with versioning and access control, not a transformation engine โ you still do the actual feature computation in Glue, SageMaker Processing, or Data Wrangler, and then ingest the result.
Cleaning and Transforming
SageMaker Data Wrangler gets a lot of exam attention because itโs the visual, low-code way to explore and transform data inside Studio. It ships with 300+ built-in transforms (missing value imputation, one-hot encoding, outlier handling, time-series lag features) and a โData Quality and Insights Reportโ that flags target leakage, class imbalance, and duplicate rows before you waste a training run on bad data.
Typical cleaning steps you should be able to reason about without a UI, too:
- Missing values โ drop rows only when missingness is trivial and random; otherwise impute (mean/median for numeric, mode or a new โunknownโ category for categorical, or a model-based imputer for anything structurally important)
- Outliers โ clip, winsorize, or bucket depending on whether the outlier is a data error or a genuine tail event you want the model to see
- Encoding โ one-hot for low-cardinality categoricals, target/frequency encoding for high-cardinality ones (donโt one-hot a zip-code column and blow up your feature space)
- Scaling โ standardization for algorithms sensitive to feature magnitude (linear models, neural nets, k-NN); tree-based models generally donโt need it
Handling Imbalanced Data
This shows up constantly on the exam, usually framed as fraud detection or churn prediction where the positive class is 1-3% of the data.
| Technique | What it does | When to prefer it |
|---|---|---|
| Random oversampling | Duplicates minority class rows | Small datasets, quick baseline |
| SMOTE | Synthesizes new minority samples via interpolation | Continuous features, moderate imbalance |
| Random undersampling | Drops majority class rows | Very large datasets where you can afford to lose data |
| Class weighting | Penalizes misclassifying the minority class more heavily | Preferred when you donโt want to distort the data distribution |
| Anomaly detection framing | Treat minority class as an anomaly rather than a classification target | Extreme imbalance (<0.5% positive) |
A trap the exam sets: applying SMOTE or oversampling before the train/test split. That leaks synthetic or duplicated minority samples into your evaluation set and inflates your metrics. Always split first, resample only the training fold.
Metric choice matters just as much as the resampling technique โ accuracy is close to useless on an imbalanced dataset. Precision, recall, F1, and PR-AUC (not ROC-AUC) are the numbers that actually tell you whether the model is useful.
Data Labeling with SageMaker Ground Truth
When you donโt have labels at all, Ground Truth is the managed labeling workflow AWS wants you to know. It supports built-in workflows for image classification, bounding boxes, semantic segmentation, text classification, and named entity recognition, and it routes work to a labeling workforce โ your own private workforce, Amazon Mechanical Turk, or a vetted third-party vendor.
The feature that gets tested most is automated data labeling: Ground Truth trains a model on the labels humans have already produced, uses that model to label the easy examples automatically, and only sends the low-confidence examples back to humans. Over the course of a labeling job, this can cut human labeling cost substantially while keeping quality high, because the active-learning loop concentrates human effort where itโs actually needed.
Unlabeled data โโโบ Human labels (initial batch) โ โผ Train auto-labeling model โ โโโโโโโโโโโโโดโโโโโโโโโโโโโ โผ โผ High confidence labels Low confidence examples (auto-applied) (sent to humans again)Ground Truth Plus goes a step further and provides a fully managed workforce and labeling operation for teams that donโt want to manage workforce logistics at all.
Data Quality and Versioning
Two things the exam expects you to connect: SageMaker Clarify for bias detection in the pre-training data (checking whether a sensitive attribute like age or gender correlates unfairly with the label), and dataset versioning for reproducibility โ every training job should be traceable back to an exact snapshot of the data it was trained on, which usually means either S3 object versioning, a Feature Store point-in-time query, or an explicit manifest file checked into your pipelineโs lineage tracking.
Skipping versioning is the kind of shortcut that looks fine until someone asks โwhich dataset produced the model currently in production,โ and nobody can answer with certainty. Treat data like code: pin it, tag it, and never silently overwrite a training dataset in place.
Exam Focus: What Questions Test From This Step
- Matching ingestion tool to data shape: Kinesis for streaming, Glue for serverless batch ETL, S3 as the universal landing zone
- Why Feature Store prevents training-serving skew (online vs. offline store roles)
- Correct order of operations for imbalanced data: split first, resample only the training set
- Metric selection for imbalanced classification (precision/recall/F1/PR-AUC over raw accuracy)
- Ground Truthโs automated labeling / active-learning loop and when it reduces human labeling cost
- Data Wranglerโs role in the Studio workflow vs. Glueโs role in production ETL
- Data versioning and lineage as a reproducibility requirement, not an optional nicety