Cloud/ AWS / AWS Certified Machine Learning Engineer โ€” Associate (MLA-C01) / MLA-C01 Data Preparation: Ingestion, Feature Engineering & Ground Truth

AWS Amazon Web Services Associate Step 1 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 1 โ€” Data Preparation

Every ML engineering exam question that looks like a modeling question is secretly a data question. Get the data pipeline wrong and no amount of hyperparameter tuning saves you. This step is about building the muscle memory for โ€œwhere does this data live, how does it get clean, and how do I prove it stayed clean.โ€


Where Training Data Actually Comes From

Most real pipelines pull from more than one source, and the exam expects you to match the source to the right ingestion tool rather than defaulting to โ€œjust use S3 for everything.โ€

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Batch files โ”€โ”€โ”€โ–บ โ”‚ Amazon S3 โ”‚ โ”€โ”€โ–บ Glue Crawler โ”€โ”€โ–บ Data Catalog
(CSV/Parquet) โ”‚ (raw data lake) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ–ฒ
Streaming events โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
(clickstream, IoT) Kinesis Data Streams / Firehose
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Glue ETL job โ”‚ โ”€โ”€ clean, join, dedupe
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SageMaker Featureโ”‚ โ”€โ”€ online + offline store
โ”‚ Store โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ–ผ
Training job / batch inference

S3 is the default landing zone for anything file-based โ€” itโ€™s cheap, versioned (via S3 Versioning), and every SageMaker training job reads from it eventually, even if a feature store sits in front. Glue is the workhorse for schema discovery and serverless ETL; a Glue Crawler infers schema and populates the Data Catalog, and Glue jobs (Spark under the hood) do the heavy transformation work without you provisioning a cluster. Kinesis Data Streams and Kinesis Data Firehose cover the streaming case โ€” Streams when you need low-latency custom processing, Firehose when you just want data landed in S3 or Redshift with light transformation.

The exam likes to test whether youโ€™d reach for Glue vs. EMR vs. a SageMaker Processing job. The rule of thumb: Glue for serverless, catalog-integrated ETL; EMR when you already run Spark/Hadoop at scale and want cluster control; SageMaker Processing when the transformation is tightly coupled to a training job and you want it to run in the same SDK/pipeline context.


SageMaker Feature Store

A feature store solves a problem that bites almost every team eventually: training-serving skew, where the features computed offline for training donโ€™t match what gets computed online at inference time.

Feature Group: "customer_churn_features"
โ”œโ”€โ”€ Offline Store (S3, Parquet)
โ”‚ used by: training jobs, batch scoring, Athena queries
โ””โ”€โ”€ Online Store (low-latency key-value)
used by: real-time inference, feature lookups < 10ms

Both stores are written to from the same ingestion path, so the exact feature definition used to train a model is guaranteed to be the same one served at inference. Feature Groups carry a record identifier and an event-time feature, which lets you do point-in-time correct joins โ€” critical when youโ€™re reconstructing โ€œwhat did this feature look like on the day we trainedโ€ for an audit or a reproducibility requirement.

Donโ€™t confuse Feature Store with a generic feature engineering library. Itโ€™s a managed store with versioning and access control, not a transformation engine โ€” you still do the actual feature computation in Glue, SageMaker Processing, or Data Wrangler, and then ingest the result.


Cleaning and Transforming

SageMaker Data Wrangler gets a lot of exam attention because itโ€™s the visual, low-code way to explore and transform data inside Studio. It ships with 300+ built-in transforms (missing value imputation, one-hot encoding, outlier handling, time-series lag features) and a โ€œData Quality and Insights Reportโ€ that flags target leakage, class imbalance, and duplicate rows before you waste a training run on bad data.

Typical cleaning steps you should be able to reason about without a UI, too:


Handling Imbalanced Data

This shows up constantly on the exam, usually framed as fraud detection or churn prediction where the positive class is 1-3% of the data.

TechniqueWhat it doesWhen to prefer it
Random oversamplingDuplicates minority class rowsSmall datasets, quick baseline
SMOTESynthesizes new minority samples via interpolationContinuous features, moderate imbalance
Random undersamplingDrops majority class rowsVery large datasets where you can afford to lose data
Class weightingPenalizes misclassifying the minority class more heavilyPreferred when you donโ€™t want to distort the data distribution
Anomaly detection framingTreat minority class as an anomaly rather than a classification targetExtreme imbalance (<0.5% positive)

A trap the exam sets: applying SMOTE or oversampling before the train/test split. That leaks synthetic or duplicated minority samples into your evaluation set and inflates your metrics. Always split first, resample only the training fold.

Metric choice matters just as much as the resampling technique โ€” accuracy is close to useless on an imbalanced dataset. Precision, recall, F1, and PR-AUC (not ROC-AUC) are the numbers that actually tell you whether the model is useful.


Data Labeling with SageMaker Ground Truth

When you donโ€™t have labels at all, Ground Truth is the managed labeling workflow AWS wants you to know. It supports built-in workflows for image classification, bounding boxes, semantic segmentation, text classification, and named entity recognition, and it routes work to a labeling workforce โ€” your own private workforce, Amazon Mechanical Turk, or a vetted third-party vendor.

The feature that gets tested most is automated data labeling: Ground Truth trains a model on the labels humans have already produced, uses that model to label the easy examples automatically, and only sends the low-confidence examples back to humans. Over the course of a labeling job, this can cut human labeling cost substantially while keeping quality high, because the active-learning loop concentrates human effort where itโ€™s actually needed.

Unlabeled data โ”€โ”€โ–บ Human labels (initial batch)
โ”‚
โ–ผ
Train auto-labeling model
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ
High confidence labels Low confidence examples
(auto-applied) (sent to humans again)

Ground Truth Plus goes a step further and provides a fully managed workforce and labeling operation for teams that donโ€™t want to manage workforce logistics at all.


Data Quality and Versioning

Two things the exam expects you to connect: SageMaker Clarify for bias detection in the pre-training data (checking whether a sensitive attribute like age or gender correlates unfairly with the label), and dataset versioning for reproducibility โ€” every training job should be traceable back to an exact snapshot of the data it was trained on, which usually means either S3 object versioning, a Feature Store point-in-time query, or an explicit manifest file checked into your pipelineโ€™s lineage tracking.

Skipping versioning is the kind of shortcut that looks fine until someone asks โ€œwhich dataset produced the model currently in production,โ€ and nobody can answer with certainty. Treat data like code: pin it, tag it, and never silently overwrite a training dataset in place.


Exam Focus: What Questions Test From This Step