Step 2 โ Model Development & Training
Once the data is clean and sitting in a feature group or an S3 prefix, the real engineering decisions start. This step is where the exam separates people whoโve only trained models in a notebook from people whoโve had to make a training job survive contact with a real budget and a real deadline.
Built-In Algorithms vs. Bring-Your-Own
SageMaker gives you three ways to get a model trained, and picking the wrong one for a scenario is a classic wrong-answer trap.
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Built-in Algorithms โ Script Mode (BYO code) โ Bring Your Own Container โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโคโ XGBoost, Linear โ Your training script + โ Custom Docker image with โโ Learner, k-NN, DeepAR โ a prebuilt framework โ full control over runtimeโโ Fastest to deploy, โ container (TF, PyTorch,โ Use when you need custom โโ least flexible โ MXNet, Hugging Face) โ system deps or a stack โโ โ โ SageMaker doesn't ship โโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโIf a question describes a tabular classification/regression problem with no unusual requirements, XGBoost is almost always the โbestโ built-in answer โ itโs fast, well-documented, and handles missing values natively. DeepAR is the built-in for time-series forecasting with multiple related series. BlazingText covers word embeddings and text classification at scale. For anything involving a custom PyTorch or TensorFlow architecture, script mode is the answer โ you keep your own training code but let SageMaker manage the infrastructure, container, and I/O channels around it.
Bring-your-own-container is reserved for edge cases: an unsupported framework, a specific CUDA/driver version, or a non-Python runtime. If a question is testing โleast operational overhead,โ BYOC is rarely correct โ itโs the highest-maintenance option.
SageMaker JumpStart deserves a mention here too: itโs the model hub for pretrained foundation models and common architectures, letting you fine-tune instead of training from scratch. On the exam, JumpStart is the answer whenever the scenario is โwe want to fine-tune an existing large model quicklyโ rather than build one from zero.
Hyperparameter Tuning
SageMaker Automatic Model Tuning (AMT) runs multiple training jobs with different hyperparameter combinations and picks the best by an objective metric you define.
| Strategy | How it searches | Trade-off |
|---|---|---|
| Grid search | Exhaustively tries every combination | Guaranteed coverage, expensive at scale |
| Random search | Samples combinations at random | Cheaper, surprisingly competitive with grid |
| Bayesian optimization | Uses prior results to pick the next combination | Fewest jobs needed, default AMT strategy |
| Hyperband | Early-stops poorly performing jobs | Best when training is expensive and many configs are clearly bad early |
Bayesian optimization is the default and generally the right answer when a question asks โwhich strategy finds a good hyperparameter combination with the fewest training jobs.โ Hyperband is the answer when the emphasis is on cost control for expensive training runs, since it kills bad trials before they finish.
A subtlety worth internalizing: AMT parallelizes jobs, but too much parallelism with Bayesian optimization actually hurts it, because the strategy canโt learn from jobs that havenโt finished yet. If a scenario mentions โwe want maximum parallelism,โ thatโs a mild signal toward random search instead.
Distributed Training
Once a single GPU or a single instance canโt hold the model or the data, you split the work. Two strategies, and the exam wants you to know which problem each one solves.
DATA PARALLELISM MODEL PARALLELISM(model fits on one device, (model does NOT fit on one device) dataset is the bottleneck)
GPU 1: full model, batch A GPU 1: layers 1-10 GPU 2: full model, batch B GPU 2: layers 11-20 GPU 3: full model, batch C GPU 3: layers 21-30 โ โ โผ โผ Gradients averaged/synced Activations passed between across devices each step devices in a pipelineSageMakerโs Distributed Data Parallel (SMDDP) library optimizes the gradient-sync step specifically for AWS networking, and itโs the answer whenever a question is about scaling training throughput across many GPUs with an already-fits-in-memory model. SageMaker Model Parallel is the answer when the model itself โ think large language models โ is too big for one acceleratorโs memory.
Training Infrastructure Choices
Instance selection is one of those areas where the exam expects current, practical judgment rather than memorized specs.
- GPU instances (P and G families) โ standard choice for deep learning training; P-series for the heaviest workloads, G-series for lighter or inference-leaning jobs
- Trainium-based instances (Trn) โ AWSโs purpose-built training silicon; by 2026 this is a mainstream cost-efficient option for large-scale deep learning training, especially when a workload can tolerate the extra engineering effort to compile against AWS Neuron
- Inferentia-based instances (Inf) โ purpose-built for inference, not training; if a question mentions Inf-series in the context of a training job, thatโs the wrong-answer bait
- CPU instances โ perfectly fine for classical ML (XGBoost, linear models) that doesnโt need a GPU at all
Managed Spot Training is a near-guaranteed exam topic: it runs training jobs on Spot capacity for up to 90% savings versus on-demand, and SageMaker automatically checkpoints and resumes if the instance is reclaimed, so you donโt lose all your progress. The trade-off is unpredictable start times and possible interruption โ fine for most training jobs, risky for anything with a hard deadline where you canโt tolerate delay.
Managed Spot Training flow:
Training job starts โโโบ checkpoint saved every N steps โโโบ S3 โ โผSpot interruption (2-min warning) โ โผJob automatically resumes from last checkpoint on new Spot capacityEvaluation Metrics and Experiment Tracking
Choosing the right metric is inseparable from choosing the right training approach, and the exam frequently tests this pairing rather than metrics in isolation.
| Problem type | Common metrics | Notes |
|---|---|---|
| Binary classification (balanced) | Accuracy, AUC-ROC | Fine when classes are roughly balanced |
| Binary classification (imbalanced) | Precision, Recall, F1, PR-AUC | Accuracy is misleading here |
| Multi-class classification | Macro/micro F1, confusion matrix | Macro F1 when classes matter equally |
| Regression | RMSE, MAE, Rยฒ | RMSE penalizes large errors more than MAE |
| Ranking/recommendation | NDCG, MAP | Order matters, not just correctness |
For tracking, SageMaker Experiments (built into Studio) automatically logs parameters, metrics, and artifacts for every training run tied to a pipeline, so you can compare runs side by side instead of relying on someoneโs spreadsheet of results. This matters operationally as much as it matters for the exam: reproducibility questions almost always trace back to whether experiment metadata was captured at training time, not reconstructed after the fact.
Exam Focus: What Questions Test From This Step
- Choosing built-in algorithm vs. script mode vs. bring-your-own-container based on flexibility needed
- XGBoost as the default strong answer for tabular problems; DeepAR for time series
- Bayesian optimization as AMTโs default tuning strategy, and when Hyperband or random search fits better
- Data parallelism vs. model parallelism โ matching the strategy to whether the model or the dataset is the bottleneck
- Trainium for training cost efficiency vs. Inferentia for inference (do not swap these)
- Managed Spot Trainingโs checkpoint/resume behavior and its cost/reliability trade-off
- Matching evaluation metric to problem type, especially avoiding accuracy on imbalanced classification
- SageMaker Experimentsโ role in reproducibility and run comparison