Step 2 — Model Development & Training

Once the data is clean and sitting in a feature group or an S3 prefix, the real engineering decisions start. This step is where the exam separates people who’ve only trained models in a notebook from people who’ve had to make a training job survive contact with a real budget and a real deadline.

Built-In Algorithms vs. Bring-Your-Own

SageMaker gives you three ways to get a model trained, and picking the wrong one for a scenario is a classic wrong-answer trap.

┌──────────────────────┬────────────────────────┬──────────────────────────┐
│ Built-in Algorithms   │ Script Mode (BYO code) │ Bring Your Own Container │
├──────────────────────┼────────────────────────┼──────────────────────────┤
│ XGBoost, Linear       │ Your training script + │ Custom Docker image with │
│ Learner, k-NN, DeepAR │ a prebuilt framework   │ full control over runtime│
│ Fastest to deploy,    │ container (TF, PyTorch,│ Use when you need custom │
│ least flexible        │ MXNet, Hugging Face)   │ system deps or a stack   │
│                       │                        │ SageMaker doesn't ship   │
└──────────────────────┴────────────────────────┴──────────────────────────┘

If a question describes a tabular classification/regression problem with no unusual requirements, XGBoost is almost always the “best” built-in answer — it’s fast, well-documented, and handles missing values natively. DeepAR is the built-in for time-series forecasting with multiple related series. BlazingText covers word embeddings and text classification at scale. For anything involving a custom PyTorch or TensorFlow architecture, script mode is the answer — you keep your own training code but let SageMaker manage the infrastructure, container, and I/O channels around it.

Bring-your-own-container is reserved for edge cases: an unsupported framework, a specific CUDA/driver version, or a non-Python runtime. If a question is testing “least operational overhead,” BYOC is rarely correct — it’s the highest-maintenance option.

SageMaker JumpStart deserves a mention here too: it’s the model hub for pretrained foundation models and common architectures, letting you fine-tune instead of training from scratch. On the exam, JumpStart is the answer whenever the scenario is “we want to fine-tune an existing large model quickly” rather than build one from zero.

Hyperparameter Tuning

SageMaker Automatic Model Tuning (AMT) runs multiple training jobs with different hyperparameter combinations and picks the best by an objective metric you define.

Strategy	How it searches	Trade-off
Grid search	Exhaustively tries every combination	Guaranteed coverage, expensive at scale
Random search	Samples combinations at random	Cheaper, surprisingly competitive with grid
Bayesian optimization	Uses prior results to pick the next combination	Fewest jobs needed, default AMT strategy
Hyperband	Early-stops poorly performing jobs	Best when training is expensive and many configs are clearly bad early

Bayesian optimization is the default and generally the right answer when a question asks “which strategy finds a good hyperparameter combination with the fewest training jobs.” Hyperband is the answer when the emphasis is on cost control for expensive training runs, since it kills bad trials before they finish.

A subtlety worth internalizing: AMT parallelizes jobs, but too much parallelism with Bayesian optimization actually hurts it, because the strategy can’t learn from jobs that haven’t finished yet. If a scenario mentions “we want maximum parallelism,” that’s a mild signal toward random search instead.

Distributed Training

Once a single GPU or a single instance can’t hold the model or the data, you split the work. Two strategies, and the exam wants you to know which problem each one solves.

DATA PARALLELISM                      MODEL PARALLELISM
(model fits on one device,            (model does NOT fit on one device)
 dataset is the bottleneck)

  GPU 1: full model, batch A          GPU 1: layers 1-10
  GPU 2: full model, batch B          GPU 2: layers 11-20
  GPU 3: full model, batch C          GPU 3: layers 21-30
         │                                    │
         ▼                                    ▼
  Gradients averaged/synced          Activations passed between
  across devices each step           devices in a pipeline

SageMaker’s Distributed Data Parallel (SMDDP) library optimizes the gradient-sync step specifically for AWS networking, and it’s the answer whenever a question is about scaling training throughput across many GPUs with an already-fits-in-memory model. SageMaker Model Parallel is the answer when the model itself — think large language models — is too big for one accelerator’s memory.

Training Infrastructure Choices

Instance selection is one of those areas where the exam expects current, practical judgment rather than memorized specs.

GPU instances (P and G families) — standard choice for deep learning training; P-series for the heaviest workloads, G-series for lighter or inference-leaning jobs
Trainium-based instances (Trn) — AWS’s purpose-built training silicon; by 2026 this is a mainstream cost-efficient option for large-scale deep learning training, especially when a workload can tolerate the extra engineering effort to compile against AWS Neuron
Inferentia-based instances (Inf) — purpose-built for inference, not training; if a question mentions Inf-series in the context of a training job, that’s the wrong-answer bait
CPU instances — perfectly fine for classical ML (XGBoost, linear models) that doesn’t need a GPU at all

Managed Spot Training is a near-guaranteed exam topic: it runs training jobs on Spot capacity for up to 90% savings versus on-demand, and SageMaker automatically checkpoints and resumes if the instance is reclaimed, so you don’t lose all your progress. The trade-off is unpredictable start times and possible interruption — fine for most training jobs, risky for anything with a hard deadline where you can’t tolerate delay.

Managed Spot Training flow:

Training job starts ──► checkpoint saved every N steps ──► S3
        │
        ▼
Spot interruption (2-min warning)
        │
        ▼
Job automatically resumes from last checkpoint on new Spot capacity

Evaluation Metrics and Experiment Tracking

Choosing the right metric is inseparable from choosing the right training approach, and the exam frequently tests this pairing rather than metrics in isolation.

Problem type	Common metrics	Notes
Binary classification (balanced)	Accuracy, AUC-ROC	Fine when classes are roughly balanced
Binary classification (imbalanced)	Precision, Recall, F1, PR-AUC	Accuracy is misleading here
Multi-class classification	Macro/micro F1, confusion matrix	Macro F1 when classes matter equally
Regression	RMSE, MAE, R²	RMSE penalizes large errors more than MAE
Ranking/recommendation	NDCG, MAP	Order matters, not just correctness

For tracking, SageMaker Experiments (built into Studio) automatically logs parameters, metrics, and artifacts for every training run tied to a pipeline, so you can compare runs side by side instead of relying on someone’s spreadsheet of results. This matters operationally as much as it matters for the exam: reproducibility questions almost always trace back to whether experiment metadata was captured at training time, not reconstructed after the fact.

Exam Focus: What Questions Test From This Step

Choosing built-in algorithm vs. script mode vs. bring-your-own-container based on flexibility needed
XGBoost as the default strong answer for tabular problems; DeepAR for time series
Bayesian optimization as AMT’s default tuning strategy, and when Hyperband or random search fits better
Data parallelism vs. model parallelism — matching the strategy to whether the model or the dataset is the bottleneck
Trainium for training cost efficiency vs. Inferentia for inference (do not swap these)
Managed Spot Training’s checkpoint/resume behavior and its cost/reliability trade-off
Matching evaluation metric to problem type, especially avoiding accuracy on imbalanced classification
SageMaker Experiments’ role in reproducibility and run comparison

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.