Step 3 โ Deployment & Orchestration
A trained model sitting in an S3 bucket has zero business value. This step is about the last mile โ getting predictions in front of an application reliably, and building the plumbing so that โlast mileโ happens automatically every time a new model is ready, not through someone manually clicking buttons in the console.
Picking an Inference Pattern
The exam almost always gives you a scenario with a latency requirement, a traffic pattern, and a cost constraint, and expects you to pick the deployment type that satisfies all three.
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ Real-Time โ Serverless โ Asynchronous โ Batch โโ Endpoint โ Inference โ Inference โ Transform โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโคโ Always-on instance โ Scales to zero, โ Queues requests, โ No endpoint at โโ persistent latency โ pay per invoke โ handles large โ all โ runs over โโ < 100ms typical โ good for spiky/ โ payloads (up to โ a whole dataset โโ โ intermittent load โ 1GB) and long โ once, writes โโ โ โ processing times โ output to S3 โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโReal-time endpoints are the answer when a question emphasizes sub-second latency and steady traffic โ think a recommendation engine on a live product page. Serverless inference fits when traffic is unpredictable or bursty and idle cost matters more than cold-start latency โ a good fit for an internal tool used sporadically. Asynchronous inference is the one people forget: it exists for large payloads or slow models (think processing a video frame-by-frame) where you donโt want the caller blocked waiting, but you still want a managed endpoint rather than a batch job. Batch Transform is for offline scoring of an entire dataset with no need for a persistent endpoint at all โ cheapest option when you donโt need predictions in real time.
Multi-Model and Multi-Container Endpoints
Running one endpoint per model gets expensive fast when you have hundreds of similar models (a common pattern: one model per customer, or one per store location). Multi-Model Endpoints (MME) solve this by hosting many models behind a single endpoint, loading them into memory on demand and evicting the least-recently-used ones when memory is tight.
Single Endpoint (instance fleet)โโโ model_customer_001.tar.gz โโโโโ model_customer_002.tar.gz โโ loaded/unloaded dynamicallyโโโ model_customer_003.tar.gz โ from a shared S3 prefixโโโ ... (hundreds more) โโMulti-Container Endpoints solve a different problem: running distinct models (potentially different frameworks) behind one endpoint, invoked either directly or as a serial inference pipeline where the output of one container feeds the next โ useful for a preprocessing container feeding a model container feeding a postprocessing container.
If a scenario says โmany nearly identical models, cost-sensitive, infrequent per-model traffic,โ thatโs MME. If it says โdifferent model types chained togetherโ or โroute by target container,โ thatโs multi-container.
SageMaker Pipelines for Orchestration
Pipelines is SageMakerโs native CI/CD-for-ML orchestration tool โ a directed acyclic graph of steps defined in the Python SDK, version-controlled like any other code artifact.
โโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ Processing โโโโโบโ Training โโโโโบโ Evaluation โโโโโบโ Condition โ โ Step โ โ Step โ โ Step โ โ (metric > โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ threshold?) โ โโโโโโโโฌโโโโโโโ yes โ โ no โผ โผ Register Model Fail / in Model Registry NotifyEach stepโs inputs and outputs are tracked automatically, which gives you lineage for free โ you can always answer โwhich data and code produced this exact model artifact.โ The condition step is what makes this genuinely CI/CD rather than just a script: a model that doesnโt beat your accuracy threshold never reaches the registry, let alone production.
Model Registry and Approval Workflows
The SageMaker Model Registry is the catalog of model versions grouped into model package groups, each version carrying its evaluation metrics, its lineage back to the training job, and an approval status: PendingManualApproval, Approved, or Rejected.
This status field is the trigger point for CI/CD. A common pattern: an EventBridge rule watches for a model packageโs status changing to Approved and kicks off a deployment pipeline (often via CodePipeline/CodeBuild) that updates a SageMaker endpoint with the new model โ no human touching the endpoint configuration directly.
| Approval mode | How it works | When to use |
|---|---|---|
| Manual approval | A human reviews metrics and clicks approve | Regulated environments, high-stakes models |
| Automatic approval | A Lambda or pipeline step approves if metrics clear a bar | High-velocity teams, well-tested evaluation criteria |
A/B Testing and Shadow Deployments
Rolling out a new model version blind is a good way to find out about a regression from your users instead of from your metrics. SageMaker supports a few ways to de-risk this.
Production variants let a single endpoint split traffic across multiple model versions by weight โ 90% to the incumbent, 10% to the challenger โ and you compare business or accuracy metrics before shifting more traffic. This is the standard mechanism for A/B testing on SageMaker endpoints.
Shadow deployments go further: the new model receives a copy of live traffic and returns predictions that are logged but never shown to the user, so you can validate the challenger against real traffic with zero user-facing risk before it ever gets a production traffic share.
โโโโโโโโโโโโโโโโบ Model A (90%) โโโบ response to userIncoming request โโโโโค โโโโโโโโโโโโโโโโบ Model B (10%) โโโบ response to user โ (or, in shadow mode) โผ Model B (shadow) โโโบ logged only, never returnedCanary deployment is the related pattern where you shift a small percentage of traffic to the new version and increase it gradually as confidence builds โ conceptually a slower-moving cousin of the A/B split.
Infrastructure as Code for ML Resources
Manually clicking through the console to create endpoints, pipelines, and feature groups doesnโt scale past a handful of models, and itโs an easy way to lose track of whatโs actually deployed. The exam expects familiarity with:
- AWS CloudFormation / CDK โ define endpoints, IAM roles, and pipeline resources as code, deployed through the same review process as application code
- SageMaker Projects โ MLOps templates that scaffold a CodeCommit/CodePipeline/CodeBuild setup pre-wired to SageMaker Pipelines and Model Registry, giving teams a standard starting point instead of building CI/CD from scratch each time
- Terraform โ a common alternative to CloudFormation for teams already standardized on it, fully capable of managing SageMaker resources
The underlying principle the exam is testing: reproducible infrastructure and reproducible models go hand in hand. If your endpoint configuration lives only in console click-history, you canโt roll back with confidence.
Exam Focus: What Questions Test From This Step
- Matching inference pattern (real-time, serverless, async, batch) to latency and traffic requirements
- Multi-Model Endpoints for many similar models vs. multi-container for chained/different models
- SageMaker Pipelines as a DAG with a condition step gating model registration
- Model Registry approval statuses as the trigger for automated deployment (EventBridge + CodePipeline pattern)
- Production variants for A/B traffic splitting vs. shadow deployment for zero-risk validation
- Canary rollout as gradual traffic shifting
- SageMaker Projects and IaC (CloudFormation/CDK/Terraform) as the standard for reproducible ML infrastructure