Step 3 — Deployment & Orchestration

A trained model sitting in an S3 bucket has zero business value. This step is about the last mile — getting predictions in front of an application reliably, and building the plumbing so that “last mile” happens automatically every time a new model is ready, not through someone manually clicking buttons in the console.

Picking an Inference Pattern

The exam almost always gives you a scenario with a latency requirement, a traffic pattern, and a cost constraint, and expects you to pick the deployment type that satisfies all three.

┌───────────────────┬──────────────────┬───────────────────┬─────────────────┐
│  Real-Time         │  Serverless       │  Asynchronous      │  Batch          │
│  Endpoint          │  Inference        │  Inference         │  Transform      │
├───────────────────┼──────────────────┼───────────────────┼─────────────────┤
│ Always-on instance │ Scales to zero,   │ Queues requests,   │ No endpoint at  │
│ persistent latency │ pay per invoke    │ handles large      │ all — runs over │
│ < 100ms typical    │ good for spiky/   │ payloads (up to    │ a whole dataset │
│                    │ intermittent load │ 1GB) and long      │ once, writes    │
│                    │                   │ processing times   │ output to S3    │
└───────────────────┴──────────────────┴───────────────────┴─────────────────┘

Real-time endpoints are the answer when a question emphasizes sub-second latency and steady traffic — think a recommendation engine on a live product page. Serverless inference fits when traffic is unpredictable or bursty and idle cost matters more than cold-start latency — a good fit for an internal tool used sporadically. Asynchronous inference is the one people forget: it exists for large payloads or slow models (think processing a video frame-by-frame) where you don’t want the caller blocked waiting, but you still want a managed endpoint rather than a batch job. Batch Transform is for offline scoring of an entire dataset with no need for a persistent endpoint at all — cheapest option when you don’t need predictions in real time.

Multi-Model and Multi-Container Endpoints

Running one endpoint per model gets expensive fast when you have hundreds of similar models (a common pattern: one model per customer, or one per store location). Multi-Model Endpoints (MME) solve this by hosting many models behind a single endpoint, loading them into memory on demand and evicting the least-recently-used ones when memory is tight.

Single Endpoint (instance fleet)
├── model_customer_001.tar.gz  ─┐
├── model_customer_002.tar.gz   ├─ loaded/unloaded dynamically
├── model_customer_003.tar.gz   │  from a shared S3 prefix
└── ... (hundreds more)        ─┘

Multi-Container Endpoints solve a different problem: running distinct models (potentially different frameworks) behind one endpoint, invoked either directly or as a serial inference pipeline where the output of one container feeds the next — useful for a preprocessing container feeding a model container feeding a postprocessing container.

If a scenario says “many nearly identical models, cost-sensitive, infrequent per-model traffic,” that’s MME. If it says “different model types chained together” or “route by target container,” that’s multi-container.

SageMaker Pipelines for Orchestration

Pipelines is SageMaker’s native CI/CD-for-ML orchestration tool — a directed acyclic graph of steps defined in the Python SDK, version-controlled like any other code artifact.

   ┌────────────┐    ┌───────────┐    ┌────────────┐    ┌─────────────┐
   │ Processing │───►│ Training  │───►│ Evaluation │───►│  Condition   │
   │   Step     │    │   Step    │    │   Step     │    │  (metric >   │
   └────────────┘    └───────────┘    └────────────┘    │  threshold?) │
                                                          └──────┬──────┘
                                                       yes │     │ no
                                                            ▼     ▼
                                                  Register Model   Fail /
                                                  in Model Registry Notify

Each step’s inputs and outputs are tracked automatically, which gives you lineage for free — you can always answer “which data and code produced this exact model artifact.” The condition step is what makes this genuinely CI/CD rather than just a script: a model that doesn’t beat your accuracy threshold never reaches the registry, let alone production.

Model Registry and Approval Workflows

The SageMaker Model Registry is the catalog of model versions grouped into model package groups, each version carrying its evaluation metrics, its lineage back to the training job, and an approval status: PendingManualApproval, Approved, or Rejected.

This status field is the trigger point for CI/CD. A common pattern: an EventBridge rule watches for a model package’s status changing to Approved and kicks off a deployment pipeline (often via CodePipeline/CodeBuild) that updates a SageMaker endpoint with the new model — no human touching the endpoint configuration directly.

Approval mode	How it works	When to use
Manual approval	A human reviews metrics and clicks approve	Regulated environments, high-stakes models
Automatic approval	A Lambda or pipeline step approves if metrics clear a bar	High-velocity teams, well-tested evaluation criteria

A/B Testing and Shadow Deployments

Rolling out a new model version blind is a good way to find out about a regression from your users instead of from your metrics. SageMaker supports a few ways to de-risk this.

Production variants let a single endpoint split traffic across multiple model versions by weight — 90% to the incumbent, 10% to the challenger — and you compare business or accuracy metrics before shifting more traffic. This is the standard mechanism for A/B testing on SageMaker endpoints.

Shadow deployments go further: the new model receives a copy of live traffic and returns predictions that are logged but never shown to the user, so you can validate the challenger against real traffic with zero user-facing risk before it ever gets a production traffic share.

                     ┌──────────────►  Model A (90%) ──► response to user
Incoming request ────┤
                     └──────────────►  Model B (10%) ──► response to user
                                              │
                                     (or, in shadow mode)
                                              ▼
                                     Model B (shadow) ──► logged only,
                                                           never returned

Canary deployment is the related pattern where you shift a small percentage of traffic to the new version and increase it gradually as confidence builds — conceptually a slower-moving cousin of the A/B split.

Infrastructure as Code for ML Resources

Manually clicking through the console to create endpoints, pipelines, and feature groups doesn’t scale past a handful of models, and it’s an easy way to lose track of what’s actually deployed. The exam expects familiarity with:

AWS CloudFormation / CDK — define endpoints, IAM roles, and pipeline resources as code, deployed through the same review process as application code
SageMaker Projects — MLOps templates that scaffold a CodeCommit/CodePipeline/CodeBuild setup pre-wired to SageMaker Pipelines and Model Registry, giving teams a standard starting point instead of building CI/CD from scratch each time
Terraform — a common alternative to CloudFormation for teams already standardized on it, fully capable of managing SageMaker resources

The underlying principle the exam is testing: reproducible infrastructure and reproducible models go hand in hand. If your endpoint configuration lives only in console click-history, you can’t roll back with confidence.

Exam Focus: What Questions Test From This Step

Matching inference pattern (real-time, serverless, async, batch) to latency and traffic requirements
Multi-Model Endpoints for many similar models vs. multi-container for chained/different models
SageMaker Pipelines as a DAG with a condition step gating model registration
Model Registry approval statuses as the trigger for automated deployment (EventBridge + CodePipeline pattern)
Production variants for A/B traffic splitting vs. shadow deployment for zero-risk validation
Canary rollout as gradual traffic shifting
SageMaker Projects and IaC (CloudFormation/CDK/Terraform) as the standard for reproducible ML infrastructure

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.