Cloud/ AWS / AWS Certified Machine Learning Engineer โ€” Associate (MLA-C01) / MLA-C01 Deployment: SageMaker Endpoints, Pipelines & CI/CD for ML

AWS Amazon Web Services Associate Step 3 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 3 โ€” Deployment & Orchestration

A trained model sitting in an S3 bucket has zero business value. This step is about the last mile โ€” getting predictions in front of an application reliably, and building the plumbing so that โ€œlast mileโ€ happens automatically every time a new model is ready, not through someone manually clicking buttons in the console.


Picking an Inference Pattern

The exam almost always gives you a scenario with a latency requirement, a traffic pattern, and a cost constraint, and expects you to pick the deployment type that satisfies all three.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Real-Time โ”‚ Serverless โ”‚ Asynchronous โ”‚ Batch โ”‚
โ”‚ Endpoint โ”‚ Inference โ”‚ Inference โ”‚ Transform โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Always-on instance โ”‚ Scales to zero, โ”‚ Queues requests, โ”‚ No endpoint at โ”‚
โ”‚ persistent latency โ”‚ pay per invoke โ”‚ handles large โ”‚ all โ€” runs over โ”‚
โ”‚ < 100ms typical โ”‚ good for spiky/ โ”‚ payloads (up to โ”‚ a whole dataset โ”‚
โ”‚ โ”‚ intermittent load โ”‚ 1GB) and long โ”‚ once, writes โ”‚
โ”‚ โ”‚ โ”‚ processing times โ”‚ output to S3 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Real-time endpoints are the answer when a question emphasizes sub-second latency and steady traffic โ€” think a recommendation engine on a live product page. Serverless inference fits when traffic is unpredictable or bursty and idle cost matters more than cold-start latency โ€” a good fit for an internal tool used sporadically. Asynchronous inference is the one people forget: it exists for large payloads or slow models (think processing a video frame-by-frame) where you donโ€™t want the caller blocked waiting, but you still want a managed endpoint rather than a batch job. Batch Transform is for offline scoring of an entire dataset with no need for a persistent endpoint at all โ€” cheapest option when you donโ€™t need predictions in real time.


Multi-Model and Multi-Container Endpoints

Running one endpoint per model gets expensive fast when you have hundreds of similar models (a common pattern: one model per customer, or one per store location). Multi-Model Endpoints (MME) solve this by hosting many models behind a single endpoint, loading them into memory on demand and evicting the least-recently-used ones when memory is tight.

Single Endpoint (instance fleet)
โ”œโ”€โ”€ model_customer_001.tar.gz โ”€โ”
โ”œโ”€โ”€ model_customer_002.tar.gz โ”œโ”€ loaded/unloaded dynamically
โ”œโ”€โ”€ model_customer_003.tar.gz โ”‚ from a shared S3 prefix
โ””โ”€โ”€ ... (hundreds more) โ”€โ”˜

Multi-Container Endpoints solve a different problem: running distinct models (potentially different frameworks) behind one endpoint, invoked either directly or as a serial inference pipeline where the output of one container feeds the next โ€” useful for a preprocessing container feeding a model container feeding a postprocessing container.

If a scenario says โ€œmany nearly identical models, cost-sensitive, infrequent per-model traffic,โ€ thatโ€™s MME. If it says โ€œdifferent model types chained togetherโ€ or โ€œroute by target container,โ€ thatโ€™s multi-container.


SageMaker Pipelines for Orchestration

Pipelines is SageMakerโ€™s native CI/CD-for-ML orchestration tool โ€” a directed acyclic graph of steps defined in the Python SDK, version-controlled like any other code artifact.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Processing โ”‚โ”€โ”€โ”€โ–บโ”‚ Training โ”‚โ”€โ”€โ”€โ–บโ”‚ Evaluation โ”‚โ”€โ”€โ”€โ–บโ”‚ Condition โ”‚
โ”‚ Step โ”‚ โ”‚ Step โ”‚ โ”‚ Step โ”‚ โ”‚ (metric > โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ threshold?) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
yes โ”‚ โ”‚ no
โ–ผ โ–ผ
Register Model Fail /
in Model Registry Notify

Each stepโ€™s inputs and outputs are tracked automatically, which gives you lineage for free โ€” you can always answer โ€œwhich data and code produced this exact model artifact.โ€ The condition step is what makes this genuinely CI/CD rather than just a script: a model that doesnโ€™t beat your accuracy threshold never reaches the registry, let alone production.


Model Registry and Approval Workflows

The SageMaker Model Registry is the catalog of model versions grouped into model package groups, each version carrying its evaluation metrics, its lineage back to the training job, and an approval status: PendingManualApproval, Approved, or Rejected.

This status field is the trigger point for CI/CD. A common pattern: an EventBridge rule watches for a model packageโ€™s status changing to Approved and kicks off a deployment pipeline (often via CodePipeline/CodeBuild) that updates a SageMaker endpoint with the new model โ€” no human touching the endpoint configuration directly.

Approval modeHow it worksWhen to use
Manual approvalA human reviews metrics and clicks approveRegulated environments, high-stakes models
Automatic approvalA Lambda or pipeline step approves if metrics clear a barHigh-velocity teams, well-tested evaluation criteria

A/B Testing and Shadow Deployments

Rolling out a new model version blind is a good way to find out about a regression from your users instead of from your metrics. SageMaker supports a few ways to de-risk this.

Production variants let a single endpoint split traffic across multiple model versions by weight โ€” 90% to the incumbent, 10% to the challenger โ€” and you compare business or accuracy metrics before shifting more traffic. This is the standard mechanism for A/B testing on SageMaker endpoints.

Shadow deployments go further: the new model receives a copy of live traffic and returns predictions that are logged but never shown to the user, so you can validate the challenger against real traffic with zero user-facing risk before it ever gets a production traffic share.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Model A (90%) โ”€โ”€โ–บ response to user
Incoming request โ”€โ”€โ”€โ”€โ”ค
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Model B (10%) โ”€โ”€โ–บ response to user
โ”‚
(or, in shadow mode)
โ–ผ
Model B (shadow) โ”€โ”€โ–บ logged only,
never returned

Canary deployment is the related pattern where you shift a small percentage of traffic to the new version and increase it gradually as confidence builds โ€” conceptually a slower-moving cousin of the A/B split.


Infrastructure as Code for ML Resources

Manually clicking through the console to create endpoints, pipelines, and feature groups doesnโ€™t scale past a handful of models, and itโ€™s an easy way to lose track of whatโ€™s actually deployed. The exam expects familiarity with:

The underlying principle the exam is testing: reproducible infrastructure and reproducible models go hand in hand. If your endpoint configuration lives only in console click-history, you canโ€™t roll back with confidence.


Exam Focus: What Questions Test From This Step