Deep Learning Deployment: ONNX, TorchScript, Quantization, and Monitoring

How to actually deploy a trained model — saving, ONNX conversion, TensorFlow Serving, TorchScript, quantization, edge deployment, and monitoring.

Deep Learning Deployment: ONNX, TorchScript, Quantization, and Monitoring

A trained model sitting in a Jupyter notebook has delivered exactly none of its intended value — deployment is where a model actually starts producing predictions for real users, and it introduces an entirely different set of concerns from training: latency, memory footprint, serving infrastructure, and detecting when a model’s real-world performance quietly degrades. This final guide ties the whole series together by covering what happens after training ends.


Saving and Loading Models

The most basic deployment step: persisting trained weights so they can be loaded later without retraining.

import torch
# Saving just the weights (the recommended approach)
torch.save(model.state_dict(), "model_weights.pt")
# Loading requires recreating the architecture first, then loading weights into it
model = MyModelArchitecture()
model.load_state_dict(torch.load("model_weights.pt"))
model.eval() # critical -- disables dropout and switches batch norm to inference mode,
# covered in Dropout and Batch Normalization

Saving only the state_dict (weights), rather than the entire model object, is the standard recommended practice — it’s more portable across code versions and avoids issues if the model class definition changes slightly between saving and loading.


ONNX: A Framework-Agnostic Model Format

ONNX (Open Neural Network Exchange) is a standardized format that lets a model trained in one framework (PyTorch) run in a different runtime environment (a C++ production server, a mobile device, a different framework entirely) without needing the original training framework installed.

import torch.onnx
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"])
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
outputs = session.run(None, {"input": input_data.numpy()})

This matters enormously in practice — a model trained in Python/PyTorch often needs to run in a production environment written in an entirely different language, or on hardware where installing a full PyTorch environment isn’t practical, and ONNX is the standard bridge for exactly that gap.


TorchScript: Optimized, Python-Independent PyTorch Execution

TorchScript compiles a PyTorch model into a serialized, optimized representation that can run without a Python interpreter at all — important for production environments where Python’s overhead and dependency management are genuinely undesirable.

scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")
# Can be loaded and run in a pure C++ environment via libtorch, no Python needed

TensorFlow Serving: A Dedicated Model-Serving Infrastructure

For TensorFlow models specifically, TensorFlow Serving provides a production-grade serving system with built-in versioning, batching of incoming requests for efficiency, and a standardized API — rather than hand-building a custom inference server from scratch.

Terminal window
# Conceptual usage: serve a saved model directory over a gRPC/REST API
tensorflow_model_server --model_base_path=/models/my_model --rest_api_port=8501
import requests
response = requests.post(
"http://localhost:8501/v1/models/my_model:predict",
json={"instances": input_data.tolist()}
)

Quantization: Trading a Small Amount of Precision for Major Efficiency Gains

Quantization reduces the numerical precision of a model’s weights (commonly from 32-bit floating point down to 8-bit integers), directly connecting to the precision discussion in Numerical Computation — this dramatically reduces model size and can significantly speed up inference, at the cost of a typically small, often acceptable drop in accuracy.

import torch.quantization as quantization
quantized_model = quantization.quantize_dynamic(
model, {nn.Linear}, dtype=torch.qint8
)
# Model is now roughly 4x smaller (32-bit -> 8-bit) with faster inference,
# at a typically small, task-dependent accuracy cost

Edge Deployment: Running Models on Constrained Devices

Deploying to phones, IoT devices, or embedded hardware requires models small and fast enough to run without a GPU or significant memory — quantization, architectural choices like EfficientNet (covered in Popular CNN Architectures) chosen specifically for their favorable accuracy-per-parameter ratio, and specialized runtimes (TensorFlow Lite, PyTorch Mobile, ONNX Runtime Mobile) are the standard toolkit for this.

# Conceptual TensorFlow Lite conversion for edge/mobile deployment
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # applies quantization automatically
tflite_model = converter.convert()

Model Monitoring: Detecting Degradation After Deployment

A deployed model’s performance can degrade over time even without any code changes — the real-world data distribution can drift away from what the model was trained on, connecting directly to the train/test distribution assumptions covered in Dataset Preparation.

# Conceptual monitoring: track prediction confidence and input distribution over time
def log_prediction_metrics(inputs, predictions, confidences):
log_metric("avg_confidence", confidences.mean())
log_metric("input_feature_mean", inputs.mean())
# A significant, sustained shift in either signals potential model degradation
# or genuine data drift, worth investigating before it silently affects users

Without monitoring, a model’s real-world accuracy can silently decay for months before anyone notices — the same discipline that made train/validation/test splitting important during development remains just as important, in an ongoing form, after deployment.

Versioning Models in Production

Beyond the technical formats covered above, a practical deployment concern that’s easy to overlook until it causes a real incident: tracking exactly which model version is serving which predictions, and having a fast rollback path if a newly deployed version turns out to underperform in production despite passing offline evaluation. This typically means tagging every deployed model with a version identifier tied to the exact training run, dataset version, and code commit that produced it, and keeping the previous version’s artifacts readily available rather than overwriting them — the same discipline as versioning any other piece of production software, applied specifically to model artifacts, which are otherwise easy to treat as a single, un-versioned file that just gets replaced on every update.

Summary

ConcernTool/Technique
Cross-framework portabilityONNX
Python-independent executionTorchScript
Production serving infrastructureTensorFlow Serving
Reducing model size/latencyQuantization
Constrained device deploymentQuantization + specialized mobile runtimes
Post-deployment reliabilityContinuous monitoring for data/performance drift

Deployment isn’t an afterthought tacked onto the end of a deep learning project — it’s where every architectural and training decision covered throughout this entire series finally gets tested against reality, and getting it right is what separates a model that works in a notebook from one that reliably serves real users.