Deep Learning Deployment: ONNX, TorchScript, Quantization, and Monitoring
A trained model sitting in a Jupyter notebook has delivered exactly none of its intended value — deployment is where a model actually starts producing predictions for real users, and it introduces an entirely different set of concerns from training: latency, memory footprint, serving infrastructure, and detecting when a model’s real-world performance quietly degrades. This final guide ties the whole series together by covering what happens after training ends.
Saving and Loading Models
The most basic deployment step: persisting trained weights so they can be loaded later without retraining.
import torch
# Saving just the weights (the recommended approach)torch.save(model.state_dict(), "model_weights.pt")
# Loading requires recreating the architecture first, then loading weights into itmodel = MyModelArchitecture()model.load_state_dict(torch.load("model_weights.pt"))model.eval() # critical -- disables dropout and switches batch norm to inference mode, # covered in Dropout and Batch NormalizationSaving only the state_dict (weights), rather than the entire model object, is the standard recommended practice — it’s more portable across code versions and avoids issues if the model class definition changes slightly between saving and loading.
ONNX: A Framework-Agnostic Model Format
ONNX (Open Neural Network Exchange) is a standardized format that lets a model trained in one framework (PyTorch) run in a different runtime environment (a C++ production server, a mobile device, a different framework entirely) without needing the original training framework installed.
import torch.onnx
dummy_input = torch.randn(1, 3, 224, 224)torch.onnx.export(model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"])import onnxruntime as ort
session = ort.InferenceSession("model.onnx")outputs = session.run(None, {"input": input_data.numpy()})This matters enormously in practice — a model trained in Python/PyTorch often needs to run in a production environment written in an entirely different language, or on hardware where installing a full PyTorch environment isn’t practical, and ONNX is the standard bridge for exactly that gap.
TorchScript: Optimized, Python-Independent PyTorch Execution
TorchScript compiles a PyTorch model into a serialized, optimized representation that can run without a Python interpreter at all — important for production environments where Python’s overhead and dependency management are genuinely undesirable.
scripted_model = torch.jit.script(model)scripted_model.save("model_scripted.pt")
# Can be loaded and run in a pure C++ environment via libtorch, no Python neededTensorFlow Serving: A Dedicated Model-Serving Infrastructure
For TensorFlow models specifically, TensorFlow Serving provides a production-grade serving system with built-in versioning, batching of incoming requests for efficiency, and a standardized API — rather than hand-building a custom inference server from scratch.
# Conceptual usage: serve a saved model directory over a gRPC/REST APItensorflow_model_server --model_base_path=/models/my_model --rest_api_port=8501import requests
response = requests.post( "http://localhost:8501/v1/models/my_model:predict", json={"instances": input_data.tolist()})Quantization: Trading a Small Amount of Precision for Major Efficiency Gains
Quantization reduces the numerical precision of a model’s weights (commonly from 32-bit floating point down to 8-bit integers), directly connecting to the precision discussion in Numerical Computation — this dramatically reduces model size and can significantly speed up inference, at the cost of a typically small, often acceptable drop in accuracy.
import torch.quantization as quantization
quantized_model = quantization.quantize_dynamic( model, {nn.Linear}, dtype=torch.qint8)# Model is now roughly 4x smaller (32-bit -> 8-bit) with faster inference,# at a typically small, task-dependent accuracy costEdge Deployment: Running Models on Constrained Devices
Deploying to phones, IoT devices, or embedded hardware requires models small and fast enough to run without a GPU or significant memory — quantization, architectural choices like EfficientNet (covered in Popular CNN Architectures) chosen specifically for their favorable accuracy-per-parameter ratio, and specialized runtimes (TensorFlow Lite, PyTorch Mobile, ONNX Runtime Mobile) are the standard toolkit for this.
# Conceptual TensorFlow Lite conversion for edge/mobile deploymentconverter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")converter.optimizations = [tf.lite.Optimize.DEFAULT] # applies quantization automaticallytflite_model = converter.convert()Model Monitoring: Detecting Degradation After Deployment
A deployed model’s performance can degrade over time even without any code changes — the real-world data distribution can drift away from what the model was trained on, connecting directly to the train/test distribution assumptions covered in Dataset Preparation.
# Conceptual monitoring: track prediction confidence and input distribution over timedef log_prediction_metrics(inputs, predictions, confidences): log_metric("avg_confidence", confidences.mean()) log_metric("input_feature_mean", inputs.mean()) # A significant, sustained shift in either signals potential model degradation # or genuine data drift, worth investigating before it silently affects usersWithout monitoring, a model’s real-world accuracy can silently decay for months before anyone notices — the same discipline that made train/validation/test splitting important during development remains just as important, in an ongoing form, after deployment.
Versioning Models in Production
Beyond the technical formats covered above, a practical deployment concern that’s easy to overlook until it causes a real incident: tracking exactly which model version is serving which predictions, and having a fast rollback path if a newly deployed version turns out to underperform in production despite passing offline evaluation. This typically means tagging every deployed model with a version identifier tied to the exact training run, dataset version, and code commit that produced it, and keeping the previous version’s artifacts readily available rather than overwriting them — the same discipline as versioning any other piece of production software, applied specifically to model artifacts, which are otherwise easy to treat as a single, un-versioned file that just gets replaced on every update.
Summary
| Concern | Tool/Technique |
|---|---|
| Cross-framework portability | ONNX |
| Python-independent execution | TorchScript |
| Production serving infrastructure | TensorFlow Serving |
| Reducing model size/latency | Quantization |
| Constrained device deployment | Quantization + specialized mobile runtimes |
| Post-deployment reliability | Continuous monitoring for data/performance drift |
Deployment isn’t an afterthought tacked onto the end of a deep learning project — it’s where every architectural and training decision covered throughout this entire series finally gets tested against reality, and getting it right is what separates a model that works in a notebook from one that reliably serves real users.