Deep Learning Model Deployment
Training a model is only half the job. Deployment puts the model into production — making it available to real users through applications, APIs, or on-device systems. A model that never reaches production creates no value. This topic covers how to take a trained Deep Learning model from your notebook to the real world.
The Deployment Pipeline
Training Environment Production Environment
────────────────────────── ────────────────────────────────
Data → Train → Evaluate → Export → Optimize → Serve → Monitor
(Jupyter, Colab) (Production Server / Device)
Step 1: Export the Model
After training, you save the model's weights and architecture to a file. This file is then loaded wherever the model needs to run.
TensorFlow / Keras:
model.save('my_model.h5') → single file format
model.save('saved_model_dir/') → TensorFlow SavedModel format (recommended)
PyTorch:
torch.save(model.state_dict(), 'model_weights.pth')
torch.onnx.export(model, ...) → ONNX format (cross-platform)
ONNX (Open Neural Network Exchange):
A universal format supported by TensorFlow, PyTorch, and most deployment tools
→ Export from PyTorch → Deploy on TensorFlow Serving, edge devices, or cloud
Step 2: Optimize for Production
A research model is often large, slow, and memory-hungry. Optimization trims it down for faster, cheaper predictions without sacrificing much accuracy.
Quantization
Training uses 32-bit floating point numbers (float32): Weight value: 0.38291746231 Quantization converts to 8-bit integers (int8): Weight value: 97 (approximate) Result: Model size: 4× smaller Inference speed: 2–4× faster Accuracy drop: typically less than 1% Used in: mobile apps, embedded devices, edge AI
Pruning
A large model has many near-zero weights — they contribute almost nothing. Pruning removes those weights entirely. Before pruning: 100 million parameters After pruning: 40 million parameters (60% removed, accuracy nearly unchanged) Analogy: pruning dead branches from a tree → tree grows more efficiently
Knowledge Distillation
Large Teacher Model (GPT-3 sized) → produces soft probability outputs
Small Student Model → trained to mimic the Teacher's outputs, not the raw labels
Teacher: "cat 75%, dog 20%, lion 5%" (richer information than label "cat")
Student: learns these probability distributions → achieves 90% of Teacher's accuracy
at 10% of the size
Used in: DistilBERT (60% smaller than BERT, 97% of its performance)
Step 3: Choose a Serving Strategy
REST API Deployment
Wrap the model in a web server. Applications send input data over HTTP and receive predictions in return.
Client App Server Model
│ │ │
├── POST /predict ─→│ │
│ {"image": "..."} │ │
│ ├── run model ───→│
│ │←── prediction ──┤
│←── {"class": "cat", "confidence": 0.92}
│
Response time: 50–200ms typical for a well-optimized API
Common tools: FastAPI (Python), Flask, TensorFlow Serving, TorchServe
Cloud-Based Deployment
| Platform | Service | Best For |
|---|---|---|
| Google Cloud | Vertex AI | TensorFlow models, AutoML, large scale |
| AWS | SageMaker | End-to-end ML platform, auto-scaling |
| Microsoft Azure | Azure ML | Enterprise integration, MLOps pipelines |
| Hugging Face | Inference API | Quick NLP model hosting, free tier available |
On-Device Deployment
Run the model directly on the user's phone or embedded device — no internet connection required. This protects privacy, eliminates latency, and works offline.
Phone (iOS / Android): Frameworks: TensorFlow Lite, Core ML, ONNX Runtime Mobile Use cases: face ID, real-time translation, photo filters Embedded Devices (Raspberry Pi, Coral TPU): Frameworks: TensorFlow Lite, OpenVINO Use cases: quality inspection cameras, smart meters Browser (WebAssembly): Framework: TensorFlow.js Use cases: in-browser image recognition, live webcam demos
Step 4: Monitor the Deployed Model
A model deployed today may become inaccurate over time. Real-world data changes — a fraud detection model trained in 2022 may miss new fraud patterns in 2025. Monitoring catches these issues before they cause significant problems.
Key Metrics to Monitor
┌─────────────────────────────────────────────────────────┐ │ Metric │ What to Watch For │ ├─────────────────────────────────────────────────────────┤ │ Prediction accuracy │ Should stay stable over time │ │ Confidence scores │ Drop signals distribution shift │ │ Latency │ Response time stays within SLA │ │ Input data drift │ Input distribution changes │ │ Error rate │ Sudden spikes = problems │ └─────────────────────────────────────────────────────────┘
Data Drift
Your model trained on: daytime photos, clear weather Production receives: nighttime photos during a rainy week The input distribution has shifted. Model performance drops. Monitoring detects this → team retrains on new data This is called "data drift" or "distribution shift."
The CI/CD Pipeline for ML (MLOps)
Data Update
↓
Automated Retraining
↓
Automated Evaluation (must beat current model)
↓
Automated Testing (unit tests, integration tests)
↓
Gradual Rollout (5% of traffic first, then 100%)
↓
Monitor in Production
↓
(repeat when data drifts or accuracy drops)
Common Deployment Mistakes
- Not versioning models — you cannot roll back to a previous model if the new one fails
- Skipping latency testing — a model that takes 5 seconds per prediction is unusable in production
- Ignoring data drift — models silently degrade as the world changes
- Over-engineering too early — a simple Flask API is fine for thousands of requests per day; complex infrastructure adds cost with no benefit at small scale
Key Terms
- Inference — running a trained model on new input to produce a prediction (not training)
- Quantization — converting model weights from 32-bit to 8-bit for speed and size reduction
- Pruning — removing near-zero weights from a trained model
- Knowledge Distillation — training a small model to mimic a large one
- ONNX — Open Neural Network Exchange — a cross-platform model format
- Data Drift — when real-world input data changes from what the model was trained on
- MLOps — the practice of automating machine learning pipelines in production
