Deep Learning Model Deployment

Training a model is only half the job. Deployment puts the model into production — making it available to real users through applications, APIs, or on-device systems. A model that never reaches production creates no value. This topic covers how to take a trained Deep Learning model from your notebook to the real world.

The Deployment Pipeline

Training Environment          Production Environment
──────────────────────────    ────────────────────────────────
 Data → Train → Evaluate  →   Export → Optimize → Serve → Monitor
       (Jupyter, Colab)             (Production Server / Device)

Step 1: Export the Model

After training, you save the model's weights and architecture to a file. This file is then loaded wherever the model needs to run.

TensorFlow / Keras:
  model.save('my_model.h5')         → single file format
  model.save('saved_model_dir/')    → TensorFlow SavedModel format (recommended)

PyTorch:
  torch.save(model.state_dict(), 'model_weights.pth')
  torch.onnx.export(model, ...)     → ONNX format (cross-platform)

ONNX (Open Neural Network Exchange):
  A universal format supported by TensorFlow, PyTorch, and most deployment tools
  → Export from PyTorch → Deploy on TensorFlow Serving, edge devices, or cloud

Step 2: Optimize for Production

A research model is often large, slow, and memory-hungry. Optimization trims it down for faster, cheaper predictions without sacrificing much accuracy.

Quantization

Training uses 32-bit floating point numbers (float32):
  Weight value: 0.38291746231

Quantization converts to 8-bit integers (int8):
  Weight value: 97  (approximate)

Result:
  Model size: 4× smaller
  Inference speed: 2–4× faster
  Accuracy drop: typically less than 1%

Used in: mobile apps, embedded devices, edge AI

Pruning

A large model has many near-zero weights — they contribute almost nothing.
Pruning removes those weights entirely.

Before pruning: 100 million parameters
After pruning:   40 million parameters (60% removed, accuracy nearly unchanged)

Analogy: pruning dead branches from a tree → tree grows more efficiently

Knowledge Distillation

Large Teacher Model (GPT-3 sized) → produces soft probability outputs
Small Student Model → trained to mimic the Teacher's outputs, not the raw labels

Teacher: "cat 75%, dog 20%, lion 5%"  (richer information than label "cat")
Student: learns these probability distributions → achieves 90% of Teacher's accuracy
         at 10% of the size

Used in: DistilBERT (60% smaller than BERT, 97% of its performance)

Step 3: Choose a Serving Strategy

REST API Deployment

Wrap the model in a web server. Applications send input data over HTTP and receive predictions in return.

Client App          Server              Model
    │                   │                 │
    ├── POST /predict ─→│                 │
    │  {"image": "..."} │                 │
    │                   ├── run model ───→│
    │                   │←── prediction ──┤
    │←── {"class": "cat", "confidence": 0.92}
    │
Response time: 50–200ms typical for a well-optimized API

Common tools: FastAPI (Python), Flask, TensorFlow Serving, TorchServe

Cloud-Based Deployment

Platform	Service	Best For
Google Cloud	Vertex AI	TensorFlow models, AutoML, large scale
AWS	SageMaker	End-to-end ML platform, auto-scaling
Microsoft Azure	Azure ML	Enterprise integration, MLOps pipelines
Hugging Face	Inference API	Quick NLP model hosting, free tier available

On-Device Deployment

Run the model directly on the user's phone or embedded device — no internet connection required. This protects privacy, eliminates latency, and works offline.

Phone (iOS / Android):
  Frameworks: TensorFlow Lite, Core ML, ONNX Runtime Mobile
  Use cases: face ID, real-time translation, photo filters

Embedded Devices (Raspberry Pi, Coral TPU):
  Frameworks: TensorFlow Lite, OpenVINO
  Use cases: quality inspection cameras, smart meters

Browser (WebAssembly):
  Framework: TensorFlow.js
  Use cases: in-browser image recognition, live webcam demos

Step 4: Monitor the Deployed Model

A model deployed today may become inaccurate over time. Real-world data changes — a fraud detection model trained in 2022 may miss new fraud patterns in 2025. Monitoring catches these issues before they cause significant problems.

Key Metrics to Monitor

┌─────────────────────────────────────────────────────────┐
│  Metric              │  What to Watch For               │
├─────────────────────────────────────────────────────────┤
│  Prediction accuracy │  Should stay stable over time    │
│  Confidence scores   │  Drop signals distribution shift │
│  Latency             │  Response time stays within SLA  │
│  Input data drift    │  Input distribution changes      │
│  Error rate          │  Sudden spikes = problems        │
└─────────────────────────────────────────────────────────┘

Data Drift

Your model trained on:  daytime photos, clear weather
Production receives:    nighttime photos during a rainy week

The input distribution has shifted.
Model performance drops.
Monitoring detects this → team retrains on new data

This is called "data drift" or "distribution shift."

The CI/CD Pipeline for ML (MLOps)

Data Update
    ↓
Automated Retraining
    ↓
Automated Evaluation (must beat current model)
    ↓
Automated Testing (unit tests, integration tests)
    ↓
Gradual Rollout (5% of traffic first, then 100%)
    ↓
Monitor in Production
    ↓
(repeat when data drifts or accuracy drops)

Common Deployment Mistakes

Not versioning models — you cannot roll back to a previous model if the new one fails
Skipping latency testing — a model that takes 5 seconds per prediction is unusable in production
Ignoring data drift — models silently degrade as the world changes
Over-engineering too early — a simple Flask API is fine for thousands of requests per day; complex infrastructure adds cost with no benefit at small scale

Key Terms

Inference — running a trained model on new input to produce a prediction (not training)
Quantization — converting model weights from 32-bit to 8-bit for speed and size reduction
Pruning — removing near-zero weights from a trained model
Knowledge Distillation — training a small model to mimic a large one
ONNX — Open Neural Network Exchange — a cross-platform model format
Data Drift — when real-world input data changes from what the model was trained on
MLOps — the practice of automating machine learning pipelines in production

Previous lessons

Back to courses

Next lessons