ML Model Deployment and MLOps

Training a Machine Learning model is only one part of the process. A model that sits on a laptop helps no one. Deployment is the process of making a trained model available so that real users and systems can send inputs and receive predictions. MLOps (Machine Learning Operations) is the discipline of managing the full lifecycle of ML models — from development through deployment to ongoing monitoring and maintenance.

The Gap Between Research and Production

A common mistake: treating model training as the finish line.

Research Stage (Notebook / Laptop):
  Data → Train → Evaluate → "87% accuracy!" → done?

Production Reality:
  ✗ The model must handle thousands of requests per second.
  ✗ Input data in production may differ from training data.
  ✗ The model must integrate with existing software systems.
  ✗ Failures must be caught and recovered automatically.
  ✗ The model's accuracy must be monitored over time.
  ✗ New model versions must be tested before replacing old ones.

MLOps addresses all of these challenges systematically.

The Machine Learning Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│                  Machine Learning Lifecycle                      │
│                                                                  │
│  Business Problem                                                │
│       │                                                          │
│       ▼                                                          │
│  Data Collection & Storage                                       │
│       │                                                          │
│       ▼                                                          │
│  Data Preprocessing & Feature Engineering                        │
│       │                                                          │
│       ▼                                                          │
│  Model Training & Evaluation                                     │
│       │                                                          │
│       ▼                                                          │
│  Model Registry (versioned model storage)                        │
│       │                                                          │
│       ▼                                                          │
│  Deployment (API / Batch / Edge)                                 │
│       │                                                          │
│       ▼                                                          │
│  Monitoring (accuracy, latency, data drift)                      │
│       │                                                          │
│       └──► Retrain when needed ──► Back to Training             │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Types of Model Deployment

REST API (Real-Time Online Prediction)

Most common deployment approach.
The model is wrapped inside a web server.
Any application sends a request → gets a prediction back.

Flow:
  User App → HTTP Request (with input data) →
  → API Server (loads model) →
  → Model predicts →
  → Response with prediction → User App

Example: Fraud Detection
  Bank's payment system sends transaction details via API.
  Model returns: {"fraud_probability": 0.92, "decision": "BLOCK"}
  Bank blocks the transaction in real time.

Tools: Flask, FastAPI (Python), Docker, Kubernetes

Batch Prediction

Predictions run on a large dataset at once, not on demand.
No real-time requirement — results stored for later use.

Example: Email Campaign Scoring
  Every night at midnight:
    1. Load 5 million customer records
    2. Run model → predict purchase probability for each
    3. Store predictions in a database
    4. Marketing team uses scores the next morning

Best for:
  ✓ Large volumes, not time-sensitive
  ✓ Weekly/daily reporting
  ✓ Recommendations computed ahead of time

Tools: Apache Spark, AWS Batch, Airflow, scheduled Python scripts

Edge Deployment

Model runs directly on a device — phone, camera, sensor.
No internet connection needed. Very low latency.

Examples:
  - Face unlock on a smartphone (runs on device)
  - Object detection camera in a factory (no cloud needed)
  - Voice assistant on a smart speaker

Challenge: Edge devices have limited memory and compute.
Solution: Model compression — make the model smaller.
  Quantization: Reduce weight precision (float32 → int8)
  Pruning:      Remove unimportant connections
  Distillation: Teach a small "student" model from a large "teacher"

Tools: TensorFlow Lite, ONNX Runtime, Apple CoreML

Model Serialization

After training, the model must be saved (serialized) to disk
so it can be loaded later for prediction.

Common formats:
┌─────────────────────┬────────────────────────────────────────────┐
│ Format              │ Used For                                   │
├─────────────────────┼────────────────────────────────────────────┤
│ Pickle (.pkl)       │ Scikit-learn models (Python-specific)      │
│ Joblib (.joblib)    │ Scikit-learn (faster for large numpy arrays│
│ ONNX (.onnx)        │ Cross-platform, any language/framework     │
│ SavedModel          │ TensorFlow / Keras models                  │
│ .pt / .pth          │ PyTorch models                             │
│ PMML                │ Enterprise standard for ML model export    │
└─────────────────────┴────────────────────────────────────────────┘

Example:
  Training: Train XGBoost model → Save as model.pkl
  Deployment: Load model.pkl → Receive new data → Predict → Return result

Containerization with Docker

Problem: "It works on my laptop but not on the server!"
  Different Python versions, different library versions,
  different operating systems — all cause deployment failures.

Docker solution:
  Package the model, Python environment, all dependencies, and
  the prediction code into a single container image.
  The container runs identically everywhere.

Dockerfile Example Structure:
  FROM python:3.10-slim         ← Base Python image
  COPY requirements.txt .       ← List of libraries
  RUN pip install -r requirements.txt  ← Install libraries
  COPY model.pkl .              ← Copy trained model
  COPY app.py .                 ← Copy prediction API code
  CMD ["python", "app.py"]      ← Start the API server

Result: One Docker image runs identically on:
  ✓ Laptop (development)
  ✓ Test server (QA)
  ✓ Production cloud (AWS, Azure, GCP)

Model Monitoring

A deployed model does not maintain itself.
The world changes — and the model must be watched.

What to Monitor:

1. Prediction Quality (Model Performance):
   Compare predictions to actual outcomes as they arrive.
   Did fraud predictions match actual fraud outcomes?

2. Data Drift (Input Distribution Shift):
   The statistical properties of incoming data change over time.
   Example: COVID changed customer buying patterns overnight.
   Old model trained on pre-COVID data no longer applies.

   Detect with: KS test, PSI (Population Stability Index)

3. Prediction Drift (Output Distribution Shift):
   The model's predictions change even if inputs seem the same.
   "Average fraud probability jumped from 2% to 18% this week."

4. Latency and System Health:
   Is the API responding within acceptable time?
   Are there server errors, memory issues, or crashes?

5. Data Quality:
   Are null values increasing in incoming data?
   Are any expected features suddenly missing?

Alert Thresholds:
  Accuracy drop > 5% → Send alert, investigate
  Input feature drift > threshold → Retrain trigger
  API response time > 500ms → Infrastructure alert

CI/CD for Machine Learning

CI/CD = Continuous Integration / Continuous Deployment

Software CI/CD: New code → tests → deploy automatically.
ML CI/CD extends this to cover data and model changes.

ML Pipeline Steps:

  Code change OR new data arrives
           │
           ▼
  Data validation (check schema, distributions)
           │
           ▼
  Model training (on updated dataset)
           │
           ▼
  Model evaluation (compare to previous model)
           │
           ▼
  Passes threshold? (new model better than old?)
           │
           ├── Yes → Deploy new model automatically
           │
           └── No  → Alert team, do not deploy

This entire pipeline runs automatically using tools like
MLflow, Kubeflow, ZenML, or GitHub Actions.

A/B Testing for Models

Before fully replacing an old model with a new one,
split traffic between both and compare real performance.

Example: Recommendation System

  50% of users → Old Model A  (previous recommendation algorithm)
  50% of users → New Model B  (newly trained model)

  Run for 2 weeks. Measure:
    Click-through rate: A=4.2%, B=5.1% → B is better
    Revenue per session: A=₹85, B=₹97   → B is better

  Result: Route 100% of traffic to Model B.
  
A/B testing prevents deploying models that look better
on test data but underperform in the real world.

MLOps Tools Overview

┌─────────────────────────────┬───────────────────────────────────────┐
│ MLOps Area                  │ Common Tools                          │
├─────────────────────────────┼───────────────────────────────────────┤
│ Experiment Tracking         │ MLflow, Weights & Biases, Neptune     │
│ Feature Store               │ Feast, Tecton, Hopsworks              │
│ Model Registry              │ MLflow Models, Hugging Face Hub       │
│ Pipeline Orchestration      │ Apache Airflow, Kubeflow, ZenML       │
│ Model Serving (API)         │ FastAPI, TorchServe, Triton, BentoML  │
│ Containerization            │ Docker, Kubernetes                    │
│ Cloud ML Platforms          │ AWS SageMaker, GCP Vertex AI,         │
│                             │ Azure ML, Databricks                  │
│ Model Monitoring            │ Evidently AI, WhyLabs, Arize          │
└─────────────────────────────┴───────────────────────────────────────┘

The Production Machine Learning Stack

Data Sources (databases, APIs, sensors)
        │
        ▼
Feature Pipeline (compute + store features)
        │
        ▼
Training Pipeline (data → model → evaluation)
        │
        ▼
Model Registry (versioned, tagged models)
        │
        ▼
Deployment Service (API / Batch / Edge)
        │
        ▼
Users / Applications (receive predictions)
        │
        ▼
Monitoring System (track performance and drift)
        │
        └──► Retraining Pipeline (triggered automatically) ──► Training Pipeline

Best Practices Summary

┌──────────────────────────────────────────────┬────────────────────┐
│ Practice                                     │ Why It Matters     │
├──────────────────────────────────────────────┼────────────────────┤
│ Version control data, code, and models       │ Reproducibility    │
│ Log all experiments with parameters, metrics │ Comparability      │
│ Automate training and evaluation pipelines   │ Consistency        │
│ Use shadow mode before full deployment       │ Safety             │
│ Monitor data drift continuously              │ Reliability        │
│ Set automated retraining triggers            │ Freshness          │
│ Document model inputs, outputs, and limits   │ Transparency       │
│ Test model for fairness across groups        │ Ethics             │
└──────────────────────────────────────────────┴────────────────────┘

Machine Learning deployment and MLOps transform a model from an experimental script into a reliable production system. A well-deployed model serves real users, adapts to changing data, and improves continuously — making the full lifecycle of Machine Learning complete.

Previous lessons

Back to courses