ML Model Deployment and MLOps
Training a Machine Learning model is only one part of the process. A model that sits on a laptop helps no one. Deployment is the process of making a trained model available so that real users and systems can send inputs and receive predictions. MLOps (Machine Learning Operations) is the discipline of managing the full lifecycle of ML models — from development through deployment to ongoing monitoring and maintenance.
The Gap Between Research and Production
A common mistake: treating model training as the finish line. Research Stage (Notebook / Laptop): Data → Train → Evaluate → "87% accuracy!" → done? Production Reality: ✗ The model must handle thousands of requests per second. ✗ Input data in production may differ from training data. ✗ The model must integrate with existing software systems. ✗ Failures must be caught and recovered automatically. ✗ The model's accuracy must be monitored over time. ✗ New model versions must be tested before replacing old ones. MLOps addresses all of these challenges systematically.
The Machine Learning Lifecycle
┌──────────────────────────────────────────────────────────────────┐ │ Machine Learning Lifecycle │ │ │ │ Business Problem │ │ │ │ │ ▼ │ │ Data Collection & Storage │ │ │ │ │ ▼ │ │ Data Preprocessing & Feature Engineering │ │ │ │ │ ▼ │ │ Model Training & Evaluation │ │ │ │ │ ▼ │ │ Model Registry (versioned model storage) │ │ │ │ │ ▼ │ │ Deployment (API / Batch / Edge) │ │ │ │ │ ▼ │ │ Monitoring (accuracy, latency, data drift) │ │ │ │ │ └──► Retrain when needed ──► Back to Training │ │ │ └──────────────────────────────────────────────────────────────────┘
Types of Model Deployment
REST API (Real-Time Online Prediction)
Most common deployment approach.
The model is wrapped inside a web server.
Any application sends a request → gets a prediction back.
Flow:
User App → HTTP Request (with input data) →
→ API Server (loads model) →
→ Model predicts →
→ Response with prediction → User App
Example: Fraud Detection
Bank's payment system sends transaction details via API.
Model returns: {"fraud_probability": 0.92, "decision": "BLOCK"}
Bank blocks the transaction in real time.
Tools: Flask, FastAPI (Python), Docker, Kubernetes
Batch Prediction
Predictions run on a large dataset at once, not on demand.
No real-time requirement — results stored for later use.
Example: Email Campaign Scoring
Every night at midnight:
1. Load 5 million customer records
2. Run model → predict purchase probability for each
3. Store predictions in a database
4. Marketing team uses scores the next morning
Best for:
✓ Large volumes, not time-sensitive
✓ Weekly/daily reporting
✓ Recommendations computed ahead of time
Tools: Apache Spark, AWS Batch, Airflow, scheduled Python scripts
Edge Deployment
Model runs directly on a device — phone, camera, sensor. No internet connection needed. Very low latency. Examples: - Face unlock on a smartphone (runs on device) - Object detection camera in a factory (no cloud needed) - Voice assistant on a smart speaker Challenge: Edge devices have limited memory and compute. Solution: Model compression — make the model smaller. Quantization: Reduce weight precision (float32 → int8) Pruning: Remove unimportant connections Distillation: Teach a small "student" model from a large "teacher" Tools: TensorFlow Lite, ONNX Runtime, Apple CoreML
Model Serialization
After training, the model must be saved (serialized) to disk so it can be loaded later for prediction. Common formats: ┌─────────────────────┬────────────────────────────────────────────┐ │ Format │ Used For │ ├─────────────────────┼────────────────────────────────────────────┤ │ Pickle (.pkl) │ Scikit-learn models (Python-specific) │ │ Joblib (.joblib) │ Scikit-learn (faster for large numpy arrays│ │ ONNX (.onnx) │ Cross-platform, any language/framework │ │ SavedModel │ TensorFlow / Keras models │ │ .pt / .pth │ PyTorch models │ │ PMML │ Enterprise standard for ML model export │ └─────────────────────┴────────────────────────────────────────────┘ Example: Training: Train XGBoost model → Save as model.pkl Deployment: Load model.pkl → Receive new data → Predict → Return result
Containerization with Docker
Problem: "It works on my laptop but not on the server!" Different Python versions, different library versions, different operating systems — all cause deployment failures. Docker solution: Package the model, Python environment, all dependencies, and the prediction code into a single container image. The container runs identically everywhere. Dockerfile Example Structure: FROM python:3.10-slim ← Base Python image COPY requirements.txt . ← List of libraries RUN pip install -r requirements.txt ← Install libraries COPY model.pkl . ← Copy trained model COPY app.py . ← Copy prediction API code CMD ["python", "app.py"] ← Start the API server Result: One Docker image runs identically on: ✓ Laptop (development) ✓ Test server (QA) ✓ Production cloud (AWS, Azure, GCP)
Model Monitoring
A deployed model does not maintain itself. The world changes — and the model must be watched. What to Monitor: 1. Prediction Quality (Model Performance): Compare predictions to actual outcomes as they arrive. Did fraud predictions match actual fraud outcomes? 2. Data Drift (Input Distribution Shift): The statistical properties of incoming data change over time. Example: COVID changed customer buying patterns overnight. Old model trained on pre-COVID data no longer applies. Detect with: KS test, PSI (Population Stability Index) 3. Prediction Drift (Output Distribution Shift): The model's predictions change even if inputs seem the same. "Average fraud probability jumped from 2% to 18% this week." 4. Latency and System Health: Is the API responding within acceptable time? Are there server errors, memory issues, or crashes? 5. Data Quality: Are null values increasing in incoming data? Are any expected features suddenly missing? Alert Thresholds: Accuracy drop > 5% → Send alert, investigate Input feature drift > threshold → Retrain trigger API response time > 500ms → Infrastructure alert
CI/CD for Machine Learning
CI/CD = Continuous Integration / Continuous Deployment
Software CI/CD: New code → tests → deploy automatically.
ML CI/CD extends this to cover data and model changes.
ML Pipeline Steps:
Code change OR new data arrives
│
▼
Data validation (check schema, distributions)
│
▼
Model training (on updated dataset)
│
▼
Model evaluation (compare to previous model)
│
▼
Passes threshold? (new model better than old?)
│
├── Yes → Deploy new model automatically
│
└── No → Alert team, do not deploy
This entire pipeline runs automatically using tools like
MLflow, Kubeflow, ZenML, or GitHub Actions.
A/B Testing for Models
Before fully replacing an old model with a new one,
split traffic between both and compare real performance.
Example: Recommendation System
50% of users → Old Model A (previous recommendation algorithm)
50% of users → New Model B (newly trained model)
Run for 2 weeks. Measure:
Click-through rate: A=4.2%, B=5.1% → B is better
Revenue per session: A=₹85, B=₹97 → B is better
Result: Route 100% of traffic to Model B.
A/B testing prevents deploying models that look better
on test data but underperform in the real world.
MLOps Tools Overview
┌─────────────────────────────┬───────────────────────────────────────┐ │ MLOps Area │ Common Tools │ ├─────────────────────────────┼───────────────────────────────────────┤ │ Experiment Tracking │ MLflow, Weights & Biases, Neptune │ │ Feature Store │ Feast, Tecton, Hopsworks │ │ Model Registry │ MLflow Models, Hugging Face Hub │ │ Pipeline Orchestration │ Apache Airflow, Kubeflow, ZenML │ │ Model Serving (API) │ FastAPI, TorchServe, Triton, BentoML │ │ Containerization │ Docker, Kubernetes │ │ Cloud ML Platforms │ AWS SageMaker, GCP Vertex AI, │ │ │ Azure ML, Databricks │ │ Model Monitoring │ Evidently AI, WhyLabs, Arize │ └─────────────────────────────┴───────────────────────────────────────┘
The Production Machine Learning Stack
Data Sources (databases, APIs, sensors)
│
▼
Feature Pipeline (compute + store features)
│
▼
Training Pipeline (data → model → evaluation)
│
▼
Model Registry (versioned, tagged models)
│
▼
Deployment Service (API / Batch / Edge)
│
▼
Users / Applications (receive predictions)
│
▼
Monitoring System (track performance and drift)
│
└──► Retraining Pipeline (triggered automatically) ──► Training Pipeline
Best Practices Summary
┌──────────────────────────────────────────────┬────────────────────┐ │ Practice │ Why It Matters │ ├──────────────────────────────────────────────┼────────────────────┤ │ Version control data, code, and models │ Reproducibility │ │ Log all experiments with parameters, metrics │ Comparability │ │ Automate training and evaluation pipelines │ Consistency │ │ Use shadow mode before full deployment │ Safety │ │ Monitor data drift continuously │ Reliability │ │ Set automated retraining triggers │ Freshness │ │ Document model inputs, outputs, and limits │ Transparency │ │ Test model for fairness across groups │ Ethics │ └──────────────────────────────────────────────┴────────────────────┘
Machine Learning deployment and MLOps transform a model from an experimental script into a reliable production system. A well-deployed model serves real users, adapts to changing data, and improves continuously — making the full lifecycle of Machine Learning complete.
