Databricks MLflow

Building a machine learning model is not a one-step process. Data scientists run dozens or hundreds of experiments before finding a model that works well. They test different settings, different algorithms, and different data combinations. Without a system to record each attempt, everything becomes chaotic. Which experiment gave the best accuracy? What exact settings produced that result? Can someone reproduce the same model six months later?

MLflow is the answer to these questions. It is an open-source platform, deeply integrated into Databricks, that tracks every machine learning experiment, stores every model, and manages the entire journey from experiment to production deployment. Think of MLflow as the lab notebook for data scientists — every experiment gets recorded, every result gets saved, and the best models get deployed to serve real users.

The Four Core Components of MLflow

MLflow organizes its functionality into four main areas. Each area addresses a specific challenge in the machine learning lifecycle.

MLflow Tracking

MLflow Tracking records the details of each experiment run. When a data scientist trains a model, MLflow captures the parameters used, the metrics achieved, and the artifacts produced — all automatically or with a few lines of code.

MLflow Projects

MLflow Projects packages machine learning code in a standard format so that anyone can reproduce an experiment. The code, dependencies, and entry points are bundled together, eliminating the "it works on my machine" problem.

MLflow Models

MLflow Models defines a standard format for saving trained models. A model saved in MLflow format can be loaded by many different tools — Python scripts, REST APIs, Spark jobs — without needing to rewrite any code.

MLflow Model Registry

The Model Registry is a central database of production-ready models. It tracks which version of a model is currently in production, which versions are being tested, and which versions are retired. Teams collaborate on model lifecycle decisions through the registry.

Understanding MLflow Tracking with a Real Example

Imagine a bakery testing different bread recipes. The baker tries Recipe A with 500 grams of flour, Recipe B with 600 grams, and Recipe C with a different type of yeast. Each batch produces a loaf that gets scored on taste, texture, and rise. The baker writes down every detail in a notebook: the ingredients (parameters), the scores (metrics), and a photo of each loaf (artifacts).

MLflow Tracking is that notebook, but for machine learning models.

A data scientist building a model to predict house prices might run three experiments:

Experiment Run 1: Random Forest with 100 trees, learning rate 0.1 → Accuracy: 82%
Experiment Run 2: Random Forest with 200 trees, learning rate 0.05 → Accuracy: 85%
Experiment Run 3: Gradient Boosting with 150 trees, learning rate 0.08 → Accuracy: 88%

MLflow records all three runs. The data scientist opens the MLflow UI and sees a table comparing all runs side by side. Experiment Run 3 wins. The data scientist registers that model for deployment.

Parameters, Metrics, and Artifacts: The Three Things MLflow Records

Parameters

Parameters are the settings the data scientist chooses before training starts. They control how the model learns. Examples include the number of trees in a random forest, the learning rate for a neural network, or the maximum depth of a decision tree. Parameters are set once at the beginning and do not change during training.

Metrics

Metrics measure how well the model performed after training. Common metrics include accuracy (what percentage of predictions were correct), RMSE (root mean squared error, measuring how far predictions were from true values), and AUC (area under the curve, measuring classification quality). Metrics change across runs — that is the whole point. Data scientists compare metrics to decide which run produced the best model.

Artifacts

Artifacts are files produced during or after training. The trained model file itself is an artifact. A chart showing how accuracy improved during training is an artifact. A confusion matrix image is an artifact. MLflow stores all artifacts and links them to the specific run that produced them.

Logging in MLflow: Writing to the Lab Notebook

Adding MLflow tracking to a machine learning script requires only a few lines of code. Here is what a simple training script looks like with MLflow logging:

import mlflow

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("learning_rate", 0.05)

    # Train your model here
    model = train_model(n_estimators=200, learning_rate=0.05)

    # Log metrics
    accuracy = evaluate_model(model)
    mlflow.log_metric("accuracy", accuracy)

    # Save the model
    mlflow.sklearn.log_model(model, "house_price_model")

Every time this script runs, MLflow creates a new run record with the parameters, the accuracy metric, and the saved model file. The data scientist never loses track of which settings produced which results.

The MLflow Tracking UI: Seeing Every Experiment at a Glance

Databricks provides a built-in MLflow Tracking UI accessible directly from the workspace. The UI shows every experiment and every run within each experiment in a visual table. Data scientists can:

Sort runs by any metric to find the best performing model instantly
Compare two runs side by side to see exactly which parameters changed
View metric charts that show how metrics evolved during training
Download any artifact from any run
Add notes and tags to runs for easier organization

The UI requires no code to use. It is a visual dashboard that non-technical stakeholders can also review to understand model performance history.

Autologging: Zero-Code Tracking

MLflow supports autologging for many popular machine learning libraries. With one line of code — mlflow.autolog() — MLflow automatically captures parameters, metrics, and models from scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, and other libraries.

This is like installing a dashcam in a car. Once it is turned on, it records everything automatically. The driver does not press record before every trip. MLflow autologging works the same way — turn it on once, and every training run gets captured.

Autologging captures more information than most data scientists would manually log. For a scikit-learn model, it records every hyperparameter, training metrics, cross-validation scores, and the model itself — all without a single additional line of code.

The MLflow Model Registry: From Experiment to Production

Running experiments produces many model versions. The Model Registry manages the journey from experiment to production use. It answers three critical questions:

Which model version is currently serving users in production?
Which model versions are being tested and evaluated?
Which older versions should be archived but kept for reference?

Model Stages

The Model Registry uses stages to track where each model version stands:

None — Newly registered, not yet evaluated for production
Staging — Being tested and validated before promotion
Production — Currently serving real users or business processes
Archived — Retired from use but preserved for historical reference

Think of this like a professional sports league. A player (model version) starts in development (None). They get called up to the minor league (Staging) for evaluation. If they perform well, they join the major league roster (Production). When they retire, they move to the hall of fame archive (Archived).

Registering a Model

Registering a model from a completed run takes one command:

mlflow.register_model(
    model_uri="runs:/abc123/house_price_model",
    name="HousePricePredictor"
)

After registration, the model appears in the registry with version number 1. As the data scientist trains improved models and registers them, the version number increments automatically.

Transitioning Between Stages

Moving a model from Staging to Production requires deliberate action. This prevents accidental deployment. A team leader reviews the model's performance in Staging, approves the transition, and the model moves to Production. Teams can add comments and notes during each stage transition to document their decision-making process.

Serving Models: Making Predictions Available

A trained model sitting in the registry provides no value until it makes predictions for real users. MLflow Model Serving in Databricks converts a registered model into a live REST API endpoint with one click.

Once the model is served, any application can send a request to the endpoint and receive a prediction. A web application showing house price estimates sends address details to the endpoint and receives a predicted price. A fraud detection system sends transaction details and receives a risk score. The serving infrastructure handles scaling automatically — if ten times more requests arrive, Databricks scales up the serving compute automatically.

The Prediction Request Format

Applications query the served model by sending JSON data to the API endpoint:

POST /model/HousePricePredictor/1/invocations
{
  "inputs": {
    "bedrooms": 3,
    "bathrooms": 2,
    "square_feet": 1800,
    "neighborhood": "North District"
  }
}

The model returns a prediction:

{
  "predictions": [425000]
}

This entire exchange takes milliseconds. The application team never needs to understand how the model works internally. They only need to know the API format.

Model Flavors: One Model, Many Ways to Use It

MLflow uses the concept of "flavors" to make models compatible with different tools. When a data scientist saves a scikit-learn model using MLflow, it gets saved in multiple flavors simultaneously:

Python function flavor — Load and use it in any Python script
Scikit-learn flavor — Load it specifically as a scikit-learn model to access scikit-learn specific features
Spark UDF flavor — Apply it across millions of rows in a Spark DataFrame for batch predictions

This is like a document saved in multiple formats simultaneously — as a Word file, a PDF, and a plain text file. The same document, usable by different tools without conversion.

Running Batch Predictions with Spark

Online serving handles real-time predictions one request at a time. But sometimes, an organization needs to run predictions on an entire dataset at once — predicting churn risk for all two million customers overnight, for example.

MLflow models load as Spark UDFs (user-defined functions) for this purpose. A single command applies the model to every row in a Spark DataFrame:

import mlflow.pyfunc

model = mlflow.pyfunc.load_model("models:/HousePricePredictor/Production")
predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/HousePricePredictor/Production")

predictions = customer_df.withColumn("churn_probability", predict_udf(*feature_columns))

Databricks runs this computation across a cluster, processing millions of rows in parallel. The batch prediction job completes in minutes instead of hours.

Experiment Organization: Keeping Projects Tidy

As a team runs hundreds of experiments across multiple projects, the tracking server can accumulate thousands of runs. MLflow uses experiments as containers to keep things organized. Each project gets its own experiment. All runs for that project live inside that experiment.

A data science team might have these experiments:

house_price_prediction — All runs for the property valuation project
customer_churn_v2 — All runs for the churn prediction project
fraud_detection_q4 — All runs for the fraud detection improvement project

Team members working on house price prediction only see their project's runs in the UI. They are not distracted by thousands of unrelated runs from other projects.

Comparing Runs: Finding the Best Model Visually

The MLflow UI includes a parallel coordinates plot — a specialized chart that makes multi-run comparison intuitive. Each vertical axis represents one parameter or metric. Each line connecting the axes represents one run. Lines that converge toward high accuracy are easy to spot visually.

A data scientist can look at this chart and immediately see a pattern: runs with a learning rate below 0.1 and more than 150 trees consistently achieve higher accuracy. This visual insight guides the next round of experiments, narrowing the search toward better configurations faster.

MLflow in Databricks vs. Standalone MLflow

MLflow is an open-source project usable outside Databricks. But Databricks adds significant value on top of the base MLflow experience:

Automatic tracking server — No setup required. The tracking server runs inside the Databricks workspace automatically.
Unity Catalog integration — Models registered in the MLflow Model Registry connect to Unity Catalog for governance, access control, and lineage tracking.
One-click serving — Deploying a model as a REST endpoint requires clicking one button in the Databricks UI.
Cluster-aware logging — When running distributed training across a Spark cluster, MLflow captures the distributed job details automatically.
Access control — Model registry permissions integrate with Databricks workspace security, controlling who can view, modify, or deploy each model.

Real-World Scenario: A Retail Company's Price Optimization Journey

A retail company wants to build a model that recommends optimal prices for ten thousand products based on demand, competitor prices, and inventory levels. Here is how MLflow supports the entire process:

The data science team starts by creating an experiment called price_optimization_2024. They enable autologging. Over three weeks, the team runs 200 experiments, testing different algorithms and settings. MLflow captures every run automatically.

At the end of week three, the team opens the MLflow UI and sorts all 200 runs by their key metric — revenue uplift in simulated testing. The top five runs stand out clearly. The team examines their parameters side by side. They identify a Gradient Boosting model with specific settings that consistently achieves 12% higher revenue in simulations.

The team registers that model as PriceOptimizer version 1 in the Model Registry. It moves to Staging, where the business team validates it using historical data. After approval, it moves to Production. An automated pipeline runs every night, loading all ten thousand products into a Spark DataFrame, applying the model as a batch UDF, and writing recommended prices to the pricing database.

Three months later, the team trains an improved version using new data. Version 2 enters Staging alongside Version 1 in Production. After A/B testing confirms Version 2 performs better, it transitions to Production. Version 1 moves to Archived. The entire history of both versions — every experiment, every metric, every decision — remains accessible in MLflow.

Key Points Summary

MLflow solves the core challenge of tracking machine learning experiments systematically.
The four components — Tracking, Projects, Models, and Model Registry — cover the full machine learning lifecycle.
Parameters, metrics, and artifacts are the three categories of information MLflow records for each run.
Autologging captures experiment details automatically with a single line of code.
The Model Registry manages model versions through None, Staging, Production, and Archived stages.
Model flavors allow one trained model to be used in Python scripts, REST APIs, and Spark jobs without modification.
Databricks adds automatic tracking servers, Unity Catalog integration, and one-click serving on top of open-source MLflow.
Batch prediction using Spark UDFs processes millions of records in parallel for overnight scoring jobs.

Previous lesson

Back to course

Next lesson