Databricks Architecture

Understanding how Databricks is built helps you use it more effectively. When you know what happens behind the scenes when you run a query or train a model, you make better decisions about how to structure your work, optimize your pipelines, and troubleshoot problems when they appear.

Databricks architecture has two main layers: the Control Plane and the Data Plane. These two layers work together like a conductor and an orchestra. The conductor (Control Plane) gives instructions and manages everything. The orchestra (Data Plane) does the actual performance — the heavy computation.

The Two Planes Explained with a Diagram

YOUR BROWSER
     |
     v
┌─────────────────────────────────┐
│         CONTROL PLANE           │  ← Managed by Databricks
│                                 │
│  • Databricks UI (Workspace)    │
│  • Job Scheduler                │
│  • Cluster Manager              │
│  • Notebook Service             │
│  • Unity Catalog (Governance)   │
└─────────────────────────────────┘
              |
              | (Sends instructions to launch clusters,
              |  run jobs, manage data)
              v
┌─────────────────────────────────┐
│           DATA PLANE            │  ← Lives in YOUR cloud account
│                                 │
│  • Spark Clusters (Compute)     │
│  • Your Cloud Storage (Data)    │
│  • Delta Lake (Storage layer)   │
│  • Network, Security, Keys      │
└─────────────────────────────────┘

This separation is important for security. Your actual data never passes through Databricks servers. The Data Plane lives inside your own cloud subscription — your AWS account, your Azure subscription, or your Google Cloud project. Databricks only sends control signals, like "start this cluster" or "run this notebook." Your data stays where you put it.

The Control Plane in Detail

The Control Plane is the brain of Databricks. It handles everything you interact with when you open the Databricks website.

The Workspace

The workspace is the visual interface you see in your browser. It is where you create notebooks, manage clusters, schedule jobs, and view dashboards. Think of it as the front desk of a hotel — the place where you make requests, and those requests get routed to the right department to fulfill them.

The Cluster Manager

The Cluster Manager decides how and when to create, scale, and terminate clusters. When you click "Start Cluster" in the UI, the Cluster Manager sends a request to your cloud provider saying, "Please create 10 virtual machines with these specifications." The cloud provider spins up those machines, the Cluster Manager configures them with Apache Spark, and within a few minutes your cluster is ready to run code.

The Job Scheduler

The Job Scheduler handles automation. When you set a notebook to run every night at 2 AM, the Job Scheduler remembers that schedule and triggers the job at the right time. It also monitors job runs, records success or failure, and sends alerts if something goes wrong.

The Notebook Service

Every notebook you write is saved in the Control Plane. The Notebook Service stores your code, manages version history, and coordinates real-time collaboration between multiple users editing the same notebook.

The Data Plane in Detail

The Data Plane is where actual computation happens. It lives entirely inside your own cloud account.

Clusters (Compute)

A cluster is a group of virtual machines that work together to run Apache Spark. Each cluster has a driver node and one or more worker nodes.

CLUSTER STRUCTURE
─────────────────

         ┌──────────────┐
         │  DRIVER NODE │  ← Coordinates the work
         │  (The boss)  │
         └──────┬───────┘
                │
        ┌───────┼────────────┐
        │       │            │
┌───────┴──┐  ┌─┴────────┐ ┌─┴────────┐
│WORKER 1  │  │WORKER 2  │ │WORKER 3  │  ← Do the actual work
│(executor)|  │(executor)| │(executor)│
└──────────┘  └──────────┘ └──────────┘

Each worker processes a portion of the data simultaneously.

The driver node receives your code, breaks the task into smaller pieces, and distributes those pieces across the worker nodes. Each worker node processes its portion of the data in parallel. The results come back to the driver, which assembles the final answer.

This parallel processing is why Spark is so fast. Instead of one computer reading through a billion rows one by one, ten workers each read 100 million rows at the same time. The total time drops dramatically.

Cloud Storage

Your data lives in cloud storage — Amazon S3 if you use AWS, Azure Data Lake Storage Gen2 if you use Azure, or Google Cloud Storage if you use GCP. Databricks reads data directly from these storage systems. The data never needs to be loaded into a separate database first.

Delta Lake

Delta Lake is a storage layer that sits on top of your cloud storage files. It adds reliability features that raw file storage lacks. When you store data in Delta Lake format, every write operation gets logged in a transaction log. This means you can roll back mistakes, run queries against older versions of your data, and ensure that two jobs writing to the same table at the same time do not corrupt each other's work.

How Apache Spark Works Inside Databricks

Apache Spark is the engine that Databricks runs on. Understanding Spark's basic behavior helps you write better code.

DataFrames

Spark works with data structures called DataFrames. A DataFrame looks and behaves like a table with rows and columns. You can filter rows, add new columns, join two DataFrames together, and group data — all using either Python code or SQL syntax.

SPARK DATAFRAME EXAMPLE
─────────────────────────
┌──────────┬────────┬──────────┐
│ customer │  city  │ purchase │
├──────────┼────────┼──────────┤
│ Priya    │ Mumbai │   2500   │
│ Rahul    │ Delhi  │   1800   │
│ Aisha    │ Pune   │   3200   │
│ Dev      │ Mumbai │    900   │
└──────────┴────────┴──────────┘

Filter where city = "Mumbai":
→ Priya (2500), Dev (900)

Sum of purchases in Mumbai:
→ 3400

Even if this DataFrame has 10 billion rows instead of 4, Spark splits it across hundreds of workers and performs the filter and sum in parallel across all of them.

Lazy Evaluation

Spark uses a concept called lazy evaluation. When you write code to filter a DataFrame or join two tables, Spark does not immediately run that code. It waits until you explicitly ask for the result — by writing to a file, displaying rows, or collecting data into your program. This gives Spark time to build an optimal execution plan before touching any data.

LAZY EVALUATION TIMELINE
─────────────────────────
Step 1: df = read_table("sales")       → No data read yet
Step 2: df = df.filter(city="Mumbai")  → No processing yet
Step 3: df = df.groupby("month").sum() → No processing yet
Step 4: df.write.save("output")        → NOW Spark executes all steps
                                          with an optimized plan

This approach avoids wasted computation. Spark figures out the most efficient sequence of operations and only runs what is necessary.

The Catalyst Optimizer

Spark includes a query optimizer called Catalyst. When you write a SQL query or a chain of DataFrame operations, Catalyst analyzes it, rewrites it in a more efficient form, and then generates machine-level instructions. This optimization happens automatically — you do not need to think about it. But knowing it exists helps explain why Databricks often runs queries faster than you might expect.

Databricks Runtime

When you create a cluster, you choose a Databricks Runtime version. The Runtime is a pre-configured software environment that includes Apache Spark, Delta Lake libraries, machine learning libraries, and other dependencies. You do not install these manually — Databricks packages them together and updates them regularly.

DATABRICKS RUNTIME VERSIONS
─────────────────────────────
DBR 14.x → Latest Spark version, Python 3.11, newest ML libraries
DBR 13.x → Previous LTS (Long Term Support), stable for production
DBR ML    → Includes TensorFlow, PyTorch, scikit-learn pre-installed
DBR GPU   → Optimized for GPU-powered deep learning workloads

For most data engineering tasks, choose the latest Long Term Support (LTS) runtime. For machine learning, choose the ML runtime. This saves you time installing libraries manually.

Unity Catalog – Governance Across the Whole Platform

Unity Catalog is Databricks' central system for managing who can access what data. It sits at the Control Plane level and governs all data assets across all workspaces in your organization.

UNITY CATALOG HIERARCHY
─────────────────────────
Metastore (One per cloud region)
   └── Catalog (Like a database collection)
         └── Schema (Like a folder of tables)
               └── Table / View / Volume / Function

An administrator defines which users or groups can read, write, or manage each catalog, schema, or table. This means a junior analyst can query the marketing database without accidentally accessing sensitive employee data. Security is enforced at the platform level, not just at the application level.

How a Typical Databricks Request Flows

Here is a step-by-step picture of what happens when you run a cell in a notebook.

YOU PRESS "RUN" IN A NOTEBOOK
          │
          ▼
Control Plane receives the code
          │
          ▼
Job is sent to the Cluster Manager
          │
          ▼
Cluster Manager routes job to your cluster (in Data Plane)
          │
          ▼
Driver node on the cluster receives the job
          │
          ▼
Driver splits the work across worker nodes
          │
          ▼
Workers read data from cloud storage (S3 / ADLS / GCS)
          │
          ▼
Workers process data in parallel using Spark
          │
          ▼
Results return to driver node
          │
          ▼
Driver sends results back to Control Plane
          │
          ▼
Results display in your browser notebook

This entire process — from pressing Run to seeing results — can take seconds for small data and minutes for large datasets spread across many worker nodes.

Key Points

Databricks architecture has two main layers: the Control Plane (managed by Databricks) and the Data Plane (in your cloud account).
Your data always stays in your own cloud storage — it never passes through Databricks' own servers.
A cluster has one driver node that coordinates work and multiple worker nodes that process data in parallel.
Apache Spark uses lazy evaluation to build an optimized execution plan before running any computation.
The Catalyst Optimizer automatically rewrites queries for maximum efficiency.
Databricks Runtime packages Spark, Delta Lake, and other libraries into a single pre-configured environment.
Unity Catalog governs data access across all workspaces and users in your organization.

Previous lesson

Back to course

Next lesson