Databricks Clusters

A cluster is the engine that powers all computation in Databricks. Every time you run a notebook cell, execute a SQL query, or trigger a data pipeline, a cluster performs the actual work. Understanding how clusters are built, configured, and managed is one of the most important skills for using Databricks efficiently — both in terms of performance and cost.

Think of a cluster like a delivery team. When a large shipment of packages arrives at a warehouse, one manager (the driver node) reads the manifest and assigns packages to multiple delivery drivers (worker nodes). Each driver handles a portion of the total deliveries simultaneously. The job finishes much faster than if only one person made all the deliveries alone.

What Is a Cluster?

A Databricks cluster is a collection of virtual machines (VMs) in the cloud that work together as a single Apache Spark environment. These virtual machines are rented from your cloud provider — AWS, Azure, or GCP — on demand. You pay only while the cluster is running.

CLUSTER ANATOMY
────────────────────────────────────────────────────────
DATABRICKS CLUSTER
│
├── DRIVER NODE (1 machine)
│    • Receives your code from the notebook
│    • Creates the execution plan (DAG)
│    • Distributes tasks to worker nodes
│    • Collects results and returns them to you
│    • Runs the SparkContext and Spark Session
│
└── WORKER NODES (1 to hundreds of machines)
     ├── WORKER 1: Processes rows 1–10 million
     ├── WORKER 2: Processes rows 10–20 million
     ├── WORKER 3: Processes rows 20–30 million
     └── WORKER N: Processes rows N×10M – (N+1)×10M

Each worker has:
  • Executors (processes that run Spark tasks)
  • Cores (CPU threads, one task per core)
  • Memory (RAM for data caching and processing)

Types of Clusters

Databricks offers three main cluster types, each suited to a different use case.

All-Purpose Clusters

All-Purpose Clusters stay running until you manually stop them or until they auto-terminate due to inactivity. Multiple users can attach to the same All-Purpose Cluster at the same time and run their notebooks concurrently. These clusters suit interactive, exploratory work during development.

The trade-off is cost: if you forget to stop an All-Purpose Cluster at the end of the day, it keeps running and keeps costing money. Always set an inactivity auto-termination period (30–60 minutes is common).

Job Clusters

Job Clusters are created automatically when a scheduled job starts and destroyed immediately when the job finishes. They never sit idle. Each job run gets a fresh cluster, which eliminates the risk of leftover state from previous runs affecting results. Job Clusters cost less than All-Purpose Clusters because they only exist during actual computation time.

JOB CLUSTER LIFECYCLE
──────────────────────────────────────────────
Job Triggered (6:00 AM)
       │
       ▼
Cluster Created (takes 2–5 minutes to spin up)
       │
       ▼
Job Runs (notebook executes end-to-end)
       │
       ▼
Job Completes Successfully
       │
       ▼
Cluster Terminates Automatically
       │
       ▼
No compute cost until next job run

SQL Warehouses (formerly SQL Endpoints)

SQL Warehouses are specialized clusters optimized for SQL query workloads. They connect to the Databricks SQL Editor and BI tools like Tableau, Power BI, and Looker. SQL Warehouses can handle many simultaneous users through query queuing and auto-scaling, making them suitable for shared analytics environments.

SQL Warehouses come in two flavors: Classic (runs on regular VMs in your cloud account) and Serverless (Databricks manages the infrastructure, starts in seconds, charges by the second of compute used).

Cluster Configuration Options

When you create a cluster, you configure several settings that determine its performance and cost. Understanding these options helps you choose the right cluster for each task.

Node Types

Node type refers to the VM specification — how much CPU and RAM each machine has. Cloud providers offer many node types in different families:

NODE TYPE FAMILIES (AZURE EXAMPLE)
────────────────────────────────────────────────
Standard_DS3_v2   → 4 CPU, 14 GB RAM  (General purpose)
Standard_DS4_v2   → 8 CPU, 28 GB RAM  (Larger general purpose)
Standard_E8s_v3   → 8 CPU, 64 GB RAM  (Memory optimized)
Standard_F8s_v2   → 8 CPU, 16 GB RAM  (CPU compute optimized)
Standard_NC6s_v3  → 6 CPU, 112 GB RAM + 1 GPU (GPU for ML)

WHEN TO USE EACH TYPE:
• General purpose  → Most ETL and SQL workloads
• Memory optimized → Large joins, caching huge datasets
• Compute optimized → ML training, CPU-heavy transformations
• GPU              → Deep learning, image/video processing

Cluster Size – Workers and Cores

The number of worker nodes and the cores per node determine how much parallel processing your cluster can do. More workers means more parallelism, which is faster — but also more expensive.

SIZING GUIDE (APPROXIMATE)
──────────────────────────────────────────────────────
Dataset Size    │ Recommended Workers │ Use Case
────────────────┼─────────────────────┼───────────────────
Under 10 GB     │ 1–2 workers         │ Single Node or small
10 GB – 1 TB    │ 4–8 workers         │ Standard ETL
1 TB – 10 TB    │ 8–20 workers        │ Large ETL, ML
Over 10 TB      │ 20–100+ workers     │ Enterprise scale

Autoscaling

With autoscaling enabled, the cluster monitors its workload and automatically adds worker nodes when busy and removes them when idle. You set a minimum and maximum number of workers, and the cluster manager adjusts within that range.

AUTOSCALING IN ACTION
──────────────────────────────────────────────
8 AM: Light workload → Cluster uses 2 workers
10 AM: Heavy batch job arrives → Cluster scales up to 8 workers
11 AM: Job finishes → Cluster scales back down to 2 workers
2 PM: Another large job → Scales up to 6 workers
5 PM: End of workday → No activity → Cluster terminates (auto-term)

Cost: You pay only for the workers actually running.
Without autoscaling: You would pay for 8 workers all day.

Databricks Runtime Version

Each cluster runs a specific Databricks Runtime (DBR) version. The runtime includes Apache Spark, Delta Lake, Python, and various libraries. Choose the version that matches your workload:

RUNTIME SELECTION GUIDE
──────────────────────────────────────────────
Standard runtime (e.g., DBR 14.3 LTS)
→ Use for: ETL, data engineering, SQL analytics
→ Includes: Spark 3.5, Python 3.11, Delta Lake 3.x

ML runtime (e.g., DBR 14.3 LTS ML)
→ Use for: Machine learning, model training
→ Adds: TensorFlow, PyTorch, scikit-learn, XGBoost, MLflow

GPU runtime (e.g., DBR 14.3 LTS GPU ML)
→ Use for: Deep learning with GPU acceleration
→ Requires: GPU-capable node types

Spark Configuration

You can add custom Spark configuration properties under the Spark Config section when creating a cluster. These properties fine-tune Spark's behavior.

COMMON SPARK CONFIGS
──────────────────────────────────────────────────────
spark.sql.shuffle.partitions = 200
→ Number of partitions created during joins and aggregations.
  Default is 200. For small datasets, lower this (e.g., 20)
  to avoid many tiny tasks.

spark.databricks.delta.optimizeWrite.enabled = true
→ Automatically optimizes file sizes when writing Delta tables.

spark.sql.adaptive.enabled = true
→ Enables Adaptive Query Execution, which dynamically
  adjusts query plans based on runtime statistics.

spark.executor.memory = 8g
→ Amount of RAM per executor for data processing.

Cluster Policies – Simplifying Cluster Creation for Teams

In large organizations, administrators use Cluster Policies to restrict and pre-configure cluster settings. A policy defines what options users can change and enforces cost controls.

EXAMPLE CLUSTER POLICY: "Analyst Policy"
──────────────────────────────────────────────
Maximum node type: Standard_DS4_v2 (8 CPU, 28 GB)
Maximum workers: 4
Auto-termination: Required, max 30 minutes
Runtime: Must use DBR 14.3 LTS or newer
Allowed tags: Must include department name

Effect:
Analysts create clusters within approved limits.
No one accidentally creates a 50-node GPU cluster.
Cost is controlled automatically.

Spot Instances – Saving Money on Compute

Cloud providers offer a cheaper class of VMs called Spot Instances (AWS), Spot VMs (Azure), or Preemptible VMs (GCP). These machines are spare capacity sold at 60–90% discount. The trade-off: the cloud provider can reclaim them with short notice (typically 2 minutes) if demand for regular VMs increases.

SPOT INSTANCE STRATEGY
──────────────────────────────────────────────────────────
RECOMMENDED SETUP:
• Driver Node: On-Demand VM (never interrupted)
• Worker Nodes: Spot Instances (cheap, may be interrupted)

If a worker is reclaimed, Spark reschedules its tasks
on remaining workers. The job slows down slightly
but does not fail.

AVOID Spot Instances when:
• Running time-critical production jobs with strict deadlines
• Processing streaming data with low latency requirements
• The job cannot tolerate restarts or slowdowns

Cluster Monitoring

Once a cluster is running, you monitor its health and performance through several tools.

Spark UI

The Spark UI is a built-in dashboard accessible from the cluster's detail page. It shows every job, stage, and task that Spark runs, along with timing information. Use it to identify slow stages, tasks with data skew (one task taking 10 times longer than others), or tasks that spill data to disk due to insufficient memory.

SPARK UI TABS
──────────────────────────────────────────────
Jobs      → Each time you call an action (write, show, collect)
Stages    → Each job breaks into stages (map, shuffle, reduce)
Tasks     → Each stage breaks into tasks (one per partition)
Storage   → Cached DataFrames and RDDs currently in memory
Executors → Resource usage per worker (CPU, memory, I/O)
SQL       → Physical query plans for SQL queries

Cluster Metrics (Ganglia)

Databricks includes the Ganglia monitoring system for low-level cluster metrics: CPU utilization per core, memory usage per node, network I/O, and disk read/write rates. Access it via the Metrics tab on the cluster details page.

Cluster Event Log

The Event Log records every significant event in a cluster's lifetime: creation, scaling up, scaling down, configuration changes, and termination. It is the first place to look when diagnosing issues like a cluster that failed to start or terminated unexpectedly.

Auto-Termination and Cost Control

The single most common Databricks cost mistake is leaving All-Purpose Clusters running overnight or over the weekend. A cluster with 10 worker nodes might cost $10–30 per hour. Running for a weekend without anyone using it costs hundreds of dollars for no benefit.

Set auto-termination to trigger after 30–60 minutes of inactivity for All-Purpose Clusters. This means the cluster shuts down automatically if no code has run for that duration. When you next open your notebook and press Run, it takes 2–5 minutes for the cluster to restart — a small inconvenience that prevents large unnecessary costs.

CLUSTER COST EXAMPLE
──────────────────────────────────────────────────────
Cluster: 4 workers, Standard_DS3_v2 on Azure
Cost: Approximately $2 per hour

With Auto-Termination (30 min):
• Morning work: 9 AM – 12 PM (3 hours) = $6
• Afternoon work: 2 PM – 5 PM (3 hours) = $6
• Cluster terminates: 12:30 PM, 5:30 PM
• Total daily cost: $12

Without Auto-Termination:
• Cluster runs 9 AM – next morning 9 AM = 24 hours = $48
• You used it for only 6 hours but paid for 24 hours
• Waste: $36 per day, $1080 per month for ONE cluster

Key Points

A Databricks cluster is a group of cloud VMs running Apache Spark, with one driver node coordinating work and multiple worker nodes doing parallel computation.
All-Purpose Clusters suit interactive development; Job Clusters suit automated production; SQL Warehouses suit SQL analytics.
Node type determines CPU and memory per machine; choose memory-optimized nodes for large joins, GPU nodes for deep learning.
Autoscaling adjusts the number of workers automatically based on workload, saving cost without sacrificing performance.
Spot Instances reduce VM costs by 60–90% and work safely for non-critical batch jobs when configured on worker nodes only.
Always set auto-termination on All-Purpose Clusters to avoid unexpected costs from idle resources.
The Spark UI shows job execution details and helps diagnose performance issues like data skew and memory spills.

Previous lessons

Back to courses

Next lessons