Spark Setup and Installation
You can run Spark on your laptop for learning without needing a cluster. This topic covers three ways to get started: local installation, Google Colab (browser-based, zero setup), and Docker.
Prerequisites
- Java 8 or 11 — Spark runs on the Java Virtual Machine (JVM)
- Python 3.7+ — required for PySpark
- At least 4 GB RAM — 8 GB recommended for smooth local runs
Option 1: Install PySpark via pip (Easiest)
PySpark is available as a Python package. This approach installs everything you need in one command and works on Windows, macOS, and Linux.
# Step 1: Install Java (if not already installed) # Download from: https://adoptium.net # Step 2: Install PySpark pip install pyspark # Step 3: Verify the installation python -c "import pyspark; print(pyspark.__version__)"
First Program After pip Install
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("HelloSpark") \
.master("local[*]") \
.getOrCreate()
print("Spark version:", spark.version)
spark.stop()
The local[*] setting tells Spark to run locally and use all available CPU cores.
Option 2: Google Colab (No Installation)
Google Colab runs in your browser and provides free compute. This is the fastest way to start if you do not want to install anything locally.
# Run this in a Colab cell
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
print(spark.version)
Option 3: Docker
Docker containers provide a clean, reproducible Spark environment that works identically on any machine.
# Pull the official Jupyter PySpark image docker pull jupyter/pyspark-notebook # Run it docker run -p 8888:8888 jupyter/pyspark-notebook
Open the URL printed in the terminal to access a Jupyter notebook with PySpark pre-configured.
Spark UI — Your Built-In Dashboard
Every running Spark application opens a web dashboard at http://localhost:4040. The Spark UI shows active jobs, completed stages, executor memory usage, and task timelines. Check this dashboard whenever a job runs slowly.
What You See in Spark UI: +---------------------------+ | Jobs | Stages | Storage | +---------------------------+ | Job 0 | 3 stages | 2 GB | | Job 1 | 1 stage | 0 GB | +---------------------------+ | Executors: 4 active | | Total input: 10 GB | +---------------------------+
Setting SPARK_HOME (Optional for Full Install)
If you download the Spark binary directly from spark.apache.org (for Scala/Java development), set the SPARK_HOME environment variable so your system knows where Spark lives.
# macOS / Linux (~/.bashrc or ~/.zshrc) export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin # Windows (System Properties > Environment Variables) SPARK_HOME = C:\spark PATH += C:\spark\bin
Quick Setup Checklist
- Java installed and JAVA_HOME set
- PySpark installed (
pip install pyspark) - Test script runs without errors
- Spark UI accessible at localhost:4040 during a job
