Spark Setup and Installation

You can run Spark on your laptop for learning without needing a cluster. This topic covers three ways to get started: local installation, Google Colab (browser-based, zero setup), and Docker.

Prerequisites

Java 8 or 11 — Spark runs on the Java Virtual Machine (JVM)
Python 3.7+ — required for PySpark
At least 4 GB RAM — 8 GB recommended for smooth local runs

Option 1: Install PySpark via pip (Easiest)

PySpark is available as a Python package. This approach installs everything you need in one command and works on Windows, macOS, and Linux.

# Step 1: Install Java (if not already installed)
# Download from: https://adoptium.net

# Step 2: Install PySpark
pip install pyspark

# Step 3: Verify the installation
python -c "import pyspark; print(pyspark.__version__)"

First Program After pip Install

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HelloSpark") \
    .master("local[*]") \
    .getOrCreate()

print("Spark version:", spark.version)
spark.stop()

The local[*] setting tells Spark to run locally and use all available CPU cores.

Option 2: Google Colab (No Installation)

Google Colab runs in your browser and provides free compute. This is the fastest way to start if you do not want to install anything locally.

# Run this in a Colab cell
!pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
print(spark.version)

Option 3: Docker

Docker containers provide a clean, reproducible Spark environment that works identically on any machine.

# Pull the official Jupyter PySpark image
docker pull jupyter/pyspark-notebook

# Run it
docker run -p 8888:8888 jupyter/pyspark-notebook

Open the URL printed in the terminal to access a Jupyter notebook with PySpark pre-configured.

Spark UI — Your Built-In Dashboard

Every running Spark application opens a web dashboard at http://localhost:4040. The Spark UI shows active jobs, completed stages, executor memory usage, and task timelines. Check this dashboard whenever a job runs slowly.

What You See in Spark UI:

+---------------------------+
| Jobs   | Stages | Storage |
+---------------------------+
| Job 0  | 3 stages | 2 GB  |
| Job 1  | 1 stage  | 0 GB  |
+---------------------------+
| Executors: 4 active       |
| Total input: 10 GB        |
+---------------------------+

Setting SPARK_HOME (Optional for Full Install)

If you download the Spark binary directly from spark.apache.org (for Scala/Java development), set the SPARK_HOME environment variable so your system knows where Spark lives.

# macOS / Linux (~/.bashrc or ~/.zshrc)
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

# Windows (System Properties > Environment Variables)
SPARK_HOME = C:\spark
PATH += C:\spark\bin

Quick Setup Checklist

Java installed and JAVA_HOME set
PySpark installed (pip install pyspark)
Test script runs without errors
Spark UI accessible at localhost:4040 during a job

Previous lesson

Back to course

Next lesson