Spark Shell and SparkContext

The Spark shell is an interactive command-line environment where you type Spark commands and see results instantly. It is ideal for exploring data, testing logic, and learning the API without writing full programs.

Starting the Spark Shell

Spark ships with two shells: spark-shell for Scala and pyspark for Python.

# Python shell
pyspark

# Scala shell
spark-shell

# You will see output like this:
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/

Using Python version 3.10.x
SparkSession available as 'spark'.
SparkContext available as 'sc'.

SparkContext — The Original Entry Point

SparkContext (sc) was the main entry point in Spark 1.x. It connects your program to the cluster and lets you create RDDs (the original Spark data structure). When you start pyspark, Spark automatically creates sc for you.

# In the pyspark shell, sc is already available
print(sc)
# Output: <SparkContext master=local[*] appName=PySparkShell>

# Check Spark version
print(sc.version)

# Check how many cores Spark uses locally
print(sc.defaultParallelism)

SparkSession — The Modern Entry Point

SparkSession (spark) replaced SparkContext as the primary entry point in Spark 2.0. It includes everything SparkContext offers, plus support for DataFrames and SQL. In the pyspark shell, both sc and spark are pre-created.

# SparkSession is available as 'spark' in the shell
print(spark)
# Output: <pyspark.sql.session.SparkSession object>

# Read a CSV file directly using SparkSession
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show(5)

SparkContext vs SparkSession

+-------------------+------------------+--------------------+
|                   |SparkContext (sc) |SparkSession (spark)|
+-------------------+------------------+--------------------+
| Introduced        | Spark 1.0        | Spark 2.0          |
| Creates           | RDDs             | DataFrames, SQL    |
| Lower-level API?  | Yes              | No (higher-level)  |
| Still used?       | Yes (via sc)     | Yes (primary)      |
+-------------------+------------------+--------------------+

Trying Commands in the Shell

The shell gives instant feedback. Try these commands after launching pyspark:

# Create a small list and turn it into a Spark RDD
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(numbers)

# Count the items
print(rdd.count())      # 10

# Find the sum
print(rdd.sum())        # 55

# Filter only even numbers
evens = rdd.filter(lambda x: x % 2 == 0)
print(evens.collect())  # [2, 4, 6, 8, 10]

Running Spark in Non-Interactive Mode

For production jobs, you write a Python script and submit it with spark-submit instead of using the shell.

# my_job.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyJob").getOrCreate()
df = spark.read.csv("sales.csv", header=True)
print(df.count())
spark.stop()

# Submit the job
spark-submit my_job.py

Stopping Spark

Always stop Spark at the end of a script or notebook to release cluster resources. Forgetting to stop causes resource leaks in shared environments.

spark.stop()
# or
sc.stop()

Shell Tips

Press Tab to auto-complete object names and methods
Type :help in the Scala shell for assistance
Use Ctrl+D or type exit() to quit the Python shell
The Spark UI at localhost:4040 is active while the shell is running

Previous lesson

Back to course

Next lesson