Spark Shell and SparkContext
The Spark shell is an interactive command-line environment where you type Spark commands and see results instantly. It is ideal for exploring data, testing logic, and learning the API without writing full programs.
Starting the Spark Shell
Spark ships with two shells: spark-shell for Scala and pyspark for Python.
# Python shell
pyspark
# Scala shell
spark-shell
# You will see output like this:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.4.0
/_/
Using Python version 3.10.x
SparkSession available as 'spark'.
SparkContext available as 'sc'.
SparkContext — The Original Entry Point
SparkContext (sc) was the main entry point in Spark 1.x. It connects your program to the cluster and lets you create RDDs (the original Spark data structure). When you start pyspark, Spark automatically creates sc for you.
# In the pyspark shell, sc is already available print(sc) # Output: <SparkContext master=local[*] appName=PySparkShell> # Check Spark version print(sc.version) # Check how many cores Spark uses locally print(sc.defaultParallelism)
SparkSession — The Modern Entry Point
SparkSession (spark) replaced SparkContext as the primary entry point in Spark 2.0. It includes everything SparkContext offers, plus support for DataFrames and SQL. In the pyspark shell, both sc and spark are pre-created.
# SparkSession is available as 'spark' in the shell
print(spark)
# Output: <pyspark.sql.session.SparkSession object>
# Read a CSV file directly using SparkSession
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show(5)
SparkContext vs SparkSession
+-------------------+------------------+--------------------+ | |SparkContext (sc) |SparkSession (spark)| +-------------------+------------------+--------------------+ | Introduced | Spark 1.0 | Spark 2.0 | | Creates | RDDs | DataFrames, SQL | | Lower-level API? | Yes | No (higher-level) | | Still used? | Yes (via sc) | Yes (primary) | +-------------------+------------------+--------------------+
Trying Commands in the Shell
The shell gives instant feedback. Try these commands after launching pyspark:
# Create a small list and turn it into a Spark RDD numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] rdd = sc.parallelize(numbers) # Count the items print(rdd.count()) # 10 # Find the sum print(rdd.sum()) # 55 # Filter only even numbers evens = rdd.filter(lambda x: x % 2 == 0) print(evens.collect()) # [2, 4, 6, 8, 10]
Running Spark in Non-Interactive Mode
For production jobs, you write a Python script and submit it with spark-submit instead of using the shell.
# my_job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyJob").getOrCreate()
df = spark.read.csv("sales.csv", header=True)
print(df.count())
spark.stop()
# Submit the job
spark-submit my_job.py
Stopping Spark
Always stop Spark at the end of a script or notebook to release cluster resources. Forgetting to stop causes resource leaks in shared environments.
spark.stop() # or sc.stop()
Shell Tips
- Press Tab to auto-complete object names and methods
- Type :help in the Scala shell for assistance
- Use Ctrl+D or type exit() to quit the Python shell
- The Spark UI at
localhost:4040is active while the shell is running
