Spark Lazy Evaluation

Lazy evaluation means Spark waits to execute your code until it absolutely must produce a result. When you write transformations, Spark records your instructions in a plan. When you call an action, Spark runs the optimal version of that plan.

The Restaurant Order Analogy

Eager evaluation (most languages):

You: "Chop onions."        --> Kitchen chops onions immediately
You: "Chop tomatoes."      --> Kitchen chops tomatoes immediately
You: "Actually, no salad." --> Work already done — wasted effort

Lazy evaluation (Spark):

You: "Chop onions."        --> Kitchen writes it down
You: "Chop tomatoes."      --> Kitchen writes it down
You: "Actually, no salad." --> Kitchen crosses off both tasks
You: "Serve the soup."     --> Kitchen executes only the soup tasks

Spark builds the full "recipe" before starting, then drops steps that do not contribute to the final result.

Seeing Laziness in Action

import time

rdd = sc.parallelize(range(1_000_000))

# These lines return instantly — nothing is computed
t1 = time.time()
result = rdd.map(lambda x: x * 2).filter(lambda x: x > 100)
t2 = time.time()
print(f"Transformation time: {t2 - t1:.4f} seconds")
# Output: Transformation time: 0.0012 seconds  (nearly instant)

# This line triggers actual computation
t3 = time.time()
count = result.count()
t4 = time.time()
print(f"Action time: {t4 - t3:.4f} seconds")
# Output: Action time: 1.8300 seconds  (actual work done)

The DAG — Spark's Execution Plan

Every time you chain transformations, Spark builds a DAG (Directed Acyclic Graph). Each node in the graph is a dataset, and each edge is a transformation. Spark optimizes this graph before running anything.

Your code:
rdd = sc.textFile("sales.csv")
    .map(parse_row)
    .filter(lambda r: r["amount"] > 500)
    .map(lambda r: (r["region"], r["amount"]))
    .reduceByKey(lambda a, b: a + b)

DAG:

[textFile] --> [map:parse] --> [filter:amount>500] --> [map:key-value] --> [reduceByKey]

Only runs when you call: result.collect() or result.saveAsTextFile(...)

How Lazy Evaluation Saves Work

Scenario: Find total sales for the North region only.

Without lazy evaluation:
Step 1: Process 100 million rows  (all regions)
Step 2: Filter to North           (50 million rows)
Step 3: Sum                       (result)

With lazy evaluation (Spark):
Spark sees the filter BEFORE starting, so it:
Step 1: Read + filter North rows  (50 million rows processed)
Step 2: Sum                       (result)

Skipped processing 50 million unnecessary rows.

Catalyst Optimizer — Spark's Smart Planner

For DataFrames and SQL, Spark uses the Catalyst optimizer to improve your query automatically. Catalyst applies rules like pushing filters earlier in the pipeline, reordering joins, and eliminating unnecessary columns. You write simple code; Catalyst produces an efficient execution plan.

Your query (inefficient-looking):
df.select("*").filter("region = 'North'").select("amount")

Catalyst rewrites it to:
df.select("amount").filter("region = 'North'")
  ^                 ^
  reads less data   applied as early as possible

Viewing the Execution Plan

Call .explain() on a DataFrame to see the plan Spark will run before triggering any execution.

df = spark.read.csv("sales.csv", header=True)
plan = df.filter("amount > 500").groupBy("region").sum("amount")
plan.explain()

# Output shows:
# == Physical Plan ==
# *(2) HashAggregate(keys=[region], functions=[sum(amount)])
# +- Exchange hashpartitioning(region, 200)
#    +- *(1) HashAggregate(keys=[region], functions=[partial_sum(amount)])
#       +- *(1) Filter (amount > 500)
#          +- FileScan csv [region, amount] ...

Benefits of Lazy Evaluation

Efficiency — Spark skips unnecessary computation steps
Optimization — the engine reorders and rewrites the plan for better performance
Resource savings — data that is filtered early never occupies memory or network bandwidth
Fault tolerance — the full lineage (execution plan) is available to rebuild lost data

Common Mistake: Assuming Transformations Run Immediately

# This does NOT run the filter
filtered = large_rdd.filter(lambda x: x > 0)

# This does NOT print the result
print(filtered)
# Output: PythonRDD[5] at RDD at PythonRDD.scala:53

# This DOES run everything
print(filtered.count())
# Output: 84291837  (actual number)

Always call an action when you need a result. The printed RDD object is just a reference to the plan, not the data itself.

Previous lesson

Back to course

Next lesson