Spark Architecture
Spark uses a master-worker architecture. One central process coordinates everything, and many worker processes do the actual computation. Understanding this structure helps you debug failures, tune performance, and make sense of Spark logs.
The Three Core Components
1. Driver Program
The Driver is the brain of a Spark application. It runs your main program, builds a plan of execution, and tells workers what to do. Think of the Driver as the restaurant manager who reads the order, plans the kitchen workflow, and assigns tasks to cooks.
2. Cluster Manager
The Cluster Manager controls the machines available for use. When the Driver needs workers, it requests resources from the Cluster Manager. Spark works with three cluster managers: Spark Standalone (built-in), YARN (Hadoop), and Kubernetes.
3. Executors
Executors are the workers. Each Executor runs on a separate machine (or core) in the cluster. They receive tasks from the Driver, process data, and return results. Executors also store data in memory when caching is enabled.
Full Architecture Diagram
Your Code (Python, Scala, SQL)
|
v
[ Driver Program ]
- SparkContext
- Creates execution plan
- Schedules tasks
|
v
[ Cluster Manager ]
(YARN / Kubernetes / Standalone)
- Allocates resources
/ | \
v v v
[Executor] [Executor] [Executor]
- Task 1 - Task 2 - Task 3
- Cache - Cache - Cache
- Result - Result - Result
\ | /
v
[ Driver collects
final results ]
SparkContext and SparkSession
SparkContext is the entry point for older Spark APIs (Spark 1.x and early 2.x). SparkSession is the unified entry point introduced in Spark 2.0 and is the standard today. SparkSession wraps SparkContext and adds support for DataFrames and SQL.
# PySpark example
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyFirstSparkApp") \
.getOrCreate()
Jobs, Stages, and Tasks
When you run a Spark action (like counting rows), Spark creates a Job. Each Job breaks into Stages, and each Stage breaks into Tasks. Tasks are the smallest units of work that executors run.
[ Job ]
|
+-- [ Stage 1 ]
| |-- Task 1 (Executor A)
| |-- Task 2 (Executor B)
| |-- Task 3 (Executor C)
|
+-- [ Stage 2 ]
|-- Task 4 (Executor A)
|-- Task 5 (Executor B)
What Separates Stages?
A shuffle operation separates stages. A shuffle happens when data must move between executors — for example, when grouping data by a key. Shuffles are expensive because they involve network transfer, so Spark minimizes them when possible.
DAG Scheduler
Spark uses a DAG (Directed Acyclic Graph) Scheduler to plan execution. When you write Spark code, Spark does not run it immediately. Instead, it builds a graph of all planned operations and finds the most efficient order to execute them. This graph only runs when you trigger an action.
DAG Example:
[Read CSV] --> [Filter rows] --> [Group by city] --> [Count] --> OUTPUT
\ /
\--- [Join with another table] --------------/
Key Architecture Facts
- One Driver per application, many Executors
- The Driver can become a bottleneck if it collects too much data — keep large results in distributed storage
- Executor memory and CPU cores are configurable at job submission time
- If an Executor fails, the Driver reassigns its tasks to other Executors
