Spark Partitioning and Parallelism
Partitioning controls how Spark splits data across executors. Good partitioning means every executor stays busy and no single machine gets overloaded. Poor partitioning leads to slow jobs, memory errors, or idle workers.
What Is a Partition?
A partition is one slice of an RDD or DataFrame. Spark processes each partition independently on one executor core. Think of a 1,000-row dataset split into 10 partitions: each executor core processes 100 rows at the same time.
Dataset: 1,000 rows Partition 0: rows 1–100 --> Executor Core 1 Partition 1: rows 101–200 --> Executor Core 2 Partition 2: rows 201–300 --> Executor Core 3 ... Partition 9: rows 901–1000 --> Executor Core 10 All 10 cores work simultaneously = 10x faster than sequential.
Default Partition Count
When reading from a file, Spark creates one partition per input block (usually 128 MB for HDFS). For in-memory datasets, the default equals the number of CPU cores on the cluster.
# Check partition count
rdd = sc.textFile("bigfile.csv")
print(rdd.getNumPartitions())
df = spark.read.csv("bigfile.csv")
print(df.rdd.getNumPartitions())
Changing the Number of Partitions
repartition() — Increase or Decrease Partitions
repartition(n) reshuffles all data across the cluster to create exactly n partitions. Use it to increase parallelism before a heavy computation, or to balance uneven data.
df = spark.read.csv("data.csv", header=True)
print(df.rdd.getNumPartitions()) # e.g., 4
# Increase to 20 partitions for a large join
df = df.repartition(20)
print(df.rdd.getNumPartitions()) # 20
coalesce() — Reduce Partitions Without Full Shuffle
coalesce(n) merges partitions without moving data across the network. Use it before writing output to avoid many small files.
# Before writing, combine 200 partitions into 5
result = result.coalesce(5)
result.write.csv("/output/results")
# Creates 5 output files instead of 200 tiny ones
When to Use Each
| Situation | Use |
|---|---|
| Increase partitions (more parallelism) | repartition(n) |
| Decrease partitions (less output files) | coalesce(n) |
| Uneven data distribution (skew) | repartition(n) with a key column |
The Too-Few and Too-Many Problem
TOO FEW PARTITIONS: [Partition 0: 10 GB] --> One core works, others sit idle [no other partitions] Result: slow job, out-of-memory errors TOO MANY PARTITIONS: [P0: 1 KB][P1: 1 KB]...[P10000: 1 KB] Each partition takes 0.1ms of data + 50ms of overhead Result: slow job due to scheduling overhead JUST RIGHT: Each partition = 100 MB to 200 MB of data Partition count = 2x to 4x the number of executor cores
Data Skew — The Uneven Load Problem
Data skew happens when some partitions have far more records than others. One executor processes a huge partition while all others finish early and sit idle, making the job as slow as the slowest partition.
Skewed data (region column): Partition 0 (North): 100,000 rows <-- huge Partition 1 (South): 500 rows <-- small Partition 2 (East): 300 rows <-- small Partition 3 (West): 200 rows <-- small Executor 0 finishes last, holding up the entire job. Fix: repartition by a high-cardinality column (like user_id) df = df.repartition(200, "user_id")
spark.sql.shuffle.partitions
After a shuffle operation (like a join or group by), Spark creates a fixed number of partitions defined by spark.sql.shuffle.partitions. The default is 200, which is too high for small datasets and too low for very large ones.
# Set shuffle partitions for a small dataset
spark.conf.set("spark.sql.shuffle.partitions", "10")
# Check current setting
print(spark.conf.get("spark.sql.shuffle.partitions"))
# Default: 200
Rule of Thumb for Partition Count
- Aim for partition sizes between 100 MB and 200 MB each
- Use 2–4 partitions per executor CPU core
- Reduce shuffle partitions to 10–50 for datasets under 10 GB
- Watch the Spark UI for skew: stages where one task takes 10x longer than others
