Spark Partitioning and Parallelism

Partitioning controls how Spark splits data across executors. Good partitioning means every executor stays busy and no single machine gets overloaded. Poor partitioning leads to slow jobs, memory errors, or idle workers.

What Is a Partition?

A partition is one slice of an RDD or DataFrame. Spark processes each partition independently on one executor core. Think of a 1,000-row dataset split into 10 partitions: each executor core processes 100 rows at the same time.

Dataset: 1,000 rows

Partition 0: rows 1–100    --> Executor Core 1
Partition 1: rows 101–200  --> Executor Core 2
Partition 2: rows 201–300  --> Executor Core 3
...
Partition 9: rows 901–1000 --> Executor Core 10

All 10 cores work simultaneously = 10x faster than sequential.

Default Partition Count

When reading from a file, Spark creates one partition per input block (usually 128 MB for HDFS). For in-memory datasets, the default equals the number of CPU cores on the cluster.

# Check partition count
rdd = sc.textFile("bigfile.csv")
print(rdd.getNumPartitions())

df = spark.read.csv("bigfile.csv")
print(df.rdd.getNumPartitions())

Changing the Number of Partitions

repartition() — Increase or Decrease Partitions

repartition(n) reshuffles all data across the cluster to create exactly n partitions. Use it to increase parallelism before a heavy computation, or to balance uneven data.

df = spark.read.csv("data.csv", header=True)
print(df.rdd.getNumPartitions())  # e.g., 4

# Increase to 20 partitions for a large join
df = df.repartition(20)
print(df.rdd.getNumPartitions())  # 20

coalesce() — Reduce Partitions Without Full Shuffle

coalesce(n) merges partitions without moving data across the network. Use it before writing output to avoid many small files.

# Before writing, combine 200 partitions into 5
result = result.coalesce(5)
result.write.csv("/output/results")
# Creates 5 output files instead of 200 tiny ones

When to Use Each

SituationUse
Increase partitions (more parallelism)repartition(n)
Decrease partitions (less output files)coalesce(n)
Uneven data distribution (skew)repartition(n) with a key column

The Too-Few and Too-Many Problem

TOO FEW PARTITIONS:
[Partition 0: 10 GB] --> One core works, others sit idle
[no other partitions]
Result: slow job, out-of-memory errors

TOO MANY PARTITIONS:
[P0: 1 KB][P1: 1 KB]...[P10000: 1 KB]
Each partition takes 0.1ms of data + 50ms of overhead
Result: slow job due to scheduling overhead

JUST RIGHT:
Each partition = 100 MB to 200 MB of data
Partition count = 2x to 4x the number of executor cores

Data Skew — The Uneven Load Problem

Data skew happens when some partitions have far more records than others. One executor processes a huge partition while all others finish early and sit idle, making the job as slow as the slowest partition.

Skewed data (region column):

Partition 0 (North):  100,000 rows  <-- huge
Partition 1 (South):      500 rows  <-- small
Partition 2 (East):       300 rows  <-- small
Partition 3 (West):       200 rows  <-- small

Executor 0 finishes last, holding up the entire job.

Fix: repartition by a high-cardinality column (like user_id)
df = df.repartition(200, "user_id")

spark.sql.shuffle.partitions

After a shuffle operation (like a join or group by), Spark creates a fixed number of partitions defined by spark.sql.shuffle.partitions. The default is 200, which is too high for small datasets and too low for very large ones.

# Set shuffle partitions for a small dataset
spark.conf.set("spark.sql.shuffle.partitions", "10")

# Check current setting
print(spark.conf.get("spark.sql.shuffle.partitions"))
# Default: 200

Rule of Thumb for Partition Count

  • Aim for partition sizes between 100 MB and 200 MB each
  • Use 2–4 partitions per executor CPU core
  • Reduce shuffle partitions to 10–50 for datasets under 10 GB
  • Watch the Spark UI for skew: stages where one task takes 10x longer than others

Leave a Comment