Spark RDD Basics
RDD stands for Resilient Distributed Dataset. It is the original, foundational data structure in Spark. Every piece of data you process in Spark starts as or becomes an RDD at the lowest level.
What Does Each Word Mean?
- Resilient — if a part of the data is lost (because a machine crashes), Spark can rebuild it automatically
- Distributed — the data is split across multiple machines in the cluster
- Dataset — a collection of records, like rows in a table or lines in a file
The Bookshelf Analogy
Imagine a 1,000-page book split across 10 people. Person 1: Pages 1–100 Person 2: Pages 101–200 ... Person 10: Pages 901–1000 Each person = one RDD partition on one executor. All 10 people read at the same time = parallel processing. If Person 5 loses their pages, Spark knows the original book source and reprints pages 401–500 automatically.
Creating an RDD
You create RDDs in two ways: from an existing Python collection using parallelize(), or from a file using textFile().
# Method 1: From a Python list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)
# Method 2: From a text file (each line becomes one record)
rdd = sc.textFile("logs.txt")
# Method 3: From a text file with a specified number of partitions
rdd = sc.textFile("logs.txt", minPartitions=4)
Viewing RDD Contents
data = [100, 200, 300, 400, 500] rdd = sc.parallelize(data) # collect() returns all data to the Driver as a Python list print(rdd.collect()) # Output: [100, 200, 300, 400, 500] # take() returns only the first N items print(rdd.take(3)) # Output: [100, 200, 300] # first() returns just one item print(rdd.first()) # Output: 100
Partitions — How RDDs Are Divided
Spark breaks an RDD into partitions. Each partition processes independently on an executor. More partitions mean more parallelism, but too many small partitions waste overhead. The default number of partitions equals the number of CPU cores on the cluster.
rdd = sc.parallelize([1,2,3,4,5,6,7,8], numSlices=4) # Check how many partitions exist print(rdd.getNumPartitions()) # Output: 4 # Visualize the split: Partition 0: [1, 2] Partition 1: [3, 4] Partition 2: [5, 6] Partition 3: [7, 8]
RDD Lineage — The Recovery Plan
Spark tracks every operation applied to an RDD in a lineage graph. If data is lost, Spark uses this lineage to recompute only the missing parts from the original source. No data is duplicated on disk; the recovery plan is the protection.
lineage example:
textFile("logs.txt") <-- original source
|
v
filter(line contains "ERROR") <-- step 1
|
v
map(extract timestamp) <-- step 2
|
v
[result RDD]
If step 2's partition 3 is lost,
Spark replays steps 1 and 2 for partition 3 only.
RDD vs DataFrame — When to Use Which
| Aspect | RDD | DataFrame |
|---|---|---|
| Structure | Raw objects, no schema | Structured rows and columns |
| Optimization | Manual | Automatic (Catalyst optimizer) |
| Best for | Custom logic, unstructured data | SQL-like operations, structured data |
| API complexity | More code needed | Concise and expressive |
Key Facts About RDDs
- RDDs are immutable — you never modify an RDD; you create a new one from it
- RDDs are lazy — operations do not run until you call an action (covered in the next topics)
- RDDs support any Python, Scala, or Java object as a record
- For structured data (tables), prefer DataFrames over RDDs
