Spark RDD Basics

RDD stands for Resilient Distributed Dataset. It is the original, foundational data structure in Spark. Every piece of data you process in Spark starts as or becomes an RDD at the lowest level.

What Does Each Word Mean?

Resilient — if a part of the data is lost (because a machine crashes), Spark can rebuild it automatically
Distributed — the data is split across multiple machines in the cluster
Dataset — a collection of records, like rows in a table or lines in a file

The Bookshelf Analogy

Imagine a 1,000-page book split across 10 people.

Person 1: Pages 1–100
Person 2: Pages 101–200
...
Person 10: Pages 901–1000

Each person = one RDD partition on one executor.
All 10 people read at the same time = parallel processing.

If Person 5 loses their pages, Spark knows the original
book source and reprints pages 401–500 automatically.

Creating an RDD

You create RDDs in two ways: from an existing Python collection using parallelize(), or from a file using textFile().

# Method 1: From a Python list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)

# Method 2: From a text file (each line becomes one record)
rdd = sc.textFile("logs.txt")

# Method 3: From a text file with a specified number of partitions
rdd = sc.textFile("logs.txt", minPartitions=4)

Viewing RDD Contents

data = [100, 200, 300, 400, 500]
rdd = sc.parallelize(data)

# collect() returns all data to the Driver as a Python list
print(rdd.collect())
# Output: [100, 200, 300, 400, 500]

# take() returns only the first N items
print(rdd.take(3))
# Output: [100, 200, 300]

# first() returns just one item
print(rdd.first())
# Output: 100

Partitions — How RDDs Are Divided

Spark breaks an RDD into partitions. Each partition processes independently on an executor. More partitions mean more parallelism, but too many small partitions waste overhead. The default number of partitions equals the number of CPU cores on the cluster.

rdd = sc.parallelize([1,2,3,4,5,6,7,8], numSlices=4)

# Check how many partitions exist
print(rdd.getNumPartitions())
# Output: 4

# Visualize the split:
Partition 0: [1, 2]
Partition 1: [3, 4]
Partition 2: [5, 6]
Partition 3: [7, 8]

RDD Lineage — The Recovery Plan

Spark tracks every operation applied to an RDD in a lineage graph. If data is lost, Spark uses this lineage to recompute only the missing parts from the original source. No data is duplicated on disk; the recovery plan is the protection.

lineage example:

textFile("logs.txt")          <-- original source
    |
    v
filter(line contains "ERROR") <-- step 1
    |
    v
map(extract timestamp)        <-- step 2
    |
    v
[result RDD]

If step 2's partition 3 is lost,
Spark replays steps 1 and 2 for partition 3 only.

RDD vs DataFrame — When to Use Which

Aspect	RDD	DataFrame
Structure	Raw objects, no schema	Structured rows and columns
Optimization	Manual	Automatic (Catalyst optimizer)
Best for	Custom logic, unstructured data	SQL-like operations, structured data
API complexity	More code needed	Concise and expressive

Key Facts About RDDs

RDDs are immutable — you never modify an RDD; you create a new one from it
RDDs are lazy — operations do not run until you call an action (covered in the next topics)
RDDs support any Python, Scala, or Java object as a record
For structured data (tables), prefer DataFrames over RDDs

Previous lessons

Back to courses

Next lessons