Spark RDD Basics

RDD stands for Resilient Distributed Dataset. It is the original, foundational data structure in Spark. Every piece of data you process in Spark starts as or becomes an RDD at the lowest level.

What Does Each Word Mean?

  • Resilient — if a part of the data is lost (because a machine crashes), Spark can rebuild it automatically
  • Distributed — the data is split across multiple machines in the cluster
  • Dataset — a collection of records, like rows in a table or lines in a file

The Bookshelf Analogy

Imagine a 1,000-page book split across 10 people.

Person 1: Pages 1–100
Person 2: Pages 101–200
...
Person 10: Pages 901–1000

Each person = one RDD partition on one executor.
All 10 people read at the same time = parallel processing.

If Person 5 loses their pages, Spark knows the original
book source and reprints pages 401–500 automatically.

Creating an RDD

You create RDDs in two ways: from an existing Python collection using parallelize(), or from a file using textFile().

# Method 1: From a Python list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)

# Method 2: From a text file (each line becomes one record)
rdd = sc.textFile("logs.txt")

# Method 3: From a text file with a specified number of partitions
rdd = sc.textFile("logs.txt", minPartitions=4)

Viewing RDD Contents

data = [100, 200, 300, 400, 500]
rdd = sc.parallelize(data)

# collect() returns all data to the Driver as a Python list
print(rdd.collect())
# Output: [100, 200, 300, 400, 500]

# take() returns only the first N items
print(rdd.take(3))
# Output: [100, 200, 300]

# first() returns just one item
print(rdd.first())
# Output: 100

Partitions — How RDDs Are Divided

Spark breaks an RDD into partitions. Each partition processes independently on an executor. More partitions mean more parallelism, but too many small partitions waste overhead. The default number of partitions equals the number of CPU cores on the cluster.

rdd = sc.parallelize([1,2,3,4,5,6,7,8], numSlices=4)

# Check how many partitions exist
print(rdd.getNumPartitions())
# Output: 4

# Visualize the split:
Partition 0: [1, 2]
Partition 1: [3, 4]
Partition 2: [5, 6]
Partition 3: [7, 8]

RDD Lineage — The Recovery Plan

Spark tracks every operation applied to an RDD in a lineage graph. If data is lost, Spark uses this lineage to recompute only the missing parts from the original source. No data is duplicated on disk; the recovery plan is the protection.

lineage example:

textFile("logs.txt")          <-- original source
    |
    v
filter(line contains "ERROR") <-- step 1
    |
    v
map(extract timestamp)        <-- step 2
    |
    v
[result RDD]

If step 2's partition 3 is lost,
Spark replays steps 1 and 2 for partition 3 only.

RDD vs DataFrame — When to Use Which

AspectRDDDataFrame
StructureRaw objects, no schemaStructured rows and columns
OptimizationManualAutomatic (Catalyst optimizer)
Best forCustom logic, unstructured dataSQL-like operations, structured data
API complexityMore code neededConcise and expressive

Key Facts About RDDs

  • RDDs are immutable — you never modify an RDD; you create a new one from it
  • RDDs are lazy — operations do not run until you call an action (covered in the next topics)
  • RDDs support any Python, Scala, or Java object as a record
  • For structured data (tables), prefer DataFrames over RDDs

Leave a Comment