Spark DataFrames Intro
A DataFrame is a table with named columns and defined data types. It is the most widely used data structure in modern Spark because it is fast, easy to read, and integrates with SQL. Most new Spark code uses DataFrames instead of raw RDDs.
DataFrame vs RDD — The Spreadsheet Analogy
RDD (raw data, no structure):
[("Alice", 32, "HR"), ("Bob", 28, "IT"), ("Carol", 35, "Finance")]
Spark sees: a list of tuples. No column names. No types.
DataFrame (structured table):
+-------+---+---------+
| name |age|dept |
+-------+---+---------+
| Alice | 32| HR |
| Bob | 28| IT |
| Carol | 35| Finance |
+-------+---+---------+
Spark knows: 3 columns, their names, and their types.
Creating a DataFrame
From a Python list
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("DF Demo").getOrCreate()
data = [("Alice", 32, "HR"), ("Bob", 28, "IT"), ("Carol", 35, "Finance")]
columns = ["name", "age", "dept"]
df = spark.createDataFrame(data, columns)
df.show()
# Output:
# +-----+---+-------+
# | name|age| dept|
# +-----+---+-------+
# |Alice| 32| HR|
# | Bob| 28| IT|
# |Carol| 35|Finance|
# +-----+---+-------+
From a CSV file
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show(5)
From a JSON file
df = spark.read.json("employees.json")
df.show(5)
Exploring a DataFrame
# Schema — column names and data types df.printSchema() # Output: # root # |-- name: string (nullable = true) # |-- age: integer (nullable = true) # |-- dept: string (nullable = true) # Number of rows print(df.count()) # 3 # Column names print(df.columns) # ['name', 'age', 'dept'] # Summary statistics df.describe().show() # +-------+------+------------------+-------+ # |summary| name| age| dept| # +-------+------+------------------+-------+ # | count| 3| 3| 3| # | mean| null|31.666666666666668| null| # | stddev| null| 3.511884584284495| null| # | min| Alice| 28|Finance| # | max| Bob| 35| IT| # +-------+------+------------------+-------+
Selecting Columns
# Select one column
df.select("name").show()
# Select multiple columns
df.select("name", "dept").show()
# Select with expression
from pyspark.sql.functions import col
df.select(col("name"), (col("age") + 1).alias("age_next_year")).show()
# +-----+-------------+
# | name|age_next_year|
# +-----+-------------+
# |Alice| 33|
# | Bob| 29|
# |Carol| 36|
# +-----+-------------+
Filtering Rows
# Keep only employees older than 30
df.filter(df.age > 30).show()
# Using SQL-style string
df.filter("age > 30").show()
# Multiple conditions
df.filter((df.age > 30) & (df.dept == "HR")).show()
Adding and Renaming Columns
from pyspark.sql.functions import lit
# Add a new column with a constant value
df_with_company = df.withColumn("company", lit("Acme Corp"))
# Rename a column
df_renamed = df.withColumnRenamed("dept", "department")
# Drop a column
df_no_age = df.drop("age")
Sorting
# Sort by age ascending
df.orderBy("age").show()
# Sort by age descending
df.orderBy(df.age.desc()).show()
Grouping and Aggregation
from pyspark.sql.functions import avg, count
# Average age per department
df.groupBy("dept").agg(avg("age").alias("avg_age"), count("*").alias("headcount")).show()
# +-------+-------+---------+
# | dept|avg_age|headcount|
# +-------+-------+---------+
# |Finance| 35.0| 1|
# | HR| 32.0| 1|
# | IT| 28.0| 1|
# +-------+-------+---------+
Why DataFrames Beat RDDs for Most Work
- Spark's Catalyst optimizer automatically rewrites DataFrame queries for speed
- DataFrames use Tungsten encoding — a compact binary format that reduces memory and speeds serialization
- Code is cleaner and easier to read
- DataFrames integrate directly with Spark SQL and external BI tools
