Spark DataFrames Intro

A DataFrame is a table with named columns and defined data types. It is the most widely used data structure in modern Spark because it is fast, easy to read, and integrates with SQL. Most new Spark code uses DataFrames instead of raw RDDs.

DataFrame vs RDD — The Spreadsheet Analogy

RDD (raw data, no structure):
[("Alice", 32, "HR"), ("Bob", 28, "IT"), ("Carol", 35, "Finance")]
Spark sees: a list of tuples. No column names. No types.

DataFrame (structured table):
+-------+---+---------+
| name  |age|dept     |
+-------+---+---------+
| Alice | 32| HR      |
| Bob   | 28| IT      |
| Carol | 35| Finance |
+-------+---+---------+
Spark knows: 3 columns, their names, and their types.

Creating a DataFrame

From a Python list

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("DF Demo").getOrCreate()

data = [("Alice", 32, "HR"), ("Bob", 28, "IT"), ("Carol", 35, "Finance")]
columns = ["name", "age", "dept"]

df = spark.createDataFrame(data, columns)
df.show()
# Output:
# +-----+---+-------+
# | name|age|   dept|
# +-----+---+-------+
# |Alice| 32|     HR|
# |  Bob| 28|     IT|
# |Carol| 35|Finance|
# +-----+---+-------+

From a CSV file

df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show(5)

From a JSON file

df = spark.read.json("employees.json")
df.show(5)

Exploring a DataFrame

# Schema — column names and data types
df.printSchema()
# Output:
# root
#  |-- name: string (nullable = true)
#  |-- age: integer (nullable = true)
#  |-- dept: string (nullable = true)

# Number of rows
print(df.count())   # 3

# Column names
print(df.columns)   # ['name', 'age', 'dept']

# Summary statistics
df.describe().show()
# +-------+------+------------------+-------+
# |summary|  name|               age|   dept|
# +-------+------+------------------+-------+
# |  count|     3|                 3|      3|
# |   mean|  null|31.666666666666668|   null|
# | stddev|  null| 3.511884584284495|   null|
# |    min| Alice|                28|Finance|
# |    max|   Bob|                35|     IT|
# +-------+------+------------------+-------+

Selecting Columns

# Select one column
df.select("name").show()

# Select multiple columns
df.select("name", "dept").show()

# Select with expression
from pyspark.sql.functions import col
df.select(col("name"), (col("age") + 1).alias("age_next_year")).show()
# +-----+-------------+
# | name|age_next_year|
# +-----+-------------+
# |Alice|           33|
# |  Bob|           29|
# |Carol|           36|
# +-----+-------------+

Filtering Rows

# Keep only employees older than 30
df.filter(df.age > 30).show()

# Using SQL-style string
df.filter("age > 30").show()

# Multiple conditions
df.filter((df.age > 30) & (df.dept == "HR")).show()

Adding and Renaming Columns

from pyspark.sql.functions import lit

# Add a new column with a constant value
df_with_company = df.withColumn("company", lit("Acme Corp"))

# Rename a column
df_renamed = df.withColumnRenamed("dept", "department")

# Drop a column
df_no_age = df.drop("age")

Sorting

# Sort by age ascending
df.orderBy("age").show()

# Sort by age descending
df.orderBy(df.age.desc()).show()

Grouping and Aggregation

from pyspark.sql.functions import avg, count

# Average age per department
df.groupBy("dept").agg(avg("age").alias("avg_age"), count("*").alias("headcount")).show()

# +-------+-------+---------+
# |   dept|avg_age|headcount|
# +-------+-------+---------+
# |Finance|   35.0|        1|
# |     HR|   32.0|        1|
# |     IT|   28.0|        1|
# +-------+-------+---------+

Why DataFrames Beat RDDs for Most Work

  • Spark's Catalyst optimizer automatically rewrites DataFrame queries for speed
  • DataFrames use Tungsten encoding — a compact binary format that reduces memory and speeds serialization
  • Code is cleaner and easier to read
  • DataFrames integrate directly with Spark SQL and external BI tools

Leave a Comment