Spark Reading and Writing Data
Choosing the right file format has a large impact on storage cost, query speed, and compatibility. Spark supports CSV, JSON, Parquet, ORC, Avro, and Delta Lake. Each format suits different situations.
Row vs Column Storage — The Core Difference
Row-based storage (CSV, JSON, Avro): Stores one complete record per row. Record 1: [Alice, 32, HR, 50000] Record 2: [Bob, 28, IT, 60000] Record 3: [Carol, 35, Fin, 55000] Best for: writing individual records, inserting rows one at a time. Worst for: reading only 2 columns from 100 — must read all others too. Column-based storage (Parquet, ORC): Stores all values from one column together. name: [Alice, Bob, Carol] age: [32, 28, 35] dept: [HR, IT, Fin] salary: [50000, 60000, 55000] Best for: analytics — "total salary" only reads the salary column. Worst for: inserting one new row at a time (rewrites the column).
CSV — The Universal Format
CSV (Comma Separated Values) is the most portable format. Every tool opens it. But it has no schema, no compression by default, and is slow for large datasets.
# Read
df = spark.read.csv("data.csv", header=True, inferSchema=True)
# Write
df.write.mode("overwrite").option("header", "true").csv("/output/data")
# Pros: Human-readable. Opens in Excel. Universally supported.
# Cons: No schema. No compression. Slow. Large file sizes.
# Use when: sharing data with non-Spark teams, small datasets.
JSON — Flexible but Slow
JSON handles nested and hierarchical data that CSV cannot represent. Spark reads each line as one JSON object by default. It is flexible but verbose and slow for analytics.
# Read
df = spark.read.json("events.json")
# JSON supports nested structures
# {"user": {"name": "Alice", "age": 32}, "action": "login"}
# Access nested fields after reading
df.select("user.name", "user.age", "action").show()
# Pros: Flexible structure. Handles nested data. Human-readable.
# Cons: Verbose. No schema enforced. Slow reads for analytics.
# Use when: working with APIs, logs, config data.
Parquet — The Analytics Standard
Parquet is the most recommended format for Spark analytics. It stores data in columns, applies efficient compression, and embeds the schema. Reading only the columns you need skips the rest entirely.
# Read
df = spark.read.parquet("data.parquet")
# Write
df.write.mode("overwrite").parquet("/output/data.parquet")
# Columnar read — only reads "amount" column from disk
df = spark.read.parquet("sales.parquet").select("amount")
# Pros: Fast. Compressed. Schema embedded. Column pruning.
# Cons: Not human-readable. Requires Spark/Hive/Presto to open.
# Use when: storing large datasets for repeated analytics.
ORC — Parquet's Close Rival
ORC (Optimized Row Columnar) is another columnar format, originally built for Hive. It performs similarly to Parquet. Use Parquet unless the system you integrate with specifically requires ORC.
df.write.mode("overwrite").orc("/output/data.orc")
df = spark.read.orc("/output/data.orc")
# Pros: Fast columnar reads. Good compression. Hive-native.
# Cons: Less widely supported outside the Hadoop ecosystem.
# Use when: storing data in Hive or tools that prefer ORC.
Avro — Best for Streaming and Schema Evolution
Avro is a row-based format that stores its schema alongside data. It handles schema evolution well — you can add or remove fields without breaking existing readers. Kafka commonly uses Avro.
# Requires avro package
df.write.format("avro").mode("overwrite").save("/output/data.avro")
df = spark.read.format("avro").load("/output/data.avro")
# Pros: Schema embedded. Handles schema changes gracefully.
# Efficient serialization for streaming.
# Cons: Row-based (slower for analytics than Parquet).
# Use when: Kafka pipelines, data with frequent schema changes.
Delta Lake — Parquet with ACID Transactions
Delta Lake adds database-like features to Parquet files: ACID transactions, version history, and the ability to update or delete specific rows. It has become the default format for modern data lakehouses.
# Requires delta package
df.write.format("delta").mode("overwrite").save("/output/sales_delta")
# Read
df = spark.read.format("delta").load("/output/sales_delta")
# Update rows (not possible in plain Parquet)
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/output/sales_delta")
delta_table.update("region = 'North'", {"amount": "amount * 1.1"})
# Time travel — read yesterday's version
df_old = spark.read.format("delta") \
.option("versionAsOf", 0) \
.load("/output/sales_delta")
# Pros: ACID. Update/delete rows. Version history. Schema enforcement.
# Cons: Requires Delta Lake library. Not a standard open format.
# Use when: production data lakehouses, data that needs updates.
Format Comparison at a Glance
| Format | Layout | Schema | Compression | Best For |
|---|---|---|---|---|
| CSV | Row | None | None (optional) | Interoperability, small data |
| JSON | Row | Flexible | None | Nested data, APIs, logs |
| Parquet | Column | Embedded | Excellent | Analytics workloads |
| ORC | Column | Embedded | Excellent | Hive ecosystems |
| Avro | Row | Embedded | Good | Streaming, schema evolution |
| Delta Lake | Column (Parquet) | Enforced | Excellent | Data lakehouses, mutable data |
