Databricks Delta Lake

Delta Lake is one of the most important inventions in the modern data world. It is an open-source storage layer that adds reliability, performance, and intelligence to data stored in cloud storage. Without Delta Lake, storing large amounts of data in a cloud storage bucket is like putting thousands of loose papers in a giant cardboard box — you can find them eventually, but it is slow, messy, and easy to make a mess that is hard to clean up. Delta Lake turns that cardboard box into a well-organized filing cabinet with a built-in index, an edit history, and automatic error prevention.

What Problem Delta Lake Solves

Before Delta Lake, data engineers stored data in cloud storage as plain files — CSV, JSON, or Parquet format. This approach had four serious problems.

PROBLEMS WITH PLAIN FILE STORAGE
──────────────────────────────────────────────────────────────
Problem 1: FAILED WRITES LEAVE PARTIAL DATA
─────────────────────────────────────────────
A job starts writing 1 billion rows.
It writes 700 million rows... then crashes.
Result: Half-written files mixed with old files.
Readers see corrupt, inconsistent data.
No way to know which records are valid.

Problem 2: NO SUPPORT FOR UPDATES OR DELETES
─────────────────────────────────────────────
Parquet files are immutable — you cannot change a row.
To "update" one record: read the entire file,
modify one row in memory, write the whole file back.
For tables with billions of rows: this is extremely slow.

Problem 3: TOO MANY SMALL FILES
─────────────────────────────────────────────
Streaming jobs write data every minute.
After one week: 10,080 tiny files.
Spark must open each file, read the header,
check the data — all before processing a single row.
Performance degrades severely over time.

Problem 4: NO TIME TRAVEL
─────────────────────────────────────────────
An analyst accidentally runs DELETE without a WHERE clause.
All records gone.
No backup. No rollback. Data is lost permanently.

Delta Lake solves all four problems with one unified storage format.

The Delta Lake Transaction Log

The core of Delta Lake is the transaction log, also called the Delta Log. Every single change to a Delta table — whether it is an insert, update, delete, merge, or schema change — gets recorded as an entry in the transaction log before the actual data files are touched.

DELTA TABLE FOLDER STRUCTURE
──────────────────────────────────────────────────────
/retail_data/transactions/          ← Delta Table root folder
│
├── _delta_log/                     ← Transaction Log folder
│    ├── 00000000000000000000.json  ← Commit 0: Table created
│    ├── 00000000000000000001.json  ← Commit 1: First data load
│    ├── 00000000000000000002.json  ← Commit 2: Update applied
│    ├── 00000000000000000003.json  ← Commit 3: Delete applied
│    └── 00000000000000000010.checkpoint.parquet ← Checkpoint
│
├── part-00001-abc123.parquet       ← Actual data files
├── part-00002-def456.parquet
└── part-00003-ghi789.parquet

Each JSON commit file records what changed: which files were added, which files were removed, what the schema looks like, and when the operation happened. When you read a Delta table, Databricks first reads the transaction log to find the current set of valid data files, then reads only those files. This is why Delta Lake reads are reliable — the log is the single source of truth about what data currently exists.

ACID Transactions – What They Mean in Plain Language

Delta Lake provides ACID transactions. ACID is an acronym for four properties that guarantee data reliability. Here is what each letter means in plain terms:

ACID PROPERTIES EXPLAINED
──────────────────────────────────────────────────────────────
A – ATOMICITY
"All or nothing"
When you write 100 files, either all 100 succeed,
or none of them are visible to readers.
A crash midway through leaves the table unchanged.

C – CONSISTENCY
"The table always makes sense"
You cannot put a string value into a numeric column.
Schema rules are enforced on every write.

I – ISOLATION
"Readers and writers do not disturb each other"
An analyst runs a query while an engineer loads new data.
The analyst sees a clean snapshot of data from before
the load started. They never see half-loaded rows.

D – DURABILITY
"Committed data survives crashes"
Once a write is committed to the Delta Log,
it is permanent. A server crash after the commit
does not lose the data.

Time Travel – Reading Older Versions of Your Data

Every committed change to a Delta table creates a new version. Delta Lake keeps all previous versions, allowing you to read the state of the table at any point in the past. This is called Time Travel.

TIME TRAVEL DIAGRAM
──────────────────────────────────────────────────────────────
Version 0 (10:00 AM): Table created with 1 million rows
Version 1 (11:00 AM): 500,000 new rows inserted
Version 2 (12:00 PM): Price column updated for 100 products
Version 3 (1:00 PM):  Analyst accidentally deletes 200,000 rows

TIME:  10am    11am    12pm    1pm (now)
        V0 ──── V1 ──── V2 ──── V3

You discover the accidental delete at 2pm.
You query Version 2 to see the data before the delete:

SQL: SELECT * FROM transactions VERSION AS OF 2
Python: df = spark.read.format("delta")
               .option("versionAsOf", 2)
               .load("/path/to/transactions")

You can restore Version 2:
RESTORE TABLE transactions TO VERSION AS OF 2

Time Travel with Timestamps

-- Query the table as it looked yesterday at 6 PM:
SELECT * FROM transactions TIMESTAMP AS OF '2024-01-14 18:00:00'

-- Compare current vs yesterday:
SELECT today.product_id,
       today.price AS current_price,
       yesterday.price AS yesterday_price
FROM transactions today
JOIN (SELECT * FROM transactions
      TIMESTAMP AS OF '2024-01-14 18:00:00') yesterday
ON today.product_id = yesterday.product_id
WHERE today.price != yesterday.price

Time Travel is the ultimate safety net. Accidental deletes, bad transformations, regulatory audits that require seeing data as it existed on a specific past date — Delta Lake handles all of these with a simple query parameter.

Schema Enforcement and Schema Evolution

Schema Enforcement

Delta Lake refuses to accept data that does not match the table's defined schema. If your table expects a numeric column called amount and you try to write a string value into it, Delta Lake rejects the entire write and raises an error. This prevents silent data corruption that plagues plain file storage.

SCHEMA ENFORCEMENT EXAMPLE
──────────────────────────────────────────────
Table schema:
  amount → DOUBLE (numeric)
  date   → DATE

Incoming data:
  amount = "not a number"   ← Invalid
  date   = "yesterday"      ← Invalid

Result: Write REJECTED, table unchanged.
Error message explains exactly what was wrong.

Without Delta Lake (plain Parquet):
  The bad data gets written silently.
  Future queries on amount column fail or return nulls.
  Debugging takes hours.

Schema Evolution

Business requirements change. A new data source adds a column that did not previously exist. Delta Lake handles this through schema evolution — a controlled process for adding new columns without breaking existing queries.

SCHEMA EVOLUTION: ADDING A NEW COLUMN
──────────────────────────────────────────────
Original table: customer_id, name, city

New data source includes: customer_id, name, city, phone

Without schema evolution: Write FAILS (new column rejected)

With schema evolution enabled:
  df.write.format("delta")
    .option("mergeSchema", "true")
    .mode("append")
    .save("/path/to/customers")

Result: Phone column added to table automatically.
Existing rows show NULL for phone (they have no phone data).
New rows show actual phone values.
No data lost. No breaking changes.

Delta Lake Operations: INSERT, UPDATE, DELETE, MERGE

INSERT

INSERT INTO retail_data.customers
VALUES (1001, 'Priya Sharma', 'Mumbai', '+91-9876543210')

UPDATE

UPDATE retail_data.customers
SET city = 'Pune'
WHERE customer_id = 1001

DELETE

DELETE FROM retail_data.customers
WHERE customer_id = 1001

MERGE (Upsert) – The Most Powerful Operation

MERGE handles the "upsert" pattern: update a row if it exists, insert it if it does not. This is the most common operation in data engineering — synchronizing an incoming dataset with an existing table.

MERGE SCENARIO:
New data arrives from the CRM system.
Some customers are new (need INSERT).
Some customers changed their city (need UPDATE).
Some customers are exactly the same (no change needed).

MERGE INTO retail_data.customers AS target
USING incoming_crm_data AS source
ON target.customer_id = source.customer_id

WHEN MATCHED AND target.city != source.city THEN
  UPDATE SET target.city = source.city

WHEN NOT MATCHED THEN
  INSERT (customer_id, name, city)
  VALUES (source.customer_id, source.name, source.city)

Result:
• Existing customers with changed city: Updated ✓
• Brand new customers: Inserted ✓
• Unchanged customers: Left alone ✓

Optimize and ZORDER – Making Queries Faster

OPTIMIZE

Over time, especially with frequent small writes (like streaming), a Delta table accumulates many small files. Spark reads each file separately, so too many small files slow down queries. The OPTIMIZE command merges small files into larger, more efficient files.

SMALL FILES PROBLEM:
─────────────────────────────────────────────────────
Before OPTIMIZE:
  1000 files × 1 MB each = 1 GB total
  Spark opens 1000 files, reads 1000 headers → SLOW

After OPTIMIZE:
  10 files × 100 MB each = 1 GB total
  Spark opens 10 files, reads 10 headers → FAST

Command:
  OPTIMIZE retail_data.transactions

ZORDER

ZORDER is a technique that physically arranges data within files to make filtering faster. When you frequently filter a table by a specific column (like city or product), ZORDERing by that column groups related rows together in the same files. Spark skips entire files that cannot contain matching rows, dramatically reducing the data read.

ZORDER EXAMPLE:
──────────────────────────────────────────────────────────────
Query pattern: SELECT * FROM transactions WHERE city = 'Mumbai'

Without ZORDER:
  Mumbai records scattered across all 100 files.
  Spark reads all 100 files to find Mumbai rows. → SLOW

OPTIMIZE transactions ZORDER BY (city):
  Mumbai records grouped into files 1–3.
  Delhi records in files 4–8.
  Pune records in files 9–11.
  ...etc.

With ZORDER:
  Spark reads file metadata (min/max city per file).
  Files 4–100 cannot contain Mumbai → SKIPPED.
  Spark reads only files 1–3. → FAST (97% less data read)

Vacuum – Cleaning Up Old Files

Time Travel keeps old versions of your data. Those old versions live as old data files on disk. Over time, these accumulate and consume storage space. The VACUUM command removes data files that are no longer referenced by any version within the retention period (default 7 days).

VACUUM retail_data.transactions RETAIN 168 HOURS -- 7 days

Effect:
  Files that belong to versions older than 7 days are deleted.
  Time Travel still works for versions within the last 7 days.
  Storage costs decrease as unnecessary old files are removed.

WARNING: After VACUUM, you CANNOT time travel past
the retention period. Set the retention period based
on how far back your business needs to roll back data.

Delta Lake Table Properties

Delta tables support properties that configure their behavior. Set these when creating the table or alter them later.

CREATE TABLE retail_data.transactions (
  txn_id     BIGINT,
  customer_id INT,
  amount     DOUBLE,
  city       STRING,
  txn_date   TIMESTAMP
)
USING DELTA
PARTITIONED BY (city)
TBLPROPERTIES (
  'delta.logRetentionDuration' = 'interval 30 days',
  'delta.deletedFileRetentionDuration' = 'interval 7 days',
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
);

The autoOptimize properties tell Delta Lake to automatically compact small files during writes, saving you from running OPTIMIZE manually.

Key Points

Delta Lake is a storage layer on top of cloud storage that adds reliability, performance, and smart management to data files.
The transaction log records every change to a Delta table, making all reads consistent and all writes atomic.
ACID transactions guarantee that partial writes never corrupt a table and that readers always see consistent data.
Time Travel allows querying or restoring any previous version of a table, protecting against accidental data loss.
Schema Enforcement prevents bad data from entering a table; Schema Evolution allows adding new columns safely.
OPTIMIZE merges small files into large ones for faster reads; ZORDER physically arranges rows so Spark can skip irrelevant files.
VACUUM removes old data files beyond the retention window, keeping storage costs under control.
The MERGE command handles upsert patterns — insert new records and update changed records in a single efficient operation.

Previous lesson

Back to course

Next lesson