Azure Databricks for Data Engineers

For complex data transformations that go beyond what ADF Data Flows handle, data engineers turn to Azure Databricks. It is a powerful, cloud-based data processing platform built on Apache Spark. Databricks lets you write Python, Scala, SQL, or R code to process billions of rows of data quickly and efficiently.

What is Azure Databricks

Azure Databricks is a managed Apache Spark platform. Apache Spark is an open-source engine for large-scale data processing. Databricks takes Spark and packages it into a fully managed, easy-to-use service on Azure.

A car engine is powerful, but most people cannot build one from scratch. Azure Databricks is like buying a fully assembled, high-performance car — the engine (Spark) is already inside, tuned, and ready to drive. You just write your transformation logic and focus on the destination.

Key Components of Azure Databricks

Workspace

The Databricks Workspace is your development environment. It contains notebooks, libraries, jobs, and cluster configurations. Teams collaborate inside a single workspace. You organize work using folders just like a file system.

Clusters

A cluster is a group of virtual machines that run your Spark code. You choose the cluster size based on the amount of data you process.

Two types of clusters:

All-Purpose Clusters: Always running — used for development and interactive exploration. You start them manually and they stay available for you to run notebooks on-demand.
Job Clusters: Created automatically when a scheduled job starts, destroyed when the job finishes. Used for production workloads. Cheaper because you pay only for the duration of the job.

Important cost tip: All-purpose clusters left running overnight burn money with nothing to show for it. Always configure auto-termination — set the cluster to shut down after 30 or 60 minutes of inactivity.

Notebooks

A notebook is an interactive coding environment made up of cells. Each cell contains either code or markdown text. You run cells one at a time or all at once. Notebooks are where you write and test transformation logic.

Supported languages per cell: Python, SQL, Scala, R. You can mix languages within one notebook using %sql, %python, %scala magic commands at the top of each cell.

PySpark — Python for Big Data

PySpark is the Python API for Apache Spark. It lets you use Python syntax to manipulate distributed datasets that are too large to fit in a single computer's memory.

The DataFrame

In PySpark, a DataFrame is a distributed table — rows and columns spread across many machines. You apply transformations to a DataFrame, and Spark executes them across all machines in parallel.

Think of processing 1 billion rows. One computer might take 10 hours. A Spark cluster with 20 machines splits the work and completes it in 30 minutes.

Reading Data into a DataFrame

# Read a CSV file from ADLS Gen2
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("abfss://bronze@mystorageaccount.dfs.core.windows.net/sales/2024/")

Common DataFrame Transformations

from pyspark.sql.functions import col, upper, to_date, when, sum

# Select specific columns
df_selected = df.select("order_id", "customer_id", "order_date", "amount", "status")

# Filter rows
df_filtered = df_selected.filter(col("status") == "Completed")

# Add a new column
df_enriched = df_filtered.withColumn("order_year", year(to_date(col("order_date"), "yyyy-MM-dd")))

# Aggregate — total sales per region
df_agg = df_enriched.groupBy("region").agg(sum("amount").alias("total_sales"))

# Write to silver layer as Parquet
df_agg.write.format("parquet") \
    .mode("overwrite") \
    .save("abfss://silver@mystorageaccount.dfs.core.windows.net/sales_by_region/")

The Parquet File Format

Parquet is the standard file format for data lakes and Spark workloads. It stores data in columns rather than rows. This makes it dramatically faster to query because reading a column-based file skips all the columns you did not ask for.

Imagine a spreadsheet with 100 columns. A row-based format reads all 100 columns even if you only needed 5. Parquet reads only those 5 columns. For a file with 500 million rows, this difference saves significant time and cost.

Always write processed data as Parquet in the silver and gold layers. Keep CSV or JSON only in the bronze layer where raw files are stored as received.

Delta Lake — The Gold Standard for Data Lakes

Delta Lake is an open-source storage format that adds ACID transactions to your data lake. ACID means your data is always consistent — even if a pipeline fails halfway through writing.

Why Delta Lake Matters

Imagine writing 10 million rows to a Parquet folder. The pipeline fails after writing 6 million rows. Now your folder has partial data — some from yesterday, some incomplete rows from today. Queries on this data give wrong results.

Delta Lake solves this by using a transaction log. Every write is recorded as a transaction. If the write fails, the transaction is not committed. Your data stays clean and queryable as if the failed write never happened.

Key Delta Lake Features

ACID Transactions: Write operations are all-or-nothing
Schema Enforcement: Rejects data that does not match the expected schema
Time Travel: Query historical versions of your data
Upsert (MERGE): Update existing records and insert new ones in a single operation
Streaming + Batch unified: The same Delta table serves both streaming and batch pipelines

Writing and Reading Delta Tables

# Write a DataFrame as a Delta table
df.write.format("delta") \
    .mode("overwrite") \
    .save("abfss://silver@mystorageaccount.dfs.core.windows.net/sales/")

# Read a Delta table
df_delta = spark.read.format("delta") \
    .load("abfss://silver@mystorageaccount.dfs.core.windows.net/sales/")

# Time travel — read the version from 3 versions ago
df_old = spark.read.format("delta") \
    .option("versionAsOf", 3) \
    .load("abfss://silver@mystorageaccount.dfs.core.windows.net/sales/")

Unity Catalog — Data Governance in Databricks

Unity Catalog is Databricks' centralized governance layer. It manages who can access which tables and columns across the entire Databricks workspace. Think of it as a library system — every table (book) is catalogued, and permissions determine who can borrow which books.

Unity Catalog uses a three-level namespace: catalog.schema.table. For example: analytics.sales.orders.

Databricks Jobs — Running Notebooks in Production

A Databricks Job schedules a notebook or a set of notebooks to run automatically. You configure the cluster size, schedule, and retry logic. When the job runs, Databricks spins up a Job Cluster, executes the notebook, and shuts the cluster down.

Multi-task jobs let you chain notebooks together with dependencies — notebook B starts only after notebook A succeeds. This is how you build production-grade data pipelines entirely within Databricks.

Key Points

Azure Databricks is a managed Spark platform — write Python, SQL, or Scala to process large datasets
Use Job Clusters for production workloads — they start, run, and terminate automatically
Always enable auto-termination on All-Purpose Clusters to avoid wasted spending
Write silver and gold layer data as Delta tables — not plain Parquet — for ACID guarantees and upsert support
Use Time Travel to audit changes or recover from accidental data overwrites
Schedule notebooks as Databricks Jobs for automated, production-grade data pipelines

Previous lesson

Back to course

Next lesson