Databricks Notebooks

A Databricks Notebook is an interactive document where you write code, run it immediately, and see the results displayed directly below each piece of code. It is like a lab experiment journal — you write your hypothesis (the code), run the experiment (execute the cell), and record the results (the output appears below). This combination of code and results in one page makes notebooks the most popular tool for data exploration and analysis.

Databricks Notebooks support four programming languages: Python, SQL, Scala, and R. You choose a default language when creating the notebook, but you can switch languages cell by cell using special commands called magic commands. This flexibility means a data engineer can write Python data processing code in the same notebook where an analyst writes SQL queries, keeping related work together.

Anatomy of a Databricks Notebook

NOTEBOOK STRUCTURE
──────────────────────────────────────────────
┌──────────────────────────────────────────┐
│  Notebook Title: sales_analysis          │
│  Language: Python   Cluster: dev-cluster │
├──────────────────────────────────────────┤
│  CELL 1 (Code)                           │
│  df = spark.read.csv("/data/sales.csv")  │
│  df.show(5)                              │
│  ─────────────────────────────────────   │
│  OUTPUT:                                 │
│  | id | product | amount | date      |   │
│  | 1  | Widget  | 250.00 | 2024-01-01|   │
│  | 2  | Gadget  | 180.00 | 2024-01-01|   │
├──────────────────────────────────────────┤
│  CELL 2 (Markdown)                       │
│  ## Sales by Product Category            │
│  This section summarizes monthly sales   │
│  grouped by product type.                │
├──────────────────────────────────────────┤
│  CELL 3 (Code)                           │
│  summary = df.groupBy("product")         │
│             .sum("amount")               │
│             .orderBy("sum(amount)", ...  │
│  display(summary)                        │
│  ─────────────────────────────────────   │
│  OUTPUT: [Bar Chart of sales by product] │
└──────────────────────────────────────────┘

Each cell is independent but shares the same execution environment. A variable you create in Cell 1 is available in Cell 2 and every cell after it, as long as you run them in order. This shared state makes it easy to build up analysis step by step.

Cell Types

Code Cells

Code cells contain executable code in your notebook's default language (or a different language if you use a magic command). When you run a code cell, Databricks sends the code to the attached cluster, the cluster executes it using Apache Spark, and the result appears below the cell. Results can be text output, tables, charts, error messages, or nothing at all if the code produces no visible output.

Markdown Cells

Markdown cells contain formatted text rather than executable code. You use them to add titles, descriptions, tables, bullet points, and explanations between code cells. This is how you turn a notebook from a collection of raw code into a readable, well-documented analysis that anyone can follow.

To create a Markdown cell, type %md at the top of the cell. Then write regular Markdown syntax:

%md
## Monthly Sales Analysis

This notebook examines sales performance across all product categories
for Q1 2024. Key findings:
- Widget sales grew **32%** compared to Q1 2023
- Gadget returns decreased by 15%

Magic Commands – Switching Languages Mid-Notebook

Magic commands are special instructions that change how a single cell behaves. They always start with the % symbol.

MAGIC COMMANDS IN DATABRICKS
─────────────────────────────────────────
%python   → Run this cell in Python (even if notebook default is SQL)
%sql      → Run this cell in SQL
%scala    → Run this cell in Scala
%r        → Run this cell in R
%md       → Render this cell as Markdown text
%sh       → Run this cell as a shell (bash) command
%fs       → Access the Databricks File System (DBFS)
%run      → Run another notebook and import its variables
%pip      → Install Python packages in the current session

Using %sql in a Python Notebook

In a Python notebook, you can run SQL in a specific cell by starting it with %sql:

%sql
SELECT product, SUM(amount) AS total_sales
FROM retail_data.transactions
WHERE year(txn_date) = 2024
GROUP BY product
ORDER BY total_sales DESC

The SQL result appears as an interactive table directly below the cell.

Using %run to Import Another Notebook

The %run command runs another notebook and makes its variables, functions, and classes available in the current notebook. This is useful for storing reusable utility functions in one shared notebook and importing them wherever needed.

NOTEBOOK A: utility_functions
─────────────────────────────
def clean_currency(value):
    return float(str(value).replace(",", "").replace("₹", ""))

─────────────────────────────

NOTEBOOK B: sales_analysis
─────────────────────────────
%run ./utility_functions

# Now clean_currency function is available here
df["amount_clean"] = df["amount"].apply(clean_currency)

The display() Function – Seeing Data as Tables and Charts

The display() function is one of the most useful tools in Databricks notebooks. When you pass a Spark DataFrame or Pandas DataFrame to display(), it renders the data as a rich interactive table with sorting, filtering, and charting capabilities.

CODE:
display(sales_df)

OUTPUT (Interactive Table):
──────────────────────────────────────────
product    │ region │ amount │ date
───────────┼────────┼────────┼───────────
Widget A   │ North  │  4500  │ 2024-01-15
Widget B   │ South  │  3200  │ 2024-01-16
Gadget Pro │ East   │  6700  │ 2024-01-17
[📊 Chart] [⬇ Download] [🔍 Filter]
──────────────────────────────────────────

Click the chart icon below the table to instantly turn the data into a bar chart, line chart, scatter plot, histogram, or pie chart. This visual exploration requires zero additional code.

Working with Data in Notebooks – Step-by-Step Example

Here is a complete walkthrough of a typical notebook workflow analyzing retail sales data.

Step 1: Read the Data

# Read a CSV file from cloud storage
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("dbfs:/mnt/retail/sales_2024.csv")

print(f"Total records: {df.count()}")
print(f"Columns: {df.columns}")

Step 2: Explore the Schema

df.printSchema()

# OUTPUT:
# root
#  |-- transaction_id: integer
#  |-- customer_id: integer
#  |-- product: string
#  |-- amount: double
#  |-- city: string
#  |-- date: timestamp

Step 3: Clean the Data

from pyspark.sql.functions import col, upper, trim

df_clean = df \
    .dropna(subset=["customer_id", "amount"]) \
    .withColumn("city", upper(trim(col("city")))) \
    .filter(col("amount") > 0)

print(f"Records after cleaning: {df_clean.count()}")

Step 4: Analyze the Data

from pyspark.sql.functions import sum, count, avg, round

city_summary = df_clean \
    .groupBy("city") \
    .agg(
        count("transaction_id").alias("num_transactions"),
        round(sum("amount"), 2).alias("total_revenue"),
        round(avg("amount"), 2).alias("avg_order_value")
    ) \
    .orderBy("total_revenue", ascending=False)

display(city_summary)

Step 5: Save the Result

city_summary.write \
    .mode("overwrite") \
    .saveAsTable("retail_data.city_revenue_summary")

print("Table saved successfully!")

Running these five cells in order takes you from raw CSV files to a clean, analyzed, saved table in under two minutes for small datasets and within minutes for datasets with millions of rows.

Notebook Collaboration Features

Real-Time Co-Authoring

Multiple people can open and edit the same notebook simultaneously, similar to Google Docs. Each user's cursor appears in a different color, and changes made by one user appear immediately in other users' browsers. This feature works best for pair programming or when a data engineer and data analyst review a notebook together.

Comments

You can add comments to specific cells or specific lines of code. Right-click a cell and choose Add Comment. Comments work like code review comments — your colleague writes a question, you reply, and you mark the thread as resolved when done. Comments stay attached to the specific cell or code line they reference.

Version History

Databricks automatically saves a version of your notebook every time it changes. To view the history, click the clock icon in the top-right toolbar. You see a list of all saved versions with timestamps and the username of who made each change. Click any version to see what the notebook looked like at that point in time, and click Restore to roll back to that version.

Running Notebooks as Jobs

A notebook you build interactively can be run on a schedule without any code changes. Click the Schedule button at the top of the notebook, set the frequency (daily at 6 AM, every hour, every Monday), and choose a cluster. Databricks creates a job that runs the notebook automatically at the specified time.

NOTEBOOK → JOB CONVERSION
─────────────────────────────────────
Interactive Notebook
    │
    │ Click "Schedule"
    ▼
Configure Job:
    • Name: daily_sales_refresh
    • Schedule: Every day at 6:00 AM IST
    • Cluster: Job Cluster (auto-creates, auto-terminates)
    • Notification: Email on failure
    │
    ▼
Job runs automatically every morning
Results saved to Delta table

Installing Libraries in Notebooks

Sometimes you need a Python package that is not pre-installed in the Databricks Runtime. Install it using the %pip magic command:

%pip install pandas-profiling plotly-express faker

# After installation, restart Python:
dbutils.library.restartPython()

Libraries installed this way are available for the rest of your notebook session. For permanent library installation across all sessions, add the library to the cluster's Libraries tab instead.

dbutils – The Databricks Utility Toolkit

The dbutils object provides utility functions for working with files, secrets, notebooks, and the Databricks environment. It is available automatically in every Python and Scala notebook without any import.

DBUTILS FUNCTIONS
──────────────────────────────────────────────────
dbutils.fs.ls("dbfs:/mnt/data/")     → List files
dbutils.fs.cp("src_path", "dst_path") → Copy file
dbutils.fs.rm("path", recurse=True)  → Delete file/folder

dbutils.secrets.get(scope, key)       → Read a secret securely

dbutils.notebook.run("notebook_path", timeout=300, args={...})
                                      → Run another notebook from code

dbutils.widgets.text("city", "Mumbai") → Create an input widget
city = dbutils.widgets.get("city")    → Read widget value

Widgets – Making Notebooks Interactive

Widgets add input controls to your notebook — text boxes, dropdowns, date pickers, and multi-select lists. They turn a static notebook into a configurable report that different users can interact with.

NOTEBOOK WITH WIDGETS
──────────────────────────────────────────────
┌──────────────────────────────────────────┐
│  [City: Mumbai ▼]  [Year: 2024 ▼]        │  ← Widget controls at top
├──────────────────────────────────────────┤
│  # Code reads widget values              │
│  city = dbutils.widgets.get("city")      │
│  year = int(dbutils.widgets.get("year")) │
│                                          │
│  df = sales_df.filter(                   │
│      (col("city") == city) &             │
│      (year(col("date")) == year)         │
│  )                                       │
│  display(df)                             │
├──────────────────────────────────────────┤
│  OUTPUT: Filtered results update         │
│  automatically when you change           │
│  the widget values above.                │
└──────────────────────────────────────────┘

Key Points

Databricks Notebooks are interactive documents combining code, results, and explanatory text in a single page.
They support Python, SQL, Scala, and R, with magic commands allowing language switching per cell.
The display() function renders DataFrames as interactive tables with built-in charting capabilities.
Multiple users can co-author a notebook simultaneously, with comments and version history for collaboration.
Any notebook can be scheduled as a job to run automatically without code changes.
dbutils provides file management, secret reading, and notebook chaining utilities available in every session.
Widgets add configurable input controls, turning static notebooks into interactive reports.

Previous lesson

Back to course

Next lesson