Databricks SQL

Databricks SQL is the analytics powerhouse of the Databricks platform. It gives data analysts and business intelligence teams a familiar SQL-based interface to query data, build dashboards, and set up automated alerts — without needing to write Python code or manage clusters manually. If you already know SQL from working with databases like MySQL, PostgreSQL, or SQL Server, you can start querying Delta Lake tables in Databricks SQL within minutes.

Databricks SQL is not just a SQL editor. It is a complete analytics environment: query editor, visual chart builder, dashboard creator, and alert system — all in one place, connected to your Delta Lake data.

The Databricks SQL Interface

DATABRICKS SQL COMPONENTS
──────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────┐
│  SQL EDITOR                                             │
│  • Write and run SQL queries                            │
│  • View results as tables or charts                     │
│  • Save and share queries                               │
├─────────────────────────────────────────────────────────┤
│  QUERY HISTORY                                          │
│  • See all queries you and your team ran                │
│  • Re-run or modify previous queries                    │
│  • Filter by status (success, error, running)           │
├─────────────────────────────────────────────────────────┤
│  DASHBOARDS                                             │
│  • Combine multiple query results into visual reports   │
│  • Schedule automatic refresh                           │
│  • Share with stakeholders via link                     │
├─────────────────────────────────────────────────────────┤
│  ALERTS                                                 │
│  • Monitor metrics automatically                        │
│  • Send email or Slack notification when threshold hit  │
├─────────────────────────────────────────────────────────┤
│  SQL WAREHOUSE                                          │
│  • The compute engine that powers SQL queries           │
│  • Auto-scales to handle multiple users simultaneously  │
└─────────────────────────────────────────────────────────┘

SQL Warehouses – The Engine Behind Databricks SQL

Before writing any query, you connect to a SQL Warehouse. A SQL Warehouse is a dedicated compute cluster optimized specifically for SQL workloads. Unlike regular Databricks clusters (which are configured for Python and Spark code), SQL Warehouses use Photon, Databricks' own high-performance query engine written in C++, to execute SQL much faster.

Photon – Why Databricks SQL Is Fast

Photon is a native vectorized query engine. Instead of processing data one row at a time (row-by-row processing), Photon processes data in large batches called vectors — thousands of rows at once. This approach uses CPU caches more efficiently and can be up to 12 times faster than standard Spark execution for SQL queries.

ROW-BY-ROW VS VECTORIZED PROCESSING
──────────────────────────────────────────────────────────────
Row-by-row (Traditional):
Loop through 1 billion rows:
  Row 1: check if city = "Mumbai" → Yes → add to result
  Row 2: check if city = "Delhi"  → No  → skip
  Row 3: check if city = "Mumbai" → Yes → add to result
  ... (1 billion iterations, each in separate CPU instruction)

Vectorized (Photon):
Load 10,000 rows into CPU cache as a vector
Apply "city = Mumbai" filter to all 10,000 at once using SIMD
Repeat with next 10,000 rows
... (100,000 iterations of 10,000 rows each)

Speed difference: Vectorized is typically 5–12x faster.

Serverless SQL Warehouses

Serverless SQL Warehouses are the fastest option to start. With a traditional SQL Warehouse, startup time can take 2–5 minutes as cloud VMs spin up. Serverless warehouses start in seconds because Databricks maintains a pool of pre-warmed compute resources. You pay only for the seconds of compute you actually use, billed at per-second granularity.

Writing Queries in the SQL Editor

The SQL Editor is straightforward. Select a catalog and database from the left panel, write your SQL in the center editor, and click Run. Results appear below the editor as an interactive table.

Basic Query Example

-- Find top 10 cities by total revenue in 2024
SELECT
    city,
    COUNT(DISTINCT customer_id)   AS unique_customers,
    COUNT(txn_id)                 AS total_transactions,
    ROUND(SUM(amount), 2)         AS total_revenue,
    ROUND(AVG(amount), 2)         AS avg_order_value
FROM main.retail_data.transactions
WHERE YEAR(txn_date) = 2024
  AND amount > 0
GROUP BY city
ORDER BY total_revenue DESC
LIMIT 10
RESULT:
──────────────────────────────────────────────────────────────────────────────────
city      │ unique_customers │total_transactions │ total_revenue │ avg_order_value
──────────┼──────────────────┼───────────────────┼───────────────┼────────────────
Mumbai    │ 45,231           │ 128,450           │ 32,112,500.00 │ 249.97
Delhi     │ 38,920           │ 110,230           │ 27,557,500.00 │ 250.01
Bengaluru │ 32,441           │ 91,880            │ 22,970,000.00 │ 250.00

Using CTEs for Complex Queries

Common Table Expressions (CTEs) make complex queries readable. Instead of nesting subqueries inside each other (which becomes confusing quickly), CTEs name each logical step and chain them together.

-- Step 1: Find customers who bought more than ₹10,000 total
WITH high_value_customers AS (
    SELECT
        customer_id,
        SUM(amount) AS total_spent
    FROM main.retail_data.transactions
    WHERE YEAR(txn_date) = 2024
    GROUP BY customer_id
    HAVING SUM(amount) > 10000
),

-- Step 2: Get their profile information
customer_profiles AS (
    SELECT
        c.customer_id,
        c.name,
        c.city,
        c.signup_date,
        hvc.total_spent
    FROM main.retail_data.customers c
    INNER JOIN high_value_customers hvc
        ON c.customer_id = hvc.customer_id
),

-- Step 3: Classify them by spending tier
classified_customers AS (
    SELECT
        *,
        CASE
            WHEN total_spent >= 50000 THEN 'Platinum'
            WHEN total_spent >= 25000 THEN 'Gold'
            ELSE 'Silver'
        END AS tier
    FROM customer_profiles
)

-- Final: Show summary by city and tier
SELECT
    city,
    tier,
    COUNT(*) AS customer_count,
    ROUND(AVG(total_spent), 2) AS avg_spending
FROM classified_customers
GROUP BY city, tier
ORDER BY city, tier

CTEs make each step of the logic self-explanatory. Another analyst reading this query understands exactly what each block does without needing to trace through nested subqueries.

Window Functions – Advanced Analytics Made Simple

Window functions perform calculations across a set of rows that are related to the current row. They are powerful for ranking, running totals, moving averages, and period-over-period comparisons.

Ranking Customers by Revenue

SELECT
    customer_id,
    name,
    city,
    total_spent,
    RANK() OVER (
        PARTITION BY city
        ORDER BY total_spent DESC
    ) AS rank_in_city,
    ROW_NUMBER() OVER (
        ORDER BY total_spent DESC
    ) AS overall_rank
FROM customer_profiles
RESULT EXPLANATION:
─────────────────────────────────────────────────────────────
customer │ city   │ total_spent │ rank_in_city │ overall_rank
─────────┼────────┼─────────────┼──────────────┼─────────────
Priya    │ Mumbai │ 85,000      │ 1            │ 1
Arun     │ Mumbai │ 72,000      │ 2            │ 3
Deepa    │ Delhi  │ 78,000      │ 1            │ 2
Kiran    │ Delhi  │ 45,000      │ 2            │ 6

RANK() OVER (PARTITION BY city):
→ Ranks each customer within their own city.
  Both Priya and Deepa are Rank 1 — but in different cities.

ROW_NUMBER() OVER (ORDER BY total_spent DESC):
→ Ranks all customers globally by spending.
  Priya is #1 overall, Deepa is #2 overall.

Running Total and Moving Average

SELECT
    txn_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY txn_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    AVG(daily_revenue) OVER (
        ORDER BY txn_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS seven_day_avg
FROM daily_revenue_summary
ORDER BY txn_date
RESULT:
──────────────────────────────────────────────────────────
date       │ daily_revenue │ running_total │ seven_day_avg
───────────┼───────────────┼───────────────┼──────────────
2024-01-01 │ 45,000        │ 45,000        │ 45,000
2024-01-02 │ 52,000        │ 97,000        │ 48,500
2024-01-03 │ 38,000        │ 135,000       │ 45,000
...        │ ...           │ ...           │ ...
2024-01-07 │ 61,000        │ 331,000       │ 49,571  ← 7-day avg
2024-01-08 │ 55,000        │ 386,000       │ 51,000  ← Rolling: drops Jan 1

Databricks SQL Functions – Built-In Power

Databricks SQL includes hundreds of built-in functions. Here are the most commonly used categories:

Date and Time Functions

SELECT
    txn_date,
    DATE_FORMAT(txn_date, 'yyyy-MM')          AS month,
    DAYOFWEEK(txn_date)                       AS day_of_week,
    DATEDIFF(CURRENT_DATE(), txn_date)        AS days_ago,
    DATE_TRUNC('month', txn_date)             AS first_of_month,
    ADD_MONTHS(txn_date, 3)                   AS three_months_later
FROM transactions

String Functions

SELECT
    UPPER(city)                               AS city_upper,
    LOWER(TRIM(name))                         AS name_clean,
    SUBSTRING(phone, 1, 3)                    AS country_code,
    CONCAT(name, ' from ', city)              AS label,
    REGEXP_REPLACE(email, '@.*', '')          AS username,
    SPLIT(full_address, ',')[0]               AS street
FROM customers

Conditional Logic

SELECT
    customer_id,
    amount,
    CASE
        WHEN amount >= 10000 THEN 'Large'
        WHEN amount >= 1000  THEN 'Medium'
        WHEN amount >= 100   THEN 'Small'
        ELSE 'Micro'
    END AS order_size,
    IIF(amount > 5000, 'Premium', 'Standard')  AS customer_type,
    COALESCE(phone, email, 'No Contact Info')  AS contact
FROM transactions t
JOIN customers c USING (customer_id)

Creating and Using Views

Views are saved SQL queries that behave like tables. They compute results on the fly each time you query them. Views hide complex joins and transformations behind a simple table name, making it easy for analysts to query clean, pre-processed data without knowing all the underlying logic.

CREATE VIEW main.retail_data.monthly_revenue AS
SELECT
    DATE_TRUNC('month', txn_date)     AS month,
    city,
    SUM(amount)                       AS total_revenue,
    COUNT(DISTINCT customer_id)       AS unique_buyers
FROM main.retail_data.transactions
WHERE amount > 0
GROUP BY month, city;

-- Anyone can now query:
SELECT * FROM main.retail_data.monthly_revenue
WHERE city = 'Mumbai'
ORDER BY month

-- They see clean monthly revenue data without knowing
-- how transactions, customers, and filters work underneath.

Materialized Views

Unlike regular views (which re-execute the query each time), materialized views pre-compute results and store them. When you query a materialized view, you get near-instant results because the computation already happened. Databricks automatically refreshes materialized views when the underlying data changes.

Query Parameters – Building Reusable Queries

Query parameters let you add input variables to a SQL query. Instead of hardcoding a city name or date range, you mark a value as a parameter. Users fill in the parameter value before running the query, making it reusable for different inputs.

-- Query with parameters (use double curly braces)
SELECT
    city,
    ROUND(SUM(amount), 2) AS total_revenue
FROM main.retail_data.transactions
WHERE city = '{{ city_name }}'
  AND YEAR(txn_date) = {{ target_year }}
GROUP BY city

-- When running:
-- city_name → "Mumbai"
-- target_year → 2024

-- Result shows Mumbai's 2024 revenue.
-- Change parameters to "Delhi" and 2023 for Delhi's 2023 data.

Building Dashboards from SQL Queries

Every query result in the SQL Editor can become a chart or table on a dashboard. The workflow is simple:

DASHBOARD CREATION WORKFLOW
──────────────────────────────────────────────────────────────
Step 1: Write and run a SQL query in the SQL Editor
        ↓
Step 2: Click the "+" button next to the results table
        Select chart type: Bar, Line, Pie, Area, Counter...
        ↓
Step 3: Configure chart: which column is the X axis,
        which is the Y axis, grouping, colors, labels
        ↓
Step 4: Click "Add to Dashboard"
        Select an existing dashboard or create new one
        ↓
Step 5: Arrange widgets on the dashboard by dragging
        Add text boxes, images, or section headers
        ↓
Step 6: Set refresh schedule: every hour, daily, weekly
        ↓
Step 7: Share the dashboard link with your team
        They see live, auto-refreshed charts without needing
        to log into the SQL Editor themselves

Setting Up Alerts

Alerts monitor a query result and send a notification when a condition is met. You set up an alert once, and Databricks checks the condition on a schedule.

ALERT EXAMPLE: Detect Revenue Drop
──────────────────────────────────────────────────────────────
Query:
  SELECT SUM(amount) AS today_revenue
  FROM transactions
  WHERE DATE(txn_date) = CURRENT_DATE()

Alert condition:
  today_revenue < 100000

Notification:
  Send email to: analytics-team@company.com
  Message: "Daily revenue dropped below ₹1,00,000.
             Check for data pipeline issues."

Schedule: Check every 30 minutes.

Alerts integrate with email, Slack, and PagerDuty. A data pipeline failure that stops new data from loading will trigger revenue alerts long before any human notices the dashboard shows stale data.

Key Points

  • Databricks SQL provides a complete analytics environment: SQL editor, dashboard builder, alert system, and query history in one interface.
  • SQL Warehouses power queries with Photon, a vectorized engine that processes thousands of rows at once and runs SQL up to 12 times faster than standard Spark.
  • Serverless SQL Warehouses start in seconds and charge by the second, making them ideal for intermittent or unpredictable query workloads.
  • CTEs make complex multi-step queries readable by naming each logical step clearly.
  • Window functions enable ranking, running totals, and moving averages over related rows without subqueries.
  • Views hide complex SQL logic behind simple table names; materialized views pre-compute results for instant query response.
  • Query parameters make saved queries reusable for different inputs, turning a one-off query into a self-service report.
  • Alerts monitor query results on a schedule and send notifications when metrics cross defined thresholds.

Leave a Comment