Databricks SQL
Databricks SQL is the analytics powerhouse of the Databricks platform. It gives data analysts and business intelligence teams a familiar SQL-based interface to query data, build dashboards, and set up automated alerts — without needing to write Python code or manage clusters manually. If you already know SQL from working with databases like MySQL, PostgreSQL, or SQL Server, you can start querying Delta Lake tables in Databricks SQL within minutes.
Databricks SQL is not just a SQL editor. It is a complete analytics environment: query editor, visual chart builder, dashboard creator, and alert system — all in one place, connected to your Delta Lake data.
The Databricks SQL Interface
DATABRICKS SQL COMPONENTS ────────────────────────────────────────────────────────────── ┌─────────────────────────────────────────────────────────┐ │ SQL EDITOR │ │ • Write and run SQL queries │ │ • View results as tables or charts │ │ • Save and share queries │ ├─────────────────────────────────────────────────────────┤ │ QUERY HISTORY │ │ • See all queries you and your team ran │ │ • Re-run or modify previous queries │ │ • Filter by status (success, error, running) │ ├─────────────────────────────────────────────────────────┤ │ DASHBOARDS │ │ • Combine multiple query results into visual reports │ │ • Schedule automatic refresh │ │ • Share with stakeholders via link │ ├─────────────────────────────────────────────────────────┤ │ ALERTS │ │ • Monitor metrics automatically │ │ • Send email or Slack notification when threshold hit │ ├─────────────────────────────────────────────────────────┤ │ SQL WAREHOUSE │ │ • The compute engine that powers SQL queries │ │ • Auto-scales to handle multiple users simultaneously │ └─────────────────────────────────────────────────────────┘
SQL Warehouses – The Engine Behind Databricks SQL
Before writing any query, you connect to a SQL Warehouse. A SQL Warehouse is a dedicated compute cluster optimized specifically for SQL workloads. Unlike regular Databricks clusters (which are configured for Python and Spark code), SQL Warehouses use Photon, Databricks' own high-performance query engine written in C++, to execute SQL much faster.
Photon – Why Databricks SQL Is Fast
Photon is a native vectorized query engine. Instead of processing data one row at a time (row-by-row processing), Photon processes data in large batches called vectors — thousands of rows at once. This approach uses CPU caches more efficiently and can be up to 12 times faster than standard Spark execution for SQL queries.
ROW-BY-ROW VS VECTORIZED PROCESSING ────────────────────────────────────────────────────────────── Row-by-row (Traditional): Loop through 1 billion rows: Row 1: check if city = "Mumbai" → Yes → add to result Row 2: check if city = "Delhi" → No → skip Row 3: check if city = "Mumbai" → Yes → add to result ... (1 billion iterations, each in separate CPU instruction) Vectorized (Photon): Load 10,000 rows into CPU cache as a vector Apply "city = Mumbai" filter to all 10,000 at once using SIMD Repeat with next 10,000 rows ... (100,000 iterations of 10,000 rows each) Speed difference: Vectorized is typically 5–12x faster.
Serverless SQL Warehouses
Serverless SQL Warehouses are the fastest option to start. With a traditional SQL Warehouse, startup time can take 2–5 minutes as cloud VMs spin up. Serverless warehouses start in seconds because Databricks maintains a pool of pre-warmed compute resources. You pay only for the seconds of compute you actually use, billed at per-second granularity.
Writing Queries in the SQL Editor
The SQL Editor is straightforward. Select a catalog and database from the left panel, write your SQL in the center editor, and click Run. Results appear below the editor as an interactive table.
Basic Query Example
-- Find top 10 cities by total revenue in 2024
SELECT
city,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(txn_id) AS total_transactions,
ROUND(SUM(amount), 2) AS total_revenue,
ROUND(AVG(amount), 2) AS avg_order_value
FROM main.retail_data.transactions
WHERE YEAR(txn_date) = 2024
AND amount > 0
GROUP BY city
ORDER BY total_revenue DESC
LIMIT 10
RESULT: ────────────────────────────────────────────────────────────────────────────────── city │ unique_customers │total_transactions │ total_revenue │ avg_order_value ──────────┼──────────────────┼───────────────────┼───────────────┼──────────────── Mumbai │ 45,231 │ 128,450 │ 32,112,500.00 │ 249.97 Delhi │ 38,920 │ 110,230 │ 27,557,500.00 │ 250.01 Bengaluru │ 32,441 │ 91,880 │ 22,970,000.00 │ 250.00
Using CTEs for Complex Queries
Common Table Expressions (CTEs) make complex queries readable. Instead of nesting subqueries inside each other (which becomes confusing quickly), CTEs name each logical step and chain them together.
-- Step 1: Find customers who bought more than ₹10,000 total
WITH high_value_customers AS (
SELECT
customer_id,
SUM(amount) AS total_spent
FROM main.retail_data.transactions
WHERE YEAR(txn_date) = 2024
GROUP BY customer_id
HAVING SUM(amount) > 10000
),
-- Step 2: Get their profile information
customer_profiles AS (
SELECT
c.customer_id,
c.name,
c.city,
c.signup_date,
hvc.total_spent
FROM main.retail_data.customers c
INNER JOIN high_value_customers hvc
ON c.customer_id = hvc.customer_id
),
-- Step 3: Classify them by spending tier
classified_customers AS (
SELECT
*,
CASE
WHEN total_spent >= 50000 THEN 'Platinum'
WHEN total_spent >= 25000 THEN 'Gold'
ELSE 'Silver'
END AS tier
FROM customer_profiles
)
-- Final: Show summary by city and tier
SELECT
city,
tier,
COUNT(*) AS customer_count,
ROUND(AVG(total_spent), 2) AS avg_spending
FROM classified_customers
GROUP BY city, tier
ORDER BY city, tier
CTEs make each step of the logic self-explanatory. Another analyst reading this query understands exactly what each block does without needing to trace through nested subqueries.
Window Functions – Advanced Analytics Made Simple
Window functions perform calculations across a set of rows that are related to the current row. They are powerful for ranking, running totals, moving averages, and period-over-period comparisons.
Ranking Customers by Revenue
SELECT
customer_id,
name,
city,
total_spent,
RANK() OVER (
PARTITION BY city
ORDER BY total_spent DESC
) AS rank_in_city,
ROW_NUMBER() OVER (
ORDER BY total_spent DESC
) AS overall_rank
FROM customer_profiles
RESULT EXPLANATION: ───────────────────────────────────────────────────────────── customer │ city │ total_spent │ rank_in_city │ overall_rank ─────────┼────────┼─────────────┼──────────────┼───────────── Priya │ Mumbai │ 85,000 │ 1 │ 1 Arun │ Mumbai │ 72,000 │ 2 │ 3 Deepa │ Delhi │ 78,000 │ 1 │ 2 Kiran │ Delhi │ 45,000 │ 2 │ 6 RANK() OVER (PARTITION BY city): → Ranks each customer within their own city. Both Priya and Deepa are Rank 1 — but in different cities. ROW_NUMBER() OVER (ORDER BY total_spent DESC): → Ranks all customers globally by spending. Priya is #1 overall, Deepa is #2 overall.
Running Total and Moving Average
SELECT
txn_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY txn_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
AVG(daily_revenue) OVER (
ORDER BY txn_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS seven_day_avg
FROM daily_revenue_summary
ORDER BY txn_date
RESULT: ────────────────────────────────────────────────────────── date │ daily_revenue │ running_total │ seven_day_avg ───────────┼───────────────┼───────────────┼────────────── 2024-01-01 │ 45,000 │ 45,000 │ 45,000 2024-01-02 │ 52,000 │ 97,000 │ 48,500 2024-01-03 │ 38,000 │ 135,000 │ 45,000 ... │ ... │ ... │ ... 2024-01-07 │ 61,000 │ 331,000 │ 49,571 ← 7-day avg 2024-01-08 │ 55,000 │ 386,000 │ 51,000 ← Rolling: drops Jan 1
Databricks SQL Functions – Built-In Power
Databricks SQL includes hundreds of built-in functions. Here are the most commonly used categories:
Date and Time Functions
SELECT
txn_date,
DATE_FORMAT(txn_date, 'yyyy-MM') AS month,
DAYOFWEEK(txn_date) AS day_of_week,
DATEDIFF(CURRENT_DATE(), txn_date) AS days_ago,
DATE_TRUNC('month', txn_date) AS first_of_month,
ADD_MONTHS(txn_date, 3) AS three_months_later
FROM transactions
String Functions
SELECT
UPPER(city) AS city_upper,
LOWER(TRIM(name)) AS name_clean,
SUBSTRING(phone, 1, 3) AS country_code,
CONCAT(name, ' from ', city) AS label,
REGEXP_REPLACE(email, '@.*', '') AS username,
SPLIT(full_address, ',')[0] AS street
FROM customers
Conditional Logic
SELECT
customer_id,
amount,
CASE
WHEN amount >= 10000 THEN 'Large'
WHEN amount >= 1000 THEN 'Medium'
WHEN amount >= 100 THEN 'Small'
ELSE 'Micro'
END AS order_size,
IIF(amount > 5000, 'Premium', 'Standard') AS customer_type,
COALESCE(phone, email, 'No Contact Info') AS contact
FROM transactions t
JOIN customers c USING (customer_id)
Creating and Using Views
Views are saved SQL queries that behave like tables. They compute results on the fly each time you query them. Views hide complex joins and transformations behind a simple table name, making it easy for analysts to query clean, pre-processed data without knowing all the underlying logic.
CREATE VIEW main.retail_data.monthly_revenue AS
SELECT
DATE_TRUNC('month', txn_date) AS month,
city,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_buyers
FROM main.retail_data.transactions
WHERE amount > 0
GROUP BY month, city;
-- Anyone can now query:
SELECT * FROM main.retail_data.monthly_revenue
WHERE city = 'Mumbai'
ORDER BY month
-- They see clean monthly revenue data without knowing
-- how transactions, customers, and filters work underneath.
Materialized Views
Unlike regular views (which re-execute the query each time), materialized views pre-compute results and store them. When you query a materialized view, you get near-instant results because the computation already happened. Databricks automatically refreshes materialized views when the underlying data changes.
Query Parameters – Building Reusable Queries
Query parameters let you add input variables to a SQL query. Instead of hardcoding a city name or date range, you mark a value as a parameter. Users fill in the parameter value before running the query, making it reusable for different inputs.
-- Query with parameters (use double curly braces)
SELECT
city,
ROUND(SUM(amount), 2) AS total_revenue
FROM main.retail_data.transactions
WHERE city = '{{ city_name }}'
AND YEAR(txn_date) = {{ target_year }}
GROUP BY city
-- When running:
-- city_name → "Mumbai"
-- target_year → 2024
-- Result shows Mumbai's 2024 revenue.
-- Change parameters to "Delhi" and 2023 for Delhi's 2023 data.
Building Dashboards from SQL Queries
Every query result in the SQL Editor can become a chart or table on a dashboard. The workflow is simple:
DASHBOARD CREATION WORKFLOW
──────────────────────────────────────────────────────────────
Step 1: Write and run a SQL query in the SQL Editor
↓
Step 2: Click the "+" button next to the results table
Select chart type: Bar, Line, Pie, Area, Counter...
↓
Step 3: Configure chart: which column is the X axis,
which is the Y axis, grouping, colors, labels
↓
Step 4: Click "Add to Dashboard"
Select an existing dashboard or create new one
↓
Step 5: Arrange widgets on the dashboard by dragging
Add text boxes, images, or section headers
↓
Step 6: Set refresh schedule: every hour, daily, weekly
↓
Step 7: Share the dashboard link with your team
They see live, auto-refreshed charts without needing
to log into the SQL Editor themselves
Setting Up Alerts
Alerts monitor a query result and send a notification when a condition is met. You set up an alert once, and Databricks checks the condition on a schedule.
ALERT EXAMPLE: Detect Revenue Drop
──────────────────────────────────────────────────────────────
Query:
SELECT SUM(amount) AS today_revenue
FROM transactions
WHERE DATE(txn_date) = CURRENT_DATE()
Alert condition:
today_revenue < 100000
Notification:
Send email to: analytics-team@company.com
Message: "Daily revenue dropped below ₹1,00,000.
Check for data pipeline issues."
Schedule: Check every 30 minutes.
Alerts integrate with email, Slack, and PagerDuty. A data pipeline failure that stops new data from loading will trigger revenue alerts long before any human notices the dashboard shows stale data.
Key Points
- Databricks SQL provides a complete analytics environment: SQL editor, dashboard builder, alert system, and query history in one interface.
- SQL Warehouses power queries with Photon, a vectorized engine that processes thousands of rows at once and runs SQL up to 12 times faster than standard Spark.
- Serverless SQL Warehouses start in seconds and charge by the second, making them ideal for intermittent or unpredictable query workloads.
- CTEs make complex multi-step queries readable by naming each logical step clearly.
- Window functions enable ranking, running totals, and moving averages over related rows without subqueries.
- Views hide complex SQL logic behind simple table names; materialized views pre-compute results for instant query response.
- Query parameters make saved queries reusable for different inputs, turning a one-off query into a self-service report.
- Alerts monitor query results on a schedule and send notifications when metrics cross defined thresholds.
