Databricks Workflows

Databricks Workflows is the built-in job scheduling and orchestration system. It lets you run notebooks, Python scripts, Delta Live Tables pipelines, dbt projects, and SQL queries automatically on a schedule or in response to external events. A Workflow is a collection of tasks that execute in a defined order, with dependencies, conditions, and error handling built in.

Before Workflows, teams used external orchestration tools like Apache Airflow, Azure Data Factory, or AWS Glue to schedule Databricks jobs. While those tools still work with Databricks, Databricks Workflows eliminates the need for a separate orchestration service for most use cases, reducing complexity and cost.

Think of Databricks Workflows like a factory assembly line. Each station (task) performs a specific job. The stations are arranged in order: station A must finish before station B starts. Some stations run in parallel. If any station fails, the line stops and a supervisor (the alert system) is notified. Everything is automated and monitored.

Anatomy of a Databricks Workflow

WORKFLOW STRUCTURE
──────────────────────────────────────────────────────────────
JOB: daily_retail_pipeline
  │
  ├── TASK 1: ingest_orders          (Notebook Task)
  │    Notebook: /etl/bronze/ingest_orders
  │    Cluster: Job Cluster (auto-created)
  │    Depends on: Nothing (runs first)
  │
  ├── TASK 2: ingest_customers        (Notebook Task)
  │    Notebook: /etl/bronze/ingest_customers
  │    Cluster: Same job cluster
  │    Depends on: Nothing (runs in parallel with Task 1)
  │
  ├── TASK 3: clean_orders            (Notebook Task)
  │    Notebook: /etl/silver/clean_orders
  │    Cluster: Same job cluster
  │    Depends on: Task 1 (waits for orders ingestion)
  │
  ├── TASK 4: clean_customers         (Notebook Task)
  │    Notebook: /etl/silver/clean_customers
  │    Cluster: Same job cluster
  │    Depends on: Task 2 (waits for customer ingestion)
  │
  └── TASK 5: build_gold_revenue     (Notebook Task)
       Notebook: /etl/gold/monthly_revenue
       Cluster: Same job cluster
       Depends on: Task 3 AND Task 4 (waits for BOTH silver tables)

EXECUTION TIMELINE:
T=0:00  Task 1 starts   Task 2 starts   (parallel)
T=0:05  Task 1 done →   Task 3 starts
T=0:07  Task 2 done →   Task 4 starts
T=0:12  Task 3 done
T=0:14  Task 4 done → Both done → Task 5 starts
T=0:18  Task 5 done → JOB COMPLETE

Creating a Workflow in the Databricks UI

Go to Workflows in the left sidebar and click Create Job. You are taken to the job editor, where you define tasks, dependencies, clusters, and scheduling.

Adding Tasks

Each task has a task type. The most common types are:

TASK TYPES IN DATABRICKS WORKFLOWS
──────────────────────────────────────────────────────────────
Notebook Task
  • Runs a Databricks notebook end-to-end
  • Passes parameters to the notebook via widgets
  • Most common task type for ETL and analysis

Python Script Task
  • Runs a .py file stored in a Git repository or DBFS
  • Best for code managed in Git without notebooks

Delta Live Tables Pipeline Task
  • Triggers a DLT pipeline run
  • Waits for the pipeline to complete before proceeding

SQL Task
  • Runs SQL queries against a SQL Warehouse
  • Good for data quality checks or Gold layer refresh

dbt Task
  • Runs a dbt (data build tool) project
  • For teams using dbt for SQL transformations

Condition Task
  • Evaluates a True/False condition
  • Routes workflow to different tasks based on result
  • Enables if/else branching in workflows

Run Job Task
  • Triggers another Databricks job
  • Enables modular, reusable sub-pipelines

Task Dependencies

When adding a task, the Depends on field lists the tasks that must complete before this task starts. Leave it empty to run the task at the start of the job. Select one or more tasks to create a dependency. Tasks with no upstream dependencies and no task depending on them run in parallel automatically.

Scheduling Workflows

Click Add a schedule or trigger on the job page to configure when the job runs.

SCHEDULING OPTIONS
──────────────────────────────────────────────────────────────
Manual (default):
  → Runs only when you click "Run Now"
  → No automatic execution
  → Use for ad hoc or one-time jobs

Scheduled (Cron):
  → Runs on a fixed schedule using cron syntax
  → Examples:
     0 6 * * *     → Every day at 6:00 AM
     0 */2 * * *   → Every 2 hours
     0 9 * * 1     → Every Monday at 9:00 AM
     0 22 L * *    → Last day of every month at 10:00 PM
  → Select timezone explicitly to avoid DST surprises

File Arrival (Continuous Trigger):
  → Runs when new files arrive in a cloud storage location
  → Auto Loader compatible — triggers on new file detection
  → Best for event-driven ingestion pipelines

External Trigger (API):
  → Other systems call the Databricks Jobs API to start a run
  → Pass parameters dynamically at trigger time
  → Use for orchestration from external tools like Airflow

Passing Parameters to Notebook Tasks

Job parameters make workflows reusable. Instead of hardcoding dates, environment names, or other values in your notebook code, you pass them as parameters when triggering the job. The notebook reads them using dbutils.widgets.get().

PARAMETERIZED WORKFLOW EXAMPLE
──────────────────────────────────────────────────────────────
Job Configuration:
  Task: process_daily_sales
  Parameters:
    processing_date → {{job.start_time.iso_date}} (auto-filled)
    environment     → "production"
    source_path     → "/mnt/incoming/sales/"

Notebook code:
  processing_date = dbutils.widgets.get("processing_date")
  environment     = dbutils.widgets.get("environment")
  source_path     = dbutils.widgets.get("source_path")

  print(f"Processing sales for {processing_date}
          in {environment} environment")
  print(f"Reading from: {source_path}")

Dynamic Variable Available in Workflows:
  {{job.start_time.iso_date}} → Date the job started (YYYY-MM-DD)
  {{job.id}}                  → Job ID
  {{run.id}}                  → Current run ID
  {{task.name}}               → Current task name

Handling Failures in Workflows

Production pipelines fail. Network timeouts, data quality issues, upstream API outages, and cloud infrastructure hiccups all cause task failures. Databricks Workflows provides several mechanisms to handle failures gracefully.

Retry on Failure

Each task supports automatic retries. You configure the maximum number of retries and the wait time between each attempt. This handles transient failures — a database connection that times out, a file that briefly becomes unavailable, or a cluster that fails to start on the first attempt.

RETRY CONFIGURATION
──────────────────────────────────────────────────────────────
Max Retries:      3
Retry Wait:       300 seconds (5 minutes) between attempts

Execution timeline when a task fails:
  00:00 → Task starts
  00:05 → Task fails (API timeout)
  00:05 → Retry 1/3 scheduled (wait 5 min)
  00:10 → Retry 1 starts
  00:15 → Retry 1 fails (still timing out)
  00:15 → Retry 2/3 scheduled
  00:20 → Retry 2 starts
  00:25 → Retry 2 SUCCEEDS
  00:25 → Downstream tasks proceed

Without retries: Job fails permanently at 00:05.
With 3 retries: Job succeeds at 00:25 despite two failures.

Task-Level Success/Failure Routing

Each task has configurable behavior when it fails: stop the entire job, continue to downstream tasks anyway, or route to a different task for error handling. This lets you build pipelines that handle partial failures gracefully — for example, sending a failure alert but still running the reporting task using yesterday's data.

CONDITIONAL SUCCESS/FAILURE ROUTING
──────────────────────────────────────────────────────────────
Task: refresh_gold_table
  On Success → Continue to task: send_success_email
  On Failure → Continue to task: send_failure_alert_and_use_cached

Task: send_failure_alert_and_use_cached
  → Sends email: "Gold refresh failed, using yesterday's data"
  → Copies yesterday's gold table to today's location
  → Marks job as partially succeeded (not complete failure)

This way, your dashboard still shows data (from yesterday)
rather than showing empty charts because of today's failure.

Email and Slack Notifications

Configure notifications at the job level or task level. You specify email addresses and notification events: job start, job success, job failure, task failure, and skipped tasks. For Slack integration, use the Slack webhook URL in the notification settings.

Monitoring Workflow Runs

The Workflows section shows all jobs and their run history. Click any job to see its run timeline, which tasks succeeded, which failed, and how long each task took.

WORKFLOW RUN MONITORING DASHBOARD
─────────────────────────────────────────────────────────
JOB: daily_retail_pipeline
─────────────────────────────────────────────────────────
Run Date       │ Status     │ Duration  │ Start Time
───────────────┼────────────┼───────────┼────────────
2024-01-15     │ ✅ Success │ 18 min   │ 06:00:02 AM
2024-01-14     │ ✅ Success │ 17 min   │ 06:00:01 AM
2024-01-13     │ ❌ Failed  │ 8 min    │ 06:00:00 AM
2024-01-12     │ ✅ Success │ 19 min   │ 06:00:02 AM

Click "2024-01-13 Failed" to see details:
  Task 1 ingest_orders:    ✅ 5 min
  Task 2 ingest_customers: ✅ 4 min
  Task 3 clean_orders:     ✅ 3 min
  Task 4 clean_customers:  ❌ FAILED (2 min) → Error: OOM

Error message: "Task 4 failed: java.lang.OutOfMemoryError:
Java heap space. Consider increasing driver memory."

Action: Increase driver memory on the cluster configuration.

Using the Databricks Jobs API

The Jobs API lets you trigger, monitor, and manage jobs programmatically. This is essential when Databricks Workflows must integrate with an external system that initiates the job — a CI/CD pipeline that deploys new notebook code, a business application that triggers a report generation, or a partner system that sends a file and expects confirmation of processing.

TRIGGERING A JOB VIA API (Python Example)
──────────────────────────────────────────────────────────────
import requests

databricks_host = "https://your-workspace.azuredatabricks.net"
token = "your-personal-access-token"
job_id = 12345  # From Workflows UI

# Trigger the job with parameters
response = requests.post(
    url=f"{databricks_host}/api/2.1/jobs/run-now",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "job_id": job_id,
        "notebook_params": {
            "processing_date": "2024-01-15",
            "environment": "production"
        }
    }
)

run_id = response.json()["run_id"]
print(f"Job triggered. Run ID: {run_id}")

# Poll for completion
import time
while True:
    status = requests.get(
        url=f"{databricks_host}/api/2.1/jobs/runs/get",
        headers={"Authorization": f"Bearer {token}"},
        params={"run_id": run_id}
    ).json()["state"]["life_cycle_state"]

    if status in ["TERMINATED", "SKIPPED", "INTERNAL_ERROR"]:
        final_status = response.json()["state"]["result_state"]
        print(f"Job completed with status: {final_status}")
        break

    time.sleep(30)  # Check every 30 seconds

Best Practices for Production Workflows

PRODUCTION WORKFLOW BEST PRACTICES
──────────────────────────────────────────────────────────────
1. USE JOB CLUSTERS, NOT ALL-PURPOSE CLUSTERS
   Each run gets a fresh, clean environment.
   No leftover state from previous runs.
   Auto-terminates after job completes.

2. VERSION CONTROL YOUR NOTEBOOKS
   Connect notebooks to Git repositories.
   Deploy specific Git branches/tags to production.
   Never edit production notebooks directly.

3. SET REASONABLE TIMEOUTS
   Each task should have a max runtime.
   A task running for 6 hours instead of 30 minutes
   signals a problem — catch it early.
   Default timeout: 30 minutes per task unless documented.

4. USE MODULAR, REUSABLE TASKS
   Each task does one thing (ingest, clean, or aggregate).
   Reuse the same notebook across multiple workflows
   by parameterizing inputs and outputs.

5. STORE LOGS AND AUDIT TRAILS
   Log task start time, end time, record counts,
   and quality check results to a dedicated audit table.
   This helps diagnose historical failures months later.

6. TEST IN DEV BEFORE PROMOTING TO PROD
   Use job parameters to switch between dev and prod
   catalogs. Run the full workflow in dev environment
   before deploying to production schedule.

Key Points

  • Databricks Workflows automates the scheduling, orchestration, and monitoring of notebooks, pipelines, SQL queries, and scripts.
  • A workflow consists of tasks with explicit dependencies that determine execution order; tasks without dependencies run in parallel.
  • Task types include notebooks, Python scripts, Delta Live Tables pipelines, SQL queries, dbt projects, and conditional branching.
  • Schedules use cron syntax; file arrival and API triggers support event-driven execution.
  • Parameters make workflows reusable by injecting values at runtime rather than hardcoding them in notebook code.
  • Retry settings handle transient failures automatically; failure routing lets workflows continue gracefully with fallback logic.
  • Always use Job Clusters (not All-Purpose Clusters) for production workflows to ensure clean environments and automatic cost control.
  • The Jobs API enables external systems to trigger, monitor, and integrate with Databricks workflows programmatically.

Leave a Comment