DevOps for Azure Data Engineering

Professional data engineering teams do not manually deploy pipeline changes to production. They use DevOps practices — version control, automated testing, and continuous deployment — to move changes from development to production safely and reliably. This topic covers how to apply DevOps principles specifically to Azure data engineering workloads.

Why Data Engineering Needs DevOps

Without DevOps, a common scenario plays out like this: A developer makes a change to an ADF pipeline in the production environment to "just fix a small thing quickly." The fix causes an unintended side effect. There is no way to roll back because the previous version was never saved. The team spends hours diagnosing an issue that a version control system would have prevented in five minutes.

DevOps for data engineering solves three specific problems:

Version control: Every change to pipelines, notebooks, and schemas is tracked with who made it, when, and why
Environment consistency: Development, testing, and production environments are built from the same code, eliminating "it works on dev but not prod" problems
Automated deployment: Changes deploy automatically after passing tests, removing manual steps that introduce human error

Source Control for Azure Data Factory

ADF integrates directly with Azure DevOps Git repositories and GitHub. When you connect ADF to a Git repository, every pipeline, dataset, linked service, and trigger is stored as a JSON file in the repository.

The ADF Git Workflow

ADF uses a branch-based workflow:

Developers work on their own feature branches — changes are saved to the branch, not published live
When the feature is ready, a pull request merges the branch into the collaboration branch (typically main or develop)
The collaboration branch is the source of truth for the development environment
Publishing from the collaboration branch deploys changes to ADF and generates an ARM template in the adf_publish branch
The CI/CD pipeline deploys the ARM template to the test and production ADF instances

Parameterizing for Multiple Environments

A pipeline developed in the dev environment uses a dev storage account and dev SQL database. The same pipeline in production must use production connections. ADF handles this with Global Parameters and environment-specific parameter files.

Each environment (dev, test, prod) has its own parameter file that overrides linked service URLs, storage account names, and database connection strings. The pipeline code stays identical — only the parameters change per environment.

Source Control for Databricks Notebooks

Databricks notebooks integrate with Azure DevOps Git and GitHub at the workspace level. Each notebook folder syncs with a Git repository. Developers commit notebook changes directly from the Databricks UI or through the Databricks CLI.

Databricks Asset Bundles (DABs)

Databricks Asset Bundles are the modern approach to deploying Databricks resources as code. You define jobs, clusters, permissions, and libraries in YAML configuration files. The bundle deploys consistently across dev, test, and production environments using the Databricks CLI.

# databricks.yml — a simple bundle definition
bundle:
  name: sales_pipeline

resources:
  jobs:
    transform_sales:
      name: "Transform Sales Data"
      tasks:
        - task_key: clean_data
          notebook_task:
            notebook_path: ./notebooks/transform_sales
          new_cluster:
            num_workers: 4
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"

CI/CD Pipeline with Azure DevOps

A CI/CD pipeline automates the steps between a code merge and a production deployment.

Continuous Integration (CI) — The Build Stage

When code is merged to the main branch, the CI pipeline automatically:

Runs unit tests on any Python or SQL transformation code
Validates ADF JSON files for syntax errors
Checks that all referenced linked services and datasets exist
Packages the ADF ARM template and notebook files as build artifacts

Continuous Deployment (CD) — The Release Stage

The CD pipeline deploys the build artifacts to each environment in sequence:

Deploy to Test: Apply ARM template to the test ADF instance; deploy notebooks to the test Databricks workspace
Run Integration Tests: Execute test pipelines that verify end-to-end data flow with a sample dataset
Approval Gate: A team lead reviews test results and approves the production deployment
Deploy to Production: Apply the same ARM template to the production ADF instance; deploy notebooks to the production Databricks workspace

Infrastructure as Code with Bicep and Terraform

Creating Azure resources manually through the portal creates environments that are hard to reproduce consistently. Infrastructure as Code (IaC) defines your Azure infrastructure in code files that are version-controlled and deployable with a single command.

Azure Bicep

Bicep is Microsoft's native IaC language for Azure. It is simpler than ARM templates and compiles to ARM JSON for deployment.

// Bicep — create an ADLS Gen2 storage account
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: 'mystorageprod001'
  location: 'eastus'
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
  properties: {
    isHnsEnabled: true  // Enables hierarchical namespace = ADLS Gen2
    minimumTlsVersion: 'TLS1_2'
  }
}

Terraform

Terraform is a third-party IaC tool that works across multiple cloud providers. Organizations that use both Azure and AWS often choose Terraform for a unified IaC experience. The Azure provider for Terraform covers all major Azure data services.

Testing Data Pipelines

Unit Testing Transformation Logic

PySpark transformation functions can be unit tested using pytest. Test each transformation function in isolation with small, controlled input DataFrames and verify the output matches expectations.

# Unit test for a transformation function
import pytest
from pyspark.sql import SparkSession
from my_transformations import clean_sales_data

def test_removes_negative_amounts(spark):
    input_data = [(1, "2024-01-01", -50.0), (2, "2024-01-01", 100.0)]
    df_input = spark.createDataFrame(input_data, ["order_id", "order_date", "amount"])
    
    df_result = clean_sales_data(df_input)
    
    assert df_result.count() == 1  # Negative row should be removed
    assert df_result.first()["order_id"] == 2

Key Points

Connect ADF and Databricks to Git repositories — every change becomes versioned and reversible
Never make changes directly in the production environment — all changes flow through the CI/CD pipeline
Use parameterization and environment-specific parameter files to deploy the same code to dev, test, and production
Include an approval gate before production deployment — human review catches issues automated tests miss
Define Azure infrastructure as Bicep or Terraform code so environments are reproducible and consistent
Write unit tests for PySpark transformation functions — catch logic errors before they reach production data

Previous lesson

Back to course

Next lesson