ADE Pipeline Orchestration

A production data platform is never a single pipeline. It is a network of interdependent pipelines that must run in the right order, handle failures gracefully, and alert the team when something goes wrong. Orchestration is the practice of coordinating and managing this network of pipelines.

What is Orchestration

Orchestration means controlling when each pipeline runs, in what order, what happens if one fails, and how to retry or recover. Without orchestration, pipelines run independently with no awareness of each other — leading to downstream pipelines processing incomplete or stale data.

Think of orchestration as conducting an orchestra. Each musician (pipeline) plays their instrument (job). The conductor ensures the violin section (ingestion) plays before the strings (transformation), which plays before the brass (reporting load). If the violins miss a note, the conductor signals the appropriate response — repeat the section or pause the performance.

Dependency Chains in Data Pipelines

A dependency chain means Pipeline B cannot start until Pipeline A successfully completes. In real projects, chains can be deep and branched.

A typical morning data load might follow this sequence:

  1. Extract: Pull raw sales data from the source system (runs first — no dependencies)
  2. Extract: Pull raw customer data from the source system (runs in parallel with step 1)
  3. Transform Sales: Clean and enrich sales data (depends on step 1 completing)
  4. Transform Customers: Clean and enrich customer data (depends on step 2 completing)
  5. Load Fact Table: Merge cleaned sales into the fact table (depends on steps 3 AND 4 both completing)
  6. Refresh Aggregates: Recalculate summary tables (depends on step 5)
  7. Trigger Dashboard Refresh: Notify Power BI to refresh reports (depends on step 6)

If step 3 fails, steps 5, 6, and 7 must not run. The reporting team must see stale data rather than partial or corrupted data.

Orchestration in Azure Data Factory

Parent-Child Pipeline Pattern

ADF uses the Execute Pipeline Activity to call one pipeline from another. A parent pipeline acts as the orchestrator — it calls child pipelines in sequence or in parallel and monitors their completion.

Parent Pipeline: Daily_Data_Load
├── Execute Pipeline: Extract_Sales  (runs first)
├── Execute Pipeline: Extract_Customers  (runs in parallel)
├── Wait for both to complete
├── Execute Pipeline: Transform_Sales  (depends on Extract_Sales)
├── Execute Pipeline: Transform_Customers  (depends on Extract_Customers)
├── Wait for both to complete
└── Execute Pipeline: Load_Fact_Table  (depends on both transforms)

Each Execute Pipeline Activity has a Wait on Completion setting. When enabled, the parent waits for the child to finish before proceeding. When disabled, the parent fires the child and moves on immediately (fire-and-forget).

Activity Dependencies — Controlling Flow Within a Pipeline

Within a single pipeline, each activity can have dependency conditions on the previous activity:

  • On Success: Run this activity only if the previous activity succeeded
  • On Failure: Run this activity only if the previous activity failed (use for error-handling steps)
  • On Completion: Run this activity regardless of whether the previous activity succeeded or failed
  • On Skipped: Run this activity if the previous activity was skipped

A common pattern: add a failure notification activity at the end of a pipeline with an "On Failure" dependency from all other activities. If anything fails, the notification runs and sends an alert email or Teams message.

Retry Logic

Transient failures — network blips, temporary service unavailability — are common in cloud environments. Every ADF activity has a Retry setting. Configure a reasonable retry count (2–3 times) with a retry interval (30–60 seconds) for activities that connect to external systems.

Incremental Loading — Processing Only New Data

Loading all data from a source every day is wasteful and slow. When a source table has 100 million rows and only 50,000 changed yesterday, reprocessing all 100 million rows wastes 99.95% of the effort. Incremental loading solves this by processing only new or changed records.

Watermark Pattern

The watermark pattern tracks the last successfully processed record using a timestamp or sequence number stored in a control table.

  1. Read the last watermark from the control table (e.g., last_processed_date = 2024-01-15)
  2. Extract only records where updated_at > last_processed_date from the source
  3. Process and load those records to the destination
  4. Update the control table with the new watermark (2024-01-16)
-- Control table in Azure SQL
CREATE TABLE pipeline_watermark (
    pipeline_name     VARCHAR(100) PRIMARY KEY,
    last_watermark    DATETIME,
    last_run_status   VARCHAR(20),
    last_run_time     DATETIME
);

-- The ADF Lookup Activity reads this before the copy
SELECT last_watermark FROM pipeline_watermark
WHERE pipeline_name = 'sales_incremental_load'

Change Data Capture (CDC)

Some source systems support Change Data Capture — a feature that automatically logs every INSERT, UPDATE, and DELETE in a change table. ADF can read from these change tables to pick up only what changed, even capturing deletes which a timestamp filter alone cannot detect.

Azure SQL Database and SQL Managed Instance support CDC natively. ADF has a native CDC source connector for Azure SQL, making incremental pipeline setup straightforward.

Error Handling and Alerting

Pipeline Failure Notifications

ADF integrates with Azure Monitor. You create alert rules that fire when a pipeline run fails. Alerts route to action groups — which send emails, SMS messages, or Azure Logic App triggers (which can post to Microsoft Teams or Slack).

Dead Letter Queue Pattern

In stream processing, some events cannot be processed due to format errors or missing reference data. Instead of dropping these events, write them to a separate "dead letter" folder in ADLS Gen2 or a dead letter queue in Event Hubs. A separate process investigates and reprocesses them later. Nothing is silently lost.

Apache Airflow on Azure — Advanced Orchestration

For organizations with complex multi-system orchestration needs, Apache Airflow is a popular open-source orchestration tool. Airflow lets you define pipelines as Python code (called DAGs — Directed Acyclic Graphs), giving you full programmatic control over dependencies, branching, and scheduling.

Azure offers Azure Managed Airflow inside Azure Data Factory as a fully managed Airflow environment. Alternatively, you can run Airflow on Azure Kubernetes Service for full control.

Key Points

  • Orchestration coordinates pipelines so they run in the correct order with proper dependency handling
  • Use the Parent-Child Pipeline pattern in ADF to manage complex pipeline dependencies
  • Configure On Failure activity dependencies to run notification or cleanup steps when something goes wrong
  • Implement incremental loading with the Watermark Pattern to avoid reprocessing unchanged data daily
  • Set retry logic on activities that connect to external systems to handle transient failures automatically
  • Use a dead letter folder to capture unprocessable events rather than silently dropping them

Leave a Comment