ADE Pipeline Design and Architecture

Building individual pipelines is one skill. Designing a complete, production-grade data platform architecture is another. This topic walks through the decisions and patterns that turn individual components into a coherent, maintainable system that serves a real business reliably.

Translating a Business Requirement into an Architecture

Architecture design always starts with a business requirement, not a list of Azure services. The requirement determines every technical decision.

Consider this example requirement: A retail chain wants a daily analytics platform that shows sales performance by product, region, and store. Source data comes from 50 stores using different POS systems. Analysts need data ready by 7 AM every morning. Historical data must be retained for 5 years. Customer personal data must comply with GDPR.

Translating Requirements to Technical Decisions

Business RequirementTechnical Decision
50 stores, different POS systemsADF with multiple Linked Services; metadata-driven ingestion framework
Data ready by 7 AM dailyADF schedule trigger at 1 AM; SLA monitoring alert if pipeline not complete by 6:30 AM
5-year data retentionADLS Gen2 with lifecycle policy: hot → cool (90 days) → archive (365 days)
GDPR compliance on customer dataPurview for data classification; Key Vault for secrets; private endpoints; row-level encryption of PII columns
Analytics by product, region, storeStar schema in Synapse Dedicated Pool; Power BI connected in Import mode

The Lambda Architecture

Some platforms need both real-time data and reliable batch-processed historical data. The Lambda Architecture serves both simultaneously using two parallel processing paths.

Batch Layer: Processes all historical data at regular intervals (daily or hourly). Produces highly accurate, complete results. Slow to update.

Speed Layer: Processes incoming real-time events immediately. Produces approximate or partial results for recent data. Fast but does not handle historical data.

Serving Layer: Merges results from both layers. Queries hit the serving layer and get the batch result for historical data combined with the real-time result for recent data.

On Azure, the Lambda Architecture typically looks like: ADF + Databricks (batch layer), Event Hubs + Stream Analytics (speed layer), Synapse Analytics (serving layer).

Lambda Architecture is complex to maintain because you run two separate code paths that must produce consistent results. Consider whether a simpler purely-batch or near-real-time approach meets the business need before committing to Lambda.

The Kappa Architecture

The Kappa Architecture simplifies Lambda by using only one processing path — a stream processor — for both real-time and historical data. Historical data is replayed through the same stream pipeline.

On Azure, this means Event Hubs stores all events (using a long retention window or Event Hubs Capture to ADLS Gen2), and Databricks Structured Streaming or Delta Live Tables processes everything — both live and replayed historical data — through the same code.

Designing for Failure

A well-designed data platform assumes failures will happen and handles them gracefully. Every design decision should consider: what happens when this component fails?

Idempotency — Safe to Rerun

An idempotent pipeline produces the same result whether it runs once or ten times. Rerunning the same pipeline never creates duplicate data.

Implement idempotency by using overwrite write modes for batch loads with bounded time windows. A pipeline that loads data for 2024-01-15 always overwrites the 2024-01-15 partition — running it twice produces the same result as running it once.

# Idempotent write — overwrite the specific date partition
df.write.format("delta") \
    .mode("overwrite") \
    .option("replaceWhere", "order_date = '2024-01-15'") \
    .save(output_path)

Circuit Breaker Pattern

If a source system is unavailable, a pipeline retrying aggressively creates more load on an already struggling system. The circuit breaker pattern detects repeated failures and stops retrying after a threshold — giving the source system time to recover.

In ADF, implement this with a combination of retry settings, failure condition tracking in a control table, and an If Condition that checks the failure count before attempting a connection.

Dead Letter Handling

Events or records that cannot be processed — due to schema mismatches, missing reference data, or validation failures — go to a dead letter folder or queue. A separate process handles these rejected records manually or with a corrective pipeline. Nothing is silently lost.

A Reference Architecture for a Modern Azure Data Platform

This architecture covers the majority of enterprise Azure data engineering use cases.

Ingestion Layer:

  • ADF for batch ingestion from on-premises databases, cloud APIs, and file drops
  • Event Hubs for real-time event streaming from applications and IoT devices
  • Self-Hosted Integration Runtime for on-premises data sources behind firewalls

Storage Layer:

  • ADLS Gen2 with Bronze, Silver, Gold container structure
  • Delta format for all silver and gold data
  • Lifecycle management policies for cost-efficient long-term retention

Processing Layer:

  • ADF Data Flows for simple transformations
  • Azure Databricks with Delta Live Tables for complex transformations and streaming
  • Databricks Jobs orchestrated by ADF or Databricks Workflows

Serving Layer:

  • Azure Synapse Dedicated SQL Pool for high-concurrency BI queries
  • Synapse Serverless SQL Pool for ad-hoc exploration of the data lake
  • Azure SQL Database for operational reporting

Analytics Layer:

  • Power BI in Import mode connected to Synapse Dedicated Pool
  • Star schema semantic models optimized for self-service analytics

Governance and Security:

  • Microsoft Purview for data catalog and lineage
  • Azure Key Vault for all secrets and connection strings
  • Managed Identity for all service-to-service authentication
  • Private Endpoints for all production data services

Monitoring:

  • Azure Monitor with Log Analytics for centralized logging
  • Alert rules on pipeline failures routed to Teams and email
  • Data quality checks in every pipeline with metrics logged to a Delta table

Key Points

  • Architecture decisions flow from business requirements — start with what the business needs, then choose the tools
  • Design every pipeline to be idempotent — safe to rerun without creating duplicate or inconsistent data
  • Lambda Architecture handles combined batch and real-time needs but adds significant complexity — validate the need before adopting it
  • Always design failure handling — retry logic, dead letter patterns, and circuit breakers prevent silent data loss
  • The Bronze/Silver/Gold storage pattern, Managed Identity, Private Endpoints, and Azure Monitor belong in every production architecture
  • A governance layer with Purview and Key Vault is not optional in enterprise environments — build it in from the start

Leave a Comment