Azure Data Factory
Organizations collect data from dozens of different sources — ERP systems, CRM databases, flat files on FTP servers, REST APIs, and cloud storage accounts. Before this data can be analyzed, it must be extracted, transformed into a consistent format, and loaded into a destination like a data warehouse or data lake. Azure Data Factory (ADF) is a cloud-based ETL (Extract, Transform, Load) and data integration service that orchestrates and automates these data movement and transformation workflows at any scale.
What is Azure Data Factory?
Azure Data Factory is a serverless data integration service that creates data pipelines — automated workflows that move data from source systems to destination systems, transforming it along the way. It connects to 90+ built-in data source connectors (Azure, AWS, Google Cloud, SaaS applications, on-premises databases) and provides a visual, low-code pipeline authoring experience.
Core ADF Components
Diagram – ADF Architecture
Sources Azure Data Factory Destinations
┌──────────────┐ ┌────────────────────────┐ ┌──────────────────┐
│ SQL Server │ │ │ │ Azure Synapse │
│ Oracle DB │──────────► Pipeline │─────►│ Azure Data Lake │
│ Salesforce │ │ ┌──────────────────┐ │ │ Azure SQL DB │
│ SAP │──────────► │ Activities: │ │─────►│ Blob Storage │
│ S3 (AWS) │ │ │ Copy → Transform │ │ │ Power BI Dataset │
│ REST APIs │──────────► │ → Stored Proc │ │─────►└──────────────────┘
│ Excel files │ │ └──────────────────┘ │
│ FTP server │ │ │
└──────────────┘ │ Integration Runtime │
│ (Execution engine) │
└────────────────────────┘
Pipeline
A pipeline is a logical grouping of activities that together perform a data integration task. A pipeline can contain one activity or a sequence of dozens of activities with conditional branching, loops, and error handling.
Activity
An activity is a single step within a pipeline — the unit of work. ADF supports three categories of activities:
- Data Movement Activities: Copy data from one store to another. The Copy Activity is the most commonly used.
- Data Transformation Activities: Transform data using Mapping Data Flows, Azure Databricks notebooks, Azure HDInsight, Azure Functions, or stored procedures. Control Activities: Orchestrate the pipeline flow — If/Else conditions, ForEach loops, Until loops, Execute Pipeline (call another pipeline), Wait, and Set Variable.
Linked Service
A Linked Service is a connection definition — like a connection string — to a data source or destination. Every source or destination used in a pipeline needs a linked service that contains the credentials and endpoint information. Linked Services point to Key Vault secrets for secure credential storage.
Dataset
A Dataset is a named reference to data within a linked service — it points to a specific table, file, or query result. Datasets define the structure and location of the data to be read or written.
Integration Runtime (IR)
The Integration Runtime is the computation engine that executes ADF activities. Three types exist:
| IR Type | Where It Runs | Best For |
|---|---|---|
| Azure IR | Fully managed by Microsoft in Azure | Cloud-to-cloud data movement and transformations |
| Self-Hosted IR | Installed on an on-premises server or VM | Moving data from on-premises databases to Azure |
| Azure-SSIS IR | Managed SSIS (SQL Server Integration Services) cluster | Lift-and-shift existing SSIS packages to Azure |
Copy Activity
The Copy Activity is the workhorse of Azure Data Factory. It reads data from a source dataset and writes it to a sink (destination) dataset. It handles parallelism, error handling, schema mapping, and data type conversion automatically.
Copy Activity Example: SQL Server to Azure Data Lake
Source:
Linked Service: OnPremSQLServer (Self-Hosted IR)
Dataset: SalesOrders table
Query: SELECT * FROM SalesOrders WHERE OrderDate >= '@{variables('lastRunDate')}'
Sink:
Linked Service: AzureDataLakeStorage
Dataset: sales-orders/year=2024/month=01/
File format: Parquet (compressed, column-oriented)
Settings:
Parallelism: 4 (read from 4 table partitions simultaneously)
Fault tolerance: Skip incompatible rows (log errors to a file)
Mapping Data Flows
Mapping Data Flows provide a visual, code-free interface for building complex data transformation logic. Transformations are defined visually and ADF translates them into Apache Spark code that runs on a managed Spark cluster — no Spark knowledge required.
Common Data Flow Transformations
| Transformation | What It Does |
|---|---|
| Select | Choose, rename, or reorder columns |
| Filter | Keep only rows matching a condition |
| Derived Column | Add new columns or transform existing ones using expressions |
| Aggregate | GROUP BY with SUM, COUNT, AVG, MAX, MIN |
| Join | Combine two datasets on a common key (Inner, Left, Right, Full) |
| Lookup | Enrich rows by looking up matching values in a reference dataset |
| Pivot / Unpivot | Rotate rows to columns or columns to rows |
| Window | Calculate running totals, ranks, or moving averages over ordered partitions |
| Flatten | Expand arrays or complex nested JSON structures into rows |
Triggers
Pipelines can be executed by three types of triggers:
- Schedule Trigger: Run a pipeline on a defined schedule — daily at 2 AM, every 15 minutes, on the first day of the month.
- Tumbling Window Trigger: Process data in fixed, non-overlapping time windows in sequence — useful for backfilling historical data or regular batch processing.
- Storage Event Trigger: Fire a pipeline when a blob is created or deleted in Azure Blob Storage or Data Lake — enables event-driven pipelines.
Monitoring and Alerting
The ADF Monitor hub shows the status of all pipeline runs, activity runs, and trigger runs. Each failed run shows the exact activity that failed and the full error message. Integration with Azure Monitor allows setting up alerts that send notifications when pipelines fail or run longer than expected.
ADF vs Azure Synapse Pipelines
Azure Synapse Analytics includes its own pipeline capability — Synapse Pipelines — that is functionally identical to ADF (same user interface, same connectors, same activities). The main difference is context: ADF is standalone; Synapse Pipelines runs within the Synapse workspace alongside SQL pools, Spark pools, and Power BI — making it the preferred choice when building a complete analytics platform on Synapse.
Key Takeaways
- Azure Data Factory is a serverless ETL/ELT and data orchestration service for moving and transforming data between 90+ source and destination systems.
- Core components: Pipelines (workflow), Activities (steps), Linked Services (connections), Datasets (data references), and Integration Runtime (execution engine).
- Self-Hosted Integration Runtime enables secure data movement from on-premises databases to Azure.
- Mapping Data Flows provide visual, code-free transformation logic that runs on managed Apache Spark.
- Triggers execute pipelines on a schedule, on a tumbling window, or in response to storage events.
