Azure Data Factory

Organizations collect data from dozens of different sources — ERP systems, CRM databases, flat files on FTP servers, REST APIs, and cloud storage accounts. Before this data can be analyzed, it must be extracted, transformed into a consistent format, and loaded into a destination like a data warehouse or data lake. Azure Data Factory (ADF) is a cloud-based ETL (Extract, Transform, Load) and data integration service that orchestrates and automates these data movement and transformation workflows at any scale.

What is Azure Data Factory?

Azure Data Factory is a serverless data integration service that creates data pipelines — automated workflows that move data from source systems to destination systems, transforming it along the way. It connects to 90+ built-in data source connectors (Azure, AWS, Google Cloud, SaaS applications, on-premises databases) and provides a visual, low-code pipeline authoring experience.

Core ADF Components

Diagram – ADF Architecture

  Sources                    Azure Data Factory              Destinations
  ┌──────────────┐          ┌────────────────────────┐      ┌──────────────────┐
  │ SQL Server   │          │                        │      │ Azure Synapse    │
  │ Oracle DB    │──────────►  Pipeline              │─────►│ Azure Data Lake  │
  │ Salesforce   │          │  ┌──────────────────┐  │      │ Azure SQL DB     │
  │ SAP          │──────────►  │ Activities:      │  │─────►│ Blob Storage     │
  │ S3 (AWS)     │          │  │ Copy → Transform │  │      │ Power BI Dataset │
  │ REST APIs    │──────────►  │ → Stored Proc    │  │─────►└──────────────────┘
  │ Excel files  │          │  └──────────────────┘  │
  │ FTP server   │          │                        │
  └──────────────┘          │  Integration Runtime   │
                            │  (Execution engine)    │
                            └────────────────────────┘

Pipeline

A pipeline is a logical grouping of activities that together perform a data integration task. A pipeline can contain one activity or a sequence of dozens of activities with conditional branching, loops, and error handling.

Activity

An activity is a single step within a pipeline — the unit of work. ADF supports three categories of activities:

Data Movement Activities: Copy data from one store to another. The Copy Activity is the most commonly used.
Data Transformation Activities: Transform data using Mapping Data Flows, Azure Databricks notebooks, Azure HDInsight, Azure Functions, or stored procedures.

Control Activities:

Linked Service

A Linked Service is a connection definition — like a connection string — to a data source or destination. Every source or destination used in a pipeline needs a linked service that contains the credentials and endpoint information. Linked Services point to Key Vault secrets for secure credential storage.

Dataset

A Dataset is a named reference to data within a linked service — it points to a specific table, file, or query result. Datasets define the structure and location of the data to be read or written.

Integration Runtime (IR)

The Integration Runtime is the computation engine that executes ADF activities. Three types exist:

IR Type	Where It Runs	Best For
Azure IR	Fully managed by Microsoft in Azure	Cloud-to-cloud data movement and transformations
Self-Hosted IR	Installed on an on-premises server or VM	Moving data from on-premises databases to Azure
Azure-SSIS IR	Managed SSIS (SQL Server Integration Services) cluster	Lift-and-shift existing SSIS packages to Azure

Copy Activity

The Copy Activity is the workhorse of Azure Data Factory. It reads data from a source dataset and writes it to a sink (destination) dataset. It handles parallelism, error handling, schema mapping, and data type conversion automatically.

Copy Activity Example: SQL Server to Azure Data Lake

  Source:
    Linked Service: OnPremSQLServer (Self-Hosted IR)
    Dataset: SalesOrders table
    Query: SELECT * FROM SalesOrders WHERE OrderDate >= '@{variables('lastRunDate')}'

  Sink:
    Linked Service: AzureDataLakeStorage
    Dataset: sales-orders/year=2024/month=01/
    File format: Parquet (compressed, column-oriented)

  Settings:
    Parallelism: 4 (read from 4 table partitions simultaneously)
    Fault tolerance: Skip incompatible rows (log errors to a file)

Mapping Data Flows

Mapping Data Flows provide a visual, code-free interface for building complex data transformation logic. Transformations are defined visually and ADF translates them into Apache Spark code that runs on a managed Spark cluster — no Spark knowledge required.

Common Data Flow Transformations

Transformation	What It Does
Select	Choose, rename, or reorder columns
Filter	Keep only rows matching a condition
Derived Column	Add new columns or transform existing ones using expressions
Aggregate	GROUP BY with SUM, COUNT, AVG, MAX, MIN
Join	Combine two datasets on a common key (Inner, Left, Right, Full)
Lookup	Enrich rows by looking up matching values in a reference dataset
Pivot / Unpivot	Rotate rows to columns or columns to rows
Window	Calculate running totals, ranks, or moving averages over ordered partitions
Flatten	Expand arrays or complex nested JSON structures into rows

Triggers

Pipelines can be executed by three types of triggers:

Schedule Trigger: Run a pipeline on a defined schedule — daily at 2 AM, every 15 minutes, on the first day of the month.
Tumbling Window Trigger: Process data in fixed, non-overlapping time windows in sequence — useful for backfilling historical data or regular batch processing.
Storage Event Trigger: Fire a pipeline when a blob is created or deleted in Azure Blob Storage or Data Lake — enables event-driven pipelines.

Monitoring and Alerting

The ADF Monitor hub shows the status of all pipeline runs, activity runs, and trigger runs. Each failed run shows the exact activity that failed and the full error message. Integration with Azure Monitor allows setting up alerts that send notifications when pipelines fail or run longer than expected.

ADF vs Azure Synapse Pipelines

Azure Synapse Analytics includes its own pipeline capability — Synapse Pipelines — that is functionally identical to ADF (same user interface, same connectors, same activities). The main difference is context: ADF is standalone; Synapse Pipelines runs within the Synapse workspace alongside SQL pools, Spark pools, and Power BI — making it the preferred choice when building a complete analytics platform on Synapse.

Key Takeaways

Azure Data Factory is a serverless ETL/ELT and data orchestration service for moving and transforming data between 90+ source and destination systems.
Core components: Pipelines (workflow), Activities (steps), Linked Services (connections), Datasets (data references), and Integration Runtime (execution engine).
Self-Hosted Integration Runtime enables secure data movement from on-premises databases to Azure.
Mapping Data Flows provide visual, code-free transformation logic that runs on managed Apache Spark.
Triggers execute pipelines on a schedule, on a tumbling window, or in response to storage events.

Previous lessons

Back to courses

Next lessons