ADE Azure Data Factory Basics

Data rarely lives in one place. A company might store sales data in an on-premises Oracle database, customer data in Salesforce, and inventory data in a REST API. Azure Data Factory — called ADF — is the service that connects to all of these sources, moves the data, and transforms it into a usable format. It is Azure's core data integration and orchestration tool.

What Azure Data Factory Does

ADF is a cloud-based ETL (Extract, Transform, Load) service. It extracts data from sources, optionally transforms it, and loads it into a destination.

Think of ADF as a postal system for data. You have packages (data) sitting at different addresses (source systems). ADF picks them up, sorts them (transforms), and delivers them to the right destination (data lake, database, warehouse).

ADF is a code-free, visual tool. You design pipelines by dragging and dropping activities on a canvas. You can also use JSON definitions for advanced configurations and version control.

Core Components of Azure Data Factory

Pipelines

A pipeline is a logical grouping of activities that together perform a data movement or transformation task. A pipeline is the main unit of work in ADF. You schedule it, monitor it, and trigger it.

Example: A pipeline named "Load_Daily_Sales" runs every morning at 6 AM. It copies yesterday's sales file from an FTP server, renames it with the correct date, and saves it to the bronze layer in ADLS Gen2.

Activities

Activities are the individual steps inside a pipeline. Each activity does one specific thing.

Activity Type	What It Does	Example
Copy Activity	Moves data from source to destination	Copy CSV from FTP to ADLS Gen2
Data Flow Activity	Transforms data visually (no code)	Join two tables, filter rows, rename columns
Lookup Activity	Reads a value to use in later activities	Get last processed date from a control table
Get Metadata Activity	Retrieves properties of a file or folder	Check if a file exists before processing it
Notebook Activity	Runs a Databricks notebook	Execute Python transformation code in Databricks
Stored Procedure Activity	Runs a stored procedure in a database	Call a SQL procedure to merge staged data
ForEach Activity	Loops over a list and runs activities for each item	Process 12 monthly files in a loop
If Condition Activity	Branches execution based on a condition	If file count is zero, skip processing

Linked Services

A Linked Service is a connection definition. It stores the connection string, credentials, and settings needed to connect ADF to an external system. Think of a Linked Service as a contact in your phone — it holds the address and authentication details so you do not have to enter them every time.

ADF supports over 100 connectors including Azure SQL, Oracle, Snowflake, SAP, Amazon S3, Google BigQuery, REST APIs, FTP servers, and more.

Datasets

A Dataset represents the specific data structure inside a connected system. It points to a table in a database, a file in storage, or a specific API endpoint. A Dataset always depends on a Linked Service — the Linked Service provides the connection, and the Dataset specifies what data to access.

Integration Runtime

The Integration Runtime (IR) is the compute engine that executes ADF activities. There are three types:

Azure IR: Fully managed by Microsoft. Used for cloud-to-cloud data movement. No setup required.
Self-Hosted IR: Installed on your own machine or on-premises server. Used when the source data is inside a company's private network — like an on-premises SQL Server behind a firewall.
Azure-SSIS IR: Runs SQL Server Integration Services (SSIS) packages in the cloud. Used for migrating existing SSIS workloads to Azure.

Building Your First Pipeline — A Step-by-Step Walkthrough

This example builds a pipeline that copies a sales CSV file from Azure Blob Storage to ADLS Gen2.

Step 1 — Create Linked Services

Create two Linked Services — one pointing to the source Blob Storage and one pointing to the ADLS Gen2 destination. Use Managed Identity for authentication in both.

Step 2 — Create Datasets

Create a source Dataset pointing to the specific CSV file in Blob Storage. Create a sink Dataset pointing to the destination folder in ADLS Gen2.

Step 3 — Create a Pipeline and Add Copy Activity

Open the ADF Studio, create a new pipeline, drag a Copy Activity onto the canvas. Set the source Dataset and sink Dataset in the activity settings.

Step 4 — Configure the Copy Activity

In the Source tab, select the source Dataset. In the Sink tab, select the destination Dataset. In the Mapping tab, map source columns to destination columns if needed.

Step 5 — Test and Publish

Click Debug to run the pipeline in test mode. Check the output to confirm data moved correctly. Click Publish All to save and deploy the pipeline.

Triggers — Scheduling and Automating Pipelines

Triggers determine when a pipeline runs automatically.

Schedule Trigger: Runs a pipeline on a fixed schedule — every day at 6 AM, every hour, every Monday. Like a phone alarm clock.
Tumbling Window Trigger: Similar to Schedule Trigger but guarantees no overlap and handles gaps. If a run fails and you need to reprocess the last 7 days, a Tumbling Window Trigger makes this easy.
Event-Based Trigger (Storage Events): Fires when a file arrives in ADLS Gen2 or Blob Storage. Useful for real-time ingestion — process the file the moment it lands.
Manual Trigger: You click Run to start the pipeline. Useful for one-time loads or testing.

Parameters and Variables

Hardcoding file names and dates inside a pipeline makes it brittle. Parameters and variables make pipelines reusable and dynamic.

Parameters are values passed into a pipeline from outside — from a trigger, from a parent pipeline, or from a manual run. Example: Pass the date as a parameter so the pipeline always processes the correct day's data.

Variables are values that exist inside the pipeline and can change as activities run. Example: Set a variable to "Success" after a copy activity completes, then check that variable in a later condition.

Monitoring Pipelines

ADF has a built-in Monitor tab that shows every pipeline run, its status, start time, duration, and which activity failed if something went wrong. For production pipelines, you set up alerts in Azure Monitor to receive email or Teams notifications on failure.

Key Points

ADF is Azure's data integration service — it moves and transforms data between systems
A Pipeline contains Activities; Activities use Linked Services and Datasets
Use Azure IR for cloud-to-cloud movement; use Self-Hosted IR for on-premises sources
Use parameters to make pipelines reusable across different dates and file names
Schedule Trigger for time-based runs; Event Trigger for file-arrival-based runs
Always monitor production pipelines and set up failure alerts

Previous lesson

Back to course

Next lesson