ADF Data Flows and Transformations

Moving data from one place to another is only half the job. Real-world data needs cleaning, reshaping, and enriching before it becomes useful. Azure Data Factory's Mapping Data Flows let you do all of this transformation work visually — without writing code.

What is a Mapping Data Flow

A Mapping Data Flow is a visual transformation tool inside ADF. You build transformation logic by connecting blocks on a canvas. Behind the scenes, ADF compiles your visual design into Apache Spark code and runs it on an automatically managed Spark cluster. You get the power of Spark without writing a single line of Python or Scala.

Think of a Mapping Data Flow as building a water purification system with LEGO bricks. Each brick cleans or changes the water in a specific way. You snap the bricks together in the right order, and clean water comes out at the end.

Data Flow vs Copy Activity — When to Use Which

Scenario	Use This
Move data without changing it	Copy Activity
Filter rows, rename columns, change data types	Mapping Data Flow
Join two datasets together	Mapping Data Flow
Aggregate data (sum, count, average by group)	Mapping Data Flow
Deduplicate rows	Mapping Data Flow
Complex multi-step transformations	Mapping Data Flow or Databricks Notebook

Core Transformation Types in Mapping Data Flows

Source

Every Data Flow starts with a Source transformation. This connects to your input data — a file in ADLS Gen2, a table in Azure SQL, or the output of another step. You can have multiple Source transformations in one Data Flow to join different datasets.

Select

The Select transformation chooses which columns to keep, renames them, and changes their order. If your source file has 50 columns but you only need 8, Select removes the other 42. This reduces the amount of data processed in subsequent steps — which saves time and money.

Filter

Filter removes rows that do not meet a condition. Example: Keep only rows where the order status is "Completed". Rows where status is "Cancelled" or "Pending" are dropped.

The condition uses an expression language similar to SQL:

status == 'Completed' && order_amount > 0

Derived Column

Derived Column creates new columns or modifies existing ones using expressions. This is where most transformation logic lives.

Common uses:

Combine first name and last name into a full name: firstName + ' ' + lastName
Convert a string date to a proper date type: toDate(order_date, 'yyyy-MM-dd')
Calculate profit margin: (revenue - cost) / revenue * 100
Upper-case a column: upper(product_name)
Replace null values: iifNull(discount, 0)

Aggregate

Aggregate groups rows and calculates summary values — just like GROUP BY in SQL.

Example: Group by region and calculate total sales per region:

Group by: region
Aggregations: total_sales = sum(order_amount), order_count = count(order_id)

Join

Join combines two streams of data based on a matching key — like a JOIN in SQL. ADF supports Inner Join, Left Outer, Right Outer, Full Outer, and Cross Join.

Example: Join an orders dataset with a customers dataset on customer_id to add customer names and cities to each order record.

Lookup

Lookup is similar to Join but returns data from a reference table based on a matching key. It is commonly used to enrich data with values from a small reference table — like mapping product codes to product names.

Union

Union stacks rows from multiple datasets on top of each other. All datasets must have the same column structure. Use this when you have the same type of data in multiple files — like monthly sales files that need to be combined into one annual dataset.

Pivot and Unpivot

Pivot converts row values into column names. Unpivot does the reverse — converts column names into row values.

Example of Pivot: You have sales data with rows for each month. Pivot converts it so each month becomes a column, giving you one row per product with 12 sales columns.

Conditional Split

Conditional Split routes rows to different output streams based on conditions. It is the Data Flow equivalent of an IF-ELSE statement.

Example: Route rows to three outputs based on region — one stream for Asia, one for Europe, one for Americas. Each stream writes to a different folder in ADLS Gen2.

Exists

Exists checks whether a row from one stream has a matching row in another stream. Use it to find new records (rows that exist in the source but not in the destination) or deleted records (rows that exist in the destination but not in the source).

Sink

Every Data Flow ends with one or more Sink transformations. The Sink writes the transformed data to a destination — ADLS Gen2, Azure SQL, Synapse, or another storage system.

Key Sink settings:

Write behavior: Overwrite, append, or upsert (insert new rows, update existing ones)
File naming: Use a pattern to control output file names
Partitioning: Control how many output files are created

Data Flow Debug Mode

Before running a full Data Flow, you use Debug Mode to test it. Debug Mode starts a small Spark cluster (takes about 2-3 minutes to start). Once running, you can preview data at any transformation step and check that your logic is correct.

Always test with a small data sample in Debug Mode before running against millions of rows. Debug Mode keeps a cluster alive and charges per hour — turn it off when not actively testing.

A Complete Example — Clean and Enrich Sales Data

Imagine raw sales data arrives daily in ADLS Gen2. The file has duplicate rows, inconsistent date formats, and missing customer names. Here is a Data Flow that fixes all of this.

Source: Read the raw CSV from bronze/sales/
Select: Keep only order_id, customer_id, order_date, amount, status
Filter: Remove rows where amount is null or negative
Derived Column: Convert order_date string to proper date type; calculate year and month columns
Aggregate: Deduplicate by keeping max(amount) per order_id
Lookup: Join with a customers reference table on customer_id to add customer_name and city
Conditional Split: Route high-value orders (amount > 10000) to one output, regular orders to another
Sink (x2): Write high-value orders to silver/sales/high_value/ and regular orders to silver/sales/regular/

Key Points

Mapping Data Flows transform data visually — no coding required
ADF compiles Data Flows into Spark jobs automatically
Use Select early in the flow to drop unnecessary columns and reduce processing time
Derived Column handles most common transformations — type conversion, null handling, string manipulation
Always test with Debug Mode on a small sample before processing full datasets
Turn off Debug Mode when not in use to avoid unnecessary compute charges

Previous lesson

Back to course

Next lesson