ADF Data Flows and Transformations
Moving data from one place to another is only half the job. Real-world data needs cleaning, reshaping, and enriching before it becomes useful. Azure Data Factory's Mapping Data Flows let you do all of this transformation work visually — without writing code.
What is a Mapping Data Flow
A Mapping Data Flow is a visual transformation tool inside ADF. You build transformation logic by connecting blocks on a canvas. Behind the scenes, ADF compiles your visual design into Apache Spark code and runs it on an automatically managed Spark cluster. You get the power of Spark without writing a single line of Python or Scala.
Think of a Mapping Data Flow as building a water purification system with LEGO bricks. Each brick cleans or changes the water in a specific way. You snap the bricks together in the right order, and clean water comes out at the end.
Data Flow vs Copy Activity — When to Use Which
| Scenario | Use This |
|---|---|
| Move data without changing it | Copy Activity |
| Filter rows, rename columns, change data types | Mapping Data Flow |
| Join two datasets together | Mapping Data Flow |
| Aggregate data (sum, count, average by group) | Mapping Data Flow |
| Deduplicate rows | Mapping Data Flow |
| Complex multi-step transformations | Mapping Data Flow or Databricks Notebook |
Core Transformation Types in Mapping Data Flows
Source
Every Data Flow starts with a Source transformation. This connects to your input data — a file in ADLS Gen2, a table in Azure SQL, or the output of another step. You can have multiple Source transformations in one Data Flow to join different datasets.
Select
The Select transformation chooses which columns to keep, renames them, and changes their order. If your source file has 50 columns but you only need 8, Select removes the other 42. This reduces the amount of data processed in subsequent steps — which saves time and money.
Filter
Filter removes rows that do not meet a condition. Example: Keep only rows where the order status is "Completed". Rows where status is "Cancelled" or "Pending" are dropped.
The condition uses an expression language similar to SQL:
status == 'Completed' && order_amount > 0
Derived Column
Derived Column creates new columns or modifies existing ones using expressions. This is where most transformation logic lives.
Common uses:
- Combine first name and last name into a full name:
firstName + ' ' + lastName - Convert a string date to a proper date type:
toDate(order_date, 'yyyy-MM-dd') - Calculate profit margin:
(revenue - cost) / revenue * 100 - Upper-case a column:
upper(product_name) - Replace null values:
iifNull(discount, 0)
Aggregate
Aggregate groups rows and calculates summary values — just like GROUP BY in SQL.
Example: Group by region and calculate total sales per region:
- Group by: region
- Aggregations: total_sales = sum(order_amount), order_count = count(order_id)
Join
Join combines two streams of data based on a matching key — like a JOIN in SQL. ADF supports Inner Join, Left Outer, Right Outer, Full Outer, and Cross Join.
Example: Join an orders dataset with a customers dataset on customer_id to add customer names and cities to each order record.
Lookup
Lookup is similar to Join but returns data from a reference table based on a matching key. It is commonly used to enrich data with values from a small reference table — like mapping product codes to product names.
Union
Union stacks rows from multiple datasets on top of each other. All datasets must have the same column structure. Use this when you have the same type of data in multiple files — like monthly sales files that need to be combined into one annual dataset.
Pivot and Unpivot
Pivot converts row values into column names. Unpivot does the reverse — converts column names into row values.
Example of Pivot: You have sales data with rows for each month. Pivot converts it so each month becomes a column, giving you one row per product with 12 sales columns.
Conditional Split
Conditional Split routes rows to different output streams based on conditions. It is the Data Flow equivalent of an IF-ELSE statement.
Example: Route rows to three outputs based on region — one stream for Asia, one for Europe, one for Americas. Each stream writes to a different folder in ADLS Gen2.
Exists
Exists checks whether a row from one stream has a matching row in another stream. Use it to find new records (rows that exist in the source but not in the destination) or deleted records (rows that exist in the destination but not in the source).
Sink
Every Data Flow ends with one or more Sink transformations. The Sink writes the transformed data to a destination — ADLS Gen2, Azure SQL, Synapse, or another storage system.
Key Sink settings:
- Write behavior: Overwrite, append, or upsert (insert new rows, update existing ones)
- File naming: Use a pattern to control output file names
- Partitioning: Control how many output files are created
Data Flow Debug Mode
Before running a full Data Flow, you use Debug Mode to test it. Debug Mode starts a small Spark cluster (takes about 2-3 minutes to start). Once running, you can preview data at any transformation step and check that your logic is correct.
Always test with a small data sample in Debug Mode before running against millions of rows. Debug Mode keeps a cluster alive and charges per hour — turn it off when not actively testing.
A Complete Example — Clean and Enrich Sales Data
Imagine raw sales data arrives daily in ADLS Gen2. The file has duplicate rows, inconsistent date formats, and missing customer names. Here is a Data Flow that fixes all of this.
- Source: Read the raw CSV from bronze/sales/
- Select: Keep only order_id, customer_id, order_date, amount, status
- Filter: Remove rows where amount is null or negative
- Derived Column: Convert order_date string to proper date type; calculate year and month columns
- Aggregate: Deduplicate by keeping max(amount) per order_id
- Lookup: Join with a customers reference table on customer_id to add customer_name and city
- Conditional Split: Route high-value orders (amount > 10000) to one output, regular orders to another
- Sink (x2): Write high-value orders to silver/sales/high_value/ and regular orders to silver/sales/regular/
Key Points
- Mapping Data Flows transform data visually — no coding required
- ADF compiles Data Flows into Spark jobs automatically
- Use Select early in the flow to drop unnecessary columns and reduce processing time
- Derived Column handles most common transformations — type conversion, null handling, string manipulation
- Always test with Debug Mode on a small sample before processing full datasets
- Turn off Debug Mode when not in use to avoid unnecessary compute charges
