Building Your First Complete Azure Data Project
Reading about Azure data engineering builds knowledge. Building a real project builds skill, confidence, and a portfolio that gets you hired. This topic walks you through a complete, hands-on project — from setting up your Azure environment to delivering a working analytics pipeline — using free or low-cost Azure resources.
The Project — Sales Analytics Pipeline
You will build a pipeline that ingests daily sales data, transforms it using Databricks and ADF, stores it in a Medallion Architecture in ADLS Gen2, and serves it to a Synapse Serverless SQL endpoint for analysis.
This project covers the most frequently tested and professionally relevant skills: ADLS Gen2, ADF Copy Activity, Databricks Delta Lake, Synapse Serverless, security with Managed Identity, and basic monitoring.
Step 1 — Set Up Your Azure Environment
Create a Free Azure Account
Go to azure.microsoft.com and create a free account. You receive $200 in credits valid for 30 days plus several always-free services. This is sufficient to complete this project at near-zero cost if you delete resources when not actively working.
Create a Resource Group
In the Azure portal, create a resource group named rg-sales-analytics in your nearest region (East US, West Europe, South India, etc.). All project resources go inside this group.
Create ADLS Gen2 Storage Account
Create a Storage Account with these settings:
- Name:
stgsalesanalyticsXXX(replace XXX with a unique number) - Region: same as your resource group
- Performance: Standard
- Redundancy: LRS (lowest cost for a learning project)
- Advanced tab → Enable Hierarchical Namespace: ON (this activates ADLS Gen2)
Create three containers inside the storage account: bronze, silver, gold.
Step 2 — Prepare Sample Data
Create a CSV file named sales_2024_01.csv with this content:
order_id,customer_id,customer_name,product,region,order_date,quantity,unit_price,status 1001,C001,Alice Sharma,Laptop,South,2024-01-15,1,75000,Completed 1002,C002,Bob Patel,Mobile Phone,West,2024-01-15,2,25000,Completed 1003,C003,Carol Singh,Laptop,North,2024-01-16,1,75000,Cancelled 1004,C001,Alice Sharma,Tablet,South,2024-01-17,1,35000,Completed 1005,C004,David Kumar,Mobile Phone,East,2024-01-17,3,25000,Completed 1006,C005,Eva Reddy,Laptop,South,2024-01-18,1,75000,Completed 1007,C002,Bob Patel,Tablet,West,2024-01-19,2,35000,Completed 1008,C003,Carol Singh,Mobile Phone,North,2024-01-19,1,25000,Completed 1009,C006,Frank Nair,Laptop,East,2024-01-20,-1,75000,Completed 1010,C004,David Kumar,Tablet,East,2024-01-20,2,35000,Completed
Upload this file to the bronze container in a folder: bronze/sales/2024/01/.
Note that row 1009 has a negative quantity — a data quality issue you will fix in the silver layer.
Step 3 — Create an Azure Databricks Workspace
Create an Azure Databricks workspace in your resource group. Use the Standard pricing tier for this project. Once created, launch the Databricks Workspace.
Create a Cluster
Create a Single Node cluster (cheapest option for learning) with:
- Databricks Runtime: 13.3 LTS or later
- Node type: Standard_DS3_v2 (or the smallest available in your region)
- Auto-termination: 30 minutes
Grant Databricks Access to ADLS Gen2
Assign the Storage Blob Data Contributor role to the Databricks workspace's Managed Identity on your storage account. This grants the cluster read and write access to all containers.
Step 4 — Build the Bronze to Silver Transformation in Databricks
Create a new notebook called 01_bronze_to_silver. Add the following cells:
# Cell 1 — Configuration
storage_account = "stgsalesanalyticsXXX" # Replace with your storage account name
bronze_path = f"abfss://bronze@{storage_account}.dfs.core.windows.net/sales/2024/01/"
silver_path = f"abfss://silver@{storage_account}.dfs.core.windows.net/sales/"
# Cell 2 — Read raw CSV from bronze layer
from pyspark.sql.functions import col, to_date, upper, year, month
df_bronze = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(bronze_path)
print(f"Bronze row count: {df_bronze.count()}")
display(df_bronze)
# Cell 3 — Apply transformations and data quality rules
df_silver = (df_bronze
# Remove cancelled orders
.filter(col("status") == "Completed")
# Remove rows with negative or zero quantity (data quality)
.filter(col("quantity") > 0)
# Standardize region to upper case
.withColumn("region", upper(col("region")))
# Convert order_date string to proper date type
.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
# Calculate total amount
.withColumn("total_amount", col("quantity") * col("unit_price"))
# Add year and month columns for partitioning
.withColumn("year", year(col("order_date")))
.withColumn("month", month(col("order_date")))
# Drop redundant columns
.drop("customer_name") # PII — moved to a separate customer dimension table
)
print(f"Silver row count: {df_silver.count()}")
display(df_silver)
# Cell 4 — Write silver layer as Delta, partitioned by year and month
df_silver.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month") \
.save(silver_path)
print("Silver layer written successfully")
Run all cells. Verify: Bronze had 10 rows, Silver should have 8 rows (2 removed — 1 Cancelled, 1 negative quantity).
Step 5 — Build the Silver to Gold Transformation
Create a notebook called 02_silver_to_gold:
# Read silver Delta table
df_silver = spark.read.format("delta") \
.load(f"abfss://silver@{storage_account}.dfs.core.windows.net/sales/")
# Gold — total revenue and order count by region and product
df_gold_by_region_product = (df_silver
.groupBy("region", "product", "year", "month")
.agg(
{"total_amount": "sum", "order_id": "count"}
)
.withColumnRenamed("sum(total_amount)", "total_revenue")
.withColumnRenamed("count(order_id)", "order_count")
)
# Write gold layer
gold_path = f"abfss://gold@{storage_account}.dfs.core.windows.net/sales_by_region_product/"
df_gold_by_region_product.write \
.format("delta") \
.mode("overwrite") \
.save(gold_path)
display(df_gold_by_region_product)
Step 6 — Create Azure Synapse Analytics Workspace
Create a Synapse Analytics Workspace linked to your ADLS Gen2 storage account. Use the Serverless SQL Pool — no additional cost beyond the data scanned.
In Synapse Studio, open a new SQL script and query the gold layer directly:
-- Query gold layer Delta files from Synapse Serverless SQL Pool
SELECT *
FROM OPENROWSET(
BULK 'https://stgsalesanalyticsXXX.dfs.core.windows.net/gold/sales_by_region_product/**',
FORMAT = 'DELTA'
) AS [result]
ORDER BY total_revenue DESC;
Step 7 — Automate with ADF
Create an Azure Data Factory instance. Build a pipeline named Daily_Sales_Pipeline with two Notebook Activities:
- First activity: runs
01_bronze_to_silver - Second activity: runs
02_silver_to_gold— with a Success dependency on the first activity
Add a Schedule Trigger that runs the pipeline daily. Test by clicking Debug and verifying both notebooks execute successfully.
Step 8 — Add Basic Monitoring
In Azure Monitor, create an alert rule on the ADF pipeline. Set the condition to fire when any pipeline run fails. Add an Action Group that sends an email notification to your address.
What You Have Built
You now have a working end-to-end Azure data pipeline that demonstrates:
- ADLS Gen2 with Medallion Architecture (Bronze/Silver/Gold)
- Databricks PySpark transformations with Delta Lake
- Data quality checks (filtering bad records)
- ADF orchestration with activity dependencies
- Synapse Serverless SQL for querying the data lake
- Automated scheduling and failure alerting
This project is portfolio-ready. Add it to your LinkedIn and GitHub with a clear README explaining the architecture and the business problem it solves.
Key Points
- Building a real project embeds the concepts more effectively than reading alone — allocate time for hands-on practice
- Use a single-node Databricks cluster and LRS storage for a learning project to minimize cost
- Always enable auto-termination on your cluster and delete or pause resources when not actively working
- Intentionally include a data quality issue in your sample data and fix it in the silver layer — this demonstrates data engineering judgment
- The project structure — ADLS Gen2 + Databricks + Synapse + ADF — matches the most common architecture pattern in real enterprise environments
- Document your project clearly — a portfolio project with a well-written README is a strong differentiator in job applications
