Building Your First Complete Azure Data Project

Reading about Azure data engineering builds knowledge. Building a real project builds skill, confidence, and a portfolio that gets you hired. This topic walks you through a complete, hands-on project — from setting up your Azure environment to delivering a working analytics pipeline — using free or low-cost Azure resources.

The Project — Sales Analytics Pipeline

You will build a pipeline that ingests daily sales data, transforms it using Databricks and ADF, stores it in a Medallion Architecture in ADLS Gen2, and serves it to a Synapse Serverless SQL endpoint for analysis.

This project covers the most frequently tested and professionally relevant skills: ADLS Gen2, ADF Copy Activity, Databricks Delta Lake, Synapse Serverless, security with Managed Identity, and basic monitoring.

Step 1 — Set Up Your Azure Environment

Create a Free Azure Account

Go to azure.microsoft.com and create a free account. You receive $200 in credits valid for 30 days plus several always-free services. This is sufficient to complete this project at near-zero cost if you delete resources when not actively working.

Create a Resource Group

In the Azure portal, create a resource group named rg-sales-analytics in your nearest region (East US, West Europe, South India, etc.). All project resources go inside this group.

Create ADLS Gen2 Storage Account

Create a Storage Account with these settings:

  • Name: stgsalesanalyticsXXX (replace XXX with a unique number)
  • Region: same as your resource group
  • Performance: Standard
  • Redundancy: LRS (lowest cost for a learning project)
  • Advanced tab → Enable Hierarchical Namespace: ON (this activates ADLS Gen2)

Create three containers inside the storage account: bronze, silver, gold.

Step 2 — Prepare Sample Data

Create a CSV file named sales_2024_01.csv with this content:

order_id,customer_id,customer_name,product,region,order_date,quantity,unit_price,status
1001,C001,Alice Sharma,Laptop,South,2024-01-15,1,75000,Completed
1002,C002,Bob Patel,Mobile Phone,West,2024-01-15,2,25000,Completed
1003,C003,Carol Singh,Laptop,North,2024-01-16,1,75000,Cancelled
1004,C001,Alice Sharma,Tablet,South,2024-01-17,1,35000,Completed
1005,C004,David Kumar,Mobile Phone,East,2024-01-17,3,25000,Completed
1006,C005,Eva Reddy,Laptop,South,2024-01-18,1,75000,Completed
1007,C002,Bob Patel,Tablet,West,2024-01-19,2,35000,Completed
1008,C003,Carol Singh,Mobile Phone,North,2024-01-19,1,25000,Completed
1009,C006,Frank Nair,Laptop,East,2024-01-20,-1,75000,Completed
1010,C004,David Kumar,Tablet,East,2024-01-20,2,35000,Completed

Upload this file to the bronze container in a folder: bronze/sales/2024/01/.

Note that row 1009 has a negative quantity — a data quality issue you will fix in the silver layer.

Step 3 — Create an Azure Databricks Workspace

Create an Azure Databricks workspace in your resource group. Use the Standard pricing tier for this project. Once created, launch the Databricks Workspace.

Create a Cluster

Create a Single Node cluster (cheapest option for learning) with:

  • Databricks Runtime: 13.3 LTS or later
  • Node type: Standard_DS3_v2 (or the smallest available in your region)
  • Auto-termination: 30 minutes

Grant Databricks Access to ADLS Gen2

Assign the Storage Blob Data Contributor role to the Databricks workspace's Managed Identity on your storage account. This grants the cluster read and write access to all containers.

Step 4 — Build the Bronze to Silver Transformation in Databricks

Create a new notebook called 01_bronze_to_silver. Add the following cells:

# Cell 1 — Configuration
storage_account = "stgsalesanalyticsXXX"  # Replace with your storage account name

bronze_path = f"abfss://bronze@{storage_account}.dfs.core.windows.net/sales/2024/01/"
silver_path = f"abfss://silver@{storage_account}.dfs.core.windows.net/sales/"
# Cell 2 — Read raw CSV from bronze layer
from pyspark.sql.functions import col, to_date, upper, year, month

df_bronze = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(bronze_path)

print(f"Bronze row count: {df_bronze.count()}")
display(df_bronze)
# Cell 3 — Apply transformations and data quality rules
df_silver = (df_bronze
    # Remove cancelled orders
    .filter(col("status") == "Completed")
    # Remove rows with negative or zero quantity (data quality)
    .filter(col("quantity") > 0)
    # Standardize region to upper case
    .withColumn("region", upper(col("region")))
    # Convert order_date string to proper date type
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
    # Calculate total amount
    .withColumn("total_amount", col("quantity") * col("unit_price"))
    # Add year and month columns for partitioning
    .withColumn("year", year(col("order_date")))
    .withColumn("month", month(col("order_date")))
    # Drop redundant columns
    .drop("customer_name")  # PII — moved to a separate customer dimension table
)

print(f"Silver row count: {df_silver.count()}")
display(df_silver)
# Cell 4 — Write silver layer as Delta, partitioned by year and month
df_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .save(silver_path)

print("Silver layer written successfully")

Run all cells. Verify: Bronze had 10 rows, Silver should have 8 rows (2 removed — 1 Cancelled, 1 negative quantity).

Step 5 — Build the Silver to Gold Transformation

Create a notebook called 02_silver_to_gold:

# Read silver Delta table
df_silver = spark.read.format("delta") \
    .load(f"abfss://silver@{storage_account}.dfs.core.windows.net/sales/")

# Gold — total revenue and order count by region and product
df_gold_by_region_product = (df_silver
    .groupBy("region", "product", "year", "month")
    .agg(
        {"total_amount": "sum", "order_id": "count"}
    )
    .withColumnRenamed("sum(total_amount)", "total_revenue")
    .withColumnRenamed("count(order_id)", "order_count")
)

# Write gold layer
gold_path = f"abfss://gold@{storage_account}.dfs.core.windows.net/sales_by_region_product/"
df_gold_by_region_product.write \
    .format("delta") \
    .mode("overwrite") \
    .save(gold_path)

display(df_gold_by_region_product)

Step 6 — Create Azure Synapse Analytics Workspace

Create a Synapse Analytics Workspace linked to your ADLS Gen2 storage account. Use the Serverless SQL Pool — no additional cost beyond the data scanned.

In Synapse Studio, open a new SQL script and query the gold layer directly:

-- Query gold layer Delta files from Synapse Serverless SQL Pool
SELECT *
FROM OPENROWSET(
    BULK 'https://stgsalesanalyticsXXX.dfs.core.windows.net/gold/sales_by_region_product/**',
    FORMAT = 'DELTA'
) AS [result]
ORDER BY total_revenue DESC;

Step 7 — Automate with ADF

Create an Azure Data Factory instance. Build a pipeline named Daily_Sales_Pipeline with two Notebook Activities:

  1. First activity: runs 01_bronze_to_silver
  2. Second activity: runs 02_silver_to_gold — with a Success dependency on the first activity

Add a Schedule Trigger that runs the pipeline daily. Test by clicking Debug and verifying both notebooks execute successfully.

Step 8 — Add Basic Monitoring

In Azure Monitor, create an alert rule on the ADF pipeline. Set the condition to fire when any pipeline run fails. Add an Action Group that sends an email notification to your address.

What You Have Built

You now have a working end-to-end Azure data pipeline that demonstrates:

  • ADLS Gen2 with Medallion Architecture (Bronze/Silver/Gold)
  • Databricks PySpark transformations with Delta Lake
  • Data quality checks (filtering bad records)
  • ADF orchestration with activity dependencies
  • Synapse Serverless SQL for querying the data lake
  • Automated scheduling and failure alerting

This project is portfolio-ready. Add it to your LinkedIn and GitHub with a clear README explaining the architecture and the business problem it solves.

Key Points

  • Building a real project embeds the concepts more effectively than reading alone — allocate time for hands-on practice
  • Use a single-node Databricks cluster and LRS storage for a learning project to minimize cost
  • Always enable auto-termination on your cluster and delete or pause resources when not actively working
  • Intentionally include a data quality issue in your sample data and fix it in the silver layer — this demonstrates data engineering judgment
  • The project structure — ADLS Gen2 + Databricks + Synapse + ADF — matches the most common architecture pattern in real enterprise environments
  • Document your project clearly — a portfolio project with a well-written README is a strong differentiator in job applications

Leave a Comment