ADE Azure Data Lake Storage Gen2

Every data engineering project needs a place to store raw data before processing begins. Azure Data Lake Storage Gen2 — commonly called ADLS Gen2 — is that place. It is the foundation of nearly every modern Azure data solution.

What is a Data Lake

A data lake is a storage system that holds massive amounts of data in its original, unprocessed form. It accepts any type of data — structured tables, unstructured text files, images, videos, logs, sensor readings — without requiring a predefined structure.

Compare this to a data warehouse, which only accepts clean, structured data that fits a defined schema. A data lake is more flexible. You store everything first and decide how to use it later.

Picture a river flowing into a lake. The river (your data sources) pours water (data) into the lake continuously. The lake holds everything. Later, you pump water out in specific amounts for specific uses — irrigation, drinking, industry. The data lake works the same way.

Why ADLS Gen2 Over Regular Azure Blob Storage

Azure Blob Storage is Azure's general-purpose object storage. ADLS Gen2 is built on top of Blob Storage but adds features specifically designed for analytics workloads.

Feature	Azure Blob Storage	ADLS Gen2
Hierarchical file system	No (flat namespace)	Yes (folders and subfolders)
Fine-grained access control	Container-level only	File and folder level (ACLs)
Performance for analytics	Moderate	High — optimized for big data
Hadoop compatibility	No	Yes — works with Spark, Hive, Databricks

For data engineering, ADLS Gen2 is always the preferred choice. The hierarchical namespace makes organizing millions of files much more manageable.

Creating an ADLS Gen2 Storage Account

To enable ADLS Gen2, you create a standard Azure Storage Account and turn on the Hierarchical Namespace option during creation. This single toggle activates all the Gen2 features.

Key settings to configure:

Performance tier: Standard (most workloads) or Premium (low-latency requirements)
Redundancy: LRS (locally redundant), ZRS (zone redundant), GRS (geo-redundant) — choose based on how critical your data is
Region: Always match the region of your processing services

Containers, Folders, and Files

Inside a storage account, data lives in a container — the top-level bucket. Inside each container, you create folders and subfolders just like a file system on your computer.

The Bronze, Silver, Gold Pattern

Most Azure data engineering projects organize ADLS Gen2 using a three-layer folder structure called the Medallion Architecture. This pattern is one of the most important concepts in modern data engineering.

Imagine a gold refinery. Raw ore comes in (bronze), gets partially refined (silver), and then becomes pure gold. Your data follows the same journey.

Bronze (Raw layer): Exact copy of source data. Nothing is changed. If something goes wrong in later stages, you always have the original data to reprocess.
Silver (Cleaned layer): Data is cleaned — duplicates removed, nulls handled, formats standardized. The structure is still close to the source.
Gold (Curated layer): Business-ready data. Aggregations, joins, and transformations are complete. Analysts and dashboards read from here.

A typical folder structure looks like this:

storageaccount/
├── bronze/
│   ├── sales/
│   │   ├── 2024/01/01/orders.csv
│   │   └── 2024/01/02/orders.csv
│   └── customers/
├── silver/
│   ├── sales/
│   └── customers/
└── gold/
    ├── monthly_revenue/
    └── customer_segments/

Access Control in ADLS Gen2

ADLS Gen2 supports two levels of access control:

RBAC — Role-Based Access Control

Assigns roles at the storage account or container level. A team of analysts can receive the Storage Blob Data Reader role on the gold container only. They can read gold data but cannot touch bronze or silver data.

ACLs — Access Control Lists

Provides permission at the individual file and folder level. You can set read permission on a specific folder for a specific user without giving them access to anything else in the storage account. This level of control is only available in ADLS Gen2 — not in regular Blob Storage.

Storage Tiers and Cost Management

Not all data needs to be instantly accessible. Azure offers different access tiers to help manage storage costs.

Tier	Access Speed	Cost	Best For
Hot	Instant	Higher storage cost	Data accessed frequently (silver, gold layers)
Cool	Instant	Lower storage cost	Data accessed occasionally (recent bronze data)
Archive	Hours to retrieve	Very low storage cost	Old data kept for compliance (bronze data older than 1 year)

A good practice: configure lifecycle management policies that automatically move data to cooler tiers after a certain number of days. Bronze data older than 90 days moves to Cool. Bronze data older than 365 days moves to Archive.

Connecting to ADLS Gen2

Different Azure services connect to ADLS Gen2 in different ways:

Azure Data Factory: Uses a Linked Service with Managed Identity or Service Principal
Azure Databricks: Mounts the storage using a Service Principal or connects directly with abfss:// protocol
Azure Synapse Analytics: Uses built-in integration — the storage account is linked directly to the Synapse workspace

The abfss:// (Azure Blob File System Secure) protocol is used when code needs to read or write files in ADLS Gen2. A typical path looks like:

abfss://bronze@mystorageaccount.dfs.core.windows.net/sales/2024/01/01/orders.csv

Key Points

ADLS Gen2 is the standard storage layer for Azure data engineering workloads
Enable Hierarchical Namespace when creating the storage account — this activates Gen2 features
Use the Bronze, Silver, Gold (Medallion) architecture to organize data in layers
Use RBAC for broad access control and ACLs for fine-grained file-level permissions
Use lifecycle policies to automatically move old data to cheaper storage tiers
Always keep all processing services in the same region as the storage account

Previous lesson

Back to course

Next lesson