ADE Azure Data Lake Storage Gen2
Every data engineering project needs a place to store raw data before processing begins. Azure Data Lake Storage Gen2 — commonly called ADLS Gen2 — is that place. It is the foundation of nearly every modern Azure data solution.
What is a Data Lake
A data lake is a storage system that holds massive amounts of data in its original, unprocessed form. It accepts any type of data — structured tables, unstructured text files, images, videos, logs, sensor readings — without requiring a predefined structure.
Compare this to a data warehouse, which only accepts clean, structured data that fits a defined schema. A data lake is more flexible. You store everything first and decide how to use it later.
Picture a river flowing into a lake. The river (your data sources) pours water (data) into the lake continuously. The lake holds everything. Later, you pump water out in specific amounts for specific uses — irrigation, drinking, industry. The data lake works the same way.
Why ADLS Gen2 Over Regular Azure Blob Storage
Azure Blob Storage is Azure's general-purpose object storage. ADLS Gen2 is built on top of Blob Storage but adds features specifically designed for analytics workloads.
| Feature | Azure Blob Storage | ADLS Gen2 |
|---|---|---|
| Hierarchical file system | No (flat namespace) | Yes (folders and subfolders) |
| Fine-grained access control | Container-level only | File and folder level (ACLs) |
| Performance for analytics | Moderate | High — optimized for big data |
| Hadoop compatibility | No | Yes — works with Spark, Hive, Databricks |
For data engineering, ADLS Gen2 is always the preferred choice. The hierarchical namespace makes organizing millions of files much more manageable.
Creating an ADLS Gen2 Storage Account
To enable ADLS Gen2, you create a standard Azure Storage Account and turn on the Hierarchical Namespace option during creation. This single toggle activates all the Gen2 features.
Key settings to configure:
- Performance tier: Standard (most workloads) or Premium (low-latency requirements)
- Redundancy: LRS (locally redundant), ZRS (zone redundant), GRS (geo-redundant) — choose based on how critical your data is
- Region: Always match the region of your processing services
Containers, Folders, and Files
Inside a storage account, data lives in a container — the top-level bucket. Inside each container, you create folders and subfolders just like a file system on your computer.
The Bronze, Silver, Gold Pattern
Most Azure data engineering projects organize ADLS Gen2 using a three-layer folder structure called the Medallion Architecture. This pattern is one of the most important concepts in modern data engineering.
Imagine a gold refinery. Raw ore comes in (bronze), gets partially refined (silver), and then becomes pure gold. Your data follows the same journey.
- Bronze (Raw layer): Exact copy of source data. Nothing is changed. If something goes wrong in later stages, you always have the original data to reprocess.
- Silver (Cleaned layer): Data is cleaned — duplicates removed, nulls handled, formats standardized. The structure is still close to the source.
- Gold (Curated layer): Business-ready data. Aggregations, joins, and transformations are complete. Analysts and dashboards read from here.
A typical folder structure looks like this:
storageaccount/
├── bronze/
│ ├── sales/
│ │ ├── 2024/01/01/orders.csv
│ │ └── 2024/01/02/orders.csv
│ └── customers/
├── silver/
│ ├── sales/
│ └── customers/
└── gold/
├── monthly_revenue/
└── customer_segments/
Access Control in ADLS Gen2
ADLS Gen2 supports two levels of access control:
RBAC — Role-Based Access Control
Assigns roles at the storage account or container level. A team of analysts can receive the Storage Blob Data Reader role on the gold container only. They can read gold data but cannot touch bronze or silver data.
ACLs — Access Control Lists
Provides permission at the individual file and folder level. You can set read permission on a specific folder for a specific user without giving them access to anything else in the storage account. This level of control is only available in ADLS Gen2 — not in regular Blob Storage.
Storage Tiers and Cost Management
Not all data needs to be instantly accessible. Azure offers different access tiers to help manage storage costs.
| Tier | Access Speed | Cost | Best For |
|---|---|---|---|
| Hot | Instant | Higher storage cost | Data accessed frequently (silver, gold layers) |
| Cool | Instant | Lower storage cost | Data accessed occasionally (recent bronze data) |
| Archive | Hours to retrieve | Very low storage cost | Old data kept for compliance (bronze data older than 1 year) |
A good practice: configure lifecycle management policies that automatically move data to cooler tiers after a certain number of days. Bronze data older than 90 days moves to Cool. Bronze data older than 365 days moves to Archive.
Connecting to ADLS Gen2
Different Azure services connect to ADLS Gen2 in different ways:
- Azure Data Factory: Uses a Linked Service with Managed Identity or Service Principal
- Azure Databricks: Mounts the storage using a Service Principal or connects directly with abfss:// protocol
- Azure Synapse Analytics: Uses built-in integration — the storage account is linked directly to the Synapse workspace
The abfss:// (Azure Blob File System Secure) protocol is used when code needs to read or write files in ADLS Gen2. A typical path looks like:
abfss://bronze@mystorageaccount.dfs.core.windows.net/sales/2024/01/01/orders.csv
Key Points
- ADLS Gen2 is the standard storage layer for Azure data engineering workloads
- Enable Hierarchical Namespace when creating the storage account — this activates Gen2 features
- Use the Bronze, Silver, Gold (Medallion) architecture to organize data in layers
- Use RBAC for broad access control and ACLs for fine-grained file-level permissions
- Use lifecycle policies to automatically move old data to cheaper storage tiers
- Always keep all processing services in the same region as the storage account
