ADE Azure Data Lake Storage Gen2

Every data engineering project needs a place to store raw data before processing begins. Azure Data Lake Storage Gen2 — commonly called ADLS Gen2 — is that place. It is the foundation of nearly every modern Azure data solution.

What is a Data Lake

A data lake is a storage system that holds massive amounts of data in its original, unprocessed form. It accepts any type of data — structured tables, unstructured text files, images, videos, logs, sensor readings — without requiring a predefined structure.

Compare this to a data warehouse, which only accepts clean, structured data that fits a defined schema. A data lake is more flexible. You store everything first and decide how to use it later.

Picture a river flowing into a lake. The river (your data sources) pours water (data) into the lake continuously. The lake holds everything. Later, you pump water out in specific amounts for specific uses — irrigation, drinking, industry. The data lake works the same way.

Why ADLS Gen2 Over Regular Azure Blob Storage

Azure Blob Storage is Azure's general-purpose object storage. ADLS Gen2 is built on top of Blob Storage but adds features specifically designed for analytics workloads.

FeatureAzure Blob StorageADLS Gen2
Hierarchical file systemNo (flat namespace)Yes (folders and subfolders)
Fine-grained access controlContainer-level onlyFile and folder level (ACLs)
Performance for analyticsModerateHigh — optimized for big data
Hadoop compatibilityNoYes — works with Spark, Hive, Databricks

For data engineering, ADLS Gen2 is always the preferred choice. The hierarchical namespace makes organizing millions of files much more manageable.

Creating an ADLS Gen2 Storage Account

To enable ADLS Gen2, you create a standard Azure Storage Account and turn on the Hierarchical Namespace option during creation. This single toggle activates all the Gen2 features.

Key settings to configure:

  • Performance tier: Standard (most workloads) or Premium (low-latency requirements)
  • Redundancy: LRS (locally redundant), ZRS (zone redundant), GRS (geo-redundant) — choose based on how critical your data is
  • Region: Always match the region of your processing services

Containers, Folders, and Files

Inside a storage account, data lives in a container — the top-level bucket. Inside each container, you create folders and subfolders just like a file system on your computer.

The Bronze, Silver, Gold Pattern

Most Azure data engineering projects organize ADLS Gen2 using a three-layer folder structure called the Medallion Architecture. This pattern is one of the most important concepts in modern data engineering.

Imagine a gold refinery. Raw ore comes in (bronze), gets partially refined (silver), and then becomes pure gold. Your data follows the same journey.

  • Bronze (Raw layer): Exact copy of source data. Nothing is changed. If something goes wrong in later stages, you always have the original data to reprocess.
  • Silver (Cleaned layer): Data is cleaned — duplicates removed, nulls handled, formats standardized. The structure is still close to the source.
  • Gold (Curated layer): Business-ready data. Aggregations, joins, and transformations are complete. Analysts and dashboards read from here.

A typical folder structure looks like this:

storageaccount/
├── bronze/
│   ├── sales/
│   │   ├── 2024/01/01/orders.csv
│   │   └── 2024/01/02/orders.csv
│   └── customers/
├── silver/
│   ├── sales/
│   └── customers/
└── gold/
    ├── monthly_revenue/
    └── customer_segments/

Access Control in ADLS Gen2

ADLS Gen2 supports two levels of access control:

RBAC — Role-Based Access Control

Assigns roles at the storage account or container level. A team of analysts can receive the Storage Blob Data Reader role on the gold container only. They can read gold data but cannot touch bronze or silver data.

ACLs — Access Control Lists

Provides permission at the individual file and folder level. You can set read permission on a specific folder for a specific user without giving them access to anything else in the storage account. This level of control is only available in ADLS Gen2 — not in regular Blob Storage.

Storage Tiers and Cost Management

Not all data needs to be instantly accessible. Azure offers different access tiers to help manage storage costs.

TierAccess SpeedCostBest For
HotInstantHigher storage costData accessed frequently (silver, gold layers)
CoolInstantLower storage costData accessed occasionally (recent bronze data)
ArchiveHours to retrieveVery low storage costOld data kept for compliance (bronze data older than 1 year)

A good practice: configure lifecycle management policies that automatically move data to cooler tiers after a certain number of days. Bronze data older than 90 days moves to Cool. Bronze data older than 365 days moves to Archive.

Connecting to ADLS Gen2

Different Azure services connect to ADLS Gen2 in different ways:

  • Azure Data Factory: Uses a Linked Service with Managed Identity or Service Principal
  • Azure Databricks: Mounts the storage using a Service Principal or connects directly with abfss:// protocol
  • Azure Synapse Analytics: Uses built-in integration — the storage account is linked directly to the Synapse workspace

The abfss:// (Azure Blob File System Secure) protocol is used when code needs to read or write files in ADLS Gen2. A typical path looks like:

abfss://bronze@mystorageaccount.dfs.core.windows.net/sales/2024/01/01/orders.csv

Key Points

  • ADLS Gen2 is the standard storage layer for Azure data engineering workloads
  • Enable Hierarchical Namespace when creating the storage account — this activates Gen2 features
  • Use the Bronze, Silver, Gold (Medallion) architecture to organize data in layers
  • Use RBAC for broad access control and ACLs for fine-grained file-level permissions
  • Use lifecycle policies to automatically move old data to cheaper storage tiers
  • Always keep all processing services in the same region as the storage account

Leave a Comment