Snowflake Clustering Keys and Micro-Partitions
Micro-partitions are the foundation of how Snowflake stores and retrieves data. Clustering keys are the tool you use to control how data organises itself across those micro-partitions. Together, these two concepts determine whether a query on a billion-row table scans 0.1% of partitions or 100% of them — a difference of thousands of seconds in query time and thousands of dollars in compute cost on large datasets.
Micro-Partitions: A Deep Dive
Every Snowflake table is physically stored as a collection of micro-partitions. Snowflake creates these partitions automatically when you load data. You cannot see or directly control individual micro-partitions — they are an internal implementation detail. But understanding their structure helps you design tables and queries that work with Snowflake's engine rather than against it.
MICRO-PARTITION PROPERTIES
============================
Size: 50 MB to 500 MB of uncompressed data per partition
(typically 16 MB compressed on disk)
Format: Columnar — all values from each column stored together
Compression: Automatic, algorithm chosen per column type
Immutable: Cannot be modified; DML creates new partitions
Count: A 1TB table typically has thousands of micro-partitions
Metadata: For every column in every partition, Snowflake records:
- Minimum value
- Maximum value
- Count of distinct values
- Count of NULL values
This metadata lives in Cloud Services layer, always available
MICRO-PARTITION FILE STRUCTURE (conceptual)
============================================
Micro-Partition File #847
+--------------------------+
| COLUMN: order_date |
| Values: 2024-05-10 ... |
| Min: 2024-05-10 |
| Max: 2024-05-10 | <-- metadata
| Rows: 85,000 |
+--------------------------+
| COLUMN: amount |
| Values: 145.50, 320.00. |
| Min: 10.00 |
| Max: 4,999.00 |
+--------------------------+
| COLUMN: status |
| Values: SHIPPED, SHIPPED.|
| Min: SHIPPED |
| Max: SHIPPED |
+--------------------------+
How Data Lands in Micro-Partitions at Load Time
When you run COPY INTO or INSERT, Snowflake writes incoming rows into new micro-partitions in the order the data arrives. If your source file has rows sorted by date, the resulting partitions naturally contain date ranges with minimal overlap between partitions. If your source file has rows in random order, each partition may contain data from all dates, causing heavy partition overlap.
ORDERED DATA LOADING (good clustering) ======================================= Source CSV rows (sorted by order_date): 2024-01-01, Alice, 500 2024-01-01, Bob, 200 2024-01-02, Carol, 350 ... 2024-01-31, Dave, 800 Resulting micro-partitions: Partition A: order_date MIN=2024-01-01 MAX=2024-01-07 (Jan week 1) Partition B: order_date MIN=2024-01-08 MAX=2024-01-14 (Jan week 2) Partition C: order_date MIN=2024-01-15 MAX=2024-01-31 (Jan weeks 3-4) Query: WHERE order_date = '2024-01-05' Partition A → SCAN (contains Jan 5) Partition B → SKIP Partition C → SKIP Result: 1 partition scanned out of 3 (33%) RANDOM DATA LOADING (poor clustering) ======================================= Source CSV rows (unsorted): 2024-01-15, Eve, 900 2024-03-02, Frank, 150 2024-01-05, Grace, 400 2024-06-20, Henry, 700 ... Resulting micro-partitions: Partition A: order_date MIN=2024-01-05 MAX=2024-06-20 (spans 6 months!) Partition B: order_date MIN=2024-01-10 MAX=2024-05-15 (spans 4 months) Partition C: order_date MIN=2024-02-01 MAX=2024-06-10 (spans 4 months) Query: WHERE order_date = '2024-01-05' Partition A → SCAN (MIN ≤ 2024-01-05 ≤ MAX) Partition B → SCAN (MIN ≤ 2024-01-05 ≤ MAX) Partition C → SCAN (MIN ≤ 2024-01-05 ≤ MAX) Result: All 3 partitions scanned (100%) — no pruning possible
Clustering Depth and the Clustering Information Function
Snowflake provides a built-in function to measure how well a table is clustered on a given column. The output tells you whether adding or changing a clustering key would improve query performance.
-- Measure clustering quality on the order_date column
SELECT SYSTEM$CLUSTERING_INFORMATION('ORDERS', '(order_date)');
-- Sample output (JSON):
{
"average_overlaps": 842.3, -- average partitions that overlap at any point
"average_depth": 843.8, -- average stack depth (1.0 = perfect, higher = worse)
"total_partition_count": 1200,
"clustered_partition_count": 45
}
INTERPRETING CLUSTERING DEPTH:
average_depth = 1.0 Perfect clustering (each value in only one partition)
average_depth = 1.0 to 5.0 Good clustering (minimal overlap)
average_depth = 5.0 to 20.0 Fair clustering (moderate overlap, query still pruning)
average_depth > 20.0 Poor clustering (high overlap, consider clustering key)
average_depth ≈ total partitions No clustering (data randomly distributed)
Clustering Keys: Telling Snowflake How to Organise Data
A clustering key instructs Snowflake to keep rows with similar values on the chosen column(s) physically co-located in the same or adjacent micro-partitions. Snowflake continuously reorganises the table in the background — a process called Automatic Clustering — to maintain good organisation as new data loads.
-- Add a clustering key to an existing large table
ALTER TABLE ORDERS CLUSTER BY (order_date);
-- Cluster by multiple columns (most selective column first)
ALTER TABLE ORDERS CLUSTER BY (order_date, country);
-- Cluster using an expression
ALTER TABLE ORDERS CLUSTER BY (DATE_TRUNC('MONTH', order_date));
-- Check clustering status
SELECT SYSTEM$CLUSTERING_INFORMATION('ORDERS', '(order_date)');
-- Remove a clustering key
ALTER TABLE ORDERS DROP CLUSTERING KEY;
Choosing the Right Clustering Key
GOOD CLUSTERING KEY CANDIDATES WHY
------------------------------- ---
Date/timestamp columns Most queries filter by date range
High-cardinality columns used in WHERE Enables tight partition boundaries
Foreign keys used in JOIN conditions Collocates data for fast joins
Status or category columns (combined Reduces partitions scanned for
with date in compound key) status+date queries
BAD CLUSTERING KEY CHOICES WHY NOT
------------------------------ -------
Columns rarely used in WHERE/JOIN Reorganisation cost with no benefit
Boolean columns (only 2 values) All partitions contain both values
Columns with very low cardinality Partitions span entire value range
Primary key (unique per row) Every partition contains every value
range — no meaningful pruning
Automatic Clustering: Snowflake Maintains Order for You
Once you define a clustering key, Snowflake's Automatic Clustering service runs in the background and continuously reorganises micro-partitions to reduce overlap. You do not trigger this manually — Snowflake monitors the clustering depth and kicks off reorganisation when the table degrades beyond a threshold.
AUTOMATIC CLUSTERING WORKFLOW ============================== DAY 1: You define CLUSTER BY (order_date) on ORDERS table Snowflake runs initial clustering reorganisation Clustering depth improves from 842 to 1.3 Cost: background compute credits (separate from your warehouse) DAY 2-7: Your ETL loads 100K new rows each night New rows arrive in new micro-partitions with good internal order Clustering depth gradually increases as new partitions accumulate (each nightly load adds well-ordered partitions at the end of the table) DAY 8: Clustering depth reaches threshold (e.g., depth = 4.0) Snowflake's Automatic Clustering service re-sorts affected partitions Depth returns to ~1.3 Your queries continue to run fast with no manual intervention AUTOMATIC CLUSTERING COST: - Snowflake charges compute credits for reorganisation work - Cost depends on table size and how frequently data loads disturb order - Monitor with: SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY - Consider whether query performance gains justify the ongoing clustering cost
When to Use Clustering Keys
USE A CLUSTERING KEY WHEN: Table has > 1TB of data (smaller tables prune well without clustering) You run frequent range queries on the same column (e.g., last 30 days) Current clustering depth > 10 and query performance is unsatisfactory The table is append-only and data loads in random order DO NOT USE A CLUSTERING KEY WHEN: Table is small (< 500GB) — automatic pruning already works well Queries access the full table most of the time (no selective filters) Data is frequently updated or deleted (creates heavy recluster churn) You cannot measure a performance problem first — do not cluster prematurely COST-BENEFIT DECISION FRAMEWORK: Monthly clustering compute cost < Monthly compute saved by faster queries? YES → Add clustering key NO → Explore other optimisations first (query rewrites, partition pruning rules)
Search Optimisation Service: Point Lookup Acceleration
Clustering keys improve range queries (WHERE date BETWEEN x AND y). For point lookup queries (WHERE customer_id = 12345 or WHERE email = 'alice@example.com'), Snowflake offers the Search Optimisation Service. It builds a persistent, automatically-maintained secondary data structure that enables fast single-row lookups without scanning all partitions.
-- Enable Search Optimisation on a table ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION; -- Enable for a specific column only (cheaper) ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON EQUALITY(customer_id); ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON EQUALITY(email); ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON SUBSTRING(product_name); -- Check search optimization status SHOW SEARCH OPTIMIZATIONS IN TABLE CUSTOMERS; USE WHEN: Frequent point lookups by customer ID, email, or other unique fields Support queries that look up one specific record at a time High-cardinality columns where clustering keys do not help COST: Additional storage for the search access paths (typically 30-50% of table size) Ongoing credit cost to maintain as data changes
Key Points
- Snowflake stores every table as immutable micro-partitions of 50–500 MB, automatically compressed and stored in columnar format
- Snowflake records metadata (min, max, null count, distinct count) for every column in every micro-partition — this metadata drives partition pruning
- When data loads in natural order (e.g., sorted by date), micro-partitions have minimal range overlap and pruning works automatically without a clustering key
- When data loads in random order, partition ranges overlap heavily — queries scan too many partitions — this is when a clustering key helps
- Use SYSTEM$CLUSTERING_INFORMATION() to measure clustering depth before adding a key — only add one if depth is high and query performance is genuinely suffering
- Automatic Clustering runs in the background to maintain partition order as new data loads — it incurs ongoing compute credit cost
- Clustering keys help range queries; Search Optimisation Service helps point lookup queries — they address different access patterns
