Snowflake Clustering Keys and Micro-Partitions

Micro-partitions are the foundation of how Snowflake stores and retrieves data. Clustering keys are the tool you use to control how data organises itself across those micro-partitions. Together, these two concepts determine whether a query on a billion-row table scans 0.1% of partitions or 100% of them — a difference of thousands of seconds in query time and thousands of dollars in compute cost on large datasets.

Micro-Partitions: A Deep Dive

Every Snowflake table is physically stored as a collection of micro-partitions. Snowflake creates these partitions automatically when you load data. You cannot see or directly control individual micro-partitions — they are an internal implementation detail. But understanding their structure helps you design tables and queries that work with Snowflake's engine rather than against it.

MICRO-PARTITION PROPERTIES
============================

Size:         50 MB to 500 MB of uncompressed data per partition
              (typically 16 MB compressed on disk)
Format:       Columnar — all values from each column stored together
Compression:  Automatic, algorithm chosen per column type
Immutable:    Cannot be modified; DML creates new partitions
Count:        A 1TB table typically has thousands of micro-partitions
Metadata:     For every column in every partition, Snowflake records:
              - Minimum value
              - Maximum value
              - Count of distinct values
              - Count of NULL values
              This metadata lives in Cloud Services layer, always available

MICRO-PARTITION FILE STRUCTURE (conceptual)
============================================

Micro-Partition File #847
+--------------------------+
| COLUMN: order_date       |
| Values: 2024-05-10 ...   |
| Min: 2024-05-10          |
| Max: 2024-05-10          |  <-- metadata
| Rows: 85,000             |
+--------------------------+
| COLUMN: amount           |
| Values: 145.50, 320.00.  |
| Min: 10.00               |
| Max: 4,999.00            |
+--------------------------+
| COLUMN: status           |
| Values: SHIPPED, SHIPPED.|
| Min: SHIPPED             |
| Max: SHIPPED             |
+--------------------------+

How Data Lands in Micro-Partitions at Load Time

When you run COPY INTO or INSERT, Snowflake writes incoming rows into new micro-partitions in the order the data arrives. If your source file has rows sorted by date, the resulting partitions naturally contain date ranges with minimal overlap between partitions. If your source file has rows in random order, each partition may contain data from all dates, causing heavy partition overlap.

ORDERED DATA LOADING (good clustering)
=======================================

Source CSV rows (sorted by order_date):
  2024-01-01, Alice,   500
  2024-01-01, Bob,     200
  2024-01-02, Carol,   350
  ...
  2024-01-31, Dave,    800

Resulting micro-partitions:
  Partition A: order_date MIN=2024-01-01  MAX=2024-01-07  (Jan week 1)
  Partition B: order_date MIN=2024-01-08  MAX=2024-01-14  (Jan week 2)
  Partition C: order_date MIN=2024-01-15  MAX=2024-01-31  (Jan weeks 3-4)

Query: WHERE order_date = '2024-01-05'
  Partition A → SCAN (contains Jan 5)
  Partition B → SKIP
  Partition C → SKIP
  Result: 1 partition scanned out of 3 (33%)

RANDOM DATA LOADING (poor clustering)
=======================================

Source CSV rows (unsorted):
  2024-01-15, Eve,  900
  2024-03-02, Frank, 150
  2024-01-05, Grace, 400
  2024-06-20, Henry, 700
  ...

Resulting micro-partitions:
  Partition A: order_date MIN=2024-01-05  MAX=2024-06-20  (spans 6 months!)
  Partition B: order_date MIN=2024-01-10  MAX=2024-05-15  (spans 4 months)
  Partition C: order_date MIN=2024-02-01  MAX=2024-06-10  (spans 4 months)

Query: WHERE order_date = '2024-01-05'
  Partition A → SCAN (MIN ≤ 2024-01-05 ≤ MAX)
  Partition B → SCAN (MIN ≤ 2024-01-05 ≤ MAX)
  Partition C → SCAN (MIN ≤ 2024-01-05 ≤ MAX)
  Result: All 3 partitions scanned (100%) — no pruning possible

Clustering Depth and the Clustering Information Function

Snowflake provides a built-in function to measure how well a table is clustered on a given column. The output tells you whether adding or changing a clustering key would improve query performance.

-- Measure clustering quality on the order_date column
SELECT SYSTEM$CLUSTERING_INFORMATION('ORDERS', '(order_date)');

-- Sample output (JSON):
{
  "average_overlaps": 842.3,    -- average partitions that overlap at any point
  "average_depth":    843.8,    -- average stack depth (1.0 = perfect, higher = worse)
  "total_partition_count": 1200,
  "clustered_partition_count": 45
}

INTERPRETING CLUSTERING DEPTH:
  average_depth = 1.0          Perfect clustering (each value in only one partition)
  average_depth = 1.0 to 5.0   Good clustering (minimal overlap)
  average_depth = 5.0 to 20.0  Fair clustering (moderate overlap, query still pruning)
  average_depth > 20.0         Poor clustering (high overlap, consider clustering key)
  average_depth ≈ total partitions  No clustering (data randomly distributed)

Clustering Keys: Telling Snowflake How to Organise Data

A clustering key instructs Snowflake to keep rows with similar values on the chosen column(s) physically co-located in the same or adjacent micro-partitions. Snowflake continuously reorganises the table in the background — a process called Automatic Clustering — to maintain good organisation as new data loads.

-- Add a clustering key to an existing large table
ALTER TABLE ORDERS CLUSTER BY (order_date);

-- Cluster by multiple columns (most selective column first)
ALTER TABLE ORDERS CLUSTER BY (order_date, country);

-- Cluster using an expression
ALTER TABLE ORDERS CLUSTER BY (DATE_TRUNC('MONTH', order_date));

-- Check clustering status
SELECT SYSTEM$CLUSTERING_INFORMATION('ORDERS', '(order_date)');

-- Remove a clustering key
ALTER TABLE ORDERS DROP CLUSTERING KEY;

Choosing the Right Clustering Key

GOOD CLUSTERING KEY CANDIDATES           WHY
-------------------------------          ---
Date/timestamp columns                   Most queries filter by date range
High-cardinality columns used in WHERE   Enables tight partition boundaries
Foreign keys used in JOIN conditions     Collocates data for fast joins
Status or category columns (combined     Reduces partitions scanned for
  with date in compound key)             status+date queries

BAD CLUSTERING KEY CHOICES               WHY NOT
------------------------------           -------
Columns rarely used in WHERE/JOIN        Reorganisation cost with no benefit
Boolean columns (only 2 values)          All partitions contain both values
Columns with very low cardinality        Partitions span entire value range
Primary key (unique per row)             Every partition contains every value
                                         range — no meaningful pruning

Automatic Clustering: Snowflake Maintains Order for You

Once you define a clustering key, Snowflake's Automatic Clustering service runs in the background and continuously reorganises micro-partitions to reduce overlap. You do not trigger this manually — Snowflake monitors the clustering depth and kicks off reorganisation when the table degrades beyond a threshold.

AUTOMATIC CLUSTERING WORKFLOW
==============================

DAY 1: You define CLUSTER BY (order_date) on ORDERS table
  Snowflake runs initial clustering reorganisation
  Clustering depth improves from 842 to 1.3
  Cost: background compute credits (separate from your warehouse)

DAY 2-7: Your ETL loads 100K new rows each night
  New rows arrive in new micro-partitions with good internal order
  Clustering depth gradually increases as new partitions accumulate
  (each nightly load adds well-ordered partitions at the end of the table)

DAY 8: Clustering depth reaches threshold (e.g., depth = 4.0)
  Snowflake's Automatic Clustering service re-sorts affected partitions
  Depth returns to ~1.3
  Your queries continue to run fast with no manual intervention

AUTOMATIC CLUSTERING COST:
  - Snowflake charges compute credits for reorganisation work
  - Cost depends on table size and how frequently data loads disturb order
  - Monitor with: SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY
  - Consider whether query performance gains justify the ongoing clustering cost

When to Use Clustering Keys

USE A CLUSTERING KEY WHEN:
  Table has > 1TB of data (smaller tables prune well without clustering)
  You run frequent range queries on the same column (e.g., last 30 days)
  Current clustering depth > 10 and query performance is unsatisfactory
  The table is append-only and data loads in random order

DO NOT USE A CLUSTERING KEY WHEN:
  Table is small (< 500GB) — automatic pruning already works well
  Queries access the full table most of the time (no selective filters)
  Data is frequently updated or deleted (creates heavy recluster churn)
  You cannot measure a performance problem first — do not cluster prematurely

COST-BENEFIT DECISION FRAMEWORK:
  Monthly clustering compute cost < Monthly compute saved by faster queries?
  YES → Add clustering key
  NO  → Explore other optimisations first (query rewrites, partition pruning rules)

Search Optimisation Service: Point Lookup Acceleration

Clustering keys improve range queries (WHERE date BETWEEN x AND y). For point lookup queries (WHERE customer_id = 12345 or WHERE email = 'alice@example.com'), Snowflake offers the Search Optimisation Service. It builds a persistent, automatically-maintained secondary data structure that enables fast single-row lookups without scanning all partitions.

-- Enable Search Optimisation on a table
ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION;

-- Enable for a specific column only (cheaper)
ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON EQUALITY(customer_id);
ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON EQUALITY(email);
ALTER TABLE CUSTOMERS ADD SEARCH OPTIMIZATION ON SUBSTRING(product_name);

-- Check search optimization status
SHOW SEARCH OPTIMIZATIONS IN TABLE CUSTOMERS;

USE WHEN:
  Frequent point lookups by customer ID, email, or other unique fields
  Support queries that look up one specific record at a time
  High-cardinality columns where clustering keys do not help

COST:
  Additional storage for the search access paths (typically 30-50% of table size)
  Ongoing credit cost to maintain as data changes

Key Points

  • Snowflake stores every table as immutable micro-partitions of 50–500 MB, automatically compressed and stored in columnar format
  • Snowflake records metadata (min, max, null count, distinct count) for every column in every micro-partition — this metadata drives partition pruning
  • When data loads in natural order (e.g., sorted by date), micro-partitions have minimal range overlap and pruning works automatically without a clustering key
  • When data loads in random order, partition ranges overlap heavily — queries scan too many partitions — this is when a clustering key helps
  • Use SYSTEM$CLUSTERING_INFORMATION() to measure clustering depth before adding a key — only add one if depth is high and query performance is genuinely suffering
  • Automatic Clustering runs in the background to maintain partition order as new data loads — it incurs ongoing compute credit cost
  • Clustering keys help range queries; Search Optimisation Service helps point lookup queries — they address different access patterns

Leave a Comment

Your email address will not be published. Required fields are marked *