How do I define a clustering key in Snowflake?

Include a `CLUSTER BY (column1, column2, …)` clause when creating or altering a table. For example: `CREATE TABLE orders (…) CLUSTER BY (order_date, customer_id);`

Snowflake Micro-Partitions & Clustering: How Data Organization Impacts Performance

Q: What is a micro-partition in Snowflake?

Snowflake automatically divides table data into micro-partitions: small, immutable storage blocks (about 50–500 MB raw) that include per-column metadata such as min/max values, distinct counts, and null counts.

Q: How do micro-partitions improve query performance?

By storing metadata per partition, Snowflake can prune (skip) partitions that don't match query filters, so only relevant micro-partitions are scanned — reducing I/O and speeding up queries.

Q: What is clustering in Snowflake, and when should I use it?

Clustering lets you define one or more clustering key columns so that related rows are co-located in micro-partitions — useful for large tables with frequent queries filtering on those columns.

Q: What’s the difference between natural clustering and defined clustering?

Natural clustering orders data by insertion order. Defined clustering uses explicit clustering keys to reorder data to optimize locality and partition pruning.

Q: Does clustering always improve performance?

Not always. Clustering adds maintenance cost and is beneficial primarily for large tables with stable data and frequent selective queries. For smaller or write-heavy tables, it might not pay off.

Q: How can I check if a table’s clustering is still effective?

Use Snowflake’s built-in functions `SYSTEM$CLUSTERING_INFORMATION()` or `SYSTEM$CLUSTERING_DEPTH()` to inspect clustering metadata — overlapping partitions, depth, and total partitions — and evaluate pruning efficiency.

Q: What are the risks or downsides of clustering?

Downsides include additional compute credits for reclustering, increased maintenance overhead, possible performance degradation on write-heavy or high-churn tables, and wasted resources if clustering keys are poorly chosen.

Q: How does clustering affect cost?

Effective clustering can reduce compute costs by minimizing data scanned. But maintenance and reclustering consume credits — so benefits depend on query patterns and frequency.

Q: When should I avoid clustering and rely on default micro-partitioning?

Avoid clustering for small or medium tables, frequently updated datasets, or high-insert-rate tables where clustering keys become stale; in such cases default micro-partitioning may suffice and be more cost-efficient.

Most Snowflake users understand warehouses, credits, and queries pretty well. But there’s a lot that goes on “under the hood,” as it were. For example, Snowflake tables are organized into data blocks, called micro-partitions, that enable aggressive pruning of unnecessary data accesses and are essential to query performance.

If you understand how these micro-partitions (and, by extension, clustering) work, you can take action to improve query efficiency. However, there are limitations to this approach, and relying on it as your sole optimization strategy could end up resulting in missed savings opportunities.

In this article, we’ll break down how Snowflake structures data into micro-partitions, how clustering affects query pruning, and how to use powerful metadata functions like SYSTEM$CLUSTERING_INFORMATION to diagnose issues and fine-tune performance.

Executive Summary

Snowflake automatically divides tables into small storage blocks called micro-partitions (≈ 50–500 MB each), with metadata (min/max per column, distinct counts) to enable efficient pruning.
By default data is clustered by insertion order (“natural clustering”), but for large tables or frequent queries on specific columns, you can define a clustering key to physically co-locate related rows and improve query pruning.
Effective clustering can dramatically reduce the number of micro-partitions scanned – turning multi-minute scans into sub-second or few-second responses for selective queries.
Over-clustering or poor key selection can backfire – costs of reclustering, overhead on DML heavy tables, and diminishing returns. Use clustering selectively based on data size, query patterns, and stability.
Regularly monitor clustering health metrics (e.g. via SYSTEM$CLUSTERING_INFORMATION or SYSTEM$CLUSTERING_DEPTH) and query-profile stats (micro-partitions scanned) to check if clustering still delivers value.

Key Takeaways

Micro-partitions store Snowflake data in compact, immutable blocks with built-in metadata that enables efficient query pruning. This allows Snowflake to skip irrelevant data and dramatically speed up queries.
By grouping related rows based on clustering keys, Snowflake improves pruning efficiency and query performance, particularly for large, frequently queried tables.
While micro-partitions and clustering reduce costs and boost performance, excessive reclustering or poor key selection can backfire. It’s important to monitor clustering depth, query pruning efficiency, and warehouse usage regularly to stay optimized.

What Are Micro-Partitions in Snowflake?

Snowflake micro-partitions are the platform’s foundational units of data storage. These are small blocks that each store roughly 50–500 MB of raw data in heavily compressed form, and Snowflake automatically creates and manages them anytime data is loaded, inserted, or modified.

For each micro-partition, Snowflake tracks column-level statistics like min/max values, number of distinct values, null counts, additional internal statistics, and more. This rich array of metadata enables Snowflake’s query optimizer to decide which partitions to skip during query execution (also known as pruning), turning what would be a full table scan into a targeted, efficient read.

What’s more, Snowflake micro-partitions are immutable, meaning that Snowflake doesn’t update them when data changes. Instead, the platform simply creates new blocks. There are many benefits to this approach, including improved consistency, simplified concurrency, and powerful time travel and cloning features.

Understanding Clustering in Snowflake

Key to understanding how micro-partitions work in Snowflake is understanding how clustering works. Clustering is the process of physically organizing data within micro-partitions based on specific columns (called clustering keys). By colocating related rows in storage, Snowflake improves both partition pruning and query performance, especially for large datasets where scanning unnecessary partitions drives up both runtime and cost.

There are two main types of clustering behavior:

Natural clustering happens when there’s no defined clustering key; in this case, Snowflake organizes data in order of insertion
Defined clustering happens when you set explicit clustering keys; in this case, Snowflake orders data around specific columns

Here’s an example of how to define a clustering key:

CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount NUMBER(10,2)
)
CLUSTER BY (order_date, customer_id);

Although defined clustering involves some effort to configure and requires compute resources to set up, it can help to avoid both fragmented and overlapping partitions, while also aligning them more closely with common query patterns.

However, clustering has to be applied with caution, as a clustering configuration that may work well for a part of your workload may potentially harm another part. Thus, deciding on whether to use a defined clustering and which columns to cluster on is a non-trivial decision that requires a deep understanding of the workload and the data as well as a careful analysis of the performance impact.

As tables are subject to insertions, deletions, and updates, a once defined clustering may become stale and has to be re-established. When it comes to who does the clustering, Snowflake has two options: manual and automatic. Here’s a quick rundown of each:

	Manual	Automatic
Control	Full control over when and how reclustering happens.	Reclustering happens automatically and continuously in the background. No manual intervention.
Cost Management	More predictable and potentially lower costs, as reclustering is triggered manually	Costs vary with data size and update frequency; introduces ongoing background reclustering activities that consume credits.
Maintenance	Requires ongoing operational effort to schedule and execute reclustering jobs.	Requires less maintenance effort, leading to uninterrupted data operations.
Effectiveness	Can be optimized precisely based on workload knowledge and timing.	Continuously optimizes clustering depth and data arrangement to maintain consistent query performance.
Ideal Use Case	Best when precise control over reclustering operations is needed (e.g., to manage cost)	Best in any case that does not explicitly require manual reclustering. Snowflake takes care of deciding when it is worth it to recluster a table and issues jobs accordingly.

Manual reclustering has been deprecated since May 2020 and has been removed from most accounts. Consequently, defining a clustering key on a table implicitly enables automatic clustering. Suspending automatic reclustering is the only remaining option to retain full control of the cost associated with reclustering:

ALTER TABLE orders SUSPEND RECLUSTER;

Replace SUSPEND with RESUME to enable automatic clustering again as needed.

How Do I Query Clustering Information and Health?

Snowflake offers two system functions to surface key information for assessing your clustering health. First, the SYSTEM$CLUSTERING_INFORMATION provides a summary of how well data is organized relative to the clustering keys:

SELECT SYSTEM$CLUSTERING_INFORMATION('MY_TABLE');

If you run that query, Snowflake will return:

Clustering key expression (i.e. which columns define the clustering)
Average overlap depth
Number of partitions

As well as several other useful stats that indicate how the current clustering of your table is performing.

Additionally, the SYSTEM$CLUSTERING_DEPTH function focuses specifically on depth metrics for defined keys. It’s useful for detecting when a table has drifted and might need reclustering. You can execute that query by replacing the function in the example listed above.

In case of a table without a defined partitioning yet or when investigating a potential new partitioning, these functions also provide insights on a potential set of partitioning columns.

How Does Query Pruning Work in Snowflake?

Snowflake query pruning is, simply put, the process of skipping irrelevant data during query execution. It should be pretty obvious how this helps to improve performance and mitigate costs. But how exactly does it work? Here’s an overview of the process:

Each micro-partition contains stored metadata that includes minimum and maximum values for every column within that partition, plus counts of distinct values and other statistics
When a query runs with filter predicates (e.g., WHERE clauses), the pruning engine compares the query’s filter conditions against the min/max metadata of each micro-partition for the relevant columns.
If the metadata indicates that the micro-partition’s column value ranges do not overlap with the query’s filter conditions, that entire micro-partition can be safely skipped without scanning its data.

Because Snowflake micro-partitions are already so small (remember, 50 MB to 500 MB of raw data, but highly compressed), query pruning can happen on a very granular level. This is the core reason why Snowflake queries on large tables can run very fast, often scanning just a small fraction of the table’s data based on the query filters.

Limitations of Snowflake Clustering

That said, clustering, micro-partitions, and query pruning aren’t a silver bullet that will fix all your cost and performance issues. Here are a few limitations to keep in mind:

Determining a good clustering key requires a careful analysis of the entire workload
Choosing a poor defined clustering can degrade performance rather than improve it – and may even be worse than a natural clustering
Reclustering operations, especially on large tables, consume compute credits and can increase costs
Complex filters and query patterns may not benefit from pruning sufficiently and still necessitate a larger warehouse or Snowflake’s search optimization service
Clustering availability, behavior and metadata visibility may differ for certain object types such as external tables, which can affect how clustering works or is inspected in those contexts

Final Thoughts on Snowflake Micro-Partitions & Clustering

Snowflake micro-partitions form the foundation of Snowflake’s performance engine, while clustering optimizes how the platform organizes the data contained within those partitions.

While these key features potentially help with performance and, thus, contribute to keeping costs under control, there are a variety of scenarios that they can’t handle (even with automatic clustering). For example, real-time changes to your workloads (i.e. a temporary usage spike) can be tough to manage, as even automatic clustering faces a data lag. What’s more, dynamic, variable workloads that are hard to predict often require manual clustering, which is time-consuming and expensive.

Thankfully, there’s a better approach. Keebo’s autonomous cost and performance optimization engine keeps your warehouses right-sized to your workloads, even if those workloads fluctuate a hundred times a day. What’s more, Keebo does this 24/7, so you can be saving credits while your whole team is asleep.

Go beyond Snowflake’s native features and take control of your costs today. Contact Keebo for a short demo to see how much you could be saving right now.

Frequently Asked Questions

What is a micro-partition in Snowflake?
Each table in Snowflake is automatically split into micro-partitions, which are small, immutable storage blocks (approx. 50–500 MB raw, compressed internally) with per-column metadata (min/max, distinct counts, null counts, etc.).

How do micro-partitions improve query performance?
By storing metadata per partition, Snowflake’s optimizer can prune (skip) micro-partitions that don’t match query filters, so only relevant partitions are scanned – drastically reducing I/O and speeding up queries.

What is clustering in Snowflake, and when should I use it?
Clustering is a way to physically co-locate related rows by defining one or more clustering key columns. It’s valuable when you have large tables and frequent queries filtering or sorting by the same columns (date, customer ID, etc.).

What’s the difference between natural clustering and defined clustering?
Natural clustering relies on insertion order: data lands in micro-partitions as inserted. Defined clustering uses explicit clustering keys to reorder data for optimal locality and pruning.

Does clustering always improve performance?
No – clustering is most beneficial for large tables with stable data and frequent, selective queries. For small tables or write-heavy workloads, clustering overhead may outweigh benefits.

How can I check if a table’s clustering is still effective?

Use SYSTEM$CLUSTERING_INFORMATION(‘table_name’) or SYSTEM$CLUSTERING_DEPTH(‘table_name’) to inspect clustering metadata – overlapping partitions, clustering depth, number of partitions. If depth is high or coverage poor, consider reclustering.

What are the risks or downsides of clustering?
Costs include extra compute credits for reclustering, maintenance overhead, and potential negative impact on concurrency or DML-heavy usage. Poor key selection or over-clustering can degrade performance or waste resources.

How does clustering affect cost?
Well-clustered tables can reduce compute cost by pruning unnecessary partitions and lowering scanned data volume. But reclustering and maintenance consume credits – so the net saving depends on access patterns and frequency.

When should I avoid clustering and rely on default micro-partitioning?
For small-to-medium tables, frequently updated tables, or datasets with high insert churn where the clustering key would quickly become stale – default micro-partitioning may be sufficient and cheaper.

Author

Martin Kiefer

View all posts

Executive Summary

Key Takeaways

What Are Micro-Partitions in Snowflake?

Understanding Clustering in Snowflake

How Do I Query Clustering Information and Health?

How Does Query Pruning Work in Snowflake?

Limitations of Snowflake Clustering

Final Thoughts on Snowflake Micro-Partitions & Clustering

Frequently Asked Questions

Author

Martin Kiefer

Related Posts

Snowflake Spend Visibility, Live Cost Tracking & Financial Governance

Working with JSON and VARIANT in Snowflake: Parsing, Querying, and Performance Tuning

AI & Automation Trends: The Future of FinOps – AI-Driven Data Cost Optimization