Snowflake Micro-Partitions & Clustering: How Data Organization Impacts Performance
Most Snowflake users understand warehouses, credits, and queries pretty well. But there’s a lot that goes on “under the hood,” as it were. For example, Snowflake tables are organized into data blocks, called micro-partitions, that enable aggressive pruning of unnecessary data accesses and are essential to query performance.
If you understand how these micro-partitions (and, by extension, clustering) work, you can take action to improve query efficiency. However, there are limitations to this approach, and relying on it as your sole optimization strategy could end up resulting in missed savings opportunities.
In this article, we’ll break down how Snowflake structures data into micro-partitions, how clustering affects query pruning, and how to use powerful metadata functions like SYSTEM$CLUSTERING_INFORMATION to diagnose issues and fine-tune performance.
Key Takeaways
- Micro-partitions store Snowflake data in compact, immutable blocks with built-in metadata that enables efficient query pruning. This allows Snowflake to skip irrelevant data and dramatically speed up queries.
- By grouping related rows based on clustering keys, Snowflake improves pruning efficiency and query performance, particularly for large, frequently queried tables.
- While micro-partitions and clustering reduce costs and boost performance, excessive reclustering or poor key selection can backfire. It’s important to monitor clustering depth, query pruning efficiency, and warehouse usage regularly to stay optimized.
What Are Micro-Partitions in Snowflake?
Snowflake micro-partitions are the platform’s foundational units of data storage. These are small blocks that each store roughly 50–500 MB of raw data in heavily compressed form, and Snowflake automatically creates and manages them anytime data is loaded, inserted, or modified.
For each micro-partition, Snowflake tracks column-level statistics like min/max values, number of distinct values, null counts, additional internal statistics, and more. This rich array of metadata enables Snowflake’s query optimizer to decide which partitions to skip during query execution (also known as pruning), turning what would be a full table scan into a targeted, efficient read.
What’s more, Snowflake micro-partitions are immutable, meaning that Snowflake doesn’t update them when data changes. Instead, the platform simply creates new blocks. There are many benefits to this approach, including improved consistency, simplified concurrency, and powerful time travel and cloning features.
Understanding Clustering in Snowflake
Key to understanding how micro-partitions work in Snowflake is understanding how clustering works. Clustering is the process of physically organizing data within micro-partitions based on specific columns (called clustering keys). By colocating related rows in storage, Snowflake improves both partition pruning and query performance, especially for large datasets where scanning unnecessary partitions drives up both runtime and cost.
There are two main types of clustering behavior:
- Natural clustering happens when there’s no defined clustering key; in this case, Snowflake organizes data in order of insertion
- Defined clustering happens when you set explicit clustering keys; in this case, Snowflake orders data around specific columns
Here’s an example of how to define a clustering key:
CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount NUMBER(10,2)
)
CLUSTER BY (order_date, customer_id);
Although defined clustering involves some effort to configure and requires compute resources to set up, it can help to avoid both fragmented and overlapping partitions, while also aligning them more closely with common query patterns.
However, clustering has to be applied with caution, as a clustering configuration that may work well for a part of your workload may potentially harm another part. Thus, deciding on whether to use a defined clustering and which columns to cluster on is a non-trivial decision that requires a deep understanding of the workload and the data as well as a careful analysis of the performance impact.
As tables are subject to insertions, deletions, and updates, a once defined clustering may become stale and has to be re-established. When it comes to who does the clustering, Snowflake has two options: manual and automatic. Here’s a quick rundown of each:
| Manual | Automatic | |
| Control | Full control over when and how reclustering happens. | Reclustering happens automatically and continuously in the background. No manual intervention. |
| Cost Management | More predictable and potentially lower costs, as reclustering is triggered manually | Costs vary with data size and update frequency; introduces ongoing background reclustering activities that consume credits. |
| Maintenance | Requires ongoing operational effort to schedule and execute reclustering jobs. | Requires less maintenance effort, leading to uninterrupted data operations. |
| Effectiveness | Can be optimized precisely based on workload knowledge and timing. | Continuously optimizes clustering depth and data arrangement to maintain consistent query performance. |
| Ideal Use Case | Best when precise control over reclustering operations is needed (e.g., to manage cost) | Best in any case that does not explicitly require manual reclustering. Snowflake takes care of deciding when it is worth it to recluster a table and issues jobs accordingly. |
Manual reclustering has been deprecated since May 2020 and has been removed from most accounts. Consequently, defining a clustering key on a table implicitly enables automatic clustering. Suspending automatic reclustering is the only remaining option to retain full control of the cost associated with reclustering:
ALTER TABLE orders SUSPEND RECLUSTER;
Replace SUSPEND with RESUME to enable automatic clustering again as needed.
How Do I Query Clustering Information and Health?
Snowflake offers two system functions to surface key information for assessing your clustering health. First, the SYSTEM$CLUSTERING_INFORMATION provides a summary of how well data is organized relative to the clustering keys:
SELECT SYSTEM$CLUSTERING_INFORMATION('MY_TABLE'); If you run that query, Snowflake will return:
- Clustering key expression (i.e. which columns define the clustering)
- Average overlap depth
- Number of partitions
As well as several other useful stats that indicate how the current clustering of your table is performing.
Additionally, the SYSTEM$CLUSTERING_DEPTH function focuses specifically on depth metrics for defined keys. It’s useful for detecting when a table has drifted and might need reclustering. You can execute that query by replacing the function in the example listed above.
In case of a table without a defined partitioning yet or when investigating a potential new partitioning, these functions also provide insights on a potential set of partitioning columns.
How Does Query Pruning Work in Snowflake?
Snowflake query pruning is, simply put, the process of skipping irrelevant data during query execution. It should be pretty obvious how this helps to improve performance and mitigate costs. But how exactly does it work? Here’s an overview of the process:
- Each micro-partition contains stored metadata that includes minimum and maximum values for every column within that partition, plus counts of distinct values and other statistics
- When a query runs with filter predicates (e.g., WHERE clauses), the pruning engine compares the query’s filter conditions against the min/max metadata of each micro-partition for the relevant columns.
- If the metadata indicates that the micro-partition’s column value ranges do not overlap with the query’s filter conditions, that entire micro-partition can be safely skipped without scanning its data.
Because Snowflake micro-partitions are already so small (remember, 50 MB to 500 MB of raw data, but highly compressed), query pruning can happen on a very granular level. This is the core reason why Snowflake queries on large tables can run very fast, often scanning just a small fraction of the table’s data based on the query filters.
Limitations of Snowflake Clustering
That said, clustering, micro-partitions, and query pruning aren’t a silver bullet that will fix all your cost and performance issues. Here are a few limitations to keep in mind:
- Determining a good clustering key requires a careful analysis of the entire workload
- Choosing a poor defined clustering can degrade performance rather than improve it – and may even be worse than a natural clustering
- Reclustering operations, especially on large tables, consume compute credits and can increase costs
- Complex filters and query patterns may not benefit from pruning sufficiently and still necessitate a larger warehouse or Snowflake’s search optimization service
- Clustering availability, behavior and metadata visibility may differ for certain object types such as external tables, which can affect how clustering works or is inspected in those contexts
Final Thoughts on Snowflake Micro-Partitions & Clustering
Snowflake micro-partitions form the foundation of Snowflake’s performance engine, while clustering optimizes how the platform organizes the data contained within those partitions.
While these key features potentially help with performance and, thus, contribute to keeping costs under control, there are a variety of scenarios that they can’t handle (even with automatic clustering). For example, real-time changes to your workloads (i.e. a temporary usage spike) can be tough to manage, as even automatic clustering faces a data lag. What’s more, dynamic, variable workloads that are hard to predict often require manual clustering, which is time-consuming and expensive.
Thankfully, there’s a better approach. Keebo’s autonomous cost and performance optimization engine keeps your warehouses right-sized to your workloads, even if those workloads fluctuate a hundred times a day. What’s more, Keebo does this 24/7, so you can be saving credits while your whole team is asleep.
Go beyond Snowflake’s native features and take control of your costs today. Contact Keebo for a short demo to see how much you could be saving right now.

