Snowflake Micro-Partitions & Clustering: How Data Organization Impacts Performance

Blog graphic: Data Engineering categorized blogs


Most Snowflake users understand warehouses, credits, and queries. But key details sit under the hood. Tables use micro-partitions, which are data blocks that enable pruning. This reduces extra data scans and improves query performance.

If you understand how these micro-partitions (and, by extension, clustering) work, you can improve query efficiency. However, this approach has limits, and relying on it alone could lead to missed savings.

In this article, we’ll break down how Snowflake structures data into micro-partitions, how clustering affects query pruning, and how to use metadata functions like SYSTEM$CLUSTERING_INFORMATION to diagnose issues and improve performance.

Executive Summary

  • Snowflake automatically divides tables into small storage blocks called micro-partitions (≈ 50–500 MB each), with metadata (min/max per column, distinct counts) to support efficient pruning.
  • By default, data clusters by insertion order, called natural clustering. For large tables or frequent filters, define a clustering key. This groups related rows and improves pruning.
    Effective clustering can greatly reduce the number of micro-partitions scanned, turning multi-minute scans into seconds for targeted queries.
  • Over-clustering or poor key choice can hurt performance. It adds reclustering cost, extra work on write-heavy tables, and lower returns. Use clustering carefully based on data size, query patterns, and stability.
  • Regularly monitor clustering health (e.g., SYSTEM$CLUSTERING_INFORMATION or SYSTEM$CLUSTERING_DEPTH) and query profile stats (micro-partitions scanned) to confirm clustering still adds value.

Key Takeaways

  • Micro-partitions store Snowflake data in compact, fixed blocks with built-in metadata that supports efficient query pruning. This allows Snowflake to skip unneeded data and speed up queries.
  • By grouping related rows based on clustering keys, Snowflake improves pruning and query performance, especially for large, frequently queried tables.
  • While micro-partitions and clustering reduce costs and improve performance, too much reclustering or poor key choice can hurt results. It’s important to monitor clustering depth, pruning efficiency, and warehouse usage regularly to stay efficient.

What Are Micro-Partitions in Snowflake?

Micro-partitions are Snowflake’s core storage units. Each stores about 50 to 500 MB of compressed data. Snowflake creates and manages them automatically as data is loaded or changed.

For each micro-partition, Snowflake tracks column-level statistics like min/max values, number of distinct values, null counts, additional internal statistics, and more. This rich array of metadata enables Snowflake’s query optimizer to decide which partitions to skip during query execution (also known as pruning), turning what would be a full table scan into a targeted, efficient read. 

What’s more, Snowflake micro-partitions are immutable, meaning that Snowflake doesn’t update them when data changes. Instead, the platform simply creates new blocks. There are many benefits to this approach, including improved consistency, simplified concurrency, and powerful time travel and cloning features. 

Understanding Clustering in Snowflake

To understand micro-partitions, understand clustering. It organizes data by key columns to group related rows, improving pruning and query performance on large tables.

There are two main types of clustering behavior:

  • Natural clustering happens when there’s no defined clustering key; in this case, Snowflake organizes data in order of insertion
  • Defined clustering happens when you set explicit clustering keys; in this case, Snowflake orders data around specific columns

Here’s an example of how to define a clustering key: 

CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount NUMBER(10,2)
)
CLUSTER BY (order_date, customer_id);

Defined clustering takes some effort to set up and uses compute resources. However, it helps avoid split or overlapping partitions and better match common query patterns.

Use clustering with care. A setup that helps one workload may hurt another. Choosing keys requires understanding your data, queries, and performance impact.

As tables change with inserts, deletes, and updates, clustering can become outdated and must be refreshed. Snowflake offers two options: manual and automatic. Here’s a quick overview of each:

ManualAutomatic
ControlFull control over when and how reclustering happens. Reclustering happens automatically and continuously in the background. No manual intervention. 
Cost ManagementMore predictable and potentially lower costs, as reclustering is triggered manuallyCosts vary with data size and update frequency; introduces ongoing background reclustering activities that consume credits. 
MaintenanceRequires ongoing operational effort to schedule and execute reclustering jobs. Requires less maintenance effort, leading to uninterrupted data operations. 
EffectivenessCan be optimized precisely based on workload knowledge and timing.Continuously optimizes clustering depth and data arrangement to maintain consistent query performance. 
Ideal Use CaseBest when precise control over reclustering operations is needed (e.g., to manage cost)Best in any case that does not explicitly require manual reclustering. Snowflake takes care of deciding when it is worth it to recluster a table and issues jobs accordingly.

Manual reclustering is deprecated and removed from most accounts. Defining a clustering key enables automatic clustering. To control cost, you can suspend automatic reclustering.

ALTER TABLE orders SUSPEND RECLUSTER;

Replace SUSPEND with RESUME to enable automatic clustering again as needed.

How Do I Query Clustering Information and Health?

Snowflake offers two system functions to surface key information for assessing your clustering health. First, the SYSTEM$CLUSTERING_INFORMATION provides a summary of how well data is organized relative to the clustering keys:

SELECT SYSTEM$CLUSTERING_INFORMATION('MY_TABLE');

If you run that query, Snowflake will return:

  • Clustering key expression (i.e. which columns define the clustering)
  • Average overlap depth
  • Number of partitions

As well as several other useful stats that indicate how the current clustering of your table is performing.

Additionally, the SYSTEM$CLUSTERING_DEPTH function focuses specifically on depth metrics for defined keys. It’s useful for detecting when a table has drifted and might need reclustering. You can execute that query by replacing the function in the example listed above. 

In case of a table without a defined partitioning yet or when investigating a potential new partitioning, these functions also provide insights on a potential set of partitioning columns. 

How Does Query Pruning Work in Snowflake? 

Snowflake query pruning is the process of skipping data that does not match a query. This helps improve performance and reduce costs. But how does it work? Here’s a quick look:

  • Each micro-partition stores metadata, including minimum and maximum values for each column, along with distinct counts and other stats.
  • When a query runs with filters (e.g., WHERE clauses), Snowflake compares those filters to the min and max metadata for each micro-partition.
  • If the metadata shows no overlap with the filter, Snowflake skips that micro-partition. This avoids scanning extra data.

Because micro-partitions are small (about 50 MB to 500 MB of compressed data), pruning works at a very detailed level. This is why queries on large tables can run fast, often scanning only a small portion of the data.

Limitations of Snowflake Clustering

That said, clustering, micro-partitions, and query pruning aren’t a silver bullet that will fix all your cost and performance issues. Here are a few limitations to keep in mind: 

  • Determining a good clustering key requires a careful analysis of the entire workload
  • Choosing a poor defined clustering can degrade performance rather than improve it – and may even be worse than a natural clustering
  • Reclustering operations, especially on large tables, consume compute credits and can increase costs
  • Complex filters and query patterns may not benefit from pruning sufficiently and still necessitate a larger warehouse or Snowflake’s search optimization service
  • Clustering can differ by object type, such as external tables. This affects how it works and how you inspect it.

Final Thoughts on Snowflake Micro-Partitions & Clustering

Snowflake micro-partitions are the base of its performance, and clustering helps organize the data inside them.

These features can improve performance and help control costs. However, they do not work well in every case. Even with automatic clustering, some workloads are hard to manage. For example, real-time changes or sudden spikes can be difficult, since clustering has a delay. In addition, changing and unpredictable workloads often need manual clustering, which takes time and costs more.

There is a better approach. Keebo’s automated engine improves cost and performance by keeping warehouses sized correctly for your workloads, even as they change often. It runs all day, so you can save credits without constant manual work.

Go beyond Snowflake’s native features and take control of your costs today. Contact Keebo for a short demo to see how much you could be saving right now. 

Frequently Asked Questions

What is a micro-partition in Snowflake?

Each Snowflake table is split into micro-partitions. These are small, fixed storage blocks (about 50 to 500 MB raw, compressed) with per-column metadata such as min, max, distinct counts, and null counts.

How do micro-partitions improve query performance?

Snowflake stores metadata for each partition. The optimizer uses it to skip partitions that don’t match filters. This reduces I/O and speeds up queries.

What is clustering in Snowflake, and when should I use it?

Clustering is a way to physically co-locate related rows by defining one or more clustering key columns. It’s valuable when you have large tables and frequent queries filtering or sorting by the same columns (date, customer ID, etc.).

What’s the difference between natural clustering and defined clustering?

Natural clustering relies on insertion order: data lands in micro-partitions as inserted. Defined clustering uses explicit clustering keys to reorder data for optimal locality and pruning.

Does clustering always improve performance?

No – clustering is most beneficial for large tables with stable data and frequent, selective queries. For small tables or write-heavy workloads, clustering overhead may outweigh benefits.

How can I check if a table’s clustering is still effective?

Use SYSTEM$CLUSTERING_INFORMATION(‘table_name’) or SYSTEM$CLUSTERING_DEPTH(‘table_name’) to inspect clustering metadata – overlapping partitions, clustering depth, number of partitions. If depth is high or coverage poor, consider reclustering.

What are the risks or downsides of clustering?

Costs include extra compute credits for reclustering, maintenance overhead, and potential negative impact on concurrency or DML-heavy usage. Poor key selection or over-clustering can degrade performance or waste resources.

How does clustering affect cost?

Well-clustered tables can reduce compute cost by pruning unnecessary partitions and lowering scanned data volume. But reclustering and maintenance consume credits – so the net saving depends on access patterns and frequency.

When should I avoid clustering and rely on default micro-partitioning?

For small-to-medium tables, frequently updated tables, or datasets with high insert churn where the clustering key would quickly become stale – default micro-partitioning may be sufficient and cheaper.