Snowflake Optimization Part 2: Schema Choices

Relational databases use three schema types: physical, logical, and external. Decisions at each level affect ingestion, query performance, and Snowflake compute costs.
Keebo is a data learning platform that automatically optimizes cloud data warehouses. It performs thousands of Snowflake optimizations daily. This is the second of three articles on Snowflake optimization. This article focuses on schema optimization.schema optimization.
Here are the top 3 schema optimizations that will have the most beneficial impact for Snowflake.
Top Snowflake Schema Optimizations
Snowflake Schema Views
Views are virtual tables built from queries. They simplify complex joins, improve readability, and reduce query errors. They also support security and business logic.
Views reduce errors from less-experienced users, who often write inefficient queries. Teams can convert common query patterns into optimized views to improve performance and reduce costs.
Views inherit the same performance issues as queries. Deep view hierarchies create nested queries that are difficult to optimize. Materialized views can help address this.
Snowflake Schema Materialized Views
Materialized views store precomputed query results. They improve performance by answering common queries in advance.
Materialized views improve performance but increase compute costs for creation and maintenance.
Use materialized views for frequently used queries with low update frequency. Automation can identify, create, and manage these views efficiently.
Clustering Keys
Clustering is the most important part of your physical schema. It defines how data is ordered in a table. For example, clustering by time helps time-based queries but may hurt queries by category or region.
Clustering improves pruning by skipping irrelevant micro-partitions. This significantly speeds up queries on large tables. Snowflake stores data in micro partitions of 50-500MB in size. It will skip irrelevant partitions for your query, and this dramatically speeds up queries on large tables. Using time as an example for a clustering key, if your table contains data from many years across billions of rows but you only want records from June of 2022, then Snowflake can go directly to the micro partitions that have those records (which will all be conveniently “clustered together” in storage) and skip the millions of other partitions that it knows can’t possibly contain those records.
The clustering key determines storage layout and query efficiency. Choose it carefully, as tables can only be clustered one way.
Snowflake clusters data by insertion order by default, which often aligns with time-based queries. Since time is the most popular filtering condition in queries, this works well in most cases. But you can use Snowflake’s SYSTEM$CLUSTERING_DEPTH system function to see how effective your micro partitions are for a specified set of columns.
Clustering degrades with frequent DML. Monitor clustering depth and adjust keys if costs increase. Prior to May 2020, Snowflake users had to manually recluster tables, which could get expensive, especially if not done properly. Since then, Snowflake does not allow manual reclustering, which means Snowflake will determine when it is worth reclustering a table. You can check the costs incurred by auto clustering and suspend it if you want. But suspending this can cause conflicts between your DML operations and your clustering key. Instead, if you notice that Snowflake is constantly reclustering and thus driving up costs, it might be a sign that you need to change your clustering key altogether.
Snowflake Schema Conclusion
Schema optimization is difficult to maintain manually as workloads evolve. Optimization requires ongoing effort as databases change. We recommend enlisting automation and machine learning to help with this task. Keebo automates schema optimization, improves performance, and reduces costs in real time.
