Snowflake Optimization: Schema Choices
There are 3 types of schema in a relational database: physical, logical, and external. You need to make the right decisions at all 3 of these levels based on business requirements and data semantics. Doing so will directly impact data ingestion and query performance as well, which of course impacts your Snowflake compute usage (direct cost).
Introduction
Keebo is the first data learning platform that automatically optimizes cloud data warehouses and analytics. On behalf of our customers, we perform tens of thousands of Snowflake optimizations daily. This is the second of three articles where we will examine what you can do right away to optimize Snowflake across three main areas: queries, schema, and warehouse. Here in part 2, we will look at schema optimization.
Here are the top 3 schema optimizations that will have the most beneficial impact for Snowflake.
Top Snowflake Schema Optimizations
Views
Database views are a type of virtual table. They are based on queries which can join several tables together into one place, and simplify things by retrieving only what you need from a complex set of tables. They are often created for data security and encoding business logic. And because they simplify things, they significantly improve query readability and reduce the likelihood of mistakes in query logic.
Reducing mistakes in query logic is often an overlooked benefit of views in organizations that have less-experienced users writing queries. These users can easily write inefficient SQL queries which directly increase compute costs. If a data team can identify the common building blocks that most query writers use, they can turn them into efficiently-written views which will lead not only to better overall performance, but lower costs as well.
Since views are essentially the result set of queries, they are subject to the same problems that queries can produce for Snowflake (see part 1 of this series for the top 7 query mistakes we see). For example, when presented with a deep hierarchy of complex views, Snowflake’s query optimizer is now effectively trying to optimize highly nested queries. And it will fare no better than if the views were not there at all. Using a materialized view is one way to prevent this, as we will see in the next section.
Materialized Views
If you are using Snowflake Enterprise or higher editions, you have access to materialized views. As the name suggests, a materialized view is a regular view, but the results of the view are “materialized.” That is, stored in the database. The results of the view are pre-computed and thus much faster when used. In essence, a materialized view answers common queries in advance.
Materialized views will significantly improve query performance for the common building blocks of your query needs. But this can also be a double-edged sword. Snowflake will need to use compute resources to build the views in advance, and update them if the underlying data changes. As you can imagine, this is too expensive for some use cases.
To get the most benefit from materialized views at the lowest cost, you need to identify the situations where these views can help the largest number of slow queries that require the least amount of updating. This can be a daunting task in a busy warehouse with thousands of queries. Fortunately, there are automated ways of identifying the best queries for materialized views and even creating and removing those views dynamically. Keebo is one such solution with our Query Acceleration product.
Clustering Keys
Clustering is the most important part of your physical schema (how your tables are designed) when it comes to Snowflake optimization. You can think of clustering as the sort order of your table. For example, if you cluster your table by time, then rows will be inserted into the table in chronological order. This would be ideal for queries that ask for data by time ranges, such as months or quarters. But it could be detrimental for frequent queries that want data listed by product categories or regions.
Clustering keys are even more consequential for “pruning” queries–that is, ignoring records you don’t need to see in a query. Snowflake stores data in micro partitions of 50-500MB in size. It will skip irrelevant partitions for your query, and this dramatically speeds up queries on large tables. Using time as an example for a clustering key, if your table contains data from many years across billions of rows but you only want records from June of 2022, then Snowflake can go directly to the micro partitions that have those records (which will all be conveniently “clustered together” in storage) and skip the millions of other partitions that it knows can’t possibly contain those records.
Therefore, the clustering key you use will directly affect how Snowflake stores your data and how efficient queries will be on large tables. However, you can only physically sort a table only once, so you can only have one clustering key. Choosing this wisely is of utmost importance.
Snowflake’s default is to cluster the rows in the order in which they are inserted, which effectively means your clustering key is by time. Since time is the most popular filtering condition in queries, this works well in most cases. But you can use Snowflake’s SYSTEM$CLUSTERING_DEPTH system function to see how effective your micro partitions are for a specified set of columns.
Clustering can lose effectiveness over time, especially if there are lots of DML statements (Insert and Delete in particular) after your initial data load. Prior to May 2020, Snowflake users had to manually recluster tables, which could get expensive, especially if not done properly. Since then, Snowflake does not allow manual reclustering, which means Snowflake will determine when it is worth reclustering a table. You can check the costs incurred by auto clustering and suspend it if you want. But suspending this can cause conflicts between your DML operations and your clustering key. Instead, if you notice that Snowflake is constantly reclustering and thus driving up costs, it might be a sign that you need to change your clustering key altogether. Doing this is beyond the scope of this article, but Keebo’s Guide to Snowflake Optimization has further advice on this subject.
Conclusion
While we have listed some ways to optimize Snowflake schema, it can still be challenging to implement them, especially materialized views. In choosing materialized views, even if you focused on the slowest 1% of queries, optimizing a medium-sized warehouse with only 100,000 queries per month would still mean inspecting 1,000 queries. While this is daunting enough, databases are “living” entities that are constantly changing. We recommend enlisting automation and machine learning to help with this task. Keebo gives data teams their time back, boosts performance, and saves money with fully automated data warehouse and analytics optimizations. Keebo doesn’t merely alert you to performance issues, we fix them in real-time.