Keebo | Fundamentals of Snowflake Query Design & Optimization

Fundamentals of Snowflake Query Design & Optimization

No matter how optimized your warehouses are, inefficient queries can still drag your performance and increase costs. For that reason, Snowflake query design should be a tentpole of your cloud optimization strategy. 

In this article, we’ll walk through several query optimization strategies that can both enhance performance and reduce costs. If you’re using an AI-powered optimization tool like Keebo, these tips will help you increase your savings even more.

Why query optimization is important when using Snowflake 

Modern data environments demand efficient, near real-time data processing. Inefficient queries create significant friction in meeting those demands. As such, query optimization is table stakes for navigating modern data environments and maintaining user satisfaction. 
Although there are many ways to write a SQL query that retrieves a certain data set, not all do so efficiently and quickly. An inefficient query will place an unnecessarily high load on your data store, slowing performance and increasing your costs. If you can retrieve the same data in a shorter time, you can achieve the win-win of satisfying customers without excessive spend.

First: a brief review of Snowflake’s architecture

To understand how query optimization can improve Snowflake performance, we need to take a step back and briefly review Snowflake’s architecture.

Snowflake is a cloud native platform, built from the get-go to leverage the flexibility, reliability, and scalability that comes with cloud computing. One of the key distinctive features is the separation of storage and compute, as illustrated in the following graphic:

Keebo | Fundamentals of Snowflake Query Design & Optimization

The separation of storage from processing is also reflected in Snowflake’s pricing model. Data storage is charged on a flat rate per terabyte. Warehouse compute resources, on the other hand, are charged on a usage-based credit system. 

In reality, this means that you can store large quantities of data in Snowflake and have a pretty straightforward and predictable monthly cost. Costs to retrieve that data, however, can be highly variable depending on your warehouse sizes, monthly usage, and whether you purchase credits at Capacity (pre-paid plan, lower per-credit charge) or On-Demand (pay-as-you-go, higher per-credit charge). 

What’s more, because Snowflake is cloud native, the upper limit of your query processing costs is virtually limitless. Which means it’s that much easier for Snowflake costs to get out of control

Under this model, query processing is where the bulk of your costs come from. The more efficient your queries, the less your budget will fluctuate. You can make queries more efficient in two ways: provision more resources to handle them (e.g. scale up warehouses) or design them to function more efficiently. In this article, we’ll focus on the latter approach. 

How to optimize your Snowflake query performance 

Although we’ve done an in-depth guide on Snowflake query performance here, let’s take a quick look at some of the key strategies at your disposal. 

Another challenge with Snowflake performance monitoring is the need for real-time insights, not monthly, weekly, or even daily summaries. For example, you could have a spike in performance at 3:07am that lasts until 3:21am. If you have a 24-hour reporting period, you may not see that report until the next day—and by then you really can’t do anything about it. 

Pre-computing and storing complex calculations

One common mechanism for optimizing query performance is to pre-compute and store complex calculations in intermediate tables or materialized views. In SQL, creating a materialized view might look like:

CREATE MATERIALIZED VIEW monthly_sales_summary AS SELECT region, SUM(sales) AS total_sales, DATE_TRUNC('month', sale_date) AS month FROM sales_transactions GROUP BY region, month;

Here, monthly_sales_summary would be a materialized view storing monthly sales data by region. This strategy allows for faster query execution by avoiding repetitive calculations and aggregations which, in turn, reduce the query’s runtime.

Query design

Optimizing Snowflake query design rests on two principles: 1) retrieve only the data you need, and 2) write your queries as simply as possible. 

For example, select only necessary columns instead of using SELECT *. This limits the data you retrieve and can reduce query latency.

SELECT userID, name, created_date FROM app_users ...;  vs.   SELECT * FROM app_users ...;

Another option is to filter data earlier in the query using appropriate WHERE clauses:

SELECT userID, name, created_date FROM app_users WHERE created_date < '2024-01-01' GROUP BY first_name;   vs.   SELECT userID, name, created_date FROM app_users GROUP BY first_name WHERE created_date < '2024-01-01';

Additionally, you can use the right type of JOIN functions (i.e. INNER vs. OUTER). INNER join only retrieves the data common to both tables, where the OUTER join will return all data from both tables. For example:

SELECT app_users.name, app_users.userID, customers.name FROM app_users INNER JOIN customers ON app_users.name=customers.name;  vs.   SELECT app_users.name, app_users.userID, customers.name FROM app_users OUTER JOIN customers ON app_users.name=customers.name;

Additionally, it’s important to use Common Table Expressions (CTEs) wisely. These are the result sets of queries that exist temporarily and only for use within the context of a larger query. You should also limit the complexity of functions within SELECT statements as much as possible.

Warehouse configuration

As mentioned above, another approach is to provision more resources to your warehouses so that they can handle particularly complex and latent queries. For example, upsizing your warehouse or setting up multi-cluster warehouses that allow for concurrent processing of queries without worrying about queue time. However, you don’t want all your queries to run on upsized warehouses and you don’t want to use more multi-cluster warehouses than you need, especially if a smaller warehouse or fewer clusters can run them with negligible impact on performance.Because it’s impossible to track the changes in workloads in real time—especially in those hours when your team is off-duty or even asleep—warehouse configuration works best when you deploy an AI-powered tool to automate these adjustments in real time.

Data clustering

Data clustering involves sorting your table data in such a way that queries can retrieve it more efficiently by setting specific columns as clustering keys. When applied to especially large tables, this can significantly reduce compute cost since data is easily retrieved thus minimizing scanning during JOINs. It’s a heavy lift, but worth it in the end. That is, so long as you maintain the sorting patterns when you add new data to your warehouse. Otherwise, you’ll end up undercutting your clustering efforts.

Caching

Snowflake offers three caching levels, so queries don’t have to constantly retrieve the same data again and again. The result cache holds the results of every query executed over the past 24 hours. The local disk cache retains the data used by SQL queries. The remote disk holds your long-term storage, and is meant to ensure data resilience.

Query routing

Another approach is to implement a strategy to route your queries to warehouses that currently have unused resources or where small, quick ad-hoc queries are routed to a smaller, cost-efficient warehouse, and likewise, where large complex queries are sent to a larger warehouse. This helps you make the most of the credits you’ve already provisioned, rather than provisioning additional resources.

Challenges with Snowflake query optimization

Although there are plenty of advantages to Snowflake query optimization, it shouldn’t be the only strategy for controlling Snowflake costs and improving performance. There are a few reasons why:

  • Some sources are incapable of query optimization (e.g. a Looker integration that creates queries through Gen AI)
  • Monitoring the quality of every single query at scale is too much for one person (or even one team) to handle
  • Query optimization requires extensive training which, while it’s a worthwhile investment, takes time to implement and maintain
  • Not a good solution for sudden spikes in query loads and other demands on your Snowflake systems
  • Can’t make adjustments in real time—especially in those midnight hours when all your team is sleeping

So while you should absolutely work to optimize your Snowflake queries, this strategy must work in tandem with other systems. 

One example of this is Keebo’s warehouse optimization tools that automatically reduce your costs without hindering performance—all in real time. When you feed high-quality queries into Keebo-optimized warehouses, the savings and performance improvements compound. 

To see how Keebo can enhance your optimized query performance, schedule a demo today.

Keebo | Fundamentals of Snowflake Query Design & Optimization
Jonathan Thein
Articles: 6