Keebo | Snowflake Optimization: Avoiding Query Mistakes

Snowflake Optimization: Avoiding Query Mistakes

Inefficient or poorly written queries are one of the biggest factors to consider when reducing Snowflake costs. Here are the top 8 mistakes we see that are easiest to correct and can have an immediate impact on query performance and cost.

Introduction

Performance and cost optimization will determine your long-term success with Snowflake. It is critical not only for financial reasons, but also for improving user experience and adoption of any data-driven initiatives across your organization. This is the first of 3 articles where we will examine what you can do right away for Snowflake Optimization.

Keebo is the first data learning platform that automatically optimizes cloud data warehouses and analytics. On behalf of our customers, we perform millions of Snowflake optimizations on an ongoing basis. So we have deep experience in saving Snowflake customers time and money. There are three main areas every Snowflake customer should look to optimize: queries, schema, and warehouse. In Part 1, we will look at query optimization.

Top 8 Snowflake Query Mistakes

1. Using SELECT * Queries

Anyone who has been in the database business knows that SELECT * queries, or asking the database to return every column in a table, is asking for trouble. But as there are new data analysts entering the industry each day, it is worth mentioning, just to be sure. Snowflake is a column-store database, which means retrieving a subset of columns will dramatically improve performance, especially if your table has a lot of columns. Therefore, you should be certain that each column you are asking for in a SELECT statement is necessary for the analytics being performed. For example, don’t pull the city, region, and country of a customer when you could do the job with just the postal code.

2. Nested Queries

A nested query, or sometimes called a query within a query, is an entire query embedded as part of an operation. Query optimizers are not very effective in optimizing nested queries. Consider this example (written in pseudo code for simplicity, instead of proper T-SQL):

SELECT A FROM T1 WHERE T1.B > ANY (SELECT T2.B FROM T2 WHERE T1.A = T2.A)

In this example, you see a classic illustration of how subqueries are used: to compare the value of the first table with all values of the second table and return a row if there is a match with any value. A subquery isn’t needed for this and the same result could be achieved using the DISTINCT statement:

SELECT DISTINCT A FROM T1 JOIN T2 ON T1.A = T2.A WHERE T1.B > T2.B

The larger point is to be on the lookout for nested queries, and upon finding one, ask why it is there and if there might be a better way. 

3. Using IN or EXISTS Statements

This is very similar to a nested query, and IN or EXISTS are often found with nested queries. Consider this example, again in pseudo code:

SELECT A FROM T1 WHERE T1.B NOT IN (SELECT T2.B FROM T2 WHERE T1.B = T2.B)

As with the previous example, the goal here is to compare tables and return matching rows. Anytime you see IN or EXISTS you can probably avoid the subquery altogether for a simpler LEFT JOIN:

SELECT A FROM T1 LEFT JOIN T2 ON T1.B = T2.B WHERE T2.B IS NULL
4. Inequality Joins

Most joins are matching records on the left table with those on the right, using a simple equality join as indicated by the = sign. Most DBAs don’t realize you can join tables based on inequalities. When this happens, you’ll see the join condition with such operators as <=, >=, != and so on. The most common use for these non-standard joins is to list or count unique pairs of items. But these joins are one of the hardest to optimize for any data warehouse, so it is worth it to look for an alternative. Consider this example:

SELECT T1.A, COUNT(DISTINCT T2.A) FROM T T1 JOIN T T2 ON T1.A <= T2.A  GROUP BY T1.key, T1.A ORDER BY T1.A DESC

This query achieves the same thing but more efficiently:

SELECT A, DENSE_RANK() OVER (ORDER BY A DESC) FROM T

There are perfectly valid reasons to use inequality joins. But as with nested queries, the broader point is to be on the lookout for them and investigate if there might be a more efficient way.

5. Using RANK

When you are looking for top-K results, where K is a small constant, using MIN or MAX is more efficient than RANK. This is because RANK requires sorting where MIN and MAX can be done with a simple scan and thus often benefit from partition-based pruning. Consider this example:

SELECT max(A) AS secondA FROM (SELECT RANK() OVER (ORDER BY A DESC) AS A_rank, A FROM T) WHERE A_rank = 2

This achieves the same result but more efficiently:

SELECT MAX(A) AS secondA FROM T WHERE A < (SELECT MAX(A) FROM T)
6. Complex Functions and Expressions in Join Keys

In the worst case, a complex join condition will lead to a nested loop join. If you need a complex expression as a join key, it might be better to create intermediate results first before performing the join. Standard equality joins with simpler expressions allow the query optimizer to choose from a wider range of efficient join implementations. The difference between such a merge join instead of a nested loop could be a query that runs in a few seconds instead of ten or more minutes.

7. Using UNION instead of UNION ALL

In our observations, if a data engineer is not entirely sure if a query result set will have duplicates, they use UNION as it returns only unique records. However, it does so at the cost of extra computation, typically implemented with a sort. If you are confident in the integrity constraints of your schema and you’ve invested in data quality efforts to remove duplicates in the first place, you can eliminate this need for extra computation. UNION ALL will keep all records and skip the check for duplicates.

8. Not Using ANSI Joins

Along the same lines as avoiding SELECT *, most data engineers know that query optimizers are more likely to have success with standard ANSI joins. Looking for the JOIN operator is the indicator that you are using ANSI joins.

Conclusion

While we have listed some common mistakes in Snowflake queries, it can still be challenging to fix them. Even if you focused on the slowest 1% of queries, optimizing a medium-sized warehouse with only 100,000 queries per month would still mean inspecting 1,000 queries. While this is daunting enough, databases are “living” entities that are constantly changing. You can make progress looking for the common mistakes, but we recommend enlisting automation and machine learning to help with this task. Keebo gives data teams their time back, boosts performance, and saves money with fully automated data warehouse and analytics optimizations. Keebo doesn’t merely alert you to performance issues, we fix them in real-time.

Keebo | Snowflake Optimization: Avoiding Query Mistakes
Carl Dubler
Articles: 36