Because Snowflake is a pay-as-you-go data cloud, and the main thing you are paying for is warehouse size, it pays to get the warehouse size right. You can try to do this on your own or enlist a fully-automated optimizer like Keebo to do it for you. Let’s discuss.
How Snowflake warehouse sizes work
It sounds a bit counter-intuitive, but data storage is not what we are talking about when discussing the best warehouse size. Size in this case refers to computing power. The larger the warehouse, the faster it can handle a compute task (query). In general, the bigger and more complicated your query workload, the larger your warehouse should be. It should be big enough to process queries in time to satisfy your users’ requirements or SLAs. In other words, you want to minimize user complaints about slow queries.
Snowflake currently offers the following warehouse “T-shirt” sizes:
|Credits per hour (cost)
Warehouse size and Snowflake cost
If minimizing user complaints about slow queries was the only consideration, then just choose a 6X warehouse and be done with it! But did you notice that your cost will double with each increase in warehouse size? So a 6X warehouse will be awesome for the relationship with your users but will likely end whatever friendship you have with your CFO. What exactly is “best” then?
Ideally, you would pick the smallest warehouse that will efficiently process most of your queries and meet your response time SLA. But don’t get it wrong! Sizing based on your biggest query would mean you are paying too much, since most of the time you will be paying for an underutilized warehouse. On the other hand, if your warehouse is undersized, your queries could “spill” out of memory and need to access storage, either local (slow) or remote (very slow). The result of spillage is slow or even failed queries and the unhappy users that follow.
You could wait until either your users or CFO yell at you and then make the corresponding adjustment to the warehouse size. Or you could take a more scientific approach.
Key performance parameters for warehouse size
Many Snowflake customers create dashboards using Snowsight (the online SQL tool for Snowflake) to measure several different performance metrics. You could also use Snowflake’s Query Profile that is available in the classic console. In such a dashboard, here are the best metrics to monitor for warehouse size analysis:
|In general, the higher the bytes scanned to answer a query, the larger the warehouse needed. If most queries scan less than a GB you probably only need a small or extra small warehouse.
|If this is high, it indicates expensive (long-running) queries. You need to either optimize these queries or scale-up to a larger warehouse.
|Query load percent
|If this is less than 50 for a query, then it should be moved to another warehouse or you could increase the warehouse size.
|Bytes spilled to local and bytes spilled to remote
|This indicates the warehouse is undersized. Bytes that can’t fit into memory “spill” to disk storage. This will be slow on local storage and even slower on remote storage.
|A query that is competing with others for resources will have a high queued time. Increase warehouse size or move queries to a less utilized warehouse.
Making warehouse size optimizations
Once you start tracking these metrics, you have the data you need to begin making adjustments. In particular, it seems that Snowflake is keen on using bytes scanned to get the best warehouse size. If you look at their own presentation from Snowflake Summit 2023, they provide a slide showing recommended warehouse size according to bytes scanned:
So if your bytes scanned is between 1-20GB, the recommendation is either a small, medium, large, x-large or 2x-large warehouse. Now I don’t want to seem ungrateful for the help here, but what exactly am I supposed to do with this information? I’d love more specifics. Unfortunately, that can’t be determined since query workloads vary wildly. Even in Snowflake’s own documentation, the advice is to experiment until you find the ideal size: “Experiment by running the same queries against warehouses of multiple sizes (e.g. X-Large, Large, Medium). The queries you experiment with should be of a size and complexity that you know will typically complete within 5 to 10 minutes (or less).”
So after all this, we are left with experimentation as the answer? Apparently, yes. Even if this experimentation sounds like a lot of fun, you are going to run into two major problems following Snowflake’s own advice:
- Resources. Who is going to do all this work? It would be nice to have a large data team for this, and some Snowflake customers do. Even for those large teams, their efforts are better spent on other tasks, particularly building data pipelines and directly supporting the analytics needs of the business.
- Constant change. Even if you have a team to optimize Snowflake, or make time to do it yourself, you still face another quandary: you’ve only optimized to the workload you have now. When your workload changes, the optimizations you spent so much time on could become instantly irrelevant.
May I suggest that you instead turn this task over to a solution that can fully automate setting the best warehouse size. Keebo is one such solution! You are reading this article on Keebo’s website, so hopefully you expected this would be coming next.
How Keebo’s AI automates Snowflake optimization
The way to solve the resourcing issue and the quandary posed by a dynamic environment is to enlist your own army of robots. At Keebo, our mascot is a friendly robot dog, because this is what we do 24/7/365: optimize Snowflake warehouses. Here are our operating principles.
I listed 5 key performance parameters earlier in this article. Keebo monitors 76 Snowflake usage metadata fields and we monitor them constantly. It’s important to note that we only access usage metadata, never your user or business data. We then train our proprietary machine learning algorithms on this data to make predictions and constantly optimize your warehouse, starting with warehouse size.
Not only do we optimize warehouse size, we optimize warehouse suspension and clustering, too. You set how aggressive you want our robots to be in finding savings, and then we do the work, freeing up your time. This screenshot shows the optimizations we are doing in real time of just the auto-suspend settings for one week of a customer’s warehouse:
And this works both ways, too. If an optimization turns out to be less than ideal (for example, it is slowing down queries), Keebo will automatically adjust, no matter the day or time. This is important for you to consider if you are planning to do manual optimizations. If something goes wrong, will you be there to adjust again even if it is the middle of the night or the middle of your hard-earned vacation?
It is important to find a solution that can do the actual optimizations for you. There are plenty of products on the market that can show you reports and dashboards instead of building them yourself in Snowsight. But Keebo is the only fully-automated solution that goes beyond “observe and report” to do the actual work.
Price is based on savings
We shouldn’t have to spend money and time to save money and time. Our price is based on how much we save you. We take a percentage of the money you save on Snowflake–a third or less–meaning Keebo will pay for itself in addition to putting a lot of money back into your pocket.
You can see what our own customers have to say in our collection of customer stories. And if Keebo seems like an intriguing solution to put the power of AI to work in saving you money on Snowflake, you can get a lot more details in our Practical Guide for the Data Team.