Keebo | Lessons from Instacart and Snowflake: Why We Need a Paradigm Shift

Lessons from Instacart and Snowflake: Why We Need a Paradigm Shift

It’s incredible that in the LLM and GenAI era, we’re still approaching our database problems the same way we have been for the last 40 years: manual, expensive, one-query-at-a-time process.

When details began emerging about Instacart and their Snowflake spend, the news caught my interest. Suddenly there was a lot of controversy around whether Instacart’s Snowflake usage had actually gone down or it was just a matter of them having pre-purchased more credits the year before. There was also a lot of speculation about whether it had to do with Instacart moving towards Databricks, or whether Snowflake’s CEO being on Instacart’s board of directors made Instacart’s CEO make a public statement.

But my interest deepened when Snowflake pointed everyone to a presentation that Instacart gave at Snowflake Summit back in June titled “How Instacart Optimized Snowflake Costs by 50%.” I was at the Summit myself so I just went back and rewatched that presentation. I think there are 3 key lessons here that all of us in the database community can take from all this.

Before I begin, I want to be clear that I’m not criticizing Instacart. In fact, I appreciate how much work goes into optimizing database workloads at large companies. As a professor of computer science and a database researcher, I have built and studied databases for more than a decade and a half now. And I have spent countless hours staring at insane queries and looking for ways to optimize them both for my research as well as for large companies. So while I have a lot of respect for all the hard work that engineers constantly put into optimizing database “things” in big companies, I think the real question we should be asking is not whether Instacart migrated some of their Snowflake to Databricks or how exactly they reduced their usage via optimizations. Rather, I think what is a more productive discussion for us as a database community is to ask whether, living in an AI era, we should be still spending precious engineering cycles on “optimizing database workloads by staring at it one query at a time”. And even more so, is it wise or even feasible for other companies to follow Instacart’s footsteps? So what is Instacart’s approach? Let me summarize that for you next.

Instacart’s three-step process for Snowflake optimization

If you take the time to watch the Instacart video, you will see that they follow a three-step process for Snowflake optimization. 

Step 1: Find the most expensive queries

The data team at Instacart has put in a lot of effort building a company-wide infrastructure to ensure each query is tagged. They have a summary view that then goes into some very nice-looking dashboards. From there, they find the most expensive queries and then notify the Directly Responsible Individual (DRI) who wrote the query and notify them of the need to optimize.

Step 2: Manually optimize queries

It is then up to the DRI to stare at that query and figure out how to optimize it as much as possible. How? They don’t say. But they keep mentioning pushing filters down, which is one of the oldest ideas in query optimization, and while effective when applicable, it is easier said than done, especially for something like a 200-line SQL query that has to join 10 different tables and has 5 nested queries in it. What’s different in this case is that instead of a particular data team being in charge of this tedious task, this responsibility is now spread out to hundreds of individuals across the company (essentially, anyone who ever wrote a SQL query can now be on the hook for optimizing it). This might create a culture of responsibility, but also a massive opportunity cost to the company by spending so many people’s time on “stare at your query and try to optimize it”.

Step 3: Find the right size for the warehouse

In addition to the work done in step 2 by the DRI, the data team will attempt to fit each query to a Snowflake Warehouse that is the right size for it. Too big and you are needlessly spending money. Too small and the query will run too slowly. How do they pick the right size? They use an experimentation process running the workload on various sizes and then comparing the cost/performance.

This is admittedly a very brief summary of their methodology. I encourage you to watch their presentation to get the full effect. There were at least 6 other customers at Snowflake Summit who gave presentations on optimization and cost savings. If you are short on time, Carl Dubler has a more detailed summary of all the customer presentations on Snowflake cost savings and also has a post on Snowflake’s advice for optimization.

Four lessons we can learn from Instacart

I know this optimization methodology doesn’t only happen at Instacart. I personally know teams at Uber, Microsoft and other companies whose main job is following a similar methodology for optimizing the query workload across the entire company to reduce the cost of their shared infrastructure. There is too much tendency in our community to clean up the mess one query at a time and it seems the move to cloud data warehouses hasn’t changed this common attitude at all. So here are the lessons I think we can all take away from the Instacart story.

Lesson 1: Everyone’s ignoring the long tail

A significant portion of Instacart’s approach seems to have been focused on identifying the most expensive queries. While it’s a reasonable approach, it tells us something important: even a big organization like Instacart with a massive data team and 2 years of hard work has found it prohibitive to look at all their queries. Even with all this investment, they’re admittedly only focusing on a small fraction of the queries by picking the most expensive ones. For example, if you can only afford manually inspecting 1,000 queries, then you obviously want to focus on your 1,000 slowest and most frequent queries. But in a large organization this means ignoring millions of smaller queries whose impact collectively on your overall bill is no longer small. In other words, the 80-20 rule doesn’t always work when it comes to database workloads: there is often a long-tailed distribution of faster queries that because of their large numbers still count for a large portion of your overall bill. Imagine what the potential could be if we could analyze all the queries and optimize all of them, not just a tiny fraction of the most expensive ones.

Lesson 2: Asking more people to do manual work doesn’t make it less manual (or less expensive)

Even for the queries that Instacart’s data team flags as expensive or candidates for optimization, there still isn’t a silver bullet to optimize. Instead, they rely on the DRI (directly responsible individual) to go fix it. They mention in their presentation that they have hundreds of users from different teams who write queries. So essentially the job of optimizing the most expensive queries is now crowdsourced to all these different teams across the company instead of falling on the data team only. In a sense, this is fair and probably a good “culture”. After all, these analysts are using a shared resource that is on a pay-as-you-go model, so they need to use it responsibly. Their data team is focused on providing a more user-friendly interface to enable the other teams to do this work themselves. They are providing a tagging system so they know who wrote which query, and a nice user interface to see the query profile and the query plan to help with performance debugging of those queries.

But the question is whether this is a wise use of the company’s resources, for a number of reasons:

  • Not everyone who needs to use the database needs to be a database expert. Is it wise to ask a ML expert or a supply chain guru or a marketing operations expert to spend their time writing perfect queries, or even worse, optimizing the queries they wrote months ago? Weren’t these individuals supposedly recruited for a different skill set to perform a different business function for their company?
  • More importantly, distributing the manual work to almost everyone who is writing queries will certainly lessen the burden on any single individual and allows the company to “manually” optimize more queries, but it doesn’t make it any less manual. You are still spending a lot of time on hand-tweaking SQL queries, one query and one person at a time. That’s a lot of money. In some ways this is even worse because now you cannot even track how much of the organization’s resources are spent on this manual work. It’s all hidden and distributed.

In some ways, the company is still paying the same tax for the manual optimizations, but more people are being taxed and some of them are paying an even higher tax rate because they are not trained, hired, or motivated for this line of work. Are there other activities for these individuals that will be more beneficial for Instacart as a business?

Lesson 3: “Stare at your query” is still rampant in our industry

A lot of infrastructure seems to be geared towards 1) highlighting the expensive and frequently running queries that need to be examined by humans, and 2) a user friendly way of seeing the query plan (the query profile in the case of Snowflake). But then all that’s doing essentially is revealing an expensive query and the expensive operator in that query. Then the DRI must examine it long enough to see what can be done to speed it up. The only advice I kept hearing was look for opportunities to push down a filter. But we all know how intimidating real-life queries can look. In many cases you cannot even see your entire query on the screen without having to scroll through multiple pages. Likewise, what if the query profile is simply showing you are scanning several terabytes of data? How do you expect your end user to solve that?  

We’ve got to do better than that. As a data community, we need to provide better tooling and more specific advice to our end users if we’re expecting them to “solve” our cost and performance problems. I’d even go further and argue that in many cases it’s not even possible to optimize your workload one query at a time. In many cases, you need to identify the common blocks of computation across many queries and find a way to reuse that computation (at the risk of sounding self-promotional, that’s one of the things our query acceleration product does).

Lesson 4: “Trial-and-error” is the modern way of saying “we don’t know”

One of the most common ways of optimizing Snowflake or cloud warehouses in general, is rightsizing your warehouses (see this article for more on this). The problem is that even Snowflake’s official advice is to choose the optimal size for your warehouse through trial-and-error. Their advice, which Instacart follows, is to track bytes scanned for each query and then choose the appropriate size for your warehouse. Say your query scans less than 20GB. According to Snowflake’s own advice, as shown in this illustration from Instacart’s presentation, the optimal warehouse size for you might be small, medium, large, x-large, or even 2x-large. But which one? Well, you won’t know until you try and try again.

Keebo | Lessons from Instacart and Snowflake: Why We Need a Paradigm Shift

No rule of thumb is precise, but even as a rule of thumb this is comically unhelpful. Imagine how many hours of experimentation will be required to find the right size. Even if you manage to get it right, there’s no guarantee that it will work tomorrow or a week later when the workload changes. There is also no guarantee that the size that’s optimal at 9pm will also be optimal during 9am peak traffic. So what should we do? Well, your poor engineers have to keep repeating this tedious and error-prone process over and over again.

A better way: AI with data learning

It’s incredible that in the LLM and GenAI era, we’re still approaching our database problems the same way we have been for the last 40 years: manual, expensive, one-query-at-a-time process.

It is very easy for any third-party observer to see that Instacart’s way of saving money on Snowflake will take a lot of engineering hours. My guess is that their approach has probably taken 20-30 FTEs for a year if you add up all the distributed effort of annotating queries, monitoring the dashboards, inspecting and optimizing individual queries, back and forth communications between the teams, and trying different Snowflake settings. With a conservative estimate of $250K/FTE per year to include the overhead, that’s a $5M investment in year one. Some of that cost will get amortized, particularly around annotation infrastructure and the internal dashboards and tooling, but the distributed effort of optimizing slow queries is a recurring cost. As long as there are new queries and new data, there are new queries that need to be optimized. For Instacart, maybe that’s still a good investment if they truly saved $10M each year on their Snowflake bill, assuming they could not invest those engineering hours elsewhere to boost their revenue. But what about other organizations that can’t afford this kind of investment? Or even if they can afford the investment, what if they want to spend that money in ways that better moves their enterprise forward?

The answer in my mind is to use automation for tasks that add little value to the organization’s core business, or tasks that humans cannot or do not want to do manually. There has been a lot of progress in recent years in various techniques that can automate query and workload optimizations. Academia has been leading the charge on these innovations. For example, SlabCity co-authored by one of my PhD students is a technique that can automatically rewrite poorly written queries. But at Keebo, we have tried to bring a lot of cutting edge technologies from academia to market, so data teams don’t have to resort to manual optimizations every single time. For example, you can now leverage a specific form of AI, which we call Data Learning, at Keebo to rewrite your slow queries on the fly with correctness guarantees. Similarly, many Snowflake customers have already started leveraging Data Learning, to automatically optimize their warehouses using automated Warehouse Optimization. If you’re intrigued by how AI can reduce your cloud warehousing costs or optimize your slow queries, I encourage you to read some of these papers. But if you don’t have time to read these technical papers, you can also just watch or book a live demo (seeing is believing after all) and then decide if you still want to spend your precious engineering cycles on manual optimizations.

Keebo | Lessons from Instacart and Snowflake: Why We Need a Paradigm Shift
Barzan Mozafari
Articles: 3