Keebo Reinforcement Learning for Snowflake Optimization

We’ve spilled a lot of (metaphorical) ink on this blog about the value of choosing AI over manual human effort when optimizing Snowflake. Now let’s pull back the curtain and show the technical processes we use to build and train our own AI tools.

Unlike many of our competitors, Keebo uses reinforcement learning (RL) to train our optimization tools. This approach is a game-changer, which is why it’s enabled the seismic shifts in AI’s usefulness, value, and accessibility we’ve all seen in recent years.

Read on to learn about how RL works, and why it’s uniquely adept at optimizing complex environments like Snowflake.

What is reinforcement learning?

Reinforcement learning (RL) is a machine learning technique used to train decision-making software and algorithms. What distinguishes RL from other training approaches is the use of trial-and-error that mimics human learning.

RL algorithms start with a stated goal or outcome. Actions that deliver that outcome are reinforced, while those that detract from it are ignored. For Keebo specifically, our goal is to reduce Snowflake spend without hindering performance. As such, when our AI-powered Snowflake optimizer takes an action that reduces spend with no negative impact on performance, the AI receives a reward. If not, the AI does not.

How does reinforcement learning work?

To understand how reinforcement learning works at a technical level, we need to go over a few key terms:

Agent	The ML algorithm or autonomous system that learns and makes decisions
Environment	The external system which includes all conditions, variables, and rules that define the context of the problem. The agent interacts with the environment by sensing its state and taking actions.
Action	A decision or move the agent takes to interact with the environment
State	Description of the environment at a specific point in time that provides all relevant information needed by the agent to make decision
Reward	Positive, negative, or zero value given to the agent in response to a given action
Cumulative Reward	Sum of all rewards

Reinforcement learning is based on actions taken at discrete time steps. At each step, the agent takes an action that moves the environment into a new state. The agent receives a feedback (reward) based on the action taken.

If the action moves the environment toward the designated desired state (in our case, reduced Snowflake spend with no negative impact on performance), the agent receives a reward. If not, it receives a zero or negative reward. This reward can be immediate or delayed (more on delayed gratification below).

Why use reinforcement learning over supervised learning?

AI/ML engineers deploy reinforcement learning as an alternative to the more traditional supervised learning. To understand why we’ve opted for RL in our own tools, we need to compare how the two approaches work.

Supervised learning uses labeled data to train the model. This can include input-label pairs, where the label is the correct answer (ground truth). The model then learns to map inputs to outputs by minimizing the differentiation between predictions and ground truth labels. Supervised learning can be used in various use cases, including data classification (e.g. “identify which of these pictures is a cat), text generation (e.g. “write a paragraph describing how to optimize a SQL query), predictive analytics (e.g. “if X = [value], predict the [value] of Y”), and more.

Reinforcement learning reduces the amount of human input necessary to train the model. Instead, it adopts a Markov Decision Process (MDP), which helps to predict whether the environment transitions from an existing state to another when you perform a certain action. The way this model works is to generate a lot of traces. These start from a random state in the environment, take an action, observe the result of the action within the environment, then receive a reward in response. The model then learns which actions generate the most attractive rewards, and adapts accordingly.

At Keebo, we chose reinforcement learning as our approach to Snowflake optimization for a number of reasons. Here are some of the most notable.

1. Dynamic optimization for ever-changing workloads

Snowflake workloads aren’t static. User demands and query loads constantly fluctuate. Which means that even if you perfectly optimize a warehouse for today’s demands, you’ll have to redo the whole process tomorrow.

What’s more, these workload changes are unpredictable. You could end up with a spike happening at 3am and lasting for seven minutes. You can’t just wake up a database analyst to handle every spike that happens throughout the day.

However, because RL algorithms work 24/7, they can predict these spikes even while your team is asleep. That way, you don’t leave any money on the table.

2. Complex user environments

Snowflake environments are highly complex, especially for businesses with unique needs and workloads. What’s more, those workloads and their associated users have different priorities and performance requirements, even within the same business.

Supervised learning struggles with such complex use cases. That’s because it relies on clear black-and-white standards for what the “right” decision looks like. This can be very difficult when one “right” decision is difficult to parse out.

Consider the following scenario: you have a sudden influx of queries to a Small warehouse, and you’re trying to decide whether to scale up to an XL. Simple logic says yes. But the nuances of Snowflake pricing makes this decision more complicated.

If those queries are highly complex and take a long time to execute on a Small, you may end up saving credits by scaling up:

Warehouse Size	Credits Per Hour	Query Execution Time (hours)	Total Credits Consumed
S	2	0.18	0.36
XL	16	0.02	0.32

In that example, then, the cost savings ended up favoring the XL warehouse. But let’s consider another case:

Warehouse Size	Credits Per Hour	Query Execution Time (hours)	Total Credits Consumed
S	2	0.014	0.033
XL	16	0.001	0.267

Because Snowflake has a 60-second minimum for query execution, scaling up to an XL, while speeding up query performance, consumes more credits.

Reinforcement learning algorithms, then, can adapt to changing and complex environments very quickly. They’re also quite adept at finding the best path forward, often in ways that aren’t obvious to human engineers.

3. Reduced human effort and error

As mentioned above, supervised learning requires humans to label data in order to provide direction to the algorithm. This introduces a risk of human error, subjectivity, and bias into the model. When you’re dealing with tens of thousands of dollars in Snowflake spend, every bit you can improve the algorithm’s accuracy has a big impact.

What’s more, there are millions of queries running every day. It’s just not possible to analyze queries and determine the optimal decision for each one 24/7.

RL algorithms don’t require human input, and thus don’t fall prey to these problems. Instead, the punishment-and-reward system enables the algorithm to self-correct when it fails to advance the predetermined outcome.

In our case, the outcome is Snowflake cost savings. But how do you know whether you’ve actually saved on Snowflake optimization? Here’s an article where our CEO walks through how to actually figure out how much your Snowflake optimizations are saving you.

4. Optimization against long-term goals (delayed gratification)

Consider a game of chess: often you make a move but won’t know until 10, 15, or even 20 moves down the line whether it was a good or bad choice. The same principle applies for AI decision-making.

For Snowflake cost optimization, most of the time the agent already knows the consequence of their decision—impact on latency, bytes scanned, etc.—within 15 minutes. Sometimes when queries run more than a couple of hours, the consequence isn’t obvious.

Because of its use of cumulative rewards, RL is built to emphasize long-term reward maximization. This is especially important when working with Snowflake, as suboptimal optimization decisions can have long-term consequences that go beyond immediate performance/cost impact:

Poor user experience that results in churn and lack of re-engagement
Service-level agreement violations that result in penalties and lost trust
Delays in data pipelines that result in operational inefficiencies and lost opportunities

Because feedback in these qualitative areas isn’t always immediately available, RL is able to learn from long-term trends which result in higher cumulative rewards.

5. Exploration vs. exploitation

In the same vein, an intelligent RL agent will be able to constantly analyze the tradeoff between exploration vs. exploitation, which isn’t something that a model trained through supervised learning can offer.

While RL agents can certainly exploit the information received to maximize their rewards, their ability to delay gratification incentivizes them to go out and identify potential new ways of improving the outcome or achieving it more efficiently.

For example, with any AI model, there are dependencies the model builders simply can’t capture. In supervised learning, the assumption is often that data points (rows) are independent of each other. However, RL is able to explore the environment and learn from interactions which allows it to capture such dependencies.

By prioritizing exploration in some cases, RL models can expand their knowledge and put more “tools into the belt” and expand their knowledge of the environment. As this process continues, RL models become more efficient at achieving their mandated outcomes.

How does Keebo use reinforcement learning to make better decisions within Snowflake environments

So let’s break down what this looks like when it comes to Snowflake optimization. Specifically, we’ll drill down into one of our algorithm solutions called the multi-armed bandit (MAB).

MAB is a machine learning framework that selects multiple arms to maximize a cumulative reward over time. The goal is to test multiple actions to quickly identify which one maximizes the reward. Once the exploration process is done, the agent can then switch to exploitation and double down with that approach.

After building the MAB model, we built the reward model based on the data in the warehouse. Essentially, not only were we able to determine the probability of transitioning from one state to another, but also what the anticipated reward from that transition might be.

For Snowflake specifically, here’s a simplified example of how this plays out. Let’s say you have a Medium-size warehouse. The agent has the option to either maintain the current size (State A) or downsize it to a Small in order to save costs (State B). The agent will measure the impact on latency brought about by the decision to either maintain or downsize, then classify that as a positive or negative impact. If latency increases, the result is classified as negative, and the agent is given a negative reward.

One of the advantages of Keebo compared to other AI-based optimization tools is that we make warehouse adjustments in real time instead of recommendations that inform and drive manual changes. This means our model has near real-time access to feedback that improves its performance over time. Because we incorporate real-world data into our reinforcement learning algorithm daily, we’re able to make adjustments that align with reality.

Final thoughts on reinforcement learning in Snowflake optimization

Reinforcement learning (RL) enables Keebo to account for the complexities of Snowflake warehouse environments that supervised learning simply cannot. As a result, we’re able to explore new, ongoing opportunities to reduce Snowflake costs that may not align with predetermined rules.

This is one of the many reasons Keebo is able to outperform our competitors and achieve maximum savings for customers—without sacrificing query performance.