Preventing Auto-Generated Queries from Impacting Costs

Most organizations use some kind of business intelligence, data modeling, data visualization, or AI/ML tool on top of Snowflake’s data cloud. For most of these integrations, engineers don’t manually translate data requests into SQL queries. Rather, these platforms auto-generate SQL code that queries Snowflake for the data needed at any given moment.

While these auto-generated queries help avoid excess manual work, they have problems of their own. Namely, they’re often written poorly and, as result, negatively impact cost and performance. Unfortunately, you can’t rewrite them to be more efficient, so you’re stuck absorbing those costs.

Or are you? In this article, we’ll walk through three strategies that can help address auto-generated queries so you can avoid incurring higher costs and hurting your data cloud performance.

How do auto-generated queries work?

Most platforms that integrate with Snowflake or other data cloud layers have their own specific mechanism for auto-generating queries. For the sake of brevity, we’ll give an example from one platform—Looker—as an example of how this process works.

Looker auto-generates SQL queries through LookML, a modeling language that abstracts SQL complexities and creates optimized queries based on user interactions:

Looker uses LookML to define data relationships, dimensions, measures, and filters—this creates a blueprint for future query generation
When a user interacts with Looker, the platform will use this blueprint to generate a SQL query that integrates user-defined attributes
Throughout this process, Looker can create temporary or persistent derived data tables, which can help pre-aggregate data to improve complex query performance
Looker can integrate directly with the cloud data layer’s unique features, like Snowflake’s ability to auto-suspend warehouses

What are the most common inefficiencies with auto-generated SQL queries?

Although tools like Looker are built to auto-generate efficient queries, inevitably some are not. If left unaddressed, these issues can lead to excessive compute resource consumption, slow query performance, concurrency issues, and more.

Excessive data fetching

Many auto-generated queries use the SELECT * command, which retrieves all columns from a table. If you only need select data from those tables, this command fetches unneeded data. As a result, the query takes longer to run and uses up more compute resources.

Inefficient joins

Another common issue with auto-generated queries is use of poorly optimized join strategies, like Cartesian joins or joins without proper keys. This can cause unnecessary complexity in how data is retrieved, consuming unnecessary resources.

For example, the following query:

Uses a JOIN on a non-indexed or unfiltered column, leading to unnecessary row scans
Uses a subquery inside the JOIN, forcing Snowflake to materialize the subquery separately
Does not pre-filter data, making the join scan all rows

As a result, the subquery in the JOIN causes an unnecessary scan of order_details. Additionally, no filtering means Snowflake scans all orders and customers.

SELECT 
    orders.order_id, 
    orders.customer_id, 
    customers.customer_name, 
    SUM(order_details.quantity * order_details.unit_price) AS total_amount
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
JOIN (
    SELECT order_id FROM order_details
) AS od ON orders.order_id = od.order_id  -- Unnecessary subquery
GROUP BY orders.order_id, orders.customer_id, customers.customer_name;

A more efficient join strategy would look something like the following:

SELECT 
    orders.order_id, 
    orders.customer_id, 
    customers.customer_name, 
    SUM(order_details.quantity * order_details.unit_price) AS total_amount
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
JOIN order_details ON orders.order_id = order_details.order_id  -- Direct join without subquery
WHERE orders.order_date >= '2024-01-01'  -- Filter early to reduce scanned rows
GROUP BY orders.order_id, orders.customer_id, customers.customer_name;

This rewritten query:

Removes the unnecessary subquery for order_details
Applies filtering (WHERE clause) before joining, reducing the number of rows scanned
Keeps all column selections the same as the unoptimized query

Complex query logic

Auto-generated queries can sometimes have unnecessarily complex query logic. Here’s an example of a query with numerous inefficiencies:

Uses unnecessary DISTINCT when there’s no need

Joins to a subquery instead of filtering directly in the main query
Uses LIKE ‘%value%’ instead of an indexed filter

SELECT DISTINCT 
    orders.order_id, 
    orders.customer_id, 
    customers.customer_name, 
    orders.order_date, 
    SUM(order_details.quantity * order_details.unit_price) AS total_amount
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
JOIN order_details ON orders.order_id = order_details.order_id
WHERE 
    (orders.status = 'Completed' OR orders.status = 'Shipped')
    AND customers.customer_name LIKE '%Smith%'
    AND orders.order_id IN (
        SELECT order_id FROM shipments WHERE shipped_date IS NOT NULL
    )
GROUP BY orders.order_id, orders.customer_id, customers.customer_name, orders.order_date;

Here are some ways to optimize this query:

Remove DISTINCT since GROUP BY already ensures unique order IDs.
Use IN more effectively by performing a direct JOIN instead of a subquery.
Replace OR with IN, which is index-friendly.
Use ILIKE ‘Smith%’ (or an indexed customer_name column) instead of %Smith%.

SELECT 
    orders.order_id, 
    orders.customer_id, 
    customers.customer_name, 
    orders.order_date, 
    SUM(order_details.quantity * order_details.unit_price) AS total_amount
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
JOIN order_details ON orders.order_id = order_details.order_id
JOIN shipments ON orders.order_id = shipments.order_id  -- Removes inefficient subquery
WHERE 
    orders.status IN ('Completed', 'Shipped')  -- Optimized filter for indexing
    AND customers.customer_name ILIKE 'Smith%'  -- Uses prefix search for better performance
GROUP BY orders.order_id, orders.customer_id, customers.customer_name, orders.order_date;

The problem is that if you’re using a tool that auto-generates SQL queries, you don’t have the ability to go in and make these changes.

Problems with inefficient queries for front-end users

Inefficient queries can cause a number of front-end issues that result in a poor user experience and erode trust in your product and brand. For example, healthcare data platforms often generate Snowflake queries to fetch data to aid in diagnoses and treatment plans. If that data is sluggish, it can hinder providers from doing their jobs properly.

Inefficient queries can manifest these problems in a number of ways, including:

Slow dashboard performance and unnecessary lag in generating visualizations and reports
Timeouts and errors caused by high query latencies and concurrency limits
Inconsistent or incorrect data, as AI-generated queries may not always follow business logic correctly

Problems with inefficient queries for data teams & engineers

The problems caused by inefficient queries are equally challenging for engineers and data teams. Not only do they increase compute costs, but they also lead to performance inefficiencies that distract from more mission critical efforts.

Some of the key challenges include:

Inability to optimize auto-generated queries, which limits their ability to control and mitigate these inefficiencies
Increased concurrency pressure, as poor queries that run in parallel can slow down other workloads
Trouble debugging performance issues, as some of the root causes are inaccessible to engineers
Schema design constraints, as sometimes engineers have to restructure tables to (e.g. denormalizing data, adding materialized views, or pre-aggregating data) so AI-generated queries can perform acceptably

How to keep inefficient queries from impacting cost & performance

While it’s technically possible to address the issues mentioned above, it requires time and effort. Plus, there’s no guarantee that you won’t have to perform that same exercise again weeks or even days later.

Instead, there are a number of ways AI can help you mitigate the risks of query inefficiencies when eliminating them is infeasible. Here are three strategies that Keebo’s AI optimization engine can autonomously implement today.

Smart query routing

In most Snowflake instances, compute resources (i.e. warehouses) are assigned not based on performance logic, but application logic. In other words, Looker might have its own dedicated warehouse, Airbyte its own, Fivetran its own, etc.

But let’s say that 85% of the queries that come through Fivetran require relatively little compute power, but there’s 15% that require a larger warehouse. On the flip side, nearly all of Looker’s queries require more compute power. Wouldn’t it be great if you could take that 15% of Fivetran’s queries and run them on the Looker warehouse? That way, you won’t have to upsize Fivetran’s warehouse at all.

Or, let’s say the users have a warehouse assigned to Airbyte, but their Looker and Fivetran usage is 10X higher than Airbyte. You could route queries to the underutilized Airbyte warehouse, thus taking advantage of resources you’ve already provisioned.

This is exactly what query routing does. It helps to eliminate unnecessary auto-scaling costs by leveraging underutilized warehouses, as well as improve concurrency by separating workloads into separate warehouses.

Keebo’s smart query routing feature dynamically directs queries to the optimal warehouse based on size, load, and query characteristics. This keeps lightweight queries from running on expensive warehouses (and vice versa), improving load balancing and reducing the impact of inefficient auto-generated queries.

Autonomous warehouse optimization

Despite your best efforts, there are going to instances where query routing won’t give your queries the resources they need to run efficiently. In that case, you’ll need to provision more resources by scaling up your warehouses.

Snowflake enables users to easily scale warehouses up and down based on resource demands. However, changing warehouse settings manually is a time-consuming exercise, and takes engineer resources away from other important tasks. Plus, high-demand queries come in at all hours of the day, and your engineers aren’t on 24/7.

Keebo autonomously adjusts warehouse sizes, clustering, & caching to reduce spend without impacting performance. Our algorithms continuously learn and adapt to changes in real time. Plus, we offer fine-grained control over Snowflake so you can adjust optimization settings as needed.

360-degree observability

One of the challenges with auto-generated queries is that it can be difficult to see which queries and applications are your biggest cost drivers. For most applications, this is a significant “black box” that prevents you from gaining the insights you need to make the decisions needed to optimize your data cloud.

Keebo’s comprehensive workload intelligence tool provides 360-degree observability into spend, query, warehouse, storage, and data health. The platform handles complex workloads and provides transparent, measurable, and actionable insights.

For example, if one of your applications is consistently generating inefficient queries, Keebo can isolate those inefficiencies and provide recommendations to mitigate those costs.

Final thoughts on addressing inefficiencies from auto-generated queries

To be clear, not every auto-generated query is inherently more inefficient than one that’s “hand-written.” Humans make mistakes, and sometimes AIs hit the ball out of the park.

The problem isn’t human vs. AI. It’s about visibility into the actual performance of AI-generated queries and having the right tools to address them in real time. Keebo’s powerful combination of comprehensive insight and autonomous optimization is critical for controlling data cloud costs and maintaining optimal performance.

Learn More