As a computer science professor focused on database systems (and now CEO of a data learning platform, called Keebo), Looker’s in-database architecture has always resonated with me. It’s exactly what we teach as best practice in database classes: push your computation to where your data lives!
This approach, known as “live connection mode,” improves security and eliminates the need for repetitive and expensive data extracts. It lets users see fresh data while enabling them to seamlessly drill down to granular details with fewer chances of seeing inconsistent numbers. Yet, despite its many advantages, this approach comes with one important tradeoff: slower dashboards.
Each time you load a dashboard or change a filter, you’re sending a new query to the database. This means waiting for:
- Your Snowflake, Redshift or BigQuery (or whichever database you’re using) to run the query,
- The network round-trip time for the request and the results, and
- Looker’s own time to render the new results.
If your dashboard has 5-10 tiles, then you have to wait for your database to finish running 5-10 of those queries. So you can easily end up looking at a spinning wheel for 15-60 seconds waiting, depending on the complexity of your dashboard. It gets even worse if you’re at a company with concurrent dashboard users.
While 15-60 seconds may not seem like a lot of time, it discourages exploration of interesting patterns and decreases the frequency of “going to the data”. User experience studies have long shown that, for a wide range of tasks, human engagement drops significantly when the response time of the interaction exceeds 2 seconds. The good news is that there are ways to mitigate this problem, so read on!
In this post, I’ll first talk about two common approaches, and then present a new but more automated and, I believe, more effective approach to improve dashboard performance.
#1. Reduce the number of tiles on each dashboard
This “breaks up” the latency into smaller chunks to improve dashboard speed. For example, instead of a single dashboard that takes 60 seconds to load, break it into two dashboards that can each load in 30 seconds. The speed-up may not always be linear like that, but you get the idea.
- It’s easy to do. You don’t have to optimize your queries or tweak your schema.
- This method is not ideal for information-rich dashboards. At some point, the dashboard loses its value if you keep breaking it up.
- This doesn’t help for the tiles that are inherently slow, i.e., a query that is too expensive even when run by itself.
#2. Manual dashboard performance optimization
You can manually analyze the SQL queries behind your slow dashboard and experiment with different ideas, like changing column types, adding an index, or looking at the query plan (by running
EXPLAIN). You can also materialize common subqueries, either in your ETL pipeline or by using Looker’s PDTs (Persistent Derived Tables).
- This actually speeds up your slow queries instead of scattering them across multiple dashboards.
- You can keep your data-rich dashboard, with all relevant KPIs shown together.
- It can be daunting to analyze tens or hundreds of SQL queries across different dashboards and decide what to pre-aggregate or pre-join in order to achieve the highest speedup. You may end up materializing a view that gets updated too frequently or end up with too many pre-aggregated/pre-joined tables that are too expensive to update.
- You have to ensure that all your dashboards use those aggregated views or PDTs instead of the base tables, and do so correctly.
- It is time-consuming. New data sources get added, schemas change, and new dashboards get created, so revisions are constantly needed. With this approach, you’re never going to be “done”. Many data teams are surprised when they measure how much of their time is spent on performance optimization and preventing performance problems.
#3. Automated performance optimization using Keebo’s Data Learning
There is a third, automated approach to solving database performance problems: Over the past 7 years, my Ph.D. students and I have been researching the use of machine learning to accelerate and automate analytical workloads. This area, which we call Data Learning, has seen several breakthroughs in recent years. After seeing several successful deployments, we decided to commercialize this idea in 2019 as a system we call Keebo.
Data teams have since used Keebo to dramatically improve their user experience and query performance. Keebo’s automated Data Learning can speed up Looker dashboards without requiring you to write a single line of code or make any changes to the database. Keebo transparently inserts itself between your BI layer and your data warehouse to automatically accelerate queries by 10-100x.
How to use Keebo
- Run a Keebo server. To ensure maximum security, Keebo runs within your own VPC, keeping all your data, models and queries behind your firewall. You just need to spin up a node and obtain a license for Keebo (you can get a free 30-day trial license). Keebo gets installed and configured for you.
- Connect your BI and Data Warehouse to your Keebo server. Keebo is wire compatible with the native interface of all popular databases (e.g., Postgres, Snowflake, Redshift, BigQuery). This means Looker, or any other application that connects to a database, connects to Keebo in the same way it would connect directly to the underlying database. No custom driver or special configuration required: connecting/disconnecting Keebo is as simple as changing the hostname in your connection setting (see Figure below).
It’s that easy. Any dashboard using this connection automatically accelerates.
How Keebo works
Once Keebo connects to your data warehouse, it automatically starts to learn a set of “smart models” by analyzing your queries and data. Keebo stores its smart models in your own data warehouse for security and efficiency. At run time, when your dashboards load or issue queries to your database, Keebo analyzes those queries. If it identifies slow queries (> 2 seconds), Keebo automatically rewrites them into an equivalent form using its smart models before routing them to your data warehouse. This means your computation and data remain inside your data warehouse.
Smart models are significantly more compact than the raw data in your base tables and they avoid expensive computations (joins, aggregation, filtering). This means that your database can run Keebo’s rewritten query up to 100x faster than the original query.
Keebo’s automated process, called Data Learning, takes three factors into account when deciding which smart models to build:
- Your data distribution, considering factors such as high cardinality and skewed columns or large tables.
- Your query distribution, such as common join patterns, grouping conditions, filter columns, and aggregation patterns.
- Your storage budget.
Keebo automatically learns an optimal set of smart models that will speed up the largest number of slow query patterns while staying within your storage budget.
Whenever the underlying data or the schema changes, Keebo automatically updates its smart models too. Keebo continuously monitors for new query patterns and their frequency of occurrence, creating new models or retiring older ones when they’re no longer useful.
- Similar to Machine Learning, Data Learning is a general approach that can analyze millions of past queries, whereas human experts may get overwhelmed and make suboptimal decisions.
- This fully-automated approach typically delivers 10-100x speedups.
- Because your queries need less compute power, you can considerably reduce your compute bill as well, and potentially even downsize your cloud data warehouse.
- It saves you the time needed for diagnosing and fixing performance problems. Unlike typical machine learning, which is often block-box, with Data Learning you can look at the rewritten queries to examine what’s going on underneath.
- Anyone using a Keebo connection will automatically benefit from the speedups without a need to understand the underlying changes.
- Queries are rewritten on-the-fly, so the queries in your dashboard remain in their original, straightforward forms which can be easily debugged and modified.
- You don’t need to worry about your materialized views or ETL pipeline. Keebo detects schema and data changes automatically, and then updates its own models accordingly.
- Whenever new queries are issued or new dashboards are created, Keebo automatically adjusts its models.
- Keebo resides behind your firewall and your models and data remain secure inside your data warehouse. You also don’t need a large server to host Keebo because queries are still run by your own database.
- You of course need to install Keebo (duh!).
Letting the machine do the work
There are things that only humans can do, but more and more, we’re finding that some things we thought required human expertise can actually be automated. Advances in Data Learning show that performance-related optimizations, database tuning and writing/maintaining complex data aggregation pipelines can and should now be delegated to automated systems. Data teams can instead spend their precious time on more important tasks that directly impact business results, like ensuring data quality or deriving actionable insights. That’s what Keebo is here for: Our mission is automating and accelerating enterprise analytics using Data Learning. You can learn more about Keebo in this white paper or request a free trial here.