Finding and resolving costly BigQuery queries
23rd August 2022 - Reading time: 7 min
Taking informed decisions is key to improving business success. From our experience organizations are currently spending significant resources on building out their data stack. As a result, costs start increasing when moving from a prototyping phase to productionizing their data flows due to increased data amounts.
A modern data stack is usually built around data storage within a data warehouse, collecting a great number of data points across the organization. This blog post will focus on Google BigQuery even though the concepts can be applied to other data warehouses as well.
BigQuery is a fully managed, highly scalable, cloud data warehouse and analytical engine. It is optimized for running analytic queries over large datasets. It can perform queries on terabytes of data in seconds and petabytes in minutes.
A great benefit of a data warehouse is that it enables the analytical team to query large amounts of data to provide insights. It also allows working with data in a streaming or batch manner for automation of data flows. However, when using a managed data warehouse, it's important to track costs. BigQuery uses either a flat rate pricing model or an on-demand pricing model, charging for bytes processed. BigQuery analytics already provides insight into costs per user. There are common scenarios where this measurement is insufficient such as for service accounts shared between multiple data flows.
Let's go through a project and a simple query that can be used in an organization to gain insights on costs for tables and queries over time. This helps to find and isolate costly parts in the data organization and data flows.
The project is set up in Google Cloud Platform with the id "stoix". In BigQuery two datasets have been created. The first one is used for ingestion of data and the second one is used for analytical tables consumed by dashboards. In the ingestion step, both partitioned and sharded tables are provided to demonstrate how to get a correct estimation for the full period.
Costs per source table
The first insight to look into is grouping the costs of queries per source table for the last 2 weeks. This provides a good indication of where there could be issues with partitioning, sharding, selection of too many tables or possibilities to reduce the number of columns retrieved.
WITH date_table_costs AS ( SELECT FORMAT_DATETIME('%Y-%m-%d', creation_time) AS date -- Set common name for sharded tables , referenced_table.project_id || '.' || referenced_table.dataset_id || '.' || CASE WHEN REGEXP_CONTAINS(referenced_table.table_id, '.*_\\d8') THEN REGEXP_EXTRACT(referenced_table.table_id, '(.*_)\\d8') || 'YYYYMMDD' ELSE referenced_table.table_id END AS table , total_bytes_billed FROM `stoix-data.region-eu.INFORMATION_SCHEMA.JOBS_BY_PROJECT` , UNNEST (referenced_tables) AS referenced_table -- Change the interval for other date spans WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 14 DAY) ) , cost_per_date AS ( SELECT date, SUM(total_bytes_billed) AS total_bytes_billed FROM date_table_costs GROUP BY date ) , cost_per_date_table AS ( SELECT date, table, COUNT(*) AS n, SUM(total_bytes_billed) AS total_bytes_billed FROM date_table_costs GROUP BY date, table ) SELECT Cost_per_date_table.table , SUM(n) AS n , FORMAT('%.2f', SUM(cost_per_date_table.total_bytes_billed) / 1073741824.0) || ' GB' AS billed , FORMAT('%.2f', SUM(cost_per_date_table.total_bytes_billed) / SUM(cost_per_date.total_bytes_billed) * 100) || '%' AS percent_of_date FROM cost_per_date JOIN cost_per_date_table USING (date) GROUP BY table ORDER BY billed DESC
Running the query provides billed bytes per table for the specified time period of 14 days. The returned columns provide information on the table name, number of times it has been queried, amount of bytes billed (rounded to GB) as well as percentage of bytes billed of the total. The following result is for the example above:
|Table (top 4)||Times queried||Bytes billed||Of total|
Notice the "YYYYMMDD", this is a combination of shards such as "20220101", "20220102", etc. Wildcards are not combined since they already reference multiple tables and are interesting enough to understand as a separate entry.
Learnings from the result
Based on this information there are a few steps that an organization can take in order to decrease BigQuery costs.
High frequency queries
Introducing automation in data flows can make even a comparably cheap job stack up on costs over time. It's important to keep in mind the combination of both cost per query with the frequency.
Looking at “stoix.ingestion.user_activity_YYYYMMDD”, it has been queried at a higher rate than the other tables, roughly 4 times per minute. Even though each run of the query has a lower cost compared to the others, it ends up being the most expensive.
Looking at high frequency queries can be a quick way of reducing cost due to any savings being multiplied by the frequency. If the query cost could be reduced from 7,5GB to 5GB that would result in savings of 10% monthly.
Automation of analytical queries
During a data exploration phase, tables tend to contain more information than needed and queries usually keep unnecessary columns and rows. Usually these queries can be simplified producing the same output but for a lower cost. Looking at the cost for a specific table over time can highlight tables that should be simplified.
If the query is changed to a daily perspective, the table “stoix.analytics.user_activity” would show low cost for 12 days and high cost for the last two days. It's a table that was recently automated purely by copying work done by the analytics team. By reducing data used by that automation or migrating the SQL to code, the savings could be up to 60% of billed bytes for that table.
Missing partition or shard filtering
With time, datasets can grow from small to unmanageable. Data warehouses provide convenient ways of handling large volumes of data, and for BigQuery that includes sharding such as for “stoix.ingestion.user_activity_YYYYMMDD” and partitioning of a table such as for “stoix.ingestion.purchases”.
A common pitfall to look out for with sharded tables is the wildcard selector “*”. Run the queries and validate bytes billed to make sure shards are correctly filtered. For table partitions the same logic applies but with the column that determines the partitions.
Looking at “stoix.ingestion.purchases” in the table above, one can see that the table is rarely queried but stands for a significant cost per query. This is an early red flag that increased usage could lead to very high costs. An easy cost saver is to double check that correct data filters are in place. The purchase table is going to be used for multiple dashboards and with more strict filters, billed bytes will be reduced close to 90%.
When building out a data stack and growing the organization's data needs, a managed data warehouse is usually a central piece. The above scenarios highlighted some actions an organization can take to reduce their monthly data warehouse costs.
In order to improve the value to cost ratio within your data warehouse it's important to start understanding where costs come from and to take adequate measures.
Also, when doing the cost analysis, total cost over time usually matters more than cost for a single query. With that in mind, it's usually fine for queries from analytics teams to be a lot more expensive than automation and recurring data flows since they have a much lower frequency.
Here are some additional steps to help with this analysis:
- Grouping the above query by table, count and user to understand if the cost belongs to the analytics team or automated data flows.
- Grouping the above query by dates to see if any tables are drastically increasing over time.
- Selecting the queries ran against the source tables sorted by bytes billed to identify the query that has the highest cost.
Click here to find out how your organization can run, orchestrate, schedule, and monitor your data ecosystem in STOIX. Be up and running in less than 5 minutes!