[ETR #49] Why Your SQL Query Is So Slow


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present, or future data professional!

A lagging SQL query caused me to nearly miss my flight home.

Ok, that’s maybe a bit of an exaggeration; while I would have still gotten on the plane, the query in question did take nearly 2 hours to run… even after working hours!

The frustrating thing about SQL and programming in general is that no matter how technically perfect your code is, you will almost always bump up against resource constraints that can slow or stop your work from executing.

Bad or inefficient code is not the sole culprit when it comes to poor execution; there are two categories of factors that can determine whether your query runs or you receive the dreaded “time out” message, technical and situational.

Note that I explain the factors within the context of the platform (Google Cloud Platform) and SQL dialect (BigQuery SQL) I’m most familiar with; nonetheless, many of these issues can be observed across SQL dialects and query engines.

Technical Factors

Data Format and Loading: First off, your data needs to be in a format BigQuery likes. If BigQuery struggles to interpret your data (e.g., dates formatted incorrectly), you'll get errors or, even worse, inconsistent results.

Optimal Data Storage: BigQuery loves partitioned and clustered data. These techniques help BigQuery pre-sort things, making queries much more efficient. Think of it as organizing your closet so you can quickly find that one shirt. If you're working with BigQuery, learning to automate your BigQuery schema definitions with Python can save you a lot of time.

Query Syntax (Beyond SELECT *): Yes, avoid SELECT *. We all know that. But more broadly you need to think: Only grab the minimum amount of data you actually need; i.e. don't pull eight years' worth of data if you only need last month’s information. Use filters! For more help on this, check out these BigQuery Data Engineering Tricks.

Views: Views can be handy, but stacking too many views on top of each other can create a performance nightmare. Not only does BigQuery need to execute the query to return results for the initial view, it needs to do this for each underlying view definition; think of this pitfall as View-ception.

Materialized views vs static tables: Understand the source data you’re repeatedly querying and evaluate whether the size and frequency of its updates warrant a saved query or a static table updated by an ETL Pipeline.

Situational Factors

BigQuery Pricing: BigQuery offers different pricing models (on-demand and capacity-based). Your organization's choice impacts the compute power available for your queries.

Slot Contention: In capacity-based pricing, BigQuery uses "slots" (units of compute power). When lots of people are running queries at the same time (think 9-5 workday), those slots get contested. Your query might have to wait its turn, leading to delays or timeouts. This is one of the worst times for data engineering.

Downstream Queries: The number and complexity of queries from other users and systems (like dashboards) can also eat up resources and affect your query's performance.

Time of Day/Activity Level: As mentioned, peak usage times can significantly impact query execution. This is often why your data engineering requests take forever.

Which all suggests that a technically sound query can still fail if the BigQuery environment is under heavy load.

Understanding BigQuery's architecture, resource management, and how your query interacts with other processes is crucial for getting consistent results–and making a flight.

For a more in-depth explanation, including examples, read my write-up: Why Your BigQuery SQL Query Will (Or Won't) Run on Medium.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

Reaching 20k+ readers on Medium and nearly 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! One of the most loaded terms, after AI, is upskilling. It’s something everyone should always be doing, yet, only the most dedicated can consistently dedicate time to learning and expanding beyond their comfort zones. If you’re on the path to becoming a data professional, you’ve probably spent countless hours learning, only to find yourself wondering if you’re actually making progress. I’ve been...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! When I worked as a resume consultant, the toughest mental block for clients was identifying and expressing material contributions at work; avoiding this communication is why so many job hunters revert to regurgitating their job duties rather than clarifying the outcomes of their work. In addition to overcoming the hurdle of distilling a complex technical role for non-technical recruiters to...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! Data science just cracked the top 40… of jobs whose main functions are most likely to be replaced by AI. If you’re up to speed on your AI doomerism news you’ll know that at the end of July, Microsoft released a list of jobs across disciplines and industries that could be majorly disrupted by AI. On a more positive economic outlook, data engineering is specifically cited as a growing role in the...