[ETR #49] Why Your SQL Query Is So Slow


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present, or future data professional!

A lagging SQL query caused me to nearly miss my flight home.

Ok, that’s maybe a bit of an exaggeration; while I would have still gotten on the plane, the query in question did take nearly 2 hours to run… even after working hours!

The frustrating thing about SQL and programming in general is that no matter how technically perfect your code is, you will almost always bump up against resource constraints that can slow or stop your work from executing.

Bad or inefficient code is not the sole culprit when it comes to poor execution; there are two categories of factors that can determine whether your query runs or you receive the dreaded “time out” message, technical and situational.

Note that I explain the factors within the context of the platform (Google Cloud Platform) and SQL dialect (BigQuery SQL) I’m most familiar with; nonetheless, many of these issues can be observed across SQL dialects and query engines.

Technical Factors

Data Format and Loading: First off, your data needs to be in a format BigQuery likes. If BigQuery struggles to interpret your data (e.g., dates formatted incorrectly), you'll get errors or, even worse, inconsistent results.

Optimal Data Storage: BigQuery loves partitioned and clustered data. These techniques help BigQuery pre-sort things, making queries much more efficient. Think of it as organizing your closet so you can quickly find that one shirt. If you're working with BigQuery, learning to automate your BigQuery schema definitions with Python can save you a lot of time.

Query Syntax (Beyond SELECT *): Yes, avoid SELECT *. We all know that. But more broadly you need to think: Only grab the minimum amount of data you actually need; i.e. don't pull eight years' worth of data if you only need last month’s information. Use filters! For more help on this, check out these BigQuery Data Engineering Tricks.

Views: Views can be handy, but stacking too many views on top of each other can create a performance nightmare. Not only does BigQuery need to execute the query to return results for the initial view, it needs to do this for each underlying view definition; think of this pitfall as View-ception.

Materialized views vs static tables: Understand the source data you’re repeatedly querying and evaluate whether the size and frequency of its updates warrant a saved query or a static table updated by an ETL Pipeline.

Situational Factors

BigQuery Pricing: BigQuery offers different pricing models (on-demand and capacity-based). Your organization's choice impacts the compute power available for your queries.

Slot Contention: In capacity-based pricing, BigQuery uses "slots" (units of compute power). When lots of people are running queries at the same time (think 9-5 workday), those slots get contested. Your query might have to wait its turn, leading to delays or timeouts. This is one of the worst times for data engineering.

Downstream Queries: The number and complexity of queries from other users and systems (like dashboards) can also eat up resources and affect your query's performance.

Time of Day/Activity Level: As mentioned, peak usage times can significantly impact query execution. This is often why your data engineering requests take forever.

Which all suggests that a technically sound query can still fail if the BigQuery environment is under heavy load.

Understanding BigQuery's architecture, resource management, and how your query interacts with other processes is crucial for getting consistent results–and making a flight.

For a more in-depth explanation, including examples, read my write-up: Why Your BigQuery SQL Query Will (Or Won't) Run on Medium.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

Reaching 20k+ readers on Medium and nearly 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! While many tech-oriented companies have (in one way or another) reneged on remote working arrangements, my employer made an extreme gesture to demonstrate its commitment to the ongoing office-less lifestyle: It removed an entire floor of our two-floor New Jersey office space. Other companies, like Spotify, have unveiled slogans like “Our employees aren’t children. Spotify will continue working...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! The only thing worse than summer temperatures (if you’re in the western hemisphere, that is) is a summer job search. Conventionally, summer isn’t the best time to apply for work; you could probably tell this if you’re currently working and find yourself accepting an overwhelming amount of OOO cal invites. If you are braving the heat of the job market, I want to share a more targeted and...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! Well, it finally happened; AI has replaced a build I created and I’ve been made redundant. Thankfully, the person that created the AI integration was also me. And I did this on personal time so this isn’t an apocalyptic scenario. I’ve previously written about a handful of tools I created to optimize the “busy work” of blogging. One of the ways is by adding links to past relevant articles and...