Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present, or future data professional!

A lagging SQL query caused me to nearly miss my flight home.

Ok, that’s maybe a bit of an exaggeration; while I would have still gotten on the plane, the query in question did take nearly 2 hours to run… even after working hours!

The frustrating thing about SQL and programming in general is that no matter how technically perfect your code is, you will almost always bump up against resource constraints that can slow or stop your work from executing.

Bad or inefficient code is not the sole culprit when it comes to poor execution; there are two categories of factors that can determine whether your query runs or you receive the dreaded “time out” message, technical and situational.

Note that I explain the factors within the context of the platform (Google Cloud Platform) and SQL dialect (BigQuery SQL) I’m most familiar with; nonetheless, many of these issues can be observed across SQL dialects and query engines.

Technical Factors

Data Format and Loading: First off, your data needs to be in a format BigQuery likes. If BigQuery struggles to interpret your data (e.g., dates formatted incorrectly), you'll get errors or, even worse, inconsistent results.

Optimal Data Storage: BigQuery loves partitioned and clustered data. These techniques help BigQuery pre-sort things, making queries much more efficient. Think of it as organizing your closet so you can quickly find that one shirt. If you're working with BigQuery, learning to automate your BigQuery schema definitions with Python can save you a lot of time.

Query Syntax (Beyond SELECT *): Yes, avoid SELECT *. We all know that. But more broadly you need to think: Only grab the minimum amount of data you actually need; i.e. don't pull eight years' worth of data if you only need last month’s information. Use filters! For more help on this, check out these BigQuery Data Engineering Tricks.

Views: Views can be handy, but stacking too many views on top of each other can create a performance nightmare. Not only does BigQuery need to execute the query to return results for the initial view, it needs to do this for each underlying view definition; think of this pitfall as View-ception.

Materialized views vs static tables: Understand the source data you’re repeatedly querying and evaluate whether the size and frequency of its updates warrant a saved query or a static table updated by an ETL Pipeline.

Situational Factors

BigQuery Pricing: BigQuery offers different pricing models (on-demand and capacity-based). Your organization's choice impacts the compute power available for your queries.

Slot Contention: In capacity-based pricing, BigQuery uses "slots" (units of compute power). When lots of people are running queries at the same time (think 9-5 workday), those slots get contested. Your query might have to wait its turn, leading to delays or timeouts. This is one of the worst times for data engineering.

Downstream Queries: The number and complexity of queries from other users and systems (like dashboards) can also eat up resources and affect your query's performance.

Time of Day/Activity Level: As mentioned, peak usage times can significantly impact query execution. This is often why your data engineering requests take forever.

Which all suggests that a technically sound query can still fail if the BigQuery environment is under heavy load.

Understanding BigQuery's architecture, resource management, and how your query interacts with other processes is crucial for getting consistent results–and making a flight.

For a more in-depth explanation, including examples, read my write-up: Why Your BigQuery SQL Query Will (Or Won't) Run on Medium.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

[ETR #49] Why Your SQL Query Is So Slow

Extract. Transform. Read.

Technical Factors

Situational Factors

[ETR #53] Companies Actually Still Hiring Remotely

[ETR #52] 5-Step Prompt To Land More Interviews

[ETR #51] Why AI Can Take Some DE Jobs