[ETR #45] 1 Google Paper Every DE Must Read


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

While I generally find textbooks a bit dry (especially when it comes to code), I can stomach the occasional white paper. A formative white paper I read in grad school is entitled Dremel: Interactive Analysis of Web-Scale Datasets, which provides a crucial look "under the hood" at the origins of BigQuery.

The Story of Dremel

Before it became the BigQuery we know (and sometimes love), Google developed Dremel to handle its massive amounts of data, including unstructured data like search queries and ad clicks. Faced with a "data-geddon" scenario, Google engineers designed Dremel to efficiently process this information. Dremel's innovations, which BigQuery inherited, include:

  • Columnar Storage: Dremel stores data by column, not row, which is far more efficient. This is why data engineers emphasize using SELECT column_name instead of SELECT *; BigQuery only reads the columns you specify.
  • Execution Tree: Dremel breaks down queries into smaller tasks, distributed across a tree-like structure for parallel processing. Understanding this structure, especially by examining query execution graphs in BigQuery, can help you optimize query performance.
  • Slots as a Unit of Compute: Dremel introduced the concept of slots, an abstraction for compute power. BigQuery uses slots to manage query execution. Understanding slots is crucial for grasping how queries consume resources and how slot contention (when many users try to process large amounts of data simultaneously) can occur.

Why This Matters

Reading the Dremel paper (linked below) provides a crash course in distributed compute system design. Practically, this translates to the following improvements for you, the new data engineer:

  • You'll write better queries: Understanding how BigQuery's engine works helps you avoid performance pitfalls.
  • You'll grasp system design: The concepts in the paper, like execution trees, are fundamental to understanding data infrastructure.
  • You'll (better) understand cloud computing: Learning about slots provides a foundation for understanding resource allocation in cloud environments.

Essentially, the Dremel paper helps you understand the why behind BigQuery, enabling you to become more effective and insightful in your system design and conversations with technical and non-technical stakeholders.

For those looking for concrete practice, I recommend:

  • Querying both large and small BigQuery datasets, experimenting with complex queries to see how they perform.
  • Analyzing query execution graphs to identify bottlenecks.
  • Exploring how inefficient queries impact the performance of connected visualization tools.

Or, you could skip the paper and dive headfirst into practical application; learn using one of the most traditional programming problem solving approaches--Brute force.

And before I go, here’s the link to the Dremel paper:

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

And here’s the longer version of the Dremel story (published last week): https://medium.com/pipeline-a-data-engineering-resource/1-google-white-paper-every-aspiring-data-engineer-must-read-a82d19bb7b4d

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

Reaching 20k+ readers on Medium and nearly 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! While many tech-oriented companies have (in one way or another) reneged on remote working arrangements, my employer made an extreme gesture to demonstrate its commitment to the ongoing office-less lifestyle: It removed an entire floor of our two-floor New Jersey office space. Other companies, like Spotify, have unveiled slogans like “Our employees aren’t children. Spotify will continue working...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! The only thing worse than summer temperatures (if you’re in the western hemisphere, that is) is a summer job search. Conventionally, summer isn’t the best time to apply for work; you could probably tell this if you’re currently working and find yourself accepting an overwhelming amount of OOO cal invites. If you are braving the heat of the job market, I want to share a more targeted and...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! Well, it finally happened; AI has replaced a build I created and I’ve been made redundant. Thankfully, the person that created the AI integration was also me. And I did this on personal time so this isn’t an apocalyptic scenario. I’ve previously written about a handful of tools I created to optimize the “busy work” of blogging. One of the ways is by adding links to past relevant articles and...