[ETR #36] The "I" Word Every DE Needs


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

One of the dirty secrets about my job is how easy it can be to fix broken pipelines. Often I’m retriggering a failed DAG task or, if using a code-less pipeline, literally hitting refresh.

In fact, “refresh” is a great example for one of the more abstract data engineering concepts: State.

And, specifically the maintenance of state under any condition. This is the definition of an important I-word, “idempotency.”

While idempotency sounds like an SAT word, it’s as simple as saying “Every time this process runs the result (end state) will be the same.”

An easy-to-grasp example of idempotency is the Google Cloud BigQuery API’s “WRITE_TRUNCATE” property. If you run a pipeline with “WRITE_TRUNCATE”, your data will always be overwritten during the load step.

A more precise version of implementing idempotency is something I include in nearly all my pipelines, a DELETE step. This is slightly more precise than overwriting data because I am specifying deletion for a particular window.

But this means that when I run a job that deletes and inserts only yesterday’s data, the output will be the same each time, leaving historic data intact and avoiding the very real possibility of data loss.

This is a very practical approach to designing data pipelines because you may get spur-of-the-moment requests to reload data or otherwise re-trigger your runs.

When executed properly, idempotency is as easy as hitting page refresh.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

Reaching 20k+ readers on Medium and over 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Hi fellow data professional! If you read my note on Tuesday you’ll know I’m coming off of the data engineering week from hell that seeped into my personal life, and delayed the launch of something cool I was planning to share with you; if you want to know more about that, scroll to the end of this message. Last week a flagship data source had a major problem and since it’s within my ownership area, I was the one with the knowledge and responsibility to fix it. I wanted to share the experience...

Hi fellow data professional! Hardly a work day goes by without receiving a request from a data analyst. They range from the mundane “Can you add this column?” to the occasional emergency “The data didn’t load all weekend and the leadership call starts in 15 minutes!” At the end of a jam-packed week I received an unusual request: Help with a Python script. My teammate wanted to know: Best practices How to commit to GitHub What the best way to deploy is They admitted the task was simple,...

Hi fellow data professional! It finally happened. I fell for a job scam. Luckily I realized my naivety after responding to the initial email. But let’s back up. We’ll examine Why this particular attempt was so “real” What made me skeptical How to prevent this from happening to you Established professionals in any field have the privileged problem of receiving unsolicited recruiter inquiries. If it’s from a random firm I typically move it to junk; if it’s a big name company, I give a look...