[ETR #36] The "I" Word Every DE Needs


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

One of the dirty secrets about my job is how easy it can be to fix broken pipelines. Often I’m retriggering a failed DAG task or, if using a code-less pipeline, literally hitting refresh.

In fact, “refresh” is a great example for one of the more abstract data engineering concepts: State.

And, specifically the maintenance of state under any condition. This is the definition of an important I-word, “idempotency.”

While idempotency sounds like an SAT word, it’s as simple as saying “Every time this process runs the result (end state) will be the same.”

An easy-to-grasp example of idempotency is the Google Cloud BigQuery API’s “WRITE_TRUNCATE” property. If you run a pipeline with “WRITE_TRUNCATE”, your data will always be overwritten during the load step.

A more precise version of implementing idempotency is something I include in nearly all my pipelines, a DELETE step. This is slightly more precise than overwriting data because I am specifying deletion for a particular window.

But this means that when I run a job that deletes and inserts only yesterday’s data, the output will be the same each time, leaving historic data intact and avoiding the very real possibility of data loss.

This is a very practical approach to designing data pipelines because you may get spur-of-the-moment requests to reload data or otherwise re-trigger your runs.

When executed properly, idempotency is as easy as hitting page refresh.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

Reaching 20k+ readers on Medium and nearly 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! I dreaded entering the job market after my data science master's. I felt like I knew more than a data analyst but less than a professional data scientist. I've since realized my program was more effective than I thought, but it couldn't prepare me for the key areas like cloud deployments and real-world problem-solving I had to learn on the job as a data engineer. And I’ve noticed these gaps in...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! If you live in the U.S., this week marks the end of back to school season; though, if you’re like my southern relatives, you’ve been back since July. The closest feeling most adults get to back to school (aside from the teachers), is starting a new job. While a new org, title and compensation package represents new opportunities, it’s also easy to feel like the “new kid”, which can lead to being...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! I once participated in a remote job interview in which the interviewer was on the video call while driving... and smoking. While that instance was among the most memorable interview experiences (for the wrong reasons), I’ve had just as many interviews that have blended together and faded into the recesses of my mind. The common denominator, however, was the insistence on asking one question. The...