[ETR #33] Avoid These DE Shortcuts


Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

For data engineering, a profession built on principles of automation, it can be counterintuitive to suggest that any optimizations or “shortcuts” could be negative.

But, as someone who was once a “baby engineer”, I can tell you that a combination of temptation and overconfidence will inevitably drive you to say “I could do without x development step.”

Doing so increases reputational risk (loss of credibility or trust) and, in a worst-case scenario, could even put your job at risk.

If you’re job searching or beginning your first role, there are 6 areas where I’d never even attempt to take “the easy route.”

  1. Testing only transformations, not load steps - it’s true that most of the “heavy lifting” of your pipeline occurs at the “T” of ETL (or ELT); but this doesn’t mean errors can’t occur when loading to your database. Take the time to create and load to a test table.
  2. Validating against expected volume, not metrics - Data engineers are primarily concerned with the shape of data, but it’s the content that matters. If the output is suspect, it’s important to understand what components are influencing critical fields like revenue. Take the time to meet and collaborate with technical partners, like data analysts, who understand the “what” of the data.
  3. Waiting too long to alert stakeholders of “bad data” - Being responsible for pipelines that produce “bad data” is objectively not a good look. But you can mitigate stakeholder concern and emotional responses by sounding the alarm as soon as you notice something is “off.” Providing a concise explanation of the issue, steps to resolve and estimated resolution timelines will help cushion the bad news you deliver.
  4. Assuming “someone else will fix it” - It’s too easy to assume, when an alert comes in, that “someone else will deal with it.” If your pipelines include an alerting component and you don’t receive a message that someone is checking the issue, take the initiative and, most importantly, let your team know you’re on it.
  5. Not testing in production-adjacent environments - Repeat after me: Your local IDE is not production. While your code may run flawlessly locally, you need to remember that production environments have different configurations and dependency requirements. Work to create a virtual environment or container that mimics these conditions to decrease the chances of something not deploying or functioning correctly.
  6. Saying “yes” to everything - While it’s tempting to cement yourself as the “go-to” for your team as a new engineer, you need to avoid the possibility of taking on too much grunt work. An abundance of grunt work increases your chances of becoming overwhelmed. But it also means you’re not working on projects that will raise your visibility and make an impact within the org, both of which are necessary to get you noticed for raises, promotions and general kudos. All of which make this sometimes thankless job a little better.

For an expansion on any of these areas, you can read the piece this was based on, “These 6 Data Engineering Shortcuts Will Burn You In Year 1” published in Pipeline earlier this week.

Thanks for ingesting,

-Zach Quinn

Pipeline To DE

Top data engineering writer on Medium & Senior Data Engineer in media; I use my skills as a former journalist to demystify data science/programming concepts so beginners to professionals can target, land and excel in data-driven roles.

Read more from Pipeline To DE

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! From 2014-2017 I lived in Phoenix, Arizona and enjoyed the state’s best resident privilege: No daylight saving time. If you’re unaware (and if you're in the other 49 US states, you’re really unaware), March 9th was daylight saving, when we spring forward an hour. If you think this messes up your microwave and oven clocks, just wait until you check on your data pipelines. Even though data teams...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! As difficult as data engineering can be, 95% of the time there is a structure to data that originates from external streams, APIs and vendor file deliveries. Useful context is provided via documentation and stakeholder requirements. And specific libraries and SDKs exist to help speed up the pipeline build process. But what about the other 5% of the time when requirements might be structured, but...

Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! To clarify the focus of this edition of the newsletter, the reason you shouldn’t bother learning certain data engineering skills is due to one of two scenarios— You won’t need them You’ll learn them on the job You won’t need them Generally these are peripheral skills that you *technically* need but will hardly ever use. One of the most obvious skills, for most data engineering teams, is any...