|
Hi fellow data professional! If you read my note on Tuesday you’ll know I’m coming off of the data engineering week from hell that seeped into my personal life, and delayed the launch of something cool I was planning to share with you; if you want to know more about that, scroll to the end of this message. Last week a flagship data source had a major problem and since it’s within my ownership area, I was the one with the knowledge and responsibility to fix it. I wanted to share the experience to demonstrate effective debugging under pressure and at scale. Because in situations like these it’s almost never one table that fails to load. There are rippling, cascading downstream impacts felt for days or, in this case, a week after. This downtime was particularly difficult to resolve because the pipeline broke on a Friday. That means anything not fixed gets handed off to the weekend team. But I jumped in on Sunday because I wanted to preempt the issue before a scheduled trip to the New York office, when I was scheduled to be offline for two days. So really this product could not have broken at a worse time. Typically fully restoring data access follows a particular trajectory.
Triage Like the name suggests, triaging an issue means to “stop the bleeding”, i.e. solve the immediate issue. In the case that a pipeline is simply not loading the first step is to turn it off so notifications stop. Then, we alert partners on the analyst side of the Business Intelligence umbrella who, in turn, let their stakeholders know that data access will be limited for the foreseeable future. Once it becomes clear that the issue is out of our control or “upstream”, we connect with an escalation team at the data vendor and seek immediate restoration options, which just means finding any way possible to get the data. In our case, we downloaded a CSV from a UI which took hours and produced a multi-GB file that had to be manually cleaned and uploaded to BigQuery. Debug Since that kind of manual process is not sustainable, it’s time to debug. We had to solve a code issue, which involved figuring out how to filter out and ultimately reconstruct source data before converting it to a human readable form. Honestly, cleaning malformed CSVs has never been my strength. It involves a lot of patience and file inspection. However, this is where an LLM can be effectively integrated. While I did use some suggestions to rewrite the loader Python script the most helpful input from Chat involved writing the inspection snippets. Between writing the snippets and providing screenshots Chat helped me discover three root causes.
Once we determined that collection of root causes we moved into the putting back the pieces phase, aka Backfill Many data warehouses begin with one primary data source that is then joined to supplemental data. This export was the foundation for this subscription data mart. And because this data hadn’t updated, dags that include checks for the export all failed. We’ve experienced similarly catastrophic failures before so we revised our dags to include jinja-templated timestamps to facilitate easier backfills. Although I had to run backfills across several dags, it was more a matter of clearing/marking tasks than manually executing queries. This design made the scramble feel like we sort of knew what we were doing. If you find yourself in a similar situation, here is what kept me sane and my stakeholders at bay.
This is my attempt at both an explanation and a post-mortem. When I was in data science school I never heard about anyone’s worst day (or week) at work. It’s important, when considering a discipline, to understand the very real challenges you’re signing up to face. These are the days when you truly earn the paycheck. And if you’ve scrolled for updates on the release… New Release Dates I’ll be making an announcement on Tuesday, 7/7. Want to hear from me sooner? Get early access on 6/30 by clicking here. Happy debugging and thanks for ingesting, -Zach Quinn Medium | LinkedIn | Ebooks |
Reaching 20k+ readers on Medium and over 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.
Hi fellow data professional! Hardly a work day goes by without receiving a request from a data analyst. They range from the mundane “Can you add this column?” to the occasional emergency “The data didn’t load all weekend and the leadership call starts in 15 minutes!” At the end of a jam-packed week I received an unusual request: Help with a Python script. My teammate wanted to know: Best practices How to commit to GitHub What the best way to deploy is They admitted the task was simple,...
Hi fellow data professional! It finally happened. I fell for a job scam. Luckily I realized my naivety after responding to the initial email. But let’s back up. We’ll examine Why this particular attempt was so “real” What made me skeptical How to prevent this from happening to you Established professionals in any field have the privileged problem of receiving unsolicited recruiter inquiries. If it’s from a random firm I typically move it to junk; if it’s a big name company, I give a look...
Hi fellow data professional! The best data skills to develop right now might just be cutting and measuring. While that statement might be a bit facetious, the hot media narrative is to push the idea of blue collar work as a viable fallback if you’re having trouble breaking into a conventional tech role. Outlets like CNN have touted the fact that data center engineer is the hottest role in tech. Executives, specifically Nvidia’s Jensen Huang, speculate that data center construction (despite...