[ETR #104] Master The 3-Step Debugging Framework From A Worst-Case Data Scenario

Hi fellow data professional!

If you read my note on Tuesday you’ll know I’m coming off of the data engineering week from hell that seeped into my personal life, and delayed the launch of something cool I was planning to share with you; if you want to know more about that, scroll to the end of this message.

Last week a flagship data source had a major problem and since it’s within my ownership area, I was the one with the knowledge and responsibility to fix it.

I wanted to share the experience to demonstrate effective debugging under pressure and at scale. Because in situations like these it’s almost never one table that fails to load.

There are rippling, cascading downstream impacts felt for days or, in this case, a week after.

This downtime was particularly difficult to resolve because the pipeline broke on a Friday. That means anything not fixed gets handed off to the weekend team. But I jumped in on Sunday because I wanted to preempt the issue before a scheduled trip to the New York office, when I was scheduled to be offline for two days.

So really this product could not have broken at a worse time.

Typically fully restoring data access follows a particular trajectory.

Triage
Debug
Backfill

Triage

Like the name suggests, triaging an issue means to “stop the bleeding”, i.e. solve the immediate issue. In the case that a pipeline is simply not loading the first step is to turn it off so notifications stop. Then, we alert partners on the analyst side of the Business Intelligence umbrella who, in turn, let their stakeholders know that data access will be limited for the foreseeable future.

Once it becomes clear that the issue is out of our control or “upstream”, we connect with an escalation team at the data vendor and seek immediate restoration options, which just means finding any way possible to get the data. In our case, we downloaded a CSV from a UI which took hours and produced a multi-GB file that had to be manually cleaned and uploaded to BigQuery.

Debug

Since that kind of manual process is not sustainable, it’s time to debug. We had to solve a code issue, which involved figuring out how to filter out and ultimately reconstruct source data before converting it to a human readable form.

Honestly, cleaning malformed CSVs has never been my strength. It involves a lot of patience and file inspection. However, this is where an LLM can be effectively integrated.

While I did use some suggestions to rewrite the loader Python script the most helpful input from Chat involved writing the inspection snippets. Between writing the snippets and providing screenshots Chat helped me discover three root causes.

An export with additional, unexpected fields
The last field contained an enclosed quotation, corrupting the entire chunk (this file loads in batches of 100k)
NUL bytes that seeped into a different part of the export

Once we determined that collection of root causes we moved into the putting back the pieces phase, aka

Backfill

Many data warehouses begin with one primary data source that is then joined to supplemental data. This export was the foundation for this subscription data mart.

And because this data hadn’t updated, dags that include checks for the export all failed.

We’ve experienced similarly catastrophic failures before so we revised our dags to include jinja-templated timestamps to facilitate easier backfills.

Although I had to run backfills across several dags, it was more a matter of clearing/marking tasks than manually executing queries.

This design made the scramble feel like we sort of knew what we were doing.

If you find yourself in a similar situation, here is what kept me sane and my stakeholders at bay.

Frequent, public communication; no side conversations, just public bulletins
Visible ownership; there’s a problem and I’m the point person on my team responsible for the outcome
No providing concrete ETAs; a bit controversial, but we couldn’t guarantee timing beyond ASAP
Preemptive messaging with my boss; I identified and escalated the issue, taking it directly to vendor support, making it less of his problem

This is my attempt at both an explanation and a post-mortem.

When I was in data science school I never heard about anyone’s worst day (or week) at work. It’s important, when considering a discipline, to understand the very real challenges you’re signing up to face.

These are the days when you truly earn the paycheck.

And if you’ve scrolled for updates on the release…

New Release Dates

I’ll be making an announcement on Tuesday, 7/7.

Want to hear from me sooner? Get early access on 6/30 by clicking here.

Happy debugging and thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

[ETR #104] Master The 3-Step Debugging Framework From A Worst-Case Data Scenario

Medium | LinkedIn | Ebooks

[ETR #103] What's Preventing Analysts From Becoming Engineers

[ETR #102] I Fell For A Job Scam So You Don't

[ETR #101] Why The Hottest Data Job Is A Waste Of Time

Extract. Transform. Read.

[ETR #104] Master The 3-Step Debugging Framework From A Worst-Case Data Scenario

​Medium | LinkedIn | Ebooks

Extract. Transform. Read.

[ETR #103] What's Preventing Analysts From Becoming Engineers

[ETR #102] I Fell For A Job Scam So You Don't

[ETR #101] Why The Hottest Data Job Is A Waste Of Time

Medium | LinkedIn | Ebooks