[ETR #93] This QA Metric Tricks Most DEs


Hi fellow data professional!

For years, the opening of The Simpsons, specifically Bart writing lines on the chalkboard, has been incredibly relatable to me. Not because I’m up to mischief (none I’ll admit to here, anyway), but because I spend most days writing the same three lines of SQL over and over again.

If you've ever been paranoid about a table's content, you might know what I'm talking about.

It’s the aggregate COUNT(*) grouped by a date field, ordered by date DESC.

The output of that query determines exactly how my day is going to go. If the most recent partition is populated, my phone stays quiet, and I can actually drink my coffee in relative peace (depending on a baby's nap schedule, of course).

If the count is zero or significantly lower than yesterday, I drop everything and start the forensic deep dive into logs and upstream triggers.

In our world, row counts are the "top-line" metric. They signify partition accuracy and job completeness. We use them to trigger anomaly alerts and to validate staging-to-prod migrations.

So row counts are incredibly useful. Until they aren’t.

The trap we often fall into is confusing data volume with data integrity. As a senior engineer, I’ve learned that my real job isn't just moving rows; it's ensuring the data doesn't "look weird."

Financial data is the perfect example of where row counts go to die. Because of the nuances between "booked" revenue (anticipated) and "earned" revenue (actually in the bank), statuses change constantly. You can run a migration where the row counts match perfectly, but the actual metrics have drifted by 15% because your new pipeline correctly captured a status update that your old one missed.

If you only check the volume, you’ll give a thumbs-up to a table that is fundamentally wrong.

The technical "easy button" is to just delete and reload everything every day. But between compute costs and the loss of historic snapshots, that’s rarely the right move.

Instead, the fix is usually interpersonal. I’ve found that the best partner for troubleshooting these "invisible" discrepancies is the analyst at the end of the pipe. They are the experts in the content, while we are the experts in the shape.

An analyst was the one who finally explained the earned-vs-deferred revenue logic to me, which was the only way we figured out why months of data didn't align with the source.

As more of our boilerplate code becomes AI-generated, our value shifts from writing the SELECT statement to confirming the output. Before you merge your next project to production, I'd encourage you to go a little deeper than rowCount.

Review your accounting windows, align on variance tolerances with your stakeholders, and make sure your test environment is actually "clean."

Go deeper with an example use case, output and a more detailed reflection on the pitfalls of counting without analyzing.

Thanks for ingesting,

-Zach

Medium | LinkedIn | Ebooks

Extract. Transform. Read.

Reaching 20k+ readers on Medium and over 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.

Read more from Extract. Transform. Read.

Hi fellow data professional! In a previous newsletter, I mentioned an idea that I wanted to explore deeper. At the risk of double-quoting a la The Office’s Michael Scott quoting Wayne Gretzky (“You Miss 100% Of The Shots You Don’t Take - Waynze Gretzky - Michael Scott”), here is the idea. “To be marketable as a candidate, you don’t just want to show how you can go from A to B (requirements->pipeline). You need to go from A to C (requirements->pipeline->scale/support).” You might be asking...

Hi fellow data professional! Remember when the world ended? This month, 6 years ago, the world shut down and entered “unprecedented times.” Shortly after COVID-19 was designated a pandemic, I was unceremoniously furloughed from my day job at Disney World for 3-ish months. During COVID while others quarantined, I was on the move. After quickly feeling isolated in our third floor Central Florida apartment, my now-wife and I joined millions of other American 20-somethings who took a pandemic as...

Hi fellow data professional! I’ve broken my own data project rule. I’ve used the same data over and over again. For 3 years. It sounds boring but that depth exposure may actually be one of the few moats that slows encroaching AI. A little context: I support subscriptions, newsletters and growth for my employer. Spoiler alert: These areas are all basically the same thing. And they use basically the same three data sets. While I have opportunities to jump to other projects, this has been my...