Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

I once participated in a remote job interview in which the interviewer was on the video call while driving... and smoking.

While that instance was among the most memorable interview experiences (for the wrong reasons), I’ve had just as many interviews that have blended together and faded into the recesses of my mind.

The common denominator, however, was the insistence on asking one question.

The answer you provide can make or break your interview.

The question I heard repeatedly, especially after I presented a project from my portfolio, was: “Where did you get your data?”

It’s an innocent question, but it’s a brilliant way for an interviewer to gauge your resourcefulness.

And while there’s no truly "wrong" answer, I quickly learned there's a definite best answer. The truth is, relying on perfectly clean, pre-packaged data from repositories like Kaggle is a trap. I’m not saying Kaggle is necessarily bad. I mean, I’ve used it myself for school projects. It just isn't always representative of the majority of data sources you'll encounter.

As I got deeper into the field and understood employer expectations, I realized that real-world data is messy, incomplete, and rarely comes in a perfectly formatted CSV. Using a stock dataset doesn’t show a potential employer that you’re ready for the reality of the job; it just demonstrates your ability to use read_csv.

When I started offering responses that showed my ability to source and manipulate data in a novel way, the interviews took a noticeable turn for the better.

Here’s what you should be saying:

“I scraped the data from a website and converted it to a dataframe.”
“I combined an existing dataset with data scraped from a Wikipedia table.”
“I accessed an API and built a pipeline to gather the information.”

These answers signal a crucial skill: you’re not just a data consumer; you’re an aggregator of information. You’re resourceful and you're not afraid of the messiness that accompanies the process of mining real-world data.

Creating your own unique dataset (even a small, niche one) demonstrates 3 things to a hiring manager:

You’re comfortable converting messy data into something usable
You are willing to deviate from "stock" datasets and approach problems with creativity
It showcases a genuine passion for the field and you’re invested in the craft of the role

As a bonus, if you can find a dataset that’s relevant to the industry you're applying to, you'll also prove that you have relevant domain knowledge, which is truly a rarity among technically-inclined candidates.

So, before your next interview, take a look at your portfolio. If it's full of projects using perfectly clean data, consider spending some time creating a new end-to-end build that starts messier.

You don’t have to build a custom data warehouse from scratch. In fact, even a simple project that involves scraping a Wikipedia table with Pandas can demonstrate additional effort that goes beyond downloading and reading a CSV.

In the end, the best source of data is yourself.

Read the original story here.

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

[ETR #62] I Got The Same ? In 12 Interviews

Extract. Transform. Read.

[ETR #70] Defeat Your Interview's Final Boss

[ETR #69] GitHub Portfolio Mistakes Hurting You

[ETR #68] This Prompt Prevents Python Mistakes