Extract. Transform. Read.

A newsletter from Pipeline

Hi past, present or future data professional!

Silicon Valley’s Gavin Belson once predicted a doomsday data scenario coined datageddon. The threat of datageddon wasn’t the volume of data, but, rather, the lack of repositories. Beginning a data engineering project, you might experience the opposite of datageddon, a scenario in which you know exactly how to store data but don’t know where to source it.

The problem is, there’s no shortage of options. If you’re a data science student you’ve probably visited Kaggle so much that it haunts your autocorrect. Or you’ve exhausted data science 101 datasets like BigQuery’s public data repositories. As a data novice you likely want something novel but you might lack the experience and free time between firing off job applications to build a complex web scraping script.

Enter my favorite hidden data repository and your teacher’s least favorite academic source: Wikipedia. Embedded in articles with questionable user edits are clean, robust and endlessly scrapable wiki tables, a.k.a. Wikipedia’s columnar HTML/CSS storage elements.

You’ll find many Wikipedia articles use tables to represent basic data, like this example, a list of past and present U.S. presidents. Accessing the data within is as easy as inspecting the page source and finding the class “wikitable sortable.” While you could use a Python web scraping library like Beautiful Soup to scrape the headings and iterate through rows, there’s an easier, little-known Pandas method that can directly read HTML elements, read_html. By passing the URL of the article and the class “wikitable sortable” to the “attrs” parameter, it’s possible to return the data of this table as a Pandas data frame, as I demonstrated in a previous article.

But, as I noted, this method has one fatal flaw: The lack of scalability. While Wikipedia will yield niche and, honestly, really cool data, it is usually limited to a few hundred rows because who’s going to scroll through a Wikipedia table with thousands or millions of rows? After messing with Wikipedia and understanding the structure of not only wikitables, but also the URLs that point to tables and entries, I discovered a hacky solution for scraping as many as 50 pages of open sourced encyclopedic gold.

And even though the datasets are a bit constrained, there are still lots of possibilities for analysis covering everything from U.S. Olympic gold medals won in the Winter Games to luxury watches that cost more than a starter home. While I wouldn’t rely on data sourced from a Wiki table to power an LLM integration, I would use it as an opportunity to demonstrate to a potential employer that you have the ability to source and aggregate niche data sources.

After showcasing a Wikipedia-fueled data analysis or visualization project, you’ll leave an audience member asking Wiki How did you come up with that?

Thanks for ingesting,

-Zach Quinn

Extract. Transform. Read.

[ETR #58] Your Next Project's Data Source

Extract. Transform. Read.

[ETR #61] Don't Fall Into These Upskilling Traps

[ETR #60] What Makes A Successful DE?

[ETR #59] What World Jobs Data Says About DE