Extract. Transform. Read.A newsletter from Pipeline Hi past, present or future data professional! Silicon Valley’s Gavin Belson once predicted a doomsday data scenario coined datageddon. The threat of datageddon wasn’t the volume of data, but, rather, the lack of repositories. Beginning a data engineering project, you might experience the opposite of datageddon, a scenario in which you know exactly how to store data but don’t know where to source it. The problem is, there’s no shortage of options. If you’re a data science student you’ve probably visited Kaggle so much that it haunts your autocorrect. Or you’ve exhausted data science 101 datasets like BigQuery’s public data repositories. As a data novice you likely want something novel but you might lack the experience and free time between firing off job applications to build a complex web scraping script. Enter my favorite hidden data repository and your teacher’s least favorite academic source: Wikipedia. Embedded in articles with questionable user edits are clean, robust and endlessly scrapable wiki tables, a.k.a. Wikipedia’s columnar HTML/CSS storage elements. You’ll find many Wikipedia articles use tables to represent basic data, like this example, a list of past and present U.S. presidents. Accessing the data within is as easy as inspecting the page source and finding the class “wikitable sortable.” While you could use a Python web scraping library like Beautiful Soup to scrape the headings and iterate through rows, there’s an easier, little-known Pandas method that can directly read HTML elements, read_html. By passing the URL of the article and the class “wikitable sortable” to the “attrs” parameter, it’s possible to return the data of this table as a Pandas data frame, as I demonstrated in a previous article. But, as I noted, this method has one fatal flaw: The lack of scalability. While Wikipedia will yield niche and, honestly, really cool data, it is usually limited to a few hundred rows because who’s going to scroll through a Wikipedia table with thousands or millions of rows? After messing with Wikipedia and understanding the structure of not only wikitables, but also the URLs that point to tables and entries, I discovered a hacky solution for scraping as many as 50 pages of open sourced encyclopedic gold. And even though the datasets are a bit constrained, there are still lots of possibilities for analysis covering everything from U.S. Olympic gold medals won in the Winter Games to luxury watches that cost more than a starter home. While I wouldn’t rely on data sourced from a Wiki table to power an LLM integration, I would use it as an opportunity to demonstrate to a potential employer that you have the ability to source and aggregate niche data sources. After showcasing a Wikipedia-fueled data analysis or visualization project, you’ll leave an audience member asking Wiki How did you come up with that? Thanks for ingesting, -Zach Quinn |
Reaching 20k+ readers on Medium and nearly 3k learners by email, I draw on my 4 years of experience as a Senior Data Engineer to demystify data science, cloud and programming concepts while sharing job hunt strategies so you can land and excel in data-driven roles. Subscribe for 500 words of actionable advice every Thursday.
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! One of the most loaded terms, after AI, is upskilling. It’s something everyone should always be doing, yet, only the most dedicated can consistently dedicate time to learning and expanding beyond their comfort zones. If you’re on the path to becoming a data professional, you’ve probably spent countless hours learning, only to find yourself wondering if you’re actually making progress. I’ve been...
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! When I worked as a resume consultant, the toughest mental block for clients was identifying and expressing material contributions at work; avoiding this communication is why so many job hunters revert to regurgitating their job duties rather than clarifying the outcomes of their work. In addition to overcoming the hurdle of distilling a complex technical role for non-technical recruiters to...
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! Data science just cracked the top 40… of jobs whose main functions are most likely to be replaced by AI. If you’re up to speed on your AI doomerism news you’ll know that at the end of July, Microsoft released a list of jobs across disciplines and industries that could be majorly disrupted by AI. On a more positive economic outlook, data engineering is specifically cited as a growing role in the...