Extract. Transform. Read.A newsletter from PipelineHi past, present or future data professional! When you apply to data analysis, data engineering or data science jobs, you likely consider factors like company name, culture and compensation. Caught up in the excitement of a fresh opportunity or compelling offer you’re neglecting an important part of your day-to-day reality in a new role: What stage of data maturity the organization is in. If you’re looking for experience building something new from the ground up, you likely won’t find it in a company that has a years-old established cloud infrastructure. If you’re inexperienced, you might also feel lost in a company that is still conceptualizing how it is going to establish and scale its data infra. While I personally arrived at a team and organization in its mid-life stage, I’ve had opportunities to discuss, examine and advise those who are considering how they can make an impact at an earlier-stage company in both full-time and contract roles. This compelled me, after a transatlantic flight, to compile a framework you can use to conceptualize anything from an in-house data solution to full-fledged infrastructure. Phase 1 Discovery - Extensive, purposeful requirements gathering to make sure you are providing a solution and, more importantly, a service, to an end user. Phase 2 Design - You can’t begin a journey or a complex technical build without a road map; take time to make a wish list of must-have data sources and sketch your architecture before writing line 1 of code. Phase 3 Ingestion - Build your pipelines according to best practices with a keen eye on cost and consumption; expect this to take 6-12 months depending on your work situation. Phase 4 Downstream Build - Going hand-in-hand with requirements gathering, consider how your target audience will use what you’ve built; might it be better to simplify or aggregate data sources in something like a view? Phase 5 Quality Assurance And Ongoing Tasks - Even though your pipelines and dashboards will be automated initially, nothing in data engineering is 100% automated. Components will break. You’ll be expected to fix them. And assure it doesn’t happen again. These 5 phases aren’t meant to be strict rules for building data infra. But they should get you thinking about how to build something purposefully so you can spend your time dealing with angry code–not stakeholders. Dive into the framework here. Here are this week’s links:
Until next time–thanks for ingesting, -Zach Quinn |
Top data engineering writer on Medium & Senior Data Engineer in media; I use my skills as a former journalist to demystify data science/programming concepts so beginners to professionals can target, land and excel in data-driven roles.
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! For years, a start-up cliche was being the “Uber” of (product, service, etc.). Now, it seems like any content platform wants to be the “Tik Tok” of a given subject area. Case in point for the latter: A fun app I came across called, fittingly, “Gittok.”* Like Tik Tok, Gittok feeds users an endless stream of distraction but instead of dance challenges it serves up a random GitHub repository, like...
Extract. Transform. Read. A newsletter from Pipeline For a STEM discipline, there is a lot of abstraction in data engineering, evident in everything from temporary SQL views to complex, multi-task AirFlow DAGs. Though perhaps most abstract of all is the concept of containerization, which is the process of running an application in a clean, standalone environment–which is the simplest definition I can provide. Since neither of us has all day, I won’t get too into the weeds on containerization,...
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! From 2014-2017 I lived in Phoenix, Arizona and enjoyed the state’s best resident privilege: No daylight saving time. If you’re unaware (and if you're in the other 49 US states, you’re really unaware), March 9th was daylight saving, when we spring forward an hour. If you think this messes up your microwave and oven clocks, just wait until you check on your data pipelines. Even though data teams...