Extract. Transform. Read.A newsletter from Pipeline For a STEM discipline, there is a lot of abstraction in data engineering, evident in everything from temporary SQL views to complex, multi-task AirFlow DAGs. Though perhaps most abstract of all is the concept of containerization, which is the process of running an application in a clean, standalone environment–which is the simplest definition I can provide. Since neither of us has all day, I won’t get too into the weeds on containerization, but I will offer a brief explanation followed by some best practices. If my simple definition doesn’t provide enough detail for containerization, consider this example. One afternoon, while on a walk, I explained to my non-technical (but very intelligent) wife that a container with an image running on an infrastructure layer like a virtual machine, is like setting up a computer with an operating system that contains only what is minimally necessary to run an application. In our example, the application was a game from her childhood she wanted to theoretically run. The instructions, including installation of the game and OS to run the game would be the container image. If using a service like Docker, the spark that jumpstarts the engine of infrastructure is the Dockerfile, which contains detailed instructions in the form of one-word commands like:
Like other tech concepts, a desire for a candidate knowledgeable about containerization appears on job descriptions in the form of buzzwords like Kubernetes (cluster management) and Docker (the industry standard for creating, maintaining and running containers and images). To stand out as a container-izer, definitely take time to learn the quirks of management services like Docker, but also:
Obviously Docker and containerization goes well beyond the scope of this brief overview. One of the challenges I faced was learning how to properly inject environment variables with API keys into an image at run time. Assuming others have faced the same issue, I wrote about how to authenticate GCP when running a Docker image. Thanks for ingesting, -Zach Quinn |
Top data engineering writer on Medium & Senior Data Engineer in media; I use my skills as a former journalist to demystify data science/programming concepts so beginners to professionals can target, land and excel in data-driven roles.
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! From 2014-2017 I lived in Phoenix, Arizona and enjoyed the state’s best resident privilege: No daylight saving time. If you’re unaware (and if you're in the other 49 US states, you’re really unaware), March 9th was daylight saving, when we spring forward an hour. If you think this messes up your microwave and oven clocks, just wait until you check on your data pipelines. Even though data teams...
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! As difficult as data engineering can be, 95% of the time there is a structure to data that originates from external streams, APIs and vendor file deliveries. Useful context is provided via documentation and stakeholder requirements. And specific libraries and SDKs exist to help speed up the pipeline build process. But what about the other 5% of the time when requirements might be structured, but...
Extract. Transform. Read. A newsletter from Pipeline Hi past, present or future data professional! To clarify the focus of this edition of the newsletter, the reason you shouldn’t bother learning certain data engineering skills is due to one of two scenarios— You won’t need them You’ll learn them on the job You won’t need them Generally these are peripheral skills that you *technically* need but will hardly ever use. One of the most obvious skills, for most data engineering teams, is any...