Extract. Transform. Read.

A newsletter from Pipeline

For a STEM discipline, there is a lot of abstraction in data engineering, evident in everything from temporary SQL views to complex, multi-task AirFlow DAGs. Though perhaps most abstract of all is the concept of containerization, which is the process of running an application in a clean, standalone environment–which is the simplest definition I can provide.

Since neither of us has all day, I won’t get too into the weeds on containerization, but I will offer a brief explanation followed by some best practices.

If my simple definition doesn’t provide enough detail for containerization, consider this example. One afternoon, while on a walk, I explained to my non-technical (but very intelligent) wife that a container with an image running on an infrastructure layer like a virtual machine, is like setting up a computer with an operating system that contains only what is minimally necessary to run an application.

In our example, the application was a game from her childhood she wanted to theoretically run. The instructions, including installation of the game and OS to run the game would be the container image.

If using a service like Docker, the spark that jumpstarts the engine of infrastructure is the Dockerfile, which contains detailed instructions in the form of one-word commands like:

FROM (a base image which could be something like “python:version” or another Docker image)
COPY (typically copying what you need from an environment into the container’s default directory)
RUN (used to install dependencies with “pip” and to execute scripts)

Like other tech concepts, a desire for a candidate knowledgeable about containerization appears on job descriptions in the form of buzzwords like Kubernetes (cluster management) and Docker (the industry standard for creating, maintaining and running containers and images).

To stand out as a container-izer, definitely take time to learn the quirks of management services like Docker, but also:

Consider using :slim versions of images to conserve memory when stored in a remote repository and when executed at runtime
Understand image tags and how to ensure you’re using the “latest” version of each image
Construct your local directory and CI/CD pipeline properly so the Dockerfile is within the scope of the build step
Double-check file paths when writing CLI commands and when creating yml files

Obviously Docker and containerization goes well beyond the scope of this brief overview. One of the challenges I faced was learning how to properly inject environment variables with API keys into an image at run time.

Assuming others have faced the same issue, I wrote about how to authenticate GCP when running a Docker image.

Thanks for ingesting,

-Zach Quinn

Pipeline To DE

[ETR #40] Data Engineering Uncontained

Extract. Transform. Read.

[ETR #44] Why No One Can Find Your GitHub

[ETR #39] Your Pipelines Will Fail On These 10 Days

[ETR #38] Powerful But Messy Data