Getting results from Big Data without the Big Infrastructure problem: Cloud + Docker + Kubernete

Imagine what would happen if every day during lunch time, you had to consciously coordinate all of the steps in digestion: breaking down the food in your stomach, pushing the food through your intestines, and telling yourself to stop feeling hungry after eating. You would spend the entire day coordinating your digestive system! Eating is a complex process, but thankfully most of the processes are automated and we don’t have to coordinate anything consciously. So why not automate the coordination of softwares in distributed computing?

Distributed computing used to be very hectic. Setting up your own computing infrastructure often also meant that you needed to rely on software from various vendors, who often required different configuration files and variable naming schemes. This complexity in configuration files often to led to misconfiguration of the computer systems. Nonetheless, the field has improved dramatically over the past years.

The first innovation that insulated the user from the bulk of the compute infrastructure complexity was probably cloud computing. The cloud hides the details of the OS and hardware maintenance, offering consistent views between the resources without having to know the nitty-gritty details about the cluster. For example, sometimes I don’t give a damn about whether the router is from Cisco or not, I just want the computers to be connected together. I don’t want to maintain the hardware either. Most cloud providers offer different OS images and hardware that can spawn off with a few clicks, which solves the problem that happens when a shared environment doesn’t accommodate everyone’s needs.

The recent years of Dockerization have pushed software containerization up a notch. Docker offers the same benefits of a traditional virtual machine but without the resource strain. It blew my mind when I first saw how fast I could download an Ubuntu image and run it in Docker like a regular software binary (almost), and the image was only in the order of GBs. Software containerization comes with many benefits. For example, even hardcoding the paths becomes a much more forgivable sin in a containerized environment, as the code can still be run by the same user and multiple PATHs. Let’s say you have a piece of software that allows only one process per OS, and you want to run multiple copies on the same node. Docker helps you by tricking the process into thinking that it is running in a separate OS. This also helps in bioinformatics research, as we often bump into packages that cannot be run again. The inconvenient reality is that most packages stop being maintained when the grant stops supporting it or the grad student is gone. By using the packages within a compatible Docker image, we can avoid having to track down the compatible statically linked libraries and recompile after changes in computer infrastructure.

Kubernetes further insulates the user from the complexity of the cloud by facilitating Cloud cluster creation, maintenance, and monitoring that could arise due to using different cloud platforms. A single line of Kubernetes operations (kops) will create and start the cluster, and another line of kops will delete the cluster. For example, configuring AWS can create many mental barriers. Most users don’t really want to deal with the different availability zones, virtual private networks or security groups. The annoying part about all of the cloud platforms is that they like to use their own vocabulary when defining the services. For example, AWS has hundreds of glossary terms that start with an A (https://docs.aws.amazon.com/general/latest/gr/glos-chap.html), and Google Cloud has their own version of the glossary (https://developers.google.com/custom-search/docs/glossary). Kubernetes cuts through the crap and offers a consistent view regardless of which cloud you or your boss chose. This consistent view across cloud platforms is key towards reducing the barrier for developers to generate cloud platform deployment distribution. The part I am looking forward to the most in Kubernetes is integration with Helm Chart, which will allow developers to perform cluster deployment with a single installation command. When I tried it, it was pretty shocking how it only took a week to figure out how to set up JupyterHub with autoscaling, spot instances, and file storage (GitHub: https://github.com/brianyiktaktsui/NotebookForBlogs/blob/master/CreateKuberneteCluster_andInstall.ipynb).

Another aspect I love about Kubernetes is the dashboard. The AWS console reminds me of an old plane cockpit (Bottom figure: left) while the Kubernetes dashboard reminds me of the F35 cockpit (Bottom figure: right). It is more concise, has fewer knobs, and gets me to where I want to go much more quickly.