The recurrent question in the data-intensive workplace often revolves around which computing infrastructure to use. In the past four years as a bioinformatics Ph.D. student, I have both received and offered solicited and unsolicited advice regarding computing infrastructures using my prior experience in high-performance computing lab and current expertise in data analytics. This blog post covers my experience in using private computing infrastructure as compared to adopting the cloud in bioinformatics/ data analytics, and the thoughts on the advantages and disadvantages.
The major challenge I found was that: The decision-making process of deciding between buying computing infrastructure versus adopting the cloud is very much like buying a car versus renting a car. The main challenge is the unpredictability of user usage pattern. The moment I walked into the car dealer, all I saw were the shiny cars with a seemingly affordable price tag, and all I had in my mind was to get the car now so that I don’t have to carry groceries back to grad housing for two miles whenever I go to Ralphs. Only till all the bills come then I know I am eating ramen for the next couple weeks and regret why the heck I bought the car. The immediate need of a car rushed me into the purchase without thinking about how often do I need it, and what kind of car do I need in the future.
It is the same game for finding the efficient and cost-effective computing infrastructure. The main problem in the story is that I am a poor graduate student having no money and I need to better utilize the limited financial resources to accomplish this goal. Fast completion time for the whatever demanding task requires expensive hardware, and like Ferrari, fast and powerful computing infrastructure doesn’t come cheap.
What are the main components in computing
Most machines have different RAM, hard disk, CPU, GPU availability. Like cars, certain vehicles cater to a specific group of drivers (Truck if you are loading) and there are types of cars that are close to one size fit all (Honda civic). Of course, there are also people love driving a pickup truck like they need to move every day, burning the double amount of gas without any extra values. Anyway, the point is that a misjudgment on resource usage can cause you extra without gaining more benefits.
For example, identifying the appropriate RAM usage is one big thing, as swapping with disk can slow down the job turn around time significantly. But if you know are going to swap, you might want to have a machine with SSD hard disk instead of HDD. None the less, most machine learning, analytic and bioinformatics programs run within <10GB RAM and some simple memory profiling using unix command memusg will tell you what needs to happen.
Renting (Cloud) vs. buying compute infrastructure:
- The cloud is more like renting a car by the hour, achieving immediate goals without commitment. For example, you want a fancy sports car to impress the girl tonight, and you went on Toru (car renting app) and found that renting a Porsche 911 for one night isn’t too bad, it’s only $200. However, after the first date, the girl expects you to drive her around in the Porsche 911 every day, and after a year, you found out that it cost you $73,000 which can already buy you that car already. None the less, the AWS spot instances, cheap availability of idle computing power is slowly changing the game. My friend Ted Liefiled from Jill Mesirov Lab told me that they were able to achieve close to perfect availability with 10X cost reduction using AWS spot instances and 99% availability, where the only time they struggle with availability is during the online shopping seasons.
- Cloud computing enables simple and yet diverse configuration of security, network, compute and storage instances. My experience of using the AWS as compared to traditional SGE and TORQUE computing ecosystem here is that AWS is highly configurable in both how the computing nodes behave together and within itself. For example, I could easily set up a secure virtual private network and file storage in a day, as opposed to negotiating for weeks with the supercomputer center on the logistic of moving the computing nodes in a private network.
- Deep Learning Amazon Machine Image + p3.16xlarge = deep learning magic. For machine learning tasks, the AWS offer instances with up to 16 GPUs and images with the CUDA and common data analytical python libraries preinstalled. Those beefy machines alleviate the need for writing distributed code which can take weeks to months, and this turns out to be a huge advantage. Also, the environments issues became much more straightforward. For example, the CUDA static object libraries have versioning extension which screwed up the dependencies when I tried installing tensorflow-gpu on shared computing infrastructures. Moreover, all the computing infrastructures I have used so far hesitate to modify the computing environment, which is often the bottleneck in the development process.
- Private compute infrastructure maintenance is expensive: During my years in an HPC (High-performance computing) lab, we used to maintain the cluster ourselves, and the cluster was hosted within the office. And whenever someone submits a job to the cluster we can feel the heat wave from the cluster. And when we tried to be smart and save on the all the compute infrastructure except the CPUs and ended have a ghetto way to stack on machines without much consideration for the heat dissipation. The frequent death of the main translates into frequent maintenance, which was a huge headache.
Finally, check your code before checking the hardware: Aside from computing infrastructure, the other significant problem towards efficient computing in bioinformatics is the lack of algorithm/software knowledge when people first come in. In the car analogy, if the person wants to go from point A to point B in the shortest amount of time, doesn’t matter how fast the car is if the driver thinks the steering wheel is a gaming controller and the volume up button will propel the car forward. I have come across some bioinformaticians that have a minimal amount of algorithm knowledge, who thought without any software configuration/optimization, simply running on more machines will speed up the process. I have also seen many published bioinformatics software that could be hundreds of lines but somehow they ended up a hundred thousand lines. I am not trying to say we should stop people who didn’t ace all their algorithms to write code, as I do believe that everyone can learn and I have seen many biologists that could code fantastically. My argument is just the field could save so much time if most people could learn some basic linear algebra and algorithms before they start playing around with Big Data. For example, in the case of table IDs mapping, could be done with a simple sparse matrix multiplication using an existing optimized matrix arithmetic library, instead of writing distributed python to loop over each ID.
These are just my opinions, hope you learned something, feel free to leave comments below. 🙂