 All right. Good morning, everybody. I'm George Roth. I'm going to talk to you about the state of containers in scientific computing and why we're doing everything different than anybody else. I work for NERSC, which is the primary certificate computing facility of the Office of Science of the US government. We have two supercomputers named Corey and Edison. Both of them are Cray machines and three smaller clusters. So there will be nearly a million cores and around 50 petabytes of storage in varying speeds. We serve more than 6,000 scientists at NERSC, and all of the science that happens is open. And the fields that we do nearly everything from astrophysics to energy physics, genomics, so jobs that are like 8,000 nodes are one. So can I get a quick show of hands? Who of you knows what HPC is? That's more than I thought. Who of you knows what Docker is? So HPC in a nutshell is... you have a large number of compute nodes connected by a high-speed network which usually is something like InfiniBand or something highly proprietary like Aries. You access data stored in a parallel file system. You run the whole thing as a shared resource. And all of that is orchestrated by workload manager in your office that's learned. So one of the question is, what's the hardest problem in scientific computing? Surprisingly, the answer to that might be installing software. The way it usually works is that you have a stack that's provided by the center and you use a modules tool that can load different version of the software. So you have the selection of L-Mod, the classic modules in modules 4. This is a process that's usually error-prone because scientific software is surprisingly hard to build. It's a slow process in the sense of when you open a ticket with your center to, for example, build TensorFlow, that's going to take weeks to months. It makes the stacks unique. So if you go to an HPC center, you have a software stack there and you want to go to an HPC center that's an hour away, it's going to look very different. And of course, that's not portable. What that leads to is usually the user appeal to software in home directories which depend on the module system, which makes it even harder to port. And then Docker came, right? So Docker was an interesting solution for the problem for us because it made everything simple, portable, and relatively reproducible. And it leveraged. And I'm seeing relatively stable APIs like namespaces and C-groups. And I do say relatively stable because if you use Docker on center 7, you might have run into a few bugs. And we, of course, wanted to use it. And then we cried for a bit because it was not a good fit for HPC. So we would really would have liked to use Docker at the time, but it, and that's not to shame Docker, it's just another client base. For us, it was a, it's very hard to have a shared machine install the Docker demon on that shared machine because it's basically a root equivalent. And as HPC people are quite puristic, you're not going to be able to run a demon or compute node nearly ever. If you have a tightly coupled application and you have a jitter in your compute nodes, that affects performance negatively and we don't do that. So let me thought, OK, if you take Docker as a whole, or containers as a whole, what features do we want out of those systems? So we wanted a way to run Docker on query Edison, very specialized Cray systems. We didn't need the fancy stuff like overlay networks because to be frank, if you run a proprietary interconnect, the chances of that working ever are very slim. We didn't need plugins like storage provisioning, using those volume plugins because we do have a parallel file system. So that's handled for us. Swarm Kubernetes, the equivalent for us would be Slarm and Slarm. Optimally, we do not want to run a demon on the compute. We're going to have it secure. It depends on your definition of secure and scalable, which means if you take a Docker container, put it on 10,000 nodes at the same time and spin them up at the same time, that should work. And as a bonus, it should work on all the kernels, like 2.6, for example. And apparently, the people at NERSC were not the only one with that idea. So in early 2015, shifter popped up, end of 2015, Singularity, and a bit later, Charlie Cloud. Surprising. Some of you might say UDocker is not on that slide. That is very correct because until yesterday, I didn't know that it existed, and I'm sorry for that. The governance of shifter is it's developed at NERSC. Charlie Cloud is the Los Alamos National Lab that has even more stringent security requirements than we have. And Singularity started at Lawrence Berkeley National Lab as well. And NERSC is part of Lawrence Berkeley National Lab, but it's now run by SyLabs that do commercial support for it. The mechanisms that those tools use to basically construct a container and ensure you can do the things you want to do in a container is set UAD for shifter. Charlie Cloud, because set UAD is considered a security risk, uses exclusively username spaces, and Singularity uses a mix of both. Those three tools work is shifter as well as Charlie Cloud leveraged Docker and the Docker file to build a Docker container. Then you pulled it into your system, converted to a native format, better suited for this massively parallel use case, and then run it. Singularity works nearly the same way, except that Singularity can import Docker files, but they have their own recipe format which resembles RPM a bit to satisfy the needs of scientific users a bit more. So the image formats are interesting enough as shifter and the recent version of Singularity used SquashFS. So you take a Docker image, you flatten it, you put it into SquashFS. When you start running the container, you loop onto SquashFS, which has a very nice side effect. If you have a large parallel file system and you start 10,000 jobs at the same time, you're going to hit the metadata server for every single invocation of the container that you do. If you loop-mount a large image, all of the metadata operations are local to the node, which makes it scalable. So that's why we do it this way. Charlie Cloud has a bit of a different approach. It uses star files, unsyps them, two RAM disks on the local node and runs from there. The reason for that is that Charlie Cloud is pretty lightweight. So if you compare those tools by size, just code size, which is a very dubious metric, I know, but its shifter is 20,000 lines of code. Singularity is also 20,000 and Charlie Cloud weighs in about 1,000. So the reason why you can build a container with that little code is that it's not rocket science, really. So these 13 lines of shell script construct the container that you can run on a recent Linux distribution. So the way that it works is you bootstrap, in this case a Debian OS, you can use anything you like into a folder called containerFS. You set up namespaces, mount the folder to itself and make it private, which means that if you unmount things in the container that doesn't propagate to your host operating system, so for example, as we mount proxies, sys, temp, and run, if you unmount them, you don't unmount them from the host, then you pivot root into the container, exit passion, that's it. In a nutshell, that's what a container is. And I think this is important to understand, you're still using the host kernel, of course, and it's relatively uncontained, which might be a bad thing in general in this case it's actually quite beneficial because what you do if you have tightly coupled application, several of them running on a single node, you share memory, so you actually, we don't need the containment. So as HPC is a field where you usually get libraries provided by the vendor, so if you have a proprietary high speed interconnect and you use some kind of message passing libraries, you're gonna get those from your vendor and you should use them because they're highly optimized and they do change the performance of the things you run. So another use case for accessing hardware directly would be GPUs, of course. So that's a problem that I guess many of you have. If you wanna do machine learning, for example, is a popular topic at the moment, you somehow need to access the GPU, which violates this whole containment thought. So you're basically breaking the wall and mapping stuff from the host into your container, which could be a problem. It depends on what you're looking for, right? So what you usually do using GPUs as an example, you bind the GPU device like that, NVIDIA 0, just gonna use NVIDIA, into the container, then you inject the host libraries, so you take libq.osl, which is a part of the driver injected into the container. Either you do that manually, which is relatively easy to do, or, well, which is relatively hard to do, or use lib and video container, which conveniently automates the process. You tell it which namespace you wanna make GPU ready and it does that for you. A considerable downside in the case is that because you're propagating whatever you have on the host into the container, you need to be ABI compatible. So if for whatever reason NVIDIA in three years breaks this compatibility on accident or not, or for a good reason, who knows, your container is not gonna be able to run anymore. So that's a bit of a downside. Another downside which many of you will ask, who does static linking anymore? HPC does, because, I mean, if you run a program and the program loads, let's say, 100 shared libraries and you do that on 10,000 nodes in parallel, that is quite expensive operation. So that doesn't work because you rely on the loader to pull in those libraries into your guest program and also a thing that not many will run into if you have a modern Linux distribution now, like, let's say, SUSE 42, and you try to run that on a 2.6 kernel, it's not gonna work by default because the G-LIB C in modern Linux needs at least the 3.0 kernel. So, of course, you can argue, who is gonna use a 2.6 kernel anymore? The other argument you can make is in five to 10 years if you wanna reproduce your science, are you still gonna be able to run your container? That's something to keep in mind. Then there are other problems. For example, this is a Docker file that installs TensorFlow from Ubuntu 18.4 base image, installs Python, TensorFlow, uses some AI program that, I don't know, hits cancer, and then runs that. So, there's a very intricate problem with this speed. So, on a very broad level, if you produce a container that gets used by many people, you're not gonna build a container optimized in the sense if you're not gonna enable machine-specific optimizations to be portable, right? The problem with that is that modern processors need vectorization to perform. It's a... a minor point is very hard to say what the theoretical performance of a modern processor actually is. There are relatively good numbers for the Intel Haswell architecture, which uses, in the scalar mode, gives you like 130 gigaflops per second, floating point operations, and with AVX it gives you nearly four-fold improvement on that. The downside is that you need to use machine-specific optimizations to get that performance. So, let's fix that. And suddenly we're back to the hard problems, right? So, that Dockerfile doesn't fit on the slide. That's from the official TensorFlow repositories. Again, not hitting on TensorFlow, it's just building software is not easy. And if anybody, if you were at the talk yesterday, how to make package managers cry, that explains why. So, the Dockerfile is like 80 lines, 85, and a bit more sophisticated, so you need to be intimately familiar with the build system, the loader, and how they pull together. So, in this case, that sets the library path for the loader, invokes Pascal, builds it for the Haswell architecture, builds a wheel, then installs the wheel, and so on and so forth. If you want to try that at home, I do recommend that you use some tool that automates the process for you, like EasyBuild or spec, because if you do that repeatedly, it's going to be pretty bad for your mental health. So, does it pay off? The Dockerfile is eight times the size. We get the seven times speed up, so that's a pretty good bang for the buck, line by line. And I have to thank Kenneth Hauster, who is the lead developer of EasyBuild for the benchmark. So, it does pay off, but it is a considerable amount of investment of time, and it opens up a whole other set of issues. You need to cross-compile, basically. So, suddenly, you had the thing that was portable, and you wanted to make it fast, and suddenly it's not portable anymore, but it is a common problem. And if you use EasyBuild or spec in conjunction with containers, you can do portable builds, but then you have the problem, how do you share that? For example, do you tell the users, well, use the version that's AVX optimized, maybe with FMA or without FMA, maybe the cache size of your processor is different, and that's going to be a quite interesting problem in the future. Maybe you want to share with the researcher that runs an ARM machine, so the architecture is totally different. One way to solve the problem is Docker has a thing called fat manifests, because, of course, Docker containers run on said series mainframes, which makes me a bit happy. So, if you do a Docker pull on an IBM mainframe or on a Power machine, and the Docker registry has a version for the architecture, it's going to pull it down. Unfortunately, it has not integrated with DHPC solutions yet, and a minor point is I didn't find any information if there were CPU features as well, and not only for architectures. So, to conclude this talk, I do think that containers are a valuable tool for scientific computing because they enable a user-defined software stack. If you want to say it less fancy, it's basically RPM in user land, kind of. I do have to say, as the advertisement goes, they're not the panacea, so they're not going to automatically solve all of your problems. Particularly, the performance do require work, and if you imagine a seven-times performance difference, if you need to buy seven times the service or pay for seven times your AWS bill, you're going to think about it very hard. And as we have containers which are quite a new paradigm, it's also quite beneficial to use what's already there in conjunction with the mechanism to use them to the fullest extent. And with that, I'm going to go into questions. Good morning. I work for a major European company that does HPC, and we still run 2.6 kernels and some of our machines for applications. That's not my point. My point was very interested to hear what you're saying about optimizing and installing the optimized TensorFlow. I think HPC people have a great deal to learn from the web generation. You've got scaling out to the cloud, you've got configuration management, cattle not pets, and you've got applications which are designed to fail, unlike HPC applications which don't tolerate failures. However, I want people to listen. I want you to comment. Often you see the recipe for installations after getting installed TensorFlow, after getting installed at OpenMPI. And then you see in the OpenMPI list, oh, I did an app get installed OpenMPI and I've got a really out of date version. It doesn't perform well, I've got a problem for it. And the answer from the OpenMPI guys is, hey, our 2.0 series is great. It's got fixed all these bugs. I think people don't realize that the maintainers of these scientific packages in Red Hat and Ubuntu, you're depending on some guy maintaining it who's not really going to keep it up to date and he's not going to optimize it for the latest process because I've got them available. So, sorry, you comment on that. So, I'm not quite sure what the question was. You very nicely elaborated what... Commenting on the issue of the tooling not being updated in this church due to lack of time for the maintainers in this church and how do containers help with that? To kind of say it in a blunt way, you're shifting the load of maintaining the software stacks to the user is what you're doing. And then for the stuff that really needs to be highly optimized, you still have to center people, but it makes the turnaround times quicker. So, maintaining software is a very hard problem that's not going to go away in the future even with cloud technologies. Anybody else? The question is how do you integrate it with the batch system? For shifter, there is a Slarm spank plug-in. Singularity as well. You don't need to use one. So, if you pull up a job and you do shifter damage name and you run it, it's just going to run. That applies to singularity as well. So, you don't need to integrate it with the workload manager. It does give better performance though for several reasons. All right? Thank you.