 We're going to be talking about distributed HPC applications with unprivileged containers Thank you, Stefan. So I'm Felix. This is Jonathan. We work at NVIDIA and In California and we're going to talk about our infrastructure or we use containers for multiple different applications So let me get something out of the way I know we just got steam but but and we are from NVIDIA, but we don't do video games And so people some time when I go to a conference about Linux and containers They say why are you here your NVIDIA? You're you do Windows and games, but but that's not only what we do actually So all that is GPUs do ray tracing with RTX. So that's useful for games But also for visualization professional visualization for movies And in the middle we have we have a type of GPUs that's only used for crunching numbers So like astrophysics, biology, simulation, mathematics, machine learning So we have GPU that don't have a display. You don't have a display. You put them in a data center And they just crunch numbers like deep learning or say what we call HPC performance computing And on the other end of the spectrum we even have autonomous Machines like robots self-driving cars So it's like more like smaller system embedded systems that people can use for computer visions and and kind of Real-time real-time tasks that are involved. So we call that GPU computing and you might have out of CUDA That's our platform for doing that beyond video games okay so We at NVIDIA we have an infrastructure of we have multiple clusters multiple data centers in Santa Clara And we have one that's per person benchmark the top 500. That's a measure of supercomputers We are number 20 in the world and those are very very large machines right here We have 16 GPU each more than 400 watts So with a total of 12 kilo watts per machine and we have 96 of those 1.5 terabyte of RAM so that those are machine that are gigantic and free band for them that working because we need very fast networking and Pretty uncommon we ran Ubuntu Little bit uncommon in HPC, but but but that happens So the next talk is about I know the next talk is about raspberry pi and that's a that would be that the first very pies to Reach one node here 12 kilowatt Four years ago we Jonathan and myself we started a project called NVIDIA Docker and then evolving to live in the container and The goal was to make it easy to deploy your code applications in containers And so it worked pretty well. It's and now it supports all runtime So that's good because it doesn't force us to use a specific runtime. We can use Alexi we can use renzi darker continuity Kubernetes you can run your code apps in there and Especially we use containers at NVIDIA for deep learning and HPC because we know some of these apps are difficult to package to install and And there's some time conflict between each other So we put that in a container and we put that on Docker hub or or some of them in our own container registry People can download those and have a stack already installed for TensorFlow by torch Biology and stuff like that and so we use containers for a lot of things even benchmarking and especially for machine learning deep learning so I've mentioned our hardware so but let's take a look at what it's typical cloud deployment looks like so so You don't have nodes usually like we have 12 kilowatts You have maybe smaller nodes instances like the value has hundred or thousand of them and you continue rice for security with Not for sometimes we're packaging but also for security and you have microservices So maybe 100 containers per node you have traffic from the outside world you have user uploads and stuff like that Internal traffic and you don't use that users don't don't access the cluster directly They they basically ask someone if they can deploy is a new app on the cluster and you have any advanced features Let that I've listed here, but So that was kind of the communities that communities has to offer so going to What we do at NVIDIA as you've seen very large nodes and we can trust the users Because our cluster are sometimes air-gapped or very little access to the outside So if someone acts the cluster if someone uses zero day on our cluster I mean we're just going to fire him and that's that's pretty much it We they ran trusted we weren't trusted trusted code on or on our on our clusters So and also not all applications that containerize if an application does not containerize because it's packaged Well, well don't use a container We don't want people to force users to use a container because they did that job correctly. So The this actual step is actually Eliminate communities already because communities is is fully containerized everything and we have few applications by node We don't we are not very dense and we have multi node jobs also, which mean that you have you need to start 30 You need to start the job on 30 nodes in parallel so it takes a bit of time and as I say mostly little traffic to the outside world So So we have not chosen communities like unlike me maybe many people before us We have chosen something that's pretty classic in HPC called slurm. So it's more It was way more adapted to our use case so it has advanced scheduling algorithms so for coders teams and stuff like that and Especially it supports gang scheduling which mean that if you need to run on 30 nodes There is no need to start running unless everyone is ready because if you start running on 15 nodes You're just going to spend CPU cycles waiting for the other 15 nodes to join So you really have to make sure that they are already at the same time and start at pretty much the same time Otherwise, it's just waste and lower in time overhead the control plane is pretty small topology aware so that that if you have new machines you are you optimize for performance in the allocation and It's user centric is that you submit a bash script It will run your bash script you can SSH into a login node and then have an interactive session on the honor on a big machine like we showed and it's about GPUs and Well, the only drawback was it did not support containers and we use containers for especially for deep learning machine learning so But it does support plugins. So what we did is We looked at our requirements and we need performance. We didn't know overhead in The container runtime we need to support darker images because our researchers to install darker on their machines Because that's that's convenience. That's a nice UI lookable. Everything is is is nice We need soft crystal multi-tenancy So that means that you can have multiple users on the same machine But you just just want to make sure they don't steal resources for someone else We are not trying to really strongly isolate Between so it's not a security boundary. Just you you have two CPUs You have four CPUs and you cannot take more and we need GPU supports Milan ox for RDMA for networking and Multinal jobs so the runtime should not get in the way of all of that and also interactive containers can install package Petrace s trace gdb your process everything like that, so There was no container runtime that really filled this need So it was actually simpler a simpler still to use slum But use a plugin system to create a new one and integrate it than modifying Kubernetes for needs. So that's what we did Before I on over to Jonathan just want to mention that Get something out of the way first is that to a writing a secure privileged container runtime is very very hard So there's this presentation by Alexa from from if one month ago or so so We you really want to have something that You really have to use what we call using them spaces and that's basically the point of his office talk so you avoid a lot of Issues and so if you get that out of the way writing a container runtime is not actually that Difficult we're not going to explain how to write a container runtime because there's a lot of talks about this already But we're going to explain what the reasons why we did what we did and and the last point being here that Even if you don't if you trust your users pretty much like you do So I still a lot of accidents or damage that can happen if you still give roots in real root inside containers for users like Files access to files or breaking the system or or being able to debug from outside a container to debug a container Because if it's if it's run as you ID zero you might not be able to debug it So our runtime is called and wrote and I will end up to try to to explain each topic here on this list right, so we're going to go through some of the design principles there are listed here and to be fair, it's a heavily influenced for LXC So The first thing we did was like we actually use using spaces. The reason for that is we wanted a fully un-privileged container The difference with other runtimes though is we only have one user namespace mapping which basically maps the user That's outside of the container. It's the same user inside. So they have the same you ID and optionally we let them also remap themselves as roots inside the container and Why we have both of like both choices is because like some application refuse to run as root and Some need actually roots access or fake root access We also like keep the same username inside the container for just convenience as not to confuse people and And we can automatically mount home and other things inside the container Also like know that like run C or like Docker base containers They they always remap roots inside the container and we we don't So talked about other run times in like traditional run signs They usually use sub UID and sub JIT maps for user namespaces and they they do that because they want to allocate a huge chunk of UIDs and GIDs and We don't really do that because we run application containers. We don't need like this UID separation but we you kind of have to when you start installing packages because You usually need to be a root and you have you need to have additional UIDs and GIDs because The packages will try to install new users and your groups Plus like there's also a problem with sub UID and sub JIT maps is that you have it's difficult to maintain across a cluster They have some efforts in shadow utils and stuff to solve this problem, but right now it's kind of difficult and You can also run into permission issue once you like exit the container then you have like you have to deal with like Permissions that were inside the container are now different outside So to solve that we basically use a sac-com filter and we trap all the set UID Cisco's to make them succeed even though they don't have a mapping inside the container We also wanted like a standalone runtime with low overhead. So basically there's no demon involved There's no persistent spawning demon and we inherit all the C groups from whatever was above us So it could be system D it could be a Docker if we're running and wrote inside Docker or it could be slurring itself in our case and For doctors people are really confused by that because you actually use the C groups from the Docker demon and not from the Docker run command And that's that's kind of the behavior we wanted Also, like basically after the runtime has executed the application. It's out of the picture So it execs the application and just disappear So you don't have any other process hanging around like like run Sierra Docker have like to handle PTOI or exit or tracking containers So we don't have that We also wanted minimal isolation, so we don't need like this fancy network namespace We is like overlays or or we don't need an IP for each container. We don't need to bind privilege port We also don't want to PID any space because tend to confuse some programs as we saw before and And also like handling PID one is pretty tricky and we've seen a lot of people like running sleep as PID one As I said like we just want to keep the C groups that were handled to to us and Especially that like the skill area is Like as better insights on like what needs to happen like if you can overcome it resources or not and and obviously it simplifies the runtime and improves the performance and Speaking of performance some of like the problems you can get with traditional runtimes is you have a network namespace We just like you usually involves a bridge or a net or overlay Networking, so you had overhead there you have set comp and LSM which can have overhead for Cisco heavy applications we also rely on share memory a lot, so we need shared IPC namespace and Docker and other runtimes also try to tune your our limits, which sometimes are not really adapted so for example mebmock is pretty low on Docker by default and Something that's less known is like when you actually turn on second by default on a lot of distribution you turn on also a spectra mitigation, so if you're concerned about performance then like We actually don't want that Low-noted, but MPI which is like a standard for a framework for message pass passing that we use to communicate inside a node or even outside So MPI doesn't really like also PID and IPC name faces as we discussed and The cool thing of having like a fully unprivileged runtime is that we can use a cross-memory attach So it it's like the Cisco is process VM right V which is extensively used by MPI and That actually requires P trace access, so with traditional privilege runtimes. It's kind of hard to have that working As for coordination with PMI and PMX I'm really dwell on that but We just let like we actually need to pass file descriptor to the actual containers who we link file descriptor inside the containers by default The most difficult part is actually the importing Docker images in the runtime Because you have to deal with a lot of the OCI stuff UFS formats and other things and we really wanted to speed that up Because we pull like really big images So what we do is we actually relay on over a fs to basically do a power in Basically, we just want the kernel to do all the Quashing of layers rather than like a sequential extraction like Docker or Emoji do Also, we found out that like just using plain bash like the parallel curl Pipeline it tends to be much faster than that going because some of the packages are not really optimized And we weren't really Found off the format like the VFS format for example that podman or Docker have because it's huge on disk space and If you were to use the overlay driver in Docker or other things what they do is they actually keep the layers uncompressed But you can't share the layers between users because you would need something like shift fs or something to convert the LEDs so what we do is we actually Share layers across the same across users in the same group and we compress them with Z standard And we have a helper binaries for overlay because Unfortunately overlays privilege It's unprivileged on Ubuntu, but it's kind of broken it right now We have that Which is not owned by default. So only the admin can import images by default with Android For the image format we chose a squash FS and we wanted to be super simple So basically when we convert the Docker image all the entry point becomes just slash atc slash rc all the environment goes to ATC environment and the volumes goes to ATC fs tab and Basically when you land in the container you can edit this configuration configuration files like you used to and and restart the container and it will just be picked up We like squash fs images because you can store that as a single file on the parallel file system and and use it We'll pull that really fast internally and also avoid like all the thundering hurts problems that you have when well You kick in a multi-node jobs and it starts pulling the image like hundred times Also useful for hair gap where the admin can control actually what you can run so it imports the squash fs and just stores it on the cluster and you can also mount it as a block device and Have it lazily fetch for example through NFS or something Also we and finally like we wanted to be super simple So the runtime is actually just a simple shell script that's like five hundred lines of code and it uses just basics Linux utilities to set up everything so It's actually easy for users and admin to customize everything if they want to if there's something that you don't like in the runtime they can change something and we have users and and System-wise configuration that you can drop in if you want the mounts or environment in all your containers on the system or on this Or all the containers are for specific users. You can just write this configuration files So now for the basic usage you can just So basically you can do any written port. It will just give it give it the doctor your eye and you end up with Like a squash fs representing the entire image then you can store that and share it with someone You can create then a container from this squash fs image Which just unpacked the squash fs and under your xdg data path And then you can start obviously stuff in there So you can run and video semi in tensorflow or you can also remap yourself as root and have the Read write root file system so you can install packages as a non-prilogy user Some of the advanced stuff you can do also is you can start the actual image directly in this case we actually rely on fuse to mount squash fs and And all your change are in memory And the really I think like neat thing that like people like is actually being able to create Self-extracting bundles so you can do and read bundle and then you will actually have a Run file that way we call in run file that includes the image and the runtime so you can send that to anyone running Linux and they can just run your container without any dependency So if you have an experience you want to share with a co-worker you can send it by email or send even to by email or something Which is really really practical Especially when you do cloud deployments as well You don't have to install anything on the instance. You just drop this file and run it Now When you look at how it's implemented inside We have few utilities that replace basically the Linux ones. They're really simple it's basically three of them and written share and run mountain and rich switch route and We have two other ones which are mostly used for Docker imports if we were to use the actual Utilities directly you could write your own custom runtimes real easily. So here you just download the Ubuntu official root file system You create new namespaces and then you mount a bunch of things and at the end You just exact switch route and run bash inside your container and now you have a really simple my container runtime and now I'll give the Felix which will explain the storm plugin that we did right. So we saw and routes the runtime So we actually give access to this runtime for users so they can use and would start and would create But we didn't want to have users learn something new because they are used to darker and and but for some for reasons that We have explained we didn't give them access to darker on on our cluster so They are used to the syntax above We slump the comments call S run you just specify your command and it will run that on one node on multiple nodes and you say I want to run Python on one node and Slump has a lot of plugins and it was very nice because we could add we added the container image dash container image Flag and you say I want to run tensorflow and that's the only change in blue and yellow here That's the only change that our users had to learn they can still use any route if they want But we wanted to make it easy. They don't know about the container run times They don't know it's enroute They don't know it actually started as a prototype with LXC to and it was the same syntax because here We don't need to be very complicated because our runtime is very simple too. We just want to say run this image With mounts and that's pretty much it. That's the basic So a little bit of the details Oh, we do that since slurm that we start the container or so so right now It's within routes and we basically get the handles on the namespaces We get the environment variables about the process and we get the directory where the container is But we cannot hijack the exec we cannot hijack the exec python train the py so it's more like a Darker exec what we do here in slurm is we docker exec was that before just before we go to the python trade on Train the py we do a bunch of set and s calls to join the namespaces of the container we have created and Then we hand it back to slurm and slurm will execute By 200 dot py it doesn't know it that has been containerized in between but it actually just work That's really just fine, which is those few steps So quickly as I said we have a few just a few New flags so you can name a container if you do container image by torch It pulls that from docker ab transparently and you can name it so you can reuse it if I call that container by torch and I can install a package for instance VM VM touch to you Load that assets in memory and then if I reuse this container name and I add a mount It's still the same container and I will load my data set in memory and then for interactive job if I want a Shell inside my container. It's actually we don't do actually anything on this side slurm as a command as a flag called dash dash pty and I'm a fake I mean I'm remapped roots inside the container and I can do whatever I want with the combination of slurm and our plugin and We have slurm as batch job. So you can have a shell script. You just send this Send this shell script to slurm with s batch and you can render that on 64 nodes for instance And here it's something a bit more complicated, but you can have a multi nodes tensor flow job with 16 Process per node and And five minutes left and we reached a conclusion. So we have five minutes for question so we'll have to single all our colleagues because we've been working on that for a few months at NVIDIA and That's it. So our entire menu roots is open source and get up and our plugin is also open source And we are open for questions now So could you explain a little bit better? What didn't you find that you wanted in? Current HPC container solutions that prompted you to create and route So like like singularity charted cloud these kind of things Well, we reviewed lots of them and I mean each one of them are their downsides I mean, I Could list like all the reasons for each one of them, but we evaluated all of them The closest one was like charted cloud I think which was the closest in philosophy as like and route Problem is I was kind of a little too static in our case So we had we wanted something that was very more dynamic and then like basically we could modify depending on cluster environments So admins are wants to do some stuffs in the container on times and charted cloud was kind of static. So Yeah, and they don't use all the tricks that we we have listed so I mean eventually maybe they will but Right now isn't the case it was easier for us to write it and again. There was 500 lines of best so wasn't too complicated Perhaps I missed it. Are there any bits that need root privileges and handling the containers No, so the runtime itself is fully unprivileged Now the import you only have one you didn't one cap to actually mount overlay But then the actual squashes and privilege So by default we don't allow users to pull images To convert darker images. I mean even though they could do it with like scope you and you mochi or something We don't allow that Because basically it's also discouraging my people from pulling random stuff from the internet And if the admin wants to allow that then you can just add one cap on the binary, which is fairly safe because mounting read-only overlays fairly safe Ubuntu allows it other distribution don't but No, the rest is fully in privilege So now since all the focus here is on HPC's on performance I would like to I can you do my sorry closer. Yeah, the all the focus here is on performance on HPC side side and as a user I would like to understand if your plan is more to go towards a substitute to the modules schema where the Where a single cluster provides to the modules for the software and you load the module and run your software With the library already pre-configured or you rather see the cluster the cluster It guys to provide you the container that is compact with the software specifically compiled for that cluster or you provide The container to run the software like TensorFlow for example connect download the application from your side should probably should it be provided by my Cluster it guy or should I compile it myself in your environment? How do you foresee it or the module scheme out is not? It's not yet. It's not okay right now For example, I mean I see the module scheme. I as the standard Right, so the problem we problem with modules and stuff. It's like I mean you only get all the benefits from containers So it's mostly admin driven where our in our needs were more like user centric where we want to use Or to bring their own applications. So modules didn't really fit these We didn't want admins to be really involved in in the process of running apps as for Actually compiling stuff specifically for your clusters We tend to do that with our containers. So Okay, but if I am How do you call it if I am a developer and I Develop my software for for something my specific case for TensorFlow Okay, I do that and at the end if I provide to the end user the Container with TensorFlow embedded in it the flag that I will that I would have used to compile that Particular machine would have been the one that fit my own clusters Environment, okay, not the one of the final user in his own machine, right? Okay, so it will be useless for him at the end Or it won't be the maximum performance that you will get by compiling it on its right and again We so we have a registry where we host all the optimized stuff from for our platforms But again, you like you can have an admin that just drop squash files They're specifically compiled for your clusters and then your users can just use the squash file that the admin imported similarly to modules Or you can have the user if you really knows what he's doing Can also push something to a registry and pull that so we have both use cases covered pretty much Yeah, and we're off time. Thank you