 Okay. Can you guys hear me? Perfect. So, as you said, yeah, I'm Michael Bauer for the last nine months or so. I worked at the GSI. You can find me on github at Bauer M97. Send me an email, www.bauerm.umich.edu if you want. You find this particularly interesting. So I'm here to talk a little bit about singularity. So singularity is a container solution for HPC environments. Here's our homepage, singularity.lbl.gov. You can also check us out on github. We've had a lot of contribution in the last six or seven months. Lots of activity. Feel free to come by and contribute. There's a list of our contributors. So the project head is Greg Kurtzer at Lawrence Berkeley National Lab. Myself, Brian, Krishna, and Vanessa are some of the core developers. And then we also have people who have made other significant contributions to our code base. So I first want to kind of pose a question. I'm assuming that at the container dev room, most everybody actually knows what a container is already. But it's important to kind of strictly define what we mean by a container. And to do that, I want to talk first about virtual machines. And Wikipedia talks, and they say, computing a virtual machine is an emulation of a computer system. Just at its most basic level. And, you know, good example, VMware, virtual box, what have you. Some pros, you can run different operating systems. So back when I was younger, it was really cool that I could run Windows on my dad's Mac OS laptop. It's awesome. You can save money, so we didn't have to buy, you know, a second laptop in order to run programs that only run on Windows. Namely games for me. And it's a little bit easier to maintain than hardware. You can download a new virtual machine if you want, right? Some downsides at the same time is performance. It's going to be a little bit slower, most likely. And you're going to have memory requirements and storage requirements that you wouldn't otherwise have. So to run a virtual machine, you know, maybe you need 10 gigabytes of space. And to talk about containers, then we can say that containers are very similar in their goal to a virtual machine. With the container you want to take some environment you want to encapsulate it and store it into some file or system of files to be distributed. With the container, however, we don't do any sort of kernel emulation. There's no architecture virtualization. It's just software packaged into one file. And what that means is we don't have to waste an extra 5% performance to emulate the kernel. We have a much smaller footprint on the disk. We have nearly instantaneous startup time. So when you guys, you know, when you do Docker run, you're running immediately rather than waiting two or three minutes for virtual box to start up its virtual machine. And you can run multiple instances of the same container using just one image on disk. So if you fire up, you know, 20 Docker containers, you really only have one image on the disk. My internet died. Hold on. Is it possible to get full screen? It's fine. So kind of to talk about specifically running multiple instances of the same container just off of the same base image. This is a depiction of Docker's image structure. So you have multiple read only layers built on top of each other. And then at the top here, cut off, if I can scroll up just a little bit. You can see that you have a small thing, maybe a couple megabyte read write layer where any running application can do read write operations on. And these are all, you know, each instance of a container you run will have their own read write layer at the top. That allows you to do some really kind of cool things with containers. So a lot of places actually using containers, you know, you can deploy on Amazon with AWS, Google Cloud Platform. You have companies that are now providing their services as a container that you use. Some big websites even are using containers such as Reddit to deploy all of their infrastructure. And so I want to talk about specifically containers for scientific computing. If I what? Yeah, I am and it doesn't align. It won't align. It's okay. It was worth the try. Anyway, so I want to talk about specifically containers and scientific computing. We see containers a lot in industry, but for instance at the place I've been working at for the last nine months, they were not using containers at all and they wanted to investigate containers for their HPC. And why would we want to do that? Well, you can first and foremost escape dependency, Helen. So, you know, they always had the issue that we were trying to run some version of code and it was depending on library A, we had library B and it didn't run and users get pissed off because of that and they yell at IT because of that. Secondly, you want your remote code to work the exact same way as your local code every single time. And with containers, you know, you can run the container on your local machine, send the container to the cluster and the cluster just runs the container and it only runs what's inside the container. And third of all, and maybe most important even for scientific computing, is that you can have one file that distributes your entire environment used to generate results. And that's really important for reproducibility of results. When we're doing an experiment and you can have everything that you used for that experiment inside one file and give that file to other people, you can promise them that they're going to be able to reproduce your results using the exact same environment that you did. So, here's a little diagram of what users absolutely hate where you run code locally. It works perfectly. You run code on the cluster. It works not perfectly. Okay, this is supposed to be animated but it's a PDF so I can't do that. I created a little bit of a checklist when I was working there when we were investigating container solutions to use for our HBC for what we actually need in order to have a container solution. So, one, can any user run containers without special privileges? Two, can we integrate it seamlessly into our infrastructure? Do we have to install something like Kubernetes or, you know, Docker Swarm and do we have to waste time and effort to get that configured properly? Third, is it portable between mini systems? Can we send it on older hosts? Can we send it across the ocean to somebody else to let them use our container? And fourth, can we let users bring containers onto our cluster without any sort of administrative oversight? You know, I don't want to have to scan my users' containers to make sure they don't have malicious content and I don't want to have to trust my users. And so, we first, we investigated Docker as most people do. It fills three out of four of the check boxes. It doesn't really integrate nicely into our architecture. We would have probably had to install something like Kubernetes on our server to get it to work properly. So, that kind of ruled Docker out for us. And another really, really important point for HBC is that your container software can't come anywhere close to root privileges ever. So, you have to ensure that if you give users an ability to run a container, they're not going to be able to do something that they would otherwise not be able to do on the host machine. And with Docker, we actually find out that we run into a slight bit of an issue where Docker actually runs a root-level process all the time as a daemon that spawns containers and kills containers in charge of scheduling containers. And so, when we had initially proposed to IT that we installed Docker on the cluster and try it out, they just told us no. They wouldn't let us install something that had a root process on every machine. And that really ruled out Docker for us. And so, that's kind of where the investigation came in, and then we stumbled upon singularity. And singularity is the solution that we chose, designed for high-performance computing. So, as you can see, we can go back to the checklist. We're filling up every single check mark this time. There's no root-level daemon, so we don't have to worry about the first point where with Docker, you kind of have to worry about that. It can be run essentially the same way as you run any application. You can directly execute a container as a file on your disk. It's portable, so singularity is built. It doesn't require any kernel features from most recent kernel. You can run it back. I think one of our users was running it on Red Hat with 2.6. So, you know, designed for maximum portability. And then, we don't actually have to worry about trusting our users. There's no issue with giving a user full control over what containers they're running on the cluster. We won't run into issues with that, and we'll talk a little more about that in a bit. Again, with singularity, any container can be run by any user at any time for any reason. So, when you run a container with singularity, the user is the same user ID inside the container as they are on the host. You don't have to, again, just to reiterate, you don't have to change your workflow at all to use singularity. So, you know, in our instance, it was just changing our batch submission script to just run singularity instead of running whatever other executable we had to run, or directly executing the image file itself. Thirdly, it's a single image file, and so that ties in really well, and that's kind of one of the big topics that we discuss in a paper that we submitted for publishing recently about singularity is that we have just one file, and so you can distribute that file and it contains everything necessary to run, everything that you wanted to run. And it's safe. You know, you don't have to, again, you don't have to worry about users and what they have inside their container because they can't do anything malicious. As you can see, there are a lot of places, actually, now that are trusting our container software to run on their cluster. So, place I work, GSI Homeworld Center, the running on our cluster, a couple other big ones, there's people running at MIT, on Stampede, at Texas. So, here I want to compare singularity to a couple other leading container solutions. So, a shifter, which you may or may not have heard of, is similar to singularity. It's containers designed for high performance computing. Charlie Cloud is also the same. It's a slightly less mature version, Docker, which I'm sure everybody's heard of, and then singularity. And there are three kind of main points that make this really important for high performance computing. One, native support for GPUs. Inside your container, it's really important, especially for scientific computing, that you can use your GPU inside the container without any extra integration. The second, native support for InfiniBand. Again, it's really important that we can run InfiniBand and use not just IP over InfiniBand but verbs or RDMA inside your container without having to do any extra integration. So, you can just simply specify you want to use InfiniBand and then inside the container you'll have access to it. And also, we have native support for OpenMPI. Other to HPC containers, they provide also similar native solutions. So, it's not such a big difference, but from Docker, it's a big difference. Right, so essentially what we're saying is we don't have to rely on any other software to build a singularity image. So, with essentially Shifter and Charlie Cloud, you can only use Docker images. And so, on those services, you have to rely on Docker in order to have an image. On singularity, it's possible to use something like Debootstrap to just build your image yourself kind of natively and really quickly and easily. Does that answer your question? Yeah. So, we actually, we submitted a paper. This is where that graphic is from. Talking about containers for scientific computing specifically. So, now I would like to cover a bit of a basic usage of singularity. Kind of just to get people into the workflow of singularity and how you would actually use it. So, I like to think of singularity being used kind of in three separate segments. You have a create process where you run pseudosingularity create, give it a name, and that puts the physical image file on the disk. You have a pseudosingularity bootstrap, and then you tell it an image and then a definition file, which I'll talk about in a bit. And all that does is it takes a set of rules inside your definition file, analogous to a Docker file, and it builds your image, either from scratch or you can also build it from, you know, a Docker image from upstream, hosted on Docker Hub. And then running your image. So, there are three main ways to run an image. You have singularity shell, which just as it, you know, as it may seem, it opens up an interactive shell inside the container for you to play around with. Singularity exec can just execute any sort of executable that you specify, and then singularity run. And also directly executing the image. These are kind of the same. And what that does is there's a script that you can specify inside the bootstrap file, called a run script, and it just executes that script when you run direct execution. So, now I have, and hopefully this is still cache because I don't have internet. It's not cooperating. So, I have an ASCII cinema little demonstration just for the basic bootstraping process. Why would that not? Okay, I don't know why it's not going full screen over here. But essentially, if we can maybe play it and zoom out a bit there. This will just show you kind of how simple it is to get started with Singularity, and it'll take you through the creation process, which we're doing right now, by specifying a size of 568 megabytes. Correct. So, the thing about the creation of the image is, we want users to create their image locally on their own computer. So, it's actually, because of the way Singularity works, we're doing set UID and we don't have user namespaces. It's kind of, for now, not feasible to be able to let a user bootstrap with just some arbitrary code on a remote environment. Because in order to do most bootstrapping things, you need to actually have sudo app get install or whatever. Right. And so, I bootstrap the image here with just a basic definition file. And then, all we do to run it is Singularity shell. And that's actually, as you can see, we don't need sudo there to use it. So, Singularity supports several different actually image formats. So, the most common, most basic would just be the image file, which isn't listed here. But it's just a .img file. It's a file system that's formatted by Singularity Create. And it's just one file, usually four to 500 megabytes, or maybe a little bit more. We also allow you to have just a directory, so you can directly, if you extract, for instance, a Docker, the archive generated by Docker, export. If you have that in a directory, you can then use that as a container, or if you have an archive itself, you can use that as a container as well. So, Singularity also, for users that are more accustomed to Docker, we have direct Docker integration. So, if you watch this, what I'm going to be able to do is I'm going to run, just as non-root user, I'm going to do a Singularity shell command, and I'm going to directly call on the Docker API, and we're going to run the 121404 version without any extra privileges. And this was done in my hotel room, and it was awful downloading 100 megabytes, so it takes two minutes. And if we go to about here, yeah, you can see it finishes downloading and extracting and puts us directly into a Docker container. And the next thing that I want to point out is I'm going to exit the container here, and then I'm going to rerun again Singularity shell and directly call on the Docker API, and you'll notice that it's just instant this time. And that's because we actually, we store these in cache and normally in the home directory, also optionally in the temp directory. So if you run 10 instances on some node somewhere on your cluster of the same Singularity container or Docker container, you're actually only going to have to download it once. So we support that as well. There's no native integration for any schedulers except for Slurm. So we integrate directly with the Slurm plugin manager and I was going to talk about that. It's actually on my slides online except those aren't behaving properly. Actually, they might be now. They're not. But essentially what you can do is if you're using Slurm, you can just add a line into an sbatch submission script that specifies what container and it'll start the container before the job is run and then run the job script inside the container. So the primary... Is there a question? The primary motivation for searching for Singularity at where I was working was we were running jobs for Alice. We're in Alice Tier 2 Center. And our cluster there is a Debian-based system whereas Alice and CERN, they're expecting scientific Linux. And what happened is that we were getting, you know, an exceptional amount of errors, something like 40% error rate which is not normal, not what they're expecting, not acceptable either. So for those of you who are unfamiliar, this is kind of what the Alice project is looking like or more in general WLCG. They generate data at the Tier 0 inside actually the Alice detector at CERN. The data is distributed to Tier 1 centers. There's one in each member country and then it's also distributed later for processing to Tier 2 centers. And the GSI is a Tier 2 center and that's where I work and that's our computer that we've been building the GreenCube. We'll have 300,000 cores. So again, as I said, we're trying to run 2,000 jobs at the same time except 40% of them are failing because we're running on the wrong operating system. And so our current solution for what they were doing is mounting some libraries in Luster in some weird directory. They have their slurred job submission script. They actually have to intercept the script, modify the script, hack in some LD library path fix. I think there's more. I don't know what they were doing. They couldn't explain it to me well enough for me to figure it out. The point is that it's a big, ugly mess that they're doing. And so we converted that to run on Singularity. So the Singularity solution to this problem is much simpler. So we package Scientific Linux 6 into a container. We modify our slurred submission script simply just to be executing the container instead of executing the Alice job environment. We no longer have to modify or to mount Luster to get access to the libraries that we were supposed to have on Scientific Linux. And we can test our containers locally for deployment. So here we have a small diagram kind of this is our Singularity build file. And essentially what it does is it will pull the Scientific Linux Docker container from upstream. CERN actually provides in the Alice repository a Docker file for building a Docker container. And since Singularity can just natively integrate with Docker, it's just a simple one line inside the build file and we're using Scientific Linux already. Then we package it up into a container. We store it on our file system and it's then accessible anywhere on the cluster. So how exactly do we do what we're doing? How can we have such secure containers? How can we ensure to system administrators that users aren't going to be able to escalate their privileges? So Singularity, it's actually very basic at its heart. It's a set UID binary and it's owned by root. And so when you call it, your real user ID is zero. What we do is we mount all the necessary sort of files. We bind mount the image. We do any sort of bind mounting that the user wants. Clean the environment to ensure that there are no leftover environment variables being passed through. And then we hide inside certain namespaces if possible. So the one that we always use, we always use the mount namespace on newer kernels. If you're interested in trying it, it's possible to now use user namespaces. It's also possible to hide behind a process, a PID namespace so that you can only see processes running in your container. So running with a set UID binary is kind of insecure by nature. I mean, you're asking a system administrator to trust something that immediately grants the user root privileges by calling it. So in order to securely do this and do it in a way that's auditable, we have two function calls in our code. We have a call called singularity privilege escalate, which changes the effective user ID inside the running process to zero. We then do one action at a time. So for instance, in order to bind mount, we have to have root privileges to do that. And then we drop it immediately after using singularity privilege drop. And we only ever use any of these function calls immediately followed by the other one. So you can actually look in our code and you can look around and you can see exactly what's run with privileges, exactly what's run without privileges. And you can audit the code to see that we're actually doing what we say we're doing. So another thing that we get asked a lot is why don't we use user namespaces always and by default because, for instance, Docker always is running namespaces now I believe by default. And inside a namespace, what we have is we map a user ID on the host to a user ID inside of the container, inside of the user namespace. And so for instance, if your user ID is 1,000 on the host, you might have a mapping present that's 1,000 to zero, and then you'll be user ID zero inside the namespace. And it looks and feels for the most part besides a few system calls like root. You can do almost everything you want to do. The issue with this specifically for us is that it has the potential to break portability between environments. So if you actually go back and you inspect what you've done outside of the user namespace again and you look into the container's file system, you'll notice that things that you created with user ID zero inside the file system now have user ID 1,000 on the host. And so if you send that to another system where you don't have an account with ID 1,000 but 1,010, we would have to do extra work to make sure that your new user ID is mapped to your old user ID, which is mapped to root. So that provides a portability issue for us. As I've iterated earlier, this is really the real danger for us when developing this code. And this is what we try our best. And actually, I mean, right now we succeed. We make sure that no user code gets executed as user root ever. If it does, you've got to consider your system compromised because you can't trust that your users are one intelligent enough to know what they're doing or two non-malicious. So again, security and singularity. This is one really big thing that we focus on. We spend a lot of time looking through our code and ensuring that there are no edge cases that are going to cause security issues. And the first principle that we have is never let users run code as real root ever. Just doesn't happen. Two, we only use the effective UID of zero when absolutely 100% necessary to do one thing in our code. So what I mentioned earlier, mounting a bind mount, we have to escalate. Do it, and then drop it. And three, we drop permissions and capabilities when forking into a new thread. So when you fork into a new thread, the child process actually waits, and it stands idle until the parent process is able to drop permissions, confirms that the permissions have been dropped, and then tells the child process through a pipe that it's good to continue. And then you can continue on with the child processes code. And that's kind of to ensure that when we fork into the actual container process, the user can't force a race condition and try and beat us and take over the parent process with root. Another thing we have to look at is isolation. Kind of what do we isolate in order to provide the most secure environment for a user? One thing we do, we bind mount the image file just into the host file system at a specific location. You CH root to move into the mounted images location. We mount vices, we mount the host host file, also resolve conf we mount, and a few other files. And then we use namespaces when possible. I talked about username spaces earlier, PID namespaces. We can mount those, and we do it on request but not by default. The idea is to make it as close to the host's environment as possible without doing any extra necessary work. A big difference between Docker and Singularity specifically is the network isolation. So in Docker, you might be familiar with lib network where you have to actually go in and set up a network and kind of tell Docker how you want containers to interact with the network. On Singularity, we don't have that. So you simply are just on the host using the host's network. So whatever network stack you have on your host, your users can still make use of that. IP over in Finny Band, Ethernet, whatever, all works. And you can also use your host's physical devices inside the container. We make it a point to isolate as little as necessary while still maintaining security. And that allows you to do things like using your GPU. So just a little more to touch on some more specific things that we do to ensure that we're not letting users escalate permissions. One thing we do is only the... So there's two binaries installed by Singularity. There's a setUID binary and a non-setUID binary. And when you call the non-setUID binary, it in turn calls the setUID a binary. And only that non-setUID binary is allowed to call the setUID binary. No user can call it on their own. And that kind of ensures that you never have any extra arguments that would potentially corrupt our runtime. And it forces the setUID binary to be sure that it has the proper arguments to actually execute. And as we talked about earlier, we have the Singularity priv escalate and drop to escalate and drop privileges. And also when we're mounting files, we do it with the msNoSUID bit. And essentially that ensures us that there are no setUID files put into the container. So if you have a container that should have a setUID final file, you mount that container that will no longer look like a setUID binary. And we also, when we open files, we have closeOnExecFlag set so that the child process is unable to compromise the parent process. So that's about all I have in terms of really specific stuff. This is kind of a table that we have that displays all of our commands, how we can use things and what they do. So I think, yeah, we have some time for questions now. Go ahead. So I've been working just like about the setUID stuff, getting more and more satisfied this time. So yes, we can discuss all the security matters which happened, like to avoid something bad happening while we're in setUID. But ultimately, I'm kind of struggling to understand the defense in depth. Yes, we can execute the minimal amount of code while we're in setUID, but at the same time, security vulnerabilities in Linux, and then there's balls in the room, whatever, having all the time is just a fact of life. So I'm just wondering why you're using this very old setUID mechanism instead of much more modern capabilities, management, all the things like this. The point is specifically to use an older mechanism. So the question is why I'm using setUID in Singularity when we can use sort of newer methods, such as user namespaces or whatever. Correct? So we actually do interact with capabilities. So when we create, we have the no new privileges capability set inside of the child processes. And we do that when we can. And so there are certain things that we allow. So when you compile Singularity, it assembles a list of abilities that the kernel has. And so the MS no new privs is one of the things that we look at. And if we have that, then we say, OK, we're going to set this specifically because it's available. And so we do all that we can, but it's sort of as available by the kernel. And the idea is we want this to be useful on as old of a kernel as possible. And so we do have a user using like 2.6 kernel, I think. And I'm not sure what features are available. OK. And one more question. You said it's safe to run auto-free user code. Like how do you retain this without validating the comments first? But I don't really understand how that works in this industry. Sure. Even if we prevent them from getting through, then yes, they don't get actually through. But that's like a very old way of both. Like I care if they run a machine, but I care much more if they couldn't run a botnet on my HTC cluster or they can somehow do something malicious using my computer power. But I don't understand how this works. Right. So the second half of the question is not understanding why it's important, essentially, that why we care that we just want to limit user's ability to run as root. And so the answer to that is I think your users, if they're just running on your system normally, and they can submit some arbitrary code there, then they could probably do the same things as you want them to as you're afraid of them doing. And so the idea is that you have all the same measures on your cluster. They work inside the container the same. It's just that it's inside of a container now. Exactly. It's not that you care. Right. It's not that you can take this and put this on an insecure system and all of a sudden users can't do anything. It's just that whatever measures you have in place on your cluster, we're not going to destroy your efforts to do that. And we're going to ensure that they can't have root while doing that. And so if there is a kernel vulnerability, we can't prevent that. And we're not going to prevent that. That's not what I'm asking. You said you can run code without, you can run that container and not have to validate the container. But I don't understand how that's supposed to work. How that's supposed to be saved when they do the arbitrary code in the container that you don't isolate, say, the network or you don't. Even if you isolate this thing, people can use your code. I just don't understand how this is going to work. That's HBC. Your point is more secure. Your point is secure. That's what I don't know. I mean my point is just that there's no inherent vulnerability by running the container. Right? I mean users, if you're using an HBC, the users are running arbitrary code on your system just by definition, essentially. And that's why we're having singularities for users to run that arbitrary code. Go ahead. Go ahead. Yeah, yeah. Go ahead. So you started off by saying that you use containers a lot. And then when you're done, I would consider the other products containers. What does singularity actually provide to the quantity of each? Sorry. What was it? Can you repeat the question? I couldn't hear very well. So you seem to be tapping against a lot of what I would consider the features of containers. Specifically, modern features that support containers like I don't know if you mentioned secret. But what does singularity actually provide that you don't get from a true and a little bit of a change? Right? So the question is, I seem to be talking negatively about other features that maybe other container implementations are using that are features of containers. And why am I talking negatively of those? And why don't we implement them? And the answer is, you know. So what does the singularity provide that's not just a change root and just mounting an image? Right? So one, we do use isolation features when it's possible. So user namespaces, when it's possible, we do so. But we only do so at the request of the user. So the idea in the HPC environment is we want to make our containers kind of as similar to the host as possible so that we don't have to mess with what users and what the administrators are already used to. And so, for instance, network isolation. We don't want to have to have somebody defining new policies to create a new system of network. We just want it to appear like it's the host but allow users to bring their own environment. And singularity, specifically, what we do is we search for certain things that we can do to provide security and isolation for your container that you maybe otherwise can't do without root. And so, essentially, yes, what we're doing is mounting an image, change rooting into it. But in order to do certain aspects of that, you are going to need root. And so we do have a set UID binary to do that. And it's done in a sort of defined, controlled way. So the question is, can we not use essentially the set UID part of slurm to do whatever is necessary to mount the container? And the answer is we can and we do. And that's slurm integration, actually. We have a slurm plugin that we wrote. I was supposed to have a slide that showed you exactly how that worked. But, of course, my internet died. See if I can bring it over here. Anyway, we do have a plugin that was created for us by a user. And it does integrate exactly as you specified. It uses sort of the root part of s-batch, hooks into that. You can define it with just the pound sign s-batch and then minus minus singularity and then path to the container. And then it does all of that just using the singularity library that comes with it, not using the singularity executable. Yeah, go ahead. Is that the question? No. The question is, what happens to you around or the comments? Can you talk louder? Sorry, I can't hear. The question is, what happens to you around the same time? The same time? Yeah. The same thing. So those are read-only layers, right? And so they will all look at the same cache directory. So if you say, if you run just on your terminal, on your computer, you run singularity, run some Docker container, and you do that twice on two separate windows, it'll just look at the same cache directory since they're read-only. They'll both just read from it at the same time. And it works just as you would expect, like doing it simultaneously with Docker. So the question is, when I'm talking about portability, does it mean something else rather than Linux? And the answer is essentially no. It's just for Linux. I mean, all the HBC environments that I've personally been involved with have been all on Linux. And the goal is just to make it portability between different versions of Linux, different distributions, different libraries, different users, different sites. I think that was all the time we have. Yeah, we're out of time. So thank you very much. Thank you.