 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at rce-cast.com You can find the entire back catalog there of over 100 episodes Again, I have here Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks a lot for your time Hey, Brock. How's it going? We haven't done one of these in a little while We kind of kind of just gotten busy with some other things. I suppose. Yeah, no Personal things, summer travel, all that sort of stuff. I moved. It's just it's been busy. Yeah We still got a little more vacation coming up and one up But we have like two or three of these queued up for the relatively near future, right? Yeah Yeah, we went out a little spat of reaching out to people and setting up You know times to be able to have discussions with them And you know, we're always looking for people who to you know Go to the nomination forum on the website at rce-cast.com and send us ideas for other topics and groups We should talk to you. So if anyone has ideas, they should do that. Yeah, thanks for Baron with us all the fans out there Appreciate it and sorry for the bit of a hiatus we've had but give us another couple weeks here for summer vacations And whatnot and then we'll then we'll be back in the swing of things But I am excited actually to start up with who we got today. This is a guy I've known for quite a while He's been in the HPC community for As long as I have which is actually a very very long time But amusingly enough it didn't even occur to me until we started recording today I'm at the Cisco mothership today in San Jose, California And I could have driven up the road literally and we could have done this sitting in the same room Boo on me for not thinking of this ahead of time, but let me introduce we're gonna be talking about a singularity today With its founder Greg Kertzer. So Greg, could you introduce yourself? Hey guys, I'm Greg By the way, thank you so much for inviting me to do this. This is fantastic and you guys How long have you guys been doing this for? Oh That's a good question man. I think it's coming up on, you know, five six seven years something like that I remember when you guys first started this and I'm so happy to see you guys have been Just just putting this out there and doing so much great for the community with these podcasts So thank you so much and thank you for inviting me. So as I mentioned, I'm Greg I'm an HPC architect and developer for Lawrence Berkeley National Lab I've had some experience creating various open-source projects in the community Some of which some people may be familiar with Probably the biggest one is centos linux I'm the primary founder of centos linux, although there were there were some other co-founders involved as well, but I was running the organization that that created it also werewolf and Singularity most recently So singularity is what we're gonna be talking about today so Can you give us a quick overview of what singularity is? Sure thing. So singularity is a container system. So I Can give a little story actually to I think of how singularity came to be so in doing HPC for the national labs We've always had a lot of scientists that were would always come to us with their with their code and whether before train or see they really bragged on how efficient their code was and how Portable it was and they wanted accounts running it on Everywhere that you can imagine just to basically make sure that a it compiled and be it was faster than heck So the scientists would you mean that was the environment that I came from the scientists would would Relish the fact of how portable their source code was so then we started working more with the university and the university is is much more I guess you could say forward-thinking and they're doing more of the current trends and moving more into the new wave of how people are doing things and what we found is that they wanted to Not just build a binary and have that portable, but they basically owned and built an entire Operating system image a container if you will That basically described and had within it all of their libraries all of their scripts that they're needed all of the workflow And everything that they wanted and they have it running possibly on their laptop or possibly on their workstation And when we started working with the university, they asked us they said so we've already got an image Or we got it running on this workstation And I don't want to rebuild everything that I've already done and you guys aren't even running a similar environment So can you just run this over there and that was really the first time I really started thinking about containers and It's it's an interesting problem to solve All right, so hold on a second. What is what is a container? What is the difference with containers and virtual machines? Oh, that's a really good question So a virtual machine basically is running a application which is going to virtualize another computer and then on top of that virtualization of another computer You're running another kernel. You're running another User land and then on top of that you're going to run a whole nother Level of applications and the distance between those applications and the physical hardware Compared to the applications running natively on that physical hardware is a much greater distance and as a result you're going to get a lot slower Runs and you're gonna get higher latencies and whatnot There's a lot of things they do to make it faster and one up But let's not go into that right now Containers on the other hand the applications that run within a container System are running basically at the same distance from the physical hardware as native applications and as a result They're gonna run much faster much more efficient So probably the best known implementation of using containers on Linux is Docker How does singularity? You know, why don't I just use Docker on my system? How singularity versus Docker kind of compare? Ah, great question. So Docker is kind of the de facto container solution out there when people think of containers That's what they that's what they typically think of and for good reason. I mean, it didn't pioneer the idea of containers Containers existed for a much longer period of time than than Docker has but Dockers done a lot to do things like Bringing the entire workflow of building containers sharing containers with your friends family and neighbors and leveraging that work of your of your peers to Extend that and build additional containers and then run it. So they've basically brought a whole ecosystem of a workflow together So there's definitely there's a lot of validity in using Docker and HPC resources So the question is really why aren't we doing that and if you if you look at how long Docker's been around let's say approximately about two years and ever since then excuse me at least from the scientific perspective about two years and So far there's not really been one HPC center that's been able to do it There's a huge number of complexities in installing Docker into a multi-tenant shared HPC environment and as a result of that I can go into any of the specifics that you want but at a high level You could see I mean we're going on years now and scientists have been been Banging down our doors trying to get Docker installed, but it hasn't been it hasn't been done yet And there's a lot of there's a lot of reasons for that But yeah installing Docker on an HPC resource really would have been the the easy win But can't give us some examples What are the complexity of it because it's marketed and I know a lot of my friends who are in the web development Types of community web applications kind of community They swear by Docker and they love Docker and that is inherently a multi-tenant community And they put all of their applications Docker. So what's different between that and HPC types of environments? Okay Let's start off with There's a lot of different ways that we can we can approach this problem So let's start off with the fact that it's running as a daemon so that when the Docker Process runs it's actually running as a root-owned daemon and for users to spawn jobs in Docker They're basically communicating over a socket And at front with a command line tool and they're basically instructing that daemon what to do So let's talk about this from a from a scheduler perspective So if I were to go and submit a job to to a cluster to an HPC resource The resource manager and the scheduler would work together to figure out well What what size allocation do I get? Where is it going to go? And? And they're gonna let me run on that now if I if I asked for a job that says give me a wall clock time of five minutes on One processor and then I run a job and my job goes over that five minutes. What's gonna happen? It's gonna get killed and the resource manager is basically gonna say you've exceeded your allocation So it's killing it so no problem now in the Docker community or the Docker perspective what you end up with is The user application or the user command line that communicates with that daemon is going to tell that daemon Okay, well, here's the job. I want you to run Now they could detach from that daemon and that job just goes off and runs on the end And even if that client program gets killed at that point It's not stopping the the Docker daemon from continuities continuously running that job So what you end up with is the Docker daemon is outside the reach of the scheduler and That basically means that you can't Properly schedule or deal with the resource management of Containers running under that daemon now There's a whole bunch of as I mentioned a whole bunch of reasons why Docker is really not geared towards HPC the web world and the enterprise world That's really it's bread and butter and that's what it was designed to do There's been a few patches that have addressed some of the issues within the Docker Ecosystem and that would make it better for HPC and those Patches at least many of them have not been accepted. Some of them are almost about a year old at this point been pending so you can see that this really isn't in the Interest of the of the Docker developers. This is not the primary use case that they're looking for It's probably not even a secondary or tertiary or even tangential use case that they're looking to solve and as a result of that It's been very difficult to get the necessary changes for for HPC Involved and there's other aspects as well I mean the complexities of dealing with MPI the complexities of dealing with GPUs Nvidia has taken a stab at it and it's been pretty good, but there's still a lot of complexities there and security as well One other point if you don't mind me mentioning real quick is within a Docker container Let's say for example I have a Docker container that I want to run in and that I own and I you know I want to get root so I know what the root password is now There's lots of ways of getting root, but let's just assume that I set the root password Let's go the easiest way when I submit my job up to the I submit my Docker container and you know The Docker daemon's running that well, I can easily escalate up to root in that particular case What do I have access to do? Well, it depends now it comes down to what namespaces are involved and how is the how's a Docker daemon and the underlying operating system dealing with that is the user namespace separated out or not and What you end up with though is you end up with I'm able to escalate up to another user privilege possibly and In that case where does that leave the security of the system? I mean if I've got you know storage that has data from other users that I should not be able to see or I'm from on a public or rather the The the shared network of my HPC resource. I don't necessarily want a root user on that So there's other mitigations that people need to take into place and they start bisecting the network and Bisecting the infinite band fabric and saying that Docker containers then don't have access to the file systems And you end up with a small little virtual cluster in your big cluster and that's not the problem that we want to solve Okay, so you pointed out a lot of things there and you know, like you said Docker kind of aims on one thing But Docker is still very popular Even among a lot of scientific codes, especially things that are kind of like one note or less like small analytics codes And you can package everything up which makes things really nice for distribution in terms of you know reproducibility is exactly the way the Developer made it which is one of the huge benefits of all these container setups Singularity can you take a Docker container and turn it into a singularity container? Yes, you can. This is one of the workflows that I've been Pushing forth with several other people. So the the idea is that And as you said, you know a lot of people already have experience with Docker and they are already built these workflows using Docker files and And implementation with Docker So one one workflow that I think is really good is people use Docker to create their their their containers and to build them into develop with them because it's a known technology and From that when they're ready to take it to a larger center to an HPC center or into a bigger resource They basically do a Docker export now Docker export Basically dumps a particular format and very coincidentally singularity imports that exact same format So you could basically do a Docker export pipe singularity import Okay, so we kind of skipped an important question though about you know We talked about what the the shortcomings are of Docker in a HPC environment How is singularity better like what did you? Design in terms of how singularity is both implemented and intended to be used that would make it suitable for an HPC environment That's a good question. So singularity does a lot of things differently than Docker So one of them for example is you are the same user Inside the container as you are outside the container and as a result of that you can blur the line between What is actually contained so for example? If on the host I have access to particular directories and particular device files and I am have access to a certain amount of resources on that particular host Well, if I'm running inside of a container There's no reason why I shouldn't still have access to those exact same resources granted I need to virtualize my environment a little bit. So I'm running my own libc. I'm running my own Any other programs I've installed into that environment, but I can blur that line between what is container and what is host so for example very easily I can say The scratch file system should be available inside the container The home directory my home directory should be available inside the container and as a result of that when you actually invoke a singularity Let's say shell command into a container what you actually what it feels like you're doing it feels like you're actually SSHing into a node running a completely different operating system But you still have happen to have all of the same file system shared to that node All right now I thought I saw something else about the format to about how you actually store and Move the containers around is there something different about that as well? Yes, there is so Singularity uses a single file image to represent the container and all of the files there in the container So this can be a file that I as a user own But it's a single file and if I wanted you to have access to that It's simply a matter of changing the POSIX permissions on that file So you can actually get access to it or we can just copy it over to your home directory and branch it in a matter of speaking So there's certain advantages from having it as a file Versus what docker is doing and I'll mention what docker is doing in just a second. So as a single file If I were to put this let's say for example on my luster file system And I run Singularity with the image on the luster file system, and I run a gigantic parallel MPI command against that that single file. Well, let's say inside. I'm running a Python job Pi MPI now The number of file opens that are necessary Within the within doing a big huge pie MPI job Is actually huge and those are all metadata operations. That's happening to your parallel file system And I've seen I've heard some numbers that have stated that to run a really big Python MPI job It can take upwards of 20 to 30 minutes of a distributed denial of service attack on your metadata server for your parallel file system That's just to start the job Now with Singularity Because we're using a single image. It doesn't matter how many file opens occur within that image It's still just one metadata look up on your mirror on your parallel file system metadata server so as far as Implementing a big huge parallel job of something like Python that has lots of opens File opens underneath the hood going on you can actually get speed-ups as a result of using Singularity Docker on the other hand Uses basically every every node that will be running a Docker job has to cache that file system locally and What that means is it's going to cache and then rebuild that the entire image the entire container image For each node that needs to run that job. So if you need to run a whole bunch of HTC or MPC type jobs You're gonna get quite a bit of startup overhead that you're gonna incur just for starting The job where in Singularity, it's really tuned for parallel runs Especially using a parallel file system Yeah, actually if you go back to our episode for pi MPI they talked about on like large blue gene systems How things would lock up and they were trying to do all these things. So that's really quite interesting doing that thing so In this setup Am I still basically bringing my entire environment though? like am I bringing like My own G libc and stuff like that if I'm against a version different than what my host platform is running on like I still get all of that Isolation of what I'm running except for the kernel Yes, it is exactly that you are getting The complete isolation pretty much except for the kernel which brings its own level of complexities and things that have to be mitigated But for the most part you're right your C library And I also have to get out of the habit of saying g libc now because there's there's more libcs than just GNU And alpine Linux for examples using muscle libc But the libc itself Is completely independent inside your container than what you're running on the host now there is some libc to kernel Compatibility that needs to occur and you may do something like for example If you want to run a red hat Enterprise Linux 7 container on a red hat Enterprise Linux 5 kernel It's basically just not going to run because the libc is not compatible with that kernel But you can go the other direction you can run a red hat Enterprise Linux 5 Container on a red hat Enterprise Linux 7 host and that works just fine So how do I integrate this with my batch system though? I mean I have to Do I still have a job script do is my job script like my equivalent of my command the container starts on boot or and I can do all this just as part of a specially crafted batch script or do I need support in my battle scheduler? good question the The architecture and the original design of singularity was trying to make it so simple for a user to run a command That's inside of a container as it is to run a job or command that's outside the container So from a user perspective, they can just let's say for example, they have the program, you know fubar And they can run this program fubar on their host if it exists or they can run it inside the container if they if it exists there So if they want to run it on the host obviously if it's in path you just say fubar If you want to run it inside of a singularity container, you basically just prefix that command with Singularity we're going to exec exe see a Program that exists inside of a particular container So then we pass the path of the container file or the container image itself And then we basically pass the command fubar hit enter and it'll actually run fubar from within that container So because of this workflow, which is really designed to emulate. What is it like to run? On a standard, you know on a standard host in a standard shell and run any of those commands you can basically put the Singularity command itself inside of your batch script now and Run the command then with no run the container excuse me with no changes architectural or otherwise To the host system so for a service provider an HPC Provider it just means that we need to install Singularity on the nodes of our compute Compute cluster now when the users come along and they want to run a job They simply just add those Singularity lines inside the batch script whenever they want to reference a particular Container that they want to utilize so integration is incredibly simple Now how do I you talk before about I can make a Docker container and easily convert that to a Singularity container but are there other ways to natively create a Singularity container because Usually I would imagine you want to have that command actually inside the container along with whatever other support apparatus that it needs specific libraries and Shell scripts and other environment kinds of things like that. How does one typically do that? So I think you're asking how to What I call bootstrap an image from scratch which is basically to build up a new image using Singularity as the the bootstrap Mechanism is that correct? Correct. Yes. Okay, so in that case Singularity does offer Bootstrapping functionality and the way that it offers it is it's basically a two-step process And it'll probably end up becoming a one-step process soon enough But at the moment it's a two-step you basically create your container image with one command So it would be literally Singularity create and then give it an image path and it'll create the image Once that's done, then you do a Singularity bootstrap You point it at that image and then you pass it a definition file and the definition file is a fairly simple syntax that basically says okay this is a Debian or Red Hat based system and I want to install from this particular mirror of packages and This is what I want it to look like here are the packages and files that I wanted to install and I can copy files or install files from my current directory or Download files and put them into the container All basically within this definition file So then as I mentioned two commands the create command and then the bootstrap command So when you run Singularity bootstrap pointing at an image and then give it the definition file It'll go off and install a new operating system into that image And it'll copy and install any programs or scripts that you want into that image But you don't have to you don't have to do everything within a definition file So for example, you can as root go into that container image Something I mentioned before is you are whatever user you are inside the container as you are outside the container And there's certain blocks necessary They're in place so a user cannot escalate up to root within a container But so if you want to be root inside your container, you actually first have to be root outside your container So you start up Singularity by going let's say pseudo Singularity shell We want to make this this container writable for this instance So you pass the minus W option and then you give it the path once you do that you can do yum install or App to get install or anything else that you want to do within that container And then as soon as you exit out of that shell all of the changes are automatically flushed to the image And there's no rebuild time or anything like that. So it's the image is always up to date Okay, so we created an image And you said we can just run simple commands inside an old batch script and all we have to do is install Singularity What does that actually mean for system administrator group like mine? Are we talking about is this kernel modules? Does it just use LXC and it's just a couple of Scripts and helper utilities that really just make LXC fit for human consumption Is it do we have to rebuild against every kernel release? What how bad is it for an admin to maintain Singularity on their platform? Oh, it's very easy There's no kernel components to it. It pretty much is all interfacing at user land level So there's nothing you need to deal with as far as updating and whatnot And it's not using LXC or anything else on the back end. It's all brand new code It's written primarily in C There's a few shell script wrappers that just basically go around the the end pieces of it Just to make sure everything is saying before it calls the the binary components behind But at that point what it's basically doing is it's it's dealing with the namespaces in the kernel not as a kernel module but calling those it via system calls and And Yeah, once it's installed you're pretty much done when you build it The configure script will automatically figure out what capabilities your host has so for example If you don't have access to You know the user namespace it's not going to leverage the user namespace code Okay, so let's dig a little deeper here. How does this translate into the? HPC realm in the area that I personally care about a lot is the MPI, right? So you've got some kind of high-speed networking stack that you want to use What does singularity do or intentionally not do for me in that kind of realm? Okay, so I've mentioned how the singularity command workflow Exists you basically just type in singularity Exec command name or excuse me image name and then whatever commands you want So in my mind a perfect world would have this so an MPI Implementation of this would be as simple as prefixing all of that with an MPI run command And so that was really the target that was the that was the example that I was going for so if you wanted to run For example a particular MPI job that exists inside of a container. You really do just run MPI run You know number processes or whatever other features and parameters you want and then singularity exec container name and Your MPI program that exists inside that container now What's happening on the back end is a little bit more complicated than what it looks like on the front end What's happening on the back end is this is a hybrid MPI approach meaning part of the MPI exists outside the container and part of The MPI exists inside the container and I can explain this a little bit and I'm hoping that Jeff You're going to kick me if I make any mistakes or at least correct me nicely So basically what you end up with is when you call MPI run You're on the host and you call MPI run and it's going to basically fork off on the Orteg process and the Orteg process is going to fork off The number of processes necessary for each one of the hosts That it needs to run in order to basically satisfy your your minus NP Requirement for your MPI run then it's going to exec whatever program that you specified at the command line So if it was just an MPI program it would basically jump into that MPI program But if it's singularity singularity now is going to build up the container necessary For that MPI program the MP the the singularity container that you specified At which point it's going to launch the MPI program within the container The MPI program is going to link against the MPI libraries that exist inside that container Now those MPI libraries are basically going to do everything that they need to do on that at that end basically link against all of their other SOs and shared objects and whatnot and Then they're going to make a connection back to the original Orteg process via PMI or PMI X That is basically that closes the loop Okay, so you pretty much blew my mind with all that stuff, but it sounds like it works at least better than any other Sort of system. I've seen before which is kind of heck of awesome What I don't really give that kind of praise that often the Different question though is is how does Can I have more than one of these I guess there's no reason I can have more than one singularity container running on a single host at a time right I can have two different People running their two different singularity images doing their own thing and their own subset of the machine That was given to them by the batch system that all that all works. Oh, yes. Oh, yes There is a consumable In singularity and that is because we are using image files to mount up those image files You need to basically use a loop device and by default most Linux distributions max out at eight loop devices But you can change that into the hundreds and thousands if you want to but by default Yes, it will consume a loop device for each container that you want now if you're using an MPI run And you're gonna run let's say 20 of these MPI jobs per host It will just use one loop device for that container because there's no reason to bind it to additional loop devices when it's the same image So it tries to be smart about that, but there is a consumable involved. So you got to be careful I Found out about singularity through actually an announcement on the exceed list at SDSC San Diego has actually deployed this on their comet system if people want to try this out What other systems around the country are kind of commonly available that singularity is working on today? That's a good question. I don't know all of them. I'm surprised every time I hear of a new one There is I know that there's an initiative right now at TAC getting it on stampede And I know SDSC has put it on a bunch of systems. I know Stanford has it on a bunch of systems I know there's a bunch other universities Kind of around that are that have done this and I know of some European centers as well but I don't know if they've made announcements, so I don't necessarily want to spill the beans but at the same token I I'm I'm just massively blown away by the uptake and how fast the uptake has been on Singularity as a matter of fact, it's making me a little bit nervous. So Which is a good nervous, I mean only because you know, I'm Yeah, yeah, probably enough said on that well at the beginning of this before we started recording actually you kind of dropped a nice little bomb on me saying that you can actually use singularity inside continuous integration systems and We use that a bunch in open MPI development and we use the the wonderful free service called Travis and you were like Oh, yeah, you can use Singularity and Travis Can you expand on that a little bit and like how can I do that and why would I want to do that? Sure. Well first I want to mention. I'm not a Travis expert, but I am using Travis Forced the Singularity project itself and as part of the normal CI testing that goes on one of the things that it does is it builds build Singularity and install Singularity and then it Will create a new container and then run tests against that container So I basically took that knowledge that I know it already works in Travis to build containers and to use containers with Singularity With the fact that when I'm doing a release of Singularity, I've got a whole bunch of Singularity images You know Centos seven six five Ubuntu Debian and I go through and build Singularity with each one of those To make sure that at least I don't get any weird build errors With any of the features and changes that I that I've done so I'm I'm pretty positive you you could actually install Singularity into Travis and or into the system that Travis is running on and Then basically just download a few containers there's a few initiatives right now that are that are going on about how people can share containers and share workflows and Define workflows and then use that for scientific documents Basically just saying you know go check out this workflow ID if you want to recapitulate everything that I've just done in this whole paper So that's going on right now. That's an initiative at Stanford but So you could just very easily just tell Travis to basically just download these images and then you want to test build inside these images But I think Travis also supports a little bit of that on the side But again, as I said, I'm not an expert on Travis. So I can't tell you for sure Okay, so where can well before I'll say that they'll probably be Singularity running on a system at the University of Michigan here in a couple of weeks But beyond that where can people find out more about Singularity? There is a website that you can go to which is just basically some get pages that we threw together at Singularity Lbl.gov and But you can also just go there directly via the github page Right now it's still being hosted under my username and github which is GM Kurtzer So if you go to github.com slash GM Kurtzer, you'll see the repository there for Singularity So that's probably the easiest way of getting it at least the two easiest ways of getting it getting to it Hey, one other question that we typically ask people Since you're doing a software project itself Particularly in the scientific community. It's not always appreciated as important, but it is important. What license are you offering Singularity under? Singularity is released under a modified BSD three clause license It basically the third clause references Lbl. Department of Energy and UC Regents Explicitly, but aside from that part. It is a standard BSD license with one other exception We have a contributors clause that is Actually in print inside the license so it basically says you don't have to contribute anything that you do to Singularity Because it's the three clause BSD so you can do pretty much whatever you want with it But if you do contribute to Singularity And you don't specify or rather you're allowed to specify any license you want for that contribution But you're giving us a grant back clause that basically says Or it's a grant back license, which basically says that we still have the ability to release this software open source And basically keep keep everything open So it's basically a three clause BSD license to users and to contributors It basically just says that we're gonna keep your code open if you give it to us Greg thank you very much for your time. Yeah, this was great. Thanks Greg. Oh, thank you guys for inviting me to this