 Hello, welcome to all. This is the second day of the sixth EasyBuild user meeting. Our first talk of the day will be about using EasyBuild for singularity containers. This will be talked by Jorg Slashman and Susan from Guy in St Thomas' NHS Foundation Trust and Kings College London. And over to you now, Jorg. Yeah, thank you very much for the kind introduction. As you can see, I work for the NHS and part of that is also involving Kings College London. And before I start off, a little bit of background about me, because it is quite important to understand the talk and where I'm coming from. I'm a scientist. I'm a fully trained chemist. I started off doing chemistry at the age of 15 with my apprenticeship. And as you can see, then I continued doing an MSc in England, in the UK, originally I'm from Germany, continued with my PhD, I've done various postdoc positions in the UK, Germany and Austria. And back in 2010, I received Myvenia Legendi, that's the license to teach, not to kill, so you're all safe, from the Technical University in Graz in Austria. And I also realised at this time that yes, I'm probably quite successful in terms of chemistry, but chemistry requires money. And that is something very difficult to get by if you want to do a certain kind of chemistry. So I decided to switch over the, in the time I was in Austria, I started off becoming interested in computational chemistry, which requires, as you might have guessed, computers, in particular high performance computers. And I was quite lucky that the local IT guys showed me a few things. So I started off looking into HPC clusters then. And back in 2010, I decided to switch, as I said, from doing HPC on the side, making that now my daytime job. And I'm still doing chemistry as a gentleman scientist in the evening, so you're still safe. Since 2010, I had various positions originally at University College London in Chemistry. I also had one of the recent ones was at the Francis Quick Institute here in the UK. And now I'm working at Guy's and St Thomas Biomedical Research Centre. Part of our job is to help clinicians to actually use HPC computers to do their work. Now, fundamental question, what is science? In science, we are doing experiments that are reproducible. If it is not reproducible, it's not science. If you don't do an experiment and you come up with a crazy idea, but you can't prove it, it's not science. So that's fine. Now in the lab, I repeat an analysis three times, simply to make sure I error out or I out any kind of random errors which I can't avoid. There are systematic errors which are simply in there. Now on a HPC cluster, we only do one calculation and that's it. The big question is why is that the case? Now anybody who has used an HPC cluster knows that things are not cheap. So time there is precious and you don't want to waste your time. That's one of the problem. The other answer for that is the underlying maths is an exact science as exact as it can be. So if you take that together, what I do expect from the people who are installing the programs, it needs to be exact the program. You need to run test jobs in order to make sure your installation is reliable and it is reproducible. In other words, what I'm doing here in London, I'm based in London, needs to be reproduced in a different environment, for example in Sweden or in the US. So as I already alluded to, that is a big problem. We need to make sure that the installation of the program is the same wherever I'm running it and one way forward here is of course things like EasyBuild. So we don't install software manually. We install it by a script or collection of scripts, whatever you want to do it. So that way it is reproducible. And the problem here is in order to do that, to really judge it is reproducible, it has been done to the best of our abilities. You need an expert and you need time. I'm working with clinicians and don't get me wrong, currently with the ongoing pandemic, and that is actually a true story that is not made up, they are more interested in getting their COVID-19 research going forward and they don't have time to install software. They don't have the time and I see that from the tickets to debug the problems. That is not installing yet. The compiler is not suitable for that because you're trying to use a 10 year old program with the recent compiler. So all of that is a little bit of a mess. And I was thinking of, is there a solution? So we can install the program not only in such a way that it will be reproducible, but also in such a way that I've got the whole program packed in a container and then I can ship it elsewhere. Remember containers are basically you put it on a truck, you ship it off, you put it on the train, you ship it off, it's standardized, you can run it everywhere and so you can run software containers everywhere as well within limits. There are two commonly used containers these days, probably a few more. Most popular ones are Docker and Singularity. Both have advantages, both have disadvantages. One of the problems we are having is on our cluster. Docker requires a demon process running in the background and that is a little bit problematic for an HPC environment, at least the one we are using. Singularity does not have that requirement, so you don't need some kind of a demon running in the background in order to run a Singularity container. That makes it much easier for us to let users execute their Singularity container in a user space environment. Also, what we can do is we can mount the outside environment into the container, so all your data, which is traditionally stored on the HPC cluster, will be outside of the container, but it will be available inside the container, so that makes it a little bit more easier. If you look at bioinformatics, what usually comes up is NextFlow, TensorFlow, all of these bus words and programs, and they are used to write pipelines. I come to that in a moment. They are working with both, at least NextFlow does, Docker and Singularity. If I can provide a Singularity container with a program in it I am using, the person who is writing the pipeline doesn't have to redo what I already have done, it just uses that container. So all of that sounds wonderful, and so the big question is why are we not using it? The answer lies a little bit, and please do not get me wrong here. Clinicians like scientists are interested in doing their science done. We are using software as a tool. We are not software developers. I remember when I started doing computational chemistry and somebody said, you have to compile your program. I was like, uh-huh, and how do I get out of that? So we are using it as a tool because we want to do science. It's like a car driver. A car driver wants to drive on the road to get from A to B. They're not interested in the thermodynamics which is happening in the combustion engine. So we need to find a way to do that. And one of the solutions is you can give clinicians a very easy set of tools to say, okay, all you need is that machine. You put an input in here and you get your tool out, which is custom made for your problem, and then you can use it to your science. The way I try to do that is basically to give researchers access to exactly that kind of custom making tool. I'm using Singularity here and I'm using that in combination with EasyBuild. And I will show you what I came up with here. So my solution is I can ask scientists to say, okay, execute this command, do that. Don't worry about what's happening inside the black box. That's my responsibility, but just do that and then you can do with a program you want. So I'm providing a set of bash commands, which are very easy to use and I will show you in a moment, and they create a Singularity definition file that has two advantages. A, you obviously can create a Singularity container from it and with your publication, you actually can publish that definition file. It's a simple text file, so it's not big. So everybody can see not only what you have done in terms of what is the tool you were using, what are the underlying mechanisms, they can reproduce it. They take that Singularity definition file, build a new Singularity container on their environment and can reproduce your results. Coming back to my initial statement, science is about reproducibility. That makes it very easy. And so you can do that. What's now in, how do we do that? And so I came up with that GitHub repository. The link is there. The slides will be distributed later, so that's not a big problem. Basically, there are two folders which are of interest. One of them is called definitions. Obviously, I'm using the tool I have written. So any software I'm installing in on the cluster using Singularity containers, I put the definition files up there. Simple reason, any scientist then can download it, can reproduce it. And also, the scripts folder is the more interesting one for this presentation and the tutorial on Friday, because this is a set of batch scripts which allows you to create the Singularity definition files. That was the first hurdle I had to overcome when I wanted to use Singularity. How do I do the definition files? And here I'm very grateful that EasyBuild already had something in place. So part of that is inspired by EasyBuild. The name of the batch script, I try to use speaking names. So if you look at it, you see exactly what it properly does. It's pretty obvious container build. Yeah, you want to build a container. CentOS7 is the operating system. We are using the environment modules in that container. You can use lmod as well. There is a matrix further down. And it is still using Python 2 in the operating system. I left that for historical reasons. We all know Python 2 is on the way out. It is depreciated. We should not use it anymore. But there might be a reason why somebody needs it. So I decided I just leave it in there. And I would like to stress it is not any kind of Python version installed with EasyBuild. That is the operating system I'm using here. So what are the requirements? I mean, obviously, you need my GitHub page and you need somehow the name of the EasyBuild file. GitHub page is obviously on GitHub. EasyBuild is on GitHub. So you can get the names from there, from the EasyBuild configuration files. You just have a look at their page and that's it. What you really need to be installed is Singularity. And if you're using Debian, for example, that bootstrap needs to be installed as well. Now you probably will say, hang on a second. But I'm not using Debian or Linux. I'm using Mac. I mean, I'm on Windows. What are we doing now? Can't we not use it? The answer is yes, you can use it. Because if you put all of that what is in a box up there inside the virtual machine, virtual box, for example, and you're using Bakrant, you create an environment which actually allows you to install and create the Singularity containers. Fortunately, Bakrant, so if you really want to shortcut, also provides a Singularity container, so you don't have to do any kind of installation. If you are a little bit more adventurous, you could download CentOS, for example, on your Debian machine, which I have done, and install Singularity inside that CentOS environment. So all of that is doable. How do I create that script? How do I create the Singularity definition file? It's very easy. You basically call the bash script of the container you want to build. In my case here, I want to build CentOS 7 because that's the version we are using in the cluster. You do not have to do that. You could use Debian. Some of the Singularity containers are built on Debian. It's not a problem. And that container is using Lmod. That is something you should be a little bit more careful if you're using a different environment module system, for example, tickle instead of Lmod on your container. And then all you need is the name, not the actual file, the name of the EasyBuild configuration script of the software you want to install. In this case, deliberately, I choose GCC version 930. It's then asking you the script, do you need a second EasyBuild recipe? That gives you basically an entry point to say, okay, I'm very good at EasyBuild. I wrote my own EasyBuild configuration script because I need to do things differently. So that is the only time where you actually need the actual file, the EasyBuild configuration file. If you don't have it, if you want to have it very easy, and you say no, don't need that. I just need a GCC container. You just say no, and you get your configuration file out, definition file out. So if you then have a look, the file again is named Singularity, and then basically I'm following loosely here a little bit what EasyBuild does. So it's the name of the software I'm installing. It is what kind of module system I'm using and it is the operating system. So how does it look inside? There are basically three different parts. The first part is we need to set up the operating system inside the Singularity container. And with all the relevant software and under this umbrella, I decided to include EasyBuild as well. The next step is we are using EasyBuild, which is installed inside the container. So you don't need a local installation of EasyBuild. You get it inside the container. And we always use the latest version. We are using EasyBuild to install the software we want to install in our case GCC. And then there is a little bit of post installation to run the Singularity container. So how does it look like? The first bit, installing software. That is done in a shrewd environment. So the first bit is basically bootstrap YAM, the operating system 7. It's downloading it. It is including YAM and that's done. What we then do, that is the bit from here onwards. We basically say, okay, whatever you downloaded, it's fine. We do an update of that one just to make sure we really have the latest version, any kind of bug fixes and so on are there. Then we install using YAM, all the requirements we need for EasyBuild to build the software. That is fairly generic. So we might install more than we actually do need. But I decided it's better to have a little bit more and it is working instead of fiddling around. Oh, we don't need this package here unless it is that circumstance. But then we can avoid it if that's too complicated. Next, we install EasyBuild. As you can see, we are using type 3 here, which means we have installed Python 3. So we install EasyBuild. We install the latest as you do. We make sure there is actually the user EasyBuild created. If not, we just do that and doing a little bit of magic regarding the Lmod environment. So that's the first bit. The second bit is then we are basically saying, okay, for the user EasyBuild, we modify the BashRC in case we want to open the container, we've got a proper environment for this user. We also set a few aliases and of course we do exactly the same for the actual installation. The whole installation will be written inside the container in the home directory of the EasyBuild user in a script called ebinstall.sh. The reason for that is if you want to do something different because you want to modify, at least you know what you've done here and you've got a guide. Then it's basically saying, okay, further down it says echo eb-minus-fetch. EasyBuild has one of the features is you can download all the files first and then you build. And we've noticed it last year when we first went into the lockdown that some of the GitHub repositories were offline for a few minutes. If that happens when you want to build your software, the whole build fails. So you don't want to have that happen four hours in your software build because you've got a slow machine or you're building quite extensive software. You want to happen that right at the first 10 minutes, then you can look into the problem and you can restart it. The next step is obviously it is then building the container. You might say, hang on a second. You're not using robots and so on. We defined an alias. So I can use eb and that automatically always calls robots. One of the advantages to have that script. It's then doing a little bit of magic for the lmod system again. And then it is installing it as the user EasyBuild, that is the second, third line from the bottom, su-l-easyBuild. So we're not installing it as root or so. The final bit is we need to set up the post installation environment. So that means we're sourcing a set of a profile, which means all the commands like module, ML, whatever is used is being found. We are doing a few bits with the modules. And right at the bottom, you see labels. And when you're using my script for the very first time, you are asked if you don't have to do it to say, okay, create a little text file where the author and your email address, for example, are there. That goes into the author label. So if somebody is reading that script has a problem, they can contact me assuming that email address is still valid. So if you come back in not many years' time, that might be different. And one of the labels is, of course, we installed GCC. So we've got the EasyBuild, we've got the Singularity definition file, how to build it. Now that is the only time we require sudo access. That can't be done on our cluster. But as I said before, if you're using Bakement, for example, on your machine, if it is not Linux-based, it's not a problem. So that bit requires sudo. So sudo, Singularity, build, then the name of the container, the suffix sif stands for Singularity ImageFile, which is basically a file which contains everything. And then you've got the definition file. You can, and we are doing that in the tutorial, you can take a container which is already there, like, for example, that GCC container, and open it up. Then we've got something what you could use a shrewd environment. And you can use Singularity to go in there. And then you can install, for example, bespoke pipelines, bespoke software, where you do not have yet an easy build configuration file for whatever reason you might think of. Running the container is very easy. Once you've got it, that can be done in user space. Running it is Singularity. Run, for example, the name of the container and then what you want to do. In that case, just demonstrate it. I've done GCC minus, minus version, and it comes up with a version and so on. Now, the really funny thing is you can do this as a center s container running on Debian. You can run it, you can use a Debian container running on center s. You can run that container, for example, on Mac, if you install Singularity, all of that is working. So you've got a container which is then running on, and I believe Singularity is now supporting Windows as well, natively. You've got a container which is running at least on Linux, any version, Mac, as far as I know, any version, and I believe on Windows as well. In summary, and I'm good in my time. Singularity containers are easy to build using the provided scripts. And all you need to know is really the easy build configuration file name, not necessarily the file unless you do fancy things. The only requirement is you need to have a Linux environment installed with Singularity. If you don't have Linux, or you do not have pseudo access, if you can install VirtualBox and Bakement, you can use it. Again, one of the beauties is, and I hope you got that message, you do not need to know in-depth knowledge of Singularity or how easy build is working. All you need to know is literally very few commands, and you need to know the name of the easy build configuration file. As I said before, the Singularity definition file can be published. So if somebody wants to redo it, they can do it. It's absolutely not a problem. You could think of publishing the container as well, but they can be quite big. So it's one thing to say, okay, I've got a text file, I put it in a supplementary of my publication, or I've got 700 or 800 or even a gigabyte of a container, how do I want to put that to my publication? That might be a little bit more tricky. Now, as an outlook looking ahead, you could think of, can we automate that? Can we say, okay, the user knows they want to use R, a particular version. So in my machinery, I put in R version 5, 3, whatever, and just build me the container. Once you've built this data set, run it and give me the output, and then I can think about I might destroy the container again, because I don't need to do it again. So you could automate it, which means from a user perspective, all you need to do is submit a job to a queue, and then the magic happens and you get result back. How that happens in the background? Is it running locally in a cloud, in a more secure environment? The user does not need to care about it. They can care, but they do not need to. Now, obviously, even though I'm the one giving the presentation, there are always people in the background who are helping. I'm very grateful for Kenneth for EasyBuild, and of course, all the people in the EasyBuild Slack channel. The same goes, I'm hanging out probably too much on Linux help in JetJunkies, and here, my friend Loki. I'm very grateful for him. He helped me doing proper batch scripting. Not it's working, but it actually conforms to standard and many other discussions. The same goes for my colleagues at the Rosalind team. I only started there back in December 2019. I managed to break my right elbow in January 2020, had my operation, then we had a lockdown. I had a very stupid start, but the people there are very warm and helped me a lot. It's a great people there. The good news is, if you are interested to join us, you can. There's a link down there. We are looking for a full stack engineer. As I said, the slides will be published. I need to advertise that. There will be a tutorial. If you want to learn more, there's a tutorial. Please sign up on that tutorial webpage. I would like to thank Christian and the team for Amazon for organizing that and providing the space. As I alluded to before, we built a simple singularity container so you understand the mechanism. Then we opened up the container. It's basically up to you to install software on it. There's also a channel here. That is my last slide. I think I'm actually dead on time. Are there any questions? Indeed. Thank you to York for that. Just a reminder, if you're watching the live stream on YouTube, then ask questions in the ED Build Slack, the EUM channel. If you're watching from on Zoom, then there's the option to raise your hand. There's a reactions option towards the bottom of the Zoom window that allows you to raise your hand and then we'll unmute you and allow you to ask a question. We have our first person. It's Victor from CSCS. A simple question. I'm very happy that you actually have developed this tool because we have a similar tool at CSCS. That's exactly more or less the same work that you're doing. But at CSCS, we have the need to allow users to install different MPIs. We are also concerned about the size of the images. Have you considered allowing for an option where they use it and customize to the command line or any other mechanism? The packages that you've been installed, like I saw that you do live verb, which are not necessarily needed for all clusters and just decrease the image size. Have you considered also doing question number one? Question number two is related to doing MootStap building, where you could compile the code and just copy the binaries of that code to a different container so that you have a smaller image. And the third question, if I may ask, have you tried MPI? So to get MPI in performance, MPI having native MPI performance for the clusters? Right. Going backwards, I haven't tried MPI. I don't think any of our users has used it, so I don't know. It's the honest answer. I haven't done any performance tests. It's a very good question. I really should look into it. Regarding the, can you remove the packages which are required for the operating system and so on? My aim was keep it simple. Because you've got the definition file, if somebody is asking the question, oh, can I remove these packages? That means they are a little bit more advanced. That means you just create the singularity definition file and then you can edit it the way you want it. So I do, for example, on a regular base, I just install them on Debian, which I don't think is on the list. So it depends on the user. If a user is just wanting to get a container, that is the aim of the script. If the user wants to do more, because you get the definition file, and of course you can download the GitHub and you can modify it, they can do it. I'm not ruling it out. Size of the container is something I have noticed. So in particular, when you build R, that very easily can go up to a gig. So you are firing up a very large container. Even though the source of R and all of that will be removed at the end of it, it's still a large container because R is simply big. I haven't found a good solution for that one. I think I've missed out one question, didn't I? Opening up the container, I think it was. What you can do is, what I have done is, I build usually GCC as a base container, then opening it up and using EasyBuild, for example, to install the software I need, or installing inside that container software, where there aren't any EasyBuild configuration files yet. And then, obviously, pack it all up again and ship the container, and here you don't have a definition file, which is the flip side. Does that answer all your questions? I've noticed there are two more. Yeah, so the question is more related to the last one was about the multi-stage build, right? Instead of doing the EB, you could do EB with RPAP, and then just copy, I don't know, if you did Gromax, and then you just copied Gromax into the binary and the odds of dependencies and that's it, right? And then you miss all these not necessary things that you can install on the container to decrease the size. You probably could do that. I haven't tried it to be perfectly honest. Okay, because it works pretty well for us. I don't know if you want to implement that yourself too. Yeah. We can discuss it offline later on if you want to. Yeah, sure. Cool, thanks. Just ping me on Slack, but obviously, like everybody, I'm busy. There were two more questions. So what we're going to have to at the moment is we're going to ask people with any more questions and ask you all to move to the breakout room, so just give us a chance to set up for the next talk.