 My talk, the challenges of installing software on HVC systems. So I'm Santiago Lacal. I'm a research computing support analyst here at zero college. I work within the research computing service, which is part of part of central ICP. This is going to be hopefully a light-hearted talk, half front, half confession, half sharing my experience about installing software throughout these past years. So I'm going to be speaking briefly about, you know, scientific software, compiling software. I'm going to be speaking about conda and the shortfalls and issues I found working with it. I'm a system package manager and I'm going to be talking briefly about singularity and containers. Challenges related to software installations that are not really intrinsic to the software itself. I'm going to be speaking briefly about high-performance software and, of course, easy build and what role it plays and my apprehension to adopt it at first. So challenges of installing scientific software. They literally threw me in the deep end and said you are now completely responsible to install software a few years ago. And that was challenging. That was challenging because I didn't know how difficult it was. I was used to simple wizards, click next or maybe configure, make, make, install. That's it. But I realized that these can be very, very time consuming. I'm talking hours, days, weeks. Sometimes requests from users to support software that's no longer actively being developed, not supported. You don't even know when was the last time I actually worked. So you got to figure out, you got to think outside the box. Find a lack of documentation. I was just discussing this with Kenneth earlier. Poor software engineering practices from developers. And I'm not talking just about these things, enhancing code readability, keeping code efficiency, version control being descriptive, keep it simple, stupid. Just putting a readme file would help sometimes or just explain what the software does. Sometimes you'll be lucky if you see an extension that actually looks familiar. And then, at least, it can point you in the right direction. So that is challenging in itself and dependencies. Of course, we've got to talk about dependencies. You think that this is going to be sort of a puzzle, something that is a challenge, something to look forward to. But you can get into a big mess. You can get into dependency hell. Long chain of dependencies, conflict dependencies, circular dependencies. Diamond dependencies, that's my favorite one. You know, tool A or library A depends on library B and library C, but each of them depend on different versions of library D. And you can't have two versions of library D loaded at the same time. And, well, so NPM, I'm not going to talk about NPM and the fiasco a few years ago, but a few examples of just how things started. You know, you get a request to install Petsy and then you're like, okay, I'm good to go. Let's see what I have. Oh, I need blasts. All right. Is it the right version? Oh, but am I going to install with GCC or Intel? This one's installed with Intel. Oh, but the user requested GCC. Okay. So I need to recompile it. Okay. But then I need to recompile, I don't know, HDF5 as well, because it's not parallel. As you can imagine, you know, you start pulling that string going down that rabbit hole. That didn't seem like a simple software installation. It ended up being base. And that is if all those packages ended up being named and organized in a timely manner. When I actually arrived, it was and you know, I have to half admit it still is a bit messy, you know, modules and folders having different naming conventions, not being fully descriptive modules being incomplete, not loading full dependencies. It's playing Russian roulette really. So compiling software. There's a lot of things that I have to think about wondering that you know environments modules like I mentioned earlier. So we're using tickles though. And environment module for seven, I think, but it starts being very challenging. There's a lot of things that you've got to think about. And obviously you don't want to reinvent wheel. I approach I had is okay, I'm going to be organized. I'm going to try to start from scratch, but you can't really do that and clean the software stack because by the time I started doing that I wasn't actually doing changes. I just decided, okay, what I'm going to do is just start writing down things. I started writing, okay, this open phone module is actually just named on a minute. A couple hours later, I had 200 things written down and I decided it's just not feasible. It's better just to start with a new software stack, which I didn't do for years. I said, let's leave it for the next guy. But then again, it's not sustainable. You know, you keep in that mess, you sort of keep at it, but, you know, how do you deal with many requests at the same time? Conda was a lifesaver in some aspects. I mean, it supports over 7,500 packages. It's, yeah. It's mostly end-to-end users and researchers. I'm not necessarily aimed towards HPC, but it empowers users. So I was pushing the problem to them. You know, here is tool. You can manage your environment. You can install your own packages and that will reduce support requests. I primarily lead the user support. So for me, that was amazing in theory. Pre-built binaries. So fairly quick. I say fairly because when requirements are defined very lax, the Conda solver takes quite a while to do its thing. Users tend to complain this is frozen, et cetera, et cetera. That's because they've requested 20 packages with no actually strict dependencies. You know, give me any version of R, any version of data table, any version of this, any version of that. Conda plays well with other services. This is Jupyter open on demand. A lot of software or a lot of applications, users tend to go straight to, I want just Jupyter. I just want to jump on that. And that's what I know. That's what my supervisor has told me. That's what I've seen in the latest YouTube video. That's what I want to do. And it's got more and more wide adoption towards software developers to have Conda as sort of the default method to install the software. However, it has some cavities and some shortfalls that I've realized over the years. Can't be used for everything. Obviously it's not a replacement for easy build or spec. Most users, which still use Conda and it's widely used in our cluster, is Python and R. Now, big environments can become very delicate. They grow very fast, especially R. That's an example. There are base and six packages. I'm going to store 247 dependencies. Imagine you added a couple of other packages more and you're already at 400 dependencies. That environment is delicate. In the sense that users will just do, oh, I'll just do a Conda update or I'll just update this package like Rbase. Big, big package in that environment. And they won't realize that they'll just say yes and it'll break. It'll break. I mean, hopefully it doesn't. A lot of times it doesn't, but delicate. Increase user support request to fix that. So in that sense, users tend to run a Conda update all or maybe just say yes when it said I'm going to upgrade the version of OpenSSL. For a few years ago, Conda 4.8, I think it was, there was a bug where you did Conda update and I updated Conda, but it didn't update Python. And then you ended up with an error. It was fixed later, but there was really no fixing that. Just pretty much scrapping it, installing it again and, you know, getting your environments from a snapshot. It requires some playing around to learn and, you know, these kind of quirks that Conda has, you know, being patient with it. Maybe even using Mamba, which is a fork that maybe has a bit better, it's a bit more verbose when things fail instead of just failed or couldn't resolve dependencies. It's another snake, by the way. In that sense, because of these quirks, it means that from user support side, you need to write a bit more documentation for the users. A bit of TLC for those users to say, you know, if you're going to use Conda, that's fine, but, you know, FY these considerations. A lot of users also, especially, you know, biosciences want to install stuff that is not in Conda. That's where you get into a situation where are you going to reinvent the wheel and create 20 Conda recipes and, you know, push them up to Conda for your bioconductor bioconda. It gets time consuming. So what do you do? Okay, you're going to say, okay, I'm going to store with Conda all the dependencies I can and then I'll install with pip or with install packages. That's fine, in theory, but it does become very tedious, very long. And Conda doesn't play very well when you start mix and matching pip and install packages for Python or not. I found that if you do it as a last resort, most of the time it does work, but you do have to take very good care about what then you update install packages that says, hey, I want to update all these packages. Other package managers, well, I've been very tempted over the years to say, well, this is a young package there so I can just do a young install. The problem is a few years ago we wanted to do a bit of spring cleaning and decided to thin down and give a bit of cardio slim down our node images, particularly the login node. We had 2000 libraries in Lib64 from HDFI, NetCDF, all sorts of things which were fine initially until we realized later that there was software that was linking against them that didn't actually define those as dependencies. And when we slimmed down the compute node images as well then suddenly that library is no longer there. So it comes into a situation that you have to rebuild those NetCDF or rebuild that software. So the kind of approach is if you're going to do a young install, it could be okay, but it might, you know, you might be shooting yourself in the foot for future. And most importantly, they're not optimized, they're not optimized for the architecture, the same as Conda, which I didn't say anywhere, but yeah. Containers, we started using Singularity a lot as another tool. It provides a practical solution like says here to circulate large sort of software stacks with a lot of moving parts, particularly with groups in different locations. But sometimes these containers are not easy to create, especially if there's a lot of dependencies. It starts, you know, it starts being very time consuming. You know, is it a bit of a gray area? Is it me responsible? Is it, you know, the user? Should I empower them to build it? You know, most users don't want to learn this, but they should. Obviously, we're getting into a topic then that is common, is that, you know, researchers don't have accounted time for learning, compute learning, Linux learning, how to use HPC systems, and it's not recognized by their supervisors. It's not recognized by grant givers, but it's time for them that is lost, so they don't want to spend time on it. You actually find it really reassuring, really, it's a really good feeling when you find a group, when you find an individual that says, you know, I want to learn, I want to be involved, I want to empower myself to support myself. So, in terms of containers, leveraging the OS package manager within the container would be another case of, am I really giving high performance binaries to the individual? It's just generic, not optimized, and a lot of times groups will want continuous support for that container, particularly they want to maybe modify the source code, they want to do a continuous update, that means rebuilding the image, and that's where really we would push them to do a little bit of training so they can manage their own container. It's definitely a tool, it doesn't obviously replace, it doesn't replace conduct, it doesn't replace easy build, but it's definitely there. What I'm getting with these two is that they should not be the first route towards those software. These are challenges that are not related to the software itself that I found over these past five years, very prominent. One is how to manage the sheer number of requests, we get dozens of requests a month. We're a small team, to give you a bit of context. We have about over 2,000 nodes of compute, heterogeneous cluster, the majority of compute is AMD Epic, then 2 Milan, Rome, sorry, 128 cores. We got how many GPUs? Don't quote me on that. We're in the middle of hardware refresh, over 74,000 cores, we're currently four in the team, we're hiring. So we're a small team, which means that managing incoming support requests is difficult, it's difficult to meet those demands. And that comes with another challenge, which is setting expectations with users with customers. Setting expectations to users with various degrees of technical knowledge, making them understand that at this level, installations are not 20 minutes. Installations can be days, weeks even, depends how complex the stack is or what solution they want, they want a VM, they want a database, they want all of these moving parts. It takes time. And then we're talking about the installation, then it's the benchmarking, which is that's another top area that we haven't discussed yet, which is not accounted for. And normally researchers overlook completely, I'm just going to run it and, you know, you'll look. Efficiently maintain software stacks. I mean, I mentioned some of the mess we had, I mean, looking at here, this was 30 PM, at some point we decided to stop the 8.06, 8.4 and just call it by the day. Or whatever it is. Maybe it would perfectly be advocate as well. 6.6, 6.7 then suddenly 20, 21. And that translates then into the modules. What happens is, how do you clean that after, because users, as you all well know, will hardcore those names to the jobs with forever and ever and ever. And do you want to have some links forever and ever and ever. So maintaining software stacks, organized planning, when are you going to upgrade the default modules, these kind of things is something that has, you know, has to be in mind. And ensure the organization and standardization standardization across the stack that directory naming the modules. Are they done well, all this thing doing it manually is very daunting. It's, it's just very daunting. And lastly, ensure the software is optimized for each architecture. I mean, we have, like I said, heterogeneous cluster with different types of nodes. We want to make sure that that software is running as best as it can. As you know, HPC, are we delivering high performance software. One of the key things is, and I fall into this group is I deliver a piece of software or a user asked for something and condense a great way works great. So user is happy. Should I be happy? Should I be content or I mean, I have another 40 requests waiting on the queue. So I'm just going to say, okay, the user is happy and then forget about it. But I should ask myself, am I actually giving him the best software there is whatever version it can, maybe it could run faster, maybe 5%, maybe 10%. That's considerable numbers. That is, you know, a time to save the user, reducing the compute time and wall time, decreasing carbon footprint, saving electricity and money. This is not something we should overlook. So a lot of users, a lot of our users, don't get me wrong, will be happy with just the condense store. But if there's an alternative way, why shouldn't we make a little bit of an extra effort to just say it's going to take two days, but I'm going to have it, you know, build with those processor optimizations. I'm going to build it with MKL, with Lapec. You know, I'm going to make a little bit of effort because in the long run is going to be better. Basically, if the user, which happens all the time, is in a deadline and they need to run 6000 simulations in two weeks, I'm sure that if I tell them, I'm going to make it run faster, they'll be super, super happy. But that, and that's what really generally does happen, you know, they'll tell us in the last minute. Oh, I need to run it faster. And by then it's too late. So this is where easy build. And the person behind me, Mr. yours has been housing, which really pushed that comes into play. Now I was, like I said earlier, very apprehensive at the beginning to adopt it. I have to admit, don't look at me that kind of same with back, don't get me wrong. Why I thought initially it's just going to add another layer of complexity. And I know make files, I prefer fighting with that, you know, I got this really complex installation, I prefer fighting with that. But that was a misunderstanding from my part, because easy build is not there to help me with those copolysis installation, at least not in the first place. Because if there's not an easy conflict file already created, I'm going to have to first install it manually and figure out how to install it. And then once I do that, then sure, easy, an easy conflict file or prosperity and, you know, to share with the world. But first, I wouldn't need to install it manually in any case, whether I was doing it with easy build or not. Where easy build is helping me is that complex installation that I did with openfoam and I installed it with Intel and I got everything right. And then two days later, someone says, Oh, I want GCC. And I want two versions before an easy build. Just assess it. Sure. If it's available. Yeah. Here it is. And it will do it automatically. It will manage all the dependencies. It'll take care of the standardization of the modules and the folders. So it does it does quite a few things of those challenges that we said earlier. And if we can just take that extra step and make sure that those, you know, optimization flags are there, then we can get it optimized for each architecture. So this is a little schematic, very crude, but something that your developed. Essentially is, well, we're going to use a easy build. It's there. It's available, you know, and you can compile the login nodes and jump on a computer node. But why don't we just get it to just automatically submit jobs, use our scheduler and push each job to a type of computer node and compile in that computer node. On top of that, because we had so many issues with it with our node images and contamination of libraries, cross-contamination libraries, if you're okay. We're going to leverage singularity containers, which have the bare minimum, bare minimum to the point that at the beginning we even had some few issues where we needed to actually install some things. We've slimmed them down too much. But now it means that any easy config file actually, or any software that actually pulls something, it's either a dependency or you shouldn't be pulling it, essentially. And it's working wonders. It's, you know, defined what easy config file in the software list, run it through automatic build. That's a message to PBS pro forget about it and come the day later and everything is passed. We should get a pass in those logs. And then that's it compiled per per architecture, then obviously in the background will have different mount points for those are those software stacks in each of those types of nodes. Now, to add to the other challenge that we had is how do we how do we make it even better to deal with those with the level of support request that we get of support installation request is why don't we automate it. And this is a kind of a future project. Essentially what we would like is to create a forum online that just lists easy config version or software and version and say, and the user can just say I want this to check if it's available. I think it's installed already. I think it's already told the user. If it's not created, take it and whatever ticketing system we have and spin up those jobs and do that automatically and us only get involved if there's a fail. That we wonders. Yes. Yeah, yeah, so. Yes, say that again. Do you think that each software is built in a separate container? Yes, one sense. Yeah, yeah, so it will pull the obviously it will it will mount. It will mount the software stack already in house already built for for that architecture into the container. So actually the text. Hey, we already built this. But obviously if I'm if I'm installing in different places. It's different stacks. That is slim down. Yeah. Now, to give you some context on what we have gotten so far with easy build in our old stack and that's, I've seen software in 2013 2700 modules. The new stack courtesy of Mr. Sassman house and behind me, we already have 2137 between our production on our development stack. We're almost there. And that has been, I mean, you can imagine how much time we spend here and how much time we spend there. Because I remember spending weeks and weeks on it to install certain software. Thank you. Questions. Any questions. Sorry. It's not straightforward. Yes, yes. 100%. Yes. Yes, yes. So the optimizations are not straightforward and it tends to be yes, it tends to be first the user, go to the user, check the JavaScript. What are they doing first? Are they just asking for 300 nodes when, you know, their program is just single core. Are they needlessly copying a 20 terabyte file to an array job and reading from the same file, you know, 3000 times. So, I mean, one of the recent examples I have with the easy build, which was with a colleague of mine, they were helping someone doing cycle learn with conda multi core CPU for enough. It wasn't working. So they went easy build routes. And they went down from seven minutes per iteration to 40 seconds. And in the cases, then they did get it working multi core with conda and but it's still warranted a difference of about 20% in favor of the easy build. 20% is considerable here. And it's not something that we should just overlook. Yeah, I agree. The majority of the majority of cases, it's, it's, yeah, it won't be the optimizations that are, are the things that are really, you know, keeping the performance back. It's other factors. And a lot of times it's users, I'm sorry, but I have to say it's users doing something silly that they shouldn't be doing. Or it's the application itself that is just coded terribly. Terrible practices or writing thousands of times in a second or. Yes. The question is, what is the advantage of having different singularity. So, well, I mean we have. Hey, Marius in container, but we are, we are compiling with the host optimizations wherever that is actually running and mounting different software stacks on each, which are the ones that are compulsory. It is, it is one singularity container image. And that is just being run on the different architecture. And it is detecting which architecture it is, and then basically mounts inside the container, the right. The GPU stack, the M2 stack, et cetera, et cetera. And then obviously all use minus X host minus and March, whatever the Rome one is. And the obvious follow up question, how do you do the detection. Oh, I remember. There is the, there is in Python, there's a little program which detects the architecture. And I'm just running that it goes back. I like grown whatever. And then in the script. I know exactly what to do. No, we are the last CPU. No, it is not a word and it's new it is piece of technology which I used from the easy project, just to highlight that these both projects easy build and easy. You can learn from each other. And that makes it really good to, to incorporate new technology on your methods. To aspect, yeah. Actually, It's before. And it's. Yeah, okay, it is, it is just a repeat what Kenneth said, the arch spec tool, which makes from back and then we pulled it out. And enhanced it. Or I do know what I got. For example, There are plenty of content. Yeah. Module. Yeah. And it integrates with our. Like. Create. So, and to point it at the same place where you. Is that so they have their own repo or does it get it from silence or the, for the containers themselves. So they, they have containers that they support. You can make your own. But I did, they have a lot of bio stuff. Yeah. Yes. Yeah. Yeah, happy to you.