 Next talk is from Anker Kreuzer from Ulik Supercomputing Centre. This is a site presentation of the work they're doing there. Anker, you have about 15 minutes for the talk over to you. OK, can you hear me? I hope you can hear me. Yes, we can hear you. OK, great. Yeah, so hi, I'm Anker and I'd like to give you a short insight or update on how we are doing things at JSC with EasyBuild, especially now with our modular supercomputing architectures, short MSA. So we have five systems at least at the moment, which are set up with EasyBuild. And here's a short insight on that. So we started with a dual approach at JSC, having some highly scalable systems like your Queen or general purpose clusters like the old Europa. And a few years back, we decided to merge this approach to have now one single system containing several parts with different purposes. So now we have machines with a cluster part and the booster part to merge these approaches. And our biggest system at the moment is the Yuval's supercomputer. It has a cluster part with Intel CPUs and NVIDIA GPUs and booster part with AMD CPUs and NVIDIA GPUs. And as Ellen mentioned before, the Juval's booster is in the top 500 list now on position seven. So that's really great. We have also the Eureka system with its cluster part, Eureka DC with AMD CPUs and NVIDIA GPUs. We have the Eureka booster part with Intel KNLs. Then we have some smaller systems like Yusuf with AMD CPUs and NVIDIA GPUs. We have the HDFML with Intel CPUs and NVIDIA GPUs. And we have the DPST prototype with its cluster with Intel CPUs. It's booster Intel CPUs and NVIDIA GPUs and the data analytics module with Intel CPUs and NVIDIA GPUs. The last one has also Intel FPGAs, but the software for the FPGAs is not yet in the easy build stack. And I'm not sure it will be because it's the only system having FPGA. So having the software global does not make sense at the moment. OK, so what does this mean for our software stack? As you have seen, we have two systems with only one module, so non-MSA systems, but we have also three systems with at least two different modules. And as you've seen, there is quite a few combinations. So we have systems with AMD CPU and NVIDIA GPU, Intel CPU and NVIDIA GPU or even Intel Xeon 5, so KNLs. Different node types can mean different requirements. And so in our case, as you've seen, the different node types not only differ from system to system, but also from module to module. So when you're now running jobs across different modules, you need to make sure that you load the software from the proper stack. For this, we at JSC use the XNF tool to make sure that the modules you load for the cluster part are also coming from the software stack of the cluster part. Another thing to consider on the MSA system is that you might not have login nodes for all parts of the whole system, but you have maybe central ones like on Eureka or on the DPST system. To make sure that when you install software, then it's installed for the right part. We introduced architectural modules that need to be loaded before you install the software. In our case, now we have a package base with packages that can be installed over all systems. And we have also some overlay directories where we put easy config files in that needed to be adapted to be able to run or install on different modules. In total, we have 689 packages. So the base plus the overlays. And we support three compilers, namely GCC, Intel and NVHPC. We also support three different MPIs with PyroStation MPI, Intel MPI and Open MPI. For math, we have the MKL and want also to add the bliss, but at the moment it's work in progress. And in our case, that makes up for 19 different tool chains at the moment. So our hierarchy for the tool chains looks like this. We have the system tool chain with packages like UCX or VTune. We have GCC Core tool chain with packages like OpenGL or Python. On top of the GCC Core, we have installed the compilers. The compiler tool chains have packages like HDF4 meters. And then on top of the compilers, we have the MPIs with different combinations. So PyroStation, for example, is installed on top of all of the compilers. Open MPI on top of GCC and Intel and the Intel MPI only for Intel. Within the MPI tool chains, we have packages like HDF5 and SCORP. Most of the MPIs have also CUDA as dependency, since most of our system has GPUs. And also for each compiler and MPI combination, we have a tool chain with MKL on top. And in these tool chains, we have packages like Romax or Petsy. Then, as mentioned before, we want to add bliss as well, but this is still work in progress so far. The plan is to hopefully have it for our next stage. So now what is the view for a user when initially login into one of our systems? So by default, the GCC Core module is loaded to make sure the user sees all the available compilers and also some binary tools and packages installed with GCC Core. And then if you load a compiler, you get also the MPI, the possible MPI runtimes and the packages on top of the compiler. And the same when you load the MPI runtime, then you get to see the packages built with the MPI. If you now load a full software stack for your purpose and notice that you need to change the MPI or the compiler version, you can just load it and Elmod then will swap the branches and activate or deactivate the modules accordingly. We also have a lot of hidden modules because the users don't need to see all available modules because some of them are only needed as dependencies for other packages. So we have over 150 of those hidden packages just to make it not too much what a user can see or choose from to make it easier to find the right packages. Another thing we do is we bundle extensions. So some of the packages like Python or R or Perl need lots of extensions. And in most of the cases, these extensions are only needed by this one package. So making each of the extensions a separate module and install it by themselves would be totally extensive. And so we bundle them as extensions and according packages and yeah, name the Python, R and Perl bundle packages. Our stage concept at the moment is that redeploy software for a given timeframe. In our case at the moment, it's one year so we update everything once a year. The stage is more or less a simple directory where we put everything in. For each stage, we have also a development stage where new software versions or configurations or things like that can be tested. Tested software is then added to our repository and deployed to production. We are also in the test phase for user-based software installation where a user can load the module to be able to install new software. And then in the future it would be that each user could install their own software and we wouldn't need a whole development stage anymore. But as I said, that's still in the test phase. Switching between the stages is done during a maintenance so the users are not affected in a way that they can't additionally use the system because maintenance is always necessary for several things. So that's okay. If a user then noticed, oh, my old compiler version isn't there anymore, yes, in the new stage we won't copy the old compiler or whatever versions but they are not gone. The development stage and the old stages are still available but not visible by default. To make the user installations a little easier or at least smooth, we enable some hooks. So for example, we have the parse hook to manage the software installations. So we inject that and what families like saying, okay, the package is a compiler or MPI or a tool chain and also we are adding appropriate site contacts so that for each software it's clear who is the responsible person. We have also a pre-ready hook to check for some bad behavior like is the user trying to use an unsupported tool chain or is he even trying to install GCC Core by himself which shouldn't be the case or installing a non-JSC MPI which we also don't want. And then there's the end talk to check if the user is part of the development group or the software group because then the installation should be system-wide. Okay, so the software team I mentioned before consists of a small core team responsible for the core installations like the GCC Core, the compiler and the MPIs. It also supervises the quality standards of the work from the software team. So checks if the tested software has the correct dependencies, has proper programming in the easy configs and things like that and then installs things in production. The whole software team consists of several people. So we are trying to get for each field of application at least one person who works on the easy build installations just to have more expert on the team in case there are trouble or optimizations needed and also to reduce the work on the software core team a bit. Up to now this works quite good and all the users of the software team are allowed to install software in the development stage. Anybody can change any other installation in the development stage. So that means that the stage might get a bit messy but that's no problem. It's at least only a development stage and the production stage will always stay clean. One good example on dividing the work to a whole team is that we now have people with experience in each separate fields. For example, in the visualization we have now a new OpenGL module. So the old one had the problem when it comes to MSA that you need a version for CPU and a version for GPU. So one version with the Mesa driver and one with the NVIDIA driver. But this also would mean that each package which depends on OpenGL would need to have two versions. One with the OpenGL CPU version and one with the GPU version and that would blow the stack a lot. Another problem was that it was up to the user to choose the right packages all the time and in a lot of cases that leads to the users just always use the CPU versions to not be, to not need to think about which one to use and then later explain on the performance on the GPUs that it's not that good. Yeah. So here, our colleague, Henrik Gerbert put a lot of effort in the new module and now we have an OpenGL module that can cover all of it. So CPU and GPU. It is based on the NVIDIA driver and has Mesa as components as well as the GL vendor neutral dispatch library. And this library has the advantage that now OpenGL chooses the right driver at runtime. So in the GLX version, the driver is now defined by the settings of the user's X screen and in the EGL it's defined by JSON config files which are listed in the EGL vendor library filenames variable. So now all applications can use the same OpenGL module and also all packages depending on OpenGL can use the same single module which is really, really nice. And our software team roadmap looks like at the moment we have the software team that installs and tests new software in the development stage or a few user already tests the user-based installations but after that the software core team needs to install it in production. What we would like to have in the near future is that the software team installs and tests the software with the user-based installations and then later on is able to install their own packages in the production stage based on an ACL authorization scheme so that each user of the software group is able to install their own software and only their own software also in production to reduce a little bit the work on the software core team. And that's all from my side and thanks for listening and now I'm open for questions. Thank you to Ankur for the presentation. If you have questions in Zoom please raise your hand feature and we'll allow you to ask it. If you're watching the live stream then ask in the EasyBuild Slack and I'll pass the question on. Yes, I had a question or maybe two or three if you'll allow me or if nobody else asks any questions. So in the overview of hardware you showed that you have a big variety of systems but all of the ones that have GPUs it's all Nvidia is that a conscious decision by JSC to for now stay away from AMD GPUs or is it just the way that it has worked out until now? I would assume the second point but I am not quite sure and I'm not part of the decision teams. Yeah. I think we choose the Nvidia GPUs because we also have a cooperation with Nvidia so we have some people from Nvidia also working with us at JSC. Yeah. And in the past we also worked a lot with Nvidia GPUs so. Hello, the experience. Yeah, Alex has a comment maybe. Yeah, so she's right. It's the thing is that we have an Nvidia center in Ulish so there is an Nvidia team permanently in Ulish. There is a lot of CUDA knowledge and when we consider the AMD GPUs the software stack was just not ready yet. The software would not run on it so it's fine they have nice GPUs but if you cannot run the grow max properly there it's pointless. Yeah. Okay. And then maybe another question in terms of manpower you mentioned there's three people in the core team that look at the things like compilers I guess Alex is one of them. MPI and things like that and then a whole bunch of more application specialists on top. Can you give a ballpark idea of how many people are using EasyBuild at JSC? A rough estimation would be 10 plus the core team. Okay. I'm not sure Alex, yeah. That's pretty good. And we have end users as well who already use it for building their own stuff on top of what's centrally provided or? Yes and no. So one of the people who is testing it is also part of the software team. So he just takes the role of a normal user and test it. But before we advertise it to the normal users we just want to make sure that everything works as expected. So that's why it's still test phase and really small test phase. Okay. So we have the software supports team which is like 30 people and some of them are taking this role of dealing with EasyBuild with our help with Anke and my help. And then there is what we call simulation labs which are application specialists. So there are people who are, that they deal with their software in their own way. And so there are specialists on the applications and then we help them to work with EasyBuild. That's it. Okay. I see Jörg has a question. So I'll pass the virtual mic to him. Cool. Coming back to the NVIDIA GPU and AMD GPU question earlier on I was wondering, is that a kind of a chicken and egg problem? If there is more software written for AMD GPUs or any kind of other GPUs which is around would you then be more inclined to use non-NVIDIA GPUs or is that like, okay, we always use NVIDIA and the best and we just stick to that box. If you see what I mean. Yes. But I think that's something to decide not on our part. So yeah, that's more a step higher than we are. So I don't know. We have also smaller systems at our side where we always test new developed things. And there of course, we will test whatever we can get our fingers on. So yeah, but for the production systems, I would assume we would stay with NVIDIA for the time being. Okay. Thanks. We have been testing HIP, that software library, but it really, it's unfinished. So it's really pointless to buy 4,000 GPUs of a brand that you cannot use yet. So yes, it's a lack of software and it's not our job to fix every single scientific software right now because they use CUDA and the people for whatever reason they like CUDA and there are people who even like NVIDIA software development somehow. But yeah, so it's a fight that is not our fight. Okay. And in-house experience is a big factor of course. If you know, if you have a lot of CUDA specialists, then yeah, you would have to retrain those people which again takes time as well. Yeah. See, Victor has a question. So I'll let him ask his question. So when you say that you're gonna empower the software stack team to install their own software, does it mean that you currently do manual installations so that they can install themselves by hand with using ACLs or is there's a common user who can install it or buy a GitHub or a Git project that controls installations? No, so the installations are all with easy build also from the software team. What I mean by hand is basically you run by hand. You log into the system and you do module load whatever easy build and do you be yourself? Is it by hand, the manual solution itself? This says that you're typing easy build command. There is no reframe or anything like that. So yes. But at CSCS what we do, we have a repository where everybody can push the recipes there and then the machine or robot will install the software. So anybody can install software at CSCS because the machine will be able to install the software. No, we have the same with the GitHub that we can push our software there but it's not automatically installed on the systems because then it's checked at the moment by our software core team if everything is okay and things like that. And then one of us three will install it manually. So typing the EB command for the production stage. In that case, I will have a suggestion because what we do have the same, right? We also have a team that kind of check the easy configs and the blogs. And then what we do have, we split in different projects, right? We have what's called testing project that tests the easy configs and someone that goes there manually and she's also but we have what's called the production one where we approve that recipe to be installed in the system and then it goes there automatically install that thing. So you don't need to be logged into the system to manually do it. You guys could implement something like that too. Yeah, sounds interesting but I'm not sure if this with the whole security restrictions we have at the moment would be possible but it's something we can think of over yes.