 So last year I gave a first presentation about Loomie. Loomie was just a couple of weeks in general availability at that time, at least only the CPU section, not yet the GPU section, which is the main part of the cluster. So this year I'm gonna give an update and mostly, I mean, know that we've used EasyBuild for over a year to actually install software based on user requests and so on of where we are with EasyBuild, what the noises and what we like. So Loomie is supposed to be one of the fastest supercomputers in the world in the last top 500, it was on the third place. I'm not gonna go in too much details, but just briefly summarize what I told last year, so which then HPE Cray EX supercomputer, manufactured over days by Eulet Packard Enterprise and it has two main sections for compute. So one is a CPU section, which we call Loomie C, which currently has 1,536 to socket AMD Milan nodes, and then the big part of Loomie is what we call Loomie G, the GPU section, which has 2,560 nodes. Each node has four MI250X GPUs each, where GPU is really the package with two compute dice. So the thing that as we've seen yesterday shows to slur mass to GPUs in the talk of Ian Patrice. The big vector FP64 performance is supposed to be close to 500 petaflops per second. It also has matrix compute units that also do FP64, so in matrix compute, there's a number that is not often quoted, but it's actually an exa-flop machine in big performance. So it's roughly one third of frontier, but it uses the same architecture at least for the GPU section. So the size is about one tennis spilt. This is basically what Loomie looks like. So it's built in a Kayame in an old paper factory where they had the huge hall in which they have three sections in which they can build a containment, which is this part, which is actually maybe even a little bit nice, a little bit nice on the outside for Loomie, and then this is the machine as it is inside. I think the picture is a little bit in hints. There's a bit more reflections in there than I would expect, but you basically got the CPU part here, then the part that has the name Loomie on it is four rows of GPU racks, and then the black part at the end is everything which is storage, admin notes, login notes, and so on. That's what it looks today. It's gonna be further extended basically because we need compensation for late delivery, so that's gonna give another two racks of CPU nodes, and it didn't reach, it's promised Lintpack benchmarks, so they need to add hardware until they reach that benchmark also, because we really wanna be in the top 500 that's 370 petaflops, and not as the 309 or 310 we're at now. Loomie user support is distributed. So Loomie is the Euro-HPC system, which means that Euro-Page-Page-Page, in this case, half of the build, the other half is paid for by 10 countries in a consortium. The biggest member, of course, being Finland, but Belgium, which I represent, is actually the second biggest contributor in Loomie, and so there are support specialists in each country that are part of the central team, the central support team, which is called the Loomie user support team, so the abbreviation is actually just my last name. I had to be part of it. Besides level two support, it turned out that we actually also have to supply level one support, which was first meant to be done by other parties, and user training, so we also maintain much of the software portfolio, right, user documentation of the system. Level three support is supposed to come from other entities, though in practice we do a little bit of that, also if only to train ourselves, and also because those other entities have not always shown up yet, but application enabling methodology support has to come from the local centers, the EuroHTC competence centers, and of course we have the support of a team at HPE, AMD, that we also call Center of Excellence, so it's a bit confusing the name. Support for issues with accounts and allocations is purely national, so there is no central allocation system for Loomie, so there is no central allocation system for Loomie, each of the stakeholders allocate users independently of one another, there is a joint system to manage all that, or basically two systems, there is one country who has something else. So what we have to do is maintain a software stack for a machine, which, and my boss is always angry when I said that, but let's be honest, it's the fairly experimental machine. The ROCCUM software stack is not stable, the interconnect is basically, I mean the time between crashes is a very nice random number generator, and you don't have to wait long for the ride to get the number out of it, so it's a new interconnect, new GPU architecture, immature software ecosystem for the visualization instead of throwing any AMD GPUs, they throw in some NVIDIA GPUs just to make it fun, and then a mix of Zen 2 and Zen 3, compute nodes are Zen 3, login nodes and visualization nodes are still at Rome. We've got users that come from 11 different channels, not counting subchannels in Belgium, everything consists of more than one, so we have two subchannels, the Danes do even better, they have six resource allocators that independently allocate users to the system, so it's kind of fun. This has to be done by a two small central support team, we're basically only nine full-time equivalents for all the tasks that we have. In principle, the consortium should contribute in practices, it's not yet really happening, though in Flanders, for instance, we just acquired funding to set up a local support team. One key thing is that everything we do on Lumi is based on the Quake programming environment. It's a key part of our system, and that also means that Clang and LLVM is an important compiler for us. Both the Quake compiler is Clang-based and LLVM-based, and the AMD compilers for CPU and GPU are Clang and LLVM-based. To make management a little bit fun, and because we were a bit concerned about performance of the file systems, the software stack is actually on four different file systems, each mounted on a quarter of the node, so we have synchronization issue also whenever we install new software. That means that since we are a small team, the machine on which we have to act early, I mean quickly also because recommends someone are changing so quickly that our users want new compilers, new software stack all the time, at least the leading edge users. So we go for a small central software stack at the moment with only the high-priority libraries, libraries that are used as dependencies in lots of packages, and that we really update quickly after the installation of a new programming environment, so basically in this time it took me 10 days, last time it took me like five days to get a new stack up and running. The other easy configs evolve as needed, so most of the software packages are not installed in the central stack but in the user's directory. Development of those easy configs is driven by request, and sometimes we even have customized setups for users because a single version of a package just doesn't cut it on a system like Lumi. Everybody wants his own version of Gromac, some people want a special patch applied to it with some special functionality, and so in that way we can cater to that very easy. Managing such an evolution in the central software stack would be very hard. Basically we cannot install over another package, for instance, while users are using it, and it's even harder to have to keep for compositing, so we go very strongly for personal environments in our setup. Basically I agree with those people from EC and so on that all users want the central software stack. The problem is they don't all want the same central software stack, they want their software stack central and no other packages, so we already get tickets even though our software stack is very small, so what modules should I load? I can't really find my way in this mess. Also with that system of putting more in user environments we have far less problems with version conflicts and we can move a lot faster when installing things than in a big central software stack. We don't need to be concerned too much about is this package gonna work with this and this and this? Library of this and this and this version of library? No, I mean, if it doesn't work well for that user we give another version of it. If they really want another version, I mean open form for instance who uses from some libraries, very old versions, we can easily supply that separately. Another argument for personal environments, again users say that they want the central environment and then go on, they use CUMLA, they use containers, they use Python virtual environments. I mean, we see users using personal environments all the time. So what we did is we made a setup with LMOT and EasyBuild and a nice configuration module. So users don't load EasyBuild directly, they load the module which we've called EasyBuildUser and that configures EasyBuild to build on top of the current software stack in a way that completely integrates with the module system. So basically we build a software stack for each release that we install of the Cray programming environments who are currently at 22.12 and 23.03 are our most current ones. So they load the module corresponding to that and if they then load EasyBuildUser, they're just a single EasyBuildUser module, it configures automatically for that software stack and moreover they don't need to keep that EasyBuildUser module loaded to see their modules. They only need it to install software. After that it really looks as if they are in the central software stack. So in that way we still try to give the users the impression of a big central software stack without actually offering one. So I've said most of that. So our software stacks are based on the releases of the HP Cray programming environment. That means that the compilers are not installed with EasyBuildUser and that sends it's an easy build installation. We have four different compilers, three hardware platforms. Luckily the number of combinations is not four times three because with AMD you have a compiler for CPUs which doesn't work for GPUs and you have a compiler for GPU which they say gives suboptimal optimization for CPUs so you shouldn't use it. And then we've got that third one that we've heard of this morning which you actually don't have installed on our system. We have no hierarchy in the tool chains only full tool chains. So tool chains with compiler, MPI and the math libraries though actually are thinking about how we could implement something that takes the role of GCC core simply to reduce the amount of software that we need to install with three compilers or four compilers. And because much software only works well with GCC anyway. So at the moment we actually fix the version of EasyBuild for a given version of the software stack basically to have an as reproducible environment as possible. And we bootstrap EasyBuild for each version of the Lumie software stack so that those different versions are completely independent of one another. So if something happens to Lumie we can restart very easily without having to go through a lot of history. Or if a user wants a test stack it's made completely relocatable so you can just download two repositories, run a script and you have a copy of the software stack on a different route in which you can start experimenting. So basically we offer over easy configs in two central repositories. If you get a handout of the slides these will actually be links. So there is a Lumie software stack repository for everything that we install centrally. And then there is a Lumie EasyBuild country repository for everything that we support a bit less in the sense we have done less quality checking and so on and use it for user support. And so we also take contributions from other parties in Lumie EasyBuild country. We've had a few but not that many yet so far. That's something that I've already touched. So we have configuration modules for EasyBuild to configure for specific tasks which really is just a single module but we use the introspection functions of Lmod so that from its position and its name it can figure out what it should do. So we have an EasyBuild production when we install in the central stack and users cannot even see that module and we have an EasyBuild user to install software in the user environment single piece of code for maintenance to make sure that all those modules remain nicely in sync with each other but it picks up where to install software from its name and its location in the module so we've fully exploit hierarchical modules in Lmod. And so what the user has to do to install Gromax is basically you load the Lumie software stack partition C says look I want to install it for the CPU nodes, load the EasyBuild user module and you just execute the EB command as you would in regular EasyBuild and dash R because I have the robot path not turned on automatically and that's a version that has an additional dependency bloomed. And it runs the installation for you which I'm not gonna show because it takes 20 minutes. So that brings me to the new material basically what I've said so far is a summary of what I've said last year. We actually make quite a lot of use of the system tool chain that may surprise some people but for us because we have no GCC4 equivalent it's an easy way to install software that we want everywhere even outside of the EasyBuild environment because you don't need to load any tool chain the modules don't load any tool chain when they start the software. And in that way we make it available even for users who don't want to use our EasyBuild tool chains. For those packages to minimize interaction I actually still use traditional static linking so I have a separate module within curses and a few other packages that those things need link them statically. We use that for instance for the build tools because users sometimes ask for a newer version of CMake even if they don't wanna use the tool chains that we have they just wanna use the Cray programming environment as is well that's an easy way for us to provide it. We are thinking of a GCC4 equivalent and even Cray is now seeing like look sometimes users want to mix compilers and making provisions for that in their module system but we're not yet there and we basically lack the time to do a lot on that. Something that I couldn't touch last year because we didn't have GPUs at that time the GPU tool chains well it turns out that we really didn't need to do anything special for that or hardly anything special just load a few additional modules. We do use so contrary to SPAC that went to a module less setup for the programming environment we use the Cray wrappers and then you basically don't need to do much it's the same compiler wrappers just with a different set of module loaded and they adapt themselves and most of the time but not always produce correct options for the underlying compilers so that needed very little work. A question we often have we often get asked by the easy build community is what do you do with Rockham? How do you get that on dummy? Well for us it's part of the Cray programming environment so in some sense we do nothing. The problem is the Cray programming environment is very slow in picking up new versions of Rockham so at the moment it's 5.2 which they officially or a version of 5.2 which they officially distribute while we're at 5.4 with 5.5 in the pipeline and probably coming out it's already in release candidates so probably coming out one of the next weeks or months. Obviously over users or at least the leading edge users are not happy with those old versions so that means that we do try to build our own Rockham modules however so far we do this from the binaries and we don't try to compile it ourselves. Does this come with problems? Yes, it comes with problems. I mean so far most of the time it works for us but one thing you have to be aware like any GPU library or limited by the driver versions. At some point I mean the driver is the one that comes with 5.2 for 5.4 that's still okay we don't know if that will be okay for 5.5 also. Another problem with Rockham is and that's why I made a comment this morning that I doubt that they have a packaging engineer when Patrick Lear spoke about the packaging engineering it's full of hard coded paths, it's improving they're removing them. Sometimes to version specific directories but sometimes also to the generic directory which then should link to the default version which is quite problematic because it can have performance implementations and so far the major problem seems to be with MI Open which wants to compile on the fly in some cases and because of this uses an older compiler than the one it should use. I said it because I knew that reaction would come. So the comment is that someone who's actually working on supporting Rockham in EasyBuild said look that path over here Rockham is hard coded everywhere where you don't want it to be hard coded so it's really tricky when even you want to compile it from sources to adapt it to the path that you really want to see there. So concerning the tool chains I do think we can do a better job with more settings in tool chain ops but that would need us to gain more insight in how programs that use Rockham will ultimately use CMake will ultimately use auto tools and so on that may still be converging. One problem that we have had while developing those tool chains what I bumped into is that EasyBuild doesn't really distinguish between C-based languages and so for instance the CXX or there is a CSTD or so variable in the tool chain ops which is used for both C and C++ and I mean sometimes we've had cases where we wanted to set both and then had to do it via C-flax and via CXX-flax. The reason why I think that this may become a bigger problem in the future is if you want to support HIP, SQL, OpenCL these are all C-language families languages. Another problem that we've run into is that there is no way to add options to LD-flax except if they're in minus capital L options and it turns out that for linking with HIP we sometimes needed to add options there to work around problems elsewhere in the Cray programming environment mostly but they may even be needed in general. So now I mean we do it with minus X linker in C-flax and CXX-flax and usually that works because most config scripts also add C-flax or CXX-flax to the linker options but it may become a problem. Can you export LD-flax? Yes, that's another thing that we sometimes do but it's not as nice as setting it in two chain-ups and in fact with GPU programs in general we often still need to manually work with those variables. Yes, that also exists and I know it exists and I use it actually. But for some things it would be nice to be able to set it as a two chain-up also because that makes a better abstraction towards the person who writes the easy config. So the remark was actually that there is extra C-flax, extra CXX-flax and maybe there should be an extra LD-flax and it may even exist but I'm not sure. Second thing which is very important for us in Lumi is documentation. Documentation is really everything so we make a lot of use of any easy build feature that can be used to produce documentation. And so one thing is that on an HPE-Cray system users really need to learn to read man pages again because so much of the Cray environment is documented through man pages and documentation that you cannot easily find on the web and so on. So we do make a point of when we install software with easy build and we see that it comes with man pages making sure that those are also on the system. But we also developed and I mean develop is a big word because it's really a very simple thing. We've got something which we call the Lumi Software Library which we generate from our down files stored with the easy configs. Why? User documentation is separate but we figured out that if you wanna document the package or something that you did to configure it and or you want to warn the user for a restriction. If you put that in a documentation away from the easy configs, you update the easy configs and you don't update the documentation because it's not part of your work cycle. So we now have started with one file a README.md which actually had long before we even thought about automating this in a Lumi Software Library. Then a second file was added because we figured out README.md is more for the technical persons. Let us write the user.md about documentation more at the user level. And then CSC became very concerned about software licenses and wanted us to document those better also. So there's no actually even three files and that is the result. It's a website that looks like that. So we've got an alphabetical list of the packages and for instance, for the Lumi VNC module, it will show you that this is pre-installed on the system. It basically gets that information from knowing where the easy config is. It will show you the license information, user documentation, pre-installed modules and easy configs so you can even click on it and see the easy config and see what has happened and then some technical documentation. Not all those parts have to be there. Some parts can be missing and even at the very bottom, but it can even show you the easy configs. If you have old versions that we haven't used, basically because if you have a package that disappears because everything is archived, you will still find it in the software library and you will still be able to find back its easy configs and either try to do it yourself. You have sufficient feeling of how easy configs work or contacts support and say, hey, you once had that package there and you reinstalled it in that or that tool chain so they can find it there. Problem that we have is that we cannot exploit the L-Motivs lines as much as we would like to. Easy build generates a lot of those lines automatically, but that gets broken as soon as you use the Wattis keyword. Then it replaces, then it only uses the ones that you define and what is for us a reason to use the Wattis keyword, that's the use of the description keyword. While these lines have a description, they all have lines that start with a key and then a value and an important key is description because that's the line that L-Motivs uses when you ask for instance just Wattis, the summary of all, sorry, not Wattis module spider, the summary of all modules and then for all versions of the module, it will pick just a single Wattis description. So that description that is used from the Wattis line should really be valid for all modules but you want to add more description about specific configurations. For some software packages, you need to have multiple configurations, you want to add that in the description or you just want to give a bit more description in module help, so it would be very nice to have something that is, one solution could be to have a keyword short description that is used for the Wattis line and a description that is used for the help block and if you don't want to break compatibility with current easy configs, if you use description when short description is not present, I don't think you run into compatibility problems so that might be an easy solution to that problem. Yeah, so the remark that was made that there was a time that there was a bug in L-Motivs that for Wattis in module spider, it took combinations from information from two different versions of the package. Now it's more consistent, that's true. But even the current Wattis, when the current Wattis keyword is not the most for us because it doesn't really enforce the key value ID which I think is a pity. And the current syntax, and as soon as you use it, easy build doesn't add the data that that's automatically anymore. And it's so nice that it automatically adds the name, the version and so on. And just an idea suggestion, not important for me. I mean, I found it useful as a developer. When I worked on the HLRS system which is actually using that package from the competition that we just heard about. One of the funny things that was in the Wattis information of the module was that if it was built with all the tools of CMake, the arguments that were used for configure scripts or CMake were in special Wattis line. So as a developer, I could see, look, that library is configured this way and I could see very quickly without going in, back to figure out what concretization was used. I could very easily see how the package was built and I could see whether it was suitable for me or whether I needed to rebuild the library in a configuration that was more suitable. That's just a little thing, another high priority thing, but it might be just a nice idea about something else that you could add to the Wattis line. Easy blocks, easy blocks on Lumi are a nuisance. They often fail and sometimes in ways that can be avoided. And that is to my feeling, a little bit the result of DID in easy build that everything has to be tested. And if it's not tested, it's not allowed. So there is in two ways that they are annoying for us. And one is if they start testing for compilers, if it is this compiler, then do this. If it is that compiler, then do that. And of course the create compilers and so on are not there and it fails. And sometimes this is understandable because sometimes you can expect that for every compiler you will need a different flag. But suppose like at some point, I think it was G429 or 10, you needed to add an option with some packages that relied on some behavior that had changed in G429. I mean, if you then start your test with if G429 then do this, else if into then do nothing, else bomb. I mean, that's like annoying people without a good reason to annoy people. And we've run into such things more than we like. And the second thing is there should be more consistency in testing for modules. I mean, testing if modules are loaded or not. And this should always be done through metadata and not through module names. Sometimes they test for loaded modules. Sometimes they test modules test or easy blocks test for the presence of Iberwood variables. It should always be Iberwood variables because then it's compatible with the external modules also. And the most annoying easy blocks so one that we have customized for Lummi is the Maison Ninja easy block. I don't know who wrote that it may have changed by now but when I made that change it was like first we check whether the Maison and the Ninja module are loaded then we check whether the root variables exist and we're not happy yet. We'll even explicitly check whether the Maison and the Ninja executable are there. I mean, how paranoid can you be? And that's a problem on Lummi because we wanted to put those in a module that has a different name. And something that would also be great just for compatibility with easy blocks if there would be a way to add metadata even if it doesn't make much sense from a module point of view to OS dependencies. Suppose you want to use the incurses library from the system for instance if that you could just have a file that sets an Iberwood and curses to user lip or so slash user and an EB version to the version that is on the system so that you could, I mean, so that easy blocks that check for incurses would be happy. And I give the incurses example in particular because on SUSE you're also hit by the sibling versions or by the version info in the libraries and it is not solved by adding that argument to the configure line. I don't know what SUSE does but they seem to be using a patch somewhere that adds even older versions to the symbols in the library and they are actually used in some of the tools that come with SUSE. Bart? Yeah, well that's precisely what I mean here. We've always dependencies that it should be nice to be able to add such a thing more in a more automatic way so that you can use the, so the Bart made a remark that they have they called it the opposite problem but because they sometimes use libraries from the compatibility layer in there, not easy but they're easy like setup that the digital aliens in Canada uses that they also have the problem of failing easy blocks because they don't find the metadata and so have to explicitly set the root variables. So this corresponds to what I said that it would be great to have such a feature to either an extension of OS dependencies or a separate file. Like we have the external module files already for the things that do not come via modules where we can add metadata specifically for our system. The next thing that slide is probably already outdated because we've heard about one or two other solutions already in this talk but we're getting very concerned about both module and directory health. Too many modules and too many subdirectories in paths, too long LD library paths, search paths and so on which really on a system of Lumia if you start an application on 500 nodes and they all start hammering on the Luster metadata server to go through a path that contains 100 different directories that's the way to get your file system to slow down. Of course there is a very nice argument in favor of splitting up in as many modules as possible and that's you get more manageable chunks for installation but sometimes it's really too ridiculous. So arguments that I have against splitting up you shouldn't use this capacity as an argument anymore like we wanna keep our installation as small as possible. It's not this capacity that is expensive. It's the IOPS. If you have your long paths you create way more IOPS and it's those that make your file system expensive. Same with long path variables, long link lines there are also terrible for developers. If you have like, if you run into a problem in installing a software package and you see that configure have added like a link line which basically fills your screen that you need to ask your boss for a bigger monitor to see what's going on. Then obviously that's the problem there. Also some software expects some components to be together and so in the past, not only because we use plain net CDF but they have had problems with net CDF that expected the Fortran interface and the C interface because the Fortran interface then links to the C interface to be together in the directory. So I don't understand why that has to be split up because the library names are perfectly compatible that fits perfectly well in a bundle for instance. Another thing that I've seen is that it's so easy to overlook dependencies. Not all easy conflicts contain all possible dependencies and the problem is if a library happens to be on the system and you haven't linked it via an easy conflict it may still be picked up by the configure script. And so all of a sudden you have software that's configured differently than you expect and differently from the configuration that has been tested when designing the easy conflict. So all of a sudden you may run into problems that never occurred on the test system on which easy build was tested. Also what's the point of having packages in separate modules if you want to have only one version of each? You don't really save space at least not for the basic packages because they are installed everywhere anyway. And like having some of the compression libraries together can actually help you to work around circle dependencies where the tools that come with the compression library support each other's format. There is definitely such circle of dependencies in the graphics libraries. And maybe installing those as one module with graphics tools, one module with compression tools might even make the programs that you generate. Of course I know I see I see what you're thinking it has lots of disadvantages also doing it that way. But I'm not sure that the current solution is up to move. So non arguments in favor of splitting up better visibility of what is installed. I'm not gonna speak for the pickle based version of modules. I'm not familiar with that one. But on our system on Lummi, we learn people to use module spider which is an extremely powerful tool. And actually when I develop bundles not now because we have an old version of Elmot which has a bug where you cannot disable the display of the list of extensions. And I don't wanna make it too long. But even for a bundle I sometimes just manually add a line with the components of the bundle that then show up as extensions of a package so that they are found by module spider. And even without doing that, if you make sure that they're in the proper place of the what is lines module spider or at least module spider, module spider or module keyword definitely finds them. The argument that you shouldn't install more than is really needed also doesn't make too much sense. It's not the disk space that expensive. It's the IOPS that you need to save on. And I think everybody who has a big cluster system has already run into trouble with too much load on the metadata servers. Linux distributions do we do? So why would not we not mimic that? Well, there's one important difference. Linux distribution still install those different packages in the same direction. And of course some distributions target very small systems where you really want to find even what you're installing that also shouldn't be an issue for HPC systems. And EC is probably targeting workstations also but even there because the caching is on a profile basis it won't make a difference. If these of installation in more manageable chunks is the argument and maybe we need to think about the better way than the current bundle for bundling installations rather than using more modules. Just food for thought. I know it's not something that can be realized quickly or will be realized quickly but it's something that we think about. Then there's a little tool written at CSC that I want to talk about because it also interfaces with for links to some other stuff that we've heard about already. So CSC for Python and Conda on their own systems and also on NUMI we basically tell users not to use Conda or not use to do make large Python installations or our installations. Basically, and we actually limit them in the number of files that they use and if their argument is I want to use Conda they don't get more files for their account. So it's a tool that's written by one of the support persons of CSC. We'd aim to reduce the load on the Luster metadata servers and on their own system they claim that for many Python scripts and so on that you run off which you get a 30% speed increase. So their approach is that they start from a minimal singularity container which is always dependent but there's really almost nothing in there. The main reason for it is that it's used to pack a SplashFS file that contains your whole Conda installation or your whole Python installation. So they have a tool, they have a command that out of requirements the text file or a Conda environment file does the installation in the temporary disk space packs it in SplashFS and then creates wrapper scripts for the commands and the bits of directory so that users don't even need to know most of the time that they're working with the container. Works quite nicely and GPU support is not yet optimal we need to work on that but it's very useful on Lumi because Lumi doesn't even allow fake routes so you're very limited in what you can do in building containers on Lumi itself that tool works around that limitation. So on Lumi we use the create provided Python and packages for the numpy, the scipy and the pandas comes from create and are properly optimized using the create libraries. I've actually been thinking about ways to interface that with easy build because one of the nice things would be if you make such a package with Lumi container wrapper as the module is called or typically as the tool is called in finish it would be very nice that you also get a module for it. So we've been thinking about ways to interface but basically complete lack of time to do it so nothing has happened yet and if you're interested in the tool the link will be in the slides. It's public domain and it should be not too hard I think to configure for a different system it's basically a configuration file in which you describe for instance which container to use and a few other things. The next thing which I always wonder and I've discussed it already with a few people during coffee where is LLVM and where is mp? Why will support on LLVM is quite limited at the moment there's no common tool chains based on it except if you call the new Intel tool chain that you're building which is based I mean Intel new compiler is basically LLVM. It's also understandable due to the confusing state of Fortran support. There's the old flying the new flang which is not really ready yet the old flang which is not very good. Intel has just moved its own Fortran compiler on top of an LLVM backend but it's really its own compiler otherwise. Same with pre Fortran. So when it comes to Fortran it's still a mess but whatever way you turn it is the number one compiler base at the moment for development of HPC compilers. Almost every commercial vendor if not every commercial vendor is using it even any C for instance which for its vector machines we built its own compiler when they launched the Tsubasa architecture Aurora architecture is also working on an LLVM port at the moment. And outside HPC it's even more to when I talk to people of iMac for those from Belgium they know iMac it's a big lap on micro electronics and related architecture they said hey are you only thinking of switching now as an HPC community we make the switch from a GCC to Klang in 2011. So in the embedded world it's also all Klang which means that for those CPUs that actually come from the embedded world they're thinking of ARM they're thinking of RISC 5. The main development is also happening in the LLVM ecosystem. Same with GPUs it's the basis for all GPU compilers. If you look at ROKM data parallel C plus plus from Intel the compilers from NVIDIA are currently all built on LLVM. We don't do nothing with it there are hardly anything with it in easy build. And the second thing for me is various M paper. I know if you look at the vendor MPIs there's some that are derived from open MPI there's others that are derived from MPI and the ones that are derived from MPIG my impression is that they are probably mostly with the network vendors. Because all network vendors except Mellanox were nowadays with lip fabric rather than UCX. Cornelius network for instance is also switching to lip fabric. That seems to be the more open library compared to UCX and in fact one of the technology people at CSC basically call open MPI just a wrapper around UCX. Which is a bit strong but if you look at GPU support in open MPI it basically relies on UCX. If you don't have UCX transport you don't have support for things like direct GPU and so on. So really for us MPIG is an important implementation. And the next slide was one that I wrote during the easy build conf call four weeks ago which started with like ten minutes of bashing spec pitch. I heard some arguments that I really think are not right. You can compare spec and easy build in many different ways of course and we had a talk about it last year making a comparison. When I look at it in the context of Lumia I have some colleagues who work with spec on Lumia and then some colleagues we've got a team that does everything with easy builds. And the thing is that our experience is that with spec we often have a solution much quicker that makes the user happy. And then I'm comparing someone whose experience with spec with the speed of someone whose experience with easy build. Of course our case is special because we cannot use the common tool chains. We have our great compilers that of course slows down the process a little bit. So the comparison may not be valid for everybody but it's something to think about. And then one of the arguments that always comes the rear the spec API changes all the time. Okay, we've just heard that they're planning a major change but in the past two, three years it has been pretty, pretty stable actually. I mean, how could they maintain 7,000 packages if they change the API continuously? And the other thing is the spec API is way more readable than the one for easy blocks. When I bump into a new package that I don't know how to install I'm not sure about whether the easy build support is ideal for us or whether the spec support is ideal. When I wanna learn about these dependencies to get an estimate of how long it will take us to prepare that for the user. Typically the spec sources tell me a lot more much quicker than the easy build source. The readability of the spec source is really better. So the criticism that you're building something that when you're building something with spec it's basically an untested configuration. That's also an argument that I hear a lot. That's also only partly to I mean, spec has its quality control also. So of course they cannot test any combination it's a combinatorial problem spec. But very often it just works. And when it does, quite often easy build will not be good because you have all the flexibility while in easy build. I mean, you have to live with the set of dependencies that are pretty determined by the tool chain. In spec if a user says, yeah, I need that but I wanna use it with that. You can just try. Maybe it's an untested configuration but who cares if it works in the end and the user is happy. And also given all the variations in Linux and underlying hardware requiring different compiler optimizations or the risk that easy build picks something up from the environment which was not in the environment where you tested basically means that the easy build recipes are also not fully tested. Testing software tools fully is not possible. We have to live with that. We will have failures, whatever CI you implement. It's impossible to have the CI that captures all problems in software installation tools. There's one small thing in what we, so even though we may be thinking of stuff accidentally from the OS and I think that changes. Yes, definitely. Because we run the test tree we're probably gonna pick up more problems that come from like the fight work is a good and a bad example. So we spend a lot of time going to the fight work set and even one or two out of the 80,000 sale. Yeah, but still you've got all those packages that don't come with test suites. I'm not sure it's correct. Yeah, I don't know. I'm not gonna say either yes or no if Todd is still in the call he can write it in the comments. He's already in the comments. Yeah. I knew he wanted to listen to my talk because I told him I would say a few things about SPAC also. Also, if you want to add another package to an environment the concretizer may come up with a very different solution forcing you to reinstall a lot. I'm not sure that is true. He didn't say much about it today but as far as I know, he tries to take into account what is installed already. If that is possible. Of course you may have new conflicts that then require you to go for a completely different solution. That's true, but the thing is easy building those cases. We'll also not always be able to come up with a nice solution. If you want to combine package A with package B in different versions, it may be that package A is supported in 2020 A, package B in the version that you want to support in 2022 A, you also end up reinstalling two, two chains, dependencies in two, two chains. So it's also not optimal unless you start editing those easy conflicts to move into a different tool chain. So which is then the better package from a user's perspective? All the user cares about is having a working environment and with SPAC he may even end up with an environment where he can use those tools without loading new modules in between. In EasyBuild they may end up in two different tool chains. Another thing is a flexible tool whose developers don't feel the need to test everything may be better prepared for the Cambrian explosion phase we're in. That's something that was touched yesterday also by Ian who showed us how much new types of hardware is coming up. And in some of those cases, I mean the way you install for AMD GPUs quite often it's the NVIDIA installation procedure which is actually sent to Rockum instead of to CUDA. But you basically use the same variables then your installation procedure for CUDA may not be tested with Rockum but it may just work, not always. But if it works, you have a happy user. And the other thing that I'm a little bit concerned about is that sometimes I have a feeling that bashful competitor can make you blind for your own shortcomings. EasyBuild has many fantastic features but it has its shortcomings also and we need to be aware of that and not be blind for that and not do something because the competition does it in another way. And then there's always Kenneth who then comes up with his remark, maybe we should meet somewhere in the middle. Spack and EasyBuild are really on opposite ends of the spectrum. Spack the very flexible tool but for individual and face summing. Put the emphasis on untested configurations and then EasyBuild will be fully fixed but well tested configurations. So maybe we need to meet somewhere in the middle. Well, I agree. I agree for 100% but you shouldn't be blind for what's going on. We're spack in the past few years. I think Spack is taking that middle position already with the environment's feature. You can actually run it through the concretizer and then we've seen that the log file which he showed us today which is like an ideal way of transferring a completely concretized environment to a different system where you can still build it optimised for that particular system. Which is exactly what we want to do with EasyConfix. For the fully fixed configuration you want to port it to another system and install it but then optimise for that particular system. Spack is doing that also. So I need to think often about how to make EasyBuild a bit more flexible so that's why I made that remark about all those let's try things and like we use a script that we have from CSS to change the cradle chains to a different one when we update to a new thing I have my own things to change versions of dependencies but it relies on something very awkward. I define them as variables elsewhere and then use those variables in the dependency statements rather than versions so that I don't need to pass those dependency things. Sometimes it would be nice I think to really have a file with versions of something like a database with versions that we want to use of all packages and refer to that database when you build EasyBuild to something like templates or so on in the future. Maybe this is something to think about not for EasyBuild 5, but for EasyBuild 6 or EasyBuild 7 or maybe we should then call it EasyBuild X because it's so dramatically different and because X is popular nowadays. And we should be careful that we don't move even more to an extreme position and make things difficult. Easy for me is an opportunity and a threat. It's a big opportunity because it means extra person power and all person power for easy development is good for EasyBuild also because EasyBuild is such an important tool within Easy. I'm a big concern that it's going to put even more focus on a single big software stack and features to manage that while I'm not convinced that a big software stack is the solution for the future. I mean, just look at what's happening with Python and so on where you have all those version conflicts where people need virtual environments we're very close to ending up in that situation with regular packages. Also packages that are left behind and the old versions of libraries combined with newer packages that the new versions of libraries. And the thing that I'm most concerned about and that's why I asked that question about support. If you cannot figure out a proper support model for easy that keeps the users happy, it may show off negative on EasyBuild also because easy and EasyBuild are so close even just the way the name sounds that if users don't like easy it may, I mean, they may just link it so hard with EasyBuild that they don't like EasyBuild either. And I mean, I made the remarks that are on that slide. So some thoughts for me, but that's from the perspective of Lumion I've actually discussed this also with my boss of user team, but I mean, finding a user friendly way to build on top of easy will be critical. That's also, I mean, that was said today everybody realizes, but I think other distribution models is also something that should be thought of. Not every site is ready to set up a cache. We need a budget to buy that cache. And even though that's not that expensive it's an out of line budget and it's not always easy to find that. Those CSCS has a set up for LHC Atlas that for our case could be inspiring. I think they just copied the whole repository from CERN, ExtraDemons is not always an option. And that's particularly true on Cray. Cray really hates demons on the compute nodes basically because they provide OSJitter and limit scalability. So on our GPU nodes, we have one of the cores turned off because it turned off that some services of ROCM provided too much OSJitter and they couldn't reach their scalability benchmarks. Obviously this gives a very strange combination with 63 cores and one chiplet with seven and seven chiplets with eight cores while you have to do a careful mapping between cores and GPUs to get optimal performance of unified memory. So that's not a happy solution if you have to do such things. Also on Lumi specifically, putting stuff in Sketch OPT is not an option. That's managed by the, well we have a wall between the sysadmins and the people who do the applications for the users. And we both have our directories to play and also slash OPT is built by the Cray management environment and they want to interfere as little as possible with that to put extra stuff in it. So a native build with or without a compatibility layer may be a better solution for a system like Lumi. The compatibility layer is actually a nice idea I think to get rid of some of the problems that you have of the different Linux versions. So that's probably something that should stay in in such a setup. Another thing is that's taken from a slide from Kenneth from the State of the Union two years ago. Look at the community, we all know that slide too well. The community of easy builds, the community of stack. So then look at how EuroHPC is working. EuroHPC has plenty of money for machines, very little money for support close to the machine. So in Lumi we're nine FTE and we're in a luxury position if you compare us to the petascale centers. If you hear who they complain, looks provide and so on, how they complain about not having any funding for support while being expected to support the users. It seems that, I mean to me it looks like EuroHPC is more favoring strong communities that do development and support near the user rather than near the supercomputer. So they have their centers of excellence that are domain specific. They have their national competence centers to bring support close to industry. But they have very little money for support of the big machines. So to me that looks like their ideal community looks more like this community than like this community. To me I think we will need to find ways to make easy build more attractive also to software developers, to scientists directly and to work with those centers of excellence and so on to provide ways that they can distribute their packages or install scripts for their packages in easy builds. Easy as a way to get funding to... Yes, and that's where the opportunity is, but it's... Yeah, just so the remark was just made that easy is also funded through EuroHPC, the multi-scale center of excellence that we've heard. That's true, but again it's gonna be difficult to get everybody behind one single software stack because of all possible version conflicts and so on that you will land in the whole amount of synchronization that's needed between teams. You basically need to set up the base of a software stack, fix versions there before you can start building other applications. I mean, see how slow it is. I mean how much time it has taken to start up the 2020 B software stack. No, I mean there were also technical problems than with Python and so on, but it slowed down the rollout considerably. So that brings me to my conclusions and I'm already over time I think. So EasyBuild on Lumi is really not a typical easy build installation because we don't control the whole environment with EasyBuild, but work on top of the Cray programming environment. EasyBuild works for us, but probably with a bit more pain than needed. For instance, with the Easy Blocks. Despite my nice remarks about SPAC, we do actually continue to invest in EasyBuild and whenever we have to build something when it doesn't work out of the box, we don't build package.py files in SPAC, we still build for EasyBuild, we develop for EasyBuild, but I have to admit that SPAC has been really good for some users on Lumi, also for the GPU software, they're way farther than we are. Users need some time to adapt to the personal environment ID, but I think now they see the benefits and many users really appreciate that they can build their personal environment even with EasyBuild. Some still consider it difficult, but most users, specifically because when we make those easy conflicts in our own repositories, they have been tested on the system, so they are far less likely to fail than generic easy conflicts on a new system. And I've mentioned lots of things for improvements, so for us we like the system tool chains, which unfortunately are not a full tool chain, the CCXX issues. We definitely have, if I would look into my log of wish list, there's a lot that's very specific that I didn't even create here. You could also argue that the AMD GPU support is something which is becoming important, of course, but I mean, so far we can live with it. I've shown you, I mean, documentation through modules, you have some issues there. It's a very nice feature of EasyBuild. I don't want to be negative about that feature, not at all, it does much better than SPAC in documenting modules. Because you do it in your easy config, it's very easy to make specific documentation for each module without having to write hooks, where if you have this configuration, then at this configuration, then at this thing to the documentation, and so on. So EasyBuild is really good there, but it could do even better. The issues we have with easy blocks, the concerns that they have about module health and metadata server load, concerns they have about missing support for LLVM and MPEG. And okay, sometimes it's good to make jokes about SPAC, of course. It's so fun to make jokes about SPAC, I like them also. But you shouldn't be blinded by those jokes. And sometimes I have the idea that you think SPAC cannot do that, SPAC cannot do that, but I mean, really SPAC does a lot more than sometimes we realize. So keep your eyes also open for upcoming threats like, if Easy doesn't do well, will that be negative for EasyBuild or not? EuroHPC and their mobile, will EasyBuild find its place in there? Of course, there is Easy, but if big centers would be moving on mass to SPAC, then you have a problem. So that's it basically, yeah, that's my last slide. All right. I hope I was not too controversial. We're in a bit over time, but we have some buffer with Sam's talk as well. I don't think Sam needed two hours. You do? Okay. Do we have any questions for Kud? In the room or I'm looking at Simon for Zoom or? Well, there were several questions already asked during the talks. Yeah, yeah, there's a question mark. Yes, but that means that you will repeat the question. So the comment was made that it's good to have two tools and some competition is always good and two tools that do not work the same way that targets slightly different environments. That's true. I agree. The competition is important. We've seen that way too often that otherwise things are no further developed, but it means that you must find ways for EasyBuild to survive. And if you look at the comments, for instance, that were made by one of the referees to the IAC proposal. Or if you look at the growth in SPAC and number of contributions, which I may be mistaken, but it looks like the graphs were a bit steeper than for EasyBuild. That's ETP, right? SPAC has a lot more funding than EasyBuild. But that's a big, I mean, that's a big concern. Will it survive? And will it be attractive enough to a big community? And not only the small centers, it should not become the tool for the small centers. It should remain relevant for the big centers also because that's where money for development and influence will push you to HPC to put money in that development we'll have to come from. Yeah, and well, so the comment was, we start supporting or paying more attention to large centers than the small centers will not be happy anymore because it's not doing the thing. Or it doesn't have the focus that they would like. I think Lumi is a good example of that. Lumi is so special that there's lots of stuff missing in EasyBuild that could be a lot better for Lumi. And it's not there because Lumi is like this special, this special case, right? But it's difficult to find a balance. And to some extent, CSCS was in that same boat with this thing, let's say eight years ago, when we started working on the initial trace support for EasyBuild, we did that because first of all, we had the time for it back then, it was fun to do. And it was nice to get our foot in the door in the big center like CSCS. But that's not something we can keep doing, right? Unless there's actually help. Yeah, but there too, you see the same discussion as in Lumi, should we go with SPAC, should we stay with EasyBuild? No. There are strong opinions in both directions. Oh yeah, we can, sure. Yeah, try that. Yeah, I thought we should be able to unmute. Oh yeah, so I was just gonna bring up one thing because people also seem to have the misconception that SPAC is mainly funded by ECP. It's not, it's funded by ASC. And ASC is the Advanced Simulation and Computing Program out of NNSA, which is not going away anytime soon. We're gonna lose a chunk of funding at the end of ECP, but it's likely to be replaced by the sustainability program that's gonna be in place after ECP, which is still being worked out. So I don't think we're not going anywhere. I totally missed that comment, but fine. I'm not gonna bash back, I won't try not to. I was talking about the funding and that it comes from a different channel than you mentioned. Yeah, yeah, yeah, okay. On the bashing, by the way, I'm not sure you should call it bashing. We make jokes, but, and Todd and I definitely talk all the time and many people, well, we can't see that because we talk directly all the time. But specifically that easy built meeting. Yeah, but that's an easy- I think then I heard some comments that really were not nice. And that's informal, right? So that's not recorded. And yeah, maybe things are a bit more loose then. But we've always been very welcome to have Todd on at the meeting to give updates. And we're very open to have that discussion. So that's important to mention as well. Sam, you had a comment or a question? Yeah, so the comment Sam has is, there's indeed threats and opportunities, but maybe this is an opportunity to let Lumi collaborate closer with EasyBuild and try to fix some of these issues. And then you're back at the manpower problem. Yes, well, partly the manpower problem. And partly, I mean, I'm happy to contribute over two chains, for instance. But it has to be valid. But it's kind of, I mean, they are incomplete because they are only tuned to Lumi. The programming environment for Intel, for instance, is missing and the NVIDIA support. So having to add this, I could ask for an account on PISdain and try to do it there, but for wait until the new apps is completely ready. To test that, but it's... Yeah. And I see, I mean, there were many points raised that we should follow up on, like the easy blocks and... Yesterday you talked about policies. Yeah. That's precisely something that should go in the policies. While there's good practice in writing easy blocks, while there's good practice in... Probably something you have to figure out. And it's not only gonna help Lumi, it's gonna help Easy and... It's helped everybody. And the Alliance as well. We've heard about also who had similar issues that could be solved that way. Yeah, so there's just many, many stuff that you've brought up here that we should come back to and follow up on. Yeah. There's one or two more comments in the back, Luca. What's the comment? It has been a supply, it's been back to the main... What is it about to have new or our own policies? So we have payloads... Oh, I used that extensively. My two chains are a copy of the ones you use, or like that I then further refined. Yeah, so maybe it takes too much effort or too much time to put stuff back centrally, especially for special systems like they have at CSU. And that's something I think I agree with. If people want to do the effort to contribute something back, it has to be a win-win. It has to be a win for the central EasyBuild community. It has to be a win for the site that's doing that. If that's not the case, just have your own repository with your own customized easy blocks or hooks or conflicts or even tool chains or whatever you need and that's absolutely fine. But like another thing that makes it difficult for me to contribute back an easy block is that I have no way to test it, at least not on Lumi, within the regular EasyBuild environment. Of course, any change I made, I should make sure that it still works for regular EasyBuild also. Yeah, well, I... And then I need to rely fully on the CI, which is, and you get the error message, it's more difficult to debug that when you can do it on your own system. Yeah, it does mean more effort, of course. Yeah. There's also an understanding that actually at the moment... There's lots of stuff within the VSC you can still play on. It has also diverged at some point, because we missed things in EasyBuild. Kaspar, you had something else? I remarked right away. So to be very clear, both the CSS repositories and all the repositories are completely open and available to the outside world. Yeah, and that's a good idea. So Kaspar was saying EasyBuild is also a platform to share expertise and to share whatever is being built on top of EasyBuild and maybe joining forces there a bit for the sites that do have a gray system makes more sense. So I'm really not so much asking for more specific support for gray. Yeah. I'm really asking to try to avoid things that can work against what we do. But I'm not only talking about gray systems, you will have exactly the same problems that I run into with any other tool chain which is not part of the common tool chains. Yeah, if we have something LLVM-based, you'll have issues like this as well. And I'm pretty sure that, yes, but you, I mean, you missed the big point that I made. You were talking out of a, Cizet Mint's point of view. I was talking also about of a user perspective. Yeah, this is a difficult. A user cares in a different way about errors. Yeah. Yeah, but I mean, you really think like a user support person or a Cizet. I mean, I wanna be one when it may not work. Sometimes at least I see the LLVM users are way happier. I mean, they'd rather have something that sometimes doesn't work, but most of the time works than something that never works. That's one of the use cases. Yeah. That's the way SPAC is being used at the moment on LLVM. I think there's a middle ground there. We could tell you easily at least try something, right? Like guess. Or give a warning rather than completely fail. Give a warning. Look, there is a risk here that we're doing something that is not quite right. And that could be a... Rather than just say, sorry, we don't support this. Yeah, that could be a configuration option. Like you wanna hard stop if something unexpected may happen or will you at least try and see what happens? Give me those compiler errors, right? Because of the, yeah. Yeah, yeah. So the... Yeah, so... Yeah, the comment is that the actual problem that you introduced if I just go ahead and try may actually pop up six installations later when it's used as a dependent thing, yeah. We should wrap this up. There's lots of more stuff we can discuss. There's another slot tomorrow. The ending slot where we have about an hour to have open discussions. We can raise some of these points again. And there's just way too much in here to cover everything, right? But there's lots of... Lots of these things should become issues that we discuss later. Maybe on conf calls. Maybe there should be a dedicated follow up conf call with some of the maintainers on this. And some of these things are relevant in the scope of working towards easy build file. Like the way that... That's why I wanted to bring them on. Yeah, like the way that we're checking for dependencies in easy blocks. That's maybe something we really need to rethink and redo and come up with a way that we are happy with. But also in special cases, like EZM and Lumi are happy with. There's... That's another opportunity. And the digital reserve, Charlie and some important... Yeah, yeah, yeah. And similar issues. And the Canadian friends as well. Okay, good. Lots of stuff to follow up on. Thanks a lot.