 Welcome back. The next talk is from Oli Sastad from the University of Oslo. He will be talking about the best practice guides over to you, Oli. Yeah, thank you. First, a short introduction on myself. I've been studying at the University of Oslo. I have a PhD or a doctorate in chemistry, physical chemistry. You run experiments and you do simulation of the experiments and then you try to fit parameters like so. So they will fit the functions and fit the models and then that's in optimization problems. And some of these are quite elaborate and takes quite some compute time, especially when you don't have the second derivatives and can make a Hess matrix and so on. Sure necessity I started up with supercomputing, but I've been playing with computers for most of the career because automating the experiments in the lab and reading off of data logging and processing data on the fly. But that was done by by simple PCs and so on. So then I start working with supercomputers. I spent five and a half years in a company called Scali who made MPI and cluster management kits. And then joined the University of Oslo and the supercomputing group. It's called scientific computing or now they call research infrastructure and so on. Which the key word is research infrastructure, which brings us to praise, which is partnership for advanced computing in Europe. I was also asked to give a little overview because no, not everybody know praise that well. So the praise people would also be happy that I do this. So what is praise? It started off about 10 years ago now a little bit more 11 years. I was in in in the first implementation period with then we have the prototype machine that was a machine from from you must scale where you use an interconnect and shared memory so that you could build in 2011. We had a system with them 4.7 terabyte of memory, which was quite a lot in 2011. So then it became second IP, two IP, three IP and now we are up to six IP. It's set for implementation implementation phase 123456 and our eyes stand for research infrastructure. So as you can read from the slide, they try to offering the world class computing and data management and this is what they do. So they provide access to infrastructure. So the large system around Europe can be accessed by people who will don't normally have access to HPC. If you are eligible, you can you can have access then praise to a lot of training and education. There are summer school, there are winter schools, there are a lot of activities going around. There are also a lot of online training that you can either just watch or you can actually participate in groups and discuss with others. Then there is a button within training and education also go enabling applications and so on. I'll give a little overview of different work packages. They also do HPC market surveillance. So if you are planning to procure system, you can go in and read the reports of what's available in the market and what people have done and what they have been procuring over the years and so on. So there's a lot of interesting stuff going on. The more you dig into it, the more layers you find. So it's a bit like reading the castle by Kafka. There is always a lot of things going on that you never even thought existed. So in Brussels, it has 26 members and that's not the fixed number. Members come and go. Some countries find out that they are paying too much and get a little back. Some others join and so on. So the map I did deliberately did not put up a map because every time somebody does that there is some hand race that this country is no longer a member but this country has joined. So it's but it's 26 European countries. So it's quite nice and all the meetings that used to go around from different places so you can come and see different cultures and so on when you attend the meetings. But now today everything is online so we stay at home. To have access to HP system, now we are talking about the real big system, the biggest ones in Europe. Soon to be pre-exhaust scale and when the Lumia system comes online, it will be a pre-exhaust scale. So the tier one and tier zero systems are open for application. But of course, in order to run on these really big systems, then you have to qualify. So there is a review process. You have to show that your application scale and it's scientifically sound. But we have some examples like the Stellar Atmosphere Code by Mats Karlsson runs from 80 to 100 million core hours on the European systems each year. So it's possible to have a substantial amount of core hours. There are proprietary access where the team will help you to get access to bigger systems to qualify to the user system to check if your application scale. And if it doesn't scale, help you to do it. Then there is the SME, small and medium sized enterprise prices, which is more geared towards the commercial side, but not the large corporation, the smaller ones. So that the small company that actually need, let's say you do the car industry, let's say you make bumpers for the car and you want to do crash collision test on just the bumper. Then you can use this SME and get access so you can do structuring modeling and finite element analysis. Then there's distributed computing initiative in Europe. This is far older than Prey, so different organizations can get access. Then there is project access, which is the one I mentioned about Mats Karlsson who run the Stellar Atmosphere Simulations. Training and education is very important and there are a lot of training you can attend. So the only way is to go into the web website at the bottom right and look into the training. There are seasonal schools and then the market surveillance, if you want to procure a big system you can read here. You can also see the system already procured and you can contact these people if you have detailed questions. So it's a very nice starting point and of course there are one of the work packages also contain benchmarking sets and so on. So you can have help for selecting a broad range of different benchmarks. There are work packages in each of these and there are eight different work packages and one is strategy and admin and two is strategy and admin. Work package three is outreach, four is for training, five is for planning and commissioning of systems. Six is services, eight is preparing for Excel scale and then comes work package seven, which is application enabling and support, which I work on. There are different things as I mentioned earlier, preparatory access of different kind, enhancing with the support level teams is a strange thing, but they are actually so called second, third or fourth line support when you really need a problem that is very, very hard to debug or solve or you have really problem, let's say your library doesn't scale and you need the library and so on. Then you can get help on a let's say half a year of full-time employee for half a year or on that level. Then there is the part that I'm taking part in is the best practice guides. There are already seven practice guides. There are different ages, some of them, but we do not rewrite everything for each guide. So it's well worth looking into the older guides to see what's been done in the past because for x8664 it's been around for a considerable number of years. So things that are covered five years ago may still be just as valid as today, even though the processors have changed a little bit. So you see we already covered some place things like the sea on fire night landing are already outdated, but I will come back to some of the nights landing because that was a rather peculiar technology and it had some forward looking things that we can talk about a little bit later. This is a graphical overview of more or less the same and we'll discuss some of the cover their relevant guide with this is called now called the modern processors because modern in 2020 is not modern in 2025 anymore, but that was chosen. This published in November 2020. The Broadwell guide, the night landing one, and there is even a guide specially for arm covering some of the arm systems, and we also included part of that in the modern processor guide. So that people who want to think about going arm can see what's going on in other places and also learn a little bit how things work. There are some overlap, but different topics are covering differently in each guide because even though they all contain system or even so they are written by different authors and different authors have different view of things. So it might be well worth reading the same chapter in the older ones. This is the current one called best practice for modern processors covers arm AMD and Intel and AMD Rome and Skylake for Intel. So there's a long list of orders. I have half the table of contents here. It's a mixture of some kind of reference guide and the field guide, because it's also contains some a lot of examples how to do things and compiling flags and so on. So I use it both as a reference guide and a field guide. When I oh how did I actually run this this the monitoring tool days I remember it was a strange syntax, and then I look up in the guide. But you can also read it as a textbook because it's for if you somebody come up and ask questions about teaching material it can also the system chapters architecture chapters and programming environment could be quite interesting. There are by design and by choice very little benchmarking because this is not the benchmarking exercise. So you can see it's only stream and HPL and they are per node, because benchmarking a larger system is not so easy, because we want to have the guide ready when the system comes alive so we we limit all the benchmarking effort to a single node benchmark because that's the building bricks in a large MPI you you start off with one thread and then you have one thread or maybe more many threads per MPI rank but it all boils down to what you can squeeze out of a single box. Then that I said that will go back to night's landing. The point with nights landing is that it had multi channel RAM. Well, some kind of high bandwidth memory it has several times at four or five times the member normal DDR4 memory bandwidth. So in principle, we can refer it as a high bandwidth memory. And that might come back because some of the arm systems are also discussing using a high bridge between higher bandwidth memory and DDR memory. The Fugaku system have only they opted for only high bandwidth memory, but then they are limited to 32 gig per node, which with 48 cores and 32 gigs is less than one gig per core. So when Intel came out with the night landing they they put in MC RAM or high bandwidth memory RAM on that on the chip and then use normal memory DDR memory for the rest. But you couldn't configure it in the bios settings, which is quite interesting. You could either use it as Numa banks or or so called flat mode and so on. So user had full control. You could allocate the memory at runtime where you wanted it or as Intel prefer and I found that was the best one is to have it as a low level cache. But I'll show some of it later. Then this is normally what's contained in the guide. It has architecture systems design instruction set, which some authors go to a length because they like the concept and like to play with the with the different Lego blocks within the processor, within the processor bracket. Memory design, which is slightly different between Intel and AMD and also with ARM. Where is the memory controller? How is the Numa, Numa layout and so on. Compilers and libraries, of course, with three different architectures, one distinctly different from the others. Then compilers and libraries differ and even in along among the X86 64 processor it differs tuning tools. I have been putting a large emphasis on the command line because there is no way that you can easily run X11 functions on a very large system. It's doable through the queuing system, but it's far easier just to use the command line interface and put it inside a batch queue file system and collect the data and display it later using the X11 on the front end or application load. Some debuggers, it's not very much of debugging in this last guide. Few people wanted to write about the debuggers and then we didn't get very much and then there is access and overview of European systems. As I said, some people like playing on or playing with the internal structure of a processor. You could talk for a whole week about this, but the message to take home here as you can see the floating point and SIMD part, there is a box at the lower right. Then you can see that there are several units that can do SIMD and these are the vector units and they differ from the different processors quite heavily. Because in the AMD ones, they are put together as 256 bits, but the Intel one has 512. This is important when you do the tuning, so how wide vectors do you ask the compilers to produce for you and what kind of libraries can do. So even though you can say AVX, AVX 2 on Intel that run fine even AVX 512, but that doesn't run on the AMD, so it differs a little bit. And of course the whole thing differs quite a lot because the number of memory controllers and so on differ on the AMD with the Intel. So you get different bandwidth from the AMD than you do for the Intel. You can also see this is just a clip from the guide to what kind of detail level the system architecture deals with. Because it goes down to that you can see you have a floating point section, there you have an integer section. In the old days it was only 64 bit floating, everything was evolving around, but now with bioinformatics and so on or even machine learning. They use integers more and more and of course you see the modern processor we talk about they only support 32 bit single precision and double precision in 64 bit. But the newer processor of course will provide a lot more data types, they even have 8 bit floating, 16 bit floating, half precision and single precision and double. So different, but then we are moving over to the accelerators which is not the topic of wider disk guide. Here I have some detail about the high bandwidth memory that you could have. You can see that if you use it as a cache everything is done for you. So if you ever come up with questions about a hybrid model of memory with using high bandwidth memory and the DDR memory. It's already tested and played with and documented in the night landing guide. This is the only reason for playing with reading the night landing guide because the processor is no longer with us. But I did some extensive work to see how this played out so it could be interesting if an ARM system or something comes with hybrid memory. Then you can see is it worth it and how should it be. You could see that for you get very high bandwidth that's a simple fact. But the penalty for last level cache miss grow tremendously when you have it in cache mode. So if you have random access to memory then you are better off putting it in some other mode where you can access the memory directly instead of cache. And it's not possible to bypass this cache. At least it wasn't then I don't know how it's going to be in the future. Then we played war with ARM we were in China and running with the interconnect Rocky. Rocky is a converging internet and then this system we had in China had had nodes with built in. Huawei has their own processor. They also have built in internet on the motherboard. This test was run on Melanox internet but same would be with built in Rocky. And you could see that we have one and a half microsecond latency which is quite good even before. Also with Infiniband 1.5 is nice. And you get wire speed. So for smaller clusters it's possible to replace Infiniband with so more or less cost free internet because the Nix comes on the motherboard embedded. Only the switch 100 gig switch. But for bigger clusters you need Infiniband 4 but for medium size clusters up to maybe 16 nodes or so or maybe even 24. You could get away with Rocky and then spend money on buying more memory and more processors. Then just more about the same this is some we have seen before. The guide covers Numa. You all know Numa but it's important that you can read about how this one is laid out so you don't apply tricks that work for one pros one architecture and do them again do do mistakenly on another one so you can read how it's set up for for this kind of system. Then it comes to one of these things that people ask the most what kind of flags should I use. We put up some suggested flags that should work. They all been tested all calls all these flags have been tested with code and there's also quite a large number of internal benchmarking to check that that there is nothing they know magical bullet that will do much better. So these flags are good starting points. I will not say that you won't beat them in performance you can probably try and you'll probably succeed but these are good starting point they take you quite a long way. But of course for the Intel compiler it's tailor made for Intel processor there can be some issues and I'll show later on that there how you can overcome some of these issues. And I get some I got some some some flaming for this one because this is performance of vectorization flags. The truth is that this numbers come from matrix multiplication. I took the jacked on Garas reference implementation from from netlib. There are three loops they're not optimized whatsoever. I fed that into the compiler in the Intel compiler with different flags. Recording the performance I also recall I also looked at the assembly code so to check that it doesn't. It cannot discover that this is matrix multiplication and then call an optimized library it don't. But the Intel compiler does something's very very efficient here. If three nested loops in the matrix multiplication can be handled nicely. One can hope or assume depends how optimistic you are that nested loop code a stencil called weather code or forecasting code which also contains a lot of nested do loops can be handled in the same nice way as the Intel compiler do with these simple three loops in the matrix multiplication. The point is that if you do if you take the second line last line in the table then says with the minus X the minus X makes code exclusively for that processor. And that can be dangerous because if you are running on a normal Intel there is a test that Intel put in so it won't run even though the AVX2 is supported. So I figured out got some hint that if you compile the main with X AVX2 it refuses to run. However, if you just compile the main function with the core AVX2 it runs if all the others are then you can compile all the other routines with the AVX2. The table doesn't show any speed up but I have some examples that it might help. So play with it. Yeah, I have a hard time giving real, real, very good suggestions. This is a slight bit of trial and error because sometimes X minus X core AVX2 can be okay. You can even put in an A in front there then it will compile for any X86 processor but then do record the performance. You might get some, you might end up by having 4.8 instead of 26. So to do the optimization is not trivial when you run Intel Compile Code on the AMD processors. Then we cover performance libraries. Of course, math kernel library from Intel is the gold standard. It's very good. It comes with the compiler so it's widely used. I did a check with the AMD library and for metric multiplication they have done a good job. They are par with Intel. But then I switched to Fourier transform and the performance is not very good. It's almost twice the performance by Intel. And since modern computers they used to do but with all the machine learning and bioinformatics these have changed. But classically supercomputer did one or two things. They either did linear algebra or they did Fourier transform. So hence the focus on Fourier transform. So if you can use MKL it's nice. And MKL has other issues. So yeah, first we also go through like linking. I know to set up a link line can be less than trivial but we have some hints how to do it and this is one of them. And there is also a web page that Intel have that helps you to set up the linking. But of course that's only relevant for the Intel. If you are running AMD you have to play around with it yourself. And then these are the default parts that Intel make. And of course if you install system like with EasyBuild or any other building system they will be a completely different place. But the names of the libraries are the same. And then when you launch MKL there are runtime tests in there. Up to 2020, before 2020 versions you could set this MKL debug CPU type equal to 5. And it will use the AVX2 instructions inside the Matkern library. So you can see you get quite the performance increase by setting this lag. This is again matrix multiplication but it goes for all the other routines in there also. So it's also helpful for any others. There are ways of fooling the newest ones to use AVX2 instructions. And I wrote some documentation that you can see on the bottom there because this is not the guide. This is the internal Sigma2 documentation but it's open to anyone. This one will be put in the next update of the best practice guide. There is a test inside MKL called as you can see track to check whether or not the CPU is an Intel one. You just simply write a function with that name, compile it, make a shared library and then you preload that library before launch. So when that function is called it's not calling the true function of the Intel function. It's calling my fake function which only returned one. So it fools the MKL to believe that it's running on an Intel while it's running on an AMD. But that much changed in the next release. Who knows? Then we are over to the Intel. Should we use AVX1, 2 or should we go for the 512? Some of the Intel processors had issues by using the 512. So I suggest before you do any production you do a test to see how things work. Here you see that it's a tremendous increase in performance from AVX2 to AVX512. But this is again H performance impact which is mostly matrix multiplication. You can fool yourself because if you put it on a realistic code it might not be the same picture. So hence try to run Benchmark that are a little bit bigger or real application if possible. We covered some more system things. One of the Benchmark was memory bandwidth. You can see that the AMD comes up to 330-340. But then again that you need to place your threads. This is a single system so it's only 128 threads. To place the threads correctly on the cores which is always important when you are dealing with threads on NUMA. So depending on how you do it it differs. If you don't do anything you might get quite a lower memory bandwidth. Because you are maybe running on a core that is far away from the memory bank that you are addressing. And addressing memory over the interconnect inside the chip is a bit costly. Then we covered some tools that you can measure performance. The ARM it used to be Alinea performance reporter but now ARM both Alinea and now it's called ARM performance report. It's supported on x8664 Intel AMD and also of course on the ARM processors. And then there's a lot of things to be learned about this. First of all you see that it spent quite a large number of the time in MPI. More than actually in compute. But if you go down to compute you can see that it's a single core. Of course normally you have one core per MPI rank. Quite often not always. But you can see that the vector numeric operations is very very small. So you could gain some performance increase in the vector by having more vector operations. So compiling with more vectorization. You see the memory access is amounting for 74. So of the 47 percent that you can do computing in this application. Of those 74 percent. Now 47 percent. 74 of that time is then spent on doing memory access. So your actually compute time is 0.47 times 0.1 minus 0.74. So it's a small number. So this is not uncommon. If you actually sit down, count the flops that you've done. And count the flops that the computer could do. You end up in a few percent of efficiency utilization of the whole computer. Then you can use the tuning tools. This is how you can do this, the Intel tools. They work like wonders on the Intel ones. Except that some of them can be a bit intimidating. Like the V-Tune it's called now called amplifier. It's actually a tool for making for the developing processors. And the first time you see it, it's intimidating. So why actually suggest if you want to learn these tools. That you attend some of these free classes that Intel used to have around the world. So you could either show up close to your home, your institution or a short travel. Because I've been to several of them in Oslo. And I know they have Stockholm and of course probably a lot of other places in around Europe. So they have good teachers. I really suggest you attend training if you want to use some of these tools. The advisor is slightly simpler than the V-Tune amplifier. And the trace analysis is even easier to understand. This is some kind of how to use the command line interface. You can just look up on the guide. No reason to study this in length. But this is how it is covered in the guide, is the message. Then we are drawing to an end. This is the modern processors, the systems that we used covered in the guide. It's the full ham app in Edinburgh. It's ARM based. We also of course we used systems that they had locally. Both Kun Peng from Huawei and Thunder X2 from Marvell. We did not cover Fujitsu because it didn't have access to Fujitsu. Which is pity because Fujitsu is the only ARM system with vector extension that can do vectors. They have 512 vectors. And they also have SVE, so vector agnostic approach. So you don't really need to care about the vector length when you program. Which is very good, so you can just assume there is a vector length of something. Skylake Minenostrum, which is the most beautiful housing in the world I will imagine. I mean it is. Supermook at LRZ. Then we have the Europeans AMD processor with the Hawke at HLRS. And you have the Betsy system in Trondheim, which I'm playing with every day and struggling with depending on how you see it. There are ups and downs, there are applications that fails and so on, which is normal. So there are pictures of these. There is one I haven't mentioned. This is the Lumia system because we are going to hopefully during the summer, prepare a best practice guide on how to use the AMD processors and AMD accelerators that will be installed in Lumia. So we hope that we'll have something to help the user start using Lumia. And that's it. I spent slightly more time, but thank you very much. Thank you for the talk. If there are any questions, then please raise your hands in the Zoom or ask them on Slack. If you're watching the live stream, Kenneth seems to have raised his hand, so we'll go to him first. Hi, thank you very much. It was very interesting. We should clearly take a good look at these best practice guides and see what we can reuse or what we should revisit in terms of choices we've made an easy build. I didn't see anything really surprising, but it's definitely worth to take a good look at. So I did have a couple of questions, so as long as nobody else has questions, I can ask them. So maybe the first one you mentioned for some of the performance comparisons you're doing with different compilers and the flags you recommend. You mentioned benchmarks that you're using to make sure that those suggestions make sense. Is that the collection of benchmarks that's available somewhere that others could use as well to do similar experiments? For the best practice guide, we are using a matrix multiplication and which is HPL and streams. Those are the only ones because we are not going to do an embark on a benchmarking exercise. But if you want a suggestion, you can either go into the European set of benchmarks that Preys have or as I do, I have been relying for the last 15 years on the NASA NPB. They are available in OpenMP and MPI versions and so on. And they are actually small cores taken from real applications. So the NPBs are quite nice. They come in different classes and so on. So you can either run them on a single node and even the larger classes now will fit on a single node because they are very old. But they are actually taken from real applications. But as a simple benchmark, the streams that measure memory bandwidth is nice and of course the omnipresent HPL for the top 500 list is always important because people know it so well. Everybody have heard of the top 500 list and everybody is familiar with that one. But if you want more challenge, you can do the HP high performance conjugated gradient HPCG, which also have a top 500 list. But it depends on how much benchmarking exercise you want to embark on. And regarding the Preys benchmark feed, I've taken a close look at that and we're using some of the input sets that are available there for things like Gromax. I think there's something that sits on top of TensorFlow and CP2K. So some of these are quite good and they come with detailed performance analysis in terms of scaling on different systems and such. So it also includes very good reference data, which is important because you never know if you get a timing. Is it actually good? Is it bad compared to other systems? So that's a very valuable aspect of the benchmark suite. But one thing I did notice there and since we're here in this and the EasyBuild community were certainly biased since we worked together on a tool that makes it easier to install software from source, including big scientific applications like TensorFlow and CP2K. What is mentioned in the readme of the different benchmarks is it just says, install CP2K and then take this input file, give it this parameters to the CP2K comment and run it. But the first part, the install CP2K part that's already very non-trivial. And there's no specific guidelines on that on how that could be facilitated. And maybe it ties into the best practice guides as well. And I was wondering if there has been a consideration on writing a best practice guide on installing software, like covering tools like EasyBuild and others, maybe also containers, comparing them, saying you can do this with that tool and it's a good fit for this particular use case and not for other use cases. Is that something that would be a suitable topic for a best practice guide? I signed up to do work on the benchmarks some time ago because I wanted them to be either a configure, make, make, install. But I never got really around to start spending time on it. But I totally agree that it should be either trivial to install the application or there should be at least some best practice how to install it. Because it just says, I agree, it doesn't really tell you how to install. And it can be installed in so many ways and so many libraries. If the installation thing is like CMake or something more complicated than Make, then it's not so easy to figure out when suddenly you get a missing symbol in a library and so on. So I totally agree with you, but they haven't really gone into that one. They spent time on making all the things that you are happy with. I thought has gone into that, not installing the actual application. So if you could say have an easy build for all of the applications, it will be very nice. We actually do for at least half of probably more of the applications that are covered by Dimension Works, they are supported in EasyBuild already. So maybe we should then engage that discussion with the people involved in that work package and say, look, here's a pull request for your README files that suggests one way of installing it is using EasyBuild that way. Let EasyBuild go and then after waiting for a while, you have a working installation that you can use for your benchmark. So maybe that makes sense actually. I suggest you go into Praise and contact Walter Lyonon because he's leading that effort and engage with him because that could be very fruitful. Okay. And what about the best practice guide about software installation looking at different tools like you do for debuggers and profilers? Is that something that would make sense? Probably, but nobody has brought that up yet. But yeah, it could make sense. But it's not so easy because we are not in a position where we can give orders to the developers. It's rather the contrary. Yeah, but it would be like a neutral view on the tools, right? So we've tested this with different tools and maybe including containers and these are the advantages. These are the disadvantages. This you have to take into account when using these tools, just like you have with compiler suggestions like some of the tricks you have with Intel. I'm sure for these installation tools, there's several things, important aspects or workarounds for issues that they have that the developers don't want to change for whatever reason. So maybe it makes sense to have a guide on that as well. And specifically to EasyBuild, there's one question that we get frequently is how EasyBuild compares to another installation tools pack, which is more US oriented. And if this would be covered by a praise best practice guide, that would basically answer that question because we should just point people there. They made a comparison, a detailed comparison about what's good and what's bad, what's different, what's the same. So I think that would be a very valuable document for the community as a whole. Yeah, I can bring it up on some meetings. Yeah, it's quite good feedback. I see there's another question by Kurt, so I'll let him ask his question. Yeah, let Kurt unmute himself, yeah. It's not a question, it's remark actually. We have looked at EasyBuild and spec for the benchmark suite for the UABS benchmark suite, we came to the conclusion that it's not usable. It's simply not the ideal tool for the target audience that we have with benchmarks, which is they're using tenders, which is that vendors must be able to use it. They must be able to show the power of their compiler so also use non standard tool chains. That's simply not possible with I mean we cannot ask them to start developing an EasyBuild and to start developing in spec simply to submit benchmark results. All benchmark coordinators have been asked to at least be a little bit more detailed in the installation instructions though. But quite often the installation instructions were there but in a sub directory of the benchmark repository. But it's really meant and also meant to test novel systems and so on so also to be used with render compilers and so on that are not supported by EasyBuild and spec. And the other problem with both packages and even more so with EasyBuild and with spec is that I mean you almost have to start from scratch and it's tricky. I know how to do it actually have a separate install of the internal compiler that I still integrate as if EasyBuild has installed it even though they haven't installed it but it's complicated enough that you do not want to give that to a vendor. Yeah but compared to the instructions that are there now like for CP2K literally says CP2K gives you a file that you can edit to tweak compiler and compiler options and just go and change that until you like it and then install CP2K. If it were that easy that's good but it's really not in practice there's a lot going on only looking at dependencies and CP2K is not that bad but there's already three or four very important dependencies that you also have to build properly. I agree with you EasyBuild will definitely not be the answer for every possible situation every possible compiler and stuff but it could be a starting point or at least it could be mentioned as an option probably together with other tools as well of course. I'm not trying the instructions we simply don't have the manpower for two or three different environments. EasyBuild do it by hand. I actually took the challenge and I'm actually benchmark coordinator for the GPAL benchmark and I tried to do it in EasyBuild in a way that a vendor could actually use and I actually had to stop at some point because it took me too much time. Maybe that's something we should have. You're talking about benchmarking trying out different configurations. Now GPAL is particularly difficult because you also have to customize a file which is hidden somewhere to choose the install options and so on. It's probably one of the hardest benchmarks to install in my view in the whole benchmark suite. Okay then maybe we should continue that discussion and see if there's things we can change to make it easier or better. I see your point. If you want a lot of flexibility in terms of tuning and changing compiler options and maybe using compiler option A for some part and B for another part then yeah that's certainly currently that's very difficult. EasyBuild is good to have something that works but it's not a benchmarking platform. No it's not it's a software installation platform but that's the first thing you have to do before you can run benchmark. Yes but there's much of it is pre-programmed into EasyBuild and it's even worse for those packages that have an easy block it's even hard to see what's happening and to influence it. Yeah okay yeah that's a good point. I mean we have looked into it that was one of the points that we thought is that something that we can improve in 6IT but in the first full hands meeting we came to the conclusion that it was not doable with the resources that we have. Yeah and there's certainly some challenging applications in that benchmark in that suite. I think one of the ocean modeling things is Nemo or one of those. Those are pretty challenging even to get properly supported in EasyBuild. There are a few packages in there that I think don't really belong in a benchmark suite simply because they are too difficult to install. Okay we're 10 minutes too so maybe we should start wrapping this up unless there are other questions. If not I guess we can wrap it up here. Thank you very much Olle. Thank you it was very nice presenting this.