 Good morning everybody to the BioXL webinar number 66. Today we have only judge and Arno Proum speaking about efficient Groman CP2K computer source usage for co-MMM simulation of biomolecular systems. Only Arno come from APCC from the University of Edinburgh I host this webinar I'm Alessandra Villa from the Royal Institute of Technology. So the presenter of today. All the people participating today. All this is an application consultant as a background in computational chemistry and physics research. She's currently focused on exploring the performance of high-performance computing application used in scientific research. In particular, she's focused on co-MMM calculation using CP2K. And she's looking how to improve the performance of such a calculation. She's a member of BioXL and she's also a part of the CSS team for the UK National Computer Service Archer 2. Arno is a research software engineer. His background is in computational statistical physics. And actually currently is working on a number of projects improving improving parallel performance of scientific software. In a way that such a software can take advantage of modern computer architecture is also involved in training and a lot of European project and not in supporting research using HPC facility. It's also part of the UK National Computer Center. Please. Now I keep them the word. Okay. Welcome everybody. So thank you for coming. So yes, I'm Arno and here today with me is Holly. And we're going to be telling you about how to use a CPUs and GPUs that are commonly found in HPC systems to do efficient biomelecular QM simulation with Gromax and using CP2K. Before I proceed, I just want to make clear that although I'll be presenting the main part of the presentation, a lot of the work actually done has been done by Holly, a lot of the technical work. So if he can correct me on anything that I get wrong, anything I miss out. And if you have any questions at the end, I will try my best to answer them but probably Holly will jump in with the real answer. So we'll see how we go. Now, the reason why we put together this webinar is because as some of you may already know, BioExcel's been producing an interface in Gromax integrated with Gromax that allows the usage of CP2K to do QM simulation. So my colleague, our colleague Dmitry Morozov at the University of Uwaskola in Finland has developed this collaboration with the Gromax development team at KTH, Stockholm. And we have seen people starting to use this. Now, as people are starting to use this, they're asking questions like not only how do I technically use this, how do I get all the right settings and parameters right, but also people use GPUs all the time to run Gromax and Gromax runs great on GPUs, especially improvements made in the last couple of years. So people say, okay, well, can I use my GPUs that I run Gromax on to run CP2K? Should I do that? How many cores should I use? CP2K has lots of different QM treatments. Which one should I choose? And we have already organized in BioExcel workshop on the really on the software agnostic QMM modeling aspects. So kind of the underpinning choices about modeling, which addresses what, for example, what functionally I'd like to choose. But what we're trying to address with this webinar here today is really a lot of the concrete questions that we've seen people have that will hopefully make it easier to start using HPC machines to do research with Gromax and CP2K. And it should also hopefully serve as a reference guide for when you're applying for compute time on, for example, whether it's your local machine at a university or a research institute or some of these national level HPC machines up right up until the largest supercomputers run in the EU by price and euro HPC. So what we're going to try and cover today is briefly say something about Gromax and CP2K parallel execution, just enough to point out why we're taking the approach in telling you about the parallel performance of CP2K that we will take. Then I'll introduce the BioExcel QMM benchmark suite, which Holly and I have developed. Then we'll look at a little bit of detail at the actual parallel performance of CP2K with these bio molecular QMM benchmarks that form the benchmark suite, both on CPUs and GPUs. It will try to gather some lessons learned about how to make efficient use of HPC resources. So as Alessandro already said, please feel free to any point to ask any questions that pop up. You don't need to wait until the end. You can ask them and we'll address them at the end. But as they come up, just feel free to answer them. So as I said, this webinar is very concrete. So it's meant to be very practical. So we do want to say something at quite a high level about the way that Gromax and CP2K work together when you run QMM simulations using the new interface that Dimitri has developed. Why are we telling you this? Because it will justify why we're going to focus on CP2K parallel performance. And we're actually going to forget about Gromax for the rest of the webinar after I've said this. Why is that? Well, because the way that interface works is during the standard MD loop that Gromax does, the calculation of the energies and forces on the QM atoms and on the MM atoms due to coupling between the QM and the MM atoms, that is computed not by Gromax because it doesn't have to be quantum chemistry, but that is computed by CP2K. So Gromax is launched in parallel, like you would normally launch Gromax in parallel on the number of ranks, number of MPI ranks, using a number of cores, using a number of GPUs, whatever you might usually do. Then Gromax, when it comes to calculating the QM and QMM interactions, passes the information about the atomic structure to CP2K, which then computes the QM and QMM coupling forces and energies. It does that in parallel, using the same cores that Gromax has been running on, passes back the forces and energies to Gromax to then proceed with time integration. Now why did I say we can forget about Gromax for now? Because actually it may become no surprise that these QM and QMM coupling calculations are far more computationally costly than calculating the classical forces and energies and performing the time integration. So what that means is that the parallel performance of doing QM simulation with Gromax and CP2K is pretty much entirely dependent on the parallel performance of CP2K running whatever it is given by Gromax when Gromax calls its functionality. So in other words, to understand how to efficiently use Gromax with CP2K through QMM simulation, we basically need to understand, or it's enough for now anyway, to understand how CP2K efficiently uses CPU cores and GPU cores to do these calculations that Gromax is asking it to do. So now to introduce the BioExcel QMM benchmark suite. So as part of our work in the project, Holly and I have been looking at whether we can. So we have two goals with BioExcel on this work. We're trying to make it easier for people to use CP2K for QMM simulation biomarker systems and the interface is a way to facilitate that as well. And we're also trying to see if we can see whether there are places where CP2K could be maybe better for this particular kind of simulation because CP2K is traditionally a computational chemistry code that's been used for material science and all kinds of problems, not necessarily as much for biomolecular systems and also not necessarily as much its QMM functionality, which has been around for a while now since about 2004 or so, I think. So as part of this effort to understand, analyze CP2K's performance and how we could improve it and to address improving usability of CP2K in combination with Gromax for biomolecular simulation, we have gathered a number of benchmarks together. Now, some of these have been adapted courtesy of our colleague Emiliano Eppoliti at Fortress Central Newlich, who published together with collaborators a paper on another interface that lives still on in Gromax as well for using not CP2K but CPMD to also do QMM simulation. So we've adapted some of those benchmarks to work with CP2K. Now, we have adopted quite a strategic approach because a systematic approach because we know already in advance that there are a number of key aspects about the QM treatment of our QM region in the biomolecule and the QMM coupling that affect the performance and essentially influence not only the computational cost of what we're asking CP2K to do but also, for example, what code path part of CP2K is being exercised when we run these benchmarks. So the approach we've taken is to take what are essentially three different biomolecular systems, MQAE, CPDIFI and CLC-19. I should say the CPDIFI is also provided by a colleague of ours, Dimitri Modozov. So what these are is MQAE is a C2-ethyl esterine solution. So it's a solute solvent system. CPDIFI is a phytochrome. It's billy-verdein chromophore bound to a dimer. CLC is a large membrane protein system. It's a chloride ion channel embedded in a lipid bilayer. And as well as representing a variety of different biomolecular systems, we've chosen these benchmarks specifically because they have certain features that span a range of ways that they exercise CP2K which we want to investigate and share with people as well. Namely, the number of QM atoms, number of atoms that we designated to be treated using QM within these systems goes from very small 19 growing through 3468 medium to a little bit larger 253. Now that is actually fairly small compared to some of the calculations that CP2K can do scaling tens of thousands of atoms. But this is what is the way that we want to use QM for a system of interest, for a region of interest in our biomolecule, whether it's calculating spectral properties or whether it's calculating bone formation and breaking or proton transfer, things like that. So another key aspect about the QM treatment, well, as well as the QM atoms, which you consider what kind of total system these form a part of. So all the other atoms that are not QM atoms are treated classically. And so you can imagine that if the balance between the number of atoms in the system that is QM and the number of atoms that are classical is very different that could also give rise to different usage of the code path of CP2K. That's why we have one system, the MQRE, which is a rather small, fairly small number of total atoms 16,000 in total. The other two CP5 and CLC are much larger. Now key aspect of the QM treatment for the QM region is to choose the density functional approximation within CP2K. So here we have incorporated some common, two common generalized gradients approximations, BLIP and PPE and also two common hybrid functionals that incorporate some Hartree-Fock exchange, namely B3-LIP and PPE0 because as we see, as we shall see these exercise the code in quite different ways. Then to isolate the effect, so we are trying to choose our benchmarks so that we can by comparing any two benchmarks, we can see the effect, we can isolate the effect of for example, changing the number of Q atoms but keeping everything else the same or for example, changing the functional that's used to keep everything else the same or for example, changing the QM cell size that defines the spatial QM region and keeping everything else the same and then seeing what the effect of that is. Now by teasing the parts the effect on CP2K performance in that way systematically, that has helped us in identifying not only profiling and identifying, you know, how CP2K performs as these vary but also hopefully this will allow you who are here today and anybody who watches this recording later on to understand how their usage of compute resources change of CP2K and the usage of compute resources changes as a result of a decision to for example include more QM atoms or for example to expand the QM cell size. We've used for most of these results we've used one the same base set which is a base set that is quite suitable for molecular systems as the name suggests it's all optimized for molecular systems and the time step in all cases is one femtosecond and the ensemble is also NVE these are periodic systems. Okay, so that's our bio-cell QM benchmark suite so as I said this is a very concrete example we're going to discuss some results from benchmarks that will give you an idea about the usage of compute resources and how they vary depending on these parameters that I've outlined. Now we've run these benchmarks on a few different systems again chosen strategically because they represent both the recent past and still current architectures and processors that are available in a lot of places for example CERIS which is an HPC machine here at EPC in Edinburgh has consists of 280 compute nodes each compute node has two 18-core Intel Xeon Broadwell processors running at 2.1 gigahertz and each node is 256 gigabytes of RAM and they're connected through an infinite band interconnect so that's kind of a I don't want to say traditional but it's kind of like a model of multi-core Intel processors that have been around has been around in HPC for a while now then a lot of systems worldwide including in the EU are starting to appear that include AMD EPC processors which have for example 64 cores each so as well as benchmarking our systems on CERIS we've also benchmarked on Archer 2 which is the national UK supercomputer use for scientific research which is a lot larger and on which each compute node has two of these AMD Epic Zen 2 ROM processors each with 64 cores for a total of 128 cores per compute node bringing a 2.25 gigahertz also 256 gigabytes of RAM so that's already one important interesting piece of information that the amount of memory per core is a lot smaller on that system which can have implications for how we run our optimally run our benchmarks the network there is HPE Cray Slingshot which is especially low latency high band with interconnect etc now just a little note about how we have run our benchmarks what we've essentially done so these are these are AMD benchmarks using QMM functionality within CP2K in order to subtract out any initialization cost to keep things manageable we have or as you say Holly has run lots of lots of runs where we run 6 AMD steps then we run also 1 AMD step we subtract the 1 from 6 and we get left with 5 every average to factor out our initialization cost that helps not just because of the QM aspects but also because when we run a CP2K standalone rather than Chromax which is fine because that's where the performance comes from ultimately as I've said there is some you know classical parts that CP2K does that we don't necessarily care about because Chromax will handle that and then we also average over multiple runs we could practice try to look at any outliers that signify system noise operating system noise or network noise and we've developed also Holly's developed a script which allows us to analyze the resulting logs standard format logs produced by CP2K and to extract not only parallel scaling plots of overall run time execution time by computing these averages and the average time per md step but also to loop to generate profiles that's that where we can visually see the top however many contributing sub-routines within CP2K that contribute to the overall run time the most okay so let's dive in to those benchmarks now so we'll start with MQAE and for the MQAE system MQAE system we're running on the CPU partition of Sirius that I've just outlined so what I've shown here is the time per md step when I say the time per md step it's the wall time so it's literally the amount of time you need to wait for the program to execute of course how many people actually care about is nanoseconds per day when they're actually going to run this as part of the MD simulation so I've converted this to nanoseconds per day or actually as it happens because these are costly calculations and this is on the basis of the assumption that the only thing impacting the run time and the simulation performance is this computation which we know is true to order approximate order to 0th order as an estimate as an initial estimate and so we're comparing here so all these slides I'm showing they're showing and as I said the effect of varying one thing by comparison between two of the benchmarks so in this case we're comparing the MQAE with the GGA functional B-LIP with the same system with B3-LIP now when we run CP2K as you either already know or will see when you start to follow the instructions about to run it it runs in parallel in two ways with MPI launched according to a number of ranks and optionally with OpenMP threading where each MPI rank can have one or more OpenMP threads now in general it is worth experimenting with this in general it can certainly be advantageous to have more than one thread in this case what I've shown also as a the most useful thing we thought to summarize the performance characteristics that we've encountered is we've chosen what seemed like good choices of number of threads per rank so that also hopefully serves as a reference to have some idea of what might be good choices but you always have to try because of yourself because it depends on your system it depends on other parameters that we've not specified here like cutoffs and all kinds of things so we see here that on Cirrus we get a performance of 17 picoseconds per day for the BLIP case and a little bit less of 13 picoseconds per day for the B3LIP case which is not too bad so we have Hartree frog exchange in there so that's giving us something better and it's not costing us a whole lot more but it's not too bad on Archer 2 it's kind of similar but a bit better I've chosen core counts that are slightly comparable of course each node on Cirrus and since these results, these data points are mostly single node increments of 1, 2, 4, 8 etc they don't exactly coincide but choosing comparing 288 cores on Cirrus with 256 on Archer 2 gives some idea that we are roughly in the same ballpark despite having these two kind of different generation processors but one Intel, one AMD and one a bit older so you know it shows kind of that when you end up in a similar ballpark with these kind of calculations the parallel efficiency is something to keep in mind I've shown it here, calculated so when you run in parallel you have to choose how many cores you want to use so the question is how efficient is that should you use more and to really determine that in principle you should first run on the smallest number of cores that you possibly can run on for example that if you run on any smaller number of cores the system might run out of memory so you have to use more cores and then see how the execution time changes as it decreases as you increase the number of cores that you run on and essentially a simple notion is basically if you run on 10 times as many cores and it goes 10 times as fast as 100% parallel efficiency anything less than that is smaller so what I've shown here is the parallel efficiency relative to running on a single node now that is not to say that you cannot run these benchmarks on less than one full node in any of these systems for some of them you can but there's always some ambiguity when you make this decision about what to compare the parallel efficiency to as I said it can depend on the parameters so this was a useful way to get some metric for the parallel efficiency at the core count quoted so at this core counts it looks as if you know that performance there whatever it is if you're happy to only run at 40% or 50% parallel efficiency okay but you're potentially wasting compute budget so it's potentially not very efficient so doing this kind of measuring is a useful way to determine whether you're getting banged for your buck essentially or a bang for your funder's buck and whether you're making efficient use of the compute machine one thing we see often with CP2k is that as I said threading can help a general trend that we often see and we've seen with these benchmarks as well is that having more than a single thread can be better especially it seems to push out the performance to get a little bit better at the larger scale sometimes to the detriment of the performance at a single node or whatever the smallest skill is that you're running on so that's something to be aware of so yeah we could go into detail in principle about why these don't scale any better so we've looked in this in detail but it's perhaps more than the time that we have here mirrors really so now there's one single change next slide the column on the right has remained the same so we're still looking at the same function but on the left what we've changed is we've increased the QM cell size so what we've done is we've increased literally by a factor two each linear dimension of the cubic cell for an overall 8-fold volume increase and the performance has actually it's just about six times slower on Ceres than it was for the original QM cell size so you've increased the volume by a factor eight the performance has gone down by a factor five sorry factor six and on Archer 2 it's slower by a factor five why the difference between two machines difficult to say, you have to look into the profiles it could be that maybe the intercom fasters that the fasters interconnect in Archer 2 is helping their performance there so that gives you some sense of how the performance scales with the linear dimension of your QM cell size it could be cubic but it looks as if it's then cubically so let's look at a different system now a system with slightly more QM atoms and being embedded in a much larger overall system of 168,000 atoms we're now looking at the GGA functional PPE in comparison to it's hard to focus change included equivalent PPE zero and so there we are again seeing roughly similar behavior now it's not necessarily worthwhile comparing well if you're interested if you're trying to decide between PPE and B-LIP or trying to decide between PBE zero and B-3-LIP then you couldn't need to compare those two but interesting point here is I guess similar to the comparison between PBE and B-3-LIP to the comparison between PBE and PBE zero so we can see that yeah it's costing a bit more parallel efficiencies is looking a little bit better at the we're looking at and you get an impression of the performance in picoseconds per day in both systems and you can see that trending wise we really are kind of in the same ballpark on these two machines so then let's look at the CLC system the ion channel where we have a case of keeping everything the same apart from the number of QM atoms that we designate oh actually and also the QM cell size that changes so there we do actually see quite a big difference as might not be a surprise so we've got a sort of a seven fold increase on Cirrus in the performance going from 253 to 219 and not as dramatic a change on R2-2 but the thing to observe here so there's something going on here which is instruction which is useful to know about which is that when you run CLC 253 you are actually well we were on R2-2 on Cirrus because of the available memory you actually have to choose a different parallel scheme parameter for the QMM subsystem specification in CP2K so by default the parallelization scheme for is atom based but that can so that can parallel memory because the grids are used to do the QMM calculations are replicated so you make a run out of memory so then the other option is to use the grid option for parallel decomposition scheme which reduce the memory requirements but then if you are replicating many atoms and the performance may suffer so threading can help so actually these choices of threading have made a big difference compared to if we had chosen different numbers of threads so threading here in the case of 253 ameliorates the strong effect of that that inefficiency of the grid scheme and what's more actually threading allows you to reduce the to make better use of the available memory on a compute node because if the memory requirements typically for these well okay that's true for hybrid functions where you specify the max amount of memory that it can use to store electron-washed integrals which is not relevant here but what's funny is that actually underpopulating nodes so that you don't use all the cores but that was running there has more memory that is running something has more memory available means that you can actually run faster so you might in some cases use fewer cores and run faster simply because you then are still able to use a default atom parallel decomposition scheme now the reason why there is this big effect is not something we should go into because that depends on detail about the profiling something that's used to be aware of and when you want to run hybrid functional calculations for example PBE0 or P3LIP is that the mole-opt basis set is very costly when combined with hybrid functionals now in CP2K generally people say okay maybe use HFX basis or EMSL but these may be less suitable for biomechanical systems so you should check literature there is an approach that can help in CP2K to be aware of which is the auxiliary density matrix method or ADMM this can help dramatically speed up your hybrid functional calculations with the mole-opt basis set for example with CLC19 with P3LIP which is not shown in the previous slide that was PLIP using ADMM we can accelerate mole-opt calculations by factor 15 for example dramatic effect and if you want to go really beyond HV exchange and you think well why are we using so few cores why is it getting badly I want to run calculations with MP2 accuracy you can do that with CP2K or RPA you can go really extreme and that does scale out much better because effectively you're asking CP2K to do much much much much much more work so one of the things you've seen in the profiles is that the balance between computation and communication that is required is more favourable whereas for some of these systems with fewer QM atoms or systems with not such demanding DFT approximations the amount of communications can dominate over the amount of computation that you're actually asking CP2K to do hence leading to a pitering out or diminishing of the scaling efficiency parallel scaling efficiency of your simulations okay so you say let's find the CPUs and CP2K runs great on CPUs over many many CPU cores on using MPI communication it's been doing that for years since we've heavily optimised how about GPUs as I said many people the question we get often is how do I run should I run, can I run CP2K with Gromax on GPUs to do QM calculations because I know Gromax runs great on GPUs so to check out what the performance was with that we ran the same benchmarks on two different systems a GPU system on NVIDIA system and a CPU partition of Cirrus which has four NVIDIA Tesla V100 GPUs per node and two 20 core Intel Zhiyun Cascade Lake processors as well per node which are a bit newer than the 18 core Intel Broadval processors that are on the CPU partition of Cirrus and has a bit more RAM as well per node also connected to the infinite band in addition to that we also looked at we were given access to the AMD Accelerator Cloud to see what the performance was like there on a combination of AMD Epic and 8 AMD Instinct with MI100 at GPUs per node and 512GB of RAM which is relevant because as I said these AMD GPUs become more common not just in US HPC systems but also into computers like your HPC's Lumi system which is up and coming for which in fact the CP2K developers have been preparing as well in collaboration with the HPC so going through the same benchmarks that we went through before what I've done here is so in principle you should maybe compare what it's like on the exact same nodes if you do use the GPUs versus not using the GPUs however for practical reasons it was more efficient to simply use the GPU nodes to run on GPUs and then compare those results as we've done here to the results you've already seen so at the top it's always a result you've already seen it from the CPU partition but now comparing it to the GPU partition does that make sense? Yes because okay they're somewhat comparable they're in the same machine they have the same interconnect this CPU course there are a few slightly fewer of them and they're also a little bit older than the ones on GPU nodes but comparable so what do we get when we actually tell us how CP2K when we build CP2K to use GPUs in all the ways it can which I'll briefly mention as well well you don't get a lot is unfortunately the answer so you can see that the performance on the CPU nodes is actually somewhat similar to the performance on the GPU nodes so you're essentially using all four we're trying to get cell CP2K to use all these four GPUs but for this particular biome like the benchmark we can see there's not a massive difference between having those GPUs and not having them switching to the large QM cell size we can see that there is a bit more of an advantage here so we can see that 17px per day on the CPU nodes for 24px per day on the GPU nodes however so we can see actually similar trends in the in the CP5 system yeah it's slightly better on the GPU nodes but the thing is if you were so these processors these CPU cores on the GPU nodes as I said they're newer and they're slightly faster and they're slightly more of them so if you were to not run it all on the GPU on the GPUs on the GPU nodes but just on the CPU cores on the GPU nodes it might actually be faster than using the GPUs we'll get into why that is because we know that CP2K can perform very well on GPUs it's just a domain of applicability for the fit of these biomelecular systems finally for CLC we kind of see similar story it is giving some boost it is a bit better but overall it's not you know a massively overwhelming benefit really so why is this it's because actually the typical QM treatments for biomelecular QM simulation on the CP2K in particular the number of QM atoms means that these benchmarks and these simulations do not heavily rely on the parts of CP2K itself or on external libraries that have been heavily optimized to be offloaded to GPU and to do so in distributed way so CP2K can skill linearly to extremely large system sizes but that is not the code path that is being exercised by these biomelecular benchmark however we do not despair because there is extremely active development ongoing continues to be done by the CP2K developers on improving the GPU offloading both on Nvidia devices and AMD devices within CP2K itself and also within libraries they develop like DPCSR in particular the ones to keep an eye out for are two electron integrals which is relevant for we know our bottleneck and a lot of the hybrid functional calculations for biomelecular benchmarks they like to run a bunch of integrals which is part of the Libint library as well as some grid operations and actually as we started talking to the developers they started looking at GEEP as well so we have looked at that a little bit as well GEEP are the QMM specific sub-routines within the CP2K that calculate the QM interactions so we have only talked about Nvidia so far how about AMD GPUs so CP2K version 9.1 which was released this January has experimental HIP offload support via the DPCSR library however we know that DPCSR is actually more important for linear scaling a DFT of calculations rather than what we are exercising with these biomelecular benchmarks and it can use open B multiswetting there is a HIP backend for the grid operations there is an ALPA library which we don't recommend for production use because it is a lot slower than simply using the CPUs the COSMA library which is also used by CP2K supports HIP but it is not recommended for production use either for similar reasons so HOPI tried a single node benchmark of one of these systems the MQE-GGA B-LIP and the performance we get is kind of similarly not necessarily advantageous as the NVIDIA GPUs so there is 8 of these AMD MI 100s and there were 4 of the NVIDIA 4 NVIDIA ones we got 15 seconds per AMD step single node of serial GPU partition with those 8 AMD GPUs we get 11 seconds per AMD step a half node of Archer 2 without any GPUs but just to aim the epic system with 3 seconds per AMD step so it just, CP2K just runs really well on CPUs but it is being very actively developed for GPUs including for these kind of cases that have not traditionally been exercising the code paths and therefore have not received just simply as much development priority so I think we are getting to the time which is fine I hope this is being useful we really wanted to give you an overview in our rough guide like we said in the abstract for the webinar of the performance of CP2K for biomelecular typical biomelecular QMM treatments the benchmark suite is on Github there is a best practice guide that we have put out which is linked there which should be useful, you can see on the left hand side useful information and we are expanding that in the very near future to include some additional information but not just CP2K standalone but Gromax kind of the CP2K a while ago already last year early last year we organized a workshop on best practices in QMM simulation which really focused on the software agnostic QMM treatment side and our colleague Dmitry Morozov also gave a webinar earlier on the interface that he developed I just wanted to say thanks to Holly for doing loads and loads and loads and loads of work on this and to our colleagues from Bikestell and Miliano, Mirko, Gerrit Dmitry and also to CP2K developers all the shoots have been extremely helpful and also Thomas Kuhne and Matthias Kraken and other people who have been very very helpful in talking to us about CP2K so with that I think we can probably so thank you very much Arnaud it was a very nice presentation thank you so now we go to the questions so please go on to type your question in the Q&A section if you forgot up to now if you come up a new question but I will start to read the first one Aik is asking how did you determine what restrictions on the step side added in QMM simulation thanks I have a simple answer but this is what was used before but I don't know if Holly has anything more insightful so we took the step size from the systems using the papers that Arnaud mentioned so I guess if you're using a QMM system you can expect to maybe have to decrease the step size a bit from using the typical system because you have to resolve the motion of the QM parts as well okay I hope that it's an interesting question actually the question to what extent the time per MD step we thought about this but the question to which the time per MD step is dependent on the the time step so it's not simply that you scale it up to use a different time step but it's only the time integration that Gromax does that you have to worry about because or worry about it's more that so the way function of course so CP2K is computing an approximate way function for electronic structure within the QM region and it retains that and tries and iterates and tries to reconverge via its self consistent field approximation algorithm on the next MD step and the more distance we think the more distance everything has moved the more everything has Gromax has told everything to move if the time steps are very large the more that the next wave function will be removed from what it's computed in the previous step so therefore the convergence and the algorithm that CP2K does will probably be slower so there's some interlinking there but I guess you always have to be aware of instability and things like lots of things that can go wrong between Dmitri and Emiliano and participants of our workshop last year at DTCC on specifically to their workshop on QMM simulation Gromax CP2K Maybe you can add the link to the chat if you think it's useful I could let us know if this was answering your question and now we have another question from Cannes Would it be wise to prioritize AVX 512 Void in the future does the ability and the compatibility in processor chosen for outmost performance gain or do other factors such as CPU count, CORE count or core clock simply contribute more well vectorization is increasingly important in modern HPC codes I'm pretty sure that when you compile I mean you just passed the flag to tell it I mean it matters I don't know how much it matters if you have any sense of... Yeah, no, it's not, so then we've looked up. I mean, I guess you basically, you know, you write, you have some, you have some processors, you better make sure that you do make use of whatever vectorization they have, because I'm sure at least some CP2Ks can use it. Okay, now we have a following up question, we don't follow up with another question, who can? Will CP2K support more mainstream CUDA capable NVIDIA, G-CPU like NVIDIA's RTX GPUs? I assume, so I mean, it's just a matter of the precision, right? I'm not, yeah, the precision issue. I don't know how, maybe CP2K might not be happy with that actually, I think it might require... I mean, it was like Romax, it's happy to kind of do mixed and work things out cleverly. I mean, not that CP2K isn't clever, but I think I'm not sure to be honest, I think we've been focusing very much on the cards that are in these big HPC machines. So I don't think it's even been on our radar as much as small labs, because CP2K has traditionally been run at massive scale on these big parallel machines. And so I don't think there is as much maybe of the kind of labs having their own cheap, really cost-effective bank for buck consumer keep used that they buy because they can get away and get really good performance like they do with Romax, but maybe I'm wrong and the CP2K developers, of course, no. Then if there are no further questions, I will just want to tell you that there will be a next webinar, will be the 7th of June, and it will be about Adoc. The title has to be arrived, so keep an eye. So you can, you will know which Adoc. We think it will go in future Adoc. So I think we think that it will be on Adoc 3. And the speaker will be Goa Texera from GeoTexera from University of Utrecht. And if no other people are, no question are popping up, just check if in the chat there is something in the nobody's right hand. So I thank you everybody for the attendance and Arnaoli for the presentation and for the work. Thank you very much.