 It's based at the World Institutes of Technology in Sweden. Do you have any slides to present, Professor? Yes. Should I share? Yes, please. Let's see if I can find where my slides are. Right there. There. Let's see if this works. Yeah. I can see you. I can certainly see your screen. Yeah, let's see if it goes full screen. Okay. Yes. So thank you for, for inviting me to, to speak at this session, especially answer questions. So this is not a woman. Those are mainly on performance tuning, but I don't know if that's the main interest of the audience. So please interrupt and ask questions and we can go in any direction. I also have other materials necessary. So for example, the room makes works performance wise to go a lot faster recently and we have more support for different GPUs, but more particularly different ways of using it very much, except for lots of GPU details, but those tend to change fast. So I haven't included those, but there are questions. I'm happy to answer it. So that's of course where most of the performance is to be happy that the pencil also in the hardware you have. The correctness is concerned or there. I mean, we haven't changed anything. I mean, Gromax was correct since a long time and is still correct. Maybe more than 10 years ago, we had approximations to make things faster, but nowadays we do things right as most codes in this community do. Okay. So I can change slides. Yes. So Gromax does classical molecular dynamics. There are some efforts on, on bringing QMM again in a better shape. And we had it to link to CP2P and probably also other codes, but I won't say either. It mostly targets problems from biochemistry, but there's also material science and I'm personally, I'm working on flow kind of problems. I'm working on simulations of water droplets and of, of, of wetting problems containing millions of atoms, tens of millions, yeah, until millions of tens of millions of atoms. So that's also becoming a more important, large by far as the biochemistry field, which is really large. So it's free and open source. Currently it's C++17 is our standards, at least for the next release, maybe not for the current 2021. I'm not even sure when we switch to C++17 requirements. Developed by multiple institutions, although the main weight of developers is currently at KTH in Stockholm and used by, while hundreds of research groups could be thousands or more, since we can't track how many people use it, that's difficult to say. And currently the main funding source is by Excel, from which I'm speaking here, but we also have Swedish funding from CERC and we have other changing fundings, depending on the projects we have. And of course, those people contributing from other places, they have funding usually from somewhere else again. Okay, so yeah, systems, a typical simulation system is given here, although it can simulate anything containing atoms with classical force fields. A reasonable fraction of the simulations is such kind of things, either proteins in solution or membrane proteins, like here's drawn with some artistic picture, where you have a protein in the membrane, where you see the two leaflets in blue here and a color protein and some counter ions and some alcohol molecules, I think, because here we want to know how alcohol affects this particular signaling protein. So such kind of systems one would like to simulate and as fast as possible and correct as well, of course. So these are systems in the size of 150,000 to 200,000 atoms. So those are very common size. You can also simulate something much smaller or bigger, but often in biomolecular simulations, one of the main issues is that simulations are, the size of the system is given by what you want to, what you're after. So if you want to study this molecule, that's the size you have. There's not something much bigger than, and this is an issue for scaling on special modern machines which are going towards the extra skill because this is a limited size system, so you can't scale it to a whole of an extra skill machine for sure by itself. So for that you need to work on ensemble techniques and these kind of things which are somewhat beyond the scope of this presentation and session, but I'll go at least into running multiple simulations together, how you can do that efficiently. Okay, but you of course might be simulated, simulating something completely different, something of your own interest, but this should be some general recommendations. So then one question is, does Gromax performance optimization matter? Well, it's usually, the quality of science in general depends on the number of independent configurations sampled, so that holds generally in particular for MDs or problems since you often cannot sample enough, so you need to, would like to sample more than you can. So that's why it's important to have good performance to sample within the given workload time or the given time that you have on supercomputer to sample as much as possible. So for Gromax default performance is pretty good, so you can say if you're a beginner you don't need to bother too much, I mean things will run quite okay, especially often compared to other codes. Yeah, if you can do something else then okay. So, but if you're running many simulations of the same kind, as often the case, the particular and same kind of hardware, which is also often the case on your supercomputer or local cluster, then it might be worth to optimize the performance there. And of course you do need to bother because the resources cost more than your time. So in the situation where you need to run many simulations on the same machine, which is very often the case, it's worth looking into how to optimize the performance. So what do you need to consider? Well, you need to consider how Gromax was built. I'll shortly go into that, what we want to simulate, how we want to simulate it. You need to know how Gromax works inside so that I won't go into any detail here at all since there's not enough time. What hardware will we use? And how do we map the simulation to the hardware and how do we find out what might be improved? So for the last point, I think I'll leave that to the question answer session. So if you have particular problems or things we might be able to actually discuss those, I think that's more useful to discuss in terms of an example. Okay, so let's start at the beginning. So nowadays I think Gromax, building Gromax is not really critical anymore because most compilers are quite good and it's quite standard type GCC or even more than client is pretty good. So I think this is rather uncritical nowadays. CUDA is also mature. OpenCL is a bit of an issue for AMD GPUs because that's being phased out likely, but okay in an MPI library. So about the only thing you need to worry about somewhat is how to configure the FFT.library, which for which you usually use FTW for the PME electrostatic, since that can be a lot slower if you don't use a CMD to use, maybe because of your hardware, like usually enable SSE2 and enable AVX or AVX2 depending on your hardware flags. So you need both those flags. That might affect performance, especially if you're limited by that. But if you use this flag here for Gromax to build with your own FFT, then it sets those automatically. So even that's not an issue. So the only thing there are requirements if you want to run a multi-node sheet MPI. If you have a single node, you don't need MPI because Gromax has internal FFT MPI. So that's usually works. So about the only thing you hear is that you don't need double precision. So you only need that if you really know why you need it. So in general, most users don't need that at all. So that will lose a lot of performance. So I think this became easy nowadays. That's not problematic anymore. Then preparing the simulation is rather important. So here one can gain performance by having a simulation cell that's the right shape, just large enough for a science, a physical model, but not too small that you get artifacts. We have the virtual side trick. I think you should read about that in the reference to double the time step, which we hope to have multiple time stepping to replace that, which works down in the US version, but it doesn't work fully stably. So we need to work on that before we can advise everybody to use it. Otherwise, our advice nowadays, for normal simulation setup is to use links for bond constraints and constrain only eight bonds for the force field. Sorry, because the force field used that usually. So not all bonds as we used to do in Gromax. So this is faster and it's more correct for the force field. So that's one small consideration that both improves performance and correctness. It doesn't have large effects, but it's anyhow, since both forms and the correctness goes up, that's worth doing. Yeah, typical water models are rigid here. Let's see. Yeah, I think the rest of the water model, this is probably old notes, you should use the water model for your force field usually unless you know what you're doing. Okay. So preparation, I think nowadays also unproblematic rather standard. So then there are some considerations how to choose your simulation setup. One issue that can quickly happen is if you decide to, for instance, write coordinates of losses every step, then you can very quickly fill up your disk. So you might bring down a whole machine in that way because you can write data very quickly if you do a thousand steps per second and you write a hundred thousand quarters of velocity per step, then you write a lot of data and usually it's useless. So it's good to think about how much data you need, not too little, not too much. But then I think, yeah, you don't need to, you shouldn't use coupling algorithms every step. That also hurts performance, but I think these are by default set up quite okay now. So you should do, there are several NST parameters if you look in the parameter setup. Then a critical thing for the correctness is that you should use the, the Jones or from the mouse cutoff as part of your force field. So you should use what the force field has used. Same thing for dispersion correction, which is linked to that, which I'll mention you. So this is rather critical. If you change this, then your simulation results might not be correct or match what's intended with the force field. So you shouldn't mess with that. The clone cutoff doesn't matter because we usually use particle mesh, you can do what you want, but you mean you do, you can use a longer cutoff there in principle, like Romanck does internally, it now tunes automatically. So you can, you can hear trade off or not trade off. You can run it the same accuracy with different parameters, but unfortunately we don't have an accuracy input setting for this for the user. So this is a complicated part. So usually you can, you can use standard parameters there, but yeah, there are some considerations. I think we have those in the user guides also. So there's a lot of information in the user guides if you want to see details. So there's a how to here linked to at the top. And there's a user guide, which gives more details about different options here. So I would go into further details where you can find all those on manual with Gromax.org. If you go around there, I think all the information should be there. So this presentation is mainly to give you awareness of the difficult considerations you should have when running simulations. Yeah, then for good performance, you get appropriate hardware. So well, if either you have access to, to a hardware at a computer center, in that case you have what you have, but if you buy your own hardware, then there's this very nice paper. I should have put a picture here and this is more bang for the buck. It's called, which is here at this link, which looks at all different kind of confederations of GPUs and CPUs. So this is by now again outdated this from 2018. We had a version before 2015. So since the developments go quick, these things tend to get outdated. But there you can see for differences, for different simulation types and for different hardware combinations, what performance you get. So you can choose the best setup for the price. Yeah, then considerations here is that if you have multiple GPUs, since the overhead is high of moving data, data between them and the CPU, you need about 10,000 of particles per GPU to use multiple GPUs. And as GPUs get bigger, this gets more and more problematic. So this is a big challenge for us working on Gromax to, to, yeah, to still scale to multiple GPUs here, since this is getting bigger and bigger and simulations are not getting much bigger. So this is hard. Yeah. For if you want to run a multiple nodes, you need some fast internet gigabits. Probably now this is, yeah, that is all slight. Probably have 10 gigabit ethernet now, but that's usually okay. It's often the PCI bus, which is also slow. Memory and disk don't matter for, for, for Gromax. And you can even run on cloud resources, but you should avoid running inside virtual machine. So nowadays it's, yeah, even that's a feasible solution. It of course costs you, but in the end it's the prices are quite competitive. Homogeneity is much better. So you want your homogeneity resource are good. Yeah, there's a lot more to say about hardware also, but I think I'll leave it at this. If there are more questions, I can talk about this later. So then here are some, some tips for running different setups. So here's, and you run on a single CPU only node. So then you can use the default build, which compiles with our threat MPI support. So we, we use it. We emulate MPI using threats. So you can run domain decomposition as we have, which I don't have a picture of here. So, but then, and the run can automatically choose how many domains it uses. So you can vary that by hand, by using the MPI flag. And you can use the OMB to set the number of threats per, per rank to use here. Although everything is threats, of course. Yeah. So hyper setting, I think nowadays is quite good. So this used to be somewhat of an issue. But I think nowadays hyper setting is usually good to CPUs. But then, so then you can choose different, different setups of this. This might, might be worth trying if you're on many simulations. So if you have 16 course, for instance, you can run 16 ranks or 16 domains, one threat or eight domains, two threads, four domains, four threads to use all the hardware. And you can use hyper setting. You can double the numbers. Yeah. So more examples on the user guide. So, but this is basically the only thing you need to try here. So that's rather doable. Then if you go to a single CPU, GPU plus CPU node, the incurrent part of Chrome X from CPU and part of the GPU. So here there's a lot more options now to divide tasks, but the default is quite okay. I think here again, I would refer to the user guide for more examples here. But now you can, now you can vary things. So if you have two, two GPUs, you can run for instance here in examples, you could run two domains, one on each GPU with this many threads, but you could also, it might actually be faster in many cases to run, to run multiple domains on one GPU. Since then the GPU can balance out a lot better. And it's idle less. So maybe the optimal case here would be four domains and four threads. And then you use two domains per GPU, since it can then overlap the work better. So one can play around with these kinds of things and see what happens. Oh yeah, there's, and see this varying, I think this is probably also I should, I should remove this, this is now automated and this is insensitive. So this in the, since 2020, I think this is not important anymore. That's automated and not sensitive anymore. So that's easy. So that can be removed. Then on multi-node clusters, things are more complicated. Maybe you've seen some examples of that in the previous, in the previous talk. So here you need to build an MPI enabled grow max. And, but then it becomes very network dependent. So, and this is an issue already on good hardware, which is not busy because data needs to go over the network. And it's often latencies are play a strong role there because grow max has so fast iterations. It can do about a thousand iterations per second or more. So it's really latency sensitive. But then if you have other jobs on the, on the machine, then they can interfere. So this quickly gets problematic. So the performance might vary a lot depending on what other jobs are running also. So requesting nodes close in a network space can help. So we've been experimenting with this. So this is very annoying and to use. Yeah, you can optimize MPI, MPI libraries for this, but this gets very technical. We should talk to your computer center for such kind of things. But what you can do is you can, you can, which I haven't explained this one can have separate ranks for the, for the PME mesh calculation, which is a rather different calculation from the rest. That can be split off to separate MPI ranks. And then, but then you can play with the number of those. So those you need to preset. So there's an MPME option. I'm sorry. You should be a dash here. So if you run many rank, many, many domains, many nodes, then it's worth playing around with that number to see how well things work. So you have GMX tune PME that can do this for you automatically. You should look in the manual there. You can see how that works. It runs some steps in different configurations and tries different things. So then, yeah, these PME ranks should, should rather be a divisor of the, of the number of, of how we call PP ranks of the rest. So because we need to communicate data between them. So it's nice if they have a lot of large common factors. So this starts getting complicated because of the way things communicate and the way Gromax is set up. So I think I won't go into more detail here. There was a slide originally also on mixed CPU GPU, but I removed that because it gets even more skipped that because it gets even more complicated. So you have to also worry about how many GPUs you have, where they are, how they communicate. So that's something even, even more complicated than the pencil on the hardware. And it might, this might improve as Gromax improves. Yeah, if you have CPU only multi-node clusters, let's see what you have. Yeah, if you see, what's the difference here? That's about the same. Yes. Right. So you can choose, I think there's no difference here. No, yeah. Sorry. It's a continuation. It shows what command line you should use. So you should use an MPI run with a number of, of ranks. And then you can, you can decide how many threads you want per rank, which usually fills up the whole machine. If you don't get this, it ultimately works. So you, but you can choose how many ranks you want for, how many threads for the PME ranks also. Again, I think I referred to the user guide here because I think you quickly get lost. If I try to explain all this, but it's important to realize that there are many options. One can tune here. And these kinds of things can affect performance a lot. So it's worth looking into. Okay. So then one of the most important aspects here is, is this PME tuning. So because you have some ranks running. So if you have a PME and some ranks running the rest, if one of them, the set goes faster than you're wasting time on those because they're idling. So that's why, well, we nowadays, we have all automated PME tuning running. So to balance the load, but that can only do a bit of tweaking. In the end, it's, it's probably usually worth if you run me to run many simulations to try how many ranks you need here for this and PME as I said. Yeah. So you can also use GPUs, but there's not, there's not so many layouts possible here. So with PME on, on GPU, it usually, currently it runs only on a single GPU anyhow there. So that would only work if you have two GPUs, three or four, not more. We're working on parallel GPU PME. That's coming. So if you have many GPUs, you would usually run PME on the CPU and not on the GPU. So I don't have examples here. I think of how to choose that now. You can find in documentation. So automatically grow max, who are me and the run, if you started by default, it would choose something not unreasonable. So it's, it will, it will run and it will use all resources, but maybe not in an optimal way. Yeah. Then strong scaling versus throughput. So this is also always an important consideration. So usually, as I said, you want you need for science, you need a lot of samples. So often you anyhow run the same system multiple times. Maybe we have different initial velocities to create more data, more samples. So if you need anyhow need many copies of similar simulations, then, and then you can wait long work to get the full set of results. And a finite resources, which usually case, then you can run multiple simulations. Together. So you can run, for instance, here we show with MPI run, we run on 16 ranks. You can run with the dash multi there option. You can, you can give directories like your ABC and D. You can run multiple simulations. At the same time, in different directories. So this allows you either to have one large job, which you need to schedule that runs multiple simulations or. Conversely, which I think we don't have. Oh yeah, here we have that. You can, you can use GPUs more efficiently. So a GPU is usually idling some for some time on a single simulation. So if I have, for instance, two GPUs and four simulations, I can mix them and you can mix them in different ways to assign different domains of each of the simulations to different GPUs. And this way you can create more concurrency. You can use the GPU to near 100%, which is not possible with a single simulation. So this makes more efficient use of resources. And even management wise for you, it's easier because you run some for simulation at once. You don't need to keep track of for independent jobs running. So this is, this is quite useful to improve throughput with this. You can, it's easier to get to the maximum throughput possible on hardware, especially with GPUs. So then how you can actually measure performance. So you can, you can use the production system. You have the TPR, you can run a few, few thousand steps to permit tuning a load balancing to stabilize it. And then you can reset the counters and then observe performance. And this way you can be short run. Like here's an example command line, run six thousand steps and you reset at five thousand steps. You reset the counter. So you measure only over the last thousand. I think these numbers are probably too small. It should be 60,000 to 50,000 or so to, to get some reliable measurements. But this way you can, you can quickly measure within a few minutes. You can measure how fast your run is by allowing for tuning of BME and the load balancing and also the warmup of the CPUs and GPUs and then measure. So that's a way to have reliable quick performance measurements, which you should aim for running a few minutes probably. Then there's a log file, which I won't have examples of. I'm going to show you how to use it. People have questions we can see later, where it gives you a summary of the hardware and software configuration at the start. It also gives all the information about your, your physical simulation settings as reports on how the simulation has been set up just before it starts. So what the ranks are used, how many threads, what's run on GPU was run on CPU. And then there was an analysis of the whole time at the end. So from this you can see what's part of the time is spent in which part of the code. And then you can, you can diff such files with different settings or in different hardware to see where the times differ. And if you change settings, what the effect they had. So there's quite, you can do quite a lot of analysis just looking at log files. You don't need to profile the code. So you can, even if you don't understand exactly what's happening, you can often infer from that what the effects are of what you've done. If you have a basic understanding of the different things that you need to compute in an MD simulation. Right. So that was what I had prepared in terms of best practices for performance and a bit on, on, on correctness as well. So let's see. There's already some questions in the chat. Yes. I'll go through those first. Yeah. So the first one we've got is from Jamshed. Yes. Whether you could say a bit more about the MPI and NTO, OMP, you parameter. Yes. So that's, that's a bit of a confusing thing we have in Gromax. So since, as I said, we have this thread MPI library as we call it. So that's emulates for us MPI, but using threads. So you have, you have multiple ranks. But those are not real processes. They're, they're just threads. But in this way we can, we, without having to code, completely new code paths, we can still use the mainly composition and, and the same communication setup too. That we have for, for normal MPI communication. So did this, this anti MPI gives the number of MPI threads, which corresponds from the cold point. If you're exactly to MPI ranks, real processes, but in fact it's threats. So this means you can choose how many domains you want to have in your domain composition. Okay. And so the number of total friends is used is the mode is the product of anti MPI and anti open MP. Which is the number of open MP threads. So that's the classical open MP thread number. Yeah. So this, this option in this anti, anti MPI option disappears. If you have a real MPI build, then you set it for MPI. Yeah. No, I was going to, cause I was going to, cause I mean I was going to ask a supplementary question. It was because it's, because I'm guessing you don't necessarily need that option if you, if you're using, well, if you're, if you're, if you're sort of MPI setup is, is there, it's available to allow for hyper threading. Well hyper threading is a different thing. That's, that's used depending on how many threads you launch. So, so that depends if you, if you launch as many threads as you have physical course, then you don't get hyper threading. So if you use twice as many launch twice as many, then you do get it, but that depends on what you launch. So, but that's, and that's in turn depends on how this thing is set up. So if you're running on a super computer, super computer might actually, the scheduling system, or the MPI run might actually set that for you. So you can't, so you, you have that given. So actually grow max checks, if it's already set by something else, and then tries not to mess with it, but you can still override it with empty, if you want to. Okay. And it depends if the, if the hardware on super computer has hyper threading enabled, there was a period when many did not, but I think now usually do, because the newer processors are usually somewhat better with hyper threading than before. So, so the next question is from Pedro asking, has grow max been fully ported to AMD GPUs? Yeah, so that's that's a question with many answers. Okay. So we have had since a very long time, we've had an open CL version, which runs on any open CL supported GPU and optimized for AMD. But that has limited support, more limited support compared to CUDA. So it's lacking somewhat behind. So that has support for the non bonded off loading it for PME, which are the two main components anyhow. So that's it's quite performant, but now the question is what the future is of open CL. So and the new machines like Lumi, it doesn't seem like they're going to be a good open CL support. So we're working on sickle for that. That's the massive efforts, but we hope to have in the 2020 release, then at least some on bondage should be on available sickle and PME maybe, but then the whole hardware or the whole software stack there is not fully reliable yet. So that should hopefully mature over the coming half year or year. So that's the current state of things. But if you have an older in the GPU, then open CL works very well. This is an optimized code. We actually did the person who's asked that question raised their hand. Would you like to, would you like to unmute? So Pedro, would you like to unmute and ask a, did you want to ask another question? It's asking you for permission. Okay. Hang on. I can see if I can ask, see if I can ask to unmute. Okay. Hey, yes. Hi. Can you hear me? Yes. What, what, what, what do you say you, you're, you're working on is sequel. Sickle. Sickle. I see. This is why CL. So that's, that's the, the, the, the open standard, which seems to be come the standard now among all hardware vendors, except Nvidia for the moment. So it's a bit, it's a bit of a, yeah. I mean, it has been very uncertain for, for, for some time, but that seems to be the, what's happening now is that sickle is becoming the standard. So which is a, on the one hand, it's nice. It's a nice C++ standards, but it's the, the problem is that it's quite orthogonal to the way standards could have worked. So it requires rewriting of the whole glue code. So rewriting kernels for us is not such an, such an issue. We know how they work and it's rather straightforward, but rewriting glue code is annoying. So this is the version that will, you will, you will be running on Lumi, right? Yes. Yes. That's the plan. I see. Thank you. Okay. So, so there's some, some more questions. There was one question about whether or not you could expand a little bit more about the different options you showed for a node with GPU and CPU. So there's a supplemental about how to. Yes. So let's see what I have on that slide. Let's go back. There, there, there's very, very much to say about this. Unfortunately, because it's really complicated. Let's see multiple. Was it this one? Multi-simulation. So where was it? I think it must have been that one. Yes. I think that's where it starts. If you have, if you have a, if that's the question, then, if you have a single simulation and you need to run it over. If it forms from one to two GPUs, a single node. So if, I mean, if you, I mean running on single GPU is rather straightforward because you can run everything there without domain composition. Although even there is sometimes it's faster, but if you have two, then you have to do something because you can't run without domain composition, at least in Gromax over two GPUs. So there's actually another option, which I didn't mention here is you can run PME on one GPU and the rest on the other GPU, but then you need to be lucky that the, the load is roughly 50 50 or equal for those parts. Yeah. So if that's the case and that's, that's going to be the most efficient. But since that's often not the case, then, then there's the option to run while either two, two domains as you see the lower line on, on this slide with empty MPI two and run. So you can specify which domain runs on which GPU. So you're the zero one. So it's the first domain runs on GPU zero, the second on GPU one, which it will do by default, by the way, you don't need to give the GPU ID option here. And the threads would also be default. So you actually don't need to do empty MPI two when you would get this automatically. But it could actually be faster to run two domains on each GPU since then you have more concurrency. So did the GPU come to anything while it's waiting for the data to arrive for the task, which we try to overlap because there are multiple tasks. So we can, we can, we can, if we can send over data for one task, it can compute on that while the data for the other task might need to come in, but then many need to coordinates and so. So it could actually be better to run, two domains on each GPU here as we see in this middle line with empty MPI four. Again, you don't, you wouldn't need to give the other options that will be default. But we show them here to see what happens. So you can run domains zero one on GPU zero and domains two and three on the other GPU. Then you have more concurrency and then now you can transfer the data. For instance, they will, it will by default, it will be running out of sync somewhat because the GPU is fully asynchronous. So then while data is transferring, or while you're running for one domain, you can transfer the data for the other domain. So this, this tends to happen by itself. So it can give better performance. And then if you're anyhow running multiple, then it might even be better to run even, even more, but that depends. So you can try and see what, what is better here. So since, since, since there's, since. Yeah, since growing internally, I mean, it doesn't, it doesn't know anything. It's transparent to how these things work, except for load balancing, but I don't want to go into details. I mean, you can, you can play around with many of such options that you can imagine you can do many things. So unfortunately, there are often quite some combinations to try, especially if the new ranks come in and then there's, you can choose which tasks to run where I didn't even say that you can try to, you can assign me to the GPU or the CPU. You can assign bonds to the CPU or the GPU and so on. So there's even more combinations there. So what we would actually need, since this is too much work for the user, obviously you might have really hard to follow what I'm saying. What you, what we would like to have is to have an option that does something like the Monte Carlo on the task placement and then moves the tasks around and then tries, maybe with machine learning or whatever to figure out what the, what the optimal setup is, but the problem is, of course, this is a rather nasty optimization problems as you move something somewhere, then, then yeah, the resource tends to get overloaded because of that. You need to move something else back. So it's a very, it's a very nasty optimization problem. Yeah. Yeah. Um, I don't know if that's somewhat answered the question here. Uh, yeah. Okay, I guess so. Um, uh, we've got another question asking if you were using, if you use the MPI run MP 16 option, uh, with, yeah, with the most, most option and there are four directories. Yeah. Yeah. Then you get, then you get four MPI ranks per, per run in each directory. Yes. That's correct. Yeah. Yeah. Yeah. So you can, so you can, you can do whatever you want there. Of course you count this, this MP you should have the number eventually multiple of the number of directories. I think that's the only requirement. Yeah. Okay. So they're identical. We could even vary that, but that, I think we don't allow, but yeah. So you see, you can't allow different numbers of process. I don't think so. I'm not sure. Maybe we even have that option. I don't think so. It's often, often any house insurance running similar runs. It doesn't really matter that much. So this option was actually there originally to, to, to work around requirements of, of, of HPC centers because they want big, big parallel jobs. So then we disguised many runs as one big one. Yes. But nowadays it's actually it's, it's, it's you can use this, the same set of issues for replica exchange or other kind of a change algorithm. So where you actually do have coupling and you need this. So that's one thing. And the other thing is that it's actually it's convenient for, for managing jobs because you might get too, too many jobs. So that's another topic which we're currently working on. This is somewhat off topic maybe here, but this is one issues as the computers get bigger and bigger, you can of course run more and more runs. And we have easily often in scientific problems enough, enough computer requirements, but then you quickly get to having to manage thousands or thousands of runs. And then if you throw that into a, into a, into a schedule on the supercomputer, it might go down. So we've had that happen because someone submitted a hundred thousand jobs with Romax. Then the whole supercomputer goes down. So then it's actually an advantage to be able to do this because then you take load off the schedule. Yeah. Yeah. Okay. So the next question that is we support virtual sites in the future or is this option what, what is option be deprecated? So we'll, we'll, we'll keep supporting this. So this works well and it's not overhead in the code wise really to have that. Maybe we'll, we'll deprecate. What is it? We have this aromatic version which is almost never used. We're not sure about that. But the standard one we'll, we'll keep using even, even if you come up with a multiple time static scheme that might be better. I mean, there's no reason to, to not support it. It's little effort for us to support. And it's quite appreciated. I think. Yep. Okay. There's also another question about, sort of related from what we were talking about, you were talking about, we were just talking about before. Is it possible to run that exchange in a single workstation or a single node? And the documentation, the document, the documentation handy seems to suggest it's only possible when you're using MPI. Yes. So that's indeed, so that's the, the only case where you would want to have MPI on a single node is to run replica exchange, but you can do that. Of course, you can install MPI or I have MPI installed on my workstation laptop, especially for this case. Okay. Okay. And then the next question is, so after saying creating residue files for new small, like you're rather challenging, you know, opaque, is there any more guidance other than in the manual? Oh, I suppose this is the topology files more here is the question, or is it, is it residue topology since for, nor if it's more molecules, you're usually directly right topology here. So for that, that's, that's problematic that we're aware of. It's the question though, if, if, if Romach should be doing this or there are other tools that should take care of this. So we have, we have, what's it called now, someone here is to call mates, this tool, I always forget the name, as a tool which, which, which interfaces with three different tools, which is one is the guff for generalized amber. It's the charm. What's the charm journal force field for drugs and such things. And there's one more. So he hasn't, the person here made an interface that's, that's automatically, or the scripts everything there. So you can put in a topology in smiles or in two other formats, or a molecule setup and then it'll spit out the topology for you, for one of these three force fields. So it's probably useful to link to that. Maybe I can find it in the meantime that I can paste in. Because doing it by hand is, yeah, I mean, this should be automated, these kinds of things. What's it called again? I still don't have a name where I have the paper. Where's the link? Okay. So let's go to the next question. I think we've actually run out of questions so far. Unless anybody else wants to ask anything. Someone wants to put in a link for a paper. So there's a, let's see. Got a paper I found. Yes, exactly. That's the one. Yes. I found a link to the tool itself, which I forgot the name of. If I have a name, then I'll have it. Don't understand why it's so difficult to find a tool. Is it supporting information? Is it here? Come on. It's apparently called stage. Yeah, right. Exactly. Exactly. Is it stage or stages? I think just stage. I've checked that as well. Right. Yes, exactly. It's called stage. But where's the repository? I actually never used it. I'm not making molecules to punish for a small system, but it seems to be quite appreciated. I found a repo, but I don't know if there's, if there's a front page for it. If you Google Chrome access stage and it's, then you'll find it. Yeah. So that's, that's extremely useful because this is, this is very tedious and annoying. And it's also easy to make mistakes. Yeah. So it's on the, it's on the, on the GitLab, on our GitLab, Gromax server actually. So if you, if you search stage and Gromax, you will find it. Yes. And that's very, that's very useful. There are other tools, I think also, but I don't know. I mean, there's, there's charm goo in these kinds of things, but I'm, I'm not familiar, familiar enough with that to, to say that should work also for such cases. I think I, I'm, I have a little bit of knowledge. I think it does. I think it, I think it, I think it, I think it, yeah, I think it can work. But I think, I think it's, well, I think they, I think it's aimed at like, maybe slightly larger modules, but it may, it may work the small ones too. But it's been, it's been a little worse since I've taken a question to look at that. So does anybody else have any questions? Yeah, there's going to be quite a questions about any, any topic in Gromax. This was somewhat performance related, but there's also questions about development, development about. Algorithms. Anything. More questions. I can't see any more questions come through yet. I mean, I suppose, I suppose, you know, if anyone, if anyone has any, any, any questions really about, oh, here we go. Is replica exchange for the functional Gromax? Because there's papers, papers around using human humans with Gromax to do this. Yeah. So that's, that's a useful question. I suppose. So there's temperature replica change, which is fully implemented in Gromax. And there's a few other things one can exchange, but temperature replica exchange is not so useful by itself because you need large temperature gaps, especially if you have solvent. So what's much better to do is. What's called rest, which is replica exchange solute tempering. And there you. If formulated differently, instead of changing the temperature, you scale down the Hamiltonian, which in the end is the, if you scale down the full Hamiltonian, it's the same as changing the temperature because instead of making the kinetic energy higher, lower the, lower the potential energy landscape, which is relatively does the same thing. But then you can, you can scale different parts of the Hamiltonian differently. So you can, for instance, do your protein and not the solvent. And that makes the, it, it much easier to have larger temperature gaps. So to say, I mean, it then turns into Hamiltonian gaps. So that's a far more efficient way of running it and Gromax, I can't do that. So I would really like to have that, but then someone needs to implement. I'm not doing this myself or my students. So I didn't, but that's, so for that, I think that's probably one of the main applications for plume for, for replications in Gromax is that to use the rest protocol, which is far more efficient, but then you have the overhead of plume, which since it's not, yeah, it's just hooks into Gromax. It's, it needs to mess around, maybe write stuff to file, change topologies and these kinds of things. So that gives you some performance loss in that sense, but you have a lot of algorithmic performance gain. So that, that we would like to have integrated, but there are, I think plume can do many. More even then that's in terms of replicating, which is part of that. Will there be some two in Gromax for machine learning or deep learning in? That's the question is, it's about analysis, I suppose. I think that we would leave, that we would leave to others. So I don't see why this should be integrated in Gromax. I think the Gromax tool should give you access to the basic molecular properties you want, like dihedral or principle components and these kinds of things. And then one can build, there are now loads of nice machine learning networks, tools around there. So you can, you can build on top of that by extracting the properties you want, by feeding it in there. So the thing one would, could have, what would help is to have a Python interface. I suppose in Gromax to feed those things, maybe real time or couple them easier. So we're working on extending the Python API to make it easier to, to interface things and do things, maybe on the fly. So that might help in that there, but then it should, yeah, that will make it even easier to, to link in some machine learning Python framework. The other things, if it's force field, so that's a development now is having better force fields using machine learning and QM simulations. So that's something we likely will implement in Gromax, but that's a rather large projects that will take some, some time, you first need to find the resources and then do it. Sure. Absolutely. Yeah. I mean, if there are a number of questions for, for Burke, I mean, we could also, we could also have some questions for some of the, yeah, for either Anila or Pedro if they, if they're still around them, we need to sort of answer a question or two. It just, I mean, we could, we can, if there, if there are any sort of remaining questions that anybody, anybody, anybody has. Otherwise, otherwise we could, we could, we could finish five minutes early, I suppose. In which case, I'd like to, yeah, I'd like to thank, thank you Burke for a very, a very, very useful and interesting sort of, Q&A session. Yeah, thanks for having me. Right. And also, I'd also like to thank, I'd like to thank the other, other, other speakers today, Anila and Pedro for, for, yeah, for their talks and, and the practical session. And I'd also like to thank everybody else here who's, he's been sort of, he's been sort of, he's attended and sort of, you know, sort of, been listening and paying attention and everything and ask, and asking questions, of course. So, yeah, so thank you very much everybody. And, and I hope you, I hope you're willing to come back for tomorrow's session on integrative modeling, booking and pre-engine preparations of biomolecular systems, which I believe starts, starts at the same time again, tomorrow at 9.30.