 And welcome to this short talk on getting good performance in Gromax. I'm Berg Hess and one of the main developers of the Gromax simulation package and I'm working at KTH Royal Institute of Technology in Stockholm. So the topic of this talk is how to, we'll answer the question how to get good performance from Gromax. So this is a topic for which one could talk for hours on many different details and many aspects that might affect performance. But I'll try to give you a short overview in 15 minutes about what the most important considerations are and what you need to think of. So it used to be originally when molecular dynamics started that the parameters such as cut-offs and other parameters had a lot of effects on the performance of your simulation. But nowadays this is changed and these parameters have little effect on the performance for one because the cut-offs are rather fixed now for different force fields. But also because we have optimization in Gromax to automatically adapt the way things are computed under the hood to compute the same potential but varying parameters of the algorithms. So in particular the particle mesh eval electrostatics which is one of the computationally expensive components especially when running at large scale in parallel. That has some effect so manually you can, sometimes worth to increase the order and have a coarser, BME order and have a coarser grid to achieve better scaling at large scale. But at small to medium scale even this is not worth it. And part of that is because we have in Gromax automated the BME tuning to automatically increase the cut-off for real space for computing direct pair interactions and thereby we can make the grid, the BME grid coarser to make the mesh part cheaper. So in this way we can move work between resources or lower the cost of the calculation even when running on the same resource. So that's about the only thing one has to think of in terms of parameters but even that is fine tuning often as this is largely automated for you. Then the most important aspect to think of for performance in Gromax but probably molecular dynamics in general is how to map tasks in the computation in this case in Gromax for MD run to the available hardware. So and also to choose how much hardware to use. So this can have a large effect on performance but is often not easy to optimize fully although it's often not so hard to get a significant improvement. So first to discuss this we need to have some idea of how modern hardware looks like so this is a very high level overview of how modern hardware looks like. So on the left hand side here there's a picture of a CPU in this case an AMD Epic which has 64 physical cores in it which are in different chips on the same processor here and they're interconnected so there's also a complex hierarchy between these cores and then you can run what's called two hardware threads on each of these cores which gives a total of 128 hardware threads. So a hardware thread which can run an individual thread of a program two of these threads can share a core thereby improving the throughput of your calculation so you get a bit extra performance out of this not so much but it's often often worth it. So then you can have up to often have two of these CPUs you can find in a node in compute centers which means 128 physical cores in one node or 256 hardware threads which is quite a lot and this is only increasing. Then on the right hand side here there's a picture of a GPU so you're probably all aware that the GPUs which were initially developed for games are now used a lot for calculations and particularly in Gromax we can make very good use of this we have highly optimized code for that so here's an example of an NVIDIA GPU one of the fastest ones 2080 TI which has 68 streaming processors streaming multiprocessors which each in turn have cores in there which can do or yeah threads running there which can compute in parallel again so you have very massive parallelization and of these you could these you can often find two or four in a node or maybe a single one if you have your own workstation. So this allows this such kind of hardware enables massive parallelism on the computation and you need software to take you to make good use of that which Gromax has been putting a lot of work into then on top of this you could have multiple nodes of these connected together in a network like a super computer for instance so the question is now how do I run Gromax efficiently on such hardware so then we first need to discuss how we actually parallelize to understand what different options do and what you can do so there's at the finest level there's thread parallelization both inside this CPU but also GPU so these threads are used to execute code in parallel on you can do this on the same core as I said already with this running two threads on one core which is called hyperfitting on Intel and SMTs shared I forgot where the abbreviation stands for but anyhow so on AMD another process you can also find this so then the advantage here is that all threads in a process of access to all data in this process so you can have for the processor I showed before up to 128 threads which can all share the same data so this allows rather fast access to the data but it still takes time to move the data from one core to another so for Gromax Gromax uses the popular interface OpenMP for a fed parallelization to enable running different software threads on the hardware threads here then on the other hand there's what's called message sparsing parallelization so here processes running on cores either in the same node or in different nodes also they can exchange data about passing messages which goes then either through the processor or over the motherboard or between different nodes through a network so this enables you to run in parallel also between nodes but then you need to explicitly pass around the data that you want to communicate so Gromax uses MPI which is by now the standard library for this or also any built-in thread MPI library so this is built into MD run built into Gromax MD run so you can make use of the features that we built on top of MPI also not using different process but using threads so that is within one node then since you know it's the same within the same process so this is convenient because this allows much easier interaction and also automation for the parallelization if you're running on a single node which is often the case okay so then we need algorithms to to to parallelize your calculation so we have at the highest level we have what what's called domain decomposition so in Gromax this is done in 3d so here's an example of a box this could be a simulation box which in this case is divided in three by two by two domains which is 12 in total where we have local atoms residing in a domain for instance here in domain zero and then we need in the way the parallelization is set up information from or atoms from domains in a forward direction to compute pair interactions and bullet interactions with atoms on domain zero but also between communicate actually but that's going into too much detail here so we need to communicate before computing the the non-local forces we need to communicate what's called the halo this shaded area from other processors and then we need to communicate the forces back after we compute them in addition to this we have what for the parallelizing PME we have what's called multiple program multiple data parallelization so in the in the standard case each domain each process on each MPI rank would be running the same code in the same order so that means if we need to do PME the particle mesh eval calculation there we have full interactions all all atoms see each other through electrostatic interactions we need to communicate between all these ranks which in this picture friends are eight MPI ranks we get a lot of communication but the gromax has the option to run the PME calculation on separate ranks which is usually the load of that is about a quarter or so so that means we can dedicate two MPI ranks to doing the PME and six to doing the rest which means we first need to send the data over to us PME ranks the coordinates but then we have very little communication left since only two ranks need to communicate and then we need to send back the forces we computed so this significantly reduces the the communication in PME which is otherwise communication bound so that's a nice feature which gromax can use to improve the performance okay then we need to map somehow the domains we have to the hardware so this this can be done in many ways so here's in a picture which shows Francis the red domain being mapped to part of the first CPU and one GPU the blue to crossing between two CPUs in the same node and one G and the blue GPU and then the green part of ring GPU and the yellow to the next GPU in the next node and so on so each domain can use max one GPU but you could also share so this is not fully optimal since you don't actually would want domain shared between different CPUs so you'd rather make sure that each domain is using a whole CPU or part of a CPU not sharing between but that's that's a detail here so this is this mapping it has a lot of effect on the performance since you can imagine that the communication is affected lots by how you do this so that's a thing to think of okay so then how would we actually do this in practice so if we have thread MPI we can run on a single node that's only so then you do GMX MD run and then you can set the number of MPI ranks with the anti MPI option a number of open p threats or anti open MP but you can also give nothing no options at all here and would automate to automatically choose something which could be reasonable or not so we try to do our best to to give reasonable performance there but it can be hard to estimate that then for real MPI you actually have to use an MPI library and use usually MPI run and then explicitly give the number of ranks to the command now often called GMX MPI MD run and then you can still set the number of threads if you want to but that can also be automated so both these commands have the same effect on MD run so internally under the hood the algorithms are the same but in one case in the threat MPI case it's run we run MPI with threads and in other case with processes so as said with threat MPI we have full automation for the number of ranks and number of threads so you can just run do GMX MD run and hope for the best but I'll show now that that doesn't give the best result in all cases so here's an example of an example systems where I show the end of the log file so at the end of the log file you can find information about performance a lot more than I show here but you can look for that yourself if you have a log file so here we're going to look at a membrane protein system which is about 140,000 atoms I've run this on the machine in our cluster that has a 12-core CPU with 24 hardware threads and two RTX 2080 Ti GPUs with default MD run setting in this case so that gives you four domains and six threads per rank so now one can get a performance so you see the performance here which is a 58.7 nanoseconds per day and then here there's the breakdown of all the parts of the calculation here given so one issue here is that when running on the GPU we actually can't time how long the GPU is busy for for various technical reasons this is an issue so but what we can look we can only time on the CPU and there we can see for instance how long the CPU is waiting for the GPU that's this red line here so you see it's 0.1 percent of time waiting on the GPU so that means that the GPU doesn't spend more time on the calculation than the CPU needs while it's busy but we can't see how much of the time the GPU is actually busy here that's difficult to estimate without profiling so the question is is in this machine which has actually two very large GPUs compared to the CPU are we using it optimally and the answer there is is actually no so we can run different setups so in this case I run for instance a single GPU so I can remove one of the use of one of the of the of the GPUs with the option dash GPU ID and now actually we see if we do that we still use 12 threads or we use the all cores but now one thread per core as and you're under by default we actually get a much higher performance of 98.6 nanoseconds per day so the system was actually the GPUs were actually quite idle and there's a lot of overhead in running the domain composition in this case so the performance actually goes up when not using one GPU using only the other one so if we actually look at the log file we can now see that there are 16.6 time percent wait time so now actually the CPU spends less time in the force calculation than the GPU so we need to wait a bit now we can also use half of the of the CPU cores while you're running six threads and then we actually the performance goes down to 82 nanoseconds per day so that we get 10 points we get a bit less wait time you see here cut out the rest so performance goes down a bit but now actually we could run a second simulation on the other GPU on the other half of the CPU I forgot to to print the performance numbers in this in this slide but then you actually get 70 a bit more than 70 nanoseconds per day so that's not much worse than than than this 82 but you get two times 70 is about 150 nanoseconds per day which is a lot faster than this 58 here but then spread over two simulations so if you want to maximize throughput then so that's as opposed to having the maximum performance in nanoseconds per day of a single run you it's often good to run a single simulation per GPU or even better run two simulations per GPU even to utilize maximum utilize the GPU because the GPU cannot be computing all the time in one simulation so if you have to share you can actually get better utilization they need to take care that the simulations are mapped well to the hardware so I've done this manually the example I gave before but you can have and you run do this for you by using the multi-deer option of GMX MPI which allows you to run multiple simulations in separate directories with a single command so then you can if the simulations are similar you can use the get maximum throughput and full best utilization of the hardware so then to conclude there are many aspects to this performance so there's the question how many MPI ranks should I use which I haven't really gone into you can also look at that although MD run does a reasonable estimation with using Fed MPI what domain composition should I use so you can change that as well the DD option how should I use separate PME ranks as I said that's the MPME option and then what parts of the calculation should I run on the on the GPU so the non-bond that always needs to run there but for the rest you can choose for a lot of components you have to you can look at the MD run help to see what you can specify there PME can only run on a single GPU for the moment but in the 2022 release we have an experimental GPU per parallelization PME GPU parallelization with CUDA that has the promise to improve performance a lot if you can paralyze PME over GPUs and then as I showed you should you should I have multiple runs share and order GPUs which can also improve performance so finally there's a lot more information in the gromax manual at this link here for manual gromax.org which describes everything in detail also the concepts and the different things that might affect performance and how I can look at that and then there's assume there's a webinar by Excel on parallelization and improvements we've made there by Szilard Paul on April 5th if you're interested in that have look at bioxcel.eu for the link thank you for your attention