 Hello everyone and welcome to today's webinar. My name is Rostan Apostolov and I will be the host of today's event. Before we start I have a few announcements to make. The first one is that we are recording this webinar and recording will be put on the Bioexcel website which you could watch later or forward to your colleagues. And at the end of the webinar we will have questions and answer sessions when you can ask any questions that you have to mark. I will take the questions in order and I will give you the microphone. If we have problems with the audio then I will read the question on your behalf. This webinar series are organized by Bioexcel Center of Excellence for Computational Biomolecular Research which is a new project that started last year. And I would like to give you a small overview of the center since we will be doing a lot of future events regarding computational biomolecular research. So Bioexcel is working with three widely used codes for molecular simulations and modeling. One of them is GROMACS that I hope you are very familiar with. Another one is HADOC. For those of you who have done docking it's also very popular software. And also CPMD which is used for electronic structure studies, specifically hybrid QMM studies of enzymatic reactions. And Bioexcel is working on improving the performance, efficiency and scalability of those codes. In addition to working with the software, the center works with experts in several popular workflow environments and platforms such as Galaxy, Comps, Apache Tavern and others. We have several sub-projects where we are combining tools like GROMACS and HADOC with external databases to automate and optimize the work of researchers. Bioexcel is also working towards training and promotion of best practices among academia and industry on how to best take advantage of the powerful applications and make the best of your work. And these webinar series are part of our efforts in spreading the knowledge. As part of this work, Bioexcel is starting with several interest groups on different topics of biomolecular research that some of those might be of interest to you. We have groups on integrative modeling which is mostly docking. Free energy calculations using GROMACS will have a webinar on that as well in about a month. Best practices for performance tuning today's webinar is specifically on that topic. We will have an interest group on hybrid methods for biomolecular systems which will include QMM and also cost grant modeling. We have also an interest group on biomolecular simulations for entry-level users which is very useful for those of you who are just starting with such simulations. And we also have a group on practical applications for industry since all three codes are used in dozens of companies in pharma and also in the food industry. Bioexcel provides several platforms for support. You can learn more about this from our website and on bioexcel.eu slash contact there are links to our forums ask.bioexcel.eu where you can put questions regarding the codes. We have a github repository where we will put newly developed code and we also have open chat channel and a video channel where we are going to upload the webinar recordings for example. So this was the introduction of Bioexcel and I hope that in future our support will be of use for you. Now I would like to present you Mark Abraham. Some of you probably know him very well from the Gromax mailing list. He is the project manager of Gromax and one of the main developers of the package. His interests are not only in parallelization, high performance computing, accelerators. He is also working on clustering, sampling methods, replica exchange and it's my great pleasure today to give Mark the microphone and he can tell you more about how to make the best out of Gromax. So Mark, could you start? Thank you for that introduction, Rosson. I am the Gromax developer manager based here at the Royal Technical Institute in sunny Stockholm where we have a lot of the development of Gromax taking place. Gromax itself is a classical molecular analysis package. It targets a lot of problems that are of interest to people in the binolecular simulation community. It is a free and open source C++11 community project that is developed by researchers at multiple institutions and gets used by hundreds of research groups around the world which is wonderful because they are able to cite the papers that we produce so that they have reliable molecular simulation methods and that in turn allows lots of funding agencies to recognize the value that we are able to deliver to lots of you for doing your science. We have been loyal supporters of us over the years with funding. Here on screen we see a typical Gromax simulation target system. This is a Blick-Ion channel which is one of the things we do research on here in Stockholm. This is a typically challenging kind of biomolecular target system. We have a protein seated inside a membrane here, all solvated in water, perhaps with a lot of ions which is very characteristic of the kinds of aqueous solutions we see in biomolecular simulations. This is very challenging for people to model because it has these multiple different parts of the system which need different kinds of interactions to be able to model well. There is even further challenge for the developers to write the code so that it runs well with these different degrees of resolution that need to be targeted within the software for the various parts of the system. It is also challenging for the users at the FireM to marshal all of these parts together in a way that will allow them to run fast to maximize their size output for the amount of computer and human time that they have expended in getting their simulation run. So today's topic is Gromax performance optimization and tuning. Before we start it would be worth considering when does this even matter? In many cases when one is doing a biomolecular simulation what one is seeking to do is do a simulation that runs over a long period of time that generates a number of independent configurations that are expected to be characteristic of the ensemble of confirmations that would be sampled in real life. However we can only do a finite length of simulation so we need to push that length of simulation as far as we can so that the number of independent configurations we are able to sample is as large as possible for the amount of computational results we expend. So it makes sense to try to get as many samples as possible for that amount of computer time that you have available. By default if you just run Gromax's simulation engine MD run you will get pretty good performance. You shouldn't bother trying to improve that if you are just starting out with Gromax. You should be focusing on am I doing a valid correct model of real simulation the real biochemical system that I am trying to model. Certainly if you are only starting out doing tutorials most of what I am saying should go in the back of your mind for later when you are running a very large set of simulations over a lot of hardware but when you are starting out you don't really want to bother with this. You should also consider not bothering if you have something else useful that you can go and do while your simulation runs and if nobody else is tapping you on the shoulder wishing to use the hardware after you. This does unfortunately require a bit of human time to compare the performance of different kinds of configurations and you want to only put time into this if you will get some value back out of it. So you really should bother with performance optimization and tuning if you are running lots of the same kind of simulation particularly if you are going to run them on the same kind of hardware layout. So if you have access to the same kind of supercomputer with an annual allocation if you are going to be doing lots of variations on the same simulation it is worth your while finding out how to run MD run with high efficiency for the kind of simulation you want to run. Unfortunately that means it will be different from how somebody else in your lab or somebody else running in your lab on different hardware or someone on the other side of the world is able to run their GROMAC simulation. So you won't find good resources for how to run your simulation well. You are going to have to look at what you are simulating and how your hardware works in order to try and get the last 10% of performance out of that thing that you have to manage. You definitely want to bother with this if your resources that you are running cost you more money than your time costs. That is for you to judge. So there are a few things we need to consider when we are playing to run a molecular simulation with GROMAC and that will include things like how we built the software that we are going to run, what kind of thing we are going to simulate, what kind of physical model that we are going to try and use for our simulation, some details of how GROMAC works on the inside which will make sense for trying to work out what sort of hardware we should use, how we express through GROMACs what hardware we are running on and how to get feedback on what we are going to do to try to improve over. The first step we should think about is that we definitely want to be building the most recent version of GROMACs. If you had an ongoing simulation study that was started with an old version of GROMACs that can be reasonable to continue with that version of GROMACs. However, if you are updating to newer hardware you need to bear in mind that GROMACs was optimized to run well on the hardware that existed at the time it was written. If you are using GROMACs 4.5 which was released about six years ago much of the hardware that is currently being sold wasn't even thought of back then so you will not be able to get anything like good performance. So you should consider that your scientific continuity might be higher if you are able to generate more samples than if you used the same literal version. So having chosen the version of GROMACs that you wish to run you should definitely consult in ThorGuy for all of the different versions. You are able to download the PDF of this presentation from the BioExcel website. If so and you have done this you will be able to click on a lot of the HTTP links that are in my talk which will take you to the websites of various resources that will allow you to get more background information and detail that I won't have time to cover in all the short period of time we have today. So do check out the install guides and have a read. You don't need to read all of it the first time you install GROMACs. You do however want to consider it if you are actually trying to get the most out of your version of GROMACs. You do want to try and use the most recent and preferably the very latest version of all of your infrastructure. You will need a C++ compiler. For example GCC or Intel's compiler are our go-to choices. You should really be using the very latest versions of those you can possibly get your hands on. Our general experience is that GCC does outperform Intel by a little bit. But if for example you run simulations on Intel's accelerators that's where Intel compiler shines. So we certainly support both of them. If you're wanting to run on GPUs for example you will also need a CUDA or OpenSeal toolkit along with the appropriate software development kits in the case of CUDA. If you have the latest Tesla generation GPUs or Quadro cards you will also want their so-called GPU development kit so you can take advantage of NVML which is a very good tool for meeting the GPUs to change their own clock speeds so that they can take advantage of the time when we're not using the GPUs to allow them to cool and to overheat during the times when we're not using them. If you haven't got the duty care you won't be able to take advantage of that. You also want the latest drivers and so on installed on your machine so that you can have the best of all of the worlds. Similarly MPI libraries you want to use something very recent. They have better support for everything. One of the key pieces of infrastructure Gromac needs to do the simulations that it typically does is a fast Fourier transform library. The state of the art there is the so-called fast Fourier transform in the west package from some researchers at MIT. Intel MKL is also pretty good. So if you are using the Intel Compilers biomeans you can link with MKL. There's instructions for that in our install guide. You do want to build FFTW somehow. Typical installs don't always have both of the support for SSE2 and AVX or the typical kinds of hardware that people run Gromac simulations on. So you want to configure that appropriately. If however you don't want to worry about those details we have a facility within Gromac that will build our own version of FFTW during the Gromac build the way we think will be best for your hardware so please do take advantage of that and simplify your life. If you want to run Gromac on a multi-node cluster that's a specific computer then you will want to build MPI support enabled. However if you're only going to run on your desktop or maybe on single nodes of your department or cluster for example you would only want to build the default non-MPI version. So this does enable by the also thing we call thread MPI which plays much the same role as MPI but doesn't require that you have a bunch of external software installed and organized. You will also need a non-MPI version if you're running some of the PME tuning I'm going to talk about that later as well as for doing any of the pre and post-processing. So particularly if you're a system administrator at a large cluster or a super computer if you want to support your use as well then you should think about installing both an MPI version and a non-MPI version and have the right to offer the job when they need it. Gromac does have the option to build both in its default mode which uses a mix of single and double floating point precision. You do have the option of building Gromac's fully in double precision which if you're a system administrator you might also consider making available but as a user you should only choose to use that if you really know why it is that you want to do it. There's only a very small minority of Gromac simulations that actually benefit from it. You'll be able to run the GPUs and you will run about a factor of 2 slower for choosing this option. So please do that wisely. Getting a simulation to run fast starts very very early in the simulation process. You do want to choose a box for your simulation that's just the right shape and just large enough for you to do your science well using the model physics that you were doing. If you're modeling a small protein folding that's approximately spherical you would really like your simulation cell to be approximately spherical as well. Gromac has support for a wide range of simulation cell shapes ranging from cubic all the way through to general triclinic cells that allow us to do rhombic data behavior which are the smallest shapes that can tessellate over 3D space in a way that allows you to use periodic boundary conditions to replicate your system to fill all of space in a way that's both physically valid and very efficient to simulate. You also want to think when you're preparing your system topology is about the use of virtual sites. This is a key performance feature available only with Gromacs that allows us to treat typical groups such as methyl and amino side groups that have a couple of nearby hydrogen atoms in a methyl group for example. To treat that as a rigid body so that when we are doing our force calculations we treat all of the atomic sites as fully interacting but then when we go to do the update we project the forces on those all four of the atoms in the methyl group down to a smaller number of interactions which allows us to take a larger time step in the update before projecting those back for the next time step. There are other MD packages out there that do quite different schemes that allow you to push the outer time step out to about 40 seconds but the use of virtual sites is the way Gromacs currently encourages people to take the best advantage of the code that we have that allows you to generate more independent samples of your system faster. When you use virtual sites you also need to plan to use links with all bonds constraints or your bond interactions. If you're not prepared to use virtual sites or perhaps they're not yet supported for your force build then you also want to consider using links and constraints on the overall hydrogen bond. If you do that you can use typically a two-vector second time step. If you don't have any constraints at all however you'll need to go down to one more probably half a second time steps in order that your simulation remains stable and is actually modelling the real world. Do be aware however that typical water models are rigid and they are optimised within Gromacs using algorithms called SEPL. So if you do read some of the older papers or some other simulation packages it will talk about the use of shake which is implemented in Gromacs but it's much harder to parallelise than algorithms like links. So links is entirely equivalent to shake and is the way we are able to do it. There are many different kinds of water models out there. Most of the available bi-molecular force fields were parameterised with some of the three site water models in mind. There are four and five site models out there that have been demonstrated to have some interesting properties that might occasionally be useful. Do feel free to consider using those. They're supported in Gromacs. But remember that you are paying for some extra work there. Typically tip 4p will be about 10% slower than tip 3p. So you want to do that if you think you're getting value for that. Another good tip is that you want to consider orienting your simulation box with load balancing in mind. You need to think about the fact that Gromacs is going to have to chop up your simulation cell into chunks in 3D space that will be able to be sent to different bits of computers to be calculated on. That works best if you orient your system so much that the default in Gromacs which is to provide all the planes perpendicular to the Z axis will eventually lead to load balancing. Here in the case of a complete line of travel here we have Z axis vertically. So we are able to resize the value between our domains, take advantage of the value of a number and have a different identity of work in our area. You have a wide line of flow that will get different characteristics of Gromacs and Gromacs in relation to the digital details and point of value to that music. We have some problems with the sound. Mark, you became very choppy and like robotic. No, it still sounds the same. Do you want to try to reconnect, maybe? Nothing's changed on my end unless it's a network effect that won't help me. Yeah. I can try reconnecting if you want. It'll lighten the middle too. Could you try just to reconnect? Maybe it might improve for some reason. Okay, everyone until Mark connects back can everybody hear me now? I would like to show you what we have on the Bioexcel's website. It's part of our support structure. If you go to contact, here you'll see their link to support forums where we have for the three applications for Hadoq, Gromacs and for CPMD that you could post questions. And we have also video channel on YouTube where we have the webinars uploaded. We have GitHub repositories as well. There's not much stuff there yet but with time we'll upload new code. And we have a Gitter chat channel that is open to everyone and if sometimes we are online we can directly answer questions. So let's see if Mark is... Mark? Yeah. I think it's better now. Let's change to you. We should certainly think about writing only the output we can actually use from our simulations. That will make lots of stages faster. You don't want to use your temperature and pressure coupling every step but rather just multiple of 10 or 100 is good for choosing your per time step frequencies that algorithms act. It is tempting to reduce the length of your cut-offs in order to make your simulation run faster. To be very careful not to do that with your Vanderbilt cut-offs they tend to be parameterized into your force field so a lot of the values of the parameters describe how the different atoms interact depend very critically on the Vanderbilt cut-offs so you should not vary those from the standard practice of the authors of your force field or other publishers in your field. You should very much use the Gromacs default settings for long range PME. They've been studied by lots of people very good. Just turning those up in some sort of a quest for accuracy tends to slow your simulation down from a particular game. Don't do that unless you know what you're doing. You might consider PME order 5 if you're paralyzing very heavily. You need to choose appropriate link settings to make sure your simulation runs well. So let's have a look at how Gromacs works on the inside so that we'll understand how to run well. Typically in an MD step, we have a number of phases that we have to go through to see how things work. So we have to compute some short-range interactions. So that will typically take place over multiple SIMD units within your hardware on multiple different cores running across lots of threads. So we'll have to combine those up. Then we'll have to go and do some modeling of things like bonds in your proteins. This would also include things like angles and dihedrals. Combine all of those. Then we have to start doing the long-range PME part, which involves spreading a bunch of particles onto a grid, starting doing a 3D FFT. Solve those using multiplication stage at this point. More 3D FFTs back interpolating from our grid to our real coordinates. Coordinating all of those forces together so that we have our atomic forces. Then using those, perhaps doing the projection out from our atomic interaction sites to our virtual sites. Updating our positions and velocities using those forces. Constraining all of those updated positions. Perhaps back projecting to the atomic sites. Looking back to the start. So we do this many millions of times over the course of an MD simulation. So this would be how Gromax looks if you're running on a single MPI rank without a GPU support. However, things get complicated quickly. We try to run on more bits of hardware. When we're running on multiple ranks, now we have, because we have multiple people participating, we have to do communication. So this is illustrated with these green arrows here which are multiple phases that happen during MD runs run where people have to down tools and talk to each other in order that everybody stays together. So now we want to do our short range work in two different parts because some of it pertains to atoms that are also shared with our domain. So we want to do that work first and send those forces off before we start doing our short range forces that only our local domain will care about. We'll still need to coordinate those then do some bonds, then do some PME. But now during PME we're going to have to do different kinds of communication. Here we're also going to have to talk with adjacent PME ranks perhaps. During the 3D FFT stage there will be some global communication. This tends to be a big bottleneck as we add lots and lots of ranks. So we'll consider ways of doing this better in a moment. And then we will have the same update phase at the end. And once we've done our update we'll have to send coordinates to our neighboring ranks depending if you have responsibility for what. We have to do this global communication stage here. That works much better on supercomputers if it is separated onto MPI ranks that are of a smaller size than you would like to run on to so that you run fast. This greatly complicates how the code works internally. We'll see more about how to run the code using this setting in a minute. But we now have to have an initial phase where we send coordinates off to those separate MPI ranks. On our particle-particle ranks as they now become called we do our same kind of work. We still have to communicate that locally. We still have to do our bonded work and then get over here to wait for our separate PME ranks to do the same work that used to be done also on our main ranks. But now these guys only do the PME part. This makes this communication phase work a lot more efficiently and with much less contention from, for example, MIDI networks on T-part metal clusters. The downside of doing this is that during the update phase there's nothing for the PME ranks to do. That's a little bit wasteful, but we gain enough here to pay for that waste there. This is a quite efficient optimization within Gromax. Since Gromax 4.6 also had support for GPUs which allows us to offload the big part of the compute work, which are the short-range computations through the GPUs. They run down here on our imaginary GPU section of our layout. When the non-local work is done, however, we still need to send that to our local ranks. So we have to transfer coordinates over, transfer forces back, send those off to other MPI ranks. We start doing a lot of data transfer and this is our unfortunate effect of life that as you start to run Gromax on lots of hardware it becomes less and less compute bound and more and more communication bound. If you're running on small amounts of hardware you have to squeeze out the most compute performance. If you're running on lots of hardware you need to think about how your network and your transfers take the GPUs once they've done the non-local work can still do the local work and send those back. Then after we've got that sent off to so we can have our communication on the background then we can go off and do our PME coordinate everything to our update phase. However the fact of life of only offloading the very compute intensive part is that again we aren't using our GPUs during all of this phase. So we really want that phase to run as quickly as possible in this scenario. Finally the most complicated layout we could imagine is that we're running over lots and lots of hardware with multiple MPI ranks. We have separate PME ranks and we're using GPUs. So this is the most complicated version to try to manage both within the code and as a user. But again we need to set our GPUs doing our work and they'll be idle during the update phase. We also need to send our coordinates off to our separate PME ranks so that they can do their separate work. Meanwhile back on our main ranks CPUs we're doing our bond interactions coordinating receiving the forces from the GPUs and sending them off to other ranks. Maybe we're running idle which is a bit of an irritating thing because if we only have bonds on some of our domains for example where our protein is we won't have any bonds where our waters are because of course those are rigid so we don't actually have any bonds that might vary. So on some MPI ranks there actually isn't any work to do here. The bonds are zero because they've just got water so all they can do is wait for the GPUs and send that off and then sit around waiting. So that's frustrating but that's how the current code works. Finally we coordinate all of the forces back on the main CPU to the update and so on. So again we might have multiple bits of hardware that are lying idle. So how to balance all the workload between all these different compute units is something that we do a decent job of within Gromax but there are some things that you need to be aware of so that you can get the maximum benefit from lots of the good stuff that is within MD run. Because we are going to use both the CPU and the GPU you do want to have a well-balanced set of resources there. There's an excellent paper produced by some of the core Gromax developers and friends, all of whom are supported through BioXL. Please do go and read that paper. It's called Getting Best Bang for Your Buck with Gromax on GPUs and it goes through a lot of the details I'm talking about today with graphs and presentations and costs of different amounts of hardware that you want to optimize all these things. I really can't recommend reading that paper strongly enough. It is worth bearing in mind if you want to scale Gromax simulation across multiple GPUs you really are going to want several tens of thousands of particles per GPU. That's pretty common place across lots of MD packages that want to run with GPUs. If you're going to run with multiple nodes then your network needs to be at least a gigabit ethernet and preferably infinity backed. This is so that you have minimal latency of communication. Gromax needs to send a lot of very small messages so bandwidth tends not to be important and latency is important. If you're also buying hardware memory and disk pretty much don't matter by whatever seems cheap and effective especially if you might have other people using your cluster get whatever is going to suit them. You can certainly run Gromax on resources in the class that's a perfectly good way to do things. You do have all to avoid running inside virtual machines because what can happen there is that only, for example, SSC2 similar units are available on the CPU side and that might waste a lot of the capabilities within Gromax for running on more recent versions of hardware so be alert for that one. You'll also get much better value out of your life if you're running on a cluster that's relatively homogeneous. You don't want to have this sort of CPU here and that sort of CPU there that's just going to make your life as a user and indeed as a fifth admin so if you're only running MD run on a single CPU you only know and you have a pretty easy time. If you use the default build that will be optimal for that. The defaults within MD run will already do a very good job. If you want to explore that's reasonable. You might get a little bit better. The thing that you want to do is to choose the number of thread MPI ranks and the number of open MP threads over which each of those ranks is parallelised so that the total number equals the total amount of thread. Hyperthreading on Intel CPUs can be useful but you're still going to want thousands of particles per core in that case. Choose wisely. If you've got a fairly small system you might not see any value from hyperthreading. If you want to know what hyperthreading is please do check out our excellent user guide on our homepage which talks about all these details of how CPUs work so that you can understand these kinds of issues. If you were running on some nodes that had 16 cores which is pretty typical these days you could run all these different ways of running the same simulation within Gromax. One of these will run faster and you might want to try all three of them to see which one in fact gives you the best. There's more examples again in our user guide on the documentation page. If however you're running with a GPU as well as a CPU you can still use the default load build but now you of course have to have configured with minus D GMX underscore GPU equals on. The MDRun default particularly in Gromax 5.1 do a pretty good job of maximizing total resource usage. You do need a number of particle-particle domains that's multiple the number of GPUs on that node so this is something that people sometimes get a little bit trapped up on we see that on the Gromax mailing list but yeah you basically because we are offloading the short range work from a given domain from domains to GPUs that's very straightforward. Again you might want to vary the number of MPI ranks and the number of threads within those ranks so that that equals the total number of results that you have and again M the number of ranks has to be a multiple of a number of GPUs. You also might want to send GPU ID appropriately because you are in need to specify indices that say okay yes this first bp rank should go to GPU 0 so should the second and third and fourth once we get to the fifth rank we want to use GPU 1 and so on for the next three. You also might want to vary the parameter NST list you can either do that within your mdp file or on the md1 command line the default with GPUs is 50 it does vary a little bit with Gromax version but you might observe total throughput go up or down a little bit over this sort of range so you might like to play with those to get maximum throughput so again on an over 16 course if you also had two GPUs you might want to vary in these kinds of ways to get different throughput with earlier Gromax version you have to be very specific about your GPU ID with Gromax 5.1 and more recent versions you can get away with not being as specific Gromax will work out the 0 and just means it should interpolate into something like this if you are running md1 across multiple nodes however you need to make sure you built an mbi enabled version of Gromax and as I said earlier Gromax does use the network heavily so latency and variability normally limits performance and scaling you have to share your network with other users you aren't going to be able to get full value out of the software engineer that does exist in Gromax you can maximize what you can do if you are able to request from your job scheduling system that you would like a set of nodes that are very close to each other you only want one level of your switch network if you have a switch network or that's one level of your Prey, Dragonfly network exposed to your 7 nodes and that will limit the exposure you have to the way switch networks are shared between everybody on the machine so if you haven't got facility available on your job scheduler please talk with your system admin and see if there are options to get those installed because you will get much better performance from Gromax if you can do that you should also consider tweaking things within your mpi library about how favorably small messages get transferred you want minimal overheads how the mpi library optimizes things and whether they copy buffers how to do that depends widely on the mpi version and library but please also talk with your cluster system admin they will know lots of secrets there as you get to larger numbers of PME ranks you definitely want to use this separate PME rank the facility that I talked about earlier we have a tool within Gromax called PMX Tune PME which is very useful for this because it is able to optimize over also the number of PME ranks which MD1 itself isn't able to do performance does tend to be best however as a rule of thumb when you have numbers of MPI ranks for these two components that are composite numbers with lots of common factors for example 48 pp ranks and 16 pme ranks tends to work pretty well if you have 64 total ranks because we can have a decomposition that has 8 by 3 by 1 in 3 dimensional space in pp space and 8 by 2 by 1 in PME space and the cost of communicating within and between these groups of ranks is minimal and as we get to more and more ranks as I've said we need to think about our communication more than our computation that will work a bit better than these guys because the common factors are not so good and prime factors like 7 and 11 just tend to be pretty bad so you want to avoid those if you can help them. If you're running on CPU only with multi-node clusters you need to think about a few issues and now you have to use GMX MPI MD run you definitely need to use this so for example if you're running on 4 nodes each with 16 cores there's some alternatives here we've got more examples of these in the Gramax user guide with explanations about how these different parameters all work. So what this does is to use the same number of cores for each of these 3 examples but we're grouping them together differently so you'll have different amounts of open MP overhead versus MPI overhead one of these will turn out to be better for your simulation and your hardware in practice. Unfortunately it's very difficult to give general advice but generally with only CPUs NTO MP1 or 2 is about as far as you're going to see with current hardware and current Gramax. Once you're running on GPUs on multiple nodes all these previous considerations apply but stuff gets more complex you now need to map your pp ranks to your GPU IDs in the same way you did with the single node you should certainly start there with learning how things work but you now also need to manage the fact that the PME ranks don't use their GPUs so in this case where we have the same 4 nodes as 16 cores and 2 GPUs they're very natural to use a layout like this where we have 64 total ranks 16 of which are PME so we have 48 pp ranks left of which there's going to be 12 on each node so we need to say yes split up those 12 ranks like this if however we use fewer ranks we're going to get to a situation where eventually we have an unbalanced number of pp ranks per node which would look like this that probably won't run very efficiently there are ways to consider the way we use OpenMP across different kinds of ranks so that we parallelize across OpenMP in an efficient way so that we fill out the number of cores that we have per node and fill out the number of GPUs in a balanced way how to do this depends very much on the structure of your hardware so you definitely want to read the documentation for your hardware and think about how to do this and perhaps talk with us on the GramX developer forums and user forums the tool GMX Tune PME is also a very good resource that you want to think about using it's able to tune the number of PME ranks which is something MD1 can't do for itself there's a few tips and tricks here particularly before GramX 5.1 this could interact poorly with the way our dynamic load balancing works so you might want to turn that off in some cases and see if you do better so Tune PME can optimize over this variable p which is often very important so I need both an MPI and a non-MPI build available so talk to your cluster admin for that this can also work with GPUs but there's only a small number of per node layouts that could be useful so that's not always useful the best tip for running with GPUs is that if you need any way lots of copies of similar simulations and you don't need the result immediately then you want to consider running copies of the same simulation or the same hardware so that the time I mentioned earlier when the GPUs are running tidal can get used for running another simulation the easiest way to set this up and a lot of these details are talked about in the paper I recommended earlier then you can run the MPI enabled version of GramX using the multi-dure feature to say okay I want to run these four simulations across those 16 ranks four ranks per simulation to share the hardware in a way to share all the details of doing processor layouts which is really good with the GPUs because we're able to take advantage of the fact that during the time when they would lie out of the one simulation we can be working on another one this tends to work very well your overall performance per unit time per unit dollar per unit what whatever is important to you goes up dramatically if you need multiple copies of the same simulation so people should consider that very seriously this will also look good in your paper because you'll be able to have multiple simulations and talk about measuring performance you do want to use the actual production TPR you intend to use so go and build that then you want to run a few thousand MD steps to permit the tuning and load balance to stabilize then to reset the counters and finally observe performance so a typical way to do this would be to use the command line option that's really good for benchmarking you shouldn't do your production simulations using this option you should set and steps within your MP file but after doing performance testing this is very convenient you want to reset the counters however so that you don't have all of this tuning and load balancing polluting the the final performance has expired so this is a very good way to to get performance that is reasonably reliable because it's already been tuned there's lots of clues within the log file that might be a topic for a future webinar to look at throughout our time today but there's a summary of the hardware and software configuration at the start of the log file there's also reports on how things got set up there's analysis of what your time got spent on different parts of the code how efficiently everything was running if you have got multiple log files because you've tried varying some of the parameters I've been suggesting you often want to use the Unix tool DIFF in its side-by-side mode to compare different runs to understand where they were different and what kind of effect that we might follow up with that so that brings my full material to a conclusion so we're going to take some questions now hopefully you've been really asking questions and Ross has someone in mind for this person to get the microphone passed to them yes, thank you Mark so there was there is one question from Robin Corey Robin, I'm going to give you the mic now can you say something see if we can hear you ok, so I will read the question on behalf of Robin so he's asking about v-sites that 5 seconds time step are apparently possible how true is this and how applicable is this to different biomolecular systems yes, certainly 5 seconds time steps can be used some people like the conservatism that they feel with 4 seconds I would certainly encourage you to do a small simulation on something that's characteristic of what you want to observe and observe that in fact there's no difference in the quality of your observable between those two there's a lot more information in a paper from Eric Lindell's group that also introduced the challenge 20 years before within Gromax that came out about 4 years ago but the observations within that are still quite pertinent you can't go past about 5 seconds because then subtle vibrations of how water vibrates aren't modelled appropriately so that's why 4 to 5 seconds is as much you can push molecular simulations whether you're doing that with multiple time stepping regimes or whether you're doing it with virtual sites see I certainly do encourage you to explore the use of 5 seconds much that you may decide is quite good for you you can find details of kinetic properties maybe you want to consider more carefully how far you push that depends whether you use the kinetic or the amino and methyl groups it's going to be important Thanks Mark and there is a follow up question again from Robin whether there is a performance difference between MPI and Thread MPI MPI of course only works on one node and it is built to work very well it's a very very efficient performance implementation on either the native p-thread or wind-threads whether you're running on Unix star systems or Windows star systems it's intended to be very low overhead you will observe more overhead if you use a first class MPI on single domain maybe to the tune of 10-20% so if you're running it's also much easier to build you don't need to have MPI installed to run on a single node everybody should be doing Thread MPI Ok, we have a question by Ramon Krehe I'm not sure I pronounced it correctly Ramon can you hear us we can't hear you, maybe you could type the question in the question pane and I will give the mic to Ok, just a second Ok, can you hear us? Yeah, can you hear me? Yes Regarding MPI tuning what sizes are those small messages basically and is there more information on what the normal message sizes are The messages are really quite small for example passing around packets and coordinates between adjacent ranks on the MPI adjacent pp ranks for example we're talking about hundreds of atoms each of which have say three coordinates for positions each of which might be four floats so we're really only talking thousands of bytes So the overhead of sending and setting up the MPI messages is absolutely dominant for the typical kinds of scenarios on the scale Similarly with the 3D FFT there tend to be fairly small messages Yeah and there are no really large messages anywhere used Ok, and next we have a question from Siri Van Kulen Siri, can you hear us? We have a couple of questions Yeah, I can hear you Can you hear me? Yes, we hear you Ok, I had two questions One about orientation of the box the simulation box because when you were explaining this the sound was a bit unclear especially if you have a rectangular box how would you need to orient the box to get optimal cutting of the box for comics Ok, so if you had something say a long thin pipe as a shape you would want to end up with domains that are approximately spherical because that means you can have more of the domain Of course a long thin pipe can't become sphere so you can end up with cuboid kinds of things The default partitioning in the domain decomposition is to split things along planes orthogonal to the Z axis So you want to set up your simulation system so that X and Y are your short dimensions and Z is your long dimension so that when the partition is in plane parallel it will get broken up Is that the kind of information you were looking for? Yes, so is that long dimension X, Y, short? Yes Ok, thanks And one more question about the GPUs because I am fairly new to the GPUs and I didn't understand how you can optimize them on multiple nodes with these ones and the zeros I didn't understand that part Ok, so the... One more perhaps Yeah, sure So I'll just come back to this guy So GPUID is the MD run command line that says how to map the GPUIDs from the particle-planicle ranks to the GPUs So here if we have two GPUs we give them ID 0 and 1 and because we have eight particle-panicle ranks we map those 1, 2, 3, 4, 5, 6, 7, 8 to the GPUs starting from 1 That doesn't change when you go to multiple nodes The indexing in GPUID however stays within a node So if you have two identical nodes and you have 16 cores and two GPUIDs you still want to use GPUID like this to express the mapping from PP ranks within a node to the GPUs within a node Ok, so In this example the 0 and the 1 stands for two GPUs and this is just how you distribute this NTMPI on the multiple nodes Yes, we're going to have eight particle-panicle domains each of which is an MPI rank which here is an MPI mapped to our GPU so that there are four domains per GPU You could also use 0, 0, 0, 0, 1 that might be faster or slower You could try them both Ok, and on the multiple nodes you said it doesn't change So, but how Ok, so you still because you have two GPUs in the node So yeah, this mapping is expressed within the node, so again because as I very quickly did the math there we will have 48 PP ranks total 12 one inch node we need to map four PP ranks to our two GPUs so 0 and 6, 1 is one way to do that I just didn't get this 48 So because we have chosen 16 PME ranks Ok 64 total ranks, there will be 48 left Yeah, ok So this can be complex because you end up with the main decompositions that have interesting numbers as factors and as I said in one of the slides you want to try and choose everything so it mutually composites that's very hard to do in the general case So with this 48, sorry could you please explain that one more time, so you have this 48 and then what do you do? So there's 48 ranks distributed over the four nodes that we're computing on so there will be 12 of those on each node Ok So exactly how that happens often means you need to get involved with how your MPI library would like you to express a host file or how your job schedule wants you to tell it how the MPI ranks placed So all of that involved talking to your local admins and so on, we can help it with general principles, but the details are the details So yeah, there will be 12 of those on each of your nodes and by expressing 0 and 6 to 1 for GPU ID MDRone will understand that you want those 12 ranks to be mapped the first 6 to GPU 0 and the next 6 to GPU 1 Ok Another thing that GPU ID is useful for is if you're running on a desktop and you might have one GPU that is useful for your display and other GPUs that you actually want to compute on The using GPU ID allows you to skip a GPU that you don't want to use because it's underpowered You do much better if you use just the two compute ones for example Thanks Ok, thank you So since Ramon couldn't connect with the microphone he wrote the question in the questions So his question is about checking the combinations of NTOMP NTMPI and NST list Particularly the question is like should we try all these combinations with all possible NST options are they coupled what's the best way to optimize So those you do of course want to choose the number of ranks and the number of prisons ranks so that you keep all the calls busy that's the set of places which to optimize there that's what we're expressing here The NST list is when you might vary and again you can say the color is independent of those all those degrees because that is the number of duration of the number of people which will enable it So we mark we're getting again problems with the sound and it's actually 5 o'clock already I suggest we stop here there's still questions asked and we'll see what we could do maybe we can do another session in future that we focus more extensively on questions or at the same time everybody is welcome to post questions on bioxcel's website on the ask.bioxcel.eu we can follow up with those questions there I would like also to tell everyone about our next webinar in the series Mark could you show the slide Yes, so our next webinar is in 2 weeks and it's an atomistic molecular dynamics setup with MDWeb this webinar will be more for novice users who are not so experienced in the in the area this MDWeb software is developed in Barcelona and it's web interface very easy to use so if you have colleagues who are some students or people who are just starting with molecular dynamics simulations this would be very useful for them and yes, we'll have more webinars coming in future also 3 energy calculations we are planning one webinar in about a month's time this will be also advertised on the website I encourage everyone to subscribe to our mailing list and newsletter on bioxcel.eu you'll see a form in the footer or in the right column of sub pages subscribe there and you'll get notified for future events I'm sorry we had some problems with the sound quality I hope it was useful for everyone there will be a recording of the webinar on the website which you can go over the slides are also there and based on the questions we can plan for some future events so thank you today Marc thank you very much for the very useful and nice talk and I hope we'll see everyone again online thanks for today