 Welcome to the Gromax Features and Future Talk from the Biaxia Summer School 2020. For those of you that may not know me, I am the Gromax Development Manager at ScyllaFlight in Stockholm, where I'm responsible for organizing the general development of Gromax that we are following our own guidelines that we're having and we are meeting the targets that we're having for providing new features for you. Also making sure that we release new versions, new updated versions of the code, patch releases and so on on time, and that those have as few boxes as possible when you get your hands on them. During this talk, I will give you a very quick overview about some of the special features of Gromax. This is of course far from complete because I don't have several hours of time to tell you about everything that Gromax can do. And you can always just go to the documentation to find out everything that is currently supported for the current version and also for previous releases. In the documentation, you will also find the release notes that tell you what exactly has changed between versions in the past. And you will also get some information what you're planning to do in the future looking at in development documentation, also finding out what features may have been removed because we decided we can no longer support them. If you want to know exactly what has happened in the few previous versions since 2018, I recommend that you check out the excellent BioXl webinar series for the different Gromax releases. That will tell you in much more detail and depth what has happened in the past, what was new in those different versions, and also what we are planning to do. And you can check on us if we managed to actually do what we had been planning to do. I will start the talk, actually not with talking about what the code will do for you in terms of different MD simulation settings or simulation modes, but talking about the hardware acceleration that we are using to actually getting the performance out of Gromax that you are interested in. In this figure that is quite busy, you can actually see one of the main things we are using to improve calculation speed. We noticed that we are not using the standard Pellist that you have for basically all our MD engines. Seeing the figure on the left, you have a standard Pellist setup where every atom will interact with atoms around it, and you are making a list of atoms that may possibly interact with this atom and only calculate interactions with them. This is simply done to reduce the n squared scaling problem that you have if you would need to do all to all interaction calculation. The doing Gromax is slightly different that we are not doing those checks for distance cut-offs based on single atoms, but we are doing them based on clusters of atoms. We are also using, depending on what kind of computer hardware we are running on, different sizes of clusters to define how many atoms should always be checked against each other. We are also not just using a single distance to see which atoms might be interacting in the future for the buffer, but we have also a second outer buffer that makes it possible for us to on the fly update the inner Pellist that you are interested in for calculating the actions with, during downtime, for example, on an accelerator device. Now, this may sound a bit strange because it looks like we are doing extra calculation, but it means that we can increase the time between updating the complete Pellist where we check again for all to all atoms, which atoms to may be interacting with. And this can save a lot of performance by only doing this every, for example, 100 steps instead of doing it every 10 steps. This has been a feature since Comix 2018 and it has led to actually some really nice performance improvements because you can, again, reduce the time between the Pellist regeneration. Everything while we are using those clusters of atoms because they are working very, very well with the very modern computers that are actually using special instruction units called SIMD units. SIMD units allow you to do a similar instructions on multiple sets of data. That's what SIMD stands for. But you need to load full instruction units and you can only, as I can do one instruction on it and that's it. Now, if you would load it with just two sets of data, two atoms that are interacting with each other, you would be wasting a lot of performance because you're not using most of the register space that you have in the CPU. But if you use clusters of atoms, you can combine them together. And then for each atom, for each cluster, do the instructions with all the other atoms that may be interacting with it and that fit into the SIMD unit. And what you can see here is that you can do in full interaction, you can do the calculated interactions between 16 atoms. What would normally use 16 interactions if you would do the 16 cycles, if you would do this one by one. Discuss the setup also works very well with the way and graphics code and graphical computing engines are set up because they can take advantage of similar data layout. So we can use a data layout that is very close, very similar between the CPU corners and the CUDA and OpenCL corners for GPUs. Do the same thing, meaning that we are saving on code complexity and making it easier to test our code because it uses the same layout in the end. One major advantage is that we're actually not wasting many CPU cycles by running because we don't need to load data that often into the unit. We can always use data multiple times. And this allows us to reach actually 50% of what you would normally have as floating point peak performance on the CPU with our corners. What gives us amazing speed-ups for even for CPU only simulations. But nowadays people don't run on CPU only anymore, they want to run on graphics cards. And this makes the whole setup way more complicated. Because a similar MD step suddenly looks like this. You have on the CPU things like domain decomposition calculation where you do the passage and divide atoms into different domains. You have bounded force calculation, you have PME force calculation. And then on the GPU you want to do your non-bunded force calculations. This can be even more complicated if you want to move in data from the mode ranks, for example, of an MPI. But it's a additional complication that we don't really need to get in the data pool. Now, as you can see, at this point you have a lot of communication steps between the CPU and your device, in this case a CUDA GPU. And you're still seeing that there's some time on the CUDA GPU that is wasted, shown by the redstriped boxes, that the GPU is idle because the CPU is calculating something and preparing new data. This idle time can of course be reduced if you would start off loading more things to the GPU. For example, the PME mesh force calculation can also be moved to the GPU. And then depending on the performance of the GPU compared to the CPU, it can mean that you've wasted a lot less time between if in this step we're either CPU or GPU or idle. Even this is not always perfect. You can test and see if you want to go much with the necessary flags to see how many idles at the cycles you have. It can be that in this case either your CPU is still idle because it's waiting for the GPU if you have a big GPU, or your GPU needs to wait for the CPU to calculate the bounded forces due to the integration constraints before it can get new data. To get around this idle time, it actually worked a lot for GOMUX 2020 to make sure that you can actually do the whole MD step on the GPU. You need to offload PME force calculation, offload non-monod force calculation, offload bounded force calculation, and the integration of forces and constraints onto the GPU and make use of the acceleration features there. This can lead to very good speedups if you have weak CPUs that are much slower than your GPU. This can mean that you can run multiple simulations on a single node where you have multiple GPUs, and you can run an extra simulation on the CPU because it's mostly idle. This can also be a problem because it's not always the case that you can calculate everything on the GPU because we don't always have kernels to calculate everything on the GPU. One very erroneous example there are the charm special forces where we need to calculate forces from CMAP terms, and there is no GPU kernel for this. To do the calculation done, we need to be able to copy force positions back from the GPU to the CPU, so we're basically offloading to the CPU, calculate our special forces, and then copy them back before we do the integration on the GPU. This can actually lead to improved performance if you, for example, would decide to offload bonded or PME calculations from the GPU back to the CPU, but do the update and constraining still on the GPU because in the above scheme you see your CPU is suddenly idle. It's not doing anything for you. So why should you waste those cycles if you're not planning to use them for something else? This means that if you do everything on the GPU you can actually be slower instead of offloading to the CPU, and we are working on automating this as much as possible so that Gormax will be able to tell you which offload is the optimum for your current simulation settings, and we're also trying to improve the simulation, different simulation types that are supported by running everything or the calculations on the GPU. This is just some ongoing work because even in Gormax 2020 you cannot use this with the main decomposition and multiple ranks. If you're not willing to be a bit experimental and try the experimental feature flags that are in the code for this so you can try it out, and I would recommend if you really want to check this that you check out the blog post by NVIDIA, where they're going into detail how to run the simulations, how to enable the developer or experimental features, and how to validate that these simulations are still stable and doing what they're supposed to be doing. But I think this is enough for talking about how we achieve our speedups now and talking about what you can actually do with it. One thing that got added in Gormax 2018, and I think is one of the most important things that got added recently, is the ability to validate physically correctness of simulation settings. This is both something for us as developers to make sure if we change something in the implementation of, for example, an integrator or thermostat, that we are not doing some unintended changes, that's only mean simulations on the normal physical, and that's also for you to check if your simulation input is actually following some physical loss and reproducing some observables that can be checked. With Gormax 2018, the ship physical validation suite can be run together with the normal regression test, but it's usually not one because it can take quite a lot of time, as it also includes the ability for you, the user, to add your own simulations to it to quickly validate if they are valid. Here is an example of a simulation that uses an unphysical setting in the MDP input file, and one that is valid, and follows the physical loss. On the left you can see HD Kinetic Energy Distribution, which is plotted against the expected value, plotted against the temperature, I'm sorry, in the form of KVT, and it shows that for using the parents and thermostats, the expected value is shown by the blue samples, and the red-fitted line is far off from the theoretical value shown by the black line, and it just shows you what I hope everybody by this point knows that parents and thermostat nobody should use for actually long-term scale simulations. On the right you see the same simulation, but with the VV scale thermostat instead of parents in, and as I hope you can see, or can see, the fit today plotted to the sample data, and the theoretical value perfectly overlap showing, yeah, VV scale in Gromax actually is a physical thermostat. This is also available for a few other things that you can check like barostats, and it's a good tool for you if you're trying to see if your simulation actually will be able to give you physical results or not. Another thing that we have added actually in Gromax 2020, so in the news release, is the ability to run normal simulations of a protein or nervous structures, and then fit them to electro-indensity maps that are provided for, that are experimentally gathered. This means you are able to take an experimental electro-indensity map from X-ray crystallography or CryoAM imaging, put it to Gromax, give your structure to Gromax, and tell Gromax please fit my structure into this map as good as possible, and Gromax will just do this. We calculate force from the difference between the map your structure, your protein, would create to the experimental map, and then those forces will be slowly adapted to improve the fitting of the structure until it fits, or it doesn't fit and your simulation will stop because of that. This is rather cool because you can not just test, for example, X-ray structure would fit into a CryoAM structure and how much changes are needed. You can also try to change structures between different CryoAM densities or X-ray densities to see what changes need to be done, for example, open a structure or close it, and with physically correct forces, so nothing unphysical during the simulation. What the end result looks, I just have to slide here because I think it looks really cool, is you have your density, you have your protein, and they match perfectly if your protein can actually fit into the density, but it's something I just can't stop being missed. This has been developed for our customer cloud, it has been working on this for a long time as part of the Gromax development team, and I want to give them a big shout out for actually implementing this into the latest version. Another thing that has been included in Gromax with the 2019 release is the ability to actually do QMM simulation with Gromax, something we have been sorely lacking in the past because it has not been able to do this reliable with the legacy QMM interface that we're having. Of course, you can just link Gromax to the CPMD program, and then through CPMD control what will actually happen with your simulation and try to simulate excitation states of atoms, chemical reactions in a protein environment, whatever you want. There has to be a word of warning here, actually, that this is under a major rework actually right now for Gromax 2021 because we realized that this interface still relied on some of the old data structures and simulation types that were long deprecated, so we're moving to changing this to a different interface at the moment. We'll see if I have time, I will talk about this a bit at the end of the lecture when I come to the future of Gromax. Now I think that has been added in 2018, and again I think not that many people are aware of this, that we have a very, not just great, but awesome way to do enhance something in Gromax with the adaptive rate histogram method now. It is quite simply as the slide says, you have an iterative scheme that solves for free energy or bias that you're interested in. By collecting samples during the MD and not just from single simulation but you can have multiple simulations as well, and using those samples continuously to update the energy estimate until the difference between the target simulation and you are actually collected, actually calculated bias that you have collected from your samples becomes zero. In practice this looks like you have here on the right side that you have a flat distribution before and you have a target simulation. Now if you start, you see you can have multiple walkers that are continuously updating a free energy surface, and it converges to a point that is as close as you can estimate for the final distribution. And this was without any information about what is actually the target distribution here. Now one major advantage of using this in Gromax is that it works very well with our domain decomposition code and also the ability to run multiple simulations at a time, because you can run the plot issues up to 100 simulations on more thousands of cores and reduce the time that you actually need until you get your energy calculated, until you get your energy that you're interested in from hundreds of hours to maybe a few hours, because this scales amazingly well. The simulations are independent, but they can update the bias you're looking for independently in the simulation, communicate every 100 steps or whatever steps you set. And with this communication continue updating their bias and not, they won't stop until they actually converge to the final data where you're interested in. This work has been done mostly by Wilkes and Vivek Alindo, and there are some very interesting papers on this if you're interested in. For example, for calculating the energy of bases flipping out of double standard DNA, DNA that was one of the key early examples of using this. Now a thing that has been included recently as beginning in 2019 and now has been extended in Gromax 2020 is the Python API that by now allows you to fully control simulation from Python. You can set up simulations, you can run simulations, you can analyze simulations, you can build up dependency graphs that get automatically resolved by the API workflow mechanics. So again, set up a simulation, run it, modify it, and then analyze the data from it. You can also use the API to define additional strains that you want to put on a simulation, making it hopefully much easier to work with Gromax in different environments. If you're interested in more in depth, if you're into this, I really recommend you checking out the webinar where Eric Ergang talks about what you actually can do with GMAX API, and I would also recommend you checking out the experiment paper that he has written with Peter Kasun on this. Now enough of what we have in Gromax, and it's time to actually think about what is the future. You can always, of course, check on GitLab where we have developed now what are our current milestones. You'll see when we are planning to release new versions, what are current issues that are signed to versions, and check out if there's something that you're interested in. You can also, of course, open new issues if you think something is wrong with Gromax or you want to inform us about an awesome new way that you can do MD. Of course, I would recommend then that you get in touch with the developers more closely so we can discuss what are good plans for a future or if it's better to have a sustainable development somewhere else. What you can also see there is our current development branches, the current stable branches in the state, and what we're currently working on. We have a lot of requests and code that is about to get into Gromax. So if you're more on the development side or want to contribute, I highly recommend you checking it out and maybe start contributing to our project. Now, I also have to show what is the current timelapse if you don't want to go through GitLab yourself and figure out when we're planning to release links. The plan is for the 2020 version that we're going to have at least three or four more patch releases until the beginning of next year where we are going to continuously fix issues to the board up with the code. I don't think there's going to be more releases for the 2019 version because it's currently in a stable and stable phase where we would only fix issues with the physics of the simulations in it and luckily none have come up so far. For 2021, I am expecting to open the beta phase at the beginning of September this year. This will also then mark the time where no more new features are hopefully included into Gromax but are targeted again then for next year's for the following year's release. And we'll use the time instead to polishing up the existing features and making sure that we have well working beta and release candidates that then will result in a new Gromax 2021 version at the beginning of next year. This may not be as this year's version be released directly on January 1st because we will likely want to have a bit more time for people testing out the release candidates so this will be at the mid of January instead. Now, what are we currently working on? One thing is something you may not be aware of and maybe you don't need to be aware of and that is we working how we are performing the integration step in the simulation. Simulation step is relatively simple if you think about what needs to be done but it needs to be done in the correct order and with the correct frequency of the different steps. This is usually achieved by just having a single loop function that loops over the steps but this is not modular. You cannot easily extend the simulation loop if you want to include a new simulation kind. You need to duplicate a lot of code and you cannot decide to maybe go back a step if you did want to implement something like a hybrid Monte Carlo molecular dynamics. Now what we have been working on for 2020 and continuing from 2021 is modelizing this that we define the different things that need to happen during the simulation as individual modules that can be combined together as we want as long as we end up with a fully functional simulation in the end. And if you've been running 2020 and used velocity valley and you didn't notice any difference to 2019 then everything works well because we have enabled this already for those kinds of simulations. And they're looking forward to replacing our current legacy code path for this also in 2021. Hopefully again without any of you noticing because it means that we did everything right and nothing got broken in the way. Another thing that this modular integration will enable us as soon as it is actually an easy way to do multiple time stepping. So in this kind of simulation you don't use the same time step for evaluating all forces. You use different times for for example the fastly oscillating forces like bonded forces and longer time steps for forces that don't change that much over time for example the non-wondered interactions. Now this makes it possible to actually a increase the total time steps, even though the five femtoseconds I show here for non-wondered time steps is slightly optimistic. It also means that you no longer need to use virtual sites or constraints or bonds, which is that constraints work very well for fixing the distances between atoms, but they're not physical. Atoms oscillate, the distance between them are not fixed, but you don't want to have small time steps for a simulation because then it takes forever to sample enough. And you instead want to have long time steps and then take in the trade-off by using constraints or virtual sites for that. What you could say is that you don't need to do this on your own. Another advantage is that the bonded force evaluation is usually a small part of the total cost of force evaluation for a single MD step. So you can pay this cost and do multiple time-stepping here, in this case 10 times for every non-wondered time step and still have the same performance as before if you have a good general integration scheme. So this will allow us to hopefully enable longer time steps by default for users and make it possible to get rid of constraints where they are not needed. Again, we're hoping to get this into 2021, likely enabled by default. And if nobody of you notices anything good with this simulation, then again, we did everything right. Another thing that has been worked on a lot now for the 2021 release is that we are trying to get as much acceleration done for free energy calculations as possible. This is done mostly by Magnus Lundberg, who is a contributor developer at the Gronachs. And he has been working hard on first getting GPU support for the multiple PMEq, let's see what happens with the energy calculations. So you'll finally be able to accelerate free energy calculations on GPUs. And also introducing SIMD support for the free energy kernels so that they can also take advantage of our very efficient SIMD code that we have for our kernels, which means that those interactions are no longer the bottleneck in the calculation if you want to do free energy simulations. We also hope to get some generous beat-up by our code modernization and optimizations, but those are the major things that we're trying to integrate into 2021. We hope they will be noticeable from the beginning as you're planning to use free energy simulations at all. Another thing to speed up the general simulations, especially if you have very large simulation boxes and maybe small systems, is that we want to be able to implement a first multiple method to supplement our current PME implementation. As you might be aware, if you want to do PME on a simulation box, you need to at some point communicate between all ranks or the grid position to calculate the PME mesh. This can be inefficient, so we are already doing optimization there to split between smaller number of PME events and particle events with less communication steps. But it's still not optimal because 3D and the 3D FATs that we're using can still be limited scaling. The reason we're using that is the first multiple method where we do evaluation of the electric fields to calculate long-range electrostatics on different cell sizes. And just in terms of computational cost, this is much better than PME, much better scaling. The reason why it's not usually used and not implemented yet is that the prefactor is much larger, so the computational cost can be higher than PME for smaller systems. We hope to get this implemented in 2021 or the following version. And then it will likely be able to speed up simulations of boxes that are very large and are sparse particles. So if you have an empty space in a simulation that uses PME, you still need to calculate the PME grid for that space. For an FMN method, you don't, so you can save a lot of time there. Yes, same as the multiple time stepping, this has been most of the work of ProCas, and he is working on improving getting this in as soon as possible. Something else we have been working on, and mostly I have been working on for almost the past two years, is improving the way we do data analysis, doing the way to how we enable data analysis steps for users. I think all of you know what is currently the way to analyze the simulation. You take your trajectory file, you run it three times to your jacob, and then you hope you have the right position of your protein and you are analyzing to actually run arms decalculations, arms F calculations, distance calculations, whatever you are interested in. That shouldn't be needed. Because in the end, a tool will know how a structure molecule that you are interested in analyzing should be oriented in the box before it does the analysis. An arms analysis tool knows and needs the protein in the box, no jumps between different images, and all the atoms are whole. Easily done. This can be expressed in a programmatic way. And then you can pre-process your trajectory file to make sure that those pre-conditions are fulfilled before you do the actual analysis. I've been working on implementing those requirements and the pre-processing, the trajectory pre-processing. The hope that some part of this will make it already into 2021 and be enabled for a small number of tools to make sure that people can use this instead of having to run TRJ company all the time. Another thing that we are working on in the background and mostly for our own six, but also to make it easier to people implementing new methods, find out new things with commerce, that we want to modularize how we express different methods that I want doing MD. So that people can just add a new module instead of trying to hack legacy code to introduce a new simulation step in the middle of the main MD loop. This is done mostly by a Christian cloud that has also implemented the density fitting code. And we are aiming here to both make it possible to check the validity of different simulation methods, the extensibility for making it easy for implementing new things and not too difficult. It's also making it possible that you have API level access to those routines to make you to enable you to from the outside declare a new method, link it into main graphics and run a simulation with it. On a similar note, we are working on doing almost the same thing with the non-bunded calculations. But here we are striving to create a library of methods that enable you to do accelerated non-bunded calculations for sets of particles with three conditions how does particle interact. Just give the library for different API routines and you are getting force calculated for those items back. And then you can do whatever you want with those forces. You can use them in your own simulation engine to do the integration for the next step. You can try to do as strange ways to integrate forces in different ways. And maybe even replace the current force calculations that you have in your method with those library calls instead of having to implement and check that you did the implementation right for yourself. In Gormax, we want to use this to replace the legacy code that we're having. And we're hoping that it will possible if you can expose those routines at the API level make it easily extensible for people that want to play around with different force calculating schemes in the end. Another thing that will hopefully make it make sure that in the future when you get to Gormax version, you can be sure that it's fully tested. And all the possible ways we can think of have been tested automatically without any interactions needed that we're working on container wise testing distribution using Kubernetes and GitLab. With different build tools and test tools, images done with Docker that are then used in the Kubernetes cluster to make build nodes that build Gormax for us and test nodes that test the code on different hardware with different accelerated devices or without. And different libraries available to get a verdict if a new patch that we want to merge into the code is ready to be merged or not. We are also planning to use this to actually distribute Docker images to users that use the same pipeline here to build a version of Gormax inside the Docker image that you can then use directly on the computing cluster with Singularity. And the hardware acceleration that you need without you having to worry if that your installation may be compiled differently that you get issues with doing one time with different processes that you're running on that the Docker image takes care of this for you. Of course those are not all the plans. And I've seen that I'm already running out of time so there are a lot of other things we're working on that are hopefully coming to fruition maybe for 2021 or later. And I welcome you to checking out current development again on GitLab. Also you can check out how to contribute to Gormax if you are interested in doing so and looking at the webinar I made for this a while ago. I'm not the one that it's not totally up to date because it was still targeted at our old code review and testing system but most of the things were still there. And I also would like to thank all the people that have been working and are still working on Gormax and especially you that make our time on developing Gormax will fall.