 Hello everyone and welcome to our sixth webinar of the BioExcel Education series. Today we have Kerstin Kuzner who is going to talk about Optimizing Cluster and Simulation Setup with Gromax. I'm Rossin Apostolov and I'll be the host today. Before we start with the main webinar, I would like to give a very short introduction about today's webinar. First, you have to know that this webinar is being recorded and the recording of the webinar will be posted on the BioExcel's YouTube channel and also on the website so that you can look at it later at your convenience. At the end of the webinar we will have questions and answers session where I will let you ask your questions to Kerstin. If your audio doesn't work, I will read the questions on your behalf. You can use the questions panel within the GoToWebinar application to write your questions while we have the presentation. Today's webinar is organized by BioExcel which is a center of excellence for computational bio-molecular research. Our center works with several important codes for molecular modeling and simulations. Gromax, that you know, HADUK for Integrative Modeling and CPMD which is used for hybrid QMN simulations. We work on improving their performance, efficiency and scalability. We also work with several very popular key workflows and platforms such as Galaxy, Ignite and Pashto Verna to help users with automation and improve their productivity. Our center also is promoting best practices and is offering a lot of training events to users. What might be of big interest to you is that we have launched several interest groups in different areas of molecular research that offer more guided advice to users. What might be of big interest to you is best practices for performance tuning interest group. You can find more about it on our website and we also provide several support platforms such as forums, a chat channel and a video channel where we have recordings of our webinars. And today is my big pleasure to present to you Karsten Kutzner who is a senior scientist at the University of Göttingen. During his PhD he was doing numerical simulations of Earth's magnetic field. Later he continued with high performance and parallel computing. Since 2004 he is working at the Max Planck Institute for Physico-Chemistry in the Group of Health Group Mueller. And his interests are in the area of metal development, high performance computing and atmospheric biomecular simulations. So now I will let Karsten continue with his talk. Welcome Karsten. Hello, Rossen. Hi Rossen, thanks for the introduction. So let's see, should be on play already? Yes. Is that correct like that? I think you need to select the play button to see the digital screen. Ah, yes. So how about now? Yes, it's good. Okay, okay. So the title of my talk is Optimizing Cluster and Simulation Setup for Gromax. And maybe a small word of warning here. So I'm not talking about simulation setup, but about really setting up a compute cluster for Gromax simulations. I assume that you have some familiarity with the Gromax software and that you have already set up simulations and you know how to produce a TPR file and run things. So this is more about tuning settings for optimal or for minimal run time. I would also like to recommend the excellent webinar number two in the series by Mark Abraham, which also covers performance tuning and Gromax optimizations. And there you will also see a detailed introduction into building and running Gromax. There I'm not going so much into the details here, but there is also a bit of overlap between these two talks. So here I'm going to focus on more on the hardware side and on actual benchmarks. So the basic question that I want to address here is how to produce as much MD trajectory as possible for your science. How can that be achieved? So depending on whether you have access to compute resources or whether you want to buy a cluster, we can rephrase the question quite a bit. So in one case you will be limited by the number of core hours that you have allocated on the supercomputer, for example, and in the other case you will be limited by the amount of money that you have. So we will try to answer these two questions here. So the question one is how can actually optimal performance be obtained on a given cluster and the second part of the talk is about what actually is the optimal hardware to run Gromax on if you start from building your own cluster. Okay, let's start with question one. How can optimal Gromax performance be obtained? So already before the simulation you lay the foundation of good performance, for example by choosing a recent compiler, by choosing the most recent SIMD instructions that your CPU supports by choosing a good MPI library, but also when you set up your system you can already do a lot by choosing, for example, virtual sites which will enable a far longer time step compared to the usual settings. So when actually launching the MD Run application the main benefits come from optimizing the parallel run settings. So you will want to make sure that you reach a balanced computational load between the resources that you have and you will want to make sure that you keep any communication overhead small. And to better understand what these... So we have several settings that we can change when we start Gromax and to better understand what these things do and what we can do as users to speed up the computation. I want to quickly recap what happens in a Gromax time step. So we want to look at the Gromax parallel time step but we start with a serial time step first. So on the left hand you see a sketch of what happens during a time step and we just look at the colored things right now, the blue and the orange stuff. So most of the time Gromax calculates Coulomb and van der Waal's interactions. And usually with periodic boundary conditions we use the particle mesh error method and PME decomposes these interactions into short range and long range contributions. So the short range part which is blue here can be efficiently calculated in real space and at the same time independently the long range part can be efficiently calculated in reciprocal space. And for the reciprocal space you need the free transformation of the charge density and this is where typically the problems begin because in parallel to do free transformation this is quite communication intense. We come back to that later. A nice feature of PME though is that it allows to shift work between these two parts. You can shift work at the same numerical accuracy or comparable accuracy between the real and the reciprocal space simply by what you see on the left hand and on the bottom. If you make the cutoff larger and at the same time the grid a little bit coarser then if you scale that by the same factor you reach the same numerical accuracy as when you have a short, a small cutoff and a very fine grid. This will come in quite handy when we use GPUs. So in this slide here we see the parallel time step. So Gromax uses a combination of MPI and OpenMP to parallelize the system on the available processors. And here we see an example with three MPI ranks and within each MPI ranks there's a group of here in this example four OpenMP threads that act on the data of that rank. And it depends on whether we look at the direct space or the reciprocal space contributions how the work is actually distributed on the ranks. So for the direct space part Gromax uses domain decomposition which simply means that your MD system is chopped up into X by Y by Z domains and each of these domains is then assigned to an MPI rank which calculates all the interactions of that rank. And in the reciprocal space part Gromax usually uses either slabs or stencils and each of these slabs or stencils is then assigned to an MPI rank. Obviously there's some boundary volume that needs to be communicated so that all the interactions can be calculated for the direct space interactions that is not so much a problem. So this is simply a small layer of atoms at the boundary of the domains and at least the size of these boundary layers that doesn't directly depend on how many of these domains you have. However in the reciprocal space part at some point you need to do a transpose of the grid from let's say vertical to horizontal layout and in that process if you have R ranks then R squared messages are involved. So while the PME calculation is known to scale with order n log n if n is the number of atoms usually in parallel the communication is the bottleneck. In the lower left plot we see an example for a small benchmark system 80,000 atoms distributed on more and more cores. This is the x-axis and I plotted simply the time spent in MPI routines in blue during the whole benchmark. So this gets more and more if we go to a larger number of cores and this is mainly due to the time spent in the MPI all-to-all routine. So this is the green triangles here. And yeah, so this MPI all-to-all routine is used exclusively during the FFT grid transpose in Gromax. So to alleviate this problem a bit of these R squared messages during the FFT Gromax can offload the long-range part of the electrostatics to a subset of the MPI ranks which is then typically on another set of nodes. So usually you need about a quarter of all nodes to calculate the long-range part of PME and this then reduces the number of messages to be sent by a factor of 16. So this is done by sending the charges and the positions of the charges over to a set of PME processors here on the right-hand side in orange. They do PME and at the end near the end of the time step the long-range part of the forces is transferred back to the other set to the particle-particle direct space processors which then add this contribution to the calculated short-range forces. Okay, what you can also do if you have GPUs on your system then you can with Gromax offload the short-range non-bonded forces to GPUs and this works similar to what we just seen in the last slide. So why are the CPUs are doing PME and also integration, things like that updates the GPUs calculate the short-range non-bonded forces. So in order that this works well there are three load balancing mechanisms implemented in Gromax. So what you need to do if you have different sets of computations on different sets of hardware you somehow need to balance the work. And in the upper part here we see the number of particle-particle or the number of short-range processors MPI ranks needs to be balanced with respect to the number of PME processors. In the middle we see you need somehow to find the optimum between cutoff and grid spacing. And in the bottom you see that there is a load balancing mechanism that balances any uneven work between the direct space domains. So the first mechanism, the number of PME nodes, this is assigned statically at the beginning of the simulation. Gromax does that for you. The second load balancing mechanism is done. It's also evaluated at the beginning of the simulation, however then it stays fixed for the rest of the simulation. And dynamic load balancing, that's also why it's called dynamic, so it's throughout the whole simulation it continuously adapts to any uneven load between the domains. So the good news is that these automatic load balancing mechanisms they work so well that if you have single nodes with one CPU and optionally one GPU, then in most of the cases the Gromax automatic settings give you optimal performance if you use Thread MPI. So here I could already end this webinar. If you have just nodes like this, a single CPU and a single GPU, then in principle you don't need to hand tune anything. However, today's nodes are usually more complex. You have either multi-socket CPU nodes or you can have one or more GPUs attached to them and you could even have a cluster of these nodes. So in these cases, manual tuning will quite drastically enhance the performance that you get. In the next part I will talk about what you can do. Well, tips and tricks for performance tuning in Gromax. The most important tip I found out for myself and I often tell to others if you are in doubt whether you get a good performance or not, make a benchmark. A benchmark just takes a few minutes. You can test different settings easily and you will directly see from the results which of these settings are better and which are not so good. So these are my pet benchmarks that we see here. So I often use these. So the left one is a typical MD system, I would say. So it's about 80,000 atoms large. It's a membrane channel embedded in a lipid membrane surrounded by water and ions. And the right system is a larger MD system. So this is ribosome in water. This is 2 million atoms large. This is a nice test system if you want to scale out to larger machines for example. Okay, let's start with the tips and tricks. So the first tip I would have is actually about how to get useful performance numbers from benchmarks in Gromax. So these automatic load balancing mechanisms that we talked about. So the dynamic load balancing between the direct space domains and the load balancing between Coulomb-Katov and eGrid spacing, they need some time to reach the optimum. And that's why it's important to throw away these first time steps from the timing measurements. So there's switches to MD run for that. So for example, reset step or reset halfway. And there you can simply exclude these first time step where the load is not balanced from the benchmark measurements. So on the right-hand side we see this is an example. The time it takes for a time step with respect to the first time step and we see that for various settings this goes down with a time step number and here it takes about 200 time steps to reach some kind of equilibrium. However, this can take quite long. On a large parallel machine this could take 2,000 or even more time steps. So there's a good idea to look at the md.log output file. And there we will see lines like, for example, the first line here in the black box, the D step 39 load imbalance force 40.8%. That tells us that there is still load imbalance in the force calculation between these domains. And if you look at the bottom of the page here, so it's step 5,000 or 10,000, this imbalance goes down to about 1%, which is pretty much what you can reach. So there you have a balanced force computation between the domains. And in between we see lots of lines, step 80, time with PME grid, Coulomb cutoff. Something where you see that Gromax adjusts the PME grid and the Coulomb cutoff size for you and at some point it will tell you, okay, I found the optimal PME grid. This is 96 times something Coulomb cutoff this. So that's the point where you should reset the counters. Second thing to do if you have CPU only nodes, you can optimize the ratio of the amount of the PME, the number of the PME nodes. I will estimate for you how many PME nodes it needs. For example, if you run on 16 nodes or if you have 16 MTI processes, it will choose for the long-range part of PME. This is based on cutoff and grid settings. However, Gromax cannot know your network. So whether you run on either a net or an infini band how big the latency is. This is where the tune PME tool comes into play. This takes the estimate which Gromax computes for you from cutoff and grid settings and tries out settings around this value. And usually if you have more than 8 MPI ranks separate PME nodes will perform better. And here on the right hand side we see how much you can typically improve up on the automatic settings when running the tune PME tool. So we see that you get another 10 to 30% performance by choosing to tune the number of PME nodes. Another thing to consider is the optimal mix of thread and ranks especially if you have GPU nodes. So due to this heterogeneous parallelization based on MPI and OpenMP work can be distributed in various ways. So if you use pure OpenMP this usually performs well on single nodes but this does not scale so well across CPU sockets. So this is the blue line in this right graph here where we see this is a 2 socket 16 core node and we see that OpenMP for this benchmark performs best as long as we stay on 8 or less cores. However if we go to more cores then a pure MPI parallelization is faster. So on multi socket nodes usually pure MPI is best. You could also choose a combination of MPI and OpenMP however this often adds additional overheads so at least in this case here on CPU nodes it's slower than either OpenMP or MPI. However if you have GPUs attached to your CPUs then it's slightly different because for GPUs it's beneficial to have few large domains that offload the data to the GPU. So generally on single socket nodes it's beneficial to use pure OpenMP however if you have multi socket CPU nodes with one or many GPUs it's not so clear there it pays off to find the optimum. That's what I did here for the two benchmark systems so we see the performance of the benchmark system on the Y scale and on the X scale we see different combinations of threads and ranks. So this is a two socket CPU node with two times 10 cores so all together 40 hyper threads and I used either 40 MPI ranks with one thread or 20 MPI ranks each with two threads and so on. And we see in the lower most black lines that what I just mentioned on CPU nodes pure MPI is fastest. So if we add OpenMP threads then the performance goes slightly down to the right. However if we add 1, 2, 3 or 4 GPUs that's a blue, green, red or light blue lines then there's an optimum somewhere in the middle at several threads per rank so you often add let's say about 4 to 5 threads per rank. And this can be 30% more than the standard settings that Gromix would choose for you. So this really pays off to look for that optimum. Here's an example of a cluster of such nodes. So again two socket nodes with two GPUs on them and the black lines show the benchmarks on the CPUs only and blue lines show the same with GPUs. And we again see that in the CPU case we end up with either pure MPI or two OpenMP threads and in the case with GPUs we have a couple of OpenMP threads per rank so 2, 4 or 5 turns out to be optimal here. Hyperthreading which is supported by modern Intel CPUs is usually beneficial in the single node case it almost always gives you 10 or 15% extra performance so it's useful to use it to have it switched on. If you scale out your system to many nodes then this effect decreases with higher parallelization. So in this example we see all the... I've marked all the benchmarks that had the highest performance with hyperthreading with a yellow circle and you see that about 16 nodes or so hyperthreading is beneficial and then later the fastest settings were found without hyperthreading. Generally if you want to have a good parallel efficiency it's good to have about 1,000 atoms or more per core. So here we see parallel efficiency is about 71 or 73% at about 8 to 16 nodes and then while you can scale to about 100 or even less atoms per core you will still see a performance benefit however at the cost of reduced parallel efficiency so basically you're throwing away most of your cycles. So if you want to use separate PME nodes on GPU nodes then if you do that in a naive way then you would leave GPUs unused. So here's an example where you see 4 nodes and if you would make the normal separation in particle-particle and PME nodes then here we see that the PME node, the orange one would leave the GPU unused simply because PME cannot be executed on GPUs. Usually it's not a good idea to leave resources unused so what you could do here is to assign for example half of the MPI ranks for PME with a comment line we see at the bottom here. So you would end up with the same number of ranks both for PME and for the long range and for the short range part on each node and on each node you would be able to use the GPU because there's short range processes on each node. So this would be a better approach and you could still balance the load between the PME and the PP processes. So here's an example where we have well, big nodes with 2 GPUs and 20 cores per node so each core is a square here and you can choose with a comment line which is NTOMP and NTOMP PME you can separately choose how much resources to allocate to the short range and to the long range part. So generally even with these complex hardware setups you can achieve optimal performance or optimal settings for gromacs. So this slide shows the impact of the compiler. So while there's newer compilers around I just want to make the point that the performance can be quite different depending on which compiler you choose. So we see here in orange these are benchmarks on CPU nodes and these white blocks show the same nodes with GPUs added and in the CPU case we see that depending on whether you choose an early GCC or a later one or ICC the performance can vary by up to 20%, 25%. So generally GCCs from 4.7 on produce the fastest binaries with gromacs. ICC compilers don't work as well on AMD hardware. However when you look at the white entries here so with GPUs the effect is not as drastic so the most pronounced effect here we see on CPU nodes. And another tip to improve the throughput of your simulations is to use multi-simulations. This is especially useful if you run on GPU nodes. So due to this offloading approach that the short range non-wounded forces are offloaded to the GPU the GPU is idle during part of the time step about during 15 or 40%. And so if you now run several replicas of the same system for example with different starting velocities or so on your GPU node then the individual replicas can run slightly out of sync and while one replica doesn't use the GPU the GPU can be used by the other replicas. So in effect you get a drastically enhanced aggregated performance so if you then show well obviously you get many let's say here in this example you get four trajectories which are a bit shorter however if you sum the trajectory length together you will get a lot more than compared to running a single replica on this system. And the benefit not only comes from the using the idle time on the GPU you get additional benefits due to higher efficiency at lower parallelization because each replica will use less cores on the CPU for example. Here's an example of what you can achieve with that so we see the membrane benchmark again run on a node with two 10 core processors and two GTX 980 GPUs so the lowest bar here the light blue one which has a performance of about 26.8 nanoseconds per day this is without GPUs the one above is with one 980 GPU and this is the optimized settings so this is already open and P-threads and ranks have been manually optimized but if you then run four replicas on the same node you get an additional benefit of a factor of 1.4 to 1.5 on top of the already optimized settings and the three bars on the top the black, dark blue and grey bar show the same for the same node with two GPUs so even there you get about the same benefits so this really makes sense and four replicas is not a lot I mean you could use more it's a good idea to use more than two replicas but from four on you already see most of the benefits you can get the last tip I have here concerns MPI libraries so this is when you run on larger compute clusters I always found that usually there are several MPI libraries that you can use there and typically they all perform different and often there's one that really performs well and there's others that have problems and here we see that I compare Intel, MPI and IBM and you see that you don't get as good performance on many nodes with Intel MPI this has here nothing to do with Intel MPI it's more that on this special machine there were the configuration of Intel MPI was simply not optimal where you could spend lots of time as a user to find out to tune all these settings or you could simply try out the other MPI libraries and this often saves you a lot of time and here in this case the IBM MPI library showed nice results so sometimes it's not even in the code the problem but in other codes in MPI libraries or settings or whatever well to sum up the question one that we had so there's several things that you can do to get higher performance this is virtual size, this is using multi-simulations optimising threads per rank on GPU nodes using the tuning, PV tuning tool using a recent compiler or using hyperthreading and on the right hand I plotted the typical trajectory gain that you can expect and you should keep in mind that in most cases these gains are orthogonal so if you use virtual size this will give you a factor of 2 compiler can give you an extra 20% hyperthreading can give you an extra 10% and so on so this really pays off okay, now let's come to question two what is actually the optimal hardware to run Gromax on? so you might ask okay, optimal, what do you mean by optimal? so optimal might mean performance to price ratio might mean the highest single load performance that you can get optimal might be for you the lowest time to solution or the lowest energy consumption but there could also be simply requirements of rack space so you just have a limited amount of space to put your cluster so our goal was in this investigation cost efficiency of simulations so we wanted to maximise the trajectory that we get on a fixed budget and what we did is we determined the prices and the performance for about 50 hardware configurations we tested them with the two MD systems I showed you and altogether we tested about 12 CPU types and 13 GPU types in different combinations and the performance that is reported in the following slides is always the optimised performance so we always optimised the threats to ranks ratio we optimised the number of PVE nodes and we used hyperthreading where beneficial maybe a quick word on the GPUs that we used in the test nodes so this can be put into two categories so the upper category here which is red here these are professional Tesla GPUs so they have a high double precision throughput they usually have a large memory they have ECC memory and while the other GeForce GTX GPUs here on the green part of the table they don't have ECC memory they just have a good single precision throughput and typically at least it was the case that the memory was smaller but that's not the case anymore however Gromax runs in mixed precision more typically so it uses single precision only anyway on GPUs I should also say that here we just these investigations already started about two years ago back then we could only use NVIDIA GPUs CUDA compatible GPUs and now since Gromax 5.1 it's also possible to use AMD GPUs via OpenCL and keep that in mind so these are also a good idea so this is the result of our investigation we see on the x-axis the simulation performance so this is for the smaller benchmark system and on the y-axis the total cost for the hardware so we have several white circles here these are CPU only nodes and some of these are connected with a dotted line to blue or red circles and the colored circles mean that we plugged in a GPU into exactly that same node and if you now plug in a GPU into a CPU node then the performance goes up and the price goes also up so you follow these dotted lines to the right and a bit to the upper part of the plot all the blue dots are nodes with consumer class GPUs so by plugging in a consumer class GPU into a CPU only node we typically see an increase in performance by a factor of 2 to 4 at the same time if you plug in a professional Tesla GPU into a CPU node for example here where which we see here with the first red dot with the one in it there we mainly move along lines of constant performance to price so the performance goes up but at the same time the price goes up so in general it's always a good idea if you want to have a good performance to price ratio to put in consumer class GPU to your node while professional Tesla GPUs only give you a higher performance but don't change your performance to price ratio so this is the the same plot we just saw this is a bit more complicated don't be scared by all those details let's just look at the right half of the plot this is the membrane benchmark on the left we see the ribosome benchmark these gray lines these are lines of constant performance to price so we want to get to the lower right part of the plot we want to have a high performance and a low hardware cost and what I want to point out here is just if we look at these green nodes so all these circles that have a white fill these are CPU nodes if we again move along these dotted lines to the right hand we add one or many more GPUs and in this case here of AMD we see that if we add the first GPU then we get a lot better in the performance we get slightly higher in the hardware cost but not so much however if we add more GPUs there the effect is not as pronounced and I should also point out here that the highest performance to price ratios will reach here with the Core i7 5820k for example nodes with consumer class GPUs so these are workstation CPUs paired with consumer class GPUs energy efficiency might be a concern so if you have to pay the energy yourself then well typically over the typical cluster lifetime of maybe five years or so the costs for the energy are more or less the same as the cost for the hardware so here we see nodes with 0 up to 4 GPUs and for a 5 year operation we assumed 20 Euro cents per kilowatt hour including cooling and we see that about half of the total costs come from the energy and again here the trajectories are cheapest if we have one or two GPUs in the node and the trajectories are most expensive if we don't have any GPU in the node this slide is a slightly different view of the same effect so we see the trajectory yield so we see the nanoseconds per 1000 Euro from these nodes this is now with a bigger benchmark so we see three categories blue, green and violet also category so the upper two compare the same CPU with 0, 1, 2, 3 or 4 GPUs and the lower two compare the same GPUs two different CPUs and again what we see here it's not good to plug in too many GPUs so the optimum is about at one or two GPUs there you get the highest trajectory yield per invested Euro so clearly balanced CPU and the results are beneficial okay that also that already brings me to the conclusions of question 2 so generally if you add a GPU to a node this yields a 2 up to 4 times increased node performance generally consumer class GPUs increase the performance to the price of a node more than two fold these are the blue dots here in this right diagram usually adding more GPUs than CPU sockets yields diminishing returns and the highest energy efficiency is reached for nodes with balanced CPU and GPU resources I'd like to point out that this whole investigation summarized in this paper on the lower right here best bank for your buck so there's lots of more details and tweaks in that paper and also in the supplements together with scripts how to how you could do these benchmarks and there's also a figure with Gromax tweaks performance tweaks for getting high performance okay with that you're in the question and answer session and I'd like to thank you for your attention thank you Karsten this was a really interesting talk with a lot of information a lot of good tips for the users as a general comment you would recommend that users do initially some proper benchmarking before they start large production runs on whatever systems they have access to as we've seen users can save a huge amount of computing time right so I mean if you just run a single simulation of this type maybe it doesn't matter to do benchmarking or not you would just let Gromax figure out the details lots and lots of similar simulations then it really pays off so you can easily get 20-30-40% extra performance by tuning that and then you will get a lot of trajectory back for your project especially if your group has a very large location on a very large HPC system with millions of core hours you should definitely do some benchmarking before hand usually you will have to do that already if you want to get the compute time you should show that your code scales well and that really makes sense to put in the effort doing a few benchmarks and maybe you get more compute time if you can show that you can scale well on the machine where you want to compute yes and this also holds like what you presented here is a very big variety of mixtures of CPUs and GPUs now we have Xeo5s coming to market and more and more centers will offer them so this will be another area for users to benchmark and see how the application performs that's true, we didn't consider these up to now because they're quite expensive so I think if you buy the hardware yourself it's probably still the best bet to buy consumer class GPUs for high-note performance I mean if you have a cluster there with Xeo5s then of course yeah why not use it well we're nearing the which we have online another of the core developers who was working with Karspan Zillert but if you wanted to add something I guess we have some problems with RTO so is there something we would like to add since you worked with Karspan on this extensive I would like to thank Karspan first for the great talk I don't think I have a lot to add except that I would like to note that although this work has been done with a few versions slightly older most of the conclusions still apply up to the current versions things may change in the future because hardware is changing and the direction of our project is changing slightly but a lot of the tips and tricks and general conclusions do apply however there have been quite a lot of performance improvements here and there so as always it's worth upgrading and looking into the performance of newer releases rather than trying to stick to old ones that's one of the things that I would highly recommend users to exploit and to make use of the work that we do for every release and last but not least yes as Karspan mentioned AMD GPUs are also an option with our OpenSea implementation and with the most recent release they are quite competitive both in price and performance so it is worth considering especially if you have around decent AMD GPU to try the performance of those Thank you Zilla we have a question by Adam Adam can you hear us Hi, yes thank you it's Adam Carter here at EPCC I was just going to ask I think most of the results that you presented here were for an Infinity Band Network is that correct and I just wondered whether you think any of the results would change qualitatively very much in the case of either slower networks like Ethernet or faster ones if there were some specific how would you connect? well it depends on what you want to do so usually if you just want to sample a lot and you don't have excessively large systems and if you want to buy the hardware yourself it's good to not even invest in Infinity Band because this can cost you quite a bit and it's always more efficient if you run more simulations on just single nodes and the single node performance has gone so amazingly high the last years due to GPU acceleration for example that it's often possible to do that however if you really need let's say one big trajectory of a big system you definitely need many nodes for computation for extreme scaling it's always good to have the best network that you can get that more or less depends on the network however this comes at an extremely increased price of the trajectories this can go up to a factor of 10 or so which you pay more for your trajectory if you run that near the scaling limit or just if you scale it out to many nodes as compared to running it on a single node so what changes with slower or faster network is mainly where the scaling curve reaches the saturation if you have Infinity Band and you can go up to let's say 60 or 32 nodes with your system and then you go to slower and higher latency ethernet and typically you can go up to 4 or 8 nodes or so but I think the general picture remains the same although I'm not sure whether today you're really wasting cycles if you run on ethernet unless it's 10 gigabit ethernet or so okay that's useful thanks very much yes thank you I would always compare the performance on one node to 2 or 4 or 8 nodes over ethernet and then you can really decide how much you're throwing away by going to let's say 4 or 8 nodes higher ethernet usually that will be quite a bit great thanks thank you well with this we are nearing the end of the webinar I can have the next slide I would like to remind everyone that the recording of today's webinar will put on the website where you can watch it again and reference a lot of the useful tips and tricks that Karsten shared we have at ask.mysel.eu we have support forums and interest group specific performance theory which you are welcome to use and continue discussion there and for those of you I would like to also know that we will have in October a hackathon in Barcelona developing Gromax and would like to meet some of the co-developers and do some work on your code please join us you can find information for this on our website so thank you all for today and we will continue our webinar series in future check out our website for updates on that thanks Karsten thanks Zillert and we'll see you again bye thank you bye thanks bye