 Good afternoon everybody or good morning if you're on the other side of the world. Welcome to the bioxial webinar number 62. Today topic is improvement in the pro maxeterogeneous polarization and the speaker of today is Cedar Paul from the Royal Institute of Technology Sweden. I'm most in the I'm Alessandra Villa and I'm hosting this webinar together with Stefan Far from University of Edinburgh. So, something about Sheila, Sheila is a HPC researcher at PDC Center for I performance computing that is located at KTH Royal Institute of Technology in Stockholm Sweden. He has a background in computer science, but he has a lot of experience also computational biophysics, it work a lot with GPU acceleration, actually more than 10 years. I formulate keep parallel, parallel algorithm in molecular dynamics for modern process architecture, and he's also co author of the first heterogeneous CPU GPU parallelization of promise. Currently is focused is an economist does scheduling and strong scaling in molecular dynamics simulation on as a scale heterogeneous architecture. I'm very very happy that he's here and now I give him the world. Hello everyone and welcome to this bioxial webinar where I will be talking about the improvements in the heterogeneous parallelization in grow max. And the first I will give you a bit of a background to why, why are we parallelizing molecular simulations and what is behind the scenes in the heterogeneous parallelization in the grow max molecular simulation engine and then I will follow. This will follow with this will be followed by a description and details some details of the most recent work that we have put into this parallelization to improve performance robustness and various features of the code. First is just a very brief introduction about grow max and because most of you probably know the code fairly well. I will not dive into too many details grow max is a molecular classical molecular dynamics code. It's focused on performance flexibility. It's free for use its open source it has an open development scheme. One of the one of the key things that it's known for is its bottom up performance optimizations and the broad support for for very efficient algorithms and and diverse set of parallelization schemes. So why do we paralyze. Why do we need to paralyze in in at this day and age and what is the role of parallelization in computational science and specifically molecular simulations. These plot it might be a little bit busy and complex but I would like to draw your attention to the two features of it. The black curve and I'm pointing at shows the number of logical CPU course. Over time and most important is is the green curve which was the frequency of microprocessors. As you can see for until around the mid 2000s, the, the number of logical core course state constant while the number of, while the frequency of processors shown in green has has steadily increased. This was what provided that performance improvements in most computational codes, including molecular simulations and researchers could just sit back and wait for the next generation of algorithm of architectures and focus on algorithms and methods to improve performance of simulations, however, this scaling has stopped and performance has translated into an increase of number of course, and these course can only be made use of by parallelization, and that is why parallelization is a key since the next 2000s and it's increasing in its importance, because parallelism is presented at multiple levels all the way from from the level of multiple nodes in a cluster to multiple CPUs in a node. Combined these days very often with GPUs and within node we have course and accelerators and vector units so it's a complex landscape and this landscape is changing fast. As you probably already know, access scale is it's here and the major access scale challenges, the increase the past increase in parallelism and the increasing complexity, how this parallelism is presented in terms of the hardware and increasing diversity in terms of architectures and heterogeneity that we will talk more about. So in order to exploit this hardware parallelism with multiple levels of parallelization and the way this multiple levels of parallelization is expressed is to expose parallelism using algorithms and express parallelism in the implementation to map it to to to each level of hardware, for instance as illustrated here, we have on the highest level and assemble each member of the ensemble may have domain decomposition, which is then further parallelized over CPUs CPU cores GPUs Cindy and it's using data parallelism. What is very important here is choosing the right granularity and the abstraction and that is where our work is focused. So what I would like to highlight here is a is a very important feature of of the various heterogeneous hardware and where have to where by heterogeneous I mean a combination of CPUs and GPUs CPUs and accelerators and note that there has been for a long time a wide variety of both CPUs and GPUs large CPUs large GPUs, often large CPUs in a server with a smaller GPU. And the way in gray, I wanted to represent the relative speed of the interconnect of the connection between the CPU and GPU, which represents how fast the data can be exchanged between these components. Same applies to servers, early mid to 2010s, we had one CPU and one GPU or one CPU and two GPUs with a fairly fat link between the two. How did this change later as you might notice the most prominent change that I illustrate here is the GPUs GPUs became big, but even more prominent is that these links between CPUs and GPUs has become small. And this is, this is a hallmark of a lot of the hardware development over the last 10 years. And it's made different more difficult making use of such hardware because this link this gray line in between CPU and GPU, which is typically a so called PCI Express interconnect has stagnated and has not include increased in performance while there has been a big leap in compute performance here. Don't look too much at the details but I want to just show the big, the big leap in raw performance on the top and bottom on the bottom this this is the one of our most engromax and over three, four generations of this kernel, the application kernel, the pair interaction kernel has increased in performance by nearly 10x. While this interconnected gray link I showed you has not really improved much as it's been with its performance has stagnated for nearly eight years. In addition, what I want to add to this is is is the background, suddenly the previous picture has become more colorful and this this is representing the diverse the increasing diversity in architectures there are and the GPUs coming up to the show is releasing GPUs so more and more vendors are showing up to the to the and then competing in this world. In addition to that HPC nodes which many of you might be using and many of you might might interface with have have also been increasing in complexity and come in various shapes and forms with multiple types of interconnects and what I would like to talk about here which which we will discuss a little bit more later how the blue blocks the CPUs and the GPUs are connected to each other but then but then GPUs tends to get connected between each other as well in compute nodes in order to exactly to avoid that that thin black or gray link between the CPU and GPU and be able to rely on a fast communication between GPUs. This is this is also going to be reflected in the next generation access scale architectures like the MD or Intel access scale architectures that will be present and do me in frontier or dvc startle or the Intel's machines in Aurora there will be a complex interconnect both inside the node and cross nodes and we need to get ready for this. We need to ready our applications and algorithms to to be able to utilize these machines efficient. So, what I want to illustrate here is is is the molecular dynamics algorithm molecular dynamics so called molecular dynamics step which is, which consists of a sequence or a collection of, of tasks. The tasks that need to be computed as shown in colorful boxes, and then integration constraining finishes the main iteration, and then we repeat that in an inner loop and occasionally every 50 or to 200 iterations we do a decomposition step, and we want to do this as fast as possible. And why do we want to do this as fast as possible. That's because we want to close the the time scale gap between our time step which needs to be small, and the real world time scales, and this this is the main goal of parallelization in studying molecular systems to to tackle this time, or sometimes length scale challenge, and this typically requires strong scaling and increasingly ensemble scale. So just to give you a bit of an insight into where the computational costs are here I show the floating point operations in a pie chart on the left of the most computational expensive tasks, you can see the number of the pair interactions and P&E is next. And then we translate this into into a wall time breakdown. Then we see that that that things change slightly that is because some tasks scale better sometimes perform better and scale better. More than the number of flops and others less so, but but the picture is still very similar, the non bonded tasks and the P&E and bonded so our force computations take most of the computational time. One important thing, especially when it comes to heterogeneous parallelization is the structure of this algorithm, we do not need to sequentially compute these force tasks, we can compute them typically in any order, either concurrently or or or sequentially, but then when we have computed the forces we need to do a reduction step so we converge and reduce to get the forces so that which which then can be used for integration. This will be an important feature in any heterogeneous code that that that tries to do task based parallelization because it allows doing these force computations in parallel. So a quick summary of the chromax parallelization before moving on to the concrete improvements we have made during the recent years. This gromax employs a multi level hierarchical parallelization in order to target each level of hardware parallelism individually on the intranode. It uses open MP multi threading, which combines allows combining the set of CPU course. We can work along a GPU and GPUs are used using API's which we'll talk about a little bit later, and there's also Cindy parallelization, which which allows great performance on CPU using library abstractions in order to allow portability. We can also include the use MPI and their best features like dynamic load balancing and task balancing which, which improved scaling and performance. One important thing I wanted to mention is that when we started working on bringing Romax on to GP accelerators and and heterogeneous architectures and algorithm redesign was necessary. This algorithm redesign was was what unlocked the world of GP accelerators and allowed Romax to perform well. And this algorithm redesign really had to go come from from a bottom up rethinking of algorithms, all the way to, to, to adjusting the main decomposition a little balancing schemes. Another highlight is is one of them, one of the interesting and exciting algorithms where where we face this issue that when we offload it to GPUs or or paralyzed over CPUs we had to trade cost in between the blue box and the green box if we made the pair search less often this out calling the outer loop as often then we we increased the cost directly on in the low bonded forces. And in order to avoid this inconvenient trade off we went back to the algorithms and used an accuracy based approach to create a so called dual pair list algorithm where the data generated the pair list and domain decomposition data generated during the blue blue algorithm we keep that for much longer instead of 20 to 150 to 400 even 500 iterations, but instead of of of this resulting in a very large long bonded computational cost. We mitigate that by trading this cost and and and do so called dynamic pruning, this dynamic pruning step can be done often. And on the next slide I'll show you why it can be done. This can be done often because the way we laid out the algorithm is that we create a large pair list for this large interaction sphere. If we would compute interactions within that sphere then we will be computing a lot of zeros, because, because there will be a lot of no interacting atoms but instead, we create a smaller inner list, shown by the dashed green line, and periodically we reduce this large list of created from the from the outer list to the inner list and compute for forces with that. And because data access is is optimized by our interaction algorithms and the pair search code. This free pruning is very, very cheap, and it can be done very frequently and it allows us to create this accuracy based balancing between different over phases. What I would like to spend the. The next one is is is discussing the evolution of the GPU hardware and API supporting Romax, wherever come from and where are we now. The work started early 2010s and late 2009 2010, and it took quite some time to port the initial code to GPUs and and tune the code for for heterogeneous parallelization. And later on open CL parallelization was added in order to improve the portability and robustness with that the robustness of the code. This was later improved for for NVIDIA and Intel architectures. These were added throughout the years, improving communication and uploading more code. And what more, most recently, we have been working on is, is, is new APIs and better support for various products core features and functionalities. In this release, we have further optimized the GPU direct communication which I'll be talking about. We have integrated new distributed FFT back ends and quite excitingly we have, we have full, nearly full support in our cycle back end for the GPU resident loop, which will be the main means of targeting Intel and architectures starting from the second half of this year. So to compare these current API support and grow max on NVIDIA, we have, we have very mature CUDA, and it's, it's the bread and butter of, of, of programming GPUs, but only for NVIDIA. It's not portable and therefore we have for long time invested into open CL, which, which is fairly mature and it's an open standard. But it's a bit awkward to use in the modern C++ code and therefore new development has shifted to sickle and with early support in 2021, and it will be, as I mentioned, the primary means of supporting Intel architectures and the NVIDIA architectures from q3 q2 plant. It's still an underactive development because not all hardware is available. So next, I would like to explain why did we choose heterogeneous parallelization. The main reason for that was instead of just choosing homogeneous parallelization where where we forget about CPUs and only use the GPUs. Well, the main reason, there were two main reasons flexibility and performance. We wanted to maintain the versatility of grow max, and we wanted to support keep supporting the majority of the features and doing a full part of the large code, the million lines of source code and the very versatile features that is very difficult, it's nearly impossible. Performance can be additional performance can be obtained as I'll show shortly if we allow flexibility in the parallelization. We wanted portability based on abstraction layers, which, which allow easy porting of, of, of to new architectures because we knew and we anticipated that things will be changing fast. GPUs will be the main drivers of computation but but they there will not be a stable foundation of, of, of APIs and programming models. There are some challenges in finding a good balance between flexibility and complexity and because we have fast CPU quote on the CPU side. It's often worth using, but moving data around is difficult as I, as I have illustrated with the big campaign interconnect lines, but moving between different computing is difficult. And so therefore, we have a complex set of problems to solve and the way we tried to tackle this, this over the years is the transfer on the top. I show the homogeneous scheme where everything is executed on the CPU to initially a so called force of load parallelization where we gradually uploaded more and more force computations to the GPU to, to address the, the, the faster and faster to make use of the faster and faster. GPUs that were coming out during the late mid and late 2010s. However, we had to later shift to a SOHO GPU resident parallelization for, for very good reasons that I will come to next. So the challenges with the force of load is, is that we needed this data movement. To illustrate the color boxes compute force forces forces on the GPU, but then everything gets brought back to the CPU, and these black boxes illustrate the reduction where we sum up all the forces before integration. Now as GPUs get faster and faster on those low dictates that that the cost in integration that during which the GPU left idle will be increasing. The computation as I illustrated earlier, moving data between CPU and GPU has already has also been becoming increasingly costly relative to computation and this this was becoming major bottlenecks so this is why the GPU resident mode was born where we prioritize keeping instead of trying to balance load between CPU and GPU with prioritize trying to keep the GPU busy as much as possible. This launched a lot of work from the CPU on to the GPU and maintain the forces and coordinates on the GPU as long as possible. So this is of course a trade off because we're now we're leaving the CPU empty, but it also leaves us an opportunity with with with performance for performance for obtaining more performance by making use of the CPU and also it maintains our ability to keep the GPU infected even better enables us to support less common features which may not be uploaded to the GPU, but maybe very valuable to many researchers, like pulling more free energies, which now can be taken, taken, can be taken by the GPU and can buy the CPU and can be computed on the CPU while the GPU is busy with the main part of the work as shown here in the magenta books. Particularly in most cases, the CPU will be done well before the GPU needs that data, and now the reduction and integration will only depend on on mostly GPU work and possibly if there are features on the CPU features requiring compute on the CPU and some smaller amount of CPU work. What this also allows is is the next topic I would like to talk about that is our data now resides on the GPU. Therefore, I mentioned earlier and I showed the topology and and and the complex interconnects on high performance computing nodes. Now, before that, let's take a quick look at at what what performance looks like in this this in the different heterogeneous offload modes and here, the different colors represent offloading more and more computation the GPU, where black line is the CPU only performance on a single node with a typical high molecular system. And what we see going from red to blue to to light blue to green is that the performance keeps increasing because we have a fast GPU. The more we offload the more performance we get. However, note that we have the number of CPU cores as as the x axis. And this is where where our heterogeneous parallelization flexibility comes into play. The dark yellow line crosses the green line. If we have sufficient number of course from around three to four course, maintaining the CPU, the ability to to execute code and do work on the CPU is a benefit, because we can get more performance by leaving the bond interactions in the CPU. So this is how we benefit from from heterogeneous parallelization. Now moving on to the so called therapy communication, which is essentially enabled by by primarily by the new architectures which have various interconnects directly between GPUs, excluding the CPUs to avoid these slow connections between CPU and GPU. And we essentially have two flavors of this one with our so called MPI and one with who they were MPI. But most important is that both aim to do efficient communication between the different accelerators in the system. This work was carried out in an NVIDIA co design project. And now I would like to highlight the main benefits of this this direct GPU communication mode. So what I show here is is the earlier node layout of a modern GPU system where we have these these fast green links between the GPUs. Now, before we had this functionality what we would have to do when communicating data between, for instance for halo exchange between GPU zero and GPU three is that we had to first go to the CPU zero then to the CPU one, and this that is where our MPI rank will reside that is as the GPU three associated to it, and then copy the data to the GPU three. So these, this is called stage data movement and this stage data movement will go through this slow gradings, and it will completely ignore the fast links between the GPUs, whereas by using the other GPU communication features, because our coordinate data is already on the GPU, we do the integration and the new coordinates are already there we could communicate between GPUs. But we need to make use of the special communication like this. We need to have such fast links like, like the MV link interconnect. We can have some benefit because we don't need to explicitly in the application copy to the CPU then copy to the other CPU then copy to the GPU. And with that block our CPU from being able to compute for instance free energy interactions with this new direct GPU communication mode we can just tell the communication layer. Move the data between GPU zero and GPU to here, and then move on and start computing, for instance, cool or free energy interactions. And that is the benefit that can be a benefit even on systems which don't have these high performance independence. Now where this becomes even more important is when we have multiple nodes. So the node on the left and half of a node on the right. And the same thing happens when we when we need to stage, and we need to copy data. If we want to communicate from from the left node GPU three to the right nodes GPU two, we need to first copy data to the CPU, then hop onto the network then hop through the network of to the to the network network card, the other node, and hope the CPU and then hope to the GPU. Now, instead of having to make these many hopes, we can just tell the communication library. Communicate this data from GPU three online node to GPU two on this other node, and the communication will proceed. These hopes will happen, but without the involvement of the CPU and without the involvement of the application coordinate in the stage. So what this looks like in a in a timeline based picture is is the following and here focus on the on the vertical lines, which represent the data movement. And here this is the, the force of load scheme where colorful boxes are on the GPU were computing forces on the GPU but we're moving data back and forth, you can see the GPU and the black box reduction is on the CPU together with the integration constraints. Now when we move our reduction and integration with the GPU resident mode. We managed to avoid some of the communication, or at least overlap it with computation. But the challenge is still that these dashed lines which represent those previous dashed lines and the previous previous cartoons. These are the stage communication phases, and, and these are the bottleneck in because because this stage communication of the coordinates. And I'm pointing at will delay the computation of this magenta box, for instance free energy calculations. Instead in our implementation can trans trans trans transfer the data directly between GPUs as shown by these by these green lines. And with that avoid a lot of the communication. So now you see the work many of the horizontal lines. Vertical lines are eliminated. And what this allows us is to essentially encapsulate the entire inner loop, including communication in some of the, the parallelization modes, and hand it over to the GPU run time. For instance, and tell the food around time to, to, to do 100 iterations, including communication and computation and schedule it as efficiently as possible. So this opens up the path to further improvements and further optimizations, for instance using CUDA graphs for now we're not making use of these features but it's definitely unlocking the performance and enabling further benefits on the, on the way. Now, MPI works a little bit differently with MPI, we still need to stop wait for the data to be available on GPU, and then communicate directly from our local GPU to remove GPU. So there is still some involvement from the CPU, but the efficiency benefits are still there. And what does this look like in practice in terms of performance. On the left, I show the performance with reactions in a reaction field simulation, because PME simulations call some more complexity first. As you can see the blue line shows staged GPU communication. If we do, if we, if we do direct GPU communication that is the red line, we get better performance. But here, we're doing direct GPU communication without residency. When we enable GPU resident steps as in keep the forces and, and coordinates on the GPU. We unlock the full performance. The same pattern can be seen on the right panel with PME performance, but this, this is somewhat more limited because PME is costly, but also because PME is very hard to scale. And these are difficult to scale distributed fashion. And this is what, why we initially chose to run PME on a single dedicated GPU and that is what you can see from from eight GPUs, having one of them dedicated to PME sufficient, but at 16 GPUs it's not sufficient. However, if we look further forward to the latest performance features of of the Cromax 2022, we can see the same system that is a one million at an STMB benchmark. If we run it with the reaction field only and without PME, we scale really well on the dual booster nodes. And if we further optimize locality and mapping to map better to the hardware and I won't go into details. If you use these these machines, please read the documentation because there are some tricky details in getting these things right. But if we optimize it further, we get, we get very good parallel perform parallel efficiency up to 50% strong scaling with 12 nodes and peaking at more than 500 milliseconds per day for a one million atom system only on only 48 nodes. However, we don't have PME here and we want to do PME. This was the next NVIDIA code design project that we worked on PME decomposition which required distributed FFT library that was able to keep up with the rest of the code and and at least not slow things now because FFTs are very hard to scale. So the initial support for this in in the 2020 release is somewhat limited because of the lack of availability of various libraries. So what we support in the 2020 release is a hybrid is this hybrid mode, which uses the CPUs for the distributed FFT, and we have a basic FFT support which is a library on the US access scale computing project. This still does not unlock the full capability of the call. In the current development version, we have much improved performance and we are making use of the STU FFT MP library which is a recently released library developed by NVIDIA, which does distributed FFTs. And it was partially motivated by our needs, and we were closely with NVIDIA to get to good performance in this library. And a preview, as a preview to the performance of this feature on an older type of DGX1 node, you can see in the dashed green line the PME performance now is not limiting us at 16 GPUs. And if we take the same performance to the dual nodes that I showed a little bit earlier, oh sorry, this is on the selling cluster, then we can see that now we scale up to 24 nodes. And with our optimized scheme shown on the green curve, we get more than 209 seconds per day. Further improvements are planned, but this is already fairly good and we expect that the next generation hardware will help to further improve this together with improvements into the competition libraries. Finally, before closing this talk, I would like to mention work we have done on hierarchical ensemble parallelization. This is very important because as most of you are familiar with, bimolecular systems are often limited in size. Biological bimolecules are just of a given size and we cannot really weak scale the way other fields like computational fluid dynamics can obtain more science with using larger systems. However, new methods like advanced sampling methods do allow exposing more parallelism along different dimensions and essentially allow scaling using a coupled set of independent parallelization. So here what I show is a new power, how typical such a run will look like where in our innermost regular step to sample we do our MD sampling, then occasionally we calculate an AWH bias for instance with the AWH method and then depending on the frequency requirements of the problem. Occasionally we do a bias sharing stuff and in this setup we can paralyze better because we only need to tackle the strong scaling parallelization challenge for the innermost gray loop. And the outermost blue dash loop here it can scale across many nodes without being hampered by the strong scaling parallelization. So how does this look like in practice? What this means is that we can use multiple walkers in a coupled ensemble and we can get even more than linear scaling in this case we show super linear scaling with multiple walkers which can offset limitations of strong scaling when the problem allows. So for instance, on the right side I show ours to solution with the number of GPUs and as you can see we get very good scaling up to 16 even 32 walkers and both with one GPU per walker and four GPUs per walker. Thanks to the good parallel efficiency across the multiple walker simulations which is shown on the right here with dashed, I show the parallel efficiency with multiple walkers which is 85 to 90% even after 32 walkers on a noisy production machine that is CSC is put in here and on the left side I show the performance per AWH walker here and as you can see we get even for such a small system in this case it's a boring channel with 90,000 atoms we get more than a factor of two performance improvement from one to four GPUs so that is pretty decent even in terms of strong scaling. And in addition feature that we added is so good flex sharing where where the sharing does not meet where not all simulations need to be coupled and and different parts of the simulation can exchange biases while others can can proceed completely independently, which allows more efficient scheduling and less work in terms of scheduling simulations because one can just use Gromax's multi-day functionality. In closing, I would like to acknowledge the work of colleagues and collaborators among them, a lot of this work wouldn't have been possible with the contributions from the CAS and Arstam Smurov, Andrej Alekschenko have contributed a lot to the GPU code and in addition, I would like to thank Alan Dury and Gurav Gaurd from NVIDIA with whom we have collaborated on the core design projects. And with that, I will close here and we'll take questions. Alright thanks for the talk it was really interesting and detailed yes and now we have the Q&A so if you have any questions you can put them in the Q&A section which you'll see at the bottom of your zoom. So, if I begin. So we have a question from Arno, which I'll read out. So, I think Shilad's slide on GPU direct communications between nodes showed comms going via CPU host memory and fire infinite band. If MV link or MV switch connections between nodes are available, will direct GPU communications as they have been implemented make use of this instead of going via infinite band? So the answer to that is, I'll go back to that slide quickly, yes, the answer is yes, primarily because when it comes to internal communication we chose to rely on the implementations in MPI. Now, there is exploratory work on going to use more efficient libraries for communication, but in case of MV switches, if we just swap out this yellow box which is infinite band here, with MV switch we will be able to communicate through those MV switches equally efficiently and actually more efficiently because the overhead of going from this GPU to that GPU will be lower, the bandwidth will be higher, but it will be transparent because we rely on MPI. Thanks. So a question from Valeria. Are you planning to release tabulated potentials with GPU acceleration? I am not familiar with the features planned for the upcoming releases, so I cannot comment on that, but I am confident that when such a feature gets implemented we will strongly consider GPU acceleration. So then we have a question from Ali which says thanks for a nice talk. Is CUDA toolkit necessary to install for compiling Gromax with a graphics card? To answer that question I'll quickly go back to this table. The answer to that is no, it is not. That is why these additional two columns are here and that's why it mentions in both of these columns that there is NVIDIA, sorry the last one doesn't mention it. Both last columns are viable solutions to target NVIDIA GPUs, in particular the middle column OpenCL is a viable solution. However, if you need performance I would recommend, I would strongly recommend using NVIDIA CUDA because in terms of performance it is the best approach. So I have a question for you which is, so in terms of system size you get this sort of you get the limit of where you can't really use more GPUs if you need to compute the PME. So is there a system size where you're better off just using a pure CPU system because of the better strong scaling? Yes, that is true at the moment. To answer that question, in a general way in general it does not need to be the case and I will try to explain that quickly with this slide. So when it comes to CPUs, most CPU systems do not have any of these additional complexity here. It's just straight the CPU going to the network through the network to the other CPU. Now on current machines there is always an added complexity in terms of hardware, which does have a certain amount of overhead, and there is an added complexity in terms of the communication libraries which needs to handle this complexity. And the combination of the two does add an inherent overhead to strong scaling molecular simulations. Now what will be coming in the next generation hardware is better integration and an increasing level of integration will remove these additional boxes and additional complexities in hardware. And we can only hope that the software will keep up and MPI will allow better scaling across GPUs to GPUs. As I said, there is just simply a very hard limit to how far can you paralyze FFTs and distributed FFTs will not scale very well past a couple of nodes and past a couple of CPUs or GPUs. And therefore, the main solution to that, I think, is looking into different algorithms. For instance, fast multiple method could allow better scaling on extra ratio architectures. Okay, yes, we have a question from veteran, which is, what is the performance of sickle on modern AMD GPUs? Is it comparable to CUDA on modern Nvidia GPUs? I'm trying for the question. That's indeed an important question. One caveat is before I comment on the performance, things are changing fast and therefore improvements to many of these components are expected. As I said, currently, our sickle support is stable but preliminary. However, the performance on a single GPU is quite good, I would say, maybe 20 to 30% lower at worst compared to the native. And we are working on improving that. I see no reason, no strong reason why we could not get nearly on par performance by the time the AMD machines arrive. And our main target, let me just emphasize our main target for now, is the big pre-access scale machines. And next, we will look at the commodity hardware and gaming GPUs when those get enabled in the software stack. So we should be able to get on par, performance on par with the naked when it comes to single GPU performance. Going beyond that, I think a lot will depend on the maturity and performance of the communication libraries and run times. And there, there is still, I expect that some, there is some gap will remain, and video libraries and communication support in MPIs is far better. When it comes to strong scaling, even though our algorithms and implementation are highly portable, I expect that when we compare, and these machines arrive and we compare, for instance, sickle on NVIDIA with MPI to our CUDA aware MPI based implementation, there will be a significant difference. So I think you might be there for questions, unless there's any more. So I'll leave Alessandra to tell you about our next webinar that we'll be having. Next bike cell webinar will be on one of the use case that we have, and one of the tools also that we have in bike cell. So it's more, and it's by Vitas Gapus, and it will be on Gromach's PMEs for large scale and chemical protein league and binding affinity screening. And it will be the 21st of April, that is a Thursday, not, just pay attention that is a Thursday and not Tuesday as usually we have the webinar. Okay. And also the following webinar will be on use case that are deal inside biocellular to at the end of June. So I thank you all the attendee for attending and for asking question. And I thank you very much to join us and to give a webinar to the bike cell webinar. Thank you everybody. And I hope see you next time. Thank you. Thank you.