 Thank you so much. I'm sorry for a few minutes at Technic Colleges. My name is Julieta Inga and today I'm going to show you how could you do multi-physics simulations by using preconditioners. I will show you how the preconditioners are going to speed up your simulations. So I divided this talk into three parts. I am going to talk about how would you mind step, what step means, what are the preconditioners, what are the multi-physics simulations I'm working on and some HPC results. Finally, I'm going to lead with some conclusions. Just in case I think you might interrupt me if somebody has questions, it's okay for me. So I will start by saying, do you hear me right now, right? Yep, we can hear you. Okay, so if I'm telling you mind step, I am presenting here an image of our step project in UKAA. I am right now working on UKAA which is the UK Atomic Energy Authority. So step stands for a spherical tokamak for energy production. This is an ambitious program or project because it challenges professionals like they have to be involved in science as well as technology, as well as engineering and also you have to have some math background, but in deep like linear algebra, for example, in my case, I had to restart many concepts to do some simulations. This project or step, the step team aims to generate an electricity from fusion by 2040. This I'm presenting in the image like a mock-up and you can see in the link in one. I will share my slides. So I am putting the link what they are explaining more, what the step means, what are the components. So if you see you have several components here, it's a huge machine. So we need to understand how the large-scale infrastructures are working. So you have to orchestrate each component to the other one. So we need to do scalable multiphysics. So we need a scalable multiphysics solutions and solvers as well. We need to integrate, as I told you, orchestrating these computational simulations. How an engineer can know how to do these simulations in the computer? How many resources do we need? How many cores? How many minutes? How many hours do we have to wait for each simulation? And as everyone in the world, I think we want to have results, faster results. Like we cannot wait two or three days to see how was my simulation. So my team is working right now and how to speed them up these simulations. How we are minding this goal for the step? We are using, in this case or lately, right now we are using preconditioners to speed up these simulations. I am working right now with basic multiphysics because to have fusion, it's a complex physics, many phenomena. And I am also aware about HPC resources. So I am a combiner doing the combinations of these three concepts by using a tool which is called Moose. Moose stands for multiphysics object oriented. Well, the simulations, and it's, he's using, because this project is huge, they are based on, this is an open source project. They are based on the finite element methods to do the simulations like PDEs. Like they are doing the discretization or the mesh using some softwares like leaf mesh. And they are also doing the solvers using Pepsi. Pepsi is another mathematical library. This tool, Moose, it's very friendly, user friendly. You don't need to be worried about, oh, how do I develop code for physics? They provide some kernels already. They developed some standard equations like the heat conduction or they are using, including earlier, I think, for fluid. They are using some kernels pre-built. They have developed some. So you can see more about training on Moose on his website. I am putting the website here. Yeah. Oh, okay. I think this is my previous version. I will try to upload the latest version, sorry. But what I wanted to show about Andy's work, and he is my boss, he was doing some simulations for a project, which is, they are doing a simulation for Chimera. This is another project in UK. And he is combining these multi-physics. He's combining thermal conduction in steel materials. And he's also combining with displacement. So he is doing something very great. He made it. He did a simulation using HPC by using around 500 cores. But okay. Let's go back to my work. When we are talking about preconditioners, I don't know if I cannot see your faces, but I don't know how many of you are aware about this concept. In the past, when you were trying to do simulations or discretizations from PDEs using the finite element methods, you end up with solving matrices. And then it becomes linear systems. And you can solve a linear system equation by using Gauss. That's what we all learn on the school. But I think this is quite simple when you have two variables, X and Y, and some small amount of equations. But what happens when we have zero like million of equations? They might be, it's a nightmare for a computation. It's very expensive computation if we are using just a simple computer. This is quite a history of math. We were talking about solving equations by using the direct methods. But then there was a creation of iterative methods. I don't know if you are aware of iterative methods like LU, lower upper. This is algebra. The sort of relaxation also methods. There were several iterative methods. And after that, because we were trying to do this simulation for three dimensional simulations, it has several variables to solve. We had to use preconditioners because some iterative methods couldn't converge. They diverge sometimes. They are very slow. In this case, you see the image on the left. You can see several iterations. The preconditioner helps. In this case, it's flattening these ellipses. In only one iteration, you can have a converged solution. In this case, what we are saying, or what the literature says, is like you don't have a good condition to solve a problem. It depends on the nature of the matrix. It depends if your matrix is well-conditioned or not. So preconditioner help, basically. This is in one line summary. Preconditioner help iterations method itself to have solutions. But during this year, we had several preconditioners in these 30 years in science. So I am putting another link when you can see more about preconditioners. What is the condition number? How do you know a matrix is well-conditioned or not? So you have to have some notion about auto-values, how to calculate those, how to calculate the spectral radio to have a good preconditioner solution. So this is more theory behind. But as I told you, MOUS already has this developed. So you don't have to worry if you are an engineer to want to apply these kind of simulations. You just have to, I will show you how to do the setup in a few minutes. So I'm just showing you like a little graphs of what is behind this. I am talking also about cry loss of the space preconditioners methods. There are other studies like indicates like, okay, you have preconditioners but they work better by using the cry loss of space. So it's a good, a crucial combination these days like using preconditioners with cry loss of space preconditioners. And well, this study also motivates to have a more effective preconditioners. There is another doctoral dissertation study. They compare preconditioners on KSP's. But this study is more like mathematical. My study is more computational. The focus I am giving to this study, right? Because he got some findings. He had, well, don't worry if you don't understand these letters, GMRS, what GMRS? I am putting in the same slide here. The GMRS stands for generalised minima residual here. So you have to hear the list of cry loss of space, basically. And what he found is the combination. Okay, these work better with this. The hyperbomber AMG also has this option. Each preconditioner also has several options inside. There's no other project that other people are doing around the world. So he found this, what I did in these years, because this work is taking me all around two years. I was working with 42 preconditioners. I was testing first this in combination with these cry loss spaces. This is a list. So my first multi-physics simulation is very simple. I was starting my, I'm sorry, I was starting by working with two-dimensional aesthetic condition problem. So I had a still play with the size of one meter in both sides. This is square. And I established some boundary conditions. I just, from the baton to the top, fixed a fixed number. I set up like the temperature in calving this time. I'm sorry if I am not putting the proper, I mean, I'm just putting the numbers, but I didn't put the indicators off if I'm using centigators or Kelvin. But I'm telling you now, this is Kelvin stuff. And I am putting since this number, 298. So I'm going to increase the temperature until this value, 373. From the sides, there is going to be a steady value here. So I am going to use for the size from the right to the left. The heat flux will vary from steady state to 3000 value. And you can see here in the image, this is the first, the second zero, or the initial stage. You have this temperature in the still plate. And in the end, we didn't enrich the whole what I said, but this is the limit. So what we got is around this number of temperature. And you can see the heat flux is working as well. So Moose is working. But what, how I did the measure of the HPC resources, because I am working with clusters, on HPC clusters. I'm working right now with the Cambridge service for data driving discovery. We call it CSD3 cluster. It has a 56 cores per node. And each node has 192 gigabytes. So you can see more detail on this link. And for this particular problem, the 2D aesthetic heat conduction, I tested these three mesh. I started with a small mesh, with a medium mesh and a large. Why? Because I want, my purpose was to increase the degrees of freedom. How I wanted to stress the problem, how many degrees of freedom are you able to handle? And the conditions of this problem, the first, the first problem of the heat conduction was like, I used the first like branch order. And the type of elements I used for this problem was a quad fold. And the solvent Newton, the results I'm going to present are solving in using Newton. And the parallelization was done by using MPI for moves. At the beginning, I was working with the main stage. I'm putting this, like I'm filtering the outputs with this line, because I thought, okay, I'm going to measure the execution time. How all the simulation takes at the time. I did 10 runs per option. And I am including in this simulation for this heat conduction, again, the creation of the mesh. I didn't use any split option. These are the results. So I was testing 3,360 combinations of preconditioners. Options, I mean, for hypermumeral AMG, this is one preconditioner, the other preconditioner, ISM, ILO, GAM, Chocobye, B-Chocobye, and LU. What about the others? Because I was testing 42. So I'm just putting or plotting the best ones. And for this small mesh, you can see that the hypermumeral AMG, the ILO, the LU, LU are presenting like a strong scalability. And most of the cases, the preconditioners reach their lowest time by using 32 MPI tasks. So these are the options that I'm using. So if I see, this is very not efficient, like you see in red, this is Chocobye. This is the first option I got. And then I said, this is not good. This is not a parallel efficiency. So I decided to just start or show you because I plot all the results. But for this conference, I'm just going to show you what I got from the hypermumeral AMG, which is who was the, who reached the lowest, the lowest time among all. I hope everyone is understanding me so far. I don't cannot see faces. So this is what I already said. Then what I'm going to show you right now, it's about a boomer AMG. I said, okay, the lowest was hypermumeral AMG. But this is the default, like in black, you see. This is with not options. This is the default option with hypermumeral AMG. I was able to reduce this time. Yes. I was able to do it by using 32 MPI tasks and using other options. And this is not good. You see this over. And this is like quite a strange behavior. So this is the option of CES preconditioning at the left, KSP, like using a cryolus of space and preconditioning at the left. So what if I just take out these? I want to just keep the best solutions, like those ones who scale well. So I have this. I have these results, but I am not also, I'm doing my study. I said, I don't have to only care about which one is faster. I have to also be aware what is the memory use, how many iterations you have, and what is the value of finally achieve. It has to be closer to zero. So these are the results. I know somebody in the audience might be so, so very curious. And we said, they say, oh, but this is, we are talking about 30 seconds. We are not talking even in seconds. How could you do a comparison and studies to show me the difference of 30 seconds? So I decided to increase the mesh. Now I'm working with the medium mesh to see if we really have a difference between the performance of these options of hyperbomber AMG. So with a large mesh, like in this case, these are the sizes, and they have the same, the same options of hyperbomber AMG. These are our results. We are not in this case using 32 MPF tasks. We have to use around 112 MPF tasks. But still they are very close. We are talking about less than five seconds. This is not like rocket science. What did I do? Okay, I just decided to study whether the best cases below the default, before configuration of hyperbomber AMG, and decide to see what's going on here in regards to the time, the memory, and the values and the iterations taken. So in both cases, we have seen like, let's see here, this is five and this is eight differences between the KSP's types, the conjugate gradient and every MRS. The conjugate gradient takes less memory in comparison to the every MRS. But still, I cannot, I have to emphasize, we are talking about minimum differences of seconds. So I decide to increase. So I am working now with larger mesh. So these are the elements in the mesh I'm working. And you can see now a quite difference of seconds. This is the default again. And we can see which one is worst is the green one, which is the CGS option. So, okay, I noticed now this is very noticeable, like, okay, you, you cannot work with a CGS cry loss, cry loss of space, because in many cases, you are watching a bad behavior. So I'm just going to keep this. And these are they belong to the conjugate gradient and the FGMRS. So now I want, I was worried, should I precondition it to the right or maybe to the left and right at the same time and using Newton TR. But you can see that the Newton TR is not good option. I will show you now. I decided to show you the best and the worst cases of these options. So the best one, don't include the Newton TR. And this time, what I'm using is the hyperbomber AMG P max, which is a, they are talking about levels and, and this, this D's option really, really increase around more than 34 seconds here. So it's good. At the end, I say, oh, I'm doing good my job, but I don't feel like I am using very well the HPC resources. I'm not halving the time while I am doubling the HPC resources. So this is only difference of five seconds and seconds. It's not nothing. Well, so, okay, I want to stop a little here because I want to say something. I was using the full resources of, this is a 448, but I was using the full resources. I mean, each node has 56 as I mentioned. So multiply by four is this, multiply by three and multiply by five. You understand? So I was using all the resources. I didn't do any distribution of the course. I didn't care about the memory, how the memory impact in this, in this execution time. So, okay, I will show you in two more minutes how the distribution of the course will improve this. But how about the other preconditioners? I was talking to you a lot of charts of hyperbomber AMG, but what about the LU, the GAM, another multigrid preconditioner? So these are the results that I got also for a medium mesh. And you guys still have the hyperbomber AMG is one of the best in comparison to Jacobi or sort relaxation methods or any other among this. When I am testing this with a large mesh, again, I can see hyperbomber AMG and GAM. Both are multigrids. And I am not plotting where is ILO or LU? I can tell that in this case, even when I was trying to test with 56 cores, only one node. And I crashed the nodes on my cluster. I was doing the wrong job there. So that's one lesson learned for me. And I think it could be useful in case you are doing some simulations using HPC. So this is the number of doff for this large mesh. And these are the computational resources that you can work with, but it's not efficiency. So for the next problem, because this is a displacement in 3D, I'm not talking anymore about 3D, this is a steel box in 3D with these dimensions. And I set just a basic setting of Dirichlet boundary conditions in both sides, in Z and in Y. But I am changing a little displacement, I'm sorry, in Z and in X. The displacement happens in Y. I am saying, okay, I'm going to do a force. I'm going to do the displacement in minus 1 centimeter, 0.01 centimeter. You can see here, so the simulation works so far in time zero, it's zero, it's not movement, and the displacement happens after the simulation ends. So these are the motion ratio, so these are the set up or the properties for this simulation physical. Okay, my part is computational. So what did I do? This time, I split the mesh for this steel box. I used the element type x80, and I used that for further grand order as well. Oh, well, it's here. This is the number of degrees of freedom that I'm handling now. It's more than 121 million. I am only focusing for, it is this part, the hyperbomber ANG, pre-conditioner. I am not testing the whole KSP type. I'm only focusing in the top five, like you can see here, GMRS, a flexible GMRS, the GMRS, the conjugate gradient destabilize, and the big conjugate gradient destabilize. I am also including that using of no KSP type at all. And the method solver I'm using for this displacement 3D problem is the new tone. I am using the boundary conditions of Dirichlet, and the parallelization as well is done by MPI. The only thing I'm changing here is how I am doing the comparison of the times. I am not including the multiplication of matrices or anything else. Just I am focusing I learned how to do that, the comparison of the time. I'm just using the finite element problem solve time. This time I'm comparing how the preconditioner works with the finite element problem solving. So I think this is a fair comparison. Now, instead of using the full per note, I'm requesting just 10 notes. I am requesting the full course. You remember, 56 multiplied by 10. So this is the course. But I wanted to know, what if I use the full resources, 56, and then if I have a link by these numbers, but 7 has not, you cannot divide these by two. So I said, okay, I'm going to try with five, two, one. And I did it like this for requesting different numbers of notes. I was requesting 10, then 14, then 12, and 18. What do you think is the answer? This is a HPC strategy. We came up. We were working last year. So to confront this boundary bound behavior, like we presented before for net thermal 2D. This is also good to predict HPC resources for simulations in the future. Like we'll handle millions of degrees of freedom as well. So far, you don't have any questions. Justin, you don't see questions on the chat. Nope, there's no questions so far. You have about five minutes left. Oh, okay. Thank you. I'm sorry. Okay, I will speed up my presentation as well. And we can precondition my presentation. And the answers. I will just show you the answers where I got with 10 notes and 20 notes and 14 notes and 28 notes. Now, this time you can see like, okay, we are having the time. It's not having, but it's quite close to the ideal behavior. So it is good. We are presenting now that, okay, this is something that I was expecting to happen. So it was good. We were able to increase. Like, you see, I'm using 140 NPA tasks here and here as well. But if I am doubling the numbers of notes, I can achieve less time, 200 seconds less. So this is good. And these are the results I got for HKs, the hyperbombering and the flexible and the GMRS and the conjugate gradients. I can tell like roughly, like the GMRS didn't solve because of memory. So you need more memory. I don't, I don't recommend if you have answers with using the GMRS of the conjugate gradient is good. And you can see as well here, if I am using the same numbers of NPAs, but different distribution, you can, in this time we increase in 300 seconds. So this is very good. At the end, we, I think this is the answer for my study. Like, the best combination is using the hyperbombering GMG with a conjugate at the end. We had a small standard deviation, which guarantee homogeneous samples if you want to do, you want to repeat this experience. So this is replicable. Okay, this is another approach we had lately this year. We were working on what if instead of changing the number of notes, we are just working with one fixed number of MPI tasks. It's, it's, we are going to have distribution and using 14 notes, 18 notes, 21 notes. This is the numbers of notes. And this is the distribution I use. You can see this is decreasing. We are decreasing in the time. I'm only using the conjugate at the end of the GMRS, but I am including this Pmax2 option and it was faster. So I want to emphasize that this option is very good. In generally, I can say that the best option again for this displacement 3D problem or linear OCCD problem, it's worthy to use the hyperbombering GMG with a conjugate at the end, preconditioner at the left and using this option Pmax2. And the conclusion is, what are the best combination of preconditioner and KSP type I used for the heat conduction problem is the hyperbombering GMG with a flexible GMRS for the 3D problem, the linear OCCD problem. I found that this is the conjugate at the combination with a hyperbombering GMG and again with the option Pmax2. It will help to achieve faster execution time in comparison just to use the hyperbombering GMG by default. I want to emphasize again that the distribution, of course, across different notes matters. It helped to overcome memory boundary issues. I will also suggest if you are doing some kind of studies similar, using the splitting options to do for comparisons to use the fifth time solver. Okay, and that's another lesson learned about the output. In fact, I have to include the oxidus because you have to provide the image and it will affect the fifth time. These are the references I used for this study. And I want to give a really thank you to these organizations. Ginon, Fedora, IVA, Oak Ridge, USA, the universities I was involved was gained because all this stuff to be a scientist is a long path. You need to know math, science, some kind of physics, statistics, and you can follow me on Twitter if you are interested in helping me to understand why I am using more notes and I am decreasing the times. I want to do some profiling work in the following months to combine these two physics like the thermal in 2D and the displays. Thank you so much. Have a good day.