 Let's start this session. Today we have Pietro Bonfa, who will give a lecture about quantum espresso on high performance computing and GPU systems, parallelization and hybrid architectures. So as usual, keep your questions at the end of the talk. When Pietro finishes the speaking, I think we will have room for one or two questions on voice. So you can raise the end and I will let you speak. And also you can write on Slack as usual, and we will report your questions from the streaming. So I think we can start. Please, Pietro. Thank you, Ivan. And thanks for this opportunity. I'm very happy to be here and spend this couple of hours, slightly more, telling you something about how quantum espresso has been developed over the years in order to efficiently exploit and run on HPC and hybrid systems. So the topic that I will cover today is certainly quite different from the one that you saw in the previous lectures. And I believe that most of you may be not so familiar with the HPC system or rival system. So I try to make a very short introduction to how you do and you execute programs in this variance. And I try to keep the technical details to the strict minimum, but I believe that understanding a little bit how quantum espresso is really exploiting multiple cores, multiple nodes or accelerators allows you to predict the time that your simulation will take and more importantly probably if you are exploiting your resources correctly. And well, first of all, let me start with kind of a joke or let's say a game. And basically why you should bother. I was studying a few days ago, I had an idea, I was performing a simulation, I tried with a simple toy model and I did what you probably did in the last days. So I configured quantum espresso, I needed PW, I did it on my laptop and without caring much about how quantum espresso was compiled and executed. Well, it took me this time to run my simple example. And then in preparing these lectures, I created an optimized version of quantum espresso. I was using the optimal parameters. Now I'm being a little bit arrogant. If I say that after these lectures, I will cover all the details. So this is probably not the case, but at least I hope that with the tricks and the details that I will share with you today, you will be able to improve the time to solution. And this is what I got on my laptop that you may say, well, why should I bother? I mean, why should I spend this four hours this morning to gain, well, just say a minute or so? And I agree with you, there are probably better ways to spend the time if this was the case. But as it usually happens, or at least when I'm lucky, sometimes this happens to me, you find an interesting result. And you want to find out if you can apply these two more interesting ideas, say some cutting edges, with a magnetic system, with long magnetic orders, or a structure with a large supercell, it has distortions. So it could be many things. And in general, you end up and with your laptop not being powerful enough to run your simulation. So you end up on a HPC system or HPC node. And so I did the same. So basically, I got my quantum espresso very simply by doing those less configure and make PW. And on this system, with this PW compiler like this, well, it took this amount of time. And if instead you optimize this compilation, you can optimize the way quantum is can exploit both many core systems here. There are 48 cores and accelerators. This is the amount that you can gain. And so I think it's pretty substantial, especially if the deadline is approaching. Now, let me finish quickly this game and go to the real topic of this lecture. The point is that quantum espresso has been developed over the years to really exploit at best both HPC system and hybrid system. And just to give you another suggestion, so another idea. For example, just going out of science, this I'm quoting here, an IT website, very well done. And in their work, they are using quantum espresso to benchmark various CPUs, meaning that the quantum espresso is really able to exploit efficiently at best the CPU and you can use it to benchmark the CPU. So this is basically my task for today. How to run efficiently on this system and showing you what are the parameters and what are the configuration options that should be careful. But before doing so, I want to explain why and how these options impact the performances. And to do so, I need to introduce very briefly a few concepts, really a few concepts about parallel programming. Then I will move to how these concepts are implemented in quantum espresso. And in the last part of this talk, I will describe accelerated system. Now, accelerator systems, the way you run quantum espresso on accelerator system is quite different from the way you use it on a standard or conventional HPC system. So I will focus the attention on this particular issue in a separate section of the presentation. But of course, in general, you will need to optimize the execution both on the accelerator and on the CPU. So this is something that we will cover in the final part of this presentation and probably also in the hands on. Okay, so first, first step. Now, let me start with a disclaimer here. I will not introduce a general description of parallel computing. I will only introduce the concepts that are useful to understand how quantum espresso exploits multiple workers. In our case, we will be starting with multiple cores. But I will focus on the way quantum espresso is really implementing parallel options, parallel computing techniques and not provide you a general overview. Parallel programs can be written in very different ways. And I will therefore only describe this concept in light of the way they have been introduced in quantum espresso. So, well, first is a very general statement, which is, hang on slow, which is quite bad news for us. And let me go through this quickly, but because it's a very simple concept, but it's useful to, well, actually to start with some bad news. Think of a task, of a task that takes the time t to run. Okay, and let me give you an example, because I will use this example later on when we're dealing with something more difficult. This, our task could be, for example, adding numbers in a file. Very simple task. So, I have my file with a list of numbers. And this is my file. And the task I want to do is sum all these numbers. There are two things that they have to do. First, read the file. Next, do the sum. If I have a single worker, say a single core, this task may take the time t. Now, I want to do it faster. How can I improve the performance? Well, there is a portion of this task that can take advantage of multiple workers. But there is also a portion that doesn't really take advantage of having multiple cores in our case. Because when you read a file in a simplified, not too much, in a simplified picture, but close to reality, when you read a file, having more cores is not really an advantage. You just have to wait for the hard drive to access the data and provide them and put them in the memory. So that portion doesn't really benefit for having multiple workers. Well, instead, if you need to sum these numbers, well, you can easily think of an algorithm that splits this list, for example, in two, if you have two workers, and assigns the first set of data to the first worker, the second set to the second worker, and hopefully, in this way, this portion of our original task will take half the time. So it will become S time faster. S is the speedup of the portion of our original task that can benefit from parallel execution. Now, let's go back and think how our original task will take if we use two workers instead of one. Well, that is very simple. There will be a portion of our task, a portion of our task that cannot take advantage of having multiple workers. And another portion, the original time was T, so P times T, that will be S time faster. So the time will be divided by, well, S. So in our previous example, it will be just divided by two. And of course, this is an ideal situation where you can make this portion of task arbitrarily faster. So it can become the more workers you add, linearly, it becomes faster. Clearly, this is an approximation if you have like six numbers and seven workers, well, the last one will do nothing. There is nothing. Oh, even if you have six, basically, they don't have nothing to sum, actually. But let's stay with this approximation for the time being. Now, this is quite a concern for us because this means that the speed up of your original task, the time it took when you had one worker and the time it takes when you put more workers to play, will be basically defined by the portion of time that you cannot accelerate. And well, this doesn't seem a problem at the beginning, but think of a very, so you have your program and you've been very good. You have changed it in a way that for the 95% of the time, it can exploit multiple workers in an ideal way. Like you have four workers, the portion that can be parallelized, there's one third of the original time. Well, even if you do so, the maximum speed up, so the time it will take with respect to the time, the speed up is this one. So the new time, the time it will take once you have accelerated it, made the parallel portion working ideally well. Well, the maximum speed that it will obtain is 20. So you are really, of course, just paying the price of having 5% of your program not being working in parallel. And okay, this is very bad news because it tells us that we will have to be extremely good and work out parallel algorithms that work extremely well for basically all the tasks in our program. And moreover, well, we have been discussing the case where the parallel tasks can benefit from multiple workers of any number, but in general, this will not be the real situation. And some tasks will start to become, will start to enter the portion that cannot be further accelerated once you put more work. So this is the situation we are dealing with. And therefore, our task now is to organize basically all the work that Quantum Espresso has to perform in algorithm that can be executed by Mungers in parallel. Now, how is Quantum Espresso doing this? Quantum Espresso uses mainly uses the message passing approach. Now, let me spend a couple of words about this. What is message passing? You probably have seen this comment before MPI run. MPI run, in the example that I'm showing here, starts a program three types. So I could have here instead of PW, I could put whatever other program I want, I could put, for example, host name. Okay, what would happen if I run this comment, it will print host name, the host name of the machine I'm working on three times, nothing really. But instead, what happens when you start Quantum PW? Well, of course, starting the application is not the only thing that message passing provides to the application. It does two other things. It gives a label to my executables. And therefore, these programs will have a label that they know and gives the possibility of sending messages. That basically is all it does. Because all our application or our PW processes, that's how they are called, will be independent from each other. So they will be separated by what I'm showing here as a whole. They will not know about the other instances that are running unless they send messages. Messages can be, well, they are just sending data. They could be whatever you want. Like, for example, you can imagine that the yellow here, Mr. Yellow, PW, yellow one is reading the input and sending it to the other PW instances. But you could also have Mr. Yellow here sending data to Mr. Green. And of course, the larger is the message, the more time it will take to be delivered. It's very easy to guess. And of course, if you think about what I was saying before, you need the algorithms that can exploit very well multiple workers in parallel. In these days, you have experienced working in parallel, in separated environments. You have a job to do. You do a Zoom meeting, for example, you spend time doing many Zoom meetings, but the work doesn't get done. So clearly communicating is a way to organize jobs and works, but doesn't get the job done. It's additional time that you generally is useful to organize work, but can delay the execution of the real job that you have to do. And well, as I said, there are a number of messages. You could also have, you have also the option to synchronize your executions that otherwise go freely. Each one will proceed in time freely until they find some point where we want to synchronize. And the last concept, which is not useful for us, but I have to mention it because it appears in the output of quantum espresso. So you know what it means. You also have the possibility to create groups. Now, groups is nothing really useful from our point of view. You have labels, you have labels already. So you could just say that my group is zero and one and that would be enough. But in order to simplify the organization, in general, you create groups that can exchange messages. And this is just a way to easily identify the workers that are assigned to a given task. Now, of course, workers need resources to run. And this is how it goes. Basically, in a multi-core system, you assign your process to a core. And this is actually done by MPI run. If there are enough cores, you shouldn't be too worried, at least for the time being, which core is selected and which is assigned to PW. But all your instances, all your processes will also allocate some memory. And now the point that I want to make here is that it's not that they have a piece of hardware they have been assigned to. They will ask for memory spaces. But these memory spaces are assigned to a given process and cannot be seen by the others. So if you want, for example, if Mr. Yellow here wants to know something that Mr. Green has, well, it has to send a message and ask for it. And the message, the data will be moved from the memory through this message to the second to Mr. Yellow here that will store it in his own portion of memory that he has previously allocated. Okay. And now another question probably. So the first time I saw this, I thought, but this is absurd. This makes no sense. Why should they think of these objects as separate entities that should exchange messages? I have a system which is made of many cores and all of them are attached to the memory. So why should I create these very funny, separated processes that can only talk with messages? And that's of course the case when you are dealing with a single node, but in general, you're running Quantum Express on many nodes. And in this way, the only thing that you have to guarantee is that your workers can communicate. So the only thing that you have to guarantee is that there is a channel allowing communication between the multiple nodes where your MPI processes are placed. And basically, this is the reason why Quantum Espresso has been originally developed using message passing. In this way, you can place your processes anywhere on a large HPC system. And the only thing that you can provide is a communication channel. And clearly, you gain if the communication channel is fast. So in general, in all the processing on a single node allows for fast communications because well, you just have to move the data from one portion of memory, another portion of memory while instead in this case, you probably would have a network device doing the communication. But no matter what you have, whatever channel of communication you have, you can run on multiple processes separated on different nodes in your system. And this is how Quantum Espresso has been structured from working in parallel on a series of tasks. Now, let me move to a very little bit more advanced topic. And of course, you still may think that this makes no sense. And you want really to exploit the fact that you have many cores on a single node. And for this reason, also a different approach has been implemented. And this is the approach called OpenMP. And this is a totally different idea. In this case, you start, for example, one, I'm starting one process, one, the PW application here. And it will do some work. For example, here it will run alone and that locates some memory. Let me draw it here. And then it will eventually arrive to the point where it has some work to do and it decides to exploit multiple workers. So the number that you specify with this command, though this should be not capital D, that's power point messing up. Well, the number specified here is the number of workers that you want to be created when you want to work in parallel on a given task. So you will have three days are called threads working in parallel on a given task. And the real difference here is that the memory allocated previously can now be accessed from all our three workers. No messages to be shared. And clearly, this makes sense. But of course, you are limited by the fact that all your workers or your core needs to be able to access the memory, the same memory. And this is not true when you have cores spread on different computers, on different nodes. And well, the only additional point that I want to make clear is that of course, even if you start with a single process, so even if at the beginning you need only a single core to do the work, when the parallel region where the job needs to be, when you need to get the job done and you create these three workers, each of them must have some resources to run. So a CPU core to run. And if you don't do so, so if you just provide a single CPU, well, this will be competing for resources on the single CPU that they have been assigned to. And this will make the job even slower because they will work in partially all of them on the same CPU, but also spend time deciding which one has to do the job. So never forget to assign the required amount of compute power when you also enable OpenMP. And this seems rather trivial, but generally a mistake that I've seen, because what you will be doing in the hands-on in the next two hours is running with both MPI and OpenMP. And therefore, you need to grant enough resources for both the OpenMP threads that may be created during the execution of each MPI process for all the MPI process. So if you go back and watch the slide you had before, now you need three cores for each process and a total of nine cores in the example that I'm showing here. This is a bit complicated probably, but I hope that I made myself clear on the important point that I just said. Now let me go back and simplify because as I told you, Quantum Espresso has been really organized thinking of messages going around. So for the time being, let me focus only on a number of PW and I will mainly talk about PW in the next slide. So let's focus on this program that solves the Quantum Espresso problem and gives you the ground state electron density and energy, total energy. So let's think of this situation here. We have a number of PW executors that can send messages around to organize the job and possibly share also data, not really share, send data. So the question now is, we want to get the job done. Like remember the example that we was quoting before, the sum of a list of numbers. What did we do? Well, the thing that we did actually was to identify a way to split our initial set of data and give it to the number of workers that had to do the job altogether in parallel. And that's indeed what we want to do now. We want to focus on, start focusing on the data rather than on the task and think how we can give some data to our workers in order to get the job done. What is the job this time? The job is well solving the Quantum Espresso equation, but eventually what we want to do and I stole this slide from Paolo, which provided the great introduction to what I want to say today. Hopefully it will be a little bit more technical than what he said, but I hope I'll make examples easy enough for you to understand. So I was saying what I want really to compute is this. So this is the density. You have seen the details of how to make it symmetric, but let me just say that this is the density and this is the whole amount of information that I want to compute over the time and eventually to obtain. And this will give me also the total energy. So let's think of this in a different way, in a different, with the idea of making this data, splitting this data and giving it to multiple workers that can work on this at the same time. So let's look how you build this. You build this by considering a number of k-points in the first Brillouin zone. Actually, if you have symmetry, you consider the irreducible Brillouin zone, so a portion of it, and you will have to solve the Connexion problem for a number of k-points. And of course, you will also need to solve it for a number of bands. So in the case of the density, you need a balance case, but in general, you will have a number of bands that you will consider. And finally, what you need is, as it's evident here, the wave function. And the wave function itself is defined as an expansion of plane waves. So here you have a lot of data that you are storing somewhere to define the wave function. So these three sums that appear here basically identify the whole amount of data that sooner or later you have to deal with when you want to compute the density. So let's visualize this and think of how we can give this data and split this data to have multiple workers dealing with the Connexion problem at the same time. Now, the first thing you may, sorry, the first idea that may come to your mind is that, well, you've seen Bloch's theorem, you know that you can solve the Connexion Hamiltonian in our case, in the case of plane waves and boundary conditions at different k-points independently. So you think, well, I have a smart idea. I will cut this whole set of information. Oops, let's use a different color like this. I will cut this set of information along these lines and have my, let's say, four workers, well, it's free here, work on different k-points at the same time. They will solve the Connexion equations at different k-points at the same time. And this makes sense. And indeed we will come back on this, but think about it twice. You are going to a supercomputer. So a system that has a large compute power, you are probably dealing with a material which has many electrons or a large number of atoms. And so it's probably, it has a large unit cell. So this means that the reciprocal space is instead quite small. And the larger the real space is, the smaller the reciprocal space becomes. So the number of k-points that you will be dealing with becomes smaller and smaller. And therefore the workers, some workers will eventually have nothing to do. Like you will eventually come with only using the gamma points. So there is no work to distribute at all. So okay, let's forget about this idea for the time being. And look at another direction. There is another direction that is probably interesting, the number of Connexion states. And well, again, let me put this symbol. I will come back probably on this at the end or during the end zone. But let's make it simple. The number of Connexion states is of the order of, on a system that you may run on a single node is of the order of 100 or something of that order of magnitude. And the number of cores that you have on a single node today is of the same order of magnitude. So don't worry, it won't waste time. You can immediately see here that the real tall dimension is this one, is the number of coefficients, the expansion in plane waves that you have for your wave function. And that's indeed what guarantees that you have enough data for each of your workers to do some portion of the job. And that's indeed what quantum expression is doing by default. And this is great because how it goes is more or less like this. You have your four workers in this case working on the first k point. Let's me write here k1. And they will work together on the first k point. And they will have a fraction of the wave function. And once they are done with this, they will, down with the, with this k point, so they are in all the bands or the Connexion state, they will move to the second k point, let me call it k2. And again, they will all work together having a fraction of the wave function. And so first worker will have this fraction here, second worker will have this fraction here, but they will be working together and eventually synchronized and move to the next k point and so on. Now, this, as I said, is what quantum espresso is doing by default and works pretty well because that's the total dimension of this total amount of information that we want to collect. But there is a problem with this. And this leads to quite a lot of communication. You see that quantum espresso is built using message passing. So a lot of messages are spread and are, yeah, are sent with these approach. I will come back on this later and I will try to explain why this is the case. But for the time being, let me just tell you that this works great, but we have a problem with the number of messages that is growing fast. So let's go back to our original idea. Our original idea was to split the number of k points. And so how will this go? Well, we will have like four k, like I'm showing here, k3 and k4. And we will have our four workers dealing with entire wave functions and working at the same time on the problem, on the connection problem, at different k points. And so all my four workers will have, at the same time, all the wave function and will be working at the same time on the connection problem at these four k points. Now, for a reason that, again, I will say in a second, this reduces the communication. But there are two problems with this. One is immediate fear. You have now the four workers working at the same time on the four k points, and each of them will need to allocate the entire wave function. What we said before is that our four workers were, all of them were acting on the first k points, and they all had a fraction of the wave function. So the whole amount of memory that was allocated in this first case was this fraction plus this fraction plus this fraction. So four fractions making an entire wave function. In this case, you have an entire wave function for the worker. It is working on k point one, and then the entire wave function on the worker is working on k point two and so on. So clearly, there is much more memory being used in this case. And this can be a problem. Generally, you have enough memory to exploit this, but you may also run out of memory and being unable to load all these wave functions at the same time. So this is something that we will investigate further during the hands-on. But there is one additional problem here. So think of the case where you have, I don't know, like the total number of k points being six, and you have your usual four workers, right? So the first worker will have, how do you do in this scheme that I'm showing here? Well, it's simple. You assign two k points to the first worker, two k points to the second worker, one k point to the third, and one to the fourth. And then when they're done, you will have the entire knowledge. Of course, they will have computed all the space, all the bands, so you will have the entire knowledge you were seeking before. But the problem here is that these two guys here will take care of one k point, and these two guys as well. But then these two, when they are done with the first k point, have nothing to do, while these two still have to do an additional k point. So we have half of your processes being idle, while the other one are still working on the conditional equation on a given k point. So this is something that goes under the name of load and balance. Oops, no, sorry. Load and balance is not good. It's bad. And you should be careful about this because leaving some processes idle clearly doesn't get the job done. So what you end up doing and what you will be doing in your exercises is, well, just trying to find the best compromise. So working with both options and getting the job done by solving the problems on multiple k points in parallel, but also splitting the wave function among multiple workers. And you will have to find the compromise that reduces both communication memory usage and load and balance. And this is, in general, far from trivial. And there are some strategies to tackle this problem. And you can easily find, actually, a good starting point, but finding the best, really the best time to solution in general is not so trivial. Now give me a second. Okay. So, okay, now I want to attempt something difficult and that is explain or try to explain the problem with communication. Now, this is a little bit quite an advanced topic. So if I fail to let you understand this point, to make you understand this point, just keep in mind that with the parallelism k point, you reduce the communication. And otherwise, if you don't use it, there are quite a lot of messages going around. But let me try to explain why this happens. Again, I'm stealing a slide from Professor Giannotti's lecture on day two. And I hope it doesn't mind. But this way you can also go back and revisit my starting point. My starting point is the dual space technique that you have been introduced quite a few days ago. So the idea that you can compute this application of the Hamiltonian on the wave function by working in the space that is the best space for all the operations. So the reciprocal space for the kinetic part and for example, the real space for the local part of the connection potentials. And you have also seen how to do this quickly with the FFTs on this grid. Now let me quickly estimate how many times you are doing this, basically how many times you are, how many FFTs you are doing. Let's say you start from reciprocal space and you want to compute this term here that really is one of the most time consuming part on this task that I'm showing here. So how many FFTs you will be doing? Well, you will have to do it for a number of connection states times a number of k-points and you will be doing it for time for many iterations. So I can add here a number of iterations. So if you do this product, this is of the order of let's say 100, this is of the order of 10, and if you are lucky, this is also of the order of 10, and this is already 10 to the power of 4. Now to be honest, and there is also factor 2 because you are going from g-space to real space and back to g-space when you want to add all these terms. And to be honest, here also there are others factors. But even if you forget about this, still we are doing really a lot of FFTs and let's think how these FFTs can be done if we have the wave function separated and in the memory of different workers, different processes, MPI processes. Now to simplify this picture, you've seen again on day two how you can write the density as a function of plane waves. And let me talk about this. The concept is exactly the same if I talk about the plane wave expansion of a block state. But let me talk about this because it's easier to visualize, certainly easier to visualize. It's a real quantity you have and you can clearly imagine it to be in real space. What I'm showing here is real space. And discretized on a grid. This is something you heard probably on the second day. And so you have your density in real space. And this is your data now. This picture here has nothing to do with the matter that I was showing before. This is now, let me say for example the density, but could also be the plane wave expansion of the wave function. But I think it's easier to visualize it with the thinking of a density. It's a real quantity on a grid, something that you can really see. And how do you distribute this data among many workers? Well, you certainly have a smart idea. You have a set of points, a grid. And suppose you have many points in these planes, but four planes and you also have four workers. So the idea is that again, the worker MPI process that now I call Mr. Green will have the first plane on our grid of four planes. Mr. Blue will have the second plane. Mr. Red will have the third plane. And Mr. Yellow here will have the fourth plane. So you're done. You've been distributing this real space data among many processes on your workers. And now you ask them to do the FFT. So they send a message and they share the fact that it's time to do the FFT. This is a very small message. Just do the FFT, guys. So they will start and they all start with the x direction. So they will all compute together the x direction. And then they will move to the second direction here. And this is something they can do on their own. They don't have to share messages. They do the FFT. I hope you know that if you want to do the FFT in three dimensions, you can do it with three 1D FFTs along the various directions. And so they've done all of them individually the FFT along the x and y. But now it's time to do the FFT along the z. And here comes the problem because none of them has the data on z. They only have a factual of the data. So they cannot do the FFT, the sum of the FFT that you have been introducing the second day. So how can they do that? Well, here comes the smart trick. It works like this. So Mr. Green here tells all the other, okay, guys, send me the column that I'm drawing here. Send me this first column. And I will take care of doing the FFT on this first column along the z direction of the data. And Mr. Blue does the same. And it tells to everybody, well, send me the second column here. Okay. And I will take care of doing that. And it goes on like this. Now probably this drawing is not very accurate. It's too small. But what happens is that we all ask for data and compute portions along the z direction. So they will eventually have columns in reciprocal space. And these columns, as you have seen, well, they are really in us, of course, again, on a cube system, but the relevant data are not on a cube. They are selected. The relevant g vectors are selected by the cutoff. We have four times the cutoff and the cutoff energy. If you're dealing with the density, let me forget about ultra soft potential for the time being. You have four times the cutoff if you're dealing with the density, the energy cutoff. Instead, we have a radius, which is half the second radius, if we are dealing with a wave function. But that's the point. You will have the columns. And these columns are called sticks in the quantum espresso output. And that's where they come from. But the point I wanted to make is that as you have seen, when you want to compute an FFT at a point, all your workers will have to ask to all the other workers to send some data. And this is not a message like, let's do the FFT. This is a message where you send the data, your wave function or your density, to the other workers. And this is one of the most time-consuming operations that you can do. This is called all-to-all operations. So all processes are sending messages to all the others to get the job done. So they are spending quite a lot of time communicating. And the more communication you have to do, the longer they will take. The message will probably be short, small, but you still have a lot of communication with a lot of people to send or receive data from. And just to tell you where this appears in the output, it's in the top. And I hope that you now understand what this stick is. And just to give you some order of magnitude, this is an example I was working on a few days ago. This is here. I was working with that, running with 16 processors. So this is the number of sticks, total number of sticks that I had. And this is how they are distributed among my 16 workers. Some of them are 13 sticks, some of them are 14 sticks. And when they're doing the Z direction, they are really like performing a few, this is the number of FFTs that they were performing along the Z direction with a size that is really tens of elements. So this is really something small for your system. And if you add too many workers, you will be doing a few small FFTs and then sending messages around quite a lot. So a number to keep to check when you're running simulation is certainly the number of sticks that each of your workers is dealing with. If this becomes too small, they will be doing very small FFTs and communicating about that quite often. Okay, so I don't know if this is clear enough, but I will not go into any further details right now. And let me instead summarize what we've been talking about now. We've been talking about the possibility of solving the collision problems in parallel of multiple k-points. And we'll be talking about the possibility of splitting the data, the wave function on multiple workers and working with fraction with our worker working with fraction of the wave functions at the same time on the same k-point. Now this, and they told you that you in general have to use both these options at the same time. But really what you do is selecting this option here. And Quantum Espresso has a hierarchical set of parallel levels that are used to organize the work. So you start with all your MPI processes or your PW executables running and you decide how many groups of workers you want to have working on k-points. So let's make this simple. Let's think we have six k-points and you have, and you decide to have three workers working on each k-point. This first group we work on k-point one and two. This second work we work on k-points three and four and this one on five and six. But you may have more processes. So inside each group, the workers will split the wave function on g-vectors. And this is what is called in the Quantum Espresso output, the band group. Now group refers to the thing I was saying at the beginning of the lecture, how you organize your MPI processes, which are your workers. And the band refers to the fact that you are splitting the wave function on g-vectors. Well, now it's really late. So I have a couple of additional information that I want to tell you about the conventional HPC systems and this regard additional level of fine grain parallelism, parallelization that we can tune. And this is the topic that we covered next. Let me just be at one more point and be general enough for my presentation to reflect what actually is the distribution of MPI processes. So what I've told you is what happens for pw.x. So blame with code, but this is not the only code that you have in Quantum Espresso Suite. And there are codes that can exploit other levels of parallelism, which is specific to their task. And you will also find this image group that can be used to split all the workers on different tasks. And I will go back this later. But now let me move to this last part here in part point. And I have to talk a little bit about how you actually compute the eigen pairs of your Konisham amyctonia. Now, I think you heard or you know that when we're working in play waves, yes, we do have a basic set of play waves. And we could in principle just solve the circular equation and find eigenvectors and eigenstates and eigenvalues. So why don't we do that? Well, we don't do that because we have a huge number of bases. We have a huge number of play waves. And doing so would be, well, maybe not even possible, but in any case very slow. So what you actually do in play wave codes is iterative diagonalization. Now, I certainly don't have the time to enter into the details now because it's almost an hour that I've been talking. But I want to mention that you have many options to do this iterative diagonalization. And this, and I will tell something about very briefly about this approach, that is an approach. But this is a field where active developments are really going on these days in quantum espresso. So keep an eye on this because these different options for the iterative diagonalization grants different speed or memory consumption and also scale different when you work on multiple and a growing number of workers. So Davidson is what is selected by the four and that's the reason why I will spend some time, well, one slide on it. But if you need to save memory, you may opt for the conical gradient approach, which is much slower but allows you to save some memory. But the really new options are coming that can optimize both memory allocation and speed. So you can certainly initially go with Xavis. So if you find yourself in need of saving some gigabytes of memory, you may consider also other options. Now, I want to tell you very briefly something about the iterative diagonalization done with Davidson because that brings me to an important option that you have to tune when you run on large system on HPC systems, with large input files and say on HPC systems. So how do you solve the Conexian problem? How do you find the Agen pairs of the Conexian problem when you are working with a plane wave base code? Well, what you actually do is that you start with a guess of your Agen pairs. And if you are super lucky but that basically never happens, these are already your Agen states. Well, instead, Ernest Davidson came out with an idea for extending the space where you look for the solution of your Conexian Hamiltonian. And therefore what you actually end up doing is computing the, so applying as was said on second, on my professor, Genozi, computing the Conexian Hamiltonian on a set of states which is of the order of the number of Conexian states. Well, actually a bit more than that. This is your initial guess, then you can expand it with additional vectors and consider this reduced space where you look for the Agen states with the lowest energy. And therefore what you end up computing in the Davidson method is a generalized Agen value problem. And I don't want to enter too much on the details. What I really want to focus the attention on is the fact that you have, you're dealing with a set of states and therefore an Agen value problem which has the dimension of a few times the number of Conexian states that you have in your system. And you may say, well, that's fantastic. We started from a used number of plane waves. Now we're dealing with things that are of the order of, matrix sizes that are of the order of hundreds of elements some small inputs. And this is true, but when you start, so basically this is to say that you shouldn't be worried about this problem, this generalized Agen value problem that you have to solve. This hundreds of elements will take milliseconds, a few fractions of the seconds to compute, really small fraction of the seconds to compute on every single core machine that you may have. But when you have larger inputs, this Agen value problem can become larger. And yeah, I've been trying this a few days ago. I took the, well, this is something that I already knew, but I prepared this result a few days ago, considering the best library that I had on this system is an Intel based system. And it doesn't matter the details now, but I used the best library that I had to solve this generalized Agen value problem with one core and the best library that I have to solve the same problem, but with many cores at the same time. And well, you immediately see that well, this is a problem that's case like the cube of the matrix size. And if you use a single core to solve the generalized Agen value problem or use multiple cores, as the matrix becomes larger and larger, the difference becomes substantial. This is not even here, it's going up somewhere. So if you forget about this on a system, which are large enough, and now you should look at this number and check that I'm having here a matrix of the order of 1,000 elements. So we'll be talking about, we're talking about systems that has hundreds of Cunichamp states. Well, if you forget about diagonalizing the Agen value problem in, so the Agen value problem on the sub space in parallel, you're really losing quite a lot of performance. But at the same time, most of your simulations, if you're dealing with small input sizes, are on this side of this graph. So on this side, you really lose quite a lot of time if you use a parallel Agen value, so that it's much more efficient to use a single core. So the final message here is that you should really and certainly use parallel Agen solver if you're dealing with large inputs, meaning hundreds of Cunichamp states, while you should definitely avoid it if you're dealing with small system tensile Cunichamp states. Okay. So I think that I'm really late. So I will not tell you much about OpenMP because I think that even we covered this during the heads on. So maybe it's better if I just let him describe how OpenMP can be used to further improve the performance of the code. And so let me skip this and mention instead the image option that I discussed before. So for example, the image option in this hierarchical organization of work that I was discussing is the first one. So the first level of splitting of the job to be done. And what you have, for example, in NAB on the phone on call is that images go on to task that require very little communication, like for example, the computation of the forces in the images of NAB simulation or the reducible mode in phone simulation. So this image parallelism is really a sort of abstract layer for tasks that do not really require much communication and can be used at the topmost level to start splitting the job, the work to be done among the many workers. And below that stage, if you are dealing with a problem that involves the computation of Konisham equation, you still have all the additional layers that I was showing before. So the subdivision of the work on K points and the subdivision of the wave function on the band group, in the band groups. And, okay, so in the very last part of this discussion on systems, so with an HPC system, conventional type, let's say, not accelerated ones, I very briefly mentioned how you can really exploit the oldest options that I've been discussing. So the image parallelism is activated with this option here, and I'll give you the example for NAB. And well, what you should put here in the eye. Well, clearly, this is a parallelization strategy that requires very little communication. So it's very good, but in general leads to more memory allocation. So you should use the largest possible value, but that is not always possible. And you should experiment with me and check if you have enough memory to use what I suggest as the best option. Then pool parallelism. Now pool is probably some of you a bit misleading, but this is just how many workers will be, how many groups of workers will be created working on multiple K points at a time. So basically, this will create X groups of workers that will compute each of them are different, the Konechama, Milton and the different K points at the same time. And as I told you, this reduces communication. And therefore, you should exploit it as much as possible. But this is again not always possible. Clearly, you need to provide at least one point per pool. Otherwise, your pool of processor will have nothing to do. But also, as I told you, this leads to memory duplication, not duplicates to larger memory allocations. Therefore, you are sometimes limited by the amount of RAM available on your system. And this is the last thing I mentioned earlier. So the diagonalization, for example, in iterative deregulation, in Davison iterative recognition, but that's not the only case. But in Davison, the most common, you have to do it in parallel where you're dealing with large systems. And I would say more than 100 Konechama states, order of magnitude. That really depends on the hardware. So it's not so easy to give an advice on this. Now, clearly, this number must be smaller than the band group size, meaning that if you have a number of pools working on the Konechama equation at different K points, each pool will have the so-called band group, if you've seen it before. So all of them working at the same time on the same K point. And they will have to diagonalize to perform the iterative diagonalization, so to perform this eigenvalue problem. And so this number must be smaller than the band group size. But this is something that the code takes, they could say, so you can put a large number here, and the code will find the best. No, they're not the best, the largest possible set of processes working parallel on the eigenvalue problem. This is not necessarily the best value that you can choose, but at least I guarantee that you're not solving this in serial, so with a single core when you have large problems. But instead, remember to use a single core where you have small problems. And well, that is basically, that is it for the for the parallelization of quantum espresso on a standard or conventional HPC systems is it's certainly far from trivial. And I hope that now the options that you have made a little bit more sense at least these two and and finding the optimal solution is really a matter of balancing between the memory usage and the amount of communication that is required between the MKI processes. And probably this will become a little bit more clear during the hands-on. Now, before it's too late, let me mention one important thing. And as I said at the beginning, if you just do dot slash configure, you may be lucky, but you may even not be lucky. And what quantum espresso does is trying to find the set of libraries to perform the operations that the algorithms requires. So basically linear algebra and FFTs. But it may well be that it doesn't find anything and uses the versions that it has embedded in the package. Now, these are very bad, not definitely not optimized for the systems that that you may have. They're just meant to have studies running no matter what. But you should really look for libraries that can perform effectively on the system where you are trying to build quantum espresso to do their simulations. And showing here a number of open source options that can be used to do linear algebra, FFTs and parallel eigensolders on basically on many clusters. This is the minimum that at least this is the minimum that you should be looking for. And if you plan to deal with large inputs, parallel eigensolders like Spark is certainly required. If you are dealing with an interbase system, you have the MKL library and this provides you with all the what you all you need for effectively running, sorry, compiling quantum espresso. Something close to the best possible performance it provides you with linear algebra, FFTs and parallel eigensolder. And you may also exploit this cyber on AMD system with some tricks. And there is also these options with the AMD libraries, which is still not much explored. But if you are on an interbase system, definitely MKL is the way to go. Or if you want open source options, that's this. This is the set that you may check. Okay, last word of advice. Also, when you write to disk, open source write quite a lot of data to disk. And you have the option out there in your input. And well, you know that HPC system, in a HPC system, you need to access all the files from all the nodes. And this parallel file system is nothing but a parallel application. And it has been optimized different purposes. So always remember to use a scratch space when you are storing a lot of, and in this case for quantum espresso, this is what is done. This is what, yes, is done in, during the simulation and stored and saved in the folder indicated in out there. Okay, in the 15 minutes that I still have, I will finally discuss something about GPU acceleration and GPU acceleration in quantum espresso. Now, let me give you a very brief introduction to this and explain why we decided to add one GPU acceleration to quantum espresso. Well, you can see that a number of systems in the top 10 HPC system in the world, this is a list taken from the top 500 website. And these are the 10 most powerful HPC system in the world. And the number of them, I can, I think they are, there should be six out of 10. Let's see if I find all of them. Voila. They are adopting GPUs. And the, and the reason why is that, now, if you come back to this slide and compare the peak performance and the power that is required to achieve this peak performance, you will notice that if you use GPUs, you generally obtain a nice reduction in the energy that you need to run your system. Now, you don't pay the BS, but who pays the BS probably is interested in this. And therefore, many GPU systems are appearing on this top 500 list. And how can the GPU make this magic? How can it magically reduce the power and sort of why? Well, it does it. It's not magic. It does it by changing the structure of the architecture of the device. Basically, it is devoting much more, many more transistor to the kinematic and logic units. So the one that do the computation. And, and therefore, where is the other part of the CPU gone? So what, where is the pipeline? Well, what is the thing that controls the execution? Well, it is smaller, much smaller. And who is doing that work then? The developers are doing that works. And it takes quite more effort, so more effort to develop codes for the GPU than the CPU. And you are lucky, actually, because the developers of Quantum Impression did it for you. And what you only have to care is a few days about CUDA programming, which is the language that has been used to accelerate Quantum Express to GPU. And let me skip this slide, because it's not so important. And I will tell you instead that, well, a thing that you may need to know and remember is that when you, when you create and compile Quantum Express around GPUs, you will make a very specific executable that can only run on a specific version of the, on a specific card, which has a specific architecture. And this is summarized in the so-called compute capabilities. And this is a number that you can check online, find online. And this is still a number that we have to specify when building Quantum Express for GPUs. And we will go through this during the hands-on. Now, you are lucky, because in order to build Quantum Express, so you need the India HPC SDK, and it embeds all the compilers and all the libraries that you need to efficiently build Quantum Express and exploit the GPU at best. And so during the hands-on, you will see really how this process is done and is quite simple. But the message maybe here is that you really need to use this package. And for the time being, you cannot compile Quantum Express with other compilers or take advantage of accelerators different from NVIDIA ones. So the one that can be exploited with CUDA actually. Now, I think this will be, let's leave this for the hands-on. Probably Ivan will talk about this a little. And let's instead focus on what you can run on hybrid systems. Now, for the time being, all libraries of Quantum Express can take advantage of the presence of accelerators. And PW SCF, so PW code, can be accelerated, is accelerated by GPUs. There is also an experimental port of NVIDIA, but you should be careful, this is really a working progress. So it's probably, if you wait a little bit more before trying anything, but just so that you know that work is ongoing, and soon you will probably have additional programs ported to GPUs, because as you can see these layers show you that the external packages depend on the internal ones. And therefore, having accelerated PW will allow faster development for the other codes in these external rings. Now, another thing to keep in mind, and this is for a reason that I will discuss in a second, is that not all features of Quantum Express can take advantage of the accelerators. As you can see, the list is growing, and in Vection 6.0, quite a lot of features are accelerated and can run on GPUs, but some of them are still running on the CPU. And some of them also, for example, the parallel eigensolder is not yet ported on the GPU. This may be a few, but really for GPUs, this is not a big deal, and we will see in a second why. But keep this in mind, and I will tell you in a second why this is important. Now let's, for example, look at a GPU card and discuss how Quantum Express is running when it is exploiting the power of a GPU. Well, first of all, let's not focus on the computational parallel memory. You see that, for example, on Marconi 100, you will have four, in a single node, four Tesla cars, and each of them has 16 gigabytes of memory. And if you, well, 16 times 40 is 64 gigabytes, and this is a much smaller number than the amount of memory that you have on a single node, which is probably four times larger, if I don't remember exactly, but really much larger. And this tells you immediately that you will have to fight, again, harder to optimize the number of pools that you can fit on a single GPU card to optimize the parallel executions. And the aspect that I should mention is that the cores of your, sorry, the cores of your GPU are much slower than the cores of the CPU, but you have thousands of cores. So this clearly tells you that the amount of data, the amount of workers that you have, and therefore the amount of data that you have to provide, in general, must be larger than the one that you are dealing with on the CPU. If you don't provide enough data, you will have workers being idle and running much slower than the worker on the CPU. So when you are trying to exploit a GPU, you will have to deal with inputs that are large enough, meaning that you are providing enough data for all your CUDA cores to work on. And this means that if you have small systems, you want to compute, for example, you want to do some high throughput computing with small inputs, well, the GPUs may not give you a real boost in performance. But now let's focus on how you run Quantum Express, so on hybrid systems, on, yes, on accelerated systems. And I'm considering here the example of Marconi 100, but you do find similar architectures in many HPC systems or probably also in various workstations. So the example of Marconi 100 is a bit extreme and also it's not unconventional, but certainly it's useful to understand one problem here. So if you look at the entire CPU, you see that the power that it provides, the computational power, the number of floating point operation that does per second is 700 gigaflops. Well, that of the whole set of accelerators is 31.2. Now, look at this ratio. I mean, the whole set of cores in the CPU, you have 32 cores in this machine. So this 32 cores in your node are just providing basically one tenth of a single GPU, which is about eight teraflops, you see. So a single GPU is 10 times faster and the entire power, computational power, come from the GPU. So this means that if we want to run quantum espresso efficiently, well, we must really forget about GPUs and use the same machinery that we created for the CPUs now for the GPUs. And this means that our processes that we are running with MPI run and have been organized in such a way to split the job to be done and split the data and work on a given task together. Well, this MPI process is now we'll have to use the real source of compute power, which is the GPU. So this clearly tells you that you will want to associate each process with a single GPU. And therefore, in general, you will be running quantum espresso quite differently from what you would do on a standard HPC system, where you have tens, of course, like 24, 48. Now, instead here, in general, you have a few GPUs. So we run just with a few MPI processes. And these have already internally, the PWR is available, for example, internally, already organizes the work and will allocate the tasks to the various GPUs where the computational power comes from. And this is nice, would be nice, but actually to run on a GPU, you also need a CPU. And there is also an additional point probably, and it's the fact that not all tasks can benefit from the GPU. And not all tasks I told you before have been imported to the GPU. So when you're running with PWR in this way, it will still have a few portions of your problem, a few tasks that are still running on the CPU. And they're running on the CPU quite badly because the process will be assigned to a core. And this core will control the GPU, which is really doing the job here, whenever we have an accelerated version of a given algorithm. But we don't have a single core working to get the job done for each process. This is quite bad, but you can overcome this by exploiting open MP parallelism and this way add multiple cores for each MPI process. But this is again something a bit more advanced that we will cover in the end zone. The important message is that the machinery that we prepare with message passing and therefore that each MPI process that organizes the work with this message passing architecture should be assigned to a single GPU that is where the computational power comes from. And so this is basically the message that I just discussed. And yes, also there is no parallel again solved right now. This may change in the near future, but this in general, except for very large input, doesn't represent a problem. Now, it's getting quite late. So I had a few more slides to discuss. So maybe I can just fly and just this is just a few best practices, I would say. So these are the set of options that you have to keep in mind when building quantum espresso on HPC systems. So as you will see, it's better not forgetting about open MP, especially if you're building with hybrid architectures. And you enable CUDA with this set of options and you should do it explicitly. Otherwise, the GPU accelerated subroutines of your quantum espresso will not be compiled. And these are instead the options that you should consider if you are building with large problems that require parallel egg and solder. I think it's like you work with multiple cores. Okay, now here I'm just showing how make.ink should look like for an interbase platform and GPU platform. Maybe I think Ivan and I can go through this during the end zone and have a look at this make.ink. Okay, so very quickly. This happens so often that I always feel a few obliques to show this, to mention this. So look at this example. I mean, it's really happening quite often. So in the top part of the output, you see the number of nodes that this is with the old machine. So these are the nodes that quantum espresso is using. And it also brings the total number of processes, a process of cores that he believes he found. But clearly, there is a problem here. There are no five, 15,000 processes in 12 nodes, right? The problem here is that it is using this creating 36 MPI processes on each node and using 36 threads on each node. So this is the situation I was introducing at the beginning, with all the MP threads fighting for the same CPU and this will make the simulation horrible. So keep an eye on this and be sure that the number of threads is what you intended to have. Otherwise, this will take forever. Okay, very briefly. In order to understand if you should use parallel against over and what is the dimension of the problem, you should really check that, as I told you during this couple of hours, hour and a half, the initial part of the output of quantum espresso, here you have the number of conditions. I think you have gone through this. And in the final part, there is a list of timers that details how time was spent in nested subroutines. So very quickly, the computation of the ground state is what you see here. And forces is, as you can see, quite a fraction of the total time. It is not negligible, so it's fine if you look, but if this grows quite a lot, well, it may be that you are working with a GPU, with a GPU code, where the specific algorithm that you are using is not accelerated GPUs, or maybe you are considering the brand new projection scheme for DFT Plus U. So anyway, this is more or less the ratio that you should observe. And then moving to the electrons out in itself, you can see that most of the time is spent solving the connection problems. And this sum band, you can guess what it is. It's the time taken to sum the bands to generate the, to generate and symmetrize the density. And again, this is more or less the ratio that you should observe. And finally, this is the time computing the application of the Hamiltonian to the wave function. And this is the time spent in, so this is, sorry, the Davison, in this case, at the camera point, the time taken by the Davison iterative diagonalization. And this is instead the dense eigenvalue problem in the subspace that I mentioned before. So if you see this growing too much, well, with respect to this stage, well, it's probably time to activate a parallel eigenvalue. Now, I think I really should stop here quite late. So I think that is it. And thank you very much, Pietro. Yes, the topic is quite broad. So yes, you really gave us many hints and you covered the overview of many aspects of HPC. And so I think maybe we have time for one question is one of the participants want to raise and I could let them speak. And we have a few questions on YouTube, a few of them we have answered in chat. So I don't see any raised hand. If you want, you can also write on the Zoom chat now, if you have questions. So, okay, yeah, this is a very interesting question. Can quantum espresso be used with consumer grade GPUs? So I think that, yeah, I think this is a really interesting point here because many laptop actually have GPUs. So yeah, it is a good question. Yes, this is a question that appears quite often. And the real point here and the answer to this question is actually in detail is a technical detail of the device that you own. Because the point is that quantum espresso is most of the time is dealing basically always dealing with double precision operations. While of course GPUs were originally meant to do graphics, so you don't need double precision. And most of them are quite weak on double precision operation. And therefore, even if their performance that is very high and advertised very high, it could well be that this is the performance for single precision operations that are not what quantum espresso is performing. So the answer is that generally happens that consumer grade GPUs are weak on double precision. But if you really want to check this, you can find the technical details of your GPU and check the ratio between the amount of course devoted to single and double precision operations. And this is something that also the CUDA utility prints. I think there is also PGI Excel info that is printing this. Maybe we can show this during the end zone and discuss this further with the output in front of us. Okay. Thank you very much, Pietro. We have to stop now. So if participants have more questions, you can just write from the back and we will answer. And of course, if I think if more questions will come, we can continue during the end zone. So let's have the coffee break now. Thank you and see you in 10 minutes at 10.30. Maybe Massimo, we can put a screen seven now. So good morning again. I think we can start. Yes. Can you hear me? Yes, we can hear you. Okay. So good morning again. And so I am wait. I share the screen. Okay. So I am Ivan Carnimeo and I work at CIS on the development of the GPU version of the quantum espresso code. So now we are going to have the end zone in which we will see a few practical test cases trying to work with the concept that Pietro gave us in the theoretical lecture a few minutes ago. So let me start. Just let's recap. Let me say a few general things about the end zone. So from what Pietro said in the before, we learned that quantum espresso is a big code that has a big parallel path. So the speed up of the code according to the under law depends on the sides. It is conditioned by the serial part. So we have a serial part, a parallel part, and the parallel part is implemented using different approaches and strategies. So Pietro spoke about MPI that is the message passing protocol and has several advantages such as distributing the memory. So MPI is the preferred choice of parallelizing quantum espresso because it allows to distribute memory among a large number of nodes. And this allows to treat large molecules because for large molecules, we need a lot of data to be stored and to be kept in memory. And this data can be distributed among nodes. So in this way, that's why MPI is very used is the basic parallelization protocol in quantum espresso. We also have OpenMP, which also helps in distributing the workload, but it is a shared memory approach. And we will see a few examples in the following. And we also have GPUs. So in this song will be divided. The first part will be about HPC, so how to optimize calculations in regular HPC clusters based on CPUs. And the second part, it will be working out to learn working with GPUs. Other important things that Pietro said is using end pools and end yag. We will see a few examples in the following. So how to optimize running calculation with pools and yags. So let me say one thing. All these strategies and approaches have been developed in order to treat large systems to optimize running, to optimize the execution of your test cases and production calculations. So they are meant to work to accelerate the calculation, especially when you run large systems. So for this reason here in the end zone, we will not see drastic improvement of calculations from the test we will do because we are forced to work with small systems because we are many people. And if 100, you are more than 100 participants. So each of you runs calculations using 5-10 nodes and we go out of computational resources. So you will be running small test cases. So please, when you run test cases, focus on learning the general approach, on learning how the calculation is always the setup. Try to understand the concepts. And of course, you will not see drastic speed ups and not always at least because we are quite limited by the fact that the calculation is small. Anyway, if you learn this procedure, maybe you are able to run more efficiently on calculations using Quantum Espresso on your own production cases. So if you have to write an article, if you want to do a test, if you want to learn about a new material with Quantum Espresso, hopefully after this end zone, you will know how to optimize the calculation, how to run large and to fully exploit the computational architecture for which you are running. So in order to give you also an idea of what happens on large systems, since the test cases are quite small, I will show you a few slides after each exercise to show what it means, what you have done and how what you have just done on the small case is reflected on a large test case. For this reason, I would like to go synchronize with you. So my idea is that you try as much as possible to follow me. I will wait for you while you do the exercises. Let's try to go together and I will try to wait for all of you to have done the exercises as much as possible. So this said, I think we can proceed. So the cover topics of this end zone are basically five exercises, which are quite, the first one is preparing Quantum Espresso CPU version. So you can compile Quantum Espresso using different libraries and the libraries that Pietro explained as before. Exercise two is how to run, how to use Quantum Espresso on the HTC architectures without GPUs. At exercise three, let's see, if we have time, I will explain you a few things about GPU, how GPU work. So a very basic things about GPU programming and GPU setup just to let you understand why we are using GPUs and what is the improvement that we might have. And then exercises four and five are similar to one and two. So exercise four is let's compile Quantum Espresso with GPUs and exercise five is let's run calculations using GPUs. First of all, for this end zone, of course, we cannot run neither in the laptop, in the virtual machine. First of all, because this end zone is about HPC calculations, so it is very unlikely that you will ever run HPC on your laptop. So we need to run on a cluster and we will be using a Marconi and 100 cluster for this. So first of all, please connect to the HPC cluster. Yesterday you have received an email with the credentials and also I've seen on the Slack workspace that most of you have already tried the connection. So now I would like, in order to try to stay synchronized, I would like to ask you to connect to the cluster with the credentials you have already received and when you are ready on your on the cluster, raise your hand so I can see I can check how many of you are good. But so let me okay now I have the window here. Okay, so let's check. So please connect to the cluster with the SSH, this common here. This instruction, you will find these instructions also in the readme.md file in the day nine folder. So please connect to the cluster and once you are connected, raise your hand so I can see how many of you are. So yes, keep 31. If you have problems in the connection, you can write on Slack or yes, and yeah, of course, you cannot raise your hand because now we are using it for a different reason. So we have about, so we are now 31 people. Okay, so I assume that so now I see 32 raise hands. So I assume that about 30 people are trying to follow the exercise in synchronized. So okay, so good. So once I cannot raise hand, it seems take a lot of time for me and you others. Okay, so if you have connection problems, if the problem is your own connection, I'm afraid we cannot do anything for that. But you can, so then you can I think you can still follow the exercise and yes, it's okay. So okay, about 33 people plus someone else who cannot have, okay. So for those who don't, I'm not able to raise hand but connected successfully. Okay, good. So in order to follow the end zone, I suggest you to refer to the readme files in the repository. So once you are connected to this cluster, you can, you will find this directory in your home. So when you are in your home, you can do cd, so now lower your hand. Everyone, you can lower your hand because we can proceed. So once you are in your home directory of Marconi 100 cluster, go inside this directory and type git pool. In this way, you will update the repository in which the files for the exercise are kept. So you will download the last version of the repository. So do this git pool. And once you have updated the repository, then you will see all the readme files, the updated versions of these readme files. And you can, if you open them, you will find as much as pretty much the same instructions that I will tell you here with the slides. So if you don't, if you open, I suggest you to follow the end zone and to try the end zone by looking at the readme files because they are the most updated version of the end zone available at the moment. So please lower your hand. Lower your, yes, don't, if you have questions, just ask on Slack or yes here in the chat. So one thing is that since now we are working on a really huge cluster and even if you in the future will be, will work on clusters like that. So it's not very, at least in this case, it is not very advisable to work in the home directory because these types of clusters, since they are engineered for many users. So usually the home directory is quite smaller and it is often backuped. So the home directory, usually the best, the good practice for working on those clusters is that in your own directory, you keep the important files that you really need because usually the home directory is backuped. So in case one hard drive is broken, the data are not lost. But in order to run calculations, this cluster has another partition, which is called the Cineca Scratch partition, which is, it is better for calculations because usually calculations produce many files that are not, that are, for example, they write the wave function or temporary files. So, and the size that they mentioned grows, there are many files. And so if you run calculation from your own directory, you will soon get out of memory, out of, sorry, disk storage. For this reason, the good practice is to move to the Scratch partition, run calculations from there, and then copy the important files, the output files to the home directory again, in order to save them. So for this reason, after you have done this git pool that now I assume that all of you have done because it was not a longer lasting comment, we do this other comment, we copy minus, it means recursive, so we can copy the unenter folder. So we copy the directory. So before you did the CD, so you are inside this material for my school, my school directory, we update the directory, then we copy the directory to this dollar Cineca underscore scratch. So we are copying the directory of day nine, so where our input file job script to the scratch directory, because scratch directory is the directory where, from where we want to work. After that, you do CD Cineca scratch. So you are moving yourself to that directory where you have just copied the files, and you go inside day nine, because this is the folder that you have just copied the two Cineca scratch. And then just to check type PVD, and you should find here M100 underscore scratch. So if the result of the PVD is this one, it means that you have done things correctly. So please raise your hand where you are done with these comments. Okay, now I see 24, I can, I could see about 30 people. Now it's 24. So I assume that a few people done, done, I see, wait, wait, this PDF available not yet, but of course I will, I will give you the slides after the end of this answer. So at the moment, you don't have this PDF, but you will. CP is not working. The warning, it is a CP minus R because you are moving the entire directory. So don't forget this minus R, otherwise it will tell you that it's not working because it cannot find files. So I see, okay, now I see again about 30 people, which are the same number of people as before. So I assume that most of you have done, I see other confirmations here, so we can proceed. So most of you now are in the right partition, and we can start with exercise one, which is preparing an installation of working version of quantum expression. Let me say a few things, let me just recap what Pietro said before. So I think from a practical point of view, there are three things you should keep in mind when compiling a version of quantum expression. The first one is the compiler. Usually you have, there are many compiles, quantum expressions can be compiled using several compilers. I think the most, maybe the most, oh, sorry, you can lower your end now, lower your end. So the compilers, you have many options. The most popular ones are, is a G4 term, of course, which can be downloaded for free. And if you are on Ubuntu, you can even easily install it with the apt-get installed. Another option is I for compiler, which is the Intel compiler. And we also have MV Fortran. MV Fortran is, you might have known it with its previous name, which was PGI, in the last year, I guess in the last months is, I think last year, it was renamed, and now it is MV Fortran. So MV Fortran is the compiler which was previously named PGI. And also if you have ARM architectures, you have Flang. There are also other options, but I think these are the most common ways, most popular ways. And so the compiler, quantum espresso is written in Fortran. So you need the Fortran compiler. These ones are the most popular ones. So of course, if you are trying to compile quantum espresso on your laptop, then you can do basically whatever you want. You can download the Fortran from Ubuntu repository, or you can download iFortran, MV Fortran. You just choose you download, install the compiler, and then you compile quantum espresso. If you are working on a HPC cluster, then you should ask, you should check which compilers are available on the cluster. So for this, in this exercise, we will be using MV Fortran. The other two important things keeping in mind is how to perform calculations. The most relevant part from the mathematical operations is linear algebra and fast Fourier transforms. So we need libraries for doing linear algebra operation, matrix, matrix product, diagonalization, qr, the compositions, stuff like that. And libraries for doing fast Fourier transforms. Of course, if you don't have any external library installed on your system, quantum espresso provides its own version, it has internal version of both linear algebra libraries and fast Fourier transforms libraries. So with the minimal setup, the minimal setup requirement for installing quantum espresso is just having a compiler. And if you don't have any other external library, quantum espresso will just use its own internal ones. So internal blasts for linear algebra and internal FFT for fast Fourier transforms. However, there are other options, other very popular options is open blasts, which is also a free library that you can easily download, for example, on a Ubuntu system with an empty get install. And also there are MKL. MKL are usually a bit faster than they are very fast and a bit faster than the other options. So they are advisable in case you have them, but they work only on Intel and AMD processors, CPUs. In this case, since Marconi 100 is based on IBM processors, power nine, we cannot use MKL. For fast Fourier transforms libraries, you also have this FFTW3, which is a very fast library for fast Fourier transforms, which is a good option. Or even also MKL have their own version of, there is an MKL version for fast Fourier transforms. But again, they work for Intel and the processors. So in this exercise, in this end zone, we will use FFTW3. So let's see how to set up things, how to do this. So first of all, you go CD, you go inside the first directory, exercise one CPU setup, and you download the source code of quantum espresso from the repository. Of course, I suggest you to copy and paste this comment from the readme.md file that you will find in the folder, in exercise1.cpu setup folder, you will find that readme file. You can open it with BI or with another. I'm not sure there is a max in Marconi. So it's, but you can also cut it to the screen or less. You can use less. Or even I think you can open the readme file from your virtual machine or from your laptop and copy and paste the comment to the shell on the Marconi cluster. Anyway, I think from there, it is easier to copy this url here. So when you run the vget with this url here, you will download the source code of quantum espresso as an archive, tarbz2 is an archive. So at the end of that comment, you will have this file qeqe6.7 much released tarbz2 in your, in your current folder. So if you log ls, you will find this file. So since this is an archive, we have to unpack it. So tarxjf, this comment is meant to unpack the archive here, the tar file. Once you have done it, you can just rename the folder for your convenience, because now it is easier for you to browse inside and outside. So move this folder. This folder is created by this comment here. So after you run this comment, you will have this folder created on your system. You can rename it with qecpu and go inside qecpu. Once you have done it, you can load the environment. So we can load the libraries that we will be using to compile this version of quantum espresso. So in the cluster, in the cluster, there is a module system. So with module purge, you just remove all previously loaded modules files. So you clean your environment. Then we load only the modules that are needed for the compilation. And with this compilation, we will be using hpcsdk. So this library is the library which contains the compiler, the mv4.com compiler, which is the one that we will be using for compilation. So you load hpcsdk. And here inside, you find the compiler mv4.com. Then you load spectrum mpi, which is an mpi manager for running parallel calculations. You load fftw. This is the library I was telling you before, for fftw3 and fast Fourier transforms. So once you have run this comment here, now your system is set up. So you have loaded all the libraries that are needed to compile quantum espresso on the cluster. And so you can run this comment dot slash configure with this option. Once please do it. And once you have done it, raise your hand again. I can also do it on my, so I can show you. So now I am on the m100 scratch partition. This is my username. So I go inside the day nine, which I already moved before. I go inside exercise one. I have already, I have this archive that I previously downloaded with the big yet common. So I have this directory here. And here inside, there are all the folders coming from quantum espresso. So I run configure and with this option mpi. So this is what it should happen. But I see that you are going much faster than me. See, it's good. So nine people with raised hand please. So when you run that common configure, raise your hand. I see 12 people now, 14, 16. Let's go on. Done, done. When I do module, nothing happens. It is normal. It is normal. When you load modules, it's just that you have loaded them. It's fine. Done, done, done. Good. 23 people. Before we were 30 people, so I expect that a few more people should accomplish this. So please raise your hand once you have run these configured script. We are still 23. So let's wait a few more minutes. It shouldn't be difficult. I tried really to, yesterday I tried really to copy and paste comments from readme.md files and everything seems to work. So if you are, you shouldn't find particular problems with this. I see 24 people. So I think that 25. So I think we have still six, seven people working on it. Okay. In the meanwhile, I can explain you. So when you run configure, so this common. So let's recap. So we have modules. So we have loaded modules. So in this, if you type a module list, you will see which modules are loaded on your system. And in this case, we have HPCSDK. So the compiler is back from MPI for, yes, and fast Fourier transforms. So once you do this, then with configure is just you are asking Quantum Espresso. Okay. I'm about to compile you. Which libraries do you see? I have loaded my libraries. Do you see them? And Quantum Espresso is answering you. Yes. This is what I see. I can find BLAST libraries for linear algebra. And I can find LAPAC libraries for linear algebra again. And here it is telling you that these, the linear algebra libraries will be taken from HPCSDK. So it means that we will be using linear algebra libraries that are already provided with mb4 compiler. And here it is also telling us that this is using FFT libraries, FFT W3, which are the libraries that we wanted to use and that we loaded with this module now. So now everything is fine. And we can compile the code with make minus j4 pw. So do it and let's wait a few minutes here because it will take I think two or three minutes. So you can lower your hand now. I'm not sure that you will all be able to compile the code, but not because it is not working, but because we are many people running on fewer and fewer nodes. So it might be that you can go slower or faster or maybe so, but don't worry about it. The important thing is that you have learned how to load modules. So the important thing now is that you have learned which are the most important things to compile Quantum Espresso, compiler, linear algebra and FFTs, how to load modules from the HPC system, how to configure the code, sorry, how to configure the installation and how to run the installation. So at this moment, if you try that, okay, I see that it is going very slow even to me. So maybe it is because many of you are running this common and the cluster might slow down. But don't worry for this because for the next exercises, yesterday we compiled one version of Quantum Espresso, which we are sure that is working. So if for the next exercises we will use this version, which is already compiled. So even in case you cannot accomplish the compilation now, don't worry. The important thing is that you learn how to do it and that you understood what we have done so far. So you can lower your hand now please lower your hand because and let's wait. Yes, but let's wait a little bit. Excuse me, Ivan, somebody said that you are going fast. Am I going too fast? Somebody tell me. Okay, so now we have a few minutes waiting for the compilation. So you can use them to do the exercise in case you can also skip the last passage. So that may minus J4PW if you are late, you can even don't do that because it's just a mechanical thing. The important thing is loading modules and configuring and seeing what happens after the configure. And after that, when you launch the compilation, usually you are not getting any problem from that usually. So I assume that you are planning that. Let me check in the chat. Inias, it means that you compile. Yes. Okay, we have one person who are reading from build pw.tips. This is good. What is the J4 plug? This has been answered. I could download with BigEdge. They are temporarily unavailable. So maybe this can be fixed in a breaking room. If you don't have BigEdge on your system, maybe it is worth solving it inside the breakout room. Oh, I managed. Okay, good. So configure. Thank you. What is the option? I don't see a particular problem here. So yes, I think the cluster is maybe we can wait a few minutes more. So for those of you who managed to build the pw executable, you can do the next step which is written in the readme file and is this one. So try it, try running this calculation. Of course, the calculation will not be accomplished because you have not to the potentials here. But even if you see the error message, hey, I cannot find the potential. It is that pw is working because that error message is sent by pw, which means that it has been compiled and it is working. So forget about the error message now. Just try. This is the executable you have just compiled. You have just built. And if you will find the executable in pw, SRC, pw.txt. And yes, if you try that, you will find an error message from the pseudo potentials, which is fine because it means that pw is working. So in my case, I am still compiling. I hope that most of you will do this. Okay, maybe please raise your hand again for those of you who managed to compile pw.txt. So for those of you who managed to, when the compilation finishes, raise your hand again. Yeah, so two people. Unfortunately, I think that this is going much slower than usual because we are at many people on the same cluster node. So I think it is going quite slow. In fact, usually if you are alone in system architecture, you can compile using even more than four processors. Now we use that minus j option, which means that we are compiling with four processors. But of course, you can compile with one processor and it goes lower. But you can also don't do this now. But if you use minus j without inserting any number after that, so just make minus j pw, it will use all available cores. Please don't do that now because otherwise you, for sure, nobody answer will be able to run the compilation. So I see that it is going very slow because yes, there are many people connected. So I think it is better if we move on, otherwise we cannot do the other exercises. So yes, I suggest you, you can try the compilation on your laptop, for example, if you have the compiler. So you can just apply what we have learned here on your laptop. And maybe we can stop it now. Let me just check if at least I can finish the compilation so I can show you the end of the exercise. But I don't want to spend too much time on this. Otherwise, we don't have time to do running calculations. So yes, let's control C and you interrupt this compilation. I'm sorry. But the problem is that I don't want to waste too much time on this. So please lower your hand and note what you have done so far. And so interrupt the compilation with control C like this. And let's move to exercise two. So exercise two is to how to run calculations exploiting ASPC cluster architecture. For this very complex architecture, we use, there is usually a job scheduler, which means that if there are too many people sending jobs at the same time, there is a manager that queues the jobs in such a way that if I send a job about all nodes are occupied, I will wait until one of the nodes is free and then my calculation will go. So all this stuff, this queuing system is managed by a manager which is called this batch. And we need to submit a job script to this batch system in order to run calculation. So this job script is here and you will find it in exercise two directory job.sh. So you can see this file exactly in your folder. And the most this job is divided in two main parts. The first part is our directive for as batch manager. So we are telling how many cores do we need, how many processors, how many memory do we need and other stuff like reservation and account. And so everything has been set for you. So don't change this these fields. Just in order to let you know, of course the processors are organized in nodes. So we are asking now one node and 16 processors per node. So each of you will be running using 16 processors from the same node. And here this manages the MPI processes. And here we have this other command which manages the number of open MP tracks that are which we are running. Here QR is the path to the executable. So if you managed to compile your own version of Quantum Express, so you could have inserted here your path. So if you just substitute this string with your username, you should be able to point to your own version of the Quantum Express. So if you didn't manage it, as so I did, I didn't do that. So just leave as is because this version of Quantum Express is a version that we compiled yesterday. So it is working. So don't need to change this field. Just I want to let you know that these variable points to the version of Quantum Express so that you have a working version of Quantum Express. So let's use this one. This is the JavaScript. Now you just send this JavaScript using this command. So you go, I will do this, you go in the exercise to directory. You will see the readme file here where you find all the instructions and the job.shell file here when you see just the job I've shown you in the slide. And it is enough to send this batch job.shell. It will tell you submit the batch job and this is your job ID. So it means that job submission has gone. And you can check the status of your job with this sq minus user ID. So my username is this. So I submitted my job with batch job.shell using that JavaScript. And this common sq minus you is telling me your job is running because the status is running. So the cluster is doing the calculation. And your job ID is this, which is the same as this one. And I am running on one note. And this is the note on which I am running. So if it is still running here, you can do this. So please submit this job. Okay. So I see that 30 people submitted this job, which is good, which means that pretty much all of you are working. So in the meanwhile, let me explain. So this job is running. It will take a bunch of minutes. So in the meanwhile, while the job is running, let me explain. So here we have sent one job, which is this input file, using one pool and one and the app. If you remember from Pietro's lecture, pool, it means that we are basically pool equal to one. It means that we are not exploiting any parallelization over kpoints. And NDAG1, it means that we are not dividing, distributing the matrix for diagonalization. At the moment, this is our basic job. Here, this input file name, you can choose whatever name you prefer. I just choose this one because if you change one of these parameters, then you change the file name accordingly. So you will keep track of what you are doing on this system. I think we can gain some time. So while these jobs are running, I think we can go further just to explain things. And this exercise is about pool parallelism. So now we have sent this job with one pool and one NDAG. And in the next step, we will send more jobs using more pools. So two, four, eight pools and the same number of NDAG. So it means that this exercise will show you how to practically exploit the pool parallelism. If you see inside this input file, you will see that this input has eight kpoints. So if we have eight kpoints, the reasonable number of pools to split them is two, four, eight. If we split eight points in two pools, it means that we will have two pools, four kpoints each. If we split with eight kpoints in eight pools, it means that we will have eight pools, one kpoint each. And this will be assigned to groups of processors. So at the end of the... I just go on to... I don't want to waste time. So once we will be done with these output files, we will see that these computational times will be like this. So increasing the number of pools will result on a reduced computational time. And this is because as Pietro explained in the morning, increasing the number of pools allows us to reduce communications among MPI processes. And this is very useful because as we have seen before, MPI has the great advantage that they allow to distribute memory and to parallelize the parallelizable part of the code. So to parallelize the workload and distribute memory. But they also have, let's say, a drawback because the more MPI you use and the more processes need to communicate each other because you are splitting your data in smaller parts. So in order to have the final result, we have to gather all parts together. So processes need to communicate data among them. So for this reason, if you push MPIs too much, you will have that communications will saturate calculations. So pools is a good way to reduce the number of communications at fixed number of MPIs. Let's check how our calculations are going. Okay, now we have 43 people running here. And my run is the, okay, here it is still running. Is it? So we are at iteration three. I think we have, okay, here we will do just four iterations. So here it is. It is going. And of course, you can check the number of k points from here two times two times two. It means eight for eight k points. And in fact, in the output file here, we search k point, sorry, point not k point. And here it is telling us number of k points is eight. So from here, we get this information. And also at the beginning of the file, we find again what Pietro told us about the computational resources, because in this case, we are running on 16 processors. And if you look again in the job script, here, we ask a 16 tasks per node. And here we find 16 processors. So it means that we are running with 16. Yes. And of course, yes, we are running with 16 MPI processes. And here it is telling us that MPI processes are distributed on one node, which is here. We ask it one node here. So in the output file, you can find some recap of your hardware configuration. So, yes, unfortunately, you see that these calculations take a while. The problem, yeah, it is what I was telling you before. The problem with this and so on is that these facilities are meant to treat large systems. And we have here two hours for the end zone and limited the resources. So we have to find a balance between the dimension of the test systems. Anyway, I see that many of you are running, which is good, because it means that maybe a few, you already knew a few things of this. So 48 nodes. Okay, while these calculations, and now I have, my calculation is almost finished because I have reached the iteration four, which is the last iteration. Because in this input file, we set this max step here is four. So we are doing only four iterations of the SCF. And I think after I finish, I should change the pool to eight and rerun. Yes. After you finish running this calculation, if you read in the README file, after this calculation, now we are running this calculation with NPUL1 and NDAG1. After that, you can change the job script and you can add these lines. Of course, you change from NPUL1 to NPUL2 and NPUL2 here too. And then you can attach these other strings here. So the basic idea is that we prepare a new job script after this one. Okay, so my calculation is now finished. So if I see the output file here, I will get job done here and the computational time here. In this case, you will see two times at the end of the job, CPU time and world time. Usually, the most reliable one is world time. Because world time, CPU time is just the time in which the CPUs were really working. And also, if you use GPUs, that CPU time is not something that is wrong because the workload will not be on CPUs but on GPUs. So always refer to the world time when you want to check the execution time of the latency of your job. And also, world time keeps also into account communications and input-output streaming. So it gives a more complete view of the job execution time. So here, the job took nine minutes, 35 seconds. So yes, I think this is done. So the next step is you can find this instruction in the readme file. So I will do just the same as written there. So I open again my job script. And here, I want to try pull parallelization. So I set M pull to two. And I also change the output file accordingly. And then I can also use the same job script to run M pull four and pull eight. And pull four and pull eight here. Save this job. And let's so M pull two. M pull one, we have already done it. Now we have M pull two, M pull four, M pull eight. We can put these comments in one job script right and quick. And again, we just batch job script. So I can see my new job that is running here. And if I log in the directory, I see the new output here. So the first one was the first calculation was the longest one. So the next calculations of course will take a shorter time. I think, yes, we still have to wait a little bit because we are running three calculations now. But at the end, if you check the wall times, so you have wall time here at the end of this file. And then you compare with wall times at the end of the other files you will be getting in a while. So M pull two and pull four. You can plot them or just check them. And you should find a trend like these because increasing the number of points is decreasing communications at fixed number of MPIs. So in all cases, we have used four, sorry, 16 MPIs. And this has not changed. It is fixed. But using one pool, MPIs are communicating very much with each other. Whereas using eight pools, the communications are slowing down. So when I think you are running because I think I don't see any particular issue, 39 people are running. So I think while our calculations are still running, I can show you something about why it is working. What are we doing now? I think you have really, this is pretty much similar to what Pietro told us. Let me just, so Pietro really showed us that the wave function can be represented in a three-dimensional, as a three-dimensional matrix, let's say, because here we have our orbitals that are expanded in plane waves here. And each plane wave has its own coefficient. And these coefficients have three indices. So G is the vector of the indices of the plane wave. I is the index of the conscience state. And K is the index of the k-points. So we can represent the coefficients on which our wave function is expanded in a three-dimensional way. So the longest dimension is the number of plane waves. Because of course, we can easily, if you run large system, you can easily reach one million of plane waves. So these matrices are very tall. And then on the other dimensions, we find the number of conscience states or bands and the number of k-points. So each coefficient here can be seen as a point in these three-dimensional matrix. When we do MPI, we are parallel, distributing the memory. It means that this matrix, the matrix of coefficient is distributed among processes. But what happens here? When we calculate whatever property, for example, in most times, every time we need to calculate something, a potential, the full Hamiltonian, the kinetic energy, usually we are applying one operator to one orbital. This means that if we split the orbital along the plane waves dimension, we see that this core here does not have the entire orbital. And this MPI process here does not have the entire orbital. But the quantity I want to calculate at this density depends on the often, for this reason, I need to sum along the plane wave dimension, because the quantities are usually computed on one complete orbital. So for this reason, the communications among MPIs are much more along this dimension, along the plane wave dimension, greater than along the other two dimensions. So if I split the wave function, the MPIs, if I break along the plane wave dimension, the MPIs will need to communicate very much among them. So let's make an example. If I split the wave function, for example, in four MPI processes, since most of the communications occur along this dimension, it means that all processes will need to communicate each other, because none of them will have one single orbital. The orbitals will be split among the different ranks, the different processes. And so they will need to communicate with each other very much. Otherwise, if I use ampoule, pool parallelization, and still using four MPI processes here, so four and four. But what is changed is the way I'm distributing the wave function, because in this case, yes, I'm giving one single orbital. Let's check here. One single orbital, so one stick here, is distributed among four processes. If you see how I've done, how it's done in this way, one single orbital is distributed among two processes, because the processes I use, the wave function is distributed along the k-points, not only along plane waves. So it means that we have still the same number of MPI processes, four in this case. But in this case, using this distribution, the wave function here, one single orbital is split among two processes. In this case, it's four. It means that in order to compute the properties, we only have two processes which needs to communicate, and we don't have the communication on or better, we have strongly reduced communications in these other directions, because the orbitals are split among two processes rather than four. So for this reason, I hope that this can be clear for you, but this is the main reason why the ampoule parallelism is so efficient. So please raise your hand when you have finished this ampoule parallelism. Okay, a few of you are good. So I assume that people who now are raising hands have finished the jobs with two, one, two, four, eight pools, and they have checked in the output file how long the calculation took the whole time, and then they found that using eight pools is faster than using one pool according to this plot. So yes, unfortunately, the calculations with CPUs are a bit longer. We will see in the next exercises that with GPUs, these will go much faster. So we have still 36 people running, and I am running oh, it is still going with pools. Okay, now I'm going to pool. So here, for example, I can compare ampoule two with ampoule one, and we see that here it was nine minutes 35 with one pool and eight minutes 10 with two pools. So yes, okay, so we have now five people who accomplished the exercise. So let in the, so I think now that you are still waiting for calculations. So I assume that most of you have submitted the calculations and you are waiting. So I think I can show you some other, another slide regarding pool parallelism here. So on the basis of what we have just said about pool parallelism, so let's take this slide. So here I can split the wave function among plane waves and among pools. There is one subtle, subtle thing to keep in mind. When you use pool parallelism, you have to think about how many k points do you have and how many processors do you have. Because of course, you are trying to, you have to fit pools in processors group. So when you, you have to think about what you are doing, because the numbers are not always compatible. Let me show you an example here. So here we have a crystal and this system has 30 k points. So we have 30 k points and I tried these calculations on five nodes. Of course, this exercise is something that we cannot, it would have been nice to do with the end zone, but five nodes each for 100 people. It is far beyond our possibilities. So I did it on myself and I'm showing you the results here. So 30 k points, if I divide 30 k points in three pools, it is quite fine because it means that I have three pools and each pool has 10 k points. So this subdivision is equally balanced because 30 is the, I can divide 30 by three exactly. So the balance among the pools is fine, but what happens here? That here I'm running the calculations on five nodes. So it means that these three pools, each of these pool has have to be sent to the, to, to the nodes. And since here I'm using three pools and five nodes, it means that each pool will be not complete. I cannot completely fit the one pool in one node because I have three pools and five nodes, which means that in order to accomplish the calculation, the nodes need to communicate with each other. So it is much more clever if I use five nodes and five pools. And you can see the computational time it is much faster using five nodes because in this way I have divided 30 k points in five pools, which means that each pool will have six k points. And then I can fit one pool exactly in one node. In this way, nodes do not need to communicate with each other because the, I will never have pools that are split among different nodes. And you can see here that using this configuration, the calculations go much faster. So another trick and other suggestion I can give you when you use, when you are using pools with k points, think about what you are doing. Think about that the pools are effective for reducing communications among MPIs. So if you, the more you, you, you can fit the pools inside nodes and the more you can do this exactly and the faster the calculation. So you have to avoid to have pools that are split among different nodes because at that point the two nodes, if you have one pool that is split in two different nodes, the two nodes will need to communicate with each other and you will lose performances. So yeah, this is, I think you can think about it as a suggestion when you will be running your own calculation on your own system for production. Look how many k points do you have and think about how many pools you want to use or many nodes you want to use. If you do this in a clever way, you will see that the competition, if the competition on speed up, it is really, really significant. And let's check our calculations again. So first of all, I see we have still five people who managed to do the exercise. Okay, so I am still running with four pools. Okay, now I am here at the last iteration. Let's check whether there are questions. How do open MP threads and pool interact with each other? Yes, this is another good question. So since we are still waiting here, I think we can also speak about open MP. So again, now I assume that you are... Even maybe while you collect the presentation for MP, I can add one point. Yes, sure, please be a question. Yeah, yeah, please. Now, because this is an interesting question that is coming out often, so maybe I wasn't so clear. We mentioned the hierarchy of levels in parallelism, right? Starting from pools that distribute the work on k points, so in the connection equation on different k points at the same time. Then the hallway function are distributed among many workers. Then there is a lower level of fine-grained parallelism. So that's where open MP is. It's really used to squeeze all the possible... the performance from all the possible subtleties when you cannot further survive the data that we have been discussing. So the work on different k points and the wave function themselves. So really this is in that hierarchy, one of the last levels, it was interesting that the purple box that we were calling fine-grained parallelism. So, and that's it. Please go ahead, because you have been preparing very nice work on this open MP part. Yes, thank you, Ketro. So we also have open MP. Let's first see it without... because at this point we... the risk is to have things... things are getting more complicated because we have pools MPI and open MP. So let's get rid of pools for now. Let's focus on MPI and open MP. So calculations at gamma or gamma point. Here I have this example. Of course, let me just say open MP. You can see some open MP speed up, especially for very large systems. So, unfortunately, it's something that we could not provide here. We will see something in the next hours... in the next hour, sorry, when we will be running GPUs computation. So you will try open MP there, but with the... unless if we don't do very large systems, it is very difficult to see speed up coming from open MP. So I did calculations for you and I'm showing you the results here. So here we have an average side system here is 32 water molecules and a gamma point and here I run calculations using one node, so 48 cores. So my computational facility here is provided with the 48 cores. If you see here, I can choose now whether to use this course within MPI parallelization or open MP parallelization. What is the difference? If I use 48 cores with MPI parallelism, the wave function is distributed among 48 processes. So I have the memory is distributed and this is sent to different MPI processes and these processes have to communicate with each other. Otherwise, if I use open MP, I will have one single process, but on this process, I will have 48 cores that will be working on it. So please consider that the main basic difference between the two approaches is that with MPI, I am dividing the memory. So my data are divided in 48 batches of smaller batches and I will have one core working on one batch of data. So memory is distributed and my 48 core, so if I have 48 cores, my wave function, my data, generally speaking, is distributed in 48 smaller batches and I will have one core per one working on one batch. Otherwise, with open MP, I will have one single process, so my data stays together, but on that data, I will have 48 threads, open MP threads that will be working on them. Quantum espresso has been natively developed using MPI scaling. So for this reason, the code has been written specifically for MPI because it is very important to distribute memory when you want to treat large systems, large molecules, because if the memory requirement is too much, you are not able to run calculation because you have an out-of-memory error. So for this reason, quantum espresso is mainly parallelized using MPI and for this reason, you see here that MPI scaling is more efficient than open MP. So because if you also, Pietro told you about the AMDAL law, here I tried to fit this data with the AMDAL law and we see that the parallel portion with neglecting communication latency, so just focusing on the code, we can say that the parallel portion with MPI from quantum espresso is 95%, whereas open MP is 76%. So we see that since quantum espresso mainly relies on MPI, MPI is more efficient in these calculations. But there is a but, if we go to larger systems, so larger calculations, we can see that open MP starts to be important. For example, here I have a larger system, 190 water molecules, so this was 32 and here we have 190. And I run on one node, so 48 cores. And here I did this thing, I asked myself, okay, what should I distribute the cores? So on the leftmost part of the curve, I have one process, 48 threads. On the rightmost part of the curve, I have 48 MPI processes, one thread. And in the middle, I have different portions. So all these calculations have been done using the same number of total cores, 48, but here I'm mostly using open MP, whereas here I'm mostly using MPI. From here, you may notice that this curve is not monotonic, but the minimum is somewhere around here. If I run the same calculations, the same calculation using two nodes, this minimum here is more pronounced. So here a full open MP, full MPI, as I said, MPI generally speaking, is more efficient than open MP in quantum espresso. So I can say, okay, let's just forget about open MP, let's use MPI. Yes, it is good. Here I have 96 processes. So all my processor cores are devoted to MPI. But here I see that there is a minimum here. If I run with five nodes, so increasing the total cores that I'm using, I see that the curve is not, of course, this I didn't mean. And here I have six nodes. So for example, if we focus on the five nodes curve, you see that there is a minimum here. So five nodes, this is a full open MP calculation, which means one MPI per node, five MPI, and each MPI has 48 threads. This is a full MPI calculation. But the best option in this case is this one, is 60 MPI processes and four threads. So four threads per process. So why is that? This is because, as we said, MPI is very good because it can distribute memory and it is very efficient. But there is this drawback that it involves communications. So the more MPI processes we use and the more communications will grow and they will start to, if we use too many processes, communications will start to reduce the computational efficiency. So on the basis of them, so from one side we have communication, on the other side we have workload. So increasing the number of MPIs, the workload for each core will be reduced, whereas the communication will increase. So if you keep adding MPI processes, the workload for each process will become very, very small and communication will grow faster. For this reason, it's not advisable to push MPIs to the extreme limit. But whenever you reach this point in which communications start to become important, then if you have more cores, it is better to exploit the more cores you have using OpenMP. Because OpenMP still retains a certain degree of parallelism in the code but reduces communications between MPIs because here we have 240 processes that need to communicate with each other, whereas here we have 60 processes that need to communicate with each other. So when, of course, there is not a general universal recipe for these calculations, it's just part is with experience, something you can reason about that. And it depends also on the test cases because of course with small test cases, you will not see anything. And this is the reason why we did not put any exercise here. But when you move to large test cases, you have to start to think about what you are doing. So if you want, if it will happen to you that you will need to use OpenMP, in the JavaScript here, you just change this variable, export OpenMP noon threads. From one, you set two, three, whatever you want, and you can couple OpenMP with MPI. You will see it with GPUs calculations in a while. So yes, I think we have now to move ahead. So let me check here. Okay, 11 people. Okay, my calculation now is finished. And we can compare. We can compare here the computational times one pool, nine minutes, two pools, eight minutes, five, four pools, seven minutes, and eight pool, six minutes. So we are seeing more or less this plot. I think it's better to go faster because otherwise, okay, see that some of you have done the exercise. So for those, I think we can go faster now because I don't want to, I want to speak about GPUs. So let's, I will just show these other exercises. And we are not doing it together. So beyond the other thing, Pietro was telling us was pool, sorry, was parallel diagonalization. This is another case in which you will see the effect of parallel diagonalization for very large systems because basically, we have what we have to do is to diagonalize the Hamiltonian matrix. But we can do if we do what we can do when we do parallel diagonalization is to distribute the matrix among different processes so that we are we increase the speed of this diagonalization. Of course, as usual, we have the drawback of that communications because you have to spend time to find a good way to diagonalize the matrix. And then you have to send a portion of the matrix to different processes. They have to do their work. They have to communicate. So this work is worked for very large systems. And it is not the case in our test case. So I will show you now how to do the calculation. If you want, you can try, but I will go fast. Otherwise, we will not do anything on GPUs. So I prefer to go faster now and do something with GPUs. So I will just explain you how to do the calculation. You just have this keyword now, this option, sorry, this is an MPI run option and the port, which means that we are assigning the four MPI processes for the diagonalization of the Hamiltonian matrix. So it means that the matrix will be divided into four pieces. Each piece will be sent to one process. And we will have four processes working at the same time to get the eigenvalues. So this is the way you launch the calculations. You will see in the output file this message, subspace diagonalization. Of course, this N diag four means two times two because we are dividing a matrix. So four, it means that we will do two times two blocks. And here, if you try the calculation, you will find something similar to this in sense that of course, this is a small system. So we don't see any significance speed up. But we see that since here we are using four pools and four diag. Here we have four pools, but with four diag, the computational time is a bit higher. This is because the system is too small and also because we are using, yes, the internal quantum espresso libraries for parallelization. There are other libraries like Scalapac or Alpa, which Pietro told us before, that can really efficiently do these parallel diagonalization work. I think it is better to skip the exercises now or at least I'll show you. I think there is no point to do it now because it's just that you change your job script and you remove the MPI. I can even show you. So it's just if you want to run this calculation, you take your job script and you can comment this and it is enough to change here four and the four and we keep track of this in the log five. So four here, four here. Then we admit this job and we wait a few minutes for it to finish. And I prefer to go to the GPU now. So if you do it, you can do, you have time, okay, the end zone finishes at 12.30, but you will have access to the machine so which you are running you can run calculations until 1.30. So you will have one hour more to run calculations. So I suggest now to go directly to the GPU part and if you want, you can do the exercise after at the end of the end zone because you will have one hour more. So let's move to the, okay, so this is open MP we discussed it. Let's move to the GPU part. Please, so I see nine people, only nine people yet managed to do that. And I think that other people are running, still running, right? Okay, we have 12 people running and 13 people who managed to do the exercise. Good. So you can lower your hands, lower hands now. And, okay, so people who are still running, you just wait and you keep running and I think the exercise now is clear and really, I think it's better to move to the GPU part. So please, leave your calculation running on the cluster. You will see them later and look here at the slides again. So the GPU part, again, the three things to keep in mind, compiler, linear algebra and pass query transpose libraries. As Pietro was saying, for compiling Quantum Espresso for GPU, we only have a mv4tran option because at the moment this is the only compiler which supports CUDA4tran, which is the language on which the GPU version of Quantum Espresso has been written. mv4tran has its own libraries for linear algebra, which are called the CUBLAS. So BLAS is the usual name and they added this CU here. So CUBLAS libraries manage to run the linear algebra calculations on GPU cores and the same applies for FFT. mv4tran has its own version of pass query transpose library, which are called CUB FFT. So once we load, we would say the main point when working with GPU and also when running calculation to keep in mind is that each MPI process sends its own workload to the one GPU. So MPI processes receive a portion of the data and they work on this data. If they have GPUs, what they do is that they receive this portion of the data, but instead of processing themselves the data, they send them to the GPU. This operation is usually called offloading or data transfer because you really transfer the data from the RAM from the memory of the CPU nodes to the memory of the GPU cores. So it is really a transfer of data that are written on the GPU memory. But just the important thing you have to keep in mind is that one MPI, the MPI processes receive data that they are part of the data and they send to the GPU. The GPU do the calculation and so the data return to the CPU. So this here we can see the same concept with the wave function. So the wave function is we parallelize the wave function on the plane wave components, assume that we have two GPUs. The best practice, the good practice is to use two MPI processes. So we divide the wave function in two MPI processes and we send each portion to one MPI process and each process offloads the calculations to one GPU and the processes will communicate with each other. If you use more than two processes, you have to remember that the calculation is still done by the GPUs. So yes, you can also use four MPI processes, but if you have two GPUs, each of these processes will send the calculation to one GPU. So in the best, so sometimes you may observe a very, very little improvement, but most of the times it is negligible and you will find that two or four or six MPIs is just the same because the computational burden is on the GPU course now and you still have two GPUs. So what rules the bottleneck, let's say yes, the pivotal part, let's say the most important part is done by the GPUs. So in the worst cases, if you use too many MPI processes, they will also burden them, they will also worsen the computational time with the communications. So you are really increasing the computational time rather than reducing it. So the good practice is to use the number of MPI processes equal to the number of GPUs. Here we have exercise three, which is an exercise in which you, I wrote a few pieces of codes just to let you know what is happening. But I think that we can go fast here because maybe it is better to run calculations with GPUs. Let me, but I would just tell you what, how it works because you will find in the folder, if you want to check later on, you will find the three four-prong codes, which basically run a DGEM calculation, so a matrix-matrix product. Code CPU will run it using a CPU course, code GPU will run the same DGEM calculation on GPU and code mix is something that I'm going to tell you. So if you run the DGEM, the DGEM on GPU, you can try it, if you can try later on. And you will find that it will take 63 seconds. That's exactly the same calculation on GPU. It will take less than one second. This is to give you an idea about how fast the GPUs are. So in this case, I think we did the calculations. Me and Pietro, it is a factor about 20, which is, which is similar to the speed up reported in terms of teraflops for the Marconi architecture. So the same, so the full CPU code for a matrix-matrix product takes 63 seconds. Full GPU code less than one second. What happens in quantum express, of course, this is an ideal case because as I said, you also have to consider the time for moving data because, as I said to you before, the each MPI process offloads data to the GPU memory. So you have to add also this communication, this offloading time to the overall computational time. So if you do this for this ideal case, you will see that the product time for the DGEM with the GPU is higher than this ideal case, because in this case, we are counting only the time for performing the operation. While here, we are allocating the matrix on CPU moving to GPU performing the calculation. So what is happening is that here, but also in quantum express, so we are not getting the ideal time, but of course something worse because we have to consider communications. And beyond communications among MPIs, when working with GPUs, we also have offloadings. So MPIs have to move map data to the GPU memory. Anyway, we see that from 63 to 0.3 is still really a huge computational speedup. So here I think now we can go faster. Let's see how to compile quantum espresso with the GPU. You already know how to do this because we are doing exactly the same operations that we did for CPU. So we go to the Cineca scratch directory. We go inside exercise four directory. We download this version of quantum espresso. You see that in this case, we are downloading the GPU version of quantum espresso. For this answer, we use the version 6.7 in which the CPU and GPU repositories were still kept separated. But in the newer releases, you will find that the code has been merged. So there will be in the future time, there will be one repository and you will download one unique source code that can be compiled with both GPUs and CPUs. But at the moment we still rely on deep quantum espresso 6.7. So you can download this version of QE. You can start this file here. This is the same operations you have done for CPU installation. You change name to directory for your convenience because here the name is more clear. You go inside the newly created directory and here you load the module files. We see that we are still loading the same modules as before up to here. So HP CSDK where Mv4 is included. Spectrum MPI for MPI parallelism, FFTW because these are the FFTs we are using. Here a question called the rise. Let's see. And a CUDA here because this is the real difference between CPU and GPU environment setup because CUDA allows us to really use GPUs. So once you have done it, you have loaded these module files, you can run configure. And we see that in this case, the configure command is a bit more complex because we still have this PGII filter and basically we are telling to use the import for the inter compiler. But here we also enable OpenMP which we did not do it before. And we are specifying that we are using CUDA. We are using CUDA 11.0 because this is the module that we loaded here. And we are using our GPUs here have a compute capability of 70. This number is given depends on the hardware. And so it depends on the GPU you have the underlying GPU core that you have. In this case, Marconi is provided with the Volta 100 GPUs and they have a computational capability of 70. So maybe older versions of GPUs might have lower values 60 for that. And then the following newer version, I guess they will have a higher compute capability. You run the configure and you check here again that all values have been taken. I can do this now. So we I go in exercise four here. I first purge module purge. So now I ensure that I have no modules here that would interfere. Then I copy the modules here. So HPCSDK, spectrum, FFTW CUDA module load. If I list them now, I see that they have been loaded. I run configure in this way. And I'm configuring my version of GPU. Yes, my version of GPU version of Quantum Express. So I think, yeah, here we have a good question. I didn't find the FFT leaves. So is that because cool FFT is not in Marconi and will be internally compiled? Yes, we will talk a little bit now about this. I guess it is going slow because everyone is doing this configure now, I hope. So could you please raise your hand? Now many of you are now configuring the GPU version of QE. Could you please raise your hand so I can say I can see. Okay, so 15, 17, good. Okay, we have about 20 people configuring. That's great. So I can, usually it is much faster now. Of course, we are many people running on the same node. So okay, now my configure has finished. So you can see blast leaves here because of course it found the blast leaves. And here we see again that LAPAX. So these two together are for linear algebra. And here we see that these LAPAX leaves have been taken from HPCSDK. So we are using internal libraries from MV Fortran, which is good because we need the version for GPU. And the GPU version of these libraries is included in MV Fortran. So it is good that here we see HPCSDK. We don't find FFT leaves here just because as it was pointed out in that question before. Because of course now FFTs are done by the GPUs and it is true that we loaded the FFTW3, but those libraries are for CPU. So the FFTW3 has been overwritten by the fact that MV Fortran has a zone version of FFT libraries. Indeed the FFTW3 in this case they do not appear here because we will be using libraries for GPU now and they are internal for MV Fortran. And so that's why they are not recognized here. So this is the configure. You should have found something similar to this. I think we can skip the compilation step, which of course is just a mechanical procedure now. I don't send this comment because we are too late. But of course you can try later on. Make minus J4PW it will compile your version of QE4GPU with these settings that we have just set here. So let's move to exercise five. So you can see how to run calculations. So this is just the same as we did before. We compile and finally we test the version and you will find this error, which means that the compilation is working. So let's do now this exercise five in which we see how to run a GPU calculation on the cluster. So this is the batch script. You can take it from exercise five. So here I can go back to exercise five and you see here again the readme, which is where all instructions are reported and the job share. Sorry, the job script. In this job script is the same that I have pasted here. You will see again two parts, the batch part where we ask resources for running calculations. So in this case we are running one node again, but two tasks per node because we are asking two GPUs. So in the architecture here each node is provided with four GPUs, but each of you will use two of them because otherwise there is no room for all other people. So you will ask two GPUs per node and for this reason we will be using two MPIs for two GPUs. So this is the main thing to keep in mind. Again we have the Quantum Express executable here. If you have your own compiled version of QE, you can put your path here, otherwise you can just leave it untouched because it will point to this folder where we compiled the version yesterday so you can use it. And here we have again the OpenMP setting. In this case we have one thread and here the MPI run with the usual common, one pool, one deck and let's go. The cool thing now is that if I run this, if I submit this job instead, so you can do of course, you can submit the job shell as is. The input file is just exactly the same we run in the previous exercise for CPU. So the cool thing now is that if I now this is running, remember that before this input file took 10 minutes. So in my case it was nine minutes and something. So let's see how much time it takes now. You can do this of course. So 25 people. So I guess you are trying. I think some of you, we still have previous calculations running from the CPU exercise but I think that many people are trying the new one. So in my case the calculation has already finished. So if I check the output file now, one minute, seven seconds. Let me open here also the other. So it was exercise two and one. Yes, it was this one. So here on the left you have the CPU test and on the right you have exactly the same calculation run with GPU. And you can see of course first of all here when you run with GPU, you can find this spring GPU acceleration is active and it confirms that the calculation has effectively used deployed GPUs. And if I search the same string here of course I don't find anything. So you can check GPU here. You also find here the times, the execution times on GPU. But the most important thing is that here it is one minutes and here it is 10 minutes. So we have just using GPUs. We have used two GPUs. So let's go up here. So in this case we use 16 processors. So MPI processes. And here we use two MPI processes. But since on the in the case using GPUs, these two MPI processes have loaded the calculation to the GPUs which are much faster. And for this reason here we have one minute instead of nine minutes here. So I guess it is now it is much easier to accomplish this exercise. What you can do now is to add open MP thread. So you can try them. Because the point here is that, let me show. Okay, if you change now what you can do is to change this export here. So here you can put two and run the calculation. You can put four and run the calculation. If you do this, what is happening here? Because let me compare again. So here we use a 60. So we have 16 processors. Each of us has 16 processors. Because this is the amount of resources that we ask that we ask in the job shell. But we are not using all of them because we are bounded by the number of GPUs. So in this case we have 16 cores available, but CPU cores available. But we are using only two of them because we have only two GPUs. So in this case what you can do is to use open MP to use the remaining part of the CPU cores that have not been employed in this case. Because in this case we cannot use 16 cores here because we would mess up with communications because we have only two GPUs. So we have to use two MPIs. Maybe you can try with four, but if you use too many MPIs you will mess up things with communications. So you have to keep the number of MPI bounded to the number of GPUs or approximately. But still we have 16 cores here. So in order to exploit the remaining part of the cores here it is a good idea to use open MP. So what you can do now is to run, since these calculations are also quite fast, you can change these open MP settings here and you can ask two, four, eight threads. Remember to change the output here. So if you change, for example, two threads here, remember to change two here, number of threads. I send this calculation again. So I think if you are, I hope that each one of you at least tried to run the calculation with GPUs because it is, I think it is very interesting if you compare the time between CPU and GPU execution. So now my calculation has finished. So I can compare one thread with two threads. Of course here, the first thing we see here is that here we were with one thread, we were running on two processor cores, two MPIs, one thread per MPI. So we are using two GPUs and two CPU cores and the CPU cores are deployed with two MPI processes. Here we are using four cores with two MPIs and two threads. So the number of MPIs is the same here and here because this number is bounded by the number of GPUs. So I repeat it again, two GPUs is better to use two MPIs or you can try four, but you cannot go too much part. But you can actually do the rest of the CPUs using threads and open MP. Okay, in this case, it was even, it took more time than using one thread. The test system is small, so I can show you here the final plot. So here I did it, we compare the computational times with CPU black bars and with GPUs a different number of pools and different number of threads. Of course, this test system is small. So as I said at the beginning, you are not seeing any significant, significant, of course, weight. You can observe a great advantage from using GPUs in general because with GPUs, whatever configuration, you are almost one order of magnitude lower with computational execution time. Otherwise, of course, we see that from one pool to two pools, we see some improvement here because all these bars are a bit smaller than the calculation with one pool. And with threads, we are not able to see significant advantages while maybe here with four pools, with four pools, we see that the red bar with one thread is higher than the other, so they're using two, four, eight. So maybe here we can see something, but honestly, in this case, the speed up from threads is very small and it's something that you can see using very large systems. So I think, yes, again, what I said regarding pools is through even for GPUs because if you distribute the wave function along plane weights, then the MPI processes will need to communicate with each other using pools, the communications are produced. So this is the end of the end zone. I would like, since so far I've said many times, oh, you should see with large systems or you should see with large systems, I think that it could be useful to show you some batch, some benchmark as cases on large systems to show the performances. And so this is quite a large system is 500 electrons. This is Tallium oxide or Tantallium oxide, I don't remember. And here we see the scaling with the number of nodes using more GPUs. So here we used one, two and four nodes. And each node is provided here with four GPUs. And we see that the time to solution is drastically reduced, increasing number of nodes. And here the same with pools. So the pool parallelism, again, we use one pool, two pools, but here since we have 26, sorry, here we have four pools and we are running our four nodes here. So when we use four pools with four nodes, communications are really lower. And for this reason here, we see a significant speed up using pools. So what we, so with CPUs, it is true with GPUs as well. And here we have an even larger system. So in this case, we have 500 electrons. And here we have four thousands electrons. And again, we see a very impressive computer speed up passing from two nodes to six nodes. We have to remember that in this, sorry, these nodes have eight GPUs per node. So here maybe I said four, but okay, it's so each node here has eight GPUs per node. I have to say that these calculations have been done with the newest version of the GPUs and per family, which has been very recently released by media. And these are very, very, they have very, very good performances on double precision floating points. So here again, we see the scaling, varying MPIs and pools using six nodes. If you use a good combination of pools and MPIs, you can really drastically reduce the computation of time. And here we see another test case. This is a very interesting molecule, is a carbon nanotube with porphyrins, functionalized with porphyrins. So this is a molecular system that is really for production. It's something which you can really use for publishing articles for doing interesting science. And we have five thousands electrons here. And this is gamma only. So in this case, we cannot exploit pooled parallelism, but we can exploit GPUs of course. And one SCF iteration on this giant molecule from our benchmark, from our test cases, it took 90 seconds, run on more than 3,000 cores. So running on CPU, if you want to do an SCF on this very, very large molecule with using only regular CPU nodes. With 3,000 cores, CPU cores, it took 90 seconds per iteration. So 90 seconds, if you need, I don't know, 10, 20 iteration is a 900 seconds. The same one iteration you can do in 30 seconds using 144 volt GPUs. These are the same GPUs of the cluster, Marconi 100 that we are using now for this and so on. And the same time, so about 30 seconds using 24 GPUs of this new ampere family. So here you are using 3,000 cores. Here you are using 144 volt GPUs. And here you are using 24 ampere GPUs. So this, I think this can give you an idea of what can be done when all these parallelization strategies of quantum espresso are used properly in a smart way. I hope to have given you a few hints about how to use quantum espresso and how to exploit as PC hardware and CPU. And I really thank you for the attention. And yes, if you have questions, yes, thank you. Yes, so I think now you can. Okay, I don't know now whether that raised and are for questions or because you did the exercise. But I think I messed up myself because I used it. So if those raised and are for because you have done exercise, please lower and whereas if you want to do questions, keep them raised. So okay. So I guess that we have two questions here. Uh, maybe, uh, chump and lean. Is it okay? I, I, okay, chump and link maybe want to use this a question. Sorry, actually, I raise hand because I was doing the GPU calculation. Oh, okay. Yeah, yeah. Sorry. Yeah, sorry. Sorry. It was my fault because I messed up things because I use these raised and even we have a question from the chat of zoom. Yes. And it's a nice one. What would the CPU do if the linear algebra and FFT are done by GPU libraries? So what is the CPU doing? Yeah, good. This is a good question. Now I have to say that there are a small part, small there are actually small parts of the code that have not yet or have not been ported to GPUs because what we did was to port the most in most computational intensive parts of the code to GPU. But there are still other parts of the code which remain on the CPU. So when, so there are two cases when the GPU is working because the code is processing a part of the code that has been ported to GPU. I think nobody knows what the GPU is, what the CPU is doing. I don't know. I think I don't know whether it is just running idle or it is waiting. But it is not important because the main computational part, not all the computational part then is done by GPU. But for those parts of the code that have not been completely ported to GPU, maybe or maybe there are also part of the cost for which is not very important to GPUs because we are happy with the CPU performances. So in that case, the CPU is still working and if you link some linear algebra or FFT library, they will be used for that part. Yes, there are a few portions and one of them is the FFT. We have been telling that the FFT has done so many times. So a lot of eyes focused on the FFT and that is one of the few portions of code where while the GPU is computing, the CPU is doing other things and is basically actually taking care of communication. So this is one portion of code where the CPU and the GPU overlap doing some useful work at the same time. But then, as Ivan already said, sometimes it's just waiting idle for the GPU to finish. And this is, of course, you may want in general to exploit the best both the CPU and the GPU at the same time. But it's not always so easy and also not always really useful because have you seen, for example, from Marconi 100, the CPU is very weak, but the GPU is very fast. So it doesn't would really contribute very little. There is a second very important questions, I think, and it's probably some misleading number that I gave during my presentation. So please, Ivan, correct me. About the end pull on GPUs, should we be careful on picking an integer fraction of the number of cores of the GPU this time, or should it still be an internal CPU cores? Now, I made it clearly made some confusion, managed in GPU cores. So I had to clarify this point, which is very important. Yeah. Well, yes, it's not when you saw the point is when you are using pools, what you have to care is about MPIs. And if you are using GPUs, the number of MPIs is kind of bounded by the number of GPUs. So you start from how many GPUs can I use? And so let's fix the number of MPIs. Okay, then I checked the number of K points. Let's see how many pools can I use in order to exploit this, my architecture. So when you think about how to run calculation with pools, you just have to think about MPIs. So pools, when you run when you use pools, think about the number of MPIs you are using. But if you have GPUs, the number of MPIs must be equal to the number of GPUs at the first glance, at the first approximation. So the number of cores, which was a GPU cores, is something that is more in depth inside the hardware. So the number of CPUs, you never see the GPU cores, because the GPU cores are inside the GPU and they are the reason why the GPU is that fast. So from your perspective, you as a user, but even from our perspective as developers, the most important point is the GPU as a whole. Then inside the GPUs, you will find cores that are usually organized in multiprocessor or in their ways, but we never work explicitly with cores inside the GPUs, especially where you are running calculations. So from your point of view, you just have to see how many GPUs you have and then you check it and then you check how many MPI processes to run. Pietro, if you want to add anything else, please. No, it was just perfect and sorry for the confusion. Mentioning the number of workers on the GPU was just to highlight the fact that in general, GPUs to run efficiently require more data than CPU. So sometimes for a small input, you don't gain much from the GPU acceleration. That was the whole point. Then forget about GPU cores, please. Yes, there is a final question probably. I don't know if we still have time. Maybe we're a bit late. Can we? Well, maybe we have time. Yeah, maybe one or two questions more. Well, I think you are the expert in this field. So this question is really for you. I'm curious regarding band parallelization. I've never obtained speedup from using it on my system. Could you comment on what type of calculation this kind of parallelization is using on? Yeah, there are two points. There are two main aspects. The first point is that in order to see real advantages from band parallelization, you need to have a huge number of conscious states. Otherwise, it is worthless. The second point is there are actually, there is a band parallelization actually in quantum espresso, but it is, I think it is not implemented in the, it is not really distributing the memory among the band groups, but it is only using, it is dividing only the workload and not the memory. So it is, I think it has a suboptimal efficiency at the moment. And it is used for exact exchange, right? I think band parallelism becomes very convenient with exact exchange. Yes, because at that time, you have to process many, a couple pairs of bands to compute the exchange integrals. And yes, as much as I know, even in that case, you will see that if you run with proper big architectures and large molecular systems. Let me add a few things. So if you want, you can keep doing the exercises on the Marconi 100 cluster until 130, half past one, chest central European time. So you still have 45 minutes more. You can just take the readme files and you read them and you are able, you are allowed to submit jobs until half past one. So if you want to do again the exercises, you want to do them slower or you want to do even those exercises that we had to skip or which I went fast, you can do that. And if you have problems, I will answer you on Slack channel. So you can do this. And regarding the data, if you want to save the data you're working directly from there, you should, you have your account is available until Sunday. So in the next one, two days, or maybe it is better if you do now, you better copy the wall directory to your local machine, laptop or a virtual machine. In order to do this, I don't know if you know, maybe I forgot to write this, the common in the readme files, but I can write you on chat. It is enough that you do this common SCP minus R, then you, let me do it. So if you go here in, I think you can, okay. So I wrote you in the chat, the common is SCP, which means that you are copying the other activity, but CP is for copying your local machine from one point to another point, SCP. Sorry, I forgot it. Obviously, this is missing something, SCP minus R user. Okay, sorry. This is the right common. So you run, you go, you open a shell from your virtual machine or from your laptop, and you run SCP minus R user is the username you've been using for our Marconi 100 cluster. Then at login next to Chinneca, and this is, you give the host name here. And then you copy this directory, day nine, which is the directory in which you did the exercises and dot it means here. Yes, I will, yes, I will put it on Slack as soon as we close. So you have time until Sunday to, to copy all your files. So you can keep working until half past one. And then you have two days more to copy back all your directory and files on your local machine. So you can keep that. I don't see any other questions. Yeah, maybe we can close. Well, no, there is another one maybe. I don't know if we're running out of time, actually, how to run similar job on the GPU. In my institution, they use their, suggest something to the person in charge. Well, I'm not sure, but certainly, if you have, so this means that your, your institute, they can help writing down a job script that allocates and starts a number of MPI processes. And, and also activates, so allocates for your job a number of GPUs. So I think that if you tell your CIS admin, so the people in charge of the class that you need to run a job with, I don't know, four MPI to use the four GPUs that you have, they will, they can help setting it up. And by the way, for those who asked about GPUs performance for Quantum Express, so I linked things like a Wikipedia page where you have the, I think it's the entire list of old cards, of NVIDIA cards. And on one of the columns on the right, there is the performance in single and double precision. And those two numbers are the ones that we want to check. You can compare the performance with single and double and immediately realize that there are two groups of cards. Those where this ratio is about one third and those where this ratio is much larger and double performance is much weaker. Those are the cards where you will not get much out of the GPU. This was a question asked before. Yes, I don't know if we have, I see two questions to raise then maybe. Yes, thank you. I have this question. I've been working recently with this code, Banier Berry, from written by Stefan Serkin. And he says that he used a combination of slow Fourier transform and fast Fourier transforms. I would like to ask you, I don't know if it is the place, why is it better than use only fast Fourier transform and also is this kind of machine implemented in Quantum Espresso or? I didn't get the alternative of fast Fourier transforms. He says that he used a combination of Fourier FFT and DFT. Yeah, this great Fourier transform, which is this low one. Why he used this combination? Perhaps I just wanted to ask you because maybe you know. Well, honestly, what I can tell you is that this is not done inside Quantum Espresso. I don't know exactly the peculiar problem that is optimised with this approach. So I would prefer not to give a precise answer probably because I don't really know the answer now. What is sure is that sometimes even the best optimised libraries and approaches may not be the best option depending on the problem that you want to perform. Let me give you an example in different contexts like you know, matters multiplication is something that everybody is doing and this course is something that has been optimised like for so many years, but so many people. And eventually we have a number of libraries that, depending on the system that you are considering, can do a matters multiplication very well. So basically with the matters multiplication, you get basically the peak performance, so the theoretical performance of your CPU, for example. But if your problem is as a peculiar size, I don't know that you are very tall matrices, very thin, so the skinny matrices. Well, you may gain something from avoiding the general implementation and use some peculiar implementation for your peculiar problem. Now honestly, I cannot comment more because I don't know the details of this task. Something more about this, or maybe someone who knows this. Okay, I was just curious because this code is a thousand times faster than the post-processing of 19. So I was curious why it's so fast. Thank you, thank you very much. It was very, very helpful, very helpful. Thank you. Maybe we have to close now. If you have more questions, you can write us luck, we will answer there. And yeah, that's all. So I think we can close the Zoom connection now. Thank you and bye-bye. Thank you. Bye-bye. Bye. Thank you.