 All gather does the same, but then all the processes get the information. So all gather is like gather, but instead of the information going to the single processor, all of them has it. And all to all is basically a matrix transpose. So if at the beginning each process has one row of data, after the all to all, each process will have one column of the original data. So this row here is actually a column here. And this is the one that takes all to all is the collective call that actually is squared based on the number of communication. That the one that the number of messages grows as the square of the number of participants. So what do you need to know about MPI? MPI is the most important thing that you need to look into and it's something that you will have to work with every time you sit on the HPC. It's available on all kinds of hardware, including laptops, not limited to HPC systems. And all compute nodes in the HPC system are close to participate in a pool. And MPI decides which copy of the processor, which nodes and core to send. So the each copy of the process identified by its by a number is called the MPI rank. And the rank is usually written in the log file, so it may help trace some kind of faults. And similar to thread affinity, the MPI system may have a way for you to specify process feed. So which rank goes to which column, which node. Now the program is usually run by MPI around the command, and you specify how many copies you want, and then you specify the command. We'll see you later in action. On the HPC system, this is usually not the case because it happens for job scheduler like slow, but yeah, in any case, this command will appear somewhere. Now the course of a node can also be used to API tasks instead of open MP threads. So if we have four nodes with 16 cores each, we can do the following. Let's say I want 16 threads on four processes, so I only have four distinct processes, but each of them runs 16 different threads. Or the other way around, I have 64 different processes, each one of them running one single thread. These are both possible as well as all the intermediate like two by 34, 32 or four by 16 or something else. And usually the first one is more beneficial. The rule of the term here is try first with running as much MPI processes as there are nodes, and inside each node run as much threads as there are cores. That's usually the most beneficial, but not always. You have to try. So what you need to know? You need to know the number of nodes available again and the core socket configuration. You need to know how to run MPI jobs, particularly for the HPC system that you're using, the batch scheduler syntax. And how to query the scheduler for the status of your jobs. You need to know your software. Does it support multi-threading? And if yes, what is more beneficial to make? As I said, the number of ranks equals to the number of nodes and the number of threads equal to the number of cores. If it doesn't support multi-threading, then just the number of ranks is the total number of cores. And lastly, you have to know the limitations of your software and the scalability, how it scales. So you don't have to waste any resources. Because that's not really what happens. Also, as I said previously, there are GPUs sometimes on some of the machines and the compute node may have quantum accelerators. The main problem with GPUs is that they don't have access to the same RAM as the CPUs. So if they need to work on a task, you have to transfer the data from the CPU to the GPU. And that piles up on top of the MPI transfers that is already needed between the nodes. And some systems support direct GPU to GPU transfers, which might be faster, but not always. Still, you need to ask for it. If you're on a system that has multiple GPUs or multiple nodes, you need to make sure that the software does or does not support direct GPU to GPU transfer because that can be faster sometimes. Now, this last question, the limitation of the software and its scalability. This is a very important question and I will now turn my attention to the last part of this lecture, which regards strong in weight scaling. So let's first say that a algorithm of steps not much different from a cookie. And some of the steps are independent of one another, so can be executed in power. By independent, step I is independent of step B, if step B cannot influence step A's input in any way. There is no data dependency between them. Now, the portions of the algorithm that can be parallelized is called the parallel regions and the rest are called the serial regions. So some things just cannot be parallelized. Like a blunt example is that if you take several pregnant women, they cannot deliver a baby for one month, no matter what you do. Now, the more time an algorithm spends in parallel regions, the better it is suited for HPC because, at the end, that makes it more scalable. When we're talking about scalability, we usually introduce something called speedup. And the speedup is 21 over Tn, which T1 is the time for running the one processor, and Tn is the time for running it on end processors. So let's say something that when I run it on one processor, one core, let's say, it takes 10 seconds, but if I run it on cores, it takes one second. Then the speedup is, and this is the ideal situation when s equals n. So independent much n is, if this is equal to the number of processes, then we have so-called linear scaling, and this is the best we can hope for. But this is really a shift in practice. It happens for some small n sometimes, but s is actually bigger than n, so speedup is, let's say, 110%. This is due to some kind of caching here and there, and it's fun, but as I said, this usually happens for very small n, too, or something like that. Now, Amdahl's law is something that says what speedup can we expect from a program in which p is the portion of the parallel regions. Let's say that 70% of the program is parallel, then the speedup that we can expect is this. And as you can see, it depends on p and on the number of processes. Now this is valid for fixed workloads, so the thing here is we have a task, we have a problem that we keep fixed, and we try to throw more processes at it. So we try it on one processor, then we try the same problem on two processors, on four, on eight, and so on and so on. The dependence of the speedup from the number of processes and the fixed load, so for the same problem we are trying to solve, is called strong scaling. And if you go to Wikipedia for the page for Amdahl's law, you will see this particular graphic. And I want to point your attention to the green line. Now the green line says that the parallel portion of the program is 95%. And even if the parallel portion of the program is 95%, so only 5% of the program is not parallel, but it's serial instead. We can see that soon we don't have a speedup that can go above 20%. And even if we run it on 65,000 processors, it will still be 20%. So it will be a huge waste of resources to try to run it on 65,000 processors. I think this one, let's say for eight processors, it has a speedup of 6, which is okay. For 60 processors, it has a speedup of 11, which is again more or less okay. For 32 it has 13, for 64 it has, well, 15. So now somewhere around here it's just not beneficial to do this anymore. You just need to stick with 60 processors and not try it on a bigger number. And for even Starker example, I want to show you this is something that I have generated with myself. This is the portion of parallel regions here is 99.9%. So only 1,000% serial regions and this cannot go above 1,000, sorry. The speedup cannot go above 1,000 no matter what we do. So even on 30,000 processors, it's still 1,000. And it's the same on 25,000 processors basically. And this is because when n goes to infinity, s goes to 1 over 1 minus p. And so, yeah, it's easy to just do the math yourself and you will see that in that particular case if p is 99.9%, s goes to 1,000. So this looks very helpless and discouraging. And it is. And the communication time, which is most notably counts towards the serial regions. So if you have any communication in the program that is not hidden behind some kind of computation overlapped, there is no hope for strong scalability. And I can see we're almost out of time, but I will just say this. I need to also say about Gustavsson's law and the weak scaling. And as I said, I'm not always when there is a fixed workload, so we don't change the problem size, just throw more work at it. There's another way to utilize the resources and that is to solve bigger problems. And this is where Gustavsson's law coming to play. And this is valid for a fixed time, but the workload and the number of processors vary. So we have a fixed problem size per processor. So if we started with, let's say a problem that is one unit big and run it on one processor. But then we have a problem which is two units big, but run it on two processors. Then we might expect to have the same to finish in the same time. And if we have a problem that is 128 times bigger, but we run it on 128 processors, we might as well expect to have more or less the same time. Again, this depends on the amount of parallel regions in the program. P is the same as before and N is the number of processors. And this works, this seems a little bit better. So if we have 90, this red line here at the top says that if we have 90% of program is parallel regions, then on 120 processors, we should expect a speed up about 110, which is more or less good enough. So, yeah, weak scaling is generally easier to achieve. But then if the software uses some kind of naked collective MPI, Alt-All, for example, weak scaling will be great as well. Also, not all algorithms are created equal. If an algorithm is complexities and to the third, then we only get 26% of speed up upon doubling the resources. So not all algorithms behave well. Usually the ones you will need to do behave well. So at some point it becomes worthless to throw resources at a given problem. And it's up to you to be able to judge that limit. And you will need to perform scalability tests. Run your simulation, but with only a limited number of time steps, like if your simulation requires 50 million steps, don't run it with 50 million steps. Run it with 100 steps or 1000. But then run it in different configurations. Different number of threads and MPI tasks. Increase the number of cores and nodes. Find out that it doesn't make any sense to increase them anymore. And when you have this scalability data, then you know how much resources you will need. So don't waste any resources. And then that's how much resources you will request for your final simulation. Well, that was for me. Thank you for your attention. If you have any questions and I'm sorry I'm a little bit above the time limit, but yeah, it was a big topic. Thank you. Thank you very much for the systematic and nice explanation, thanks for the good advice. A little bit over the time limit, but on the other hand, facing the lunch break. So I guess we have some time for questions. Here is the first one. And using open MP and or I on a workstation on the Linux to optimize performance for instance using Roman. EG C on core with N CPUs and GPUs. Or are they only beneficial for use in supercomputers. Yes, sure. If your workstation or a laptop or whatever. Supports yes multiple cores. Why not. That's totally acceptable. And that's what you do that's what gets done usually because. Well for for the limited number of course you have. For example, when we do some kind of programming in Gromax, because we did that previously. At least up. For a rate for a rate PI tasks or for a rate threads. We kind of perform this. Scalability. The laptop them on the workstation themselves, but when we need to place the scalability for a larger number. We just need to go to an HPC machine because they're just not enough course. Definitely use MP and open MP and laptops. Just don't go over the number of course you have. Okay. Then the next question from our attendees. When would you recommend close or. By the. Bends on the software. I cannot recommend one over the other, but sometimes. If you just have to try which one works better software at hand, because you don't know. What the programmer did when. When he rolled that region. So it's sometimes some software work better with close other we spread, but it just depends on how the code was written. So unless unless you have some kind of. A documentation from the developers that says use that credit. Or the other you have to experiment. Okay, thanks. I don't see any other. In the chat. So. Thank you once again.