 Yeah let's go there. So what's the difference between this Agai parallelism and this parallel that we're going to be talking about. And the difference is that in Agai jobs, all of these jobs were independent so this is called embarrassing parallel. But they were all running still with only one CPU, but they're still running in parallel. The queue just manages the parallelism. But for some situations you actually want to use multiple CPUs or multiple processors to do the actual calculation, to speed up the calculation. This is especially popular in physics and stuff like that where you have big models that you need to do, but in any fields really, if the software just supports it. Yeah, and there are several different strategies to do this. And really, I mean, any of these strategies are not specific to high-performance computing. They can be done on your own computer. You can use MPI or OpenMP on your own laptop. NumPy uses OpenMP for any calculations it may do. But the cluster is designed to make this easier and let it scale to even larger things than before. So should we look at models? Probably explain the main features of this kind of like main pipes of parallelism. So there's basically like besides the embarrassingly parallel one that we already have experienced with the array job, there is the so-called shared memory parallelism where each of the processes are in the same physical machine. So we were talking about trying to find these kind of analogies with Richard during the break. How would we describe them? And the analogies got pretty wild sometimes. But the best one we probably managed to get was to think of an office, like an office where there's people there working with some like papers or something. And everybody's doing there. You do a collaborative, let's say, project. And everybody's working on their own chapter and they're working in the same office. And you can like give a bunch of paper to the person next to you just by handling them the papers. That's basically the situation with shared memory parallelism. That the memory is the text that you're writing. Everybody's writing text into documents. And they are in a shared room. Everybody's, every worker, every office person is in the same room and they can just handle stuff to each other. So basically everybody's in the same place. And in this case it's like if in reality we have the compute nodes, which are computers, and the computers have the CPUs in them. And with the computer there's also the RAM that we talked about, like the local memory where the programs are being run. And the processors and the RAM are limited by the actual physical machine. It's like it's an actual machine, like a computer. Similarly to your laptop, it's an actual physical thing. And that's the limit of the machine. And in shared memory parallelism you utilize multiple processors within the same machine. And that is like this kind of parallelism is very popular nowadays So Richard also already mentioned OpenMP, which is this coding style for C and Fortran and these low level languages that basically utilize, like you can code with this OpenMP to run multiple things at the same time, multiple calculations. But also you have like many, like Python, Matlab, AR, and all have either utilize low level libraries that can let them do this OpenMP stuff on the background. Or you can like have this multi processing things going on where you have like multiple Python processes. Like often in some code that you're using you see a parameter called like number of workers, a number of jobs or number of whatever. And that usually just means how many processes you want to run this problem with. And that is basically the thing. Like if those frameworks that you're using or these programs that you're using support running with multiple processors, they can utilize multiple processors. But they're still limited within that one machine. Like out of thin air you cannot create more CPU cores, just by asking for them. Like you still are limited by the actual hardware of the system. So I guess you're limited to several tens that you can get on a server these days. Yeah, like something like I think 40 is the maximum in Trident currently. Yeah. It's still a lot more than your computer might have like four or eight or something like that. So it's a lot more, but still. Yeah. But the main thing here is also to recognize that not everything works like that. Like not everything cannot, can be parallelized. So it needs to be supported by the code itself. So you need to have that number of workers flag there or you need to have these libraries that can utilize them on multiple processors. Or otherwise it doesn't, like adding more stuff to, there doesn't help like adding more. If you only have one kettle or one pan where you want to cook your pasta, adding more stoves won't help basically. Like that's the kind of a situation. Or if you're telling the person to only use one pot, giving it more pots isn't going to help anything. Yeah. Which is actually, well, do we talk about this later, configuring the code? Maybe let's talk about MPI, then we can talk about the difficulties of getting code to use. Yeah. So what's the idea? Yeah. So the other parallelism, like paradigm, is this MPI or message passing interface programming. There's other alternatives that use other things as well. But the main thing, and MPI is the most common one, but the main thing is here is that there has to be some external network that the processors can discuss across. So basically if you think about the office analogy, like you're in the office, you're working with your coworkers. Let's say you have another office that you need to work with in India. In your company has another office in India and you need to discuss with them. You simply cannot give the documents to each other by just reaching out to India. That's impossible. Yeah. So in your program you have to have an explicit state of going and sending the communication. Yes. And this message passing interface, let's say in the office analogy, you send a request by email that, hey, can I have a document and they send you attachment back. In the message passing interface, that is a net layer that handles all of this nasty network communication so that you can transfer data from one processor to another across a network. So in an HPC cluster, you usually have a fast interconnected network on the background and the MPI handles all of these kinds of discussions, like how do you want to do this networking and these handshakes between all of these different processors so that you can have a communication across multiple computers. Yeah. Which is instantly what SRUN does. So if you're doing something with MPI, you would normally compile it so it knows about SLRM and then when you put SRUN in your script, it sort of magically handles all of this setup for the communication and then tells the program how to use it. Yes. There's layers upon layers of middleware, software in the middle so that the data will go as fast as possible straight from one computer's memory to another computer's memory because otherwise there will be a lot of time doing handshaking and lots of time wasted doing communication and I want to send you something and the other person needs to respond. MPI and the other software, it will make it smoother so that you have a direct communication channel so that the programs can discuss with each other and the difference here is that in this user perspective, whether your program supports either of these paradigms, there are also these hybrid programs that support both of them but that's another kind of worms, but if your program supports either MPI or this shared memory parallelism, you need to look at the manual usually. If it mentions MPI somewhere or something about networking and going across networks, then it probably might support working with multiple computers but usually vast majority of parallelism is shared memory parallelism. Yeah. And this point about reading the manual is quite important. Sometimes someone will come to us and say, okay, this program should use multiple processors. How do I run it? Okay, well, let's see. And we look at the manual. In the worst case, the author of it doesn't say how it works or it's designed where it will try to detect how many processors are on the computer and use all of them. So you might run it in Slurm and say, okay, please run with two processors. It's like, okay, I am on a 40-quart node. I'm going to try to use 40, except Slurm tells it it can only use two and then it's actually a lot more inefficient that way. So not all codes even are smart enough to realize that it's running in Slurm and there's some limitation beyond try to use everything that's there because it was just designed for a desktop computer where, well, there's usually one thing running at a time. So yeah, that's what this big box is about, basically figuring out how it works. And if you can't figure it out, then you could do some tests yourself and see or you can contact us and we will try to figure some things out. Yeah, it's usually a good idea to do something like this, kind of a A-B testing or just kind of like testing that if you assume that the code understands how it's supposed to work, but like if you assume that the code works better with multiple processors, if you increase the number of processors, you expect the code to perform better like you expect the time to run the code, get shorter because there's more processors working there. If this is not what happens, if the run time, if you use let's say SS or Slurm History to look at what the code did, if that doesn't match the expectation, then you know that something is wrong, either the program or how the program is configured. Yeah, so basically like as an analogy, let's say the office, like you're in the office working there and you need to work together, you need to write something with a computer and you only have one keyboard and everybody like switches places, has to switch place constantly to write their own stuff and then another person comes in and writes their own stuff with the same keyboard. There's obviously going to be like a huge back, like a huge bunch of people just waiting in the queue, like if you're in the kitchen and you only have the one pot, wait to cook the pasta, but you have 10 chefs looking around, it doesn't make it any faster to add more workers to the situation. So in these kinds of situations, you need to look what amount of processors actually are you requesting and what amount of processors the code is assuming it's getting and this can be hard to decipher sometimes. But the best way of usually checking it is by looking at the time, did it run faster and if it did run faster with multiple processors, it probably used them in some form of fashion. Yeah. So we talk about the different ways of running things. So if it's embarrassing in a parallel, you would use array jobs that we already discussed and really these days, if you're not using some massive physics code, a lot of software uses array jobs plus shared memory on that node. So basically multiple processors on one node and then scale by using more tasks. Okay. So next, if it's open MP or multi-threaded or multi-processing, how do we run this? So what we need to now tell to SLURM is that we want multiple processors for this SLURM task that we're running. So SLURM, there's this kind of like different lingo inside SLURM about tasks and CPUs per task and such forth, but tasks are basically MPI tasks. So tasks in SLURM's internal lingo means how many MPI workers you want. And if you don't have an MPI code, you always want one worker, like one task always. You don't ever want to go above that. So the default number is one and what you just want to do is to add more CPUs available to that one task. So using the CPUs per task option. Yeah. So if you have this CPUs per task equals to, let's say four, you will get four CPUs for that process. That process will run with four CPUs. And so if your code doesn't have MPI in it, you never want to modify the tasks parameter. You just want to specify the CPUs per task parameter. Okay. And then we would specify some memory. Or I guess we can specify memory per CPU or a total amount of memory. Yeah. So some programs are these kinds of programs that, like basically let's say copy the data and then they do, they work in like, they have some internal like trouble allocation inside them or something like that. So they might, by adding more CPUs, you might increase the memory requirements. In those situations, it's usually a good idea so that like the CPU, the actual memory usage is like multiplied by the number of CPUs you're using. So basically if you add more CPUs, you add a more memory requirement. Automatically you don't have to like, change both parameters from the slurm street when you do like job submissions. The other situation is that when there's no effect, the number of CPUs doesn't affect the memory usage. And in those cases, it's better to just like put like a flat memory limit for the whole program and then just add up the CPUs number. Yeah. Okay. Let's see. So here's an example of running an OpenMP program. And by this time we are not, I guess we're not going into these details. So when you need to, yeah, like when you need to run something like this, you can go down to see. This is like actually compiling a... Yeah, maybe we could quickly, like if I quickly show that like the program so that we can focus on the actual like important part, which is the series per task. Okay. Your screen is shared now. Yeah, I can run the example. So like you don't have to necessarily, like if you're not compiling OpenMP code, this is not probably relevant for you. So I'll actually make a new directory. Let's clear this up. So we have some OpenMP code with something that can work in parallel. It doesn't like if you're not familiar with OpenMP, if you're not familiar with compiling, it doesn't matter. Like the main thing is that this is a program that can utilize multiple CPUs. Well, it doesn't do anything interesting, but it can utilize multiple CPUs. So if I now run it with, let's say, S run, so interactive running, I will run it in the compute node, some compute node, and I will give it the Hello OpenMP program that I have. It will run there, and it says that, okay, Hello from thread zero. And if I run it with CPUs per task equals two, let's say. So now we are asking for two processors. The program itself understands that, okay, there's two processors available, and it should print Hello world from thread one, zero on thread one. But the only thing you need to think about if you have a program that can utilize multiple processors is to modify the CPUs per task parameter. That is the only thing you need to worry about from the slurm side. The other side is you need to worry about whether your program can actually utilize the CPUs. But from the slurm side, you can just simply put this number there and it will get you there. Yeah, okay. So that did two things. Yeah. I'll also mention that sometimes it's a good idea to, like, you want in your code add some, you want to know how many CPUs you have reserved. So slurm will set this slurm CPUs per task parameter when you run something with multiple CPUs. So you can use this environmental variable to get the number of CPUs while the code is running, like how many CPUs you have reserved so that if your code requires you to tell it how many CPUs you want to use, you can use that number. And it's always a good idea to ask, like if you're asking for, let's say, four CPUs to utilize four CPUs, not like eight or anything like that, because it will create this kind of competition for the same resources within the code and it's not efficient. Okay. Okay, should we look at the MPI? Yeah, should I go back to my screen? Well, I just did. So as we scroll down, okay, MPI. So what do you say that MPI, when you're getting to MPI, you're getting to serious business? Hmm, that's what... Well, everything is serious business, but I would say that you're going towards the traditional large-scale HPC situation. So these kinds of... I think I yesterday or before mentioned about where they're programming, for example, where they're forecast, they need to have a full... The simulation might need thousands of processors to run so that you can get the forecast done for the next day. It needs to be done fast and the simulation is usually really big, so you need to do it with multiple CPUs. So in these kinds of situations, when you have these kinds of... Especially in physics, when you have these large-scale simulations, all kinds of simulations, you usually use this MPI to manage a huge number of computers. And this is especially important in, let's say, CSC, if you're... We're going to hear about CSC resources more, but if you're going to be doing this kind of large-scale MPI, HPC kind of a thing, it's a good idea to use this. Okay. And the main thing is that usually in MPI programs you have things like... Let's say you have a simulation area, which is a cube and you split the cube into smaller cubes and each processor handles its own cube and then everybody communicates next to its neighbors and then you calculate something inside the cube. So this kind of situations is usually handled by MPI programs. So do you want to demonstrate the MPI program or should we go on straight to... So I propose... I can run the MPI. Yeah, demo. So you can run this after the course, if you're... If you want to use MPI, it's a good idea to test it out. So I will copy from this HPC examples, this HelloMPI program. So I have the program here and then I'll use... Let's say usually when you're running MPI, you need to use RMPI because RMPI or cluster MPI is usually like... Like I mentioned, there's a huge bunch of middleware that handles the communication with the networks and the MPI needs to be able to handle those so you usually have like a software installed by us, that can handle these kind of discussions because otherwise, well, you won't get the performance that you want. Yeah. So then we just compile this, then we have a executable here and now when we are running this, let's say we run it again interactively, if we run it like with simply s run, we run it with one processor, so it will say that hello world from processor something out of something and in MPI it's called usually rank, rank of the program is like, which is the ID of a single processor, but let's say we want to run it with, let's say two tasks, so we didn't mention, but so if you wanted to run shared memory programs, you had the CPUs per task and you had the number of tasks at one. If you want to run an MPI program, you want to have usually the CPUs per task at one, but the number of tasks at a higher number. So that you specify with this end tasks option. So with two tasks, let's say you have a hello, you have two hellos from two processors that are running together. Okay, so there we see ranks zero and one. Okay. With the MPI, I would quickly mention that if you want to run it bigger, you want to run more tasks, you can either increase the task number, but it's usually better to split the problem based on how you're handling it inside, so that you have certain amount of tasks per node if you have a very large problem. So then you can do this end tasks per node and ask for, let's say, nodes two. So now you have two nodes times two tasks or actually four processors running. So basically there's all kinds of options for requesting whatever you may need. Yeah, but like mentioned before, if you're not using MPI, do not use this, because then you will run simultaneously four copies of the same program, because Slurm will try to launch the program four times, and that's not what you want. So here we see that two of the processors were in one machine and two of them were in a different machine. Should we quickly go to a monitoring performance and then see about maybe one exercise, then a break? Yes, so about the multi-processing performance, so yesterday Richard already talked about SF, which will give you efficiency of a program, and you can use that on a job ID to get the efficiency for any kinds of program with any kinds of resource requests. So here you see here that for this previous program that run with two nodes on two tasks each, it had two nodes, two cores per node, and the CPU efficiency was 25%. So if you think about it, it's like two nodes times two tasks equals four CPUs. So 25% of four CPUs is one CPU, so it was actually running on one CPU. Like sometimes these calculations can get bit muddied. I guess it's so small that maybe it doesn't measure it very accurately. Yeah, most likely. But the most important thing is that if this is close to 100 or not, like if whatever you're running, if it's close to 100 or not, at least you know that it's going to utilize the CPUs. Then it doesn't tell you if you're utilizing them effectively. Let's say you have the 10 office people working with one keyboard. The keyboard might be on fire because everybody's typing on it, but you get not that much done because 10 office people are writing it. Everybody's writing one word at a time, so it's not very efficient because you still lack, you have a bottleneck in your system, so it's lack of available keyboards. But you still get 100% efficiency of the actual requests, requested resources that you have. Yeah. So running 100% might also mean that you are requesting too much. You are using too much resources with respect to the request that you have. Yeah. So, okay. It's at 2 o'clock, or 9 minutes when the laptop Stilumi CSC presentation is scheduled. I guess that can be pushed back a little bit if needed. Do you think it's even... I have a feeling that most of these exercises, except for maybe number one, would be done alone if someone wants to. We don't need to take time. Do you want to exercise one as a demo, or type along? Yeah, maybe we could do it as a type along. So unfortunately, like we mentioned, many of these depend on the software very heavily. So in separate application pages in our documentation, there are examples of running multiprocessing different programs, like R, Python, Matlab. There's also on other sites, of course, and also some MPI programs, the most common stuff, at least those that we have installed and so forth. So you can use or run through them based on what you actually need. But basically the main thing that you need to do is just remember that if you are... Remember which type of a program you're running. Are you running an MPI program or shared memory program? So really there's unfortunately not that good exercises that we could give that would fit everybody's needs. But this first exercise that we have here is really good, and I highly recommend you do it at the same time as I do it right here, because it will demonstrate how SRM thinks about these different parameters. So in this exercise we have this... What do you want to... I have another idea. What if we have a 10-minute break now? We go to Laptops to Lumi, and then we come back and we do the parallel exercise. That sounds fine. Okay, yeah. We can have a break until... Let's see... 04, and then... Yeah, let's do that. So you can continue writing stuff or asking questions, and we will see you afterwards. Okay, see you later.