 Hello, we are back. Hello. And next up is parallel. So Simo, what should people get from this lesson? Yeah, what's how deep are we going? Yes. So this is quite a complicated thing to explain. So in this lesson, we'll talk about like, like parallel computing is all about like doing stuff that is like, you have multiple workers, multiple CPUs working together at the same time doing the same thing. Like in the area job, we have stuff that like they could happen at a different time at a different place, as long as they just happened, like you had, you had things happening that just needed to be done with a different parameter, but they didn't have to like have a communication. They didn't have to have like they didn't have to each of these jobs were independent, and they didn't need to work together. And in parallel computing, it's a different thing, because we want to like if if in the area jobs, we want to do more in parallel, we usually want to do faster, like, and more isn't always faster. Like, so so quite often, the most efficient way of getting stuff done is to do embarrassingly parallel all these areas, because like, if you do, if you can do like 100 simulations, like, if you can do 100 simulations separately, it's usually better than having 100 CPUs running one simulation at a time in a sequence. So so if we look at the analog of the pasta again, like, if we previously had just like these different kinds of pasta that we wanted to do, and now let's say we want to do like, we just want to do more pasta, like we could have like four cooks with their own own pots, do that thing, that would work like that, that that would work because like, like that would be the array job, we have four cooks there, cooking their things, and they all would produce one pot of pasta, and then we would have four pots of pasta, let's say. But we could also have a situation where we just add, we have one cook, but we add more, more pots to the to the stove. So we just do more with one one cook. And in that case, usually you get the same results, or you should get like, if it's if it parallelizes, like, optimally, you should get the same pasta done in the same time. But maybe like, when you're you have to move the pots around, if you're only one cook there, you have to move the pots around some of the time is wasted, because you have to do this kind of stuff. So it's usually not as efficient to do stuff, like just multiply, add more, more pots, add more stoves to a single, like, single, like job. Instead, it's usually better to do this embarrassingly parallel thing. But if you want to do parallel, you usually need to know what your program can do. Like, like it all depends on on the program that you're using. So there are two main paradigms that that are, well, in HPC, and in general, there's MPI, and there's multi threaded or a multiprocessing thing. And like, if your code has support MPI, it usually says it, it's like on the theme, like if it says that it's an MPI program, then it's an MPI program. And if it's not an MPI program, then it's not an MPI program. So then it's, it might be a multiprocessing program. And it's usually like hard to tell which one, or it might be hard to tell. But usually if you see something mentioned about like, MPI, then you know that, okay, it's MPI program. And if it doesn't mention MPI, but it talks about workers or number of jobs or a number of processors or whatever. It's usually the multiprocessing program, like the words get easily mixed up in this field, because like, everybody can use the words in which way they want. So there's no like, no one solution usually, like to recognize which program you're using. But when it comes to the resource allocation, the thing is very simple. If your program is multi-treaded, or is it, it's quite a multiprocessing, you need to specify the CPUs per task option to get multiple processors for the multiple CPUs available for the process. So you basically ask the slurm that okay, this I'm going to run this job. I want multiple CPUs available. Can you provide them for me? And that's it. That's the end. Like, that slurm basically makes certain that okay, there's there are CPUs available. And then it just takes hands off. And then it makes certain that you have a big enough stove, basically, it makes certain that you have enough burners, enough working space to do your work. And then it's like goes off. But with MPI jobs, slurm does a lot more like it launches this kind of a world, a network between these many workers. So if you have to do like multi node, if you have to do parallel computing where you have multiple, like computers working in tandem together. So if you think about like a big weather weather model or something like that, you might have hundreds of computers creating this kind of own simulation that they all have to do. So this is like the traditional high performance computing, like, like which is still a very popular, which is like big scale stuff. That's done with MPI where slurm like gives each of these tasks, it gives them its own like own place to work. And then it like makes the communication between these tasks possible. And for those kinds of jobs, you need to use this end tasks option. So it's like two, one or the other, there are also hybrid jobs that use the both. But we don't want to talk about them here. It's too complicated. So let's just limit ourselves to these two options. So if your job is an MPI job, you need to talk about tasks. If your job is not an MPI job, you need to talk about number of CPUs. So just how many CPUs you can use. So let's, let's go and look at that example of an job. So this in the documentation, there's a lot of stuff about different kinds of like, what is a multi-traded multi-process that was asked previously, like, we don't want to go through that because it's technical and it's I don't think it helps. There's also talk about open MP, which is this kind of if you're coding, it's an easy way of getting like coding your multi-process stuff. But in most cases, you do something with an already existing framework. So here's an example of like Python, like this, this is also relevant for people who use Matlab or R or whatever. Those programs are internally multi-traded. So there's in Python, there's this NumPy library, which uses these other libraries that are all underneath them, they can utilize multiple processors. And if they see multiple processors, they use multiple processors. So that's how they work. Like if they see resources, like if you ever in your own laptop, you start some simulation, and then you notice that the laptop fans turns on, and all of your processors are using in use, that's usually because the program itself recognizes that there's multiple processors available. So if Richard, if you want to take the screen and show how to do this example. So in the example, there's like a simple like this, like, do this kind of like linear algebra, calculate pseudo inverse for a matrix kind of calculation that is done like multiple times. And then the timing is, it's timed like how, what is the, how long does it take to do this calculation? So if we just create a script, make it and just copy paste the whole code, everything there, I guess I don't even need to look at the code, do I? Yeah, yeah, like usually, like, if you just see that some programs supports multiple processes, you don't need to know how the sources it's made, it's usually like, like, if it's supported, it works. And we make a new script? Yeah, let's start with the script itself. Like if you like, let's not do that. Okay, we're doing great to script. Yeah, let's okay. Yeah, let's go straight to the script. There's an example of how to run it. No, like, so let's go straight to common pattern, we see we have a code and we have the separate script that controls it. Even though in theory, they could be combined with some effort. Yeah, so let's see. So L time tasks, two CPUs, one gigabyte. Okay. Yeah, so these also like this example, it might differ a bit different. If you're running in a different cluster, because they might be a different Python installation, they're available. But so what, what, what new do we see in this script? And the main like, and especially in the S batch section, like we see these two lines called CPUs per task and a memory per CPU. So, and we also have these end tasks set here, but you should you do don't need to set it. But basically, the idea is that you just want to have basically one of these MPI tasks, like the tasks number shouldn't be anything else than one. And you just want to have multiple CPUs available for the job. And in this case, we set two CPUs for the job. And instead of using the memory, that's just mem. In this case, we, we set a memory per a cheap CPU. So you can think of the total memory requirement for the job being the memory per CPU multiplied by the number of CPUs. So basically, and it depends. Yeah, like if you need to, if you want to use more CPUs, you only have to change one place instead of changing the memory separately, somehow. Yes. Yeah. In some cases, your program might be that that the number like they adding more CPUs doesn't increase the memory requirement of the job. In those cases, you might want to use a like a flat mem statement. But you can also use this mem per CPU. If you know that, okay, like if I add more CPUs, it will add more memory requirements. So you can have these kind of like helpers help the things here. But basically, what what this when we now submit the job, like the rest of the thing is like, we we just use Python via this anaconda module that has been installed. And then we when then we run the code. So that's like, that's something that that some program that supports multi threading. So should I so if we now, yeah, if you now submit it. Okay, it claims it's done. Yeah, if you look at the output, I'm removing all the Yeah. Okay, so can you see Python open MP dot out. If you don't have it noticed here, I push tab and then it automatically completes many of the things so I can type file names really quickly. Hmm. Okay. needs Oh, okay. The the script isn't like, yeah, can you can you do the VW gets comment there, like the Python command, like the Python script is only a partial of the script. It needs to import time. Yeah, it's missing some imports. Are there more? Yeah, you can download it there. Yeah. Yeah, I think it's import time and import number from. Yeah. Yeah, you can just download it. Should I just download it? Yeah, I think it's like in the example, there's like the command to download it. Yeah. Yeah, that's that's the unfortunately. Yeah, we'll have to update the example. Okay. There. So Yeah, yeah, if you're typing it along, then do notice that this the there's this download command. Okay, so now if you try it again, it probably takes a bit longer. Should we see what the differences are? Probably a waste of time. But okay, yeah, we see it's told in port. Yeah, that's several more things. Yeah. Yeah, that's our additional output. Okay. So, so there's also a question about like, like open MP, an open MPI. So this this kind of like a mess of like technical jargon in in HPC world. So this is open MP standard, which tells like, how do you come? How do you write code that does this parallelization using this open MP paradigm? And then there's MPI, which is this message passing interface, which is like, how do how do you do this message passing in later standard? And then there's open MPI, which is just like implementation of this MPI standard. So so yeah, it's it's like, it's the computer science world or like, yeah, like, whenever there's some somebody who who gets like technical days start to make up this naming conventions, and you have to like, this, this, it's unfortunate that you just have to like, spot the difference basically, whenever there's like, what are we talking about MP or MPI? And it's yeah, I don't know. So let's try it like it took 13 seconds. So so let's try with one CPU, how it would behave? Like what what would be the nano if we edit? Yeah, we need to nano the submit file. Yeah, one CPU per task. Yeah. Okay. So that, yeah. And as much it? Yeah. So, yeah, it will take a bit longer. So, so what's the important part is here is that, like, the requirements part, like your program, like, if you don't know if your program is capable of utilizing multiple processes, you can always try giving it multiple processes. But the main thing is then like to figure out where they actually utilize them. After this has finished, like if you look at the output now, 22 seconds. Yeah, so we saw like previously, it was 13 seconds and now it was 22. If you look at slum history for, for these jobs. See, yeah, we all yesterday, we already used this SF command to, to see, like how to monitor the like how the program was behaving. So if you want to run SF for, let's say the one CPU top first is here. Yeah. Yeah, the second one. Yeah. So, yeah. Yeah, in this case, like, we noticed that, yeah, it took 25 seconds to run. So that is the wall time, a wall clock time. And it used, used 90% of the CPU. So so with one CPU, it, this CPU utilization was 90%. So if we look at now the two CPU job that you run previously, like the SF of that. Yeah, the wall time is 15 seconds, which is less. Yeah, and the usage is more. Yeah. Yeah. So you can see that the it used to cause by node and the CPU efficiency is still pretty much the same. So it actually managed to utilize both of the cities. So if you're unsure if your program supports multiple cities, it's good idea to just like give it a few and see if the CPU efficiency is good. There's also this kind of like, like, if you know that your program does support multiple CPUs, it's also a good idea to make certain that the program if in the if in the program, there's some sort of like variable or something like how many processors to use like in in math lab, you might have like, what is the size of this parallel pool that create or whatever, like some sort of number that describes how many processors you want to use. You want to make certain that that number matches the CPUs you're asking. Because otherwise, you might end up in a situation where you launch your program launches more CPUs than it actually has access to. So so your program tries to use, let's say 20 CPUs, but it only has access to, let's say one. And then there's like 20 20 processors fighting for the one CPU. So basically, you have like, you try to cook 20 pots of pasta on one stove. And of course, like, it's going to be like, hell, you're not going to boil any of the water. Like if you have to like cook every pot for a second, and then move another pot in there, and you have to constantly swap which pot you want to put into the stove into the burner. Yeah, so it's it's going to be like a mess. So so you always want to make certain that how many burners you basically request. So in this case, you request two or one. You always want to make certain that your program puts as many like pots to those burners. So you always have like the system utilized. And also you have don't have it overutilized. So basically, like it can go wrong both ways. If your code doesn't support parallelism, then it might not use any more CPUs. And if it supports it but doesn't know about slurm and how slurm tries to tell a program, how many CPUs are available, then it might say, Oh, I'm here on a computer with 20 processors, but it only has assigned been assigned four of them. And but try to use 20. So so many things that go wrong. Yeah, but it's usually a good idea to just test and use the SF. Like the main thing is like you if you know how to do division, it's quite simple. So if you if it like launch a job, and you let it run for some time with one CPU, like you launch some job with one CPU and you get some sort of like runtime. If you increase the number of processors, you expect like in the perfect world, the amount of time used to like the amount of time required for the job to complete if you increase the number of processors is the like the time with one CPU divided by the number of processors. So you should like, if you just divide the number, the time that it took to run with one processor with the number of processors that you're now running with, you should expect like to get. Yeah, you expect like the time to be divided. So so if if you see some discrepancy, you probably know that okay, it's not working as intended. Like if there's some sort of like, like, okay, suddenly, like, the times don't match, like I increased the number of processors by two, but the time didn't drop at all, then you know that okay, like, like it didn't work. Yeah. So okay. So yeah, that's about all that we can say about, like, there's all kinds of talk in the in the documentation about how to do, like, how about these multi process jobs, but like, it depends so much on the program, so that we can't get generally use a general instructions, but but then yeah, if it's if it doesn't say MPI, then it most likely will use the other other data. The MPI is then a lot more complicated because like, but but usually it's it's it's complicated on the surface level, because usually, like, you don't compile your own MPI programs, you use MPI programs that other people have provided. So what MPI MPI is complicated because it creates this kind of like a communication between it creates this kind of a communication between multiple different nodes. So so and for that to work, it needs to know all about like system libraries, network libraries, it needs to like do so that you can get like information from across the nodes fast, like you can have a lot of communication. So if you think about like, traditional MPI program would be like, let's say you have a, you have some sort of like a finite difference method, or you have, let's say you calculate a grid, you have a grid of, let's say, you have the weather data that this describes, like a place or something. And then you have like a grid that describes the place, describes, let's say, over Finland or something like that. And the each grid point at some point in space. And each for each grid point, you have one MPI worker that calculates or may for few grid points, you have an MPI worker that calculates what happened in that grid point. And then all of those have to communicate with their neighbors. So it's a huge amount of communication usually. And that's why it needs these special, special libraries. But usually MPI programs, they, they, they usually work that you load this MPI module, which is provided by the system administrator, because that MPI module needs to be like very, it needs to work. And usually, there's only few versions that that are like actually tested that actually work. And then you compile your program against that version. And then you, you do stuff. If your program is provided by the systems administrators, and it uses these MPI programs, you should use those MPI libraries that it provides, because like, they, if they're, if the libraries don't work together, then nothing works. So you should utilize it. But it's in the case of the MPI programs. If you reach out, go to the Hello World example there. Up above, I guess. Yeah, it's up above. Okay, here. Yeah, over here. Like, typically, what you would do is that you load some compiler. So in this case, in Alto, you would load this GCC compiler and this MPI library. In other sites, the version numbers might be different. The model names might be different. But usually, you load some compiler and some version of MPI. And then you compile your code. And if you want to reach out, show the, the Slurm script. Here we go. Yes, in the Slurm script, you typically tell the number of tasks. So, so instead of the number of CPUs, we are talking about MPI tasks. And this is because like the networking and all of the jazz. So if you're working with MPI programs, you specify this number of tasks. And if you're not, then, well, there's, you're, you're working with the other, other paradigm. There's also lots of like in this documentation, there's how to, like, if you're going to do MPI programs, if you're going to do these large scale MPI programs, this documentation here, how to spread the workers evenly, how to, how to like optimize, like how to choose the number of workers and so forth, complicated stuff. But it's, it's not something that most of the users will use nowadays. Like the HPC used to be that everybody was just doing MPI. But now, stuff has changed a lot in the last 20 years. And there's much more multi processing and much more data parallelism, so much more aggregate jobs. So those are the, like, the ones that you probably, like, if you're not in physics, you're most likely to do the other stuff, like, like the physics people, like you're stuck with the MPI programs, because the problems are usually so big that they need the MPI to solve that. Yeah. Okay, so we've got 10 minutes left. And did we already say that we didn't plan on doing dedicated exercises here? I guess that should be obvious by now. Should we do some demos or something? Like some exercises as demos or? Yeah, maybe we should do like the exercise one as demo. So if you scroll down to exercise one, like, because like these, these will show the concept, like in detail. So Yeah, okay. So let's see. Here we are. So I'm running s run CPUs per task equals four. So I'm telling, I guess there's one task and four CPUs. And then I run host name, which tells me what computer it's running on. So which, which of the, the type of paradigms is this? This is the multiprocessing one? Yes, yes. So in this case, like what we are just asking is that similarly to the, like the serial job, where we asked for memory resources, we asked for time resources. Now we're just asking for CPU resources. So just, just saying to the slurm that okay, like, whenever I get the, the kitchen I want to be in, I would like there to be four burners, like four burners available for me, could you make it possible? And then slurm says that yeah, sure, like I can I can figure you a place to work. And here are the four burners. But when we are running now the hostname command, of course, the hostname command doesn't care about the force for CPUs. So but, but in theory, if they would be a program here, that yeah, it could use them. Yeah, okay. So now if we try the number of tasks. And tasks equals four. So this is the MPI paradigm. Yes, I guess. So it's waiting. Yeah. So now that we launched this, we asked it to provide us for MPI workers. So you see that now the output is like quadrupled. So we get four times the output. So this because the hostname command doesn't understand anything about MPI. So what happens is that it duplicates itself. And like, in different nodes, we get all in this case, the same node, but in different places, we get four CPUs. But we have like all workers that all things four times. Yeah. So basically, now we are asking for like, we want four cooks, but we didn't ask if they all speak the same language. So so basically, we asked for like, can we get four cooks? And one one speaks Italian, one was Finnish, one speaks Polish, and one speaks Indian. And and like, none of them know how to communicate with each other. Is it like, they're not being told to work with other people? Because I think normally, yeah, yeah, like, like, they're just, yeah, like, they're not told there's even anyone else going on in this kitchen. So you're giving everyone the same recipe, and they all do it independently, because they weren't told, here's your leader, and the leader will tell you which part to do, or something like that. Yeah. Yes. So in this, like, if you, if your code doesn't use MPI, and you try to use this end task, what you have, what happens is exactly what happens here, that you launch the same program multiple times. And in those case, like you in best case, one of them finishes and three of them crash, like here. And in the worst case, they all try to like write same output file and, and like they mess up your data and stuff like that. So you don't want to do this, like, if your code doesn't use MPI. So let's try the number of nodes. There's also this option. So for big, big MPI jobs, you might want to do multiple nodes and spread stuff across multiple nodes. Does end tasks automatically take multiple nodes? If you have multiple? If you have so many? I think the default is that if you specify my number of nodes, you get one CPU, you get the number of tasks will automatically get multiplied to the number of nodes. But I'm not certain. I'm not completely certain. Okay. But like, like this, this brings to mind, like, if you don't know what the program is doing, like, usually it's, it's like, like, don't trust the defaults try to like, actually set the values that you want it to use. So in this case, we would get like, four different nodes that are tasks in four different nodes. So, but yeah, so, so the main thing is that like, you know, what kind of parallelism your code supports, if it doesn't support it, then a tough luck, then it's better to use the embarrassingly parallel thing. If you want to code your parallel code yourself. It's good idea to check out like, many of these, like you can check MPI, you can check open MP, you can check, like, there's huge bunch of different libraries and systems that provide parallelization. But but usually it's it's better to use, like, ready made products that somebody has already written because like, in lots of, if you don't, if it's not optimized, then lots of time can be used in like, don't doing nothing, basically. So it's good idea to because like, the the there's always in every program, there's going to be time that is serial time that cannot be parallelized in it's like, it's, it's provable by mathematics. So it's like, you cannot, like, parallelize everything in your program. So it's good idea to check like, what is possible and what is not. And usually the embarrassingly parallel is the best way to parallelize stuff. Maybe we should make a flowchart of the best way to do things. But anyway, yeah, if you want, yeah, if you quickly want to switch to my screen, I will quickly show this, like I, I don't know if this helps, maybe it helps somebody, maybe I'll make a graph out of this to the documentation. So basically, this, this is kind of like a hierarchical structure of what the jobs can be. So typically, you can have like an array job, you can either have an array job or not. And it basically just copy paste everything underneath it. Like if you, it just sets the array index and then everything underneath this is as it is. Then each job can have its own requirements in these espatch comments. So when we write this serial job, you can have this espatch comments. And each job step that we run with s run, inhering the requirements from the main main job. And each job step can have multiple tasks, like this MPI tasks, but only for the MPI jobs, otherwise, the top number of tasks will be one. And each of these tasks can have one or more CPUs allocated for it. So it's this complicated mess. But like, you don't have to think about it too hard. The easiest thing to like think about is like, does my program say something about number of workers? And if it does, then just try using my multiple CPs. And if it doesn't, then yeah, doesn't matter. Yeah. So it's We have any questions in the company? Nothing? Let's check. Yeah. My hands are getting a bit full here. So what comes next is a talk by CSC. So when you need even more resources than there are at Alto, then this is a good option. And it's not just computing, there's resources for data, secure data, many different things. And I think we'll hear about various of these. And then after that, we come back and we talk about GPUs. So the GPU part is not about writing your own GPU code because most people don't do that, but about running things on GPUs. And that turns out to be quite important for many people these days. So with that, should we go to a break and we'll be back at zero zero? Yeah. Okay. See you all then.