 And here I think we talk more about the concepts of parallelism and each code is different. And at least I hope that I learned from Simo what the differences of the kind of parallelism are and how to approach each of them. So yeah, how do we, what are the different parallel models? Yeah, well actually we have already encountered one of them and that was the date, the embarrassingly parallel model. And we already did that with the array structures. So basically parallel in a cluster sense, like in a HPC sense, parallel means that you just do something individually, like you do separate things separately. Like you don't run a serial job, like we were talking about serial being from one end to another. Parallel means that you do separate pieces individually. And the embarrassingly parallel that we did with the array job. So if your job is something that can be split up in a natural, your whole workflow can be split up in a natural way, then the embarrassingly parallel is probably the most efficient way of parallelizing your job. So you shouldn't think about these parallelizations unless you consider that one first. However, when it comes to computing, parallel has another meaning as well, like outside of the HPC context. And that is that you run multiple workers or multiple CPU cores together so that they work together as a whole one task. So in one job, like embarrassingly parallel, like in the array jobs, we had individual jobs that weren't communicating with each other. We have individual CPUs that were doing their own stuff with their own seeds or datasets or whatever. But here we are going to be working so that each job has multiple CPUs assigned to it and it will utilize them together to solve one problem. And this is usually used in two ways. So there's this shared memory or multiprocessing parallelism. And then there's the MPI parallelism, which is this message-passing programming parallelism. And these are completely separate things. So if you think about your computer right now, your laptop, if you run, let's say, R or Python or Matlab on your computer, you have only one computer with multiple CPUs. Modern laptops have multiple CPUs that they can use. So all the CPUs have access to the same memory together so that there comes the name shared memory parallelism. All of the CPUs can access the same memory and then they can work together on the same machine. So many of the programs that you're working with on your laptop are similar. You can run them in a similar way in Triton so that they can use many CPUs at the same time. Could you even say that many programs these days can easily use shared memory, whether it's games or desktop applications and so on? Yes. If you, for example, are in a Zoom meeting and you look at with top or edge top, you look at the CPU utilization, you will see that there's multiple of these Zoom processes running in the background. And this is multi-processing so you have multiple, the individual processes running doing the same thing, so running the Zoom application. And for example, in HPC context, if you're running Python, the Python, especially the NumPy libraries that you usually use, they are written so that they can do parallelization. So you can run the Python so that it utilizes multiple CPUs at the same time. You can also run Python multiprocessing, like the multiprocessing module in Python code, that you can run multiple individual Python parts that do some stuff and then collect the information together. In R, you can use, for example, the parallel package that starts multiple R workers, does something and then combines the results. In Matlab, you have the parallel pool that basically starts multiple processes, does something and then collects it again. But all of these are done within one computer, so you have one shared system where you share stuff together. And in all of these, and this is different to the MPI processing that we'll be talking about later, but you have to think about when you're running your Matlab code, your Python code, whatever, on your laptop, you're running it on one system. And these are a multiprocessing code. We should also note that in all of these kinds of codes, there's what Richard is showing there, there's these losses when you're running this kind of a program. Because there's always some part of the program, let's say IO, for example, that cannot be done in parallel. And those parts need to be done in serial, so the job has to wait. And there's always these kinds of even theoretical computer science laws that you can't get like infinite parallelization, but you will get some sort of fraction of improvement. So basically, if you add four CPUs to your job, you don't necessarily get four times the improvement, but you will get something like three to 3.5 or something like that. Because there will always be parts that are not... Yeah. And even if it is good like that, if you just write a code and don't specifically make it parallel, it won't be parallel at all. Like it won't even be able to try to do anything. And if you find some random code someone else has made, you can take it and give it 10 CPUs, but if it's not able to use 10 CPUs, then it won't do anything. It'll just use one and you've wasted all the other ones. And not only that, it has to be able to detect that you've given it 10 CPUs, as opposed to one CPU and as opposed to every CPU on the machine, which may not be allocated to you. So... Yes. Yeah. So basically, whenever you're the same as with the GPU jobs, when you're running parallel jobs, you need to know what your code is actually trying to do, whether it's trying to do shared memory parallelization or this NPR parallelization and how it behaves. So these shared memory parallelizations, they are usually very... This is one of the most annoying things when someone asks me for help in using, making something parallel. I ask, what are you running? They're like this. And I'm like, okay, let's see if it's parallel. And I go to the instructions of that program and then it doesn't say how to control the parallelism or how it works, which makes our effort in running it on the cluster a lot more difficult. So if you're making code that you want others to be able to use, remember to document this thing and actually think about it and control it. So, yeah. Yes. So there's even more like these different complications usually caused by this term multi-threaded, multi-process. Usually, it doesn't matter that much. There's some examples there where it might affect. So basically, threads are the small pieces of like small computing tasks that you can do. And processes are something that run these threads. And one process can run multiple of these threads. And you can run multiple processes at the same time. And it can get quite complicated. But usually what you want to do is just to make certain that you choose one way of parallelization. You don't want to create this situation where you have launched, let's say, you have reserved four CPUs and then you launch four processes. And all of those four processes launch four threads. So you don't want to do that kind of stuff. So because then you will get into this situation where all of the different threads will compete of the same resources. Yeah. So usually, for example, here, like if you're using something like a Python straight out of the box, you can use this OMP number of threads to set the number of underlying threads that it can use for parallelization. If you're using something like Python multiprocess or R parallel or Matlab parallel pool, you want to set this number of threads underneath it to one so that you won't get this kind of multiplication that can happen. So let's look at it. We have lots of other examples here. But to me, this is a very specific thing for each person's code. So is there anything here that's worth demonstrating? I think we can demo quickly that, for example, the Python OpenMP parallelization we can show this. So this is a good example of what we might have. So here we have a small script over here. A small Python script that calculates the... Well, it just does a matrix inverse of this big array. So it's just like it doesn't do anything reasonable, but it does something that's somewhat heavy to calculate. So it takes some time to calculate. So where would this be? I think it's in the Python OpenMP. Probably here? Yeah. So I will change to the HPC examples directory. So now if you look at the example below the S of the S run statement. If I scroll down? Yeah. Here we go. Yeah. See there that you can specify the CPUs per task parameter that tells the slur to allocate multiple CPUs for this task for this job. So that means that when you're running this job, you will utilize multiple CPUs. Of course, you're not necessarily... You don't know whether it utilizes them if you don't know that the code actually uses them, but you can ask for multiple CPUs. So it's just one flag similar to the GRES statement. You can specify this CPUs per task and you will get multiple CPUs. I used to also put the export OMP broke bind. What that does is to tell the Python or the libraries underneath the Python to use the different CPUs basically. And then when you ask for the CPUs, it will utilize a lot of the CPUs. Okay. Python. Okay, let's see if it works. It's cute. It's allocated. Okay. Now remember how long it takes to run. But basically it's just another flag. Like if you need to use multiple CPUs, you just need to specify the number of CPUs you need. Okay, let's try the same command. Okay, let's look at history actually. Yeah, let's look at history. So you can see there that the code printed how many CPUs it was using because, well, it was written in the code. Job ID. And if we go on down, we see... If you can get the runtime and the total CPU time together in the... Here we go. Yeah. So here you can see that the total, the wall time. So how much time it actually took, like in real world, it took 16 seconds to run this job. But the CPU time used was 30 seconds. So it was actually doing something for 30 CPU seconds. So if you think about it, we asked for two CPUs. It ran for 15 seconds. It used 30 CPU seconds. So it seems to be that it utilized both of these. And by using this SF command, we can then look at what is the CPU utilization. And we see that the CPU efficiency was 93%. So we saw, we can see that it actually utilized all of the CPUs. Let's try the same command with additional CPUs. So put like four CPUs there. And let's see whether we get... What sort of performance we get. So changing... So basically then we just switch the CPUs per task for four. So if we think about it, the total runtime was 30 seconds. So we should divide it by four. So we expect it to be something like seven and a half seconds. But then we know that there's going to be losses anyway. So you see that it's not 7.5, it was 9.7. So there's some serial part that could not be parallelized by the code. And this is like an example where the parallelization is at maximum. Because it's doing something that can be very easily parallelized, which is these matrix operations. But in many cases, if you're doing something, the parallelization can be much worse. But what can we learn from this is that in many of these easy languages like Python and R, you can easily just ask for more CPUs and see whether you get the efficiency increase. So you can see that the efficiency increase was 90%. So you can try and see whether it works. If it doesn't work, then it doesn't work. So I can keep going up until the efficiency gets too low and then realize that, okay, now I have to stop adding more CPUs. But if you think about it, let's say theoretically that the code we were testing now, if that would have taken an hour to run, we would have wasted 10% of the hour. So six minutes of the hour would have been wasted now. And let's say we want to run four instances of the same thing. We could run four jobs that would take this... We could run four 15-minute jobs that would take totally an hour, basically. Or we could run an array job of four that would take an hour also, but we would get better efficiency. So in cases where you have to think about whether you want to speed it up or make one job faster to run versus the whole thing to run faster, you might need to do the array parallelization instead. If you want to get the maximum efficiency, you need to... So basically it's better to run the smallest job you need but then run more of them. Yes, unless of course the runtime gets so long that it gets hard to get through the queue. So let's say if the runtime takes a day and you need results to be done today, then you might need to increase the parallelization. But usually the parallelization is only... It only works when it works and you shouldn't necessarily... You should know whether your code actually supports it. Yeah. Okay, but there's the other form of parallelization that is much more complicated but it's common to these HBC systems and supercomputers and this is the MPI parallelization. And this is something that if your code doesn't support it, it doesn't support it. In the team that whether it supports it or not. Like this is... But this is something that basically the supercomputers use to do thousands of nodes worth of... You can scale it up to huge exascale computations but there will be of course different complications when you're doing that. But MPI basically does is that you have huge amounts of different tasks working together and they communicate with this MPI message passing interface. And this is basically going from one... Instead of you're working in one computer like we were previously working when we asked for multiple CPUs. In this case we are asking for a number of workers that can end up on any kinds of computers. So we're working across the node like actual physical machine boundaries. We're working with more than one machine. And for this to work you need to specify a certain MPI library. So they are not compatible with each other. So you should specify the specific library. So for example open MPI. You need to compile your task or your code to use that library. And then you need to give it this end tasks option of how many workers you need. And you will get like this... They will be distributed to the cluster and they will communicate with each other. So many physics codes use this. There's also the MPI for Pi package that uses this if you do MPI parallelization in Python but it's complicated there. And there's other libraries in R and Matlab. Matlab quite rarely uses MPI but there's other packages in other languages in Julia for example that uses MPI as well to do multi-node processing. But this is like more complicated and you need to read or go to a course on MPI if you want to use this if your code doesn't automatically support it. So you need to go to a course if you want to write a code that uses it. But I guess... If you're using something that we have already installed like the OpenFoam or CP2k or Lamps or something like these kind of physics codes they are already compiled by us and in those cases you don't usually need to like learn how to code MPI yourself you can just use them but then you need to specify what kind of... how many tasks you want how many workers you want but I want to reiterate if you're working with like these shared memory programs don't specify the tasks number because otherwise you will get into this situation and then we will start to multiple copies of the same job so multiple copies that don't know that the others exist because they don't understand MPI and you will... you will confuse the system and at worst case you will write to the same data files and cause all kinds of problems so if you are working with parallel pool Python multiprocessing, OpenMP don't use the tasks option at all if you're using MPI use the MPI... well use the end tasks option then so here in the example there's a simple like C and Fortran MPI Hello World examples we don't need to necessarily go through them but we could look at the SBAT script a bit at the bottom so what you usually want to do with MPI job is to specify a certain number of tasks and maybe specify the constraint of the CPU architecture that you want to do that all of the nodes end up on a similar kind of a CPU architecture you might... if you look at Slurm features output there are different architectures for like hash well, broad well and everything like that and you might need to... you might want to specify certain of those architectures so that you know that you can estimate the run time easier when you specify one of those specific node architectures when you're running these big jobs so these MPI jobs can go up to like hundreds of tasks whereas the CPU tasks are limited by the number of CPUs in the actual machines so all of our nodes have either 14, 20, 24 or 40 CPUs like physical CPUs in the computer and that is the maximum for the shared memory parallelization for the MPI parallelization you usually want to reserve full nodes or half nodes or something like that but the performance is the best so what you usually need to do is ask for multiple... well, you ask for a certain amount of nodes and a certain amount of tasks and then you get it there okay, so... yeah, over here is the instructions if you're going to be running these MPI programs here's instructions on how to spread the node tasks evenly with these number of nodes and the number of tasks per node yeah in monitoring performance we already demonstrated the SF program yeah, I think we could... like before we... if we run the last examples like of running this some of these... the exercises here are more... up above like if we run the different examples like we were talking about running these MATLAB scripts and Python scripts but we could demonstrate the first one the first exercise so this is a good one to do yeah so there are two... really different options here so like what we are going to be running the CPU is per task which is the shared memory paradigm then there's the number of tasks which is the MPI paradigm and then the number of nodes which is the... yeah, if we... run... plain hostname you can see that it runs on CSL48 and you get one... one task running there one... one job was running yeah so here we add in CPUs per task and we see one hostname yeah so you will get one process that has access to four CPUs of course the hostname command doesn't utilize those CPUs but it still runs only one program but if you run it with the number of tasks set to four you will notice that... well you will notice something else happening which is now like... if you would run an MPI program two dashes yeah so now you see that you get four responses so this is because like... when you're running with the end tasks Slurm starts four different processes and it expects these processes to understand the MPI communication and to be able to discuss with each other what's the part of each individual process of the whole but if your code doesn't understand MPI if it doesn't use MPI this will mean it will start four competing like processes that all do the same thing so if you're trying to do simple like CPU parallelization don't use the end tasks because you will get mistakes like this and it will tank the performance and it will at worst case cause like issues with I overwriting and if you specify nodes for you will see that they all end up in different nodes so when you're using simple parallelization techniques like what you use on your own computer please use the the CPUs per task that's all you need you can just specify that and then you can try to figure out how to get the best efficiency out of the system and you can come and discuss us if you are unsure whether the program is using the CPUs correctly or not mm-hmm okay what if we combine them yeah I think you will get the same what's the same thing yeah but if we ask for 16 tasks so in this case you will get 16 tasks but they're not evenly distributed so so this is why you when you're using the nodes if you're running nodes you should specify that you want them to be distributed as evenly as possible because otherwise you will get these performance when all of these tasks will try to communicate with the other tasks in the different nodes so when you're running big MPI jobs it might be a good idea also to ask us whether we can look at what is the best geometry in the cluster, what is the best way of running it okay so what now do we so we had this idea of demonstrating some Python R and MATLAB things should we take a quick break and then go and I guess we're sort of out of time right now but we've covered all the actual tutorials and what we're about to do can easily be watched by video, the video recordings can be watched later equally to now yeah but I want to reiterate make clear that like this day it might seem that this day is like a huge jump in difficulty and that's because it is because the tasks or the questions at hand are much more difficult and much more complex and when you're scaling your stuff up to a bigger scale it will eventually become complex but if you felt like at home with the serial jobs it's not a big work from there to these other parts if you take it step by step so what I recommend is that first getting like hand is on a touch on how to like run a serial job, one serial job in Triton then if you feel like you can parallelize it with the array job then try to parallelize it with the array job and if you know that your code is capable of utilizing GPUs in memory parallelization or shared memory parallelization or open mpa parallelization then discuss with us and we can help you get started with those but you know that they exist, these possibilities exist but the serial job is really the most people need yeah and if you would like, if you make an attempt and would like someone to look at it come to our daily garage so every day at one we're in a Zoom meeting and you can we're happy to look at your things and give you some comments and then see what um like see if we have any suggestions so you don't have to worry about doing it right all by yourself we're here for you should we have a small break and then put some extra material of how to like run a simple simple webflow so we'll be back maybe in 5 minutes you'll have some more examples so please give us feedback either in HackMD or via the chat whatever let us know how this went, how the material can be improved was this even on topic for what your needs were um and so on any other courses you would like us to give um yeah this is very important and we especially want bad feedback because we can do something about bad feedback but good feedback will mean we just keep doing things. Okay thanks a lot for attending yeah thanks