 So the parallel programming or parallel running in Triton, like there were many questions already in the HackMD, like I was given the impression that stuff will run faster in Triton and it's busy clusters, like why doesn't it run faster and the answer is usually like you need to reserve the resources, what your code will use and you need to know what your code can use and then it will run faster. So when we talk about parallel running, like this kind of like parallel running where we have multiple CPUs or multiple workers, there's basically two ways, there are two different paradigms, there's like this shared memory paradigm, which means that basically you get a computer, like let's say your laptop, the laptop has four CPUs or maybe six CPUs or eight CPUs nowadays. It's all in the same system, so you're in one system and they share the same memory, so they're in a shared memory system, so they share the same like, well, same physical layout, physical machine, they're all there. And you can run programs so that they utilize the multiple CPUs in your laptop. Let's say for example the zoom that we are talking into right now, it is probably using multiple CPUs to do like different things at the same time. And this is nowadays really common for all kinds of computations. So when you're running, let's say Matlab, when you're running ARRA, when you're running Python, when you are like doing all kinds of stuff, underneath it all, the program will recognize how many CPUs you have and it will try to utilize all of them. It will try to usually run with all of the CPUs. And like sometimes if you're running some CPU, like you notice that your laptop pan starts rolling or like you get this kind of feeling, you know that okay, it's now most likely utilizing all of the resources available in the system. And it does it usually without like even you know to sing it. But it's noticeable by the programmer. So the user may not notice, but these are all written exactly with the programmer saying, okay, here's how you can divide this. There are many ways of doing this. So most comments are like multi-threaded or multi-processing. So basically it means that like, let's say you start in Matlab, you might start the parallel pool. So multiple workers work together or in Python you might use Python multi-processing or in ARRA you might use the parallel packets or the futures packets or something like that. Or you might use multi-threading for example like if you are using NumPy like in Python, underneath they test these libraries or linear algebra libraries and stuff like that. And those have been written in a way that they can utilize multiple CPUs to do the like matrix inversions and matrix calculations and stuff like that. For example, NumPy can use these multiple processors and this is implemented in the C level in there. And the important here is to know that like okay, you're asking for multiple CPUs, you're asking for multiple processors on the same machine. So you're asking when you're running programs such as these, you're asking for multiple CPUs. In Slurm there's two of these different requests, there's multiple CPUs and there's multiple tasks. And the task is related to the other paradigm, like the same memory you ask for multiple CPUs. But then there's this MPI or message-passing interface paradigm. And this is used for like these supercomputers, like when you want to run thousands of computers at the same time, they need to be able to communicate to each other like what is that. They usually like let's say you have a physics program and then you split the program into smaller pieces and each piece is given to a separate processors and then separate tasks. And each of these communicate with each other to solve the bigger problem. So in this case they use this MPI or this message-passing interface to like do the communication from node to node from task to task. And this is like the massively parallel stuff. But this is like not something you usually, like you need to, actually the program does need to support it. Like this is not something that is like supported everywhere like compared to the shared memory thing. This is something that the program needs to say at its manual page basically that MPI is working. Otherwise it's not going to work with this. So it needs to be said on the team that like this is MPI, like MPI working program. You can work with MPI. And if it does work like this, then you can run these like you can ask for multiple tasks and they can be then put into like multiple computers, these individual tasks and then you get like multiple, like you get like really big simulations. I guess like it says here for example 10 or 20 years ago computers were not so powerful. So you had to use more MPI in order to use 10 or 100 nodes to do something. But now there's a larger percentage of work which can't possibly fit on one node with 10 or 20 CPUs. So like that's enough for most people. Yeah. So how do you actually like then call these? So how do you do these? Yeah. And should you do these? Like that's a good question. So should you make it yourself? Yeah like make certain like when you're running these make certain that you actually like the program knows what you're trying to like how many CPUs you're trying to use and stuff like that. Because like if your program doesn't know about like that it's supposed to use multiple CPUs or if it assumes that it can use multiple CPUs when it's not been reserved multiple CPUs, you might get into trouble and SKF is the best tool of monitoring this kind of behavior. Like a common thing that might happen is someone uses say our Python multi processing or Paralybe and they run it and say okay let's give the program for CPUs and then the program starts running on a node and says oh I'm here on this node with 20 CPUs so let's run 20 things at once but the node constricts it to using only four. So basically you have all these things all these 20 things trying to use the same four processors and it's much more slower. Yeah so basically it's like with slurm you tell the slurm like if you scroll all the beat up in the documentation like about running multi-traduit applications over here. Bit down. Was I already there? Yeah yeah no. Just in the middle. Yeah bit of a... Oh here we go. Yeah. Then like you need to think there's two things here happening. So first the program needs to know how many processors it needs to use or wants to use and then you need to reserve from the slurm NEPs for the program. These two need to match. So basically it's like if you are on a Ferrari on a normal city street you can't go above the speed limit like you have the power there and that's basically the idea like you need to reserve time at the track or something if you want to drive really fast. And if you ever make your own program that can use multiple things do everyone a favor and in the documentation in the read me wherever. Write down this is how it uses multiple CPUs and this is how it's configured. Otherwise you get people who will try to run it on a cluster and they won't know how to and we basically have to go reverse engineer it and well you can tell we've done that a bit too many times. So yeah. For these shared memory programs it's as simple like getting multiple CPUs like we already had it in the example that you basically say the CPUs per task and then you just say how many CPUs you want and then you get those CPUs but of course it doesn't mean that the code actually uses those so you need to use the SF to monitor whether it actually utilizes the CPUs but basically the reserving part is very easy. It's just specified the CPUs per task. In some cases like in most cases you just want to specify some like hard limit for the memory like for the whole program but in some cases you might have this kind of like a like increasing the number of CPUs might increase the memory requirement so let's say if you start separate processes or something like that so in those cases it might be useful to use this memory.core memory requirement but usually you want to just set this memory requirement so basically adding one line this CPUs per task into your program allows it to use multiple CPUs then when it uses them that depends on the program there are like in the documentation there's many examples of like how to run Python programs in parallel how to run Matlab programs in parallel how to run R programs in parallel like with this flag but you need to check that documentation for your specific program and we're at about half our time now so should we go to some demos? Yeah let's do for example this very simple OpenMP program OpenMP is this kind of a standard where you can like if you're coding like C program or something like that you can use OpenMP to do like a for loop in parallel or something like that so here it's also in the example's repository oh okay this is in HBC examples master okay should I change to that directory? okay so here I am well this work on other sites might be a different CCC then so there you can see like usually when you run this OpenMP program programs you want to set this block bind through in your script as well what does this do? so basically this program runs well it just says hello world from multiple CPUs so in this case it runs on four CPUs for five minutes and it just brings the different threads should I run it like this or like run it like that? I guess we can run it like this but if you run it with S run you can like see it like that oh okay and if you run it with like a slow script you see below like you just add this one line it's like one line for the OpenMP block bind if your code is OpenMP like it depends on the code which do you need to set this environment variable but the CPUs per task that's the important part add this one line you get four CPUs there is a question in the chat that like the maximum number of CPUs is it 40 and usually yes like it's actual the actual maximum number of cores in the computer so the biggest computers we have usually is like like we have two sockets so two physical like CPUs and those have like 20 cores each so we get like 40 CPUs per node so that is the actual limit you can get if you're using this element in parallel so the amount of physical CPUs on the machine okay so now what should we do as the next example maybe let's yeah but there's I think we could well maybe we could should we look down towards exercises or something I think we might in our limited time so there's a question yeah so what in this course we're not learning how to make your own multiprocessing or threaded programs well it's not the most complicated thing in some models it's more than we can possibly cover in this course so we're assuming that you have such a thing and you can run it okay so MPI yeah like it's a whole kind of worms like it depends really on the program how do you do it so for the MPI part like if your code supports shed memory if your program supports shed memory things use the memory CPUs per task that's all you need you don't need to think about CPUs per node sorry CPUs per task you don't need to think about like anything else if you are using MPI code you need to instead of using the CPUs per task you use the number of tasks so if you're using MPI code you're usually using some physics code or something like that and you typically load some MPI library like we have like this MPI libraries they need to be installed by us because they are very like finicky and they need to be optimized for the networking infrastructure so that the communication between the nodes is as fast as possible yeah and how do you usually work with them is that you load the module you compile whatever code you have so if Richard you want to do the example over there so this one yes so this is like the open MPI example okay if you run the let's see where it's in the HPC examples open MPI folder or MPI folder sorry okay hello MPI yes okay so let's load some modules this is already loaded this needs loading yeah and usually you let's say you have MPI code that you don't need that but below there you can see the let's say compile the C version of the MPI program yeah okay so with MPI usually you need to compile programs you need to do stuff like for the specific MPI version that you have like it's very you need to have the specific version of MPI for the code to work so whenever you're running the code you need to load the same MPI version that you have and this MPI is much more complicated than normal like shared process set memory like parallelism to code user but you can of course like then run with hundreds of computers so here we see that we got well they ended up in the same computer but we got four of these processes running four tasks they're independent tasks and they communicate with each other through the MPI and this is how like if you're using MPI then use the end tasks if you're using like shared memory use the CPUs per task and don't mix like if you're you kind of mix and match them if you're using so-called hybrid codes but those are specialized stuff and you like if you don't know what they are then I wouldn't recommend mixing and matching them so if you so this one like if with MPI you can scale up to hundreds of nodes or something like that like in our cluster the limit is I can't remember how many nodes it currently is but in theory like in let's say CSE you can use hundreds of nodes so it's like we're talking about thousands of CPUs but it's only when you have this kind of MPI enabled code that can scale to these lengths yeah in the examples there's like if you look at the first example in the exercises that is very good exercise so what is the difference between yeah what is the difference between these four lines CPUs per task okay so we see it printed it once so what it means that it runs like in this case we run the command hostname and we give the command access to four CPUs so this is the shared memory situation then what happens if we put the end tasks so here we see that we run four separate programs so if your code understands MPI these programs will recognize that they're MPI programs and they start to communicate with each other but if your code is not MPI program then you will run the same program multiple times the same identical program and that's not what you want because then you will like you will get output errors and all kinds of that and you will basically do the same calculation four times and waste resources and you don't need that so if your code doesn't use MPI don't use end tasks ever let's try the number of nodes again so if you're running MPI tasks you can specify you can get multiple nodes as well and how many processors would this be using with four nodes in this case it would use four because you haven't specified enough tasks so you need to also specify usually like number of tasks per node or something like that in order to get multiple is there a task per node? end tasks per node but these are like if you're using MPI programs that's too many but you're running MPI programs these are something that you need to worry if you're not then don't worry about these and we should emphasize this is really not the good case here you would prefer all the tasks to run on the same node because communication will be as fast as possible so first you have to scale to one node then start going to other nodes but really a few people these days make it that far because it's not necessary okay here we go so we see now there's it ran eight times on four different nodes twice per node and our time for parallel is almost up so what's the summary here yeah I'd say like there's a few things that you need to like I would emphasize so after this we can talk about aggregate jobs and for most vast majority of cases the aggregate jobs are better option than actually using parallel parallel sounds fancy it sounds like really nice that you have like as an example let's say we're going to our cottage somewhere and we want to take stuff with us there we can take a really fancy Tesla and put that up to a brim of stuff so much that we can't even bring our stuff with us so we have to drive a fast Tesla to the cottage just drop our stuff there go back to our home take more stuff and take that to the cottage as well so we have to do two rounds with a really fast car or you can like get a really like two cars that are not as fancy but you can fill those up and then you can drive the two cars and do everything in like one go so basically sometimes the speed isn't everything it's about like how how much you can bring with you maybe the analogy isn't that clear but basically sometimes it's not worth optimizing the speed it's more important to optimize the truth how many like jobs you can do and in that case like usually with parallelization you have this kind of limits of how fast you can do stuff and how much you can parallelize stuff and usually it's not worth the effort of doing too much parallelization of course in some cases you have big problems that can't be run on a single machine or they take too long in those cases parallelization is the only option but in many cases it's better to run individual jobs that are like individual jobs that are take longer but you get more of them done and it causes less headache with the CPUs reservation and stuff like that so of our remaining time what should we do? I think we should move like should we have a small break and then go to the arrays and GPUs yeah okay should we make the break until 6 or something? slightly less than 10 minutes okay so let's see okay yeah and you can use this time to mess with some of these examples or and if you're doing these kind of things it's okay to come to our garage and ask us for advice before you start because really maybe it's better to have someone help you get started rather than try to understand everything because there's so many traps yes yeah I highly recommend joining the garage when you think you need to do something in parallel yeah okay so see you in a bit