 Yeah, so here's the overall schedule. So we've got, well, advanced cluster usage. We start with array jobs, then parallel, the talk from CSC and then GPUs. So SEMO, your screen is ready to share also. So yeah, about array jobs, what can you tell us? Yeah, so before we go to the array jobs, let's go back one step and remember what we did with the CDL jobs. So what we did was that we made an instruction, we made a recipe, so if we remember this, we made a recipe for our program. So we had instructions to the computer itself, what it should do and with what program it should do it with and what kind of data it should read. So we told the Q system that, okay, we want these things to be done and please run it somewhere where you have the resources to do it. So basically we told the system that we wanted these tools, we wanted to run this recipe and we had these ingredients, so the data what we have and program that we have, we wanted it to do something. So this is like the normal, like non-interactive CDL job that you run. Well, what if in this case you would want to run, you would want to do different kinds of pasta? Like you want to do, instead of only spaghetti, you want to do one pot of spaghetti, one pot of penne, one pot of tagliatelle, one pot of ravioli, I don't know, like running out of pastas here, but maybe somebody wants the spaghetti split up and somebody doesn't, so maybe you want to do different kinds of pots of pasta. I guess the metaphor here might be something like data science where you have 10 different sample data sets you want to run the same analysis on. Yes, or let's say you want to do the same analysis with different seed values, different random number generators, so you get a statistic of a system, or maybe you have different kinds of topology or different kind of initial state for a physics system or different kinds of forces at play, so some sort of like parameter change, some sort of data change, something changes, but all of these would be like separate from each other. So basically like if you mess up like penne spaghetti at the same pot, you get mess, you don't get separate spaghetti and penne, so what you would have to do if you would only give it to one cook, like they would have to like first put one pot of spaghetti then wait for it to boil and be finished, and then they would have to take the pot off and put another pot that would have, let's say penne in it and you would have this kind of a, like for the loop of serial jobs, like sequential amount of serial jobs. Well, this is of course, it means that the overall time you need to run, overall time you need to cook all of these different kinds of pasta, it would be like adding all of the different pastas together, but these can be improved a lot with array jobs. So the idea behind array jobs, so if we go to the array jobs tutorial, so the idea is that with the array jobs, you can define like one recipe. So here, if we think about the thing again, like we want to change maybe the ingredients, we want to change the ingredients here, we want to change the data that the program takes, but we want to use the same program to do the same things. So we don't change the recipe, of course, in real life, you'll probably cook spaghetti and penne a bit different times, but maybe the program finishes faster with certain dataset than another dataset, but we want to change some of the ingredients, we want to put certain parameters in, we want to put certain data in, but we don't have to change the recipe. So in array jobs, you define an array of the serial jobs that can all run independently of each other, but they can distinguish themselves based on this number called array ID, so this identification number, and then they can behave differently based on this number. So basically they can choose, let's say a different dataset based on this number, they can choose different parameters based on this number. So you run, let's say, 10 copies of the same simulation, and they all run independently with the same kinds of resource requirements and same kind of recipes, same kinds of programs. So does this choice get made in your code or in Slurm or in, like, how does it know what data to use? Yes, so this is the important point that like, the Slurm only knows, you give the Slurm just script and then you tell it to run an array, like run this as an array, and based on what's in your script, you can then do all kinds of logic that describes like, okay, how do you map this number into some parameter or some piece of code or how do you map this into a dataset? So basically, if you know about like countable infinity, so like you can have, if you have numbers from zero, like the array drops from zero to basically infinity, of course, there's some limit, I don't know what it is, but like you have integers basically that you can use as your numbers. So if you can map it to something else, like let's say a dataset, this one integer maps to some other thing, then you can use it in the array job, but you need to write the logic yourself in your script. So should we do an example? Can we? Yeah, let's go straight to example that, yeah, if you scroll a bit down here, there's your first array job and it's best to start with this kind of an example and if Richard, you want to take the screen. Okay, here we go. This is my screen now. So, okay, so we're basically starting from the beginning here with a new array job. So I'm starting a new project. So I guess there's the usual stuff we do. For example, I will change to my work directory. And yeah, that's there. I will make a new directory 2022. Okay, yeah, so I'll put the array example directly in here. So use nanoarrayexample.sh and I will save time by copying and pasting some of it. And there's also, yeah. But the important part is this s batch. So it looks like a normal slurm argument, just like we've been learning, I guess. Equal sign. That's an equal sign, yes. Yeah, so I guess arrays just a normal slurm argument to s batch, and then we'll put the commands below. So what Richard is like, if you look at the script as a whole, you notice that it's basically the same kind of a script that we run previously. Like you have a time and a memory requirement and then you have an output file there defined. But the only difference is that while the output name looks a bit strange, we'll talk about it in a second. And then you have this dash dash array. So what this dash dash array basically tells to slurm is that, hey, I want 16 of these. So from zero to 15. So you have 16 copies of the same job. So basically you tell the slurm that, okay, I want 16 of these. And for each one of them, set this slurm array task ID to be a number from zero to 15. So each of these jobs gets its own number. So if you now run it, we should see like what would happen. Control X to save. I submit with S batch like usual array example. Okay, slurm Q. So okay, I see here it says these things are running. This job ID, the zero, one, four, that's the array IDs, correct? Yes. So you notice that now you have multiple jobs in the queue at the same time. And all of them get the same, like the master job ID, this first number, which tells like, what is that job ID? And then they get a subscript, which is like the array ID. So this is the number of this individual job. So in the chat, or HackMD, there was a question of like, do these run on different cores? And are they parallel? And then the answer is that they're like, embarrassingly parallel. So embarrassingly parallel means that you just like, like if you know, like you basically like, you copy paste the answer, you do the same thing, but like with a small difference. Or how long do you describe it? This cooking metaphor. So let's say 500 years ago, people were not that efficient with cooking. Like the pots weren't as well, someone had to maintain the fires. So in order to cook a meal, basically parallel meant you had these people and they were actually working together. Like the person making the fire had to communicate with the person putting the pot on the fire. But embarrassing parallel is basically you have different people and they don't need to communicate at all. So they could work in the same kitchen. They could work in different kitchens. And like in the past, there was a lot more of the parallel with communication because computers weren't so powerful. But these days more and more work can be made embarrassingly parallel. Like you can make an array job that uses one node and that one node using TensorFlow or whatever modern application, it can use the 20 to 40 computer cores on there to do a job. So basically array jobs make a lot more sense for a lot more things these days. Yeah, and you should think of the array job as like completely independent of each other. Like they are completely independent. Like you shouldn't be, you should think of them as like they are, they're running as like this kind of like each of the array jobs runs it's like submitting 15 different jobs. Yeah, like you could do like S batch submit the job, S batch submit the job and you could have like different scripts for each of these jobs that only have like, you could copy paste the script, the submission script like 15 times and submit all of those scripts. But then you have like, it's really hard to manage the thing, but instead you can have this one script that basically manages a lot of them. And this becomes very important when you have like, let's say, like if you remember yesterday I showed this example where you had this, we had to do feature extraction for these different music genres. And there you could have this one script that launched like 10 jobs on that you all use GPUs to do this analysis. So you can have this kind of like, we'll talk about how to do this mapping later on. Should we show the output here? Yeah, let's look at the output. Let me list the files in this directory and we see there's array examples, zero, one, two, three, four and so on. Let's look at zero using, oh, let's cat it. That seems suitable for the theme of today, huh? So we cat this, I'm array task number zero. And if we cat one and so on. So basically we had 15 or actually 16 completely separate runs here, each of which did something. And if our shell script was able to do something different with each of these inputs, then we'd be able to analyze 15 different data sets at once. And you noticed that the output file name, in the output file name description, in the slurm script, Richard had this person sign, capital A and person sign lowercase A. And these basically mean, these are like these kinds of like placeholders that you can use in your output files to make certain that like every job gets its own output file. So because otherwise they would all cram to the same output file and the input and output, like it would be impossible to read what's happening. So the capital A means like the main array ID and the lowercase A means the task ID. So should we have a small, like people could run this exercise or this thing themselves? Like should we have like a small five minutes or something? Have they already been doing it? I wonder. Yeah, well, if not, then they could try it out themselves. So like if you just copy paste this into nano, into an editor and submit it into the Q&C that you get the similar kind of a... We're getting questions in HackMD that will be answered if we go further down. So I think maybe I propose we show some of the ways this is used. Okay, yeah, maybe the concept is already clear. So yeah, if you... Let's go. Um... If you submit... If you look at more examples, your screen, okay. Yeah, there you go. Yeah, I can, yeah. So if you go here below in the Agai examples, here's an example where you can like, let's say your code takes different input file in it. Quite often the input files, like if you have datasets or something, they can be indexed by some number or something like, or at least they can be like renamed so that they are like, dataset zero, dataset one and so forth. Then it would be easy to do something like this that you have like a srun, my application, take the input and then you have the RA task ID there to determine which data file you use. So this is like the most embarrassingly parallel of them all because you don't have to do any kind of mapping, like you use a number and you use the number somewhere. So this is like very easy way of doing it. Yeah. Other way of doing it, like you could have this kind of like a hard coded arguments in your Slurm script. So this was the strategy that I had in my example where basically based on the Slurm RA task ID you would use this case statement in Bash to determine that, okay, this variable seed would be something else. Like that would be determined based on the number. Of course, this is something that only like usually works if you have relatively few examples, like because then you can have like, yeah, yeah, you would have only like, you would have a lot of lines there in the end. So it's easiest when you have only a few examples. But basically you can do this kind of a mapping. One other option would be to take the parameters and put them into a file. So you can have, for example, here we would have like iterations.txt. And then based on the RA task ID we would choose a line in this file. So this gets a bit more like bashy. Like you don't necessarily need to understand how this works because it's like a bit messy, but you can just copy paste it and it works. So basically the seed command, it chooses this, it brings this certain line from this file. And this is very useful like way, for example, to find those at these parameters because like you can, if you think about like you do different experiments and all different parameter combinations or parameter values, you can always add more lines to the file and you can always increase the number of agai indices. The agai indices doesn't need to start from zero. So basically you could do like, you could write 10 parameters that you want to run. Then you can submit an agai job from one to 10 so that you can take the line number from one to 10 and do those experiments. Okay, if you want to do 10 more, you can add 10 more lines to the experiments that you have run file. And then you can run experiments from 11 to 20 and so forth. Like you can constantly increase the number of parameters and you don't ever have to like, remember did I run this combination of parameters or you don't have to always re-run the whole thing to run again and again. So yeah, maybe we could go to the exercises. So that's probably the best way of like, testing it out for yourself. So choose in the exercises, there's the exercise again are in the HPC examples representatory. So in the serial jobs, we had this memory hog program that we wanted to run or we run this program. So in the first exercise, you can choose whatever strategy, like you can even try multiple of these strategies that are outlined in the agai page. So one of these strategies to map this memory number, so you can choose, you can use a file here, you can use this structure, you can choose whatever structure you want to use to do this mapping. But basically somehow you should be able to run an agai job where you run this memory hog script with different values of memory and yeah, see how it works. Okay, so I put this in HackMD array exercises. How long should we give for this? 15 minutes, 20 minutes? Maybe we should have 15 minutes and then look what's the situation and continue. And there's a good question there that's being written currently. I will quickly mention this. So does the memory parameter specify the total amount of memory for the full array job or the individual job in the array? And it's the individual job in the array, so you don't have to do any multiplication or something, like all of the other parameters of the script. So you can think of it like when Slurm sees the array thing, it basically does copies of the initial script and then it removes the array line and then it just thinks of the rest of the script as a normal Slurm serial script. So basically, you can think of it like it would perform exactly the same except that you have this array line there. So if you specify memory requirement, that's a memory requirement for a single job. If you specify some parallel requirement or if you specify a GP requirement or whatever, they all are for each of these array jobs. So it's basically like just serial job copy-pasted multiple times with this one parameter set. Okay, so we're back in 15 minutes. Okay, see you then. Bye. And we're back. So I guess now we go over the exercise. We'll do it as a demo. And you can keep working now if you'd like. And then we have a break. Simo, what do you like to answer this one question now or after the? Maybe after the demo. Yeah, okay, let's do it. So I'm going to my screen. Maybe we should quick, yeah, I'm thinking. So what's the status in the Zoom? Can you, if you have done the exercise or not done the exercise, can you add it to the count so that we know what's the status? But yeah, let's go with the exercise. But it would be good if you added so that we know that did we have enough time to. Yeah. So here we go. We're doing number one, basic array job. Make an array job that runs this with five different values. So I will start the usual way and I'll take the existing file and copy it. So I will copy arrayexample.sh to arrayexercise.sh. And then I will nano it. So let's see. So what looks the same here? Bin bash, okay, array 15 minutes, that's good, memory. So we're going to need more memory. I guess now we actually have to answer that question. So array, the array argument says that it's an array job. Everything else is the same for all jobs. So that means the time in the memory has to be the same for everything. So like basically each of these array jobs will get the same requirements, but they don't share among each other's requirements. So like each of the jobs has 15 minutes to do its own task, array task. Each of them has five gigabytes of memory in this case to do its own thing. Yes. Okay, there put the commands below. Let's get rid of that. Okay, so now we need the example. So I'll do what I'll usually do and scroll up and say, oh, this looks good. So I'm going to copy this part and stick it down here. Yeah, Richard is going to be doing, using this hard coded example because there's going to be, there's not that many like parameters that we need to do. We need to do only five of them. So it's easy to use this hard coded thing. So he's going to write here to use 50M. I think it needs to be capital, yeah. Yeah, 100, 500, 1000 and 5000. So now every one of them gets... And now he's going to be changing the... S run. The executable. So it's a Python, HPC, examples, slurm, and then the memory will be none. And this is the environment variable. Yeah, and just for the demonstration purposes, let's add a quick sleep there. So there was this parameter called sleep, I think, that you could add to make it so that it doesn't immediately quit so that we can see what happens there in the background. Yeah, okay, does this look good? So I will exit and save. Yep, so I know I need to clone the HPC example. So I will copy and paste the git clone command and run it. I had to enter the passphrase for the SSH key. And now if I list, should I clean up some of these older files? Yeah, maybe we could clear them. So I'm going to use the remove command with a glob, so asterisk.out. So all of the output files of the array are going away. Yeah. Okay, this looks pretty clean. So let's sbatch it, or sbatch array exercise. Okay, why is it taking so long? I'm guessing that there's so many, so many tasks going, maybe there. Slurm queue, oh, I didn't change this, it's still to 15. Oh yeah, yeah, well, now we see what happens when we have a bug in the system. So yeah, so for those, we have probably something fun happening, but now you see that all of them are queuing at the same time. So Slurm gives this nice bracket notation that makes it so that it's easy to see the, see the jobs, yeah. Yeah, so now some of them are running. And you notice on the right side, that's in the node list in Richard's output, that some of them run on one node, some of them run on a one node, because like in the eyes of a Slurm, Slurm only sees the amount of CPUs and the memory and time you request, so it doesn't care. Like you didn't tell it to run on a specific machine or a specific system, so it will find you a correct system. It might be in that case, like that some of them end up into the same machine if it's empty and some might end up in a completely different machine, but because all of these tasks are independent, it doesn't matter, like the only thing that matters for these programs is that you get what you wanted. There's a nice question, could you modify the memory requirements like percent A times 100? So in something like Shell, abstractly you could do this or even in your code. So take the array task and multiply it by something to get what you need. Yeah. But yeah, like in the, yeah, you can write some sort of wrapper that creates it, but the Slurm, like when you submit a job, they will get like a shared requirement and usually like the idea behind the array job is that you want to use the same kind of, do the same kind of thing. So if you have completely different kinds of things, like if you want to cook pasta and you want to cook lasagna, it's completely different kind of thing. So you need different, like you cannot use the same recipe for both of them. So if you have a different recipe, you should do a different array task basically. Like if it's completely different, like most of the time the stuff will be the same. So yeah, it's going to be fine. Yeah. Okay, let's look at them. So my prediction is this won't work because there was no task zero in there, or was there? Ah, okay, worked. It said trying to take five megabytes of memory. And yeah. Yeah, 50, yeah. And then one, so the 100 megabytes. And what we go up to three, and then we go to four, and then we go to five. And this probably has an error. I'll look to a few arguments. So basically that mem parameter didn't get set. Yeah, so basically because like, Richard didn't change the agai indexing, but like if you show the script again, because there was... So this went to 15? Yeah, the agai index went to 15, but then we didn't have, yeah, we didn't have a mapping. So for some, for the rest of the parameters, it wasn't set, so you get some sort of error. So the important thing about the agai is that you have some sort of an idea of like, how do you map these numbers, the slurmsets, see go to whatever, or whatever number to whatever, how do you map them into some parameters that then your code understands? One other option is to like, just give your code the number, and your code does the logic for you. Like you can have in your program that it opens a configuration file and then it reads lines something. Like you can use whatever logic you want to do this mapping. But the main thing is that if you're given number three, how should you behave? And that's like, if you do some mapping, do some kind of like an idea that, okay, like if I get the number three, I will do something. Then if you have this kind of mapping set out, you can do whatever you want. Three look at HackMD. So there are some interesting questions down here. Do we do that? Yeah, that makes a sequence like in Python. This in theory is possible. Oh, wait. Well, this is not the right syntax, but the general idea. Yeah, but the person to A is only evaluated for the output. Like you can look at the same configuration, but those wildcars are only evaluated for output fields. Should I quickly show these two here because they're quite similar? Yeah, okay. In this case, I had actually specified 5,000 megabytes. So let's see, I will copy. Yeah, yeah. Like in the example, there was this question that like if you set a lower limit, which of these jobs are killed? So yeah, you notice that all of them have the same requirement. And if you set like the requirement to be something that doesn't fit there, you will, they will get killed. So what I'm about to do is sort of, so it's advanced like it don't just watch, don't worry about following this. This is bash scripting, which we're not teaching here, but it will sort of show you the kind of things you can do with bash and answer two of these questions here. So in bash to do math, it's like this. So do I need a dollar sign here or not? Yeah. Maybe. Not quick to check. So here I've multiplied the array index by 100 and to make sure I put the M in there, it's here using this other bash syntax. So let's go from one to 10. Yeah, but yeah, but set the memory requirement to be lower so that some of the jobs will fail. What about 100? That should fail some. Yeah. Or yeah. Like in the example then. Yeah. So these espats requirements, those you cannot set programmatically, like because the idea is that like you, all of the stuff in the array job should be similar, like they should be similar, they should have similar kinds of requirements because otherwise like the script doesn't really like, it doesn't describe the jobs completely like if they depend on something. Yeah. Maybe I will copy this to HackMD, someone can format it will. Do you wanna see what happens if this runs? Actually, let's just do, should we do a test or do you think this is correct? I guess it probably works. Yeah, it's correct. I tested the syntax. Okay. So again, let's emphasize what we're showing now. We haven't taught you how to do in discourse, but it's the kind of things you can't do. And if you come by garage or ask someone or read shell like bash scripting guides, you can learn how to do all kinds of things like this. And then, yeah. And that will be quite nice for you. Okay. So we see it's running. Notice the ones that have the high memory usage died quickly and now these lower ones are sleeping. So I guess that means it's working, huh? Can someone add in HackMD a link to the shell scripting course because that would be a great resource for. Yeah, I was, yeah, I was. So yeah, like you can do all kinds of stuff with the shell and all kinds of mappings like this. There's also the advanced example there. So these are great jobs. Like we should have maybe said it earlier, but like they should, and we should have said this actually yesterday as well. So normally when we think of jobs when we submit them to the queue, they should have some sort of like they should be somewhat heavy or somewhat like they should do something. And usually what something means in this kind of a cluster environment is that they run at least for about half an hour. So half an hour is like this, like minimum bench, minimum size of a job that usually is recommended. Like because like if your job runs only for minutes or seconds, it's usually bad for the cluster and bad for you because most of the time is wasted when the job is being set up and stuff like that. So it's not a good idea to run very short jobs. And with average jobs, it becomes easily this kind of idea that, okay, like I will make a forelook that runs average jobs or something like that. And then you might have like thousands of second, one second jobs or something like jobs like this. And that's not the point. The point is to have like these half an hour jobs at the minimum and have those run. So your job, if your job, like let's say your job has to do analysis for 10,000 parameters, 10,000 parameter numbers. And you, every one of them takes one second. So you don't want to have 10,001 second jobs. Instead, you want to like bunch them up together so that you have about half an hour per job. And then you can use the Agai construct to run them all. So there's an advanced example in the tutorial that describes how you can do this like chunking basically, like if you have a index, you can have like a faster running index and a slower running index. So you have like, I heard you like loops in your loops. So this kind of like a situation where you have multiple indices running in the Agai job. So in the case, so basically you could have like 10,000 jobs and each Agai job would take like, I don't know, like thousands. So you have 10 jobs that all run thousand indices or thousand numbers each. So you can have this kind of like loop in a loop situation. And the overall idea should be that the run time of the job should be about half an hour, at least. And hopefully not like days and days because then it's, it can get really scared that it fails or something. So yeah, so it's a good idea to check your job so that it runs a correct amount. And if you have this kind of a situation where you have lots of small things to do, it's better to collect them together into one job that then runs for a bit longer. But there's like a structure in the documentation of this kind of like fast running indices. How do you deal with them? Yeah, so now it's time for the break. I'm excited about all of this discussion here. Unfortunately, we have to go on, but please keep the questions going. And if this is fascinating to you, read more about Bash. Like you will learn so many interesting things for, yeah. Yeah, because like, if you think about it, like lots of the things like people in Windows world and Mac world and every day, we usually like have some like hotkey extensions or something like that, that does like clicking for you. Like you move your mouse and you program your mouse to do a click for you or you have some sort of hotkey that presses multiple buttons for you or type something and it saves time for you. And Bash has lots of these kind of similar things. It's of course, command learn, but like you can write your thoughts, what you want to be happening when you do something so that you can have lots of like things happen without you like always having to type the same thing again and again. And the other thing is, our structure is one of those things that's like if you know that you're going to be doing something again and again and again, you can easily write it as an array structure because like you can see the whole picture and you can like split that. Okay, yeah, like I will split this job into multiple pieces. So learning about Bash and learning about array structures is very helpful, especially if you have a problem that is not like unique, you don't work with only one simulation, you have to do 10 simulations with different parameters. It's usually very easy to do it. Like then you have suddenly a lot more resources at your disposal. So you can have our great jobs, like you can have hundreds of CPUs running at the same time, the hundreds of simulations running at the same time with our great jobs with minimal changes to your code. Okay, let's have to go for a break, 10 minutes. See you soon then. Bye.