 So array jobs, what is the main point here? So the main point of the array jobs is that when you have individual tasks, individual problems that don't involve any communication between them. So basically they are independent. What do you mean by problem here? Well, you want to run some code. I can give my metaphor. So what I had just said, my old advisor came to my office and said, okay, get all the group here, you run with these parameters, you run with these parameters, you run with these parameters, the same code and then send me the results. So in this case, a problem is each set of certain parameters, which he gave each of us. And what these parameters could be like in this case, it was the number of iterations or parameters of the graph it was making, but it could be which input data set or it could be some random seed or it could be other things like that. Yes, so basically when you have something that you need to be done multiple times with different parameters, data sets, something changes between the iterations, but the code itself doesn't change. Basically, like the code is the same, the resource requirements are the same for each individual run, but something changes between them, like something that you can find out. There's one actually question that I forgot to answer in the chat, how do you cancel the jobs? Like if you have submitted the job and then you figure out that okay, this job is something you actually didn't want to run, there's this sCancel and you can give it the job ID and it will cancel the job request or kill the job if it's running. But yeah, back to the other jobs. Basically, if you have something that, so for example, in this example, if you have something that needs to be run with different seed numbers, so you have some data set and you want to, let's say, scramble the data set and do a distribution of your results. So usually, let's say you want to run a model on a data set, you will get some results from that model, how good is the model at predicting something or something like that. But let's say you want to make certain for your paper that the model actually predicts what is the mean prediction, what is the mean accuracy of the model or something. You need the distribution of different results. So then you need to run the same model again and again with, let's say, different shuffled data sets and you need a seed to specify what kind of shuffling do you want to do, like what way you want to shuffle it and you want to do this 1000 times the same model on the same data set so that you can get a nice graph to your paper and say that, okay, I tested my model and the average is there and it works fine. We have run enough of the testing. So in these kinds of cases, you could write 1000 slurm scripts and submit them all or submit the same script 1000 times. But there's even a better way. So in slurm, there's this array construct. So what array means in this case is basically like run the same thing, but in between runs, change this one number. And it will, when you say to it that, okay, give me an array from, for example, from zero to four, it will run five instances of the same job separately. The same shell script or the same slurm script. Yes, it will run them separately. So you don't need to do any multiplication of the run time or anything like that. It will basically submit five copies of the same script. But for each of these copies that run independently in whenever they run, each of these will get a single number that determines what number well, what is the, what is the, what is the, well, what is the number of this job? And this is called the slurm array task ID. So this is the number that the jobs can then use to determine like what, how do, how do, how they should change their behavior. So basically, if I'm job number zero, should I use parameters these like X, if I'm job number one, I should use parameters Y and so forth. You can distinguish the individual jobs based on their. Yeah. Okay. Can we find an example below? Yeah. Most likely our first array job that looks useful. Should I start with this? Yeah, let's look at it. Let's look at it first before we start running it. So what we have here is like a normal slurm script. We have specified job name and some output. And then we have specified a time and a memory. But then we also have one extra parameter, which is this array. And this array basically tells us that, okay, we need from zero to 15, run all of the numbers between them. So we have 16 numbers there. And this means like by adding this one line, so we have normal serial script. And then we add this one line array. We now multiply it 16 times. So we run 16 copies of the same serial script. And then we change the array ID in between all of these scripts. And let's, let's check what happens when we run this script. I think I already called something hello.sh, so I'll make a new one. I will just copy and paste everything to here. And looks like it works. I need to adjust this to my actual location. So you can notice here there is percent capital A and percent lowercase a. Can you tell us what that does? Yes. So, so in these output parameters, because now you have, if you think about it, when you submit this job, the slurm sees the array thing, and it says that, okay, there's, I need to make many of these same things. But each of these wants to write an output file. So what you should write in the output file, or if you leave it out, then slurm does it automatically for you, is to let every job write to their own file. So otherwise they will become a situation where all of the jobs are writing into the same file and the output is completely scrambled if it even works that, because everybody tries to write to their own file. So this percent a means the job ID, the job ID for the whole array job. And the small lowercase a means the array ID of this task. So in this case, the task number, uppercase a means, well, put the job ID for the whole array job. And then the lowercase a means that put the... Maybe we can submit it and see. Yeah. Put the task ID. Should I submit it? Yes. Let's do that. Okay. With Aspatch as usual. Hmm. Let's try watching the queue. So we see, okay. There's a bunch of jobs here. 10, 9, 4, they're all, well, some say completing, some say running. And I guess... Yeah, they're most likely all completed already, but it just, yeah, doesn't have an update yet. Okay. When submitting a large number of jobs, you might even want to restrict the number of jobs. There's a possibility in the array construct to change this, but we'll talk about it in a second. Yeah. Let's look at the output. What did we get? So here we see a bunch of new things. So this is the number of the job that got submitted. And then we see zero, 10, 11, 12, 13, 14, 15, 1, 2, 3, 4, 5, and so on. So... Let's look at one of those. As you can see, they're all here. Let's look at the first one. Yeah. So as you see in the script below, the command that the job ran was the echo. I am the Agri task number, and then there was the task ID. So this was the output of the task zero. So it says that it is the task zero. Let's look at some other job to exert, and that is actually run as we want. More example. Yeah. So 10 has 10. Let's look at history. Here we go. Let's do 10 minutes again. Yeah, that's a good idea. So what you see here on the left side now, you can see the job ID and then the underscore and then another number. So this means that it's an Agri job, and the Agri ID is at the end. You can cancel the whole, like we were talking about canceling the job with the job ID. You can cancel the whole Agri job by giving it the bigger number or just giving it the specific, like the Agri specific job ID, so that you can kill one of these if you want to. But you can see that it basically run the same thing multiple times with just the Agri ID differentiating between those jobs. And each of those got the same requirements. So each of those got the 200 megabyte memory requirement. Each of those got the same, the different output file, each of them got the same time limit. Yeah. Okay, so what comes next? I should quickly warn out this, because this is such a powerful tool to run these multiple tasks. Like I mentioned just that you can limit the number of tasks with this percent sign. You can basically give this learner the Agri ID that you can tell it to run only certain amount of jobs at the time. And if your job is something that is complicated, so for example Richard here is putting five at a time, so it will only submit five at a time. So if your job is something that causes, let's say a lot of IO or something like that, or loads Python environments or big environments or big launches, you might want to specify these so that you don't get, you don't like overload, because yeah, you don't want to necessarily start thousands of jobs at the same time. With the Agri job, it's quite easy to do that and that can be complicated. And so the Agri limit currently is, I think something like 30,000, but you should probably put at maximum 1,000 jobs at a time. And even then you should think about what your individual jobs are doing. So you shouldn't do steps like let's say compile a program in an Agri job, because then you will cause, might cause a lot of IO. But the important thing with the Agri job construct is that it's very versatile. It's very, like it's versatile, but it can be sometimes even complicated to figure out how should I use this, because you only get this one number and this one number is the one you should use to differentiate the different jobs with let's say these seed numbers. It's pretty self-explanatory. You just give the seed number to a program and if the program understands that, okay, this is the seed I need to use then it will work with that seed. Should we see a few examples here maybe? Yeah, there's other ways you can do work with the Agri job. So one example would be to read different input files based on the Agri ID. So basically if you have input files that have some sort of like a number at the end, you can tell the Agri job to, okay, each job take a different input file based on the Agri ID. You should also note that the Agri ID, there's plenty of ways you can write it. It doesn't need to start at zero. It can start at, I don't know, 131 then it can go to 140 and then it can go again from 150 to 212. You can specify different ranges for it and different spacings if you want to. You can do like stepings in the Agri ID if your code requires it. But basically you could, let's say you have, like in the example, you have 30 files that you need to analyze from zero to 29. You could with the Agri job, if you have this one number that can differentiate different data files, you can run all of the data files and then you can, well, you can, well, run each individual file with the same program. So this is very useful when you have lots of different data files that you need to analyze with the same kind of tools. Yeah, okay. And the next example, I believe. Yes, the next example is that you can, like let's say your code needs some parameters that you need to specify and you need some different parameters to test. You can, for example, do this case structure. So this is bash case. So the shell will run through this case statement and it will determine, like, for example, in this case, based on the Slogom Agri task ID, it will choose the seed number to use. Yeah. So if you need to do, like, some sort of parameters based on, well, if you have different parameters that you need to go through, you can use this, for example, to go through them to map the Slogom Agri task ID to different parameters. So here it sets a seed and then uses the seed in the actual thing. Another way you can do this, yeah, another way is to read them from a file. So let's say you have a file that has, for example, here for the Pi program, you have the number of iterations in this file. You can use the Agri ID to get the line number that you need. So, for example, the Agri ID 1 would take the first line. Agri ID 2 would take the second line. Agri ID 3 would take the third line and so forth. And this, if you have a big parameter space that you need to analyze, you need to, like, go through it and you don't want to, like, redo different iterations. You might want to do with this parameter file. So basically add new lines to the file, like which parameters have you already gone through, add more lines and increase the Agri index. So basically don't always start with the one, but, like, increase the Agri index when you need to do stuff. So here, for example, the seed command is used to pick the correct line of the file and you can plug it to your code. The code knows which one to use. Back when I did array operations, it was very important that sometime in the future you can know what corresponded to each job ID. So I had this master file, so I never reused array indices between different parameters sets. So if I needed to run it again, I would run it with the old index, but otherwise I would add new ones at the end. If you know the proof of how it's been proven in mathematics, how the number of rational numbers is the same as the number of integers where they go through the number of rational numbers in this kind of like a zigzag pattern so that for each integer you can have a rational number. You know basically that this is the same kind of idea, so basically you can map these integers to any kind of like combinations of like if you have a countable set of numbers, you can countable set of parameters, you can go through all of them in some sort of fashion. You just need to figure out what is the mapping that you want to use. One thing we should also point out at this point that we forgot in the serial job part is what should be the minimum size of your job. So far we have run many small jobs like we have run very small example jobs, but we should also point out that when you get to actually use, when you actually start to use Triton, it's important you should make the jobs a bit bigger. So make them run at least for a half an hour because otherwise you might end up, well they might be trouble, they might be problems. So what we recommend is the minimum half an hour of run time for a job and this is especially important when you're running an array job because if you're running let's say thousands of 10 second jobs or something like that, it can quickly cause problem for the queue system because the queue system has to constantly calculate like you have these jobs coming and the jobs finish almost immediately and it has to recalculate where the jobs need to go and it needs to do a lot of like this fixing and these can cause problems in the queue. So what you should do is to make certain that when you're submitting jobs the jobs are big enough especially if you're running a lot of them. With the array construct it's easy to run let's say a thousand job jobs but you need to remember that you're basically like running a huge amount of jobs on huge amount of computers at the same time and for that it's important to know what one of those does like if you know what one of those does you should think about okay if I multiply this by thousand what can happen. So that's why it's important to basically run a big enough job so that it doesn't go through the queue too fast and cause problems through that. So now we're getting a bit short on time I think that it was really good that we spent time on the serial jobs because that's the most important part here and sort of does everything. Should we do array job exercises or should we carry on to is it GPU next? I think like I think maybe we should have like a small break and people could run like the first array job actually on the top they don't necessarily need to even do the exercise but just start the array job or even with the current script you have written like the host name script you can try that with the array job and maybe we should have a small break and then go to the GPU part you can of course run the exercise as well but if you just want to test out the array structure what it means like when you move from the serial job to the array structure what happens when you put the array there and what you need to do to change it so it works correctly. Should we give eight minutes for exercises and then ten minutes for break and resume on the hour? Yeah I think that's a good idea and also of course you can ask questions in the chat and we will respond there. Okay so see you in 18 minutes.