 Okay, here we are. Serial jobs. Yes, so serial jobs are basically jobs that are not interactive. So this is basically the idea. So basically serial in this case means that you run something from beginning to end completely in a script. So you don't, you do the whole from beginning to end in one go, in one script. And this might involve many like separate steps to do this, or it might be like only one program call. But and also serial in this case means that it's not running parallel. So you're not doing multiple, well, you're not doing multiple different routes. You're just going from a start script to the end of the script. Yeah, like one script, one data set, one submission. Yes, something like that, which I guess is always the starting point to doing anything. You have to submit it once before you can have it do multiple things. Yes. So serial jobs are these, like how they are actually written. They are written into these scripts, these slurm scripts, which are in in reality, they are simple like shell scripts. So this is why shell programming is very important for for using these clusters that you need to be able to tell the machine what you're planning on doing when you're doing these serial jobs. So because everything needs to be in that same script. So the software module loading, going to a certain work directory, such forth, they need to be all written into the same script. So the, you need to be able to tell the computer what to do, and you need to tell it in the language that it understands. And this is the shell language. So basically go to this folder, open up this data file, run this program. So these kinds of commands, you will give them to the computer. So, and yeah. Should we scroll down and look at the first job script or? Yes, yes, let's let's do that. Here we go. If we look at this first job script here, you will see that there are, there are a couple of like the special lines, what differentiate this script from a normal SH or shell script. So the first line that that comes to mind is the is the actual first line, which is this she bang directive. So this tells the slurm, when slurm actually executes this script, it will tell it what program it should use to execute this script. You can use other interpreters, besides like normal shell programs, but they are sometimes a bit more complicated to use. So I can have slurm start Python directly, for example. Yes, yes, it is possible, but it's usually, because you usually need to do some preparatory steps, like loading modules, you need to do some something that involves the shell or the folders and such forth. Right. You usually need to use some shell and bash is the most common one. Right. And on what we recommend people to use. So basically the first line, you will always have it in your script. So the first line just tells, tells the slurm that when you are going to run this script, use this, use this shell, use this interpreter. Okay. The next lines are then what is basically the meat of the slurm script in a sense that if they are the resource requirements that your job will want. And these are given with these comments to the to the slurm scheduler. So written this hashtag as patch and and the after that you have some resource requirements. So these are completely the same as previously when we were running with s run these commands and we were giving these requirements via the command line with these comment line arguments. These are basically the same thing. So instead of reading them from the command line, it will read them from this file and it will read the arguments from these comments. So so basically these comments are not something that doesn't involve your program, but they are like hints or instructions to their scheduler that when this script will be run. What kind of resources should be allocated for this. Yeah, I can see that the hash is the standard comment character in shell scripts so as far as the shell knows these are comments but I guess slurm looks at them and processes them before it gets to the shell. Indeed. So basically, there are two steps to to each job submission law or job when you run a job. First you submit the job. So basically you give it to the scheduler and the scheduler looks at the job or the shell script and it will try to determine what are the resources that the job needs. So basically what it does is it checks the command line arguments, it checks the defaults and it also checks the script itself and it will look for these comments. And when it encounters these comments, it will like put the note to them and say that okay I will need for example in this case five minutes of time, 100 megabytes of memory per CPU. And it will notice that okay I will put my output to this specific file. I will run it in this test debug partition just to make it run fast so that we have this debug partition for running fast jobs. And those are like it will make a note of them and then it will store them for when the job is actually run. And when the job is actually run, when the reservation has been given resources basically, once the script has been read in, the job has been submitted, the resources are like, well you are asking for the resources. And when the resources are available, the job will actually run with the allocated resources and then it will execute whatever comes afterwards. Should we try in this? Yeah, let's do it. Okay, it will take me a few seconds to get ready. So here I am on Triton. I guess I should change to my work directory, which I will do with this variable. Let's make a directory for the project. So let's see. Looking at the file. Let's see how would you like for me to arrange the windows here. Is this good? Yeah. Okay. I will edit the file. Hello.sh. No, I will use nano. Hello.sh. Okay. And I will copy and paste all of this to here. I think there's a window at the bottom right. So that window probably shows the previous commands that you were running. Yes, this here. Yes, just for everybody watching. Yes, so anybody watching, those are the commands that Richard just run. So it's not there by mistake. So okay, anything I need to edit here. So I notice there's this output. I guess this says my username and then the job ID. Yes. So each job will get an ID. So basically it will get, I think you need to put the % here before. Yes, right. So it goes to your work directory. Okay. So my work directory, my, I made something called kickstart 2021. Hello.jobid.out is where it will be saved. Yes. Okay. I guess I'll save with control X and click Y and enter. And there we go. Let's LS to see what's here. Okay. Yes. So after this has been done, like you have written this kind of a script, like you can submit it with this S batch function. This S batch is basically slurms command that says that okay, submit my task. Should I do it? Yeah. I think bit of the shell is hidden behind the font might be too large. Hopefully it will scroll down and work. I'll try to keep an eye on that. Okay. Yeah. So basically, there we go. Once we submit the job, we see that the job has, we get this job ID and we see that the job has been submitted. Like Richard tried to catch the job actually in the, like in the running state by running a slurm queue afterwards. But it actually runs so fast that it wasn't catched. But with slurm history, one can see that it has actually finished. So there's a lot here. I'll do storm history one day, but still a lot of stuff. What about maybe one hour? One hour. Yeah. Okay. And I need a smaller font size here. Yeah. Yeah. I think it might be. But I can't easily make this font smaller. Okay. Or can I? Well, anyways, we see from the output that the, the job has run and we see these. There we go. Yeah. Yeah. Okay. We see from here that there is a job ID on the, on the left side and there's a job name. What was the submission script? You can also change these if you will, but by default it's the submission script. Yeah. And you will see that there's like these, these underlying steps. And these are basically like the slurm bad script. It will run the whole script, but it can also run these individual steps. If we look at the, if you reach the show in the browser, the slurm script, we had the S run statement before the echo. So we see in the output on the history, we see the echo there. So the slurm, when you have an S batch script, you can have this individual S run pieces inside of it. And these basically show as individual top steps in the, in the output then, in the history output. And this is not something you have to do, but sometimes it's very useful to do because, well, especially if you're running something like an MPI job that needs to, well, we will talk about that later, but if it needs information from the slurm, it might need the S run. But anyways, if you're running some job which has like, let's say pretty processing step, then it has an actual calculation step, and then it has some sort of a post processing step. You might want to have these as like individual steps. So you can monitor, for example, resource users per step basis. So what do you say there's two purposes here? One purpose is to record resource, resource usage separately. And the second somehow makes the parallelism work for MPI programs. Yes. So for the main program runs, like if you're running some main program, so not necessarily running something like CD to a folder or module load something. For these main load, main program calls, you should use the S run. So then it will record, well, it will get the resources it needs and it will record what were the resource requirements for each individual step. But I guess something like module load, it's not really necessary or... No, no. I guess actually module load, it wouldn't work at all, but other stuff like that, like copies or moves, not important. Yes. Unless your copy takes several minutes and then you might want to track it. Okay. So where's the output for the job? Basically... Oh, go ahead. Yeah. So the output, we had in the slogan script, we had the output statement that this describes where the output was written and Richard changed it a bit with adding the kickstart folder. But anyways, it should be there. Yeah. Yeah. Okay. So here, I guess if I look at the contents of this file, which I can do with cat, and I guess that's the job ID that got submitted. Yes. And it says, hello me, I'm on node CSL48. Okay. Yes. And this is the, like because we use the echo statement in the, in the SBAT script, this is basically just like a standard output of what happens in the, when the script is running. And this brings an important point with the SBAT scripts that the communication, how, how you're monitoring these grids is happening via the file system. So basically you are monitoring the scripts via standard output or the output of your code. So you're not, when you're submitting the code, you will submit it and then you will only get output via either the standard output of the, the program, if it brings some statements or if it saves some files in the folder that you can then monitor to check what is the output. If I don't specify, if I don't specify an output path, what happens? Is there some defect? Yes. It will, by default create this slurm dash job ID. If you are running an array job, it will add the array number and then put an out dot out at the end. So you will get these default outputs if you don't specify the output yourself. And basically, what you usually want to do is in your code add some verbosity or add some normal output so that you can, you can monitor the state of the code. What is, what is it doing? What like state is it currently running and so forth? Like you can monitor it non interactively. So basically if you are currently running, let's say a Python script or MATLAB script and you press play on your IDE or development environment and you will start running it and then it won't produce any output for two hours or something and you're just monitoring it, looking at like whether it's finished or not. It can be really complicated to monitor it in Triton because you won't get, you will only get the output whether it's finished or whether it didn't finish. And this is not usually what you want to do. You want to write something that the program will give you some hints of what step it's actually taking at this point, what kind of output it's creating and so forth. So basically because you will be working, like it's basically like a fire and forget type of a thing where you say the Slurm script that okay, run this thing and it will tell you that okay, I will run it and you can ask the Slurm whether it's still running it or whether it's just completed. But you will have to have some phone number basically to the worker itself if you want to get updates on what the worker is doing. And in this case, it's the standard output. Okay, good. So what comes next in our lesson? Should we continue? I think we've covered most of the things here. This wording is something I've seen a lot. Are you going to talk about that? Yes, so this is, yeah, it's good to make a distinction immediately in your head that you shouldn't use bash to submit the jobs but sbatch. So sbatch is the, like Slurm takes hold of your script and it will run the script. Bash means that run it now, basically run this script now. So if you run bash, yeah. Hello.sh is like this. Yes. So now we see it looked like it submitted but it was srun that was submitting. So this was actually submitting as an interactive job and it says it's me and I'm running on this login node and it knows nothing about these other Slurm parameters here. Yes. So the sbatch is the wrapper that basically takes the script, reads it to the Slurm queue and then runs it somewhere else. And bash is the interpreter that is then used to, well, run it when it comes to the actual place where it's supposed to run it. If you just run bash and the script, you will get, well, you won't get far from the login node. You're actually running on the login node and that's not what you want. So remember to use sbatch when you're submitting stuff to the queue. Yeah. Okay. So what else? So we looked at the status with Slurm queue although there was nothing there because it was done too fast. You might want to put it like our sleep to the code and we can demo it. Do you want to do that? Yeah, let's do a quick demo. Hello.sh. Let's add like a sleep. Yeah. We can sleep for one minute. So sleep 60 means just do nothing for 60 seconds. Okay. Yes. Control X. So now. Save, enter, and I will go ahead and decrease my terminal size because people say it's readable if it's smaller. Okay. So should I submit it? Yeah, let's do it. So I remember to use sbatch. And you will see that the top is submitted. We do slurm queue. And here we see the job ID which matches the submitted job ID where it's running the name, the time, and it's running and where it's going. There are various things you might see here. So quite common if you're running a job that actually has some requirements compared to this small job, you might see a pending in the state. So pending means that it's waiting for its place in the queue. So basically it's waiting for a good resource where it's going to be put. I guess maybe later we can talk about how it prioritizes what's running. Yes. Okay. One other stuff you might see is these bad constraints. And this is an internal thing. It basically means that it's still waiting for correct kind of a node. But that's not something to be scared about. Okay. Yeah. So let's see what's next. It asks about resource parameters. But I guess these are similar to the interactive job resource parameters. Indeed they are. So the slurm, well, when you submit a job, you will need to give it some resource parameters. They are the default values set, but you should always set these once you know what the job actually requires. Right. So the two parameters that you will most always need to set are the time and the memory requirements. And this is important because, well, like Richard said about how it prioritizes the jobs. The job size is calculated basically like the number of CPUs or the number of memory required to, well, amount of memory required times the time requirement. So basically it will choose this kind of a block of time. And then it, well, it will calculate based on those. I don't know if I explained very well, but basically what you need to do is to tell the slurm to give you a certain block of like resources. And to do those, you will need to give it a time and a memory constraint. And the memory like we talked yesterday, you can go a bit over the memory limit and there is, you can go a bit over the time limit as well. But if you, if you go hard over them, it will eventually kill your job. Yeah. You know, once I heard this description, you know, it used to be on clusters, every job fit into a rectangle, a certain amount of time and a certain amount of resources, and that's all it needed. But now with, for example, machine learning methods or all these advanced things, the resource usage may vary over time where there's a short time it needs more memory and then it shrinks some and that makes it a little bit harder to adjust the parameters. Yes, that's true. Yeah. So maybe that's something we can talk about later or is it? Yes. So there's, yeah. So basically what you should think about these are the ceilings like how big, well, if you think about like driving a truck through like underneath the bridge, you're not going to like say that, okay, if your truck has one place where it's really tall and otherwise it's really, really tiny, you're not going to say that, by average my truck should fit through that bridge because that's not what's going to happen. The truck will hit the bridge because there's going to be this huge spike. So what you should do is, similarly in your code, if you have these kinds of spikes of let's say memory usage, you should try to write your code in a way where you don't have, for example, you don't float the whole data set just to work on a part of the data set because then you will need enough memory for the whole data set. So instead you should write your code in a way where you can only load the part of the data set because then you can lower the average height of your, or the maximum height of your job. So you should really think about your job as this, well, basically this truck that's going, that needs to be certain amount wide, certain amount of height and you don't want it to be all these things are somehow, all these things are somehow interrelated and that's what makes the cluster interesting or fun. Okay, so we are at half an hour in, let's see what else there is to discuss here. So we talked about the resource parameters, monitoring your jobs, I guess we've already sort of looked at. Yes, there's very many of these tools that you can use in Slurm to see like detailed information what your code was actually doing like this S act that is underneath actually the Slurm command is just a wrapper for the S act. You can see various other, well, I don't know if you need it because it's a huge amount of information there. But we have many of these like very common things we have put into the Slurm command. So this S act will give you this sometimes pretty complicated information. So we have written this Slurm script so that it will give you information in much more manageable pieces. Okay, so Slurm is an auto-specific thing. Yes, yes. You can look at it, it's a percol code written years ago but it's very good so we're still using it. But if you look a bit below, we have this list of very useful commands that you might want to use yourself. So the first one, well, the most interesting ones for you are probably the Slurm partition like the Slurm commands that show the state of the whole cluster so you can basically see what's happening currently. So you can plan what you have. You're going to see a bit of an expectation of what's happening in the cluster. So Slurm politicians will show you that because we were talking yesterday that the cluster is heterogeneous so there's multiple different kinds of computers and then it's been split into these partitions based on what kind of computers they are. We have written it in a way the way you don't usually need to reserve any of these partitions manually, you don't have to think about it, you just need to give the resource requirements. But you can see from here that sometimes some partitions are full, some partitions are allocated, some partitions have GPUs, some partitions don't. So if you want to see like a problem here, there's allocated idle, other and total. So basically there are some 42 idle nodes in the batch as well, partition for example. So there's plenty of resources to go around currently. Okay, but this is something that users don't normally need to look at. Yeah, they don't normally need to, but if you want to look for example, like what is the timing limit for, let's say GPU partition, you will see it from here. Or what are the... Partition time limit and GPU says five days. Yes, so by default it's five days. Other well important command that you probably need is the, well this isn't necessarily the most important, but one of the important commands is worm features, that's somehow missing from this actually. Okay, maybe someone can add it quickly. Yeah, we should add it to this list. So this list tells you what available types of machines they are available. So maybe you can make the font a bit smaller. I think I can't. This is... Okay. This terminal has only a certain fixed things. Yeah, well if you run it in Tratton you will see for yourself that you can see with this command what kind of like computers we have. So if your code, like yesterday there was a question of if you try to run TensorFlow on Cosh you will get these AVX errors because like the Cosh machine doesn't have like the Auto Shell server where you shouldn't run necessarily TensorFlow it doesn't have AVX to instructions. So from here you can see that some of the machines, the Ivy Bridge machines, they miss the AVX instructions. So let's say you want... AVX is a processor feature or processor, like hardware thing? Yes, it's this vector vector instruction. So you will here see that that some of these features are available on some machines, some GPU cards are available on some machines and so on. Okay. So let's see. We should probably work towards the exercises. What else is important? I think all these commands people can read if they need to. There's basically lots of different ways to get information about the system. Yes, the most common that you use are the Slurm Q, Slurm History and the S-Batch. Like those are 90% of what you will do anyway. Like you will use S-Batch to submit a job. You will use Slurm Q to look what's happening in the Q. And then you will use Slurm History to check what was the output. And that's like 90% of your use will be that. But you should note that there are plenty of other tools to see more information about your jobs, more information about your... One thing, well we will actually we can look at it later the SF command. But now we probably should do the exercises or do some exercises. Okay. So we can see what else is on the page. We already talked about partition some. And like Simo said, I think for most users we don't... So on some clusters you have to care about what partition you're submitting to. But on Triton it's mostly automatic. So just say what you need and it will run somewhere. Which is a nice little feature. Okay, full reference is just more lists of commands so... Yes, exercises. So I guess we will break up into the Zoom breakout rooms for doing the exercises. So how should we do this? I think we should probably have like at least 15, 20 minutes to run the exercises because these are like the meat of the cluster really. So you should get a hang of how to write these espac scripts and how to do it. And after that I think we should demo a few of these exercises here. So maybe we can have some time in the Zoom breakout room for questions but then we can come back shortly to actually do the exercises together here. I guess we can go to a short break you can discuss in the Zoom room and then us instructors will talk with the Zoom host and decide what to do. Does that sound good? Okay. So there are how many exercises here? One, two, three, four, five? And how many should people try to do? I think do you think we should give a lot of time here because it's the first one where people can really like if people can do these then everything else will be relatively straight forward. Yes. Should we give maybe 15 minutes for independent work and then come and demo it together? Yeah. So five, two, one or... Okay, yeah. Should we have 15 minutes to demo and then 10 minutes for break to continue? Yes. Okay. And please... So we're sort of combining the self-working time with the break time but don't forget to actually take the break and we will resume at 105. Does that sound good? Okay. Great. Do you... Then in HackMD it will clearly say what the exercises are.