 So we are going to talk about serial jobs next and the magic part here is that we've already basically gone over all the important parts. So we will have a quick little introduction and put all the pieces we've talked about already together, give you a good big chunk of time to work yourself on exercises or read or whatever you do. And then we come back and quickly go over them. So let's get right started. Cmo screen is here. Yeah, if you. Okay. Yeah, so currently we came here into the login node and over here we managed to run stuff here. So if you think about yesterday we connected here, we managed to run stuff here. Today we run here and we use the s run also to run stuff on the compute nodes as well. So what if we just, after we have managed to get job running into the compute node, what if we don't anymore like want to be there and keep the connection open and want to watch the thing going on. So this is the non interactive job or the serial job that we will be focusing. And yesterday you remembered this analog. Do you want to Richard. Yeah. Reminder people of it. So we have the recipe. Well, the important parts for now is the recipe is basically your code. And we are writing up like the daily instructions for a cook. So the cook will be told. Okay, please request this much or you have this much time this much resources. And go and run this recipe, prepare this recipe, and then come back when you're done. Yeah, if you've ever like, like put the water, start started boiling, like over here it says that boil water five minutes it's boring stuff like you don't want to be there watching. It would be better to hire your own Italian like tell them to okay I want to do this recipe I want you to do it for me. I'll expect it to be done in like 10 minutes or something like do it for me. And I'll be back then and I'll see what's happening. And this is the basically the idea of non interactive job you write the recipe that what you want to be done. And then you tell the cook to do it and the cook has the resources hopefully that they can actually do the job. Like if they have the bats and bones that like that, all of the stuff that they need to cook the stuff they can do it. Okay, should we jump into this? Let's go. So serial jobs. So our first job script. So this is a text file. So I guess we make it with nano our preferred text editor. What directory do you put it in? Well, you put it just in. Yeah, there was wherever you are now. There was already questions there about work directory. Like where should we run this stuff and yeah, in the in the clusters, like in this demo, there is usually this kind of a data directory where people store their stuff, some big data directory that you have. In Triton, it's it's in the work directory. So it's in. You can type. Yeah, for example, echo work directory to see, see where it is. So I'm speed WD tells me where I'm now I'm in this sub folder. So, so over here, I will write this script with nano. First serial dot sh. Okay, so we open this. Um, we copy it so bin bash. As batch time. Five minutes. What is this first line Richard. So this looks a lot like what we did for the interactive jobs right so these capital as batch means. The slurm parameter, and then we give it the exact same kind of arguments that we learned last time. So output tells it that the output text of the program will be stored in a file called hello dot out, which makes sense. And then we give the actual job here with s run echo and then all the stuff. And this is basically a shell script. So things like dollar sign user means insert my username here. Dollar sign host nine is where I are. So we don't need to worry about what this text is. Yeah, the first line is so called she bang. So it's, it's, it means like which program is used to execute the script. So and any, you don't have to worry about that too much. You can just copy base to bin bash on every slurm script. Yeah. Okay. So let's save it with control X and save. Okay. So now we have a file called for serial. Yeah. So we use the cat to cut the night. Yeah. So we have a program to run it, which is as batch. So we do as batch hello dot our first serial.sh. Um, yeah. So let's run it and maybe immediately run slurm queue afterwards. Too fast. We don't see. So basically the whole cluster ran it. Um, yeah. Okay. So we see an output. Hello dot out. So should we use say cat to view it? Yeah. Okay. So we see hello. Tomi says one you are on node CSL for six. The time is whatever time it is. So yeah. We compare it to the script over here. Like it, it filled out my username here. The host name here. And then it run this date command. So it's provided the date. So it actually did something there. Yeah. And you can see that it did it on, on a different, different node. And if I look at slurm history, um, yeah, I can see that it ran it on a different node. Yeah. It says CSL for six there. Okay. Yeah. So you, you notice that there is this S run command in, in here. So why S run in this script? Like why, why, why is there like S run in my S batch script? So the, the thing is that this is, this S run tells slurm that, okay, this is a job step. This is especially important if you're doing MPI codes, because it allows the, uh, slurm to do, uh, many, many things on the background. But in, in regular codes, it's usually like those pieces of the program that actually do something you usually want to put this S run in front of it. The reason is that, that you can get then like, uh, this, you can get this, uh, more detailed output, uh, for each of these, uh, job steps. So you can see here that, uh, like you, you will get like the job, job ID and the individual job step zero, uh, is recorded separately on a separate line. So you can see where your code is and when it's running. So let's say you have a code that has multiple things to do. Uh, you can have them as individual job steps. So you can know what program did what, what kind of resources they used, what kind of efficiency did they have, uh, did they crash on certain job and so forth? Yeah. Okay. So the next section in the page is setting resource parameters. And really there's not much more to say here. So we've already experimented with these resource parameters in interactive jobs and you use the, uh, hash, S batch to specify it in the script. And it's usually good to include these in the script. So you have a reference for what you ran in the future. Okay. Yeah. So, so, yeah. So basically like, like this recipe over here, like if, if you don't like to cook with recipes, you might get different kinds of food every time you cook. Like if you don't know what, like you just, you just swing it and just put whatever you feel like into the, into the, uh, research that you're doing, you get basically like something happens and you get something. Uh, but if you want to do, like if you want to replicate the results, which is in science is quite, uh, you usually a good idea. Uh, you want to document what kind of a recipe you use to cook the results that you got. So, uh, the serial writing everything into a serial job helps you document the whole process of running the code that you're doing. It helps you to make, make certain that like you will always get the same results. If you run the same serial job use it, get the same results. Uh, so, uh, yeah, it's a good idea to use serial jobs to document how to run the code. Yeah. Okay. There's a little bit on monitoring your jobs here, but really we just talked about this. We talked about this a little bit under interactive. You know how to use slurm history and so on. And also it's the next section that we're going to talk about. So, yeah, there was also a question in the HackMD that in some sites, the slurm comment might not be there. So it's available in a github. If you want, uh, there's a link in the HackMD, uh, other sites might want to provide it, uh, as well if you're interested. Uh, but you can also, you can also download for yourself or you can, uh, use these sq commands and sact commands, but they're a bit more cumbersome because they don't have a, like the slurm command is basically a wrapper for these that provide sensible output, uh, by default. So, uh, you can use whatever you want, but, uh, yeah, the slurm command is recommended here in Alter. Okay. So next is partitions. So this is something that we've gone through a lot of work to try to hide. So you don't need to specify because of cancelling your job. Quickly mention, uh, yeah, canceling jobs. So if you have a job that is running and you want to cancel it for some reason, uh, you can just use sCancel and the ID of the job to cancel it. Yeah. Yeah. Okay. So partitions are something you don't need to worry about at Alto and some sites you might need to. But basically it's different groups of nodes, which, um, like you might specify, okay, this is a debugging job. So I'm running it in a debug partition, which will finish very quickly. Or I'm running this on, uh, short nodes, which will are designed for things that take less than four hours. But really for the most part, this is automatically specified here at Alto. So let's not dwell on it very much. But on other sites and on CSC machines and other places, you might need to specify this dash P and the partition name if you want to submit your jobs to a certain partition. So basically like, uh, different machines with different kinds of capabilities are usually in different partitions. So, uh, GPU jobs go to GPU partition and stuff like that. Uh, in Alto we try to do it automatically on the back end, but it's good to know about it. Yeah. And with that, we're at the exercises. So it wasn't too hard, was it? Yeah. And it's good to remember here that like what we're doing now, uh, is like we're building on top of the, like, uh, on top of one thing on top of the other. So, so if you think about, again, the cooking thing, we are making the cooking, we have to do less cooking ourselves and we get more stuff done. And like somebody else is doing the cooking for us and we just need to tell what, like we're growing up in the ladder, we are becoming like restaurant managers instead of like doing the cooking ourselves constantly. So, uh, it's everything builds on top of each other. Like all of the stuff builds on top of, uh, these scripts basically, and this kind of syntax. So it's a good idea to get a hang of it, uh, as early as possible. So now you have about 25 or 30 minutes to work on the exercises alone or in Zoom. And then we'll come back and we will briefly review things. And then we will have a break and then we'll talk about monitoring a little bit more. Um, yeah, I guess, uh, I see mostly other questions that are answered in HackMD. So we can continue in a bit. Okay, see you later then. Bye. And we are back. Yeah. So we're going to try to quickly go over the examples. I see some good questions here. So I guess let's try to do the exercises quickly. So Simo, your screen, you've begun this. So, um, number one, basic batch job submit a batch job that just runs hosting. So I guess you make the script, you open it. We basically copy the same things that we've had before. Once we start doing this a lot, we basically have some scripts that work and keep copying it over and over again. There's nothing wrong with that. Simo setting the time to one hour and 15 minutes, the memory to 500. Um, the job name and output file, we might have to go look at the reference up above for what these arguments are. Yeah. So, uh, I'll quickly mention that, uh, let's look at job name. Yeah. Okay. So name you can give. Yeah. Like these kinds of documentational things that like some things you have to remember like the memory and time, those are stuff that you, you, uh, usually want to like, uh, just do it over and over until you remember it because they, you use them all the time. But for example, job, job name and stuff like that. Uh, there are documentation for it. And I personally just go over back to the documentation all the time because who has time and capacity in the brain to remember all of this. It's not worth the effort. Okay. Um, so what name will you take? Hostname. Um, test maybe. Script. Yeah. Hostname test is a good one. Okay. Uh, chat set the output file. So this is dash dash output equals hostname test dot out. Okay. So in it is just hostname. So since we know hostname is a trivial program, we aren't adding s run in front. So we save and we dispatch it. So it's submitted. Uh, let's do slurm queue. Okay. Yeah. So it finished very quickly. And if we do, uh, LS, we can see the hostname test dot out. And yeah. So it ran on that. And if we do slurm history, it will show the same code name CSL four six. So that's exercise one. We did it really quickly because it's basically what we did above. So now next is submitting and canceling job. Um, so, uh, sleep job dot SH or sleep script dot SH. Yeah. So all the same stuff as before the, my bin bash line, uh, as batch. Do we, I guess we can do that. Let's put 10 minutes there. Memory 50 megabytes. Okay. So now it's run this time. Oh yeah. Okay. So sleep is a Unix command that says do nothing for the number of seconds given. So this will wait for five minutes. Just there. So we save. We submit it with s batch as usual. And if we slurm his, uh, Q. We see, well, what do you know, this time it's actually running because it's taking a while. So that point of this exercise is to cancel it. So what do we do? Um, we have to find the job ID, which we see listed in slurm history and also when we submitted it. So we copy that and s cancel. And it takes a second. And it's done. Slurm Q. It's not there. Should we look at the output? Yeah, let's do that. So you notice that this time, uh, I didn't add an output name for the job. So we got this automatic output name for the file. So it was slurm dash and then job ID. So that is, that is the, uh, automatic output. You can also use this like this wild cards. You can check in the documentation. There's wild cards that you can use in your, if you want to have like this job ID or various IDs in your output file names, you can, you can find it in the, in the documentation. Yeah. Okay. So the last exercise, this is a little bit more interesting, uh, checking the output. So we're going to make a new slurm script that has this little script in it. So what it is, it's basically a shell script that says for the numbers one to 30, we will print the date and we'll sleep for 10 seconds and it will repeat. So the point here is to have some output that is, um, that is, um, like continuing to come. So we can basically see that it takes a little while for the output to actually start appearing in the file. So it's submitted and we can do slurm queue. So okay, what's the output here? It's this one we can cat it. Okay. So we see two things in there. Uh, if we keep catting it, then more is appearing. So this is a common thing that you might do. So you submit a job and it has these things that are like it's continuing to give output. Like you made your code. So you have, um, print statements in there that say like loading data, uh, processing the data, uh, done 10,000 trials, 220,000 trials and so on. And you can by catting the file, see how it's going. Simo is using a command called tail dash F, which is used. So it basically keeps the file open. And every time there's a new line, it prints it immediately. And this is pretty useful for a following the jobs. Control C to quit it. Yeah. So control C as usual to quit things. And yeah, that's the exercises we asked you to. Yeah. There was also this kind of like login lock back thing. Uh, that was meant to demonstrate that basically like if you leave and the job is there running, you can close all of the connections to try them and it's still running on the background. So, uh, yeah, it's, it's running there and it's, it's not dependent on the login node either. So if the login node goes down, uh, that can happen with security updates or some sort of other reason, uh, it's not affected like it's, it's completely independent. Now it has flown the nest and it's completely independent and doing its own thing. Okay. Let's check out HackMD before the break. Let's see. So questions on exercises. Yeah. So if you submit something without specifying these, it uses the defaults. So, I mean, if it's really a small test, then okay, but if it's anything significant that you're running a lot better to be explicit just so you know. Um, yeah, I think this was answered here. S batch versus S run plus screen. So I think these things answer it pretty well. Like if you're doing one thing at a time. So screen is the thing that lets you run a program interactively and then you can log out and log back in and see it again. So if you don't know what screen is, don't worry about this answer. But, um, yeah. So if you do this and you have only one thing running, then you're not going to forget about it. But if you start having things where you're running many different, like two or three different things, you eventually get to the point where it's hard to keep track of everything that's going on. And then using as batch is better. And once you start running 10, 20, like, um, array jobs of many trials, then you really need as batch. Yeah. And it's also this kind of like fundamental idea that if you're doing stuff non interactively, then why not use the thing that is designed to do non interactively? Uh, so, so basically like if you see like, uh, like this kind of like, uh, uh, like a cross shape screw can not remember it. Maybe, maybe you can open it with, uh, like a philips screwdriver or maybe with your keys or something like that. You can use whatever tools to get the same stuff done, but maybe it would be better to use the tool that's been designed for that specific thing. Like instead of like, uh, doing something because like these things are designed for a reason. And the reason is that they make the users easier. And if, uh, if you're battling against the tools, then it's, it's usually not worth it. Yeah. This question about what I meant when I said it was a trivial program. So basically we knew that host name would take zero seconds to run and basically zero memory. So it's not worth getting a separate output line in the history there. Um, for any program that we're running for real science and real calculations, then I would probably be adding it there. There's few exceptions. Like for example, uh, in our cluster currently, there's a bit of a bug with the GPU allocation and S run. So some of the GPU jobs, uh, don't work with S run. So that's in the GPU page mentioned there for technical reasons. Uh, then there's, you can also use S run to launch. Let's say you can launch a job that asks for a certain amount of resource and you can launch individual jobs with different resources within the same job with multiple calls of S run, but those are like very esoteric. And if you're using something like a spark or like you want to create your own data cluster or something inside the job, then it's feasible. But in most cases, like, no, you don't, you don't necessarily need the S run. It's just a helper thing. Yeah. So let's see that for a loop. Uh, this question here. Seymour, can you show, or show this again, the for loop? Yeah. Here we go. Sure. So there's no S run. There's no S run in front of four because this is a shell construct. You could put an S run in front of each state here and then you would have 30 individual. Yeah. That's actually a good point. Let's, let's do this just, just to show like what happens. If we look at the slurm history, history here, we can see, for example, that, that the date script, the previous date script that you run, it produced only this batch and extern. So these are like, uh, like this, this batch collects information on, on the whole job itself and the extern con, uh, or when, when the job is initialized and the extern contains like those comments that don't have S run in front of it. Now that we have added the S run, if we submit the date script and we, uh, look at the slurm history, uh, well, once it's run a few cycles, let's let it run a few cycles. Um, we can see that it will produce like a huge bunch of these, uh, let's wait for not one cycle more. Okay. Okay. Let's cancel it. So in the history script, in the history we can see now that this one job has like this, all of these lines, uh, for it. Yeah. And you can see that the first, uh, first one's completed and then this was like the job was then canceled, like the whole thing was canceled, but, but many of these steps already completed. So let's say you have a job that you want to like analyze, uh, you have, I want to have a for loop that analyzes within the job it does like analysis of let's say hundred different data files or something. And then the job runs out of memory or something. You can use the information here to determine which files were already analyzed or which file is corrupted or something like that. Because you know, like, you know, job, uh, job step level, which, uh, jobs worked and which jobs don't. Of course it depends and of course you shouldn't have like tens of thousands of these, uh, because then you will be those are a Q system, but like, uh, create problems, but if you have something that you can yourself manage it's good thing. Yeah. So should we go to a break now for 10 minutes and we come back with monitoring and basically the last part of the day. So yes, thanks a lot and see you soon. Bye.