 So we continue to serial jobs, which is basically making a file that defines what to run and then submitting it And you can leave and come back and watch it Then we talk about Monitoring the job processes progress and so on and this answers a lot of questions which people had yesterday Like how do I tell how much memory and time my job needs? What do I request and so on? Then array jobs, which is serial jobs multiplied by many things happening at once Then this is sort of a in a mission we talk about the module command This used to be on day one, but we wanted to spread out the non-example based stuff even more And then we talk about GPU computing and parallel computing So we're not talking here about how to write your own programs that do these But if you have a program that uses GPU or parallel stuff how to run it on the cluster And I think we probably will not get to this last part here. So we'll skip it and then that and Well, we'll talk about this at the end, but you can read these follow-up suggestions for where to Yeah, we're to go next So with that being said any other introductory comments Yeah, I'll quickly mention that basically today's work is like like yesterday we We got some interactive users of the cluster. We connected there. We we put some files there We run some jobs interactively and we got some interactive users done So today we are basically like trying to tell the machine to do our work for us And that is like we we just try to tell the cluster or the machine and the q-system and so forth and that okay We want something to be done and and then we Let let it do it and then we like reap the reap the benefits So everything is like built around that principle So you should think about it like bit of like a like a programming your workflow This is kind of the thing that we are going to be doing And this is done through these non-interactive jobs And all of this is like related that we are going to be having it's like everything is extended by from the non-interactive jobs forward So should we go forward to the first topic yeah, so let's see serial jobs Can close some extra things okay So how do you want to arrange this should see what do you want to talk and I'll type Sure, yeah, we have plenty of like small examples littered around this this full page so we can we can then like go through the The examples to get it. So what are serial jobs? So? Serial jobs like we previously run interactive jobs So basically just commands that were executed the compute nodes. So basically we had we had we told the queue system, okay, here's a command run it on the on the node in some cases we just wanted to have an interactive like command line in in a compute node, but okay This this works up to a point But this all everything here relies on human interaction with the system So basically you need to like be there and write the commands that you want to be executed And that scales only to the point of how many like command lines Do you want to keep open and like like we yesterday mentioned that like yeah if you had thousand laptops Could you like run thousand times your code on the thousand laptops like most likely not like you you would Spend your whole day writing commands on the different laptops and it would be like really an inefficient way of working I have a good analogy about this which I'll tell in the array job section Yeah, but but in the cluster system what you want to do is basically you want to codify Your your like what do you want the program to do now? What do you want to do you want to codify into these scripts and the scripts are basically like collection of commands That that you want to be executed and then you tell the queue to execute them And and you tell them give them to the queue and then the queue Once once the resources have been like you tell you tell it to get some resources Once the resources are resources are available the queue executes the commands for you and well So that is How it basically basically works So you want to like create like this kind of a file that Contains all the information of what you like basically instructions. So basically the The queue system is a really good like cook that can do whatever you want For for you do whatever you want as long as you give it a good recipe and this is kind of the idea Like you would need to specify that okay put the water make it boil Like put the pasta into the water and this this way like you need to give give it the instructions that it Requires and then it will produce you the the build that you wanted But if you leave out some of the instructions, it doesn't like try to fill out the blanks It doesn't understand what you're trying to do Unless you tell it like what what you're supposed to do And how this works is that uh Like we write these scripts. So if you reach out once scroll to the to the first uh script example so So here's like a simple First job script that you could write. So what do we have here? So basically the first line is usually the same like Usually people run their scripts with the bash interpreter. So bash basically means the shell like the command line basically That means the first line is is like always in place And with that you tell tell the system that okay run these commands Like in a command line So that that usually never changes The next few lines are if you remember remember from the interactive tutorial Interactive session that we had yesterday We gave these resource parameters to the q system like we gave in the command line with the s run We gave like give me some memory. Give me some time and uh We we specify these in in the command line, but uh well If you run want to submit the script instead you want to usually write the Requirements into the file itself. So basically these you write these commands like this This is like static form of these comments. So basically this hashtag and capital s batch And after that you can have some comments Or some some parameters for the q system and the q system parses basically through this file It goes through the file and then it like looks for these comments And when it sees these comments, it's like, okay I will add these requirements to the job requirements And when when it encounters first line that is not a comment It will like stop looking and and then it like assumes that okay all of the requirements are have been specified So the example here, uh, we just ask for five minutes of time We should start doing this. Yeah, you should probably write this. Yeah, so to organize. I'll make a directory Called kickstart By the way, did you get my screen share? Yes, okay, great uh, so So this will be my working space for now or should I put this in work directory? Uh, work target would probably be better, but it doesn't really matter about these small cases As a good, uh, rule of thumb is Better to go to the work directory and work from there there. Okay Okay, so Okay, so new directory. Oh, here I am Yeah, um So should I make this script? Yes, if you use this, uh, like this is, uh Easy editor called nano that you can use from the command line. There are other more powerful editors as well But they are more complicated to use and nano gives you lots of feedback. So it's easy to use so You can Cut this nano and let's say this yellow sh Okay And then basically copy paste the contents or write the contents there So the first line, uh, like I mentioned is the like, uh, interpreter like who who executes this stuff Uh, the second line we ask for time resources. So let we say that this job, uh Should take five minutes to run On the third line we ask for 100, uh Megabytes per cpu in this case It's a bit of a like, uh, we'll talk about this memory memory per cpu. You can use mem Just mem or mem per cpu. We'll talk about the difference later on But basically 100 megabytes of, uh, memory And then we set up an output file for the for the script. So in this case, um Yeah, hello out. So this is where the output will Turn out to be so the output way in this case We'll talk about a bit more in the monitoring But the output means like what the what it produces into the command Command output. Basically. Yeah So so what you would see if you would run this command yourself What kind of grain statements it would produce those will be redirected into the output file So in this case, we just run this one s run command of echo. So basically just print print hello User name you are from host node name And Save and run Yes control x to exit And say modified. Yes Yes, and then hello dot sh If I list what's here, there's this. Yeah, okay So if we look a bit a bit below to the documentation we can see there that That you then if you want to run this command, uh, you in the queue You you need you need to use this s batch command. So this s batch command is is Is it basically means that run this as a batch job and batch job means that well do it non-interactively in the queue So basically when you run this with the s batch, the queue will take over and it will run it for you So you reach out should we go press. Yeah Enter Yes, now we can Use for example slurm queue Which will work at some other places, but let's use it anyway It should work. We we had now it works. Yeah, it should be in other places as well Yeah, so it already run. So it it doesn't show anything because it already run and it produced the hello out. So if we Look at me So I'll use a program called cat which basically takes the concept the contents of a file and prints it Which in this case is nicer than starting an editor, which I'll have to go and close So here we go It shows I ran something Yes, so the green statement in the code was like printed out Or in the in the script it was printed out and it was redirected to the output file and then we we got what we wanted Of course in this case like this is like a really simple example. We didn't do anything really but we basically Case we We could have had done anything we wanted in the code like in the in the middle part like if you Show the slurmscript again region we can We could have instead of like the s run echo here. We could have Done whatever we wanted here. We could have loaded some modules To load some software. We could have run some software commands And then they would be executed in the node that we we ended up on And we could have specified some other parameters for the job So basically the whole workflow of Of our analysis like these commands that you would normally write as this kind of like a series Okay, I'll first run this command. I'll then I'll run They can then I'll run this group. We all could have put them into this Well after we have specified the resource parameters And and this is the idea of the serial job So basically you write this one script that you submit and then you put it there and of course Like I say one script. It doesn't mean that you need to have one script for all of your jobs It means for this specific task that you're trying to accomplish So you can have multiple of these and the good thing also about these is that they document what you're trying to do So so you can like when you're running some analysis or something and you like If you ever use let's say a tupiter For coding you end up into this or matlab You you end up Well any any id really like you end up into this situation where often you you start different You run different like commands And then you run them in different orders And you no longer know like how did you end up into the end point where you got like like and These like when you have this learn script you basically have a set order of Executions a set order of commands that the program runs and And then you know that okay, this is What should well what the program should do It should be let people try this out on sure like So if we put into the chat a quick poll Yeah Are you doing it? Oh, yeah, I can write it um Try running that example that Richard just did Yeah, I'm hoping it's certain about the formatting. Maybe maybe somebody in the HackMD can set the formatting so that it's uh, it doesn't become so capitalized. Yeah We needed more blank lines there. Oh, yeah. Yeah, by the way, I really like whoever adds other characters to the bar graphs Yeah But like it it wasn't that hard right like this is It's very easy to create this a cigarette tops So basically the only idea that you need to have in your mind is that okay What steps I need to do when I run the run the run the program that I Uh need to be done and of course this can be complicated like if you need to If you have a very complicated program, you might have a lots of complicated steps But but usually it's very easy to like codify what you're trying to do Okay, let's let's look at quickly like most of the people have managed to Yeah Do it. Uh, yeah, I will quickly Richard if you want to point the page again So so quick Note about the warning that we have on the page there run is batch not bash. That is very common So bash even though the file is named as dot sh That doesn't mean that it's a bash like it's it's it's going to be executed by bash But the idea is not to run it immediately. The idea is to let the queue run it On the background on the compute node. So you need to use s batch to submit to the queue So that is very common that people Like because they sound the same It's batch and bash like should I try it and let's see what happens So this is what you should not do unless you're locally testing So let's see it says it's queued But notice here now it puts the output to the screen So what actually happened is on the login node it went and ran each line in the file And the first well if we look at the file We see this is s run Hello, so it did request an allocation on another node And then it Ran this before it's like it basically like uh like when you run bash If those user and host them Uh with the currently existing user and host name So login tree is the current like placed where you're running it So it filled those and then it just ran it on a compute node where it printed like they already filled like Screen so so it's a bit of a like a mess. So but usually Usually it's a good idea It just creeped add something like let's say host name at the start So you know where it run and you know that everything is running on the compute node and everything works great But basically so in our pronunciation of s batch and bash Mix just remember that if you're trying to run it through the queue you need to use the s batch. Yeah Okay, so what's next is there really much more for Yeah, yeah, we could we could talk about like quickly about the uh, the resource parameters So yesterday we already have uh had had a discussion about when we did the interactive jobs Like what happens to jobs if they are They run out of memory or if they run out of time. So they are killed. So there's a bit of a leeway to the To the job. So so usually there's like an hour worth of leeway in the time department. So or something like that So if the job runs a bit over time, it's not immediately killed when the time is hit And same with the memory like if if it goes a bit Above the memory limit. It's not automatically killed. It's only killed if some other job needs that memory or it stays Like it goes way above the memory limit and but uh, so There was lots of questions, uh, uh, like how do you specify? How do you figure out these limits in the monitoring section? We talk about a bit more how how you can figure this out But like like said yesterday, it's good idea to like compare them to your local computer So if at your computer it takes like hour to run a simulation And it uses like you have 16 gigabytes of memory You might put like, I don't know two hours and 16 gigabytes of memory as the limit And then you can after the fact you can look what the actual usage was Usually you run these like pilot jobs. So these kinds of like test jobs Uh, when you start a new new thing new analysis or something You usually run one or two of these test jobs and see how it behaves and then you adapt Your parameters after that. Yeah, it's get killed then just like increase the limits and put another one into the queue Yeah Yeah, so so one thing Yeah, so so if you submit jobs you might sometimes need to also cancel the jobs So We we'll talk about the monitoring probably in the in the next chapter In much more detail, but but the canceling job is is something that's also Important so each of the jobs gets like this job ID You can figure out the job ID with let's say Richard, could you submit some like sleet job or something? Yeah, okay in Mmm, like I create like a Yeah, like a like here Richard will create like this kind of a Sleep job say five minutes Yeah Yeah Yeah, okay, so it prints the name of the computer the time Sleeps for five minutes and then prints the date You can see here that Richard is not putting any s bus parameters. So it will use the defaults here Yeah, uh when when you submit this kind of a script So the defaults are like hour of runtime of 500 megabytes of memory. Yeah So the job is now there. Can you see if the produces any output? Okay. Um with ls ls So I do see a file here Yeah, and you can see that the file name is a bit strange So the file name if you haven't specified an output file name It's this slurm dash and the job ID of the job. Uh if you run slurm queue now What does it what does the Slum queue job And for people in other universities what you said this does exist. Yeah, this is fcci now supported Yeah, that's great Should be in a helsinki university because they have multiple clusters. You need to talk to the helsinki university people Uh, they They have a bit of a different situation but What we see here in the output is that we have this job running. You can see the starting time how long it has run that it's running Where is it running and and on the left side you can see the job ID Whenever you submit a job to do there's like a huge database on the background That keeps track of all of the jobs of different users And and and there's the queue system that queue system. They basically allocates Different resources for the jobs based on like this kind of a resource priorities and algorithm that it runs on the background so basically Your job is given a priority based on your past uses and and your like re like the resources requested by the job and And well these kind of small jobs are immediately basically in the queue like that They usually run very quickly in the queue and it's currently running on the background So if we want to cancel this richard, what would you do? So we first would find the job ID Which I can find from the slurm queue or also when I submitted it and then As cancel Yeah, and now if we look at the slurm queue Yeah, there It's not it's gone. If we're going to slurm history now So slurm history you can give it also like this kind of a time Time format like let's say one hour so that you don't get all of the previous jobs that richard has run That's probably quite a bit of them, but you can see the last few lines there Okay, this is really wide Yeah, but the last few lines you can see at the bottom. There's like this Job ID and you can see Maybe on the right not Not really, but on the right. There's this It says that it's cancelled status Yeah, so you can see some of the jobs Completed and some of them cancelled. So basically you're like when you're usually using these queue system and these Non-interactive jobs You're firing away jobs that you have like brief program like tasks You're firing them away into the queue and if some of them like fail for let's say they have Insufficient time automatically means you then add up those and put them back into the queue again And you basically like manage like these scripts usually this way And sometimes let's say you you submit the job that you know won't finish in time It's usually better to just cancel the job and put another one with the correct time requirements into the queue Yeah, that that's pretty much it for for the Serial jobs one thing that is like Under the hood there are these partitions So basically all of there's lots of different compute nodes usually in a cluster like we mentioned on the first day Cluster comp is compromised of multiple nodes and multiple different kinds of Computers and usually like for example here in alta. We have many many different Generations of computers that have been purchased like in different times and then we have also Different like we have specified some compute nodes to be in some partitions and some in the others In some sites like in alto you usually don't need to specify this partition like There's this Like the script slurms should automatically like put you into the correct partitions but just in case like in many other sites for example in csc or Well, some of the fcci sites most likely you need to specify these partitions so So you can find the information either with the Either with the Swarm command or with the s info command which is like this bit more like the swarm command is wrapper for all of these different commands But but if you need to specify this partition when you submit the job You can do it by specifying either like dash dash partition equals partition name or like simply DOS dash p and then the partition name so Richard do you want to submit the The for example the sleep job into the Debug partition or something like that So so For example in alto we have this debug partition for like really fast jobs that is meant meant for like debugging for really small jobs But usually you don't need to specify about it depends on the site Yeah And this is really something that we worked on a little well a few years ago trying to make the defaults as useful as possible The best way to teach something is to make it where you don't have to teach it Okay. Yeah, if you want to submit if you look at the queue now Does it say what partition it's in it? Yeah, it ends up there probably Yeah Because it's in multiple partitions most likely notes. So, uh, oh, yeah, I think Yeah, that's probably the reason so I guess for us it expands debug to include all possible partitions because Why not? Yeah but but yeah, so so In other sites you might get like the mileage may vary but just in case you will know that there's this So basically all of the nodes can be There there can be various of these partitions. Yeah Okay So we have a full reference here of Many useful commands Yes, there are very various of these commands and various of the features in these espach and also in the Yeah, well in in the Slurm commands themselves a few of the interesting features like we'll be talking about how to like get multiple cpus and Speak mpi tasks later on But there's also like ease of life or like usability features such as like you can You can specify the the slurm to send you a let's say email once the job is finished or something like that There's features such as that But like if you want to help like you should probably discuss with us before you use like You can test them out, but at least like once I think we got banned from the alter ideas when somebody made like a array job that made these mails Pop up quite a bit. Yeah, so so you should be careful not to spam too much With these commands But there's lots of like interesting features here that you can use in this So but Without further ado, we probably should go with the exercises so you can test out this Test out these and we'll check the hekemd for any interesting questions, but but basically here are few Things you could Try out so in the first exercise just create like an batch job that runs hostname It's good idea to write yourself So you really get like a grip of how do you write these scripts because like it's very easy to like Not follow the exact form of the commands for example like like Many cases like people come to garage and they ask like why why does my job work? Like why why is it killed like or something like that and then we notice that there's a small typo in the Parameters or something like that in the espach's parameters and that is very common and it's nothing like it's just like That happens to everybody, but it's it's easy. It's better to like try it out yourself to write these so that you get like Like you can see the form of the parameters yourself because when we see the script then after after It doesn't work. We quite quickly recognized. Okay. There's a there's like a typo there So It's good idea to test it out So which exercises should people work on then? Yeah, I How much time should we allocate? Maybe we should have something like Me 15 minutes maybe it could be like 15 minutes and then 10 minutes of break and we presume at The zero of the hour um Yeah, I mean maybe yeah, like what what did you go from here once you're familiar with the editor and these parameters then everything else is Simple incremental work from that. So yeah, it's worth it. This is like a bit more time now rather than struggle later This is really that like the meat of the Meat of the thing so basically Uh It's very important to like get get a grasp of this because like everything is built upon these non-interactive jobs Like more like like if you want to use more than What you can run like more than you can run in command lines You need to be able to do this and that's why it's good to like do these exercises and try this out So in the first exercise like I mentioned, it's simple like A simple sbat script second exercise Is you can test out what we just like the canceling Uh of the of the job and the third exercise try to see like Completely uh for yourself like how the non-interactive workflow works So basically to meet this job that constantly does something in the background and will produce your output and then get out of there Like get out of the cluster completely come back and see that it's still running. That is very important to like get this kind of a like Workflow where you go to a cluster just to submit your jobs and then do some other stuff So that uh, uh, you don't have to like think about it until it's finished Should we actually give 20 minutes for the exercises and make the break until I think maybe maybe we should have a 50 minutes and then go through the exercises Not me But then should we go through the exercises after the break? We shouldn't put people in a position where it's a question between break and doing the exercises. So that's true. So, so let's let's How about um, okay, we do five to five to two Sorry five to one and then uh, 10 minute break and then we go through the exercise Welcome back so Let's see how we're doing here Simo has gone and added a poll to the bottom Did he manage to run the exercises? um, please Well, I guess this is a multi answer um Yeah put put up like uh, we can then probably use the We Use it to to see if some of the exercises are bad or good or whatever like if we add this kind of a Multi user multi answer thing. So you don't have to go the exercises in the sequential orders. You can uh, you can do them in Um, yeah Okay, so, um We thought now we can do the exercises or at least the first few as a demo um, or do you think that there's enough here now that we should move on I think we could like at least go through the like the first uh first like we basically did the exercise to Like which was the canceling top canceling, but we could go through the exercise one and exercise three or maybe exercise Exercise one. Yeah, actually To see what Okay, let's take a look then. I'll get my screen arranged Okay, so exercise one What do I do? So i'm in my kickstart directory here Yeah, so let's create a new new Uh file for like cgl exercise one. Yeah. Thought sh So it says make a batch job that just runs hostname Yeah So that's the command to run and what's the other freight work Well, first we always have to start with the bean bash thing at the stop at the top. Does it have to be bash? Uh, well, it doesn't have to be but but that is the most common like You can also have other interpreters, but it's usually tricky and usually you want to do other stuff Like in your job, you want to do other commands such as like uh load modules or you want to Move to a folder or something like that and you usually want to do like command line stuff So bash is the best one to use usually And then usually like always like these are basically something you You uh, it's a good idea to like get accustomed to writing all the time So basically like just put something into time and memory because like you done what There was questions in the hack md that what happens if you forget these and of course like you get some some like Defaults defaults there, but it's it's basically like Like you just need like I I highly recommend getting like this kind of a habit of writing like these first lines all the time like How to pilot like when you need to create a new script you just write these automatically. Yeah So, okay, we need to change the job name. So should we look at the reference? How do we set the job name and so what does that actually mean? Yeah, I'll push control z to minimize nano Why does that not work? We can look at the reference in the in the documentation. Oh Oh, yeah, okay. I was gonna show opening a manual page Let's see. So I see job that can be left As a like an extra for the bash course But yeah, so so you can specify a job name. So this is like if you have a lot of different jobs It's usually a good idea to put a job name for your like otherwise It's automatically interpreted from the comments that you're running or the name of the script that you're running but it it might be helpful for you to like like figure out what is the what kind of uh Program you're running. So it's easier for you to pass the output of the slurm history and so forth Yeah, okay, and then we need to specify output file. So how do we do that? So I remember that's dash o from somewhere else Yeah, it's either dash or dash dash output So so you notice that many of these commands have like long form and short form in the documentation here We usually use long form because that reminds you what is what is the And the parameter it's it's like it says it says It does what it says on the team basically like like it it but but you can also use this short form things And here richard has specified this wildcard This really nice dash j. What does that mean? So this means it will use the job id there So that means if I run it multiple times the output won't be overwritten Yes, these kinds of wildcards. There's many of them in the in the manual espach model so you can use various different wildcards to like Make your output like auto generate a bit more fancy output names. Yeah Okay, so I will Exit and save with yes Okay, and let's submit it mmm Check the output Can you check the slurm queue? Is it currently? Okay, it looks like it read And we see already there that the output file like richard specified this percent j there It got the output file got this Uh job id there. Can you run a slurm history? Let's take a short history Yeah So there we see that on uh, I think it was csl 48 It ran ran this command and we saw we can see that it it ran. Yes Okay, so what does the output file say? Mmm. Yes, so it matches Yeah, so it it ran ran there there was Okay Good question in the in the hack and the about like why do we need to specify the s run within the Uh, there's spats script and the reason behind that is like we'll talk about it even more detail in the monitoring but but basically Your when you submit a spats job, you can see there that there's like this hierarchy of different Steps in the in the output in the history. So basically the first Line it shows you job ID the full job like the what the full job consists of and then you see like these Uh separate steps. So these bats extern and zero. So what these mean is that like these are The job is like run in multiple steps and unless the steps are named Then the steps are like clumped together into these batch steps So so basically you will just run them and you don't necessarily know what what they actually did but sometimes you want to know what is the output of the of all like you need to see Like specify Uh, how how different steps work. So so basically see how how different steps behaved. So in that case, uh You if you run the s run you get like this for example there you get like the zero step Uh, the chopper name for that is host name. So that is the command host name So basically like you can so for for example, if you had a massive job and there was part of it that Read in the data and part that did a computation Or part that process data part did a computation. You could Allocate the time between them and realize which one was actually the slow one or not Yeah, yeah that kind of stuff and and also like you can do various other monitoring tricks and also, uh, when you're running these massively parallel chops or big parallel chops using mpi You usually need to specify this s run so that the slur and the mpi can communicate to know where the chops will be running There was also a good question that that So can you specify other parameters such as the partition or something in the s batch? a thing and The short answer is yes, like you can you can specify whatever s batch parameters There like arguments. So for example these mail mail Like where do you want the mail to end up or partition or partition to use or whatever and uh All kinds of stuff you can specify in the In the s batch comments like this where lots of different parameters that you can specify there and they're basically like You can also override these from the command line. Uh, so Richard do you want to try out? So for example here, we have the output file specified, but let's say we like Okay, let's leave it like this is what you would comment that so we'd leave it as it is Yeah, yeah, you can comment these like if you don't want them to be you can add another comment So then but let's say if the Richard wants to like this Submit this job, but he wants the output file to be a different in this specific case He can from the command line your command line Like it goes I can fix that But basically, uh, you can from the command line when you're doing the s batch like call you can overwrite this So these are basically the same parameters and and like if the like if there's no parameters set The defaults of the system are used if if you have in the script file Some parameters those are used and if you run something from the command line that overrides those So basically you have this kind of hierarchy of different commands that you can do Okay, so should we run it? Yeah Okay Yes, well It's probably done if I list we see New output file. Yeah And the old so Yeah, so I think we can move on to the monitoring section now that we are Like, uh, yeah, so the other other stuff is basically the similar kind of thing Uh, so you you basically have the Yeah, you basically What is good to uh, like what is important for this, uh, serial job is to is notice that that all of the Uh stuff is like done on the background somewhere Like once you you basically fire and forget it you give the task to the q and the q do does it for you And you it does exactly what you say it to be like exactly the kind of stuff you needed to do Yeah, but let's look at the monitoring. I think okay