 Yes. So what's next? Yeah, so as a reminder, what we've just done was just a demo. So we didn't expect you to be able to follow this along. And now we're going to go to the parts where you actually do follow along. So we're going to interactive jobs. Yeah, and this is more like hands-on typing stuff. So what does interactive jobs mean? So basically just like whenever I was preparing this, instead of writing something to submit and see if it worked, I first tried running interactively. So basically with the resources right there. And this let me do the debugging faster than submitting it, waiting, looking, rewriting, and then so on and so on. And because this is a good way to get your hands dirty and actually see how things happen, this is how we begin here. So yeah, so see what we do like last time and maybe you're the talker and I'm the typer. Sure, okay. So by the way, if you hear some zoomies on the background our new cat is running there. So don't be alarmed by it. Yeah, so introduction to Slurm, let's start with this. So we already talked about the queue system, but let's look at it in a bit more detail. So basically there's this analogy that is pretty good. Is this the like this HPC diner? So basically you can think of a restaurant. And if you walk into a restaurant that has like a wait here for seating site or the sign at the start, the HPC system is basically that is the diner and it has different kinds of tables and the different kinds of tables are, well, if you have party of two, if you have party of four, if you have party of eight, there's usually a creature or a server who arranges how the tables are organized so that the diner can fit as many people in there. So I guess you don't want a party of two occupying a table for eight and then there's someone waiting for that. Yes, and basically you always have to tell the greeter what kind of resources you want. So if you keep that in mind throughout like today, throughout tomorrow, that will serve you fine. So whenever you walk into the HPC diner, you need to take into account what you want to do there. So in this situation, like let's say the queue system is basically like a, like if you run in non interactively, that's basically a takeout. So you make your order and then you go and collect the food once the food has been. You're not eating it there in place, but like interactively eating is basically you go to the table and you sit there and you do your work there or you eat your food there. So interactively uses is basically it's limited by you. That's the main thing. Like we recommend non-interactive users because then it doesn't require like human intervention. Like if you ever like updated your computer and pressed like yes, and then you expect it to update and then you come back the next day and then you'll notice that there was like a questionnaire box and then it like you'd still need it to click yes, but you didn't notice it the last time. And now you have to wait another three hours for the quit, like the update to finish. That's basically interactive usage. And we don't recommend that because that wastes resources because like the human has to be there. Like human has to be there to process the stuff. But there are some times that. But interactive usage is, yeah. You might need that. Like for example, some people need to analyze interactively some big data sets. So you request the resources, you do your things and then you finish. So it's not the most efficient, but it's more efficient than for the human than trying to program it. Yeah, yeah. If you want to playground on a larger system or you just want to try out, let's say compiling your program in the cluster that might be a good way of doing. So should we just jump into it and start running? So let's see. I will change to my home directory again here. Yeah, here we go. Okay, let's see. How do we want to arrange this? I guess this is good. It goes a bit to the out. Yeah, but I'll keep adjusting it. Okay. So on the small screen, you can see what commands which have had typed in like the history of the commands. So yeah, let's see. Yesterday, when we had the connecting to Triton part, we asked the host name command. So let's start with that, for example. Like we just run host name. So host name is like a command line command that just tells where we are basically. And now we are at the login node. So where we are is basically where the commands will be executed. Like the terminal is connected to some machine, the window that you have, and the commands will be executed there. So in this case, everything we type in this command box will be executed on our login node. But let's say we want an interactive session with let's say one hour, one hour interactive session with 500 megabytes of memory or something like that. How would we type it? So to start off, there's srun, which lets us run things directly from the shell. So basically anything I might do, I can add srun in front, and then it will run instead on a node in the same environment with whatever resources we ask. Should I try running this or should I try requesting more resources first? Let's do the simple case. Let's try, yeah. Yeah, so what we, yeah, that's actually, yeah, that's better example than what I was thinking. So what we are now trying to do is we are running the command host name, but we are running it in the queue. So we are running it somewhere else than the login node. So you see that we get, we see that we are queuing, and then we get some resources, and then we get the output of the host name command. So do try this out at home. Like this, now we are at the point where we recommend trying out the same commands that Richard is running here. So try out and see what node did you get. Like you will get completely random nodes. Like the name doesn't matter, it's some node somewhere, and you will get the resources from there. But you notice that if Richard, you now run host name again without the S run. So I'm back. You notice, yeah, Richard is back in the login node. So when you just run with S run, it's basically like you go there and then you come back. You just go to the compute node, you do something there, and then you come back where you left off. So it's like this kind of like, you went to the office to do one thing, and then you came back home, basically. And I guess maybe we can clarify, like I would often do this when I'm first testing things, but I wouldn't do this for any real kind of work because if I get logged out of Triton, then everything gets lost. So this is just like getting started. So all we're doing is exploring the different slurm options here. Should we try adding some more other options to it? Like what if I wanted more memory or something? Yeah, let's do that. So by default, you will get like 500 megabytes of memory, but that's the default we have specified. But if your code requires less or more, you should specify the memory requirement. So how you do it is that you type this dash dash mem equals, and then you can give it a number and suffix like kilobytes, megabytes, gigabytes. Typically, well, 100 megabytes, for example, might be a good one. And now if you... Do this. Yeah, you can run it. And maybe afterwards we can check the history to see what we have. So now we're running it and we have specified a different memory limit. So the jobs usually they like, you're given some sort of a memory like cap and if the job goes a bit over it and nobody else requires the memory, it will run. But if somebody else needs that memory, the job will be killed because like the Q system needs to try to fit all of these different type of like Tetris blocks basically there. And if it doesn't want the Tetris to go overboard so the game mix is over, like it has to always be able to fit the Tetris pieces there into the compute node. So it has to always know what memory requirement you want. In the restaurant example, it's saying how many seats you need. Yes, basically. Okay. And now if we type from history, we see the power. Yeah. Yeah, we see the latest jobs there. So each of them got a job ID and you can see what command was run. And the first command was run by requested memory was 500 megabytes and the second command was 100 megabytes. And you can see on the next to them in the max RSS column, you can see what was the actual memory users or the maximum memory users. Yeah. That looks about like what I'd expect from these. Okay, so. The other thing that the major, like the memory is like something you want to specify, but the other major thing that you always want to specify is time. Like how long does the job run? The default is I think one hour. But like if your job runs faster or slower, you need to specify time limit. And this is basically how long do you want the reservation to be at the table in your restaurant? Are you going to stay there for a full free course meal or a simple lunch? Like, and that will help the server in this case, the queue system to arrange your place. And you shouldn't request the resources that you are not going to need because that will make it harder for the server to fit you into the restaurant. And that would, it makes you wait longer in the queue. Like if you come to a restaurant with eight people and you want, okay, we want to stay here for four hours, they're having hard time to fit you in there and you have to wait probably to the next day so that they can, they have like a free slot for you. So if you know that you're okay, we wanted the table for eight, but we only four people are arriving, we know that, okay, let's ask for a table for four, maybe we fit in better. And we will talk a bit more about how to know what are the limits. Yesterday when there were already questions about how do I know what job asks, we will talk about that in detail more later. So let's see, where are we in the lesson here? We have a quick poll in the HackMD, you have people run, yeah. Let's see. And again, what we're doing here is so simple, it seems so boring, but it's like the assignment zero. So if you can start with this, then the other things become much easier. It's better to do 10 things that are easy than to combine it all into one thing that's hard. So should we do interactive session and then have a small break, maybe? Like a serial interactive, like S-interactive one? Yeah, you mean with graphics or? Just S-interactive, like. Okay, so we're going back to my shell and is this the section we wanna be on? Yeah, yeah. Yeah, the graphics is something that might be relevant, but it might not, like, the S-interactive session is, it works with, you can get some graphics out of it, but it's not necessary. But basically, the idea here is that, let's say you want to run more than one command with S-run and you don't want to queue each time you run this, like you want to run some interactive stuff in the queue, but you get bored like, okay, I need to wait again in the queue when I want to run my next command. I want to have a place where I can just test out a few things and then close the connection afterwards. Like you want a small place where you can fiddle with your things and then you want an interactive table for a few hours or hours or so. So you can request that kind of a resource and that's done with this S-interactive command which also gives you a possibility of using these X forward in, which is more technical, but you can do some basic plotting with it, but or something like that. Like, I guess, like what I'm doing here. So, should I run it? So do you want to describe, Richard, what we have here in the command line now? So S-interactive is a wrapper we have. That actually you can do a similar thing with S-run, but it uses all the same slurm options like the memory and time and then we'll launch the job and connect us to a shell there. So what Richard is now asking, like, let's say Richard wants to like open a data set and pick a data set up part and look maybe you need to run the S-run PTI option. Okay. I don't know why that didn't work. Okay. Okay, we have a demo effect running here, but well. The demo effect is a very long tradition when doing live courses. Yeah. If the S-interactive doesn't work for you, you can also run, do you want to run the S-run that's just PTI? Yeah. So what Richard is now running, he's launching an interactive terminal in a node. So basically like, basically similarly like you had, took the connection to Triton, you took a connection to a login node or to other login node you can use then ask for, okay, I want another interactive, like interactive session on a compute node where I can play freely. Then either this S-interactive or this S-run dash dash PTI and at the end, you need to give bash usually. You can get this kind of interactive playground. So if you want to, Richard, run this command. Oh, you're already running. Yeah. This will give you up probably a bit more because there's 20 gigabytes of memory requirement, but the queue is constantly calculating on the background who's getting where and it's trying to formulate a place for Richard to run the stuff. Yeah. So this is good. If you know what you're doing, you have like, you want to do like some stuff and you want to do it interactively, but it's of course a bit of a waste of resources because it again depends on you to do stuff, like in the queue. So here to try to make it go faster, I selected the interactive queue now, but it's still... Maybe you need to specify like, okay, you got the results, yeah. Okay. So now you can see that, yeah. If Richard runs the host name, he noticed that Richard is running these commands on this P5 node. So basically, Richard is now on some compute node where he has the resources he requested and he can do whatever he wants. And this is fine way of working if you have something that actually like requires interactive users. But of course, like I said, we don't recommend this for long-term stuff because again, if your computer crashes, if you're close the window, if you close the terminal window, it will close the connection and that will like you lose whatever work was in progress. And also it requires you to be the gatekeeper of the machine. And we like the actual benefit of these eight like HPC systems is that you can tell the machine to do bigger stuff than you could manage. So basically like I said yesterday, you can make this order to China or somewhere abroad completely out of your access. You can make an order for tables and they will fill it for you the order but you don't have to be there and manually do it yourself every time. So you can just like fill out the order and you will get the results afterwards. And that is better because then you don't have to be constantly there watching and the computer will do work for you and you don't have to do the work yourself. Okay, so where are we now? I think we've gone over mostly. We've shown the monitoring your usage as part of the what we've done before. We've talked about setting different resources and maybe we should go to a break and people can work more themselves. Yeah, we could have like a 10 minute break and then we can. Maybe break an exercise session or let's see what's our time. We could resume in half an hour and that leaves 10 minutes for break and 20 minutes for playing with some of these other. Slurm examples down here. Yeah, maybe we would like, how could I say it, like we hit the loma by like make the, like a 10 minute break so that people can stretch their legs and then we can answer the questions and do a few of the interactive exercises. And then we can start focusing on non-interactive usage after that. Should we have a 10 minute break and then start the exercises? Okay, so we'll come back in 10 minutes to describe the exercises. Okay, sounds good. So break until, break until XX01. Welcome back. So now we're going to introduce the exercises. So I saw in the HackMD there was little poll there and some people said this was too fast. So that's sort of expected, like we haven't even gotten to the exercises yet. So now you're going to have 20 or 30 minutes to go and actually do this on your own and take all the time you need. And we're going to give you some things to work on. I point out right now here, this is the risk of S-interactive. So I left the interactive shell open so it's continuing to reserve my 20 gigabytes of memory even after I am not using it. So I should exit from that. I should have before the break. So okay, so exercises. So here we see there's five different things you can try. Some of them are more practical, some are more philosophical. But they give you some programs you can actually try running interactively. And then we will go and do it in a script. So during the break, someone was asking us, why do we even do this interactively? Why not just go straight to making the script? And I mean, some people will say, okay, like I need the script anyway, I'm just going to set it up and make it. But I sort of like being able to see it live and see the stuff as it comes out and it makes the development faster. So anyway, you'll have some options. So in these exercises, there is a Git repository here called HPC examples. So if I run this command, let's see. Okay, I don't have it here. Actually, I will change to, oh, yeah, I'll just clone it here. I run Git clone this and, okay, I'm going to enter my key. So what Richard is doing, the command in the documentation, I'll post it into HackMD as well. There we go. The command will copy from this public repository, will copy these examples via Git, via this version control system because we update the examples, we try to make them better. So we update them and we use version control to do that. So using Git, we can just get the examples faster. Let's see. Yeah, so now that I've cloned it, we have this code here. And by the way, this is how we'd recommend you moving code around onto the cluster and so on. So here we are, and we have the different programs we need. So for example, we can, we have LS, HPC, examples, slurm, something that will use a lot of memory. And then I can, if I push the up arrow key and then slide over, I can run it with Python. And it says it needs an option. It needs a memory option. So let's say 50 megabytes. And there it's going and allocating a bunch of memory and using goes up to 50 megabytes. So with that, you should be able to sort of go and do, well, example one is basically the basic experimentation of the above. All of the different slurm options and slurm history and so on. Example two starts running something with time. So you see how much faster it gets when you use more processors and so on. During the exercises, it's good to, like if you're part of the Zoom call, you can ask the instructors there if you have any quick problems. If you have other problems in the HackMD, you can also ask them and we'll go through the solutions after that. Yeah, please put all the problems you have in HackMD so other people can see them also. And let's see, number three lets you explore some of the other commands that will tell you about what's going on in the background and some more things. So yeah. And just to make it simpler, now we're just trying to, basically the etiquette of asking for a table. So if you, I remember on a holiday trip when I was small, I was the one like asking for a table even though I was the smallest of our family, like because the child needs to learn how to behave in a restaurant and ask for these kinds of things. Not saying that anybody is a child here, but the idea is that like you have to learn how to get the resources you need and the best way to get the resources is to just run the Friday exercises and learn the language of the queue system. And it's, you have to like first start with just getting an interactive table, just run one command in the queue and then we can get to more complicated stuff later on because it will get complicated. So it's good to learn the basics, like have a good grasp of the basics. Yeah. And there's a good comment in HackMD. It's good to change to your work directory before cloning those. So we'll talk about this at the end of the day, but there's different places to store your data. And the one for the big research data is WorkDir, like this here. So, okay, should we go back to what, 20 minutes of time? And we'll resume on the stream after that. So until 38. And if you need follow HackMD, you can comment when you are done and if you need more time and also with your problems. So see you later then. Bye. Welcome back everyone. So if you haven't noticed, there's a little poll in HackMD you can use to let us know how you're doing. But if you're not done, we're going to demo it anyway. So you can keep trying to work while we do that. Yeah, so remember like what we are trying to now focus on and what kind of things we are focusing on. And that is just reserving the resources, right? Not like that is the first like everything we do in the cluster goes through the queue. And that's why we need to learn how to reserve the resources. So everything is about that. At the same time, I know that for many people is probably a lot of new things with Shell and the command line. And that's something that's unfortunate. We'll probably have to add more stuff on the shell. But for now, let's try to keep it like the minimal. There's also in the HackMD, there were questions from users of other clusters. So some other clusters, they might have different flags that you need to add. For example, like partition or account so that you can get your job running in the queue. So do be mindful of that in these room rooms, there's instructors who should be able to help you with finding the correct flags for you. Yeah, so sorry if that confused anybody. Yeah, this is a common thing. You go to a new cluster and you have to figure out what the particular options are. But okay, so I think Simo will do a demo. So let's switch to his screen and here we are. So I have the git clone command here that Richard posted and it will clone me the HPC example scavenger story. So I will run this LS command to list files and I see that there's HPC examples folder here. And now let's look at the first exercise. And the first exercise was to run this Python HPC examples. I can use tab for auto completion if you want to use the tabulator so you can faster fill out these commands. So slurm memory hog, 50 megabytes. So now we are running something with Python that ask for 50 megabytes of memory. So this is what Richard was running previously. So we see here that it asked for 50 megabytes of memory. And if I run the whole name command you'll notice that I'm still running in the login node. So now I used 50 megabytes of memory in the login node. And of course we want to run it in the queue. We are programs so we have to run it there. So that was the A part of the exercise one. And exercise two, let's try running it in the queue with S run. So I'll clear this up. There was a question in the HackMD, how do you clear it up? Clear, just type in clear. We'll, and press enter, we'll clear the terminal if you want to like start from a fresh perspective. So let's run S run and add this memory 500 megabytes and then run the Python, HPC examples, slurm, memory hog, 50 megabytes. And now we're waiting in the queue. So in other clusters, you might need to add an account or a partition flag here in order to get it running. But now it's queuing in the, yes. Now we're queuing. So we'll never know how long it will take. Most likely not that much. Yeah, okay. So now it's run and we see the memory thing here. So now let's add the memory, increase the memory requirement that the program wants to run. So let's increase that. I press up arrow here in the terminal so I can get the previous command that will make it faster for you to try it as well. You can use up and down command arrows to scroll in the command line. So let's put some requirement like, I don't know, for the, let's say 5,000 M, let's add two zeros to the command and let's run it there. And let's see what happens. So now we're queuing, it should take about the same as previously because we're still asking for 500 megabytes. Okay. But now we got an error. So we see that the job is queued and waiting for resources. But then it noticed that there's OOM kill. So out of memory kill. So it ran out of memory. So let's put it somewhere a bit below. Let's put like 2,000 megabytes. And let's see if that works. Okay, that run. So why did it run? Even though we asked only for 500 megabytes. Well, the answer was what we mentioned quickly previously that there's some leniency. It depends on the cluster. Our cluster has a bit of leniency that you can go a bit about the memory requirement if nobody else is using the memory, wants to use the memory. But usually you should want to specify the memory requirement to be something that, well, the stuff will fit into memory. Okay, so then we have the D step. So let's use slurm history to check how much memory it actually used. In some other clusters, I think in CSC, you need to use a bit different command, but you can use other commands as well to check. So this is a wrapper that should be available in many universities to check your history. I'll actually, you can give it one hour so that it will give you the most recent commands. So now it's a bit hard to read because it wraps around, but we basically see that we have this command that run with 50 megabytes. Let's see again. So we have here one that, okay, it runs so fast that it didn't see any output. Like it didn't record what was the memory requirement. Like it says in the exercise, like the slurm history will only record like this memory every like 60 seconds or so. So if there's like a huge spike in the memory requirement, it won't record necessarily. So let's clear the output and run the same S run command. Let's say the 5,000 megabytes. So now we know it will fail. Or let's say a bit, let's say 1,000 megabytes so it doesn't fail. So we get the recording. So, oh, actually, yeah, sorry, sorry. I just noticed that in the exercise, there's a better example. So there's this Python 50 megabytes sleep 60. So yeah, my bad. So now it's, oops, sorry. Let's run it through the queue. So S run at the stop front. Let's go to sleep. So now it's running, it's waiting, it's running for 60 seconds. It says 50 megabytes of memory and waits for 60 seconds. So now we get some historical information of the job. So, Richard, how would you say, like when you're normally running stuff, how do you use like the slurm history? Yeah, well, usually I'll like pretty often to take a look at it. Okay, my cat's trying to enter again, I think. Yeah, cat's problem is for both of us. So yeah, like I use it especially when I'm first running because people ask, we're asking yesterday, how do you know how many resources a job will take? And the only way to know is by actually looking. So basically start with small stuff, run it, like make your best guess. If it's too large, make it smaller. If it's too small, it will fail and then you know to make it larger. And that's just sort of like, that's how it goes. So here we run it for 50 megabytes and then we slept for 60 seconds. And now when we run slurm history, I'll clear the shell again. Okay, I'll clear, I run it slurm history for one hour. Yeah, so now for the last job. So for this job, we see that the memory requirement was 500 megabytes and actually usage was something like 70 megabytes. So that was what we asked for 50 and there's a bit of overhead by the Python. So it looks what we expect. So then there's the second exercise. Let's quickly go through that as well. So there we run this Pi calculating program. So a program that calculates as a digit of Pi. So let's run it. So time, Python, HBC examples, slurm, pi and 500 integrations. So we get an estimate for Pi, not a very good estimate, but it doesn't take that long. So let's try adding one zero here. Well, now we get a bit better, still pretty bad, but let's try one zero more. And now we get, well, still, we only get one degree of accuracy. Okay, let's try running in the queue. So now we're running again on the login node. So let's just, I can use the left arrow to just go at the start of the program and add S run here to run it in the queue. So let's add one more zero, let's say, and let's run it in the queue. So again, this is just asking for more resources, asking for running in the queue. So the output is a bit garbled now, but we get a lot better estimate for Pi. In the example, there's also this bit more advanced stuff on how to run it with multiple CPUs. We'll talk about this probably more later on, so maybe we shouldn't focus on that right now. Let's just focus on the main thing, which is running through the queue. And I think, yeah, I think, should we do the next exercise or should we move towards the serial jobs? I guess we can discuss what these say very briefly. So what does S info show? Okay, so there are commands for the queue, like you can ask the queue manager things about the cluster itself. So S info will give you this kind of a quite a complicated looking. It depends on which cluster you're running, but it will give you this kind of complicated looking view that tells you what kind of machines there are in the cluster. Then there's also this, you can give it various flags like this in a flag to see in a node level. Then you can use S queue. Let's not do that because it doesn't show only your jobs or it shows everyone jobs. So let's not do that. Yeah. For many of the clusters, we have this slurm wrapper that helps you to, like the slurm history command for example, it parses some of these more complicated commands. You can type just slurm to see various features of the slurm command. We will talk about this when we now start talking about the CDL jobs because they are more important. To make it clear, some people notice that slurm isn't installed on other things. So the slurm is an extra thing that makes it easier to use, but everything from slurm you can do with other things as well. Okay. Yeah. So one thing I would like to mention that in there, in the documentation, there's a question, like one of the exercise questions was that, why not use shell scripting and write multiple S run commands there? And the problem is that if the connection breaks, yeah, if the connection breaks, you will lose the progress. And also, if you have multiple of these S run commands, you will always have to queue between them. So you don't necessarily want to do that. You want to have one queuing, like you want to queue once and then run multiple commands instead of like running multiple commands and queuing for every command. And that's where we go to CDL jobs. So the next step in our way is actually like, we run many of these commands, but we can also run them. So we can tell the computer to run them for us. So we don't have to type them as spells every time we want to run these commands. Yeah. Okay. Is that enough for serial then? Should, or no, for interactive. Should we go on to the next section? Yeah, I think so. I think most of the questions have been answered either in HackMD or here we could take a quick look and see someone's asking for another break. Yeah. Maybe we should do the break and then go to serial. Yeah. Yeah, let's have a five minute break and you can ask if you have, oh, 10 minutes. Okay. Let's do a 10 minute break. And if you have any questions, still related to S interactive jobs, post them into the... Continue HackMD. Now let's, I will also put a poll there like is, was it clear? Was the section clear or? Yeah. Okay. And when we come back, we start making these scripts to run stuff more. Okay. See you shortly.