 screen. So we go to interactive jobs. Um, it's scrolling down. There we go. I hope you enjoyed the demonstration of what work is really like, everything going wrong, stuff changing stuff that in theory worked. Last year not working anymore. So interactive jobs. So, um, we've talked about the cluster a lot. We've talked about these workload manager. We have another metaphor here. That's the HPC diner where basically the scheduling system is like the host in a restaurant. So there'll be a queue of people coming in. Um, and the host will try to distribute people in an optimal fashion. So for example, distribute parties of two into the section that has tables for just two people, leave some bigger tables open for, um, for bigger parties which may not even be arrived yet, do things like take reservations. Um, if they know a big party is coming for a reservation, they'll try to clear up a section so that they're able to see people there soon. Um, like as soon as they arrive. So I think we don't need to go over this metaphor a lot more. So what's next then? Um, so basically, yeah, basically the job of the queue is to make certain that whatever you require in the, like what kind of features you want, what kind of stuff you want, your jobs will get it. So whenever you, whenever you request, um, like a seating for two, you will get the seating for two. Basically, you will get what kind of resources you need. Yeah, we'll talk about various resources a bit later on. So why interactive jobs? And one of the questions in the HackMD exactly got to this point. So the question said, is testing debugging handled differently or is it the same submitting jobs? For example, we'd like to test and run quickly some script. So that's exactly why we talk about this first. Um, it used to be, we told a lot of talking about writing the scripts and submitting it. But just like this question implies submitting something and waiting five minutes or 10 minutes for the result to come back before you debug it, that's just not a great experience. Um, so now we first explain with interactive. So the basic idea with interactive is you can run these few commands, and then it will go through the queue. But instead of having to write a script for it, you can enter the stuff directly on the command line and then see the results that come out. Or you can even submit a, you can submit one job that's interactive, and then you might wait a few minutes, and then the job runs, and then you have the complete shell session on the interactive mode. And then you can run lots of things without queuing. So the disadvantage, should we just start? Yeah, should we run it and everybody in the audience could also run it? Yes. So now we're type along. So you're on your cluster. Let's see. So there's this command on here, some Python command. It's an all, it's a one-liner command, but it basically just prints high. Yeah. If you copy it from here, there's a dollar sign in the front. We'll need to remove that in the future, so that it's easier to copy. Yeah. Yeah. Okay. So what we got was high from the login node. So the login node responded at high. It's me who's running this command. So for what we're doing now, you can be in any directory. So we're not saving any files or whatever. Okay. So now let's say we want to run this on a node environment. So basically in someone else's kitchen, if we copy this s run in front, and now what's the difference here? So at the beginning, we see s run dash dash mem equals 100 dash dash time equals 010 0. So if you notice, this looks a lot like what we put in the batch script that I was demonstrating. So s run is the thing that says let's talk to the queuing manager and then the resources and then the command we run. And this will just run directly. So let's see. So we see job something is queued waiting for resources. It's been allocated resources high from CSL 38. So basically, we were able to run this on the environment in some other node. Yeah. So basically, instead of like cooking in our own kitchen, we suddenly cooked in a random kitchen that we just wanted a kitchen that has some burner that we can use. So this would be useful for testing things like, for example, in my lamps example, I could have used the dash n option here in order to start requesting multiple CPU cores and things like that. Or in that case, multiple tasks. We'll talk about parallel in later. So how would this work in practice? So if you have a code that you think works on the login node or your own computer, you basically just add s run from the front and do the very minimum verification that it works somewhere else. Or if you're doing development and things like that, you can quickly just run this and see the output and not have to worry about all the scripts and stuff like that. Yeah. So here is an example that if you want to see what's in the queue, you can open a new terminal. So quickly, if it's complicated for you to open a new terminal, I'll just run it for you. So if I connect a new terminal to Triton, just a second now, and with this slurm queue or slurm queue, whichever way you want to pronounce it, you can get to see what jobs do you have in the queue. So if I, instead of, let's say, if I, instead of submitting this job, I'll put a job with the sleep. So it just hangs around for, let's say, 30 seconds. So I can see it actually. If I submit it, and in another terminal, I type slurm queue, I can see that it's actually there in the queue. So it's there running there on the background. Yeah. So the next thing is a interactive session. Let's see. So with interactive, we can basically get a complete shell somewhere else. I guess we can wait for this to finish. Yeah. Yeah, now it's not finished. And in here, we can see that the queue is again empty for me. Should I just run it? Yeah. So what happens when we, like this, this might be different for different universities. You might have different flags here to get an interactive session. But let's see. Yeah. Just in auto, it's these flags. Yeah. So this would be a good thing to do. Like, let's say you need to open a bunch of data and do some visualization or processing, something like that. And you got an email saying, please don't run this on the login node. You can use this to get this session on some other node. And then you can, let's see, what node does it say you're on now? Yeah. Yeah, it's like, it might look like, okay, nothing happened. What was the point of this command? Like, it looks, the terminal looks completely the same. But if I look at type host name, which tells me which machine I'm running these commands in, it suddenly shows me not login node, but this PE8 node. And if I bring the other terminal back, which has, which I used to view the queue. If I now run the swim queue, I can notice that there's this interactive bash running in this PE8 node. So actually, now I have this exact same kind of a terminal, but it's not running anymore in the login node. It's now running in a separate machine. Yeah. Okay. And finally, we have interactive shell with graphics. So let's say you need to run a graphical program. There's another command here, which is S interactive. And really, so now I will type a logout in this job that is running. So exit, sorry. That's a good point. And you will see that the job finishes. And now, if I type the host name, I'm back in login three. It's running host name is a good idea, usually, because then you know what, what system you're actually running in. Like, am I where I think I am? If you look at the terminal, you might be like, okay, like, it looks the same all the time. It's very hard to tell what is the where am I? Where am I situated? But running the host name command will tell you like, which machine will execute these commands at any time. Okay. Yes. Interactive with graphics. So this is a thing. This is a bit more complicated behind the scenes than the other command that was run. But it does it in such a way that you can run graphical programs this way. But maybe we don't need to dwell on that. And let's try to go faster and we can get all get to the exercises. Yeah, typically, like, because everything in the clusters happens through these terminals, it's usually not the best idea to run that many graphical term graphical programs in the cluster. It's better to just like run the graphical programs in your machine and they get the data from the cluster in some way that you can then visualize it. That's usually but we'll talk about it today later on when we talk about the remote connections. Yeah. Okay. So what about monitoring your usage? So there was a question above from how do I know how much time and memory I need? And for that, what you'll usually do is run it a few times and just see how much resources it takes. So you might run it on your own computer and check the memory or run one job and just request some huge amount of memory and see how much it actually uses. And then you reduce it to that amount or start small. And if it crashes, then you increase it. So basically, yeah. Yeah, usually, like a good example would be that like your computer, if you have a laptop or something, it might have like 16 gigabytes of memory. So if it works on your laptop, with 16 gigabytes, it should work in the in the queue as well. If you run it on your own machine and it takes an hour to run, then I would put like two hours in the class to see how long it runs. If it crashes, then increase the time. It's a trial and error process, but it's usually a good idea to start from the like larger end of the scale and then work your way towards the smaller end instead of like putting too strict of a limit. And then the job failing because you have set us too strict of a limit. But but it's a it's usually a trial and error process. Okay, so, um, yeah, do you want to show an example of the monitoring? So you sort of So yeah, if I used to slam history command, well, if you can see a lot of history here, I will limit it a bit. So I can say, let's say one hour like this. And can you make your terminal wider? Oh, like, because the line is wrapping here. It's okay if it scrolls off the screen some. Yeah, even if it's a bit, yeah, I'll have to increase the share a bit. Oh, you can't move it outside of the share? I mean, it's okay if we miss some of it just. Oh, okay. Like, yeah, still, yeah, still to there. So we see a chart small now to see there's all of the job IDs we got when jobs were submitted, what it was when it was submitted. We see how much memory it actually used, like these run extract things. See what was running? I see it used about five gigabytes of memory and took what three minutes? Yeah, I canceled those so so they are they haven't they didn't completely use the memory that they were allocated. Yeah. But so, okay, if I if I limit the output still a bit, so let's say the last 15 minutes. Yeah. If I put like 15 minutes here, so we see only the ones we have run in this session. And you will see here that, for example, here's the Python job that that we run. It's a it wraps around now. So don't be too worried about that. So you see that the job name is the Python. There's the start time when it run, how much memory it requested. And then you see here what is the memory usage? Like what was the memory used by the and how much time it used? Yeah. And we previously already saw the server queue that you can use to see are there any jobs in your queue currently, and it will show what's their status and stuff like that. Yeah. So I think we basically covered everything important on the page. So we haven't covered it in the exact same order, but you should have the basic idea. Um, should we split to a break for 10 minutes and then give 20 or 30 minutes for independent work on this? Yeah. So here in the exercises, I'll quickly show the exercises. We have this repository that you can download this GitHub repository that you can download using this command over here. And then that has other exercises as well that we'll be using on the coming days. And the first exercise, there's this well, there's this program that tries to utilize memory, and you will you will try to run it in different contexts, like just just running it on, unlock it now through srun and so forth. And, um, yeah, it hopefully will demonstrate in action how to how to use these, uh, slurm commands to submit the job. It's a good idea to probably, uh, you can also use if you want to open a new window, you can use this slurm watch you command to watch what jobs are running there. So you can leave this open, uh, like if you open a new terminal and you leave this open in the background and then submit some jobs. So if I now run here like a srun m 500. Yeah, I'm 10 minutes by gun like a sleep, sleep here. You can follow how it goes through the queue over here. So once it's in the queue, the watch will watch the queue and what happens there. So it's running there in the background. So that's, that's, that can also be helpful for you if you want one once you're doing the job. Yeah. And if you're stuck with the exercises, join the zoom that registered participants have. We have helpers from the different universities there. Um, yeah, what else? Yeah, so basically explore. Um, the main point is to get experience running srun and slurm history and things like that and, uh, be ready for the next part where we write the scripts. Okay. Yes. Great. So talk to you. Okay. So I guess let's give 10 minutes for break and maybe 25 minutes for exercises. So we come back until, um, break, let's see. Break until five until 30 minutes past hour. Does that sound good? Yeah. Yeah. Okay. Great. See you later then. Bye. Bye. Hello. Hi. So, um, how were the exercises? Let's see what's written here. I don't see many questions, which is either a good sign or a bad sign. Um, so make a poll here like a straw poll, uh, that if you have done and done the, uh, exercise, please, uh, put your name, name over here. Uh, not name, but. Oh, yeah, yeah, not name. Put your, uh, put it, it's kind of a. So someone request five extra minutes. Should we do the exercises as a demo for five minutes and get the recording for people in the future? Yeah, sure. Yeah. Okay. Yeah. I think, uh, so now if you're, I guess, either pay attention to what we're doing or zone out and keep working by yourself and you can watch it later. Um, so Seema, I'm going to your screen. Yeah, there you are. So in the hack and the, there was a good question, uh, that was like, where are we supposed to run these, these exercises? And, and the answer was that, uh, on the login node and, uh, I'll return a bit, bit backwards. To, to yesterday's Enrico's talk about, uh, where are we? Like, uh, just to like reiterate these, these certain concepts so that we get familiar with, like, basically, uh, if you go to, if you rent like a rent, uh, cottage in the countryside or something you, you get there, you might want to get comfortable before you start unpacking your stuff and grilling your food and stuff like that. You, you first have to like get comfortable with the surroundings. So, so in, in the cluster now, uh, in this slide, uh, from Enrico's talk, we see like the concept of a cluster. So we have this kind of a, we have some machines over here where we actually run the stuff. Then we have, uh, the data storage and then we have the login node. So the login node, uh, is the place where we are supposed to, uh, do this kind of like job submissions and, and this kind of like, uh, menial tasks, like, uh, storing like editing our scripts and stuff like that. So that's the, that's the entry point to the cluster. Uh, so whenever you're submitting jobs and everything like that, it's best to do it on the login node because there are, you are basically yourself. You're not, yeah, yeah, you are there. Uh, and you are, you are in like a normal computer, basically. But of course you shouldn't run complicated stuff there, like, uh, like hard stuff there. You should then tell from the login node that stuff should go to the queue and run on the, uh, CPUs and so forth. And to do that, we use the queue. So let's go through the examples here. Um, let's, let's start going through them. So let's get first the, we need the, the git, uh, the repository over here. So we run this git clone. So version control is very important to learn and git is the most popular version control system. So we have a course on that and materials on the web page. If you want to learn more on that, uh, it's a very good tool to learn. So what are we now doing? So we have this program memory hog, uh, that, uses a lot of memory to do nothing. Uh, doesn't sound that useful as a program, but we can use it to demonstrate some features of the queue. So let's play around with it. So we have a, we call it with Python and then the name of the program and then, uh, we tell it how much memory we want to use. So let's do that. So in the command line here, I, uh, I use tab to, uh, to automatically feel once I've written a few characters. So I don't have to waste, waste my keyboard, uh, efficiency. And so now that we have this command line, we can call it with 50 megabytes. And what it did, yeah, what did it do Richard? So it was just requesting more and more memory progressively. And we see it got up to how much allocated wish that was more readable. Well, anyway, I guess it tried to request 50 megabytes. Yeah. So, so it used some around that ballpark of a number. So here we, because we are again, we are now in our entry point to the cluster, we are in the login node. So whatever comments we run here, we run them in the login node itself. So in this case, we used 50 megabytes of the login nodes memory. Uh, doesn't map like 50 megabytes in modern systems, doesn't matter that much. But if everybody starts doing it, then of course it can cause some problems. So instead of doing that, we want to run it in the queue using the S run command. So this is like the interactive users, but not an interactive cell. So let's do this S run. Mem 500 M Python. Let's see, examples. So we're telling Slurm to give it 500 megabytes of memory. And the program itself is requesting 15 megabytes. So I guess that should work. Yeah. So, so I would this default values for all of these parameters that Slurm has. Uh, but usually it's a better idea to tell them, tell it yourself. So Slurm will then allocate you, make certain that you can fit to these, or you can get these resources, it will get you these resources. Uh, by default, you will get one CPU, at least in outer you get one hour of compute time and the default is 500 megabytes. So it's a redundant here, but we know that what we are asking when we give it the, when we want the 500 megabytes. Yeah. Okay. So now that we run, we notice that it doesn't run out, like it doesn't run instantaneously. We had some delay because we had this queuing. Had to allocate. But that was pretty fast. So just a few seconds. Okay. So anyway, so do we check and see how much memory it used? Yeah. Is that with Slurm? Yeah. Before that, let's try to increase the memory. So let's put here, let's go like 4,000 megabytes. So now we're asking for more memory than the memory that, or now we're using more memory than the memory we are asking. Yeah. Yeah. And let's see how it goes. Okay. So now we get out of memory. One kill event. So means out of memory. So basically Slurm said this person's using way too many resources and it kindly killed the job for you so you can update this. So now you're trying to- Let's try it instead. Yeah. Let's try a bit lower memory limit. But this is still more than you're requesting. Yeah. But it works. But for some reason it worked. Do you know, Richard, can you describe this discrepancy? We asked only for 500 and we still got it to run with 2,000. Yeah. So the way Triton works is we've configured it where you can go over the memory limit by a little bit. Like is it a factor of two or three, something like that? And it will kill you. Yeah. Or four or five. Well, anyway, you can go over the amount. But yeah. You can go over the amount. And if- So if the node itself is starting to run out of memory, as in if everyone on the node is requesting too much, whoever is going over their allocation by the most amount is the first one to get killed. So basically this gives you a little bit leeway and makes it so you don't have to request a huge amount of memory if you only need it for a short segment like the initial loading or something like that. But it's risky because you could very easily get killed. And it would be really bad if you ran something for several days and then it died when it was trying to save the data, because that's when it used more memory. But usually it's good idea still to put like a ceiling for the job that fits into the requirements that it uses. But yeah, like Richard said, some jobs might have like this kind of a bump during, let's say, data loading or data writing, where they for a short while they require much more memory than the average average memory uses. So for that kind of jobs, like if we would put a hard ceiling to 100%, a lot of jobs might fail. And then you would have to like put the average higher, which would then mean that less jobs would get through the queue. So it gives a leeway to you to be a bit more relaxed with the memory. Okay, should we go on with the examples? So yeah, so let's look at the slurm history. Yeah, I'll put like 10 minutes here. So of course, the output is a bit messy. I will make it a bit smaller so that it fits. Hopefully you're not watching this with your phone. But anyway, I see a column that says max RSS. And I guess RSS stands for what's it resident set size? Residents set size. Yeah. But anyway, you just need to know it's the memory that is actually loaded. Usually these codes load much more memory. But yeah, it's not calculated for your memory limit. Yeah. So yeah, so you can see here that the memory requirement, even though we use two gigabytes of memory, the memory usage is not necessarily recorded because slurm does like sampling measurement of the memory usage. And it does this sampling every 16 seconds or so. So you don't get like a real like the actual usage, because that would reduce the efficiency of your job, you will get like some sort of a sample. So if we put like the sleep here, let's say 60 seconds, the sleep parameter to keep it alive, let's lower it to let's say 500 megabytes. Yeah. Or 300 megabytes. And let's run it. Now it's sleeping for 60 seconds. So the slurm will capture it in there. Okay. But you can use it to like, like the max error is still is it's a good measurement to use. If you have, because all of your jobs will be running more than 60 seconds. If they're not, then it's not necessarily a good idea. So yeah. Oh, let's see. So we're waiting. Should we do the next example? So we're halfway to the next hour or okay, from our break. So how should we schedule things? Yeah. Let's look at the hack MD are the most of the people managed to do excess to but then the last exercises on yeah, people. But the last one in the exercise to we already have some spoilers coming from tomorrow about how to use a multiple multiple processors. So maybe we can do it after we discuss those in detail tomorrow so that we don't. Yeah. Yeah. So the first three are the most important ones. So yes, seems good. So yeah, now the top run on the background, but we're talking. And now if we look at the slum history, you can see that the actual memory is recorded. Okay. So you can see that the the memory usage is correct. Yeah, so so then this yeah, these are maybe actually, well, this slur has its own commands that you can use to check what lots of information in the cluster on what kind of notes we have. Best command, the best commands are also many of these commands are also with the slurm command itself, you can use slurm help to see various commands that you can use to check, let's say, what what is the status of the cluster and how many notes are in use and stuff like that. Like these commands are pretty much the same, but be mindful that they exist. So what should we do now? Should we do another example? If we do another example, then it's basically to break time. Yeah, maybe we should mention this one. So let's say we we would want to run multiple things. Like you if you think about you're running something on your computer, you might do something that you have a script that runs a lot of stuff. And if you run on Triton, you might think that okay, if you run the sran command enough in a loop or something, you put multiple srange commands in the background, that's a good way to run stuff on the cluster. But actually, that's not a good way to do things. We will we will be transitioning to CD all this script jobs or batch jobs in a second. But why this isn't good is because if you think about again, this structure that we have, you are running the commands in the login node. So when if you have like a for loop that runs many of these sran commands, they are still running from the login node, like the actual sran command is running. So basically, if I if I would run here the sran sleep 60. And if I now close this window, the job is killed, the job dies, the job is gone. And it doesn't do what what you needed it to do. So we will be talking about this non interactive jobs next. So if you would have like this kind of a loop running in the login node, it would mean that this login node is like a single point of failure. And if that goes away, then your jobs goes away. So you don't want to do that. You don't want to you want to use sran and s interactive when you do actual interactive stuff, where you actually are watching what you're doing. If you're not doing it interactively, it's much better to use the queue non interactively, because then you don't have to be there watching points. Yeah. Um, yeah, okay, should we should we go to a break now and be back at the top of the hour? And then we're going right on time, I think. Yeah, if you have any any questions in the chat in the HackMD, please ask them then we can about the exercise or anything. Please ask in the chat and we'll respond there. Okay, cool. So let's see HackMD. Let's make a note of the break. Great. Um, okay, see you in a little bit. 10 minutes. Bye.