 Okay, so now do we go to interactive jobs? Yeah, let's do it. So now we are putting the sauna, like now we're hitting the sauna, now we're like getting somewhere. So this is the first, now we're actually starting to do something in the classes. So far, we just like unpacked our stuff, we have our stuff stored somewhere in the work directory, we have put stuff in the cupboards, we have organized, like we are ready to stay at this place, but now we actually want to do something there. So now we have connected there, we have found some firewood applications, we have found something that we need for our stuff. We have put our belongings to the proper places and now we want to run something. Yeah, so I guess first we need to talk about what Slurm is. So how would you describe that and why is it the most important concept here? Yeah, so basically what we, on day one we describe the cluster as like it's a collection of nodes and the login node. And there's hundreds of people using the same system at the same time. So if you think about like you have a computer at your home and you want hundreds of people working on the same computer, that wouldn't really work because there's only one keyboard and stuff like that. And the Slurm like helps this kind of a situation. So basically the Slurm manages all of the different tasks that are required from it. And it allocates resources for these tasks, for these jobs as they are called. And they are run then on the compute nodes. So basically the Slurm is like a very good, like what's it called in the restaurant like server, like who arranges, yeah. Yeah. Host or mater D or something like that. Yeah, yeah, yeah. Yeah, mater D who arranges like tables for your group. Like you come with a, you have a job that requires certain resources, certain placement on a table or something. Like they have certain requirements for their night in the restaurant. And this very good host manages to organize everybody at the proper places so that the restaurant is as full as possible constantly. And this is basically what the Slurm does. It's a queue manager that basically manages like all of our jobs that we submit. They end up to this queue and the queue manager will then arrange resources for these jobs. Right. So then what are the, so what's our interface to this? Like what's the part that we need to know when trying to run something? Yeah, like in the list that we currently have visible here, there's very good like examples of stuff we need to know. So the host, in order to get like what you, the current correct seating, the host needs to know how many party people are in the party and how long they're going to be dining there. So these would be, they represent like how many CPUs your job needs, how much memory does your job need, how long is the job going to be running. And then the host knows that okay, like this table will be reserved from this time onwards, but in the meantime you can sit here because your dining time is short enough that you can finish up before the next people arrive. So then once you, what you do, usually you use these S-interactive, S-run or S-batch commands to ask for these allocations. And then you're giving a number that basically tells what is your ID, like what is your reservation number, and then you are given the correct seating. And after that you can ask the host like am I currently, what is my situation? Am I allowed to run? What is what is happening? And then afterwards, well, you get, you can look at what happened, like how long did the actual job runtime take and stuff like that afterwards. So what happens if you really have no idea what these parameters are? Yeah, so the first thing is that you can first try just running something in the queue without setting these parameters at all. So basically if you have something, for example, here in the example, do you actually want to try and run this example? Yeah, sure. So let's see, here I am, so I am on Triton. Yes. Yes, so maybe I should make sure both are visible. Yes, that would be good. So the example is a very simple Python script. It just prints while it's running this command in Python. So we want to run this. Yeah, and the easiest way to test how it would run in the queue is to just add a run at the start. So what the s run means, yes, like that, what the s run means is that it basically, at that point, Slurm like understands that, okay, this one, this will be run in the queue and whatever comes afterwards is what will be run. So it will, like it says here, it will put the task into the queue and it's waiting for resources. And then once the resources are available, it will run the command. So in this case, because we are asking to print what is the name of the machine, it will, right now, when you see that it's been allocated resources, it will say that it ran on this CSL 46 machine. So it ran on a separate system. So it was allocated resources and then it got the resources and it ran. It ran with our default resources that are like 500 megabytes of memory and hour of runtime. But like in the example in the web page, you can specify these resources yourself. So let's say we want to add 100 megabytes and let's say, because it's very fast, let's say 10 minutes of runtime if you want to put. So is this 10 minutes or 10 hours? Well, it's hard to tell. So I would rather put two zeros at the front so that you know what it is. So that way it's like a universal clock. Otherwise, it's very hard to tell. And you can see that in the memory, there's like 100 and capital M. So you can put also gigabytes there with the G, but it's megabytes. You want to look at Slurm Q? Yeah, we can look at the Slurm Q. Well, we can do the monitoring later on. You can look at what's happening in the Q with this Slurm Q and Slurm... Well, these commands, but at this point it's not... We'll talk about monitoring in detail a bit later on. So it's not currently that important. But that will tell you what is the Q status. That is all the specific, I think. So other sites, there's a different command. We'll talk about that later. But okay, so there was good questions there in the HackMD of how do you know the resources of your job? And I usually say it like, if you have a computer, like let's say a laptop, and it's something and it takes like 15 minutes to run on your laptop. Well, you might expect it takes 15 minutes to run on the cluster as well, right? It's only natural that there are both computers. It should work pretty much the same. And that's usually the case. It depends, but usually I would guess something that is similar to, well, how it used to run. But if you're unsure, and anyways, you should put some leeway there. So put maybe like of the order of the job at runtime. So let's say your job takes an hour. I would put maybe two hours. So it has leeway to go above the hour. Like starting on your computer. Like your computer might have 16 gigabytes of memory. So start with that. Yeah. Same with the memory. Like put something like your computer might have 16 gigabytes of memory and the program runs. So you know that okay, maybe I'll put 16 gigabytes. We'll talk about how you can monitor these after the fact. Like how do you, how can you see what is the, how much it actually used after the program has run. But first you need to give it some ballpark. If you aim too low that the job is killed. So that happens when your allocations are like too low. Then you know that okay, I will need to put higher. And usually it's this kind of like you first need to get like a feel for your program. So that you then can like set more or better limits that are more clear or like similar. Okay. So what's next? Yeah. Let's talk about, so this job like over here when we run something, we run it on the login node. We run this S run something and then it ran on the node. But let's say we want to like, we run to test something out and we don't want to like constantly run the S run and wait in the queue. We just want to like have some, we just want to work like on a compute node for a while. Like for example, there was questions there that, let's say I want to like, yeah, I want to test my code, but I don't want to like mess up other people's stuff on the login node. That is usually a good idea to think about. So what you can do is you can set up this interactive job. So the interactive job basically like it allocates your job and it takes an SSH connection there. So you can, you basically get the terminal in that node. Does this work outside of Alto? I think that should work. We also, in Alto, we have this interactive command that basically you can just use to, like similarly to S run, you can run S interactive under requirements and you will get a terminal. So now you see that Richard, when the job was queued and was allocated, he's now, you can see on the left side of his command line, it says that he's running on a compute node. So basically he's running now like an interactive terminal in a compute node. And there he can run well what he wants without bothering other people in login node. Yeah. So this is a fairly simple way of getting like interactivity sources. Do I need to exit from here? Yes, like if you want to, once you have well finished doing your work, like you don't want to be there anymore, you can just exit or log out and it will say to you that the... Does it say to you? Hmm. Maybe log out, try log out. But the shell did end. Hmm. Well, this is... Yeah, maybe the control B is the usual thing that I use. It might be that the... Could it be it's because this is not as interactive? Well, anyway. Try log out. Try log out. What happened? I will... ...lose the shell. Yeah. Yeah. Yeah. Okay. So can you check... Can you tell I have to get this locking thing set up again? Yeah. Okay. Can you check slurm queue? Is it still running? No. It stopped. Okay. Yeah. So what does this slurm queue do? This slurm queue... So this short hand for this SQ command that is like a bit more... Well, it gives information of what is happening in the queue. The slurm queue will tell what is the status of your jobs. We will be talking about these once we... More when we get to the non-interactive jobs. So the non-interactive jobs are jobs where you are not running something interactively. So maybe we should have a quick exercise of people trying to run, for example, the Python example. Okay. Let's see. Like five minutes or something if people can... You mean what we just did? Yeah. Yeah. People can run themselves. Which part? Well, let's say the S-run part. Yeah. Because that is the first step you're taking towards running stuff in the queue. So try running stuff in the queue and see if it works. Let's say five minutes on this. Okay. And if you have any problems, then we'll respond to the HackMD at the same time. Okay. So we're back. Hopefully you've been able to do that. So there's a lot of messages I see about different commands and partitions being different. Like in some computers it says use interactive, not S-run for interactive. In Alto you can also use S-interactive. There was a comment way above about is G-run the same as S-run? Or you need to use a different partition for interactive. And it's kind of a mess, isn't it? Yeah. I'd say that the G-run, most likely, like you're not working probably with a Slurm system. So all of the clusters in the world, they have some sort of like a queue manager. Like there are many of these like PBS, Slurm, Cray has their own. I can't remember all of them. But Slurm is the most popular, partly because it's free, but it's also very good. But so can't say if that's the same. But on other sites the interactive queue might not exist. So you might need to take the P flag out. We will go into detail what are those different flags when we go to the serial jobs. There's an interesting question here. If both Triton and my computer runs a program in the same one hour, what's the point of using Triton to run the program? I was thinking it should make it faster somehow. Yeah, that's actually a very good philosophical question. Like why would you use it? So the idea behind the cluster is that you can... something that would take your laptop, let's say, a full night's work of just wearing in the background of your apartment and making you sleep less well. Something like that can be put into a computer cluster and you can run it there and, well, you sleep well. And at the morning you can come back to the resources, come back to the results and see them. Or you can even run the same analysis, let's say, with multiple parameters, like separately, run multiple of these jobs so that suddenly you will get benefits. Like you can run multiple of them at the same time. Or you can either, if your program supports it, you can use the multiple CPUs. So when we are talking about what is improvement or making something faster, it doesn't necessarily... we are not talking about you waiting looking at the computer. We are talking about what makes it faster for you to get to the end goal. This is what... the XY kind of a problem. So the end goal is that you get, let's say you need to do a thousand simulations. And if you wait a thousand times the time on front of your laptop, you might not even... you can't join a Zoom call because your laptop is so loud that you annoy the other people there. Or you need to stop the simulation while your laptop is doing the calculations or stop, like, while you're doing the Zoom call. So the idea is that you can offload a lot of this work to the cluster systems where you can run, let's say, multiple of these simulations at the same time. And there's basically like an army of little work people who will generate the results for you once you tell them what you want to be done. So it's like offloading the computation somewhere else. And that will reduce the overall runtime. Of course, some programs, they are faster. Of course, if you run multiple CPUs and stuff like that, they can also be faster like in real time. But in many cases, the situation is that what is like the faster for the whole project? And that is like something that you need to think like, if I would have 100 laptops, would I get this done faster? Because I could put one simulation running on each of these laptops. And of course, you don't want 100 laptops in your home. Like that's insane. And you couldn't even use them? Yeah, like who's going to like running the commands on every laptop? That's like so much work. It would become really laborious. But that is completely trivial to do in cluster. Like you just like we talked tomorrow about this array talks and you can immediately like run 100 simulations simultaneously. And it's like, it's not a biggie like in a cluster environment. And the thing is that like, like speed up and what is faster, it's not necessarily you shouldn't think about like clock time. You shouldn't, you should think about your work time. That is the most important thing. Like you want to be... Because you can say if you really have only one program to run for one hour, there's no point in the cluster. Like I usually think about it like, I don't want to sit in front of like a terminal window looking at the numbers were in past. Like when the simulation is going on and just waiting there, fiddling my thumbs and waiting for the results to appear. I want it to be like, it's gone. It's running somewhere else. I can do other stuff in the meantime. I can read literature. I can improve some other stuff. And then once I come back by magic almost, the results are there to be like right for the picking. Yeah. So that is the real power of the cluster system. Yeah. So what should we do now? Yeah, so at this point, I think we should like we are getting close to the four o'clock. We can now well demonstrate what we will be starting in the next, at the start of the next day. Like what we just talked about, that is like basically the next step. Like that is that that change of frame of perspective from I'm running this stuff on my laptop. What am I running it in the cluster with the s run command? That is basically now you're running the same thing there, but you're not getting any benefit, right? Like you're running the same stuff in a command line somewhere else and you're not seeing any benefit. But the real benefit is come comes when you disconnect yourself and your terminal from the simulation and you run it somewhere in the background and you don't have to think about it anymore until it's finished. And this is where we get to the non-interactive working flow and this is what basically everybody uses in cluster environments. Yeah. Should we do some more of these interactive exercises down at the bottom or should we go on? We might want to do, like there was, yeah, let's do a few of these like before we run the series of examples. So there was good questions that like what happens if the job runs out of memory or if it runs out of time? And there are these examples here. Like if you have downloaded, like if you haven't yet cloned this HBC examples repository, you can do it now into your work directory or your home directory. It doesn't really matter. You can try the work directory because that's where you will be running in the future. So that's a better place to store it. But if you cloned the repository in there, there's this memory hoc script that basically just allocates bigger and bigger amounts of memory. And try running it. Let's do it interactively. Yeah. Okay. So I'm removing my existing examples. You don't need to do these. Yeah. I hope there was nothing important in there. And now I run git clone to get it. Okay. And now I am in HPC examples. Yeah. If you run this memory hoc script. So what happens if I run it now? Yeah. If you run it now, it runs on the login node. So basically you will use the resources on the login node. So try running it. Yeah. Does it say at the top, does it say where it's running? Probably not. No. Okay. Yeah. Okay. But yeah. So you can see that it tries to allocate. And there's no limit on the login node. So many users sometimes field the node with the chops. And we have to kill them because we haven't put memory limit on the login node. But you're not supposed to run heavy memory usage chops there. But let's say we want to run it in the queue. So the exercise B. Okay. So we'll use S run 500 megabytes of memory. Python slur memory hoc and 50 megabytes. Yeah. So we're waiting. Okay. So it worked. Yeah. So you. Yep. Yep. So that's as simple as that. Can you run slurm history now? So this is, there's another command for other sites. I think it's the last one over there. So we'll talk about monitoring it later on. But you can well allocate it didn't to register the memory usage yet. But let's try running the C step. Oh, C. So increase the amount of memory. So let's put something that is bigger than the 500 megabytes. Like five gigabytes. Yeah, something like that. Okay. The idea here is to demonstrate what will happen if you run out of memory in the job. And this slurm history command will go into more detail next week. Yes. Okay. I've got resources. Or tomorrow. Oh yeah, tomorrow. Yeah. Yeah. So you can see that we got out of memory error here. So we are. Yeah. And interestingly didn't even print any output, which I'm. I'm pretty sure that's because like if you run the Python without unbuffered output, it might work. So Python dash U, I think it's because of that. Okay. Well, hold it. Let's go then. I'll add that. Okay. But yeah, you can see that like if it runs out of memory. Okay. Now we get some. There we go. Yeah. So about two gigabytes. The job was killed. So there's a bit of leeway in the in the memory that you are allowed to use unless some other task needs that memory in which case your job is killed automatically. Yeah. So this, this is what happens when. So maybe I think we should show the S interactive or. So I see a comment in HackMD about in Helsinki using the interactive command. So when you start these things like S run. Well, how about I demonstrate this other command we have S interactive. So this is probably similar to the Helsinki S interactive command. So we run it. We could give the same slurm options for memory and time and stuff like that. The main reason people use this is that they want graphical applications to work. But so once this starts, we get a new shell. So this shell is on the other computer. And my shell tries to start something automatically. So we see a big mess there. Let's ignore that. But now let's say I run the Python thing. Well, if I run anything here. So it runs and then it stops. But notice I'm still on the node I requested. So I have to exit this shell myself. So there's two ways of doing these interactive things. One is by requesting a session. And then you can do multiple things. And then you have to remember to close the session yourself. Or else it'll keep allocating the resources and well, eventually time out and. That's not good. The other option is what we did with the S run like this where the command is directly on the S run line. And here we wrap the command in S run. So S run allocates resources runs exactly one command and then exits right away. And this is a very important fundamental distinction. And also I guess we see it later on too. When we do the asynchronous serial jobs, you give it a command that runs it and then it frees the resources immediately. Let's see. That shell got crazy. Okay, let's see. Maybe we can what should we do now? Yeah, like I want to clarify a few things in the HackMD that are very good questions. So like these questions relate to like why are we suddenly using these queues? Like why is it important? And the important thing is that we have a lot of resources and those resources need to be shared among multiple users. So these queue systems are meant for like making certain that everybody gets what they need. Like if we just put like say machines and we say that go out there, people go there, some servers will be completely empty and some servers will be completely full and nothing works there because there's too much used there. So the queue system balances the load so that everybody gets what they need and what they ask for. Well, this system is efficient. So the efficiency is something like 90% in our clusters. So like the system is completely filled and that is important because that's basically like we don't waste servers. We don't waste money. Like everybody gets what they need and we still got like, well, we don't just like put stuff there and it goes to waste and money goes to waste. Like it's important to put stuff, do stuff effectively and in scale, when you're doing stuff in scale, you need to be able to create this queue system so that any people to abide by the queue system so that people get what they need. There's also a good question there. Like if I don't run with S run, the stuff runs much faster on the login node. And why is that? And the reason for there is that like if you just run S run, you are asking for one CPU, you're asking for some memory and some time. And in the login node, if you run stuff there without like you're basically taking stuff from other users and then using the all CPUs probably on the machine or your code is might be using. So that's why it's really important to like set your job so that you the job utilizes the resources it's been given and knows how to utilize it. So when you run the simulation on a node, we'll talk about parallel simulations tomorrow. But when you ask for resources, it might be that the job is running on one CPU or it might be that it starts multiple workers and it tries to all of them are forced into one CPU. So either way it will run slower. So it's important to know what your program is doing and ask for the correct resources for the program. And that way it should run the same speed once you get the correct resources. If you ask for let's say a similar node as the login node. Yes, so any other really important questions here? I see we've detected that the Helsinki cluster is not failing. So don't don't don't mess up there because it might affect other jobs if the memory killer isn't engaging properly. So don't run the memory like to become a memory stuff. Can you try running the CC here? So s run the command but then tell it to sleep for two minutes and it might be that it runs. So it allocates the memory and then okay, so it used to be in Triton. There was a process that ran every 60 seconds and checked is a job using too much memory and then killed it. So when this happened you can go over your memory limit until the next 60 second interval happened and then you would get killed. So that couldn't be what the problem is here. And that's why the sleep option was added. So it won't end until it has time to do this. So interestingly also when you run the slurm history command which I showed before. You see let's see we'll go over this more tomorrow but this one M here is what the actual memory usage was. So you notice it says one megabyte this is just wrong and that's because still this only measures it every 60 seconds and not every not instantly even though Triton kills it instantly. So it can be a bit tricky to keep these together. The memory key only engage when they're gone on a set period and then they set intervals and if the memory is running out then they will kill the jobs that are going above the limits. And interestingly this is the kind of thing we talk about in our Triton admin meetings all the time. So there's some really small things that seems like an obvious thing to do but then either it doesn't quite work or there's some other trade off and well yeah. But yeah so just to like let's for tomorrow like what you should gather from this interactive thing is like now we got a first glimpse of the queue system. And this is like the queue system is really what powers the cluster. So everything we said before is like needed for the cluster to work. You need storage you need login notes you need stuff. Like basically you're currently seeing like the veil being opened like you're at the theater. The veil is being opened but the show is about to start and show starts tomorrow when we go into non interactive users because this is like like you're still opening a terminal somewhere. You're running commands interactively and that is not the point the main point of the cluster. The main point of the cluster is that you tell somebody else to do stuff for you and that somebody else is the queue manager. You tell like the like you want stuff to be done and you want it to be done tomorrow and you go to sleep and you come back tomorrow and then it's done and that's how the queue works. You run stuff non interactively and you only use the command line and the data copying stuff like that to like then like give the instructions and read the output from there. So these interactive this is just the first glimpse of the queue manager. So you you can run stuff interactively you can give it some limits like memory limit time limit we'll talk about other limits tomorrow. But basically the idea here is that tomorrow we'll have a full day of running non interactive jobs and that is the like 99% of our jobs are run that way. And like the most powerful users are those who have the most like basically the biggest pool of workers working for them. Then they're basically CEOs of like jobs instead of like they're managing the jobs. They're managing the workers. They are not doing the work themselves. So so basically that is the non interactive idea that we'll be focusing tomorrow. Yeah. And this is why the shell scripting is so important. And we started with that because if you can't make your things run automatically you can't use 6000 processors at once or no for the new CSC Lumi computers tens of thousands or hundreds of thousands of processors at once.