 So, welcome back. So, now we get to the real interesting part, which is actually running stuff. So, let's see. I'm switching to CMO screen here. So, yeah, so we have this large cluster that has access to how many nodes do we have? Several hundred? How many hundreds is it? Well, quite a few. Let's see. So, how do we manage to share hundreds of nodes among hundreds of people that are using it at any particular time? So, that is what the job scheduler or workload manager or batch system is used for. So, on Triton, our workload manager is Slurm. And Slurm is basically like the Mater D at a restaurant. So, you walk into a restaurant and there's a queue of people that would like food. And there's a limited amount of tables available and a limited amount of cooks available to produce food for the table. So, if you let people just take whatever table is available, it may not be very efficient. For example, you may have parties of two taking up a table for eight people and so on and so on. Or you may not have people join in the same table when they could share. So, imagine this HPC diner analogy. You arrive at the restaurant and there's always, not always, but usually a host there. And they will take information about you, how many people you are. Well, I guess they don't usually ask you how long you would like to stay, but I guess in theory they could. And then you're given a number and asked to wait. And then, once a table is available, that the host believes will be efficient for the restaurant overall, they will take you there and seat you. If you're a small number of people, if you're just two people, it's easy to squeeze in somewhere. If you're a party of 10 people that are coming for some major event, well, you might have to wait quite a while. Or, in reality, they'll want you to have an advanced reservation and make space for you. So, on a cluster, there actually is this concept of reservations, but in reality, how it works is the bigger job you get, the longer you have to wait until space is available. And everything is being continuously and dynamically recalculated. And the MaterD also wants to keep people sort of balanced. So, as you have repeat customers coming here, if someone is coming to eat all the time, they might have to wait a little bit longer to make space for the other people that are new. And that's an important role here that doesn't exactly exist in a restaurant as well. And what's also important to note is that the creature doesn't, like, if the creature sees one person in the line and one person says that I need a 10-person table, the creature, like, he doesn't know to adapt the request, he just fulfills the request. But, of course, the request is filled as per the, as to the priority of the customer. So, basically, you can ask, one person can ask for a 10-table reservation and they will get it, but after a certain waiting period, of course. Right, yeah. So, the creature just wants to get as many customers as possible through the queue. He doesn't care if the balance is wrong because of the users, rather because of the people coming in. So, I guess let's get straight to it. So, here we are on the cluster. So, SEMO has logged in. So, we see username at login2. And if we type hostname, this tells us the computer that we're actually running on. So, this is running on the login node. And what you don't want to do is take this and just run all your code here. Because, while the login node is a pretty powerful computer, it's not powerful compared to hundreds of people that may be using it. So, if you run stuff here, you'll get a polite email from us saying, please use the queue. So, here we see SEMO's copied the first example, which is basically Python 3 from the command line. And it says, import a module and print the node name. Basically the same as hostname here. So, let's say we want to run on a cluster. So, as I like to say, just add srun. So, srun here means slurm run. And on our cluster, what this does is it will basically talk to the host of the restaurant, the workload manager and say, I have a request. I would like to run Python 3. And the default parameters are 15 minutes of runtime and, what, like maybe 500 megabytes of memory and one processor. So, since we're fine with those defaults, we don't say anything else. And here we see it says job queued and waiting for resources. We say it's been allocated resources immediately. And now we see the output. So, it says hi from node csl48.int.tryton.alto.v. So, this was actually able to run on a different computer. And we got resources. So, let's say we need more resources. Let's say we want, well, the example here says 100 megabytes of memory. Do you want to try that? So, we run it and it immediately works also. What if we ask for 10 gigabytes of memory? Do you think it can give that to us immediately? So, here we change 100 megabytes to 10 gigabytes. And notice it's taking a little bit longer, but it's still ran and on the same node. So, we've got various parameters here that are running. So, this next example won't actually work so well because right now jobs are running very quickly on Triton. But in general, you can run this command slurm queue in order to print all the things that you have waiting to be run or running. So, this is a little bit of a difference between Triton and other computers. So, this command slurm is a custom script we have installed. It's actually available for other sites. But as the some documentation or the other says, in order to use the other slurm commands, it can sometimes require an impossible branch of other commands. So, it's not the easiest thing. So, let's see. The slurm queue isn't really going to, there we go. So, CMO just in one terminal ran srun sleep 10, which waits. And then down below, we see slurm queue and it says that the job is running. So, we see the job ID matches and the command that's running is just called sleep and well, it's going. Okay. Yeah. So, you may wonder about this right now. So, we're doing this from just a single shell. So, you can only run one thing at once or well, as many things as many shells as you have open, which doesn't really scale. But this is sort of just the introductory step. And in the next lesson, we will be getting batch jobs running. Okay. Yeah. So, these interactive jobs are useful for debugging and sort of testing stuff and getting familiar with the cluster. And well, also, they're useful too, for this teaching purpose. But yeah, you're going to need to use the serial jobs later. Okay. So, what? Yeah, because like, the currently, when we are running the srun, we are running it on the login node. So, we are running the srun on the login node. And then the slurm runs whatever the srun was supposed to run on a compute node. So, let's say we need to take login node down for maintenance. Then we have to kill processes that are running on the login node. This happens, maybe I don't know, once a month when there's like a major kernel update or something like that, once in two months. So, then if you're running like a long job on the login node with this srun, it will be canceled. If you're running in the non-interactive way that we are going to be demonstrating properly tomorrow, then nothing will happen. It will run on the background and the login node will go down, but you don't have to keep your terminal open to login node for it to run. Yeah. Maybe we should go over a little bit of this terminology. We gave the metaphor, but not the exact terms. So, the login node is the first node you connect to, and it's generally used for accessing the rest of the cluster. Yeah. And you can use it for the login node. And you can use the login node for small-scale tests and debugging and things like that. Login node is basically like the lobby of the restaurant, like you wouldn't want to eat at the lobby of the restaurant where the other people are constantly puzzling around. You want to ask the creature to get you a table where to eat instead of eating at the lobby. Right. Yeah. And then the compute nodes are where things actually run, and compute nodes exist in huge racks and are basically like the tables. Yeah. So, compute nodes are basically tables, and each table has multiple chairs, so CPUs. So, basically, each table has different kinds of like CPUs. And if we want to take the analysis even further, the tables are different sized based on the memory. So, some nodes have more memory, some nodes have a smaller amount of memory. And the greeter's job is to fit everything together as good as possible. Yeah. And srun is basically making the request to the greeter to ask for the table. So, yeah. So, this is all well and good. But what happens if you actually are doing something like interactive, like debugging something, then, and you would like to get a shell with more resources? Like, let's say you want to use Python to open up a huge data frame in pandas, and you need 100 gigabytes to do that. So, we have a way. So, it's called interactive shells. So, you can use srun. And it's srun and the dash p option means make it interactive. No, the dash p option means partition. So, we say we would like this to be interactive. And we say we would like the time to be two hours. And then the memory to be 600. In this case, I guess it's 600 megabytes by default. And then pty means a terminal, and then bash means run bash. And now here we are. So, you see from Simo's prompt, he is now on a node called pe6. So, this is just like a shell on the login node, but you can do basically whatever he would like here. He can start Python from anaconda and then interactively process some data, run ipython, whatever it may be. And this is pretty convenient when you just need a bunch of memory, but not you don't want to do it as a batch process. Okay. So, in a little bit, you'll have an example to try this yourself. So, there's different ways to get it. It's also important to note that the interactive jobs, like if you ask for interactive shell and you don't close the shell properly, then it necessarily doesn't go away. So, the reservation doesn't necessarily go away. So, remember always when you have an interactive session, remember to close it afterwards so that it won't reserve your time or the time of the CPU anymore. Yeah. Okay. So, if you are need an interactive shell, but you need graphics along with it, there's a different way to do that, which is called sinteractive. So, it basically works a lot like srun. You give a time and memory and whatever you may need. And it works slightly differently in the background, but if you need to run a graphical application, this will work. Okay. So, I'm betting Simo doesn't have, do you have graphics connected here? Do you have that? Does X, I's work? X, E, Y, E, S. X, E, Y, E, S. Like I's like what? Oh, yeah. I saw it there. Because this is the interactive node, so it's not there. Yeah. Okay. So, let's try it. Does it work? Yeah. So, for example, if you want to run Matlab, this might take a second to load, but yeah, Richard can continue to describe it. So, this is something that you can investigate more later if needed or try out during the hands-on sessions. So, it looked like Simo's Matlab didn't start graphically. Yeah. I probably didn't have an accession open, but it's usually a better idea if you want to do a graphical application to use the VDI and then access the data in Triton instead of running the graphical application in Triton. But for some applications, let's say you want to do some again, like let's say with PAGA view or something like you want to do physics rendering, that's completely possible in Triton. But yeah, graphical applications are not the main forte of the system. Yeah. So, you may ask, how do you know how many resources to request and how much to use? And that goes to monitoring. So, we have this command slurm history, again, using the thing there. Can you make your terminal wider? Yeah. A smaller, wider, whatever it may be. So, anyway, every time you run something on the cluster, it will keep detailed records about what you ran and how many resources you requested and then how many... Let's see how many resources you requested and then how long it took. So, here we can see the recent jobs Simo has run. So, we have the job ID and under job name it says Python 3, the start time, the requested man. Max RSS is how much memory it actually used. So, we see that it didn't... These Python processes have not used much and how much CPU time it used and so on. So, what we would normally recommend people to do when they're first using a cluster is just submit the job and make a guess at how many resources it uses and then you go and you look and see how much it actually used and then you adjust your resources to match that. So, we will get to this in a... Yeah, we will get to this in a little bit more later once we start running things in parallel. But good role of thumb usually is like, if you're currently running the software on your laptop, on your desktop computer, then compare to that machine. What does you have there? Like do you have a, let's say, you have a MacBook with 16 gigabytes of memory and something like that and then you run some simulation code for four hours, then you might guess that 16 gigabytes of memory and four hours might be suitable for Triton as well. But of course, it may vary. But that might be a good guess like what kind of resources you currently have available and how much you're currently using. If you are running like a small Python process and then it's not using very much, then it might be lower the resources. If you are running a big simulation, it might be higher the resources. Yeah. And the interactive jobs are sort of good for requesting too many resources to see how much is needed. But then once you start running tens or hundreds of the same thing, then it's really worth taking a little bit of time and getting your resource usage correct. Okay. So, yeah, I just talked about these resource parameters here. So, the main parameters are memory, time and number of CPUs. And this will become a lot more important in the next lesson. Okay. Yeah. About the resource parameters, it's usually a good idea to match the request as closely as possible to what you're actually using. So basically, same as the restaurant thing, if you know that you're going to be there the whole day, then let them know beforehand. Or otherwise, they will kick you out because the table has bigger results from certain time onwards. So, you need to let them know beforehand how much resources you're going to need. Or otherwise, you will be kicked out of there. And there are some certain magic numbers, certain combinations of resources that are best suited for the system. Talk about them in more detail. Yeah. So, here we are at exercises. So, these are pretty easy things to do, or at least easy to start with. So, would you like to, Seema, maybe do the first exercise ourselves to sort of go through the whole example? And then we can divide up into the breakout rooms. So, this first exercise, yeah, this first exercise, we clone a Git repository that has some sample code with us. And then we have some programs inside that use a bunch of memory or a bunch of CPU or whatever. So, we can use these to test the different resource parameters. So, that this works on any computer. It's in Git. So, here we see Seema has changed to his work directory and now has cloned the repository. And he can change into the different directories. So, here we are in the slurm directory now. So, let's run it. So, if you run Python and then memory hog with 50 megabytes. So, here Seema has changed into the slurm directory. So, he doesn't need the whole thing. But, yeah, let's run it. You changed into a directory, but also gave the directory on the command line. Oh, yeah. Well, maybe I'll go back. So, here we see the program has gone through and allocated more and more memory. But this is on the login node, which is not a good place to be asking for lots of memory. Let's see. So, now let's use it with srun. So, srun and we'll ask for 500 megabytes of memory. Yeah, I previously also run it with srun. But, yeah. So, let's slowly increase the amount of memory we've requested. How about we ask for five gigabytes of memory and see if it works? How many? Five gigabytes. Five gigabytes. So, it's waiting. Oh, we see an error. We see detected one out of memory events in a job step. So, here is try to limiting the amount of memory you can use to the amount you request. So, actually you can go over by a little bit, but as soon as you go over, you might get killed. And if you go too high, then you definitely will get killed. So, now let's look at slurm history. So, at the very bottom here, we see, so this actually doesn't work surprisingly. So, the reason for that, no, so the reason for this is that slurm only measures the memory use every 60 seconds. And this job immediately died. So, that's a little bit of a problem in the measuring system. So, what we can do, let's change it to request, say, one gigabyte of memory, and then let's add a sleep option to the memory hop program. Yes, like that. So, now it will occupy the memory, but then wait for 60 seconds, which should be enough for slurm to measure the memory you're using. And you'll notice here that even though we requested 500 megabytes of memory, slurm gives certainly way to the job. So, if some other job isn't actually needing the memory, so if there is free memory to go around, it will give the memory to the job. If some other job that has resorted the memory suddenly resorts the memory, then it will kick this job out. So, it's usually a good idea to put whatever requests you need and then give some extra space to sort of eat this memory spike or something, some sudden memory reservation it doesn't get kicked out. Yes. Okay. So, there we go. 60 seconds passed. So, we do slurm history and let's see. So, here at the very bottom, we see it requested 500 megabytes and used 1,030 megabytes. So, I guess 30 is the general Python overhead. You may wonder what MN means. So, why is it 500 MN instead of M just megabytes? So, the N means 500 megabytes per node as opposed to per CPU. Let's see. So, now should we go to the hands-on time? So, we see a few more exercises here you can play with. I propose that we take maybe 15 minutes. So, we'll regroup at five minutes past the hour and then summarize what we have learned from there. And then, we can give you a quick preview of what comes tomorrow when we actually start scripting these jobs. So, if we do... Yes, tomorrow we will go like, now it's a good time to get accustomed to the system. So, how to use the system interactively and how to access it. But tomorrow we'll go into much more detail on how to do big-scale stuff or specialized stuff on Triton and do it non-interactive. Okay. So, 15 minutes of exercises. Okay. And if you're leaving us already, please give us some feedback. Either write in Hack&D what you thought of today or let people know via Zoom or the Twitch chat or however, it's very important to us to get this feedback. So, we know that today seemed like it was quite slow, but that sort of goes along with the territory we think. So, there's a lot of basics we can get to before we can start going to all of the interactive jobs and things like that. But don't worry, as soon as we start tomorrow, it starts getting very fast. And the things that you've done today, please also go and review it and whatever. And then, because this is an important basis for tomorrow. Any other final comments, Timo? If you feel that, okay, this is clear as a day, nothing complicated there, I recommend checking our reference page on different flags on the SLURM extra and commands and different flags you can set because time and memory aren't the only ones you can set. You can start reading on the more complicated material. But yeah, nothing else really. Yeah, okay. So, let's head to the break. See you soon.