 Okay, now the interactive jobs, I guess. So as a reminder. Yeah, let's do it. Okay, so back to Zemo. Okay, so now we're at the point where we're actually connecting to the cluster. So our general strategy for these lessons, we'll try to really quickly go over the main points and maybe do a couple of quick demos, but then leave a really long time for you to do the exercises and practice yourself. And at the same time, we'll be watching the notes and see what questions come up. But then we'll come up at the end and go over those questions and then do demos of the exercises and things like that. Because I think people don't really wanna hear us talking. They want to do stuff themselves. At least that's what we always hear. So just so you know, that's the plan for the next three lessons. And second off, we had this poll earlier about are you connected to the cluster already? So if you haven't gotten your cluster connection set up yet, you'll probably have problems today because we're not going over that. If you haven't done the shell practice from yesterday, there might also be a few issues here. But you can read those things and try to catch up, but we're not going over that now. Yeah, and join this Zoom room if you have connection problems or something like that. But at this point, I think it might be already good idea to go back in your mind or when we start the first exercises to make a connection to Triton. But yeah. But okay, here we are. So from this cluster schematic figure with what we're doing right now, we'll be able to use one CPU on one node. Yeah, so, but we'll expand this later. So interactive jobs, yeah. Why interactive jobs? Why to run interactive jobs? We were talking about non-interactivity a lot. So why interactive? Yeah, so there's a specific pedagogical reason we're teaching this way. Well, that's not the question that you asked. So why interactive jobs? So because sometimes you actually want to, like when you're developing, you don't want to have to submit something, wait, check another file and see if it worked. You want to be able to submit C, change the code, submit again, and see it right on your screen with little overhead. That's one very useful case that almost everyone goes through. Also sometimes, like there've been times in my career, I've said, okay, I need to process this data, but it needs 200 megabytes to load it. So I'm gonna get one interactive job and request 200 megabytes of memory and a few processors, go there, like load it up in Python, do whatever my thing, and then save it out and then go on with my work. And that kind of thing I've done some before. So why do we teach it this way? Because it's easier, basically, like it gets people running interesting things and seeing them faster. Okay. So should we try running something interactive then? Yeah, let's try. Yeah, this is now like a demo, so you don't have to like follow the commands exactly, like you don't have to run the commands, oops, sorry, I grabbed the wrong window, so I just resized it. Okay, yes, there we go. Okay, so you're... I have a terminal here in Triton and so you can even follow what we are doing. Yeah, so Cimo has cloned the HPC examples repository from yesterday and now he's changed into the directory. And this is why we spent so long talking about files and directories yesterday because it needs to be second nature to you now. So, okay, here we are in HPC examples. So let's try running our little pie script. So we can use Python three and then we give the relative path to the pie file. So notice we're not in the directory that contains pie but we're in another place, but that's okay. And then we do it with 1,000 iterations. And remember, we shouldn't be running stuff on the login node, but this takes under a second so it doesn't matter. But this is what I'd usually do. Like I'd say, does the program even run at all? As soon as it runs, I cancel it. Now let's add srun. So Cimo can push the up arrow key and that goes up this shell history. So we don't have to go retyping it and then scroll to the beginning and now add srun to it. Srun and let's say memory equals 100 megabates and 10 minutes. And if Cimo pushes enter now, so notice we see something different. It says job, something, something cute and waiting resources, allocating resources and it ran. Can we show that it actually ran somewhere else? What if we do srun with, can you type the command hostname just here? Yeah. So hostname prints what computer it's run on and now can you do srun hostname and it's queued and look, it says something else. It says CSL48. Okay, is there anything else we really need to talk about or is this basically it or the interactive shell would be good to say? Yeah. So the problem here is that we're waiting for every job to run. So may not very long, but sometimes it can be a while. So there's a way we can use srun for something different here. So if Cimo does srun. I'll quickly say that like what we did here was that we previously were running like over here when I run the command hostname, it run here in the login node. But immediately when we use the srun the queue system takes over and the queue system sees that okay, I will want to run something in the queue and it will reserve resources based on the resource request that we have specified and then puts it onto some compute node. And I don't know what the compute node is going to be. It's decided by the queue system but it ends up here somewhere. So this is what we basically did with the srun. And there's already some questions in the notes about this not working at some other universities. So if it doesn't work, this is the kind of thing that's different at different sites. So. You might need to add the partition or something like that. Yeah, like write in notes and see if someone answers there. Okay, where were we? The interactive shell. Yes. So if Cimo does srun and then dash dash PTY and PTY means pseudo-terminal, PTY bash or yeah, some time like let's say hour and then two dashes for them, I think. Oh yeah, yeah. Okay, so if we run this, we request one hour of time in 500 megabytes. But this looks really similar. But if we do hostname, we notice we're on the other computer. So now Cimo can run multiple things at once. Cimo can start up Python, try to do some stuff, close Python, look at the data, start Python again and whatever. Yeah. So, yeah. Okay, what else? There's a section about, is there anything else here? There's interactive shell with graphics. That's another section, but we don't need to go, you can read that yourself. Checking your jobs. Should we show Slurm history? Yes, we'll talk about monitoring later on but Slurm history is, well, it's a, there is a, this might be in some other clusters, might be different, but in, maybe I need to make. Yeah, it needs to be wider. So I'd say just make it really wider, like drag the other half off the side. So basically Slurm is recording everything that's running. And can you make the terminal wider than the shared? Yeah, there. So we see basically everything that ran, what command, when it started, what it requested, and so on. So you'll explore this a little bit in the exercises but also two lessons from now, we'll talk about monitoring and do a lot more with this. This Slurm history might also not be available on some places, there you might need to use this SQ user, oh, sorry, SQ user, which gives a different kind of output but most clusters to some commands should probably be in Finland, please. Yeah. If you scroll down to where it shows Slurm history. Yes. Even more there. Yeah, so we often say Slurm history and if Slurm isn't available, S account dash long shows you similar information. So we give two commands, one that is a bit more advanced and custom and one that should work anywhere. Okay, and with that, there's only a final note about setting resource parameters. So basically all of the different things we talked about under Slurm, and you can find in the quick reference, you can add to this S run command. So to request more memory, more time, exclusive, more processors, GPUs and so on, even if we're not actually talking about them now, but this is the basic interface. And with that said, now it's time for exercises and we're actually doing quite good on time, I'd say. Oh, there's a, I don't know. I think they're described pretty well. We probably don't need to say much. So just remember that all of these use the HPC examples repository. So make sure you're in that directory when you're running things. And, yeah. And I would also mention that in the notes there was a good discussion about like what is the difference between running in the login node and in the, using S run. And the main difference that we also tried to emphasize is that the login node is a shared resource. So that is shared among like everybody in the cluster. So it's usually very important that you do not run anything like anything that is computationally heavy on the login node because that will affect other users and it's not the fastest way of doing it. So if you're uncertain where you are currently, because in the terminal you don't necessarily see it easily, where am I? You can always run the host name command to check am I running it on the login node or am I not? And then if it says host name is the login node, then you know that, okay, maybe I shouldn't be running it here. Maybe I should use the S run. Yeah. There's a good point. Yeah. Yeah, so there's a wrong example here. It's missing the argument iterations. But yeah, so this is a good example of reading an error message. So here it says pi.py error, the following arguments are required iterations, which is a hint that there's something wrong in need this other argument and maybe the docs are wrong. Yeah, should we go to the exercises? How long should we give? Well, I'm not certain. I think we should probably give something like 10 minutes and then go through the exercise. It'll also be combined with a break. What if we're back in half an hour? So at 15 past the hour. So there's enough time for a break exploring stuff or oh yeah, 10 past the hour maybe. Yeah, maybe 10 past hour then we can go through the exercises and the solutions for those. Okay. Yeah. Okay, exercises plus break until then we'll, yeah. And you can keep working after that time. Okay, so see you. Okay, I'm putting the notes here. Yes, okay, see you soon. Bye. Bye. Welcome back everyone. So hopefully you had a break and if you're still working on things you can keep working while we're talking if you'd like. So there were a lot of questions but overall I think it's gone pretty well relative to how it could go some past years. Yeah, in the exercises section in the notes do you want to vote whether you managed to do the exercise or not? Like if you had, if you did them or if you had any problems or if you need more time because that will let us know what's the status currently. So yeah, or if you're not even trying stuff. That way we can better like know, we know if we should focus more on like getting through these exercises. Yeah. Okay. Okay, well people are, or maybe we should have also asked, do you want us to go over them? Well, let's scroll up to the notes and let's see what kind of questions we had. Does that sound good? Yeah. Okay, so not just for exercises but all interactive jobs. Does this need to be larger, the text size? I would say it's fine. Okay. But leave a note in the notes. Yeah. Okay, so interactive jobs. Interactive jobs, yeah, I'd say the interactive jobs work on very many different clusters. Like it's a fairly standard thing to have. I know some years, for example, the University of Helsinki cluster, there was a different command you needed to run to get them, but it's a thing. Sometimes they're discouraged because by being interactive, it basically means the resources are not being fully efficiently used. But everyone understands that a small amount of resources being used for development is a normal thing. Then the third question, like we can return to the software when we get to the software section. We actually get to that. We're talking about that later today but that's a good question. Yeah, so like if you see any kind of errors that say it is kind of like unable to allocate resources in a valid account or account partition combination. So different clusters have different kinds of ways of like separating different types of nodes and different users into their specific places. So the account means that for example, in CSE, you usually get an account or project account for each project you're doing there. So you usually need to specify like which billing unit you want to put the resource or you want to build the resources out of it. But for example, in Aalto, we don't use them, it's done behind the scenes automatically. And similarly in the partitions, they're like in different clusters, you might have different nodes specified to different partitions. So yeah, in those cases you need to check the documentation for that specific cluster. Yeah. Yeah. And that's why having a cluster quick reference or something like that's good. Like one way that new users can scroll through and see what's specific to your thing and how to get started. And also like the question below, if you see like you've been waiting for resources, that's what it's doing. So it's waiting for resources. It's waiting for the correct slot to appear. So it's waiting for, yeah, exactly what it's saying on the teams. So like that can happen because it's a shared resource and if all of the resources are in use, you'll have to wait. And this is partially why we don't recommend doing too much interactive stuff because you will wait and waiting while you're sitting there is annoying, but waiting. If you're not sitting there, if you're doing it non-interactively, that's not so annoying. So we'll be talking about non-interactive stuff after the interactive session. Okay, we submit jobs from the login node. Yes, you can actually usually submit jobs from other nodes, but let's just ignore that. So you connect to the login node, you submit stuff, you check the output from the login node, you don't run big things on the login node itself because you slow it down for everyone else. Yeah, well, to the dynametaphore, you order the food from the restaurant hall, you don't order the food from the kitchen. So you don't walk into the kitchen and say that I want a pasta. Like you stay in the main restaurant hall and ask the person, like, I want a table and I want pasta and then they organize it for you. So the login node is the place where you want to like be and do your housekeeping and these kind of like submitting jobs, moving files around and that sort of things. Usually you do it on the login node. Yeah, and the next question is pretty similar. So yeah, I think we answered that. Yeah, and also it's a good thing here. Like I think in the some question below, it's also asking like, what is the difference? Should everything be run in the queue like with S run and everything like that? Like not everything is computationally expensive. Like the Git clone, for example, that we did, like moving folders or moving to a different folder and it's not, like you need to think about how big is the box of the like memory, how much CD, how much memory does this process use when I run it, when I run a host name comment, it doesn't do anything basically. So it's like, it doesn't cost anything. So it doesn't matter if you run it on the login node and same with the Git clone and that sort of like housekeeping stuff. But if you actually need to like cook a food or like you need to run like a hour long simulation, that actually takes time, it takes memory, it takes CPU and those you want to push to the queue. Yeah. Well, I already discussed the next one. I realized I should have said that was a exercise itself to read an error message, but okay. About the missing argument. Git clone, does it matter where you run it is? Simo just answered that. So small things can be anywhere. About the work director, okay. We'll be talking when we talk about the disks in general later on. But for now, let's just work in the home directory. That's good enough for these small exercises. Yeah. See and access everyone. Is that untrited because that shouldn't be the case on our cluster? Yes. Yes, it shouldn't be the case. You can usually see the folders but you shouldn't be able to access them. Like you shouldn't, you should see the folders but you can't. But not go inside. Yeah. You shouldn't have permissions to go inside the folders. If you see this, let us know or let the managers of your cluster know because that's usually not what you want. Yeah. Like yesterday, you can actually get inside there. Yeah. Okay, yesterday we ran without slurm. I think we answered that already. Often, that happened often that people forget about slurm. Well, you'd think so because it seems that we're always reminding people of this. But also given that there's hundreds of users doing stuff, I mean, because it's not that high a percentage but it's enough that we're talking about it over and over again. So you can get an idea. Get clone login node. The clone runs. Yeah. So this is a good one. So where did we get the test code? So at the top of the exercises it says how to clone it. Actually links to what we did yesterday but it could be that you're not in the right directory. So for example, if you haven't done CD to HPC examples then the relative path slurm dash slash pi dot pi doesn't exist or you might have changed into the slurm directory and then you're running it with slurm slash pi dot pi. So this cluster shell lesson from yesterday, the main point was to give some experience with these directories and moving around and knowing where you are so you can run stuff. So I'd really recommend if what I'm just saying doesn't make a lot of sense to you after this course is done go read that again and see if it makes more sense with a bit more experience because this is something that will make everyone's life a lot better and it's worth spending five more minutes or 10 more minutes to know that. And you can always run the commands PWD to check what is your current working folder. So where am I now? And you can always run the host name command to check what machine I'm running these commands in. So those are good things to navigate in the system because it can be hard using the terminal especially if you haven't used terminal in the past to like get grids of, okay, where am I when I'm looking at the screen that only says like words and numbers and that sort of thing. And but yeah, I would recommend checking the output of those commands and trying to get like bearing of, okay, how far in the hierarchy of directories, am I and which machine am I running these commands in? Yeah, the next question, change from log and note to our workspace. Do you mean your storage space or your job allocation with the resources there? So the storage space is shared among all nodes of the cluster. So it's just like you have it there. And then the commands we've been doing now get you the resource allocation. Okay, this is about Olu which we don't know. No. Same error. There's a question about compiling code. We can talk about that when we talk about applications later today. The, yeah, like heavy, if it takes a long time, I usually like if it takes a long time, like longer than you're willing to watch the screen, I usually move it into the queue. Like if I'm not actually watching the thing execute, it's taking too long and I'm doing something else at the same time. Then it's usually something you could do in the queue. Like I would give that as like a good, like a goal of town. So like if you're not bothering to watch what the output is constantly saying, it's running too long. Yeah. Okay. Exit log and answer. This is a good thing. So we're using this command slurm, which is a custom. So it's not part of slurm itself, but an add-on that someone wrote to make some of the other commands easier to know. So usually in the docs, we've tried to say what the slurm one is and what the equivalent with other things. For example, slurm queue versus SQ-U user, which is just a generic one. You can ask your admins if they can install it to make life easier for you. Oh, let's see. Maybe we should go on. Okay. If you allocate for one hour and a process finishes in 30 minutes, do the resources get freed? Yes. So if the job actually ends and the monitoring says it lasted half an hour, then the resources are free and you're not charged for those extra resources. So it's okay to give a longer queue time. It might mean it takes a little bit longer for your job to start, but that's okay. Yeah, especially if you're running like an interactive terminal, what we were running here, like you can put in like a end time there, but usually if you, let's say, you want to run test some stuff out, you might put like two hours there because right when you've found out the correct thing you want to do, it's a shutdown. So you might want to like overestimate a bit, but usually you want to give it, well, close to the runtime, but usually you want to give it a bit more leeway so that if something goes longer than expected, it doesn't get killed. Yeah. Okay, I guess we should move on because my clock says it's the time. Can anyone on the cloud? So yeah, only admins can see or mess with your stuff. The whole system is designed so that users can't interfere with each other. There, I will quickly mention there that there are programs such as, for example, Jupiter, but if you run it without like the normal security things that it runs, you can give by mistake access to other people. So you should like, yeah, be careful when you run something with a web application or something like that because it can sometimes be tricky. Normally, Jupiter, it gives you this URL that has this password there. So you can make it so that the other people can't use it, but some of these, because it's a shared system, some processes might, like if they want to give access to other people, they might do it. So check your application. So yeah, the next question is actually really a good lead for the next one. If I run a program using the S run command and accidentally close my shell, how can I get the results back? Well, that's the thing you can't. And that's exactly why you shouldn't be using S run for anything that might take a while or that you want to keep. And that's why we do batch jobs, which is what we will do next. So I will add that here. Yeah, it's basically like interactive, basically means that it's only there, it's like only there when you are there as well. So interactive by definition is something you do, like that involves you being there. And if you are not there, if you close the connection, it will close the session. So you shouldn't use those for anything that takes long. Yeah.