 We are back. So our plan for now is we will talk about monitoring for around 10 minutes. We will have 15 minutes for monitoring exercises and then we'll talk about some of the rest of the stuff we have and then we'll have the Q&A. So we'll be a bit behind the published schedule but really about what we'd expect. So Q&A was designed for buffer time. So Seemo, can you share your screen or me? Yeah. Yeah, if you want to share my screen. You haven't. Can you scroll up or maybe I should share the screen? No, no, no. This is the one I want to show. So this is the monitoring page. Yeah. So we have already gone through many of these monitoring tools. So the Slurm Q and Slurm History. But we'll talk a bit more about monitoring monitoring from the hardware side and also from the ideas side. So the monitoring, we have now been monitoring what is the Q thing and what's it doing. But really what you usually want to do is that when you submit jobs, you first want to monitor how they like, do they go into the Q? Did they crash immediately? Did you have a typo somewhere? But then the more important part is that while the job is running, you want to monitor the job state and how it's performing. So if you think about like yesterday, Enrico showed this kind of a graph, like what's usually happening when you're like doing stuff. So if you think about you're here and you're doing something with an application that runs on some operating system, which then runs on some actual like machine and then the stuff comes back from the calculations. It's stored somewhere in a disk or somewhere you maybe visualize it with an application and then you see the results for yourself. Well, now we've been doing like basically this left half of this situation. So we have used different applications to run on the cluster and on the cluster hardware. So some compute nodes. But now the question is that how do you like actually get the results out of the cluster? How do you do that? How do you like get the results for yourself? And this is really up to the program that you're using. So it's a good idea in your program to add something that tells you what the program is doing. So it can be something like Richard said previously, like I'm doing this. I have done 10,000 iterations. I have done like some sort of print statement, something that describes what the program is doing, saving some checkpoint maybe of a model. If you're running some model, you store like a temporary checkpoint. You do some like print statements that describe what the program is doing because you cannot see it while it's running on there in the background. So basically your interface to the program, to monitor the program is like looking with the cat function that we used in the command line. So you need to be able to decipher the program state based on some logging that the program is doing. You need to be able to see like some sort of progress from the command line. And then after you have like finished the program, you can then use slurm. So basically the slurm is the operating system in a cluster. You can use that then to like see what the program was actually doing. Did it finish correctly? What kind of resources it actually used? And then you can use some other program yourself to visualize the results. But the main thing is that like this pipeline or this kind of like loop that you have here when you're running stuff in the cluster, it has to be programmed a lot by you. Your code needs to be able to tell you what it's doing. There's some information in this monitoring page. Like if you want to go in detail, there's some information of different kinds of like monitoring that you can do. You can read this if you want. It describes like how you should do monitoring in your program itself. But that's something that we cannot like, we can of course help you with that, but we cannot like tell one good solution because it depends on the program that you're running. Okay, so now this is the stuff that while your job is running, you need to look at this. You need to write some sort of like output into your program. But after it has finished, we already went through some of these slurm history and slurm queue and those kinds of things where you can monitor like if the job finished correctly in the queue. But there's one extra tool called SF, which shows the slurm efficiency or the efficiency of your job. So Richard, so here's an example output. What does this output tell you? And how do you do it? How do you get this kind of an output? So I see we ran SF with the job ID. And then it tells us some basic stuff like the job ID again, who ran it, that kind of things. But the most interesting part is down below where it says CPU efficiency, 90.62% of 32 seconds of core wall time. And of the memory efficiency, I see it says 0.08% of two gigabytes. So this basically does the math for you and will tell you overall where your requirements about right. So for CPU efficiency, 90%, well, it should be a bit higher, but it was a short job. So it's probably inefficient when it's first starting. For memory efficient, it seems really low. But that's also because it was really short. So it wasn't able to get good statistics. And this is something that you should be running on your jobs regularly. Yeah, so in this kind of an output, you can see how long the job was running and how much it utilized the CPUs. So you notice that in this job, there were two CPUs used. And there's 100% here in the efficiency. So it means that 100% of the two CPUs, so basically it utilized the two CPUs at the 90% efficiency. So the maximum is 100%. So if the job utilizes the CPU resources requested fully, it will be 100%. And the memory efficiency can go above 100% if your job uses too much memory. So you want it to be close to 100, but not over 100. Okay, and I guess that's basically it. So do we, the GPU job monitoring, we will talk about tomorrow with the GPU stuff. Yeah, we'll mention this in the GPU part, but there are ways of monitoring. GPU utilization is a bit more harder to monitor, but there are tools to monitor it. Let's go to an exercise now. Yeah, so we will have 15 minutes. And exercise number one is a basic example where you, well, you run something with different parameters, and you will be able to see how SF can tell you the efficiency. And you'll see that the larger problem sizes are more efficient. Yeah, so there's number one. Yeah, so you can, you can consider this like as a CGL job writing exercises as well. So the first part is basically how to write CGL job and how to monitor it. Second part is how to monitor individual job steps. So we were discussing the benefits of job steps. And the third one is, well, basically the same thing. So yeah, let's do 15 minutes on this. Yeah, and try what you can in 15 minutes. So for number two, we haven't really gone over the different threaded parts yet, but it can be a preview for the next day or tomorrow if you're interested. But if you can just do number one, then that's pretty good. So yes, okay, see you in 15 minutes then. Okay, bye. And we're back. Hello. Hello. So we cheated a bit and did the exercise at the same time while here, we're doing it so that we can go through it faster because like running through the some of the simulations takes a bit of time. So I'll just we'll just show the script that we created. So here is the script. Okay, so we have the user's aspect time memory. We had to increase the memory limit a bit because we got an OOM error with lower memory limit. So this is basically doing multiple of these at the same time in the same script multiple of these exercises. So it runs the 10, the 100 million simulation, but it also runs all of the lower ones. So if we look at the output, like first, first we should note that the script itself, it creates this output file. Well, it produces the output in the SIRMS script. So you can see 10, 100,000. Yeah, you can see the output output here. So you can you can get like the output here. There's also flag there that it produces output to our outside file, but you can yeah, you can see here that it produces output so we could easily monitor what the job was doing. Yeah, if we run slurm history, we will get output like this. So you can see previously that here was the cancelled one that the memory that run out of memory, but here below is the one that didn't run out of memory. And you can see that like the rest of the jobs, they don't seem to use any memory because they run so fast that again, slurm didn't capture the output necessarily. But over here, the last one, we can notice that it used like three gigabytes of memory. So that's why the job was failing beforehand. Like the last step was too heavy for it. So it didn't, didn't have to do it. And over here, we noticed we can see like how long it took to run each of these steps and so forth. Now, if we look at the, we can use the SF on that job. So we can use SF to the job as a whole. Actually, let's do it in the other window with the better, better font. So over here, if we run the SF, quickly move this bit up so you can see the history of the commands. So you can see here that the memory efficiency was 76%. And the CP efficiency was well, quite good, 96%. So you, the memory efficiency here, it of course means like the maximum memory with relative to the limit. So it doesn't mean that the average memory usage was good, but but it just means that like we were close to the limit that we had set. Now we can, we can run the SF to check the individual job steps as well. So if we, for example, check step one, which was 400, we noticed that, okay, the memory efficiency and the time, CP efficiency, well, it doesn't even show maybe the data because it was so fast. But if we check, for example, step five, which was 410 million. So we noticed that the CP efficiency is pretty bad and the memory efficiency is pretty bad because like again, it's too fast of a job, too small of a job to like actually do anything. And but if we look at the last step, the last step itself, we noticed that well, basically all of the time used was using this step. So this is basically the same as if we would run it without the step because like most of the job was done during this step, but we can use SF to monitor like individual job step efficiencies. Yeah. Okay, so that's about efficiency. Should we