 So what's the point of monitoring here? So now we've already been doing it a little bit, but people often just wonder, so something's running. How do I know if it's doing what I expect? How do you, actually, let's see, we'll put a good breakdown here. There's three types of monitoring. So when the job is submitted, is it running yet? Monitor the position when it will run, when it should end, and so on. Which we've already seen a little bit of with the Slurm queue or SQ, that kind of thing. Second, when the job is running, you want to monitor the job state and how the simulation is running. And actually, we just did that exercise, basically, watching the file output there, and you'd have your program write the right amount of stuff. And then finally, when the job's done, we want to monitor the performance and research usage. So basically, this will answer the question, did we reserve 10 processors and it's only using one? Or did we reserve 50 gigabytes of memory and it's using two? And that lets us adjust our request closer to what is actually needed. And I think a lot of what's here is probably self-explanatory and either the exercises or what we've already talked about. Should we scroll down, monitoring during queuing? Is there anything new to say here? Yeah, well, if we look at the output, like the important, like all of the output is basically quite self-evident here, but the terminology is good idea to like go through this list at least once and check these terms, what they mean and try to like get a grasp of what they mean, like what is the start time versus time. So start time basically is when did the job start in like clock time and the time is how long did it take to run? So it's like relative to the start time. So like these kinds of like things, it's good idea to like maybe look through the list once, but yeah, checking in the queue is easiest way of like monitoring, did the job start or did it fail because it like, could we get more resources? There's no resources available, like you're asking the wrong partition or something like that. Or another example, if you've submitted 1,000 jobs, what's the status of them? Like are they all done? Are they have done stuff like that? Usually also you will see like jobs that are in the queue that are not yet running, like bigger jobs here in these exercises, we cannot really have this kind of like jobs, but if you submit like a really big job, you will see usually an estimate when it will start. And if it's like, it will start in a week, you might think that, okay, am I actually asking the right resources? Is the job really that kind of a job that it takes a week to queue it? So you can usually like also determine like, is the job that I'm running, is it like correctly allocated or correctly? Did I actually make the correct resource request? Yeah. Okay, so what's next? So monitoring while it's running. So we actually just showed this, you give it the output file and then you can look at it already as it's going, just like the question in the notes said. And there's different kinds of outputs here. So can you explain these? What should someone even have the program write while it's running? So this is a bit wordy, so I recommend treating it later on, but I can try to condense it in few sentences. So first off, there's like monitoring output. So this is like, okay, like very, like something like we previously saw, like it has this kind of like, I'm doing this thing right now. So for example, like the date command, it was printing the time it was running the command. So from that, we can see that, okay, what step I'm taking. So a typical thing would be something like this, that it says that I'm loading data, I'm running iteration 31. And if your program contains like outputs such as this, you can easily see what it's actually doing while it's doing it. Even if you're not like actively running the program, like you're not interactively running it, you can still know what step of the program it's currently doing. And it's good idea in your program to use these kind of outputs, because then you can easily have it, you can easily see it from the output file that Slurm stores for you. The second kind of output is this kind of debugging output, which is like this very like, you print like the actual variables and this kind of like, okay, what sort of like internal logic is in the code so that you know that if there's like a problem in the code or something, and this can be usually like something that is very hard even for human to read. So it might be two verbose for human to read, but it might be something that you might read with like a special tool or like you store the actual, let's say, arrays or you store the crash state or something like that and use debugger to look at it. And this is something that you don't usually want to enable if you don't read it yourself, then what's the point of the output? And this debugging output is usually the kind of output that is good if your program is crashing and you don't know what is the problem. But if you're constantly printing like stuff that you don't, you know that you're not even going to read, it's not important and you shouldn't store that constantly because every time the job writes to the disk, it can create like, if you have millions of these jobs running, it can like create problems. So you want to like keep the verbosity at the level where you are actually reading the output. The third output is that like in many of the programs that you're running, like we were talking yesterday about like choosing what is the time that you want the program to run or what is the time limits? Like if you notice that your program will be like running out of time and your program has been running like three days and you save output only at the last second when the program is running, you know that, okay, now I'm going to lose like three days of calculation. So in many cases, you want to write your program in a way that it can store just like the situation, what's happening and it saves like a checkpoint. So common ones are like physics codes, they're usually for like a checkpoint where you have the simulation state or like deep learning models where they store like the state of the training model that's being trained. So they store like the weights of the model so that you know that you can continue where you left off. And this is a very good idea because then you can think about like how much time did you lose if something goes wrong and if you don't lose the whole simulation, you only lose up to the latest checkpoint and it's a good idea to have this. And then of course there's the final output, the simulation output. And it's usually a good idea to like before you start the simulation to know what is going to be the output. So you don't forget like, okay, I needed to actually calculate the mean but I calculated only the variance. And then you're like, okay, I guess I have to run it again. So it's usually a good idea to before you run the program to determine what sort of like output you want. Yeah. And there's a really good question. Can we explore the result directory throughout the calculation that's running? And basically, yes, and it's great. Like, especially when you're developing, write the program in a way that it does write these outputs and you can actually check and see is it doing the right thing? Like for example, quite often nowadays, people are like, if they're doing like deep learning or something, they're using tools like 1DB or a weights and biases or something. Like, oh yeah, that's the weights and biases but they're using tools that basically show that training state while the simulation is running in a fancy app. So people are using all kinds of tools to monitor the monitoring output doesn't need to be always the terminal. It just depends on your program. What sort of output can you create? Like it can create a plot for you every, like every 10 iterations or something. And then you can view the plot and you can know that, okay, like it looks like it's going to the right direction. It depends on the program. Okay, should we move on to try to get the exercises? Slurm history we've been using and this shows more about what comes out of there. I think we don't need to say much here other than. Yeah, main things would be to mention that like these fields, like you can have these fields that specify like how many, how much memory did you actually request? And this marks RSS shows like the maximum re, the coincidence set size, so the maximum number of memory that was basically reserved. And then the total CPU time shows basically how much CPU time was used in total. And the wall time shows the like the real time, like the actual clock time. So you can see from here, like this kind of like timing information. But there's a better tool for these, which is to use SF. So Slurm efficiency basically. So if you give SF some job ID, after the job has finished, it will tell you your CPU efficiency, your memory efficiency. And you want these to be like CPU efficiency as close to 100% and memory efficiency, well below 100%, but near the 100% as well. And so the like the memory that you have a request that you're actually using it and the CPU is your request that you're actually using it. And yeah, these are very good tools to use. Yeah. Okay. There's a section on GPU monitoring. You can read it, but actually we'll get this tomorrow. So I think- And I think this might be currently all only auto. So- Right. Yeah, that's- This section, but there's this SSH section that might be applicable. But GPU monitoring is a bit more tricky. We are discussing on technical solutions for this GPU monitoring. It's a bit annoying, but we'll talk about this later on the GPU section. So let's go to the exercise. Okay. Is there anything we need to mention when starting here? I mean, I think they should be mostly self-explanatory and you can always check the solutions. The... Yeah. I mean, let's just go. And we noticed from before some, there are some lots of problems using, like doing all the stuff from the terminal, typing it from the command line and so on. We know it's hard, but it gets better and it makes this kind of work possible. So the reasons we have all these things is just to give you more familiarity with it. So how long should we give? I guess we're coming close to break time. Since this is not so critical for the future stuff, should we come back five minutes after the hour? Yeah, I think that sounds good. Okay. I will write it down into the notes. Yeah. And I'll post the link there. Okay. Thanks and see you soon. Bye. Okay. Y'all should have said earlier. Yeah. So what were you saying? So I forgot to push the button to activate the stuff. Yeah, that must be funny. Okay. So please comment up here about the... If you were trying or not the status, we don't see many new questions here. So if people don't have that many questions, maybe we should just go on. Yeah, I will quickly mention that this monitoring part, it might feel at this point like this kind of a little bit of like, okay, what am I really monitoring? I'm not running anything important. But this will become more and more important the more and more stuff you do. So let's say you start, you work in multiple projects, you might try different things. You might need to manage, like basically instead of being that you run, like instead of running a simple thing, you start to become a manager of things. So you start to manage multiple scripts, multiple simulations, multiple different things. You submit array jobs, you submit like parallel jobs and all sorts of things. So it's good idea to get a grasp of like, okay, I need to manage these jobs I have in the queue. So I have these many jobs that I need to manage and put them to the queue. And then afterwards I need to verify what they did. And it's a good idea usually to get a grasp of like this kind of like, whenever you run something, always do the monitoring because that will help you a lot in the like, because once you start to run like hundreds and thousands of jobs and you scale your stuff bigger, if you do not use these tools, you suddenly, you can hit like a wall where you cannot manage them anymore. And it's a good idea to like start it with small jobs and start like get the habit of monitoring your jobs always. Yeah, okay. There's a good question here. How do you use the hell command? Can't see the steps in a loop? Do we see them in the end when it's done? And the answer here about buffering is really good. So basically in order to be more efficient, like basically writing stuff to the storage is one of the slowest things that code can do. So a lot about the system is designed to avoid writing stuff to disk when it doesn't need to, which is why there's something called buffering. So basically the idea is that when a program is making output, if it's not going to a terminal, so terminal is unbuffered because you need to see it right away. But when it's writing to a file, it will wait a certain amount of time or a certain amount of data that comes before it writes it all at once. So basically what's happened here is that, yeah, it's, what is it? Yeah, so it's like writing these little short lines and the Linux system is saying, okay, I don't actually need to write this yet. Let's wait till there's a bit more and then it writes it. So there's different strategies to get around this. So there's an option here, export Python unbuffered equals true, which tells Python always write stuff. There's also when you're writing things, if you search flush your buffers for whatever programming language you have, then you can tell it right this line immediately. But one of the lines, which example was this that we were doing that had this? I think it was on the serial. Serial, that was in serial, yeah. But I think we've said what we need to say. We should look at the exact exercise to see why, but so you know. Yeah, but the main thing is to like, if you see this kind of like behavior that doesn't remind you of, you know, behavior, there's usually a reason for this. And there's also usually like a solution for this. And like the solution isn't to always keep stuff unbuffered, but the solution is to unbuffer that stuff that is like, like actually important. Yeah, like you'd flush the buffer just when you're done with the iteration instead of unbuffering everything because that's not good for the system. So good question is, well, time's supposed to be short an exercise for it, not for me. There's a good answer here. So maybe with one processor, it ran on a faster CPU than the others. But also the Pi program is just so fast. Maybe you can actually see the difference because at the time it takes to start Python is so much. But we will get to this more. Yeah, tomorrow when we are talking about, when we're talking about Python, sorry, when we're talking about parallelization, we'll talk about this, but it might have ended up on a different machine with a different CPU or something like that, which is faster. So like there are multiple variables that could affect the result. Okay, so let's move on. If we go to the schedule, next up is a little half hour period. And the things we're talking about, like I said, we used to do them first and they would take up most of the day. So what you're about to hear, we can't possibly tell you everything that you need to know. And we're just sorry about that, but we're gonna try to give you a quick overview to see what's there. And you can keep reading and figure out the solutions for your problems later. So one thing I just noticed, there's software modules here, but not applications. So I wonder, did we lose applications or did we just always skip that? Or was it that? Yeah, maybe like, yeah. I actually think we should do the new applications that I just rewrote. I will put a link into the notes. Yeah, I have it. Okay, and I'll also post it to the Twitch chat. Oh, can someone else post it there because it'll take some time to move to that computer.