 Okay, we are back. So now we're in the final stretch, and we've got three things we'd like to cover some. So first there is using GPUs, and next there is parallelization, and both of these are basically very small incremental improvements over what we've done already. So yes, and yeah, and all of these can be also included with the array statement. So basically, because the array statement is basically like copy paste the same thing that we like do the same thing over and over again, all of the requirements specified in these GPU requirements in these parallel stuff, they can also be used with the array construct as well. But basically everything is a serial like extension of the serial job. You just add a bit more information there for the slurm, and it will figure out the stuff itself. Yeah, so let's take a look, I guess, once I get my page ready. So GPU stuff. Yeah, so the GPU stuff is a bit complicated because like stuff usually isn't automatically it doesn't work with the GPUs, it's not like a magic word that you can like put there and it will run faster. There's lots of hype around GPUs and lots of promise around GPUs and lots of use around GPUs, but it's not like a magic, it's not something you can do for everything. The code needs to be written in a way that it can use GPUs. If the code doesn't specifically say GPU on it, no. Yeah, then it doesn't work with GPUs. So we should like also make certain people understand that the GPUs in Triton, they are NVIDIA GPUs. So they use this CUDA framework on the background. So this CUDA framework is written by NVIDIA, and it's basically like a library that has lots of different operations like matrix multiplication and stuff like that. Lots of libraries that computational codes can use on the background. And the CUDA does the discussion between the GPU and what the user wants or what the code wants. There's also like in the Lumi computers, in CSC there will be these AMD GPUs that use this ROC M platform. That's basically different libraries. It's a similar kind of thing with CUDA. So it's like libraries written by the manufacturers. In our case, NVIDIA in Lumi, it will be AMD. In Puhti, if you're running, they use also NVIDIA GPUs. So you need to have a code that is written for these specific libraries or it won't work. We will be searching for, like we will probably at some point purchase some of the AMD GPUs as well. But currently, there's no time of arrival or anything like that. But if you need to ask for a GPU, like if you know that your code uses GPUs, it can utilize GPUs. You want to use GPUs in Trident. What do you need to do is basically add this one line that Richard has here. So this gress or generic resource and you tell it that I want one GPU. So basically here, you will tell what you need to, well, what you want. You can ask for more than one GPU, but usually it's not worth the effort because GPUs, these GPUs especially, they are very, very efficient. Very, very fast. And what that means is that you usually need to have like a bunch of CPUs to feed data to these GPUs. Some physics code can work with multiple GPUs without that much of a hassle. But most of the time, you can't run with multiple GPUs and expect the speed to grow up. You need to do specific tricks, get it faster. Because the GPUs are so fast that the CPUs and the storage can't keep up. Yeah. And then that's basically it, I guess. So you say you want to GPU and then it's up to you to figure out how your code works with it, which hopefully it makes it clear. One thing you might need to do is if you are writing your code yourself, or if your code depends on certain generation of GPUs, you need to specify this constraint. This constraint can also be done with the CPUs. So we have these different CPUs like we showed with the partitions or features command, where you have these AVX features. You can you can also set constraint on the GPU generation. So the Kepler generation is the oldest one, we have few of those still left. Then there's the Pascal generation that was the preview. Well, well, after Kepler came Pascal, then came the Volta. And now it's on better, but we don't still yet have on better GPUs. But we have a main, most of our GPUs are Volta GPUs, but we also have a lot of Pascal GPUs and we have some Kepler GPUs. If you know that your code requires certain, like, kind of a GPU or certain kind, well, one of the recent ones or something like that, you should specify with this constraint option. So it's match constraint similar as other slalom directives. Okay, so that's how we pick a certain one. Next up on the page is frameworks, or is there something? Yeah. Yeah, about the frameworks that most of the, like we have already, like most of the people who work with GPUs work with either TensorFlow, Keras, PyTorch, or MATLAB GPU arrays, those are probably the most popular ones. So deep learning is very popular. But there's also people who do GPU offloading with, with physics codes and stuff like that. And those we also have already installed versions of CP2k, for example, or PyGPOR. But, but yeah, so many, many use cases use Python. And in those use cases, we already have these installed in the Anaconda modules, installing them yourself can be a bit of a hassle. So I wouldn't recommend it. Because you need to specify the correct versions of the good app packages so that everything works together. But we already have these installed and if you need help with your own environment, let us ask us. But if you, if you find it in in Triton already in the applications, you should probably use that. Okay. And if you need to compile your own CUDA code, first off, who would compile their own CUDA code? Well, the, mainly, mainly their own, the people who need their own CUDA code is that who have written some extensions, CUDA extensions to the already existing frameworks like TensorFlow, their own kernels or something like that. Or people who do like their own C code, that their own C, C code for physics programs or stuff like that. So if you need to compile your own code, it's basically you need to load the correct CUDA module, and then compile with the CUDA compiler, the NVCC. And you can, you can do it. But most of the time, people don't need it. But if you need it, then you should, well, use it. Yeah, this, this way. If you're using already existing frameworks like Anaconda, you don't need to load the CUDA modules, because those environments already come with the CUDA modules, see more information on the actual application page. Yeah. Okay. Oh, let's see. So there's that. Yeah, the error that's mentioned there, the lead CUDA S01 cannot open shared file object. That means that you're not running it in a GPU enabled program or node. So if you are trying to run GPU code on let's say login node, it won't work because there's no GPUs on the login node. You should reserve a GPU resource for yourself. Right. To run, run it. Yeah. So that's not going to work. This often happens when people try to test on the login node and see it doesn't work and well. Yes. So so whenever you're running in, in Triton, you can also use the S batch. So sorry, the grass requirement with the S run command. So if you're trying to run something on the GPU node and just try it out, you can also use the S run. So below here, we, we don't necessarily need to go through them. But you can, if you are going to know that you're going to be working with one of these, one of these environments, there's an example, for example, from TensorFlow tutorials, how to use how to run Keras code on GPUs, both in S run and in S batch or PyTorch. Yeah. There's also PyTorch example and CNTK example, if you're doing this deep learning stuff. Yeah. Okay. Let's see. So is there anything here we need to go over? These seem to be basically download and run. It shows the different examples here and running it with S run and all that. Yeah. Only thing you might need to, like what is important is if you go a bit below is how to monitor the GPU usage. So like I mentioned that the GPUs in Triton, they're very, very fast. So what usually happens is that if you run a similar kind of code that you run on, on your workstation or somewhere, it doesn't necessarily run with full utilization in Triton. This can happen through multiple reasons. The most common is that the GPU is not fed enough data. So the GPU is not fully utilized. And to monitor this, you can use this, you can use this sact.j job ID comment. And we have this, like automatic script that sets to this comment field, the GPU utilization of the GPU job. So after you have run a GPU job, you should check what is the GPU utilization, because many times it can be in like low signal, low signal, well, like 10 to 20 percent or 30 percent. So basically, there might be a situation where there's efficient efficiency that can be gained by running the GPU, while modifying the stuff around the GPU, so that you utilize the local disks, for example, for caching the data or have enough data loaders for the GPUs so that GPUs can get enough stuff done. Yeah. Once I remember, we did some calculation of the overall efficiency of GPUs and throughout all of Triton, there was about 50%. And there was basically one person who was using half the GPUs with almost 100% efficiency. So it's not too hard to do the mental calculations and realize that most people are basically not using them. This is like a very hard problem. So when you're using GPUs, if you see it low, it's worth asking us for help. We have people that can help improve your code directly. That help you get it right. Yeah, so it's like, even if you don't necessarily need to learn how to program for GPUs, like I mean, like learning how to code and write C code and stuff like that, you still need to usually learn how to program around the GPUs. So even if your framework supports using GPUs, you need to support that framework usually by using the local disks, using multi multiprocessing to read the data in and everything like that, because the GPUs are so hungry for data, and so hungry for running stuff, they can they are so fast that it's really hard to keep them occupied. And we usually recommend that if you're doing like development stuff, you do it on somewhere around your workstation, maybe in an interactive session, sometimes OS run, or on the on the VDI machines. But once you know that your model works, then you run it in s batch, and then you check what what's the performance. And then once it's like, you see what the performance is, you then scale it up. Because with the GPUs development cycle is a bit longer, because you usually need to do this kind of fine tuning good to get it run properly. So so if you see like papers where Google has done some model, they have done some specific, like, very fancy deep learning model or something like deep mind has done, they they have very sophisticated procedures on how to load the data in and how to process the data and how to run the stuff with multiple GPUs, so that the they actually get the performance. Because if you're going to be running it, like, just like running it, it will take you years to run the same models. And it's, it's better to usually design the code properly, and ask us before you, well, run into a problem where you don't get the performance that you need. Yeah. Okay. So here we see the different GPUs we have available. Is this up to date, even maybe? Yeah, I think there's one TPGX node. So we have also have like these DGX nodes that are there under Ubuntu system. So they're a bit different. But these are bought by certain research groups. So these are these have very heavy for for running big parallel GPU jobs, very big GPU systems. They are also available. We are these single entity containers to other users. But they they are the people who have bought them have priority on them. So if you're going to be using them, then do note that your jobs might be thrown out because the other people have the priority because they paid for them. But but they are also available if you want to do multi GPU processing and stuff like that. Okay. So which of these exercises should we go over? I think we or examples, we could run the exercise one, for example, quickly, like people could try it out. It's very simple exercise. Do note that it might be a situation, especially if you're now submitting jobs, you should specify like a short time limit because the GPS are also very often need like desired common commodity. So so people, they are them are hope. Well, most heavily requested. Yeah, like stuff in in our system. Yeah. So so when you request, or submit a job that requires a GPU, it's very important to put the resource requirements correct, because otherwise you will wait in the queue for a bit longer time. But yeah, I think we should have like, maybe to five to half. So yeah, 25 past. So okay, and is that enough time for parallel and then the demos at the end? I think it is. Okay. What should be a problem? Yeah. So my yeah, I'd sort of hesitant to run this command here because user names or something might show up on the screen, which I've been trying to avoid. So yeah, well, should we have the short exercise time? And we'll see what happens and resume at 25. Yeah. Okay. Yeah, so you can run that the first example, first exercise, and if you have time on the second exercise where you compile this code or try one of those GPU models, the samples above upper. Yeah. Yeah, the samples above. See you soon.