 So GPUs. Yeah, okay. So let's see. We're right here. What do you like for me to share the screen or would you like to? You can share it, I can then run the example. Yeah. So Simo, what's a GPU and why is it the thing that everyone talks about? Yeah, so GPUs or graphical processing units or sometimes GP GPUs or general purpose graphical processing units because people love acronyms. They are these extension cards that are basically they can do certain types of calculations really fast. So the graphical in the graphic in graphical processing unit doesn't necessarily mean that they can only do graphics but that's the historical reason. Like when people started doing 3D graphics for gaming and stuff like that or visualization for like CADs. So like computer assisted design or animation or something like that. They had to do a lot of vector calculations. So lots of calculations with numbers, lots of rotations. Like if you need to move a triangle and you need to draw a monster somewhere you need to do a lot of calculations and really fast you need to do lots of like multiple calculations of vectors and matrices. So would you say it's correct that basically the whole thing was designed to do these vector calculations very efficiently unlike the typical, unlike the processors the Intel architecture processors we've ended up with. Yes, so the architecture was designed like I'm not certain if GPUs are like too incomplete or something like that like computational stuff like theoretical I'm not certain about that. Like can you run an operating system on a GPU? I don't know. But the main idea is that like the hardware level like how the actual processor has been designed is being designed to like pump as many of these calculations as possible. Whereas like a normal processor like the Intel or AMD processor that you most likely have in your computer or maybe like a, what's it M2M100? What's the Apple one you can't remember? Is it M1? Or the, yeah, or other one like that you have in the phone. Those are like general purpose processing units. So they can do all kinds of stuff. They can calculate numbers. They can add strings together. They can like do all kinds of like itty bitty small things. And they can of course calculate lots of big numbers as well and lots of additions and multiplications that like you have in let's say simulation codes. But when you are talking about graphical processing units and what they can do is like, let's say you have a huge bunch of numbers and in a matrix and then you want to multiply that with a huge bunch of numbers in another major matrix or a bunch of numbers and you want to do that fast or you want to like, yeah, do that kind of operations fast and many operations you can reduce to these kinds of operations. So you want to, you can reduce to this matrix calculations then you can do it fast on the GPU. And this whole field exploded when people realized that, okay, you can train AI models and deep learning, especially on the GPUs. So everybody got really interested in AI because they realized that, okay, it's basically matrix calculations all the way and like let's, they can write it as matrix calculations. So I guess it's like the use of the GPUs has gone in step when people were able to write more algorithms in a way that could use it. Like for example, these, well, GPUs started with what graphical processing and sometime people realized, oh, I can do molecular dynamics on GPUs. And then suddenly that becomes the thing. So it's like, it's not, yeah. Yeah. And nowadays there's a huge bunch of software that can utilize them. But the important thing to realize is that like because it's so related to the hardware like everything is related to the hardware and how the hardware is organized, there's usually like libraries that you can then utilize to make it easier to code on this. But it's different than normal program that you run on a computer. The GPU programs need to be written for the specific GPUs usually even to the specific card or type of a GPU that you're utilizing because everything is written in terms of like this low level machine code that can run very fast. And this means that the program, like similarly that we mentioned with the MPI standard, like programs that utilize the MPI standard, programs that use GPUs are written usually for GPUs and they are written even to the level of certain types of GPUs, certain kinds of GPUs. And it's usually very specialized and you need to make certain that your program supports the GPUs. So either you need to like install, well, you need to usually install like a program that has support for the GPU users in order to be able to use the GPU. So how often times do people these days write their own GPU code? Like say at the C level and how often are people using codes that have already been made for GPU by someone else? That's an excellent point. Nowadays it's getting to the point where there start to be more and more like already existing GPU codes that you don't have to do it yourself. Like Python, like all of the deep learning stuff like TensorFlow and PyTorch, they utilize libraries so you don't have to look at the GPU code. If you are running MATLAB, that's the GPU array so you don't have to think about the GPU code. If you are using some physics programs you usually need to tell it to just compile with GPU and then it can use GPU like Lumps or CB2K or something and many of these we provide already. And like most of the programs they just utilize the GPU if they've been made to be aware of the GPU when they have been installed and the support is increasing. So most likely you won't ever compile GPU code yourself but if you do of course more power to you. But basically if you code can utilize GPU you can try to use GPUs in the cluster. And now we can get closer to the point of how do you actually use the GPUs in the cluster and the main point. I have one more comment there. So I guess even though there's many things that are already written, do people end up needing to somehow script it together? Like I guess if you're using say TensorFlow is one example it's not like TensorFlow does what you need but you need to write your Python program to use TensorFlow and you need to know a little bit about the GPU and how it works. But the core that's actually using it is already written and optimized for it by someone else. Is this the way it is? Yeah, that's a good way to put it. So most of the time like this GPU with the GPU you don't just get the card and walk out of the shop. Like you get a bunch of software with the GPU. Well, like when we install a GPU it comes with a bunch of software and that software usually like already has a lots of stuff in it like libraries for doing these calculations and then these higher level like libraries like TensorFlow or MATLAB or whatever they know how to formulate their problems in terms of these driver level libraries. So in Triton, we most of our GPUs are NVIDIA GPUs. So we have one AMD GPU for testing stuff but most of our stuff is in NVIDIA GPUs where we have the CUDA framework or NVIDIA has the CUDA framework of you might have heard this term. So it's this kind of like a bunch of stuff that bunch of libraries that can utilize the GPUs so that you don't have to like write your own stuff but your code needs to be like it needs to use the CUDA to access the GPU. And in the CSC side if you want to play with Lumi you need to deal with AMD's own library which is this ROC-M library which is the same feature, same kind of stuff as CUDA has so there's like but usually you need to like look at the certain words like if you find CUDA somewhere in the documentation it usually means that the code tries to use NVIDIA GPUs. If it mentions ROC-M it usually means that it tries to use AMD GPUs. So how many common programs can use AMD GPUs these days? I would guess that the support is increasing and increasing but the NVIDIA has a bit of like a lead when it comes to this competition currently but nobody will know what the future will bring with it. Yeah. So let's see, have we covered most of the introduction or did you have more there? Yeah, like I'll mention that in Triton we mentioned yesterday is a heterogeneous cluster so we have multiple different kinds of nodes, multiple different kinds of plays like machines and we have multiple different GPU architectures as well. So we have architectures running all the way from Kepler to the modern ampere. So Kepler, Pascal, Volta, Ampere there's three, four different architectures of four different types of cards. And these are like if we are talking about GPU cards if you think about like you buy a gaming card somewhere from a Veracogarp, I don't know like some online retailer, Amazon you bought by like a gaming card, these are much more powerful. So these are like, they're much more powerful like four times more powerful than a typical gaming card that you would have in your- How do you define power then? Is it in speed or memory or- Yeah, that's actually a really good point. So the power is not necessarily like speed as in like gigahertz or something like it's not clock frequency usually. They might even run at the slower clock frequency in the cluster, but the power comes in the number of parallel like computational units. So each of these cards is usually like a cluster in and of itself. So it contains like a huge bunch of small computational units that can calculate these matrix calculations and vector calculations. And so the power comes that they can run bigger problems and they have like the specialized like compute units inside them that can do even better calculations like they have this optimization inside them meant for scientific calculation. So that they are much faster, let's say floating like single or double procrastination floating point numbers or the specialized cases for scientific computing. They also usually have like a bunch of more memory like this VRAM or like video random access memory that is like for the memory for the card itself. So basically whenever you run some GPU code usually stuff gets transferred to the GPU where it gets to the memory of the GPU and it's calculating inside of the memory inside the card itself. And whenever there's some like results need to be brought back it will be up there brought to the CPU and the random access memory of the machine itself where they can be done access. Okay, and is the memory bad with a limiting factor? Yes, often the factor like people there's often case where people send, let's say a job to Triton and they don't know this much improvement in when they run it on the GPU and Triton compared to GPU and their own machine. And the problem might be that data is not it's not getting to the GPU. The GPU is too fast for the amount of data that goes there. So these GPU cards, they are very powerful and that means that if you give them like a simple addition to calculate they are done with it and then they just idle. And all of the time is done doing this communication between the processing unit, the CPU and the GPU. So often you need to usually have more than one CPU trying to deal with the GPU to give it enough stuff to do. So you can think of it like if you have this office situation where you have these different people working in this office and then you have one person who's like super expert and that super expert can answer questions really quickly but if nobody asks what the super expert thinks they don't do anything. So basically there's this one dude who's really good but you really need to put him into work like you need to manage that one dude. And basically everybody in the office can ask as the super efficient person their problems and the super efficient person will give you an answer but if nobody asks him, well, nothing like he's not doing anything. And this is basically a situation with GPUs like GPUs need to be fully utilized in order to get the most benefit out of it. And this also brings to mind like the different generations I mentioned. So running on the newest one isn't usually the best option if you don't need it because like you're, if you're only utilizing let's say 50% of the earliest generation you're going to utilize like a 10% of the newest generation and your code will run as slow as it does. So usually it's better to run more instances of the slower generation than with the, let's say, array job. You use an array job for slower generation than to try to utilize the newest generation and also adding more GPUs doesn't usually work unless your code knows what it's doing. Like you can run multiple GPU jobs on Triton but then you again need your code needs to be able to communicate the task. Like adding more experts to the room doesn't help if nobody asks them. Like that doesn't really benefit the overall efficiency of the system. Yeah. So, okay, so what's the moral of the story here? I guess GPUs are like they're easier to use because more software supports them but harder to use because there's more and more you need to know in order to use it efficiently and there's more and more things that can go wrong. Yeah, basically it's this kind of a situation like if you have like a super car and you want to drive really fast you need to like reserve like some ring you need to reserve some circuit where you can drive with the super car and you must make certain that there's no slow drivers there's no bottlenecks there's no traffic lights on that circuit because otherwise you're just running slow like you're not going to go 200 miles an hour with a super car if you're going through the center of the city because like the speed limit is 40. So it's important to make certain that when you're utilizing GPUs you keep the GPU fully occupied and there was already a question about like in Triton specific like why is like is GPUs available on the Jupyter notebook and the answer is no because of this reason like it's so hard to keep the GPU occupied and they are expensive cars and they're very lucrative like a lot of people request resources for them so we want to keep them fully utilized and interactive usage basically means that whenever you're not doing something yourself the GPU will just idle and that will basically mean that it's burning like money all the time. Okay. And like usually I would say like don't be alarmed by all of this it might seem like okay I'm not good enough I'm not worthy of using the GPUs that's not the point the point is to make certain that you are actually benefiting yourself because if you're not using if you have like a super car but you're driving in the city it's not benefiting you because you're not having any fun. So it's better to usually we recommend that you test out like a training version of your program like some sort of like a mock-up of your program like smaller program, shorter runtime or something like that you test it out first on your like if your department or your research group has provided you with the workstation with the GPU like this commercial GPU you first test it out and see how it performs there and then you might test it also we have in Alto at least we have these workstations that are shared in these classrooms that you can also try out stuff there you try out some toy models try out if in principle it should work efficiently in the cluster and if it does then you run it in the cluster and you test it out there and if it runs efficiently then you can just like fire away and maybe I guess we could also say if you're ever not sure just try it run a few tests then come by and consult with our research software engineer service and we can go look with you at your stuff for an hour or something or even like check your stats and say good to go just like you've done it well and if not we can help you figure out what the bottlenecks are and improve it so don't feel that you're alone in figuring all of this out Okay, should we go on to some examples of using it now? Yeah. Okay, so here we are at GPU jobs so I guess by now if we know everything we know about slurm we know it must be some simple option to add and that must be like what I see here the espatch command. Indeed, so yeah the GPUs in the eyes of a slurm of the slurm queue they are these generic resources or GRESS so they are like resources that are more than like something special something special compared to like the normal CPUs and the memory that the compute notes have so they are mentioned or they are like marked internally as these kind of like resources available that you can then request and then when you request them then slurm will limit your well limit you to like those notes that can fulfill that request so in other systems you might need to specify a partition for example you need to specify like okay I want to run on a GPU specific partition in Triton we don't have that but be mindful of that but usually you need to only specify that I want this generic resource to be present and in the case of GPUs you basically say that I want GPU colon one to have one, two, two, three, four, four but again remember adding more GPUs doesn't necessarily make it faster. Yeah. What do you like to give a demonstration or is there even anything to really demonstrate here? So well I guess we've shown the basic idea here. You want to talk about the machine learning frameworks or yeah like what do you think is the best order here? Like I guess basically what we've been saying so far is what people mostly start with and then from there it depends on your particular framework so there's the how to request it and how to use it in your code and we mainly talk about how to request it. Yeah because like it's very specific like the requests are usually the same you just add I want the generic resource I want the GPU and then if it's a MATLAB code if you use GPR if it's like a Python code you call TensorFlow or something if it's like you run some lump simulation or something then you just run that and it will use the GPU so the request side is the same all the way through usually so it's nothing magical. I can for example run the TensorFlow example. Yeah. Are you sharing your screen? Just a second. I think it's better if you do it because my cat is not going to let me be typing here. Yeah. Yeah it's getting late in Friday and the cat wants to go and have enjoyed the weekend. And once it's food in 33 minutes. Yeah. Okay so here you go. Yeah so here we have like this is from TensorFlow's tutorials like okay it has been it changed. Hopefully it is. Let's okay the example code example still works because we have done minimal changes to it so it should work so but the targets itself have moved but let's say you want to run some like this is like a typical M&E classification problem that uses like these convolutional neural network layers to do like a simple deep learning problem. Like this is something that you can run on a CPU as well so this is like super easy on the CPU but so it's nothing complicated. So I download first the file and then let's look at this ASBAT script. So maybe I'll just write it like myself. So I'll write this TensorFlow example.sh and I do the user liturgies. So ASBATCH. By now we use hopefully you're like accustomed to these syntax so it's nothing like usually when you write you just write this like automatically nothing special here. Let's add a bit more memory here. I can't remember how much memory it actually requested but so now we have the normal time normal time and memory requirements that we need. And this is such a simple model that it doesn't require multiple CPUs to do data loading usually like the deep learning stuff you need to have multiple CPUs to fill the GPU with data but in this case the data is so small that it's not necessary. So we can already go with Gress GPU one. So give us any GPU, that doesn't matter. And then we give Python and the name of the Python code but before that actually I forgot I will have to load Anaconda. So different sites have probably different ways of like this should be in software but in Triton here we have these Anaconda modules that contain basic like basic installations of PyTorch and TensorFlow and all kinds of Python packages. So like it contains already TensorFlow that you need. Yeah, there was a question on hack and dee can you install newer versions of TensorFlow and yes you can install your own Anaconda environments or ask us to make a update. Yes, if you look at our Python documentation there's instructions on how to create your own environments and of course in those environments if you want to use GPUs you need to make certain that these CUDA toolkits are installed as well so you can, they know how to utilize the GPUs. So let's run this example with SBatch. So it's now pending. So now it's running on one of the nodes with V100 cards and I'll use tail to look at the output. So you see a lot of this litany of text at the start where it actually chooses which GPU card to use and all kinds of stuff and then it's doing the actual training. So here it's doing five epochs of training and it's like 98% good at classification. So this is like a simple model but you can put whatever you want there in your code. So whatever, oops, wrong program. Whatever GPU code you have, you only need to add this. Then if your code requires a specific architecture you might want to specify this constraint parameter in the GPU, well, it's somewhere here that you can specify, this might be all the specific but different sites have different conventions probably here but you can use this constraint parameter to limit yourself to certain GPUs. Certain architectures. So, okay, let's look at the 3D monitoring. There's one important point here. So you may remember the S run we would put before commands to profile them individually. So in theory that should be done here but the latest version of Slurm has some sort of, well, we don't know if it's a bug or something else, but if you put the S run there, it won't get allocated the GPU. So, yes, yeah. So this is all the specific stuff and we are fixing it in a process of fixing it and at that point we'll probably do some changes to this syntax as well. We'll switch to this newer syntax but you don't have to worry about it. We'll notify you if something changes, something in the syntax changes but currently before we have fixed it we will let you know once we have fixed it but don't put S run on front of the GPU code in Triton because it might hang because it cannot get access to the GPU, that's unfortunate but like, yeah, it's like this annoying bug that we encountered last autumn when we updated our Slurm version. So don't be alarmed by it. If you get errors or your code doesn't produce any output when you're running it through the queue with GPUs, try checking if there's S run on top of it. If there is then take the S run out and try again. So that's unfortunate but it's really good fun by Richard. Okay, so, Brad, do we check? Yeah, let's check, like, this might also be, so let's, this might also be auto-specific, might be in other sites. Yeah, I think it's probably auto-specific. How do other sites monitor their GPU performance? I'm not completely certain. One way to monitor GPU performance is that if you have something running on a node, you might, you can temporarily SSH to the node and use this NVDI-SMI to check the GPU utilization. There's other tools as well and we are in a process of creating better documentation on how to monitor because this is like a kind of worm that is pretty unfortunate. Like the GPU utilization is sometimes hard to get out of the system. So, but in Alto, you can use this specific magic that I always just look at the documentation. You just need to know that it's there in the documentation. If you put the search bar, GPU monitoring, you will get here. And if you put the job ID there, you can get GPU monitoring if the program has run long enough. Like I'll add here like a... Yeah, and also occasionally there's bugs with these things. So basically it might randomly not be producing the GPU stats. Yeah, so this might be something that if you're interested in monitoring your GPU usage, you might want to consider joining us in garage and discussing with us how to monitor it and how to do performance testing. Or like, yeah, like make an issue about it. One option is to, like when the job runs. Like if this GPU performance... You can go to the node and run NVDI-SMI and it will tell you what the GPU devastation is if there's something running on the GPU. Yeah, but if you try to do this GPU monitoring and you don't get any results out, then definitely open an issue or come to our garage because this is our problem and something that we need to fix for everyone. Yeah, it's a bit of like this GPU compared to like the SF which is a really great tool of monitoring the CPU performance and the memory performance. The GPU performance, it's a lot more than Nebulas, but it's very important. Like, similarly that what we've talked about, like recognizing the scaling, like recognizing if something is supposed to run faster. Like if you run something on your own laptop and you run it on Trigen and then it's suddenly slower or run it in a cluster and it's suddenly slower, you know that, okay, I expected it to be faster. There might be some configuration option that I overlooked. Maybe I didn't book enough CPUs for the job that it got in the laptop without like any specifications. Similarly, if you're doing like GPU training on a workstation and then you think that, okay, I want to run it on a cluster, I want to make it faster. And then it's somehow slower in the cluster. You know that, okay, there's again, there was this assumption that it's faster, but it's not. It's not that the hardware is broken. It's not that the system is like the machines are somehow flawed. There's, of course, it's more of a, there's a barrier of implementation and it can get tricky. And in those cases, it's best to just come and ask us and discuss with us so that we can look at your specific case. Because there's no one sure way of doing it. But just like being mindful of that feeling that, okay, like why isn't this faster when I thought it was? Like if you recognize that feeling and you recognize that situation, you know that, okay, there's has to be something. There has to be a better way. And then you realize that, okay, like these guys are meant to be maintained in the system, ask them why isn't it faster? And then we can try and help you. Okay. So we have 20 minutes left. What is left? Monitoring input output. Now are we, what else is there to discuss? Yeah. There was an interesting question. How much extra, actually, no, let's go to questions at the end. Are we basically at the end of this? We've talked about the input output. Do you wanna show a list of the different available GPUs? And then there's exercises. Yeah. So we have, like I mentioned, we have these KATs, then we have these P-100s, then we have these P-100s and these A-100s. And some of these P-100s have this NV-Link, which is this fast interconnect so that you can run faster multi-processing, or multi-GPU jobs. But then again, your frameworks needs to support it to be able to utilize this NV-Link. So it's more advanced. And if you feel like you want to utilize it, then do come and discuss with us if you want to check how it performs. There's also like this A-100 machine that has this AMD GPU. So if you want to try out, like GPU codes, if you want to like test bit something that you might want to run later on in CSC machines, maybe you want to make your code compatible with AMD. But if you're writing like GPU code and you want to make your code compatible with AMD GPUs, do ask us about that and we can help you with that. Yeah. Okay. Mm-hmm. Are any of these exercises, let's see. Do you want to do the exercises as demos? Yeah. And also, I'll give a reminder. Please look at the bottom of HackMD and answer our feedback survey about what was good and bad if you're leaving early. Yeah, you can also describe what you would want to learn or what kind of topics would interest you in the future. Like, so that we can then focus on that and maybe make a full course out of some subtopic of the course or some topic that was mentioned in passing but you will want to learn more about. Okay. So let's run this Nvidia SMI utility. So this is like this exercise one. So this is a utility that will show you the GPU cards available. So if I run it without, let's try running like that. I just S run Nvidia SMI. So I didn't specify here the GPU resource that I wanted to use. So this means I won't get a GPU resource. I will get some compute node without a GPU and because there's no GPU, there's no Nvidia utilities so there's no Nvidia SMI. So it will give me an error. So let's run now with the GPU specified. So this GPU one. So now when I run it through the queue in some compute node it will give me, let's see if I can make it better. Okay, yeah, that looks nice. So the output looks like this. So I got into one of the GPU nodes with 300 card with 32 gigabytes of memory and there was nothing running there because I wasn't running anything. But basically I could have run something there. Yeah, okay. Also I would probably quickly mention that if you're running some GPU jobs that let's say like creating machine learning models or deep learning and stuff like that, it's good idea to make certain first that it runs like a small amount of time before you put it into like five days of running because like in general, like this applies to other jobs as well. You might want to check that your code runs like small time before you run it the full time because like you don't want to have like a typo at the end of your script just before you save the output after five days of computations. So it might be a good idea especially with the GP resources because they are more like heavily queued upon but in general it might be a good idea to go through like this kind of incremental step where you first put some test job into the queue see how it performs, if it performs better. Well, based on the test job you set the memory limits you set the time limit and then you put run it the full time. So reduced number of parameters, maybe reduced runtime, maybe run only one iteration instead of thousand and check how it performs in that then extrapolate from that information the full job. Yeah, but I think the rest of the exercises are pretty like specific for whether you're compiling code or whether you're running certain frameworks. Yeah, should we take some final questions then? Like broad questions about any topic? Yeah, there's also the good question there. One question that how much faster should the job be on the GPU compared to CPU? So the question like, but typically it ranges between a problem, like if it's usually something around like maybe nowadays it's like 20 to 40 times, like 20 to 40 CPUs worth of calculated time in some deep learning stuff, it might be even 100 times, 200 times like you needed huge bunch of CPUs to do machine learning training compared to GPUs, especially because of the communication that needs to happen. So usually like if it's like orders of tens of CPUs, so if your code runs on one CPU it takes some day or like full, it takes like 24 hours to run. I would guess that like to be worth to run it on a GPU, I would probably, it has to run in two, three hours max. So you need something like 10, at least 10 fold efficiency for the GPU. It depends on the GPU. And the main thing is that like, do you even then utilize the full GPU? Yeah. You can get like, you can get like 10 to 100 times faster running on GPU, but it depends on the algorithm. Some things are not good for GPU and it's better to run like a, let's say instead of running GPU job, run on a great job of single processors that might get you much more stuff done. Let's see.