 So GPUs the thing everyone cares about and incidentally is last so it's not in order to keep people staying here the whole time But because it really is sort of depends on everything else. We've been talking about Yeah, yeah, there's so many like links to other places. So so first of GPUs if a code doesn't use GPUs if it doesn't know about GPUs if it's like normal code It cannot use GPUs. It needs to be built for GPUs. So if you scroll to the visualization of the GPU program Over here. So GPU programs are typically sort of like you have this GPU part of the program That has been compiled against some GPU like type So for example your workstation type of a GPU or in our case the cluster GPUs that we have and This GPU part of the program works in conjunction with the CPU part of the program and the CPU part of the program Like tell us what the GPU part should run So it will give you give the data to the GPU part of the program and then the GPU part program runs this so-called Kernel so it's usually like small bit of program that is running in a parallel across these like GPU cores So you have like let's say you have like addition or vector addition or something And it gets like two vectors or two numbers Two arrays of numbers and then it sums them together In this case like you could have one thread adding one number And second thread would add a second number and so forth and and you would get all of the additions done in in parallel and of course like nowadays these are they are very commonly used In physics codes in in deep learning all kinds of things because the GPUs when you have so many of these individual Cores they can do a lot of like matrix multiplications and additions and that sort of thing that Enable you to do like complex things like Deep learning deep networks and that sort of thing Faster so a lot faster But the important part here is that like there needs to be this GPU part There needs to be something compiled for the GPU so that the GPU can Run it but most like Users use some code that is already being compiled For GPUs so for example like by torch or tensorflow installations or Or they use some GPU code that we have installed or something like this. So They already have the kernels inside them. They already have the GPU programs inside them So there's many programs that's calculated. I don't know like a matrix multiplication or something They are already inside there So they and the program is mainly like CPU program But something sometimes something it goes to the GPU and then it's calculated on the GPU and then It's done there and like programs some programs also like Do this compilation automatically so they automatically compile stuff for the GPUs On the background. So you don't necessarily have to code this GPU code You can use the GPUs if the library is or the framework that you're using supports it There's a question. Do you have an example of a code created for a GPU? And I guess yes, yes, like we have an example in when we run the Yeah, one will compile should we go and do that? Yeah, so And and yeah, I'll quickly mention that when How slow and thinks of these GPUs how slow and thinks of these is it it sees these as these generic resources So we reserve these resources. So We reserve a GPU Um And when when we reserve it, it's basically booked mark for us and then we can do something with it and and these GPUs are so popular and so Expensive that like they are not usually in the interactive partitions, but uh, so so and and they are very powerful GPUs So so you really need to usually use a lot of work trying to get them fully utilized But let's let's look at the example GPU code and how do we reserve it? Um, I'm scrolling down running a typical GPU program. I guess we'll see this when we do the example Okay, here we go Running an example program that uses a GPU Uh, we find the example repo I guess let's get down straight to it Yeah, so So here we have a like an example gv program. So this is the same python Like same algorithm that we previously used for the py thing But it's because it runs on the gpu. It's going to be like insanely faster. Uh, so here we are going to compile it So, uh, so because this is c code or c plus plus code so It will it will need to be compiled So they are like I mentioned most People run stuff that is already compiled. So you don't need to worry about this But for this case, let's just compile it. So if you load these modules like clean All the modules I had and I will load those And if you're using gpus Uh, you will hear the words kuda all the time. So kuda is this kind of like framework that Nvidia has created that contains like lots of stuff so that Users don't have to learn about the gpu hardware itself It has a lot of libraries and and there's lots of other things built on top of kuda. For example, like Tens of low py totes such like they use kuda. They don't use Like the tpus themselves They they go through this kuda framework so that they can use them So if you just copy this monstrosity of a command So this this basically tells us is to compile it for all different tpu architectures that are It's in the best idea work compiling this code into something named us. Yes Okay, it takes a little bit of time, but it's done Okay, what if I try running it here without slurm? um well If you try running it, well, you can see what happens like it it doesn't like it very much Yeah, so basically it basically doesn't work Yeah, because there's no gpu device here. So Yeah The toss it's gpu pass are not going to be executed. So yeah, it doesn't work. Okay So we request one gpu and run Okay, should we do it? Yeah, let's do it So it finished now. We actually get some answer this many throws pi is one four Yeah, he hit it with a bigger one like that's that's too low So he did it with like uh, so how much the this is one billion Uh hit a one zero there as well. Okay, one zero extra. So it's 10 Three three three. Yeah. Yeah. So so this is like uh, like even this is like like Once you get it running it's It's very fast. So you notice here what we are asking we are asking this tenetically source tpu colon one in some clusters you might need to specify the the tpu Type as well But like you can you can reserve tpu type specific tpu types at least in our cluster we have different generations of tpus and then Usually you want to like the the bigger ones or the newer ones are more heavily contested So you might want to like uh You might want to try to use if your code doesn't use the all of the Newer features of the tpus then I would recommend using the not so heavy ones Maybe you should try with the Debug queue so the tpu short queue So in in triton we have this tpu short with this quite a bit older Uh Okay tpus. So yeah, so the tpus are very popular because they are so heavily used nowadays So yeah, that should probably give you a faster one Maybe they are right. Maybe everybody's running there right now. So everyone I guess everyone's doing this right now. So yeah Okay, um The main thing is that you basically ask for tpu and then you're granted the tpu In some sites you might also need to specify a tpu partition. So, um, there's a question in the notes that Like you might not have the node Available in your partition. So then you might need to put this dash p and give it a tpu partition name or something like that It depends on like use Certain features or this There's this s info command in the notes as well to see all of the available tpu architectures But yeah, it looks like When you use So this can happen like this can of course happen to anybody waiting um so Special cases and common pitfalls. So yeah, what can go wrong when we're using tpus Well, the first first things you should know is that tpus because they are so different to cpus they The utilization is completely like kind of a different kind of thing. So If you have a problem, which is not very big The tpus will just plow through it and then they will say to the cpu that okay, i'm done Continue where you left off and if your cpu part is very slow You can often have a situation where the cpu is actually not doing anything or it's working like 10% of the time And 90% of the time it's just waiting for the cpu to give some Uh, some really like something for it to do and in these cases, it's usually important to spot this Uh, this low utilization So you can while the job is running you can take this ssh connection to the node where the gpu is and check what is the utilization using the nvdsmi command that is presented here Or you can in in alt or you can use this slum history command Or this sact command to to see the gpu utilization But this is often like a complicated thing and it's a good idea to come and talk with us on Like if you encounter this kind of a situation that you feel like the utilization is is not optimal Or it takes longer than you expected because like the gpu's in in clusters They are like they are much bigger than the gpu's that you're using in in your workstations And it's very important that you get them fully utilized because they are like so powerful If they're not fully utilized they will just go through the material immediately and then they will Yeah, throw the stuff to the cpu There's also a question about this metaphor Or our cooking so someone can be cooking so efficiently That they're running out of the materials the ingredients And they're spending more time walking to the fridge or the pantry to bring in more materials So that way the stoves can't be fully occupied in which case you need More cpu's to bring in that data there Or you need more Data bandwidth In order like be able to carry more at once And you need to be the code needs to be written so that it's optimized for reading in this data Okay, should we go on to other common problems? Yeah, the other common problem is that Like that you might have the library is loaded multiple times or you might have the library is not loaded So like I mentioned that like The gpu's They usually like you access the gpu's using these libraries like CUDA libraries And when you're having a code like like, I don't know like if in this case we had a compiled code so we had a um We had Something that we compiled ourselves. So we need to load the same CUDA module when we are using it If we are running something like tensorflow or pytorch or some other code that is built with CUDA So like remember there are these built internals in these installations like when you install pytorch you also install some Code that has been compiled against these CUDA libraries. So if they don't find these libraries, they will break So when you're running Stuff you need to make certain that you have the correct CUDA toolkit and correct correct things installed in in your environment or whatever In your like you have the correct things loaded and you don't have like multiple things loaded at the same time So usually like there's instructions in our documentation on how to build like these Like for example, konda environments if you are using pytorch then the flow that sort of things If you're using other frameworks, you need to like make certain that the the code you're using uses the same CUDA that it was compiled against There's a good question here if my program does not use the full gpu memory. Is it okay if I run several jobs on the same gpu? Well, this is a good question. So this is not usually a problem like like GPUs they work in these like you have Usually a number of threads. So I think in in the 800 cost for example that we have there's like If I remember correctly, it's like 7,000 8,000 of these so you have like basically 8,000 CPUs they're like 8,000 of these GPU cores and if you add more like Stuff there if you're currently utilizing all of those but you're not like You're using all of those 8,000 That's like cause but you're not using the full memory You're adding more jobs for it to do doesn't really help right like it It will just make that like all of these 8,000 are already used So it's it's much more important to use like all of the all of the parallel threads Like it's it's more important to use the whole width of the calculations like GPUs. They usually work in that like they have a You you can think of it as something of like a Wide wide conveyor belt Like there's a conveyor belt and there's stuff coming into it and every every like Like every second it's it's doing It's like plowing through all of the calculations what it's doing And if you're using the whole width of the conveyor belt So if you're using all of the compute threads You if you add more stuff to it It doesn't make it do any faster because then it just means that you're like the conveyor belt is already full And and this is like if you're running like this example for example in in like Like M&E's training or something like this the trivial examples that are given in in tutorials for like deep learning These are basically just like I don't know like five percent of the whole conveyor belt So like they the whole GPU like you you the speed might be similar to what you run on your laptop or something Like it might be similar to what you run on your workstation because like the power only comes with the The size of the problem like the the power of the GPU the fast it's faster only when when it can utilize the whole width of the Like calculations. So if the calculations are really small It it it's not faster But if they are wide and if you have a lot of things to do they are faster So with GPUs you usually need to think of it more into like this kind of like Not in a it's not a sequence. It's a like a wide lane that you Go through So could we say that memory of GPUs is usually not the limiting factor and if it's not using all the processing cores then That's Something wrong itself. So there's yes Never a case where like a very rarely a case you'd have multiple different tasks on the same GPU Like memory is usually the limiting factor only when you start running like really big Let's say language models or something like that way. It doesn't any more fit into the memory So the a hundreds that we have they have 80 gigabytes of memory and they Uh v hundreds they have 32 gigabytes of memory So if if it doesn't fit into the memory anymore, then you start to have problems But there are solutions for this as well Okay, let's keep looking at the other pitfalls Reserving specific GPU types. So there's a question. Do we manually select which GPU our job would work? So by default it will give you any available GPU But of course some of them may be older than others in which case you can use this constraint option to say I only wanted to run on these different types of newer GPUs. So that way the code won't crash But if possible, you should accept any possible type And also the billing of the GPUs is arranged. So the more powerful ones cost More so yeah, you need to wait longer by using that one the older ones There's an excellent question from the short queue. Yeah. Yeah. Yeah, there's an excellent question in the in the note So so there's a question that can't you like when you request a GPU Do you really need to like get the full GPU? You cannot only get a part of it and this is exactly the case Like because these are these resources like generic resources like they are there's only one GPU right there and because of like like like yeah, because tb is so much more like Gradual when it comes to like other people running there Multi-user running is not really an option So you will get one GPU like an actual physical GPU when you're running GPU codes and reserving a GPU So you will get the whole thing So if you only use a part of it, you're like the rest of it. It's not doing anything Nobody else can use that rest the rest of it. So Yeah, but you can on GPU run like It Like it can run multiple things on the same time by you like you can run these kernels that I mentioned You can run multiple of these in the same GPU, but that's all usually happening in the background You don't have to think about it But it's it can run multiple of these calculations at the same time on the background Like the GPUs But usually that's something that yeah, you don't have to worry about but you it's reserved only for you And that's why it's very important that you Utilize it to the maximum extent because like cheap like For example, what we see now when Richard is waiting there in the queue if all of the GPUs are occupied then It just means that there's less resource for everybody. So Whenever you ask for a GPU, it's good idea to To try to utilize it to the fullest extent But of course at the same time Do play around and test it out like but it's important to make certain that what like if you're training a model Train it first a few steps and see if it works If it works or seems promising then train it better Don't train it for five days and then realize that it's crap But cancel it beforehand like I cancel it if you see that it's going to a wrong direction Or if it doesn't look promising because then it means that somebody else can use the CPUs Like it's a shared resource and they are so expensive that it needs to be like everybody everybody needs to try to Common commonly be like Try to use them to the best of their abilities yeah Should we go back to the lesson and see what Maybe we can quickly go through these points about um So we talked about reserving the specific GPU types There's a short queue for quick debugging which sometimes can even be full If you get a message that says libcuda1 can't open share optical file no such file or directory Basically, what that means is you're running the GPU code on a thing that's not a GPU node and It doesn't work Which often happens when people try to test their codes on the login node So basically then use s run to grab the GPU So what about the different python deep learning frameworks? What special things are needed in order to use those Yeah, so so quite often we have the case of people asking Asking us about like Like how to create an environment that's like Works like where the CUDA and the Python deep learning framework are together because python is nowadays so popular especially in deep learning and quite often there's the case that like The installation didn't go through or something like that. Something went awry And the installable well, it just doesn't work So I recommend checking our documentation because it's it's extensively documented there that how do you create these environments? because like it's you need to have A framework that has the correct CUDA version there and there are tools like the condom for example, it can manage this This trouble it can find you Corresponding pieces, but it gets very technical quickly and I don't want to bore anybody else who's yeah, okay But but I recommend checking Here, uh, so you know how to create like an environment suitable for you okay CUDA architecture flags So basically if you're compiling your own code you can make it run on every GPU and not just some Keeping GPUs occupied was the metaphor about keeping enough supplies while you're cooking. So basically having enough incoming data bandwidth to supply the data as fast as possible And I guess you can read this as well as we can say. Yeah I'll quickly mention that so so Like this is something that most likely is the biggest culprit of of like underutilization of our GPUs So basically usually you need to when you're doing GPU computing you Yeah, unless you're running some physics code that completely just runs in the GPU and doesn't like ever leave there Like for example with deep learning, you usually have something that you have data and the data is read by the CPU The data is pre-processed by the CPU to be in a correct format And then that is then converted into this kind of like tensor that is sent to the GPU Memory and then it calculates like okay This is how how the model thinks about it and this is how the model should change And then it comes back to the CPU and asks for more data And and if the CPU is there just hanging around waiting like while the GPU is doing the calculation It's surprised by okay. I now need to get more data and then if it gets the data The GPU is just waiting around there. So this is not Good and and that's why all of these frameworks have like massive Documentations on how to optimize these data pipelines and it's just it's an annoying part of the whole program But usually that is the most like cost effective way of speeding up your code spending time on the data Pipeline optimization and it's an annoying part But it's something that usually needs to be done If you like you cannot like for example the large language models that we have been mentioning Like they use something like 200 petabytes of data Like can you imagine how long it takes to read 200 petabytes of data? So so it's it's insane amount of data So Or I think plus 40 petabytes or some of these like like you need to have a massive Framework to read that data in so that you can give it into these language models. So You need to have like You need to have a lot of engineering Around this data loading and this is like something that we can help you with so so come and join us and discuss about it if you have problems related to this Okay, uh, what's next profiling GPU usage with nvproff. So I guess this is a Um well Do there's a tool that will profile things you can read this yourself When we get there, I'm trying to get quickly to either to the general q&a part And here's a list of the available GPUs and architectures. So there is a question note that each of these Oh Yeah, this is is this up to date. This looks like it might be Or there might be more here that aren't listed Yeah, I think that's up to it. Anyway, I might have I might I'm mistaken. I think I I'm out of data needed for the training was was I think for the stable diffusion kind of models because they are image models and they So the chat gbt requires less data, but still a lot of bit Yeah, okay, so These exercises, I think we do we need to do them together? Yeah, it would have been good to run these exercises, but as we can see from the Richard's terminal, uh, the gps are I'm fully utilized. So it's very hard to do exercise when you don't have a gpu to run it on So so I think it's better to leave them as a homework But should we go to a general q&a?