 We are back. All three of us. Yeah. So, um, now all two of us. Yeah. Okay. So GPUs. Um, we talked a lot about during the break about how we present this because it's the kind of thing that can go anywhere. Yeah, let's see. Yeah, like, like, yeah, unfortunately, like today's, like some of these, like, like the average jobs are really easy to describe as a, as this kind of like extension of the CDL jobs parallel jobs are somewhat easy to describe because it's just like, okay, you ask for more resources. But when it comes to GPUs, it becomes lots more like it becomes more complicated than complicated because like, you have more and more requirements from the software itself. So the software itself like supports it. So this unfortunately can turn into a bit of a talk fest. So we have examples here, but like these examples are contingent on like they depend on like our system. And so hopefully they they run on other systems as well. And but so, yeah, unfortunately, it's this kind of like, like, nobody can be told what the matrix is matrix is they have to experience experience it for themselves kind of a situation. But so for this talk, it would be nice if if you have any questions related to GPUs, just put them underneath here in the in the HackMD so that we can then try to respond to them like it's this the same like elephant trying to describe it by like somebody's touching the tail, somebody's touching the trunk, somebody's touching the leg, like it's really hard to describe it. Just like in total with simple words. So okay, what is what is the GPU so so there was a question like yesterday or somebody asked, but what's the GPU actually so so GPU so graphical processing unit, it, it like even though it says graphical in the name, it doesn't mean that it's actually doing graphics, especially when it comes to like compute computing stuff. So in in the clusters and in scientific computing, we often talk about accelerators as well. Or general purpose GPUs, which which just flows out of your mouth mouth. So GPUs originally aware about graphics processing, but at some point, somebody realized that okay, like these machines that have been designed to do lots of like vector calculations, because in graphics, you usually do like lots of translations, like if you do 3D perspectives, you have to do 3D rotations and stuff like that, you'd have to do a lot of vector calculations. So they could be used also for doing general purpose computing if your calculations would be vector calculations. So then lots of people realize that okay, like many physics codes and especially nowadays, like machine learning and deep learning would and and and differential equations, they could be really easy, they could be used or solves with using GPUs. And because these GPUs are specialized on these kinds of things, they have like hundreds or thousands of of these calculation cores or these like basically calculators that can do these operations in parallel. So they're like highly parallel systems where you basically you give it like a huge bunch of numbers and tell it to like do something with this like calculate difference of these numbers and it calculates all of these operations, lots of like matrix operations and stuff like that. And it can do it very rapidly and it can be used in various fields, but because of still it's it's not generally like central processing units in in computers, they can do whatever you please you want them to do. But these GPU systems are still like they all can only do like certain kinds of operations fast certain kinds of things fast. So that's why they often depend on libraries and the system itself to to like or the code itself to utilize them properly. And they often depend on on external libraries to do their stuff correctly. So like there's only if the program doesn't support GPUs, then it doesn't support GPUs. If it does, then it does. So it's again this kind of a situation that either it does or it doesn't And usually with the GPUs, no nobody wants to like actually code for the GPUs like low level code. So instead they use libraries or libraries that use the other libraries to like like abstract make it more abstract and make it easier to like do stuff. And nowadays, most of the people actually use like, like physics people use, let's say Gromax or CP2k or whatever, or lamps that has been compiled with the support for GPUs and then they can run on on GPUs or MoMAX or all kinds of different software that that can use GPUs. And for deep learning people or machine learning people, they use like these libraries that then utilize the GPUs so you don't have to like worry about it too much. So yeah, but from the side of the slurm from the side of the queue system, it's very simple the GPU. So if if Richard you want to switch to my show. Okay, there you go. Yeah, so, so in general, you what you want to do is you want to specify this G res or generic resource, like you want to specify that okay I want this resource to be available and the resource is usually GPU, and then you have a colon and the number of GPUs. It's usually best to start with one GPU will come to that later. But yeah, usually you just specify that you want you want the GPU. And in some clusters, you need to specify these some partition like GPU partition because the GPU source, like in their own own little world. So you need to ask for specifically GPUs from the GPU partition. In Triton do not use S running in your batch group because there's a bug. I think it might be actually solved now but we'll have to. Yeah, I just remember that we might have already solved this but we haven't looked at the documentation but in Triton they might be this kind of a thing. But let's look at the next example. So, so here's an example, what you might want to add to your GPU job. Okay, so you you have the exactly the same script, but you just add G res GPU one and like we talk with like CPUs per task option for these multi multi process jobs. It's up to the program to actually utilize the GPU. So you basically tell us learn that okay I want this, I need this, but it's up to you or up to the program to actually utilize the resource so it doesn't actually like, if you specify the GPU. It doesn't necessarily like mean that the program actually utilizes it in in Triton. There's multiple different GPU types in other classes they might be as well and you can like you can either give a constraint which specifies like this kind of like, I want to use a certain kind of a GPU architecture. Or you can give a GPU type in the gress command itself to specify certain type of a GPU. It depends like in some other clusters might be a different constraints might be different options here, but, but it's good to know, because like the GPUs this like generational differences some of the GPUs do faster in like the newer ones are always faster but sometimes the newer ones are so popular that the older ones would suffice. And yeah, like you might want to choose an older one instead of a newer one. And for all of these different arguments at the bottom of the page there's a table that says what to use. So basically, the idea is that you have the arguments but you don't have to memorize them or anything, obviously. Okay, yeah. Yeah. So, like, in general, like, like you might see words like CUDA or in the previous presentation from CSC, like this, you can see HIP or ROC M, those are the AMD versions of CUDA. So CUDA is this kind of a library that basically like does all of the heavy lifting when you're using NVIDIA GPUs, which in our case we have NVIDIA GPUs. So in those cases, you might need to use CUDA, but usually, like nobody wants to use CUDA, like, like wants to use it themselves, they usually want to use something that uses it. So maybe I said, like, complicatedly, but, but nobody wants to compile their own CUDA code. If you need to compile your own CUDA code, this might be relevant. But in most cases, you just, let's say, if you want to use, let's say, TensorFlow, you create a CUDA environment, which brings you the CUDA libraries that work with TensorFlow and then you use those. So maybe we should run this example to see how to, like, how to run something in the GPU. I guess as usual, I will type it. Yeah. Okay. So here we are examples. Let's configure the screen well. So this, this example is from TensorFlow's tutorials and it, it like, you need to load the download the actual source code because the actual source code is lots, a lot longer than the example there. But it's basically like a, like a simple, like, that's the Wget command. Yeah. So, so the actual example is, is uses, it uses TensorFlow and these dense networks to predict this M&S dataset, which is like the first example that everybody sees when, if they do deep learning. It's a bit complicated to code. So. Okay. And then the submission script or we run it first. Yeah. Ourselves. So I module load Anaconda into this very shell. And now we do it interactively. Yeah. Because that's what we do for testing. Yeah, testing out. One GPU. Yeah. And I guess this runs on any type of GPU we have. Yes. Yes. So Python. So here we have. Anaconda. Yeah. So, so what is important is that we just add the one flag to the GPU and because the code TensorFlow itself understands about GPUs, it will use GPUs. So, yeah, so, so you can see here that it's, it says a lot of, lots of stuff, but then it also says that. Yeah, it says there that Tesla K80. So now it was using one of the older GPUs to do that because we didn't specify what we wanted to do. But this is such a small problem that it's, it's completely fine. Yeah. Yeah. Okay. So it ran. So, so, yeah. So, yeah, typically the problem is, is about like the problems are usually installing the software. So that it, it, it software works and can recognize the GPUs and, and then efficiently using the GPUs, but it's not usually like, how do you queue the CPUs? That's easy. So if you want to run the slurm features command, that will show the available GPUs or the available features, at least we have. In other clusters, I highly recommend checking the documentation where the, which are the GPUs. So we have plenty of different GPUs. Here we go. So we have these. It's rather small, but. Yeah. Yeah. Well, we see. It is. So the nodes, I guess this is available, idle, O and T. Yeah. Other, other and total. Total. Okay. Yeah. So on the, on the column, you can see that there's like a GPU, a hundred on some nodes, GPU Tesla K 80 on some and there's B 100 nodes and B 100 nodes. So, yeah, all like many generations of GPUs are represented. So if your code needs a specific one, you should specify it be a constraint or with the G res parameter. But if you don't know what's your need, you just, I would recommend checking, like just leaving it out. Okay. Mm hmm. Yeah. But then the problem usually with the GPUs is the efficient use of them. Like, like women, I mentioned that GPUs are very powerful for that specific task they do. So they have the rest of the tasks that that's normal program running has to be done by the CPU part usually. So, so the GPU part is very fast at what it's doing. So for example, like it's very fast at doing like this kind of matrix operations. But especially in, in like machine learning or deep learning, you usually need to feed the GPU with data. And usually the problem becomes that the data doesn't come to the GPU fast enough. So what you usually need to do is that you need to, well, choose a better data format for your data or to have multiple data loaders, like multiple processors. Like you can have many of these different requirements for your jobs at the same slurm script. So you can have like, you can ask for multiple CPUs as the same, same time you ask for multiple, since how time you ask for GPU. And it's usually recommended to have like something like six data loaders per GPU. So basically like they will be processes running on the background, doing data loading, while the GPU is doing the calculation because like, or doing some output like doing calculating some other stuff. The CPUs are doing some other stuff while the GPU is running because the GPU is so fast that you need to keep it fully occupied constantly. Otherwise, like you, like, you won't get the maximum benefit of the system. It's like in our cooking metaphor, your cooks are really fast, but your supply chain can't bring you in the supplies fast enough. And you're basically wasting all these resources while sitting around waiting for trucks to arrive. Yeah. Yeah, basically like you have a pot waiting to be filled with you have a really like fast, fast burner, but and you can like cook a lot of spaghetti with it, but but you cannot have enough spaghetti to cook so it's just like boiling there and doing nothing. So, so that's basically the situation. So this can be helped with various different techniques, like in the CSC talk, they also talk about in their workflows, they have an example of how to use like containers and squash fs and all kinds of stuff to speed it up. There's, you can use various data sets and various data loaders to speed it up, but it gets very technical, like, like you already probably zoned out what I was talking about, and it gets very technical. And it's very dependent on the, like the program you're using. So if you want to use GPS, I highly recommend just looking at some program that says to be in there, like, like, if your program can utilize it. And GPS are not like, they shouldn't be, I could say it like, like they are all the rage now, like nowadays, but it's, it's not about what, like, how you do it, it's about what you do. And you can do a lot of research with other tools as well. And it's not like, you can use traditional machine learning for many problems you can do all kinds of like you can use CPS are not like out of out of the loop and like, there are a lot of tools that you can use GPS is one tool or one resource for certain kinds of problems. So don't feel like you're left out if you cannot like use GPS for your problem because like real, like science is all about like ideas and and like creating models that explain things. And sometimes you need a GPU to explain things. Sometimes you need a GPU for doing like machine learning, but in many cases you don't need it. And then you shouldn't waste too much time on that. For one of the things you said about read the documentation, I guess we can say, like if you learn enough to use just the GPU and just write the programming, that's not enough. So read more of the things and like read about the different data loaders your framework recommends and things like that, because these are important. Yeah, like nowadays like nobody, like, like I told the story that once, once I wanted to try coding some games myself. And what did I do, I started to learn about how to do OpenGL, like how to draw a triangle with like, we did like how to compile C++ code to create like a triangle and that's not how games work. Like you don't get a game by starting from graphics library and trying to like write your own graphics engine. Of course you can do it, but like it's a waste of effort because like you can find a good graphics engine like Unity or Unreal or something, like free, you can get it for free. And then you can just focus on the game making part of the game making. And it's a similar kind of thing where like people nowadays they don't use. When they want to use GPUs, they usually leave the like the GPU coding part mostly to, to, to, let's say Google or Facebook or somebody who like uses millions of dollars and euros to like TensorFlow and Torach and all of these different frameworks. And then they maybe sprinkle some GP, their own GPU code on top of it's like, like you can still learn how to use GPUs in the context of a certain frameworks for example, and you will still like utilize the GPUs but it's it's you don't have to start from the bottom in order to learn. That's really quite representative of a lot of what happens on the cluster so you very rarely do things completely from scratch but you're sort of like finding the best libraries and then doing the next little bit on there. And unfortunately in many courses you learn how to make new things but you don't so often learn how to really reuse things or make things reusable. Yeah, it's always. Yeah, it's, it's, it can be a bit disheartening to learn like expect to learn about GPUs, especially like for example in this course and then be told that read the manual and look at but but it's unfortunate reality is that like nobody uses the low level stuff or somebody does but but this tutorials for all of these different fields. And like it's highly recommended to look what other people are doing, because that is the fastest way of learning because like, like, for example, if you're interested in in in deep learning just go check what kind of repositories repositories let's say deep mind has created like they do cutting edge like nice looking they have a nice blog where you can read about all of these models and they provide the code in the GitHub so you can check how they have created the what tools they use what kind of things they use to create their like cutting edge technologies and then look at what try to like learn some about those. If you are interested about physics and stuff like that, take how like big physics code like lamps and CP2k or Gromax or CharmPB or Charm++ or some other sweets how they do this stuff and and use those. Yeah, so it's good idea to broaden your horizons and look at look at those. So what's next. Um, yeah, is there much else on this page. Well I think there were three topics we wanted to cover so GPUs aren't magic, how to request the GPUs, how to monitor them. What's the efficiency of the job I just read. Yeah, let's see if you even kept it it was so let's see we can check it out. Slurm history. Five. Well, 10 minutes. Okay, is it 20 minutes. There we go. So here's the job. So what we're about to show you is special to alto. Yeah, we're. Yeah, we're working on getting better document a better monitoring solutions for our users and hopefully we can share the tools if we figure out where good tools to do the monitoring. We can probably then share them with other universities and other sites. So, yeah, sometimes it's hard to like it requires a bit of detective work to see the utilization. So we use as a count and dash J with the job ID. And then we tell it we want the comment field and let's make it P for parcel. So we see here in comment there's the maximum GPU memory. The power consumed by the GPU in some unit, they requested one CPU one GPU and GPU utilization for does that mean 4% or for what's the unit here. Yeah, it's 4% Yeah, so it didn't like like it's such a small problem that it doesn't like even tickle the GPU. So these GPUs are very powerful. But you know that to get access to get the power you need to like, yeah, you need to actually Yeah, yeah, you need to actually like get this get some actual stuff to the GPU so that it can be fully utilized. Yeah. And again, this is implemented with something that someone here at Alto created that basically watches the GPU as it's running, and then records this data and puts it in the comment field. So I don't know what the solutions are at other places. Maybe someone can comment in hack and D about that. But yeah, I guess, yeah, we can't say how important it is that you watch this number, and you make sure it's being used well and if you don't get what you want, then ask us. What are the common things that can go wrong with the GPU efficiency that you haven't already mentioned. I have we already mentioned them like the data coming in too slow, not enough CPU power. Yeah, like, like, if you're running, if you're running something on your own machine, and you have happened to have like a gaming GPU that is fast, you might, you're probably running like on a local SSD, you might have like an NVM ESD on your machine like this fast solid state drive. And when you go to try to know if you use some like cluster and you don't have an efficient data loader or the data for data isn't in a good format. You might end up in a situation where the like, it seems faster to run on your machine than it runs on the cluster, but the problem is that data comes from the wrong place. So usually these GPU machines have local disks that you can utilize to store the data set while you're doing analysis. So you usually have to use like local disks. And sometimes you need to write the data loader in a different way. And like, so if you encounter a situation where like something is like it doesn't seem right, like, like, I expected this to be easier. I expected this to work better. And then come and ask us and ask people like what is the reason, because like, it usually just it usually means that they are like, yeah, there's some some underlying technical solution, technical thing that is annoying and needs to be solved or needs to be like fixed in the code, you know, to get the maximum performance. Other thing is that also like, if you do multi GPU stuff, make certain that your program actually supports it because like some of these physics code can do it quite easily they can do like MPI multi GPU stuff. Yeah, like many of them, because their problems are relatively straightforward, they can usually like utilize multiple GPUs. But especially when it comes to like deep learning, you can easily like make the make a bad situation worse by like, if you have a once one GPU that is not fully utilized and if you add another GPU there, that is even less fully utilized, then you can even slow down your code by adding more GPUs to the mix. So, so it's usually a good idea to check. Check what happens. I highly recommend checking like tool frameworks like by torch and TensorFlow they have their own monitoring tools profiling tools that you can use to see how the code is progressing and what is this GPU utilization and just utilizing that kind of tools can easily spot you like if the multi multi a GPU works if it works or not. Yeah. Okay, so what now. Oh, that was basically the page there's more examples up above. I think these exercises are maybe too esoteric to be useful or should we try. This one idea we had, which is really risky for us. Does anyone have a code you would like to try running on a GPU that's publicly available, and we could try to get it running here live, and you can see all the things that go wrong with us. We might not even have time to do it, but if someone pays something we can. There's a good question hack and D can you comment on the different types of GPUs. And I think for there there's two main classifications here. The top parts are the Nvidia GPUs and this is an AMD GPU, but beyond that Cmo can tell a little bit more. Yeah, like the the AMD GPU so we bought one of these AMD GPUs just so that some of our users could port their code to the upcoming Loomis system or tested stuff on if it's possible to put them into the Loomis system. So like I mentioned about these libraries like CUDA and HIPAA and so such forth, because the underlying machine is completely different, the libraries are also completely different and they come from different companies. So of course they don't want to like standardize the stuff they want their own stuff because their stuff is better. So they have different libraries. So yeah, like if you you cannot simply port one to other, but usually you can have like all kinds of like things that that can do port in between the GPUs you can try it. But it gets like really technical really fast. So if you're into that kind of stuff, then yeah, like I think there's there's lots of work for you coming in the upcoming years. Yeah, definitely beyond what this course does. Yeah, and about the other like different type of GPUs like in every GPU generation they introduce new things. So I think in the Volta generation the main thing was that they implemented like efficient efficient double precision like computing units. So actually, like the performance when it comes to double precision became much faster. And in the arm better generation, I know that they at least implemented some like half precision units. So if this doesn't tell you anything then don't worry, it's very technical but basically like, like you can, if you put different kinds of numbers into the GPU, they suddenly perform a lot better. And there's lots of like this kind of like very technical things that that like sometimes if you do some technical changes into your code, sometimes it performs lots, a lot better. And usually it's good idea to check like in the framework that you're using, they usually have like best practices. And if you check those, they usually give like good, good recommended options to use good recommended settings and and if you just follow what everybody else is doing then you usually like get the best results. Like, because this field is like these are so technical things that if you don't, if you're not interested in the technical aspect you're interested about using them aspect. It's better to leave the technical, like somebody figures out the technical details and then they give you that like the source version, like use these flags. And that's usually better than trying to like get all technical because then it's a never-ending, never-ending battle, never-ending thing. Can we summarize also like someone asked here the CUDA score, practically speaking? CUDA core is basically like a compute unit. Like it's basically like if it's one calculator that calculates things. So CUDA cores, it's a single calculator that can do many of the other things. So it usually has operations of like do a addition, do a matrix addition, do a vector addition and some of these additions it can do a lot. So let's say like add two numbers that are 32-bit, it can do it fast. But if you have like let's say the older generation GPU cards and you tell them to do like do a multiplication of double precision numbers, this like 64-bit numbers, then it's suddenly really slow. And it's because of like how the hardware works and like, yeah, it gets very technical pretty quick. But this CUDA core is basically how many processors there are. And usually nowadays it's like 4,000 in the newest ones or like around 4,000 or 3,000, 4,000. In the older ones it might have been like 1,000, 2,000, something like that. Yeah. I wanted to, my summary of one of the previous questions Seema answered. So what do the differences of the GPUs? Practically speaking, there might be a code you have that needs a newer GPU. And newer tends to be able to do more stuff at once, so can be faster. But in effect, newer might have features that a code needs. So yeah. Yeah. But again, we should think about like when we talked about the GPUs, like what we talked about the array jobs that like faster doesn't necessarily make it more. Like it doesn't give you more, like because like if you use the most recent ones, they are the most sought after. So you might want like somebody else wants to use them as well. Wait until you get access to them. And at the same time, you could have been using some older one that nobody wants to use and you could have like gone to the bank with your research. And like it's usually best idea to think about like, okay, what is the end goal of my thing? Is my end goal to do this like 10% faster or is my end goal to do it? And it's like, okay, if you just want to do it, then it's usually best to like just use the resources that are most readily available. And that's why the a magazine, the parallel is usually the best way to do stuff because like you get most stuff done. And instead of like trying to optimize how to do like, yeah. And you can use anything like you can even use the smaller stuff more. Yeah. Yeah, you generalize. Of course, in some cases, the most recent ones are the only thing that works because like I mentioned these technical reasons, like some older things don't necessarily even work. Like they're not supported anymore by the frameworks. So then in those cases, you don't anymore want to use those. But in many cases, like what is available is better than something that is faster in theory. Yeah. There's several more interesting questions down here. Yeah. Actually, can we answer first how powerful are the Triton GPUs like a 100 compared to consumer GPUs? Yeah. So I think like I'll have to double check. But if I think the like the most powerful consumer GPU is 3090 RTX 3090 Ti. And I think that's like a that's basically like an a 100 that's been reduced or something like that. So if I so so basically like the the most newest. Yeah, it's it's it's a yeah it's it's basically like the a 100 is like. It's something like like the most powerful GPU you can oh it's yeah it might be a bit bit less. I don't know. I'm not completely certain which like but it's it's on the same ballpark but the main things usually that like that. Like the difference between these like these gaming GPUs and these these machine learning GPUs is usually like all these compute GPUs is this kind of like technical stuff inside them like like in in gaming you don't care about double precision usually because like you don't if some pixel is like one one pixel off you're not you don't worry about it because it's like it's some gaming like it's you're playing some some game and you don't care if there's like an error or somewhere or or error that can propagate but in when it comes to like. And at the same time you like you don't care about high precision when you're doing gaming and also you don't usually care about like low precision because that would like mess up all of the graphics. So usually you don't have these kinds of like they leave when they create these CPUs they leave some of these like compute units. They have different kinds of compute units for these kinds of cases so so basically like. The graphics in a graphical GPUs that you buy from a store are designed for gaming and the GPS to you buy from from vendors for this. They are designed for scientific calculation so you can do like high precision physics simulations that you don't have want to have mistakes in them because the mistakes can propagate or you can have like. These kinds of like low precision stuff that you can use in deep learning to make faster and bigger models and they usually have lots of more memory so like 80 gigabytes of memory and like there's there's lots of technical reasons but it's like. Like there's a difference like if you go to like like a hardware store to buy like a drill or something you usually like you have consumer products there. And when the professionals go to like their own store they have like like professional products and those are like different kinds of things that like they use different tools they don't use the. Motornet tools or like they don't go to their. Causity market or like some some neighborhood store to buy their tools they go to a actual hardware store. Yeah. And they so yeah they are correct in memory is also one thing like it doesn't matter if the computer crashes like. Like when you're gaming like it's good with the computer and look back even but if you're like you've been training for many days and you're a hundred or something crashes that's a that's a bad thing. Yeah. What else. There's a good question. It's also quick. Go ahead. Yeah. Yeah. About the reproducibility. So. Yeah. Usually like. Machine learning stuff and that kind of stuff. The reproducibility is always a question like I don't know if they are it's even like this. There's lots of talk about this. Is this even possible because like. Like there's usually randomness that is designed to be in the system. But. Yeah. Like usually when it comes to reproducibility having good documentation is paramount. Like like how did you what software did you use what kind of environment do you use what kind of what kind of GB used what data did you use and what. When you use the data did you load it in different order. Work with the randomness of the system and stuff like that. Yeah. I guess we're basically in Q&A mode now. We'll stick around until questions stop coming. Please give us feedback for the course. I'll quickly mention still about GB is like there is like in the first day I presented the conda environment like documentation if you're planning on using these Python packages or this stuff that requires these. GPU capability highly recommend you check that documentation on on how to install this because like. It's it's something that like there's no simple answer and it's like everybody hits their head on that problem. So that's why we've been with rewritten our documentation because like we we have a lot of a lot of people encounter the same problems with those. So I highly recommend checking that documentation if you're planning on using let's say TensorFlow by talks or whatever that requires the GPUs because like the libraries and making everything working together is really complicated. But yeah.