 Thank you for attending our session. We were not sure we're going to receive. Thank you for attending our session. We were not sure we're going to receive such a big interest in our topic, but we're very happy to see you and let's start. So this is a GPU talk. Too many users, very little resources. Thank you for coming. We're presenting efficient access to share GPU resources. I am Dio Guerra with Computing Engineer with the CERN Kubernetes team. Yeah. I'm Diana. I'm a Computing Engineer at CERN as well. OK, so for a bit of context, CERN is the European Center for Nuclear Research, the largest particle physics laboratory in the world, situated in Geneva, Switzerland. One of our biggest apparatus is a circular accelerator, which accelerates protons close to the speed of light. And clockwise and anti-clockwise, and in certain points of this accelerator, we make them collide. This is one of these points, is the compact muon solenoid, which is not very compact. And basically, it acts like a big photographic camera, which takes 40 million pictures per second. This is one example of one of those pictures. And each accelerator, each detector, can produce about petabytes of data a second. So we actually filter this data with hardware and software filtering and to make it down to a more manageable size on the order of 10 gigabits a second. But still, with all the detectors, the four, we generate about 70 petabytes of data a year, and this is just only going to grow up. So we have a very big and large data set. But the good thing is that this is highly parallelizable, which is great for GPUs. And also, in the community, we use the GPUs to do simulation, even filtering, as I explained, and also some other event reconstruction or physics data to process the data. So we have some challenges when using GPUs. Some users have suboptimal code, basically, because they have some strong interactions between CPU and GPU, or they move memory around. Other types of challenges is legacy code, which we do have a lot. And basically, code was not designed for CPUs. And people just port them to GPUs. It's not going to work the same way. And also, some workloads which are spiking nature. So for example, we can take this as a scientist developing some algorithm on notebooks. And basically, it just sits idle, and it doesn't use the GPU at all. On the other hand, there's some infrastructure perspective about the GPU power density constraints. But probably the most important is the limited resources to meet all users' demand, because GPUs are scarce, are expensive, and there's a lot of people looking for them. So story time, we're going to present three use cases, and which basically reflect our users' use cases. So Mr. Brown is a badly coded simulation job, which has some VRAM requirements. But when he uses it, it doesn't really take advantage of the GPU. So in average, he uses about 20% of the GPU processing power. Mr. Pink is an inference service, which is occasionally triggered by outside events. It's spiky and unpredictable. We know that it mostly sits idle, but when it runs, it actually takes advantage of the GPU. And it also has some other memory requirements. And last but not least, Mr. Orange, which is our wildcard. And we can have this being a physics user that is using some GPU notebook, and there's some potential memory leaks, or the user just leaves and does not use the GPU. And basically, the GPU stays locked to this user, which is not using it. And it's not returned to the GPU pool. So to configure all this, we're going to assume that we're using NVIDIA cards. And to make this run on our Kubernetes cluster, we install it with the GPU operator Helm chart. And this will make a new resource available on our cluster, which is the GPU resource. So with this in mind, let's onboard Mr. Orange. And as we said before, Mr. Orange is Mr. Brown. Sorry, Mr. Brown, when using the GPU, does not really take advantage of the GPU. And we can see that the GPU utilization is on average 10, 20% at best. And we can clearly see that it would be really nice to actually share this GPU with other users, because we can do this. So to do this, we're going to first, we're going to set up time slicing. So basically, on time slicing, the scheduler gives an equal amount of share, time between the GPU processes, and alternates between them in a round robin fashion. But the memory of the GPU card is shared between all the processes that are assigned to this card. Though the compute only happens at one process at a time. So to do this, we go back to our GPU operator helm chart and we do some configuration, saying that we want our nvidia.com GPU resource that we had before to have four replicas now. And because we use this renamed by default tag, our resources that will be renamed are going to be appended with the dot shared extension. So now we have the node with one GPU, now we have four GPU shared resources. So now we can onboard Mr. Pink, because we can share the GPU. Mr. Brown and Mr. Pink are going to run using time sharing. And here we have the same example. And at some point, Mr. Pink randomly is executing and we can see that we have a better memory consumption and we have improved GPU utilization. We can still see that we can definitely take more advantage of this. So we're gonna risk it and onboard our wild card, Mr. Orange. And as I told you before, Mr. Orange has some memory consumption problems. At some point is going to execute and he's just going to consume memory and disregard other users. So what we'll see is memory is going to climb up, climb up, climb up, and then at some point in time, Mr. Pink is gonna go execute and he's gonna die here because, well, stuff happened. So this is really bad because we are offering our service to users and we don't want users to complain. Look, I just ran my simulation last week and they have to do it all over again. So this is the problem with time slicing. There's no memory isolation and there's also no ability to set priorities between the processes. And this also affects that we cannot, for this reason, we cannot do latency-sensitive applications. But as you can see, we could share a GPU between two and a half users and it was going really well. And this is one of the advantages that this works pretty much on a wide range of NVIDIA architectures. And it's an easy way to set up GPU concurrency. Can we do better? I don't know, can we? Let's see, yes, I think we can. So in this context, let's discuss about MIG. MIG stands for Multi-Instance GPU. It's a technology from NVIDIA that allows us to share a GPU. But now we have isolated partitions. We have isolated memory, we have isolated cache, we have isolated compute cores. So that similar stuff is not happening again. I love this image because it allows us to see the GPU as an abstraction of eight memory units and seven compute units. And now we can see what we cannot just randomly take partitions. We need to follow certain rules. We can go down, down, down on this image and divide into more partitions that have less resources as we see fit based on our use cases. So remember please that we have eight memory units and seven compute one and eight memory in this context means for A100 means 40 gigabytes of virtual memory. And in terms of code, we just need to set some stuff in the GPU operator. We need to set the strategy to mixed. This is just something we do in order to enumerate the resource type for every MIG device available. And other than that is the same as with time slicing. We just have a config map and we say we want these devices to be available. Just make sure that when you sum the devices, you have the full GPU because with MIG, you can really lose compute or memory if you're not careful. For example, here we have 2G.10 GB. We have two instances and one of three G20 GB. And if you count them, you have seven compute units and you have 40 gigabytes of memory. This is what you want. And if you apply this, our GPU one will transform into GPU zero and our MIG instances. And now with this, we can assign instances to our actors. We're gonna assign the more powerful one to Mr. Orange because we saw that he wants more. He tries to get all of it. And let's see what's happening. One cool thing to notice this is that now we can have telemetry data for every partition. So we have nice pink lines, brown lines, orange ones. But let's see what this dashboard is telling us. First of all, the memory consumption and the GPU utilization is very good. They're trying to get advantage of the resources they have, which is very cool. On the other side, I want to ask you to look at the orange line for the memory utilization. It goes up and it still goes up, up, up. And at some point, the GPU utilization for Mr. Orange is going down because Mr. Orange is trying to allocate more memory again, but now he cannot do it because he's isolated and he terminates himself a little bit. Mr. Brown and Mr. Pink, they have no idea how close they were to a very bad outcome, but lucky them. So the conclusions here are that MiG is cool. We have hardware isolation now. We have monitoring data and we are very flexible based on our use cases. But disadvantages is that MiG comes with a price and I don't mean metaphorical price. It's just very pricey. The GPUs are very expensive. You need a pair of hopper architecture and this is like server side GPUs are very, very expensive. From another point of view, if you're not careful, you might lose compute or memory, as I already told you. And other than that, you need to evict all the running processes in order to change the layout. So this is maybe something you wanna keep in mind. I need some more. So the question that you should ask yourself, do we have performance trade-offs? Are we losing something? Because we are doing extra work, we're sharing. And for this, we need to do some benchmarking. I don't wanna spend too much time here. It's just a few words. We're using an NVIDIA 140 gigabytes PCIe GPU on the 122 Kubernetes cluster. And we are doing some simulation script that generates collision events. And this script is built with Xsuit. Please check this project out. It's very cool. It's written in Python. Other than that, the script is very heavy on GPU utilization, but it's not very heavy on CPU to GPU communication or memory accesses. So just to have like an overview of what is happening. And yeah, let's see the results. Our first benchmark use case is what is happening if we enable time slicing, but we're not using it because we are onboarding only one process. This is why it's called shared one. And the loss is very small. Like initially it was less than 2%, but then you dropped to less than 1%. And I would say this loss is kinda negligible if you take into account the amount of good stuff you can receive from actually sharing the GPU. The problems appear when we actually have to do the context switching. So when we have this shared two, we are expecting the time to double, but I know let's look at the last row, 30 million particles. If you're looking at it, we're expecting the time for shared two to be around 300 seconds. Actually, it's more around 420, which is a very big loss. It's enormous. It's like more than 100 seconds. And this is not something you want. This is the equivalent of almost 30% loss. And this is what you have to remember is when you have to do the context switching and you have a long running process, it's gonna come with a price. And don't do it if you have processes that are using the GPUs a lot. On the bright side, it's not all that. On the bright side, if you onboard more processes like four or eight, the time just doubles. You're not losing additionally, you lose once when you have to perform the context switching, but after that you're in the safe. It's cool. But yeah, you need to keep in mind the context switching, otherwise you're not on the right path. But what about MIG? With MIG, stuff is a little bit different. So the point here is that an A100 GPU has 108 streaming multiprocessors. I'm gonna use SMs because it's too long to say it again. So basically when you enable MIG, you lose 10 SMs by default, by doing nothing, by just enabling. And this is the equivalent of 9.25 performance. And you care about this because SMs is what defines how many CUDA cores and tensor cores you have. So if you lose 10 of them, you lose a lot of cores, which is what you can see here, but different is quite big and you care about this. But this is the theoretical stuff. And what we want here is we want to do our benchmarking and see if this aligns with the theoretical ideas. And it does. We benchmarked this. The 7G 40GB is actually the full GPU, but with MIG enabled, this is why it's called 7G because it's the seven compute cores I talked about. And 40GB is the full memory. So yeah, the loss is about 9%, this is what we expect. So just remember that if you want to lose MIG, you have to actually share the GPU. If you just enable MIG, but you don't use it, you lose cores for nothing. It doesn't really make sense. Mr. Pink is here to help us because it's a lot of tables. It's benchmarking. I tried to make it interesting, but it's not so much I can do. So the first table is just like to give you the baseline of our calculations, but I want to focus on the second one. And namely, let's look at the last column of the second table. So we see 1G, 5GB. This means one compute unit and five gigabytes of memory. And we have 2G, 10GB, which is actually double the resources. We're doubling everything. We're doubling the cores, we're doubling the memory and we're doubling the bandwidth. And we expect the time to be two times better. This is why we say that the ideal scale is gonna be two. And our calculation give us 197 and then 198 and we are converging to ideal value, which is very cool. Some final conclusions here. So first, MIG is very cool. It's kinda suitable for everything because MIG gives you the feeling that you're the only one using a full GPU and you really don't care that there are some other users. But with time slicing, it's not the case. We already know. Time slicing is very good when you have a lot of idle time and you wanna take advantage of it. And it's very cool when you have low priority jobs that can run for more time and you don't really care. If you use time slicing for users that have a lot of idle time, you have to add some kind of memory management procedure or script because you don't want stuff like this to happen again, especially in production. But yeah, time slicing has its use cases, but don't use it for something that is very latency sensitive or very performance intensive because the whole switching thing is gonna come with a very big performance price. So this is not something we talked a lot or mentioned a lot in this talk, but it's very nice to know and it took us a while to actually see what we can do this. This is the monitoring of different pipeline utilization. So, namely, we can see if the tensor cores are utilized if the floating point 16 or 32 are utilized. And this helps us because now we can understand what kind of jobs our users are running on the GPUs. And A100 and H100, they are coming with tensor cores which personally makes them as pricey as they are. And if you see what in this graph, the tensor cores are really not utilized, you wanna approach this problem and do something about it. And we have a lot of other stuff, everything you saw, it's something we use for our GPU monitoring. So check this link here in the QR code if this is something that is of interest for you because yeah, some stuff took us a lot of time and this maybe could help you. Yeah, I'm gonna give you a sec. We did a lot of benchmarking. This is not everything, of course. There is not enough time and I think you're gonna get bored pretty fast by tables and tables. But please check out the first thing, is the link to our blog post about GPU, GPU utilization, setting up the GPU operator, time slicing, different approaches to this. It's a lot of things we're still publishing this, we're still working on this and feedback is always very welcome. So please make sure to subscribe and push the button. And of course, the NVIDIA documentation and the GPU operator just makes our life so much easier because I cannot imagine managing all this stuff myself. And special thanks, this is our colleagues, this is Ricardo and Dejan. Ricardo has a keynote tomorrow, make sure to check it out. They helped us a lot and we are very thankful for their support on our benchmarking journey. And of course, the amazing movie, Reservoir Dogs, but served as an inspiration for everything you saw today. Thank you, questions. I have a question here. Yeah, there's a microphone. It's on you. Here. Yeah, hi, Kevin Clues from NVIDIA. I have more of a comment than a question. So I'm the one that built the time slicing and the make support. And I drew that picture that you so like. And I just want to add, you know, at some point in here you said, can we do better? And I'll just let you know, yes, you can even do much better than what you have here. And so we should talk after this about what you can do. When you said you're from NVIDIA, I got stressed. I thought I got something very wrong. Can you combine time slicing with Nick? Yes, you can do this. This is possible. You basically just replace the resource instead of using the NVIDIA GPU. You use the resource with the nickname that you want to time slice. And an additional question. Say you have a new Mr. Reds. How do you onboard such a user? Do you use like the profiling and then give them an indication for this will be your ideal MIG profile or how does it look like? So this, we actually who worked, we had the Kubeflow cluster, which basically it's used by multiple users and they are actually was the one that worked with this. We do some profiling for some users. And we say which if their workload actually requires a lot of time to run and a lot of GPUs, we give it a test and assign it to some, like you should use these cards that are appropriate. Thanks. Yeah, more of the same question is, how do you define the MIG profile? Because in your example, it's very nice. You have a four, two, one, but how do you allocate this to the user? And isn't it easier to just do seven, one, G? And then let's all the job brands, I don't know. The overhead to determine the MIG profiles, the optimal one is way too high. So, yes, this is why we do the benchmark and to just understand the use cases better. For each user? For each user? For groups of users, let's say like Github runners or some kind of machine learning training or CI jobs, stuff like this. So we are kind of grouping them by use case. But yeah, we are thinking about just partitioning it into seven instances. But I think if you do it, you'll lose five gigabytes of memory. You don't want this. But yeah, we're thinking because even the smallest partition is very powerful. This is a possibility. We're still testing the water here. Yeah, and then, always the same question is, would it be better to have worker nodes with one profile only and other part of worker nodes with a bigger profile and then you just. Or to mix like you have done. I think this is the way to have nodes with better GPUs or smaller and then you just assign to each one you want. Because if you want in the future to kind of centralize the whole GPU effort to make sure you're using the GPU to their most possibilities, let's say, then yes, you would have different partitions or different nodes. We're not doing it dynamically because you have to evict everything. This just doesn't make any sense for us. But yeah, I'd have to agree. So if we need to buy a specific license. I think you only need the license to have the GPUs not for the software itself. If I, I don't remember this. This was like this long time ago. Yeah, yeah. For VGPU you need and the. And we talk about them in the blog post. Yeah, so VGPUs is actually a bit more, I found it a bit more difficult to set up and you need licenses and we actually don't fit our use case. So we actually didn't go really down on it. But we comment, we have them on the blog post. Just check it out. Hi. Yeah, thanks for, first of all, thanks for producing all that benchmark data. It's a really useful resource for the community that I know it takes a long time to produce. So thanks for putting that together. My question is, is it, since you've spent all this time evaluating MIG, have you found it to be better than simply using smaller GPUs combined perhaps with an auto scaler? I don't think we don't use smaller GPUs with T4s. I think that was the smallest I thought we used on a cluster. I don't think this would be appropriate because well, the more amount of hardware you have, you have more power consumption, they will probably end up by costing more, I would say. And basically what we want is just, we have a resource problem. So we have a lot of users that they want GPUs and if they don't find it here, they will go somewhere. So we tend to offer locally our GPUs. And we want to make, for example, one of the biggest problems is that basically the users take a GPU and keep it so they don't lose access to it. And we want to actually have, if we share the A100 into seven instances, if we have 10 A100s, we have 70 instances and this is much more that users can use so they can actually experiment in this small environment. And then when they actually have a production code or something that actually requires more processing, we assign them to a new graphic card. And yeah, that's great, thank you. I was just wondering where the data in the Grafana charts is coming from. Is that just coming from DCGM exporter or do you have some other utility? Yeah, so this is all very nice. The GPU operator from Nvidia is very nice and this is just provided with the DCGM operator and then the UF Prometheus scraping the data and it's made available and that's it. We have a question here. Did you ever try to use the automatic mix precision? Sorry? Automatic mix precision. AMP? Yeah. I think we do have users. I don't know about this. We have users because we are training a lot. I don't know about concrete use cases to give you unfortunately, I can ask around and come back to you if you want. But yeah, we need this and we are really looking into this. Thank you. You're welcome. Hi, thanks for the interesting talk. I just have a kind of side note to it. I saw a lot of monitoring. So one case is what you saw that the user is trying to grab too many resources but we also talked about under usage. I'm actually using this monitoring to actually flag the users that are not using so many resources as their request. For example, even if you have MIG and you have the bigger partition and then that person is asking for a bigger partition than they actually need to use. Are you using those kind of data in some way to kind of guide the users to the right profiles or this is just for your benchmarking? I'd say that we are trying to improve a GPU utilization because it's a pain point. We have users that need GPU. So this is an ongoing thing and the benchmarking is kind of just the beginning. So I think we will do this in the future, not really flag users. This is... Send them an email. This sounds bad, but yeah, we're gonna use the data to understand the users better and give them more of what they need. This would probably be my answer, but not for now. Sorry, actually also one of the important things is the profiling metrics, which basically users are users and they want the best. And so like they just give me a 100, but their use case doesn't actually use the tensor cores. So maybe they would be more appropriate to use a less recent GPU, which is also less expensive. So we actually are watching this and yeah, we take this into consideration to optimize the resources. Thanks a lot. Hi. It sounded like some of your workloads are really responsive, you know, like filtering on a 10 gigabits incoming data stream. And like on the CPU side, you have a lot of history on like CFS and optimizing responsiveness and bulk workloads at the same time. Do you have any experience with how well time slicing works for these mixed cases if you do time sharing and like these responsive workloads on the same GPU or is that something you haven't looked into? I would say, I'm not sure I'm gonna answer your question. I'm sorry if I don't, but I think once you have less GPU utilization because you have some kind of communication additionally to peer form, you're actually lost. It's gonna be a little bit smaller. This is what we saw in the benchmarking we did. So once, for example, we did some machine learning stuff where we had epochs. So when we did this, the error was not 40% for sure, it was like 10. Because you're introducing some extra latency what influences this. But we're still researching it. We don't know everything, but this is the direction probably. Okay, thank you. You're welcome. Hi there. I was wondering how do you select the amount to limit the slice? Like if it's one GPU and two gig memory or stuff like that, then how do you monitor and manage it? Yeah, as we told before, like we don't actually select this. On some use cases, we actually go through this effort and assign users because they are actually very active users. But generally we want to make an offering, make the most GPUs available as possible. And then the user's test with the small GPU and then they can request or like give me an increase on my quota. This is probably the best use case scenario that you can do. Thank you. I think that's it. I think that's it. No more questions. Thank you for coming and enjoy your coupon. Thank you.