 Hello, everyone. My name is Jack Min Ong. I'm a machine learning engineer at GNA AI, and today I'll be talking about scaling AI workloads with Kubernetes, so sharing GPU resources across multiple containers, and more specifically, NVIDIA GPUs, because if you're running AI workloads, most likely the GPU that you're using is going to be NVIDIA. So this talk will be quite specific to NVIDIA GPUs for now. Okay, so AI can already do a lot of very interesting tasks, such as document similarity, tech classification, automatic speech recognition, though I guess the ones that have been making more waves are the text generation model, so I believe all of you here have probably used chatGPT. There's also some open source alternatives that maybe you guys have tried out, so like maybe PHY1, Falcon, and also Lama2, and also the image generation models, like stable diffusion and mid-journey. But what all of these AI services have in common is that they run on neural networks, and neural networks run on GPU. But GPUs are expensive. GPUs are also scarce, so if you have been trying to provision GPUs recently, you may have been greeted with a page that looks something like this, where not only is it very expensive for you to provision your GPUs, but the clouds often don't even have the GPUs for you to provision. And so for the limited amount of GPU resources that you can provision, you want to maximize your GPU utilization. Or perhaps one can argue that you must maximize your GPU utilization, because you don't have enough GPU, and also the GPUs are so expensive that your cloud bill may bankrupt your company. So why Kubernetes? So why don't I just run my services out of bare metal? And if you're running your services and they don't have very high QPS, I guess it doesn't make sense for you to use Kubernetes. But once you start needing scale, then yeah, you want to have reproducible services, you want to be able to port and scale your services, so then it makes sense for you to be able to move them onto Kubernetes, because then there will be infrastructure agnostic. I can deploy them on my own infrastructure, I'll be able to deploy them on Amazon's EKS, GKE, AKS, it will all work. Another reason that you might want to go for Kubernetes is that Kubernetes with the GPU operator supports managing the drivers and the shared libraries. So one of the issues if you've ever tried to run GPU workloads is managing all these drivers and shared library stuff. And so if you're using it on Kubernetes, then you can just manage all of that stuff on Kubernetes. All your nodes will have the drivers and the libraries ready for your containers. You can also manage your GPU configurations. So if you're deploying a lot of nodes, it gets very annoying for you to go into each node, set up your GPU configurations. So if you define it in Kubernetes in like a config map, then it just maps to all your nodes. And that's really nice. It also comes with GPU monitoring and telemetry. So you can see the usage of your GPUs. Another reason that you might want to put your GPU workloads on Kubernetes is just your non-GPU workloads were already on Kubernetes. And so your infrastructure team is used to managing your non-GPU workloads on Kubernetes. And it would be nice that if they could just manage everything in one cluster. But I guess this depends on your organization. Okay. So this will be the outline for the talk. So I'll first start off by talking a little bit about resource management, specifically about multi-instance GPU, because I guess people aren't too familiar with that. And also then I'll go into how to manage your GPU resources in Kubernetes. Then I'll talk about sharing GPU between your containers. So how you can get multiple containers to use the same GPU device. So you can utilize them better. And then I'll quickly go at the end. I'll briefly mention some techniques for optimizing deep learning workloads. Because Kubernetes can help you to efficiently allocate your infrastructure. But if your workloads aren't running very efficiently, then it's not going to be that helpful. So I will briefly mention some optimizations there. Okay. So I start off understanding GPU resource management. So for the newer and more powerful cards, NVIDIA has this feature called a multi-instance GPU. And so it spawns out of when you're running the inference workloads, a lot of the time your inference workloads are not going to be able to fully utilize your card. Because your card may have 80 gigs of VRAM or 40 gigs of VRAM with a lot of streaming multi-processes. But your inference workload maybe only uses like peak five gigs or something. And so they have this feature called multi-instance GPU where you can basically shard your GPU. And so you only have one GPU device, but it can advertise itself as multiple. And very specifically a maximum of seven GPU instances. So you can have a node that only has one A100, but it will advertise itself to the device as seven A100s, but just like lesser A100s. Okay. So the way that it works is that for this example here, I'm using the A100 with 40 gigs. It will shard it into eight memory slices and seven compute slices. So each of your memory slices will contain one eighth of the total VRAM that you would have had on the full card and will contain around one over seven of the total amount of streaming multi-processes on the GPU. So for those of you who aren't super familiar with GPU streaming multi-processes are basically the things that you use to execute the kernel. So basically it's the compute resource on the GPU. It's around one over seven, but you will notice that 108 is actually not divisible by seven. So it's actually not exactly true. And sometimes when you combine the slices, they will actually give you a little bit more than each slice actually has streaming multi-processes. So this is just something that's specific to the card. But generally, each of the slices are around one over seven of the compute resources. This is an example for the other cards. So if you are using the A100 that is 80 gig VRAM or you're using the H100, then yeah, it's still eight memory slices and seven compute slices, but just the number of resources in each slices now different. So now my memory slices will have 10 gigs of VRAM. And then my compute slices are still one over seven, but it's more streaming multi-processes now. And why seven? Because there's like eight memory slices, but only seven compute slices. It's kind of a weird number, right? Why did they go for seven? The NVIDIA's official reasoning for this is just that if you go for eight, the chip yield is not that good. And so the way that they designed it, that it just happens to be that the yield is good if you, or the yield is acceptable if you go for seven and not if you go for eight. So this is just some peculiarity about the way that multi-instant GPU works. They're just going to have to remember that there are eight memory slices, but only seven compute slices. And so it does give you some issues later on. So I'll mention that as I tell you how to partition them. Okay, so let's start partitioning. So the first one I'll show you here is just, yeah, you can request for a slice that contains one GPU slice and one memory slice. So the way that you specify the slices is you have to remember that canonical name. So this naming over here is a 1G.5Gb, which means that I am taking a 1GPU slice and corresponding of a 5Gb of VRAM memory. Okay, you can also specify 2G 10Gb slice. When you go for three, for some reason you cannot specify 15Gb. It suddenly becomes 20Gb memory slice. This is just something you have to remember. NVIDIA designs their cards in very peculiar ways. Then when you go for 4G, you get 20Gb slice, but you cannot go for five. And you also cannot go for six. The next number that you can go for is seven. And when you provision seven GPU slices, you will get 40Gb of VRAM. So the rule to remember is that you can only provision GPU slices, one, two, three, four, and then seven. And if your GPU slice is not a power of two, so it's not one, two, or four, then you will get an extra memory slice with it. So the seven is not 35. You'll get an extra memory slice, so you get 40. And then for the three it's not 15, you get 20. You can obviously also do multiple partitions. So here I've partitioned the card into seven 1G slices. And so this card will advertise itself as seven GPUs, each containing one slice worth of resource. And yeah, you can obviously also do 4G slice and a 3G slice. And the important thing to note here is that the partitions, if you're not doing them using Kubernetes, so you're doing this in NVIDIA SMI and trying to partition them, it will partition them from left to right. This is important because it means that the order in which you partition your card matters. So you can see here that I can actually fit two partitions, one with 4G and one with 3G. But if I define the 3G one first, it will not work. Because you can see when you define the 3G 20G V1, it will take up this part of the chip and then lock you out of that one compute slice. Then when you try to provision this 4G 20G, you don't have enough compute slices. And it can't access that compute slice that is at that first 3G partition. The same if you try to provision 3 and 4, you can see the first 3 1G slices will be fine. But then the last one, you can see, oh no, this part of the card does not have a compute slice. And yeah, it can't access the one that has been taken up by the 3G. The general rule is that if you are to partition your GPU, just don't partition the three first. So just don't do this one first. Because if you didn't do this one first, then you would have been able to fit it. So you did your 4 1G slices first, and then you did your 3G slice, and then it fits on the card. Everything is nice. You can also further partition the GPU slices into compute instances. So the compute instances will share memory, but they will have separate amounts of compute. But this also means that you lose the isolation. So previously when you did GPU instance slices, you get memory isolation, compute isolation, and you also get error isolation. So if one of these slices running a process, it errors out, it won't kill the other processes on the other slices. But if you had the compute slices, since they share that same GPU, in your process, you can specify which compute slice you want, but they will in a way be tied to each other because they share memory, so you can OEM the other slice, and also they will not have error isolation. So if it errors, it will try to reset the GPU, and that will also kick off the other process. But compute instances isn't too important for this talk because of this thing that you can't get this nice isolation, Kubernetes at the moment does not support this. So all you need to remember is actually that we can partition the GPUs. This compute instance thing is not super important. Okay. So now let's get to the main deal, I guess, which is how do I manage my GPUs on Kubernetes? So yeah, this is pretty easy because NVIDIA has made it very easy for us. Basically, you just have to install this NVIDIA's GPU operator. So yeah, I've put a link over here for installation instructions. It's pretty comprehensive because I guess NVIDIA has helped a lot of their clients to install the GPU operator. So yeah, it covers a lot of different type of clusters that you can have. So yeah, it supports all the main cloud offerings. So if you're using EKS, GKE, AKS, you're supported. It also supports bare metal virtual machines. If you're using NVIDIA Enterprise, you may have VGPU also supported. And it supports all the container runtime. So if you're on Docker, container D, cryo supported. So yeah, I've put the example of how you would install it if you were on bare metal and configuring for Ubuntu. Yeah, basically it's just how you would install any other operator or HelmTron. So if you're familiar with installing HelmTron, this should just make sense to you. Okay, then once you have installed the GPU operator, what happens is that there will be a demon set called NVIDIA Device Plugin. And what the NVIDIA Device Plugin does is it will go to your nodes, and they will find out how many cards you have on your node, and they will put it inside the label. So you can see here that this node has two GPUs, so it's labeled with the GPU count of two. There will then be another demon set called GPU Feature Discovery that will try to find out the capabilities of your card. So you'll find out what family it is. So in this case, it's the A100 PCIe on 40 gig VRAM, and it's an ampere. And also it will tell you the driver version, the CUDA runtime version, as well as the compute capabilities. And so it will expose all these as labels. I lost myself in the slides. Okay, it was just stuck for a while. Okay, so then once you have your nodes with the GPU resources, it's pretty easy to consume them. Basically, you just define a pod spec, and inside your pod spec in your resource limits, you just specify how much GPU you want under the key nvidia.com. So in this case, this pod spec will request for one GPU. If you want to select which type of GPU you want, then yeah, what the GPU Feature Discovery demon set did in the previous slide is now helpful to you, because basically what you can do is you can use your node selector and specify the key of the particular GPU type that you want. And so then it will be provisioned on the node with the GPU type, and yeah, everybody's happy. Okay, when you have nodes that contain GPUs, you would want your nodes to be dedicated only for GPU workloads. The reason is that yeah, if you have pods that aren't requesting GPU and they're running on the node that contains GPU, they may lock your pods that need GPU, and they won't be schedulable. And then you'll be like wasting the GPU, because this node is provisioned on an expensive node, but then it's not actually using the full resources on the node. Or you may also get into a situation where your auto scale is actually, it should have been able to spin down this node, but then because you have like a non-GPU workload running on it, you cannot spin down this node. So you would want to make it that the nodes that contain GPU only run workloads that need GPU. And so the solution for this is basically just to taint your nodes. So you can put a taint on it. And then when you define your pod spec, you just put a toleration so that your pods that need GPU are still able to provision on the GPU nodes, but then the pods that do not need GPU will not be able to do it. Okay, you can also define resource quotas. So if you set it up such that you have one cluster that is used by multiple teams, each of the teams has access to a different namespace, then you can also inside the namespace put a resource quota and limit the amount of GPUs that that team can provision. Yeah, this is a classic Kubernetes thing. It's not really specific to GPUs, but yeah, this is how you would specify resource quotas. Okay, so now let's talk about sharing GPU between containers. Okay, so if you are familiar with writing pod specs for non-GPU workloads, you may be familiar that you can actually make fractional requests. So you can see over here, you can make a fractional request for CPU, like 100 micros of CPU. So here it will be 10% of the CPU. So this is basically specifying that when I'm running this pod, only 10% of the CPU time will be spent on this pod. But when you do for GPUs, because the GPUs are exposed, using the device plugin, there's some limitations. So in Kubernetes, device plugins have these two notable limitations. So one is that they only support integer resource requests and they cannot be over committed and also devices cannot be shared between containers, which I guess makes sense. Okay, so that means that if you write a pod spec that requests fractional GPUs, it won't work. So yeah, I can't ask for 20% of the GPU here. It also means that you cannot, you cannot under commit your GPU requirements. So you can't specify a limit that is different from your request. Your request in your pod spec must always be equal to your limits. So that's why in the previous pod spec that I showed you, you just directly define the limits. Because if you don't give a request, then your request is equal to your limits and then your pod spec just works. Okay. So the way to get around this and still be able to get multiple containers to use the same GPU is basically in a way to cheat Kubernetes where you expose integer resources, but then you have your integer resources actually define fractional resources. So I guess that this is why I introduced multi-instance GPU. So yeah, one of the solutions you can do this is through multi-instance GPU. Basically, you just label your node and then once you have your node, this example, I've provisioned one A 100 node to expose a seven one G slices. And so you can now see that yeah, even though it is one card, it is exposing seven of the GPU resources. And so when you define your pod spec and you take one GPU, yeah, two Kubernetes, it will be as if you had got one device, but actually that device is one seventh of the power of the full card. Okay. In order to do this, you basically inside the helm chart, you just provide the values. So you have to enable the manager. There's also this strategy thing. And so what the strategy thing is about is that you have two choices. Either you can go for the single strategy or the mixed strategy. So if you go for the single strategy, you can see it will expose the resource as nvidia.com slash GPU. But if you go for the mixed strategy, then you will expose this as nvidia.com slash Meg and then like the partition naming. So this makes it easy for you if you have nodes that have the full GPU and you also have nodes that have shot the GPU. And then when you provide, when you write your pod spec, you can specify which one you want. So you can specify whether you want the full GPU or you want the sharded one because if you are on the single one, your GPU resources are exposed as if they were the full device. This is also somewhat choice because if you are on the single one, you can see that the product type is actually changed. So the product type was originally a 100 PCIe 40 gig. But if you use the single strategy, it will add an extra one G.5 GB behind. So this makes it that if you write your node selectors nicely, then yeah, you could still know that this node will have a sharded GPU. Whereas if you go for the mixed one, since the information that this GPU is a sharded GPU is already inside the resources allocatable, your GPU product does not need to do anything. Okay. Yeah. So just now I told you that you can just label your nodes with this all one G.5 GB, like where does this come from? Yeah, basically this comes from profiles. So yeah, the profiles are in the same format as the MIG parted CLI command. So when NVIDIA released this MIG, originally there was no GPU operator and you would have to do it manually. And I guess people got annoyed. So they released this CLI called MIG parted that helps you to save profiles that you can automatically execute and then configure your GPUs. And then when they released the GPU operator, yeah, they didn't want to reimplement it. So basically the MIG manager just uses MIG parted internally. So they have the same format for the version. So basically, yeah, you just define a config map and then yeah, you restart your MIG manager, then it reads the config. Okay. Another strategy you can go for other than the multi instance GPU is time slicing. So time slicing would basically be I have the full GPU or actually you can also shard your MIG instances as well. But basically what is happening is that I have multiple processes and then they take turns on the GPU. So in this illustrated example, what happened is that it will run a little bit of the first process and then it will contact switch, run the second process, contact switch, run the third one, then contact switch and run the fourth one and then go back to the first one. So basically round robin contact switching. Okay. The way to define this is basically you have a config map and then yeah, here the important thing to note is the replicas. So you can see the number of replicas will be the maximum amount of pods that can schedule on your time slice nodes. And so you can see I have defined two profiles here. The one is slices the GPUs into four replicas. One slices the GPU into 10 replicas. And you can see that if you slice your GPU into four replicas, what happens is that your node will now be labeled with the nvidia.com slash GPU dot replicas. And then the number of allocatable resources that is exposed will be your GPU count multiplied by your replicas. So this node originally contained four GPUs. I specified that I want four replicas for each of the GPU. So the total amount of GPU devices that are exposed to Kubernetes is now 16. So you can also see that if I specify 10 replicas, it changes the labels. And also now I have 40 allocatable resources. Okay. Yeah. There's also this config over here, this rename by default. So if you specify the rename by default to be true, then it will expose the resources as GPU.shared. So that when you write your pod spec, it knows that it is requesting a shared resource. And if you specify this as false, then yeah, it will just expose this as the very vanilla nvidia.com slash GPU resource. But same as the MIG, it will still maintain this information, but now this information will be inside the labels. So you can see the nvidia.com slash GPU product. Now at the end, there's a suffix that contains the dash shared. So if you provision your node selectors, you can still be able to differentiate between nodes that are exposing shared resources and nodes that are just exposing full GPUs. Okay. So what's the comparison between MIG, the multi instance GPU versus the time slicing? Basically, if you use MIG, your partition is physical, it's literally on the chip. If you're on A100 and A30, enabling MIG actually resets the GPU. I think on H100, you can do it dynamically. Yeah. And but for time slicing, time slicing is logical. So you're to the device, you actually still only have one GPU, but then Kubernetes manages all the logic of the context switching and the round robin and making sure that the pods get to operate on the GPU. On MIG, you have a maximum partition size since it's coded into the chip of seven. On time slicing, you can have any arbitrary amount of partitions, although you probably don't want to have too many partitions because then the context switching costs you a lot, right? Yeah. And then in terms of QoS, if you're on MIG, then your streaming multi processors and your memory will always be the same. Whereas if you're doing time slicing, then it's not always the same because it depends on how many other pods are on that node. So if that node contains two processors, then each of them will get 50% of the time of the streaming multi processors. But if the node contains 10 pods, then now you only get 10%. So it will be variable depending on the occupancy of the node. And so you don't get nice QoS with your stream multi processors. You also don't get nice QoS with the memory now because Kubernetes does not enforce memory constraint on the pods. So they can owe at each other. You also don't get error isolation. So if one of the processors fails, it will restart the GPU, it will kick out the other pods. Okay. Yeah. And then for reconfigure, the MIG requires a reboot. Whereas if you go for time slicing, you can just do this dynamically. And yeah, GPU support MIG at the moment only supports the H100s, A100s, and the A30. Whereas for time slicing, you can use it on most of the GPU. Since it's not something that needs to be on the card, it's something that's logically in Kubernetes. Okay. So now I'll briefly go through techniques for optimizing deep learning workloads. Okay. So the first optimization, I guess, quite a lot of people know this, I think, which is that you can go for low precision arithmetic. So if you have your model in full precision, a lot of time you can just run it at half the precision. It will run two times faster at half the VRAM. And you may lose some amount of performance. So you'd want to test this out before you actually do it. But most of the time you don't lose that much performance, especially if you're doing diffusion models, they basically always run on half precision because to the human eye, there's no difference between images that are generated at half precision and full precision. And so yeah, you just get the two times speed up and also the half VRAM requirement. If you're running transformers, then you'd want to look into attention slicing and flash attention. So if you're familiar with how transformers works, the attention mechanism has to calculate this attention score matrix. And so this attention score matrix scales in n square of your sequence length. But actually you end up soft-maxing this attention matrix, right? So you can actually just fuse the operations and cast your softmax whenever you've already computed the attention scores for the whole row. And so if you use these fused operations, either through attention slicing or through flash attentions, then you can do the attention operation in O and memory. There's also if you're running language models, then you want to look into speculative decoding. So speculative decoding is this idea that if you inference the language models, small back sizes are very inefficient. So I think one of my colleagues, you tried this out, if you inference at a back size of one versus a back size of 10, basically you get your response in the same time. And so what you want to be able to do is you run a small model that predicts 10 tokens ahead before you batch it and put it into your large model. Then if the tokens happen to be the same, then okay, that's great. Then you just skip 10 decoding steps. But yeah, basically the main big idea is that you can use a small model to increase the back size of a large model. I haven't explained it very well. So if you actually want to implement this, just Google speculative decoding and find out what it's about. Okay. The other method you can go for is distillation. So this is basically you train a small model with a large teacher model. And basically here you're trading off the accuracy for performance. Okay. So in summary, Kubernetes can be used to easily share GPU resources among your AI workloads. The NVIDIA GPU operator can be used to manage and configure the nodes with GPU resources. And you can use MIG and time-fishing to be able to over-subscribe your GPU devices. So basically have multiple containers use the same GPU device. Okay. So here are the references. If you want to check this out yourself, I guess you can read a lot of my docs, some of them are talks. And okay, that's it. Thank you. Okay. So now I guess I will take questions. Any questions? Okay. Is there a mic for the people with questions or did they just shout it? I guess I'll repeat the questions if that makes it easier. Okay. So the question was, when you reconfigure MIG, is the reboot for just the driver system or is for the whole system? The answer is that it's only for the GPU. So basically you have to get all processes off the GPU, then the GPU will reboot. So if you manage your processes well, then you can kick all the processes off the GPU and then you won't need to reset the whole system. But a lot of the time you have a lot of processes that are hard to find, like you may have a telemetry process that is looking at your GPU and because of that, the GPU can't reboot and then you'll force to reboot the whole system. So from my experience, oftentimes it's translated to rebooting the whole system, but actually you can get away without rebooting the whole system if you are able to clear out all the processes on the GPU. Okay. Any other questions? Okay. So the question was, can you also mix the MIG partitioning with the time slicing? And the answer is yes, you can. So it will make it that you get a MIG partition and then your MIG partition is time sliced across the ports that are scheduled on your MIG partitions. Yes. Can you come again? I couldn't quite hear you. Oh, okay. Okay. You can enable in the GPU operator. Yeah. But the GPU operator, I think if you're on enterprise use cases, yes, you need a license. Yeah. Okay. Any other questions? Okay. If not, then I'll briefly introduce our company. So yeah, we are Gina AI. We're an AI company working on multimodal neural search. So founded in February 2020. We currently have 60 members spread across three offices, one in Berlin where I'm from, and also one in Beijing and one in Shenzhen. And yeah, we raised $30 million series A and yeah, we're a top AI company. So yeah, if you want to join us, yeah, you can check out the link over here, Gina.ai slash careers. We also have an internship program at Gina.ai slash internships. Yeah, but it was in my slide that I didn't upload yet. So yeah, you just have to find that yourself. Or you can just, I think Google Gina AI internships. It should be the first page. And so yeah, thank you for coming to my talk. Have a nice day.