 So yes, hello everyone, I'm Jack Min Ong. I'm a machine engineer at GINA AI, and today I'll be talking about scaling AI workloads with Kubernetes. So sharing GPU resources across multiple containers, and particularly NVIDIA GPU, because if you're running AI workloads, chances are you're probably using NVIDIA GPU. So first, why do you need GPU for AI workloads? So I think most of you are pretty familiar that AI has made a lot of big advancement in the past year. I guess particularly in text generation, which was said a lot during this morning's keynote, but they can also do a lot of other things like document similarity, text classification, automatic speech recognition, and all of these types of capabilities. Right now, the state-of-the-art models are deep learning models. So these neural networks and these neural networks, they have the highest throughput and run the most efficiently on GPU. So if your company is interested in serving these kind of models, then you would be interested in having GPU capabilities. But if you have ever tried to provision any GPU, you would very quickly notice that GPUs are quite expensive. They are also quite scarce. So if you have tried to provision GPU recently, you may have been met with a screen that looks like this. I think this is from Lambda. So basically that even if you had the money, they don't have the capacity to serve you. And so with the limited amount of GPUs that you can provision, you want to be able to maximize your GPU utilization. To some extent, since you can't provision as many GPUs as you probably would like, you must maximize your GPU utilization. Okay, and then we go into why Kubernetes. So if you deploy, like why don't I just deploy on bare metal, right? So first you get a lot of the niceties of running images, which is that you can easily port your services and easily scale your services. So if your container runs on your local, it's most likely going to run on the cluster and they will run whether it's two pods or like even if you can scale it, the 10 deployments, 100 deployments, they'll all run the same. Another reason that is specific to GPUs is that what gets annoying when you want to scale GPU resources is managing the drivers and the shared libraries. And so if you're using Kubernetes to do this management, basically you can unify all your management and all your nodes will be the same. So it will help you to manage your GPU drivers and the shared libraries. Another annoying thing is that you might need to configure your GPUs if you're trying to do some partitioning thing. And so Kubernetes can also help you to scale your GPU configurations. And also it comes with GPU monitoring and telemetry. So to know the usage of your GPU and other telemetry metrics that you may be interested in. Another reason that's more of like an organizational reason is that your non-GPU workloads may already be running on Kubernetes. So your SRE team is familiar with Kubernetes and it would be nice if they could just use the same text that they're already familiar with to manage all the resources that are in your company. Okay, so my talk was, this is the outline for my talk. So our first start off by talking about GPU resource management, specifically multi-instance GPU because I think people aren't super familiar with this. Then I'll talk about how to manage your GPU resources in Kubernetes, then sharing the GPU between containers, and then lastly some techniques for optimizing deep learning workloads because even if you're able to efficiently provision the GPU, if you aren't able to efficiently utilize them in your workloads, then that kind of is no point. And I think most of the time if you want GPU, what you are looking for is deep learning workloads. So I'll briefly talk about some techniques in terms of optimization. Okay, so first understanding GPU resource management. So first multi-instance GPU. So multi-instance GPU is this idea that a lot of the time when you get these big and powerful cards, they come with a lot of resources. But when you're running inference, sometimes you're using like a smaller model where you don't actually need the full power of the card. And so what MIG allows you to do is it allows you to take this big powerful GPU and shard it into smaller GPUs and a maximum of seven GPU instances. Okay, and so when you shard the GPU, basically your GPU will be split into seven compute slices and eight memory slices. Each of your memory slices will contain one over eight of the total VRAM on the card. So for this example, this is the A140 gig VRAM. So each of your memory slices will have five gigabyte of VRAM memory. And then for compute, each of your compute slices will have roughly one over seven, the total number of streaming multiprocessors on the slice. So for those of you who aren't super familiar with GPU architecture, streaming multiprocessors are basically the cores on the GPU that run the kernels and do the operations. So the more cores you have, the faster you can run parallel operations. And so each of your compute slices will get roughly one over seven. You'll notice that actually 108 is not divisible by seven. So it's not exactly one over seven. Basically you get slightly less when you provision one slice. And sometimes when you combine slices, you get more. But if you combine all the slices, basically you get 108. These are the specs for the other cards. So if you're on the A180 gig VRAM, your streaming multiprocessors is still the same. But now since your memory slices divide by eight, each of your memory slices will now have 10 gig of VRAM. And then for the H100, since you have more compute slices, each, I mean, since you have more streaming multiprocessors, each of your compute slices will now have more streaming multiprocessors, but same with roughly one over seven. Okay, so why this number seven? Because it's kind of weird, right? They have like eight memory slices, but then only seven compute slices. So I think the official reason from NVIDIA is that the chip yields are just better if they go for seven compared to if they go for eight. Don't really understand what's going on with that. But you just have to remember about this peculiarity when you're using MIG, which is that there are eight memory slices, but only seven compute slices. It does cause some issues with partitioning later, which I will talk about when I'm talking about how to do the partitioning. Okay, so when you want to do the partitioning, basically you had to specify these GPU instance profiles. And so they'll have a canonical name. And so the name contains two parts. So the first part specifies how many compute slices you want for your GPU instance. And the second part specifies the VRAM. And so you can see here it's a 1G.5GB compute instance. So basically you get one compute slice and you get 5GB of VRAM, which is one memory slice here. You can also obviously specify two and then you will get corresponding two memory slices. So a 2G, 10GB GPU instance. When you go to three, somehow you don't get 15GB. You actually get 20. And then when you go for 4G, you also get 20. You can't specify any number. So five doesn't work, six doesn't work. The next one that works is seven. And when you go for seven compute slices, you can get 40GB of VRAM. So they do seem kind of weird, the number of VRAM that you get. But basically there's a rule, which is that you can only do one, two, three, four and seven. And if the number of compute slices that you specify is not a power of two, then you will get one extra memory slice. So for example, if you had went for the one, then one is the power of two, two power zero. And so you'll get one memory slice. If you go for four, which is also power of two, then you will get four memory slices. But if you go for seven and three, seven and three are both not powers of two, you'll get one extra. So instead of three memory slices, you get three plus one. You get, so you get four memory slices. That's why the three one is 20GB. And then the seven one, instead of seven, you get seven plus one. So you get eight memory slices, which on the A100 card with 40GB basically means you get the whole 40GB. And if you want 80GB, then yeah, you get the whole 80GB. Okay, obviously you can specify multiple GPU instances. So here you can partition the GPU into seven one compute slice, five GB slices. And you can also specify a four G slice and a three G slice. If you're using NVIDIA SMI to do this partitioning, the ordering of the slices actually matters. So if you had tried to do the partitioning in NVIDIA SMI, basically what you do is it gives you some profiles to choose and then you choose the profiles and it will fill up the chip from left to right. But there's some implications of this, which is that if you specify the three first, it won't work. So you can see if I specify the three G first, it takes 20GB memory slice. Then when I try to specify the four, it won't work because it will see these four memory slices and it can't take that one missing compute slice and they can't jump over and take the one that the three G has locked it out. Yeah, so same for this partition where the three will be fine, the one will be fine, one will be fine, one will be fine. But then the last one will not be able to partition because there's no compute slice over there and they can't jump over and take this compute slice that the three G has taken. So basically, if you're going to provision the manually, the general rule is just don't provision the three G one first. If you put it last, your configuration basically always works. So if you want this one to work, basically specify all the ones first and then do the three last. So basically like this and then you can get to work. There's also this concept of compute instances. So just now I was talking about you get the GPU and then you slice it into smaller GPU instances. You can also slice your GPU instances into smaller compute instances. And your compute instances will basically get a subset of the parent GPU stream-mode processor slices based on how many compute instances you specify inside. I mean, how many compute slices you specify in the compute instance. So you see here there's a two C. So basically each of the compute instance gets the two compute slices. And then these two compute instances will basically share memory and engines. But because of their share memory and engines, they don't get a nice isolation. So basically this means that one of the processors will be able to own the other one. And also if one of the processors errors, it will reset the whole GPU instance. So it will also error the other, the process that's running on the other compute instance. And so because of this, you actually don't need to be super familiar with compute instances if you're using Kubernetes because basically it's not supported in Kubernetes because you don't get this nice error isolation that would make you, basically because it would make it that if one of your pod errors, it will kill another pod if they happen to share the same GPU instance. Okay, so now let's talk about managing GPU resources in Kubernetes. So first you have to install the GPU operator. This is pretty easy to do. Basically just go to this documentation and find whatever configuration you have. So NVIDIA has many customers and then many customers have very many different Kubernetes deployments. So yeah, this is actually pretty comprehensive. So I guess for the classic cases of like, okay, am I supported on the three big clouds? Then yes, so if you're on EKS, GKE or AKS, you'll be supported. If you're using bare metal virtual machines or VGPU, which is I think NVIDIA's GPU cloud offering, you'll also be supported. And if you're running bare metal, you may also be interested in like what container run times work. And basically all the container run times you're interested in would work. So if you're on Docker, container D or cryo, it'll work. Basically the way to install it is yeah, you just add this NVIDIA repo and then you install the Helm chart. So if you're familiar with installing stuff from Kubernetes, this should just kind of make sense to you because I think everything, you just install Helm chart, right? Okay, so basically once you've installed the Helm chart, this is what happens. You will get this demon set called NVIDIA device plugin. And what it does is it goes to your nodes and it will label them with the number of GPUs that's on your node. So if there are two GPUs on your node, you will get this label of NVIDIA.com slash GPU.count of two. And there'll also be another demon set called GPU feature discovery. And what this demon set is in charge of is finding the capabilities of that node. So it will specify like the architecture of the GPU. So in this case, ampere, which GPU it is. So it's a A100 PCIe with 40 gig VRAM and it will also provide some information about like the CUDA runtime driver versioning stuff. Okay, and then when you want to consume this GPU resource, basically you just need to define a pod spec. And in your pod spec, in your resources, the limits specify how much of a GPU you want. And then in order to choose the type of GPU, basically you use a node selector because in the previous slide I showed you that the GPU feature discovery just labels your nodes. And so if you want a specific type of card, you just use a node selector so that it provisions onto a node that is labeled with that type of card. Okay, this is just some best practices, which is that if you are going to do this and have GPU workloads, that you will want to make it that your nodes that run non-GPU workloads are different or like treated slightly differently from work nodes that can run GPU workloads. Basically you don't want it to be the case that pods that don't need GPU can be scheduled on nodes containing GPU because this can make it that you might be wasting resources because like a workload that needs GPU can't schedule onto a GPU node because some non-GPU workload is running on it and consuming too much CPU or RAM. And also this affects your auto scaling. So you might have been able to scale a node down but then if you can't scale it down because there's a non-GPU resource on a GPU node, then you're kind of wasting money, right? Because you're using up a GPU node and paying GPU prices but you're not actually using the GPU. And so the solution to this is basically you'd want to taint your GPU nodes. And so you can, for example, just taint them with this key of nvidia.com.gpu and then when you specify your pod spec you specify a toleration so that it can schedule onto the GPU node. So this will make it that if you also have non-GPU workloads running on your Kubernetes clusters, now they will no longer be schedulable on nodes that contain GPU because they can't tolerate the taint. Okay, you can also specify resource quotas. So if you have different teams and you want to be able to limit the amount of resources that each team can have, basically you would give them access to certain namespaces and then inside the namespace you would specify a resource quota. So this is not specific to GPU. This is actually just a Kubernetes thing, but yeah, basically this is how you do it. Okay, so now let's talking about sharing GPU between containers. So first let's talk about the pod spec. So when you specify resources, if for CPU and memory you can actually specify a limit and a request. And also you can specify fractional resources. So for example, for a CPU you can actually specify like 100 micros of CPU. So basically this would consume 10% of a full CPU, yeah. But because the GPU operator and the way that it detects the, the way that it exposes NVIDIA GPU resources uses the device plugin, there are certain limitations. So one of the limitations is that your resource specifications must be integers. So you can't specify this 100 micro. It's not really integer, you must specify natural numbers because you can't specify negative GPU, right? And also it cannot be over committed. So this means that you can't specify a request that is different from your limits. And also devices cannot be shared between containers, obviously. Okay, so this basically means that, yeah, you can see if you try to specify like 200 micros of a GPU, this will not work. You can't specify 20%. You must specify an integer number. And also you can't have a request that is different from your limits. So you can't say that, oh, my limits is the consume to GPU and then I only request one. Yeah, so basically your limits must always be your request and that basically means don't specify requests. So only specify limit and then your request will be the same as your limit. Okay, so let's talk about MIG. So if you want to be able to do MIG using Kubernetes, what you have to do is you have to enable the MIG manager in your GPU operator Helm chart. So basically there are two flags that would be of interest to you. So first is the MIG manager.enabled. I think that by default, they set the truth, so I guess you don't need to set it. And then there's also this MIG strategy, which I'll talk about later. Okay, then once you have the MIG manager enabled, what you need to do is you need to specify this config map that contains YAML specifying profiles of how to provision on a node. So for example, you can see this old dash 1G.5GB profile is basically specifying that, okay, I want seven 1G, 5G slice. And then there's this all balanced profile. This one is I want two 1G slice, one 2G slice, one 3G slice. The syntax for this follows the syntax of MIG parted, which is this CLI command that you can do to do the MIG partitioning if you were doing it on bare metal. Yeah, so the MIG manager actually just used MIG parted internally. That's why the syntax is the same. Okay, then once you have these profiles, all you need to do to apply the profile to the node is just label the node with the name of the profile. So just now I defined this all 1G.5GB profile. So if I label the node with this all 1G.5GB, it will apply the profile to the node. And basically what happens is you can see now the GPU.product label has a suffix. So this suffix of 1G.5GB, and then the number of allocatable GPU resources is now seven. So this would have been one if you didn't do this partitioning. And then there's also this MIG strategy single. So what this is about is that if your MIG strategy is single, your GPU resources must be homogenous because you can see that your labels contains this suffix of the type of GPU instance that you are using. But if you had used the profile where you specified like 1G and 2G and 3Gs, then you would not be able to differentiate them and cause your resource allocatable just contains nvidia.com.gpu. Yeah, but if you go for a MIG strategy, then you're able to actually mix partitions. And also it changes the naming convention of the nodes. So you can see that now if you had used the mixed strategy instead, your GPU.product name no longer contains the suffix. So it maintains the original A100 PCIe 40 gig, but now your resource that is exposed contains the MIG partition. So it's exposing nvidia.com slash MIG 1G.5GB, and it's exposing seven of them. If you had gone for a mixed strategy, then yeah, it would expose whatever mixture of GPU instances that you had. Okay, so now let's talk about time slicing. So time slicing is basically this idea that instead of like doing this hard partitioning of the GPU, what I want to do is just make it that certain processes can take certain amount of time on the GPU. So if I had four processes, basically I wanted that the four processes are like round robin context, which in each one I guess 25% of the GPUs time. Okay, so the way to do this is basically you specify a config map, and then you specify the number of replicas that you want for the GPUs. And so for example, if you specify a replica count of four, so in the top right example, you can see the original GPU count is four, and if you specify the replica count of four, basically you are saying that for each GPU, I'm allowing four different processes to share on that GPU. And so because of that, your actual exposed resources will be a product of the GPU count and replicas. So now the number of GPU shared GPU resource that you have will be 16, that's four times four. And so if you specify 10 replicas instead, then you will now have 40 shared GPU resource. Okay, and you'll notice that in this config, there's this flag called renamed by default true. And basically what this does is it will make this allocatable resource be called GPU.shared. You can also set the false and basically it's a preference of like how you want to provide your infrastructure. But what it will do is you will make your allocatable resource, go back to the original nvidia.com-gpu resource, but it will change your product name. So your product name will now contain this suffix of shared. So basically it will still know that this resource that you are consuming is shared. Okay, so if you're trying to choose between the two, you can actually do both at the same time, but there's some considerations. One is that partition type. So the MIG partition type is physical. So it's like on the card, time slicing is logical. So this is done at the Kubernetes level of basically doing this round robbing context switching. For the MIG, the max partitions that you can do on any card is seven. This is just specified in the hardware. For time slicing, it's unlimited because it's a logical thing. But generally you don't want it to be too high. Otherwise you get this context switching and the context switching may cost you more than you actually save. In terms of QoS, since the MIG is like a physical partition, you get compute QoS and also memory QoS. So your memory for each of the pods will always be the same. The number of computes and the throughput of compute on the pods will always be the same. Whereas if you're doing time slicing, this depends on the number of pods on the shared GPU. So if you had three processes, then each of them gets 33% on the GPU. If you had two processes, then they get 50%. So you may get different behavior depending on how busy your node is. And you don't get any amount of memory QoS because there is no limit on the number of VRAM that any process can have. And so actually one process can OM other processes because they will cause the whole GPU instance to reset if they go over the memory. And you also don't get error isolation. So on the MIG, if you error on one of the GPU instances, it only resets that GPU instance. Whereas if you're using shared, it resets the whole GPU instance, but the GPU instance may contain multiple pods. So the other pods which did nothing wrong may also be forced into this reboot. And so they will also error. And in terms of reconfiguration, if you're using MIG, then this requires a reboot. Whereas if you're doing time slicing, you can just change the configuration dynamically. And in terms of GPU support, MIG is on the hardware level, so only some card supported. So you're only supported on the H100, the H100 and H30. You can run NVIDIA SMI and I think there's one flag called like MIG enabled. And if it's either yes or NA, NA means that it's not being enabled and then yes, no means that it has the MIG capability. If it's yes, it means you're currently using MIG, if no means you are not currently using MIG. Okay, and then in terms of time slicing, time slicing since it's a logical, it's on the Kubernetes level, you can get this done on most GPUs. Okay, so now let's talk about some techniques for optimizing deep learning workloads. So first is low precision arithmetic. So most of the models will be distributed in full precision because they, I mean, they train in mixed precision, but then the master weights will be in full precision. But yeah, when you're running inference, most of the time you actually don't lose any performance when you go for a half precision and you will actually run more than two X faster because you get to utilize the TensorFlow course and you will use half the VRAM because you use half the number of bytes as if you had loaded in full precision. You can also do attention slicing or flash attention. So this is the idea that when you're doing the attention calculation, the classic attention calculation requires you to calculate the full attention matrix. But if you use flash attention, what you can do is calculate it in tiles and then do this rolling softmax. So this makes it that you don't need to consume as much VRAM. And because you don't materialize the whole attention matrix, you actually also get some caching benefit. So it actually also runs faster because it does less loads from the cache to the device VRAM. So yeah, you can also look into flash attention and attention slicing. But actually flash attention, there's an issue which is that it's not implemented for all attention calculations. So if you're using like the standard attention calculations that you probably have it, but if you're using attention biases, they're still trying to implement it. There's a PR that's like been there for a few months that they haven't merged in. Okay, and then the next one is speculative decoding. So this is that if you're trying to do a language model, most of the time you're actually kind of wasting compute because your language model is just generating these tokens that are very obvious that you could have just generated with a smaller model. And so what you do in speculative decoding is that you get a small model, predict a head, and then you pass this full thing into the big model in a batch because the issue is that when you're trying to generate with the model, you can only specify a batch size of one. But if you had predicted the head, like for example, you predicted the head 10 tokens, then what you do is you can now specify a batch size of 10 into your big model. And then based on whether or not that answer is correct, you can basically skip ahead in the generation. So if you notice that like the generation is same up to the token predicted the head number five, then you can just straight skip to number five. And basically this allows you to have higher batch size for your large model and be able to run your large models faster. You can also look in the distillation, basically you have this large model that performs very well, but then you don't have enough resource to inference it. So what you can try to do is distill the performance of that large model into a small model. You will obviously have to trade off some accuracy, but you may gain a lot in terms of performance and for some production use case, maybe you actually don't need super high accuracy, then you can just trade it for performance by doing distillation. Okay, so in summary, Kubernetes can be used to easily share GPU resources among your workloads. That's the MVDA GPU operator can be used to manage and configure your nodes with GPU resources. And in terms of if you want to oversubscribe a GPU device, you have two options, either you go for MIG where you partition the GPU or you can go for time slicing where they do like a shared resource around Robin context switching thing. Okay, so here are the references. There's like three pages of references. Okay, and yeah, that's all for my talk. Thank you. So I will now take questions. I think I have like two minutes for questions. Anyone have questions? Yes? Uh-huh. Okay, so yeah, the question is basically, yeah, for the partitioning, like there are two options, right? You can go for MIG or you can go for time slicing and the question is whether or not these are mutually exclusive. So can you do these together? And the answer is yes, you can do these together. So you can have a MIG partition that is shared using time slicing. Yes, so if you have the shared app, the schedule has it. Yeah, I think you should be able to find it there. Yes? Yes. Yeah, because Kubernetes specs support something called device plugin, which allows this. And so basically what NVIDIA did is they implemented the device plugin API. And so yeah, you can use the NVIDIA GPU operator and then you can use the device plugin to be able to see the resource. Oh, yes, there is an equivalent operator in AMD. But I think it has less support for stuff. But I think yeah, so some of the slicing stuff, I don't think AMD has the GPU partitioning. But yeah, if you want to specify a cluster that contains AMD, I think AMD has their own. I'm not sure they have an operator, but they definitely have a deployment to be able to do the device plugin thing. Yeah, any other questions? Yes? Mm-hmm. At Gina, no. I am aware that some companies do use Kubernetes, but I'm actually not too sure how they do it. Because you get a lot of overhead with Kubernetes, right? So if you're going for training, most of the time you just want raw compute. And in that case, you actually don't want this overhead of having to go into Kubernetes. But I think some of the, when you scale very large, you start needing fault tolerance, and then they have something with Kubernetes, but I'm not too familiar with that. So these slides are mostly about inferencing. Yeah, because there's also another issue, which is that if you do the MIG partitioning I was talking about, the GPUs can't talk to each other. And when you're doing this multi-node training, or just even multi-GPU training, you want the GPUs to be able to talk to each other. But when you're doing the MIG partitioning, NVIDIA only thought of it in terms of, you would only want to do the MIG partitioning if you're doing inferencing, right? So it actually doesn't support multi-GPU training when you have to enable on, yeah. Okay, so what he's talking about is that the Kubernetes community is implementing just this DRMA thing, yeah. I'm not super familiar with the work, yeah. But yeah, I guess you can read up on it, yeah. Okay, so I think that's it for questions. So I'll briefly introduce the company and then I need to get off stage for the next speaker. Okay, so yeah, Gina AI. So we're an AI company based in Berlin. We do embedding models as well as, as well as prompt technology. So we have some Cnexplain and prompt perfect. So Cnexplain is this visual language model and Cnexplain is, oh, oh yeah, I said that. And prompt perfect is this prompt engineering tool and then embedding models, yeah, we have some embedding models. They were from Pager Hacker News and I think Lama Index also says that our embedding models are good. Yeah, so the company has 40 plus members. We have three offices. We are based in Berlin, but we have an office in Beijing and Shenzhen. We raised 30 million for Siri Say and we're a top tier AI company. So if you're interested in joining as an intern, there are internship programs at Gina.ai slash internships. If you're interested for a full-time position, then you can go for Gina.ai slash careers. That's our career page. And okay, that's it. So thank you for coming.