 Good afternoon, everyone. My name is Malin, and I'm a group product manager in Google Cloud working on Kubernetes Engine. We all know that GPUs are very expensive resource. And utilization of a GPU is a core concern for all the GPU users. Poor utilization of the GPUs cost them dearly. So in this talk, we're going to show you how to improve GPU utilization using Kubernetes. So in my humble opinion, I believe Kubernetes is an ideal platform for AIML and high-performance computing workload. And there are three core reasons why I think Kubernetes is best suited for AIML and high-performance computing workloads. Number one is portability. Kubernetes provide open, standard-based, and cloud-native APIs. This allows the practitioner to seamlessly port workloads between the laptops, private data center, and the public lab. Second, Kubernetes can seamlessly scale from a single node to thousands of nodes. It supports auto-scaling, auto-provisioning, GPUs, GPUs, and many other advanced features that allows them to do a very large-scale training and inference. Third is productivity. Kubernetes makes the practitioner more productive by freeing them up from having to manage the underlying resources and compatibility issue. So they can actually focus on their core business mission, be it training, serving, or high-performance computing. So let me quickly walk you through an architecture of Google Kubernetes Engine. Google Kubernetes Engine is a fully managed container orchestration platform provided by Google. It has two main components, Control Plane and Data Plane. Control Plane comprises of many things, including master nodes, API server, scheduler, HCD, and many other services. So Control Plane provisions the Data Plane, which comprises of worker nodes. Worker node is the place where workloads run, and worker nodes can run the workloads using GPUs and GPUs. Worker nodes are grouped together as a node pool. All the nodes that belong to a single node pool share the configuration. Node pool is also the basic unit of autoscaling. All the nodes in the node pool either have a CPU or GPU to run the workload. So as I mentioned before, GPU utilization is a core concern for GPU users. And poor utilization costs them very, very dearly. What we have observed in our GKE fleet is that GPU utilization for typical workload is quite low. And the GPU utilization is actually getting worse day by day, as GPUs are getting more and more powerful. A single workload may not even be able to saturate a very powerful GPU. Under utilization problem is especially acute for certain type of workloads, such as inference, gaming, visualization, and notebooks. So let's look at some of the examples. So data scientists build model using the notebooks. And most of the notebooks today are attached to GPUs. And as we all know, notebooks stay idle for a prolonged period of time, wasting very, very expensive resource. Let's look at some more examples, like chat box, vision product search, and product recommendation. These are all real time applications that are latency sensitive and business critical. So Kubernetes auto scaling and auto provisioning features are essential for such application, but not sufficient for two reasons. One is it takes minutes to spin up a new node in the Kubernetes. And most of these applications are latency sensitive, so they cannot tolerate that delay. Also, until now, we could not do a very effective bin packing for GPU workloads. So how do we help these workloads be more cost-efficient? That's the main purpose of today's talk. So the main challenge today we face is that Kubernetes allows fractional utilization of CPUs, but it does not allow fractional utilization of GPUs. What it means is that a Kubernetes workload can ask for 0.5 virtual CPU, and Kubernetes knows how to give the workload 0.5 CPUs. But today you cannot ask 0.5 GPUs in Kubernetes. So what happens is one workload, one GPU is fully allocated to one workload, even though workload needs a fraction of the GPU to execute its task. So how do we fix this? So there are many solutions to allow workloads to share a GPU. Some solutions work the application level. Some solutions work at the GPU system and software level, and some work at the hardware layer. So today I'm going to talk about two solutions that we recently launched. First one is time sharing, and the second one is multi-instance GPU. And both of them are very popular mechanisms to share GPUs, and both of them together address most of the use cases and most workload needs. So let's walk them through some one by one. So first I'm going to explain the temporal multiplexing solution, popularly known as time sharing. So time sharing allows multiple container to run on a single GPU. Each container gets a time slice. So GPUs are allocated fairly to all the containers. And under the hood, it does time sharing through context switching. What it does is that at any given point on time, one container has exclusive use of a GPU. But at a periodic time interval, it does a context switching, and the next container gets the exclusively use the GPUs. And this is done in a round robin fashion, so all the container gets a certain fair time slice that they deserve. So the beauty of this solution is that when there is only one container allocated to a GPU, it gets to use the entire compute time of the GPU. But as soon as you add the second container, now two containers share the GPU, so each one gets about half the time, compute time, and that's how you can enable a fair sharing of GPUs. So for NVIDIA GPUs series, the older generations, they were doing context switching or preemption at the CUDA kernel boundary. However, the new generation of NVIDIA GPUs, Pascal and Lader, they do preemptions and context switching at the instruction level boundaries. And that basically facilitates fair sharing of GPU. And the solution that I'm going to explain today is fully managed solution by GKE. So all the configuration and management and underlying heavy lifting is done by GKE. So now I have already explained what time sharing means. Now I'm going to explain to you how we can enable time sharing on a GPU, what the user experience looks like. So in order to create a GPU node where the time sharing is enabled, you can provide a configuration parameter either at the cluster create time or at the node pool create time. And in this chart, you can see an example. Like it has a configuration parameter called max time share clients per GPU and the value is set to 10. What it means is that in this example, NVIDIA Tesla T4 GPU can be shared by up to 10 containers. So 10 is the upper limit between one and 10 containers can use this GPU. All this configuration, you can do it with API calls. We are also going to launch user interface so you will be able to do the same configuration using UI UX. So let's say you run the command either the cluster create or node pool create. It will automatically set up the nodes with the time sharing configuration. So after the nodes are created and drivers are installed, you can actually inspect a node. So let's inspect the node by running kubectl describe nodes. So you can see here that the nvidia.com slash GPU resource value is actually set to 10. So what does that mean? It means 10 shared GPU resources are available and each resource represents a time slice. So in a simple speak, up to 10 container can share this one NVIDIA T4 GPU. So after the nodes are created, the next thing it does is basically it labels the node. So with each node that is created through this configuration will be labeled so that workloads which request this time-shared GPU can be landed on this particular node. So in this example, you can see that two parameters are specified. One is time sharing and the maximum number of containers that can share this GPU. Third thing it does is basically change this node. Why is this needed? This is needed to avoid or prevent whole GPU workloads from being scheduled on this node. You only want the workloads that want a shareable GPU to land here. You don't want a workload that needs a full GPU to land on this particular node. So now we have set up the nodes. Next thing is to basically configure the workload. So what do we do? We specify the deployment. So within the deployment spec, you can actually specify node selectors or affinities to schedule the workload to run on a time-shared GPU. If the nodes don't exist, then GKE is smart enough to either autoscale an existing node pool which matches the configuration or it can create a brand new node pool that matches the request from the workload. And I'm going to talk more about that as throughout this talk. So now the workloads have to request nvidia.com slash GPU. And the request count in this example is 1. What you have to remember here is that 1 is not the measure of GPU time allocated to a particular container. How much time a container gets depends on how many containers are running on the GPU. So if there are 10 containers, each will get 1 tenth of the time. If there are only one container that will get the exclusively used, it gets to exclusively use the GPU. So now we have figured out how to provision a node, how to land workloads on those time-shared nodes. But there are some nuances that we need to be familiar with. And there are some issues and corner cases that we have to be mindful. So in the time-sharing GPUs, all the processes get their separate address space. So there is no issue of data overlapping. However, no memory limits are enforced. What it means is that if the containers are not well behaved, then you can get into out-of-memory situations. So the responsibility of restricting memory usage is up to each workload. So how can we do this? So there are two ways in which you can avoid out-of-memory situation. The first one is you can actually use CUDA unified memory. What it does, it basically enables on-demand paging between host and GPU memory. So that way, it avoid out-of-memory situation. The second solution is that you can configure this in the applications. So application frameworks like TensorFlow or PyTorch, they expose you knobs, which you can set, to avoid out-of-memory situations. So this is something to remember to avoid when you are sharing too many containers on the same GPU. You want to avoid out-of-memory situations. So now I'm going to walk you through how auto scaling works when you have a time-shared GPU. So auto scaling is a very key feature of Kubernetes. It enables workloads to avoid overprovisioning and underprovisioning situations, thereby saving the cost while offering a better performance. So auto scaling is quite a complex topic. So I'm going to walk you through a very simple workflow. So in this case, we already have a time-shared GPU node. Now this node will expose a GPU utilization metric per container. And you can actually also use custom metrics. So if you wanted to specify a query per container, that could be your custom metrics. So the horizontal pod autoscaler actually watches for this metric. It actually tracks this metric. And what it does is that when this metric exceeds the threshold that you specify, it will actually add replicas of your container. So let's say you are watching for GPU utilization metric. And the threshold is 70%. When the utilization goes more than 71% or higher, it will start adding replicas, because it thinks that the application is running hot and it needs some help. So when it does this, it can do it in a couple of ways. Now we have added more replicas of the pod. Pod needs to learn on a node. If there was an existing node which can accommodate that, it will happen. But if there was no such node available to lend this extra pod, then it will automatically add new nodes in your cluster. So cluster autoscaler is smart, and it will take care of this for you. Now there are three main scenarios when we talk about autoscaling. Scale up, scale down, and auto provisioning. So I'm gonna quickly walk you through all the three. So when the nodes are unschedulable, then GKE is smart enough to scale up the most cost-effective node pool. This is to your advantage. How does it do that? It basically looks at the pod spec and it looks at the node spec and sees which are the nodes that can satisfy this pod spec, and out of those nodes, which one will be the most cost-effective to scale up? So if there were too many pods waiting to be scheduled, then it's also smart enough to figure out how many nodes it should add, whether it should be adding one node or five node or 10 node to address or service all this outstanding pod. You can also ask, what happens if there was no existing node and you are starting from scratch? There is no node pool that is running on the cluster. In this case also, GKE is very sophisticated and smart to automatically provision the node that will satisfy the needs of the workload. So we call this auto provisioning. So let's say you're running your cluster and suddenly the load drops. Then GKE is smart enough to scale it down. And the way it does is by monitoring the utilization of all the nodes in the cluster, when the utilization drops below a threshold, then what it will do is it will try to figure out if all the workloads that are running on underutilized node, whether they can be consolidated in fewer number of nodes safely. So if the answer is true, then it will basically move the pods from underutilized nodes into fewer number of nodes and free up the extra resources. This will save you money by reducing the number of nodes needed to handle the workload. Now we talked about auto scaling and we talked about in the context of you already have an existing node or node pool and you scale it up when you have a pending pods. What happens if there is no existing node pool? You want to bootstrap from scratch. In that case, if you enable auto provisioning on GKE, then it will automatically figure out what is the best node configuration and node pool configuration that it can bootstrap the workload. So it will automatically add those nodes from zero nodes. So that's called auto provisioning. This basically saves you time and effort of configuring the node pools. So on the right-hand side, here is an example. Let's say you had enabled auto provisioning and you're just starting a cluster, there are no node pools. So based on this deployment spec, it actually knows that you have enabled time sharing and it can actually figure out it needs to add a time share node to the cluster and node auto provisioning will automatically do that for you. So it saves you effort. So we talked about time sharing and in the beginning I said, we are gonna talk about two distinct mechanisms for GPU sharing. So now I'm gonna switch the gears and talk about special multiplexing. So this is a relatively new technology that was launched by NVIDIA. It is known as multi-instance GPUs. It basically allows multi-instance GPU-enabled GPUs to partition into GPU instances. And the key difference here is that partitions are physically isolated with dedicated compute and memory. So this physical isolation, that's why it's called special multiplexing. In the previous case, there was a temporal multiplexing, you're just time slicing single GPU across multiple containers. So in this case, it supports simultaneous workload execution with guarantee of service. So now you have physical partition so you can actually run those containers in parallel where all of them are executing at the same time. And that gives you a better quality of service. This is only supported on A100 GPUs as of now. And we have done a lot of testing on this and we have found that throughput increases linearly when you add more instances. Which makes logical sense. So in the A100 GPU case, there are seven compute units and eight memory units. Each unit of memory is about five GB. This compute and memory units, you can combine in different configuration to slice the A100 GPU in different instances. And this table actually shows what combination alleged. So each combination is basically tagged as a compute G dot memory GB. What it implies, let's take an example here. When we say one G dot five GB, it implies one compute unit and five GB of memory. And when you specify that, you can create seven instances with this particular configuration. If you pick the different configuration like three G dot 20 GB, you will get two instances. So when you get seven instances, you can run seven containers on this particular GPU. If you have two instances, you run two containers in parallel on this particular GPU. So similar to time sharing, you can configure this GPU with however many partitions that are listed on the previous table. And in order to do that, you have to specify this particular parameter called GPU partition size. In this example, I picked one G dot five GB. And as we saw in the previous table, this creates seven instances. So this can run up to seven container in parallel. So when you inspect those nodes, you will see nvidia.com slash GPU resource with a resource count of seven. So this is a particular example. You can slice it differently depending on your needs. Now, how do we deploy workloads on those nodes? Very similar to time sharing. You will have a deployment spec. First thing you will notice is that there is a resource count. In this case, you request one. Previously, we talked about the nodes are already labeled with the kind of sharing solution that the nodes are configured with. So with the combination of node selectors, the scheduler can figure out which workloads can land on a which slice of which GPU. So in this case, you can see their replica count is seven because there are seven instances. So you can run up to seven containers on a single A100 GPU, which is partitioned into seven instances. So now let's compare and contrast the two mechanisms. So you actually understand when to use Meg and when to use time sharing. So as I mentioned before, in the case of Meg, the partitions are physical. In the case of time sharing, partitions are logical. So when you have a physical partition, it will have a max partition limit, which is by design a hardware limitation. So A100, you can only partition into max seven instances. In the logical case, you can partition a GPU as many ways you want. Like you can load too many containers on a single GPU. But I will caution you that if you added too many containers on the same GPU to be shared, then you have to watch out for the overhead of context switching. So be careful how many containers you want to share a single GPU with. So in the case of Meg, by virtue of physical partitioning, it provides a lot of benefits. For example, it provides a physical isolation. And in many applications, isolation is an important requirement. It also provides memory protection. Again, that avoids out-of-memory kind of situation. So very beneficial. And provides the quality of service guarantees. So clearly, when you're looking for quality of service or better isolation, Meg is a better choice. None of this is possible in a time-sending sharing scenario because you're just sharing a single physical GPU across many containers. Because Megs are physical partitions, the reconfiguration of a Meg GPU, if you wanted to change partition, requires a little bit of effort. In the case of time-sharing, reconfiguration is very easy. So if you specify like you wanted to share a single physical GPU with 10 containers, in the case of time-sharing, tomorrow you decide, no, I only want to share it with five containers, you change one parameter, and then you are done. So reconfiguration is quite easy in the case of time-sharing. So when do you choose what? As I mentioned, if quality of service or isolation or prevention from out-of-memory are your main criteria, then certainly, Meg will be quite beneficial because it provides those guarantees. On the other hand, what we have found in the practice is that time-sharing is very good for a bursty workload. So the benefit here is that, let's say you specify a GPU to be time-shared across 10 containers, but you only have one container to begin with. It will get the full power of the GPU. If you have two containers, you will still together will get the full power of GPU. On the other hand, if you're working with Meg, you specify seven slices, but you only use one container to run on those many seven slices, then you are only getting one seventh of the performance because other six slices are gonna stay idle. So that's the trade-off. So when you have a bursty traffic, you can use time-sharing and you are going to get much better utilization of the GPU and much more flexibility. The other benefit of time-sharing is that it actually works on all the GPU families we have, including A100, including Meg partitions, versus Meg only works on A100. So you don't have that flexibility in every single families of GPU. There are things to consider beyond the time-sharing versus Meg. So we recommend that you do GPU sharing only within a single trust boundary. So what that means is that if you have a scenario where a single user needs to run multiple applications, it's totally legit and okay to do GPU sharing because you are working with a single trust boundary. Similarly, if you have a single company or single tenant, but multiple users, an example would be like multiple data scientists running notebooks and those notebooks want to share a GPU, that should be okay. Again, we are within the same trust boundary. However, we don't recommend the solutions in a multi-tenant scenario. So if you have multiple customers where you have to cross the trust boundary, we do not recommend that because we don't think the isolation properties of any of the solutions are meeting the security bar to allow sharing across the customer. So please keep in mind, within a single customer, totally okay to share across the customers is not advisable at the current state of technology. So in summary, the key takeaways from this discussion is that we offer two solutions for GPU sharing on Kubernetes. The time sharing solution works in every single GPU family and offers better solution for bursty workloads. MIG only works on A100 GPU, but it does provide better isolation, quality of service and out-of-memory protection. It's your choice depending on workload needs. You can choose either of them, but keep in mind, this is only good for a single trust boundary. Don't use it across the customers. That's not what we recommend. Thank you for listening to my talk and open for any questions. Thank you, everyone. I'm your moderator for this session, so I'll be running around the mic. I see there's one over there and I'm gonna get to you in just a second. First off, I'm actually gonna ask a question from online. Can you enforce to use the same physical GPU by different containers? For example, you want to run X server in one container and desktop in another. These two containers need to share a single GPU. It won't work otherwise. You should ask for GPU in both containers if a node has more than one GPU. There are no guarantees that they get the same GPU. Does that make sense? Yes, I missed the last part, but yes, you can share the GPUs across two different applications. And the one thing to watch out for is that out-of-memory situation. So you can actually, if you're not careful, then in the time-sharing case, you can encounter out-of-memory situations. But if you're doing with the Meg, that should work fine. And we do have customers using and sharing single GPU across totally, two different, totally different applications. Thank you. Okay, I'll bring this over to the person over here who had a question. Thanks a lot for this talk. In a time-sharing case, do you observe any calculation performance drop due to cache refresh while context-switching or not? So we have done extensive testing and the results are very workload-specific. So it cannot be translated across the workload. However, if you limit the number of containers you share a single GPU with, then the performance hit is very negligible. But if you try to go extreme, like I want to do 50 containers on a single GPU and GPU is T4 Tesla, then certainly you will have too much overhead from context-switching. But a specific example, like NVIDIA does a really clever job in managing the memory and other things. We haven't seen a huge performance hit because of context-switching in this scenario. So as long as you limit the number of containers, the overhead is very small. I saw someone back here first. I don't know who it was, though. Okay, I'll come back. Hi, thank you for your talk. Very insightful. I have a question about the memory. I know that for MIG, you actually can use only one slice of the memory. So if you use two instances, you can use only half the memory that the GPU actually provides. With the time slicing, is that the same or is it not the same? It didn't really become clear from my talk. So the question is, can two individual containers, both use the full memory, one after the other, or can they use only half, and is this caps? Yeah, so let me first clarify the question. So in the case of MIG, you will specify memory slice for a particular instance. So that's the memory that is available to a given slice. So that is very clear. In the case of time sharing, it's not clear. So what you have to watch out for is that some total of memory used by all the containers does not exceed the total memory that is available on GPU. That's why we mentioned that out-of-memory situation is real in the case of time sharing, and you have to make sure that applications are well-behaved and they don't claim more memory than physically available on a single GPU. Right, so they together have to seal fit in the total amount of memory because it's not being offloaded between one and the other slice. Correct, yeah. All right, we've got a couple other questions over here, but first I'm gonna do one online again. If you run something like TensorFlow, it likes to allocate and keep a whole GPU, keep whole GPU memory, letting no space for sharing. Any commentary on those types of situations? Yeah, but it also allows you to limit the memory that you can request for your container. So as long as you use TensorFlow carefully and when you're doing sharing, you make sure that total requests do not exceed the physical capacity of the GPU, you can do it. We have customers using TensorFlow and sharing the GPUs. Okay, cool. And... Hi, thanks for the talk. Just a clarification in case of time sharing. I can allocate uneven time for the different containers, right, by providing the number of GPUs greater than one. Like in your example, there was one, but I can provide two and it will get two times of time. That's a great question. Thanks for asking. There are a lot of nuances to that. So if you ask more than one, then basically it allows you to bin pack basically heterogeneous workloads on the same GPU. But in the end, the compute time is evenly divided. So each container will get the same amount of compute time. But in terms of memory and other requirements, you can fit in heterogeneous workloads on the same GPU. So that's the trick you are to play with that counter. Oh, I see. So it's static one. So time-wise, everybody gets same time slice, but in terms of memory now, how you want to fit the different workloads, heterogeneous workloads on a given GPU, you can play clever tricks with that count. Got it, got it. Thank you. One more online. Someone asked, in terms of time sharing, are MPS and a custom device plugin what you're using under the hood? So the solution we launch does not use MPS. That's on our roadmap. Cool. And there was another question over here. Hi, thanks for the talk. I have two questions. First one, is it in GA, this solution, or still in preview? So this is available? It's generally available. Yes. So technical question, can I request more than one slice for a single container? For example, I decided to use one CPU, five gigabytes. For example, but I need two slice for my container. Yes. Can I do it? Yes, you can do that. So you put two in the request? Yes. Okay. So in the case of time slicing, it's pretty straightforward. You ask for like two and you get two slices and you can run container on two slices. In time sharing case is a little bit tricky. If you ask more than one, you still get a proportionate time slice. So intuitively, it's a little bit hard to make sense out of it. Like I asked five, but if there are two containers, each one is going to get equal amount of time. But you can use that cleverly to fit in heterogeneous workloads if they have different memory requirements. Thank you. All right, we're a couple of minutes over time. So I'm going to go ahead and cut off questions there, but I'm sure Mullen will probably hang out for a few minutes. If anyone would like to come up and ask him questions. Yes, please come here and I'm happy to answer more questions.