 Hello, everyone. My name is Gaurav Kumar. I work for the Uber's Compute Platform team in Amsterdam. Unfortunately, my colleague Amit couldn't be here today due to logistical reasons. So I'll be leading this session solo. In my roles and responsibilities at Compute Platform, we are mostly focused on running different kinds of workloads. And in this particular talk, I'll be talking about scheduling GPU workloads on Kubernetes and the kind of insights that we have had so far while scheduling such workloads. So yeah, let's get started. As the outline of the talk, I just want to clarify what this talk is not about. So this talk is not about how to set up GPU workloads at the beginning. So basically, I'll not talk about how to set up your clusters to run GPU workloads. Rather, once you've set your clusters up, what sort of challenges and requirements that you face along the way, I would go through a brief overview of how Compute works at Uber. And I'll give an overview of Kubernetes clusters for GPUs at Uber. After that, I'll cover the sections about efficiency and precision, what are the goals and the challenges and the solutions that we implemented to achieve these efficiency and precision targets. I'll also cover a little bit about benefits for scheduling efficiently and precisely on the Kubernetes clusters. We'll cover some common pitfalls that we encountered along the way to enable such GPU workloads. I'll briefly touch on future work that we have in the plans. And if we have times, we'll cover a short Q&A session as well. This is the overview of Compute at Uber. At the top, we have two federation layers. We have a stateless federation layer and a batch federation layer. These federation layers manage resources efficiently. And they are responsible for different types of workloads. So for stateless, we have a federation layer known as UP. For batch, we have our own federation layer known as Batch Federator. I would recommend everyone to see the talk on batch federation on KubeCon 2023 North America, in which my colleagues talked about the batch federation layer. Our Compute platform has been consistently evolving throughout. We were based on Mesos and our in-house scheduler known as Peloton. And over the year, we have transitioned to Kubernetes. Below the stack, we have our presence across multiple cloud providers, such as OCI and GCP. And we have our on-prem data centers as well, where we run our own hosts. The overview for GPU use at Uber is like this. We use GPUs for model training for our AI and ML workloads. We have data science notebooks, which are used for exploratory data science work. So for example, if you're a machine learning scientist or a data scientist and you need access to a GPU, you can just launch a session, work on that session, have the access to the GPU, and return it back to the cluster. Some use cases include for the models that we train like ETA prediction. So you'll be able to see on the app like how long will it take for your cab or food to arrive or how long will you be reaching to your destination. So we have ETA prediction models. We also have LLM-based chatbots, which will do automated customer support. And we have LLM use cases for our own internal users. In terms of scale, we have more than 4,000 GPUs. And we have about 50,000 jobs which run every week on those GPUs. In terms of GPU models, it's kind of like a mixed bag. We have NVIDIA H100, NVIDIA A100, some QuadroRTX4000, and some 1080TIs. So it's like modern GPUs as well as some legacy GPUs as well. So yeah, this is what cluster level overview for a GPU cluster looks like at Uber. So we have an in-house component known as Kubernetes Object Pusher, which is responsible for syncing the NVIDIA device plug-in daemon set spec to all the clusters. It reads from a repository where we describe the spec for the device plug-in. Once that spec is synced, the device plug-in daemon set part gets launched on the node. And we are able to advertise GPU resources to the cluster. The device plug-in and the kubelet communicate via Unix socket as standard. On the node level, this is what it looks like. We use the NVIDIA container toolkit, which contains the implementation of the alternate container runtime. Our container runtime on production is mostly based out of container D. So we have configured container D to be able to hook into the NVIDIA container runtime whenever we receive a pod which has the runtime class name as NVIDIA. If it does not, it goes through the default runtime. So those are mostly stateless pod and non-GPU workloads. We also use CAdvisor for being able to monitor GPU metrics. We have Rabble written on top of CAdvisor to be able to label the metrics on a workload level because we want to do attribution of what jobs are basically using what GPUs. So for example, here's a screenshot of a job which is using four GPUs. And those four GPUs, they span across two pods. And those two pods each have two GPUs across different nodes. So now that you have set up your clusters to be able to use GPUs, life should be fine. But unfortunately, it isn't. Once you're able to just set up your clusters for GPU workloads, you come into requirements for efficiency and precision. I'll get into detail for how that happens here. So first, you need to support a heterogeneous cluster. So a cluster can have CPU-only and GPU nodes as well. And theoretically, CPU workloads can run on GPU nodes. It's not the best utilization of them. So your cluster should be able to support heterogeneous nodes. That would be one requirement. We came into this requirement mostly when our machine learning scientists sort of wanted to use Ray jobs. Ray is a framework by any scale where you can launch a job which can have CPU-only workloads on some parts and GPU-dedicated workloads on some other pods. You have to be able to support multiple stock-keeping units for GPUs. So GPU workloads should be scheduled on the right skew based on what the customer is asking for. And the workloads which do not require a specific GPU skew should not be running on those specialized GPU skews. You would also need to minimize the fragmentation of GPU resources to be able to schedule efficiently. So I'll get into details of all of these sections in the slides later. Yeah, so when you have to support heterogeneous clusters, the key points here is that GPU resources are limited and they are costly. So you don't want to waste GPU resources by running non-GPU workloads on them. And here, as you can see in this example, we have a cluster where a CPU-only pod is scheduled on a GPU node. This tends to happen if you just let the cluster run amuck and you don't have any guardrails in place to avoid this. So as you can see, a CPU pod is running on a GPU node and a general GPU pod will anyways run on a GPU node. There's also one catch that you need your device plugin to be able to run on only GPU nodes and not CPU nodes because that's not optimal. Also your device plugin will just keep crashing in that particular case. So how do you solve that? Well, there can be multiple solutions to do this. First, you can use to, first you can label your GPUs to be able to advertise that you have a GPU on this node and what kinds of GPUs are there. Node labeling, in our case, I mean, you can use different solutions. As I think Kevin and Shiva mentioned, you can use NFD. For us, we use a custom way of labeling GPU nodes. We already have an on-host daemon, which uses NVIDIA libraries to fetch the information about the GPUs. At boot time when we launch the cubelet, there's a wrapper which calls to that daemon, gets the information about the GPU nodes, parses it, and labels the node once the cubelet is booted. Once you have labeled your GPU nodes, you need to add the necessary node selectors to the pods of your device plugin so that your device plugin only runs on those GPU nodes. And you can use a GPU management filter plugin to basically ensure resource isolation between CPU workloads and GPU workloads. Just a bit of context here. What do I mean when I say a GPU management filter plugin? So Kubernetes allows you to write plugins for various points in the scheduling cycle of a pod. So when I say I write a filter plugin, what I mean is a filter plugin will basically filter out any candidate nodes which are not suitable for this particular pod scheduling. So here what it looks like when I say if you have to implement a GPU filter plugin. So if you have a pod which is not requesting GPUs, the GPU filter plugin will filter out all GPU nodes from its possible candidate replacements. And your CPU-only workloads will only run on a CPU nodes. If you have a pod which is requesting for GPUs, it will be a know-up, and it will directly get scheduled on GPU nodes. So the GPU management filter plugin is sort of a know-up for GPU resources. Well, once you've ensured the isolation between CPU and GPU nodes, you think that, again, life should be comfortable. But again, unfortunately, it is not. The reason for that is now you have another problem where you have to support multiple types of GPU SKUs in the same clusters. So your product teams would come up to you saying that now we want to train LLMs. And LLM training requires specialized hardware. For example, NVA100 and the best use case for training those LLMs, best use case for using those specialized hardware is to train certain GPU workloads and not let anything else run on that for some reason. Well, and I mean, you could potentially run generic GPU workloads on those specialized hardware. But again, it's not a very optimal use case for those workloads. Also, GPU features differ completely on the SKU types. So not all GPUs are created equal. Some GPUs have more VRAM. Some GPUs have different kinds of compute. Each of the GPUs cost a different amount. So you sort of have to do this cost optimization when you're a platform. And one more key thing was that model accuracy is highly dependent on the kind of SKUs that you use. So if you train your workloads on a certain kind of SKU, we've seen in production that the model accuracy varies by a few percentage points. So as you can see here in the cluster, since we have ensured that CPU-only workloads will run on CPU nodes and GPU workloads will run on GPU nodes, the CPU pods are fine. But if you have a generic GPU workload, it can get scheduled on specialized GPUs. But you have to prevent that. So again, how can you address this problem? Well, you can, again, write a new filter plug-in, which is called the specialized GPU management filter plug-in. So plug-in one will be your global GPU management filter plug-in. It filters out all GPU nodes from a pod which does not request those. The plug-in two will be a specialized GPU management filter plug-in, which will filter out all the specialized GPUs from any pod which is just requesting for general GPU resource and does not want to be placed on a specific SKU. And at the end, we have placement strategies based on node selectors. So if your pod is specially requesting for, let's say, I want to run on an A100, you put a node selector there, and your pod will accordingly get placed. What does this do? So once you have done this, you are able to isolate between CPU nodes, generate GPU nodes, and specialize GPU nodes. This ensures better hardware utilization in our clusters and, obviously, once we are able to schedule workloads on the right nodes, we are able to achieve better model accuracy. This also gives us cost savings because, obviously, we are able to utilize the hardware better. The last time we did some evaluation on the kind of cost savings that it gives us, it was around half a million dollars a year. OK, now since you've done that, you're able to isolate resources. You should be able to rest on your laurels. But again, you see another problem. And the problem is based on GPU fragmentation. What tends to happen in our use case with scheduling other kinds of workload was that we used to prefer load aware placement. When we say load aware placement, we try to place the workloads on the nodes which are not being utilized to the fullest so that we ensure there's a uniform utilization of resources across our clusters. This works well in the CPU world, but in the GPU world, what that results in is fragmentation. So here you can see if you have a cluster. We have nine GPUs available to schedule your workloads. But if you want to place a pod which requires more than two GPUs, you won't be able to place the particular pod in the cluster. And we've had complaints from our customers saying that even though I have GPU resources in the cluster, I'm not able to schedule my pods because of this fragmentation issue. So how do you mitigate fragmentation? Well, you could use something like bin packing placement. So Kubernetes by default has a node resource fit plugin. That node resource fit plugin, you can alter the node resource fit plugin to use the most allocated strategy so that you're able to bin pack your workloads closer. And then as you can see, you have nodes available with more than three GPUs. So your pod can be placed on those nodes. With that being said, I'm not saying that we have completely solved the fragmentation problem in our clusters. The reason for that is it's more of an art than a science because you have multiple scorers inside your cluster competing for the placement. And then you have to tune the weights of different kinds of scheduler plugins that you write so that you don't run into issues. Well, so now you are able to isolate resources. You're able to schedule workloads on the appropriate SKUs. And you're able to bin pack your workloads. You should be fine with most of the scenarios that you see. But then life is not again so easy. You run into some common pitfalls. Pitfalls is just a fancy name for bugs that we encountered along the way. So yeah, one of the pitfalls that we found maybe after a week of deploying the plugins was that the device plugin pod and the filter plugin do not play very well. And the reason for that was really simple. I'll get into the reason first. So what basically happens is when your node joins the cluster, it is effectively like a CPU node. You don't have GPU resources being advertised on your node. And you have the GPU filter plugin in place. Once the demon set spec is synced and the demon set pod gets scheduled on that particular node, you start to advertise the GPU resources on that particular node and all as well. But what happens when that particular NVIDIA demon set pod, for some reason, let's say, crashes, or you're trying to upgrade your NVIDIA demon set, device plugin demon set, you run into the issue of the GPU filter plugin interrupting your NVIDIA device plugin pod to be placed. So basically, the GPU filter plugin was making a decision that is the pod requesting for GPUs and is the node a GPU node. And if that's the case, then I don't want to schedule pod on this node because device plugin essentially is a CPU-only workload. Well, that's not ideal. So just for the simplicity of the solution, you can add an exception for your device plugin pod in your GPU management filter plugin. And it just not only applies to device plugin pod, it applies to all the system pods that need to run on your GPU node, but those are not GPU workloads. Another common pitfall that we encountered was that pods requesting for more than GPUs than they are available on a node. So for example, we have a cluster here. It's a theoretical cluster. We have nodes with four GPUs, six GPUs, and eight GPUs. And if a pod tries to place a workloads, which contains overall 10 GPUs, what used to happen was like it would go through the admission control. And we'll try to place the workload on the cluster. And we'll go through multiple scheduling attempts. And we'll keep on facing placement failures because essentially no node can accommodate this pod. And if no node can accommodate this particular pod, you get alerted that your job has been stuck for a while and you cannot run your workloads. Well, what you could possibly do is to prevent the admission of such pods in your cluster, which contain the request for resources more than of what your particular cluster at that point of time can provide. This can also be solved at the federation layer, but I'm just talking about on a cluster level. So if you try to place a pod which requests for 10 GPUs and your cluster does not have a node which can accommodate 10 GPUs, you just deny admission of that particular pod. Well, it's again not so simple because you can't just hardcore saying that, OK, in my particular cluster, I have pod which can support eight GPUs. So I'll just hardcore eight and then ask users to send their workloads. And once the scheduling happens, I'll be able to deny admission to pods which contain more than eight GPUs. Because your cluster essentially is a living entity. Nodes keep coming in and nodes keep going out. So tomorrow you can have a node which contains 16 GPUs or, I don't know, 10 GPUs, or your eight GPU node can move out of the cluster for some repairs or maintenance. So how do you do that? Well, you can write an admission check using a node informer. So you will watch on this particular cluster. I'll watch for node updates. I'll keep account of the maximum number of GPUs that a node can have that a one node has in this particular cluster. My admission control check will just validate that is my pod requesting for a number of GPUs which is higher than a node can supply? If yes, then I'll just deny admission. If not, then I allow the pod to be admitted. This is a very simplified solution. We also need to take into account the fact that your pod spec also contains init containers. And when you have init containers, you need to evaluate the maximum of an init container as opposed to the sum of the number of GPUs that are being requested by your pods, requested by your containers. Another thing that you would probably run into is the fact that you have privileged containers running in your clusters. So just for a bit of context, a privileged container is a container which has elevated privileges, and these are not limited by your container runtime. If you run privileged containers on your cluster, apart from just the security risks that are associated with running such containers, you'll be seeing that a privileged container has access to basically all the GPUs on a particular node. And this affects resource attribution. It also affects your metrics in terms of what GPUs are accessible to this particular container. And yeah, you run into problems. Well, we took a decision to avoid running such containers into our clusters. The necessary use cases for privileged containers are governed by some other route of deployment rather than Kubernetes. So we do another admission check. We say that if the pod has privileged capabilities, for example, it has capsus admin or capsus chroot or anything like that, we just deny admission to it. Also, if the pod security context allows it to its privileges to be escalated, we do not admit such pods into our clusters as well. And if everything is no, then it's a no op and your pods can get placed. So yeah, I'll just briefly talk about the future work that we plan on doing. We plan on supporting fractional GPUs. So in our current setup, GPUs cannot be shared by containers. With that being said, we do train multiple models on our single GPUs using software techniques, such as low rank resource attribution. Workloads, if they are not able to fully utilize the GPU, will essentially waste resources. So we are planning to use mix in order to enhance the utilization of GPUs in our clusters. We found use cases of certain inference workloads, which scale well with mix. So yeah, the plan is to support this. In the future, we also plan to support multiple GPU vendors, like Intel and AMD. The few reasons that we are planning to do this is because of price to performance ratios of certain workloads with these different GPU providers. These providers have open source drivers in stack. And the device plugin for these vendors are also open source. So you can just check them out. With that being said, I don't expect it to be very smooth, just like the experience that we have had within VGA GPUs. A bit of a shameless plug. This is the team that has worked on most of these features. Feel free to reach out to us on LinkedIn if you have any questions or feedback. And with that being said, I'll open the floor for questions if you have any. And please scan the QR code if you have any feedback for the session. Thank you. You have mentioned Ray. Are you using Kubray for Ray? And are you using Ray self? I'm sorry. Can you repeat the question? Yes. You have mentioned Ray at the beginning. I guess that you are using the Ray framework on Kubernetes, Kubray? Yeah, so we are using Ray workloads. Those are running on our clusters to train the machine learning models. So we have multiple sorts of bad jobs that run on our clusters. One of them is Ray, which is a framework by any scale. Exactly. And I think that there is an option to use functional GPUs with Ray. And are you using anything other than Ray serve? Because currently what we are using is mainly serving the ML models with Ray serve, nothing else. So I guess that you are using also Ray jobs? Yes, there are Ray jobs. There are also Spark jobs. So yeah, it's multiple sort of jobs that run on the batch platform. It's not just Ray. Super nice. Thank you so much. Thank you. Thanks for the talk. I have two questions for the first one. Are you separating the federation layer based on CPU and GPU workloads? Is that what you meant on the first slide? Oh, so yeah, our federation layer, so the question is that are we separating GPU and CPU clusters? Like when you said you have separate federation layer. Yeah, we have separate federation layer, but it's mostly on job types. So we have a federation layer for stateless workloads, which run like microservices and other things. And we have a separate federation layer for batch workloads, which run these like bad jobs like Spark jobs, Ray jobs, and other machine learning models. I see. For batch workload, how are you guys approaching curing batch workloads on the federation layer? I see. So that's a good question. We already have a batch federation well through something called a multi cluster federator. So what that federation layer does is it admits the job from the customers, and then it tries to distribute the job to the appropriate clusters depending on the resources available in those clusters in this sort of like a multi cluster queue. So it will redirect the job to the appropriate cluster depending on the amount of resources that particular cluster has. So yeah, I would encourage you to watch the KubeCon Chicago North America talk if you want to learn more about the federation layer for batch that we have at Uber. I'll link it in the presentation as well. Awesome, thanks. Just one final follow up question. Sure. If you could share more on what are you guys prioritizing in terms of that queue? Are you guys prioritizing fair sharing or maximum GP utilization rate, et cetera? Yes, we have that fair sharing model. I can maybe get back to you on the details of how that fair sharing works on a cluster level. I am mostly part of the platform team, which runs the workloads on Kubernetes clusters. So that federation layer sits on top of the platform. Awesome, thank you. Thank you very much. Sure. Thank you so much for a great talk. I have a question. I'm just curious if you follow all the best practices and avoid those pitfalls? What is the highest CPU and GPU utilization in the cluster that you've seen? So that's a really nice question. When it comes to CPU utilization, we are highly overprovisioned. We are still working on improving the overall CPU utilization across our clusters. But just to answer your question, majority of our workloads are probably like the number of the peak CPU utilization that reaches in a cluster is close to 80%. So yeah, you can never really reach a very high CPU utilization, because then you will run into issues with the services being able to not perform as expected. And when it comes to the GPU utilization, I think when it comes to utilization, more than 50 to 70, but never beyond 80. Thank you so much. OK, thank you very much for your time, guys.