 Hello everyone. Welcome to the session. My name is Amit and I have Kevin here with me. Both of us work for compute platform team at Uber. Compute platform team is part of platform engine group. So in today's session, we are going to talk about efficient resource utilization for batch compute on Kubernetes. We will talk about multiple challenges that we face while achieving efficient resource utilization and how we resolve all of them. So coming to the agenda, first we will talk about batch compute at Uber and overview. Then we'll talk about importance of resource sharing and what types of challenges we face in efficient resource utilization. Then we'll jump to solutions. Some key of them are resource, regional resource management and federation and specialized hardware efficiency. After that, we'll talk about some of the future work that we are going to do in next year. And then we'll happily answer some of your questions. Coming to the quick overview of compute team. We have two forms of compute in compute team, stateless compute and batch compute. Aditya and Apurva already talked about the stateless compute in the previous session. In this session, we are going to focus on batch compute. The underlying architecture remains literally same for both stateless compute and batch compute. Earlier, we were running on Pelton powered by Mezos and now we are in the phase of migration to Kubernetes. All the lower level architecture like train host as a service and the physical hardware and the network infrastructure remain same as for the stateless compute. In the batch compute, we'll largely focus on machine learning jobs which is powered by our sister team, Michelangelo. We'll also talk about spark jobs and data science workbench. So let's move on. Coming to the use cases. At Uber, we have both end user use cases and platform use cases which are solved through batch workloads. Some of the key end user use cases are rider pressing intelligence, ETA estimation, destination suggestion, etc. Like when you see Uber app and you see the ETAs, those are powered by batch workloads. From the platform team, we solve some of the use cases like AI model training, data science notebooks, also through batch workloads. To solve all these use cases, we largely depend on three types of processing frameworks, namely spark, ray cluster and ray jobs, and Kubernetes view and job. Spark jobs are solved through the spark application CRD and the spark operator that is open source by Google cloud platform. Ray jobs are solved through the ray cluster CRD and the ray operator, and Kubernetes view and job is already present in the upstream version of the Kubernetes. We literally use that. So before we move on, I would like to give you some status. So currently we have already migrated all the data science workbench jobs and sessions to Kubernetes. Most of the machine learning jobs and the AI jobs are already running on Kubernetes. Some of them are still with MISOS, which will be migrated soon to Kubernetes in the coming half. For the spark jobs, we have already done the solution and we have just initiated the migration. In coming half and next year, we target to finish these migrations. So we are targeting next year for sure to have all the batch workloads running on Kubernetes. Coming to the scale of batch compute at Uber, we have around 30,000 hosts dedicated for batch compute, which results into around one million cores. We have also got 4,000 GPUs solving the machine learning jobs and the AI model trainings. And per day, we launch around 3 million containers for batch workloads. And at peak, we go up to 500 parts per second launch rate. Now that we understand the use cases and the scale, which we're trying to solve through batch workloads in Uber. Let's also understand the importance of resource sharing because there are multiple teams involved who solve their use cases through batch workloads. It's important to learn how we solve the resource sharing problem for them. So this graph actually shows the utilization of resources for one particular team. The red line you see in the graph is actually the reservation that team has done. But if you see, most of the times they are not adhering to that. Either they are going high in the utilization or they are underutilized from whatever they have reserved. So which means they cannot actually predict how much they need in general. What it means is that at the time when they are going higher than the reservation, they will need resources from somewhere else by some other teams. And at the times when they are low in utilization, they can share resources with someone else. And this pattern is largely same for most of the teams. Only thing that differs is the time. They go up and down in utilization at different times. So the next graph shows the utilization for aggregation of multiple teams. Which means that if we aggregate the resource allocation for multiple teams and plot on time, then we find that it is pretty close to the total available resources. And that helps us understand why resource sharing is important. If you have not shared the resources, each of them will have to reserve for their peak. And that will actually cause problems in terms of cost and efficiency. So moving on, how we solve this sharing problem. At Uber, we have a concept of resource pools. What is a resource pool? Resource pool is a logical abstraction of set of resources like CPU, memory and GPU. In this picture, if you see on the left half, we have shown the organizational hierarchy of Uber. It can be for any other company also. And on the right half, we are showing the similar hierarchical structure for the resource pools. So what happens? Total resources which are available at the root level get distributed among the children. Which means some of the resources of children is equal to the resource available at the parent level. So this hierarchical structure of resource pools helps us actually achieve the sharing between the teams, between the projects and the orgs. Going forward, we'll see how these resource pools are being used for sharing, etc. Kevin will explain that. Moving on, let's talk about the first version of the architecture that we implemented to solve these problems. So what we did, we created the resource pools inside the cluster. And each of the cluster at Uber is zonal in nature. Which resulted into each of these resource pools also becoming zonal resource pools. What it means that the resources which are part of these resource pools could come from only one zone. However, we can create multiple resource pools inside the same zone. That's not a problem. And this was our architecture. The first one that we implemented, we had multiple zonal clusters. Each of them having a Kubernetes control plane, set of cubelets and operators, which helped execute the batch workloads like Spark or Ray. And customers, which were internal customers, were hitting these clusters, the kube API servers directly. Which means that each of the customers had integration with these clusters. And they used to submit workloads directly. So these two architectural decisions. First one, the zonal resource pools. And second one, customers interacting with the clusters directly without any form layer in the middle. That caused us multiple challenges. And some of the key challenges are here. First one we faced was fragmentation. Second, non-any form cluster usage. Third one, zonal availability of resource pools. And then multiple issues in cluster management and operations. We'll talk about each of these in the next slides. First one, the problem of fragmentation. If you see in this picture, we have again three clusters. We have multiple resource pools which are created on these clusters. And each of these clusters also have some buffer capacity. Total buffer capacity is shown as here, 80 units. But if you want to create a new resource pool called resource pool 9, which is incoming resource pool, which requires same amount of capacity, which is available as total buffer capacity, we cannot create it. Reason being, this resource pool can be created on only one of the clusters. And none of these clusters have total of 80 units of capacity available, which means the capacity is fragmented. And even if you have total capacity available, we cannot create new resource pool. And that's a big loss. Moving on, the next challenge we faced was non-any form cluster usage. In this graph, you can see it shows the cluster utilization for four clusters. One of the clusters in green color is way below energy utilization, around 20% or even below, goes to 10%. And the other one is peaking up to 100% multiple times. So this problem, what happens is whenever we try to submit workloads on the cluster 4, which is peaking a lot, workloads will go to pending state because resources are not available for execution. At the same time, the cluster which is very low in utilization, resources will be wasted. And why it happens? Because the resource pools to which workloads are being submitted are on clusters, and teams can submit the workloads to the resource pools only. So this is another problem. Sometimes we are wasting the resource, and sometimes we are waiting for execution. Next challenge we face is zonal availability of resource pools. Again, you can see here, we have three zonal clusters. If zone 1 goes down, what will happen that zonal cluster 1 will also go down, which will result into the resource pools which were available on that cluster going down. What it means that the teams which own the resource pools, resource pools 1, 2, and 3, they cannot submit their workloads to those resource pools. And this is a big problem for them from disaster recovery point of view and from availability point of view. Moving on to the next challenge, we face multiple problems in cluster management and operations due to the architecture that we adopted first. First one being coordination required to turn up and down the clusters. Whenever you have to create a new cluster or bring down a cluster, we'll create new resource pools, we'll pass the information to the different teams who are going to use these, and whenever we are bringing it down, we will need approvals from different teams that, okay, we are bringing down the cluster, are you fine with this or not? So this becomes very process heavy. Second one is cluster selection because the teams are looking at their own resource pool only and which means that only some part of the clusters or some clusters, not all the clusters at the time, they don't have the best view of how to choose the cluster for any new workload. So cluster selection is a problem because they don't have the global view. Also, whenever we have to do releases and do upgrades of the queue version, if you want to bring down some part of the queue blades in the cluster, that will cause lower availability to the resource pools and that will impact the customers. So these are the key challenges that we face due to the V1 architecture. Next, Kevin is going to talk about how we solve these challenges with fragmentation in batch compute. Kevin. Thanks, Amit. So to solve those problems, we implement a new service called the batch fragilator. With the fragilator service, all cluster details are now abstracted away from the clients. It provides simple APIs to create, read, update, and delete the batch workload. So how do those APIs work? To create a job, the batch fragilator first look at the properties of the incoming workload. For example, the amount of resources the workload is requesting, the type of resources, that is whether the workload is requesting GG or not, for example. Based on those properties, the cluster select module within the fragilator decides which cluster to place the workload. And then the fragilator creates the workload CRD object on the respective sonar Kubernetes cluster. Once the workload CRD object has been created, depending on the workload type, whether it's a Spark or Ray, the respective operator running on the sonar Kubernetes cluster will create the pods and continue to manage the lifecycle of the pods. Now for the remaining operations after the pod has been created, that is read, update, delete, API calls. The fragilator will use a module called WorkloadTracker to maintain internal mapping between the workload and the cluster. And based on the mapping, the API request will be routed to the QBAPS server of the cluster which contains the workload. So how does the cluster section work? First of all, the fragilator has a list of clusters known to itself and it first goes through the cluster filter to get a list of clusters that are eligible to place the workload. So let's say the workload is requesting GPU resources, the cluster filter will filter out any clusters which do not have GPU holds. Next, the set of eligible clusters will go through the cluster ranker. It will rank the cluster based on various characteristics like available resources in the cluster, available resources in the resource pool, amount of pending resources, et cetera. Then after ranking, the most suitable cluster will be selected to place the workload. So maybe let's take a more simple example. Let's say the workload is requesting 10 CPU cores and all cluster has passed the cluster filtering, meaning they all have sufficient resources. But cluster one has 100 CPU cores available, cluster two has 50 CPU cores available and cluster three has 20. And because cluster one has the most CPU cores available, it will be ranked as most favorable cluster and later selected for placing the workload. But however, having the fragilator alone only solved the hardware problem. As May has mentioned earlier in our V1 architecture, the workload from a certain team can only run on the Zonal clusters, which has the resource pool owned by that team. This means unless we create resource pools on every single cluster, which is quite wasteful, the workload cannot be freely placed on any cluster. So how do we solve this problem? So instead of having the resource pools pinned to a specific Zonal Kubernetes cluster, now resource pools are managed regionally by a service, which we call the regional resource manager. The regional resource manager will be responsible for tracking the amount of resources consumed by resource pools regionally across cluster. And most importantly, it will constraint the amount of resources a resource pool can consume. And lastly, it provides the capability to share resources elastically among resource pools. And how does regional resource pool manager, regional resource manager work? On a very, very high level, it monitors the paths within the cluster in order to track the usage for every resource pool. And it aggregates the available capacity from clusters within the region, then distributes those capacities to each individual workload fairly while respecting the constraint of the resource pool. As a summary, with federation and regional resource management, we are able to solve the following problem which we mentioned earlier, and I will get into each of the points in detail in later slides. So the first one, we are able to eliminate resource fragmentation. The second one, we are able to make the cluster usage significantly more uniform across clusters. The third one, we are able to provide regional availability of resource pools. And lastly, we are able to greatly simplify the cluster management. Meanwhile, with the regional resource manager, we are able to provide elastically resource sharing across different resource pools within the region. And lastly, it is also going to cover the additional work we have been doing to further improve the cluster efficiency by supporting specialized hardware. As well as the future work that we are planning for communities. So as you may recall in the previous slide, if we place resource pools statically within the zonal community cluster, there will be additional buffer capacity inside each cluster, but we are not able to create a single resource pool to utilize those buffer capacity. This causes the resource fragmentation problem. This problem is now solved now that we manage the resource pools at the regional level. In the V2 architecture, we only need to make sure the sum of the resource pool capacity is less than or equal to the cluster capacity within the region. As mentioned earlier, the cluster selector will pick the best cluster to round those workload, and the regional resource manager will make sure the resource consumed by the workload is using the resource pools contract. As you can see in the diagram, now we are able to utilize all capacity within the region and create a resource pool for team to consume using the buffer capacity without fragmentation. The next benefit we are seeing is significantly more uniform cluster utilization. So here is a test round that we performed. On the left side, it's the cluster utilization graph for non-fragmented cluster. And on the right, it's a graph for the fragmented cluster. The test was performed using the same dataset, which means they have the same precisely controlled workload, workload resources, workload iterations, and the same resource pool configuration. The only difference is that on the left side, the resource pools are created statically, whereas on the right side, resource pools are managed dynamically by the regional resource manager. As you can see for the non-fragmented clusters, similar to our V1 environment, due to static resource pool placement, we are seeing few clusters being overly utilized, whereas other clusters are underutilized. However, for the fragmented cluster, the utilization is uniform across the cluster. Also because the cluster utilization peaks at a different time for different clusters in our V1 architectural, having a uniform cluster utilization gave us the opportunity to shrink the overall cluster size. Meanwhile, as for the workload, we are preserving the fragmented cluster to have 3x reductions in P95 task scheduling time. This means the user will see much shorter time to round the workload. The next is regional availability of the resource pools. Previously, when the cluster zone goes down, since the resource pools are cluster local, this means the workload depending on the resource pool will fail to execute. However, with the regional resource pool manager, we can place the workload on any cluster that has capacity for most cases. As you can see in the diagram, after cluster 1 goes down, workload 1234 can run on cluster 2 and 3. Given that those clusters have available capacity. Lastly, since clients use APIs provided by the publisher to interact with the underlying community's cluster, this makes the cluster management much more easy. Since clients no longer need to maintain the list of available clusters, so no coordination is now required to turn up or turn down a cluster. Also, clients no longer need to worry about how to pick the best cluster use, and it makes our platform team's life much easier, since the part of the cluster cannot be easily turned down for release and upgrade purposes. Next, I will talk about the electricity resource sharing feature inside the regional resource manager and how we are sharing resources among the resource pools within the region. Let's take a simple example. We have resource pool 1, 2, and 3 in region 1. Resource pool 1 and 2 have free capacities, whereas resource pool 3 is fully utilized. Now, let's say a workload is created on resource pool 3. Without electricity resource sharing, the workload will be pending until the other workloads have finished in resource pool 3. But now, with electricity resource sharing, the regional resource manager sees there are pending workload within the resource pool 3. But there are also unused resources in resource pool 1 and 2 as well. So, the regional resource manager is able to borrow the resources from resource pool 1 and 2, and use those resources to start the workload that is created on resource pool 3. Now, what if there is additional workload created on resource pool 1? Since resource pool 1 already has some free capacity available, those capacities can be used to run some of the workload that is created. But still, we have some part of the workload pending. Because resource pool 3 is consuming the borrowed resources, it has lower priority to those resources compared to resource pool 1. And because of this, the regional resource manager will evict the parts within the workload so that the borrowed resources are given back to resource pool 1. So that he meets the resource requirement for the pending workload. And lastly, the additional workload created on resource pool 1 is able to take the resources from resource pool 3. So, next, I will pass it back to Amit to talk about the specialized hardware efficiency. Thank you, Kevin. Now that we have seen how resource sharing works, how we changed our architecture from regional resource pools to regional resource pools, and introduced federation layer at the region, let's talk about some more efficiency efforts. And one of them being the Q1 that is for specialized hardware efficiency. So, our clusters are largely heterogeneous in nature. By heterogeneous, I mean they have both CPUs and GPUs available in the same cluster. And some of the GPUs also are different from the general GPUs. Why? Because they are dedicated for training a special kind of machine learning and AI jobs, and they are much more costlier in nature. And we don't want to waste those special GPUs by running other GPU jobs or the jobs which don't require any GPU. So, in this diagram, if you can see, we have three types of parts. Parts which require only CPU in green color. Parts which require only general GPUs, which means they can run on any GPU. And parts which require special GPUs in pink color. So, we have shown here two scenarios, and both of these are not ideal for us. First one, all the CPU-only parts or the general GPU parts or the special GPU parts, all of them get placed on the special GPU nodes, and that's the wastage. In the second one, if you see, the CPU-only part is getting placed on the GPU part, GPU node, that is again wastage of hardware. So, how we solve this problem? So, we have implemented a pair of scheduler plugins called global GPU management filter and a special GPU management filter. Global GPU management filter actually filters out all the GPU nodes whenever a pod is requiring only CPU. And it doesn't let these CPU parts being placed on any GPU nodes. The special GPU management filter filters out all the special GPU nodes from scheduling for any part that requires any general GPU. And at the end, we have node level selectors, which actually matches the special GPU part to the special GPU nodes. And that way, we are able to correctly place the CPU parts to the CPU nodes, general GPU parts to the general GPU nodes, and the special GPU parts to the special GPU nodes. And we don't waste any of the costly hardware by executing the parts which don't require them. Next, coming to the future work. So far, we have solved the problems of running batch workloads on Kubernetes for machine learning jobs for Michelangelo, data science work wins, and the Spark jobs. Next year, we are going to target Presto, Apache Flink, and our internal pipeline execution framework, Piper. And all this will be run on Kubernetes. So we are targeting all these solutions running on Kubernetes on the shared clusters incoming here. That's all we had for today. Thanks, all of you, for patiently listening to us. Please share your feedback on this QR code. Thank you. If you have any questions, please. Placing parts on CPU nodes and GPU nodes, Kubernetes already has taints, tolerations, no-definity, right? So what made you implement schedule plugins beyond that? So taints, tolerations are certainly one alternative. We did consider those in our design discussions. We wanted to keep it controlled mostly on the platform side. So the customers of the platform group, actually, who are consuming these solutions, they don't have to do much changes. And that was the reason we wanted to implement it through the scheduler plugins. So the parts coming from the customer, they don't have taints? They don't have any tolerations in them. They don't have to do any changes, actually. Okay, thank you. I have a question about how the regional resource manager works. So it sounds like it sort of aggregates within different zones, but does this assume that jobs can be disrupted and moved across the zones? If you understand what I mean. Yeah, go ahead. Yeah, does that make sense? Does it assume jobs are preemptible and sort of aggregating across zones and dynamically sizing the resource pools? Yeah, so jobs are certainly, most of them are preemptible. I agree with you. But the resource pools to which they are being submitted, if they are present only certain zone, and if that zone itself is not available, that means that the hardware capacity that particular team reserved for themselves, captured via resource pool, that will not be available. So once we move the resource pool itself to the regional level, that availability problem got solved. That was the question. Yeah, I guess I'm more interested in when you're doing defragmentation, do you assume jobs are preemptible and move them between zones in order to sort of, you know... Yes, we do assume that jobs are preemptible. Okay, thank you. Thank you for the talk. Really interesting. I have a question about the preemption that you used to reclaim resources that were borrowed by another resource pool. Do you use the Kubernetes native priority for pods, or do you have your own priority logic implemented in the resource manager? Yeah, that's a good question. So we do implement our own resource manager logic. So basically, we tag the pod in terms of whether it's preempt or workload or not. Then based on the... Actually, we use the annotation. So based on the annotation, we decide whether the pod is preemptible or not. Cool, thank you. Hi, this is on cloud, right? This is not private cluster. So right now, we solved it on on-prem. But the hardware layer as you saw in the first diagram itself was provided by a unified layer called Crain Hostage Service. So it should work equally good on-prem or cloud. The reason I ask is because if it was cloud and you could have done version one and have a cluster autoscaler to manage the size of cluster, there's no reason why a zonal cluster should be limited in size, right? So size was not a problem at all. In zonal nature, also, we were able to create clusters of 3,000, 4,000 nodes. Then why couldn't a cluster autoscaler solve the problem? Why did you have to move the jobs and do all the... So of course, when someone reserves the capacity using the resource pools and if they're reserving within a zone only, if that zone itself is not available, it's possible, right? Most of the availability zone goes down and then the probability of availability zone going down is much higher than the probability of whole region going down. Okay, so it's mainly from an availability perspective and resource optimization perspective. Okay, yeah, thank you.