 Hello everyone. Thank you for joining me to this session. I'm Yodal Shafir and today we are going to discuss about fair scheduling for deep learning workloads in Kubernetes. But first I would like to introduce myself. I'm a software team lead at RunAI. RunAI provides a compute management platform for AI workloads. I am also a contributor of Kubernetes and I live and work from Tel Aviv. As you probably know, data scientists run their workloads on GPUs. The GPUs are shared in the cluster between the data scientists and we see more and more clusters that are moving to use Kubernetes. Now, in deep learning, workloads can run for days or even weeks. So when the data scientists share the GPU resources, deciding which user should be able to get the next GPU can be critical. A wrong decision can cause starvation. In this talk, I will explain how does Kubernetes scheduler decide what pod should be allocated next and discuss with you how can we achieve fairness between users when sharing GPUs in Kubernetes. Let's start with explaining the way Kubernetes schedule pods. In a very simplified way, the scheduler picks a pending pod and then attempts to bind the pod to the most suitable node. We will be focusing today on the part where the scheduler picks the right pod for scheduling. The default scheduler stores the pending pods in a heap. The heap is sorted by the priority and by the creation time of the pods. So the next pod to be allocated will be stored at the end of the heap. This is a very simplified way of presenting how does the default scheduler decide what pod should be allocated next. There is also another heap called pod back of queue and a map for the unschedulable pods but they are not relevant to the purpose of what I'm about to explain so we will not get into them. Using the default scheduler of Kubernetes for sharing a cluster with a limited number of GPUs can cause fairness issues because data scientists can monopolize the cluster. One way for users to monopolize the GPUs is to submit pods with the highest priority they are allowed to submit as users. This would make their pods move to the top of the heap causing their pods to be allocated first. As you can see in this example, Bob submits pods with higher priority than Alice so his pods will be at the top of the heap and therefore will be scheduled first. Eventually Bob can monopolize and use all the GPUs in the cluster. Another way to monopolize the GPUs is to submit as many pods as possible. This will cause more pods to be in the heap and even have smaller creation time. As you can see in this example Bob submitted many pods and the heap contains mostly ease pods and therefore you will also have more GPUs allocated. So our motivation is to build the first scheduler, one that knows to share the GPU resources between the users in a fair way regardless of the creation time and the pod priorities. So we want that even in our examples where Alice submits less pods and she submits pods with lower priority she would still get the same amount of GPUs as Bob. Now let's discuss how can we achieve such a scheduler. In v1.15 of Kubernetes the architecture of the scheduler was changed to use the scheduling framework. The scheduling framework is more pluggable and provides several extension points to change the behavior of the scheduler. It was a big game changer for building a different scheduler rather than the default scheduler because it reduces the need for developers to build their own scheduler from scratch and try to keep up with all the Kubernetes features. We have examined the scheduler framework for implementing a fair scheduler and we decided not to do it. The reason we decided not to use the framework was that we wanted to use the different data structure for the pending pods rather than just a simple heap of pods and the framework simply doesn't let you change this data structure. So we decided to build our own scheduler. The data structure we use is a two-dimensional heap. In the main heap every node represents a user and we sort the heap by the number of the allocated GPUs this user uses. Now every node of a user is associated to another heap of his pending pods. The pending pods are sorted just as in Kubernetes by priority and by creation time. So if once again I compare this data structure to the heap of the scheduling framework now we can associate the pending pods to a specific user and sort the users according to their load on the cluster rather than just using a simple heap of pods. The way we allocate pods is this. First we pop the first user from the user zip. Then we pop the first pod of this user and allocate this pod. In our case in the following example Alice is the first user in the user zip. So we will allocate her first pod and update the number of allocated GPUs Alice currently uses. Once we will do that the heap will be updated and Bob will now move to the top of the heap. Notice that at the end of the user zip we always have the most starved user. This helps us in the goal of achieving fairness because we know who is the most starved user and is the one that should get the next GPU. Once again we pop the first pod from Bob's pods and allocate this pod. We will continue to do this flow until there are no more free GPUs in the cluster. So once we will update the heap Alice will move once again to the top of the heap and we will allocate her a pod and we will do the same for Bob. As you can see now the GPUs are shared fairly between Alice and Bob. This solution will provide fairness when deciding which pods should be allocated next. But is this solution good enough in order to provide fairness between the users? Let's examine the following scenario. Let's assume that Bob has submitted many long running pods when the GPUs were free. Because he was the only active user we want to allow Bob to use the whole cluster and maximize the utilization of the GPUs so we will allocate all the GPUs to Bob. Now when Alice submits pods her pods will remain in pending state and Alice will be starred until Bob will finish to use the GPUs. Bob can now run the GPUs for weeks so Alice can be starred for a very long time. This means that user can still monopolize the GPUs. In order to solve this problem we added another algorithm to enable preemption between the users. The way we handle the preemption between the users is the following. We calculate the number of GPUs every active user deserves at a given moment. Then we run a simulation where we preempt the pods of the users who are monopolizing the GPUs and attempt to allocate the pods of the starred user. If the simulation succeeds we preempt the pods and simply let our allocation algorithm to apply. Let's see an example of this flow. So we add four GPUs in the cluster so we calculate how many GPUs every user should have. That's two GPUs each and it also means that Bob uses two GPUs more than he should. So we preempt two of Bob's pods and move those pods back to the pending pods heap. Now we have two free GPUs and we simply apply our allocation algorithm. So we will allocate Alice's first pod and update the number of allocated GPUs but she will remain at the top of the heap because she is still starved. So we'll allocate her second pod. As you can see Bob and Alice are now sharing the cluster fairly. To summarize with the allocation and preemption algorithms user can get the most of the GPUs when the GPUs are free. When the users submit pods they are able to get the GPUs they deserved and they will not be starved. Let's see an example of all this explanation from one of our customers. When you looked at the mark time window you can see that the blue user uses about 20 GPUs while the pink user uses about 10 GPUs. We allow the blue user to get 20 GPUs because the GPUs are available and we want to get the maximum utilization out of the cluster. After a few hours in the new marked window the pink user submits more pods so what we do is that we preempt the blue user and we let the pink user use those GPUs. As you can see the lines are getting closer to each other this is because we took the GPUs from the blue user and gave it to the pink user. So this is how the fairness actually looks like. Thank you all for your time you are more than welcome to contact me if you have any question or want to discuss about scheduling with me. Thank you.