 Hello everyone, thanks to join today's session and hope you all have a great holiday in Kukong. My name is Ken, I'm from Dark Cloud based in Shanghai, China, and... Hello. Hi, my name is Aldo. I'm from Google in Canada. And together we, with Kante and multiple other contributors, we have built this queue system that we're going to present to you now. Yeah, so today we'd like to talk a bit about building a batch system in Cloud with a queue. So before we talk about the queue, I think it was needed to mention a bit about something like what's the job. It may be a little funny here to talk about this in batch today, but just in case we have some newcomers and who are interested. So what's the job? We think it has several points here, like job will run to completion. So different from the long-running workloads like the deployments, step-offsets, they will restart the port when they fall into failures. But jobs will run to completion. And usually for batch jobs it runs with a group of parts. They may run independently or collaboratively. For example, like the index job, they will run with partition index in parallel. And for like the MPI job, they will have the launcher and the worker who will communicate with each other. So also jobs are flexible. So you can run in different time at different locations. And also they are compatible with a variety of resources. Let's take the time for example. So let's say we have some like latency in sensitive workloads. So we can deploy them together with sensitive traffic, sensitive service. So when we hit a sudden spike, then the in sensitive workloads will be implemented. So that's what we call a job. And the next question. So why job query? I think the answer is easy to answer because resources are limited. So like in prem, we have static cloud clusters. So if we want to run more jobs in parallel, we may have to jump to the cloud for so-called affinity resources. But usually the cloud can not grow as big as we want. Maybe because we may like account for the discounts in cloud. And also the Kubernetes before the cluster has a scalability limit like with 5,000 nodes. And also besides the limited resources, we may have jobs with different priorities. So the higher ones maybe have more chance to run than the lower ones. So job query can help us there. Okay, back to the key point today. So what is Q? In one word, Q is the Kubernetes native job query. So it can help us to manage the resource quarter in multi-tenants with a boundary and pre-ambition between different tenants. Also it supports resource fungibility in heterogeneous clusters. The resource fungibility means we can define a set of resources for the job and they will follow through policy. Yes, about the support for Q. Actually, we support Kubernetes native job in the first place. And in the last list, we support the Kubeflow MPI job. We did a lot to implement this like we implement the suspend semantics in Kubeflow MPI job. With the experience of the MPI job, we believe it will be easier to integrate with Q if we can provide like a control library for scaffold. So now you can just, if you want to build your own control, you can just implement the interface which shapes the behaviors of the job. Yeah, we are looking for to collaborate with more communities. And exciting news is from the Telton cloud native CR32 who is also searching for the job query solution and we hope we can do something. Then we'll talk a bit about the relationship between the Q and Kubernetes. So there may be some question like we do have some projects under the CNCF who also focus on the job query. So what do we want to incubate the Q again? I think the most difference here is about the start point. So Q is focused on the separation of consoles with the different Kubernetes components, especially with the Kube schedule. So for Kube schedule, it will focus on the post-scheduling. But however, the Q will focus on a more high level workload like a job. So for functional days, Q will not overlap with these components. And yes, Q is compatible with other Kubernetes components. So like, you know, the cluster auto scalar, which plays an important role in batch processing. As a result, so Q has close relationships with other Kubernetes 6, like the 6 scheduling, 6 apps. And we are now under the governance of the batch working group. So if you are interested and don't be hasty to submit to your ideas there. Okay, so let's go through how Q works inside the Kubernetes cluster. Actually, in Q, we have different responsibilities for different roles. So like for batch admin, which behaves like the control plan. So it can create different tenants and set the resource usage limits for different tenants. But for batch users, they can only submit to the job, run the job. So let's, from the first step, the batch admin may create the QQ APIs. And we assume that with these APIs applied, we have built up a batch system. Since the batch user will submit the jobs with the level name pointing to the Q, the real Q, yeah. One thing I want to highlight here is about the suspend field. So whatever the job is suspend or not when submitting, if it's under the control of Q, it will be suspended. Then the Q will try to admit the job based on the like the priorities, like the quotas. And if the job fits, the Q will inject scheduling directives like the node affinity. And also yes, unsuspend the job. Then we will go through the normal Kubernetes flow like job control. We will watch the suspended job and create the pause and the schedule will try to schedule all the pause. And if there is not enough resource in the cluster, the cluster auto-scaler we are getting on board. Yes. And then I will hand over to Eldor for the API path and several user cases. So next I'm going to explain some use cases, some close to real world use cases that you can enable through Q. So for that, we need to understand a little bit about the APIs that we use. As we mentioned earlier, we look at two types of users, the batch administrator and the end user, the researchers, et cetera. So let me start with the APIs that researchers kind of have a view of. So most importantly, a researcher will just use the job API. That's traditionally the V1 job API, but it could be others like MPI job, Qflow API, TensorFlow, et cetera. So that's the primary API that the user will send. The user just writes a regular job, a regular Kubernetes job. And we will see that there is a label here, a special label that Q understands, which has the name of the Q. So and then this Q is an object that the end user observes is able to list, can see which Qs are available to them. Because they don't really need to create it or interact with it. And additionally, we Q manages some metadata for the job about like how if it's spending, why it's spending, how many resources extra are needed, et cetera, et cetera. This is all hidden or stored in this internal API that we call workload. And this cluster Q API is the next side of the equation that I will explain, which is basically the administration point of view. So as I was saying, the other user of Q is the administrator. And they have potentially the most complicated task of setting up the cluster to enable resource sharing, et cetera. Down here, we have the nodes. So these are the nodes of the cluster. And as most Kubernetes clusters, you will have different types of nodes, right? Here I'm highlighting standard and spot. So these are different provisioning guarantees for nodes that happen to have different prices. So you could have those or you could have different models of GPUs. So all of these details about the nodes, you can abstract them using these resource-flavored objects. And then the administrator can set up quotas for each of these resources. So here we have two cluster Qs using these resources. On here on the left, you have an example of a cluster Q object. In this case, these cluster Qs governing two quotas for CPU and memory, and we have two flavors. So a flavor for standard VMs or on-demand, they are also called, and spot VMs. So and then we can define the quotas for this Q. So with that background, let's jump now into one example. So in this case, let's imagine the most simple example where we have one user for one cluster. Probably not the real world, but let's start from there. But we have actually an intergenerous cluster where we have spot VMs and on-demand, or even we have, let's say, committed discounts in this cluster. So you have the user has bought 40 CPUs of committed discount, but also they want to be able to go over their commitment if there are too many jobs. And they have a certain quota for spot. So this is the cluster Q object, right? Now, the cluster Q object is a primarily administration object. The end user facing object is the local Q, which is namespace. So this is going to be in the namespace of the user because we are only, in this example, we only have one user. That's the default namespace. And then this object points to the cluster Q for the resources. And then we have a couple of resource flavors. One resource flavor representing, in this case, the reservation discounts. And in this case, I'm using the official labels from GKE that are associated to the standard VMs. And we have another resource flavor that represents the VMs of type spot. And also I'm using here the label for that. So each of these resource flavors are associated to a set of nodes in the cluster. And on the other hand, so that's the administration side of things. And on the other side, we just have the researcher running a job. Again, it's the basic job API, the same old, except that it has this extra label that points to the queue. So now all the jobs will be queued in this queue, right? And if I still have quota, they will keep starting. If we don't have quota, they will be suspended. Now, initially, let's say I create a job and there is quota for my reservation. That's the first one that is going to be used. So queue will select the reservation flavor and it will inject this node selector into the job. This guarantees that all the pods in the job will run in the same set of VMs. You generally don't want your job to have pods in nodes of different flavors. But let's say the capacity or the quota for reservation is exhausted, then actually queue might decide to put everything in the spot flavor. And then this is the primary API in which queue communicates with the rest of the Kubernetes system. You just inject the node selector and when queue scheduler picks it up, it knows exactly in which kind of nodes it needs to be scheduled. With that, I want to... So that's the first use case. Before we talk about the next use case, I want to talk about a new feature that we introduced in the latest release, which is preemption. Preemption is basically the idea that if I have enough capacity, I can keep adding jobs, right? But if a new job comes in that has a higher priority, we should make space for it. That's one case. But in queue, there is also the possibility of sharing resources. So in this case, we have two cluster queues, one cluster queue for the team A and another cluster queue for a team B. Let's imagine they both look roughly the same. So this is the team A definition for the team A cluster queue definition. And we can see that it has an assigned quota of 40. So this is the 40 we see here. And it has a borrowing limit of 20. So it can actually borrow anything that the other cluster queue is not using up to 40 plus 20, right? 60. So let's say I'm running like this. These are all my jobs on each of the teams. And a new job comes in, it can borrow because the other team is not using the resources. So we can borrow those resources. And our job comes in, we can still borrow. But then team B wants to run a new job. Now, this team A was actually borrowing from team B, right? So we want to recover that quota for team A because that's what it belongs to them. So we can issue a preemption. So let's, we can basically evict one job from team A, which was borrowing. And then we can accept, we can admit this job from team B. So, and now we are happy, we are like respecting everybody's, everybody's quotas. But when, again, when there is no, not all the resources are used, we can still borrow resources back. So that's one case of preemption. Here we have another case of preemption where actually we have jobs with different priorities. Here, bigger number means higher priority. So we have this, all these jobs running. And then a new jobs comes in in this case from team A. We cannot really preempt jobs in team B because we would be violating their quota. But this spot, this job D has higher priority. So we can inspect which of the jobs has less priority and we can evict it. And then we can admit this job. So all of this can be handled by Q. Here in this example, I minimized this is only one resource. So we can look at things linearly. But of course, it works across CPU memory, like whatever you want to define. It can work on all those dimensions and it can work across different flavors as well. So that's preemption. And with this concept, we can implement, implement a bigger system with two, in this case, two teams that can borrow resources from each other. So how do we manifest this in the Q APIs? First, we have two teams. Sorry, we have one team, team A, which has two researchers. Each researcher has its own namespace. And the way we are defining the team is using these labels. The same label, team A, means that they are part of the same team. And then we can define a cluster queue. As you can see here, there is a namespace selector that matches these labels in the namespaces. So this is how we define that the cluster queue is for these namespaces, or in this case, this team. And we have some quota here that they can use. So they can completely share this quota. And then we can establish the local queue objects, which are the ones that are actually namespace. So these two, they match the same namespace. And we are matching here the cluster queues and the same for the other researcher. And this is one team. If we have another team here, they are maybe a smaller team. So it has a less quota, nominal quota of 40. And the crucial part of the API here is that they are part of the same cohort. When two cluster queues are part of the same cohort, it means that they can share the quota that is not being used among each other. And you can also control how much you can borrow. You might allow it to borrow everything. But you can also control, like, I don't want to borrow all the unused quota. I want to control to some level how much we can borrow. So that's it for the use cases. Yes, we just released a new version several weeks ago. It's 0.3.0, and we introduced quite a lot of features in this release. So the first part is about the API. Now we bump the API to better, and we respect the Kubernetes depletion policy. So if you are using a low one, like the alpha level, we encourage you to migrate to the latest one. And we add quite a lot of validation in the web hook, so for the robustness of the queue. And the next is about the preemption. Yes, just introduced quite a lot about it. So we now support the preemption inside a cluster queue or between different cluster queues. And in this release, we add support for the MPI job view on bad truth. And we also provide a control library for easy integrators. So if you, yes, you can refer to the template. And the next is about the option feature with for post-ready. So it acts like a sequential admission. So it means if the job will be admitted until all its previous jobs getting ready. You can enable it. It's disabled, and if you want to use it, you should disable in the flags. Because queue is sensitive to the resources, so we make it aware of the limited ranges and put overhead. And last but not least, we sense sense for all the contributors in the last release. They are from different companies, different countries, and also different communities. Much appreciate. And about our last next release, we mainly focus on the exist features like the way for post-ready and preemption. It would be nice to have some features like dynamic quota reclaiming, because now we only reclaim the jobs when the job finishes. Also, we want to add more support for the other jobs like the Ray, also jobs in Kubeflow. Yes, and if you are interested and if you can contribute to your ideas here. So I guess if someone has some interest with the project, you can visit our new website. Yes, we have a new website. And also participate in the batch working group. Also find some issues and we welcome every kind of contributions. So this is our new logo. Pretty much for today's session. Any questions? Thank you guys for the amazing presentation. I have a couple of questions about the system. First one I want to understand. So if I keep sending jobs, you say if you go back to the slide where you show the like baseline architecture from the, when the user submits a job, I think was one of the first ones. Yeah, this one. So basically here on the right side we see user which creates a job and you put it into suspended mode. So if I keep sending jobs will I eventually overload the API server. So you don't have any external storage component here or so I assume it all ends up in the city. So basically what's the limit for the system then. So can I basically overload the class with jobs in the end. Yes, so the key thing here is that we are working at the job level. So because we are suspending the job, even if your job is a parallel job of 1000 pots or even bigger, there is only one object, which is the job. So from this, that's already a starting point where we have much, much lower load on API server, because we are not, we are not allowing these pots to even be created. So that's a big starting point for the scalability. But resource, you can still use resource quotas to, you know, control, like put this failover mechanisms, not about like usage of the jobs, but about like tipping points for your cluster, because in any system you will have a tipping point, right. And you can still use Kubernetes resource quotas to define those scalability limits. And if the user exhausted that, well, they will have to retry the job creation. Okay, then next small question would be if we take airflow as an example, as a system that I can use basically for the same purpose with bits of Python code on top. I can also use airflow to do job scheduling multi across multiple different classes, but in this case I'm a limited to a single cluster. That's correct, right. But what about namespaces? Can I at least, does this work across namespaces in the cluster? We are bound to a single namespace here. If I want separate namespaces for different teams and then I want to have a cluster queues that target different namespaces as well. The queue APIs give you all the flexibility. So you can define a team to be one space namespace, or you can define a team to be a set of namespaces. That's something you can control within the APIs. So yes, it's multi multi tenant within a single cluster. And to answer the other part of your question today queue is a single cluster, but we are already thinking about multi cluster. And there is a contributor that will share some some news, some initial like design in the working group. So if you want to participate and learn more about the multi cluster plans, you can join the next working group much batch meetings. And we will have this discussion sometime soon. So I think we're out of time, but our next next pictures speakers are Richard and she why I couldn't see them. Can you raise your hand if you're here, hopefully they're there. Okay, cool.