 Hi everyone, again, my name is Abdullah. I'm a staff software engineer at Google. Today I'm going to present our open source work named Q. That's a play on Q with a Q. It's a Kubernetes native job queuing controller. Just want to stress see that this is a community a project, an open source project. And we hopefully, it started like a few months ago. It's new. We had our first lease really like a couple of weeks ago. We're here to present the idea, the high-level idea, how we think that queuing should work or potentially could work in one way how it could work on top of Kubernetes. So here's the agenda. I'm just gonna quickly describe the problem of trying to solve. What is our core proposal here of how we handle job queuing within Kubernetes? The APIs that we've introduced, how it operates in general, how it acts with different controllers that we have in Kubernetes. And finally, we're gonna summarize. So just being the first talk in this batch working group, maybe we should just define what is the job. The way we're looking at the definition of a job is that any computation that has a start and a finish. So it's mostly computations that run to completion. Imagine like we said with type workloads, you have a process listening on a port all the time. It can be infinitely doing that, receives requests and processing them. That's why most applications we have in Kubernetes like deployment, replica set, stateful set, if your pod dies or finishes, it will directly recreate it because the assumption is that this thing, this deployment should always continue to run. In contrast, the job API, it allows a pod to run to completion. And so it will not restart the pod. And so we have this concept of that the pod runs the completion, we count how many pods are gonna get completed. And this is how would you find that the job finished or not, job status. And so this is a very simple definition of a job. But we find it fits most types of batch workloads. And this brings us to the second point, like these types of batch workloads, they usually run not just a single pod, like if we cast these definitions into Kubernetes terminology and using pods as a task, they are independently like Monte Carlo simulations, processing the frames of an image, processing the trading days and like when you, if you think about Fintech or they collaboratively process a task. Basically there is a ton of communication that happens between a group of tasks. Think about MPI doing unbiased simulation, for example, to simulate how particles with gravitational power react over time. I just wanna hear mentioned that the gray color is lightly whitish on this screen. So please use your imagination. All right, so a very important characteristic of batch workloads that we can capitalize on as people who are trying to develop batch schedulers or batch orchestrators is that they are often flexible on multiple dimensions. They're flexible on time. They can sometimes, they don't necessarily need to run right now. It's not like a, you know, a request that you're serving to an end user. Sometimes they can wait and run. Their start time can be pushed farther ahead. They are sometimes flexible on location. If you think about cloud, you have multiple regions, multiple zones. They don't necessarily all the time need to run in a specific region. They could run in a different zone, for example. Of course, there is data locality restrictions. But taking that into consideration, it could still run on different places. They are also flexible on types of resources. They could, for example, run different GPU types or different architecture like CPU architectures Intel versus, for example, AMD. So what is your opinion? The problem that we're trying to solve but at the high level, it's basically managing a limited pool of resources by multiple tenants. So you have multiple users, go ahead. Thank you very much. The time it takes to finish this job. So especially in my experience, there's a huge difference between having some kind of micro-batches which running only seconds and those running maybe for minutes, hours. And hence, I'm here, we have like a big problem with our micro-batching jobs and maybe it's something to consider. Yeah, definitely. That's a very good point. So you have different running time for different types of batch jobs that can be taken into consideration when trying to schedule these jobs. This is by no means summarizing everything. We're just trying to highlight some of the aspects that we're taking into account when solving this problem in the context of Q. So as I mentioned, what do we mean by job queuing again, we try to decide which jobs should wait and which job can start right now. And why do we need job queuing? Well, I mean, as I mentioned, in most cases we have lots of jobs that needs to be processed but we have limited amount of resources. So we need to maximize the utilization of these resources and if some jobs can fit right now, we need to decide which ones should start and which ones should wait. And we have two types of infrastructure usually, right? We have the on-prem infrastructure and we have the cloud. On on-prem, on-prem clusters are really static, typically small in scale. And so you have this limited amount of resources by design. On cloud, there's the perception of flexibility and infinite, you can infinitely grow. But that's not actually always the case because you have discounts, for example, you buy a discount on a specific amount of CPUs, core hours, for example. So you wanna only fit your jobs and you're processing within that discount. Or you could have spending limits. You don't want just as many jobs as you have, you wanna continue to buy more and more. You have some restrictions on how much money you wanna spend. You have also pertinent limits, like different users can't they just spend as much as they want. Some users want, some tenants could spend more than others. And also cluster limits. Like in Kubernetes, we have cluster limits. You can't have, for example, by default, we have probably, I think, 5K. In GKE, you could have like 15K. But at the end, you still have some limits. There's a limit to how much the master, for example, can scale, how many nodes you could have in the cluster. And so here I'm trying to summarize, like what is it that we wanna solve with job queuing? Obviously we wanna queuing. Jobs that don't fit should, existing capacity should wait. The other thing is execution order. Who should get to execute now whether it's based on, for example, the typical 5.4, like submission time or priority. You need some, these nodes that allows users to configure how ordering of the jobs should happen. Also, fair sharing. We have a talk later today as well about fair sharing of ability of resources between tenants. Different tenants, we may want to set some limits how much they can use and the unused capacity can be shared between them. Budgeting, again, like it's not always the case where you want like a fixed quota, but you want how much resources you can spend over a period of time. Ability to set policies, maybe like on the runtime of the job. And finally, as I mentioned at the beginning, flexible placement. You want the ability to say, if I can't run on demand, I want to fall back to, for example, to spot VMs. Again, like it could also be flexible on time and location. And so why are we trying to propose a new controller? Well, with plain Kubernetes, plain Kubernetes continuously tries to start workloads. It can work itself today. So every time you create a job, the job control will create the pods. And if you have limited amount of resources, these pods will continue to be unschedulable and the control player will continuously try to schedule them to no avail, right? So this is not a desired behavior. There's no built-in approach within Kubernetes to say, okay, let me stop now and try again like later. And we have one way of doing that. It's called Kubernetes resource quotas. But the way that resource quotas are built does not really fit the job queuing paradigm. It mostly tries to protect the cluster from failing over. And so the way that it applies a resource creation, whether you get to create the pod in the first place or not. So there's no way to say, okay, let me wait and then I'll try again. It's up to some other external controller to keep trying. And even then, that's not a good approach. It can continue to hammer the API server. So it doesn't really fit well the concept of a queue. There are multiple existing customer schedulers, custom schedulers, they all fit really nice in different use cases. But one thing that we wanted to avoid is re-implementing existing functionality. We didn't wanna do implement pod to node scheduling, for example. We wanna continue to have the queue scheduler to do that. We didn't wanna re-implement the job controller, job lifecycle manager. We wanna continue to have the core Kubernetes job controller to do job management of a single job. The second thing is that existing scheduler didn't have a clear integration with autoscaling, which is a really important characteristic on the cloud. And finally, last but not least, we found that nobody actually solved the problem of resource flexibility, fungibility, neither at the API level or mechanically. And so we wanna make that like a first class, let's say citizen of job queuing, flexibility, because we come from the cloud, cloud is massive, have different types of resources. We wanna make sure that you have really good way of expressing flexibility. So what is our closet? Like the main design principle for queue is that we don't wanna re-implement anything that already exists. No re-implementation of autoscaling, pod scheduling, life cycle, job lifecycle management, even admission control. Autoscaling, we have cluster autoscale, we have a bunch of other autoscaleers out there as well other than the one hosted by Kubernetes. Cube scheduler handles pod scheduling, job lifecycle management handled by the job controller. Admission control, we have many policy controllers out there as well, like gatekeeper for example. And we want to have native support for the batch V1 job API. The idea here is that we want to have this on-ramp really easy process of onboarding batch onto Kubernetes. You wanna start by, I'm a user, I just create a batch of using V1 job. Now, my needs has grown, I want queuing, that's it. You just install our controller, V1 job now, like it's the one that will be supported, you're gonna be able to queue these jobs, et cetera. You don't really need to adapt again to a different job API. So the advantages of this, again as I mentioned, we reuse significant existing functionality, there's no duplication, there's no divergence in functionality. And it enforces also separation of concerns. We don't want to have two schedulers on the controller, we don't have to have two job controllers and whatnot. This comes with some limitations that like the last column here mentioned some potential drawbacks that it creates two levels of resource management. If we have a controller at the top that only does job level management, it might divert from the lower level layer that does pod level management, which is most Kubernetes controllers do, right? They all mostly work at the pod level. It does have a concern related to starting the job. So if you assume that I'm going to manage the job, once I decide that the job should start, I'm going to start it, but then there is the job controller that have to react to start the pods, et cetera. So there is that, but we think that these limitations trade-offs are worth it versus the advantages that we're going to get. And as I mentioned, like we're focusing on the job API, the existing job API is the main vehicle for deploying jobs. Here I'm just quickly mentioned that we have tried to fix the job API. We understand that the job API did not serve all use cases. We're trying to do that again as part of the batch working group that we initiated. We've fixed a bunch of things. We're looking for other things also to add to it, not only to run normal batches, but also hopefully to use it as a building block to run, for example, machine learning workloads or even MPI. And so the two personas that the queuing controller that we have focused on is the batch admin, the batch user. The batch admin is the one that is going to set up the tenants, set up the queues, the user's limits, and the batch user, their user journey should be simple. The same thing, they use the V1 job. The only thing that we have to do is really simple, just select the queue and that's it. Okay, so let's get to the heart of it. I spoke a lot. Where is queue? Show me. Okay, so the resource model is really introducing two main resources, the queue and cluster queue. Namespace is an already existing concept, as you know. It is the one that presents the tenant. Okay, if you have a new tenant, for example, you have a new team, you create a namespace for them. That's their working space, basically. A queue is a namespace resource. It represents a closed-related single tenant jobs. And there's the cluster queue. Think of it as like the cluster roles and roles, like in our back. Cluster queue is a cluster resource, cluster scope resource. It is where we define the actual quotas for a specific group of tenants. And in this context, the queue is really, the namespace resource is just simply a pointer to the cluster queue, where the resource is gonna be consumed. A user within a namespace, they usually have access to that namespace, right? Like they only can list the resource exists in that namespace. And so they don't really need to know too much about the cluster queue. The only thing they need to know, okay, what are the queues that are available in my namespace? They list them and they just submit the job to that queue. And here I wanna highlight that multiple queues in different namespaces, they can point to the same cluster queue. And that's why multiple tenants can share the same amount of resources. And the actual queuing in order of execution of the job will happen at the cluster queue level. So here you have job one and job three from the robo-actions team, from the robo-vigint team, they have job two and four. The actual starting order is gonna be happening, like it's gonna be managed, it's gonna be happening at the cluster queue level. For example, for the robotics team pool. And another one last thing here is that you could have multiple cluster queues, of course, in the cluster. For example, one for robotics team and the other one for the natural language processing team. Each one could have their own resource pool. We have this concept of borrowing cohort. Like if you see like there's a cohort parameter, they can define that and basically decides how fair sharing happens, who can borrow from that. And I'm gonna highlight this in the next few slides. In the APIs. So again, I spoke about this a lot. We use the normal job API here. And the only thing I need to do to use specific queues is just basically set the queue annotation. We're trying to make that parameter of first classes of the job API. We want to introduce queue name in the job spec. So as I mentioned, the queue API is really simple. This is the namespace resource. The only, like in simplest form, the only thing that it has is a pointer to the cluster queue, where the resources will be allocated. This is like, again, this model is splitting the queue into two different resources. A namespace one, non-namespace one fits better the Kubernetes model in general. The fact that namespace is the organizing concept for teams and where you do our back, et cetera. And so this is why we came up with this model. The more exciting API is like where everything lives is the cluster queue API. So the first thing I wanna highlight here, as I mentioned, it defines quotas. One thing that we've introduced is called like flavors. For a specific resource here, only mentioning one resource is the CPU. You could define multiple flavors of a CPU. Here I'm defining two flavors, the on-demand one and the spot one. Can define different quotas for each one. The minimum, for example, this is the minimum amount of resource you're gonna get. The maximum is how much you can scale up to by borrowing from another cluster queue in the same borrowing cohort. So as I mentioned, the quota basically defines the total amount of resources that can be requested from the jobs that start through this cluster queue. It does not really guarantee that these pods are actually gonna get scheduled. So I want you to think about this as a dynamic quota manager, not as like really a resource manager per se. It doesn't guarantee that the pods are actually going to schedule it, but it just says that similar to resource quotas, we're gonna allow you to start. This is the amount of resource you're gonna have. We're gonna depend on the law of infrastructure to actually provision the resources for you. We are also planning here, I'm defining only quota. We're planning to support budgets in the future. It's basically like overtime rather than just on a specific point of time. The second thing I wanna say here and highlight is that how do we differentiate between flavors? We've introduced two parameters, labels and tints. Again, they match how we differentiate between partitions of the cluster between nodes, right? You partition them using labels and tints. I want you to think about flavors in the same context. So again, flavors are associated with tints, labels, when for example, when I start a job, I assign it to a specific flavor. For example, a specific job when it starts, I'm gonna say, okay, I'm gonna take 10 cores from on-demand. How do I force it to go to the on-demand? I'm gonna use that label, translate it into a node selector, inject it in the job, and that's how I would force the job to only use that specific quota. Tints is an interesting parameter here. It allows users, it's an opt out by default mechanism, right? Jobs that don't tolerate, for example, tints, they will not, if for example, we're gonna start from the top to the bottom, the queue controller will say, okay, let me try to give it some cores from on-demand. If there is no quota, I will fall back to spot. But if the job doesn't explicitly tolerate spot, then it will continue to be pending. And so this is the nice thing about this approach. It's kind of, again, Kubernetes is native, the way that you try to assign pods to nodes, it's the same thing here, you use node selector and use tolerations to say, whether I'm opting into, for example, using spot or not, or you can also say, I always want to use spot. And so you could have, this is the last point, on the job, if you say in the node selector, I want spot, then I'm gonna skip the first one and directly assign to you, try to assign to you a quota from the spot flavor. So that is the, those are the two main APIs. This is the main API for defining quotas. One last thing in the API I want to mention is the borrowing cohort. I mentioned that before in the resource model. Different cluster queues can be grouped into a borrowing cohort. So basically this allows you to do like some sort of like fair sharing or unused, using unused quota from different cluster queues. Imagine you have a research department cohort and robotics team, if they are not using all of their quota, a different team within the research department can use it. So how all this works? As I mentioned, like you start with the batch admin, creating the namespaces, create the queues, create the cluster queues. The user only thing that it does creates the job and assign it to a queue. One thing we're doing here is that we force the job to start in a suspended mode. No pods are going to get created at the beginning. We do have a suspend of a flag now in the job API. We can use a webhook to force it to do that. The second thing is, well, our queuing controller, what it will do is that it will watch for these suspended jobs. And based on the cluster queue that's assigned to, the flavors that are available, the quotas available, its order and whatnot, it will admit the job. What that means is that it assigns a specific flavor, give it a quota, and then the way that it does is basically it injects the affinity into the job and unsuspends the job. That's basically all it does. It doesn't create pods. It doesn't do anything else that already an existing controller do. The rest is the same, as I mentioned. So these red boxes are existing controllers. The blue one is the one that we've introduced. The job controller once the job is unsuspended will create the pods. The scheduler will assign them two nodes. So how do we have the custom workloads? Not everything is a V1 job and not everything can be modelled as a V1 job. So we do have a hook to do that. And the only thing we require from a custom workload to work in queue is that it needs to be represented by a high level object API, like an MPI job that controls creating the pods, et cetera, and needs to implement a suspended semantics, which is the whole enabler for everything related to queue. We need to decide when the job should start and so we need to suspend and suspend the job. We've introduced a new API called workload basically is an abstraction that allows us to decouple us from actually linking the MPI job or TF job type workload. And the workload API is basically an API that gives queue, like makes queue understand how much resources do you want. And like in all of this, like really it's just a pod set. Pod set is just a pod template. How many pods do you want out of that? And that's it, and the queue name. So how that works in action, we envision that for each custom workload we need a new workload controller. That workload controller does that syncing between the actual custom workload and the workload type that we have in queue. And so the batch user is going to create the MPI job for example, we will force it to be starting a suspended mode. The workload controller should be watching for these suspended MPI jobs. It will create the queue workload type for us. Queue will be watching for the workload type. It will basically reflect the assignment on the workload status. This is how basically we admitted the workload. Then the workload controller watching for that status. Once it is assigned, it will inject the node affinity based on the selected flavor and suspend the job. We have actually implemented the job API using this model. Like it's not really built into queue. We have a really tiny job controller that does that syncing between the queue workload type and the actual job API. One last integration point that I want to highlight is autoscaling. It's really important for us to support autoscaling natively because in the cloud everything is autoscaled. And this is the promise of the cloud, is that when you use resources, we're going to scale it up to you. When you don't, we're going to scale it down. But I guess I have 10 more minutes, right? So how do we envision integration with autoscalers? Well, one way is to say, OK, I'm going just to create the pods. Some cluster autoscalers, like the normal cluster autoscaler, basically reacts to unschedulable pods. I will provide you resources when they see a pod that is unschedulable. But we can do better than that. We envision having a research provisioning object API. Let's name it capacity request that cluster autoscalers should be supporting. Instead of just reacting to pods, we want an explicit API to request a specific amount of resources from the cluster autoscaler. So the idea here is that the cluster autoscaler would be watching for these cluster requests, or for capacity requests, and potentially scale up the cluster to satisfy that request. And why do we need to enable this? Well, like this approach, it's an enabler for location flexibility. Before I, we won't have the cluster autoscaler aware of a group of pods. Like when I want to schedule resources for a job, I want to say, OK, I want 10 cores for each, for 10 pods, one core for each. But I want all of them to be in this region. But I want all of them to be in one zone. But don't spread them across zones. I want any single zone type of semantics. And so I want an API where I can express this to cluster autoscaler. It also allows us to do better handling of stockouts. So see that before I actually unsuspend the job, I'm going to create a cluster capacity request. Do I actually have 10 cores to provision or not? In cloud, again, it does have this like promise of influence scalability. But we do face stockouts in some cases in Black Friday or Cyber Monday or in specific regions that have smaller clusters, et cetera. And again, and last but not least, it allows us to handle scale-up and when to use provisioned capacity. It's not always the case where you want to actually scale-up. I want cluster autoscaler to be aware of existing provisioned resources and not actually trigger a scale-up all the time if capacity is already provisioned. And the way that we envision this being integrated with Q is that after the job is created, Q is not always going to just simply assign a flavor and unsuspend the job. Before it unsuspends the job, it will create a capacity request. We see that cluster autoscaler watching for these capacity requests, fullfully, and then Q watches for this capacity request. Once it's actually satisfied, it will inject the non-definity based not only on the flavor but also the capacity request. We envision there to be something similar. For example, if the cluster autoscaler selected in a specific zone, we're going to inject a non-definity to that specific zone. And then it unsuspends the job. And the rest is the same. The job control will create the pods. The scheduler will place them. And here, one last thing I wanted to highlight and stress is that we are trying to use existing controllers as they are. We're trying to use the primitives that they give us to control when to schedule, what to schedule, and where. We're using a lot of non-definity, not selector teams. We see those as really powerful, generic features that allows us to do higher level types of decisions while delegating the low level pod management to the lower level controllers. And so far, so good. It worked for us really well for the job API. And this is my summary. So what is Q? So as I mentioned, Q is a Q-in controller. You can think of it also as an elastic core time manager. It does no pre-implementation of existing functionality. As I mentioned, it worked really well for us for the V1 job API so far. We have an API to integrate with custom workloads. And it does support fair sharing and resource flexibility using flavors. I'm sorry about the, again, the font here. But the start is that we've released our first version, version 0.1. Includes everything that I discussed apart from the cluster autoscale integration. What's next? Again, we want to prove that this also will work for custom workloads. We want to start with integration with common operators like Spark, Kubeflow. We want to implement job preemption. Think about the suspend. Again, it's a really powerful, simple concept. When you're on suspend, you're starting the job. When you suspend, you think of it as job preemption. So we want to use that to do more advanced job management, maybe based on job runtime. Budgets, multi-cluster as well are on our roadmap. I want to thank all the contributors to Q so far. Again, really, it's in the very first days. And that's it. You can visit our repo at Q.net section. Thank you. I hope you can receive the questions. We have questions first here. Question. So with jobs, like the pods of the jobs, we are created in the namespace where the Q was originated or in some different namespace. They will be created in the namespace where the job is created. I'm going to jump to a question from Slack. With regular HPC, I'm able to launch 1,000 of jobs in parallel, but Kubernetes limits the number of pods to 110 per node. How could we accommodate a regular number of jobs from HPC in a Kubernetes cluster? I'm guessing this is not really related to the job curing, per se. But in general, I think the 110 limit is not like, I don't think it's fixed. You could adjust it, basically, to increase the number of pods that you can have in a node. It's basically how many IPs. I think that's the main limiting factor there, of course, in addition to other resources. But the main limiting factor there, typically, we assign, I think, a slash 24 subnet. That's why you get 256. They divide it by two because when you do, you can't reuse the IP quickly. I think that's why it's 110. But generally, it's something configurable. Hey, thanks for your talk. I was wondering, how does the Q interact with index jobs? Does it only launch the pods when all of the pods are ready to go or when a subset abates? Right. Now the assumption is that we don't really know about the semantics of the job, how these pods are being created. We only manage the whole job as a single unit. So it's pretty much you can think of it as all or nothing is scheduling. But it's not really scheduling. It's all or nothing quota management. So I have quota for the whole index job. I'm going to start it. And here, we're looking only at the parallelism field, not the completions, right? Because this is what dictates how many pods the job controller is actually going to create. I'm going to jump to Slack. How does Q handle workloads such as Apache Spark and other frameworks that might need to dynamically ask for more resources? That's a very good question. It is on our roadmap. We don't right now have that explicitly, but we've discussed this in the community how to handle dynamic resource management. We think of it as kind of similar to vertical pod on a scale. I don't know if some of you were following that in the community, how you adjust the resources for a pod, the kind of leaning towards a solution in the same vein. I think you may have just answered my question. I was going to ask about right sizing, but that might be vertical scaling. Right sizing for the pod or the job? The job, like how do you pick those? Right. So it is on our roadmap. We're having some discussions of how we can dynamically change the size of the pod. For example, you could want to start with one pod, but then you want to change it, right? OK. Yeah. There's one. Test order also plans for priority and queues. So if a job comes into one queue with the priority X, Y, Z, and then the job comes into another with a higher priority, we do then we order the cluster queue? Yeah. So we do have job level priority, not queue priority. So basically the jobs basically are ordered based on priority and then based on submission time. We do have like a vision like different ordering. And it is a parameter on the cluster queue level. With that, we'll conclude this session. Thank you, Abdullah. Have a good day. Thank you. Thank you.