 Hi everybody, thanks for coming to this maintenance track session for SIG Scheduling. My name is Aldo Kulkikonder. I work at Google and I'm a TL in the SIG. I'm Kensei from Merukari, Japan. I'm the operator in SIG Scheduling. So we're going to start with an overview of the scheduler, I guess, which is the main project within our SIG. So we'll talk about the basics of it. We'll also talk about what's been happening in the past year. And then we'll move on into sub-project updates. As a SIG, yes, the scheduler is our main project, but we have, let's call them incubating or mature projects that we're also working on, such as the scheduler was an extension, Q, the de-scheduler, and more recently Quoc. So we're going to go through all of these ones. But let's start with scheduler. And this is an overview of the multiple steps that the scheduler executes in order to assign one pod to a node. The computation for each pod is split in two stages. First, scheduling cycle, the green box, the big green box, which happens serially. So every pod that goes through the scheduler will go through this box in a serial manner. And then we have the binding cycle, which basically handles the communication with API server. And this can be parallelized. So it happens like that. But overall, the scheduler is built in a way that we can modularize it to implement different mechanics, different scheduling attributes. So for example, you can say the most basic scheduling parameter is whether or not your node is full. So that's a yes or no decision or what we call a filter. So you, based on the total allocatable capacity of your node, a pod comes in and you just answer yes or no. Does it fit? Does it not fit? And of course, we have multiple APIs that define different behaviors like affinities, not affinity, but affinity, topology spreading, and so on and so forth. But all of these filters define a given yes or no answer. And further, once we have a set of candidate nodes where the pod can be scheduled, we do a scoring based on different criteria. It could be similar criteria. And for each node, we calculate a number that tells us how good it is to put this pod in this node. And that's the scoring phase. So those are the main extension points for the scheduler. But once you go into the weeds, so many things you can tweak. So we have a few different extension points. Also, more recently, we introduced a new extension point, which is even further on the left, on the left side, the purple box, which is what we call the pre-inqueue. The pre-inqueue allows us to make decisions even before we consider the pod for scheduling. The most important feature here that uses these semantics is the scheduling gates, which simply allows us to hold the pod from scheduling until some other controller says yes or no. So you could have, for example, a quota manager that decides whether or not the pod is ready to be scheduled or other extensions or other mechanisms to decide. So that's where we are today with the scheduler. So from here, we introduced the recent improvements in command scheduler. So the first one is a new feature called much double keys in pod affinity and pod under affinity. So much double keys specifies the keys for the doubles that should match with the incoming pod variables. So actually, we introduced the same feature in pod for display recently. And the biggest use case here is to improve the calculation accuracy during the warming upgrades of deployment. So let's see how it works with the example. So we have a deployment result with required pod affinity. This pod affinity requires pods to be in the same zone. And let's see how it works if you don't have much double keys. So one rolling upgrade for this deployment. So the desired outcome here is that all pods from the new versions are placed in the same zone. But the scheduler takes both old and new pods into consideration during pod affinity calculation. So the new pods could be scheduled in both zones because we have pods in both zones before. So it's going to be like this, this, this. So it could end up having only one pod in the zone, which we don't want to. So let's use the math level keys here with pod template hash. So pod template hash is the level key, which all pods from the deployment have. The value of pod template hash is different between each version. So we can make the scheduler take only the same version's pods into the consideration during the calculation. So let's see again. The scheduler would ignore the old versions' pods. So only these orange ones, which is new ones. Cool. So the results should always be like all orange pods are placed in the same zone. So the next one. So at the same time, we introduced mismatch level keys, which sounds very similar. Mismatch level keys specifies the keys for the levels that should not match with the incoming pods variables. So let's see one use case to understand how it works. So this pod has pod affinity and pod anti-affinity. Pod affinity enjoys this pod goes to node pool, which already has pods from the same tenant. And pod anti-affinity enjoys this pod goes to node pool, which does not have pods from different tenants. So I mean like let's say this pod is in tenant A, then this pod cannot go to node pool, which already has pods from tenant B. This combination makes it possible to achieve the pods go to node pool, which has only pods from the same tenant and all pods from the same tenant are scheduled in the same node pool. The last one we introduced is queuing hand. So before getting into it, I have to describe background knowledge of it about how in general how pods are moved in the scheduler. So let's see this example. We created a pod with a required pod affinity. When any pods are created, the scheduler notices such like newly created pods via event handler and put such newly created pods into scheduling queue. So scheduling queue is composed of three places, active queue, back of queue and unschedulable pop-up. Basically it like newly created pods put into active queue first unless pre-enqueue rejects it. Then the scheduler takes pods from the active queue one by one and schedule them. So in this example, the scheduler cannot find any nodes which satisfy pod affinity. So as we described, so each scheduling constraint is implemented as a plugin. So in this case, pod affinity plugin rejects this part because there are no place to satisfy the pod affinity of this pod. So such rejected pods basically put in the unschedulable pod pool as I emphasized basically. There are some exceptions but we don't have time to get into such detail. And yes, so at the same time, at this time, the scheduling queue remembers which plugins rejected each pod. So this rejected plugin will be used during a recurring process in scheduling queue which is by queuing hand. There we go. So queuing hand is a callback function for a plugin to decide when to requeue, when to retry these pods. So the scheduler keeps watching what happened in the cluster via event handler and queuing hand from rejected plugins called every time some events happen. So in this example, a pod affinity is the failure of this pod encountered. So the scheduling queue uses queuing hand from pod affinity plugin and decides when to requeue this pod. So it makes sense because this pod is rejected by pod affinity in the last scheduling cycle. So in other words, this pod will never be schedulable until pod affinity failure is resolved somehow. And queuing hand is the way to know when it's resolved. So via queuing hand, we aim to avoid having wasteful scheduling cycles as much as possible by like retrying the scheduling smart way. So back to this example. So let's think about how pod affinity failure could be resolved. So one possible scenario is that an existing pod gets like new rubble value which matches with this unscheduled pod affinity. To cover this scenario, pod affinity has a queuing hand for pod update event. And when the queuing hand finds like such events, the scheduling queue recues this pod into active queue or backup queue and the scheduler will retry the scheduling of the pod. This is how queuing hand works in general. So we are actively working on this queuing hand enhancement recently. So we started the deployment of this queuing hand in 1.28 and some entry programs supported queuing hands in 1.29 next release. So we will see other notable updates. So we introduced skip status in pre-future and pre-school to avoid execution of corresponding future in school. So for example, when pods does not have any pod affinity requirement, then pod affinity plug-in can return skip in pre-future and the scheduler won't execute like pod affinity plug-in in future phase. Also, scheduler configuration V1, V8, V3 is removed in 1.29. So please move your configuration to V1. So from here, we are going to talk about the SIG subproject updates. The first one is VASMA extension. So it's quite new subproject in SIG scheduling. It's an experimental extension to build scheduling plugins on web assembly. So you may hear VASMA around the web browser, but actually it started to be used not only in the browser, but also in the backend site to provide some extendability. So yeah, let's see how it works. This project has Goran SDK, which allows people to develop plugins in a very similar experience to existing Goran scheduler plugins. So on the SDK, you just need to implement the interface like this example, and you can compile it into VASMA binary by just following instructions. And you can enable your VASMA plugin via the scheduler configuration like this, and you have to pass the file pass to the binary and just enable it. So when you want to bring your own logic into the scheduler, you probably implement your own Goran scheduler plugins nowadays. This is the most popular way. And we also have another web-fuk based extension called Extender, but it's not as popular as the Goran scheduler plugin because it's less performant and less extendable. So not only in the scheduler, but Kubernetes itself has several extensions in several components. So usually extensions are provided via web-fukes or Goran plugin stuff like the extender and Goran scheduler plugins in scheduler. So however, those come with some drawbacks, performance concerns on extender, or need to review your components, your scheduler, and replace schedulers with your Goran plugins. So what's the best extension could be a solution to these challenges? So it allows users to build plugins without worrying about recompiling, replacing, and it also performs better than web-fukes based solutions. So yes, I personally believe VASMA extension would be a big step not only for us, I mean, not only for safe scheduling, but also for the Kubernetes community to explore the possibility of extension via VASMA. But on the other hand, there are some disadvantages from VASMA. So the first one is latency impact. So as I said, it's actually much, much faster than extenders, like web-fukes-based extension, but still slower than the Goran scheduler plugins, obviously, because it needs to go through, it wasn't runtime. So we suppose it will not replace existing scheduler plugins eventually, because if you have a very large cluster, the performance of scheduler is super crucial, and this latency impact may not be okay for your cluster, your huge cluster. Yes, another one is VASMA peculiar limitations, but we don't have time to get into detail here, but hopefully we will have some chance to share knowledge on this. So talking about the current status, it's an, as I said, our stage, and there are many things to do. Actually, it only supports the common extension points, such as pre-future, future pre-school and school, and we'd like to see all extension points supported as soon as possible. But we currently have some example implementation of VASMA-based plugins in the repository, so you can go and see what happens there. The next project we wanted to highlight is Q. This is probably at the moment one of the biggest projects in SIG scheduling, other than the scheduler, of course. What is this offering? So you can think of Q as a second level scheduler that takes decisions at the job level, and the decisions are whether or not a particular team has enough quota to run their job. So it takes decisions at the job level as opposed to the POT level, like the scheduler. So we can take Q in decisions independently of the size of the job. So it's a resource quota manager to offer some flexibility around the quota management. It offers a two-level hierarchy so that you can define fair sharing and borrowing semantics. And once you define borrowing, you might also want to do preemption, and the preemption here happens at the job level as opposed to preempting individual pods. Also very core to the design of Q is fungibility, which maybe not everybody is familiar with the term, but it's the ability to burst from one type of resource into another, which could be different levels of availability, like you could have reservations or on-demand VMs or spot VMs. Or if we are talking, for example, about accelerators, maybe you have different models, and you want to run the newest Chinese GPU, but you're okay running in another model. So all these semantics can be defined in Q. Again, Q works at the job level, so we need to understand particular APIs. We already support the job API, the Kubernetes native job API, as well as a newer API in C-gaps, which is job set. And we also support all of Qflow, Qbray through Rayjob, and more recently, plain pods. And because not everybody runs like this standard or well-known CRDs, we also support, we have a library, so you can extend support for any CRDs. And we also added more recently a new mechanism to inject extra decisions about whether or not a job can be admitted. So I want to give a quick overview of all the components, how Q and scheduler and other components talk to each other, how it fits in a batch system. First of all, on the top, on the left, you have a cluster administrator or a cluster operator. They define how many, how much resources can each tenant of the cluster have. So they will define what we call flavors, which resources you have, and queues, which will have the quotas. And on the other hand, you have a researcher that's probably not familiar with all the internals of Kubernetes. They just want to run jobs. So they have all their backlog of jobs. And the journey for them is very simple, they just create a job, which could be, again, Kubernetes job, Qflow job, so on and so forth. So that comes into the API server. Q reacts to it. Q takes a quota level, a quota decision, a quota-based decision, decides whether or not to admit the job. It looks at the resource flavors. If there is quota in a particular resource flavor, then it will inject node affinities associated to this flavor so that the scheduler can later pay attention to these node labels and place the pods in that particular flavor. So that's step two. Here is where the new features come in, 2.1 and 2.2, highlighted with asterisks on red. And we added this external admission checks logic or admission check extensions, which give us the ability to control, give the control of admission to other controllers. One of such controllers could be, is the cluster auto-scaler. So if your cloud provider supports this API, which in the cluster auto-scaler is called the provision and request API, if they support this API, you are able to, you are able to request a bulk of resources from the cloud provider. So you can say, I have 10 or 100,000 pods with this shape. Can you provide them? So we use, Q is able to use this API for your job and it will request the scale up. And once the cluster auto-scaler responds, yes, I have, I made a scale up, we can proceed with the rest of the process. On the lower side, 2.2, we have these external checks. You can define your own custom external checks. You could, for example, implement budgets. We are thinking of implementing multi-clusters through this extension mechanism. Multiple things you can do. So that's 2.2. Now we have decided the job can run. Great. So what does this mean? This means that now at this point, the job controller or the key flow controller is able to decide, is able to create pods. So these job controllers go and create pods and up to this point, we haven't used much at CD storage. Only at this point, once we know that the job can run, we create all these pods. And once we have the pods, Q-scaler can do its usual tasks and assign a note to each of these pods. And if you didn't have the provisioning request support, your cluster auto-scaler can still scale up if necessary. So as you can see, Q doesn't replace any of the existing components. It's a higher level scheduler or we like to call it a Q-system. That plays and you can plug into your existing cluster and you can even use other schedulers with Q. So quickly because I'm running out of time, these are the things that are in our mind for the next releases. We just did a release 2 weeks ago but we are thinking of more visibility, groups of pods as opposed to single pods, more policies for recuing, also hierarchies. We have a two-level hierarchy. Our users are asking for higher levels. We are thinking of multi-cluster and workflows. We don't want to implement workflows in Q. Again, there are many great workflow managers like Argo, Tecton, or Qflop pipelines, so on and so forth, Snakemake. They do their job great. We want that to work with Q, supposed to Q having to re-implement all that logic. Another project we have in Qs scheduler is the disk scheduler, which as its name suggests is kind of the opposite. You have lots of running pods. Now you have certain policies for when these pods need to be evicted. So in the new version of disk scheduler 0.28, they also follow a similar model to the scheduler where they have a disk scheduler framework. So more people, more contributors can contribute plugins to the disk scheduler to implement different policies. In this new version, they have guarantees that they have two types of plugins, balance and disk schedule. And they have the guarantees that the balance plugins will run before all the disk schedule plugins and some bug fixes on the V1, alpha 2 API. Right now, in the last two releases, they got 29 contributors. So it's pretty exciting to see so this is the disk scheduler framework. It has some resemblance of the scheduler. But what I want to highlight here is that in this example, there are two profiles. You can have as many different profiles. And within a profile, you can have disk schedule plugins and balance plugins. What is disk schedule plugins? Very simple things. They take a decision for one pod at a time. For example, you can have a simple policy, such as TTL. After X minutes, you want the pods to stop. That's a simple example of a disk scheduler plugin. On the other hand, you can have a balance plugin which takes decisions for groups of pods. For example, if you want to maintain high availability, you probably want to maintain a certain level of homogeneity between different zones. You can implement a plugin to define these semantics. This scheduler already has these plugins as well. Next, I think this is the last sub-project. Quok, also quite new, but already 2K Github stars, which is a lot given the time that it's been going for. What is this? We have all these scheduling projects. How do we measure the performance of them? How do we simulate before we go into production? How do we do simulations of all these scheduling semantics? This is where Quok comes in. Quok means Kubernetes without Cubelet. Basically, it's a controller that simulates what Cubelet would do if all these nodes existed. We can easily run, for example, a 10,000 node simulation in one laptop. You don't have to create a big cluster, a big real cluster, but you get as close as possible to a real cluster. You still have health checks and all this extra load on the API server from the Cubelets. Pots can start and can finish. A lot of feedback in the past months. Some highlights. There is some support through CRDs for configuration. Quok is deployable in a single image. Now, recently, Quok supports Potman. You can run Quok anywhere, pretty much, Docker, Potman, and CTL, or even as a binary in your laptop, directly on your laptop. Support for Snapshot and Restore through Quok CTL. Yes. What is coming next? Some usage simulation for CPU memory and extra commands through Quok CTL. Lots of projects. I hope we didn't overwhelm you too much, but now we have some minutes for questions. Is there a microphone? There is a microphone right there. You're right. Thanks for your presentation. One question you mentioned. You have the concept for job sets. Can I use it together with MPI jobs, built a job set from MPI jobs, for example? Yes. This is CIC apps domain, although I'm fairly familiar with job set. The idea of job set is that you can define a job with multiple templates, and you can define some networking for these pods. Yes, you can use it to implement MPI jobs. At the moment, there are some complications because in MPI, you need the hostname file that contains all the names of all the pods. That's something that you probably would have to do it yourself. Job set by itself doesn't provide you with that. I cannot test currently the Kubeflow MPI job kind of thing. Kubeflow solves all of this already, but only for one template. Actually, well, two templates, the driver and the workers, but you only have one shape of workers. Yes, you can Kubeflow already supports all of that. That's out of the box. Job set tries to go a little bit beyond, but MPI is tricky. Yes. Thank you. Hi. I don't know if this is something that's supported in job sets, but is there going to be any support for being able to consider a group of jobs to be handled by the queue? I think it was. I think in your example, you can consider a single job, but if we had some conditions for like, I want to schedule a group of jobs, and I only want them to be scheduled. If all of them can be scheduled, there are some conditions like that. For example, like, is there any job set? It's about creation. All it does is create pods based on the templates. Scheduling is still in the scheduler side queue. So queue knows how to take into account the entire size of the job set so that it can make the decision for the job set. The scheduler would still take individual decisions for each of the pods, but there is an open cap and still an open issue to implement cost scheduling. We have the cost scheduling plugin in another sub-project which needs to be graduated and brought back upstream. But yeah, all these pieces, we are trying to make them supported all the way through, but we are still on the way. Okay, we're out of time. All right, thank you. Thank you for coming. Thank you very much.