 Hi, welcome to the maintainer's track, six-scheduling deep dive. My name is Aldo Kukikondor, I work for GKE, I'm based in Waterloo, Canada, and in the Kubernetes side of things I'm a maintainer and TL for six-scheduling. Yeah, my name is Kent, I work for the Dow Cloud based in Shanghai channel, and I also work on Kubernetes upstream together with Aldo and mainly on six-scheduling, yeah. So we're going to have a quick overview of scheduler for the ones who are not familiar, and the remaining is all about updates in the various projects that the six-scheduling hosts. So let's start with the scheduler. The scheduler is a single component, is a controller in part of the control plane, the Kubernetes control plane, and it's basically watching for pods which can be in the high level in two states. They could be admitted or scheduled, scheduled pods, pods that are already assigned to nodes, and pods that are new, they were not assigned to nodes yet. So then the pods that are not assigned yet, they will go into a scheduling queue, and the pods that are scheduled will go into the internal scheduler cache. Once at a time, we pop a pod from the queue, and then we execute all the scheduling algorithms to find first where the pod can fit, and among all the possibilities, which one is the best option according to the configuration, and whether we need to preempt any existing pods that are running because we have a higher priority pod. So if it's schedulable, it will enter the binding process, which is simply telling the API server that the pod is already scheduled, and where it should go. If it does any, just we'll go back to the queue, and even if it doesn't, it will get a notification, it will get a pod condition which says not schedulable, and other components can react to it, such as the cluster autoscaler, for example. If the binding is successful, then that's all that the queue scheduler does. At that, from the point that the pod has a node assigned to it, scheduler no longer owns the pod, so anything that happens to it is the responsibility of the queue let, except for preemption, which is when we need to schedule a higher priority pod. So over the years, we worked in a framework to make scheduler more extensible because obviously different companies, different users have different needs, so we needed to establish a structure where people can contribute more easily to the scheduler. So as a result of that, we introduced a series of extension points where people can extend the scheduler. Let's start in the green box, the scheduling cycle. There are basically two main processes here, prefilter and filter. Filters are all about yes or no, so we take a pod, we take a node, and we ask the question, can this go in here, yes or no? So we do that process. If we ended up with all nodes for all the nodes, then we trigger the postfilter, which the typical example is the preemption mechanism that would evict some pods. If we have more than one node that can fit, that responded yes to the filters, then we enter the scoring phase, where we run a score for each of the nodes, multiple scores for each of the nodes, and whichever node has the highest score is the one that is selected. And then we enter the binding cycle on the right, which communicates the decisions to API server. Somewhere in here, there is mechanisms to do volume binding as well, but I won't get into those details. Now, we have introduced this new box on the left, the purple box, which is a new extension point, which we call preenqueue, which what it means is it will hold any pod from the rest of the mechanisms of the scheduler. Why is this useful? Well, we'll give us some insight of that, but this is a new extension point that you can tweak. The big difference between preenqueue and filter, let's say, is that preenqueue is a global decision, it's a decision for the cluster, filter is a decision for the node, and also if you are failing the preenqueue, you're not even entering the queue, so it's as if your pod didn't exist, kind of. So if your pod fails the preenqueue, for example, the cluster autoscaler won't scale up. That's one of the differences with filter. If a pod fails filters, then the cluster autoscaler can still react to it. I will see why is that useful in a second. Okay, let's go through several updates and improvements we made in the last several races. So the first one, as Ado said, is the pod scheduling readiness. So for most of the time, when pods are created, they will be added to the scheduling queue and popped out of the queue one by one for scheduling. So this makes sense for most of the time, but not always the case. Like if the pod missed the essential resources, let's take a pod request for GPU, for example. So if the GPU hasn't been registered in the cluster, then the pod will never be scheduled successfully, right? Also, like in a cool time management system, and let's say a pod with a lower priority, and there may be some scenarios like there is not enough resources for the pod. So the pod will still be stuck in pending for a period of time. However, the code scheduler will commit to reschedule the pod when it gets the chance. So it somehow will lead to the degradation of the scheduling throughput. So although we have some like mechanism underlying, like we have the back off algorithm, which we are handling for the pod scheduling failures, but we cannot avoid it, right? So for another case, it's about the let's say we have several customer controllers who want to make the decision together with the scheduler, but it does not need, but it does not want to modify the default code schedule. So without this feature, we cannot achieve this. So we need a knob to help us control the scheduling process. That's the general idea of this cap. So in this new feature, we introduced new feature named the scheduling gates. It's a set of strings. So when the code scheduler found a pod carries the scheduling gates, it will never pop out the pod for scheduling. And this is an example about how this feature works. Let's say we have a normal Kubernetes and we have a customer web hook and external controller. So when we apply a pod, the pod will send the request to the API server and invoking the web hook. Then the web hook will inject the scheduling gates to the pod. At the same time, the code scheduler watches the pod create event and it found the pod carries the scheduling gates. So it will never pop out the pod. Then came the external controller. Oh, sorry. It's for like removing the feature gates when it found the pod is ready, like we have enough resources for the pod in the Corsair management system. So the external controller will try to remove the scheduling pod and the code scheduler will watch the pod update event and it found the scheduling gates was removed. So it will pop out the pod. And yes, when entering into the normal scheduling process, that's how it works in general. And the next cap is about the mutable scheduling directives for suspended jobs. So the motivation for this is in batch jobs, usually pods will run with specific constraints, like they will run in the same zone, maybe for the communication performance or they prefer some same model of GPUs. So job hopes to be mutable when it's suspended. And also a high-level job query control can take advantage of this feature for better pod placements. Let's look at the right side of the slide. So these are all the mutable fields we support right now. So node affinity, node selector, torsions, annotations, labels, and the scheduling gates. The scheduling gates were introduced in the last release, 1.27. So we didn't introduce any new APIs in this feature. We just relaxed the API validations. So you see when the job is suspended, we can injure these scheduling gates. And also Q is a subproject mostly sponsored by the six scheduling takes advantage of this feature for job queuing. So the next feature is about the mutable pod scheduling directives when pod gated. So it's quite similar to the previous one, but it's for pod only. And yes, the motivation is quite similar. So the external workload controllers can influence the pod placement with this feature. So the general idea is when we found a pod gated by the scheduling gates, then we can inject the node selector and the node affinity. But one thing I want to highlight here is not any node selector or node affinity is allowed here because we at the higher level, so if the node selector or node affinity is empty, then anything is okay. But if not, only more restricted updates are allowed. That means the node candidates should be a subset of the original nodes. So if you have some customer jobs, which has no suspenders, semantics, so you can also achieve the same behavior as Kubernetes native job with this feature. Then the pod to bridge spread plugin. So this plugin helps to control pod spreading across all the clusters. And we did a lot of work in the past several races. But I will not go into that detail, because we already did that in the last coupon of America. So the first one is about the main domains. It's a new field under the to bridge spread constraints. So it helps to control the minimum number of to bridge domains. And if there is not enough domains, the cluster auto scale we are getting on board until there are enough amounts. And the second is about two new fields, node affinity policy and node tent policy. So we can use these two fields to define whether we should respect the node affinity or node tent when calculating the spreading skill. So all these three fields are under the to bridge spread constraints. And another feature is about the loading updates. So in the before, when a deployment is loading update, because the CUBE schedule cannot distinguish with other replicas and the new replicas. So when the other ones tear down, the state will fall into an unbalanced state. One solution for this is we can deploy the schedule at the same time. So if we use this feature enabled, there is another approach right now. So this feature introduced a new field named the match level key here. It's a set of the level keys and it will not care about the level values. So for template, we can deploy for the template hash here. So the part of the template hash is level added by the deployed control. So it's child replicas says we are not overlap with each other. And yes, and so the CUBE schedule can use this use this level keys to distinguish with the other ones and the new ones. And we are following to a balanced schedule results. That's how it works generally. And yes, we just I think we the blog about these three caps was just published yesterday. So you can if you have if you want to know more information about this, you can click the link for the details. Yeah. And beside this feature, we have several notable improvements at least the three here. So the first one is about the skip status. So if if if a pod has nothing to do with filter filter pluring, then we think filter pluring can be skipped directly, right? Let's take let's take example like a pod carries no node affinity fields. Then the pod sends a node affinity pluring will have nothing to do with the pod. So the pluring can be skipped directly, right? Now we can return a skip status in the pre filter extending point and the corresponding filter pluring can be expected. It can be skipped. The second is quite similar to the previous one, but it's for the score extension point. So you can also return a skip status in the pre score extension point to skip the score pluring. Yeah. The next one is about a new metric plugin evaluation total. So if if if one plugin is being called, the value of the metric will plus one. So with this plugin, we can have have an overview of how you pluring inflecting the scheduling results. I think it's quite useful. And then I handle handed over to Eldor for several subprojects updates. Yes. So as a scheduling, we our main component of course is the Clip scheduler, but we also host a few subprojects. The first one is kind of not the newest, but one of the newest subprojects. It's a subcomponent. It's a controller for job queuing. So we we offer resource quota management with borrowing and preemption semantics at the job level. It also has support for resource fungibility, meaning that if you have an heterogeneous cluster, then you can you can establish quotas for different resources for different flavors, we call them in the cluster. And then you can if you submit a job, Q will decide if it fits in the in certain quota, it will assign to that flavor. Otherwise, it will assign to the next one that has available quota. We directly support the job API that we offer support for for MPI job as well, Qflow, and we are working on more integrations. We also published a library for supporting any custom jobs here that you might have. So you can queue it with Q. Here, you might be wondering why are we scaling investing in a new scheduling project? And the answer is that we are not. We are not reinventing CubeScheduler. We are following a separation of concerns principle. So Q can take decisions at the job level. So it's like let's call it a first scheduler. And once the decision has been taken for the entire job, we delegate the rest of the scheduling to the existing components. So Q is compatible with CubeScheduler, it's compatible with controller manager, and it's also compatible with cluster autoscaler. And so we just released the version 0.3. I think it was last week. And some of the highlights we released the first beta quality API, meaning that from now we respect the deprecation policy established by Kubernetes. We increased some validation for better use. We added preemption support. That was a big request from our users. And we finally have it for different scenarios. We added support for MPI job, the V1 beta 2 version of it. We added this optional sequential admission for quasi all or nothing, organ scheduling, if you will. We added support for limit ranges and runtime classes so we can better calculate usage, quota usage. And we, as I said, we published this library for integrating any CRD with Q. If you want to learn more, we did a session yesterday during the batch on HPC day. You can find a link in the slides. Hopefully the recording will be uploaded soon. Another project that we host as six scheduling is the disk scheduler component, which they just released. They just got a new logo. So what is new in disk scheduler? First of all, they introduced the V1 alpha 2 API and the disk scheduler framework, which is kind of a similar framework to the scheduler framework. It's a plugin-based refactor of the project. So more people can easily collaborate without bumping into each other's features. There is a new config API. The V1 alpha 1 is deprecated. And now they also introduced disk scheduler profiles, similar as well to the scheduler profiles in Kube scheduler. I'll expand on this a little bit in a bit. And they added this feature for the node utilization plugin. And now they have a namespace filter. One inspiring thing that they wanted to share is that they had more than 10 contributors in the latest release. So the disk scheduler framework, what is the motivation that the project was becoming too big, too many contributors, hopefully some of you here. They all wanted to introduce new features, new strategies, and they were having issues merging those changes. So they decided to follow a similar pattern to Kube scheduler, introduced the disk scheduler framework. It works a little different from the scheduler framework. And also a few different names. So let's say we have a profile here. Each pod is passed through all of these plugins. Each plugin implements resort sort of filter. And they go through it. There are two types of plugins, the disk schedule, which is kind of like observing one node at a time, or sorry, one pod at a time. And you have the balance, which is observing kind of like the entire cluster at a time to take decisions. So this could take decisions about, let's say, how long a pod has been running, and the balance can take a decision about like, is my deployment spread. And that's one profile. Each pod runs through every profile. And if the profiles are different, so the first profile can filter specific pods. And the next profile can filter other types of pods. So that's the idea. The disk scheduler project is asking for use cases. So if you have any more use cases for disk scheduling, feel free to reach out. And if you want to learn more about the disk scheduler framework, you can find a link down there to the original design. The next project is the scheduler plugins. This is a repository out of three. So there is basically one Qubes scheduler in Kubernetes, but we wanted to give the opportunity to the community to be able to incubate certain plugins with as high quality as possible. So that's why we introduced this repo. And there are quite some experimental and also quite production-ready plugins in here. So this is a kind of like staging project. If a plugin matches enough, we will consider the upstreaming it to Qubes scheduler, but it still needs to go through the production, well, the cap process and production readiness reviews. Anyways, I want to highlight the plugins that are there today, starting with the most popular one, probably the cost-scheduling plugin, which offers all or nothing scheduling a group of bots at the same time. I've been told by the contributors that OpenAI is one of the users of this plugin. The next one, maybe for a bit, is no resource topology. It's about adding topology awareness to scheduler. Topology inside the node, like NUMA nodes, for example. Capacity scheduling, it's about elastic quotas, preemption, toleration, its variation on the preemption plugin that offers certain opt-out capabilities. And the next one is Trimaran, it's actually a set of plugins, which are kind of low-aware scheduling plugins, and the network-aware scheduling plugin as well. Major changes in the last few months, a new component config API, several updates to the node resource topology match. Now there is a scoring strategy, a new scoring strategy for NUMA nodes, and there is an implementation of reservations that aims to reduce conflicts with Cubelet. And the APIs for POT group, an elastic quota now serves the status, so you can subscribe to those events. And more importantly, there are breaking changes in the newest release. We changed the API group into scheduling.xcovenetist.io, so if you upgrade to the newest version, you will have to update your CRDs. Okay, the next project is about the Kubernetes scheduler simulator. So it can help to simulate the Kubernetes scheduler without a real class. More importantly, it can display scheduling decisions in detail. So the major changes in the last release is about we can display the scheduling results across all the extension points. So but in the before, we can only support filter and score extension points. And one thing I want to mention here is the simulator has a web UI, but up to now, we only support filter and score extension points. For other extension points results, you have to refer to the POT annotations. And the next, we are trying to implement the scenario-based simulation and benchmark. You can refer to the cap for more details. So this is the general how it looks like. So we can see we have all the extension points scheduled results in the POT annotations. Yeah. Okay, the next project is about COG, Kubernetes without a cobalt. So this is a new project. It's a toolkit to build up a cluster and it can create thousands of nodes in seconds. It's lightweight, it's fast, and it's also flexible. So from the test report, it can retain thousands of nodes and hundreds of thousands of nodes, hundreds of thousands of ports per laptop. It's quite a large amount of resources. And the more information you can refer to the blog, we are happy to announce that we just released the first version. And we have beyond 1000 GitHub start right now. It's quite popular. So actually, the architecture of the COG is not quite complex. There is no cobalt, there is no node, there is no data plan. So all of this will be mocked by the COG. So to achieve this, COG has two controllers. The node controller and the port controller. The node controller we are responsible for simulating the node lab circle. And it also maintains the node heartbeat to the API server. For port container, it's responsible for simulating the port lab circle. So you can mock any node, any port at any state use at any time. It's quite flexible because it can achieve this because in COG, everything is API objects. Yes, COG also has a client tool like COG controller. It's as brand new as the kind. So we think COG can cover several use cases like the scheduling simulation. So one interesting thing here is COG schedule simulator is planning to integrate with COG. That's quite interesting. And for scalability tests, like if you want to have a performance for like 5000 nodes, you will no longer need 5000 real nodes. You can use COG to and build up them in your laptop. And also for integration with the class auto-scaler, you no longer need to jump to the cloud providers. You can build up them in your laptop also. And yes, functionality test. If you are developing a CRD and you want to verify whether the CRD behaves well, so you can use COG. Okay, pretty much for this session. And I guess someone must be interested with like joining the community. So if you are a beginner, you can refer to the first issue or have a wanted issue. And if you have any questions, please join us, like we are all there. And also we have bi-weekly meeting. If you want to get more involved, you can talk with like the tech leader, the coaches. And yes, we have an APAC meeting for the contributors from other regions. If you want to know more about the architecture of the COG schedule, or you want to know how the features are designed, you can refer to the CAPS and the development docs. So last but not least, thanks for all the contributors in the last list. And yes, we are calling out for reviewers. So the last part, Q&A. Any questions? Thanks. So thanks for the great presentation. I wanted to ask about the features that allow delaying scheduling of ports, so the scheduling gates and the suspended ports. So is there some sort of status condition or timestamp where this status changes to port becoming schedulable? I'm asking because, for example, in autoscaling, the scheduling latency is a major thing we monitor. And I'm wondering if there is now a bitterly long time where port scheduling can be delayed. Is there a way for me to actually monitor the actual sort of port going through the system? There is a port condition. If I remember correctly, it's called scheduling gated. So that it contains a timestamp. But yeah, you will have to subscribe to that to be able to measure SLAs. Thank you. That's great. Thank you very much. Yes. I just wonder if there's a strategy or plug-in which allows to put ports dense on nodes so that they're not spread across nodes, but that we utilize the nodes at the maximum point? Yes. That's one of the modes for the node resources utilization plug-in. I don't remember the name at this time, but you can configure it to target maximum utilization or minimal utilization for spreading. The default is minimum utilization. So by default, cube scalar spreads. Yeah, that can be a problem because then you get unutilized nodes and they cannot scale back because the nodes are pinning the nodes. Yes. So some providers, you know, in GKE, we provide multiple scheduling profiles, so you can basically select the scheduling profile that fits your use case better. So that's the recommendation for cloud providers to provide. One off the back. A follow-up question to that. How does it compare to the de-scheduler? It could also help with maximizing bin packing, right, by clearing out unutilized nodes? No, it's not about cleaning up. The question was about scheduling. Add scheduling time to maximize utilization. And I guess the benefit about doing it during scheduling time is that, well, there's no pre-emptions, no disruptions to the pods. But of course, you can still pair, you can still deploy together scheduler and de-scheduler for maximum defragmentation. Hi. Thanks for the talk. Question about Q. So you mentioned that decision is made on the higher level for the job itself, basically, decision to schedule, not to schedule is done on the job. How then preemption works? So, yeah, so that's based on the suspend field of the job API has a suspend field that Q leverages to either start or preempt a job as a whole. So the point here is that Q takes the decision at the job level and then is the job controller that actually executes the deletions and then Q handles the graceful termination and so on. So, basically, the approach we've taken in Q is that any time upstream Kubernetes doesn't work for us, we go and change it. We talk to C-gaps, we talk to C-gaps scaling, to any SIG node, and we work with them to make this communication channel or this separation of concerns work well with us. So as a result, we might have a little slower release, but we are here for the long term. Thank you. Yes, and you can also refer to this session we said yesterday about Q and the more detail about the preemption. Yeah. Thanks. I for sure will review it. Thanks. All right. Thank you. Thank you all. Thanks everyone.