 All right. Hi everyone. Welcome to literate the last session of the KubeCon. I hope you'll enjoy this KubeCon. Today's the six scheduling maintenance track. My name is Wei Huang. I'm the coach of six scheduling. I work for Apple as a software engineer in the AML department. My name is Ken. I'm from Dark Cloud. I also work on community staff and I'm also on the upstream. I work together with Wei. Yeah, that's all. Okay, let's dive into today's session. Today's session, we'll first of all go through what the schedule is and what the current framework looks like and then give you some recent update which is made sure it was completed in 129 and in the upcoming 130 releases and also mentioned some other notable updates and then can tell you I mentioned some sub-projects update which is sponsored by six scheduling. Okay, so first part is scheduling overview. You may have known that a scheduler since 119 firstly introduced the scheduling framework to orchestrate the whole scheduling workflow. So basically all the plug-ins right now shipped with the default Kubernetes offering are using this framework and a lot of auto tree plug-ins which is developer to satisfy in-house scheduling requirement are also using this framework. So basically I intentionally skipped the first part, the purple part I mentioned later in detail. And then you can see we start with a sorting interface which is internally we maintain a scheduling queue to accommodate incoming paths and the by default is sorting by the paths priority and then the schedulers the green part will choose the head of the queue and pick out for scheduling and to the scheduling cycle. So it will go through a series of six detection point. The first of two is called pre-filter and the filter which is majorly to decide whether the path can be scheduled according to predefined hard requirement or not. So by the end of this phase you will give you a result says, okay, how many nodes can fit this path? If yes, then continue to the happy path is to go continue to pre-score, score, etc, which is basically to prioritize this nodes candidate to give a final optimal node for the part to be scheduled on. And then once the last phase of the green box finish the scheduler will start from the head of scheduling queue again to pick up at the next part to schedule the next part. So why it separate the scheduling cycle and binding cycle is because binding cycle is sometimes time-consuming because it has to interact with the APS over to sending a binding request. So the yellow part which is called the binding cycle which does what it does is to do sending a binding request to the API server which is basically underneath to set the spec.nodeName for the path. So this is for the happy path. That means you have at least one node to accommodate the incoming path, but it can happens that there's no single node can host the path. So it goes to the upper layer if you look at the red sign here. So it calls post filter. That means no single node can accommodate the path. So it will trigger enter the post filter extension point. So right now a default implementation of post filter is called preemption. So basically it does is to evaluate whether there's any like less important low priority path can be preempted so that make room for the incoming path to be hosted on. So that is preemption. And because each Kubernetes path has default grace termination period which is defaulted to 30 seconds. So basically schedule cannot just sit there for 30 seconds for the path to be physically deleted and then place the path on. So basically if the path has to preempt at the path, it should be immediately returned at the end of post filter. So basically the output of post filter is to post temporary nominate a node name to the path and then return immediately. Wait for its next turn to be really scheduled. By that time it also will check whether the victim path has been really physically deleted. So this kind of feature to prefer, for example in the next round, they will prefer using the nominating node name set on the status field for the path. This feature is a little surprising. It's also used by some external component like Carpenter. They're using this to do the scheduling. You have a very fast path because they involve some node provisioning. So if they don't need to go through the regular scheduling flow to go through the previous, for example, 5,000 or 2,000 nodes because this path has been proved to not fit for the path. So that's basically optimization and how it's used by other components. So recent update section. We will mention a few caps. Cab, if you're not familiar, is short for Kubernetes enhancement proposal, which is a very formal process to submit the features and to track the importation progress. So the first one I want to introduce is called the path-scanning readiness. So as I mentioned earlier, for a pretty long time, this is the scheduling framework, which is just basically a path gets created. They will immediately go into scheduling queue and wait for it's turn to be pick up and scheduled. But it doesn't fit for some user scenario that is once the path gets created, it doesn't really mean the path is ready for scheduling. For example, the path maybe needs to go several external components to check its readiness, for example, whether it is quota, some in-house quota has been satisfied, et cetera, et cetera. So that is why we introduce an extension point called pre-in queue, which happens before the sorting and also also the whole scheduling life cycle. So once the path gets introduced, you can implement the pre-in queue extension point to add customized logic to pre-check the path. Once it's satisfied, it will push it to the scheduling queue. So this feature is alpha in 126 and beta in 27 and the GA in the upcoming 130 release. So this may be a little abstract to end user. So for the users that use this feature, we introduce the scheduling gates field in the path. So in the left rectangle box, you can see that it has two scheduling gates field. One is called full, one is called bar. These fields in the scheduling gates has to be specified at the beginning of your creative path. It can be your controller take the ownership of the path creation or it can be some mutation webhook. But once you set it and raise the path, create the path via API server, it's a one-way direction. So you can now say, okay, sometimes later, you add additional scheduling gates. That's not allowed. So if there's some scheduling gates out there in the bottom, you can see that the path will stay in scheduling gated phase. That means the path inside scheduler has not entered the regular scheduling cycle yet. You'll wait there for some external signal to remove the scheduling gates. For example, the full maybe has be checked, already satisfied, then it will be removed and then the bar gets removed. Then it will lift up the gate to push the part to the regular scheduling cycle. And then you will see your normally familiar with the part where it gets from pending to running. So basically, this feature introduce another scheduling phase, which is called the scheduling gated, for you to integrate with your components to do some very customized scheduling requirements. One typical thing is that some organization has some in-house quota requirement before the part can be scheduled. So this is used, I think in the Apache Unicorns pregimall is used a lot to increase their throughput. The next one, which is also a cap, is called the main domains in part topology spread. It's a sub feature of part topology spread. The motivation is that some users, I would say, a lot of users use cluster rows of scheduler in the cloud offerings. So as time goes, the project domains may shrink, for example, may shrink to one. And in that case, the part topology spread, the max excuse, is actually no OP because you only have one domain. So in that case, all your part will be stacked into the same domains. So that may be not the user one because they may also prioritize the application's resiliency, high availability over the others. So in this case, what they want maybe is that, okay, I don't want the part to be scheduled to the same domain, that doesn't make sense to me. So what they want is that, okay, just let the part be pending. I do want the part to literally to another domain. So what you need to do is that in the scheduling part topology spread constraints field, add the main domains equals two. To simplify, this in this case just mean domain equals two. So this kind of information and the pending part will be detected by your cluster of scheduler, either compender or vanilla cluster of scheduler. They will create an additional domain for you and then the part will be scheduled to the domain. That's what you want. So how to configure it is pretty simple. The main domain is optional field. So by default to set to one, you have to set a non-active number here. So another feature is called queuing HIN. It's very internal feature. It's not oriented for the end user, but I will mention a little bit. So in scheduler's history, before 122, how the part can be re-queued if the part is pending is pretty wild or aggressive. So basically for every system event, no matter it's a node and a part gets deleted, we'll trigger move all the parts back into the scheduling queue. So you can imagine it's super inefficient because you can put the high-level parts again and again, push back to the head of the queue, so that costs the low-priority part. It doesn't even have a chance to be scheduled. So this is classically called head-of-line blockings. And then in 122, we introduced some cost-grant approach by allowing the plugin developers to define which events are related, or we can say can help the failed part reschedulable. For example, in this case, you can define in your volume plugins that is only some PVC or PV event happens with the scheduler will move the part back otherwise just stay there. But this is sometimes not that fine-grained. So in the latest release, we introduced the queuing hint to not only allow the plugin developers to define related events instead of define related callback functions, which is we give enough context information to the callback function signature. So you can choose how to implement that callback functions. This is the original interface, which is very simple, is just return a series of the cluster events. A new interface looks like this, much complex, I would say, but it can achieve better throughput. It's also used in the DRA implementation. But right now, we found some bug, I would say. It's because of the internal implementation uses a double-link list, but sometimes the object cannot be garbage collected soon, so it causes excessive memory usage. So on one patch release of 128, we disable it by default. So we're still working on that to optimize the internal data structure to make the memory usage to the same level as before. So it's a queuing hint. The next feature is called match label keys in part affinity or part entire affinity. So if you're familiar with part body spread, there's also a field called match label keys. So basically part affinity here is adopting the same philosophy. So the requirement is that as time goes, upgrade the application is definitely a requirement. But Kubernetes, for example, the deployment's upgrade is that I will spin up a new one and then I delete a new replica and then set a new one and delete all the replica. This is by default the upgrade strategy. But think about a case that you are using part entire affinity as specified, two replicas on two hosts. And then when you want to upgrade the deployment, when the new replica comes, it actually cannot find a place because the old ones are still there and the new one match the older replicas. So in this case, your upgrade is stuck because the new one cannot be scheduled and the old ones are still staying there. So to resolve this, we introduce the match label keys. That is, you can specify in this case a part temporary hash as a key. And in runtime, schedule will inject the concrete value for you. You don't need to be aware of that. You just need to in your side specify which keys can be uniquely defined, the new replicas to differentiate it from the old replicas. So it can achieve what you want so that you are not stuck in the rolling upgrade. So the same case can happen to mismatch label keys. So it's in pair with the label keys. So basically in the runtime, the operator will be not in comparing to the previous match keys in. Okay, the last cab I want to mention is the 10 manager decoupled from node lifecycle controller. So basically this is a cross-seq feature which is mainly related with scheduling, mainly related to seek apps and signal, but historically we choose scheduling to host this. So basically the motivation is in the old implementation of node lifecycle controller, we don't differentiate, we just use one simple controller to do two things, to define how a part, sorry, how a node is tainted upon which conditions and when the, when the tints are applied, how the part should be evicted. So that is sort of makes the responsibilities in one single controller. So for some use cases, we want to separate them so that they can continue to use the upstream default implementation of define how the node should be tainted upon what kind of conditions and also use another component to replace the entry implementation to do what they want, like to see, okay, maybe this 10 should not trigger eviction or trigger eviction another way. So this feature is for that to basically it's more re-factor the code base so that give the users more freedom to do their customization. So under noted updates, one thing I want to mention is that the CUBE schedule configuration has been pretty stable along the years. So right now in the upcoming 130 release, we were just a simple third one version, which is v1. The old ones like v1 beta 3 are all removed in 129. Okay, I will hand over to continue to introduce the subprojects. Thanks, Wei. So next I'll go through the subprojects and see what progress we made in the past several months. So the first one is cork. So before we dig into the details, I would like to know how many people sitting here have ever used cork before. And if yes, please raise your hand. One, two, three, not much, but yeah, good. So cork is the two keys that have to build a cluster. So you can take it as a can, but the difference is can still consume resources. But cork, they are only API objects, so it will not use actual resources. Let's take a look at the picture. So actually, the controller plan will communicate to the cork directly, and there is no couplet. So you can fix the status of your API objects. And actually, that's why you can build up a cluster with thousands of nodes in a laptop, because there are only API objects, there are no real resources being used. And in the last list, we achieved several features. So the first one is about the simulation for the CPU and the memory usage. So actually, you can act as you are running the real pod in your cluster, in your cork cluster. And the second is about the state API. So state API is a CRD, so you can define the desired state you want. And you can actually define several states you want. They can transition from one state to another. And yeah, you can almost move the whole lap circle of your API object. And also, cork has two keys named cork control. So we add three new commands, the hack, the record, and the reply. So the hack command can modify the resources in ETCD directly by passing the API server. This does not set secure, because you are skipping the API validation, defaulting, so you should be careful when you want to use this command. And another one, you know, basically when you have a bug and it's hard to reproduce, so you can use the record command to write down all the events that happened in that moment, and replace them. And you can repeat this again and again until you find out the reason why, so the lose cost, yeah. Okay, the third one is about the integration with the metric server. So I think this is why you can, you know, like, you know, you can simulate the CPU and memory usage. So all about cork is about simulation. So for the next step, we'd like to simulate like the GPU usage, because GPU is expensive, and also the volume provisional, and like the 4D port and the node scenarios. So this page is about how cork is widely used across the community. So we can see a lot of organizations, communities have used cork already, and yeah, cork has achieved, you know, got more than 2000 stars in today. It's quite popular, right? And the next about a queue, so still I'm also a contributor of a queue. So I would like to see, because in this coupon, I heard several topics mentioned about the queue. So I would like to know how many people here have ever used a queue before, if yes. One, two, oh yeah. Great to see that. So basically, queue is designed to, for the job management and job queueing. So it offers quite a lot of capacities, like the priority-based job ordering and job preemption. Yeah, and the queue is quite good at the resource quota management, like you can fair share your resources across different tenants and borrow resources from different tenants. This can somehow, you know, improve your resource utilization. And the next about the functionality is this can somehow, you know, for course saving, like you have a support instance, and if there is not enough, then you can fall to your own demand one. The next about the multi-class support, yes, queue right now is support the multi-class, and we'll talk a bit about that later. And in queue, we actually support a lot of jobs, like batch job, no doubt, and all kinds of coupon flow jobs, the rate job and the pause and the pause groups. Yeah. I think the most important thing about queue is that you can work queue with the community's native components smoothly. So there is no migration work there. Oh, I forgot that. So we got more than 1,000 stars a week ago. I think that's a big milestone for our project. So this is how queue works at the high level. So for short, the queue will have to manage how jobs are, when the job will run, when the job should be claimed, and then when the job is admitted, it will leave the scheduling to the default group schedule. And if there is not enough resource in the cluster, the cluster auto scalar will take place. Yeah. This is how queue works. And the weeks ago, we just released a new manual version, 0.6.0. Actually, we introduced a bunch of big features in this list. So the first one is about the multi-cluster support, aka multi-queue. So this is a master worker model. So you have to install the queue in each cluster, and the master queue will decide which cluster to dispatch the job to run. If you are interested with this feature, you can read the cap for more details. The next one is about visibility API. It's somehow for the insights of the pending jobs, pending workloads. So you can have an overview about how your cluster looks like, how many jobs are running, and how many jobs are pending right now. And next about the landing limit. So before this feature in queue, let's say we have two tenants, and one tenant can borrow the resource, can borrow all the resource from another tenant. So it's greedy. So with this landing limit feature, you can guarantee some resource for self-use CG, and it will never be borrowed by your neighbors. And now it's about all-or-nothing queueing. So I want to highlight that this is completely different from the coop scheduler pull group. It's only implicit in the queue queue. So the capacity is quite same. So you can annotate the pod. So a group of pods can be queued as a unit. Yes. And before we support the red job, but some of our customers say that I also have some red cluster in our queue. So I will also support a red cluster. So it comes. And we also support some enhancements for the pre-emission and some new policies. For the next step, we mostly focus on three things. The first one is about the more fairness resource sharing, like the dominated resource fairness. And right now, queue only supports two-level hierarchy. So the cohort, the class queue, but it's not enough in a big company. So we are now considered to support the hierarchical cohorts. It's already designed. And yeah. The last one is about the integration. Integrate with K-Serve so we can manage the online service, online influence service, and offline influence service together. Also the Kubeflow pipeline. And the next about what some extension, I guess a few people here has heard about this project. So today we have several ways to extend our schedule. The first one is the schedule framework. This is recommended. And the third one is about the multi-schedulers. And the third one is about the schedule extenders. So schedule framework is, you know, it's recommended, but it's not, whether you have to recompile your schedule once you have a new planning. And the schedule extender is quite flexible, but it's slow because it invokes further API requests. So we still lack, you know, a dynamic way to recompile, to extend your schedule, but it's quite, it's still efficient. So that's why we introduced the Watson extension. This is, you know, high-level picture about how Watson extension looks like. So we have Watson plugin in the schedule. We usually call the host plugin. And we have several guest plugins. So the guest plugin is actually where you're scheduling logics located. So we'll combine, let's say you have several guest plugins and you implement the functions and then we'll compile them to the Watson files. And the Watson plugin, the host plugin, will load the Watson files. So that's how it works generally. So with the Watson extension, so you don't need to recompile the schedule, you just need to replace the Watson files. Oh, yeah. And it's faster than the schedule extender. Also, because, you know, we have to implement the extension points in Watson plugin as well, but the interface is quite similar to the schedule framework. So you will have the same experience like developing the traditional schedule plugins. Also, it's a save send box because even if you panic in your guest plugin, the coops schedule will not be impacted. But yeah, it's still slower than the schedule framework right now. So if you want to use the Watson extension, how to use, how to run this. So first of all, you have to recompile the schedule with the host Watson plugin because it's still out of the tree. But one day when it became the material, we may make it an issue, I don't know. And the second, you should enable the guest plugin. It's quite similar to what we do today. So you just configure your coops schedule configuration. Here, we enable the two Watson plugins, the plugin one, plugin two, and we specify the guest path and then it works. We are very close to our first release. Yeah. And if you want to try in advance, so we have a sample here. So please. And the next about the schedule plugin. So we host some out of tree plugin in this repo because not all the plugins are fit for every user. So besides the stable ones like the coops scheduling, the capacity scheduling, we introduced two new plugins. So the first one is about the system core-based scheduling because some parts will share the, you know, the same host kernel. So the system calls may lead to some attack of a node. So the scheduling can make decisions based on the system core usage. So even if attack happens, there will be fewer, fewer victims in your cluster. Yeah. You can take a look if you want, if you have similar demands. The next is about the disk IO schedule. So it's quite straightforward. So you, it can make a decision, a schedule decision based on the real-time disk IO. So to achieve this, you have to install a disk driver in your each node. So it will report the real-time disk to the new CRD. And the schedule will read the information for while scheduling. Yeah. We also got more than 1,000 staff. So the next about the schedule. So, you know, a coops schedule make a schedule decisions at a certain time, a certain point. So, but you know, our cluster is truly so some nodes may come in, some parts may be deleted. So this will lead to unbalanced cluster. So this is what the schedule works here. It will evict the unbalanced parts and make them scheduling again by the default schedule. In the new release 0.29, we didn't introduce many big features, but we do have some reinforcements like respect to the node affinity policy, node tenant policy, and the match-level case in the port-to-project spread plugin. And also we respect the image pool back off in the port lifetime plugin. So yeah. We also have some significant fix like the user experience improvements, docks have installation logs, and yeah, several CVs. And each time I talk about the schedule, I will find that there are some new contributors in this schedule. And at this time as well, we have got five new contributors. So this is inspiring for the maintenance, right? Yeah. So that's all for today's session. And but if you have questions, if you have inspiring ideas, you can bring to our six slack. And you can also subscribe our mailing list and also join our meetings, the six scheduling, just schedule whatever. And thanks to all the contributors. Yeah. Thank you. So yeah, you can write and leave it for the back by scaling the QR code. And if you have any questions, yeah, please go ahead. Yeah. Thank you for the session. I have a question about scalability on the pre-enqueue plugin, the new plugin that you showed. Are there any tests done on how many pods it can hold? For pre-enqueue, actually, the pressure is not from schedule. Instead, this is the opposite way. Introduce pre-enqueue is to release the pressure on schedule. Because as I mentioned, pre-enqueue, the power of the life cycle, where it just goes through one direction cycle. So it will impose extra pressure to the API server and also no extra cycle to the rigorous scheduling cycle, because that's the pre-enqueue cycle. And then the pre-enqueue can happen very lightweight. So for the default schedule, the implementation, you can think of it as one line code. Whether check the lines of the scheduling gates equals zero or not. But when it comes to external plugins, it will quite depend on your implementation. For example, if you, in your plugin implementation, you're involved for API code to AWS or externally, never definitely as the latency to the schedule. But for scalability, it doesn't impose the pressure. It's just depends on your presentation, well, at the latency. Yeah. Yeah, you can hold. Yeah. Because I can, yeah, my share was this. The internal implementation of pre-enqueue is that it's only invoked by the power of gas-enqueued. So if the power, that means the power is only enqueued once if you want. So in the later cases, it's all based on the signals that it sends to the schedule via some other ways. For example, the scheduling case implementation is to remove the, remove the gaze. So if your implementation is pretty graceful, I don't think hosting a thousand pauses is the pain. You mentioned that there are some enhancements in the framework, right? So for example, the callback interface, if I recollect. So the existing plugins doesn't require to implement any enhancements to be able to cope with these extensions, enhancements in the actual framework. So schedule framework, the actual ones all design backwards compatible way. So that means if your existing plugin doesn't want to use the latest SDK interface, they can still work as is. That means the behavior is still as before. For example, for queuing him, you will still adapt to the older ones that are associated with some cross grain events, right? But you want to opt-in, you then adapt to the upgraded to the latest SDK and choose to opt-in. Yeah. This is always the design rationale. We don't break the order. Sure. So that also means that if... I think we are running out of time, right? Yeah. So I will be here to answer. Okay. Sure. Thank you. Thanks for everyone coming. I think we are the very last slot for the session. So enjoy your call. Thank you.