 Hello, everyone. Welcome to today's session. And I hope all of you have a great week in Detroit, in KubeCon. And today, we will talk a little bit on sick scheduling. So it's a maintenance session. So I hope this session is more interactive. And we can leave enough time for you to ask questions. And basically, this is a session that we will talk more on the latest update of the sick. So today, we were supposed to have four speakers here. But Chin Chan from Alibaba and Kante from Dark Cloud can now attend in person. So here is just me, Wei from Apple, and... You can say from Melchorine. Yeah, we will cover the topic. All right, here's today's agenda. As usual, we will give a very short introduction on the schedule itself and then give some details on recent developer updates on the last two or three releases. And then, as a sick, we sponsored some sub-projects that related with the scheduling. So we will also give some updates on the sub-projects as well. And lastly, we will have the Q&A. All right, the first thing. What is schedule anyways? So, schedulers as a core component of Kubernetes, just one simple thing is that it assigns the nodes according to the cluster's utilization and power distribution and other resources allocation and assign a best node to the pod. So the pod will have the best node to run as being obvious continuous. And if you look at the picture, in the left... Sorry, the clicker doesn't work. So the pod from the left top is the input that usually created by the user or created by the department and the department manager, control manager, spin up the pod. So the pod comes in and it enters the scheduling queue. And the schedule queue will have some internal mechanisms to ensure the pods get sorted properly and fairly gets popped out. And once the pod pops out from the scheduling queue, it enters the scheduling cycle. So the scheduling cycle is the core logic of how schedulers schedule the pod. So it basically leverage another internal component called scheduling cache. So scheduling cache, we are listening to all the objects it's interested in, like the node objects, running parts of information and as well as others, like storage, PVPVCs, runtime class, a lot of other information. So the core schedule cycle leverages that kind of information. So it has up-to-date latest view of how the cluster looks like. So it can then, based on some predefined algorithm and policies, like prefer one node over the other, these kind of pre-config policies, then we'll try to choose a better node for the pod to run on. So next, you will check whether the pod is schedulable. If yes, it goes to another separate implementation-wise, it's a separate go routine. So it goes to the binding cycle. Binding cycle does nothing but bind the node to the pod. So then the scheduling cycle ends with the pod that carries the no-name scheduler thing. It's the best node for it. But it can be also, the system is pretty occupied. So there's even no single node that can accommodate the pod. So in that case, the pod gets put back to the scheduling queue. And internally, it will go through some internal predefined back-off timer. And then so that it can be fairly put back into the head of the scheduling queue and gallery trial. So basically, this is the very high-level structure of how the schedule works and what's the input and the output of the schedule. And a little more detail, zooming into the green rectangle. So the scheduling cycle can be zoomed into a very simplified two-phase scheduling process. One is called filter. The other is called score. So filter is that we popped out the pod. The first phase is to look at into each node in the cluster whether the node fits the pod or not. For example, if you're asking a super large CPU, maybe some know the fit, some not not. So in this case, we can see node two, node three, and node four fits. But the other knows doesn't fit. So in this case, we call these three nodes as feasible candidates for the pod to run on. So after that, we are based on some predefined policies to see which node we should place the pod to. In this case, we rank the three nodes and give a final score based on predefined scoring policies. Then finally, in this case, we give the node three with the highest score with 90. So the output of the schedule cycle is that we come up with node three for the pod to bind on. So this is a pretty simplified phase. But if you want to know more details, we do provide some fine-grained extension points that not only the internal scheduler is using, but the new customer scheduler can also use. Just basically, we are using the same thing. No hidden APIs. Everything the internal queue scheduler uses is what you can use. So there's another talk on Wednesday. There is a section that I gave detailing all the design patterns and why we exposed so many extension points and what is usage of extension points. So if you are interested in the details of the scaling for framework and each extension point, you can check it out. OK, next section is about the recently developed in the last two or three releases. First one, simplified schedule config. So we provide schedule framework, and we provide a bunch of extension points. But unfortunately, this is basically more developer-oriented, like you're implementing a plugin, and you know exactly which extension points you want to implement. But there can be troublesome for the end user or for the SRE because they just know, OK, I want to enable node affinity or power affinity plugin. But what's the idea of why I should know which specific extension point I need enabled or disabled? So in the before, if you want to enable one particular plugin, you should know which exact extension point it implements. In this case, you have to plug in full and bar, and then you have to know full actually implements full extension points, pre-filtered, filtered, pre-score, and the score. You have to specifically specify them for the same case for bar, but it should be something more user-friendly. So that is why we introduce the so-called multi-point extension point. So this point is not a single-purpose extension point. It's a more compound extension point that in this case, you just specify the plugin names under the multi-point section. But in this case, you just tell scheduler that in the config, you want to enable full and bar. That's it. So internally, under the hood, it will check which exact extension points it implements. It's based on some reflection-based kind of user to detect that and then will enable them under the hood. So that is so that you, as an user or SRE, you don't need to worry about which extension point it underneeds in implements. And it's available in Kubernetes 123 if you're using the latest of V1 Beta 3 config and also in 123 if you're using the V1 config. And also, we didn't abandon the old-style configuration. You can still use it, use the old-scale style or use it along with the new style. So it's all tested, so we can give a try. The second thing I want to introduce is the working progress feature called Power Scheduler Readiness. So if you look at the left picture, it's that once the power is created, the power will immediately enter the scheduling queue and waits for its turn to be working on, to be popped up. So there can be some cases, especially in some customer solutions like associated with like a quarter check or some other in-house requirements that the power creation doesn't mean the power is ready for scheduling. So in this case, the power is scheduled anyway, although we have back off timer inside scheduling a power that is literally not ready, it's just waiting cycles. Especially if you're running a multi-tenant environment, scheduling the part that shouldn't be not ready is just postpone the schedule of other parts so that you will impact the overall throughput of the scheduling as well, impact the other part that should be scheduled earlier, so that will impact there and the users SLO, SITR. So one thing we are trying to improve is that during the scheduling queue and the scheduling cycle, which is unconditional, so right now we put a pink box, so we provide the knob in between so that before it's popped out and they've been working on, user has the choice to choose whether or not the part is literally ready for scheduling or not. So the feature is being worked on is internally implemented by adding power level fields called scheduling gates and then the regular workflow can be the user created the part and the part should carry some predefined gates and each gate is associated with external controller and controller should be responsible to remove that gate when it thinks it's ready for scheduling so you can define more than one scheduling gate and internally there will be a default in queue plugin to check in the scheduling gates and then when all the gates get removed it will be eligible to be popped out so that is the design but it's still working progress so hopefully it will be available in the next release, Kubernetes 126, yeah, that's it. So next, a couple of enhancements is about part-to-body spread I will hand over to Ken say. Yes, it's a main domain in part-to-body spread. It's now graduated to beta in 1.25 and it has a new field, main domains to part-to-body spread constraint and it defines the minimum number of the part-to-body domains and the use case we have is like you want to first spread things over minimum number domains and if there aren't enough domains already present then make the cross-out scale provision them so and it can be only used with the do not schedule which is a hard constraint in part-to-body spread. Let's see the example. So let's say we have a develop deployment with two riparickers and using part-to-body spread constraint with MaxQ1, MainDomain2, to policy key, hostname, do not schedule and in this case, one riparicker is now on the node but another one is pending now. It's actually it's because of the main domains we set. The number of domains currently one, only one node and it should be two. So the cross-out scale, notice the pending part and create another node so that this pending part can be scheduled. Yeah, there we go. So the next one, take taint iterations into your consideration when calculating skill. So it's now alpha feature and we plan to graduate it to beta in next release. So it adds new fields, node taint policy and node affinity policy to participate constraint and those options to take, to specify whether take taint and iteration or node affinity into your consideration when calculating the skill. Let's see the example. So let's say we have a deployment with part-to-body spread constraint, MaxQ1, to policy key hostname, do not schedule which means hard constraint and we have two nodes and node two has a taint. So of course node two can't have any parts because it has a taint. So the first one, the first replica goes to node one. Then the problem here is the second part, the towards spread constraint wants to schedule the next part to node two because node one already has a part but there is a taint on node two so it will be pending. This is the problem we want to solve with this proposal, this new feature. So we can solve this by specifying by setting node taint policy owner. By setting node taint policy owner, scheduler takes taint and iteration into consideration when calculating skills. So there we go. The next one, part-to-body spread again. It's respect part-to-body spread after rolling up grades. It's now alpha since 1.25. So it adds a new field called much level kids to for this part constraint. So the problem behind it was the scheduler during rolling up grades, scheduler takes both old replicas and new replicas into consideration when calculating the skew. So the problem here is, let's say we have a deployment with two part-to-body spread constraint and let's go rolling up grade. And new replicas, so the scheduling of new replicas are affected by existing old replicas. So after rolling up grade, it may be imbalanced to for Z. So we introduce much level kids to for this part constraint. By setting port template hash to much level kids, scheduler takes only parts with the same port template hash into consideration when calculating the skew. In other words, scheduler takes only new replicas into consideration when calculating the skew so we can solve the problem. There are a lot of nursing features and fixes. So four months improvement on demo set parts. So we updated the pre-filters interface to return pre-filter result. Pre-filter result is a struct of Gorang. So pre-filter result contains the information which knows to evaluate in downstream extensions. So by using this in node affinity, demo set parts are scheduled in each node by using node affinity. So we, by using node, pre-filter result in node affinity is pre-filter. We can improve performance on demo set parts. So next one is memory leak on preemption. That was due to the encroached context. And we refactored the scheduling route to prevent such a memory leak in the future. Next, flash interval. So currently, the scheduler moves unschedulable to active queue every certain minutes. It was previously five minutes, but changed to... No, sorry. So it was previously 60 seconds, but changed to five minutes. Because the reason behind it is we want to eventually remove this flashing in the future, and this is the first step for the component config. It's stable now, and latency... No, legacy scheduler policy config is removed to 1.23, so if you still use it, you should migrate to component config. Yes, that's it about the upstream update. From here, we are going to talk about subproject updates. First one, scheduler simulator. The scheduler currently has a lot of ways to expand or extend configure. It's like a plugin or extend component config, of course. And this simulator helps you check or evaluate your scheduler's behavior. So it's a younger project, so it has only simple feature right now. Basically, what this simulator does is it has its own Kuba API server in it, and you can do anything with that Kuba API server. So you can create resources, apply resources, delete resources, whatever. So then when you create parts in the simulator's API server, this detailed scheduling result will be added to POT's annotation, such as a filtering result of each plugin or scoring result of each plugin. So usually, we can see those information from the real clusters scheduler. So it should be helpful. And yeah, so there are some interesting discussion going on. Simulator also has the web UI, so you can easily check the results from the web UI. So filtering result, scoring result, and normalized score like this. Next queue. It's also a young project. It's a job-curing controller designed to manage batch jobs as a single unit. So it has these core features, core management. It controls who can use what and up to what's in it. And next, fair sharing of resources between tenants. So to maximize the usage of resources, any unused caller assigned to inactive tenant should be allowed to share fairly between active tenants. And flexible placement of jobs across different resources types based on availability. So you can have the different various resources such as different architecture, GPU, CPU, or different provisioning modes like spots or on demand. So other than that, we have a different queueing strategy. Strict first in, first out, it's a first in, first out. And best effort first in, first out, it's basically first in, first out. But if all the workloads cannot be admitted, it will not block other young, younger, newer workloads that can be admitted. So queue has these APIs and any other ongoing work is also interesting. So it's just a brief overview of queues. So if you're interested in it, we have an official document, of course, and also we recently published a blog post about queues. So you can refer to that as well. So some more, some projects, one is called this schedule. This schedule has been there for a long couple of years. So basically the motivation is that the scheduling decision was made based on the moment of the cluster's resource utilization. But as time goes, all the cluster states may change. So the optimal decision by the time in the past is not that optimal anymore. So you may want to make the most of the cluster utilization. You may want to change by moving some posts to other nodes. So this schedule is this kind of project for you to define a couple of strategies and one, some criteria doesn't meet. Then it kicks in to evict the path. Because it doesn't follow the strategies you are defining. So basically it's running as a control loop and then you define a couple of strategies and then you will check the path whether they fit the strategy or not. If not, they kicks in and evict that. So basically this is the idea of this schedule. And then as time goes, even before all the features are developed in this project, time goes a lot of demanding requirements come in and some of them are pretty, I think, not that general, but it's still a requirement. So to fit the diverse requirements, so the project is recently bringing up a proposal called this schedule framework so that not necessarily every feature needs to be developed in this project. Instead, you can develop on your own ripple and then just glue them together to come up with a schedule, this schedule, I would say, plus plus and then to fit your in-house needs. So if you are interested, there is some new community meeting dedicated for this schedule. You are welcome to join. This is for this schedule. And next is a new project called COG. So when we talk about scheduling or testing the other scalability, we are using both control plan and the data plan. But for some specific area, like if you are just testing the functionality and the scalability of scheduler, actually you don't quite need the data plan because all the kinds of stuff are just API objects. You don't really need to spin up the real containers and backed by the nodes. In that case, you can save a lot of costs and then even just run the simulation or evaluation tool in your laptop. So that is why COG, Kubernetes without the KubeLate project comes in. So basically you can see in the picture, it removes the needs of the KubeLates and then it will spin up a controller called COG. And COG will watch the node and the part objects so it will make some magic to let the node and the part behave they are backed by KubeLate. Like it will maintain the hobbit from the node to the API server and also will make the part transit from pinning. If the part carries a node name, transit from the pinning to the running state. So basically it's transparent to you. It looks like you have the 5,000 nodes in your laptop, but it's backed by the COG. So the use case of COG is that can do functionality in a very low cost and for some area like scheduling. And also can do the scalability test for scheduling and also you can integrate easily to your pipeline without spinning up the real runtime KubeLate to spin up the containers. There can be other usage, like if you want to test the working progress across the auto-scalar implementation you need to jump into the AWS to buy the machines, try to try it out, but you can just test the COG by the nodes. So that is the COG project. The last one is schedule plugins. So basically both the default schedule and maybe your customer schedule can follow the same pattern of use the schedule framework to fill your customer requirement. But not every plugin can be entry. So what should you do? So this repo is the place, like marketplace, for you to contribute the idea, exercise some innovative thoughts and then work on a specific area. Right now we have a lot of useful plugins like co-schedering, elastic order, topology aware, scheduling, et cetera, really innovative and cutting-edge maybe. Network aware, load aware scheduling. So if you're interested, it happens to have similar requirements. Yeah, welcome to Chakao. All right, I think that's basically pretty much for today's session and the next session is for Q&A. Any questions? Yeah, go ahead. Yeah. Hi, thank you for the talking and the work done on the scheduler. I have a question about project Q. So what was the motivation to do it if there is already a volcano which solves the same problem in part of CNCF? It's right now a Kubernetes SIG project. So basically it belongs to Kubernetes. I'm not sure it's the goal to go to CNCF as a standalone or just stay there for as a Kubernetes sub-sub-project. But I don't think there's a big difference. My question less about CNCF, more about what was the motivation to create Q in the first place if this problem is already solved by other projects in this case. Is there any particular functionality or any particular problems that Q will be addressing that is not addressed yet? So the Q project although I'm not the maintainer is trying to move some of the scheduling domain outside the schedule itself. So it doesn't necessarily that everything needs to go to the scheduling core itself to be resolved. Instead it's standing at the controller's perspective like some external requirement related with like a share in between multi-tenancies can be done outside. That is the motivation for Q. So I think you maybe overlap with some CNCF project like volcano and some others but they do want to walk into slightly different direction like Q as you really said it will not cover the schedule Q but other projects like volcano is just putting everything together and you have just deployed all the things you provide for that. So it's basically couple. But Q you can just hook up with any other scheduling implementation and then it's basically loose couple it's just focused on one particular for the domain it wants to resolve. I think that's that. Thank you. Hi, are you able to customize the scheduler when you're using a managed version of Kubernetes like EKS? It depends on I guess on the public offering Kubernetes offerings you don't have the control of the control plan. So in that case one thing I usually suggest is that you ground the customer logic as secondary scheduler and you can use or disable the default scheduler that runs with the EKS control plan. So that the scheduling flow works through your customer scheduler. So basically if you use the scheduling framework based on the implementation you have 100% compatibility with the entry plugins power to bridge spread, power to all the things you can use. One thing you may want to tweak is to point your workloads scheduling name to the customer scheduler you define outside. So maybe a little but I think that's the way because you have zero control to modify the control plan of the EKS. Thank you. Any additional questions? Sometimes you want to modify the pod spec after the scheduling decision has been made. Modify the schedule spec, which spec? The pod spec. The pod spec has a lot of fields. Let's say the container image. Container image? So basically it's not quite a scheduling question. So in the API's definition there are some fields can be mutated after the pod gets created if you check the website. I think image version and some others can be mutated but some others are just defined as immutable. So I'm not sure that is your question. So can we mutate it in the scheduling framework itself? That's a good question. The suggestion is not. So basically a scheduler is not supposed to mutate some particular fields. It's just doing one simple job to find a node for it. So one not field, one particular thing that scheduler does modify the pod is only the condition, the status condition. If the pod is not schedulable then it will append the condition to that. That is all scheduler does. Another scenario is that it's in terms of preemption it injects the status nominate no name to that to suggest preempting some paths so maybe later on you can try that. It's a hint. Try that nominating no name. Other than that it's just the no name because that is ending stage of scheduling a pod. So in your customs scheduling it's not that recommended to mutate the pod spec. Any questions? Thank you gentlemen. We'll still be here around if you want to have some offline talk. Thank you.