 Hi everyone. Welcome to today's session. Today, I'm going to give you a deep dive on the six scheduling. My name is Wei Huang. I work for IBM. I'm also the current co-chair of the six scheduling. So today's agenda contains three parts. First, I will give you a very brief introduction on the schedule. And then we'll share with you some best practice tips and so that you can make the most of your schedule. And the third part is I will give you some update on the latest development progress. So, first of all, what does the schedule exactly do? So in short-term-limit time, the schedule just does one thing, which is to put the path to the mouse. So that means it doesn't involve into the path creation. It also doesn't involve into the spin-off the underlying containers. You just find a node for the pending path. And in technical terms, so basically, the schedule will watch the pending path, which means the path doesn't have the spec download name set. And then it will look into the path explicit scheduling constraint, like CPU, memory, request, another facility, etc. And also taking some predefined implicit scheduling preferences into consideration and look at the cluster resource usage and make the decision to place the path to the optimal node. So that is what the schedule does. And inside the schedule, there are basically several components. The first one, we call it a caching. So to make the most optimal decision, the schedule has to maintain a cache to have a sort of choose view of the whole cluster, like how many nodes are there, how many paths are there, and how many PV, PVCs, etc., etc. So the schedule, we run internal, I'll see scheduling formers to watch for the API objects so that we can maintain the internal cache properly. And the second part is called queuing. So if a path is coming to the queue, we schedule it with reason or order. So internal, we have a priority queue. And if there's no node fit for the path, we also have some backoff mechanism to ensure the lower priority path has a chance to be scheduled. And also the path can have a fair chance to be re-popped up and be retried. And also, next, our main scheduling go routine and binding go routine, which is called the serious processes to do the core scheduling flow. So the core scheduling flow looks like this. So first of all, we will pick apart from the scheduling queue, and we will go through a series, what we call the extension point here. So for each extension point, we have a series of plugins, and we will go through those plugins either sequentially or in parallel, depends on what phase it is. It will first go through the pre-filter and filter to check the hardware requirements for the product. Like how many CPUs it requires, the 10th iteration, the infinity, etc. So if there are some nodes fit for the path, we will do prioritization on the nodes, which is to give the nodes candidate a score for each, according to predefined score plugins. And then finally, the node has the highest score we have chosen as the final node. And then we go to the binding cycle to assign the node to the path. So this is the happy path. So what about the negative path? What if in the filter, there's no single node can fit for the path? We will go to the postal filter phase. Right now, by default, we have preemption plugins, which semantically tries to preempt lower priority paths so that to make room for the incoming path. And if it succeeds, we can find a node to serve the path with the cost of preempting some victims. Then we will also set the path status that nominated no name to the path. And then put it back to the queue because we cannot immediately assign the no name to it because the victims are still being deleted. So we should have a risk period for those paths to be deleted. So this is the basic core scheduling flow. And this is the basic introduction of how to schedule it does. So next, I call it make the most of your schedule. So first of all, I call it policy customization. So schedule is a box. It has a lot of knobs, not a lot of options. So if the by default option doesn't work for you, you can adjust them. But one thing to keep in mind is that if you want to use some latest feature, latest options, you have to use the dash dash config and the following with the schedule config YAML. This YAML is the most standardized and most organized way to use this kind of option. The latest version of the schedule config is V1 bit of 2. This is the example. So you usually have to pass it to the cube schedule binary. And second thing to keep in mind is that by default, there's a lot of scoring plugins to favor each kind of preference and apply that impact on your final decision. And by default, the weight is usually one. So if you prefer one scoring policy over the other, for example, in this case, I put image locality. For example, image locality means you prefer to schedule the part to the notes that already has an image downloaded. So in this case, I gave a very large weight, which is 50,000 for this kind of plugin, so that I can impact the final decision for the part so that it will be preferred to be land on the note which has the image already downloaded. So this is called plugin-specific weight because this weight is in the plugins configuration context. And there's another way to configure the weight. The work load itself, like in note affinity part, affinity and part entire affinity, there is a field called preferred during scheduling, even not during execution. There's also a weight for your part to show your favor to the existing part. Either you want to coexist or not coexist with those kind of parts. So this is also a preference and has a weight associated to it. So basically two types, plugin-specific and load-specific. The next thing is that we have a default set of the plugins, but that doesn't prevent you to use another plugin to replace the default plugins or you just want to make additional plugins. So for example, by default we enable the list-allocated plugin, which means we prefer to schedule the parts evenly across the cluster. But in some use cases, like cluster autoscader is not the case. It's totally the opposite. It prefers to make the use of existing notes as much as possible instead of like in CA's case, we all spin up a new machine, which we all waste the money. So in case of CA, you maybe want to use the most allocated plugin and disable the list-allocated plugin. But wait, you may be wondering, I may don't have the full knowledge on what kind of plugins are there and what kind of plugins should be enabled. So what if I enable the most allocated plugin but forgot to or I don't know to disable the list-allocated, what should I do? So in this case, these two plugins will work in the opposite way. So this is not the expected behavior. So that is one thing we are considering all the time. And for this kind of conflict plugins, we try to reflect them and come, for example, in this case, we come up with a new score plugin called Know the Resource Fit. And in any time, we enforce there's only one type of behavior enabled. So in this case, you can specify the plugin arguments, which in this case is called scoring strategy.type with most allocated. So in any time, there's only one type can be specified. So you don't need to worry about the mistake of enable two conflicting plugins at the same time. So this is an example to adjust your scheduling behavior by modifying the default or modifying the plugin arguments. So here, I put a link for you to get familiar with all the plugins and their arguments and what they are semantics. So basically that's the first recipe I want to share with you to customize your scheduling behavior by modifying the default schedule config, like changing the plugin weight and changing the plugin arguments, etc. The second part I want to talk about multi-profile versus multi-schedule. So multi-schedule, you may have know that is that in addition to the default schedule, you can run your own schedule to cover some scheduling tasks of a particular path. So what's the problem of running multiple schedule? The biggest problem is resource constriction. Because the best two schedule banners doesn't communicate with each other. So they may compete for the limited resources. Suppose a node has only one CPU left and default schedule and your, for example, full schedule simultaneously assign their path X and Y to the node. I suppose both paths request one CPU. And in this case, KubeNet will be the eventual arbitrator. It will reject the latter arriving path. Suppose in this case it's path Y. So there will be a very severe and weird symptom, which is the latter arrived path will be showed up in this case without a CPU. And in API's perspective, this part already finished the scheduling phase. It has already has no name set that implies this part will be hiding there forever until the resources in this node get released. That's a very unexpected behavior will sometimes even involve operation to manual resolve this. So this is the biggest problem of multi schedules. So do you have solutions? Yeah, we do have some solutions to mitigate this, but not perfect. For example, some users are trying to divide the cluster into several pieces and delegate some piece to one schedule and others to another. So this can work that will cause fragmentation and has caused a low cluster utilization. Another solution is on top of this multiple schedule, they set up an extra arbitrator so that each schedule decision was reported to the arbitrator. And then arbitrator works as the central place to do a decision reject or accept. So this works, but as you'll see, we are definitely called extra workers. And also because you're in introduced another component, you have to manage the extra efforts, better manage this kind of extras. So what's the community direction on this? So the community direction is to use a multi profile schedule. Multi profile schedule is still one schedule. But inside the schedule, it can define different profiles. And each profile you can think of is sub schedule. So because it has its own set of scheduling policies to including which kind of weights, which kind of plugins, and which kinds of parking arguments, etc. So in this case, I provided a schedule config and multi profile demo. And here you can see that we have four kinds of schedule profiles. One is the first schedule to keep it as is for the regular workers. And the second is image first to favor the image locality so that maybe some service workers care about the cost start time. So they want to schedule a path to the nodes with the image already downloaded. And third is called the bean pack. So cluster also scalar maybe beneficial for this as you want to save the cost of running machines. The last is called skip score because score phase is doing the ranking is to to do their preference for the path. So in some cases, like the cluster is very schedule is highly used. And there's a lot of cost queuing there. You may be want to schedule the path. In most cases, a low priority or offline workers as quick as possible. So I don't care about finding the best node. This is the node satisfying the hot requirements that works for me. So in this case, you have four kinds of profiles and your workloads can specify the spec of scheduling name to map to the profile name. So there's two ways to specify the schedule name. Once you expect them manually, the second way is to do that dynamically. So you can set up a admission controller so that it can intercept the creation of the path. So you can programmatically like assign the path to the schedule profile by, for example, the path labels or by even by some external metrics like how busy the current schedule is behaving. Like how many paths are there are being queued up in the schedule. So it's very dynamic. You can work on a lot of innovative ideas to do this. And in terms of the schedule config, we are also raising up a cap to simplify the schedule config. We will introduce a multi-point plugin so that for a user, you don't need to know whether the plugin implements score or filter or post-filter. You just define the name and provide that as a plugin. So internally, you will easily well smartly detect which extension points it implements and then load it down properly. So this cab is still being developed, so it will be available on 1.23. So this is the second part, multi-profile versus multi-schedule. Running multi-schedule is fine, but you should be knowing the limitation. And upon the resource confliction, maybe some manual operation work were involved. The third recipe is improve your throughput. So schedule especially on a large cluster, you sometimes don't need to iterate all the clusters to find a node to fit that path. So there's one option called percentage of nodes to score. And by default, it's set to 50. But no exact 50 is the adaptive algorithm to calculate the percentage by default. So for example, suppose you have 400 nodes, the adaptive percentage default percentage will be 46. That means it will, in the filter phase, it will check the nodes. And if the number reached to, in this case, 230 in the filter, if 230 nodes has been available to be a candidate for the path, the filter phase will stop. Instead of going through all 500 nodes. So, and then this 250 nodes will be passed down to the scoring phase and give the score to each node. So in this case, you can see that in the 200 nodes, in the most desired case, you only need to filter 680 nodes and has only scored those kinds of nodes. So this is the default percentage. So in your case, you maybe want to tell this potential value according to your cluster size. For example, if you have to have 1000 nodes and specify the percentage as 10%, you only will score in the filter 200 nodes. This is the first parameter I want to highlight. Second one is also follow the same idea to iterate fewer nodes, instead of going through all of them. In the default preemption plugin, there is some parameter like minimum candidate's nose percentage or minimum candidate's nose absolute. So by defining these two parameters, you can define how many nodes you want to go through the preemption phase instead of go through all the nodes. Another option is called prefer nominate node. It's a feature, it's not an option, it's just a feature gate introduced in 1, 2, 1 and already enabled by default in 22. So it's just a yes or no option you want to use and we believe it's very useful features. So we don't want to provide as an option. It's a shortcut to leverage the status nominating node name so that when the path has been already carried with that field, we won't go through all the nodes. Instead, we will just do once to check whether the previous nominating node still fits. Nominating node still fits. If it fits, I will just assign the node to it, whether I'm going through the regular scheduling flow. If not, don't worry, we will fall back to the regular scheduling flow. So that's why I say this is a feature I think is mostly will increase the scheduling, especially the preemption throughput. Another parameter is called parallelism. So across the scheduling flow, there are two phases we run the plugins in parallel. And by default, we run the nodes in batch. So we default to 16. So the 16 sounds like to corresponding to the CPU course, but underlying you will involve operation system as well as go through the go long schedule. So sometimes it cannot be hard-coded to the CPU course. So it needs some tuning to get the best performance. We used to have the PR tool wanted to set this to the CPU course number, but it seems doesn't work properly. So we revert that PR. So if you want to customize this and want to, for example, if I have a lot of CPU course, maybe default to 16 cannot make full of the CPU usage. So maybe you want to tune it and so that adjusted it. So that's the third recipe I wanted to share to improve your throughput optimization. And the final one is, I think it's a knowledge that I just want to highlight is that in addition to some regular resource core resource like CPU memory, we can model your customer resource in the format of extended resource. I give a very easy examples. So as a user, you can model your resource. Then you can give a key value pair and patch that to the node capacity. So because we manage this kind of node capacity in a very generalized way. So in this case, if you patch that to the node spec, status will have the capacity automatically populated. So then also the scheduler knows that you have this kind of, well, it's fake GP information. So then you can submit a part request with this kind of GPU with file. And the scheduler will use the default plugin, node resource fit plugin to schedule your part. So in this case, I see a lot of users implement their own scheduling plugin, but it's not necessary. A lot of plugins already taking the extended resources for the citizen and supporting them well. All right, last four recipes I want to share with you so that I hope those recipes help. So you can make the most of your schedule to more customized options and without any further coding. Well, maybe you may say that it still doesn't fit my needs. What should I do? And are there any viable options to fulfill like a different answer scheduling requirements like batch workflows? The answer is yes, we do. We have been thinking about to supporting more diverse workloads two years ago or three years ago. So we initialize the schedule framework design to focus more on expose the internal mechanics to the external developers so that you can use the core scheduling capabilities easily and then easily build your own schedule plugins. And so that with your customized arbitrary plugins and plugged into the default schedule, you will make it a schedule plus plus. So it's still one scheduling boundary with extra functionalities your developers. The other thing we were thinking about is that in some cases you have to involve other components to facilitate the scheduling. For example, in terms of batch workloads, sometimes you know exactly when and how to control the creation of the path. And if you can control the part creation more smartly instead of just dumping them path immediately when a workload is created, then maybe can save a lot of scheduling efforts to fulfill the scheduling constraints of your workload. So recently we were thinking about that. So either you can define some your own CRD so that with some extra controllers and animation webhooks to control the better creation time of the path. Also, we have done some upstream work like suspended job, like in the next release we were doing some features like multiple scheduling directory for the suspended job. So this in one way we will perfect all the internal scheduling mechanics in another way. Other components can also help so that these two things combined together can make the scheduling experience more smooth to adapt to more diverse workloads. All right, that's pretty much I want to share with you to make the most of your schedule. So the last part is the update on the developer progress. The first thing is the schedule component config. In 122 we upgraded the version to V1, beta 2. We removed some options that are either not used quite often or doesn't quite make sense. And we will also in the next release we'll come up with the simplified scheduling config so that you can have a more easy to config way to configure these kind of plugins. And also we will try to make that totally backwards compatible. And there are some features I've already mentioned like referral nominating, no multiple scheduling directives. And also in 122 we add a namespace selector for part affinity. In the before, in the part affinity we only provide the option for you to specify a slice of the namespaces but now we spot the namespace selector. The last thing we did is a series of scheduling queuing enhancement to improve the efficiency. Like we, in the before, we re-queue the paths in a very brute-force way. Like any events come in, we will move the paths back to the head of the schedule queue. But now we define the series of rules for the path only to be moved to the head on its method events come in. Like if a path failed with pvc or pv restriction, one, for example, irrelevant events comes in, the path will not be moved to the head. So it can save a lot of scheduling cycles and also mitigate the sort of head-of-bulking problems in the schedule queue. And also as more and more users are using scheduling framework to do their customer plug-in, so we also did some work to support some dynamic event handlers so that you can define your CRDs so that the CRD can be also detected by the scheduling framework. We internally use some dynamic kind of mechanics to do this. And also we provide the option for the plug-in developers to proactively move paths to active queue. It can be sometimes useful for like the co-scheduling plug-in to, for example, if you have a spark job, when the driver part comes in, you may want at that time to move associated executor path back to the active queue so the executor path can be scheduled afterwards mid to third. Okay, so next is about some sub-projects. The first one is scheduled plug-in. So scheduled plug-ins are a repo which we initiated for we want to engage the different vendors to contribute their design interpretations. Well here are two KubeCon talks you can check out and both are using the schedule framework to build some customer scheduled plug-in to fulfill the business needs. Second sub-project is this schedule. So this schedule provides some re-balancing policy to re-distribute the path according to some defined rules because scheduling decisions, this happens on the scheduling time. But along with the cluster, as time goes, some constraints may not satisfy anymore. So this schedule can instantly regularly check these kind of policies so that if the policy violates, it will practically delete them so that they got another chance to be rescheduled. And third one is called KubeScheduler Simulator. It's a new GSOC project which provides some web UI as well as some simulating SDKs for you to easily build your toolkit to set up a cluster programmatically and see whether you can replace a series of requests and see how the scheduling result is. So right now it's still in the early phase but yeah we maybe want to to spend more effort on it to make it very have a very solid SDK so that the integrators can benefit from that. All right and you can contact us in the China's in the mailing list and also welcome to join our big committee. You can raise your agenda item there. We will get over your proposal etc. All right thank you for joining today's session.