 Hi everyone, welcome to the join today's session. Hope you will have great work week in Detroit So today We're gonna share some hands-on experience on running kube scheduler including configuration extension and operation we hope by the end of this talk you will get some practical ideas and actionable Practice is to run your schedule more efficiently in your production environment And firstly, let's introduce ourselves Hi, I'm Yuan Cheng from Apple Cloud Service. So glad to meet you. It's cool to see so many people show up Hopefully there are many more people online I'm Wei Huang. I'm also from Apple. I'm also the culture of the six scheduling I'm evil I work as a software engineer in Apple Cloud Services. This is actually my first kube con so I'm pretty excited Hello, everyone. I'm Chen Wang from IBM research and I actively Contribute to scheduler plugins and auto scaling on the other side because I'm in research I also did a lot of research work in resource management for Kubernetes and I also try to enhance Kubernetes without for all types of machine learning workload I'm very looking forward to talking to you in person Cool So this is today's agenda. Firstly, we will give a very high level introduction on the kube scheduling And then we are deep dialing to each part Configuration operation and extension and in the end and we hope we can have five to ten minutes for answering questions All right, the first part about what is scheduling anyways So when we talk about scheduling, we usually talk about the typical Part-life cycle because schedule is just playing a certain part in the whole life cycle so starts with the part creation usually either by a user directly creating a part or Creating a deployment and the control manager is seriously possible for spinning up the path so after that The job of the control manager is down Then is turned to the scheduler to try to use its knowledge on the whole cluster to find The best and note for the part so right now you just treat the scheduler as a black box the input is the pending part and the output is the part associated with note and after that It's Qubelet is Responsibility then the corresponding qubelet will get notified. Okay. There's Part coming to my note I saw I should be responsible to bring it up to spin up the corresponding containers then the part gets into a running state and then after that optionally if it's a run to completion part then it will Run its job finish the job and it's done then the part gets into a completed state But can be also that the parties are long running services then the part of just stay running forever So this is basically the whole Power life cycle and the scheduler just focused on Between the part creation and part running so let's zoom into the red Rectangle box to look into little bit into the internal of the schedule The first thing to look at the schedule is what's the input and output? So the input one type of input is definitely the Pending pass that the scheduler should be diligent working on to find out the best and know the file and The other thing to make the part placement decision It should be aware of that up to date it Custard status including not only to the running pass the nose information Storage information PVP sees et cetera. So in this case internally scheduling well casual information and So that you can make the right decision for placing the past This is the first first part and second part is the we called internal cues So the part comes in we should find the Proper mechanisms to sort them properly and also have some back off mechanisms so that it's pretty Fair to Schedule both the hyperality pass and those are low priority pass There's a second part for cues and then the third part is called call scheduling them a typical workflow is that a part? Pop it up from the internal cue then the call scheduling works on it goes through a series of the Actions then the output comes into two ways One is okay. We can't find the node to host the depart so we go to the binding cycle binding cycle is nothing but Associate we associate a known name with the path So that is one upper and the other part pop is that okay? Sorry, I cannot find a node to host the depart then the power that goes back to the internal cue and went through some Pretty fun back off timers then it has another chance to be reached higher later So this is basically also pretty high level of the internal of the schedule then if you zoom in the Call scheduling a little bit We came up with extensible framework that we define a series of the extensible extension point and at each phase you can associate with the corresponding logic to overcome a particular scheduling Constraint, but by here I won't go to into two details because later. I will slice on detail each extension point, but here you just I want you to understand that there are two typical or three typical phase in Regular scheduling cycle one is called the filter. So the filter the output is that it will Give you a yes or no answer if there Can't we can find a note or multiple notes to host the part or not. It's a binary Results and after filter if we can find either list one note. We will go to the score Face then the score is to rank the feasible notes then by giving some predefined Algorithm and policies then come up with a final note We suggested the part to be wrong. So this is the happy past We can't find in those but that can be a negative past, right? We cannot find Notes then in that case it goes to the post filter In the red rectangle go to the post filter face post filter a typical Implantation is called preemption. So preemption's efforts is that okay? I may want to sacrifice some low priority path to make room for the hyper error to run so Yeah, that is basically the most the critical three phase by now I want you to to get to know but later I will just go a little deep into that and Another thing I want to mention is that Scheduling is not only just a fix the block. It can be very dynamic. So in two ways once you can Craft different scheduling flavors by a term called profiles. So each file is like as kind of be associated with the particular scheduling pattern or flavor like you want the part to be more impact or be more Spread and each profile is consisted with a lot of plugins the plugins is like the minimal unit that Resolve a particular scheduling domain Problem and then you can just build the profile using the plugins like using Lego blocks and People will go pretty deep into profile plugins in the next sections about the scheduling configuration And over to Yibo Thanks way So I will be going over So I will be going over the scheduler configuration itself so you can think of the Kubernetes scheduler configuration is You can do that in the declarative format similar to like a pot spec for example So you can see here that we have like a sample cube scheduler configuration There it's kind of mostly consists of the global configurations of the global parameters So these are things like for example the cube config to use to be used to connect to API server or things like their either Leader election configuration and things like you know the pot initial backup seconds and max backup seconds Which I'll go over in a bit We also have a list of profiles that you can configure as we mentioned earlier that Basically can define that exactly exact behavior of this particular schedule instance So and also at the profile level you can have a per plug-in configuration Which you can control for example, which set of plugins to enable and disable as well as things like Like the parameters that you pass into the specific plug-in itself So one thing to note here is that the API version that's currently supported is v1 beta 2 v1 beta 3 and v1 And v1 beta 2 is in the process of being deprecated So at the global kind of parameters level We've got a couple things to know that you can kind of tune and and and see how it performs So percentage of note to score is basically a percentage of all the notes used for the initial search of like feasible notes So by default it is doing an adaptive percentage between 5 and 50 percent And this is to kind of make sure that the scheduler is performing enough Especially for large clusters where you may not not necessarily need to consider all possible notes at all time But if you do need to kind of want to do that you can tune this percentage yourself So one of the things that my colleague Yuan has recently actually submitted a portal quest on Government upstream which is to be able to configure this at the per profile level and not just as a global parameter The next section here is on leader election So you can control this to is basically have your schedule run in leader election mode This is to encourage for high availability and the locks here that you can support our Leases endpoints as well as config maps For the initial backup seconds and max backup seconds So these are essentially for on schedule pod Unschedulable pod case where the pod may go through a sequence of exponential backup So it just is to avoid things like head of the line blocking where you don't want to constantly trying to reschedule over and over Like this unschedulable pods So at a high level this is like a very simplified view of the scheduler internal cues So there are three main cues that make up the scheduler So there is the active cues where all the pods will be It plays in this cue are kind of ready to be scheduled. So you can think of you know At the every single scheduler interval called the schedule one is going to pop off that pod from the active cue And tries to find a feasible node for that And if I'm unable to find a feasible node then I will be placing that pod into the unschedulable cue And there will be a bunch of events and triggers that will basically flush those unschedulable cue into either the backup Of cue or the active cue to be considered for rescheduling. So in the case of the backup cue It's in essentially this is where it's going through the exponential backup starting from the initial backup seconds all the way until maximum backup seconds So for example, you may want to Set your max backup seconds to be quite high for like a very large cluster where you want to kind of Not to like for a poster like maybe unschedulable for a long time You may want to have that go through a high like a longer backup time so that it doesn't kind of do with headline blocking So now moving on to like the specific profile configuration So the profile configuration itself allows for like a granular control of the extension points So those extension points are like cue sort pre-filtered. I'm not going to go over all of these There's going to be a specific section going over each every single one of the extension point But basically you can kind of see from an example here that I've got a my awesome sort Custom plug-in that I want to enable for the cue sort extension point with some parameters like percentage of node reserved or something like a learning strategy, for example, if this Plug-in is doing some kind of learning placement algorithm But you can control these parameters for that specific plug-in in the plug-in config section And you can also see that I've not only enabled my custom plug-in a I've also enabled this for my custom plug-in B with a different weight. So the weight here really allows for favoring Plug-in score during then what's called the normalization So how that works is at the scoring layer is going to go through every single plug-in and Finding a per plug-in score for every single particular node and then it's gonna kind of create a normalized Final score that's basically factoring the weight So it's going to take that score and multiply by the weight and divide by the weight sum of all the enable plug-ins And to find them the final score for that particular node and Basically the node that scoring that has ended up with the highest score will be chosen as a node for placement of the pod So starting from v1 beta 3 of the schedule or configuration We have there's an added support called a multi-point Inside the plug-in configuration, which simplifies the enablement and disablement across several extension points So prior to this if you know I have a plug-in that extends up a series of different extension points Then I would have to go through and turn this on for every single extension point Which can be kind of cumbersome for example If I just want to enable this across like a bunch of extension points So let's run through an example here where I've got my default Q sort Which is the default one that ships with the scheduler But I want to kind of disable it and I want to use my custom Q sort Extension I've also got two default plugins that ship But I kind of want to for example disable the plug-in one for filter stage and maybe Enable my plug-in for only the scoring stage of The plug-in to but I also for example, I have two custom plugins with one being Both of the plug-in being extending, you know all the filter and scoring stages So if I were to do this in the prior multi-point Approach then I would have to go in and say, you know pre-filter enable plug-in one plug-in two filter enable plug-in one plug-in two Which is pretty tedious and there's a lot of lines that you have to of YAML that you have to write But in this particular case, I can just simply say multi-point Enable my custom plug-in one with a scoring weight of three and and and they will be able to enable that for all the extension points So this would be no pre-filter filter pre-scoring score So we've bravely touched upon like multi-profile, but I want to kind of dive a little bit deeper here so basically a single instance of cube scheduler here can run multiple profiles and under each profile You would define a name for that particular scheduler as well as a set of plug-in specific configurations So this is things like the multi-point for enable one or more plugins or specific plug-in arguments So when you have done that then for your pod to be able to target a specific profile you would set that in the spec scheduler name of the pod spec and Basically, if you don't do any customization for the cube scheduler out of the box You will get a single scheduler profile with the name being default scheduler and the pod will By default set the spec scheduler name to be default scheduler But in this particular example, I've got like two profiles with one default scheduler and one second scheduler with some Customizations and for example if I want to target my pod for the second scheduler I would set this in the spec scheduler name One thing to note here is that all the profiles here must use the same plug-in For Q-Sort this is because the scheduler itself has only one pending pod queue So you must ensure that when you're specifying multiple profiles that the Q-Sort layer Whether you enable a custom sorting algorithm or something that they have to be the same across all the profiles So the cube scheduler itself comes it's already like base driven based on plugins So there's a list of default plugins that ships with the cube scheduler. I don't have the entire list here But there are some notable ones here that really kind of is something that's important to tune as well Something important to to kind of make note of One of them being the node resources fit plug-in With the default weight of one and the default scoring strategy being least Allocated with a CPU weight and a memory weight of one each So I will be going over in a in the next slide on in depth about the node resources fit itself There is also the interpot affinity plug-in as well as the node resource balance allocation plug-in Which is used for ensuring like notes generally you prefer to score notes with and up with a placement of a balanced CPU and memory allocation So let's talk a little bit about been packing itself So the default node resources fit plug-in uses the least allocated strategy So what this will do is if you have n nodes and you're trying to place pods into them What it will do is it will always pick the node with the leads resources being used So it's attempting to be a more spread strategy where you're kind of think of that as like horizontally placing pod until everything fills up But one thing that you can do Let's say if you prefer to been packed more aggressively by saying I want to place pods in a way such that I want to fill up One note first before I move on to the next note. You can use the most allocated Strategy here well within your schedule or configuration So you just set this that the scoring strategy type being most allocated under your knowledge resources fit and in here What it will do is it will basically start packing pots in a way that fills up the first note and then move on to the next sold Etc. And the out and here's the algorithm Which is basically you go request resource resource requested divided by resource capacity multiplied by the percentage As well as multiplied by the weight that's associated with that particular resource So you could tune this such such as like if you want to do, you know Higher weight for CPU or lower weight from memory or maybe a different, you know Weight for like a custom resource like a GPU or something like this you can you can tune this according to You know how your cluster and your note resource looks like So another thing that Another strategy that comes with the noted resources fit is what's called the request to capacity ratio So this allows you to have really kind of fine control of the scoring shape by giving kind of the exact mapping of My node utilization to the actual score that I want to get so in this particular example I've got my You know scoring strategy to be request to capacity ratio and my shape looks something like if my nodes utilization is 0% Then the score is 0 and if my note utilization is 100% then I want to score a 10 being the max here So what this will do is you can see it will draw like a linear line from, you know, 0 to 10 essentially and Basically, if your node percentage lands on one of the dots it would give you a score according to what the line says So this can be really useful if you don't Necessary I mean this is a very simple by example, but if you want to have a more complicated different shaping example You could really kind of find your control like your utilization to like your score mapping So in this particular case, you know, I see that I've got like a more like a Parabolic kind of shape where you know as the percentage goes really high then my score changes much more less Significant than you know at the lower Utilization of percentage So I'm gonna do a demo here of the node resources fit here So I'm inside of VM and I'm going to be using a kind cluster here to show this So I've got four nodes here with one control plane node and three worker nodes so what I'm gonna do here is rather than I'm gonna show that basically How I would configure this by deploying actually a second scheduler into the cluster and the second scheduler here I've specified my configuration here with two profiles So the first profile here is gonna be a spread scheduler And what I'm gonna do here is I'm gonna turn off all the other scoring plug-in except for the node resources fit This is just to really amplify this example here of what this request to capacity ratio Looks like and in the plug-in config itself I want to say okay I'm gonna wait CPU and memory being equal and my capacity request the capacity ratio is gonna say if I'm at zero percent utilization Then I'm gonna give it max core max score and if my utilization is 100% then I'm gonna give it min score So the spread scheduler really just behaves like the least allocated scoring strategy and Then I've got a second profile called the pack scheduler and what this pack scheduler is gonna do is it's gonna do the opposite Which is gonna say if my utilization is 0% Then I'm gonna give a score of 0 and if my utilization is 100% then I'm gonna give a score 10 So this is the basic thing. I want to impact such that the most the nodes with The notes that's most used will be chosen for for my placement So we're gonna do what I'm gonna do here is I'm going to be deploying this scheduler So I've got my second scheduler Deploy now. So what I'm gonna do now is I'm gonna just test this out by looking at my So I'm gonna create a deployment of six replicas I'm gonna set my scheduler name to target the spread scheduler So I'm wondering to kind of spread my pots out across the three worker nodes so I'm going to Apply the spread case so you can see that my All six my replicas are running and they're placed evenly across the three worker nodes. I've got two in each node so I'm gonna date this deployment and then I'm going to be deploying a second example that Targeting the pack scheduler. So exactly the same configuration. The only thing is I'm targeting this to a different profile and I'm going to apply This and all of them are running and you can see that basically all six pots got placed on that one node So this is basically we are essentially been packing as much as possible until this note fills up before we move on So that's all I wanted to show here in the demo Let me jump back to the keynote So I think next Yuan is going to be talking about the scheduler operation. Do we do we want to take questions now or take question at the end? Yeah, so I've got a couple questions. Maybe you can just take that right now. Sure. So the question was Did you did we have any specific use cases where you wanted to actually configure these profiles rather than just use the default one? So we have some use cases where we have particular large clusters where some customers We may want to really impact really tight because they run a very large scale deployment And in those cases, you know We do need to help them fighting because some of the default spread strategy may not be the most efficient at placement of pots Mm-hmm So the short answer is It's up to the program this to like how to make their control plane more extensible and Manipulatable for the for the end user, but in a practical view is that you can deployment the whole your customer plugins or whatever in Outside because you don't have the control on on the control plane so that because you specify a secondary scheduler So ideally it doesn't conflict with the default schedule if you don't use the difference at all You just use the big because you want to control over that because use it this way you That you have the 100% compatibility with the first schedule and you just have the pure add-ons For it to fit your customers. I think we should continue to finish them by the end of the Q&A session We will have a little time. I think we can Okay, okay me well, okay, so we can take more questions after the talk. Yeah Yeah, just come to us. Thanks for you both for the great presentation the demo I think he bore has covered everything and did it work for me. Maybe I can just skip my Next session, but anyway, so my Next session I'm going to focus more on how to operate the scheduler right and in particularly share and some the experience and knowledge with you how to for example build deploy a scheduler and when you run a schedule that did the events and the log information very important in particularly and If you maintain the scheduler, you will know most of the problem, right? The customer users come to you and that's why my ports and are not a schedule Why it's so slow to schedule right so how to troubleshoot and some of the typical problems then I will also show you some key metrics and some example dashboards and Hopefully we find it and helpful So I have a Yeah, create some of the examples and we uploaded to this and the github repo So if you go to our presentation and we update the presentation PDF file so in this on these slides and you can click the link if you want play with it as long as you have the go 119 and you can yeah, just download and clone the Kubernetes latest version You have a local Kubernetes environment neither midi cube or kind. Yeah, both should work. Then also I have a bunch of the YAML files there so you you you can play with it and if I interest so so one thing that I want to mention right Ibo and mention not I was away and Detail and how advanced the scheduler make the decision right the different queue and the different plug-in different placement strategy But if we put it a simple We write what does the scheduler and do right? It's basically just to say to get on on on schedule a port and choose a node and Xi a node name to that port and that's all right You can use very and advanced algorithm or the simplest way you could use a random right I just the random choose one if it's working and then just place there So also and if you want to run a scheduler from API server perspective Schedule is nothing and different from it's another client or another controller right as long as you can collect it to the Kubernetes API server you are fine and you can get the port information You can update a report so it's simple like that also you can run and as many Scheduler as you want as long as each of schedule to have a different name unique name You've already covered that so okay, so go back to this and if we look at the scheduler right and Here I just want to show you in the sea and Everyone can play with it and with your local environment and customize it and other things. This is a repo I don't know it right. I simply you just make a Cuba scheduler and It will and generate okay, so by default and It will generate this created this and binary files here and called a Cuba scheduler That's the default name then you can run it the interesting thing another thing is how you want to run a scheduler, right? It's the voice you're working right it's also just the simple and you run this command and The only Interesting thing or matter and what matters most or the only parameter basically is you need is specify The scheduled a port is a fire or configuring fire you bore it a color right the simplest one in the configuration fire also There only one single parameters probably is Really matters is this and the club that is configured right could bless configure and I hope you know right specify your Your certificate key how you collect VPS server if you have that yeah, that's all you can just run a Scheduler so option two of course in most the production system would run and we contender rise the schedule and run it in ports and Contenders but nothing different right you create the image and in a command line You just started so if we go back to and my demo under here right my environment under here So so you will see yeah, I I also have the Script there you can see here. Yeah, just to run this and the schedule I just the build and the specifying the configuration file also the configuration file and now You've already covered and I gave you a simple example said I created two basically profile One is I quoted default one Nothing plug-in. I did not discuss customize anything another one I call it the best the feet you both mentioned that the the curve scoring curve here and I use the default the Intui already upstream one, but it's not a default default called the least allocated. You can think it's a Worst the feet right to try to find the most idle nodes most of Allocated one means I will best the feet to try to find the most Allocated one is like you've been packing. Yeah, something like this. Yeah other parameters you both covered I just specified there then yeah, you you just write and yeah, it's kind of you if you see all this information Yeah, it's working right you see it is so it's just the simple like that Then you can debug and that's why and if you want to play with it and test with it Okay, so go back to my Talk here. Yeah, I already covered is so as I mentioned so firstly of course if you didn't end Start it successfully most is right to make sure your past is correct And you have the right to config also you you should config our back and other make sure your schedule can get the information can also update and loads but one tip I want to show you here is a Relative new and feature. I don't know when it's available and in this So if you want to debug and your configuration other thing you can specify where you run the commander called the right Config to with this and commander nine the flag So it will you generate basically and the configuration files for you and what you are running So for example here, okay, so now of course, I really run the the scheduler and but this one if I use This I have another one. I call it right config the only difference here I want to start the sketch scheduler and the actual scheduler instead just generate this and the Profile configuration file. So if you use this one, of course, you can play underneath. I have all these scripted there Okay, it's basically we generate and the scheduled the fires for you and this one So if you check this and it's a known file and a large file, right? It it's already populated all this and the default and Configuration of parking information. So then you can see is anything right or wrong, right? Your configuration and that is default value. You can see our need election is true or false so this will be very useful for you to debug and if your Configuration or your scheduler and didn't turn the start. Okay Then switch back Okay, so once you started and as I mentioned so most important thing and I think is you should understand Check or look at the schedule log file. So typically and they are all different way you can specify config it and early version you can specify the log file and just on the command and flag now I think there are different log and the utility and can configure it. So in the schedule of fire I would just see yeah, we and give a brief and the introduction the life cycle So most important information here we see neither a port is scheduled successfully or a schedule not and Scheduled you will you need to check these key events in the log file Why is when a port is submitted you will see add event for this unscheduled port Basically is right added to the scheduled queue Then you'd have to wait in the queue until the schedule to pick it up So once it's tend to be scheduled you will see this information this keyword attempting to Schedule this port. This is the time the schedule you run the plug in try to find a load of feed to it Right if it's successfully scheduled that you will see this message Successfully bond this port to a node. We see the load name then you will see delete the Unscheduled port right at the schedule port on the other hand if a schedule and is not Is a poor if a port is not a schedule the most important thing information and in addition to the queue and other events information you will see this and Not fit also each load right and what's the reason that caused the scheduler This scheduled portal scheduling a few the reason right the 72 ports didn't match other information That's how you can debug and your information. Why it's successfully and or not so typically and the net tons of reasons right and a Port is not a schedule successfully But I would like to summarize and yeah three high-level in the categories and that's what and we have seen and in Production or in practice one is yeah could be not for misconfiguration in particularly related to the Storage persistent volume or persistent the volume claim right? It's not found other things So that's one thing in particularly. I think different customers or cloud providers have different the storage solutions So that's is something and you should and yeah definitely check be careful another the second one and of course is if the schedule and Is not able to find a feasible nodes to feed to this and the ports right and then that's also very typical And we have seen and depending on the availability Educable resources also the physical and the capacity you will find it the last one is the yeah So the port can specify are not of the constraints additional constraints right for the affinity node affinity node is an actor If it's software constant his preference, it's okay But if it's hard constraints then you will see and maybe even you have nodes available right resource available in general But it cannot turn the schedule reports So now and yeah if we look at the log here, and I will quickly yeah just issue and some examples as well and Yeah, so again, I'm going to start with schedule right so now If I I have a port very simple ports here, right? Again, yeah, it's nothing specify five CPU and I use this profile name You remember our schedule right we have the Yeah, we specify we create a tool profile right you can use anyone to do it It's a bit of stone and at all Okay What's going on? Oh, I have this much ports and I draw early and Otherwise it's can only run one post. Okay, so as I mentioned earlier, right? You will see and the the information about the ports right and basically try to Attempting to schedule it Attempting bind it then successfully bounded the ports to node then add the Scheduled event deleted a scheduled event and this but if I run another and a large ports, right? I have a large port the two CPU because my nodes if you Look at Yeah, so so far. It's already used 60 percent. Yeah, or the most one point two CPU Yeah, and this node and if you check the Educatable resources is two CPU. So now it's only have this and less than 0.8 CPU available. So now I purposely create a larger ports request Yeah, this unscheduled one I request two CPU right it should not be able to schedule, right? Yeah, you see it's pending right here and also as you can see this message as I just shoot right One means there one node because I I'm using a midi cube and the class is only a one So means that one node in sufficient CPU. So anyway, that's the most standard if you want to check and looking to the reason why Ports is not a scheduler and you search the north just look at it It's unable to schedule it right and then look at the reason why the ports not scheduled Okay, then that's Switch back and yeah, unfortunately to the time this a little bit Complicated in other case I didn't shoot but this is the typical message you can check and Misconfiguration position of volume as I mentioned also the tent and affinity. Okay, the second one is okay Maybe the the ports is scheduled, but it's too slow, right? that's also and Yeah, as we are maintaining and yeah the Kubernetes and a not for times customs. Oh, it's way too slow into Scheduled report. So here I also want to mention and Yeah, some of the common the cases or or tricky cases The first one is you know the the schedule I need to talk to APS server Maybe even talk to some of the admission controller because it need to update the ports So it needs the networking collection and to make sure the network and the latency is Okay, otherwise a lot of times we also noticed the because of the network and latency Issue so the schedule to make a good decision and write it to station and a quick decision But unfortunately and it cannot update the ports and the status then this will slow down the entire pipeline and the final need slow down the Scheduler ports of schedule increase the latency. So normally if you check the log you will see some yeah like the Collection timeout other information like arrow updating this and the ports. That's one thing the second thing. Yeah Yibo and Mentioned it's very important is this so there are two parameters called a back off, right? If I scheduled at the first time Was not scheduled successfully it will be put into unscheduled queue then to the back off queue Then wait for its back off time right expire Then it's moved back to the active queue to get another chance to be scheduled But it's also I think important also sometimes tricky to set these parameters and the default one is one second for the initial back off seconds and Ten times for the maximum one so it's exponentially increased So it's like a three times it will increase to ten seconds only wait in the back off queue for ten seconds Use this and the default one I would say it's quite good Right if you want to the more the responsiveness right because the cluster status could be changing so far the cluster didn't have the Five minutes ago one seconds ago and I didn't have the resources about maybe just 10 seconds later some ports and finish It can schedule But another thing and we have to be careful is if you set too small even use the default one We noticed it could in large clusters and with thousands of nodes or tens of thousands of ports You have different priority of the clusters. It could cause some of the head of nine blocky issue What's what what caused that is because your highest priority ports for example highest priority ports and didn't find the sufficient and resource to run it put back to the Back off queue, but it's back off and just wait for a few seconds Now come back to the active queue because the default queue and the ranking or sorting is based on priority Right if you have a large number of this unscheduled of higher Priority and ports they are unscheduled of misconfig other thing that we just keep back to the active queue rescheduling too frequently then They can cause other no priority ports right blocked and it cannot be scheduled So this is a something and you have to make a trade-off right Responses and also make sure the fan is and other thing So maybe I can quickly and show you example here and this percentage of nodes The effect hopefully we can let me see now. What's the default configuration? I Started Okay, so now is the default why I just shoot here You don't need to specify this default initially is one seconds maximum is 10 seconds means the unscheduled only wait for the in the Back off queue for 10 seconds. So now one reason yeah, I I start to pause early because this not just see you Not just CPU one is pending and because it didn't get to the enough resource But what about if I delete it and? deleted This pause right Just call it pause Let's see and I should restart it and but let's see how quickly and they just get I don't know if this is the largest CPU pause or unschedulable one. Okay, maybe let me Start it from beginning Okay, that's one point two. Okay, so I start a small one You should running right? Now I start I'm starting this large one. It's pending Okay, now I did it to this small one Okay, so you can see this one quickly. Yeah, even essence 17 seconds because the back off time is 10 Okay, now what's happened and if we change this? To let me just give a 60 seconds, okay, maximum 60 seconds and the initial 60 seconds and the maximum also 60 second I will restart the Schedule because I change the configuration fire. I have to restart it. Okay. Let's see what happened. We started a small one, right? It should run Okay, now we started this large one You should a penny Okay, now we did it to this small one You see it's still pending. Of course, we don't need to wait, but we can come back to check But it's already it have to Wait to has to wait and for at least the 60 seconds, right? So hopefully yeah, you you you get this information. So this is some parameters We find and we found in our practice. Yeah, it can be useful So last one. Yeah, so I want to mention is this Percentage of those to score. Yeah, you've also already covered that This is important and I want to emphasize that again is it's a Banners basically your schedule scheduling quality and a schedule performance, right? so you don't you can imagine you can have a Fastest and the schedule a quick is the schedule that just randomly choose one, right? And then the quality could be poor and you may not find the best the nodes But it's definitely it is the fastest the one so the combinations have this percentage of loads to score and You can specify it by default is there only use adaptive and the formula So it's 50 50 minus the class size divided by 125 so here is examples and if you specify 60 and percent then 250 nodes means 250 times 60 and you get 150 nodes So means the schedule will just scan the orders class the nodes until it find the 150 feasible nodes, right? For this example, you can see here and that's from some logs and I Run in the offline so it's evaluate a Total of 232 nodes then finally you find it is and 150 and Yeah, poor but if you don't specify in the parameter use the default one the default use this formula the percentage is 28% 28% of the 250 nodes is less than 100 So the current implementation a default minimum and the feasible nodes is 100 so because it's Less than to a 100 so you can see here. It's such 172 nodes until it found 100 feasible nodes I just want to mention the use cases for this one is for example for batch workload You have some spark job with thousands of reports. You may don't care about where it's get wrong Right, you just maybe said oh, I use a small percentage quickly I want right improve the throughput you just find a nodes for me But for your no running service, you may want to make sure all these soft constraints affinity rule Yeah, no the affinity anti affinity rule can be met Then you may want to them and yeah search scan a large number of the nodes recently We and already found if this number use the default one, maybe your Port interfinity rule cannot be met Why because this node get one set of the feasible nodes another ports get another one basically they are disjoint a set of Notes even the entire class you can find and those co-located these two ports to minimize your networking collection or latency But because each individually get to disjoint nodes pool so you cannot find them. So that's something and I Think we'd be very interesting and to look at and that could be a useful thing So Nasta and is about a metrics So finally of course when you can check in a log and there are some parameters and Yibo and I already discussed is a tons of the Scheduling and those scheduled related metrics You can check so here and I put three key Categories of the metrics you may find useful and when you operate or run a schedule one is performance related one So there are bunch of the metrics high level is you can get an end-to-end from a submission for the submission until it's successful Scheduled one so even multiple cycles if it's a one go to go through three Cycles get a schedule. So this is the total end to end the time You can also get a single scheduled cycles time. You also Can have the each plug-in and we mentioned the right like filter Scoring even different preemption the performance the second category is you can get the schedule a different results said okay How many get a schedule how many and unschedulable also in the queue you can get to the how many ports so far in the active queue in unschedulable queue or in the back of queue finally and the preemption is important Right if you're high priority ports Not scheduled and the schedule to try to evict other ports So you can get a bunch of the preemption information total preemption attempts, right? If this ratio is too high and probably means that your capacity might be in a trouble, right or how how many ports actually? And are evicted or preempted so you can check this. Yeah, I have the link here you can check this and the Metrics files in the schedule a packaging they these all these metrics as well as the descriptions So here I want to go to the detail that of course you can create to the Grafana dashboard can monitoring the all this information Yeah, I said that the latency information the ports information and a preemption information all kinds of information that's definitely be a great tool and to get the information about the performance about the overall status of the schedule So, okay, I think that's all I have and I probably don't have time to take any questions But we hopefully can answer something and some question at the end of the talk or you can talk come to us to Ask any questions. Okay Thanks, you are any ball for the operation and the configuration part. So the next part is about schedule Sorry Yeah, schedule does provide a lot of Entry functionality and provides a flexibility for you to come fake to behave it differently But it's also very possible Use do you miss the fundamental functionality you want to defeat your customer or close? So what should you do? Is there any way to push the boundary of the default schedule? The answer is yes. So there's a couple of extension ways to tender the Cube schedule just like the principle of the overall Kubernetes platform So the first one is called schedule is tender. This is the first the mechanism we introduce. I think a couple of years ago So basically it's tender is external HTTP webhook you can just associate with particular Face of a second cycle But there's a two problem. Why is it only provides the Pretty limited Face for you to hook up. I think if I remember correctly right now You any can hook into filter score and the preemption and second problem is that This mechanism is a design based on well, you have to involve the network cost to exchange data between your webhook and at the cost schedule and also The muscle and the demutrile cost cannot be avoided so because of these two issues right now We don't quite recommend it to use the you know, large cluster So we do get some reports that the extender can slow down the overall scheduling throughput so the second way schedule plugins the most recommended way to Extend it's keep scanning right now. And yeah, the latest steps were basically based on this So it resolved the two problem I mentioned in extender first one is it provides a Bunch of the extensible points for you to extend and basically at every single place you can think of to extend basically there's a Extension point for you to use and second one is that we know no longer to use the HTTP or RPC connection between your Extension and at the cost schedule instead we provide a language specific interface. So basically you have to implement that in interface then recompile the extra plugins at top that default schedule So basically you have the 100% compatibility and just have the net benefits of the plugins your developer and also the third way Because Kubernetes doesn't have hidden API what could keep schedule can build you can also build the same thing front of scratch you own the In that way you own the everything you own the kills you own the cash on everything and have to implement all the scheduling constraint primitive in the in the power everything so but yeah, you're on it So this talk will basically focus on the second extension. Okay, if you look at the typical scheduling cycle It first starts with the so-called Q sort interface Q sort is the interface between Your starts pop up the part and the gets started with scheduling a part. So the Q sort is provides a stainless function interface just Give you a two-pass and you tell me which part is should be prioritized over the other Then so that with this function default schedule can know how to sort all the parts into its internal cues and then in terms of the implementation right now the default schedule Q sort plugin just look at the priority value that you set associated with the priority cast Then it can prioritize a more high priority path over the others But it doesn't prevent you to implement your own logic like Co-scheduling plugin in the community just to prefer Sort the past which belongs to the same part group over the other so that they can schedule group of paths back-to-back So that you can get more chance to schedule a group of paths all together So this is the Q star and after Q star You should have a very High chance to pop up the part with which you think is more important right immediately to Guess that is scheduling so next phase is called pre-filter pre-filter as the name suggests is the pre-step Before filter and there's a several Use cases for per filter the first one is simply simple is just you tell me whether the part Should be in a pretty lightweight cost that will tell me that we should continue the schedule or not so if the part like We can in a very early phase to determine that the part shouldn't Continue then we just stock here so return as early as you can and the other use case is that We promise a schedule cycle Related data structure for your plot to promise your specific data structure so you can use later on for example for some complicated Scheduling requirements like part to barge spread and the part affinity it needs to look at the power Distribution across the cluster so in this case Later on the filter interface doesn't quite fit because it doesn't know the entire clusters positive distribution so usually best practice is that you pre calculate sort of pre-defined that structure and for the functionality you want to Schedule later on then the data structure the value will be passed down in the same schedule cycle So you can find a similar implementation in part of the barge spread and part affinity So this is for pre filter and another thing I want to also mention is that in the latest Kubernetes offering we provide a Huck card pre-filter result so into before it doesn't exist. So basically it can give you the nap to return a very Smaller list of nodes so that the later on scheduling will just do the scheduling Logic among the nodes you provide so it's pretty useful in terms like your schedule Demons that part so that you just because demons that is just your single is scheduled apart onto the single node So you don't need to search the whole cluster So you but you can have some similar innovative idea in your prefer the interpretation So next filter filter is just give you the part and give you a note and you tell me whether the node fits the part or not and Yeah, you can be using the pre-calculated Information that prompting to the cycle state and Also, you can if that's going logic is pretty simple You just implement the logic without without help of the cycle state and the next after filter we should come back with the output of Bunch of candidate's notes and among the candidate knows what we should do is we Score them rank them to come up with a final winner node for the part of the bind to so in this phase pure pre-score is similar like pre-score to score is similar like a prefer the two filter. It's just that in this phase We don't need to provide Yes or no answer because when it comes to prefer score the node that we provide here in the node list are all feasible knows so in this case, so the most practical usage is to again pre-pram some calculation so that you can use later in the score face and score face along with the We call it the normalized score is to come up with Final score by given logic and then the score will be Accumulated and then finally we come up with a final score. So we pick up the highest score for the for the part of the bind to Yeah, that's for score so next usually after score actually we can just say okay we are all down we can enter the Binding phase but hold on a little bit. So there's a two more phase we provided for you to do some extra Enhancement or accuracy control the first one is called reserve. So if you take a step step back So what's the source of truth of past scheduling and or what the source of truth or past date? The source of truth is that only the part get Persistent into entity we can think okay. The pot is really running there and Occupying the resources right that means by the end of the score The pot hasn't been boning yet. It can't fail or it can't success. So in this phase to prevent To temporary resolve the resources, but we don't know yet whether it's Finally persistent or not. We should provide a result phase for you to resolve the certificates Specific a chunk of resources temporary We hope in happy past that later on in the banding will become pretty fast and then we just deduplicate the the result the resources or if some unexpected failure happens after the worse We just go back the changes we Reserved right now in the in the in the past this pattern is pretty pretty important because in a Decorative patterns If we just look at the decrypt patterns The there was be a lag, right? Between the pot gets schedule internally to the face that a pot get persistently. So there is the Gap so we want to during the gap We still wanted to schedule in what close accurate accurately. So that is why we provide the result face that is based on the This pattern okay, and the permit is also for the same purpose But it's usually a little differently. So basically permit if you look at the interface It returns a Duration that means in this face we semi approve a part scheduling but return with a timeout meaning by default is By the end of this timeout if nobody tells me I should fully approve this part or not I will reject it but within the timeout usually a typical use cases Code scheduling or gas scaling so the sibling pass of Precedent pass we are tell the scheduler okay here we come and We have reached the column finally and then all the sibling pass should be approved within the timeout So that's a very typical use case for for for permit use case for permit extension point But the default entry plugin doesn't use this permit interface yet All right after permit We enter the other cycle called binding but it's pretty straightforward. It's just for performance Consideration we put it into another go routine That means by the end of the schedule cycle by the end of the permit the schedule We'll just jump right into the another cycle to schedule the next part and then Meanwhile in another concurrent go routine is that the binding so binding includes three parts pre-bind So you can do some final check on whether the path can be By bind so typical use cases of all in binding it does find some final check on the pbpvc association And after pre bind It's the bind by it does nothing but bind the node to the path. So Basically, you don't need to implement your own bind by the implementation and Afterbinds postal barn postal barn is usually for post the processing or logging for some information Which is maybe useful for your for your scenario so this is basically the happy pass of How schedule internally? Exposed the extended point for you to hook up to schedule a path But that can be a negative pass is that again What if no node can fit at the bottom? So there's another Rec cambo in that is the code postal filter. So postal filter the Intention for you to implement is that you Do some logic like preemption to evaluate the weather you have some Alternatives to make it a part schedule ball like in the default preemption. It tries to Sacrifice some low priority to make room for the high priority, but your implementation can be Quite innovative. You can just invent what kind of your logic can fit your business needs like in some if I remember correctly some community plugins needs to depends on some Customer resource objects. So their logic is pretty Customized so they have to Re-event the preemption logic in the postal filter. So you can implement your own So this is the pretty much the overall of each what each is an extension point is and each has a large adoption in the industry like IBM and the Red Hat has built in some plugins into their like open-sharp offerings. So next Turn from IBM will give us some practical examples of how they use the schedule framework to build some commercial plugins Thank you way and for the nice introduction of all the extension points and plugins and Thank you all for still staying with us and I know you must all be hungry, but I will be quick So I will introduce some use cases of using scheduler plugins to Customize our own scheduler for different types of workload for different clusters and then I will also give a short Tutorial on how you can start writing your own scheduler plugin for your particular use case So the first use case we work on in IBM together with PayPal is the loader where scheduler plugins so the default Kubernetes scheduler only considered the request and limit request values of the part to place the part on the node and then What it end up with is usually your Developers are over allocating resources they specify very high resource for the request values and then From the trace I show in the bottom right and then this is the real Google trace and it says the request value usually Set by developers as much larger than the usage and then what it end up with is you Significantly over provisioning resources for your all your parts in the cluster and your cluster would be very Low utilized and then you cannot schedule more parts That's why we come up with this loader where scheduler plugins and then we want to schedule past based on the actual Usage of the node, but not the allocations of resources on the node. So there's three different plugins in this series and then I Will interest these two here For example, the first one target load of packing plug-in. This is a plug-in. We collaborate with the with PayPal and it's a very simple plug-ins What it does is it? Allow you to try to maintain a certain percentage of Utilization for all nodes in your cluster and then the benefit of it is You make sure all your nodes in the cluster achieve a certain utilization and it's not underutilized Again, and if you keep the margin with one minus X percent then you still have a margin safe margin for you for the bursty workload and then so you get a nice balance between both the utilization and your performance of the workload and the second one is This is a particular IBM cluster use case We notice that some parts have a very high variations on Utilizations and then if we place those parts together on a particular node, then what happened is in the node Probably on average you will have a very low utilization But at certain period because the variation is so high then you would have bursty workload All the performance would get done and then you would end up with out of memory evictions of those parts and you lose the high availability and performance so what we were doing is we were trying to balance the risk of Having part evictions or performance issues by scheduling parts considering both the average Utilization on the node and the variations of the Load on the node and then what it does is actually to maintain a constant value for both the node average utilization and standard deviation the plus of the sum of those So another use case a way already introduced a lot is we will have a lot of like Machine learning jobs training jobs spark jobs that you want to schedule a group of parts all together Make sure they're running at the same time and we already introduced all of those plugging extension point and co-scheduler is the Scheduler that utilize those extension point to make sure you have a certain number of ratio of Paths group of parts running together and then you may check more details on days And it's widely used for spark jobs and tons of blow training jobs So next I will give a very simple tutorial in just the four steps And then let's go through it You can scan the barcode to get to the full tutorial if you want to try it yourself yourself So usually the first step you want to start developing your own schedule plugin is Go go to the schedule plugin repo clone the repo and then create a package for your own schedule Let's take the scoring plug-in for example here The first thing you want to do is to define the the plug-in Struct like here we define the score by label and then this is in this example We just implement a simple scoring plug-in that takes the labels a scoring value as the score for scheduling and then we give it a name score by label and then You you also want to define the name function to return the name as part of the necessary function defined by the plug-in framework and The next the the only thing the most important thing you want to do is you want to write your own Score function and in this score function we implement some simple logics like reading the node labels get the node label Value as the score for your node and then for example, if your the scores you derived for scoring the nodes is not within 0 to 100 you can also use the score extension interface and Implement the normalize a score function to normalize all your scores between 0 to 100 So next you really when you want to develop some algorithms you want to take some input arguments to configure your score plug-ins and then the way to do that is to simply add for example, plug-in name plus arcs in the APIs In those folders like either we want beta 2 or we want beta 3 right and then this score Let's take a look at the score by label arcs struct for example And here we want the user to be able to configure their own Label keys. That's why we put the label key as a string here, and then you can also Add functions to set the default value for your input if it's amazing so lastly as You would already Mentioned we can run the your scheduler plugins as a secondary scheduler for Certain certain types of work load in your cluster and then the only thing you need to configure is the Cube scheduler configuration profile and then here because we want to test the score by label plug-in We just enable it and disable all other plugins and then The way the work load specifies the secondary scheduler is by the scheduler name and then here we give the name of the scheduler as the score by label and then we specify the scheduler name for the pop next let's Try to deploy So here we are in a cluster with three nodes Let's remember the node IPs with two three two two four three two five three And then we we don't have any work load right now And then the first thing we want to set up is to test the score by label plug-in is to label the node with a certain score the key is scored by label and two three two we label it with ten two four three we label it with five and Two five three we label it with one and then now the node two three two has the highest score and We double check if the label is already there. We can now try to Deploy the score by label scheduler Using a simple deployment getting the image already built with together with all the RBAC rules available also in the schedule plugins ripple and We mount to the the schedule of configuration To the part so it take as the Kubernetes config dot the ammo and We go ahead and deploy the The score by label Scheduler as the secondary scheduler now it's raining What we do next is we want to go ahead to stream the Logs in another window and then let's take a look at the testing workload, which is a test pod it's use the score by label scheduler and We go ahead and create the pot Now you can see it Actually the score by label plug-in is already running the logs shows a Get all the scores from different node and then the highest score is the two five three With the score of ten what it shows is it attempts to blend the part to the node on two three two three two and Then it finished a binding for the path on the node and we can double check if the The party successfully bind Using the events Yeah, it says it the pot is already started and it's Successfully assigned it to two three two and those are very useful techniques introduced by By way and Yuan for debugging So then let's change the two five three node with a score label of 100 now two five three becomes the highest score node and Let's try to deploy the workload again and see where it is scheduled We first I delete the pot and create it again a little bit because Okay, we'll create again and this time what's it happen days it tried to add the event for binding and And eight attempting to bind the part to two five three No surprise, right? Okay, we double check the events and we have more events that it says successfully bind the part to two five three and please follow the tutorial on days and Hand it back to to Yuan pretty much for today's session and Here's some references and also on Friday. There's six schedule in tip-tie So that talk will basically talk about some latest updates both on the Scheduler itself and some sub project of the six schedule. So yeah, we Can have a few minutes to ask questions Yeah, so the questions was whether we have plans to move some arbitrary plugins to the entry to the core So it quite depends on a few factors like how the plug-in itself is mature and how it's widely needed by the by the community and the the last piece is API compatibility whether it needs to introduce a new like high-level pod API So I will say it's possible It needs to be work on case-by-case. There's no general principles. It says, okay You check mark this this this and you can push up to the to the option But the good thing is that it's totally compatible So maybe you just have some there's some extra efforts to just recompile that package into that So if I answer the question was about the performance like Maybe in the first round of the scheduling for for the part one is Evaluating the like pod zero to five hundred and the for the next one is evaluate another round of Path sorry nails. Is that your question? Yeah. Yeah. Yeah Yeah, I think you are I recently have a use case if you want to talk about it Talk of line just Again, thanks everyone. I think coming to a session particularly you were staying and until the end of that you really appreciate Thank you again. So, yeah, we'll be around. So if you want to discuss