 Good morning. Good afternoon. Good evening. I'm Yuan Cheng from Apple Cloud Infrastructure team. Today I'm glad to present our work on enhancing Kubernetes schedule for diverse workloads in large clusters. It's a joint work with my colleague Yuan Xu and many collaborators from upstream community. This is the agenda of my talk. I'd like to start with an introduction why we need to extend and enhance the existing Kubernetes schedulers. Now I'm going to discuss and present a few use cases how to use the scheduling framework to develop new features to support stateful applications, batch jobs and improve scheduling performance in large clusters. Then conclude my talks with a summary. Why is native Kubernetes scheduler is not good enough? Let's look at the history. Kubernetes scheduler was designed to mainly support state-based applications. It used a port-by-port scheduling strategy and applied some simple scheduling logic. Also, its scheduling decision is based on some optimal strategy. It means it choose the nodes, scoring the nodes and choose the best nodes to be assigned to a port. But if we look at today's workloads, there are many more other workloads like stateful applications, batch jobs, machine learning, deep learning and HPC applications start to run in Kubernetes clusters. They, to some extent, require advanced scheduling from GAN scheduling, topology awareness, advanced bin packing. Also, the Kubernetes cluster size are increasing very fast. Today, we have seen thousands of those clusters, 10,000 of those clusters. The performance is very important. Also, as we run more diverse workloads in Kubernetes clusters in a multi-tenant environment, a single scheduler, policy or strategy cannot meet the different requirements. One size does not fit more. We need to support more customizable policy and algorithms for different workloads. So here, just a partial list of the new scheduling features that we see very important to support today's workload in Kubernetes clusters. Firstly, I'd like to, so how do we end up develop new scheduling features? I'd like to give a quick overview introduction of the scheduling framework, which provide a very powerful and uniform mechanism that enable us to write custom logic to extend and enhance the existing schedulers, scheduling logic and workflow. So if we look at the flow here from a port submission to a port finally placed or assigned on a node, it has to go through many steps or stages. So the scheduling framework provides an API for each of these and extension point before and after each of the stage. For example, before the filter stage, which basically is finding the feasible nodes that can run a node or the scoring stage and give a score to a node and rank all these nodes to choose the best nodes. So the scheduling framework provided this API, we can leverage this and use this in the mechanism to create and many different kind of the scheduling new features and custom scheduling logic. Compared to existing approaches to extend and develop a new scheduling policy or algorithm, the scheduling framework is highly extensible and customizable because it the plugin is part of the scheduling free the native and the schedulers. So we have only a single schedulers and can use this and the cache and error handling so it's provided a much better performance and better handle and errors. Finally, because it's run as a single schedulers, so no conflicts or risk condition that is inherent in multi schedulers. Here is a summary of compared to existing or old way and to extend the Kubernetes schedulers. Use just the change the code. It's definitely not recommended because it's not compatible cannot turn the very difficult to incorporate the new features or new versions of Kubernetes or use the scheduler extenders, but the key problems here are one it has very limited extension point basically can only and run extender after the filter stage and also after the scoring stage Also, it's run as a separate service or webhook the communication overhead deserialization, serialization very high The third approach is just the run and a separate customer scheduler, alongside the native scheduler the problem with this approach is very hard or difficult to coordinate the scheduler scheduler decision conflicts So it will introduce the risk condition so far and it's still an open question how to solve it in a larger scale and production systems By contrast, the scheduler framework and provide a uniform and very lightweight and fine granularity extension mechanism You can customize extend the scheduling flow logical algorithm before and after each stage of the scheduling cycle Also, because it's raw and built into the scheduler binary raw as a single scheduler It eliminated the serialization deserialization overhead and can offer much better performance So now let's move to look at some of the use cases We never use the scheduler framework to develop a new feature to support Stateful ports batch job and improve the scheduling performance in large clusters So first use cases is about new features to support some stateful ports which require a static or fixed IP address This means Whenever a stateful port is a design an IP address It should and have this IP address during its entire life cycle This is a Not uncommon requirement. We have seen and in many and end user environments have this requirement neither for the legacy and maintain the compatibility with the next system or simplify some application and the logic This will require some changes to the existing scheduling logic So firstly, we have to track this IP information at least you have to know and which and stateful ports already assigned on which IP address Also, depending on how you manage your IP address If it's a static way ipan and assign a fixed number of IP address to each node You have to make sure you check this map from the IP address to the node or node to the IP address Finally, during this and the filter stage of the scheduling and the decision find the feasible node We have to check make sure and we assign the support to the right and the node if it's already have an IP address or if they are still Free or available IP address on a node So if we look at how we implement this and using the scheduling framework The simplest and design implementation here is we have to introduce and two plugins or extension point One is for the pre-filter During a few free-filter Basically, we have to sync up and make sure the scheduler have the right and the correct information about the IP state as well as the IP reservation information the second is it have to check and classify a port into three different categories and one is a stateful port new stable ports without allocated IP or existing stateful ports with an IP and a regular port Just state this port Then the next plugin new plugin is during the filter stage The static IP schedule has to do some additional things. So for regular ports or stateful ports It needs to check whether or not and this node or a rack have still available free IPs Then reject the nodes if it doesn't have any available IPs But if it's a stateful port Then it have to make sure right the allocated IP and the node Actually match means if there are some static IP Already and assigned to a node and the allocated IP Should be available on this node So on the right side, this is the code segment. Just to show how We could implement this static IP stock Scheduled plugin. This is the example of the filter plugin. So as you can see, it's quite straightforward simple So compare with the existing approach using the scheduler extenders as a webhook The scheduling framework plugin provide a much simplified Implementation also Because it's running as a lightweight plugin without managing the The the event cache itself and a much easier handle errors. So it's much more robust Statable and also easier to maintain and manage Finally without running as a separate component without the Dematuring overhead, the performance Can be improved significantly. So here is some benchmark performance result on top is the Webhook implementation the percentage of the time spent on the predicate logic extender It's measured as the percentage and over the entire scheduling algorithm and time It's almost up to 50% by contrast the field plugin only Take up to 4% of the scheduling algorithm ratio. So it's a Very and significant performance improvement next and I'm going to discuss or share a new scheduling features to support the Gantt scheduling for batch jobs So many in the batch jobs and like big data workloads spark machine landing deep landing require Gantt scheduling, which means giving a group of the ports Scheduling all of them or none of them So far because the existing native scheduler only supported as port by port group It's missing these features So one of the solution We already have In the community is a night wheat co-schedule plugin It was a regional proposed by Arnie Baba, Qingtang Wang and his colleagues During the last few months and we have been working very closely with the communities to improve the enhance this co-schedule plugin to better support the Gantt schedule for batch jobs We have the collaborators from Arnie Baba, Apple, IBM Tencent and many others So the idea is It introduces two labels The first label and just to define a group you name it and like my batch jobs The second labels describe and specify how many what's the minimum number of the ports should be scheduled as a group So in order to support this Gantt scheduling, we have to introduce multiple extensions at different stages. So first step is for sorting the ports when a port is submitted to The Kubernetes it has to be sorted a scheduling queue instead of doing the port by port and the sorting The plugin make sure the the ports from the same group are sorted together and Avoiding and it interleaved and across the different port groups the second addition is to check and create a port group managed by the Co-scheduling plugin So make sure we have we know which port be known to which port group Also, it's very fire whether or not and the minimum number of ports in the class can meet the port group and Minimum requirements then can reject the ports earlier to save the resources The third plugin is because now We like to support the port group scheduling and we cannot bind a port After finding a node for it, we have to just reserve that port so The reserve plugin also should have a timeout mechanism make sure it won't occupy resources forever The fourth plugin is this permit plugin It's check whether or not The ports be known into a same group already reach its minimum number of ports To be wrong if it's meet the ports then we allow All these scheduled ports move to next stage to the binding stage to Basically to start and creating the container supports to run the systems Also, we have to manage these port groups and make sure the Deletion of port group clean up the port group. Also if there are timeout We have to clean up the port group as well All these parameters can be customizable based on some parameters so in a summary This and the co-scheduling plugin provide a simple mechanism to run batch jobs to As a group support this and gain scheduling requirement by many batch jobs It also supports to define a port group across the jobs and deployment And there are some mechanisms To make sure the port group can be cleaned up and when to delete it or time it out Also have a nice and error checking mechanisms So here is just the example and the policy file Define a scheduling profile to support the gain scheduling or co-scheduling So here as you can see is We just need and add this co-scheduling plugin for the different stage The sort stage, pre-filter stage, permit stage and reserve stage Also, we can customize the parameters how to manage the port group For more information, please visit the scheduled plugin in the repo And also the users will welcome users and also contributors So moving forward They are not of the interesting or more advanced features associated related to the co-scheduling The community and Is working very hard to address or looking to these issues And a new proposal or PR just Try to improve this and label based co-scheduling to port group CRD based Implementation Also, we are looking at some more advanced customer preemption. For example, when you preempt to evict the port from port group Should we evict the entire or ports from the Same group or just a single group that it may violate Violate the minimum number of the ports requirement defined in the gain The port group definition Also, can we improve the utilization if we don't want a large port group Reserved resources and block some small port group from running so What about introducing the backfill strategy and Or more advanced rescheduling strategy also finally as we can see there A lot of new requirement on this and the port group based management. So can we introduce a more general Generic and sorting plugin tree and Even a single port and a port group Consisting just a port so we can manage in the Port or port group In a uniform way so for more information and you can look at some of the Discussion and the the issues And the cap the third use cases is scalable scheduling How to improve the performance scheduled performance at a scale? We have seen and many very large scale clusters With thousands of 10 thousands of nodes Also, there are a lot of large jobs or services running in such clusters Each have thousands or 10 thousands of ports For some of the real-time interactive workloads the auto scaling is important to quickly scale The number of the ports for service to handle increasing workloads So all this require very good performance Scheduler But unfortunately today kubernetes native scheduled performance can be limited by its port-by-port scheduling algorithm and also the optimal placement strategy in very large scale clusters To address that we have Make two proposals the first one is Can we customize some key scheduling parameters for different workloads or different tenant to better perform balance the scheduling quality and performance The second one is instead of a scheduling a port Port by one by one We can assign The nodes to a group of port at a time So first thing let's look at customized scheduled parameters so one of the key parameters that have a big impact on scheduling performance is called percentage of nodes to score which Determine how many of the nodes that the scheduler should check and Score and rank them. So for example here is a benchmark results experiment So if we run this experiment in a 2000 node cluster 5% of this parameter value means We'll score up to 100 nodes. So if we find 100 feasible nodes We just stop search And rank this 100 nodes and choose the best one 100 means we'll look at up to the entire cluster up to 2000 nodes if we look at the results On the left the graph shoe and for very simple strategy the 5% And 100% they don't have much difference can reach 120 ports per second That's the maximum throughput we have seen so far. But for some ports that require advanced placement constraint like affinity anti affinity 5% and 10% 100% Show a huge difference almost a 3x difference in terms of scheduling throughput To address that we propose a per profile parameter idea So the current this percentage of nodes to score is a global parameters What we proposed is to have a per profile Parameters so that we we could assign different percentage of the nodes to score to different scheduling profile that different workloads or tenant Can assign or use different scheduling profile to better meet their scheduling performance and scheduling quality The second idea the proposal is to score And assign the nodes to port as a group The key idea is We can concede a group of homogeneous ports with the same resource requirement as a group Then we just need to score The nodes once assigned a top key scoring nodes to the port at a time To implement that we need firstly Add a port group sort plugin Which is similar to the code scheduling plugin sort the port group together The second plugin is the group score plugin which will apply this and top key scoring nodes to the key ports But this is still in a design phase The challenge is How can we support this port group as scheduling port? because The current scheduling framework we can customize the scheduling logical Before and after each scheduling stage But it's still impossible to customize the scheduling flow logic itself For example, can we skip a port for certain stage or customize some Processing logical there But this is a very hard problem because we have to Make a good balance and have good trade-off between the simplicity robustness and The performance and advanced features In the scheduling, scheduling design and implementation Okay, so we have and Discussed the three and use cases and how to develop a new scheduling feature to support the different use cases Finally, here is example. How can we put together create a scheduling profile? to use all these scheduling features So here we show an example policy with the four different profiles each profile Is associated with different plugins They have different names different plugins. The only constraint is the Qsort Plugin should be the same because they're only a single scheduling plugin that for different Profile we can Find different plugins associated with this and a schedule profile So sticky ip scheduling co-scheduling default scheduling or an integrate schedule this one We can use any combine all these Three different type of the plugins together Also, we will see and we can set different percentage of nodes to score for each of the profile Then different workloads different tenant different users Can specify different the profiles for their workload to better meet their workload performance and Scatability and scheduling quality requirements to summarize We have seen an increase in needs for new features in Kubernetes scheduler the schedule Schedule framework provide a very powerful and generic mechanism to enable us to develop new scheduling features many of them are already available and are under development Last but definitely not least the community collaboration is the key the work We have presented today Is the outcome of close collaboration with the upstream community, especially the SIG scheduling community More users and contributors are highly welcome To join the community for more information. Please visit the scheduler plugin report We'd like to really thank the contributors in the upstream community our collaborators We Huang, Abduna, Geronhar Both co-chairs of SIG scheduling and Qingchang Wang, Kai Zhang from Alibaba on the co-scheduling and Inastic scheduling contribution We Dongcai from Tencent on the CRD based co-scheduling Finally, we are part of the Apple cloud service team For more information, please visit the Apple virtual booth and we are hiring Kubernetes engineers from the core Kubernetes infrastructure to platform to application layer. Thank you very much