 Alright well good morning almost afternoon everybody welcome. My name is Randy I'm a cloud native ambassador at the CNCF and this is the Capacity scheduling for elastic resource sharing in Kubernetes session And I would like you all online and here in person to welcome one Chen from Apple Thank you ready Good morning. Good afternoon. Good evening. Yeah, so excited to be here It's the first on site event almost in two years It's my six and a cube calm also my six talk and Not on site and hope can get and make new friends get collected with the community for those people Who are virtual and look forward to seeing you next time? So today I'm co-presenting our work on capacity scheduling in Kubernetes with Ching Chan Unfortunately Ching Chan is not able to attend in person They will present a demo in a prerecorded video later So a neat bit about ourselves Ching Chan is a software engineer from only Baba cloud He has been a very and active contributor in Kubernetes especially the seek scheduling Group he made a lot of the code contribution design contribution to the Kubernetes schedule I'm you entered from Apple cloud services I'm a Kubernetes contribute to and my work is mainly focused on the Kubernetes scheduling performance Scatability Okay, I like to start With some background context and motivations about the work as all of all of us know Kubernetes has becoming a defector cluster and cloud resource management platform so more and more workloads and Different type of workloads are running in Kubernetes In addition to the traditional service workloads and now there are more batch workloads From the machine landing deep landing to big data workloads are running in Kubernetes a Key issue we have realized after talking to our customers. There are some look at the today's Kubernetes How does Kubernetes management resources? between different users or namespace So the quarter is a basic concept But quarter mainly use the just for the admission control or capacity planning not involve the Scheduling and in a more dynamic way the second any The quarter so far we can define just a single value for example for each name space That's the request the total request a name space Can ask for the third one is During the wrong time the resource coordination is purely based on Priority and the default permission we based on priority to do this and the eviction To achieve this kind of the fairness or the resource allocation So as a results we can see and it lacks some of this more flexible or dynamic resource sharing traditional class management system like the head do young can offer as Also, this can cause no resource and the class resource utilization This to and the key of this and the shortcoming disadvantage has become a Zone block for many customers to migrate from traditional class management system to Kubernetes to address this We have been working and in upstream community and the collaborating with a lot of contributors on two of the key and Concepts one is the report terms elastic quarter The idea is Can introduce a define and provide a more flexible? The building block can support more flexible dynamic resource sharing across the namespace and users Also, it will provide some kind of the resource guarantee and the fairness Finally we extended the flat Kind of the quarter structure to a tree or hierarchy level to bad manager the resource Across the different user organization That's is what also other like a do be young and offer capacity planning or scheduling or fair scheduling so now let me and Talk a little bit detail about what the inactive quarter is So inactive quarter is a CRD the difference from the traditional or existing Building this source quarter is we introduce basically two Fields of values for each elastic quarter one is minimum. That is guaranteed quarter for namespace The second one is maximum is the limit of the namespace Be aware here when we talk about the quarter We basically refer to the request request not to the limit because this is more and What the schedule? Make decision and we look at the request only so it supports multiple resource type from the building CPU memory disc To extended resource like a GPU or other resources So I already mentioned so it's a different concept from the Building or native and resource quarter So this resource quarter is more for admission control, but the now elastic quarter is more used by the scheduler and coordinator resource sharing at runtime So the right side is a simple example Here is a elastic quarter Name the test it's associated with a namespace test. So as you can see it's quite straightforward So it defined or specified Maximum and the minimum resources in terms of CPU memory and GPU by the way throughout my presentation So I'm using the GPU as an example, but it can apply to any type of resources like CPU or memory So how can we guarantee so the basic idea for each namespace with the elastic quarters means the minimum quarter Will be guaranteed and you can get that much of the quarter So maximum is Optimistically if there are additional resources available in a cluster You can borrow the quarter from other namespace or user to write But what happened if there are contentions? so this is how the resource guarantee families is Very important So to implement this basically we implemented a customer preemption So at runtime if a namespace submit pause But it cannot be cannot be scheduled because it's quarter and has been used or borrowed by other namespace So the preemption we try to identify those Overused namespace. So means they already use more quarter than its guarantees Then among all this and a candidate namespace. I should take off my and mask. Sorry about that so after it Identify those then it will identify at least of the candidates or victim ports victims So so far we implemented a simple policy based on a priority and also try to minimize the number of the Evicted ports of course one of the extensions Probably have more of the fairness and keep balance across different namespace. That's can be added in the future versions So this and elastic quarter is a CRD a customer preemption is Implemented as a schedule a plug-in. So for those who are not familiar with the scheduled framework It's basically provide a uniform platform That are now developed to customize optimize the schedule. So any of this plug-in We be built into the Main schedule So we are not introducing a second schedule. I think this is a key advantage means this is Kubernetes native solution You still just run a single schedule with additional plug-in. There are no risk condition problem You can share the cache other thing also you can disable or enable the schedule plug-in So not to detail here, but the key thing is there are three plug-in the pre-filter We check the quarter make sure when it's a schedule or port right this port and Have to make sure it's and meet the Cannot exceed the maximum quarter Post-filter basically is preemption if a port a namespace Is under you use its guaranteed quarter that have to perform preemption The reserve stage is when you schedule a node Will schedule a port the schedule have reserved some resources to avoid the risk condition So here is example of elastic or quarter So there are two namespace namespace one Have the minimum or guarantee and a quarter of four force gpu Six maximum six gpu the another one have the minimum of six and a maximum of eight gpu respectively So at the beginning each of them only use two gpu and three gpu. So it's fine. It's even a Less than they are minimum one but later the namespace one or user one Requested additional resources then the scheduler can schedule more ports up to its maximum one in six gpu The next step but the namespace two submit more ports or workloads that requested more resources Now the scheduler realized right this namespace one Overutilized used its guaranteed quarter. So the customer preemption will be kicked in and evicted ports from the namespace one So now the namespace one use Four gpu namespace two use six gpu. This they are guaranteed or minimum one. So that's the fairness We want to achieve next I'm briefly describing An extension of the work so far what we're describing here is a flat quarter and structure But more and more user want to map the quarter management resource management to the organizational teams So the hierarchy or three kind of a structure Would be a very desirable feature So this two example So we can create this and three or hierarchy quarter group from root and aggregate the entire cluster resources Also have a minimum or mean So its children right the sum of its children's minimum should be equal then its parent Minimum then its maximum Should be less than its parents maximum then at the leaf level The namespace will be associated with the leaf levels and quarter group So this week we can manage the resource in a more dynamic and flexible and manner Uh, I hope it's and readable and I apologize if it's too small So here example, so just extend the previous example so now In order to define a quarter group A user can specify the minimum and the maximum But also their additional field for children So it's a nested structure So each quarter group can specify the The the children the quarter group then the children quarter group define the minimum and the maximum Then finally also can have its own children next I'm talking a little about It's related to the batch workload batch workload support In Kubernetes, but not directly and related to the scheduling It's the job queue As we know The workloads and batch workloads come and running and gone So typically we need to manage or buffer the job when the resource not available So that's how we need a job queue management So to address this and a job queue management is introduced to the system as well So it's managing the workloads Not the ports the ports is the schedule on the concept So far the policy is based on the priorities of the job and the creation time of the job or submission time of the job And also the quarter constant then decided and which job or should be scheduled next So a new feature and we're still working on is concrete multiple queue right for example Each user or each tenant can submit their jobs to different queue Then based on certain policies and either fair sharing or other more advanced policy the management system can achieve or provide More flexible or fairness across the user and tenants So here is a high level and architecture about how The job queue is designed implemented So the idea is when a job is submitted annotation like suspend annotation is added So then later and the job will be added to the queue So based on the policy when a job is dequeue from the job queue Then annotation Will be removed then different operators Then got the notification will submit or create reports. So one thing I want to Mention is because they are all type of different application level specific workloads, right from Tencent flow your pytorch Spark and pi they have different description or schema for their jobs So now we are working on the extension Different extension support different workloads So the key function of the extension is to convert or adapt the application specific job definition to a unified and generic descriptions in a queue We call we termed its queue unit. So that way you can support a different type of jobs and workloads Okay, so next I'm going to show a demo created by Qingtan So very unfortunate if we are not able to make the the the audio and work and Otherwise Qingtan would be the best to present about the work. So I I'm Trying to explain it. Yeah, I can just click it. No, it's not working. I have the video here, right? Okay So in this demo It's really it's running a GPU cluster So we create Unless you call her and name space then from different name space submit to the jobs and you will see and how the jobs and the different namespace or users share the resources in a more dynamic way also how the job queue to manage the different jobs because we cannot hear the Qingtan's voice and so Then let's see and the cloud Okay, so now the first step with just the creator and to name space name space one name space two The next step is create the inestical quarter with the minimum or guaranteed quarter and maximum or the limit of the quarter Okay, so you can see here now we create a elastic quarter for each name space and the maximum is for GPU And the minimum is to GPU by the way and so far we assume and the total capacity of this tiny and dummy cluster have total of four CPU cards So each name space means it will get two GPU As a guaranteed resource, but it can use up to four GPU if another user Is not using the resources So as you can see here mean at max So next step is we are going to create some jobs in each name space See how these jobs or ports Are running and in cluster or scheduled running in a cluster. Let me see and Okay, that's it. That's quarter. So for the jobs now So the namespace one will submit three jobs Each job have two workers because Tencent flow work So each job have two work each work Is using one GPU. So it means each job Is using two GPU That also mean a total of four GPU means because there are only four GPU in a cluster There are only two jobs can run at the same time So you can see here is right. There are two workers the replicas Is two and each one the resource request one GPU So let's see if the user one in the namespace one submit three this type of the jobs So what will happen? Qingtan probably have some more insight in here. So Okay, let's see It's still Okay, that's the first job created again the each job we Basically is requesting two GPU So user one launched three jobs So let's see the status okay As we just described So two jobs are running Each Is using two GPU and then up to four GPU The job three is queuing in the job queue because there are no resource available to run job three Next let's see What's happening if user two in namespace two Is submitting a job to to the system So namespace two Is submitting a job the same type and the request to GPU. So because there are two workers So what we are expecting is then because so far there are no GPU free GPU Available So the preemption will be kicking then Evict some jobs or ports from namespace one Then the job User two or namespace two can get the resources run its job Let's see the status Okay, so far and because it takes some times and The job too so far it's not getting resources queuing But then during this queuing and the stage the schedule We identify the victims and the candidates So as you can see now the job two in the namespace one Is being evicted and The ports Will be evicted then the job Now okay, so it's interesting because the job one in namespace one completes So the job three Got a chance to run because it's requested to CPU Then job two Or job one of the namespace two as we described and after the preemption eviction succeeded It started running as well Then the job two of the namespace one Was evicted and preempted Because it's overused the the guaranteed resources So as you can see this demo and the shoe how these resources dynamic sharing between namespace one and the namespace two so when namespace one Didn't use the resources guaranteed resources namespace one can run two jobs Then but later when namespace two Needed the resources and the namespace one has to yeah the schedule will use preemption to reclaim the resources to run the workloads So now and the implementation is simple and after Evicting jobs and it will stay in the field Finner in the stage But definitely and we can extend it either by Application levels and the job management resummit it We're also discussing if it's because of this and elastic quarter and the contention that make a pause And terminated early then maybe we can automatically resummit it to the queue right then it will run later with resources available So this is the demo To show how the elastic quarter work and also the job queue Hope you'll find it and useful and have a bad idea about How this elastic quarter and they're working Okay, so now and The current status of the elastic quarter of capacity planning is we have two open source projects The first one is implemented is elastic quarter And capacity scheduling the core component is the customer preemption Is part of the seek scheduling out of three plug-in so you can check out the code and play it With this and in the seek Scheduler and plug-in repo So the second one is independent or separate and project repo is how to manage the Workload and provide this and job queue management functionality We just make it public and before the conference. It's still under active development So now it supports Tencent flow And the pytorch extension so spark mpr other thing And still working on also we look forward To contributions from all of you So the hierarchy quarter will be released in the next So in terms of the early adopters so alibaba already and apply this elastic quarter capacity planning in production the alibaba cloud So at apple we are actively investigating how to provide this Job queue and elastic scheduling functions to support around in the spark workload in our Kubernetes infrastructure Baidu is looking at use this to run their self-driving and all this AI and based on simulation workloads So here is a list of reference and you may find useful and the repo and some documents So this is a joint efforts in the upstream community without the collaboration It would not be possible To even have the reach this stage So we really want to thank all these contributors Who has contributed to the ideas and the coding the designs? We're also look forward to working with everyone and So you can try out the projects provide the feedback Or contribute the code design or submit the issues or suggesting recommendations Everything would be highly appreciated and welcome okay, I think that's all I have and so I I'm happy to take Some questions was after the talk. Yeah, I would be around. So if there are any questions, yeah Also, I'll just note for the online folks We'll be taking questions online too. So feel free to drop those in if you're online And don't forget to review the session on sketch too. Thank you for this. This is amazing work One question I had Basically, you're taking all the boxes of what we need for scientific like workloads One that you didn't mention is gang scheduling. Is this also supported by the scheduler you described? Okay, you speak a little bit. Sorry. Sorry. I was just saying that this is amazing You're taking all the boxes of the stuff we need One that you didn't mention is gang scheduling When you're doing the workload Allocation, is this something that the scheduler also supports? Like scheduling multiple jobs at the same time waiting for the slot things like this Oh, yeah, so so by scheduling and so I I should have a clarified right as you can see now Basically, there are two levels. One is a job queue level. This is the workload or job as a unit so the The queue management system will determine who which job Is the next job to schedule right as I mentioned that there are some policy there So this is separated. You can think it's outside of the Kubernetes and the scheduler independent of Kubernetes schedule Then the second one is within Kubernetes the key component and the kube schedule This is at port level right when the job is Pick from the job queue then all the ports within that Job will be submitted to api server then in the scheduled queue Then at that time is just the default scheduled behavior and to do but we have a Plugin as I showed a pre-fueled plugin. Also, we check make sure if the scheduler schedule a port make sure The resources total amount of resources won't exceed right this limit space and the maximum elastic quarter resources Did I answer your question? Oh, thank you. Great. Thank you an online question. That's got a bunch of votes here is Are the elastic quota and job queue components Going to end up in the kate scheduling sake, you know mix at any point in time Yeah, so so far we we believe as I mentioned this elastic quarter capacity planning is already part of the SIG scheduling and out-of-treat repo and it's been there. Well, and We tested extensively and it's just the crd plus the plugin of the schedule So it's basically just could let you see the solution on if I've already applied the production So it definitely will come and more testing and to try it and so so far and We just hope and we can get more and yeah adoption and yeah people test it and then yeah, we can improve it and So that's the go and yeah make it in the production ready. Yeah widening and the broadening. Excellent. Thank you Uh, just I may be completely misunderstanding or missing a problem But could this be used for sort of like nomad style scheduling where Instead of having an hpa that has to Sort of respond to a metric Not being in the right threshold and then scheduling more pods Make sure that you have pods scheduled maximally across all your nodes so that if there is a search and traffic on any workload like a web workload It is possible that it already has more than it needs And if not then you can evict something that does have more than it needs and then do the scale up Yeah, so so I would just see like I said, there are multiple and decision and points and in the entire life cycle workload So we talk about there and when a port the key function of Cube scheduler is basically a xyan right place the ports on the nodes So if you are familiar with the kubectl schedule and the functionality, there are different Algorithm are plugging there and been packing algorithm neither spreading the ports, right or more Best defeat consolidation worth of other constant all this can be under configurable by default. For example, the Cube schedule to just try to spreading and the ports Across all the different nodes, but you can configure it a different way I think it's a little bit of a circular to whatever and we want to do it, but The more interesting idea I think you mentioned and the comment to make a good point is I think is custom probably care more is about SLO SLA, right? Can we use some high-level metrics to drive this workload or port scheduling, right? I don't care what strategy apply, right? I want my ports can be scheduled on the run within One seconds two seconds. Yeah, that's I think very interesting thing and yeah Community and we are still brainstorming, but I have to meet is still a little bit too early stage How to have a SLA SLO driven and yeah scheduling Hi Yeah, I was wondering if this was compatible with clusters with the cluster autoscaler or autoscaling groups enabled Technically in that scenario shouldn't a cluster have like infinite resources. This seems more applicable for clusters with like a fixed set of resources So in terms of the compatibility again, so the In that state quarter and the capacity scheduling pieces like I said is yeah It's purely and compatible with any upstream and the Kubernetes code and just met you rebuild that the schedule with the plug-in and Define a CRD the job queue and is a separate concept We try to make it more extensible Like and you just provide or the project will provide different extensions can run different workload so overall and I would see and we try to make it and Independent or aglostic to write what the application ever workloads or what and your workload management system so by providing this in the standard api for job queue by build this preemption and elastic quarter into the native in a native Kubernetes way But again, yeah, if there are any suggestions or concerns or some features and you feel and we should have or desirable and Please yeah, check out code or submit the results and of course more and yeah Hopefully you can help us or contribute to the project and code All right. Well, thank you very much one. We were about time here. So wonderful presentation Thank you very much. I wish you have a wonderful rest of events