 Thank you very much Welcome everyone. We are delighted to be here to talk about scaling batch. I am a workload beyond the Kubernetes scheduler So a quick introduction. My name is antenna Stephanie T and I'm a software engineer after that where I work on OpenShift AI specifically focusing on the resource management for model training I run my name is Anish Astana and I'm an engineering manager on the same team as Antono focus very much on resource management and distributed workloads So for today's agenda, I will start by briefly talking about the key Characteristic that differentiate offline job from online services will then take an example and see how those characteristics some of them Make it challenging to scale the exception of bad job on Kubernetes using the default Kubernetes scheduler We'll then move on to review the different classes of Project batch scheduler project that have emerged in the ecosystem and that I'm at solving those challenges And then we will go over briefly in more detail some of these projects so to To start the and it's important for the discussion that we we are going to be having in in that presentation Just quickly see what differentiate offline jobs from online services that we know Kubernetes is good at orchestrating well So a typical example of an online service is a web application a Microservices is kind of architecture or several less application So these are mostly stateless application where the replicas are homogeneous. They are like the same Kind of workload That are disposable and generally the Processing unit is the request and it goes to being processed by a thread or a routine and as a platform and instructor The KPI or the SLI that you're really interested in is the latency and the availability of your platform and from a compute When it comes to sizing the cluster from a compute standpoint You would take like the expected load model from let's say your production environment and it would be fairly easy job to to size the cluster accordingly to the load that's expected and and the Viability of the load would be fairly limited which it's that generally The way you would adapt to the variability of your load into is to autoscale at the pod level So on the other end of flying jobs bad job such as like a IML distributed model training or a spark job or a ray job for example These are mostly stateful application that are made of like heterogeneous replicas the prime example being like you have a driver Process or let's say your leader that is responsible for coordinating multiple workers The processing unit is more like the worker that is going to process data partitions or data shards that are allocated to each worker And from an administrator standpoint you're more interested into maximizing the throughput of your platform And also the efficiency or optimize the usage of your resource or of your platform is And when it comes to sizing the cluster It's a bit more difficult because the the activity of your user that directly translate to a viability of loads you can live like peak usage where Lots of jobs get submitted and generally that means that when you want to really adapt be capable of adapting to that Magnitude of variability in the load you would ready Autoscale at the node level at the cluster level So when it comes to highlighting what are the key challenges when it comes to scaling batch job scheduling on the Kubernetes scheduler We've the Kubernetes scheduler. We've taken a fairly simple example. So scheduling a thousand batch job each of them having ten parallel pods a Simulating like a completion duration between three minutes and five five minutes around five minutes The key constraint being that all the pods have to be ready So the entire job can start its progression So that that kind of like is the constraint that you have when you you do like distributed model training They all your worker pod they have to be online and the leader as well So and the training can progress and we set a completion deadline to 15 minutes So that means that when the deadline exceeds all the All the pods for that jobs get terminated and the jobs get Terminated as well. We schedule those jobs on a hundred nodes. We we've been using quark to simulate the node So quark is a very good like project to ease performance testing for Contrombene components and and you don't have to have like real physical server to do that So given all the parameters there the A total request resource request for the jobs what's available for from these hundred nodes and the average duration time for the job it It gives us like the theoretical maximum throughput for the platform for for the the execution of those jobs Which is around roughly a thousand jobs Completed in ten minutes. So that's kind of the key A performance indicator that we wouldn't expect to be Provided by the Kubernetes scheduler So here is the result of running that simple example with the Kubernetes the default Kubernetes scheduler and we see so here we see that a large percentage of the job fails and And and so we see that quickly as the jobs get gets created The saturates the the saturate the Available computer resource provided by the nodes And there is simple simply not enough like bandwidth for all the job to complete within their Before they are completion deadline Exceed so as the job saturates the available computer resources We see that a there is an increasing number there of scheduling items that fails and it takes an increasing number of attempt to get But successfully scheduled and similarly takes an increasing amount of time For a single pot to be successfully scheduled on the platform Another interesting aspect also is that with that simple test we see that It also impact the scheduling of Other kind of workload likes some of those that could be scheduled beside the test job The average scheduling time is for for the pot that have the same priorities impacted on the other end It's still the skin the Kubernetes scheduler still does a fairly good job for the higher priority pot So we are not going too much into the details today of how the Kubernetes scheduler works It's more for people interested if you want to already get deep into how the Kubernetes scheduler works There is a good resource in the C scheduling project and that explains exactly how the Kubernetes scheduler works So with that example That limited bandwidth that the system provided It could be mitigated by increasing the Completion deadline that we've set but there is a major a more problematic issue, which is resource fragmentation So here what we did like we increase the parallelism of the job that we schedule So instead of having a thousand job with it each ten pods. We only have a hundred Pod but this time they have like Android pods each so we still have the same Amount of computer requests It's just that they are distributed differently and we see that by increasing parallelism The failure rate also increases. So we we started with 51% of failed job and now we have 81 And the reason is that the Kubernetes scheduler have no idea of The constraint that there is that the all the pod for a single job have to be scheduled Atomically As a group so basically what happened is that we end up with many jobs that are scheduled partially And it ends up in it in a kind of a date lock situation and no matter no matter how much extra time we give to the system to to Work we were here in we've re-run the test and we instead of having at the line of set to sit to 15 minutes It said it's it's set to one hour and and it does not any good We we still have a fairly very high level of failure and and we see here that the the Kubernetes scheduler just struggle and and at each scheduling cycle it retries, but there is no chances that By chance it schedule the entire set of pods for a single job. So that that problem is It's supposed to be solved by a Plug-in a cube scheduler plug-in. That's called the co-scheduling plug-in So what it does is like it extend at some extension point the default Kubernetes scheduler and it sort All the pods in the scheduling queue Continuously force for each job so that they are continuously scheduled and it makes sure that It only binds the entire set of pods for a single job when There is enough resources and that the entire set of what can actually be scheduled So here we've rerun the test using the by enabling the co-scheduling scheduler plug-in So what one thing for us to mention is that like it does a very good job like the the the entire set of jobs or the The Android jobs they successfully complete Then the major thing to mention is that it takes a very long time for all the job to complete and It's much higher than the expected theoretical maximum throughput. We had like computed before It's almost a other magnitude larger than longer than the 10-minute theoretical maximum throughput and the reasoning is that probably for each Scheduling cycle here it operates on the entire set of pods So it's it's at each scheduling cycle It's going to sort the 10,000 pods and it it really leads to a significant increase in basically the the scheduling items Average time now there may be quick things to mention is that It does a good job also at not impacting the other Kind of workload that can be scheduled. So any schedulable pods with default priority or even higher priority is not affected by that and Here is just an example Apprenting the example where we have a thousand Best job, but this time we use a Those job scheduler crew manager project. So for instance here, it's using Q So we were in the exact same test and yeah, we see that We've such a project for in that example Q It's all told the execution of the jobs here and it makes sure that it does a very good job at maximizing the Optimizing the available resource usage and it's also very efficient at I Executing all these these these jobs and here the Total completion time is 15 minutes, which is very close to the maximum theoretical for boot We also see that Almost all the pods get successfully scheduled in one attempt here Same year for the average duration It's it's very nominal and by doing that is it really relieves the scheduler the default Kubernetes scheduler activity Here we see that it's an order of magnitude lower in the number of scheduling attempt and same year for the other kind of workload Pods, they are not affected by Whatever we do for testing there. It's very low nominal average scheduling time and And last but not least just the example where we increased the parallelism here are still with Q so 100 job with 100 pods and and it does not like Make any difference because Q operates at the job level. So no matter how distributed the job is it's still able to do a very good job at maximizing the resource usage and also same the Completion The throughput is very close to what would be an optimal throughput. So now that we've seen Basically the the challenges that comes with the default Kubernetes scheduler and an example here of using a Q manager source as Q I'll end over to any age that is going to talk a bit more about those kind of projects All right Yeah, so again, I said like he's given us a lot of the motivation for why you would want some sort of job scheduler We can talk through some of the options we have now First up you have your custom schedulers and you have your Q managers Custom schedulers replace or extend the Kubernetes scheduler But then there's a kind of ends up being the same thing. You are changing how scheduling works for your cluster They operate at the pod level for the most part and they're responsible for both queuing and admission of jobs as well as your scheduling Great in practice, right the only downside is that there is a Pretty big impact in that you are changing the default scheduler for your cluster Q managers are but it's in the name. They are solely responsible for queuing and admission of your workloads They usually operate at a higher level on some CRD. So it could be like your jobs by torch jobs, whatever and They're pretty lightweight you usually just get a control or maybe some CRD sort of configuration Now our case for OpenShift AI like we had queuing requirements pop up has and now motivated it and We lent we were leaning more towards Q managers and just due to the fact that there's less Possible impact on customer clusters right like we are asking you to install a custom scheduler on your cluster It's going to be a little challenging Next up I'm going to talk through some of the options as we understand them First up I'll talk about coordinator with a key like everything else in Kubernetes It comes with a custom scheduler an admission by book and a demon set running on your nodes The interesting thing about the demon set is that it is profiling the actual memory usage or resource utilization of your nodes So it can really try to eek out As much use it as possible of your nodes It provides gang scheduling via the pod group API What this means is that your pods will not be bound to nodes unless every single pod in that pod group can Be bound to a node. It also provides additional QoS things for It provides custom QoS is for when Kubernetes QoS is don't suffice for whatever you're trying to do It provides quota managers quota management as well for that All they're doing is they're extending the cube scheduler last Dakota API with some custom Anundations to help support multi-level coders Last up they do have support for heterogeneous clusters if you have different types of hardware nodes or specialized nodes They provide a device API that can be used to inform the scheduling And they also support fractional allocation of your CPUs and GPUs The cluster auto scaling is pretty normal They just use the inbuilt like cluster auto scaler and they don't have multi cluster support I'm calling this out. They're not like as a con for Coordinator, I think for most schedule we think for most schedulers Multi cluster doesn't really make sense Because a cluster a scheduler should be looking at the cluster. It's operating on Next up. I'll talk about MCAD. MCAD is a Q manager It's a controller that comes with a couple of CRDs for configuration as well as a CRD for your actual Application the CRD is called app rappel Our app rappels operate is that you take whatever workload you want MCAD to queue for you and you just They copy paste that into like the app rapper format the app rapper just contains a little bit of additional information that is used for scheduling This provides you with a lot of flexibility right as a user or as someone trying out a new technology You don't have to update MCAD all you have to do is give it the right R back to operate on this new CRD And then you're good to go One possible point of friction here though is that as a user I may be familiar with upstream CRs and now I have to make changes to my processes to work with MCAD I'm kept provides all the core things you care about with queuing so preemption borrowing priorities gang scheduling It also provides multi-level coders and it uses Kubernetes extended resources for dealing with You know specialized hardware requirements It supports auto scaling via the machine set or machine pool APIs on OpenShift And it does have multi-class to support which and as a queuing system. It's more able to Talk to other clusters Next up. I'll talk about volcano. I'm sure most people here for the volcano It's a custom scheduler which has a controller and some admission web books It does it is a little interesting or different and that it also has its own job APIs Which for the most part I just bought template specs with some additional scheduling related metadata This is they have a lot of community integrations as well. So You know cube ray spark you so they all have volcano related integrations, but and I'm sure this is much longer list They support all the core things you care about with queuing And the one thing I'd probably call out is that they use custom APIs in walking They use custom APIs for the most part for everything these APIs are pretty comparable to what's in community, but it's in the volcano umbrella They have quota management as well. It's single-level coders And one interesting thing they do do with you coders is that you can have proportional queues or coders What this means is that you can allocate 30% of a cluster to a quota and as the cluster scales up and down that queue will also automatically get readjusted It uses Kubernetes extended resources for the again petrogenous clusters and it's also numerous so you can Do more fine-grained allocation with your CPUs it supports our auto scaling via the cluster auto scaler and I think there's a talk today earlier today actually about multi cluster support where it's delegating to external projects The last custom cuz schedule I talk about is unicorn Unicorn can be deployed in two different ways either standalone or as a plug-in to the Kubernetes scheduler But the end results the same you're really updating the behavior of your scheduling It comes to the controller and some admission webhooks it operates on any Pods and the main thing it's doing is like it's making sure that any incoming pods Get transformed and the scheduler set to refer to unicorn. It provides all the core Queuing capabilities that you care about One interesting thing that they're doing in it actually is placeholder pods So when you have a job or workload waiting for admission Unicorn will start creating a placeholder pods on your nodes to start reserving resources The minute you have enough placeholder pods on the nodes for the workload to actually run it mutates those pods into the actual workload and Basically you've got like a guaranteed execution at that point for quota management. They have a custom language and Configuration parts they also do the same thing for our back. It's Interesting right like it's a divergence a little bit from Kubernetes But also lets you do some things that normal Kubernetes our back doesn't let you do it supports Kubernetes Extended resources and it works with the cluster order scaler as well Lastly, I'll talk about Q Q comes with a number of CRDs related to configuration it operates on a predefined set of CRDs which means that if you want to Start queuing a new type of workload with Q you're going to you're going to have to make code changes both in queue as well as in like whatever upstream CR you want to support The benefit though to this approach is that as a user you really don't have to do much if I'm used to submitting normal pytorch jobs just continue doing that and If Q is running whether it's not running doesn't really affect me beyond you in capabilities coming in It provides all the core queuing capabilities that care about and it's also extensible using the admission check API So for instances where we find that Q doesn't meet your needs. You can just Create a custom plugin for it It provides quota management the quota management only has two levels of coders and it allows you to share between coders and all that one stuff For heterogeneous clusters, they provide a resource flavor API that is also used for scheduling So you can submit a workload saying hey I want a BNC resources and Q will make sure that the right gains taller the right tolerations and node selectors are applied to your workload as it gets submitted it provides Autoscaling capabilities via the CA cluster autoscaler provisioning request API and they just recently introduced alpha support for multi clusters That's the end of our slides, but like I'm sure you can guess at this point that we've gone with Q as a solution As we were looking at the job schedulers options out there Customs schedule is pretty much immediately went out of the window due to the fact that you have a really wide impact Right, like you're changing the full customer cluster behavior for scheduling a lot of customers would just laugh us out of the room at that point While we were looking at the two these two Q managing solution and solutions We went with Q with a K since there is minimal impact on users for adoption With MCAT we have more flexibility, but having to adopt a new CR introduces another barrier to entry for people who You know want to do things with minimal effort Well, yeah, that's all we had we have any questions Thank you for your talk My question is that in all of these schedulers Q things. I might be missing with something but is there any of them that predict the Workload how much for example time does it need To complete on on the workload not on the node itself So because it's a common practice in a scheduling And I want to see if there is any specific things for AI in this term like AI workloads so if your question is about like providing an ETA for a job like At the moment At least as far as I know they don't provide some such an information given that They run like arbitrary workload and and they don't have the knowledge about what is being actually Scheduled or Q on the other end you have some project that provides some visibility on on on the queue Like the position of the job in the queue Etc. I think that's the next step actually for this project like to provide those kind of like We could call it like an intelligent matrix But based on because they would have to learn about The workload they are managing basically some are like Be able to capture some insight of them and be able to like doing some even you know The same kind of stuff that you do with time series forecasting or stuff like that You try to like find similarities and then you can provide some some estimate But at the moment, this is a separate concern as far as I see it But it's generally a typical request that we have from end user like from so this project generally They have like two personas like you have the platform administrator the one that operates the project and generally You really want to make sure the platform is stable. You have a good support and and that is Basically what he care was about but then you have the actual end user like the data scientist or whatever is submitting job and then That kind of matrix that this is the matrix. They are really care about like on their own for their own job Well, I think we're at time as well. So thank you very much for your time. Thank you