 Welcome everyone. I'm fully aware that we are very close to the lunch So we will try to speed it up so that you will be able To actually focus on the important things. My name is much a I have marching with me and today We will quickly go through what the batch working group has been up to over the past year And what are the plans for for the upcoming features? So quickly the most important things is when we meet and where you can find us We meet every other Thursday There are times in various time zones not all but a lot We have a slack channel. That's probably one of the best options how to reach us on the Kubernetes slack as well as a Email group both have reasonable volume. So feel free to sign up and join both channels So what we are up to in the first place what the batch working group Is behind so This is literally a place a space for all of us for the entire community to discuss topics related with broad batch work clothes Approaches in the Kubernetes ecosystem our goal is to ensure that people are working on similar problems more or less in a one single place rather than each of us building a Solution on their own in in various different locations. So this is an important thing We are sharing this is one of the main goals of the Kubernetes Community as well is to share our opinions our solutions We are very open to hearing what people are struggling with what kind of solutions there are implementing How they are implementing We are trying to be very open and inviting everyone if they are already Implementing some kind of work batch workloads on their Kubernetes clusters or they are currently struggling to show up to present their ideas to present their solution their tooling and How we could all collaborate To create a common set of tools because obviously it won't be One single tool one single solution, but we would like to somehow combine all those ideas into a common Share it set because together we can definitely Reach a better and and much more Approachable solutions The key stakeholders is basically a special introduce Scheduling apps note and auto scaling. I'll quickly mention why each of them So if you look at the scope and the six that I did just mention You quickly notice that the batch API's falls under the The special interest groups apps. They are responsible for all the controllers and Within the batch workloads. We are focusing on jobs and cron jobs specifically By extending those API's and I'll be talking a little bit more about those Right in a moment We are also working on queuing primitives. So we're trying to involve The knowledge that we have from the scheduling work group as well as the controllers so the apps work group to To improve the scheduling primitives and queuing primitives We are using folks from the autoscaler special interest groups to better utilize the cluster resources to pack more resources to literally Get as much as possible out of the clusters and lastly from the sig node We're working with the node working group who are primarily responsible for the note to allow additional specific hardware To work with the Kubernetes clusters So quickly about a couple of features that we've recently added to the jobs one of them is elastic index jobs So if you remember jobs Whether that's index jobs or regular jobs have a limitation that you cannot modify The completions once You set a particular value to them you can modify a parallelism Which basically means how many jobs how many pods are created for a job at any given point in time But you cannot modify a completions There are some use cases where we would love to be able to modify that so we extended the The index jobs API with the ability to modify completions and parallelism But you have to do it together So if you combine those actions at once it will be a valid action and you are Able to do it and this is a better feature since Kubernetes 127 so basically two versions ago and Beta base means the future is available by default in the cluster The next feature that is that has been in the works for I wouldn't say three or four releases by now and it will probably We will be iterating on it a little bit more is the potfailer policy for jobs So previously jobs had a very strict policy either the pot succeeded or failed and it counted it one way or the other The author of a job did not have a control over Identifying certain failures as being Not important the potfailer policy Allows the author of the pod to specify that certain pod conditions whether that was because A pod was evicted due to cluster a specific node exhaustion or whether because There was an image pool Failure so there are a set of pot conditions or exit codes from pods that we can say yeah This is an expected failure. So don't count it but actually retry those particular failures. This allows us to improve The success rates and and a lot of those jobs Another one that is also a recent addition is pod replacements for jobs so Normally a Pretty much every single controller within the Kubernetes world and that's why we started the jobs controller and wrote their job controller In a similar fashion was that you will replace a pod as soon as it reports that it's not healthy or its Plan to be deleted The reason for that action was to ensure that the application is available all the time as much as possible But there are some cases especially in the batch Use cases where you are running at your limits or you are basically Have a very Assigned quota where you you can only fit this much and you cannot exceed it for those cases the The requirement to replace the pods immediately is Not always possible. Also when you're using specialized hardware You actually need to wait for the hardware to be released to be able to reuse it again. So for those cases we introduced a new pod replacement policy which allows you to actually wait for the pod to To be failed, but that basically means either succeeded or failed And only then replaces rather than replacing it as soon as it goes One additional thing that I wanted to talk about is job set. It's not a default feature in the Kubernetes It's standard core Kubernetes. It's an add-on. So that's an important Differentiation you have to explicitly Get that installed to to have it working in the cluster If you know how you've been if you've ever been working with jobs An important limitation of a job is that it only allows to provide you a single template of a pod job set is actually a Next step further in the journey because it allows you to specify multiple Pod templates and then based on that you can configure how the success rates how the success policy for those For those various Templates will look like so you can have separate Leader job or worker job and then based on your requirements for your job You can specify which failures or which successes actually matter to complete this particular job and Just because we are putting this under a single package. You can unify the lifecycle of the particular job So with that I'll pass over to Martin to talk about you Hey, so probably the biggest endeavor that we have in batch working group is Project called Q. It was started about two years ago and Is trying to solve a number of problems that batch user face for example, which of the Training jobs that they have in the cluster should be running at a given time On the limited amount of resources that they have how to ensure that all pods that belong to a particular job starts quickly Because many training jobs require all pods to be running to actually proceed having 99% of pods running is not good enough Another problem is how to allow users to use as many of let's say spot instances They want but to limit their own demand reserved capacity regular Kubernetes quotas are Not very rich in terms of Specifying what can the users do they just allow you to specify one number for the entire namespace Another problem is how to do all of about that Without replacing regular Kubernetes components like scheduler or other controllers and do it in a cloud environment when there are nodes come and go and are added when needed So for that Q was introduced Q is basically a kind of batch Job scheduling and admission systems that decides Which jobs that you have in the cluster should run at a given moment and on what type of machines So it is kind of similar to a Kubernetes scheduler But there is one major difference Kubernetes scheduler works on pods and tries to put these pods on nodes Q doesn't put any pods on nodes It decides whether there is enough space in the cluster to run the job and then leaves the actual scheduling to To schedule there Q provides Advanced resource controls like you can specify Quota for a particular set of resources for example for your reserve capacity or for particular type of GPU or for and Licenses that you may have for your computation it allows you also to do some Quota sharing between the teams In regular Kubernetes You can only have one Quota per namespace and if you are out of your Quota for in your namespace then well you are done You cannot borrow it from the other team even though they are not running much in their a namespace Moreover, where Cooper allows you to have Different policies around preemptions how they are done how they are performed and how these resources That you might have been borrowing in the previous steps are reclaimed when the owner of these resources is in need And What's important Q doesn't replace any of Kubernetes components, so it works with regular Kubernetes scheduler. It works with cluster autoscaler. It works with carpenter Whatever you have there, it will work with it Moreover and one of the main goal of Q was to meet Users where they are so to do that Q provides integration with major APIs that are out there so first of all we have Latifi integration with regular Kubernetes jobs we have integration with Kubeflow job portfolio, so MPI jobs, PyTor jobs, TensorFlow jobs and all others We understand Ray jobs job sets and even Work with standalone pods if it's that's what you have by integration. We I mean that we Understand what are the requirements of these jobs how much capacity they were in it Work a then run and we are able to start and stop them when needed so that the quotas that you defined in your cluster are not exceeded Okay, so Q resource models Relatively simple, so you are meets jobs via Q as you and the name suggests so If you create a job you just need to specify which queue it belongs and then you will automatically pick it up for each of the queues you can specify Quota of a particular set of resources for example one queue may have 50 CPUs 20 a 100 and video GPUs and 200 gigabytes of its disposal as And log as this quota is not exceeded. It can admit workloads So the flow here there is relatively simple workloads comes it has an annotation saying which queue it belongs We check the quota if there is enough quota the workload is Started if there could be enough quota if some preemptions were made then these preemptions were done if Preemptions would not happen the workload Is suspended and it stays in the state until there is enough resources in The quota for example because some of the earlier started tasks have just finished But the previous model is not enough to cover all of the use cases that batch work batch users may have So in queue 0.5 which we released Just a week ago. We Introduce a kind of plug-in mechanism where you can Influence whether the workload is admitted or not. So instead of just Doing the basic quota check we allow to define for each of the queue some Additional checks that needs to be performed in order to admit the workloads this additional checks can be implemented either in the queue or in a standalone controller that is provided by you and much is your Components need for example, it could block Workloads that exceeded their monthly budget You could issue a prometheus Query in this admission check check how many CPU hours This particular user used in the past month and based on it decides whether the workload could run or not or We could use this mechanism of additional checks to talk to cluster auto scaler so quota mechanism provides you some form of Guarantee that your workload will be started If you have a quota and the workload Fits in this quota. There are chances that it will actually start, but the reality is that for example Many clouds are struggling with availability of this new fancy shiny GPUs And even though you have quota in this GPUs may not be available on demand immediately so We introduced In cooperation with a single to scaling a thing called provisioning request It's an open source API to ask cluster auto scaler or any other auto scaling controller to ensure that There is space for the given set of ports So in the API you provide pot template you provide the number of pots you move you decide which type of Machines you could you would like to have and what engine should be should be used for provisioning and then cluster auto scaler will try to make sure that these resources are provided and This provisioning request will be in pending state until cluster auto scaler is sure that these resources are there Q will monitor this thing and and Suspend Your workload until a cluster auto scaler gives it a green light to proceed There's strength as atomicity guarantees around gang scheduling that Q provides so not only will be checking quota, but we also be talking to the underlying infrastructure To make sure that you have resources For the job that is started so we don't partially start jobs and we will not lose your money and There are three classes that are being quirk on the in cluster auto scale right now, so Well, the first one is to just check the capacity without actually provisioning it It's a class that could be used for example in an on-prem scenario previously cluster auto scaler didn't make sense a lot of sense in your on-prem environment, but if cluster auto scaler could give you some So answers to whether a job could be scheduled or not then it would make sense in On-prem environment the second class that we are working on is generic scale up, which will try to scale up Your notebooks in your cluster and if it doesn't succeed we will back off or if your cloud provider says as a Retroinsets an error we back off and the other one is the GK specific That's a new thing on GK. It's based on Q resources that could be Asked for on Google Cloud Platform Okay The other thing that we are working on is tackling the problem of multi cluster in patch environment and the reason why we need as Multi-cluster is to also solve the GPU obtainability problems especially for sports and demons Which has availability on different locations at different times We would also like to help users that have clusters in Single regions, but they have many of them or maybe they have multiple clustering multiple regions or Even though are on multiple clouds and on a prem and the same time and would like to have a way of starting the jobs in All of these places in a very convenient way So basically The user has a number of clusters here in each of them of course secure and maybe cluster autoscaler And we would like to put a job definition in the cluster that is most likely to admit it So one of the approach that we could use is to create Job definition all of the clusters hoping that one of the cluster will eventually pick it up and then we would remove the Job from other remaining clusters and the job will continue Execution in the cluster that picked it up initially Of course, you don't want to create this Jobs definition in all the cluster manual you would like to have a controller that will do it for you Such a controller could be Q running in an under cluster Then you're using admission checks that we talked earlier about would create job definition in all of the worker cluster see which cluster admits it and remove The job definition in the other cluster So that you could only submit the job to management cluster and then it would be Distributed to one of the clusters Q will also make sure that the job status Or in this management cluster is reflecting the status of a job running in a one of the worker cluster So you could only Monitor your management cluster and get the full status of the job there this is this architecture, but probably you would like something like that without management cluster, so share the role of management cluster and the worker cluster and Have it distributing the job across your other clusters. So We are currently working on this architecture We start working on it and hopefully will have some working prototype in the next Q release Which will come January in January to February every time frame, but eventually we are aiming at this one But this will obviously come later because it's a slightly more complicated approach Pros of Multicubes are that no new APIs Will be needed for running your jobs. It will work with all of this Q integration we talked about earlier would use the same binary in the worker and management cluster it will work with Autoscaling VEVA provisioning requests and hopefully it will work across regions clouds and on-premise But it won't want to address the storage problems. It will be a good tool for Jobs that are more or less location flexible and don't require petabytes of data that needs to be in the region For the first release you will need to set up a management cluster You will need to ensure that roles and authentications are set up correctly between the clusters and You will need to sync the cues and namespaces across all of them That's our idea. That's what we are working. That's what we are discussing if you have some Comments concerns questions or ideas for improvements, please reach out either to us either after the session or on batch working groups But this multi-Q thing is not the only thing that is coming to Q in near future In Q right now we have a two-level hierarchy of quota But it's not enough for many companies that would like to build Q structure for multiple teams and multiple sub teams with complicated borrowing and sharing rules So we are starting to work on the hierarchy cutout quota structure in Q We learn that lots of users are Happy with tools that slend provide that allow you to the talk to your cues to check their status and Have a really nice experience for that So we are hoping to have a dedicated command line tools for Q release soon We are working on hybrid resource assignments right now here Q assigns only one type of resources to your jobs but maybe some of the jobs could Run on different architectures on different GPUs at the same time We hope to have some budget implementation and we hope to enhance visibility Into what is going on in your batch system in your cues via more status and dashboards and We are planning out of other features so where to find us Q dot sh is the enter page. We are on github If you have questions, then batch working group on slack would be great and Surminder we run bi-weekly meetings of batch working groups on Thursday And it would be a great place to ask questions if you have any of them about cues Okay, and now it's a time for your questions And there is a mic in the center of the room. So please use it Question. Can you hear me? Yes. Okay. Thanks for the great update. Yeah, I'm Yuen Cheng from Apple So I have two questions. Yeah, the first one is yeah I'm very nice the oldest Q feature other thing and can you and the comment on we have also other are Independent external and a batch system like unicorn volcano and even I know people run slum and HPC Scheduled on top of the Kubernetes. So how do you think the ecosystem the Kubernetes in the future, right? They are not efforts now and the Kubernetes native and Yeah, the features and the functions like what you are talking about But there are also other efforts, right have other independent systems to try to address this they have a few features they have some Advanced scheduling features and quarter management other thing. The second question is more and about any AI in particular given the emerging and the generative and the AI and large language models any specific efforts to address the needs right from this and the largest skill AI and workloads right heterogeneous and hardware and Dynamic resource education and network topology awareness and all these and the new features. So any efforts about that? Okay. Thank you. Yeah Okay, so the first question Why do you need Q if you have volcano and your unicorn and and slur? Well, that's right So all these tools also slightly different problems and have slightly different design principles, for example volcano is Targeted mainly on this my opinion on on-prem environments where Things in the cluster don't change much. It doesn't work particularly great with Autoscaling things and one of the principles that we have during Q design is to make sure that it will work well in cloud environments you can This also kind of applies to her unicorn and slam is completely different business has been around for Years if not decades and it has its own users Why do you need? Q if you have if you could have slur, for example, because you just want to have one environment in your company if you pick slur and then you might need a solution for your serving services and deployments and with two Environments you will have to have two set of IAMs roles to To Duplicated efforts in maintaining all of these environments duplicated training the Duplicated procedures and also and so on and so on so slam at this moment provides much more Features than you would provide it has very advanced scheduling capabilities. It has great tools. All of that is Really nice, but the problem is that it's yet another environment that you need to maintain and it's not great for everyone and Some people don't have that sophisticated needs to have So much control over how things are executed and they might be good in we Slightly simpler solution like Q Which will address their needs and allow them to have just a single type of environment in their company Okay, the second question. Yeah, so can you comment anything any efforts or working progress for the AI specific workload? so Network aware scheduling like This more of a schedule there is a six schedule up. So we don't have any Effort started sir. Yeah It's somehow addressed to buy a job set, but we understand that more things will need to be developed to ensure that Proper management of Groups inside of large language, what are training are handled that if a group fail then it might be Needed to So you have single pot fail then maybe a group of pot needs to be moved elsewhere So there there's a lot of quite complex needs that still need to be addressed We we are getting user needs, but at this moment apart from job set We don't have anything ready, but if you have any particular needs, please, please talk to us after these sessions We would be more than happy to talk to you about this, but if you have any particular need Please please talk to us after these sessions We would be more than happy to learn them and hopefully we will have a good solution for you in the future Okay, I'll comment a little bit on the on the Support for the hardware. It was also mentioned earlier today during keynote that we are working through the dynamic resource allocation feature The problem with that one is highly embedded in kubelet and there was a session yesterday during the contributor summit Where we were discussing How API the cube API is so extensible but at the same time how kubelet is very tightly coupled and How far how hard it is to actually expand the functionality within the kubelet, which when it comes to actually reusing various Hardware features is critical So I I've personally got like a couple days prior one of the freeze previous one, I think To review the controller side of the DRA It's a massive change in the Kubernetes code base We are making progress on that front But I think it'll it'll be a little bit longer Until we get to the point where we will be satisfied with the state of the future To be able to to mark it as a stable feature But if you if you can And I would encourage each and every single one of you to try to use Alpha or beta features in the early stages in your staging environments or testing environments Provide the feedback see whether the use cases that you have are fulfilled or where are the gaps That you think it would be beneficial for the entire community to either Show up during the batch working group and Discuss those topics or for the specific SIGs because for kubelet a lot of the DRA Is happening in the kubelet, but not only there because it like I said it cross crosses several SIGs Yeah, thank you very much. Yeah, I have we we have time for like one last question Hi, this is Abhishek from IBM Research I have question on the job set API at least on the screenshot I saw that job set supports multiple job templates. Is there any road map to support arbitrary CRs in the job set for instance, can I paste in spark? templates inside job sets It's funny that you're bringing this up because that's also brought up yesterday during the contributor summit There's an issue if you look back in the Kubernetes repository about being able to use even jobs or or not only jobs Or even sometimes other controllers as the wrappers around Literally any kind of generic resources CR days or none As it currently stands the the API side of things Is that as flexible as we would want it to be to allow us supporting this kind of extensions? And that's why for now we're focusing on on jobs specifically only All right, thank you. Thank you very much all