 Yeah, so yeah, so welcome everyone to our bi-weekly meeting today. We have Abdullah from Google presenting the work that has been done on Supporting Kubernetes native job queuing. Maybe Abdullah, maybe you can also just briefly introduce the The batch working group and the work that has been put there briefly because it's kind of on topic If you if you can say a couple of words The link has been circulated, but just just to put it in context as well Yeah, otherwise I suggest like we listen to Abdullah and then we should have plenty of time for discussion This should be a good one Yeah, this this camera This this is the one that's working. This one's not Yeah, thank you for having me. So I'm Abdullah. I'm a contributor to Kubernetes. I'm a co-chair of Seq scheduling and a recent working group. I've been formed called Batch working group within Kubernetes. I work for Google again part of the GKE team Within Google I'm focused on Batch as well. So as Ricardo mentioned a couple months ago A couple months ago we proposed to form a new working group within Kubernetes to focus on batch and Reduce the fragmentation on if it's related to batch workloads within core Kubernetes Try to Make batch kind of the first class citizen of of Kubernetes We feel that until recently Batch has been a guest on the platform more than it hasn't been a home like services And so we want to try to push that use case forward So the goal of the Working group Like based on the charter that we've agreed on is threefold One is to enhance Come up with like reasonable apis to start Jobs we already have an api job api, but we're looking to improve its capabilities Reliability scalability aspects Its applicability to various types of batch workloads. How can be reused for example to reduce The fragmentation we have in the community in building job apis for example, how can we run mpi workloads on top of the job api? How can we run tensor for like reinforcement learning? Workloads on top of it. So that's one pillar. The second pillar is Job level management Most of the components within Kubernetes they are kind of like pod centric with that that being like the scheduler or the auto scanner even quote as Like it it mostly works at the pod level and this is this does not lend itself well to Jobs and batch workloads in general, which most of the time you want to manage the whole job Not just like a single pod And so and this is part of what you propose. I'll be discussing here as well in in in this presentation and in queue The third pillar It is mostly focused on Like hvc mostly on node level You know enhancements to use special accelerators Special types of hardware and how it works with the scheduling as well like no malware scheduling or like how do we Basically better use fpgs, etc. They have their own uses of like, you know Resetting the fpga before before a pod can use it and and all these kinds of you know hooks That allows Special hardware to be better used within the kubernetes ecosystem Do you have any questions? about their working group I thanks Ricardo for posting The the PR for the working group The charter has been merged I can once and then I will also post links to the mailing list the slack channel And the pull on like we're trying to decide on what time it's going to be the meeting It's probably going to be on a Weekly or my weekly basis for one hour thing on on that note Last time I checked in with claus. I think I was the only person who responded to the doodle so uh To try and pick the time if people are interested in the conversation There is a channel in slack For what does it call is it batch wg or wg batch wg batch? It has several entries now No, it's batch wg Batch wg is the one that I Think is Now the channel would then Kubernetes for this working group. Is that particular connected on cncf? Oh, no, that's not. I think we're talking about two different things This is the kubernetes batch working group and then there's an initiative which is the cncf batch system initiative And there will be some coordination to be done between the two But yeah, I'm always talking about the kubernetes one interesting. We're all at the same point of uh Yeah, got it Yeah, and it is like to your point one of the uh, um Basically things that we're planning to do as well Uh is to try and help defragment the community try to reach out to cnc And that's what we're trying to do here as well presenting what we're planning to discussing within core kubernetes Uh to the larger community And make sure we're aligned in in our efforts Got it. Thanks So so without without derailing the discussion But can somebody like in a couple of sentences see the difference between Those two groups that you just mentioned the batch group in kubernetes and the one in cncf Yeah, I So, um, I haven't Like I don't know about the cncf one. I didn't read any charter or About that group But the kubernetes one is focused on core kubernetes enhancements. Like what do we do with the core kubernetes? Feature set to make it easier to run batch workloads And the people working on it are leads In sags within kubernetes. So the working group is going to set recommendations or how can we improve core kubernetes to A better to better execute batch workloads The individual sags like sags scheduling sags apps Uh, uh, auto scalar sig node will take on the execution for these enhancements um, and so that that is uh, that is the goal of The kubernetes working group But I can't speak too much about the cncf one Maybe I can I can I can say a couple of words We we had a few discussions also in the toc about this and And the goal is really to to try to promote Progress in this area as much as possible and this can happen in the kubernetes core like what Abdullah was describing But there's quite a lot of initiatives in other projects in this cncf that can have a different release cycle or focus than necessarily the kubernetes core So the goal is really to see how how these two groups go Try to align them as much as possible because if there's little point if like the functionality gets into core It's probably good for everyone to reuse it But to to maybe review this in six months one year and see see how things align better I don't know what everyone thinks but I guess if you have any feedback The issues are posted in the chat. So feel free to to push some feedback there to the cncf work Initiative is within the tag runtime I would add also that there are things magic just came also working with Abdullah in gki in google We are also Like declaring a lot of very important things for much more clothes as out of scope for the kubernetes group Like we know that kubernetes There's no intent in kubernetes to handle workflow for workflows. For example And many other aspects of really much job orchestration the experience of how you use How the whole research or data scientists interacts with with the whole stack lower. So We indeed need to coordinate with the other Between the two working groups, but there is a lot a lot of scope of work to be done That's way broader than kubernetes where there is there is a need for leadership and coordination to drive it And it is declared as out of scope for the the kubernetes Working group kubernetes working group wants to make sure the primitives related to kubernetes Work very well with things outside and like we we do need to work on drawing the lines and making sure it's well coordinated But I know some other things, but I think it it makes sense to have like separate very focused and clearly defined working streams You'll be pleased to know that I added mcad to the charter that we drew up For the the higher level cncf working group that way we'll talk about things like amada and volcano and mcad and how all of those pieces How we should all be working together on those pieces a huge part of Working together will be also watching and reacting and contributing to what kubernetes working group is doing as well because that will play into our all eventual aims, but but yeah, there's as we know a whole bunch of other pieces to think about like Multi cluster stuff and And how we've all sold that so yeah those discussions can take place in the at the cncf working group that we're trying to put together Great. Great. Thank you. And thank you for adding mcad to the list So, yeah, I wouldn't forget you. I wouldn't forget you Sounds good. Um, yeah with that said, uh I'm here most attractive present, um q this is a new proposal that I again like focus is on the second pillar I mentioned for patch working group and within kubernetes. It's just like job level management um within kubernetes I mean the the idea here is I want to start with discussion about this working or like this this audience that it's already aware of what's the job and how we define it um But quickly we just think of job as computations that run to to completion and This is basically a group of pods that either run independently or like run to cover simulations or collaboratively to process a task whether that like being an npi job or even um reinforcement learning job where you have, you know, uh Like workers and and and drivers One important aspect here that we are focused on is the fact that jobs sometimes are like Uh, it's a type workload that is Often flexible on multiple dimensions either flexible on time when the job could start It's sometimes flexible flexible on location. Uh, like which zone it could it could it could run or even type of resources Um and type resources could be even like type of provisioning For example, is a spot or or on demand or even or even types of accelerators like, uh Whether it could run on gpu model x or y and the like on on prime clusters they do have Flexibilities in in one way or another but in the cloud this becomes even bigger issue because you know in the cloud we have Way too many types of resources. Um that users will look at to manage the straight up between performance and cost Um, and so this would be a focus for us as well. Like it's a problem we want to solve Um, and at the high level what is job queuing or the type of job queuing we're looking at or how we define It's basically what we're trying to do is to have mechanics and mechanisms to manage access to a limited pool of resources share multiple tenants And basically what job queuing will do is decide which jobs should Wait and which can start now based on a number of constraints And why do we need job queuing? again, like on-prem this is clear you have static and sometimes small scale clusters But on the cloud this is sometimes less clear. Why do you need queuing in the cloud since I mean People sometimes think of it as like infinitely scalable and and and can absorb every single workload You have but that's not true There are a number of aspects here one is utilizing discounts cloud providers offer discounts if you pre declare how much resources you want In google for example, we have something called um committed use discount. You can pay for example to use a number of core Cores over over a three-year or one-year period And now that you've paid for them You always want to have them used and you don't want to use more than them And so it's basically you've created your own static cluster within the cloud And you want to manage access to these resources um, another thing is like Users they have spending limits. They can't just keep create Executing every single job that gets created. Um, they want to control their budgets. And so they have spending limits um, they also want to introduce pertinent limits and not like Users the new jewelry that run batch workloads are big organizations with different research proof, etc You want set limits pair pertinent even on the cloud and last but not least we have cluster size limits like Kubernetes itself cannot scale infinitely in in gke. We do support up to 15 000 nodes Um, but again like many other and many other instances you can't scale more than 5 000 nodes or even 1 000 nodes dependent on your workload So what exactly uh, we think that users want from from from queuing I mean, obviously you want queuing jobs that don't fit existing capacity should just basically wait Uh and execute when when capacity becomes available Um, you want to have knobs where users can decide on execution order You want knobs for fair sharing of available capacity between multiple tenants. Um Also budgeting is not about all only how much resources you can use in specific point in time, but also over a period of time Um, and also ability to set policies Who can use which types of resources and up to a limit? We have customers for example that they open the tab for their users to use preemptible or like spot VMs You can use as much as you want there as many jobs you can run But when using on demand you have a specific limit or you can't use it at all Same thing with gpu. Those are expensive scarce resources Give them to any tenant And and last uh is is flexible placement again across different resource types and location and time When you have your job submitted to the queue you want the ability to start the job Uh, based on what is available On uh in your infrastructure and the flexibilities of the job that's declared by the user Any questions about like the do those? requirements, uh, resonate, uh, do they capture? uh Like the use cases that you have in mind, um for for queuing There's a comment on the chat I think it's pretty relevant maybe Tim as you can Mention the storage Yeah, just in general uh researchers like to buy storage up front in large chunks, especially if they have a grant Um, if you have an nsf grant and you have to keep your data for five years and your grants over in two years There's no way you can do that in the cloud. Um, nobody in the cloud will sell you storage ahead of time Gaurav you have a question go ahead Hi, so I mean is this concept of q6 above a cluster where When a job comes in it says okay, I queue it and You know this this is a cluster that's dispatched to this cluster that cluster. It just It's within the cluster Uh, so I will get to the apis in a second But conceptually it's not like initially we will or we will implement it as a as a within cluster controller But I can imagine this being run in a Nodeless cluster for example imagine it running in a like as a controller outs that manages multiple clusters Uh, and it can watch for jobs being created in multiple clusters we need to Like, you know fine tune these concepts a little bit, but I don't see a problem having this controller running you know And and watching for multiple uh on multiple like apis servers and and try to manage resource of course multiple clusters But it's not hard to the single cluster story I guess so I mean is that a use case that uh, uh, you want to Focus on like first release or it'll be like comes later So the mvp is focused on like it will be run on the mass arrow of a single cluster, uh, it's like With so so that that is the mvp. Then the next step is to how how this can run outside, uh to manage multiple clusters Okay, thanks. I think if you're interested in the multicluster conversation sooner rather than later That's where you might be interested in the cncf batch working group conversation But that's one of the things we're directly talking to um So that can you send me the link of the cf batch? Yeah, I put a whole bunch of them about just before timothy middlecoups except storage thing There's the slack thing that I posted in the charter above that then riccardo Uh posted the issue where we'll I'll I'll add those those links in the agenda as well Also so that we can go back. Yeah, thank you. So actually one one thing on um, this what users want I would add speed A little scale of things I don't know that That's Absolutely listed here, but like when we tried to do this with just the regular kubernetes scheduler or building a custom scheduler A couple years ago. It just wasn't fast enough, especially when we scaled a cluster up to a really big Uh big size so Yeah, all of these things speak to to us for sure and then we have a few others. So Uh, yeah, scale is always is always top of mind. You're absolutely right and We've worked on improving that for the scheduler. It's like the pod to node scheduler cube scheduler um And it It is something that we're taking internationalization while we're designing implementing the controller Yeah Like I don't know like thousands of jobs or a million pods type of like, you know, um scale I I assumed as much. I just wanted to Yeah throw that in so cool If there are others you mentioned that this is this captures some of your a subset of your requirements If there are other requirements, please like just post maybe in the chat and then I just want to we just want to make sure that we take him into consideration as we move forward. Yeah, that's great Um, so why are you controller? Um, like as you notice playing kubernetes doesn't really lend itself well to managing jobs without like with respect to uh Like, you know, Matt queuing in general, uh, anything that you create on kubernetes Basically the whole cluster is going to try to reconcile itself to create pause and schedule the pause and start the pause There's no way you could say If there is not enough resources, just don't do anything and wait until the resources are becoming available It will basically continuously attempt to do that and it will work itself to death. Um, especially like when you have like thousands and hundreds thousands of jobs being created Um, and kubernetes coders are not really enforced Uh, like in a way that allows it's not dynamically enforced basically enforced until just creation time So it's whether you are able to create the job in the first place or not And if you're not you don't have quality to create the job And there's no place to park it until it's like the resources are available to on the job Um, so volcano is uh one of the most famous, um Schedulers for for drip scheduling Uh, our issue with volcano is that it it it re-implements a number of existing functionalities. It is a schedule. So it's the second schedule of the front side by side cube scheduler And that causes a number of issues related to, you know, risk conditions and re-implementation of some of the features And how can't catch up with the features that we are actually pushing up in in kubernetes core kubernetes Um, the second thing is that it has on drop apis. It is also a job life cycle It has a drop life cycle uh controller and so again three implements the job api that we have in core kubernetes um It the the other thing is that it lacks clear integration with auto scaling. Um, one important thing we Have a design aspect in q is that needs to have a clear integration with cluster auto scaling because this is extremely important Aspect of managing jobs because you want to allocate resources for a whole job How do we do that before the job actually starts and how do you send it to a specific location or specific gpu model type or a specific cpu or provisioning Like standard various on demand um And this is like space to the last one it basically lacks clear support for resource fungibility or flexibility Um, and so so this is the issues that we have with um with volcano And he also I want to mention that uh gke had or like google cloud. They had a previous effort Uh decommissioned now it's called batch on gke a couple years ago. It had similar issues. It reinvented scheduling job life cycle management auto scaling um The other thing was it was close source and so it was hard to meet customers requirement support ability Like customers want to run this on-prem. There's a ton of batch workers as far as That will continue to run on on-prem Um, and so we need to speak to these customers, right? Like we want them to be able uh to run Um, you know manage their jobs on-prem and maybe sometimes spill into the cloud or have a multi cloud story as Or a multi or on-prem plus our hybrid story Really will throw it out And so our thought here is that okay, let's try to come up with a proposal against should be open source um driven by the community addresses the requirements that we've mentioned before in terms of you know, uh That plays on the strengths of both the cloud and on on-prem cloud has a ton of capabilities Um, we that is exposed through auto scaling and auto scaling should be like a central piece of the design of any job management controller That that's I guess I will look at it any any questions on the quick related work review here So Having too much focus on the auto scaler kind of leaves out the the people running it on-prem, right? So I would say like if if we say like auto scaler is going to be first and mostly Then we don't care about bare metal where you have a fixed A set of machines, right? Like it should be something that we care about auto scaler But uh, we care about people that is going to have a fix it because I feel like this story is being told as Uh batch is for auto scaling On a queuing system, right? Like if you keep repeating that the auto scaler is the most important thing that we need to integrate And if you have a batch A fixated system as they I see the list of people in the school like from Academia and universities. They have a fixed machine. So the auto scaler is out of the picture, right? I mean, it doesn't think that I'm mentioning this because of like the fact that most batch Schedules are built in the past They were fixed that they were designed for fixed clusters And I'm trying to emphasize the point here that this is changing and we need to take cluster auto scaler into consideration here In an environment where you have, you know Uh, a ton of elasticity and flexibility and and the fact that batch workloads are migrating from on-prem into the cloud Um, I did not mean to because like it's still like the idea that like for example batch on gke didn't succeed Because exactly what you mentioned. That's why we want to start from a point where it needs to be open source It needs to speak to on-prem customers. But at the same time It should Take care of environments where you have flexibility You have elasticity Those have never been top of mind And I'm trying to emphasize the point, but maybe I I overdid it Yeah, uh, trust me. I'm like I'm supporting this idea. I'm just Help to define the idea Right. It's like a pendulum. It's both like this and this and you know, it tastes nice So question, uh, so there is queue and there is batch. So In this proposal, are we going to combine both of them together or they are they're still going to be two different entities I sorry, I did not get the What do you mean that there's batch and there's queue? So, I mean queue can apply on top of normal community scheduler Right. I mean you have queue and then you can put it on a normal community scheduler with schedule things one at a time Now batch is like scheduling things in together, right? so now Are we going to combine this batch scheduler? and the queue's capability together or It's going to be two different things. Okay. That's a great question. It is one of the main design principles that we're carrying here Which is don't reinvent the wheel And when you say queue, we're talking about the queue scheduler. Am I right? This is exactly to the point of this slide. We don't want to reinvent the way We don't want to have a second scheduler running like a pod to node scheduler running Um, we don't want to reimplement auto scaling. Um, the cluster also that we have open source in part of the kubernetes Package, um, we should just reuse that the same thing with job lifecycle management I don't want to propose a new job api. We just need to manage the existing job api and have hooks to manage custom workloads custom jobs built That that cannot basically reuse the job api Um, but we don't want to force like we don't want to introduce a new api for creating jobs, basically So So yeah, we're not doing that And the advantage is that we're using significant existing functionality we're not concerned about functionality divergence It enforces separation of concerns In a sense that like the control that we're proposing is not going to do auto scaling Is not going to assign pods to nodes is not going to create the pods of a job All of that is the existing components. It will only decide when the job should start and using Cates native scheduling directives like node affinity chance interrogation to direct the job to the place Where it should run based on existing capacity of the cluster Um, can I ask a question? Yeah. Yeah, so so first 100 percent With you on you know separation of concern pod scheduling is separate. This is a job meta scheduler You know, I agree with you and and Not reinventing the wheel just the question I have is about the the job representation, right? And job life cycle management because I want something that is general enough that if I have a Spark job or a job or whatever kind of of job how complex it is, right? It may include multiple deployments, etc. I want to be able to say this is my job, right and cure it as one entity So I'm not sure if if the current Kubernetes job Specification is is is general enough to to accommodate, you know, all those types of jobs. This is a great point and I will I will um Address that in the next slide. We want to support both the thing again, we have users That their journeys are simple. They just want to run a batch job Uh, the job api we try to fix the job api But let me let me finish this one. They go to the next one and address that point We do acknowledge that there are some Like quotients or limitations to this approach It creates two layers of resource management. So we need to to make sure that we address this point We have multiple components involved in starting the job Uh, and so this may add extra latency Uh, again could become harder. And so all of these things we need to make sure that the way that would design the controller the ux It should be designed in a way that limits these uh, these limitations I don't think we can completely get rid of them But it's a necessarily able to the fact that we want to reuse significant dysfunctionality and suppression concerns um Yeah, as I as I mentioned, um We tried to fix the job api like for example, you mentioned the array jobs or index jobs We tried we introduced index jobs to the v1 job um We fixed completion status tracking. It was pretty much broken like if uh tracking was based on like if it basically if if a pod completes the pod object itself needs to continue to exist in The api server to make sure that the job completes like this is how we were tracking things And that did not work on environments where you have for example spot vm when the spot vm gets preempted Any pod even if if it completed that was on the api server Had a node name assigned to that node it will get garbage collected and so basically you're losing progress in the job So we fixed that as well we introduced uh Some New status like tracking ready pods and job status which is required to implement Uh tf and npi jobs from top of job api. So I guess the the point is that we are trying to improve the job api Uh to address the simple use cases and make it usable To implement more complex workloads But we do acknowledge that there will always be a percentage of workers that will not be able to use the job api That's absolutely true That's why we have a concept like this is that the resource model for q we do have a concept of a queued workload It's basically an abstract representation of any drop in queue um and The idea here is that it this queued workload Object api is going to serve as a proxy between the actual job without that being for example, as you mentioned spot job and what q is queuing um We had a concept of a resource claim here. Um, this is a bit Maybe early to introduce but it's an api that we will be introducing to cluster autoscaler to ask for resources And this is what I meant by we need that native integration with autoscaling This is not an like necessary to have for example in in on-prem environments Uh, the whole thing could still work without resource claim Um, but in on the cloud it will be quite powerful because Before we start doing the job we want to ask for resources We're just going to communicate with cluster autoscale. The autoscale will tell us. Okay. I have these resources in zone x or zone y And then we'll start the workload by injecting affinities to the workload to send it to the resources that cluster autoscale provisioned for us um And the other that the the last two are maybe slightly not completely surprising is that the q which is Basically organizing concept for grouping managing And reading about closely rated resource Jobs and then there is the concept of a capacity which defines how much resource exists For for different tenants We are reusing the Namespace as a tenant concept which seems to be taking like this is well accepted concept now in kubernetes and so In this in this case, um You would basically model for example your teams as namespaces Would create queues for them these queues are namespaced They would point to capacity the capacity is a cluster wide resource And so usually the cluster admin or the batch admin would be the one who's managing the capacity And the q resources the one that creates them the the This is like the personas that were focused on and in the batches that should be basically just Runs and monitors drop like the way that it would work. It's basically for them. They will create the job and then you would have Admin um, you know setting up all these like queues and capacities That decides when the job will start and how much resources exist for each tenant So this is like quick slide or like a theory of operation. Sorry. I'm not paying attention to the questions on this Chat, I hope that someone like Aldo or magic are answering them or please interrupt me if there's something that I need to clarify more Just in the interest of time like we have around 15 maybe 20 minutes if we overload a bit so I don't know either either if it's not a lot more Otherwise, I would suggest we go through it and they take questions at the end. I don't know what people prefer Or we can just interrupt and then we see where we get um, so I think This after this slide the the story would become a little bit clearer and then I can show a couple of few cases and we can have questions So here I'm just trying to show how this q controller is going to work Um, as I mentioned, we are using a lot of existing functionality. The red boxes are existing controllers Part of the part of Kubernetes and introducing a new one called q So in in time zero basically in the top left corner here You can see the batch admin would create the q and capacity resources It could have like get keeper type policies to select like who can submit to it And then the batch user will basically start the job But it's assumed here again. We're using the v1 job api. It will set the q name Where the job should be started and the job will start in a suspended mode Basically, we're gonna have for example a webhawk. We're also discussing within kubernetes community for like setting policies so that We enforce that thing like basically jobs started in in a suspend more basically the job controller is not going to Act on it. We'll just ignore it The second uh step here is that q will look at that is watching on these jobs It will assign them to a capacity It will create a resource claim if it has an integration with cluster autoscale to to understand where these resources are going to come from And then cluster autoscale will basically once cluster autoscale fulfills this resource claim q is watching on that It will unsuspend the job Once the job is unsuspended Again, the rest is the same as it's working right now the job control will create the pods the schedule placed them on on on the nodes One important aspect here is like how do we direct? Basically, how do we do job-level scheduling as I mentioned before we're using uh Kubernetes native scheduling directives like q based on where the resources We're allocated in the resource via the resource claim q will basically inject affinities Or even tolerations onto the job To send it to a specific place um I'm going to skip this one. Uh This one's probably the more important thing to allow allows uh concern The q workload is an abstraction that we're introducing to allow Managing of more complex workloads. Uh, there is the idea here is that Once the custom workload is created, we would need a controller that Understands the custom workload And translates it uh to create a queued workload resource and this queued workload resource if you like I don't have the spec right here But it's basically how much resources you need. It's basically like a pod template Uh and account maybe even like, you know An array of that because you could have a driver and and for example the workers like an spark job and q would be Aware of these queued workers would be watching on them assigning them to capacity and then it would basically mark the job as um Fulfilled and then the controller here could Would be the one that starts the queued the custom workload the main requirement Is that for custom workers to suspend to support suspension? like basically it needs The ability to Start in a suspend mode and have a way for us to start it Basically by setting suspend to false uh And this provides an agnostic way of deciding when the job can start and when it when when it should stop meaning like preempted for example in Discussing like introducing a suspend sub resource similar to the scale sub resource if you are aware of it that allows HPA horizontal pod outscaling to work agnostic across different types of deployments Um, so we can we are thinking of suspending the same way So yeah, just to make sure I understand so in this case for example, if I'm interested in Uh, you know, let's say spark jobs that are started by spark custom resource I need to make sure that that You know, whatever spark controller implement the suspend api Would that be how it works? Yeah, we you would need uh to have like the top level Resource object that represents the job To have these like the ability to tell it okay suspend or resume Yeah, so this is the integration point like and we feel that this is Like a really small surface area of like of integration relatively The complexity here is that again, we're in a point where Kubernetes is extremely flexible and allows you to build anything you want Uh And the fact that you want to manage all these types of custom resources, right? Um So that's the design that we came up with uh the at least like the the the integration surface Uh seems to be to us reasonably small. Um, hopefully it we will not be proven wrong. Um So we'll see how it works Yeah, I mean we're putting a requirement on everybody like whoever is implementing the spark controller the Uh ray controller, whatever you name it, right go to the tensor job PyTor's job, right? I mean everybody has to implement that suspend interface. I guess that's the uh, Yeah, the the uphill battle here Um, hi, um, yeah, it's not that uphill. Um, I mean, I'm already a contributor in kip flow So I can do this for the mk operator But for example, I think Alex Alex is here at this one And I think uh, he has discussed this already with uh with the maintainers of the train The training operator And they they are fine with the idea. We just need to to implement the change At least for q flow, it's I think this battle is pretty simple It's not a battle Uh, and I'm pretty sure we can Work with with other communities to integrate it as as Abdullah said is It is a simple field That doesn't require much thought I would also add that ideally actually that should not be unnecessary, but we would love that Most of these tools the job api so that we can actually consolidate on the base job life cycle Really with the core job api on kubernetes Now not all jobs in spark is a good example that probably the job spark has some specific requirements that like the job api will not be able to meet um, but like We are at least We are looking into like trying to curate like a stack ranked list of all of the tools That need to indeed have this integration and like at a later stage if we see that this gets traction at least with early adopters and first users One of the elements of work would be and help also that we could use is indeed that we Do a targeted effort and like in a stack rank start from like air flow or go AI Coup flow various flavors actually of coup flow Etc that they all make sure that that we have this integration either with the core job api instead of using their own job life cycle management approach or Or this approach of unsuspending if something more complex is needed The other thing here is that this idea helps in scaling Uh addressing scaling concerns like we don't want the pods to be created only At like from the beginning that will help us scale like if you have hundreds of thousands of jobs being created And you want to queue them you don't want all of them to create pods and just basically manage the million pods that only One tenth of them will actually execute at a time And so like I feel that this could also enforce a shift of Like a new design pattern basically That that should be more scalable moving forward um I don't have as as as you mentioned like we don't have a lot of time. So the we have the controller designed The controller it's a different beast. Uh, we'll leave it for another day, but the design document is there We created a repo. Uh, we have a like a Approved concept that we're planning to open source next week Uh, and so hopefully the community can start Looking at it and helping us shape it And improve it um I don't think we got half time for the api to think if we have uh, no questions question on That I've been thinking after reading the proposal is adding a new controller to cure netis feels like a A really heavy thing to do, right? Uh Do you see these like An actual thing that can happen. I feel that For the last year cure netis has focused on stability and maybe even to run cure netis on like edge cases and tail cost and all of that and Then adding a new controller that is a use case for a lot of people But not for I would say the 80 percent of cure netis use cases Are looking into this so adding a new controller will make cure netis heavier Uh, how does the cure netis, uh, community feel about this? That's a very good question. So we're starting as a sub project. Uh, not in core kubernetes. We want to prove the case We want for that this works We are planning to integrate this with core kubernetes That's why the way that we're designing this such that it integrates with existing controller. So that's one Like, you know rock that we're trying to avoid from the beginning The other thing is we have q controller manager, which is basically it's not going to be like a new binary Sorry a new Yeah, a new binary experience on it's just going to be another controller that gets created within the q controller manager set of reconcilers So That's why also we formed the batch working group to get To convince the community that there's conviction around these ideas like there are there's momentum There's a new type of workers that we need to open up kubernetes for And yeah, I mean I can't tell you that it will happen but we're trying We're making decisions right now that Hopefully will help us In the future to make their case for having it in core kubernetes And again, if nobody is using it it's just going to be sitting there right like it's not Again like a new Container that you need to start etc. It's not going to be like that All right, it sounds amazing So we are reaching six o'clock But I suggest because we started a bit late for those that can stay to stay five more minutes and then we wrap it up So I I saw a lot of activity also from kevin Do you want to raise a couple of questions one or two questions kevin? um, sure So we want an hpc systems here at p&l Um, I was curious how these queues are going to interact with each other You know, usually we give each project a namespace so they would kind of have their own queue in this api But on our hpc systems, we have queues where Each of the projects submits their jobs They can see where they are in the overall view of the system queue So they know a it's going to be two days before their job starts or whatever And and all the projects are fairly their jobs are fairly scheduled across the different projects so one One project doesn't dominate the whole system How is having separate queues at the Uh namespace level Right work for that use case So queues are simply like if I if you look at the api, it's simply a pointer to Where the actual capacity is the thing is like Having queue namespace solve a couple of problems one discoverability like users usually only have access to lists So for them to know what is the queue that I need to submit my job to it's going to be simply okay It's less than queues in my name space The actual like uh This was capacity management is in the capacity api and this is not a namespaced object And this is something that multiple queues can point to right like, um Like even if you have um For example multiple namespaces You can you can group them using labels and say okay all of them can Basically point to the same capacity and they share the same The other thing that queue will help us as being namespace is a case that someone brought up while discussing this in open source Consider the case where you want to run an experiment Like a user wants to run an experiment and it's like Thousands of jobs that they want to run, but they don't want to use more than for example a gpus Uh because they don't want like for some reason um The ability for the users themselves to create a queue in their namespaces and set limits within the queue Right, so those limits do not Give you any promise on whether you will get the capacity or not But there are limits on how on the maximum amount of resources that your experiment is going to use So I would imagine users creating a queue even pair large-scale experiment themselves and setting those limits at the With for that specific experiment At the end of the day, what will control how much capacity will actually get is the capacity object api Um, does that does that make sense? I think so so the queue api is Pointing at a a dedicated pool or a capacity the the queues You know jobs are assigned to queues or whatever but there's kind of a scheduler level queue that aggregates all of the queue api objects Exactly capacity and it's Looking at when the various jobs are submitted and still evenly scheduling them Exactly, yeah At the end of the day that your actual queues want to be the capacity um Where like basically we're gonna decide, okay, which one is going to get executed first or not like they they will all be basically dependent on the capacity Right, so I think we have two minutes left to I would suggest like I'm pretty sure we'll have more more to talk about This is uh, very very interesting to to this Yeah, maybe in a couple months we can uh come back on the demo But yeah, absolutely. That would be amazing. Uh, the other thing I would ask you is What's the best way for the people in this group to to provide input and feedback and try to help out? Maybe just uh, yeah, here's the repo Oh, it's not okay. I don't know if Yeah Does it open? Okay, I did. Yep. Okay. Um, so we will have everything there Like the like this is the repo. We will upload the code there. We'll have The links to the design documents and the api, etc If you have specific suggestions, please create issues To help us, you know, uh better Shape this project right now. It's just a template. Um, there's nothing in the repo But we will we should upload something this week or next week So our goal is to have it as an incubating in In CF. Okay Yeah, in in sort of in Kubernetes. So this is a sub project sponsored by six scheduling Okay, um, so it's not a CNCF project. I don't know like I don't know the details there But it's a it's a sub project within Kubernetes. Got it. Thanks And I I guess like if people can have a look at the proposal in the google doc as well Put as much input as they can there. I guess there was a lot of discussion going on. Uh, I saw we we had some time But I see that people have a lot more feedback So I think we can we can interact there. There's also the mailing list. I also linked that in the agenda so I suggested like everyone checks those links And yeah, like let's say we we sync again in like a couple of months We'll make sure there's a slot for this It's been great. Like it's been super nice So, I don't know. I I saw a couple of new new timers. So I first timer. So I hope you you are here again In two weeks so that we can we can have a proper introduction Otherwise, uh, does anyone else have anything else to raise? If not, like thanks again to Abdullah Machik and Aldo for for the really nice presentation And we meet again in two weeks March 2nd In principle the the topic will be air gap solutions and We we stick with the topic for now, but we'll we'll send the reminders as usual Thank you. Thanks everyone for attending and thanks a lot for the discussion Thanks, bye. Bye. Thank you Thank you