 Hello everybody my name is Aldo. I'm from Google, but in this talk I'm representing sick apps from Kubernetes community So basically What I want to to say is that well we understand that We understood Front since that since the start of Kubernetes the job API Was not meant or was not ready to run all kinds of applications So All kinds of applications in the HPC AI ML ML Space so I'm sorry need to I cannot really see what I'm doing here Sorry about this Sorry about that So in particular When you wanted to run part of the applications on Kubernetes you run into a number of problems In as you might know in the Kubernetes API in the Kubernetes documentation we proposed a set of Patterns for how to run a job part of jobs and they were actually cumbersome for example You you had to set up an external queue just to do data partitioning Or another recommendation was if you wanted to run the same job across a set of Different sets of data you had to just set up your own orchestrator around the job controller and Of course there is limited control over over startup or termination Of the pods how to control retries or rather fail a job etc. And this cost this has caused that a number of Developers in the ecosystem and you've probably already heard of some of them during this talk during this batch APC day that there were a number of people projects rewriting the job controller and This of course leads to fragmentation, which makes it hard for Providers to support We're hearing earlier that well now we have all these different APIs We have to support which ones my provider supports So that's one problem and also a number of bugs that Were done in one place are also replicated in all these places So looking into that Basically around the 121 release of Kubernetes we have been working on this number of of features in the job controller to finally Make it possible that everybody or at least most of the developers can run in a single API I'm going through basically all of these ones during this talk But this is for reference so you can you can once you download the slides you can go through all this The documentation for all of these ones I'm also going to talk a bit about what are we thinking for the future and what are the moon shots? What do we want to bring the job API next so the first one is maybe some of you are already familiar with this index jobs Index jobs is simply the ability of creating a single job For a parallel application where each each part in the job has a different index and Well this This index is available as simply a number as an environment variable So how do you set up an index job? You just set the completion mode to index and just you set the number of completions the number of pots you want to run and then in the command line or well in your in your pot in your workload you just access the the Environment variable and well execute your application and Well, this is this is how it looks in the This is how we can think about index jobs. So this is such a fundamental feature that He hasn't existed since the start of Korea is a couple of years back. We finally added it The next one is Job suspension and scale and mutable scaling directives so This provides you with the ability of creating a job but not start the the the pots yet think of it's this is this has already been mentioned but You could have a case where You don't want you have way more Workloads running or you desire to run more workloads than you have capacity in your cluster And you don't want to make your users Simply well Try to run the application and run into scaling errors or or simply fail the the pod creation The job creation you would like to queue this these workloads. So we added these fields into the into We added just this one field in the job API to say well this this job is suspended and You can implement an external controller To do the unsuspension Start or resume the job even you can suspend it back to to do preemption and stuff like that Then once this external controller Considered that there is capacity in the cluster. There is quota We've dealt with first sharing within the cluster for all our users we can start the job and Maybe even Place it in a particular place in this case We have decided that this job is gonna run in the In this zone the US central one a and it's gonna run on spot VMs because this is what we have capacity now of spot VMs and This is exactly what one of the projects in six scaling is doing. It's using these APIs to to implement this these semantics and But the interesting thing here is that well queue is one project that can use this but any or any other job Job queuing operator can can use the same APIs without having to read rewrite the job controller or the job or the job API This one is a Feature that is the the most hurtful for me Basically, well, I implemented this but it was quite challenging and this is why I was mentioning Initially about bugs How this this bug that is present was present in the job controller was present in a lot of these other Custom job APIs and it was very hard to fix So what was the problem? Basically the design of a job controller What's meant it was originally thought of low scale and also just to serve Services so it It at some points when he interacted with the garbage collector, he would start losing track So you could say you could see that your pot said I have five pots completed And then later he says I only completed actually three Why because the pots were disappearing and the job controller was only looking at the pots that existed in the cluster So this led of course to pot recreation and then people ended up paying more to render jobs and of course this meant that jobs were unusable at scale people running a five thousand Parallel job wouldn't be able to use the job API or maybe one yes, but not not more than three So this this was a big problem. And of course if you had preemptible VMs spot VMs Once these nodes disappeared the the the progress will be lost as well So and as I was saying custom jobs in the ecosystem had a similar design similar problems The fix is rather an entire rewrite of the of the job controller, which I don't I won't get into the details but during lunch you can look for me and we can talk about it and It was such a complicated fix that the feature ended up being disabled by default in Open source release of one two one twenty three and one twenty four even though the feature was already beta and hopefully we fixed all the the problems and We re-enabled it back to one twenty five with Incorporate is we don't really Renable features by default again, but if you if you need the feature you can still Re-enable even one twenty three and twenty four because all the fixes were backported So this is mostly Non-visible to users you you you don't say I want to use job tracking with finalizers We just track with finalizers And you can see some traces of what's happening. You could you will have of course pots with a finalizer and And of course you will see that your job is finally tracking progress correctly and not losing not losing Not decreasing counters of completed and completed jobs And actually We've heard from some of our customers That they are successfully running index jobs with 5,000 pots or more Which of course was not possible before This is another feature we are currently working on it's it's in alpha today And we are one of our engineers is working on the on the beta release so basically in Kubernetes you have a number of Problems that could cause your pot to to stop running right you could have of course software errors The hardware could fail you could have a maintenance event like upgrades or simply maintenance or You could have the eviction API or maybe a high priority job comes in and keep scheduler simply Kills the pot There's the pot garbage collector You can add a taint There is a cluster auto-scaler that could decide that suddenly wants to pre-empt some pots because it wants to optimize utilization The API server and you could maybe write your own controller and Affect a running pot so all of these components can can really kick kill your pot and You of course you might want to retry you might want not to retry because you Your workload doesn't support preemption And really there was no control over over this All these problems you just had a face in the process the test failed and the job controller could only say Oh, I want to support up to six failures So in Kubernetes 125 we added first a pot condition to Observe what's what happened to my pod? We just say well this pot has been disrupted It has been disrupted by skew scheduler for because of a taint etc. All of these all these problems And in 126 we are working on the support from cuba all the things that cuba might do to kill your pod Which includes? exceeding memory limits for example or Exceeding the usage of ephemeral storage and maybe we'll think about some other some other conditions in the future So what what do you do with this? This is just the information being passed by Pass from other controllers and then from the job API you can use these conditions to decide whether or whether to Terminate the job or retry it or how many times? And also we have support for looking at the exit codes of the of the Pots so if you know for example the The exit goes 40 41 and 42 are non-recoverable. You can immediately Fail the job without having to go through all the counters But maybe you want to ignore if the pod was disrupted disrupted by any of the controllers So you can retry indefinitely for any infrastructure problems Maybe you want to terminate. Oh, sorry. Yes, maybe You want to fail your job because of any error that happens in the In the in the binary But you want to retry if there is a disruption by the control plane But only up to a limit in this case the back of limit is three So I want to count to this failure and once it fails three times. I'm done with it I don't want to retry again So this is all the features we've been working on Are either most of them are finished one of them is ongoing and then in the near future we want to add this other Capability to index jobs where you control the back of per index and You can basically guarantee that all your indexes run at least once or up to the limit And you can say even okay if this limit if this index fails I just want to fill this index, but all of the other indexes can continue running We haven't started the sign yet, but this is kind of like the potential API So and What what comes next? Well, we are thinking of a some some moon shots But but really what we want we are here in in much HPC day basically to present These enhancements and also to ask you what do you want to see next in the job API? Well, how do you think job V2 API could look like for example Maybe why do we not index jobs? Maybe all indexes all jobs could be indexed Would be a huge simplification for the current is control plane That would lead to even better performance Or maybe you want to see multiple pot templates pattern with a startup sequence, which is something we haven't Looked at or maybe you want a resizable jobs. The jobs currently are fixed fixed number of completions Maybe that's something you you want to to have so with all of this How can you get involved? Why would why do it should you get involved? Well, we are building these API's We only have three credit releases per year, right? So we want feedback as early as possible from the community So once we publish our API design, we would like to know if it fits your use case and Well to do that, of course, we also want contributors if you Want to contribute to to the Kubernetes core components? But this is all the places we communicate in in particular the working group batch or the c-gaps In slack or the the meetings we have So, yes, thank you. Thank you for listening to me and Please keep in touch. Thank you. Although I think we're running out of time But anybody has any questions we run or two questions Yeah, so what's your take on the pod groups that are co-scheduled and how that relates to jobs? Rather that's well from the perspective of the job controller does tangential the job controller doesn't care about scheduling it's all about the how how it presents to the developers to the application developers If you want my take as a six scheduling leap, that's that's a different question. Is that what you're looking for? Yeah, essentially, so we have this idea that for us a job typically is not a single pod It's a collection of Kubernetes resources with a lifetime that is tied together with all of them And that may be multiple pods and services, right? Yes. Yeah, we actually had this question recently in the working group batch About yes, what if the job also encompasses services in particular right for interactive jobs? Yeah, for example, that's one use case There are two ways to think about this one is that Maybe a service is not that expensive and it could be created from the beginning and it's gonna be there sitting idle Because the job the pods are not created. So the services don't have endpoints and whatnot The other way to think about it is well, you can Implemented in the controller in your application controller. For example, let's say we have interactive interactive job Controller this interactive job controller has a job a Kubernetes job And you say well your controller only will create a service once The job is unsuspended the Kubernetes job is unsuspended So you could actually just listen to the signal from the job API and control the The rest of the resources based on that Those are the two ways I've So far thought about it, but of course, it's a it's an open discussion still Right. Thank you. We're having an interesting discussion during the panel about this topic as well Any other questions? I guess one thing I guess it's also on the panel So maybe I'll just ask it now a little bit But how so for all these different projects like Argo workflows airflow They all kind of have gone ahead and most of them are probably scheduling raw pods and then I guess is the the hope from From like the working group batch is like is this a place for people to come to try and like ask how to like migrate to Using the jobs and like if it normally is that like what the Guess the goal of some of these batch groups is to try to get this feedback from them Absolutely, so I don't blame them for rewriting their own or writing their logic in robots because the job controller was not ready Now that we have fixed a number of things we hope that all of these projects would Use the job API in particular about Argo Argo has had the same problem. I don't know if still does about losing track of progress So if they were a if they were migrating to the job API They would basically get rid of a bit without having to re-implement what we already went through Yeah for index jobs Some some common features I see in HPC based schedulers for job arrays our step and cap So like a step would be how much to increment between each Run so like if you're saying one through a hundred you could do step 10 and Like would be 10 20 30 40 for each index And the other thing is a cap Which is like a limit So like if you had a index of a hundred that you are running You could give it a cap of 10 to only run like 10 at a time Are those features that you guys are looking into implementing? That's actually already supported There we go, so if you see here, there is two fields completions and parallelism Completions is how many you will run? Oh I don't think I But I can point here. So you have to two fields completions and parallelism. So completions completions is how many? Are in total and parallelism how many run at a time? So this is actually what the graph is showing here. So it's two at a time running Now in terms of steps We believe that this kind of things can be done in the application level I mean you can always multiply inside although there have been requests about Basically I want to you know, I run my job and then certain indexes fail And then I just want to retry the indexes that failed So that's something we we might be looking into but we don't have support for running all the indexes yet So that would be the first step and then we can implement Selected indexes to run But yeah, that's something we are definitely looking Forward in the in the next releases How this will handle that are the part deletion I miss that the part deletion like a part eviction Suppose if the node is draining then the Kubernetes will automatically evict the part Will it retry the part or? Yes, so that's that's the API for job retries That's one of the disruptions the control plane can do and then you can express in the API whether You want to retry for that particular kind of disruption or or if your workload doesn't support it you you can stop Does that answer your question? Yeah Yeah, that's a name of course So do we have the ability to query the job status of because jobs are long-running thing I mean they tends to go Multiple days or hours. So is there a way to query the job status? What is the status and you know ability to understand when it's going to be finished things like that? Yeah, I don't think I clearly here, but you can in the job status You will tell you exactly which index is already finished And from that you can derive which ones still didn't run or complete. So yes, that's okay. That's visible One last question I guess before we go to lunch You mentioned earlier that you expected multiple services to potentially interact with suspend and other attributes of the job once it started Do you at this? You've mentioned specifically that you might have a queuing system And then you mentioned later talking about interactive jobs on with the other questions That you might have something that would suspend the job while it spins up say Let's for example a service and wait for endpoints to be populated Is there any way to coordinate that at this time so that you know the job doesn't actually run until both of these things? Because we'd have the point where the queuing system sets it on to spends it and the other system then tries to suspend it and I Was wondering if you had any ideas how to prevent it from starting in that brief gap So I can tell you a little bit what we are doing in queue which is our Implementation for how we think this should work and Well, there's a few things First having our back to prevent a job from starting is one possibility But ultimately the controller can You know if you observe a job try to unsuspend You can suspend it back from from the controller Is that is this in the lines of your question? I'm not sure yeah, I guess I I Was wondering if that might lead to like a race condition or something where the job starts In that gap and if there's any way to get that. Yeah, absolutely Well, this is kind of we're feeling within the pattern of communities, which is eventual consistency, right? in communities nothing is like We don't issue something and wait for it. We rather we do we do We have a desired state and the control plane converges towards the desired state state You again you can control some things with our back, but ultimately the controllers should remedy those situations where you have these these Problems and then that's that's when retryable APIs Come in place too, so you can control whether You should retry recreate bots or or simply stop The the job. Thank you. Thank you all. I'm sorry for running out of time. So we're gonna be back 55 1255 for lunch Thank you so much. Thank you