 Hello everyone. Welcome to this introduction to the Kubernetes working group batch. I'm very excited to do this presentation because this is a new working group. How many of you were in the batch day? Okay, a few. So you heard about this working group through that day, I assume. How many of you heard about the working group today through the keynotes? That's cool. Yeah, thank you, Ricardo, for this amazing presentation that highlighted some of the stuff we have been doing. I'm going to skip this slide. I want to start with the motivation. Why did we start this working group? Well, the reason is Kubernetes, as you might or might not know, it had a humble beginning. Kubernetes was built for serving applications, stateless applications. And over time, we have been increasing support for staple applications, right? Still, of course, there are still things to be done. But there was never a focus on other types of workloads, which are very different from serving applications. And of course, I'm referring to batch workloads. So, yes, you can run batch workloads. That is the job API. And of course, you can create your own CRDs and run batch applications. But there are feature gaps. Back to the job API. There is very little policies around completion and failure modes in the job API. So, that's something that limits the usage of the job API itself. In the kubelet, or well, not just the kubelet, the overall Kubernetes system. And there is lack of some support for specialized devices. Yes, we have the device plugging API, but it doesn't have... If I want to say how much memory I want for my GPU, you cannot do that. Or if you have NUMA cells or NUMA nodes within a node, you cannot... Well, there is some progress. I'm going to talk about it during the presentation. But there are all these different things about hardware that you might need for high performance. Which are lacking. All or nothing, pod scheduling. This has been a request for multiple, I would say, years in the kubes scheduler. Yes, today you cannot schedule a group of pods together. You have to schedule one by one. And, yes, this is kind of related. So, once you have multiple pods trying to... You have multiple jobs, and each job is a part of the job with multiple pods, then we don't have pod or nothing scheduling, so we also don't have job queuing. So, either create a job or you don't. So far, in Kubernetes, the status quo was to support this feature through CRDs, through replacing the scheduler, all sorts of ways around the problem. And this somehow works. It's not... I mean, you can still do something, but it causes other problems, such as fragmentation in the ecosystem. So, for example, we have four pod schedulers. We have four job APIs. And we even have CRIs, different CRIs to do the hardware support. And once you have all of these things, then it becomes harder for you to manage your cluster, to install it. And then, if you have two competing schedulers, they might take different sessions, and then you run into some problems. So, yes, the working group started actually when I submitted the application for this talk, the working group was not created. I could do it because I'm a TLN6 scheduling, so I was allowed to submit the application. But we had already asked the steering committee to create this working group. It looked like it was going to go. And fortunately, it did. In February, the working group was formally created. But the work didn't start this February. We have been doing some work in different SIGs over the past few years. SIG Ups. In SIG Ups, we have been working on, well, completing the job API, making it support a wider range of use cases. For example, we introduced index jobs, which is now GA in 120, I forgot. I think it was in 123 or 124 that we make it GA. And this is simple. When you have a classic job, all your pods look the same. So it's up to you to differentiate these pods among themselves. If you introduce an index, so each pod gets a number, it's very easy for you to just, from within the pod, access a certain index of your data. So you can do static partitioning of your workload. Well, it's much easier to do parallel programming or distributed programming this way. We introduced suspended jobs as well. This might be hard to understand on itself, but basically it's ability to create jobs, but defer the pod creation to later. And you'll see why this is important in the next slides. So we also introduced TTL after finish. Jobs don't necessarily, once they finish, they stay there in the API server consuming, well, ETCD resources. So we introduced that. This is a very important feature, which is still in beta. There is a very, there is a very bad bug in the job controller where we actually lose progress of the job completion or number of failures. And that has been there for forever. And the reason is because we depended on the pods to stay in the API server to keep track of the completion. So we introduced some approach, basically using finalizers to do this synchronization. So we can remove pods, but we don't lose progress on the job status. This feature actually has gone through some ups and downs. It was very hard. And I want to highlight that this is very hard. And we don't want, managing pods is very hard. So we don't want you as developers to have to manage pods. Kubernetes should do it, right? So we hope that by solving this problem and by increasing the use cases of the job API, you don't have to think about this. You can create your CRDs, of course. But when it comes to managing pods, you just create a job. And you don't have to run into these bugs again over and over. We also introduced this small feature to track the number of ready pods in deployments, or sorry, replica sets, you have this field saying active pods. That was not in the job API. We included it. And the current job was graduated to GA, which is also very important. There were several optimizations there. And somewhat related, this might be more related to gaming or if you have, I don't know, I don't want to put other examples, maybe Spark jobs actually. If you want to reduce the size of your replica set, there was no way to tell which pods should be removed. Pod decision cost doesn't provide all the guarantees, but it still gives some flexibility for you to decide which pods downscale. Quickly, just want to show how an index job looks like. Very simple. We have our different tasks that we have to accomplish at the top. And so we have number of completions five, and then we have number of parallelism two. This means that at each point of time, we only have two pods running, but basically the first pod will take the first work item, the second pod, the second one and so on and so forth. And you get this window that progresses through the data. I actually forgot to introduce my co-presenters. One of them couldn't make it to this keepcon, but he sent me a video. So I'm going to play this part. This is Alex one, one of our main contributors to the working group batch initiative. There we go. You have my receipts. We are wasting the cluster resources. We define a CRD called pod group to allow the user to define the minimum number of pods required for job to start. When creating jobs, you can associate each with a pod group by the label in pods. And the second one is capacities guide. We define a new API called elastic quota to declare the quotas for different techniques. User can define the guaranteed resources and being and the maximum resources and max by using this object. It makes it possible to share the resources dynamically between different namespaces in your cluster and to improve the resource utilization in the cluster. I think that all of the usage is on the right. When namespace two have three resources, namespace one can borrow resources from namespace two to run more pods. And when namespace two is resources, the borrowed will be reclaimed. At which point, both namespace one and namespace two can use their guaranteed resources. And the last one is being packed. Each one knows how to schedule pods on the nodes that has the least unused resources, such as CPU, memory and other external resources like GPU. In this way, we can solve the problem of resource fragmentation. Small tasks together. Thank you to Alex for showing us these small plugins that were introduced in the complementary six scheduling plugins repository. Now back to SIG node. There were also some initiatives around, in particular around NUMA nodes. For example, we have a Cubelet stopology manager where you can tell Cubelet to send containers or the entire pod to one NUMA node or to a set of NUMA nodes. And this has the advantage of improving the runtime because the processes are closer to each other within the NUMA node. Now, the problem is that the Cubelet takes its decisions, but it doesn't coordinate those with the Cubes scheduler. So for that, SIG node was working on this CRD, still working on this CRD for exposing the decisions that the topology manager did so that a scheduling plugin could use that information to schedule based on that. All right, so this all happened before February. On February, we finally got approved to create the working group Batch, which has this mission, discuss and enhance the support of batch workloads in core Kubernetes. The goal is to unify the way users deploy batch workloads to improve portability and to simplify supportability for Kubernetes providers. So I'm going to quickly highlight here what do I need by batch workloads. Usually, we could be talking about ML, we could be talking about HPC, and we can be talking about maybe something less obvious like CI CD. When you have CI, you have a job just to run your tests, for example. They behave similarly to these other types of workloads. So we want to improve the support for all of these things in core Kubernetes and unify. So I don't know how many of you know what a working group is. Maybe some of you might not even know what a SIG is. I'll quickly explain the difference. A SIG owns code, a working group doesn't own code. So it's more of a forum for different SIGs to coordinate changes that span multiple dimensions. So of course, we have SIG apps as a participant stakeholder. SIG apps owns all the batch APIs. SIG auto-scaling, they are listed in alphabetical order. SIG auto-scaling, of course, when we have multiple jobs, maybe our small cluster is not enough. So we want to have to scale, and with that we want to improve cost, scale up, scale down. So we wanted to make SIG auto-scaling part of the initiative as well. SIG note, we already saw some work being done. There are multiple caps that I saw open around this area, so very encouraging to see. SIG scheduling, of course, that's where I come from. So yeah, those are the stakeholders. Once we created the working group batch, we decided to kind of define what things we want to work on. So the first work stream is advancing the job API. So you saw already some improvements, but we want to keep listening from the community what they need from the job API to support a wider range of use cases. We already saw static partitioning, but there's still work to be done for supporting MPI, ML, AI. Again, if you are using, for example, Qflow MPI job, that's great. But again, managing POT is hard, so we want Qflow MPI job to somehow use the job API or some other APIs that we have for groups of POTs. So job API is the main API we're working on, but of course all the APIs are open for discussion. The second work stream is about kind of scheduling. I particularly don't like calling it scheduling because when we talk about scheduling in Kubernetes, we are referring to POT to not scheduling. So I prefer to call it job management. So we want to work at the level of jobs and at the level of jobs we want to do queuing. We want to do auto provisioning, make the auto-scaler, create provision more resources. Of course, there will be scheduling. Well, I already set out the scaling. And the last work stream is around hardware. We want to discuss what we can do about runtime and scheduling support for specialized hardware. Accelerators, NUMA, RDMA, FPJs, you tell me. So where are we now? We started in February. What we have been doing, like, what is it, three months? We actually did some progress. So Sieg scheduling started sponsoring a new project for job queuing. I don't have a presentation of queue today because we already had one on Tuesday for the batch day. I believe it's the link on the right. This is a full presentation on the system. I'm going to quickly explain what it is. And the other link is for Ricardo's presentation, the keynote in the morning. So go ahead and watch it. It was pretty cool. So queue, it's a job level manager. So again, we're not dealing with pods. We're dealing with jobs. And then it decides when the job should start, meaning that we can create pods. Or if there is no capacity and then I have a high priority job, I need to reclaim that capacity so I can stop a job and delete all those pods and start my new job. So that's what queue does on a high level. So for that, I showed you much earlier, we added the suspend field in the job API. So we did that to enable queue. Yeah, it's very simple. You suspend the job, no pods. You unsuspend the job, pods are created. So this way we prevent the scheduler from scheduling something that might not fit. And we take the decision in this other controller, which has a view of all the jobs that are running. Now, I've been talking about jobs, but in queue there is this concept of workload, which is a smaller, it's a shadow representation of a job such that we can support not just a job, but we need to support CRDs. Because of course today job API doesn't support all the use cases. And maybe there is always a chance for CRDs. So we want that to be included. The main design principle of queue is no duplication. So queue coexists with everything we have already in Kubernetes. Queue coexists with a queue controller manager, which knows how to manage pods for a job. Queue scaler, which knows how to assign a pod to a node. And the cluster auto scaler, which knows how to add capacity in my cluster. So queue coordinates with all of them rather than replacing all of those components. So that's the main design principle. The usage, I mean, the admins define queues with resource flavors, quotas and borrowing cohorts. So they have the heavy job, let's say, and users, they don't change their behavior. They just create jobs. They just say the queue. And that's it. So we released a V01 on April. We're working on the V02 where we want to add more metrics and make it robust. So go and watch the two presentations. What is happening in the space? We have a few of these debates. We are talking about more policies for jobs. We are thinking about how to still do all or nothing scheduling in the queue scaler. For the cases where we have a static cluster with very tight resources, it might still be needed. An API to reserve no resources so that the auto scaler can preemptively add more capacity. Maybe a plug-in model, there was a proposal for a plug-in model for the Qubelet, so we can more easily define, support other hardware. We are welcoming presentations from all of you. If you have a batch project, a CRD, we want to understand what picture apps you have and we want to keep the discussion going on. Of course, how can you get involved? Well, come present. We have the links in the last link, the working group batch in the official community repository. There you can find the Slack, you can find the mailing list, they are there. You can find our schedule. Go ahead, put your topic in the agenda. Our lovely organizers who are sitting right here will moderate you in the next working group meeting. That's all. Oh, yes. Before I forget, Google is running a live panel in June 15th. We invited the relevant leads from the different seeks within Kubernetes to discuss, to chat more about what is missing in Kubernetes to support a batch. I invite you to go there and register. Questions before, I completely forgot about introducing my co-presenters. Abdullah is right here for questions. Although we are both from Google, Kikang or Alex Wong from Alibaba, you can find us in GitHub there with those usernames. Thank you. We have about eight minutes for questions. If you want, raise your hand and I will give you a microphone. Thank you for the talk. I wanted to know how does it integrate with some existing, for example, training operator, formerly TFJob operator, because I remember there was indexed pods already at the time. If you have multiple replicas, for example, the pods would be indexed. I don't know if these now features are brought by the batch spec or batch API. How it will integrate? The TFJob would have to use the job API. The TFJob controller would have to create job objects instead of creating robots and then it would benefit from the index job. We engaged with the Kubeflow community. We wanted to get this done. We wanted them to be able to use what we have been working on, but it's still not possible. They have extra policies in terms of termination and failures. We need to include those in the job API so that they can migrate. That's a discussion that is going on. Was that your question? Were you asking about Q? Yeah, exactly. I guess it will make them have less code on their side. Correct, yes. But still more constraints. What I usually say is we have to solve the problem once and everyone benefits. No questions. I'm going to ask what I'm going to talk about tomorrow, which is how does this relate to the CNCF batch initiative? Yes. Well, actually the two proposals were kind of started at the same time. We have a little bit different objectives because the working group batch is within the Kubernetes organization. So we want to improve what is already within the core, the job controller, the Qubes scheduler. I don't know if it's going to be possible, but we want to provide these smaller core features that other batch schedulers can use. So Kubernetes can focus on starting jobs and the batch schedulers can focus on, well, maybe you don't like to use fair sharing. You can implement your own fair sharing algorithm. But starting the jobs, maybe some basic Q-ing, 5-4 Q-ing, you just still use that. And if we are talking about multi-cluster, the Kubernetes core should support some of those operations. Now, the working, the CNCF batch working group is looking at the problem, let's say, outside of Kubernetes. They are trying to integrate or they are trying to unify multiple different batch projects, which might or might not be Kubernetes. So we hope to engage with them so that we are somehow aligned, but that's a conversation that hasn't started yet. I'm looking at what you're proposing. A job is a fixed thing, so you created beforehand. But what happens if the job changes while it's running? So it grows or shrinks, so that means that every single time you need to recreate something. Well, if you look at the TensorFlow or a Spark job, the driver picks up and says, I need these kinds of resources. So how does that fit in with this proposal? So I suppose you're talking specifically about Q. Oh, the job API. Yeah, so when we are talking about scaling job, the main question here is which, when we're scaling up, all is fine, just add more pods. When we're scaling down, that's the question. Which one, which pods do we remove? And simply the discussion is to... We need to discuss this. We need to come up with an API where we can safely say which pods we can remove, but... Abdullah, you wanted to add something? Yeah, I just want to mention that in the job API, I think one of the things that are fixed is the completions. And this is what you're mentioning, that you define that my job is complete when I complete 10 instances of this pod. But you want to scale it up and down, this is the same thing. If we make completions a mutable field, it starts with one, for example, and then you increase it, you could achieve similar semantics probably. The case that I'm talking about is I run a... Let's pick a Spark job. I give it a request, and based on the amount of data it processes, it says I need 10 executors or 100 executors. You don't know when you define the job. The Spark executor looks at the data, or the Spark driver looks at the information that it needs to process, and based on what it needs to do, dynamically generates the pods that it needs. So you submit the job, consists of just the driver, and the rest is code within the driver that decides on what it does. So what you're asking almost for is the driver to create job API things to get the executors up and running and scheduled in a way. So that's a huge request that you put on the applications that are going to run using this. You want to do it without changing the application that runs. You don't want to change the Spark or TensorFlow or things like that. So the Spark jobs and MPI jobs, if we're talking about... There is some elastic MPI jobs. So when we're talking about them, they 100% don't fit the job API today, right? The job API is designed for completions, and these jobs... Yes, the entire thing is a job, but the individual workers are not tasks. They are kind of like small servers. So they are more similar to services, so more similar to deployments and stateful sets. So the answer... My answer today is we don't have an API for that today. The closest thing might be stateful set, but that's a discussion we need to happen, and it didn't start yet. So the last question we got from a virtual audience. What is the performance of running jobs in Kubernetes? Is the overhead and rate at which pods can be created going to improve? I suppose this is asking about the job controller. So it comes a lot to what... Well, for starters, it depends a lot on what the API throttling that the job controller has. So if you increase that, it's probably going to behave better, but at the same time, that's something we want to work on. We recently introduced some load testing for the job controller, just the path release. And, well, we just started with the load test. We don't have a concrete answer today. We want to improve it. Yes, for that, we welcome contributors. All right, good. We ran out of time, so thank you very much. Thank you.