 Hello everybody, welcome to another maintenance track session. This time working group batch. What's new and what's next? My name is Aldo Kuki Kondor. I'm a TL at SIG Scheduling. I'm an active member of the working group batch. Hi everyone, I'm Swati Segel. I'm a principal software engineer working for Red Hat and I've been involved in the resource management space for a few years. The focus has been on resource focus has been on numerous scheduling as well as resource managers, and I'm happy to be here and happy to engage with you guys. So first of all, I want to start by explaining what a working group is within Kubernetes for those of you that might not be familiar. Hopefully you are familiar with the SIGS, SIGS are groups within special interest groups within Kubernetes as well that have each a responsibility to maintain a set of components for the Kubernetes project. A working group is a bit different in the sense that for one, it doesn't own any components and two, it has a temporary nature. So we get together basically to solve a specific problem and then we decide do we dissolve or maybe we evolve into a SIG. In this case, I'm here to talk about the working group batch, which as a working group is a forum to discuss enhancements to better support batch workloads. Batch might mean different things for different people. So just to give some examples, we deal with HPC, AI, ML, data analytics or even CICD applications. Things in general, jobs that work to completion. One of the primary goals of the working group is to reduce fragmentation in the ecosystem. If you're already in this room, you might have already seen a lot of talks about batch. Everybody doing different things and even years before people are already doing many different things. We want to bring some cohesion in the batch in the Kubernetes project. And to do this, we gather a set of stakeholders from the community. So SIG scheduling is one of them. SIG apps, primarily because of the job API and the Chrome job APIs. SIG node, we still need our resources to run on nodes and we have accelerators and whatnot. So we needed them and auto scaling to obviously scale up your clusters. When you need more resources. But we are not just limited to the Kubernetes developers or maintainers. We welcome a huge diversity of ecosystem developers. Regular attendees to the meetings are folks from Qflow, from Armada and even from, let's say, competitor schedulers such as Unicorn. So we welcome all the communities to come and bring their ideas and feature requests or code. So what's in scope? We're going to go through all these topics in this session. First additions to the job API. The Kubernetes project has had the job API for a while, but it needed some features, some love. So we brought those to the job and Chrome job. And then we can start talking about job queuing, maximizing utilization of your clusters. And last, we're also discussing topics around specialized hardware, GPUs, GPUs, you name it. So we'll go through all of this starting with the job API. So let's start with the feature I'm most excited for because it's the enablement for the rest of the features. Basically, we reach a general availability of job tracking with finalizers in the 126 release. So the last, the second last. So what's important about this feature is that before this feature existed, the job controller could actually lose progress of the status of the number of completions, number of failures, etc. And that just happened when you deleted the pods. So that meant that the job controller was actually incompatible with the pod garbage collector, which is ironic because both are Kubernetes components, so they didn't really work with each other. So what do we do? We introduce a finalizer to control when pods are deleted. So we use this finalizer to keep the correct tracking of pods and make sure they are already tracked in the job status before we delete them. And we have done some testing in different environments and we've observed we can process jobs of 100,000 pods in minutes, and particularly for index jobs. And this is really not a visible feature. You don't opt in or opt out. It's by default. Now it's starting from when 126 is always on. And the result that you see is that simply your jobs are tracked correctly. So hopefully more folks, more downstream applications or operators can use the job API. Trust the job API to handle pod management so that they don't have to re-implement it. If you want to learn more about how the feature was specifically implemented, you can refer to the blog post that we wrote for the 126 release. And yeah, it was actually an interesting journey with some pitfalls and whatnot. So I encourage you to read if you're interested. The next feature I wanted to highlight is pod failure policies. So as you might know, in a Kubernetes cluster, pods are pretty much ephemeral. They can go down at any point. And we need some control about what to do when there is a failure in a job. And the failures can come from so many places. It can come from the scheduler because of preemption. It can come from the API server because of the eviction API. Or it could come from different controllers that watch different events and decide to evict a pod. It can come from the Kubelet because the node is getting hammered. So the node needs to free up some resources. It can also evict the pod. And ultimately, you might have your own user-supply controller. Let's put an example. The disk scheduler could be one of them. They could disrupt your pod. And then suddenly, you want to have full control of what to do. And we had an API to have some control, the back-off limit. But it's not enough, right? There is only a number you can control. If you put a high number, then you could have too many recreations. If you put a low number, you have basically no reliability. So we introduced this API. This is a very complex configuration. It could be much simpler. But you can exactly decide what to do in each specific failure. There are two types of rules or conditions, matching rules. You can use one-in-conditions, for example, disruption target or this one config issue. Or you can react to exit codes. You can fail. You can ignore. There are a few more options. And this disruption target condition comes from the Kubelet, sorry, from all the Kubernetes controllers, including the Kubelet. And this one is a sample one. You can actually introduce your own pod conditions. And you can react to those failures accordingly. So what's new in 127? We introduced some guarantees in the Kubelet so that all pods reach a terminal phase so we can reliably apply the pod failure policies. But this is still a beta feature, so we are still learning. We're still improving it. If you want to learn more, Miho here who is sitting over there is going to have a presentation on Thursday about pod failure policies and some other features of the job API. So if you want to learn more, you can join this session. The next feature we've worked on is rather simple. So we made the completions feel mutable. Just that. It's a validation relaxation. And there is only one requirement. The parallelism and number of completions need to match. And that's all. And now why? Because, well, multiple jobs can actually scale up and down. And one example of those is PyTorch. So we just wanted to accommodate the job API to be useful in a wider set of applications. So simple. You have configured parallelism and completions and you just mutate it to have a different scale. So what's next? In the job API, we are investing in a number of different features. One of them is actually a new sub-project. It's called JobSet. JobSet, as its name implies, it's a set of jobs that work together to do one computation. One example is a driver-worker relationship, right? Like Spark, for example. You have a driver and you have the workers or Ray. So many applications that can use this API. The idea is that we have a centralized, let's call it pod management, but in reality we are implementing it as a set of index jobs. So a job set is a set of index jobs. Each index job is a role and you can have a number of workers or a number of replicas for each job. And we also want to automate it. We also want to automate a pod-to-pod communication authorization keys if the application requires it, like MPI. So there is that as a corollary to get this implemented, we are proposing a change in a signal in the Kubelet to craft environment variables from a file so we can more easily support more APIs. Another cap that is kind of still in progress, we didn't finish the design yet, is back-of-limit per index. Naturally, as we released index jobs some years ago already, people have been asking about having control per index. So that's the work we want to do. And this is useful for parallel applications where there are a lot of independent workers working on independent pieces of data. A corollary of this is the ability to retry specific indexes instead of retrying the entire job. The next idea or work in progress is terminating pods as active. This is a necessary feature for a tightly-coupled applications that don't actually support more than one worker per index. So we want to accommodate those features too. So a lot of discussions, I wanted to bring to your attention so you can join the meetings or just they give-have-issues and talk about your use cases. We also have some open discussions more like abstract or less thought, let's say, let's call them. For example, we've been talking about mutable scaling directives for the jobs when suspended. Or we've been talking about terminating pods. Terminating pods that are pending. There is a discussion there. Kevin here is working on that. Demon job is also another proposal that we received recently about the ability to execute a pod per node but one time as opposed to demon sets that are continuously running. These are open discussions. We're still collecting feedback and thinking through the implications. So don't take anything as granted, let's say, but we, of course, welcome your input. The next one is stateful index jobs to kind of have PVCs per index. The next set of things we're working on are primitives around job queuing and maximizing utilization of clusters. And the primary investment here is queue. Queue is a controller, Kubernetes-native controller that implements job queuing. It offers resource quota management and it has borrowing and preemption semantics so you can maximize the utilization of your cluster. It also supports resource fungibility which means you can establish quotas for each kind of resource you have and failover to the next resource if there is not enough capacity for the entire job instead of per pod. We also naturally have support for the job API. That's what we've been working on so that's the first class citizen but we've been adding, we added support for queue flow MPI job and we want to also add more. That's why we're providing a library so implementers can integrate with queue and we are talking with different communities to do more integrations. The latest version is 0.3, released a couple of weeks ago and just to give you an overview, one of the primary design patterns, primary design decisions we made with queue is that we don't want to re-implement anything. We are fully compatible with queue scheduler with the controller manager and cluster auto scheduler because we take a decision at a different level. Basically queue, here you can see it injected and the only decision that queue takes is whether a job should be started or should be suspended as a whole and then the rest of the operations are handled by the existing components such as scheduler to actually assign pods to nodes and the cluster auto scheduler to scale up your cluster. Next, the roadmap for 0.4 just some improvements to some of the features that we already have. We have a PR opening the rate operator to integrate and we are working with the queue flock community to support more APIs. As I was saying, the design principle in queue is that if something doesn't work in the rest of the ecosystem in the scheduler or the job controller, we propose fixes or we propose features on those components so that we can implement higher level decisions in queue and these are a couple of discussions that we are having. And if you want to learn more about queues specifically you can watch our presentation on Tuesday for the batch and HPC day. Next, I'll hand it over to Swati. Hello. So for the next part of the talk I'm going to be focusing on support for specialized hardware. I'm going to cover what we've been doing as part of topology where scheduling and start with a brief overview and then we'll dive into the developments that have happened since we discussed the topic last at KubeCon. So as you see on this slide we have a Kubernetes cluster with two worker nodes. They look very similar to each other. Each node has four instances of device A, four instances of device B and eight CPUs. So as I said, from scheduler's point of view both these nodes are completely identical and it seems to appear that resource-wise they're exactly the same. So if we were to schedule a part requesting both device A and device B we should end up in a scenario that the pod should succeed. So let's see what happens if we were to place this pod. So if this pod was placed on worker 2 the pod succeeds and it goes into running state. But in scenario that the pod ends up on worker 1 it ends up in an unexpected behavior and gets rejected. And what we see if we inspect the pod is that it has topology effinity error. So let's try to understand why we've run into this unexpected behavior and what exactly is topology effinity error. In order to do that we're going to zoom into this particular cluster and get a better understanding of how the system has been configured and how those resources have been distributed. So as we see on the top topology manager has been configured with single-numonode policy and resources are distributed across numonodes in a way that on one node the resources are interleaved between the numonodes whereas on the other one numonode 0 has all instances of device A and all instances of device B are on numonode 1. And this kind of clearly shows why we were getting the behavior that we were getting. So the topology manager is responsive for taking care of resource alignment at a node level. However, the scheduler is topology unaware and that is what's leading to suboptimal scheduling decisions. So in order to optimize the scheduling behavior performance of clusters and the resource utilization in general we need to ensure that the scheduler is aware of how those resources are distributed and has the granular information of those resources. And in order to do that we came up with topology aware scheduling. This slide shows the phase one implementation of topology aware scheduling which was done as an out-of-free solution. We started back in 2020 that was alongside Kubernetes 122 so we've come a long way and in order to do that we have a few components. So the first one that you see there is the topology aware scheduler plugin. Kubernetes SIG repo it houses a bunch of repositories of out-of-free scheduler plugins that are based on scheduling framework and we contributed node resource topology scheduler plugin to that and this particular scheduler plugin ensures that we have numaware scheduler decisions. This scheduler plugin runs a very simplified version of topology manager algorithm and it helps us determine if a particular node is suitable to run apart based on the resources that it's requesting. In order for the scheduler to have enough information we have to provide it that information. So we do that through node resource API which often we call it as an RT API and it is a CRD based API and with this API we are able to decipher between how resources are placed on different numeral nodes or it could be different hierarchical structures in general. And the third component is the topology of data agent. In addition to the two components we need a component that runs on all the nodes exposes the hardware information and subsequently the scheduler plugin uses that information that is exposed to make the topology aware placement decision. We have two software components that can serve as topology of data agents NFD and resource topology exporter. The diagram on the right that you see it shows the interaction between different components. We have the control plane where we have node resource topology API CUBE API server the topology aware scheduler plugin I should point it and then we have the worker node where we have the different components. You see the topology of data agents on all the nodes they are interacting with CUBE but we have pod resource API and then we have the workload that gets placed based on the topology aware scheduler. As part of the phase one we were able to come up with an end to end solution with all the different software components but there was still a lot to be done. We had to make several optimizations across the stack as well as in all the different components that we just discussed about. One of them was that this particular slide shows the interaction between the different components so there could be scenarios that some of these operations being time sensitive can lead to race conditions. So we need to understand how can we overcome those scenarios and have an information where the scheduler has an up to date information and it's making to the right scheduling decision. This particular slide gives an example of a race condition between the scheduler and the updater. The blue arrows on the top indicate the pods that are being scheduled over time and the green arrows at the bottom are indicative of the NRT object updates that happen subsequent to the pod creation. As I said, the accuracy of the scheduler plugin and its decision relies on the fact that you have accurate data as well as fresh data. So the scheduler can end up with stale information that you see in that corner because there could be scenarios where the pod has come up, it has occupied resources but the resource information has not been updated yet. There are several ways of overcoming this. You can increase the updater interval or you can have watchable pod resource endpoints or cubelet endpoints for that matter. But the problem is that the updater is always going to be slower than the scheduler plugin. And the main reason for that is that after a pod comes in contact with the API server, scheduler is the next component that serves the pod and processes the pod. So it made sense for us to make changes in the scheduler itself. So we optimized the scheduler to take advantage of a local cache with help of a reserve plugin. So we essentially used the reserve extension point that the scheduler framework provides us. So how did we enable the reserve plugin and implement the local cache? So the first step for us was to keep track of all the resources that were allocated but not yet reported back. We call those as unreported resources for the rest of the slides. And this cache update happened every time the pod was scheduled on a node. The scheduler doesn't have visibility on what new node the resources are going to be eventually allocated from. That's the responsibility of topology manager. So the scheduler takes a pessimistic approach and deducts those unreported resources from both the new nodes. And the third step is to invalidate the cache. In order to do that, the scheduler looks into... It constantly ensures that after the NRT updates have happened, it has an object called a pod fingerprint that it refers to to determine if the node state corresponds to that of the one that has been updated by NRT updater. So I have a small demo, but I'm going to go through a few slides to explain what I'm going to demo. And then we are going to see what the expected behavior is going to be like and then I'm going to show it through the command line. So as you see here, we have this particular cluster which was deployed through MiniCube. And these are three VMs. The first one being the control plane component which is running EPS server topology aware scheduler plugin. The other two are worker nodes running Cubelet with topology manager configured with single-numer node policy. Resource-wise, we are just looking into CPUs for this particular demo. We have 16 CPUs on each node distributed across two-numer nodes and a CPU per-numer node reserved. So that means we have a total of 14 allocatable on each node and we have seven per-numer node. So just to have a high-level information of the environment that we are going to be talking about. So what you're going to see in the demo is that we are going to deploy a burst of pods, each requesting six CPUs. And this cluster in general is able to accommodate this much resource. But is that what we are going to see? Especially given that we already spoke about over-reservation and the scheduler adopting the pessimistic approach of allocating resources from both the numer nodes. So when the first pod comes in, it occupies resources from numer node zero and one. And this particular update that you see is a reflection of the local cache and how it's updated. Once the second pod comes in, again, on the other node, it occupies both the numer nodes. Then what happens to the third pod? It essentially, because there's not enough resources left in this particular cache, the scheduler keeps it in pending state. But after the system has reconciled and the state information that has been updated through NRT Updater matches with that of the state captured by the scheduler plugin, we see that the pod goes into a running state. So I have a recording here, but I'm going to walk it through a demo. So in this demo, as I said, it's deployed with a mini-cube-based cluster. The nodes are under their VMs. And we have the Kubernetes cluster deployed with 126 version of Kubernetes. You can see in this demonstration that we have three nodes on the cluster, one control plane and two worker nodes. So this is the environment that we're using for demonstrating NumaWare scheduler. They have two Numa nodes. As I already explained, we have 14 allocatable per node. So you can see over here that this is corresponding to the Numa node. We have seven CPUs per Numa node. And this kind of showcases that the resources are... We have a system that has Numa nodes and the resources are split across Numa nodes. Configuration-wise, Cubelet has been configured with topology manager policy of single Numa nodes. And if we look at the resource allocation, this is showing how NRT API looks like. So we are showing here Numa0, Numa1, Numa1, node1. And we are showing here allocatable, available, and capacity. Actually, I'd like to show one more thing here. So in addition to all these things, we have something called pod fingerprint. And this had to be enabled for us to do the cache invalidation. So next, we are going to see our test workload. And as we already saw, each workload is going to request six CPU, and we have three pods here. So this test group is kind of to simulate that we have a burst of pods and essentially to replicate that race condition that I was talking about. We have eight CPUs per Numa node, which I already explained. And six CPUs are to saturate the cluster. So once we create the test group, you want to see if the pods were created. So as I explained earlier, we have the two pods created, but the third part stays in pending state. So now we have to wait for a little bit for the system to reconcile. And we are constantly polling to check if the state has reconciled. And after a little bit, what we'd see is that the third pod goes into running mode and the pod gets scheduled. So let's wait for a few more seconds. There we have. So we see here that the test container three of pod three has gone into running state. And now we can look at the resource allocation. Yeah, that's the pod running. And now we look at the resource allocation. And that's what we see. So resources have been updated. So that was corresponding to the second node. And we're going to take a quick look at the pods. And we can see here that the pod has gone into successfully running state. And prior to that, actually, let's look at this. It was attempting to schedule the pod. And it was basically showing it wasn't able to schedule the pod. Let me see here. It's just me failing. Yeah, so this is where you see that it didn't admit the pods at that stage. And that completes the demo. I'll move back to the slides. So in general, for the phase two developments of the scheduler plugin, we have the reserve plugin implementation. And we have the PRs corresponding to that. We have different caching strategies, which is over reserve, discard, reserved. For scenarios where we are trying to allocate resources from multiple NUMA nodes, because there are cases where you can't allocate it from a single NUMA node. You might need scenarios that you want to use as less NUMA nodes as possible. So that's the scoring strategy for that. And then we had to make a few changes for catching up with the changes to the NRT API. So as I said, we had to make a few changes to the NRT API. And this was in collaboration with other community members who gave feedback on how there are other use cases as well that can use NRT API. And NRI plugins is another component that's using this API in addition to topology aware scheduling. So we introduced top level attributes for exposing topology policy information. The next step for us is to gather more feedback and move towards V1 Beta 1. Similar to this, we had to make this is the third component of topology aware scheduling. The node feature discovery, as I said, we have two components, RTE and NFT. And we were previously focusing a lot on RTE development as part of Red Hat. And we are currently working really hard to get these two software components up to speed and feature parity with each other. So we made a bunch of changes to make sure that NRT catches up with NFT. And this work is still in progress. As part of the demo, I showed the pod fingerprinting. So in order to enable the reserve plugin, we had to make changes to the NFT itself. So it has support for pod fingerprinting. So this change corresponds to that. And finally, we had to adapt to the V1 Alpha 2 changes. And this work is supposed to continue. So watch out for that. And our future plans are scalability testing. We would like to test this solution at scale and evaluate how we can improve the solution and integrate it with dynamic resource allocation as well as VPA. In the long term, we would like to see this solution moving entry in Kubernetes. And that would naturally have integration with the RAN VPA. I just want to call out all the people who've been involved in this effort. We've been a huge effort that has been spanning multiple years. So all the contributors, reviewers, who've been working tirelessly on this. I want to especially thank Francesco Romani who's here for creating the demo that I just showed here. So thanks for that. And last but not the least, I just want to mention that we are looking for input from the community members if there are use cases, if you have thoughts, if you have any questions, feel free to reach out on the batch working group via Slack or mailing list. We meet every other Thursday at 2 p.m. UTC. So it'd be nice for us for you to join us there. Thank you very much. I'm happy to take any questions if there are. Hello. Thank you for the great talk. Good question. You have the pot failure policy. So you mentioned two actions right now, failure and ignoring. Is there any way to implement custom actions? No. I would ask you to come to open an issue so that we can discuss new additions because it has to be implemented in a job controller if I think correctly. But yeah, we can discuss. We are looking to implement per index, for example, failures or rules, but that's still being designed. Thank you. And can I ask one more question? The colorations, which you mentioned in this scheduler, that what happens if colorations step will fail? How can I handle it? Is there a specific place where I look for the logs for that? Like pod resources, not the pod worker or closed resources. I'm not sure I follow the question. Yeah, there was a step when there is a correlation step, which like resources colorations. If you can go back to the slide. Tolerations, you mean? No, no, correlations. Oh, no, corollaries? Yeah, probably. No, I mean like corollary in this context is simply an issue as a result of the other issues that we are. No, no, there is a step when you manage to find the proper resources. You have some predictions and then you realize you have another like resources which are available, right? And there is another step when you update the scheduler. Yeah. MR8. So no, the way it works is that on a regular interval, NRT updater that the component that we have that, that populates the NRT API, it, it gathers information from a component within Kubelet and populates data. So the scheduler is maintaining its own local cache and it's keeping track of the resources internally because the update that had happened can happen later, you know, from a data standpoint. So it keeps its own cache and makes the scheduling decisions based on that. I'm not sure. So the whole, the whole idea of implementing the local cache is to make sure we prevent that failure in the first place. You saw that we could have ended up in a topology FNT error. What we are trying to do is avoid that by actually keeping the part pending. So we feel that keeping the part pending is better than it failing because these are important parts that we want, you know, therefore important use cases. It's better to keep them pending than having them failed. I hope I don't say your question. I think we are out of time. I think we are out of time. We can have a chance after the call. Thank you very much for today. Have a good day.