 Hi, everyone. So my name is Abdullah. I work for Google. And I'm going to present our work on Job Set API, specifically for on-demand systems, HPC systems, and scale training. This is a presentation that I prepared with Vanessa Sokat from Lawrence Livermore National Laboratory. Unfortunately, she was not able to attend today, but she has a recording. So this presentation is going to be two parts. I'm going to present the first part. And I will play a recording for the second part. So for a while, we've been talking about Kubernetes as not being too ad-educated for batch workloads. But I'm going to start by saying Kubernetes is ad-educated for batch workloads. We've been working for a while on improving Kubernetes to support batch workloads, as you've seen in the past few Kubernetes, a few KubeCon talks. Mainly through the Job API. So we've been working on improving it to support various cases of batch workloads, including training and HPC and whatnot. The main feature that we've added in the Job API was index job. And in addition to that, we have new features related to how you decide when like pod failure policies, also the ability to set up stable network IDs for pods that are created through index job, which all are requirements if you've ever worked with MPI or distributed training. Even though that I claim that Kubernetes now is a home for batch, I think we can still do better. So there's always a bot. And so we've been working on a new API in a sub-project under Kubernetes called JobSet. As I mentioned, it's a new API. The idea behind it is that it manages a group of jobs as a single workload. So we're not trying to reinvent the wheel. We're trying to basically reuse the core Kubernetes Job API to try to expand its use cases. As you can see on the graphic on the right side, with JobSet, you are able to create or specify multiple job templates. And for each template, you can specify how many jobs or replicas of that template I want. And it also allows you to automate, set up like success and failure policies for this new workload that is represented as a collection of jobs. And so in general, what it tries to do is really automate multiple patterns that we've discovered in training and HBC type workloads. As I mentioned, multi-template jobs, setting up part-to-part communication. With index job, you used to create an index job. And then you create a headless service yourself and set up everything. And then when you clean them up, you have to do it manually. With jobs, we're trying to automate these things. As I mentioned also, we are trying to provide some common failure and success policies. Like when do we consider the whole workload as failed? So in some cases, for example, you have this like leader follower pattern. If one of the followers fails, you don't necessarily need to fail the whole workload. But if the leader fails, you want to say, okay, yeah, fail the whole workload. Same thing with success. When do we consider the whole workloads as successful? If we are managing a group of jobs, which job, when it succeeds, should we consider the whole job, the whole workload as successful? Or should we say all these child jobs should succeed to consider the workload as successful? Also, it tries to provide some knobs to decide how to place these jobs. In some cases, with distributed training, you try to place the jobs on different parts of your infrastructure. Consider like, for example, you have different tracks. And so each job represents a shard of the training workload. And you want to have really high bandwidth communication between the pods that are responsible for these shards. And so you want to create a job for each track, and you want to ensure that each job lands on a different track. So I'm going to talk about two use cases that we use job set for. The first one I'm going to talk about is a scale distributed training with TPUs. And Vanessa is going to talk about the second one, which is the HPC use case. So what is a TPU? This is the only pitch for TPUs. It's Google's basically accelerator we designed specifically for machine learning. It works with like most frameworks supporting PyTor, JAX, TensorFlow, and integrates with Kubernetes. There are two main form factors for TPUs. One is what we call the TPU device, where you have a single VM attached to a TPU device. There is no special network communication with other devices in the cluster. Usually it's designed for inference type workloads, but that's not our focus for this talk. Our focus is on the other form factor, which is TPU slices, where when you provision, you actually provision a group of VMs as a unit. And those VMs, they are the TPU devices and those VMs, they are connected with special high speed interconnect links. Like they are pretty much like fiber links that are like point to point between these devices. And it is specifically designed for distributed training workloads. And those slices, they can be provisioned in different shapes. Here I'm showing like a two by two, but it can be all the way up to like 64 VMs in a single or node in a single slice. So how do we train workloads on TPU slices? The training setup is really simple from Kubernetes perspective. So you need to have a pod pair node. That pod basically consumes all the TPU device on each node. And each pod needs a unique ID. We need to set up stable network IDs between the pods so that the lower level communication library, it's called lip TPU, but it's pretty much similar to, you know, MPI. It requires those stable IDs to be set up so that it can do the distributed communication. And for failure, basically it's like again, any distributed training framework. If any pod fails, we need to restart the whole thing. So how do we do this with Kubernetes? Well, easy. Just use index drop. Like I just described how index drop work. So you specify like the number of workers in the parallelism or completions and completions parameter, you set up back of limit to zero. Basically if any pod fails, just fail the whole job because we can't, even if that pod was recreated, it's not gonna be able to continue to train. And then we manually create the service and we set up the sub-domain so that the pods gets stable IDs and we set up some environment variables for the training framework to work. Now the issue is, now we're talking about LLMs with significantly larger, no single, even TPU slice can train the type of LLMs that we're trying to train. And so we're looking now at multi-slice training. And so the training is shot between multiple slices and we have two levels of orchestration. There is an orchestration within the same slice which I just described, it's the same. But then there is also coordination between multiple slices. The failure policy is similar. So basically if any pod in any slice fail, it's not enough to just recreate that pod. Everybody else needs to restart. So how do we do this? Well, easy, just create this massive YAML file with an index job for every single slice that you have. Set up some environment variables to make it work. But this is hard to manage, right? What if my training workload spans hundreds of slices and how do I monitor the status of the whole job, right? Like now you have 100 or 200 index jobs, there is no one place to look for the status of the whole workload. And how do I manage failures as well? Like how do I fail all the other jobs if one job fail? With index job, if you had one index job, we had that supported by index job, right? Like if one pod fails, I can sit back of limit to zero and then the job controller will say, okay, that job is failed. But now I need to do that for all other jobs as well. And last but not least, like how do I ensure that each job land exclusively on a slice? And this is where job set helps us. So this is an example workload that we run on GKE. At this scale, we had over 12,000 nodes, over 50,000 TPU chips. We had almost 200 TPU slices, basically 200 index jobs that we had to create. And for each index job, it was 64 pods. And we managed it using job set and also we had Q installed in the cluster because we had other smaller jobs running that we needed to manage the resources used. So how did we do that with job set? Well, as I mentioned, with job set allows us to create a number of replicated jobs. For the use case that I'm describing here, we didn't really need to create different jobs of different templates. That's the second use case that Vanessa's gonna talk about. For our use case here, it was mostly about creating multiple index jobs from the same table. They are replicas. So here in the blue box, you can see we have the template here, which is really an index job. It's a job template. You specify the same thing that you specified before. And in this parameter, the number of replicas is basically specifies how many index jobs that needs to be created. And in the success policy, you could specify basically that the whole workload is successful only if all the jobs are successful and finished successfully. With restart, with failure policy, right now we have only one type, which is if anybody fails, the whole job fails, but you can specify how many times you wanna restart it. So this also gives you an automated way to restart the job at the whole workload basically when this happens. And the way that it does is basically it recreates all the jobs if a failure happens. So to force basically a restart. And then there are a bunch of environment variables that you can set up with jobs that you get a number of labels and annotations injected into each job that gets created, like the index of each job. And that makes it fairly easy and very straightforward to map it to the lower-level frameworks as environment variables. One last thing here is still under development, which why we only have it as an annotation, we need to migrate it to a proper API as exclusive placement. So with this, what we're saying is that each job that you create, that jobs it creates should ensure that it lands on a unique set of nodes with a shared TPU slice ID. So here on the right side, you can see that each TPU slice has a bunch of VMs. They will get a shared TPU slice ID. Like a managed group, for example, for example, this is TPU slice ID zero, one, and two. With jobs, while specifying this parameter, it's gonna ensure that each job that you create, for example, job zero, it's spots will land only on one slice and only job zero will land on one slice. It will repel all other jobs. And so I find this a bit interesting maybe to deep dive a bit into it, how we implemented this. So we implemented it using really just like quite affinity and anti-affinity. We didn't really integrate a new scheduler into the job set operator. What we did was injecting two scheduling constraints, a pod affinity constraint that basically ensures that all the pods of a job land on the same slice. But that's not enough, right? Like I wanna make sure that no other pod land on the same slice as well and avoid these types of risk conditions. And so we had to insert as well an anti-affinity constraint to say all other pod, like any pod from any other job cannot land on the same slice. And so this is like this combination of pod affinity and affinity allows us to implement this exclusive placement. So our plan here is to really enhance this API, make transition it to a proper API into the spec, but we also want to include other semantics. So the exclusive placement is the thing that we needed at first, but I can see other types, right? Like for example, a job should land on the same slice or maybe think about it rack, but it's not in an exclusive way, right? Like you could fit multiple jobs under the same rack or you could, in this case, it's a required, right? Like it's a hard requirement. We can also allow it to specify, for example, it's preferred, right? Like so that for performance reasons, we prefer that the pods of a job land on the same topology, but that's not a requirement. So we can continue to make progress if that doesn't exist. So that concludes the first part of the talk. The second part I'm going to play Vanessa's presentation for the HPC use case. Hey, Kupkan. I'm Vanessa Sock, I'm a computer scientist at Lawrence Livermore National Lab and today I'm going to be talking to you about application building in Kubernetes using JobSet. So about a year ago, I started my adventure as a developer for Kubernetes and specifically I had one goal to implement a job manager, our job manager at the lab called Flux Framework inside of Kubernetes. And so I want to take Flux and I want to take Kubernetes and I want to just touch them together because that's how software engineering works, right? Not exactly. So there's these two communities we have cloud and HPC and in between them is this beautiful opportunity for the convergence of technologies and culture. We call this Converge Computing and there was some rhyme to our reason in choosing Kubernetes and Flux. Both of our communities use them for running workloads and jobs and they're also both modular. So for example, I can take a component from Flux and I can implement it inside of Kubernetes and vice-versa. So TLBR long story short, we created the Flux operator to implement the entirety of Flux Framework inside of Kubernetes. And it was an awesome experience. It was the first taste of convergence. So our next logical question was, okay, we have a workload manager and how do we run applications? Let's go and talk to some fish in that bay. How are these cloud and HPC apps actually different? First looking at application coupling. In cloud, we have very loosely coupled apps. In HPC, we have very tightly coupled apps and we'll talk more about what that means later in the talk. For resource scheduling in cloud, we may want to run a pod with a certain amount of CPU. In HPC, we're actually scheduling more than software. We're scheduling hardware down to the PCI bus. For job queuing in cloud, we may do something like calculate a priority score. In HPC, we have very sophisticated queuing algorithms that often are graph based. Finally, workflow management while cloud does this really well with automated declarative management, bring in the YAML-CAML. In HPC, we also have workflow tools, but sometimes it's not really great. It's like bash grips all the way down. So our first step after obviously getting our workload manager and Kubernetes was like, let's run some applications. Let's get into our bathing suits, dive into that bay. And our applications, they didn't run. Oh, why doesn't it work? The question for developers of all time. Please don't tell me why doesn't it work? So we needed to actually look deeper into this mystery. We needed to dive into the trench of treachery. Wait, how did that get there? I mean, the trench of discovery to really debug this problem. And so my team and I, we loaded up in our submarine except I was the one on the outside that like in the movie gets eaten by the shark or like the underwater sea monster. No, I'm just kidding. So we went down, down, down. And our first stop was to try to better understand how we could model our complex HPC applications in Kubernetes. And we already had some experience with this because of the Flux operator. We had used an index job. So we had taken Flux, put it in a container. The container is going pods. And by way of the index job, that gives us duplicates. We can add some system configuration files and then a headless service gives us unique host names so they can all talk to one another. And this encompass the Flux mini cluster was awesome. But there were hashtag developer problems. There's always developer problems, aren't they? So the first one was that we have to have different entry point logic based on the index in one script. We also needed to manually create the headless service. And finally, we really needed a way to say like when this main index zero is done, the entire job is done. So okay, we understood the limitations of job. We kept going in our adventure and we encountered job set. Hey, job set, what's going on? What are you doing under the ocean? You know what? Don't answer that. It's totally cool. I do not judge. Yes, we would absolutely love your help. So with job set, we could add this other abstraction of a job set. And by way of having more than one replicated job, we could separate the logic of our different components specifically into a lead broker and follower brokers. We could still be have them on the same network. And with a success policy, we could actually say, okay, when this lead broker is done, the entire job is done. Awesome. So we knew that job set would be the subtraction to allow us to build these complex HPC applications. But we still have this problem, which is why we're under water in the first place that our applications didn't run. So let's not forget about that. So we needed to take a trip and do the canyvub debugging. So we go in this cave and it really wasn't such a bad cave because we found some friends in there and our friends had some good ideas about what might be going wrong. I had some ideas too. And what I ultimately found is that when I added a local DNS cache, the cluster that for first didn't come up at all, all of a sudden came up. And so who was lurking in the darkness? It was the DNS fish. Yes, you get out of here. You nobody wants to hear about you and you worry at you somehow are always around. So I said, okay, great. Let's run our application again. Maybe we'll see it run this time. So just to give you a sense of what I'm seeing, what you're looking at here are application times for a fairly good problem size for our application. And you're looking at time as a function of the cluster size. And so what we're doing is something called strong scaling where you hold the problem size constant and you give it more resources. And what you'd expect to see is that it gets faster. And to some extent, we do see that. So we do see that up to size 32, but then it's size 64, like, oh, something happens and that's something suggests that the cost of communication offsets any potential benefit to adding more resources. Hmm. And also I just want to point out that these times are not very good. So, oh, like it's still slow. Thankfully Antonio swam out of the collaborator sub and he was like, I can help Vy. You see, there's all these things that you aren't considering from the base image to how you're building MPI, networking flags and importantly the instance type. So we did many more experiments. And by the way, this is over many months and we found we ran into new errors. Like, yes, so both of us were like, oh, no, oh, no. But I have this insight and many developers have this insight when you don't understand something, you need to measure things. We needed to measure the network and understand what was going on with our application. And because I had just done this work with job set, this was at the forefront of my mind. I was like, hey, job set, can you come down here and job set was like, I don't know. It's kind of creepy down there. But job set gives us this flexibility to model not just our applications but also tools for performance analysis. And the insight that I had specifically was that so many of our applications along with actually machine learning applications have this launcher and workers design. So I could actually implement everything from proxy apps to networking and storage or IO tools using this common design. So this turned into what the metrics operator, yes, he's absolutely adorable. With the metrics operator, you create something called a metric set. And the metric set is going to allow us to look at that networking, which we thought was the culprit, specifically using something called the OSU benchmarks. So we needed to dive deeper. So first change of wardrobe and we again went down, down, down, down. So going down into the abyss of expectations, which is a very dangerous place to be, might I add. And so my first expectation was that HPC applications need low latency. We always have these beautiful and filiband fabrics. Why would they not need it? So let's go back to this idea of tight coupling. What does that mean? I wanna talk more about MPI, the message passing interface. So MPI, if we look again at this launcher worker design, HPC is gonna be using a tightly coupled design. So we may start with thinking of the concept of nodes. So here we have nodes, there's a launcher node and worker nodes, but in HPC, because we need to run complex scientific simulations, we actually care about the relevant unit of a process. So instead of looking at nodes here, what you're looking at are 14 CPU that do the work and they are going to be communicating using the message passing interface, which is called an interface because it's more like a standard and there's different implementations of it. So I need to stress that because these processes are on different nodes, that often means that we're gonna have communication of processes from different nodes. And that's why the networking is so important. So there's different patterns of communication that we define and we're gonna talk about two kind of families today. Those are point to point and collective calls. So point to point communication may be ascended and received and this is exactly what it sounds like. You have one CPU that is sending a packet of data to another one, hello, it's me, I've got a packet of data that you might want to receive and then hopefully the other CPU is gonna send a message back or otherwise, the first CPU is gonna be making music videos in black and white that are really sad and you know how that goes. So the second pattern is called collective and these are communications between collections or groups of processes. So here you see an example of MPI sum. This is a little like map reduce where you start with a bunch of numbers and then the reduce down to one sum. Now the most common collective call is called MPI all reduce. This is gonna add something called MPIB cast which is going to broadcast that data out to more than one CPU. And we can actually look at these calls for collective and point to point across HPC applications. And as I mentioned, MPI all reduce is the most popular collective call and then send and receive are the most popular point to point functions. So I must stress that because of these communication patterns without a low latency network that absolutely cannot be performant. So the OSU benchmarks are going to allow us to measure these due to things OSU all reduce for collective OSU latency for point to point. And we're gonna be using a metrics operator. Here's what that looks like to create a metric set and the goal of the metrics operators just really make it stupid easy to run these HPC apps and applications powered by job set. Alrighty, so here we're looking at OSU latency. This is a point to point benchmark on the cloud. We see latency in microseconds in log scale as a function of the packet size. And you're probably looking at this like, well, how do I know if this is any good? Well, we can compare it to HPC which is orders of magnitude better between 10 and 20 times depending on where you are on this graph. And so I looked at this and my little abyss of expectations and I was like, oh, it's the latency. I knew it, that's the culprit. But I also had another malformed expectation, wink, wink. I thought that I knew that my application would run better on HPC. I assumed that it would run better but I hadn't actually run it yet. Oops. And at this time, there was also sort of a blessing in disguise which sucked at the time but the blessing was that I did not have quota. So the instance type that I wanted to use I didn't have quota for. So I had to fall back to a different instance type that I hadn't really tested as much. And I was shocked. So remember we talked about strong scaling? The strong scaling looked beautiful. We saw that as we increased the MKI ranks this time went down. And although this wasn't like the best that I'd ever seen I was like in shock because this was the result that I had wanted to see way back in May, many months earlier. And so I'm sitting in my abyss of expectations with a nice stewing pot of confusion like what just happened? This didn't even work before. And of course my confusion very quickly turned into delusion and I was like, it's okay. I'm still gonna try on HPC and I'm sure it will be better. Again, shook. So what you're looking at here, we added in HPC the cloud it has the red boxes around it. And what we actually see is that the cloud times were faster for our MKI application except for the largest size where one of our HPC clusters was slightly faster. And so what that suggests is there's still some issue with communication between nodes when we get to those larger sizes. What could be that cost? What's going on here? And we can actually then look at our all reduced benchmark because that is a proxy for communication. So I know this looks like someone like sneezed on the slide, but let me try to explain what you're looking at here. You're still looking at average latency in microseconds and log scale as the function of the packet size. There's two groups here. The top group, this is cloud and those little dots in between are actually an increase in cluster size. So the dot on the bottom is the smallest the one on the top is the largest. And so what you see is that as you add more nodes that need to talk together, the time goes up a lot. Now HPC is the second group and it's much lower. Again, orders of magnitude. And this is a cluster that has a Venomand fabric. And look at the difference in units. I just wanna point out these are sort of in the tens versus over 100. So there is a very large difference. So we do suspect that network is still becoming an issue at these larger sizes. So how could we further look at this while we brought in our metrics operator again that also can run a tool called the HPC toolkit because we wanted to verify that the time spent in all reduced was actually going up. And I was running out of cloud credits so I only could run this on three sizes but what we saw is that as the size increased I envied the time in all reduced the time spent communicating between nodes was increasing and this in a way did validate what we thought communication is definitely the bottleneck at larger sizes. So I can say this comically now but I woke up the next morning after all of this after these results and being surprised and I was overwhelmed. There were so many variables to think about the story was not clear and logical and there was so many things I wanted to test and it totally went against my expectation. This is why it's so dangerous to be in this abyss and it felt really bad. But this is also why it's so important to have a supportive team because my team came down and they're like, Hey Vy, it's a little dark down here. I know you're having some kind of party or something but maybe it's time to come back up. And so we left the cave of debugging and we had learned so much and actually the first point was almost joyful. We were like, wow, we were actually wrong about always having the need for the lowest of latency. The latency doesn't have to be super low. It just has to be good enough. And in fact, our application problem is CPU bound at lower ranks and network latency at higher ranks. And the reason it works so well on cloud is because the cloud has really awesome shiny new CPU. So at this point we could kind of return to the service and this is where we are today. These are the things we still do not understand. We do not understand why this particular instance type did not work originally. We also need to better look at our expectations. What expectations are sort of lore do all of us carry that come from our community that might just be entirely wrong? And then application design, this is the coolest bit. How can we take our applications and kind of understand the patterns that they need to map them to the right environments? And so to kind of summarize the story we started in this new space that we didn't understand. We found a design strategy to map between spaces. This was job set. And then we could figure out a means to easily measure and run things. This was the metrics operator that ran everything from proxy apps to benchmarks. And operators and Kubernetes are so cool. They are like developer Legos for building things. I would recommend in parallel, no pun intended. We also need to understand the application patterns and needs so that we can really best optimize the environment for them. And you know what, these are really complex, hard problems. This is not just the work that one team can do, not even one lab. We need to bring in collaborators from, you know, not other labs, industry, academic, all over the place. And we have to work together on this to solve these hard problems. And you know, this is just the start. We are hoping to like build a little bridge to get across HPC to Cloud Land. But really what we want to move toward is this future where we can have the best of both worlds, this converged computing. And you know what, there are a ton of adventures ahead. And we hope that you join us. I can promise that I won't be out somewhere on adventure, but you know what, you are invited to come too. Kukan, back to you. All right, you can see why I went first. It's hard to match Vanessa's energy. So just one last piece here on future work for job set. I mentioned the placement policy a little bit about like our plans to have a proper API with better, you know, configurability. And the other thing is, as you might know, if you've ever used port affinity, it's not the most performant, you know, constraint that you could have on pods. So we're thinking about approaches how we can accelerate it. One solution we have is like having a leader forward pattern, like only one part of each job would have, like for example, index zero, you can use a webhook to do that, would have the port affinity and affinity constraints. Once that pods schedules, all the other pods of that job, basically we would have a validating webhook that blocks creating them until its leader is scheduled. Once the leader is scheduled, we inject node affinity to follow the leader pod into the same topology. For example, if it's a pod slice, it would go to the same slice ID. So that makes it basically much faster. So we split the scheduling into two pieces. One is the 200 pod that represent the leaders of the 200 jobs that we create. Those are gonna schedule a bit slowly, but the rest is gonna be fast. And failure policy as well, we have some ideas. Right now we recreate all the jobs when a failure happened, which again, it's too, it's a big hammer, right, like it's too expensive. We are thinking about ways how we can do in place pod restarts in a reasonable way. And the main issue here is how we broadcast all the pods that they should restart. So some ideas is like maybe we can have a config map that all the pods mount and they receive that broadcast with a side car that basically tries to make it more composable once it receives this restart and it goes and does like a kill five to the main container. And that's pretty much it. So this is our repo. It's sponsored by the batch working group. And if you have any questions, please reach out to us on the batch working group Slack channel or email. And also special acknowledgement to Daniel who helped implement our job site and couldn't be here today. And thank you.