 Hello. Welcome. We are going to talk about the Batch Working Group, the CNCF Batch Working Group. Thank you all for coming. It's lovely to talk to so many people about what we're doing here. And to start, I'm Alex. I'm Alex Scammon. I work for G Research. I'm one of the chairs of this batch working group that we're going to talk about. Hello. This is Kyle from VEDA and I have several experiences about the batch system. And I use it to be the stick chair of Kubernetes and also founder of volcano. So I'm working with Alex on the CNCF batch working group right now. Yeah. And we have another chair who isn't here, but Weiwei, who currently works for Apple, but spent a lot of time working on Unicorn at Cladera. But it's not just us. It is also a bunch of people who show up for some reason every couple of weeks. And we chat about batch systems, people from SCED MD, and you can read up there, Red Hat, IBM, Huawei, MaliCloud, ByteDance. There are lots of people involved in the conversation. And what we are setting out to do is what we set out to try and align all of these different projects that we'll talk about in a minute to see if there's any commonalities, anything that we can share together and will create protocols and standards and things that we can all share. And over the course of a year or so, we realize that we need to sort of rein in our expectations. And we have been mostly starting on just education and trying to get a grasp and help other people get a grasp on what all the options are out there right now. Because there are lots and we'll talk about them soon. Eventually what we'd like to do is broaden the discussion just from batch schedules, which we'll talk about today, to more of a system view of batch. And maybe get back to that lofty goal of aligning everybody. But we'll see. We'll start small. Yeah. And as you know, there are also another working group in Kubernetes. That's mean Kubernetes batch working group. We have lots of interaction with them. But in the CNCF working group, we'll focus around the several projects. Not only the CNCF, but also the Apache, such as the Unicorn, and also some, not all the CNCF projects, such as Instagram. We are going to provide the solution, provide suggestions across different projects. This is a bit different with the Kubernetes batch working group. But we have lots of connection and we work closely. Yeah. It's very confusing. Two groups both called batch working group. One's Kubernetes, one's us. We're all nice. And there's a question, I suppose, about why there's even a discussion about batch scheduling to take place. And the roots of the conversation are around this traditional view of Kubernetes, which is as like a place where you can run long-lived services, your web server, and things that will need to stay up forever and ever. And for people who want to do traditional HPC workloads, it's slightly different. The workloads look different in shape and in duration and in how you use your farm. And when people tried to use Kubernetes to do those things, they ran into all sorts of problems and it's been a discussion ever since about how to make Kubernetes more amenable to this kind of workload. There's also a discussion around all of the people who have been doing traditional HPC batch for 20, 30 years and they know their tools, they love their tools, and they hear about this cloud thing and they don't want to deal with it. And should we be bridging those gaps between the traditional old HPC world and where we are in cloud? So that's the sort of fundamental of this conversation. The other part of this, as I alluded to earlier, we're going to go through just schedulers today, but actually batch scheduling encompasses a lot of other pieces to make that batch workload work. And that's part of the sort of holistic thing that we need to address in Kubernetes to be able to run these workloads. So right now I'm going to give you an overview of all of the existing cloud options that you have and you will walk out of here in about two minutes understanding deeply all of the architecture. So first we have Q, there's a controller, it's just like Kubernetes, you don't have to do much else, great, that's done. Volcano, we've got a scheduler, volcano scheduler, volcano admission, we've got jobs, okay that's fine, then we've got Unicorn, okay it's using the scheduling framework, great, brilliant, you have some shims in there, got it, awesome, coordinator, I just heard about this one last week, but I deeply understand all of the infrastructure about it. Girdle, this one isn't even out yet, this just got added last night, brilliant pods I think get scheduled somewhere. MCAD, this one doesn't even show, this diagram doesn't even show that this is all about multi-cluster stuff, it gets even more complicated than this. And RunAI, if you don't even want to run your own thing, you can have RunAI, run it, brilliant, genius, let's go to Armada, this is the one that I'm affiliated with, again this is multi-cluster stuff, look at how simple that is, brilliant, genius, LSF, if you're in the traditional world, LSF scheduler, they just replaced the Kubernetes scheduler, hey Presto, you're running LSF, Sunk, kind of the same thing, you're going to run Slurm, brilliant, genius, look at that diagram, everyone gets it, right? Okay, so we've only got like 20, 25 minutes here, there's no possible way that we're going to give you an in-depth view of how all of these things work. But there is a lot of similarity between them, there are also lots of very subtle differences, which is sort of the basis of the conversation that we have every two weeks, which is just to sit down and try and puzzle out what works and what doesn't in this ecosystem. So we tried to sort of normalize the views of all of these batch schedulers. We'll start with Q, which is a native batch scheduler to Kubernetes, this is actually being driven out of our sister team, the Kubernetes batch working group, not us, which is the CNCF batch working group, but the Kubernetes batch working group has been driving this project Q, and it looks that this picture should look very familiar to you, this is like bog standard Kubernetes, it's like your control plane, your worker nodes, they talk to each other, there is a Q controller, and it will run your batch workloads, great. And the little bento box, the sushi up there, that's like this is your standard bento box thing for batch scheduling. Now we'll move to another one, which Klaus is very familiar with. I'm going to, for what you know, this is going to be our long story here. It's used to be a cool batch here, the first batch scheduler in Kubernetes, and then we obtain the scope to draw a controller, device plugin to include everything there. So it's going to be separated from Kubernetes and donate to CNCF, and apparently, what you know is incubator project in CNCF, so in the Kino, I think you'll see lots of future about volcano, and as we build the volcano based on the experience about the HPC part, and I use it to work at the platform computing, I'm not sure how many people know this company, but it's a lot, I have a lot of experience here. And the volcano can, but going to build the support batch feature in Kubernetes and integrate with the future of Kubernetes for this one, yeah. And sort of importantly for this diagram, going from Q to volcano, all we're showing here, we're boiling down, we're getting rid of a lot of complexity and just showing that essentially, we just replaced the scheduler with volcano here. That's really all that we've done. If we go to something like flux, oh, another important point there, there's still the job controller, you can still submit jobs, just like you can in Q, you see the job controller down there on the lower left. When you get to fluent, they also replace the scheduler, but you don't create through the job controller, so I've removed that. Again, just sort of showing how similar these things are at some level, but also very different. This again uses the scheduling framework to override the scheduler and replace it with flux. There's a complicated sort of history, a lineage of names here. They've changed it from flux to fluence, I believe, because it collides with the flux CD project, but this is yet another one where we just changed the scheduler out. Then we go to run.ai, which I think somebody might be here from run.ai. Wow, that's cool. So run.ai, that's interesting. I think I hear about this project from lots of customers, from we have a collaboration with run.ai. I think run.ai provides a good interface for the end-user and can also manage lots of clusters for the end-user. For my design, it's just great. It has run.ai, it can control the model cluster and provide the results to improve the results utilization for the end-user. For the multi-tenants environment, they also would like to manage the cluster and share the results between the different tenants. That's great. And again, another one that just replaces the scheduler. Again, through the scheduling framework, Unicorn, this one's from Cloudera. I don't know if we have anybody from here. They sometimes send somebody to KubeCon. But this one has some very interesting things around hierarchy of queues that you can create for some projects. That's an absolute necessity, just how you can organize your queues that relate to your organization. A lot of these also have very fancy ways of doing the scheduler. All of these so far are interested in priority fairness queues so that you can get fair use out of your farm. A lot of these, I could go through each one, but they have preemption, they have gang scheduling, all the sort of things that you would be familiar with coming from a traditional HPC universe. And a new one, again. Oh, coordinator, yeah, that's interesting. I think there are other, in the last, in the field, recently there are several requirements that we are going to support online service, sorry, online workload and offline workload together in a single cluster. So web server and batch workload, such as code transaction, code transfer, image, something like that. So we always call that calling, call location, workload call location. So because the workload has, you know, have the time-based results requirement for example, as a daytime, we are going to require results for web server. And as a night, batch workload for the big data, something like that, we are going to require more results. So a coordinator project will try to balance the results between this kind of workload and also meet the queue's requirement of both workload. This is the common requirement for the Internet company and also some other offline workload. Yeah. So all of these have been sort of focused on multi-cluster schedulers and you've seen most of them just sort of change out the scheduler. Some of them have the job API that you can still submit to or not. But the next one is MCAD and this changes how we look at it somewhat because this is aiming at multi-cluster batch scheduling so that you could batch schedule across a whole bunch of different Kubernetes clusters. So you'll see that I changed the cluster size so that the control plane is just in that control cluster and then you have target deployment clusters where you can send the jobs out to. This one has an MCAD controller that is doing the work. You, I think, submit app wrappers. They have a CRD, a special CRD to submit to Kubernetes. This is similar to the next one which is the one that I'm affiliated with, Armada. And again, this is a multi-cluster kind of an approach and here you can see that I moved some other things around where Armada isn't actually submitting to the API or using a CD. Part of the rationale here is that we wanted to avoid overloading at CD or the API server and so we actually distribute, we talk directly to the API, sorry, to the Armada API and then that distributes it out to the executor clusters that you see here. Next, oh, one little thing on multi-cluster stuff. Q that I mentioned at the beginning has just added multi-Q and it's doing multi-cluster stuff. Okay, I'm not sure anyone knows Armada. That's great, that's great. This is traditional, by the way. That's why we switched to Nagiri here. It's no longer a bento box, it's now Nagiri. Yeah, Armada is, you know, as far as Nagiri is at first generation of traditional HPCs, and for the top 500, the top 500 cluster in the world, several cluster are using RSF as batch scheduling, so this is a powerful and helpful feature. And when a few years ago, we provided a kind of solution that to integrate Kubernetes with RSF. The major purpose of this project is to leverage scheduler capability of RSF to enhance the Kubernetes part. So in this case, we can see that RSF also has the scheduler component and provides scheduler capability for the Kubernetes. And they also have some agents on the computer node because they need to have report some international data to the scheduler, and then the Kubernetes and RSF can work together for them. Yeah. And so another traditional one, if you're a traditionalist, this again is for people who are coming from a traditional world wondering how to get engaged. HD Condor is, I've debated about whether to put this on the slides because it has sort of the least sort of integration directly with Kubernetes. But in talking with the community, there are plenty of people who do run Condor on Kubernetes. And in general, all they do is they either helm install Condor cluster directly into a Kubernetes cluster, or they'll have endpoints, EPs running in the in a pod itself. And then just having an external Condor installation know that those pods exist, know that those EPs exist, and then direct workloads over there. So this is yet another way that people use Kubernetes to run batch workloads with a traditional scheduling. Yeah. This is a name about Serum on Kubernetes. So Serum is also another batch scheduler for the traditional HDC. And again, for the top 500 cluster, I think there are also lots of cluster using the Serum as a batch scheduler. And a bit of a bit different with the RSF with Kubernetes, I think Serum sunk support feature to launch Serum in Kubernetes cluster. And it also provides several features for the set of the main or SRE to manage the cluster. For them, it provides another scheduling and control plan and a several features by leveraging Kubernetes. Also for Kubernetes, it already supports, for example, Recovery Network Manager and Story Manager, several features for the Serum. So for the end user, they can just... What do they need in the test to submit the workload to the Serum? That's going to be made easy for the end user. And also for the operator. Yeah. Yeah, which is an interesting bridge, I think, for the traditionalists out there when I talk to a lot of people who are firmly entrenched in APC, the default is Serum and this might be a way to cross that divide. So I think we've gone through 10 or 11 different schedulers here and the point of it is, you know, as I started off by saying we didn't go into the guts of any of this. The point of going through all of these schedulers is that they're actually very similar. They're approaching this either through, you know, replacing the scheduler or having a controller. Sometimes there's a jobs API that you can submit to. Sometimes there isn't. Sometimes there's an agent on the worker node. Sometimes there isn't. There are some slight differences from this high level. And as I mentioned at the beginning, one of the things that we're trying to do is help other people know that these things exist. One of the themes that we've seen over the years is that because people didn't know that these things existed, they went out and built new ones. And we're both guilty of that and really it would be way better if we could pool our resources and not build more ones and try and work on these things together. So that's the first thing is just like, hey, these things exist. They're all actually pretty similar. So if you actually choose one, you're going to have to dive in under the hood and look at all the details. There was actually a shout out to Anish I think from Red Hat, the very last talk on batch day, AI day. There was a good comparison that they did of five of these tools that gives you a head start into what are the different things that are actually running? What are the policies? What are the controllers? So check that talk out. That goes into one level more depth than we've gone into here for half of these things. Yeah. And for the batch working rule, I think it's a good community for either for the community and for the user to provide more feedback about each project. And that's going to be a very important input to improve their feature and what's the open point and for them what's the high priority of feature and whether we have new user case for AI, for inference, for training, for distributed training, several things here. And we build this working group to get more feedback for all of the project. And when we have to share some of the things, it's going to be great for everyone. Yeah. And one of the things that we're working on sharing is adding a landscape to the landscapes of landscapes that we have in the CNCF. This isn't out yet. There's a PR discussing how exactly we want to have this look. But again, trying to be a signpost for people who are interested in batch don't know where to start. Existing projects that you should know about, please check them out. So that's the first thing that we're working on. We also have just some new entrants that we'd like to mention. Yeah. I think that they're, as mentioned, there are several new features, new project here for the coordination. I think this cover one can for features that's called location web server batch and big data, several things. And I think another interesting topic is about fan ops. I think a co-location is a sub-topic about fan ops for the mutual April, how to reduce the cost. Right. And I think there are also another project called open cost that means do more analysis. And for the crime, I think this is more about the co-location part. And for the others, for the goaler, we have more features here. And he has also working on the co-location part and also working on the Gauss-Gablin job level affinity, several features here. Yeah. And we do realize the irony in telling you all about like four new projects that have come in the last year that we've just been saying like please don't create new ones. It's just that we were too late. Like these were already going. It's fine. They got in under the wire, but no more, okay? So what's next? As I mentioned, we've got the batch landscape that we're going to publish. Hopefully people start seeing that there are options out there. We aspire to be very complicated very quickly in even just how we define what a scheduler is. Like normally, we just start talking about definitions. It's great. You should join us. But thirdly, we want to broaden the discussion as I said at the beginning to beyond just batch schedulers, there's a whole bunch of questions about how they work with workflow engines, how you work with underlying storage layers, how you work with all the distributed training frameworks that you want to use. Will we get information from DRA or the cluster inventory that can inform all of these batch schedulers that can make better choices? All of those things are part of the batch system initiative that we are talking about here. I think for the vital paper, I'd like to say that we would like to get more input from the community because I think one of the important items is about the terminology. As you know, for different project or different arrow, sometimes they introduce some interesting new terminology and that's going to be making the user confused. I think one of the important things is for the vital paper we will do explain what's the job, task what's the relationship between the task and the pod, all of these things. We will try to align and do some explaining about this term. Definitions doesn't sound scintillating terribly, but we try to make it fun. We try to spice it up. It's fun. It's not just us creating new terms. It's often also the traditional HBC crowd has their own terms of what they call a scheduler and what we call a scheduler. It's messy. So how to join us. We've been talking about this thing, getting you excited about joining our meeting. How you can actually join it. You can scan this QR code. It'll take you to some notes about how to join. The meeting invite is in this presentation. The PDF will be up a little later. It's first and third Mondays. I will have to change that. First and third Mondays of every month at 8am pacific standard time. There's also a Slack channel in the cloud native Slack. It's just hashtag batch working group. You can also reach out to any of the chairs if you have questions, how to get connected. Please do. I think that brings us to the end and if you have questions, we'd love to hear some. Yeah. Thank you. Hi. You talked about the top 500 a bit and how a lot of them use slurm, etc. How do you see the landscape in the future and do you see a high percentage of top 500 supercomputers using some of these? Or do you think the performance will stay with just the basic HPC? I don't know. What do you feel? How do you feel today? Do you feel hungry? Like the people who use traditional HPC right now, research institutions and all the entrenched in HPC will they move to cloud? I'd say I'm, I don't know, I go back and forth. Two years ago when I went to SC-22 I actually had somebody shout at me for mentioning Kubernetes. At SC-23 they seemed more amenable to the word. I think they're thawing. I think if we can pool our resources and come up with a really good answer for for batch workloads in cloud it will be really appealing even to that even to that crowd. I think there's also this sort of philosophical difference between you are in cloud and view cloud as this fungible resource that you can just burst into versus traditional HPC people who are like I have bought this billion dollar piece of hardware and I need to maximally use it all the time and that changes how you philosophically build a batch scheduler and so there's those questions to answer if we can answer them all together then there's signs that people are as I say thawing are open to being able to do both cloud and traditional APC in the same place. Some more important is that as far as I know there are also some other cases that for them or for research they try to launch cluster very quickly so they would like to use just some to launch the cluster by the Kubernetes that's going to be easy they don't care the scalability they don't care the stability several things and another thing is that for the traditional HPC there's also the case that when they leverage AI stack or AI related component scenario that's going to be considered to how to integrate the Kubernetes so I think everyone already know this years AI and Kubernetes it's going to be everywhere there so when we talking about the cluster HPC when we have some AI workload they may consider the Kubernetes part and when try to run AI workload HPC on the Kubernetes they will have a question about which project we are going to use and several features for each project with similar features such as GaN scheduling fair share queue and that's depends but there are still some some corner case meet your requirement if you have some more input we can discuss and also link with the owner of each project so we can have more discussion thanks for the great overview I had a question because you were talking about Unifine unifying some of these projects but then for example for me it was novelty because I was aware of the Kubernetes workgroup but not the CNCF so also there is there some kind of unified effort or is this really two different scopes or I mean there's two different scopes we all talk and we all like each other and we're all in league even though we've made Armada there are people on my team who contributed the queue directly and that's the project that the Kubernetes batch working group is sort of driving and sort of the main thing for them so no no we're all friends and working together yeah oh yeah hey so you already mentioned there's quite a lot of bad schedulers and kindly asked not to create more what I'm interested in is so need for computation will just grow and we see that pretty much in some places without the limits of single Kubernetes cluster I think Kubernetes multi cluster it still might not be late to standardize multi cluster in Kubernetes let's say Armada Armada has one approach how it solves multi cluster Q&M CAD use a well almost orthogonal approach how they solve multi cluster also multi cluster is being used in a couple of other Kubernetes 6 and they will solve it in some other scenarios some other approaches so what is the stance of the batch working group on multi cluster are there any efforts to at least standardize that because at least for batch scheduling later it might be easier to merge some batch schedulers into one at least if they agree on the multi cluster approach because multi cluster definitely will be the future um I mean is there a conversation to standardize I would say no um I mean it hasn't been a conversation yet all is all I think you're hinting at what I was referring to earlier about I think in reference to the first question where there's a philosophical difference between people who have on prem gear and those who are in the cloud and that informs the choices that you make for in this case multi cluster batch scheduling um so you should come and join us and you should drive that conversation I'd love it yeah that would be brilliant one one more okay so you're telling us engineers that we cannot create more batch job schedulers no more batch schedulers are we allowed to do batch schedulers or in other words is that part of the scope of the batch working group what kind of schedulers sorry? DAG schedulers so DAG schedulers yeah Jonathan no you're not allowed to build any more DAG workflow engines no no definitely not on wasm no don't do any new rust thing don't no wasn't this this is Jonathan is just trolling me here he knows that I I deploy to new creations new batch schedulers and new workflow engines I it's my pet peeve so thank you Jonathan cuing that up thank you all for for coming really appreciate you please join the conversation we would love to have you yeah thank you so much