 So for our next presentation, we have Oliver from IBM, who is the principal research scientist from IBM. And we have Apishak, a senior software engineer from IBM, to give us the next presentation. Thank you. Thank you all for being here. I think I've learned a lot today, but I already find the day that has been a bit long. So thank you for sticking around this for our talk and for the lightning talk afterwards. So my name is Olivier Tardieu. Today I'm joined by Apishak Malvankar, who is a senior software engineer and a tech lead on this project. And we want to tell you a little bit about the platform, the software platform we've been developing at IBM Research to train large models. So we've heard a lot about large language models already today, so that's good. I don't have to repeat too much. Just want maybe to point out on this slide that we like to call them foundation models because it's not just about language. The principles and techniques, the tools that have made these large language models very popular, they are applicable to many other modalities. And at IBM Research, we've been looking at other such things. So in particular, we've been, for instance, teaming with NASA to build geospatial models, models that will help predict flooding, fires, think about natural disasters, environment-related things with the same kind of tools and techniques again. And the one thing that all of these models, as you know, have in common, and they are very large, right? Anywhere between hundreds, millions of parameters to maybe 10 billion or even more. 10 billion or even more, yes. So that means a lot of computers need very serious hardware to train these things, but also very robust software stack to manage all of that. So briefly, just a quick picture of the hardware, because people always have questions about this. So in IBM Research, we've been building the next generation of cloud-native AI supercomputers. So the keyword here, of course, is cloud-native, which means different things. The first obvious one is running on Kubernetes or, in our case, on OpenShift. But it's more than that. It means also being able to scale from very small systems to very large key systems to actually dynamically scale as we are running and so on. And in research, we've been, in particular, pushing the limits on how large we can get Kubernetes, right? How many hundreds of nodes, growing clusters like that. But what we really want at the end of the day is the ability to either pool a large number of nodes in large clusters, or maybe rapidly switch configuration, divide that into many small clusters if that's more adequate for what we want to do with them. We've also heard, for instance, from Branii earlier today, some numbers about the cost of virtualization, the cost of Kubernetes, and so on. And we concur. We believe we can do that with very low overhead. So this is, for instance, a screen capture of a dashboard that we built from this cluster, sometimes early this year, where we were ramping up and starting to deploy our software stack on these systems. And at the time, there were about 1,200 A100 GPUs in the system. And the important point, and the whole thing we're going to talk about today is we want this system to be fully utilized. The number of idle GPUs needs to be really, really low for everybody to be happy. So now when we talk about these systems, OK, there's a hardware. But the next question is, what's the software for this? And if we want to build good software for this, we have to start with user requirements. And obviously, no matter the users, there are a few things that are going to transcend every user category. We want performance, performance, performance, of course. We want people want to know what's going on with the system, so we want observability. But beyond that, we have to look a little bit more carefully. So the obvious first class of users of these systems are the data scientists, the AI experts that are actually using the systems to do what it's supposed to do, which is to train or validate, et cetera, large models. And they have requirements such as using the framework of their choice. Maybe they do the actual pre-training of the models using PyTorch. But for data preparation, they want to use Ray. And what they really want is to just be able to submit jobs to these systems and forget about them in the sense that the system will take care of them, will babysit the jobs. Maybe the job will run for weeks. Maybe the jobs will start tomorrow. The jobs will just run when the time is right and the system, as much as possible, should take care of all the small things that can happen when you run the job, like a node crashing or 2.0.3, or a GPU, or a PCI link, et cetera. We'll talk about this more. First category, and really the main characteristics of the users is no matter what we want, they are experts. They are not Kubernetes experts. And frankly, they're making a lot more money being AI experts than Kubernetes experts. So it's not going to change. The second category of users is admin, necessarily the people that are there to keep the system running. And what they want is the maximum flexibility and the minimum effort. What they also are is they live and breathe Kubernetes. So they want to see the system through the Kubernetes lens, because that's what I understand. And finally, of course, we have executive accountants never to forget. So people were there to decide what's this important, what's going to run on one day, what are the quotas, what are the priorities, and so on. Want to make sure that the money that is spent on this system is actually giving us a return on investment. So utilization is high. We can scale on demand, and so on and so forth. So in order to serve all these category of users, we've been building a stack in open source. So all we're going to talk about is actually part of this community called Open Data Hub. At the top, and we'll, again, describe this in more detail in a minute, we have the ML expert facing part of the stack, which is essentially Jupyter, notebook, Python that lets you describe, implement jobs, run them, and so on. At the bottom, we have Kubernetes, or in our case, OpenShift. And in the middle, we have the meat of the talk for today, which is the workload management system consisting of a mechanism to batch or deal with these batch jobs and queue them and run them and repair them if necessary, and also the matching component for cluster autoscaling. So before I let Abhishek dive into the stack, I just want to point out that this is, of course, only a small piece of the big puzzle that is bringing AI to the enterprise. On training and validation, to start with, even just look at creating and using a model is just one part of it. But once we have trained and validated models, then we have to fine-tune them, then we have to serve them, we have to run them. So we're also working on that. And again, the AI piece is just one piece of the puzzle data. We've already said this multiple times. It's really critical to this process, and governance as well. How do we traceability, and so on. So of course, IBM Redats have products, services in this space that we're building that are building on top of the stack we're going to talk about today. I'm not going to talk about this in this talk, but please come to see our booth, the IBM booth, the Redat booth, and we can fill you in with the details. Thanks. I'll be sure. Thank you, Olivier. Hopefully, everyone can hear me. We are now going to learn about the different ingredients of the stack, and then have a view of the whole recipe. Let's learn about the first ingredient of the stack, a codeflare SDK. It's a simple Python interface, typically catered to researchers, ML engineers, and data scientists for interfacing with the stack using Jupyter Notebooks or CLI. These users can submit and create clusters and submit jobs on the created clusters using codeflare SDK. Few notable APIs of the SDK is the cluster config object. The cluster config object provides the ability to the users to up the cluster, view details of the cluster, and finally interact with the spawn clusters by submitting jobs, viewing their statuses and logs, and finally performing code operations. Before we move on to other ingredients of the stack, let's try to understand the difference between dispatching and scheduling from the stack lens. When a user submits a workload, aka custom resource, it lands inside the MCAT queue, as shown in the picture. At some point in time, the workload gets sent to the respective controller to spawn one or more pods. The process of sending workload to the controller is dispatching. Once the controller creates pods, they rely on the scheduler to bind such pods to the nodes. This process is called scheduling. Since we typically deal with gang, there should be gang scheduled with packing enabled on the target cluster. Now that we understand queuing and dispatching, let's move on to the second ingredient of the stack. MCAT, or multi-cluster app dispatcher, provides batch computing capabilities on multiple Kubernetes or OpenShift clusters. It dispatches app wrappers when aggregated resources are available, thereby guaranteeing workload execution and just-in-time pod creation. All this is done with zero code changes on the target operator. It provides features such as bring your own scheduler. It supports any upstream Kubernetes scheduler. It supports features like bring your own framework. So you can bring Spark, Flink, PyTorch, Ray, TensorFlow workloads. It also provides standard batch computing features such as priority, preemption, and quota management, which we'll learn shortly. It provides fault tolerance to any compute Kubernetes object. Let's learn a bit more about bring your own framework capability of MCAT. This is possible with the help of app wrapper CRD. We see a few notable sections of the app wrapper CRD that is used in the stack. First, we see the resource version and the name. In the metadata section, we see quota trees that are supported by a label, which we'll learn shortly. Later, we also see priorities supported in the spec section. These are integer priorities. And also, we support fault tolerance in the scheduling spec standard. The bring your own framework capability is supported by generic items. It has the ability to wrap or simply append any custom resource that you want, and it gets queued inside the MCAT queue. All this happens with zero code changes on the target operator. Let's talk a bit about fault tolerance. Fault tolerance is needed for various reasons. GPUs fail very often at scale. There are degraded PCI links, which slows down the entire training at scale. Stuck pods, due to VM or node failures, also happen. If nothing, there are user errors that definitely happens at scale. Scheduling spec stanza in app wrapper provides the capability to monitor and maintain the state of the gang. It also has the ability to customize time out for the gangs that are submitted by the user. If the above features are enabled, then MCAT can recue gang automatically during failures with the ability to force terminate stuck pods. To summarize, MCAT can detect nodes that are added, removed, or failed by nannying gangs of pods. Now let's move on to the quota management feature or hierarchical quota management feature of MCAT. Let's try to understand this with the below borrowing example. Consider a simple quota tree with root that has all the cluster resources and leaf nodes have respective quotas. When a user submits app wrapper 1 with the below resource requirements to team B, we can see that team B does not have enough quota to satisfy the resource requirement of app wrapper 1. When this happens, since there are no other jobs running inside the cluster, MCAT automatically borrows quota from team A and allows the app wrapper 1 to run. This is really important if you want to increase the cluster utilization when there are idle resources in the cluster. At some point in time, a user submits app wrapper 2, which is now part of team A. As we can see, team A, the resource requirements of app wrapper 2 are well within team A's quota. But team A has been sharing its quota with team B. MCAT realizes that, and as a part of fair share mechanism, MCAT would preempt app wrapper 1 and allow app wrapper 2 to run. When MCAT does quota and resource check, both to dispatch an app wrapper. If the resource check fails, then MCAT would trigger instant scaling, which we'll learn a bit more shortly. But from our experience, quota may not be always physical resources. For instance, some workload may need tokens or licenses to be shared amongst users. To enable this use case, MCAT supports quota forest evaluation, meaning MCAT can evaluate multiple quota trees before an app wrapper is dispatched. Let's talk a bit about bring your own scheduler capability. Different teams in Vela Supercomputer run different types of workloads. Even OpenShift prefers running OpenShift default scheduler to bind system pods. Hence, in Vela Supercomputer, for training workloads, we use core scheduler with packing a GPU dimension to avoid fragmentation at dispatch time and potentially increase utilization. We covered a lot of internal features. Now let's visit UI aspect of MCAT. MCAT has dashboard that supports different personas. It shows aggregate cluster utilization from admin persona and also shows app wrapper statuses from user persona. Let's move on to the next ingredient in our recipe, which is the Node Autoscaler called Instascale. If an app wrapper is pending inside MCAT Q due to insufficient resources, then Instascale picks up such app wrappers to get gang of resources needed to run app wrappers. Acquiring gang resources could be time consuming. Hence, we use reuse policies to transfer resources acquired for previous workloads to the next workload. It has the capability to scale down to zero and works on wide variety of OpenShift flavors. Finally, let's see the entire recipe with all the ingredients added. Users would use codeflare SDK to create and interact with the cluster through Jupyter Notebooks or CLI. These workloads are then queued in MCAT as app wrappers. If aggregated resources are available in the cluster, then such app wrappers are dispatched. Or pending app wrappers would knock the doors of Instascale controller. Instascale controller would acquire gang of resources and add it to the cluster, and at later point in time would dispatch the pending app wrapper to the cluster. Once the app wrappers are dispatched, respective controller then spawns pods, which get binded by the scheduler. In our use case, we use co-scheduler with packing. Later, user starts interacting with the spawn workload. And finally, when done, the cluster could be destroyed with code operations supported by the SDK. Hope you learned a little bit more about all the ingredients of the stack. I'll now pass on to Olivia. Thanks, Abhishek. So to recap what we've seen today, we've talked about MCAT, we've talked about Instascale. There are dispatching systems through dealing with large distributed workloads, such as the one we find in AI training. One of the key characteristics there is we're looking at jobs that use a large number of GPUs, therefore a very large number of nodes, hundreds of nodes typically on the cluster, and therefore a large number of pods, because we have one pod at least running on each node. What that led to is a design decision in MCAT that we cannot just queue pods or scale based on pods, because if we want to queue 10 different jobs, that's 1,000 pods. If we want to queue 100 different jobs, that's 10,000 pods. Everything in MCAT, everything in Instascale is designed to work with custom resource definitions, like PyTor jobs, Ray jobs, Badge jobs, you name it. The second design decision that Abhishek also insisted on is we wanted this to be completely extensible in the sense that Kubernetes is popular to a large degree because it's extensible. So everybody has its own, lots of companies have their own operators, their own resources, and so on. So we wanted MCAT out of the box to work with any such things without requiring any changes of any kind. So if you have your own operator, if you have your own resource type, your own kind of Kubernetes, it already works with MCAT, there's nothing to do there. And Instascale, in the same way, can make scaling decisions, not just based on the pods that are pending, but on everything that is in the queue, right? So now, that's what we talked about today. The elephant in the room is what we didn't talk about today, which is MCAT stands for multi-cluster abdispatcher, right? So in fact, MCAT started as a multi-cluster thing in contrast with maybe some of the other projects we've talked about today, where we're evolving from single cluster to multi-cluster, we kind of did the reverse in MCAT, which we started with the multi-cluster use case, that's where we came from. Eventually, for instance, in training, we realized that the single cluster case was also really important. And so in the last couple of years, at least we've put the focus on that. But MCAT still is a multi-cluster abdispatcher. It can manage a pool of cluster, a dynamic pool of cluster, you can add and remove clusters, it can monitor, resource and make decisions about quotas and resource about different clusters. And it can make scheduling decisions based on resource availability, right? So now a little bit about our roadmap, there's a lot of things we want to do, again, similar to things you've heard today, right? Today, MCAT is only about, or MCAT, like many systems, thinks about GPUs as units, right? A GPU goes through this job or that job or that job. But we know, for instance, with NVIDIA, with NPS, with me, with DRA, that there are a lot of mechanisms coming up about dividing GPUs and being smart about, again, sharing GPUs between different jobs in various ways. So that's obviously something we're working on and want to support. One important dimension of a dispatching system is in which order do we dispatch job? When do we dispatch job? That's something, MCAT provides a foundation for this, and that's something we're building on to do new interesting things. Again, we've heard already today about the costs to the planet for training large models and doing AI. So we're working on various policies there and extensions, for instance, to do training or to do batch jobs in general, I should say, that are more power efficient or that use greener energies, right? That chooses clusters depending on the time of the day and so on and so forth. Finally, when it comes to cluster auto-scaling, cluster auto-scaling in general has been a challenge because this is not quite just a problem about the cluster, this is a problem about what's hosting the cluster, what provider, what technology is being used, right? So cluster auto-scaling has been kind of a, different from different cloud providers and so on. So there's, and in star scale as it exists today is primarily focused on open shift, but there's an effort in the community to kind of standardize that, to make it much more portable across providers, across kinds of managed clusters and so on. And so we're definitely also very interested in that effort and bringing that effort to MCAD and in star scale. So as I said before, everything we discussed today is open source, we have a code for our community, you know, GitHub, Slack here. Please join us, please let us know what you think. I also want to point out that later this week we're going to have another talk that is going to, you know, tell you more about MCAD. This particular talk is actually about how to do performance scaling, performance testing at scale, simulating large clusters, and but we'll use MCAD as the running example. So please come join us and watch Sarah's and Vishak's talk on Wednesday. And I think that's it for today. Thank you very much. Thank you so much for the great talk. I think we have time for one question. I think it showed up at the right time. Yeah, I think so. I was just gonna ask, do you find it easy for the ML engineers or the data scientists to plan resources when they're going to write the requests? So, you have to repeat the question. Yeah, sorry, so the question is, is it easy for engineer to plan resources? So the real answer to that is something I could have added to the roadmap is we're looking at elasticity and all kinds, elasticities, there's job elasticities, there's cluster elasticities, quota elasticities, so it's kind of a holistic question. So we do, I mean, what AI experts think in terms of this number of GPUs and maybe number of nodes, right? So that's the input essentially to our system and that's what we work with. And that's, I think, works well enough, right? Where they need help is adjusting CPU and memory and so on, but we can work that from the request for the GPUs essentially. Typically, I mean, the workloads that researchers submit in our facility, they are all gangs and gangs have this notion of all or none, right? They need all the resources at once to start. So they have a good understanding of the scale of their model and the kind of gang resources they need to start their computation. So at least internally, our users and model trainers, externally, I think, I believe, they know what kind of resources they need in a priori before submitting a job. And you can always shape that, right? At some point, you put some heuristic to say, I think it's this type of thing, more or less. Yeah. I had one more question, but it's okay. Yeah, sure, go ahead. I saw on one side you're using Ray for training and using K-Serve or Serving. Do you have any thoughts on K-Serve versus Ray-Serve? Or did you not, have you not explored Ray-Serve yet? So we've explored Ray-Serve initially. So we worked with any scale for a while, actually, on trying to bridge the two together. I think this is still an ongoing discussion. We've had requirements in terms of, you know, size of collections of models, for instance, in the all-size-of-q-tune collections of models, you know, we have things where we have hundreds of thousands of models in our collections, right? So Ray-Serve wasn't designed to handle that in the first place, but we worked with any scale to a degree to reduce, you know, to improve on different things, right? So I think that's an ongoing, that's an ongoing effort, yes. Got it. Thank you. Great job. Thank you.