 Hello, everyone. Thank you for joining us this morning. My name is Sarah and this is Vishaka. And we are here today to talk to you about best practices improving batch scheduling performance at scale using MCAT and QWAC. MCAT is the multi-cluster APTIS batcher and QWAC is the Kubernetes Without Cooplet tool. To give you an idea of what we're going to cover today, I'm going to give some background on the motivation and existing challenges we see. The potential solution, Kubernetes Without Cooplet or QWAC. And then I'm going to hand it over to Vishaka who's going to give you some insight into best practices and getting started with tools like QWAC. Some things you might want to look out for when you're doing this sort of testing. And then walk you through some example use cases that hopefully you'll be able to take away an understanding of how you might adapt this approach to suit your resource management and performance testing needs. So to give you some motivation and cover some of the existing challenges, we're coming from IBM Research where we've been working a lot with foundation models. So you may have heard about large language models such as ChatGPT. This is part of a broader class of AI that we refer to as foundation models because it can use a lot of the same ideas and techniques to represent different modalities and train these very large models. To sort of give some background on the different areas that we see often in AI, there's this sort of life cycle of model creation to deployment where one of the major steps is actually data preparation. How do you process, accumulate, make sure you have high quality data. And then we go into more of the distributed training which is going to be sort of our use case focus for this talk, which is often long running jobs that have massive infrastructure requirements and heavy GPU utilization. And then you can go into other stages of the life cycle development here such as model adaptation or fine tuning, and then all the way to inferencing. But here, as our use case, we work a lot with the distributed training teams and so that's going to be what we're going to focus on. And so as part of IBM Research, we really believe in applying the open source first approach to enable and scale these foundation model workloads. So here we highlight our training and validation stack all running on Kubernetes or Red Hat OpenShift. And so at the top level, we have the CodeFlair SDK which provides data scientists and machine learning engineers with a simple and robust set of Python APIs to interactively define and request resources for their large scale training jobs. And then under that we have the Multicluster App Dispatcher or MCAD which provides a lot of the resource management capabilities we need and we'll get into more detail on that. But it integrates with common machine learning or AI software such as Ray through Kubray and it works with PyTorch through Torchex. And then how do we guarantee that these large scale AI workloads, scheduling in the cluster, have the resources they need? So that's where Instascale can come in. It's a proactive node scale out mechanism which provides aggregated resources for cute jobs prior to creation of pending pods. And on the other hand, it can also help free up a cluster resources by rapidly scaling down to zero when there is no demand in MCAD's queuing system. And through all of this, we work with providing open source mechanisms either through the project CodeFlair, umbrella project, through some of these upstream projects like PyTorch or Ray. But we're also delivering all of this through the open data hub. So you can check us out in a lot of different locations. So what is MCAD or the Multicluster App Dispatcher? Well, this is really what we're using to provide these batch computing capabilities on multiple Kubernetes clusters with no code changes. So it's really a queuing system, but just attached cute jobs of any Kubernetes objects when aggregated resources are available. It provides something called the AppRapper CRD, which gives you the ability to wrap any Kubernetes objects. So whether that is a custom resource definition from some of your favorite libraries or working more closely with the actual Kubernetes definitions, it offers you the option to wrap and holistically view any of these Kubernetes compatible objects. And by doing so, it also helps us increase control state plane stability with some of this just in time object creation for any of these resource needs. It offers the option for you to bring your own scheduler, so it integrates with the upstream Kubernetes schedulers and that whole stack for features such as co-scheduling, which we'll talk about a little bit later, which helps us with packing along this GPU dimension, which is really important in training jobs. It also offers options for queuing with different policies, includes features like priority and preemption, quota management with borrowing. And one of the nice things about MCAD for these large-scale training jobs is they also offer fault tolerance for any of these Kubernetes objects. So to get into a little more detail, MCAD offers you this option to bring an AppRapper and queue that, and by viewing the available capacity, it will dispatch it only when the resources are available. As an example here, we're going to use this sort of pseudocode to represent our training jobs. So whether you're using Ray or PyTorch for the simulations we're going to do to test the performance of MCAD, we're going to AppRap both pod groups coming out of the co-scheduling effort as well as represent these training workloads as batch V1 jobs. And the AppRapper really allows you to create an object where you can really append these and bring these all together into one holistic object. And of course, at IBM we have a long history of working with high-performance computing. So we announced the AI supercomputer Vela, which is our cloud-native AI system for jobs like training and inferencing and so on. So if you're interested in the specs that they're shown here, you can also find more in a blog with more details. But that stack that we just talked about, it's been running in the Vela supercomputer. And as you can imagine, when you have these large-scale training jobs that require hundreds to thousands of GPUs, there are some challenges with managing these large-scale jobs. And so it became really important to make sure we have a robust system like MCAD to handle all of the situations you might run into with multiple users running on the same Kubernetes-based system. And the options from MCAD such as queuing these large training jobs and dispatching them as resources became available, along with things like the priority preemption and retry support are part of the critical path for training some of these multi-billion parameter models at IBM Research. And so what happens when you are deploying these at scale? And you need to make an update because MCAD is an open-source project. We make changes, we improve things. And how do you do that testing and make sure you can run at scale before you are deploying it on your sort of critical systems? And so that's where we discovered Kubernetes without KubeLit or Quack. And we think this offers an option for testing the scalability of your changes or updates before you even take it to a cluster. So this is coming out of the Kubernetes SIG group. It's a toolkit that you can use to create and simulate pods and nodes to behave like real ones. And it has a really low resource footprint. So some of the key tools that are coming out of Quack are, it offers you the ability to simulate fake nodes and pods and other resources with two main modes for installing either in cluster or out of cluster. And we're just going to cover how we do this with a kind cluster. But then it also offers tooling to help you simulate and test your resource management needs. One of the questions we get is, well, how does this compare to other options? So something like KubeMark might be a good use case for you to do some of your testing. What that is doing is it's really a KubeLit without running containers. But for some of this scale-up testing, it can be quite memory intensive. So if you are really just looking at some of the resource management performance testing you need to do, Quack offers you a way to simulate that node behavior. And it is a very slim tool, so you can reliably maintain 1,000 simulated nodes and roughly 100,000 simulated pods locally on your laptop to do something similar with KubeMark that could consume roughly 60 gigabytes alone. So with that, I'm going to hand things over to Vishaka and we'll have a chance to talk about what we've learned from working with Quack and scale testing MCAD to help you get started with testing at scale with such a tool. Thank you, Sarah, for the introduction. So to get started with this toolkit, Quack, it would be helpful to understand some basic building blocks which will help us then simulate the end-to-end lifecycle. And these building blocks are as follows. How would you create a fake node? So this is an archetypical YAML file of a fake node. And this looks very much similar to an actual node YAML file, except that now you need to add a specific annotation for the node that will be managed by the Quack controller. And also you need to add gains to avoid scheduling actual running pods to fake nodes. And with these two minor modifications, when you do a kubectl apply on this file, you'll create a fake node that will be eventually managed by Quack controller. Moreover, we can customize the node resources. So for example, here we have set the node to simulate the GPU capacity of 8. And then the next building block is to create a fake pod. This is an archetypical example of a pod, except that now we want to ensure that the simulated pods should land on the simulated nodes. And to do that, we need to add the node affinity. It's a property of pods that attracts them to a set of nodes, in our case, fake nodes. And also we need to add a pod toleration because we added a taint to the fake node to repel the set of actual pods. So this will allow the scheduler to schedule pods on fake nodes. Okay, so now we have fake nodes, we have fake pods. How do we simulate a lifecycle of any Kubernetes resource? So for that, Quack provides what is called as a stage API. It's a Quack configuration. It allows you to simulate different stages in the lifecycle of any resource, for example, node or pod. And moreover, you can stack these stage APIs on top of each other to eventually create an end to end lifecycle. So for example, in case of pod, a pod goes through these stages. You would have a pod creation stage, the pod goes into the pod ready stage and finally a pod complete stage. So if we focus on just pod ready stage, let's look at what the stage resource and the corresponding YAML looks like and what are the interesting fields of interest to us. So first of all, you need to specify the kind of resource you want to apply the stage to because we are working with the pod ready stage. Of course, this stage applies to the pod. Next, you also in the stage API, there is this selector field that allows us to define when to execute this stage. So for example, a pod ready stage should be executed only when a container has been initialized or multiple containers are initialized and ready. It also provides a delay field to specify when the stage will be applied. So it also gives you an option to introduce jitter to the delay to specify the latest time to make the simulation more realistic. And finally, there is this field called as next. It defines the new state of resource, the changes that will be made to the resource when the stage is applied. Meaning define the new state of the resource. So we see that by configuring the delay, the selector field, the next field in a stage, you can essentially control when, how and to what resource the stage is applied, providing a flexible and a scalable way to simulate a real world scenario in your Kubernetes cluster. So to get started with quark, this is all you need to know. And it would be now good to kind of simulate a real world scenario, basic real world scenario. For that, we took a real cluster and that cluster just has one node set kind cluster and the workload is just one job with a pod. And inside that pod is a container that runs this command of sleep for 10 seconds. And when we do a kubectl apply on this file, these are the events that we observe and the corresponding timestamps. We see that the running stage was achieved at 10 seconds and then the pod slept for 10 seconds and therefore the pod completed stage was achieved at 22 seconds. So it would be good if we could use this toolkit quark to kind of simulate this similar scenario. For that, what we did was with the pod ready stage, we just modified the delay field, set it to 10 seconds so that the stage will be applied after the 10 seconds. So the pod will get ready and running after 10 seconds. And finally, you can grab some arguments from the sleep command from your yaml file, supply it to the delay field in the pod complete stage. And when you do a kubectl apply on a fake pod, these are the events that we observe and the corresponding timestamps. We see that the running stage goes at 11 seconds and finally the completed stage is achieved at 22 seconds. So it looks like, okay, this toolkit is working, but there are a couple of things to look out for. And what we observed is when you have an actual say cluster, like for example, a kind cluster, the bottleneck is the kubelet. It's the container runtime as you increase the workload on your node. The time to start the containers will increase. But with quark, this is all simulated. So even if you increase the workload to say 100 pods on a node, it's not going to affect the time to start containers. And this is something that you should look out for. Okay, so we now understand the pros and cons of this toolkit, but how at IBM Research did we use this toolkit to evaluate our systems and why? The title of the stock is improving batch scheduling performance at scale using MCAD and quark. And why batch scheduling? It's because we're interested in these large AI model training workloads and they have different characteristics. They're long running workloads. They occupy multiple resources at the same time. And it has been observed that the native Kubernetes scheduler is not suitable for such workloads. So IBM Research came up with this MCAD system. And using MCAD along with the co-scheduler, it offers you resource management opportunities for the batch workloads. But identifying and resolving the bottlenecks for the optimal performance is crucial. And that's where quark comes in. We wanted to test the various versions of MCAD to help identify the performance bottlenecks. And now how we did those tests and what those tests look like. For that, a couple of timestamps were of interest to us to capture both in case of the systems with no MCAD and the systems with MCAD. In the system with no MCAD, we wanted to capture the job completion time, the time between when the job was created versus when the job finished. And particularly, we were interested in grabbing the scheduling latency. It's the duration of when the pod was created and when the pod eventually got scheduled on the node. With respect to MCAD, the timestamps that we captured were when this app wrapper was created and when the MCAD system dispatched that app wrapper. And then finally, when the job kind of finished running. And this end-to-end time is what we refer to as MCAD time. And one duration of interest to us was the dispatch time when the app wrapper was created and when this app wrapper was dispatched. We also wanted to check the scheduling latency in case of MCAD. An important thing to note over here is that because we're using quark, after the pod is scheduled and the pod reaches the pod completion stage, this duration is essentially simulated using quark and this duration is what we refer to as simulated pod occupancy. Okay, so with these events of interest to us, one of the baseline use case just to see if our system is working fine is what we did was we created a cluster on actual kind, it's a using one kind node and we use four fake pods of four fake nodes and each node has a simulated capacity of eight GPUs. And we test the system with respect to both MCAD with and both like cube scheduler with and without MCAD and the workload looks as follows. We have just one job and inside the job we just have one pod that request for eight GPUs and the pod occupancy time is said to be 72 seconds. A basic napkin math would tell us that if you submit jobs after more than say 18 seconds, it shouldn't induce queuing anywhere be it the scheduler or be at MCAD. So just like a validation test that we did, yes, we observed that with both no MCAD systems and MCAD systems, when you set the interjob request interval to be greater than 18 seconds, in this case 20 seconds, we observed that there was no scheduling latency, the job completion time was as expected to be 72 seconds and even the dispatch time was not, it was actually negligible. But what happens when you increase the load when you make the system unstable, so say if we submit jobs after every 17 seconds with systems with no MCAD, this is what we observe with respect to job completion time. This was anticipated because the increase from 72 seconds to some higher numbers is due to the scheduling latency. The pods were getting queued at the scheduler queue and if you just overlap this bottom curve on top of the red dashed line, you'll see what added to the job completion time. It was in fact the scheduling latency. For the MCAD system, we performed the same experiment. Now we observe that the job completion time for many pods is set to 72 seconds except for just one of the job. I think MCAD system over there did an incorrect decision. It was a false positive. It dispatched the job even when not enough resources were available and the job ended up being queued and therefore it induced the scheduling latency and therefore you see that spike for that corresponding job. So now that we understand that our system is kind of working fine with the baseline use case, then we wanted to kind of use the co-scheduler with MCAD. And as discussed previously, we want to kind of test MCAD with respect to these bad jobs. So for that we use co-scheduler plug-in. Just one of the key items is the introduction of the pod group, which is a group of pods that should be scheduled and run together. And with respect to an app wrapper, because it supports arbitrary, Kubernetes, object definitions, it's easier to insert the pod group definition in your app wrapper along with other workload definitions as well. So when we perform the co-scheduler use case, we changed the architecture, the cluster. Now the cluster looks as follows. Just have one actual node, but now you have 64 fake nodes. Each node has a simulated capacity of 8 GPUs. So in overall, we're simulating 512 GPUs. And again, we test the setup with respect to MCAD with co-scheduler and with no MCAD, but just co-scheduler. And the workload also changed. We now have just one job and each job has a pod group with 16 pods in a pod group. Each pod request for 8 GPUs and the pod occupancy time at each node would be 144 seconds. Again, if you do a basic math, you would see that the system with interjob request interval greater than 36 seconds should not induce queuing. I'm not going to present the results of no queuing, but what was of interest to us was the high load scenario. And what we observe is that with the high load, even like for example, when we set the interjob request to say 30 seconds, what we observe with co-scheduler is there was a significant scheduling latency as expected. The pods were all queued. The job completion time was around 332 seconds. The latency was significant. But when I did these experiments with MCAD version 1.31.0 during my summer internship, what we observe was the average job completion time was 186 seconds. This was not anticipated. Something wrong was going on. In fact, there was a lot of false positives and there was like a significant dispatch time. So we needed to optimize MCAD. The team worked on kind of improving the performance of MCAD. And I went back to my PhD and I did these experiments again just last month, but with the new version. And what we observe is the average completion time is what we expected to be 144 seconds with almost no scheduling latency. So MCAD kind of resolved that false positives issue as well as we observed 50% reduction in the average dispatch time. So this is how we actually essentially use this toolkit clock to check the performance bottlenecks of our systems. So sorry, do you want to conclude? Yeah, so we just wanted to give you some insight into how we're testing our scheduler and resource management. Just to recap, MCAD or the Multicluster App Dispatcher is our main resource manager for managing these batch jobs in our environments for AI training jobs. And we really needed a way to test these resource management options at scale before we roll them out into production. So we hope we've given you some insight into how you might use some of the tools we explored like QUAC to figure out your resource management testing at scale. And hopefully we've shown that these indeed are lightweight simulators, but they can realistically show you some of these performance changes as you're making new commits, as you're updating your code. And with that, we'd like to thank you all for your time. We hope you will check us out at MCAD. We're always happy to meet new people or get new contributors. And we hope you'll leave some feedback for this talk. Thank you. Yeah, if anyone has any questions, I believe there should be a mic in the center of the room. Great talk. Question I have is, have you used the same strategy to simulate actually overtaxing the cluster itself to where you actually overload a complete cluster and simulate things that can sometimes happen, like bringing SCD down and that sort of thing? So QUAC offers some of these options to set up and sort of play with your cluster configuration, your different stages and so on. So we've done some exploration in that space, but I think there's a lot more we can do with that and really see how far we can push these to test some of these potential system failures or detection. So it's potential, but you haven't actually done it? Not with QUAC, but I mean that training stack runs in a real production environment. Gotcha. Thank you. If there are no more questions, please reach out to us, connect with us and thank you for your time.