 All right, everybody, I'd like to thank all everyone who is joining us today. So welcome to today's CNCP webinar, the Kubernetes-native two-level research management for AI ML workloads. My name is Danielo. I'm working for Red Hat as a technical marketing major, as well as I'm responsible for CNCP ambassador as well. So luckily, I will be moderating today's great webinars. So we would like to welcome our presenter today, Diana Arreo, and a software engineer at IBM Research. And all right, you, Chef, the manager of Container Cloud Platform at IBM Research as well. So there are a few couple of things about housekeeping today. So hopefully, we started this webinar so you are not able to talk as an attendee. So there is a 28 box at the bottom of your screen. So please feel free to drop your question in there and we will get as many as we can at the end. This is an official webinar of CNCP. So this is the subject to CNCP Code of Conduct. So please do not add anything to the chat or question. There will be a violation of the Code of Conduct. So please be respectful all of your fellow participants and prisoners. Please also note the recording and slides that will be posted later today at CNCP Webinar page, www.cncf.io slash webinars, confined there later today. And I will hand over to Diana Arreo to kick off today's presentation. Take it away, guys. Thank you, Daniel. Thank you, Daniel. And I can share my screen. And I hope you can see it. Yes, we can see it. All right. Okay. So thank you all for joining today. I'm happy to take you through this presentation together with my colleague Diana. So Kubernetes native two-level resource management for AI and machine learning workloads. And our agenda today, basically we want to start by saying why the Kubernetes scheduler is not enough for the scheduling and resource management related activities for these AI and machine learning workloads. And then we will also talk about a few additional desired capabilities on top of the shortcomings that we will discuss in the first point. Then after that, we will show you our proposal to address these shortcomings and desired capabilities which is the multicluster app dispatcher, which is an open source project that we want to share with you and tell you a little bit about it and about how it works, show you a short cool demo, and then a call to action, which you can guess. So why the Kubernetes scheduler is not enough? Basically we had a recent CNCF blog post with the same title about a couple of months ago. And basically in that blog post, we try to motivate for the need of this second level resource manager that we are going to talk about today. So some of the characteristics of these AI workloads that we are targeting here. So basically you all know that it's the use of Kubernetes platform for running these AI machine learning workloads is on the rise. And these workloads have typically multiple concurrent learners or executors. For example, if you're using Spark, you have Spark executors, if you're doing some deep learning, you have deep learning learners. And typically they need to run concurrently in a distributed learning fashion. They have some collocation or affinity constraints. They may have some specific hardware requirements, such as using GPUs, for example, and a specific number of GPUs per learner. Also we're seeing nowadays an increase in the massively parallel jobs where there is a big number of short-running tasks that need to be executed, like array jobs, for example. And these jobs are resource hungry, meaning if you give the job all the resources to run the thousand tasks that it's composed of, it's willing to take all those resources and run the thousand tasks in parallel. If you give it less resources, it has nice, many of them have nice elastic features where if you give it resources enough, for example, to run only a hundred tasks at the time, then it will do a hundred tasks at the time until it completes. It will take longer time to complete, but the total amount of consumed resources is going to be the same. And why would you want to do that? Well, I mean, there are two reasons. I mean, the resources are not infinite. But second, you may want to regulate the flow, so you allow for parallel jobs to proceed, especially with the increase in the use of interactive mode, like, for example, data scientists sitting in front of a Jupyter notebook and running some of those machine learning jobs and want to see results interactively. So they are not all batch jobs that can be run later. And when done, somebody's waiting for the result. So in this kind of environment, right, so you have jobs and tasks, right? And then there is the advantages and disadvantages of managing at the job level versus at the task level. So typically a task maps to a Kubernetes pod, and the job is composed of multiple of those pods or tasks. And now the question becomes, at which level should you be specifying things like priorities or classes of service, like, you know, gold, silver, bronze, for example, or quotas, right? This is the maximum quota on the memory resource or GPU resource for a particular task, job, user, organization. I mean, at which level should you set these things? And also at which level do you do queuing of these jobs or tasks, right? And then allocate resources and do preemption, right? So these are all the questions, right, that don't have, you know, one answer. But, you know, we are trying to argue here that really the right level of management that you need to have is at the job level, and you need to do a lot of these functions at the job level, not at the individual task level. Obviously the task will inherit from its job. And think about what happens, your scheduler and the different controllers in your Kubernetes environment, right? And in your EdCD, when you have, you know, thousands and thousands of pods belonging to, you know, thousands of jobs that are pending and need to be scheduled. So if, as these jobs arrive, you create the pods and hand them over to the Kubernetes scheduler, it will be overwhelmed, as well as the other controllers in the system. So briefly, you may all know that, you know, the Kubernetes scheduler is a pod scheduler. Given a pod with particular dimensions, like resource requests in the memory dimension, CPU dimension, GPU dimension, it basically filters out the nodes that don't fit that pod or for any other constraints shouldn't be used for that pod. And then for the remaining nodes in the cluster, it's going to prioritize them and rank them according to priority functions and get the top candidate to put the pod on, right? And then it binds the pod. So in this example here, you have a series of pods that are arriving and queued at the scheduler to schedule. They happen to belong to three jobs where each job has four executors or learners, right? It has four pods basically in this simple example. And the job of the scheduler is to one by one schedule the pods. And now imagine now that you have thousands of these jobs and all those pods are waiting for the scheduler to schedule them. And you obviously have limited capacity in your cluster. And it will keep trying to schedule them and failing and schedule them and failing until once in a while it's able to let in one of the pods. So in this example, this is what happened, right? The pods got scheduled this way and you ended with two of the three jobs not being totally placed in the cluster. And now if these jobs require to be, the whole job to be placed in order for the job to proceed to do the learning or whatever it needs to do, then you have those two jobs, the purple and the red here in this example. Some of their pods occupying space that is useless until the remaining pods can be scheduled. And you have partial deadlocks. So these partial deadlocks obviously if you have a gang scheduler can help in solving them. But just remember that when you have thousands of pods, you could have avoided that altogether if you had a second level controller that is able to look at the cluster from a high granularity level and let in to the scheduler only jobs that are likely to be placed. So as I mentioned, the cluster resources are limited. Scaling up the cluster by adding nodes takes time is not something that you want to do instantaneously for every arriving job or leaving job. It's probably something you want to do at a little longer time scale. And as I said, some of these jobs are resource hungry and no matter how much you scale the cluster, they're going to eat it up. So practically speaking, at every point in time there is a limit on the available resources and even if that limit is not permanent and is going to change and you want to be efficiently using these available resources. So the common saying says the sky is the limit but remember that it doesn't say that the cloud is the limit. So the cloud has limits in terms of available resources at every point in time. In addition to that, we are seeing the rise of different patterns like multi-cluster patterns where organizations own tens of clusters, tens of Kubernetes clusters and these clusters are basically easier to manage than a huge big cluster and that's contributing to the rise of this multi-cluster pattern that we are seeing. And in the hybrid cloud world where by default you have some on-prem resources or on-prem clusters and some public cloud and multiple public cloud clusters, by definition you have a big number of clusters that you need to choose from where to place this next job that I need to run. Static assignment of users or certain apps to a particular cluster is feasible but it's not the efficient way to use all your available clusters. So you may have, in addition to on-prem and public clusters, you may have edge clusters as well which increases the number of clusters. You know, tremendously and in that case the static allocation of a particular job or user so a certain cluster is not the best way to go. Bursting scenarios, people have seen them and talked about them for a while but you want to be able to do that in a more automatic way where job is submitted and then if there is no resources available in my on-prem cluster it's automatically routed to a public cluster that I have a subscription in. So finally I mentioned about the rise of edge computing paradigm and what it means in terms of the number of clusters that an organization owns and is managing and the options to place a job. So in addition to what I mentioned there are some desired capabilities that we're looking for. So I alluded to the need of having some form of queuing at the job level before admitting the pods to the scheduler to do the binding to specific nodes. And if you are able to do that queuing and dispatching to a specific cluster before it reaches a scheduler in that cluster to do the fine-grained placement of the pods and binding them to specific nodes then you have achieved a lot. But in addition also this is a good point of control where you can specify priorities and classes of service and have multiple queues for different priorities of different classes of service where the jobs can wait to be dispatched to the appropriate cluster based on an available capacity. Enforcement of quotas like hierarchical quota management for example if you want to say this is the quota in terms of resources that this particular organization or department or user is allowed to use in my system then you can do that at this control point in an easy global way. And you can think about all sorts of soft and hard quotas and of course quotas that span multiple resource dimensions. Also you may be able to do things like pre-empting low priority jobs in order to admit higher priority jobs or jobs that borrowed from the soft quota of another organization but now that organization has its jobs coming in and they have the right to go in and pre-empt those jobs that were borrowing above their quota. So you can do all these controls at that second level resource manager. Because it basically has a unified view of the jobs belonging to those different AI and machine learning frameworks. So what is our solution to all the shortcomings that I mentioned and the desired capabilities that I also talked about is the multi-cluster app dispatcher which is an open source project that we are going to tell you all about it and how it works. So I'm going to hand it over to my colleague Diana and she will take it from here, tell you a little bit about how it works and then show you a cool demo. So Diana do you want to share or do you want me to flip the charts? If you can release the sharing. Okay. And I'll take over, you don't mind. Make this big. Okay, can you see my screen? I think I picked up on your last chart. Yes. Okay, great. Okay, thank you all. So as I mentioned we've developed a controller that tries to address some of these capabilities. And we have an initial framework of one and I'll talk to you a bit about what it does now and then some things we're looking at adding and adding in the roadmap. But essentially, as I mentioned, we would like to have a way to dispatch to either within one cluster or multiple clusters and ability for queuing. In this example, this high level picture here, we're just kind of showing how in a multi cluster environment, you would submit a job and what we call a dispatching cluster. And I'll give some more details about that, where you essentially submit all the components of a job. And there's a, the MK controller as you mentioned before, would do the initial evaluation whether that job is runnable, and then determining whether it's runnable or not dispatch it to other clusters to actually be realized. And for the binding to happen within the cluster. So just a quick notification of how you know, as I mentioned, we have an open source project for this. It's in GitHub and it's also in the operator hub as well. So let me dig a little bit deeper on how this actually works. In addition, there's the MCAT controller, the multi cluster app dispatcher, and with that we operate on the custom resource definition called the app wrapper. The app wrapper, what it does is and what it is is essentially any and all of your Kubernetes resources that you create the ones that are compute consuming resources. Deployments, pods, staple sets, any of the resources that you define as part of your complete job, we wrap it in this app wrapper. In addition to that, we also allow you to wrap any non compute consuming resources, for instance, many times you'll deploy as part of your job as a service, maybe even me, under different namespaces. So with all of those components that you have that represent your job or application, we have a CRD that you would submit it under and wrap it all so that it could be evaluated holistically. And then the MCAT controller, what it does that operates on the app wrapper, it takes the app wrapper, actually investigates it, inspects it, and determines whether that job, all the resources for that job can be runnable. And it does it in a holistic manner, meaning it looks at all the items that you've listed in this wrapper and evaluates whether it can be runnable or not. If it does determine that it's runnable based on the policies, then it unwraps all of the objects that you put in it, all the Kubernetes objects, and creates those objects in the cluster where they're dispatched to. Obviously, if they're not runnable, we put it in the queue to and reevaluate over time to make sure that we can dispatch it once resources become available. And then another thing as I mentioned earlier was is that we support preemption and recuing and with that aspect we actually allow app wrappers to have dispatching priorities you can define those and I'll show how that actually works in the YAML as well as a demo that we have. Okay, so this is kind of pictorially what I've talked about in the first chart, where we have the controller, the multi cluster app dispatcher that's operating on the app wrapper so you take whatever objects you were creating. Without any Kubernetes objects that you're creating, you put them inside of an app wrapper and then submit that app wrapper to a Kubernetes cluster and the multi cluster app dispatcher will operate on it. We have two configurations that we run actually we started out with the standalone version meaning you know this is just running within one Kubernetes cluster. And we found that worked really well and it was a nice extension to actually move it to multiple clusters. So, and this is the second bullet here where we have a dispatcher mode for Kubernetes cluster and agent mode and that last bullet really is the multi cluster environment. Okay. So a little bit more deeper into what's happening behind the scenes so this big box right here is the impact controller. And what it does is that it runs within inside the cluster. It actually determines the available capacity of the cluster so it finds out what's actually running what how much resources are available from by using the normal Kubernetes model. And it tracks what's available over time. In addition to that as you bring in app wrappers into the system, you submit them the app wrappers will go into a queue. Again it's all wrapped in the app wrapper wraps all the resources that you're trying to realize into that cluster. You put them into a queue and this queue can be FIFO. If you don't set priorities that's how it behaves if you do set priorities on these app wrappers, then the queue will be the queue will recognize that and recognize the put the ones with the higher priorities at the beginning. Another goal here for this impact controller is to evaluate taking the available capacity looking at what's in the queue pulling jobs off the queue determining their runability as I mentioned before. And if they're runnable dispatch them and actually create the objects inside Kubernetes. So that's the standalone version. And just the modifications that we have made so that we can support multiple clusters. In this picture here you see these big blue boxes here and they're actually separate complete clusters. So this is a, these two clusters on the right or what we call, excuse me, agent clusters. And when you dispatch MCAD into that cluster, you dispatch it in the configuration of an agent. We have a another cluster that acts as the dispatcher cluster. I mean this is where jobs get submitted right and what happens is is that the agent clusters agent controller collects the available state or the available capacity for the individual cluster and makes it available to the dispatcher The dispatcher cluster or the dispatcher controller here keeps track of the available resources on the different cluster it's supporting. And then when you submit app wrapper jobs into the dispatcher cluster, they're put into the similar queue and then they're evaluated on where they can run and the dispatcher cluster will actually dispatch the job to the cluster on where it chooses. Here's an example of a, I thought it would be useful to have like a use case so you can kind of see how this works. Let me see if I can analyze this. Get this out of way. So this is kind of covering it up. So I apologize, but this is an app wrapper here but before we actually create the app wrapper. In addition, before you bring these in CAD controllers up on the various system. This is the dispatcher cluster that I showed before and two agent clusters. And you have the in CAD controller running in different modes. So when that happens, the first thing that happens is the state is collected meaning available resources and the local agents and that state is made available to the in CAD dispatcher controller. It's ready to receive app wrappers. So this is an example here I have, we worked with a team who had a bunch of resources that they were trying to create that represented one job. And this is actually really a subset of them that I put in here. Another job that they submitted they had multiple services, they had a name space they were creating a network policy of PVC, they had a couple of deployments that were creating one of them just had one pod and another deployment had multiple pods. And as they were creating those objects already and they took our app wrapper, and they just wrapped all of the objects in that they were already creating inside the app wrapper and I'll show how those are, how those actually look inside of our CRD. And they submitted into the dispatcher cluster the dispatcher cluster in cat the in cat control on the dispatcher cluster picks it up and inspects all the objects inside of it determines if it's runnable and where it's runnable, and then dispatches them to the appropriate cluster. And this example here I show that it's dispatched to agent one, where it's created, and then the local mk controller will actually take that object and unwrap all of the objects. Here they're like objects, a through G unwrap them and created inside of this cluster here. And then let's see here. My screen is not working. There you go. So that's what we have as of today and I wanted to give you guys some insight into kind of the things we're working on now as part of our roadmap. So this is the current work that's happening now that this is here the box that you just saw in the previous where we had the mk controller the queue and available capacity where object gets submitted. But as we mentioned, we just we want to add more policies to support the capabilities that I'll have described and one of them is not only evaluating jobs where they're runnable based on available resources but now they want to extend that to additional capabilities and one of the first ones we're looking at is extending this to quota management. So what we want what we have here and this is just an example here is that you may want to define quota. Not just like at an Incisa level which is already available in Kubernetes now but maybe some abstract administrative definition. So that that would be possibly you may want to set quotas on organizations departments within those organizations. You also may want to set quotas on projects so you could have multiple projects that you would set quotas on those. You also want to support soft and hard constraints on these quotas. So various aspects and enhancing the whole quota management capability, not just like in a needs in space level but even more abstract. This is the work we're doing now I just wanted to give you some insight about how we would be able to take advantage of some quota management evaluation where yes we would pull it into the controller to determine available resources, but we would also evaluate whether that job is runnable based on quota management in the sense that you know we may have enough resources but there may be hard limits that you want to restrict on some of these jobs. So we would queue it up and then dispatch it once there's enough quota that's available. So that's some insight on what we have and how we're moving forward with enabling more and more capabilities at the evaluation of these jobs at a higher level. One more chart here just to show you as I mentioned before, the actual CRD that we have is the app wrapper. So this first arrow right here shows you would express your job that you're going to submit as an app wrapper. I'm going to skip these two guys just for a moment and go to the last one which is I mentioned that you would put all your objects inside of your app wrapper so you would wrap them and you would just do it under the item stanza. If you had, if you saw in the other chart there was like seven or eight different objects that we created, Kubernetes objects, they are listed under the item so you just have a full list of things that you add here. And then as I mentioned before, you can set priorities on these jobs as well, right, and we will, this is a dispatching priority. So the determining whether this job can dispatch and giving prior higher priorities over other jobs. And then finally, just some insight of where we might be able to express quota information right so this job is assigned a specific quota name that we would evaluate it based on the quota tree that gets built. And this is, again, as I mentioned, part of the room that we're working on now. Okay, so I thought it would be also helpful to just give some insight on how this works with showing a demo. So let me jump over to the demo I have here, and I recorded this. So I'll talk through this demo, it's a very simple demo, but I wanted to kind of show you guys how this works initially and I'll show you examples of queuing, and then this preemption with priority. So let me get started. Let me start this up. So on the left here is the black window and it's the dispatcher. controller. And it's on a separate clusters, the green and the blue box are also separate clusters as well. I've made them very, very tiny, because I wanted to be able to fill up the cluster and show you the queuing so each one of these clusters only has one node in there. And every node I think has about a CPUs and available. Right off the top is really just five CPUs. Now, down here on the bottom just going to show you the app as I submit the app wrappers, they're going to show you the state and also show you that pods actually don't get created on the dispatcher cluster. They only get to play created on the agent clusters. So the first job I'm going to submit is job one. I'll show you the contents here I kind of showed you some some of that in the charts that I'm submitting an app wrapper. And then I list the items here I'm only I only have one item in here it's a staple set. And, but again, you would put as many items as you need to represent your job. And then of course, since we're evaluating whether it's run on or not. Here's their, the CPU that we're actually allocating and in this example we're having three replicas. So it's three replicas with that CPU and memory limit. The first thing to do is really is I'm going to show is filling up both clusters. As I mentioned, they right now they have about five CPUs available that are free before submitting any jobs. So, if you create one of those jobs is going to fill up about a little bit more than half. So we submitted it, and it's actually dispatched to the blue cluster. And it's actually running. The next job that will submit here is job number two. And it's going to. We're going to submit this one and what happens here also is that when we submit the policy that we have running here is just a random policy. We select we just randomly pick a cluster. If it fails that cluster, it will take it'll go into back off mode and try again and about 20 seconds. And so this is what you're seeing here is that it tried the first cluster was full so it actually backed off for about 26 and seconds and then retried it on a different cluster. So the second job got submitted and it got dispatched on the second cluster. And next what I'm going to show is so now essentially these two clusters can't fit a third job of the same size. So this is really to show you the first ability which is to be able to queue the job and wait for resources to become available. Again, I'm just showing here that it's the same same size job. And since the clusters full it's not going to be able to create the jobs at all. And because it's not it'll queue it up and there won't be any pending pods, which you would normally see if you just submit these staple sets without the app wrapper. There won't be any pending pods on any of these clusters it'll just be the app wrapper that's pending. And so what that means is the schedulers not trying to do any work where it can't fit all of the pods. And you'll see here that it's pending on the bottom left hand window. The third job is pending. So now I'm going to show how when a job gets freed up. And the resources get freed. The pending job automatically gets dispatched into one of the clusters that become available. The job one gets completed and job three will start to dispatch shortly. There it goes. So get dispatched to know that we have the MCAT controller detected that it was enough resources so it dispatched the job to an available cluster. And finally, I want to show another simple demo here where we use priority to print for preemption. So here I'm sitting the priority number of this job to 10. All the other jobs they didn't have a priority number defined so by default they have a priority of zero. So we're going to create that again it's the same size but since this has a higher priority. We want to evaluate what's actually already running and determine that everything else is lower priority so we will preempt a lower priority job and actually put it back in the queue and allow the higher priority job to be able to get dispatched. You want to use this kind of environment when you have jobs that, you know, takes time stamps and takes tracks the movement of the job where you can restart it very easily. The job got preempted I think it was job number two, got preempted. And then of course, when resources become free available again, which I'll show here next will free up job number three. Number two will get redispatched. And again, I believe in this recording when I did it that originally the job to was running on the green cluster, and now we're redispatching it to an available cluster which is in the blue cluster. And I think I'm going to stop that now because we're running close to being done here I want to give time for questions that you can see that it's we got redispatch. So I'll stop this recording and I think that's it. The last chart I think I had was really just a call to action where, you know, feel free to give this a try out that'd be great again here's the links and the name on the operator hub. Give your feedback we'd love it and even even great if you can contribute as well. And that's it. We can take questions now. Awesome. Thanks Diana and it's a really great presentation and awesome demo so we now have some time for question if you have any question and you would like to ask a please drop in the clinic tab at the bottom of your screen and we will get to as many as we can time go. It's your time. Must have been very clear. Yeah, so very positive things is a pre awesome and we understandable demo so. Oh, we got one question actually. So how can I configure the priorities of jobs. We want to take this question. Yeah, so the priority of the job is set at the as part of the app rapper spec. So all you need to do is essentially just add that stanza the priority stands inside the app rapper. Let me see here. I can show you again. This is an example Yammel. So when you submit your app rapper. If you want to take advantage of the dispatching priorities. You would just add this stanza and set the priority that you want that job to get assigned to. So this is all configurable at the submission level right so whenever you submit a job you set the priority that you would like to have the job to have. The app rapper. Next with the answer in a follow on question from 10. Can the priority, can you probably be changed the later when it is in the queue. We so right now we don't support that but we definitely have plans in our road maps to address that. And also, I showed you a highlight of the major things we're doing one of the things we're also doing to address all kinds of complexities is which is starvation right so we initially start out with this user defined dispatching but you may as a cluster administrator be able to handle starvation. When there's low priority jobs but you still want to get a few of them in so we're currently developing a system priority that will consider the user defined priority but over time those priorities will change. And so we have we have that part of our maker roadmap as well. And another question just came up from our credit box. Now what is the role of the ML in a research management. I can take that one so in terms of what we've shown today we are doing resource management for a IML workloads so it's the opposite of what the question is about but of course. The use of AI and machine learning in resource management right to in the resource management function. And it's benefit there is obvious and we have other orthogonal work, for example focusing on using AI in deciding on the scaling of elastic applications, vertical or horizontal scaling. The use of reinforcement learning to work around failures and avoid failures so but different orthogonal pieces of work. The MCAD controller is not really using AI to control resource management for for the jobs. Right at the moment. All right. Okay, so I think all question. We are at the rest today. So, okay, any other question. All right. Thanks Diana and Ra for a great presentation and the current official presentation once again and all right. So, and everybody thanks for joining us today again and the webinar recording and slice that will be online later today. And we are looking forward to seeing you at the future. Thank you so much for your patience. Webinar as well as QCon and cloud and we call the next month and November please register and we will miss him again and have a good rest of the day. Thank you. Thank you.