 Hello everyone, welcome to the talk on Caspian, the Carbon Aware Scheduling and Placement talk. First, let me congratulate you on your endurance and tenacity that you're coming and attending the very last session, almost last session in the conference. Let me introduce myself first. My name is Asir Tantawi. I'm with IBM Research and my colleague. Hello everyone, my name is Sayebeb Bahraini. I'm a postdoc researcher in IBM Research. All right, so we'll walk you through this activity or project that we started. It's not in CNCF, but we're planning to be there. We're going to talk about scheduling, we're going to talk about dispatching, we're going to talk about jobs, multi-clusters, cuing, coops teller, but most importantly, carbon and energy. So what is Caspian? The motivation for us getting onto this project is obviously that the world is trying to get greener and greener and we all know that in data centers and in clusters and Kubernetes clusters, especially now, we have a lot of GPUs and we have a lot of training jobs running and they consume obviously a lot of GPU cycles for very long periods of time, minutes, hours, days and months in some cases. So what we're trying to do is schedule jobs so that we can minimize the carbon footprint of running those jobs on those power hungry devices. There are many things one could do. We can make the devices more efficient, you can make better cooling, better this and that. We're going to try with this project something very simple, which is schedule jobs at the right time at the right place and I'll go through what do I mean by that and that's what Caspian is all about. So I don't have to remind you of the amount of the carbon footprint due to running these LLM training and so on and so forth. They're usually equivalent to, I don't know, the account, how many round trips and how many car trips in the lifetime and so on. You all know that. The problem is that it's increasing. It's getting worse by the day. Okay. The good thing however that we're exploiting in this project is that if you look at the generation of electricity that goes into these data centers come from various or a mix of supplies, of energy supplies, some of them are renewable, some of them are not and this mix you see here on the left depends on the geography of where the data center is. Some geographies have more renewable percentage of mix than others that you can see here. This is real data from Canada for example in Toronto. They have a lot of nuclear in this case and so on and so forth. So it varies from location to location the mix itself of where the energy is coming from. That's one aspect. The other aspect is over time that mix changes depending on how much sun there is, is it day, is it night and so on and so forth. So we're going to try to exploit that if you have a multi cluster arrangement that clusters are in different locations and over time the mix is changing we're going to try to schedule those jobs that take a long time and they're power hungry at the right place with one of those clusters at the right time. If we have the luxury of shifting the job a little bit by delaying it a little bit within a tolerable amount which is specified by the user that's the whole idea. If you're not familiar there's something called the carbon intensity. The carbon intensity takes this mix that I showed you and calculates one value which is how much carbon given that mix that is emitted per energy unit. So for example for every kilowatt hour of energy how much carbon is emitted and that depends on the mix. If you have a lot of renewable energy in the mix then this carbon intensity would be low and it would be higher otherwise. So there is a way to compute that and this carbon intensity is a function of time as as I showed you before. Now since we're after carbon we have to figure out what is carbon right carbon emission due to this job running is really a product of three things. One is the intensity that I just talked about at that location at that time times the power consumption the units are shown here in yellow so carbon intensity is how much carbon per energy and the two terms here is the energy. Energy is power times time. Power is your power consumption the devices that the job is running on. How much power does it does it emit? So that depends on how efficient the devices are it depends on maybe other things like if you've been to the talk yesterday the peak stock peaks handle this particular thing which is tries to minimize the power. This is a power aware scheduler and the third component is how long the job takes. So in Caspian we don't touch that there are of course ways of dealing with that if you if one can can somehow make the length of time a little smaller by adjusting maybe hyper parameters watching the accuracy precision and so on and so forth we don't deal with that in Caspian. We deal to some extent with the choice of hardware whether if if one data center or one cluster has better hardware than the other but more importantly we deal with the carbon intensity in order to minimize the total carbon so that's the whole idea we minimize the total carbon that left the left part of the equation by scheduling those jobs wherever on whichever cluster and whenever in time when the carbon intensity is low. Of course not all jobs or not all workload is shiftable as we want we cannot shift it in time positively you cannot shift it in negative in time before it arrives but if if it's maybe a schedulable or a job that is run every day or something like that you can do that and some jobs are not shiftable at all so there is some tolerance and that's what I talked about before we assume that a job when submits a job the training job for example would have a deadline like you have to finish my job by six o'clock tonight but it could be harder could be soft our assumption is that it's a soft deadline in what I'm going to we're going to show you today but we try to keep that as much as possible close to the deadline and not go beyond the deadline much and you're going to see that in the demo in the demo so pictorially here is the picture of the problem that we're trying to solve we have multiple data centers or multiple clusters and we have jobs that arrive a job is described in the manifest or the yaml file an estimate of how long the job will run an estimate and a resource requirement for that job and a deadline okay and our opt there is a dispatcher obviously there's a queue of these jobs and a dispatcher that's not casband casband really is the brain is the decision maker is the thing that's going to decide optimally for each job where should it go and when does it run okay so the way we do that is we formulate an optimization problem we discretize time so we look at time slot it's a time-slotted system typically the carbon intensity doesn't doesn't change that often talking like 15 minute half an hour kind of granularity so our optimizer runs at about that scale and again remind you that we're talking about training jobs that are long long running jobs so we solve an optimization problem that we formulate and I'm going to show you a glimpse of that we make some assumptions we make of course we look at the cluster as a whole so we kind of aggregate the efficiency of the cluster in some an approximate linear power profile linear and in the utilization of of the resources we assume that allocation is equivalent to minute to utilization which is not quite true but that's the assumption that we make there are some things that we that casband doesn't get involved with for example the scheduling within a cluster we say that job goes to that cluster but the scheduling within a cluster is not a casband thing maybe it's a peaks thing or some other schedule plug-in in that in that cluster all right so so as I said casband is a decision maker and this box here in the middle it relies on a multi-cluster management platform of some kind it relies on a job a queuing and dispatching kind of system and as well as a mechanism to to transfer jobs transparently from a central location which is a cluster as well so we this thing on in the left here lives in a cluster a management cluster or what we call a a hub cluster and then you have the multiple clusters here what jobs run are the spoke clusters so the for example the management of the job lifetime and and so on and so forth is part of that multi-cluster management platform which is not casband I didn't talk much about preemption we also even though you will not see here in that presentation that is job if a job is preemptible and we and casband finds that it's better to cut it into pieces and spread the pieces when the carbon intensity is low then it will do that so it will it will make preemption if the job allows preemption is if it's checkpointable so that's all part of the optimization problem speaking of optimization problem here it is I'm not going to go through the details of what it is and so on and so forth but I'm going to give you a glimpse this is a language in which the problem is specified think of it as a programming language but it's a mathematical language so the idea is for each job I is going to be scheduled on cluster j at time slot t if this variable is one and that's the variable of the optimization problem it's an integer program so it's either zero or one and you either run it there or you don't run it there at that location at that time there are a bunch of constraints and an an objective which is in red our objective is a multi objective optimization one has to do with carbon the second I talked about briefly which is we want to finish the job as close as possible to the deadline and the third one is we try to to schedule jobs as early as possible not only by the deadline but as early as possible so that the finish time the makespan if you will of all jobs is also minimized subject to the constraint that the resources are satisfied so the pieces the components of of putting the system together we rely on a couple of open source projects so casmian is again a decision maker as an optimizer as a solver it takes the problem periodically and it solves and it depends on and it relies on m we chose mcat as our q or job queuing component that queues jobs and the jobs will be there's something that we call a a dispatching gate and the gate is set originally similar to what q has q has also a suspend kind of gate that casman would remove that when the job is ready to be executed and and it will also put the the target cluster that has been decided on that on the job specification mcat will do the job queuing life cycle management and then we rely on this other project which is coop stiller another open source project that's going to do all the the mechanism for sending the job to the spoke cluster managing the job life cycle syncing the results back and and all of that stuff is part of coop stiller all right i think i'm almost done we tried kubernetes terms we added a couple of custom resources that we needed one is the one on the left here in the purple is cluster info so cluster each cluster in the spoke cluster has a cluster info object that has information about that cluster mainly the resource availability availability on that cluster and that changes over time so some controller here is is updating that as well as the geography where the spoke cluster is and then we get the carbon intensity through uh web services typically that we go and find what the carbon intensity is for that location so we go and fetch that through a carbon monitor component so cluster info is is a custom resource that is synced through coop stiller up in the hop cluster where the where casbian lives the other the other custom resource is something that is called an app wrapper and that's really part of mcat so this is not new so mcat has this uh app wrapper custom resource that wraps everything related to the job in this one object so if it's about a pod a deployment a collection of deployments the seekers the config maps everything in this one thing which is an app wrapper the app wrapper is the unit or the granule of scheduling so that sits in the queue the mcat has two components one in the hop cluster which is the dispatcher that's where the queue is and when it's time for the job to be scheduled when the scheduler optimizer which is casbian sets the gate and and then that job is then dispatched will go down to the particular spoke target spoke cluster through coop stellar mcat runner will take that job as an app wrapper will unwrap it and all the api objects of that job will then live and run in the spoke clusters if it's not preempted let's not talk about preemption now the app runs in here and it will be watched by the by the mcat runner uh eventually it will be finished and that will be synced back up in the spoke cluster that's where the user interacts with with the job and i think i'm going to hand it over to tayoba to show you a demo thank you answer so uh now in the second part of our presentation i'll show you a demo of casbian uh you can also scan this qr code to uh get more details on how to set up uh the demo uh so here we consider uh four local clusters all are built by k3 k3d one is hop and we have three spoke clusters on hop uh we deploy casbian uh that has two main components carbon monitoring and a scheduler carbon monitoring periodically interacts with carbon intensity service to get the updated values of carbon intensity of those spoke clusters then scheduler also runs periodically and at the beginning of each period it gets all workloads that are in the system the ones that are queued in the hop and the ones that are running in this in this post then uh based on the status of workloads and based on the status of clusters like their carbon intensity in the next 24 hours their resource availability it decides which workload should be executed in the next time slot and on which cluster it also may decide to uh suspend the execution of some workloads due to sustainability and then we also have mcat dispatcher here uh that for those workloads that are their target is set by uh casbian uh mcat dispatcher will dispatch them to the target cluster and with the help of sinker so sinker will like uh down sync these workloads and also up sync the status of those workloads to the from spoke to to the hop cluster and then on each spoke cluster we run mcat runner we deploy mcat runner so mcat runner will uh extract all carbonitized objects in a wrapper and execute them and also we have cluster info controller in mcat runner that updates cluster info custom resource so it periodically gets the status of spoke cluster like the available resources gpu cpu and also the geolocation of clusters and then uh so for example in the demo that i'm going to show you uh we said geolocation of spoke one to germany spoke to to uh one state in japan and spoke three to uh ontario in canada we also have a load generator that over 24 hours will submit workloads in the format of app wrapper to the hop cluster uh so yeah let me show you an example of an app wrapper so here uh so as uh uh was stated before you can app as many objects as you need in a single app wrapper yaml file you can just need to list them in generic item part for example here we have a single part but you can also have job you can have any other kubernetes object in addition to that you can also specify some uh fields that is uh needed by casbian what is the expected runtime for this app wrapper and what is the deadline for finishing this app wrapper so if uh user doesn't specify then just casbian use the default values so and here is like the distribution of the arrival rate of workloads by our load generator is kind of passing distribution and then uh we assume that workloads are like long running workloads between one hour to four hours is their running time and um their gpu requirements is like between one to five cores and then we assume here that uh the jobs can tolerate delays so here the slow down is set to three um you can also set other parameters so i um to run the demo for example you can set the like the period length period length so how often you want to run casbian hourly or in minutes so for example 120 seconds i want to run it for example in every two seconds every two minutes you can also set the optimization mode so basically the current version of casbian runs in two modes one is sustainable mode and the other one is qs mode in sustainable mode casbian considers and not only carbon footprint minimizing carbon footprint but also minimizing like the completion time of workloads and lateness of workloads but you can also run it in qs mode in which uh casbian doesn't consider any weight for uh carbon footprint you can also set other parameters like the zones of clusters and also the power characteristics so uh here first uh let me uh play this uh recorded uh video that shows the steps that you need to go through to uh run casbian then i will also show you some experimental results so uh before everything you need to uh clone the repository by running uh create cluster scripts uh these scripts actually will create three spoke clusters and one hub so you can also also pass some parameters so three is the number of clusters that i need spoke clusters one is long number of agent nodes and 16 is like the number of gpu cores that i need per node so then this script will create the right so after creating clusters yeah it takes a little bit time right so here we have three clusters the spokes and one hub the next step is to run run mcat on each uh cluster so with this command uh on each spoke cluster mcat runnel will be deployed and mcat dispatcher will be deployed on the hub cluster then the next step here is to run monitoring um like a script so i will explain this uh plus a little bit later uh just so so actually what we are showing here like the performance of clusters like what is the the live uh carbon intensity in each cluster and what is the gpu allocation you also show the the overall performance of the system so uh the right uh like terminal here shows the output of here we're on casbian script and uh this uh terminal shows the output of casbian in the first period i haven't worked around the load generator so no app wrappers in the hub cluster uh that's why you don't see any decision that would be made by uh casbian so just let me explain a little bit this uh uh script so the first line like list the app wrappers in the system in the hub then it will like get all spoke clusters or what is available cpu what is available gpu's on each and what is the geolocation and then it will list the decisions that are made by the optimizer in casbian so at the beginning we don't have any app wrapper but now we run load generator now we should expect in the next uh like time slot more decision be made by casbian so now you can see two jobs for now are uh submitted to the hub and uh and here you can see the the decisions that are made by um casbian we see that both uh jobs are assigned to spoke three but why that casbian did this misdecision make this decision actually because the of carbon intensity if you look here is spoke three the geolocation for spoke three is canada ontario that has the lowest carbon intensity so now let me run the remaining part of this uh video so if you look at this uh on the left uh figures uh the first row from top shows the carbon intensity on each spoke clusters so uh the left is for canada then is for japan then for germany so as you see here canada has the lowest carbon intensity that's why in the second row if you look at the second row uh more gpu allocations are assigned to this cluster but after some point because uh casbian also takes care of the qos of workloads then it also uses the second uh cluster in japan to schedule workloads and so here just because japan is i mean um is more power efficient so you see japan and uh germany their carbon intensity is in the same range but because uh japan is more power efficient casbian chooses um spoke two here and then so i am going to explain the next to slot two plots in the next uh slide but um here uh let me just briefly explain that uh at the bottom you will you see two uh figures the left figure shows like the total carbon footprint generated by executing your workloads uh and then on the right you see the like a histogram of percentage of jobs based on the value of alpha so what is alpha alpha is actually the completion time ratio um which is the final response time over the deadline so if this value is left for a job this value is less than one or equal one this means that that job is executed by deadline if it's greater than one means that the job has missed this deadline so let's see how casbian can guarantee this so let me go to the next slide so here uh we run a casbian into different modes on the left we run casbian in sustainable mode and then on the left on the right we run casbian on qs mode so we don't have any sustainability concern on the right side while in the left side we uh consider both qos and uh carbon footprint in our decision making and then the first row shows the carbon intensity of each cluster the second row shows the gpu allocation and then the third row shows the overall performance of the system uh let me play the so yeah over 24 hours workloads are submitted to the system so this is just in fast mode what you're seeing here uh you can compare how workloads are allocated in each mode uh so you will see so in sustainable mode the schedule really starts by assigning jobs to the greenest one close uh which is the canada and then it will use uh uh japan and after that it will use uh germany while in the qs mode we see a kind of balanced distribution of row close so and then if you look at the last row you will see the total carbon footprint executed generated in each mode of course in sustainable mode we have lower carbon emission compared to the qs mode but what's what about other performance metrics about qs if you look at the second i mean there uh the like the value of alpha in each mode so you will see just uh in sustainable mode about like seven percent of jobs will miss their deadline uh while in qs mode because we don't i mean the main concern is about qs uh less than one percent of row close will miss their deadline but um if you look here the value of alpha most of jobs between a value of one to one point two are finished so just uh like lower percentage of jobs are very late so which uh uh we can like uh ignore those those percentages yeah this is like the end of uh our demo um thank you everyone uh if you have any feedback please scan this QR code and yeah any questions yeah thank you for the presentation um do you make any decision about delaying a job instead of putting it in the cluster with the lowest carbon footprint right thank you uh our scheduler here actually consider it makes two decisions about when to schedule a job and where so we make both decisions some jobs will be delayed if they can tolerate uh delay and yeah we make decisions about both yeah so uh in that case how do you choose to delay a job uh because i mean you don't know the future you don't know what the carbon footprint will be uh so we have our optimizer here look i mean the look ahead uh time window for our optimizer is next 24 hours so we know what will happen in the next 24 hours for carbon intensity and based on that we make decision let me add to that you're absolutely right you have a good point yes we have we work with predictions of carbon intensities and nobody knows for sure what's going to happen the next 24 hours but given this the nature the cyclic nature of the of the way generation of energy is then you can predict with good accuracy what's going to happen the next 24 hours any other questions yeah thanks a lot for the presentation very nice i was wondering do you are you thinking about adding other parameters to the scheduler like for example the price of running in one specific location or the latency i'm thinking of enterprise using this and maybe having broader constraints i assume one one of the assumption here is that the three locations have the same kind of server exactly so their profile is the same or i don't know so the the power profiles is not necessarily the same in all clusters that's the assumption that we make and we work with but i think you're bringing in an interesting dimension which is cost there may be other dimensions as well the optimization problem that we showed earlier is a multi objective one and therefore adding another objective should be straightforward what's not straightforward though is the way to solve it we cannot you can take this problem formulation and get a solver and solve an integer programming problem if you did that and you're dead in the water i can tell you that if you have hundreds and thousands of jobs in many clusters it's going to take forever to solve so we have a an approximation algorithm that solves that particular optimization problem which we didn't get into when can we expect this to be available for production workloads forever still still in research is it some kind of like open research where you accept issues or pull requests yes it is in github uh we we started to work on this so it's uh but you're you know you're welcome to download and try uh did we put the the qr for the github or was it at the end yeah so the one on the left is casbin and we have the other two that's already in open source mcat and coop stellar mcat is one way of doing job dispatching otherwise as i mentioned before could be q so we could integrate with that as well we haven't done that yet um but the the casbin is the is the one on the left i was wondering do are you using uh green software foundation carbon aware SDK to to fetch the data about carbon intensity or you're doing it on your way directly we don't have our way we use yeah here we use our electricity map to get the updated values yeah we rely on other open source if no other question thank you very much thank you