 Good morning everybody. Welcome to the talk on Trimaran, load aware schedule plugins. My name is Acer Tantawi from IBM Research, my colleague. My name is Chen Wang. I'm also working in IBM Research and we have been working in these scalar plugins for years. All right, so let me walk you through or both of us we're going to walk you through the Trimaran if you haven't heard about it. It's a family of scheduler plugins that are load aware and there is three of them. It's a bit confusing and we're trying to resolve that here. Make a distinction between the three plugins that exist. They are all upstream. You can download them and try them if you haven't yet. And we're going to show you some use cases and demos of what these Trimaran plugins do. So the alarms that you might see in your cluster that some nodes may be experiencing high load congested and others are not or that some nodes have spiky kind of CPU and or memory load whereas others are not and you may have in your cluster some pods that are able to get up to their limits whereas others in the same cluster cannot. These are all symptoms of a need for one of the Trimaran plugins. This is not necessarily true that Trimaran will solve these. This might be due to other cases but we just list them as a motivation for the need of a Trimaran plugin. Trimaran is load aware as I said and it relies on the load watcher which is an open source piece of a component that Chen and others have been working on. And you can see in the diagram here it relies on either the metric server, Prometheus server or some other SignalFX server and will get all the metrics for utilization of CPU and memory. Both averages and variations around the average and will expose that through Prometheus and that's what the Trimaran plugins use through Prometheus. Those metrics they are smoothed through the load watcher and used by the Trimaran plugins. All right so what are the goals of these three plugins? They kind of have separate goals and that's why we have three of them. So the first one tackles the energy. It works to get the node utilization. Currently it's only CPU but hopefully we're going to expand that to other resources such as GPU and others to get the utilization around a particular target utilization and I'll talk about that in a second. The second plugin is about avoiding congestion and interference. It avoids the a lot of spiky kind of behavior and load distribution on a node and will place the pods away from nodes that have those spikes. And the third one has to do with risks due to limits. So as you know the default scheduler does not consider limits at all. It just works with the request whereas the Trimaran third plugin takes the limits into consideration. The reason being that it allows or will let the the pods that need to burst and reach their limits to reach their limits freely as much as possible. All right so this is in a table format. The official names of these plugins have been calling them the first, second and third and that's the order in time that they have been created and also the order in their complexity but they are called the target load packing. So the first one packs to reach a target utilization. The second one is about balancing the variation in the utilization as well as variation and the third one is about the the limits or the risk over commitment. In terms of their usage of the load awareness the first one utilizes only the average. Second one is the average and the variation around that that's the standard deviation and the third one these were the tail of the distribution so the entire distribution not only the average and standard deviation it builds a or it fits the distribution to it and uses the tail to make its decisions and as you see down there the three goals that I talked about before. So let's talk about the first one target load packing. So as I said there's a particular utilization that an administrator or the one who configures this plugin would set the utilization to be that target currently as I said it's only it only works with CPU device a CPU resource. This picture here depicts the that there is always a sweet spot for utilization. What you see up here is some study that looks at two devices or two resources the CPU and disk and it shows that there is a sweet spot in utilization that you want to operate at that point. At a very high level as I state up there let's talk about CPU for example power consumption as you know the power power is a rate of energy and power typically increases it starts with an either power at the utilization of zero and it keeps increasing in some form whether it's convex or concave or almost linear but the power increases as utilization increases but power is not everything if you have been to the talk yesterday about peaks it deals with power and finds the sweet spot for power. Here we're talking about energy there's some other aspects in here when when you run in a node that is a very highly utilized even though hypothetically ideally containers should be performance isolated but they are nearly not there are some resources that are congested networking buses and V links and what have you so you start performance starts to suffer as well as when you reach very high utilization then there are other aspects such as the fans starts to go on and then you lose some some some energy that way so that's why you don't want to be very close to the hundred percent you want to be a little before that and whatever that value is is going to set by the the administrator or the one who's configuring this to reach that to operate at that point how does the plugin work so to reach that target as you can see here a depiction of a cluster with three nodes the dotted line the dotted line there is the utilization the blue is this is not the allocated or requested this is the average kind of load so what happens is the plugin will start to place spots on the first node until it it reaches the target utilization or about you know and then it's going to take the second node that reaches that and so on so it's basically doing packing at that point during that phase is doing packing whenever the cluster reaches its target utilization on all nodes approximately beyond that you don't want to get close to a hundred percent so what happens is beyond that it's going to try to spread the pods across the nodes in the cluster so that you don't go far away close to a hundred percent and try to be close to the target utilization I forgot to mention that the three triangle plugins are all score plugins and they they use some some basic common components I talked about load watcher before so it's really easy to add a fourth and a and a fifth if you will triangle and plug in I invite you all to do so you can see the score function here is familiar with with the scoring function that this particular plugin works with let me move on to the second one so the second one is not about average it takes this spiky nature of the utilization into consideration and it deals with both cpu and memory we're hoping to add gpu to it but that's where we are at this point here is a depiction of two nodes node one on the left node two on the right they are both at the same total requested so that's what the default schedule cares about a load a load aware would would look at the green which is the average utilization again it's the same about the same for node one and node two the only difference is that the variation around the average is a lot less for node one than node two node two is a lot spiky still around the same average so if you have a new pod where would you place it on node one on node two so obviously you want to place it on node one so that it doesn't suffer with this due to the spikiness and the interference with other containers running on that node and that's exactly what the load variation risk balancing plugin do and the way it does that if you think about balancing your cluster this diagram is a bit involved but let me walk you through it very quickly what you see here is a space that on the x axis i call mu and the y axis is sigma mu stands for the average utilization and sigma stands for the standard deviation of utilization that's the spikiness the variation right and a dot is a node in the cluster each dot is a node in the cluster all the dots in the space are all the nodes in the cluster right so depending on utilization average and standard deviation of that node that's where the dot lies in that space so typically people when they talk about load balancing the average they balance the average and that's the top left case there that all the nodes line up so that they have the same average but you might have some nodes very low in variance and others very high in variance no good another possibility is to to balance the variability they all have the same spikiness but some nodes are very low in utilization average and others are high no good either and on the bottom left is the coefficient of variation if you're familiar with that that's the the standard deviation over over average and you try to line them up that's no good either and what this trimaran plug-in the load variation risk balancing does it tries to place the nodes along that red line what is that red line that's mu plus sigma equals the constant why mu plus sigma if you familiar from statistics you know we have some kind of distribution mu plus sigma means something mean that most of your your data is going to be in that range mu plus sigma so mu plus sigma would like two-thirds of the of your data is going to be there mu plus two sigma is going to be more mu plus six sigma is going to be everybody but the default is mu plus sigma but it's a parameter that you can specify in this plug-in which is a multiplier for sigma for now that for simplicity is mu plus sigma why is that good so if you see some dots here at the very right this is a node that has high utilization let's say 90% utilized but it's very flat whereas one on the left is like a 20% utilization and very spiky or 40% utilization and very spiky and they are kind of equal and as far as this plug-in is concerned I'm taking a lot of time here I try to go a little faster okay here is an example of a three node cluster with some data about the current utilization and in the same space that I talked about before load on the x-axis variation on the y-axis and you see the three dots the three nodes the red one green one and blue one as far as the CPU resources and the memory resource that's the top the before placement case before placement meaning I now have a pod I need to place where should I place it okay remember that we're trying to balance the nodes on a line so the obviously the one that is furthest away from the line is the red one so you try to bring in the red one closer to the other two and then we compare the both CPU and memory on the one that is furthest away is the one that's going to impact the the score of that node and after we place the the pod then you see the red dot got a little bit closer to where the others are and you keep doing this one after one at a time then the nodes are going to line up of course they will never line up perfectly but as much as possible all right the third plug-in is the is the low risk over commitment and as I said before this one deals with the limits in addition to requests to allow containers in a pod to reach their limit whenever that is possible use cases spring boot that if you have many containers that they have an innate container many pods they have an innate container that spikes it does a lot of it needs it to do a lot of work let's say on cpu and then it kind of dies away the regular main container doesn't do much so if you have a lot of these pods running at the same time at time zero you have an issue and so you want to place them to place these pods across your cluster not on the same node and that's exactly what the low risk over commitment does all right i'm going to have to be a little quick here i'm not going to walk you through this what this trimaran plug-in looks at two things one is the value of the limit as specified by the containers in the pod and secondly the load awareness why is that because the first one is a container is telling you this is my limit i'm planning to reach up to that limit whether i will do it or not is a separate story but there's a potential you don't know that potentially that container may reach that limit so you have to consider take that into consideration and that's the first component that this plugin considers it is the limit awareness the second one is is it are the containers running on a node really going beyond the requested and going into their limits or not and how much are how much of them are are really reaching that limit and that's what we get from the load watcher and and the observation of the utilization so given that that's why i would say we have to look at the tail of the distribution we have to look at how much really utilization is above the requested into the limits and i think the demo is going to clarify that and these two factors we have a weighted sum and that weight is is a configurable value the default is a half but you can configure it differently for cpu and memory for example and i think i'm going to stop here and and let Chen walk you through the demo yeah well you think i'm always intrigued by those interesting theory and then but here i just want to share a very very simple very extremely simple three demos to give you an intuition actually in production environment which use case which types of load you want to use which scheduler plugin and here is the barcode we have the whole tutorial to simply deploy those scheduler plugins in one click feel free to scan it and so to demonstrate that the target load packing plugin we are emulating a very simple scenario with three node cluster here and then so those three know that configure differently with some background load so for example the first load node has a low allocation but actual usage is really high the second node is like the parts placed on it have the request matching what is actually using and the third one is under utilizing so how will the target load packing plugin really behave in this scenario let's just work through the demo so simply if you download the previous folder from github you would have this target load packing yaml deployment it includes all the service account necessary all the cluster row definitions and you combine those with the cluster or binding to your namespace and to your scheduler plugin deployment here we deploy this target load packing plugin as a secondary scheduler in your cluster so the testing pod can select it afterwards okay so we go ahead this network policy is just to allow your promissory access to all your secondary scheduler um you can define similar things if you combine your own secondary scheduler based on those scheduler plugins so we go ahead to create the deployment for this target load packing scheduler granting access to permissions and let's see if the scheduler is running so get the pod we want to watch the pod locks in real time okay the locks seem fine and then we have the scheduler running secondary scheduler running which runs the target load packing plugin and then in the background we already deployed three types of workload just matching the the example I showed in the previous slides so this is our cluster state we actually forced those background workload on different nodes so we know the first one we have a high utilization we spawn like stressing CPUs to three but actually only allocate 200 million cores for the request so the allocation is well below the actual utilization and yeah the second one is matching I probably skip that the third one is really it is doing nothing sleeping but it's also requesting some usage so we also have the dashboard all available in the dashboard folder you can check on the GitHub link so for the testing it's pretty simple it's using one CPU but allocating 500 million cores and the limit is one and specifically we use the trim around scheduler to schedule this pod so if we go ahead create this testing pod we will see we enable all the logs of course for demo purpose and then it will successfully bind to the second node which IPNs with five and you can see it's because tremorine is really scoring the middle nodes highest and then if we go back to the Grafana dashboard we can see so actually the highest utilization is first one here for the configuration we actually configured the target load packing to target 70 percent of utilization and the middle one is using kind of half of the utilization so it doesn't get to the 70 percent yet the third one is pretty empty so we make sure we pack the the the testing workload to the second one to make sure all the nodes achieve 70 percent until we start packing to the third so the benefit of this one is for the third node you really is empty you can pack it later and for the second one as Arthur mentioned if we target the 70 percent that your target utilization is 70 percent then you can achieve some efficiencies as needed you also leave certain margin safety margin for your workload if it varies a lot okay I think that's the first one so it really goes to the middle node then we change the scenario a little bit and then we still have the high utilization but also add high variations of the first node but with low allocation for the second node we have something the average utilization matching the allocation between some variations so remembering that Arthur was mentioning that the the load variation risk balancing is all about understanding the variation of load on your node and the third one is really a steady usage with little variations we can be a little bit quick here similarly we have this deployment and some parameters we need to configure for the load variations well documented on the scheduler plugins report basically here you want to configure the the safe variation margin and the safe variation sensitivity those two parameters are actually the related to how many times of the variation you want to multiply upon the main in your scoring function okay we go ahead create the the Trimaran scheduler with load variation risk balancing plugin and show the logs as well similarly we take a look quickly take a look about all the background and then I'm sure you already get to the gist of how we emulate the background load in this case worker one uh instead here we have some variations so we only have a spiky workload for five minutes and drop 820 for five minutes with nothing and so the average utilization here you can think of is the CPU should be two but the request we allocate is only one so it's under allocation and the second one we just the request kind of matches the utilization a little bit still a little bit below so the request is two the peak CPU load we are adding is five the third one we are just so stressing and then the requested for CPU is two the usage for CPU is also two and then limit is four so we don't have any variations tight over time variations of the load on the node okay similar we have a simple test pod with load two and requesting only 500 millicores so there are four nodes the left one starts with ip4 and then the last one ends with ip6 and then you can see on 0.4 node and 0.5 node they have high variations usage and the limit is of course way beyond the uh allocable and then uh just for the third one you have more stable utilization in this case because uh load variation risk balancing is trying to minimize the risk cost by the load variations so it will prefer the third node which ends with six I think we can quickly um skip to the third one the third one is simple it's just instead of deploying the uh high variation load on the node two we copy the same load from the node three to node two however for node two the background workload really requires a really high limit the limit is much higher than the allocable and then for the third one we are conservative we estimate the limit more accurately for the background workload so the limit is actually below the allocable and then let's see what will happen this time again there's a parameter cost uh risk limit way right you can specify by others formula you can kind of understand the whether you want to assign higher weight uh to the uh to the risk of not uh not hitting the high limit or you can assign more weight to avoid the risk of the load so here we configure because we don't really add any load on memory uh we don't consider the limit of memory so we configure memory limit as one and the CPU uh parameter says 0.5 which balance the risk of limit exceeding the limit and the risk of load so quickly we deploy the uh third scheduler with the uh plugin the try minus three plugin and in the background we also have three types of different background workload um first one high variations and um low requests again uh this time the first node ends up with 16 second one very stable workload with the matching request and the high limit which is way beyond the usage and the node ends up with 17 the third one is limit matches the request matches the actual usage so you can see 17 and um 18 nodes that we study utilization around 0.5 and the first one is have having average usage of 0.5 but with higher variations and then um presumably if you have higher variations you have a higher risk of overloading so apparently 16 is not preferred and then for 17 and 18 they really have the exactly the same usage but 18 is more conservative in terms of allocating that limit so the total limit on the node is way beyond the allocable and let's see what trimera 3 will do okay of course in all the testing part we force them to use trimera scheduler and you can see it's successfully bound the part to uh the node ending up with 18 because 18 apparently has lower risk in exceeding the allocable in term in in limit I think that ends up with all our demos and if you have more questions and feedback please let us know