 Okay, good afternoon and welcome everybody in person and online Thank you for joining us for a trimer and real load aware scheduling in kubernetes Would you all join me in welcoming Abdul Kadir from PayPal and Chen Wang from IBM? Good afternoon everyone those present here and greetings to my my greetings to those virtual present virtually So I'm Abdul Khadir. I'm a senior software engineer at PayPal I'll be presenting on load aware scheduling with my collaborator Chen Wang who is a research scientist at IBM So the image you see here is of trimaran. So what it stands for is like a three hulled boat Which provides more stability and safety in the ocean? All right, let's get started So this is the agenda for today We'll go over the motivation the background and the problem definition followed by the trimaran architecture The design and the plugins we contributed to the open source community The first one is the target load packing and the second one the load variation risk balancing plugin Then we have a demo And then some of the challenges trimaran faces how we overcome them Followed by good practices in using trimaran in production And some future work that we have to do so let's get started with further ado So as many of you may know Kubernetes provides a Declarated resource model for its pods. What I mean by declarative model is that the Resource usage that you define for your pod or the container needs to be Defined in a spec before you're able to run it and run your workload and The core components in cuban it is namely the cubelette the scheduler They honor honor it to behave consistently with respect to the quality of service guarantees Now given this the developers they tend to over provision the resources with this model Which you know One of the reasons is that they want to avoid the penalties of evictions and see put throttling that the Kubernetes would do now One way to solve this problem of declarative model is to benchmark your applications But we know that like it could be comparison to do for all your production applications and in general estimating real load is hard Now with that background the main problem we are trying to solve is that the default scheduler in cuban it is it uses Allocation based scheduling model and it can lead to underprovision nodes in terms of resources and Fragmentation of course across the cluster What do you see on the right bottom right is a graph? from 29 day Google trace in their cluster and You can observe here that The usage is about 40% of the requests which implies a lot of it is over provisioned okay Now I'm moving on to the primer and design and architecture. So around the time we started working on this problem the cube the scheduler provides cuban it is Scheduler framework it was in beta API and it was moving towards stable. So we decided to leverage it to contribute our plugins in Now the framework it provides flexibility to extend the scheduler in different APIs and Trimaran plugins they do so in an extension point called scoring extension point if there's an upcoming talk by six scheduling folks And you're more than welcome to attend that if you want if you're interested in a scheduling deep life Now for a given scoring plugin What you have is like you get an input as a set of nodes and what you output are different nodes scores for that for those nodes So That depends on the algorithm you define in your scoring plugin So in the diagram you can see that the trimaran plugins are part of the cuban in a scheduler. So they run in the same binary Now the next major component in our design is the Lorde watcher So Lorde watcher can be defined as a cluster wide aggregator of metrics Which is backed by a third-party metrics provider Will explain in the next slide what all metric providers we support The Lorde watcher also maintains the cache in memory for fast lookups Whenever the plugins need to talk to the Lorde watcher to get the live metrics It also maintains It's cash in a DB to provide some fault tolerance if there is a failure and fast recovery during resorts More in the next slide about Lorde watcher. So another way to think about Lorde watcher is that it is sort of a wrapper which Unifies metrics from different providers into a common format that the trimaran plugins understand and The currently supported ones which we added our Prometheus signal effects by Splunk and the native cuban it is metrics provider It can be extended by the users to support any other metrics provider if needed Now from the diagrams you can see that there are two options of using the Lorde watcher So first one is that you can use it as a service And then the second one is that as a library So each one of them has their pros and cons. It's up to the users to decide what works best for them For example in the case of using it as a service You have a separation of failure component from the scheduler process and the a pay additional a pay call latency from Conducting over the network. It can be minimized if you use like a local container to the schedule report and The other point is that it scales Separately from the Kubernetes scheduler. So this is important for us Because in the future we want to make it a bit more complex by adding Machine learning models, which could make the watcher process more CPU intensive and we don't want it to compete with the scheduler for resources What you see on the other hand on if you use it as a library from the plugins inside you have Of course, no API call over the network, which is important when you have fast scheduling cycles Then you have a much simpler deployment We at PayPal use it as a service and Red Hat folks they integrated Trimodern in their product offering on Kubernetes OpenShift and they use library as in their offering What you see in the rightmost column is a sample data that the Lord Watcher process maintains in the cache so it maintains windows of different metrics that are needed by the plugins and The window durations range from 15 minutes to five minutes and we use each one of them as needed depending on the use case It also adds some more metadata to enrich the water cache Coming to the meat of the design So for the first plugin that we contributed There are two main objectives we had in mind when we designed the target load packing plugin The first one is that we want to achieve high utilization across all the running nodes in the cluster and Basically, we want it to be around a given threshold that we can set before we run the plug-in and we do so by packing the pods As long as the CPU utilization is under the given threshold The other objective is to maintain Some sort of safe margin for CPU usage spikes which can happen due to unpredictable loads so we like to do that by Spreading the the incoming pods on to the nodes which can result in higher safety higher safety here implies that there is more Left-order CPU course left Here's some here's the algorithm For the target load packing plugin I'll try to explain that in simpler words so what you see on the right is a graph of the difference course that That the plugins provides versus utilization Now the Kubernetes framework it supports It wants a scores in the range 0 to 100 so a Node if it is scored higher it means that it has more chances to be selected for scheduling so The first line you see is an increasing function. So as the utilization increases from 0 to 50 We favor the nodes by scoring them high this implies they've been packing I talked about that we keep packing the ports on the loads Now as you reach 50% that is our threshold utilization The graph shown here is like for the utilization of 50 by the way So there is a drop. There's a discontinuous function so what that implies is that we would like to penalize nodes which are trying to exceed the utilization that we set for as a threshold so We try to favor the nodes lesser as the utilization goes beyond 50% Now in point number two and three it just explains how we calculate the incoming pods Usage because we don't have the usage metrics for it yet. So we predicted predicted based on it's different Based on the quality of service of that Specific pod if we have both requests and limits then we specify the limits as the usage Or more like the prediction and then if there are no limits then we have a multiplier for the requests And if there are if it's the best effort pod that is no request or limits then we assume Some number which is a minimum utilization So that's it another way to look at the algorithm is that it switches from best rate variant of bin packing To least fit variant as the utilization changes from 0 to 100. So here's some results from an experiment we did With about 100 nodes and 400 pods We did this using an open source tool You call KDisk cluster simulator with some changes to incorporate out of three out of three plugins Which is what we contributed So the pods here have burst table QS. So the limits are larger and different than the requests and They have a normal distribution of utilizations which remain constant throat Through the time they run Now the graph on the left you see is for the native Kubernetes scheduler and the graph on the right is for trimer in Now there are multiple Observations you can make from these two graphs by analyzing them. The first one is that The area that you see under the threshold mark the red line that you're seeing is much higher for Target load packing which implies better capacity utilization across the cluster The second one is that there are lesser number of hot nodes In the case of target load packing by hot nodes I mean that any nodes which exceed the of threshold utilization that we had as an objective so that they are minimized in the case of target load packing plugin The last one is that there are lesser number of fragmented cores So the way you can tell that is so each bar here respond corresponds to a specific node So the variance in the height of the nodes is much lesser in the case of target load packing So that implies there are lesser number of fragmented cores is compared to the default scheduling, okay Moving on to the next important question for production, how well does trimer and perform? So these tests were done in Scheduler perf tests from the Kubernetes main repo. So just for the background It is what Kubernetes uses to publish their Scheduler metrics so that they know that The metrics for example the latencies the algorithm latencies and then to import scheduling duration is within the SLOs or not so we modified that to include our own plugins and So we did two experiments here the first one is that we used 5,000 nodes and 1,000 pause to schedule The other one is with 500 nodes and 1,000 pods There are same 500 number of init pods in each of them. These are like a set of standard tests which you can use to compare Now before and we can see that the trimaran beats the default scheduler in both of the experiments Now before I explain why this happens I like to mention that the default scheduler it has a set of scoring plugins that are configured by default Now trimaran uses two lesser plugins than what is present at default. So essentially you have like one less plugin compared to the default so first reason for The performance benefit is that there's like one less plugin who's doing scoring So it is faster The other one is that we optimized the plugin to use fast json encoding and decoding libraries and we have a background thread that fetches Fetches metrics from the loadwater cache so that it isn't it prefetches before each scheduling cycle So moving on to my last slide This is about status of trimaran at PayPal as I explained So the main use case for us Working in infrastructure team is to maintain and assure efficiency of the fleet so PayPal relying on the cloud we would like to also minimize the costs now this translates to the Technical requirements. I mentioned for target load packing plugin the objectives One thing I didn't mention is that we also have a requirement for the replicas of an application to be spread across topologies by topologies here. I mean it could be nodes or node pools or different zones for example so That said we combine our plug-in the target load packing and port topology spread To achieve these requirements the current status is that it is actually tested in For for deployment in QA followed by production. So this will be enabling Efficient fleet for a majority of the people applications over to my collaboration Okay, thank you Abdul. So Next I will introduce the load variation risk balancing plug-in That is designed to solve an issue we observe in one of our production clusters So we observe that even the nodes will have different We will have the same average utilization when you consider the preference to schedule the power they may have significant different load variations and It's very important to account the usage variations on nodes when scheduling paths Namely the higher variations of load workload you have on a node The higher risk you will have to schedule that part And then if the power runs there and there's some bursty workload You will easily end up with like part evictions or some performance issues So this plug-in is designed to balance such a risk So let's take an take a look at an example Suppose we have two nodes like each has a capacity of eight CPUs and only five are requested on both nodes And in that case those two two nodes are deemed Equivalent when the default scheduler is trying to schedule paths and then it can for a fit in both nodes And then if we can only consider the average utilization and you may have exactly the same average utilization on both nodes so Considering the utilization average utilization you will also regard these two are the same and then so but if you look at the variations on these two nodes apparently on node two at peak usage the maximum utilization is much higher So the scoring plug-in we want to propose is to consider not only the average utilization But also the variation on that on that node So if a path can fit in both nodes in this case in order to minimize the risk of Overcommitting at the peak hour. It is much safer to choose node one than that node two, right? So to better understand what we are actually balancing based on what we can balance based on the average and standard deviation Let's just draw all nodes Average and standard deviation in a mu sigma plot where mu is the average and sigma refers to the standard deviation So we are scheduler is trying to balance the average utilization across all nodes and their node utilization will look like the top left chart Where it is the Vertical line and all nodes average utilization at the same so if you want to balance the variations on Available nodes it will look like the horizontal line on the red part Top left plug so if you want to balance something like the your variation is Proportional to your average it looks like the lower left chart I don't know if there is a practical use case for that But it's a very good illustrative example to explain the last one, which is balancing the risk So the higher average utilization you have you want lower Variations on that node and the higher variations you have you may want lower average utilization and it will look like this line and Which is basically mu plus sigma equals to some constant So this is exactly how we designed this scoring Plugging so assuming you have part coming and is requested resource Let's denoted as R and then specifically what the algorithm is doing is it's going through all the node for each node It's going to get to the sliding window average and the standard deviation of resource utilization And then it used the average utilization plus the arrival path request as the prediction of the future average utilization on that node and then So we also use the standard deviation we observed in the past time window to assume Assume it will be similar in the future Time window and then if we plus them together just that's what I have shown in the Graph in the previous chart it will be our risk So we want to balance this risk Of course we want to bound it the boundary risking the range like 0 to 1 and then so we can scale it up to the Priority score in the because it's lower risk you have the higher score you will Prefer to choose that node. So it's one minus row Multiplied by the maximum priority score and then finally later the schedule will just choose whatever is a high score So next I will show a simple Two simple demos on how our two plugins Can be deployed and how it behaves in a real cluster So here we are in the cluster of three nodes and We are going to first deploy the target load packing Load packing plug-in So let's take a look at details what you want to create and then Including a service account you want to use for this scheduler All the RBAC rules that are needed by the scheduler plugins You want to bind a bind this these rows with your service account? including also the Cube scheduler cluster row and then you want to create a config map to Wrap all the scheduler configurations there and within that config map You want to disable the conflict for conflicted Plugins and enable our target load packing plug-in So here the important parameter who configures for the target load packing is like the percentage of utilization You want and the endpoint of your measure provider and then we can mount it To the deployment of the scheduler where we just use the upstream Scheduler plug-in image so we go ahead creating namespace trimer one and We seen the trimer namespace. We deploy this scheduler and we can also See if it's running now So I record all the demo because the whole demo will run for like an hour and we can speed up a little bit in deploying those workload, so now we log into the scheduler part see all the detailed messages and Then we go ahead to take a look at the workload. We are going to deploy in the cluster So this is a simple hamster application It's used around 400 mili cores But it's requesting only 200 mili cores and we are going to see the target load packing will pack nodes up to the Predefined utilization percentage so here we prepare a simple script to generate the workload and then it will create a part every once in a while and Totally we can define how many parts we want to create in the cluster and What it does is just replacing the ID in the part template So totally we are going to create this part 36 times and Here I speed up a little bit because it will run for longer and On the lower bottom window, you can see the target load packing plugins logging some messages on Bending parts on the and on the right-hand side that the first one is the actual pod usage versus their request the second one is the Utilization on each node. So what you are seeing is it go ahead to pack on one node on two reaching the target utilization percentage and then it starts on the other node and The third one is a number of parts testing parts scheduled on that node So similarly it starts packing on one and then spreading to so when all of the nodes reach to like the target utilization The scheduler will try to spread the parts across nodes to balance the risk Okay, next. Let's take a look at the load variation risk balancing plugin So similarly we will create all those service account RBAC use and deployment It's all in our documentation you can find it in the tremor and plugging Ripple so I will explain a little bit on the right hand side on the right hand side We will have a testing part request versus their usage and also the The CPU utilization on all nodes Sorry, this is okay. So it was replaying the previous demo. Yeah, here we have the the behavior of the load version risk balancing plugin in the cluster We have three nodes on the left showing their Allocable resources and the variation of the usage on each node in the middle We see we show the statistics of the average and standard deviation And then we go ahead to create a testing workload This workload is just a simple stress workload with some variations So now we are creating a testing Work name space for the workload and created the power and on the right hand side We have all the logs from the load version risk balancing plugin You can see the details about like it is computing the CPU and memory Average utilization and standard deviation and it adds them together to compute the risk score and then based on the risk it's coming up with the score and It's actually also combined the resources So if you have a higher score on CPU by lower score on memory It will always choose the memory the bottleneck resource score. So now you can see There's bigger variations on the first node and the third node and that's why even Their average utilization are similar the path will eventually be scheduled to the middle one okay, so Just a summary if what if we have a bursted arrival of paths and during that time we may not have enough Matrix changing and then we would collect the matrix like every five minutes But when you have like a hundred parts arriving in one minute, and then it's not reflected to the nodes What we do so in that case the matrix would be missing but we we still have the previous utilization window and also the arrivals of power So we can use the arrivals of power to predict what would be happening in the next time window So for those parts we request them limits. We always count with the limit So if you have 10 parts arrived and each one is requesting One CPU we add up to 10 CPUs and then for those parts with only request values We can configure a multiplier to leave some safe margin of your Inaccurate prediction and then for those parts of that best effort You can also configure what might be the stretch out for your for your power request a pod usage and In the case that one provider is not accessible one metric provider is not accessible. You can always fall back to the metric server solutions and You can also like in the future We plan to use multiple metric sources and do cross validation in the load water So if one metric provider fails we can switch to the other So from development to test and testing of these plugins we learn several license throughout the way First we should be very careful about how frequently we retrieve metric data So the load on metric providers are not acceptable. You are not overwhelming like a premises and But the interval should not be too large then we have like outdated metric data So and this is exactly why we introduce load water library We should only fetch data from providers periodically and cache it in the schedule of plugging and so it's ready to be used whenever the pod arrives and Second we should always have a fallback solution, especially for plugins that Use a lot of metric data because metric provider can be inaccessible Third it's important to not use complicated Plugins together for example target load packing is targeting the average utilization and then the load variation risk balancing is Targeting both average utilization and standard deviation. So they have some complicated objectives You should not use them together So but even with tremor and plugins We can never predict the load variation or load usage a hundred percent accurate lay So you can always end up with some over provisioning node If the pod usage climbs up So here are some future work we want to do first that we want to integrate load water with For example the scheduler to the schedule pass based on their actual usage to solve them Reported issue I show here and the second We plan to include some additional resources such as I owe and natural boundaries And here we propose a new proposal new cap proposal in the community called a network of word schedule Proposal and it doesn't only consider the bandwidth allocation, but also the latency's Requirements you may have in either a distributed cloud or in a data center cloud and in the future We will also want to try some machine learning models to have a more accurate Prediction of utilization over loadware schedulers so here are some references if you want more details and Lastly we would like to Express our appreciation to those who cannot attend the conferencing person including 10 and Rena from PayPal and answer from IBM research they are all contributors to this project and we really appreciate help from all members in the six scheduling team and Here is questions