 Good afternoon. Welcome to this session on sustainable scaling of Kubernetes workloads with in-place pod resize using predictive AI. I am Vinay, I work for eBay Cloud, where I'm helping build EBB powered Kubernetes networking at global scale for large clusters with a lot of pods. Haran, please introduce yourself. Hi everyone, I'm Haran. I'm currently a PhD student at the UIUC and I'm working on microservices and the cloud resource management with machine learning. Thank you, Haran. So the agenda for today will start by describing the overprovisioning problem and look at the environmental impact and the dollar cost of overprovisioning. We touch upon our roadmap and outline why this is important for eBay and quickly recap the in-place pod resize feature. We'll also look at the cluster auto scaler use case and see how in-place pod resize can help improve it and we then switch gears and Haran will walk us through the current state of AI in the cloud. We'll take a look at the role that reinforcement learning can play in auto scaling and see how AI can help us go from being reactive to being proactive with auto scaling and to conclude, we'll look at how RL training is done and review evaluation results. So accurately estimating what resources your pod needs is a very hard problem. Those of you who containerized your apps in Kubernetes may have come to the resources section of the YAML and wondered, gee, how much CPU do I need? How much memory do I need for my pod or how much storage? And this makes it hard and it's a challenging problem and for various reasons. The Java apps that use CPU, there is, excuse me, for example, Java apps use more CPU at startup time and consume a fraction of the initial CPU during normal runtimes. So if you use fixed limit for a guaranteed QoS class pod, it means choosing between a slow startup or over provisioning. You could get the over provisioning, you could get the provisioning right, but still be subject to load shocks due to external factors such as other pods going down or the load balance or misbehaving. Your code may take slower paths more often due to varying nature of the requests and the services that you depend on may experience outages causing your service to get backed up with requests. And if you do everything right, if you get the provisioning right, you profile your code with real traffic and use VPA recommender, perfectly tune the load balancer and HPA. Well, as of today, K8s does not allow you to mutate your pod resources out of the box. So in order to deal with changes. So there's not a lot that you can control. And over provisioning comes with a cost. First, there is the environmental impact. Steven Mojrat did a case study two years ago when cryptocurrency media was in full swing. He estimated that a single data center consumed the power equivalent of 50,000 homes. That crypto energy hunger is now being replaced by AI energy needs. And AI workloads tend to be compute intensive and network intensive. Data centers also have significant cooling needs. They require a lot of water. And then there is the CO2 emissions impact and there is a noise and electronic waste that comes with it. So why should we care? It's simple because there is no Planet B. And then of course, there is a dollar cost to over provisioning. J.Chapel in our blog estimated that 26.6 billion dollars were wasted cloud spending in 2021. And he found that 40% of IS instances were over allocated, which tallied to 8.7 billion dollars in overspend. In another report, a vendor named Cast AI estimated that 37% of computer resources allocated were never used. And last year, a company named Strongforce did a survey where people responded that 47% of the cloud waste was from over provisioning. They also felt that majority of the respondents also felt that Kubernetes complexity was a contributing factor. The theme is that about half the resources are wasted. And in a nutshell, this is an expensive problem and this affects the bottom line for companies. And in the end, it impacts consumer wallets. But it also means that we have an opportunity here to do better. Sustainability has been an important business goal at eBay. When your business is about finding new purpose for once loved but abandoned items, reuse and sustainability comes naturally. Specifically at the data center level, that $550 shipping seals the deal for me. Specifically at the data center level, the goal is exclusive renewable energy use. HPA with predictive AI to better estimate the replication needs is an ongoing effort at this time. And next year, there are plans to take up VPA to right size pods and containers. And this is where in place pod resize is an important piece of the puzzle as it avoids workload disruption due to vertical scaling and the overhead of scheduling new pods and starting them up. Eventually, we want to get to, sorry, we eventually want to get to back here. We want to get to the multi-dimensional pod auto scaling. Now in place pod resize up until earlier this year you could not edit the resources given to your pod. You had to restart your workload if you wanted to change its resources. And then finally, we merged, finally we merged the pull request in Kubernetes 127 that enables us to use, enables us to resize pod without disruption. I have a blog that you can visit to learn more about this feature and how to use it. I presented a detailed design about this in a talk last year. The field names in the API have changed slightly since but the core design that was presented remains the same. And if you're interested in the gory details, there is a link to the KEP, KEP stands for Kubernetes Enhancement Proposal. That's our design document. One application of in place pod resize with sustainability benefits came from a recent blog post by Peter Mankowski. He noticed that Java apps needed a lot more CPU during startup time than when doing regular work as I mentioned earlier. So if you have a guaranteed QoS class pod with hard CPU limits optimized for the runtime need, it would result in long startup times and the alternative is over provisioning. In this use case, he resizes the CPU limits lower after the app startup phase was complete. And this means a job can start and finish faster and we can power down unneeded nodes sooner and AKA it helps you become energy conscious and cost conscious. So cluster on a scaler mainly does two things for us. Number one, it scales up clusters when pods are pending due to insufficient resources and it scales down clusters by removing underutilized nodes. Consider this scenario. You have Kubernetes cluster and a couple of nodes in there and they are at capacity running pods. A couple of new pods show up. They are pending due to lack of resources and they will remain pending until some pods finish and free up some room. Cluster auto scaler sees this and calls into the cloud provider API to allocate a new node. Then the scheduler can then assign these pending pods to that new node. The issue with this is that cluster auto scaler only considers the pod resource requests. It does not take into account the resource utilization of running pods before allocating new nodes. This also means that your cluster could be very underutilized yet you end up bringing new nodes online. In other words, your carbon footprint goes up and you waste money. Let's have a quick look at how cluster auto scaler works today. I'm gonna use, this is using QBDM cloud provider and I'm gonna play a prerecorded demo for this part. I apologize if people in the back find it hard to see but there is an uploaded video for this as well. This is a demo of cluster auto scaling with QBDM as cloud provider. In this demo, we have a simple cluster with two nodes, a master and a node called node one. And we have two pods that are running on node one. Taking a closer look at node one, we see that pod one has requested one CPU and pod two has requested 500 mili CPUs. We also note that this node has two CPUs allocatable and a capacity of two CPUs. These pods are idle though. And if we wanna schedule one more pod that request one more CPU, it's not gonna be able to schedule because there's not enough room in the cluster even though the cluster is underutilized. Let's start the pod. When we create the pod, we see that one more pod is up in the API and it's pending. And it will remain pending until there's more room in the cluster. Let's take a look at the reason. We do a key describe pod on this pod and in the events we see it's failed scheduling and the reason is insufficient CPU. In order to get it for CPU, we will add a new node to the cluster using QBDM cloud provider. With this YAML file, we can deploy the cloud provider and this cloud provider is gonna listen for requests to scale up the cluster on the local host address at port 8086. So now that we have deployed the pod, the cluster, the QBDM cloud provider is up and running. And if we look at the logs, we see that it's listening at the local host address port 8086. Next, we start the cluster auto scaler. We tell the cluster auto scaler how to reach the cloud provider by providing this config file, which just points to the address to the cloud provider that we just started. So now we start the cluster auto scaler. It's going to connect to the API and see that there is a pod pending and it's gonna request a scale up of the cluster with the QBDM cloud provider. There is the request for scale up and the QBDM cloud provider has added a new node. This node is up and running and that will be reflected in the API and the scheduler will see that a new node is up and running shortly. And when it does that, it will look to schedule this pending pod. And there we go. The pending pod is now up and running. This concludes the demo of QBDM as cloud provider for cluster auto scaling. Okay, I'm glad that didn't crash. So what you saw is just a vanilla cluster auto scaler. We allocated new node and then we schedule this pod on that new node node two, even though node one is underutilized and that's what motivated the stock today and sets the basis for the next demo we're gonna show in a little bit. Now let's look at what we can do differently with in place pod resize. What we have now is the ability to quickly resize the pod without disruption. So that means pod disruption budgets are not an issue. We can make a small tweak to the cluster auto scaler logic where we instead of immediately requesting a new node from the cloud provider, we check to see if current pods can be resized down. And if we can do that and we can create some more room, then we end up scheduling the pod without firing up new nodes and we save some money and we become more sustainable. What does this look like in terms of the code change? Well, it's still a very simple tweak. What we do is we arm the cluster auto scaler with what I call as a pod smusher. The difference is that we check that in this case the pods that are underutilized, we shrink them before we scale up the cluster. Now, what does this look like? For this, we'll switch to a live demo because I'm feeling adventurous. Let's see how that goes. So you have the same setup as before. You have two nodes here, the master and node one. We have a few pods running. There is the QBADM cloud provider pod that's standing by to receive requests to scale up the cluster if need be. And there are those are pods, pod1 and pod2, which are underutilized. They're tailing the null device, so they're really doing nothing. If you look at the node, we see that pod1 has requested one CPU and pod2 has requested 500 CPUs in a node that has an allocatable of two CPUs and a capacity of two CPUs. These pods are great candidates to be resized down and that's what we will do in this case. You can see here that pod1 is also requesting, has a request of one CPU and allocation of one CPU by the Qubelet when we do describe the pod and get its container status allocated resources, a new feature that was added in in place pod resize. Next, if you want to schedule one more pod as before, which requests one more CPU, it won't be able to schedule just as before. So let's do that. I'm gonna create this pod in the API and it's showing up in the API. It's spending as before and the reason it's spending is, I'm gonna do a kdescribe pod and in the events, we see that it has failed scheduling and the reason is insufficient CPU. Now, this time around though, we are gonna do something slightly different. Instead of running our vanilla cluster auto scaler, we are gonna run this podsMusher cluster auto scaler which is hard-coded to resize down pod1 to 200 milli CPUs. So let's hit it. So now it's gonna connect to the API and there we go. It has detected that a pod is spending but before resizing the cluster, it checked that can smush pod1 to 200 milli CPUs and we'll look at this again. Now we are at 200 milli CPUs for pod1 and we see that one more pod has now been scheduled but this time it's scheduled on node one because we created room by shrinking the pod and we did not need to allocate node two. So thus, we save some money. So that's a demo for the second case with in-place pod resize. Now, does this mean we are done? Well, not quite. There are many ways this could go wrong. Let's look at a couple of ways things can play out in ways you do not expect. You just took away memory from a pod that's known to get room killed during spikes and a spike is about to occur or that idle CPU you repurposed for degrades the shopping experience for your site users later when a bad job starts. We can speculate all day about the different ways in which things could go wrong but the real question to ask is, can we as humans come up with a smarter set of heuristics? Sure we can, but if we had infinite time but then is there a better way than that? Well, it turns out there is. Making recommendations based on a large set of parameters in a reasonable amount of time in a reasonable amount of time is a job best suited for, keep my things up. Making recommendations based on a large set of parameters in a reasonable amount of time is a task best suited for AI. So let's hear from Haran how AI can help. Take it away Haran. Thank you Nave, oops. So let's stack back a little bit looking at the cloud platforms and you may find that general resource management is actually everywhere in cloud platforms. Workload out of scaling is one example and others include job scheduling, VM or container placement, congestion control, et cetera. And such problems have been around for a long time both in theory and in practice but yet remain significantly challenging. Currently most are relying on human engineered heuristics. On the other hand, we have learning based approaches such as reinforcement learning which allows us to use deep neural networks to express the complex dynamics with raw and noisy signals and to express the decision-making policies. Learning based approaches are available because we have abundant data generated in more than a cloud platforms. Examples include monitoring data, system metrics, application performance metrics and those are there due to the improvement of observability tools. So let's look at what people do today for those system management tasks. There are two main categories, human engineering and reinforcement learning based approaches representing the learning based solutions. But actually at a higher level, they share similarities. So here I make a side-by-side comparison. On the left-hand side, we have human driven engineering. People usually start with a simple system model based on for example, Q-theory and they need to manually produce some heuristics or parameters to make it work. And of course we need to test and tune those parameters with extensive profiling based on like until relatively good heuristics are found and due for the average case. And whenever there's any changes to the applications or the cloud platforms, we need to redo these steps again and again. On the other hand, there is opportunity for using learning based solutions such as reinforcement learning where an artificial agent is created to interact with the environment. And the system management task is usually formulated as a Markov decision process, essentially a sequential decision-making process. And the agent starts from a random policy. This policy maps from the states to the actions and is usually parameterized by neural networks. At each time step, it obtains states, make an action and then gets the reward indicating how good the action is given the current context. And this policy will be optimized based on the reward. As you can see, this is also like a loop and it will continue until convergence or there's no improvement when updating the model parameters. There are two main reasons why learning based approaches such as IO is suitable for cloud system management. The first reason is that it provides a systematic framework for automatic retraining to reduce repeated human-driven profiling and tuning. And secondly, it reduces costly optimization or search to constant time which makes it scalable to the large state action space in dynamic cloud environment for heterogeneous applications. As a primer, IO is an approach that falls in between supervised learning and unsupervised learning. It doesn't require any labeled data but needs a reward. An agent interacts with the environment in a step-by-step manner. At each time step T, it's gonna get a state ST, make an action AT and then receive the reward RT. As I mentioned, the reward here serves as the feedback like the loss function which directs the IO policy or model training. And the goal of our agent training is to maximize the expected cumulative reward in any T-step trajectory. So let's look at how we formulate the workload auto-skilling task as an IO problem. In the Kubernetes cluster, application workloads are usually deployed as puzzle containers to continuously meet application as a load and achieve high resource utilization. The IO-based auto-skiller is responsible for auto-skilling like in a vertical dimension such as resizing the container regarding the CPU and memory limit and horizontal skating to adjust the number of containers. With IO, we get rid of human-driven application profiling and parameter tuning in heuristic-based approaches. For example, in threshold-based auto-skilling, the optimal threshold for CPU utilization without violating the application performance as a load actually varies across different applications or platforms. And IO automates policy learning with a systematic and dynamic feedback control loop. To support IO training and inference in Kubernetes, we have a multi-dimensional pod auto-skiller or MPA. The design of MPA actually follows the similar style of VPA that separates skating recommendation from actuation. By doing so, it supports customized plug-and-play multi-dimensional auto-skillers such as IO. First, we have application deployments and metric servers in the Kubernetes cluster. And then MPA recommenders guess the measurements from the metric servers and guess the skating recommendations from either the IO agent from our case or the traditional VPA or HPA controllers. The recommender then sets the skating configurations in the MPA API. And then the updater operator executes the horizontal or vertical skating recommendation configuration updates. Here, we took the open source implementation firm which has been published in OSDI 2020 as the IO-based auto-skiller, which is a proactive approach, meaning that the IO agent decides how to auto-scale to react to the perceived states, such as the current utilization or the current application performance metrics. But we can make it a proactive auto-skilling approach by prepending a predictor following the way that deep scaling is doing which has been published in SOCC last year. Instead of taking the current measurement like the time series data, the predictor forecasts the next-time windows time series data on utilization and loads and then pass it to the IO agent to make the decision. For the IO agent, it's really just trained to make resource reprobation in decisions directly from experience. And it is optimized for the end-to-end objectives. What does that mean? As I mentioned, the reward function in IO acts as the loss function to point the IO agent to the right direction. And the reward function, in our case, is defined as this function. It basically consists of two parts. The first part means to mitigate SLO violations fast. In this case, SMT is the SLO maintenance at time T and is defined as the SLO latency divided by the current latency. A lower value means worse performance degradation and we give it like a penalty by having lower reward. The second part is to avoid over-provisioning. It's defined as the resource usage at time T divided by the assigned resource limit or allocation. A higher value means high resource utilization efficiency and the less over-provisioning. And this aligns with our objectives in an end-to-end manner. And we did evaluations on microservices deployed on Kubernetes. Overall, we found that IO-based auto skater reduces the SLO violation mitigation time by up to nine times, compared with baseline Kubernetes auto skaters. Breaking it up, it reduces the average tail latency by up to 11 times. While reducing the overall requested CPU limit by up to 62%, in the meantime, it reduces the number of drop to request or time out request in the microservice applications by up to eight times. So to summarize our talk today, we saw that cloud computing comes with a significant amount of environmental impact and dollar cost. Thankfully, in-place pod-resize feature helps us drive towards the goal of multi-dimensional pod auto skating. We also look at how reinforcement learning helps us further improve the efficiency, which has played a very promising role in Kubernetes auto skating. It saves us from the laborer's work and it also helps us from being reactive to proactive. Our next steps were to drive the in-place pod-resize feature to JA and realize cost saving and reduce the cotton footprint via holistic auto-skate. And we could use community to help with this. Here is a list of references whose work helped us put together this talk, including several papers I mentioned in the talk. And we would like to hear back from you so that we can learn what we could do better. Please scan this QR code. It will take you to the place where you can leave us feedback. And with that, we will conclude this talk and open this session for Q&A. Thank you. So if you have any questions, please come up to the mic over there and we can answer them to you. Really neat stuff, appreciate it. Could you back up one slide so we can get the references? I have uploaded the slides to scan so you can get it from there too. Hi, this is Abhishek from IPM Research. Hi, Aura. So I have two questions. We covered a little bit about AI workload use case. So the first question is, GPUs currently are expressed in Kubernetes as fixed integer quantities. How would scale down work in that aspect? And the second question is again regarding the AI workload use case. Most of these workloads have gang scheduling semantics. So will this technology help in resizing gangs of pods that are related to a single application? Currently, let me take the first question there. The GPU workloads, yes, they are currently, they're showing up as extended resources. And the spec that we have today for in-place pod resize only covers CPU and memory. It does not cover extended resources today. And that was mainly a decision to keep the scope in check and it's a large project as it is. A future cap is welcome that can scale up and scale down GPUs. I don't know if that could benefit from in-place pod resize kind of, whether it depends on whether you want it in units of one or you want it smaller than that. But I can see that potentially being one of the things that come in extended resources. You want to scale the number of GPUs that you have for your pod up and down. Yeah, that could be there. And regarding the second question, let me think, the scheduling approach that we have today with in-place pod resize doesn't really, it's not even assisting, it just observes. And this has come up, scheduler could potentially come up as something that can assist. What it does today is that it steers new parts that are coming away from nodes that are requesting resize. So that way resize gets a little bit of priority. As far as gang scheduling, whether we're gonna take that into the scope of in-place pod resize, I don't see that happening anytime soon, but if you have a strong proposal of requirements that are kept in mind, that's totally welcome. The community would love to see that and we want to get community review. These are fairly big asks because it scales across multiple components you are scaling across. You need to change the API, you need to change the scheduler, cobalt, all critical components of Kubernetes. So they will go through thorough reviews. Thanks, Vinay. Thanks for the talk, really interesting. I was curious when you're doing your reinforcement learning, are you training that against a live cluster or do you have some sort of simulation environment that you're doing the training in or what does that look like? Yeah, thanks for the question. So the question is, during the reinforcement learning agent training, are we using the live cluster or simulator? So actually in these experiments, we are using a live cluster, creating the, first deploy the microservice benchmarks on the cluster and then deploy the workload generator to drive the microservices running to serve the request. And then the reinforcement learning agent training is actually happening by interacting with the MPA, like setting scaling recommendations and then receive the feedback. Yes. Cool, thanks. Thank you. Hello, thank you for the session. I'm Feben and I'm from a healthcare domain. So with not auto scaling, will not-selector taints and tolerations will be considered while this auto scaling is performed? Let me take the first part of the question. I think there are two part answers to this. The first one is vertical scaling, in place vertical scaling. That is after scheduling. So taints and tolerations don't really come into play here. With regards to multi-dimensional part auto scaling, the author of that cap is standing right next to me and I'll let him answer that question. So, could you repeat the question? Is it? Okay, so while you do the not auto scaling, yes. Will the not-selector pod affinity rules and taints and tolerations, will that be taken into consideration? Mainly when you scale it down. Scaling down. When you remove a pod, sorry, when you remove a node from the cluster. Yeah, I'm trying to understand. So you're asking when we are scaling down the resource limit? Yeah, so let me tell you this. Two or more than one pod an application has only two pods and you are scaling the node down. Yes. These you only have two nodes with a certain not-selector and the pods can run only on those two nodes. Oh, okay, okay. Yeah. I see, so you're asking when there's a constraint that the pods can only run it on those two nodes but there's no capacity on those two nodes? What do you do with that? No, the capacity is there on the two nodes. Actually one of the nodes is underutilized so you don't have that node to be on the cluster and you decide to scale it down, remove the node from the cluster. Oh, so let me take that question. So the current cluster auto-scaler, the way it works, I think it takes into consideration if a pod can be evicted. If a pod cannot be evicted, we won't get a scale down. But frankly, I think the current cluster auto-scaler where the auto-scaling community is looking to replace it with the carpenter and let's see if that gives us more features. Thank you. Correct, yeah. No, it's a node. But they said the node also, right? You had the node? I think pod disruption budget will stop the scan and then pay before. I wanted to know what all things are taken into consideration. No. No, yeah, it just changes the sizing. Thank you. Thank you for your tremendous work. As someone operating solely on-prem, I've been looking at this proposal for three years and every time it hasn't met the release milestone, it was like Christmas taking back from me. So my question is you are performing reinforced learning on actual good data, but in practice, we are then operating in environments with microservices where the reward function and the metrics of reward function could be heavily impacted not only by the performance of the service itself, but rather by the performance of dependencies like databases or performance of other services that are dependencies for this one, upstream and downstream performances. So what are your ideas on how we can incorporate the filtering of the latency or other reward targets that could smooth out the impact of not the decisions made by the reinforced learner autoscaler itself but around the environment? Is it clear? So let me repeat the question. So you're saying in microservices, there could be always third-party services like databases and how can we isolate the root cause for isolate violations? Yeah, so how can we in the end of the day stop the autoscaler to trigger any kind of excessive upscaling or downscaling which is not actually correlated with the decisions made previously because there is noise in this kind of environment. Yeah, that's a good question. So actually we, like if you check out our paper there, like the number eight here, so actually we treat the microservices like a graph. So we first use tracing tools like a Jagger to first nail down what's the root cause candidate for those SLO violations. So if it's not the microservice A, then we are gonna only focus on the root cause for the SLO violations. And then our agent only focus on that particular microservice component but not on all the microservices at the same time. So by doing that, we are not like excessively scaling up some peripheral, some microservices but only targeting the root cause candidates. I have extra questions but I think I first need to check the paper here. Yeah, I feel free to shoot me questions. Thank you. Okay, we are out of time. So I think we need to stop now. We can certainly hang around outside and then take more questions if that's okay. All right, thank you guys. Thank you very much for coming.