 Konnichiwa, we came from Japan. And today we are going to talk about the new alpha feature called in-place resource resides. So among the attendees, have you ever heard about this feature? Okay, good number. So over half of the people will learn about what it is. Yeah, let me start with self-interaction. My name is Kohei and I work at Apple as a city field engineer. And I work on open source. And yeah, my focus right now is a cloud native technologies. Hi, I'm Aya Ozawa. I'm working at cloud analytics and working on FinOps to like auto skater and automate Kubernetes operations. Thanks. So how do we set the right bot resource? Resource management is key to smooth the sharing of maintaining Kubernetes clusters. Pot resource can be configured using request and limit. Request means minimum resource requirements and limits kappa resource. It seems right to simple fields but there are lots of complexity under the food. To maintain Kubernetes cluster effectively, it's important for us to understand them. This is a rough diagram of pot creation. We'll look into the detail later so you don't need to understand everything here. But as you can see, resource request and limit are used in several components like schedule check requests and keyword combats request and limit to container settings and then runtime set them to the container. Now, let's dive into what happens when we set these values. But first, let's think about scheduling. Scheduler uses resource request to check if the pot fits the node. As the left diagram shows, if the pot request satisfy the available resource, the scheduler assigns the pot to the node. On the other hand, this scheduler does not consider resource limit and actual usage, but actually pot can use resource up to the limit. So CPU and memory can be overcome it. As shown in the right diagram, pot is allocated regardless of its resource limits. Next, let's explore what happens when a running container hits a limit. As you can see from the left diagram, if a container goes over its memory limits, it's killed by the womb killer. As for CPU, you know, it's a time-stressed resource. If a container exceeds its limit, it will be throttled until the quarter period ends. Of course, it allows to resume in next quarter period, but this delay leads to a performance delegation. Keep in mind, resource can be overcome it. This means if other container on the same node have already used up node resource, your container run into the same station even before reaching its own limit. Now, let's consider resource of entire node. If a node runs out of memory, the womb killer kills the process with the highest OOM score. The cuber adjusts the score based on the pot quality of service we usually called QoS class. As you can see from the manifest in the upper right corner, you can find its QoS class in the status field. Pot QoS has three classes. The classification is based on how pot resource are configured. Best effort is a class with the highest OOM score. If a pot has no request on limit for all containers, it's classified in this class. This, the next possible is class that use when a three one container has a request. Limit is optional. The last guaranteed is class that use when all containers have both requests on limit. This way, cuber adjusts the OOM score based on the QoS class. When node goes OOM, the OOM killer kills the best effort pot first. Next, we move on to the relationship between resource request and pot eviction during node pressure. Once OOM happens, the workload is affected and it's making the backing difficult so we wanted to avoid OOM as much as possible. Cuber it provides such a feature to make resource available by eviction. When memory usage exceed the eviction threshold, this threshold is one of the cuber it settings, cuber it evict pot to make node resource reusable. So how does cuber decide the order for evict imports? It depends on the request usage and priority. Taking this diagram as an example, cuber it select pot in order from right to left. The right most pot is to be evicted for it because usage is higher than the request. Then looking at the middle pot, the pot with the lower priority comes next. If the priority are the same, precedence is based on gaps between request and usage. The left most pot will be evicted at the end. Let's take a moment to reflect on the importance of pot resource management. So what can go wrong if we set wrong values? Consider this. We set two low requests. It could damage to service performance and availability. Overcomending is useful to cover temporary spikes, but if pot's concentrate compete for resource disasters, it's ready to frequent evictions and earn. Conversely, setting too much resource beyond the pot actually needs is also a problem. Competing resource and not free, so this is a waste of cost. Of course, resource setting affect the entire cluster, not just individual pots. Taking across the order scatter as an example, it also relies on pot request to make decision about clusters carrying up and down. If you do not set the right amount of resource, the entire cluster will not scale properly. Therefore, allocating resource actually needs no more, no less is important. So how do we set the right value? The key is to monitor the actual resource consumption, understand the usage pattern, and then adjust the setting to match these observations. Bear in mind, the setting request is not a one-time task. Let's say a new feature is added or your service grows, resource usage may change, so we need to continuously adjust them. Furthermore, we are typically managing multiple workloads. It takes some real effort to handle for all of them. So is there a better way to do this? The vertical pot order scatter can be your hero in dealing with that. The vertical pot order scatter we usually call BPA is one of the Kubernetes side projects. If you write a manifest like this, BPA will analyze the resource usage and then provide scaling recommendations, supported workloads as scalable projects, such as deployment and stateful sets. As you left manifest, BPA calculate the recommended value and set it to the target fields. Now, you'll notice other fields visiting their recommendations, but they are mainly for reference. The first, lower bound means if you set value lower than that, the container may damage performance and availability. Compulsory setting a value higher than upper bound means it's likely to be wasted. Target is a recommendation for resource requests. BPA set it into the pot request fields. In this example, the CPU will be 72 millicore and the memory will be about 230 megabytes. Now, what about limits? BPA keeps the original ratio between requests and limits. Looking at this memory example, the original ratio of requests to limit is one to one. So, BPA set the same value to limit. So, how does BPA estimate the recommendations? Periodically, BPA collects container values and adds them to the histograms. Here, we use what's called a half decade histogram. This means after 24 hours, the sample's weight is halved. Why do this? By giving more weight to the newer data, we can quickly adapt to changes in resource usage trends. Also, BPA handles metrics differently for each resource type. Let's start off CPU. Here, things are pretty straightforward. Each metric sample directory gets placed into the matched bucket. As for memory, on the other hand, find peak usage within a five-minute window and add only the peak value to the bucket. Additionally, BPA has a special consideration for memory. When UM happens, BPA does not just throw it to the bucket, but multiples the last usage by 1.2. Yeah, 1.2 is a default, so it depends on your settings, but in this way, BPA calculates the recommendation considering actual usage and UM events. Next, let's look at the relationship between horizontal POT autoscale and vertical POT autoscale. There are two different methods for POT autoscaling. Many of you are likely familiar with HPA as it's a standard API project. It's scheduled horizontally by adjusting number of replicas. On the other hand, as we've explored already, BPA scales vertically by adjusting the POT resource themselves. One thing to note is that the current open source version of BPA cannot use together with HPA on the same metrics. Let's consider the fader scenario. When the resource usage exceeds its threshold, HPA creates a new replica. At that time, the new replica is still low-utilized then BPA finds out the utilization is low and scale down the POT. If this kind of cycle is repeated, eventually many of these small POT will be created. Therefore, please use custom metrics for HPA instead of CPU or memory until HPA integration is introduced. Next, let's move on to the update mode. There are three distinct modes to apply a recommendation to the POT. Let's break them down. First up, off mode is a hands-off approach. It calculates recommendation but not making any changes on the POT. This mode is great for those who wanted to make money or adjustments. Next, there's initial, the mode applies the recommendation only at the POT creation time. This many comes when you're changing the replica counts. The third, recreate is most proactive approach. Not only applies to the new POT but also evicting and recreate running POTs. It's very good when requests do not align with the recommended values. And last, order mode. Currently, it works same as recreate. Let's do in recreate mode a bit more. Take a look at this graph. The blue line shows resource usage while the orange line, the recommended values. Liquor mode can be well optimized for many applications. Of course, POT will be disrupted by eviction to apply a new recommendation but POT disruption budget we call PDB can mitigate the impact on service availability. PDB can limit how many POTs will be disrupted at the same time. Therefore, the combination of recreate mode and PDB allows resource optimization with that sacrifice service availability. But here's a catch. Recreate mode doesn't work for all workloads. Consider this. Some workloads fluctuate in usage only during sudden period of their lifecycle. As you can see from the right graphs, liquor mode could not truly optimize in this case. Another example is cases where even a brief disruption that has the serious impact on availability and cost like ML long running jobs that can be paused and resume without consequences. So how do we manage this kind of workloads? So here's the thing we want to bring and introduce here to you. A new alpha feature called in place resource resizing for Kubernetes pods has been available since actually 1.27. This flag as the field called resize policy to pod specs, the default mode is recreate. And it's basically how current pod behaves when the resource field in the pod spec has changed. The new in place mode enables a pod to modify resource limit field without recreation or restart. To enable this feature in current Kubernetes versions, we just need to enable the feature gate called in place pod vertical scaling feature gate. So we will show this in the demo that comes after, fingers crossed. As I mentioned earlier, this new feature gate adds a new pod field resize policy. This field takes two parameters, resource name and reset policy. This example shows that CPU change, now that's not trigger pod restart, which means you can resize CPU value on this pod without restart or recreation. Restart container also does not trigger pod recreation, but it restarts pod. But either way, this pod will not be created when those resource values are modified. So the resource values became mutable regardless of the podline cycle. This feature is useful for multiple workloads. I have presented previously that this can be useful for ML workload, but my personal favorite is game lobby server, which clients expect the persistent connection while playing on the multiplayer game. You don't want to get disconnected while you're playing Apex or some other games like Minecraft even. So the server can scale up and down with this alpha feature without that disconnection. If resize failed and keep scheduling cannot be executed, the pod status becomes infeasible. In such cases, this pod will not be modified and keeps running with a previous config while the value gets modified. Let's try to see what happens in the demo. Let me have a moment. So I'm in the directory called KubeCon EU 2024 demo, and I have the feature gate in my kind cluster configuration. So yeah, let's create the cluster now. I hope this don't, doesn't blow up, but yeah. Okay, my control plan is good boy. Yeah, I created a cluster. Not yet, but, sorry. Okay, this is the pod spec I share. This has the resource besides policy. Yeah, and request limits and resource requests. With one gigabyte of memory and one CPU. Then let's see. Don't worry, I have a plan B. Here we go. Oh no, sorry. Yeah, so we have the pod spec and we create the pod. Yeah, I knew this is gonna happen. Sorry. Then the container is creating here. Yeah. Then I saw it's running now. Yeah. Then it has the 20 seconds agent no resets for now and it has the resets policy inside. The CPU is not required and memory is reset container. So if you change the CPU, it doesn't change anything. Just change the value in a C group. Yeah, so right now it has the resource limit with one CPU. Yeah, I just checked the C group value inside the pod. It has the one core CPU and now I apply the patch with the new resource request and limit with number two. Why he's taking this down? Yeah, it's changed and let's see if it's recreated. Oh yeah, before we check the C group value. Yeah, so it doesn't restart and the age is 296 which means it is the same pod. Now we try something else. Do you see it's one that 10,000 CPUs. Do you have 10,000 CPU at home? I don't but it's an experience for the status infeasible which I explained. This cannot schedule the new pod. I mean new CPU resource. So the resource has 10K which means it's modified but in the container status, yeah, where is it? Yeah, the alternative resource has not changed. CPU remains two. So it means, oh, the also the status is infeasible. So this is what happens when you put too much resources which is over the capacity. Okay, we can go on. I lost my mouse cursor. I lost my present mode. Well, okay, I can explain. So this diagram shows what actually happens inside of the Kubernetes components. So API server takes your pod request. Like if you keep got to apply, the new manifest will be stored in your HCD and it will be go through API server. And the kubelet in each node periodically check the pod spec modification from API server. Then if the change has been detected, it goes through all the way through the CR runtime and OCR runtime and eventually the new CPU value or memory value will be stored in the C group value. So in order to make this function happen, several things has been implemented as the alpha flag feature. Which is API spec, scheduler, kubelet, runtime, especially runtime was a big change for me because this feature relies on C group version two completely. Also testing as well. But this is basically the Kubernetes component change only. That means auto scaling needs to implement this feature in a different place. So when do we have it? Actually we really don't know. So yeah, we are currently, I am not a developer. I'm just introducing this feature but the community is trying to make the API stable but we have a lot of considerations still. So we don't know when VPA is eventually get implemented. So this is the consideration we have right now in GitHub. So these are examples. So current implementation that we have in alpha feature, it depends on the runtime like continuity or cryo then kubelet decision on restarting containers. But in the proposal, it has to be in VPA. So we have to change the decision making logic towards there. Also even though this is exciting, I think you understand now how it can be, it can resolve many use cases and many problems but this feature doesn't have many attention. So more use cases and feedback are wanted from you and all the people who's watching this video on YouTube. So yeah, this is one big concern as well. And when it comes to scaling down, we really need to be careful about it because if you have this part and slightly shrinking the memory usage but just because that you cannot just scale it down because what if requests like a few more requests comes and the memory usage spikes a little bit, it brings the umkiller again. So it's not that easy. Yeah, so this is the TK takeaways. As we explained, resource management is the key for the cluster and workload resource management. And VPA has several components like a recommender, updater, to make it autonomously. And the current resource has some resetting part problem. Yeah, so it can work on many workloads but there are some use cases where it specifically needs no reset. So implicit source resize can solve this problem. And if you combine VPA and this in place resource resize, as we saw, it can bring more autonomous resource management. But there are so many things we have to do. So if you are interested, I highly recommend that you take on the cap on AEP. AEP is the auto scaler's enhancement proposal. I share some links and the AP cap number in the slides so you can show it later. Yeah, that's it. That is our presentation. Yeah, thank you for listening. So if you have any questions, please ping ping us on Slack or email or Twitter or X, I don't know, X, yeah. And feedback means a lot to us. So please scan the QR code and press the good or bad button. I mean, hopefully good, but yeah, it's up to you. Please. Yeah, thank you. And if you have any questions, I think we can do it now, but a lot of people are going out, so I don't know. Any questions? Okay, oh actually before that. Vinay is like one of the big contributors on this feature and he's actually here, so I want him to come up on the stage and answer the questions with us. Please come up. You don't have to, but yeah. Okay, let's keep going with the question, yeah. You can sit there actually, yeah. Okay, you come up. Yeah, I was just wondering about like, if you have, if your cluster is being managed by a tool like Flux, for example, that is constantly reconciling with a Git repo, then how do you solve the conflict that the VPA is trying to constantly adjust the resources within the cluster, but then Flux will pull the resources that you set in the repo and we'll override them, probably. I'm asking because I have the same problem with the HPA, but it was with the amount of replicas, not the amount of resources. So the question is, if you set the VPA or HPA, GitOps can bring the problem, like scaling problem. Is it like the replica number or the resource request limit? Yeah, for example, the way I solved it with the HPA, you just removed the number of replicas entirely from the deployment. I can answer that question. So don't set any replicas on deployment because that should be taken over by HPA. Yeah, but I'm asking about the VPA. How do you remove the resources from the deployment as well, like the requests and limits and you just let the VPA take over completely? I mean, I can try to answer this one. So you normally deploy deployments with GitOps pipeline, right? And this just works on the pod level, so probably GitOps won't intervene with this one. You don't directly deploy pods with Flux or so. So it should be just okay. For me solved. There's another one, raising hands. Yeah, you wanna say something, Vinay? No, it's okay. I was just gonna say that if you have something else that's gonna overwrite what VPA is gonna do, you'll have to disable. You can't have like two people drive the car. It's not gonna go anywhere. So that is the solution. That's the problem that needs to be solved at the GitOps level because really the true master for this is gonna be VPA when it is doing in the auto mode and it's gonna make recommendations as it sees fit. If somebody else is overriding it, then that has to be kind of disabled. That's... Yeah, so what we ultimately manage... Sorry, yeah. What ultimately we manage is the VPA values, not the each pods resource size itself. Well, so from the demo, it was clear that it will be possible to change CPU resources without restarting pods. But what about memory? Like will it still be required to restart pods to change memory resources? Because in the demo, the restart policy was to restart pods if there are changes to memory resources, but will it be possible to not do it? The question was memory without restarting. Yes. It really depends on the workload, I believe. Like for example, some like ML workloads, you shouldn't restart it. So yeah, it is actually possible, but for example, like JVM, you have to specify the heap size memory, which means you have to restart either way. Thank you. Just the last question. Yes, a bit more of a non-technical question, but so the beginning situation that you talked about, about like this problem, I feel like a lot of companies are dealing with this, but not really addressing what was the compelling event for you to kind of start working on it and addressing it within your organization or as a project. Most people just kind of ignore it and just allocate too much or provision. Well, like resource utilization and cost management is like one of the biggest motivation for me, but I can ask Aya and Vinay as well about that. So yeah, what you said is right. You can solve this by giving too much memory, too much CPU and that'll kick the can down the road, but eventually you're gonna hit the wall of, okay, how much am I paying? What's my cloud bill, right? And in this economy, people are starting to watch that. So it's starting to take on a little bit of more priority. I don't know whether it's gonna be, it's enough momentum here that it will have companies put a push on, put more resources on this and then make it get it all the way through so that it can be used. But yeah, there are use cases not enough, compelling enough that enough resources are gonna be put on this at this point. Hi, I am Dixitaw and I am a contributor in Signode Community. I don't have any question, it's just an announcement. First of all, great presentation and thanks Vinay for actually starting this feature and writing the cap for it. So I wanted to say that this feature is an alpha right now and we have been trying to address the issues in the alpha feature for a long time and we want to promote this feature to beta soonish and we are kind of looking for feedback if you have any feedback and if you could try this feature out, it will really help us work on the beta functionality of this one and also feel free to contribute and attend the maintainers track for Signode Community if you are interested in contributing with the feature. That's all, thank you. Sorry. So I guess the question is over now. Yeah, okay, so thank you so much. Also, please give another applause to the contributors. Thank you.