 So today, we're going to talk about resource efficiency in the context of allocation in a Kubernetes cluster. Hi, my name is Vincent Seval. I'm working for a company called Lombard-Rodier. I'm an architect. I'm involved in a cross-unit team called PlatformOps, which aims at delivering the platform as a product. And in that context, I'm involved in different projects. I've been involved in OpenShift for quite some time now, and specifically deployment on OpenShift. And lately, I've been involved in our deployment of Kafka on the OpenShift platform, plus the Quarkus application framework. All right, so Lombard-Rodier is a private bank based in the Switzerland. And we've got the traditional lines of businesses that you'd expect. And plus something that we call technology for banking. What we're doing is that we develop internally our own core banking system, which we make available to the bank itself, obviously, but to other external banks. Then in that context, we instantiate our core banking system so that we can operate their activity from our data centers base in Switzerland. So we have four traditional functional development streams. And a very modular architecture that was started 25 years ago with many technology. The last 15 years, we invested a lot on Java, but that's not the only technology. And we've got a lot of flexibility in our system thanks to this architecture, but a few challenges as well. All right, in 2020, we started a large modernization initiative called GAX, where we are looking at the functional side, but also at the technical side. And we started introducing new technology, OpenShift being one of the first ones. So the story starts probably a year and a half ago when we were starting to ramp up our workload on our Kubernetes cluster. And we started seeing this kind of message a little bit too often. Then, well, basically, there were two reasons for this. The first reason was the cluster we were using was pretty small. And the workload was starting to pile up. The other reason was we did not have a good understanding about how resource allocation works inside the Kubernetes cluster. So for some time, we tried to solve that with extra capacity that we were adding to the cluster, but that didn't get us very far. Eventually, we had actually to handle the problem. All right, so what we're going to talk about today is optimizing placement of pods in a cluster so that we can tune adequately your worker now so that you can size the underlying hardware appropriately and all of that so that you can avoid waste and save money, but without sacrificing the behavior. And that's where the efficiency comes from, when you can respect your sales, but squashing as much as possible the resources that you're using in your cluster. So at that point, you could be wondering, OK, what's particular about all of this? I mean, I've been doing capacity planning and traditional workload for many, many years. And I guess the big difference is that we are moving from static workload to dynamic workload, where as opposed to treat your VMs as pets, now you've got your worker nodes that are casual. And you standardize the configuration for your worker node, so you have to be better at tuning your workload underneath. All right, so at Lombardier, we've got currently three main clusters, one for development and integration, one for ST&T, UAT, and one for production. We've got another smaller one for production for different use cases. And we're expanding the number of cluster we're installing. So we're working on the increasing, and we are in the process of deploying another cluster into the public cloud. I'm just showing the applicative pods. So that's the pods for the components that we are writing ourselves. Actually, our worker nodes, our cluster, has something like 1,300 or 1,400 pods when you take everything into account, including the sub-party and all the supporting workloads. We're running OpenSheet 4.8 on the virtualized VMware infrastructure with our worker nodes. And one thing on the hardware at the hypervisor level is that we're using a CPU overcommit with a ratio of 1 to 5. So at the end of the GX project, we'll probably be running more than 20,000 pods across all clusters. And I mean, by definition, we won't be able to do fine-tuning on all those workloads with this amount of containers and pods running in our platform. All right, so when we talk about resources on the container, we need to take into account two different notions. The first one is requests, which is the minimum amount of resource that is required for Kubernetes to schedule your container and your pod on a specific worker node. So we say it's a schedule of time notion. Although for CPU, it maps also to the CPU shares in the C groups, so it's used as well on runtime. The context of that discussion will focus on the schedule time aspect. And there is the limit, which is the maximum amount of resource that your container will be able to use on your worker node. This is runtime. In terms of resource types, then we have CPU, which is said to be a compressible resource, which means that in these two situations, when the usage reaches the limit, assuming one limit has been defined for your pod, or the node which is 100% of its capacity, then the pod is going to be slowed down. It's not going to crash. It's going to be throttled. By your position to memory, which is a non-possible resource, so there are two situations to consider here. When usage reaches limit, then you will have a pod out of memory. So your pod will restart, either on the same worker node or in different worker nodes. Or the other situation is when you use all the memory on your worker node, then in that case, you've got a memory pressure situation on the node. And the Kubernetes scheduler will start doing some eviction on your pod and restart the pod somewhere else to free up some resources. So here, in the situation we had with the error message I was showing, basically this was a request problem on CPU. And that's what we're going to talk about, which is basically we were reaching the full capacity of our nodes. OK, so we need to find the right balance between oversizing the request, which give you high behavior predictability because you're going to over reserve resources for your workload. But it's going to give you a low density, so a low efficiency. To some extent, if you're using overcome it at the visualization layer, you'll be able to compensate some of it, but not everything. So we'll say this is a performance optimized approach. And you need to balance that with the opposite, which is undersizing the request in your cluster, where you're going to have low behavior predictability because you will have high density. So you'll be gambling the usage of your resources against the performance of your workload. If we're talking about CPU starving, then you'll get some throttling on your worker, on your pod. If we're talking about memory, then you may have some evictions. So we could say that there is a high density, so high efficiency. But if you start having some throttling or some evictions, you're going to lower your efficiency. So you're going to sacrifice the behavior of your pod. So we'll say in that situation, it's a cost optimized approach. So how do we sit right in the middle where we assess precisely the request we need to set up on our workload? Well, one idea is to look at past behavior. So let's take a small example. We've got four pods running on an ideal worker node. And we've got resources ranging from 25 to 1,700 with workloads behaving very differently. Some with peaks and low activity. Some with a lot of activity all the time. So the first question is, what's the normal worker node that you need to run this workload? So when you've got the least amount of waste and you don't sacrifice some of the behavior. But actually, the better question is what do I need to reserve for each of my pods? Because you're all going to configure them independently from one another. Because each pod doesn't know where it's going to sit and we're going to be the neighbors. Do you take the average conception? Do you take the max conception on each node? Another way to look at this graph is to look at the cumulative conception of a time. So it's very easy to see that the optimal host would be the one which is sitting at the peak of this graph. So in that situation, 79. Now, if you took the average conception of each pod and you summed that, what would happen is that you would get a host with 1,200 capacity. So you've got a lot of density because you're taking a lot of the resources that are available on the node. But what you can see is that five times out of 10, the workload is going to try to consume more capacity than you've got in your worker node. So if we're talking about CPU, you'll get some throttling. If we're talking about memory, you'll get some eviction. And we don't necessarily want that. Now, if you took the max, you've got the opposite situation where you over reserve resources for your workload. So you've got lots of extra capacity. So you're not going to have any throttling or any eviction, but you're going to waste a lot of resources. So we need something in the middle, which is what VPI is doing. So one of the metric is to take the percentile, which basically covers most of your activity and get rid of the peaks of activity. Now, in that situation, you will get a host with 2,500 of resources. So it sits somewhere between the optimal and the max. So you have a little bit of waste, but not too much. And more importantly, you don't sacrifice the behavior. You don't get throttling. If you look at pod1 in particular, you can see that the P90 is going to be nine times out of 10 in your resource usage. So the question is, what happens to your peak? Well, you're going to be financed by your neighbors that are not going to hopefully doing their peaks at the same time. So there is a little bit of gambling, and then you benefit from the extra capacity that is provided and reserved for the resources, but not used. And that's basically doing the exact same overcoming that we were doing initially at the visualization layer, but we're doing it at the pod level. All right, so vertical pod autoscaler. This is a Kubernetes project, and it can provide recommendation on resource usage, and specifically on the amount of resources to reserve for your pod. So basically, that helps you configuring the request. It can help you as well in configuring the limits. The interesting thing is that it's going to watch your containers all the time, so it can give you up-to-date recommendation, and this recommendation can change over time. This is not attuning that you're doing six months ago, and you keep while actually your workload is evolving in terms of behavior. It's CRD-based, so it's completely compliant with your GitOps approach, and you can either connect it to Prometheus to gather the metrics, or it's going to watch directly the resource usage on your pod and consolidate and compile those metrics into a format which is itself stored inside a different CR, which is called the checkpoint. Interestingly, it can upscale or downscale the resources that have been configured depending on the situation. Maybe you need more CPU, maybe you need less. It can work on memory and CPU, and optionally, it can apply the recommendation on your workload. It's available on the public cloud through different offering, and at Lombardier, we're using it through a dedicated and supported OpenShift operator. That's what we're doing. So let's take an example. Here's the main CR that you would create for a specific workload. In that context, it's a deployment. For the control resources, you can say that you want to work on CPU or memory or both. The control value, you've got choice for the request only or request and limit. You cannot choose limits only. And for the update mode, you've got four options. So off means VPA is going to provide a recommendation, but it's not going to apply it for you. Initial means the recommendation will be applied at the next startup. It could be a crash or it could be a normal rollout. Recreate means that if the recommendation is too far off from what you've configured on your workload initially, so for instance, on your deployment, then it's going to detect that. And then VPA is going to trigger a restart of your pod, and it will restart with the recommendations applied. And the last mode, auto, is today doing the same thing as the recreate because the only way to change request and limit in Kubernetes today is to do a restart. After watching your container working for some time, VPA will calculate a recommendation that appears directly into the CR in the status section, and there are four metrics. The most important one is the target, which is based on the P90 like we saw before for all the controlled resources. You've got the lower band, which is the P50, and you've got the upper band, which is the P95. So in this particular context, we're running in initials, so at the next startup, our deployment was originally configured with 100 millicores and 1 gig of RAM, and the pod, the controller will replace at startup those resources with the calculated values. In terms of use cases, we're using VPA at Lombardier for stateless workloads and jobs and cram jobs, but there is also an additional use case, which is interesting, which is the stateful workload, where scaling horizontally stateful workload might be specific to the technology that you're using. So in that situation, VPA is a very good option to grab more resources. There are a few limitations, however, when you're using VPA. It is said in the documentation that you shouldn't be using VPA to control memory and JVM-based workload. This is essentially related to the way the JVM manages memory and the limited visibility it provides on the heap to the VPA or, I mean, the metrics that are coming from Prometheus. Another thing to look into is what you've got to be careful is using VPA with HPA. And typically, you cannot use a control resource like CPU and VPA and use it as the base metric for scaling on HPA. So you can use both, but not on the same metric. You have to be careful about that, otherwise they're going to work against one another. The auto-recreate by design is not, by default, going to trigger any restarts on your pods if there is only one pod in the replica set. Otherwise, that would mean some inevitability on your service. You have to be careful. And if you're using the initial or even the recreate or auto-mod, you have to be careful about potential excessive recommendation that VPA would be doing because it doesn't understand fully your workload. And you want to make sure that the sum of all the requests that will be calculated by VPA do not exceed your cluster capacity. One thing, one pinfall is that there is only one VP object allowed per workload. So if you create two VP objects for the same deployment, VPA is going to select only the oldest one is going to ignore the most recent one. And the last thing is that there are some distributions that limit the number of VP objects you can have in a single cluster. So for instance, GKE allows only 500 objects into the cluster. OpenShift doesn't say that they have this limitation, but they warn you about the resource usage that VPA has and takes to calculate all of those metrics. So you really have to be careful about to check how VPA is scaling to monitor all of your pods. So where are we in Lombardier? So we've deployed VPA on all clusters. We're using the request-only mode. That's basically an answer to the initial problem that we're having. And we're also starting to experiment with the OpenShift highlighting on the depth cluster. And that explains some of the numbers on the depth cluster which are pretty low. Nothing to do with VPA, but it's another tool in the toolbox for capacity planning. One thing that we did is that we created a capacity planning governance board where we meet every month. And basically, we track usage versus requested for memory on CPU. The goal is at some point that the request will be aligned on the real usage. We still have a bit of this discrepancy on the CPU request versus usage. There might be, well, we're still analyzing this. It might be related to the other type of workload that are running on a worker node and that are not covered by VPA. But basically, the next challenge that we have, we can solve the CPU situation. And the next challenge is going to be about memory. So for CPU, we're running initial on all nodes. We've got default values for our Java workloads. But those values can be overridden by the different development teams. And one thing that was very interesting when we looked at the result calculated by VPA is that we had a very low value, meaning we've got a lot of workload, but not doing too much. And so we could see with this value, the 38 millicores, actually even our default, which seemed not so high, was actually oversized compared to what we had with VPA. And on top of that, the development teams was limited understanding on how things were working. Actually, we're increasing the request, hoping actually to get better results. So it's kind of deployed progressively on each workload in a different cluster. We're starting actually to see the savings in terms of VCPUs. It doesn't change the usage. It changes the perspective of the allocation or the usage that you have on your cluster and what you reserve. For memory, it's very new. I mean, we started doing that a few weeks ago. So it's deploying on the cluster and the dev cluster only. And what happens is that the development team are going to set a value which is going to be used by default for the request and the limit. So basically, we're working in guaranteed mode by default initially. And what's going to happen is that VPA is going to watch your activity and basically downscale the request on the memory. So so far, we were able to save 18 gigs of RAM of request. All right, so lesson learned. You have to be careful about namespace or object deletion and recreation. In our situation, we're not getting the metrics out of Prometheus, which means they're being stored in a compacted format in a specific CR. If you remove these checkpoints here, then you lose your metrics. And you have to start over with your recommendation. The other thing is that the recommendation is stored as a status in the VPA object. If you remove the VPA object or you replace it the next time you deploy, then for 30 seconds, you're going to lose the recommendation. Your pod is going to start with whatever is configured on the environment. So if you want to make modification on this VPA object, you've got to update it and not replace it. One big learning experience is the low CPU recommendation. And really, even if you don't use VPA to apply this recommendation automatically, it's a learning experience just as much as doing observability and doing growth in a dashboard to watch the behavior of your workload. One thing to not forget is that reserving resources based on the P90, for instance, in the context of VPA is basically doing overcommit at the pod level. And if you're using overcommit at the visualization layer, they could conflict with one another. So you have to be very conservative with your overcommit you're doing at the visualization layer. And one very interesting thing that we got out of using VPA was that not only we were able to do fine tuning from one application to another, but we were able to get different recommendation and tuning for the same application in the different environment. And it's not unusual when you start using VPA to not have the same value in production, test, and integration. You've got kind of the base value. A few things to be careful about in the context of GBM workload. One mistake we did at the beginning is that we wanted to protect the startup of critical application. The issue with Java applicative framework is that they tend nowadays to consume a lot of CPU at startup. So you've got an initial peak. If you size your request based on that, you're going to overreserve an over-consume request. So do not oversize to cope with startup. You would have a better efficiency. The issue, however, is then what we call the soldering hub problem, which is when you've got a massive amount of pods that are starting and are going to do a huge peak of consumption. VPA is not helping you with that. So that's an issue. OK, so next, we want to, like I said, the next challenge for us is keeping memory at a sustainable level. So we want to continue working on that. And so get some feedback on the development cluster and start actually going up in the other cluster. We want to continue adding some density on our work on Earth. I mean, the ultimate goal would be at some point that we don't need the virtualization of a commit at the virtualization level. And we can go to bare metal, right? You don't want to go to bare metal and configure the low density on your work on Earth. We want to start working with VP on one side and horizontal scaling on the other side. And specifically, there would be a lot of benefit if we were able to scale down to 0 and back to 1 for this particular use case. So we'll be looking at HPA, but we'll invest some time on serverless as well, or KDA, for instance. And the last thing is that we want to expand VPA to work close that we're not writing ourselves, so that the best practices are applied to everybody. And typically, what we're seeing is that vendors come with their product and they ask you to configure excessive resource allocation to protect the product, even if that's not appropriate. So we want to work on that. A few issues to consider. So the first one is being pushed by a colleague of mine, Matthias Berchi, which aims at providing the ability to configure a target percentile value different from the default 90, which is hard-coded today. So we could choose with a more cost-optimized approach or a more performance-optimized approach, depending on the workload. It would give us more flexibility. Just if you go back in the small example I was showing earlier, the perfect value was P78, right? So P78 was already very conservative, and we could be saving some value by actually a little bit with more gambling toward a more cost-optimized approach. The second issue is very important. That's the idea is to provide Kubernetes with the ability to replace request and limits without restarting the pod. This would open up use cases like the auto, which requires that, and might help also as the basis for implementing something smarter for the selling health problem. About that particular issue, you may be interested in the discussion of the third point, and that's why I listed it there. And the last one is the bug that we found. A few tools and resources to wrap up this talk. So there is a recommendation plugin that you can install at kubectl with crew. Goldiland is a dashboard which relies on the VPA recommendation engine to show you on your different workloads basically the values that were recommended by VPA with that having to scan all the CRs. Harness is an interesting and complete solution which has its own recommendation engine. And what I like about it is that they recognize that there are different types of profiles and they let you choose between cost or performance or even a custom profile that you would define yourself. So you could say, I want the target to be calculated on the PAT for instance, right? And if you're interested in the subject, I really suggest recommend reading the last link which has a lot of interesting information. And with that, I think we've got time for a few questions. Anyone got a question in the room? There's one over there, okay. I have one from online as well. Hi, thank you for a great talk. I was wondering if you had any chance to look at the Prometheus integration of the Vertica pot on a scale. And if you did, you noticed any reasons why you would choose one of the others of a custom resource or the Prometheus integration. That's a good question. No, we didn't look at this integration. There would be one good thing about it is that you wouldn't rely on internal storing with the CR. So you wouldn't have the limitation I was talking about where if you remove the subject, you use your matrix. But the good thing about not depending on it is that you don't have a dependence on Prometheus. So if it's not available, basically VP is standalone and it can continue. So we make the choice actually to separate both at this point. OK, I'm going to read out a question from Slack from Frederico Hernandez, which was, how are they practically tackling the HPA versus VPA dilemma by which he meant you could have HPA scaling out workloads, which would make more pods, which would make less room for the VPA to scale. OK, so yeah, my understanding is that so we haven't studied that work, but my understanding is that you can use HPA with VPA, but you need to use it on a metric which is not controlled by VPA. So for instance, you could take request per second, if you, which is the basic notion in serverless. You could even use the CPU control resource on VPA and use memory for scaling. But you don't want to actually have both metrics used in the two, in these two features. OK, thank you. Who next? At the back, can you keep your hand up or come to you next? Thanks for your talk. I was just curious as you're starting to do the memory investigation, how are you handling the issue with the JVM? Yeah, that's a very good question. So I reached out to different people. My understanding is that as long as we're doing request only, then we're probably fine. And that's what we are working on today in the development cluster. So we need to confirm this. The issue seems to be related to trying to change the limit, which we're not interested in at this point. Specifically around there is a way to size the heap in relation to a ratio on the container memory. So if you start actually having VPA just scaling up, I mean, the problem is with heap committed versus heap usage in the JVM. So we're not getting into the limits today. So I mean, my understanding is that we're fine with request only mode. So my question was that I realized you had an open issue, but it was closed with no solution for how to, like, frottling containers starting up. Do you have any work around in place right now that you are using? Well, my very good point. A lot of people have been complaining about, I mean, throttling the startup because of JVM workloads. So it's a common issue. The solution could come with the other, the cap I was showing earlier with the in-place update of resources. So the idea was that if we had that, maybe we could add startup over provision request to cover the startup. And once the pod is ready, don't grade the request. So in that case, you would have natural throttling because you would have, you work on piling up big request and suddenly free up resources as the pod gets ready. But it's, I mean, we'll see in a few years. OK, I think we're out of time, so I'm not going to take any more questions. OK, so thank you, a big thank you to yourself. And Lombardier was able to be able to get me on stage actually to present all of this work that we've done, and which is very important for us. Thank you very much, and have a good conference. Thank you. Thank you.