 This is Mauro Pesina, and welcome to my session, which is getting the up and down service efficiency that auto scalers won't give you. These are the contents we will cover today. We'll start by discussing the problem we are going to address and the Kubernetes challenges for ensuring application performance and reliability. We will then introduce the new approach we implemented in Moviri, which leverages AI to automate the optimization process. And they do that by providing real board examples. Finally, we will conclude with some takeaways. Before proceeding, please allow me to introduce myself. My name is Mauro, and I'm the head of performance engineering services line in Moviri. And performance engineering has always been my passion since I graduated. So let's start with a quick introduction about what the problem is. As you all know, applications are evolving. Monoliths are quickly disappearing, and modern and cloud native applications are features. Dozens or having hundreds of microservices spanning over a wide set of technologies. All these technologies comes with the tunable parameters and options themselves. And as a result, we must deal with hundreds of thousands of possible configurations. So how do you find the configuration that best suits our workload? Speaking of Kubernetes itself, there are more and more stories about how difficult is to ensure performance stability for Kubernetes application, as you may have seen also on speeches before mine. The Kubernetes failure stories, for example, is a website that was specifically created to share Kubernetes incident reports with the goal to learn how to prevent them. Many of the stories describe teams that are struggling with Kubernetes application performance and stability issues, such as unexpected CPU slowdowns, or even sudden container termination. But why is it so difficult to manage application performance, stability, and efficiency on Kubernetes? The simple answer is that Kubernetes is a great platform to run container-raised applications, but it requires applications to be carefully configured to ensure high performance and stability. Let's now get back to the fundamentals and see how Kubernetes resource management works to better understand the main parameters that impact Kubernetes application performance, stability, and cost efficiency. We'll go through the six main key facts and their implication. Starting from the first, which is resource request. You may have heard many talks before mine talking about this topic, but let's make a quick recap. When developers define a pod, they have the possibility to specify also resource request. Those are the amount of CPU and memory which are granted to the pod, or better, a container within the pod. And Kubernetes will schedule the pod on a node where the requested resources are actually available. In the example, you can see the pod A, which requires two CPUs, and is scheduled on a four CPU node. When a new pod B of the same size is created, it can be also scheduled on the same node. This node now has all of the four CPUs requested. If a pod C is created, Kubernetes won't schedule it on this node and as its capacity is full. This means that those numbers, a developer puts in a YAML file, accurately used by Kubernetes, to manage a real cluster capacity. A quite surprising fact is that Kubernetes varies no overcommitment or request. You cannot request more CPUs than those available in the cluster. This is very different from, for example, virtualization, where you could create VMs with many more virtualized CPUs than the real physical ones. And another very important fact to notice is that the resource requests are not equal to utilization. If the pod requests are much higher than the actual resource usage, for example, you will end up with a cluster that can be fully scheduled, even though its CPU utilization is as low as 10% for example. So the takeaway here is that setting proper pod request is needed to ensure Kubernetes cost efficiency. The second important concept is resource limits. Resource requests are the guaranteed resources a container will get, but the usage can be higher. Resource limits is the mechanism that allows you to define the maximum amount of resources a container can use, like two CPUs or one gigabyte of memory. All this is great, but what happens when resource usage is the limit? Kubernetes treats in CPU and memory differently here. So when CPU usage approaches the limit, the container gets throttled. This means that the CPU is artificially restricted and this can result in application performance issues. Instead of when memory usage is the limit, the container could get terminated. So there is no application slowdown due to paging or swapping or so on, but with Kubernetes, your pod will simply disappear and you may face serious application stability issues. The third fact is about an important, less known effect CPU limits have on application performance. CPU limits work by throttling CPU performance and you may think that this happens only when CPU usage is the limit, but surprisingly the reality is that the CPU throttling starts when the CPU usage is well below the limit. We did quite a bit of research ourselves and we found out that the CPU throttling can start when CPU usage is as low as 30% of the CPU limit. This is due to some configuration on some kernel parameters, but let's talk about what happens. This aggressive CPU throttling has a huge impact on service performance. You can get sudden latency spike that may breach your SLOs without any apparent reason and even at low CPU usage. So now some people including the engineers of Buffer, for example, tried to remove the CPU limits. What they got, you can see it in the chart on the right, was an impressive reduction in service latency. So is it actually a good idea to get rid of CPU limits? Well, it depends. The answer could be no, because CPU limits exist for the purpose of ensuring that applications run fine and coexists with other applications. So if CPU limits are removed, a single runaway application can complete this rapidly performance and expect the availability of your most critical services. This best practice, obviously, is also recommended by Google. So tuning these resources is really important and you have to think properly before doing it. Coming to the auto scalers. So we have VPAs, we have HPAs. Let's talk about VPAs. The vertical portal to scaler basically provides recommendation on CPU and memory request based on the observed pod or source usage. However, our experience with the VPA is mixed. In this example, you can see Kubernetes microservices which is serving a typical urinal traffic. The top right chart shows the latency of this service and it's a service level objective. While below you can see the resource request in terms of CPU and memory and the corresponding resource utilization. It's interesting to see that we left the service running for a couple of days with some initial resources sizing and then reactivated the VPA and let it apply the new recommended settings to the pod. It's interesting to see that the VPA immediately decided to reduce the signed resources. In particular, it cut in half the CPU request but this is likely due to some apparently over-provisioning of this service as the CPU utilization was below 50%. However, with this new setting suggested by the VPA, the latency of the microservices skyrocketed with the microservice no longer able to meet its reliability standards. What is the lesson learned here? VPA autoscaler is based on resource usage and does not consider application level metrics response time for example. We need to evaluate the effects of the recommended settings as they might be somewhat aggressive and cause severe service performance or reliability degradation. But we do also have HPA, the horizontal pod autoscaler. What about it? In this example, we wish a situation where HPA scales out when specific memory or CPU consumptions are met. As a result from one pod, we get two identical pods that provide the same service. This is great, so we have two times the resources we had before to compute the service. What could possibly go wrong? Well, things may go wrong because if for some reason you didn't properly set up your memory limit, for example, you undersized it, you will simply get into a situation where you get twice the out-of-memory kills you had before. But also, on the other hand, you may face a situation where you set your CPU limits to high, but your pod can find such resources directly impacting your service latency because of trolling. But let's say you wanted to play safe and overprovision all your environment and your pods. Everything works properly, functionally speaking, but at what cost? At least you may be wasting two times the resources you were wasting before the scale out if your pod isn't tuned properly. As we have seen so far, Optimizing Microservices application on Kubernetes is a real challenge. Given the complexity of tuning Kubernetes resources and many moving parts, we have modern applications. A new approach is required to successfully solve this problem. And this is where AI can help. AI have revolutionized the entire industries and the good news is that they can be used also in the performance tuning process. AI can automate the tuning of the many parameters we have in the software stack with the goal to optimize application performance, resiliency, and cost. In this section, I would like to introduce you to this new AI powered optimization methodology by leveraging a real-world scenario we have seen. So the real-world case is about an European SaaS provider of financial services whose Java-based microservice application are running on either Azure or AWS Kubernetes service. The target system of the optimization is the B2B authorization service running on Azure, a business-critical service that interacts with all application-powering the digital services provided by the company. The challenge of the customer was to avoid overspending and achieve the best cost efficiency possible by enabling developing teams to optimize their application while keeping on releasing application updates to introduce new business functionalities and align to new regulations. Those results were almost impossible to achieve with the tuning practice they had in place since it was manual and it took almost two months to tune one single microservice with mixed results, obviously, because it was always too difficult to find the right trade-off between performance, resiliency, and obviously cost. By design, the approach was the good old overprovisioning but as a result, what did they have? The result was a huge overspending and a lack of performance, operational efficiency, and business agility. So how did we address this customer? I would like to introduce quickly our AI-powered optimization methodology in practice. The process is fully automated and it works in five steps. The first step is to apply the new configuration AI suggested to our target system. This is typically done leveraging Kubernetes APIs and to set the new value for the parameters. For example, the CPU of the request of the container. The second step is to apply a workload to the target system in order to assess the performance of the new configuration. This is typically done by leveraging performance testing tools. In this case, we used a GMeter test that was previously built to stress the application with a realistic workload. We'll see it later. And the third step is to collect KPIs related to the target system. The typical approach here is to leverage observability platforms. And in this case, we integrated Elastic APM, which is the monitoring solution used by the customer. The fourth step is to analyze the result of the performance test and assign a score based on the specific goal that we set. In this case, the goal was the cloud cost of the application container, considering the prices of visual cloud. The last step is where AI kicks in by taking the score of the test configuration as input and by producing as output the most promising configuration to be tested in the next iteration of the same process. And so we can start again by testing a new configuration. So I've said that we have to have a goal. We can define a goal. In this scenario, the goal was to reduce the cloud cost required to run the authentication service on Kubernetes. As you can see, the optimization blur is declarative. At the same time, you also wanted to ensure that the service would always meet its reliability targets, which are expressed as latency, throughput, and error rates as follows. So how can we leverage AI to achieve this high-level business goal? In the AI-powered optimization methodology, the AI will change the parameters of the system to improve a metric that we have defined. In this case, the AI goal is simply to minimize the application cost. This is a metric that represents the cloud cost we will pay to run the application on the cloud, which depends on the amount of CPU and memory sources that are allocated to the microservices. The methodology that we used also allows to set constraints to define which configuration are acceptable. In this case, we state that the system throughput, response time, and error rate should not grade more than 10% than the baseline. So to assess the performance and cost efficiency of a new configuration suggested by the AI optimizer, we stress test the system with a load test. In this case, you can see the load test and scenario that we used. It was designed according to the performance engineering best practices. The traffic pattern reproduces the behavior seen in production, including API call distribution and think times. Before looking at the results, it's worth commenting on how the application was initially configured by the customer. We call that the baseline configuration. Let's look at the Kubernetes setting first. This container powering the application was configured with the resources requests of 1.5 CPUs and 3.4 gigabytes of memory. And the team also specified the resource limits of 2 CPUs and 4.4 gigabytes of memory. Remember, the requests are the guaranteed resources that Kubernetes will use for scheduling the capacity management of the cluster. In this case, requests are lower than the limit, which is a common approach to guarantee resources for the application to run properly, but at the same time, allow some room for unexpected growth. Besides looking at the container settings, it's important to also see how the application run time is configured. The run time is what ultimately powers our application. And for a Java application, we know that the JVM settings might play a big role in application performance. But the same happens for other languages, for example, Golang and so on. So the JVM, it was configured with a minimum heap of half a gig and a max heap of 4 gigs. Notice that the max heap is higher than the memory request, which means that the JVM can use more memory than the amount requested. As we are going to see, this configuration will have an impact on how the application behaves under load and the associated resiliency at us, the ghost. OK, we've covered how the applications is configured. OK, let's now look at the behavior of the application when subject to the load test we've shown before with the baseline configuration. In this chart, you can see the application throughput, the response time, and the number of replicas that were created by the autoscaler. Two facts are important to notice. First one, when load increases, the autoscaler triggers a scale out event which creates a new replica. This event causes a big spike on response time which impacts service reliability and performance. This is due to the high CPU usage and CPU throttling, obviously, during the JVM startup phase. When the load drops, the number of replicas does not scale down, despite the container's CPU usage may be idle, as in this case. It's interesting to understand why this is happening. This is caused by the configuration of the container resources, the JVM running inside, and the autoscaler policy, in particular for the memory source. The autoscaler, in this case, is not scaling down because the memory usage of the container is higher than the configured threshold of the 70% usage of the memory request. This might be due to the JVM max CPU being higher than the memory request, as in this case, as we've seen before. But it may also be due to a change in the application memory footprint, for example, after deployment. This effect clearly impacts the cloud bill, as more instances are up than the required ones, but shows that configuring Kubernetes apps for reliability and cost efficiency is actually a tricky process. So what did we do? We now know how the baseline behaves. Let's start optimizing. What we did was starting experimenting with different Kubernetes and Java configurations suggested by our AI algorithm and measured the results. Let's have a look at the best configuration identified by our AI with respect to the defined cost efficiency goal. This was found at experiment number 34, after about 90 hours of trials, and provides a 49% improvement on the cost with respect to the baseline. First of all, it's interesting to notice how our optimization increased both memory and CPU request limits, which is not all obvious, especially for Kubernetes, because Kubernetes is often considered well suited for small and highly scalable applications. We will dive on this into a minute. The other notable changes are related to the JVM options AI picked. The max heap was increased by 20%, and is now well within the container memory request, which was increased to five gigabytes. The mean heap size also has been adjusted to be almost equal to the max heap, which is a configuration that can avoid garbage collection cycles, especially in the startup phase of the JVM. So let's now see how the application performs with the new configuration identified and how it compares to the baseline. There are two important differences here. Autoscaling is not triggered in this configuration, as the full load is sustained by one pod, which is clearly beneficial in terms of cost. Response time always remains within the response time as it allows, and there are no more peaks. So this configuration not only improves on cost, but also is beneficial in terms of performance resilience. Let's also compare in detail the best configuration with respect to the baseline. Here we can notice that the pod is significantly larger in terms of both CPU and memory, especially for the request. This configuration has the effect of triggering the autoscaler less often, as we have seen. But interestingly, while this implies a kind of fixed cost, considering the pricing of the container resources, it turns out being much cheaper than a configuration where autoscaling is triggered, and this also avoids performance issues. The container and runtime configuration are now better aligned. The JVM MaxSIP is now below the memory request, and this has beneficial effects, and it also enables the scale down of the application should the scaling be triggered by high workloads. Let's now have a look at another configuration found by AI at the experiment number 14, after about eight hours. We labeled this configuration high resilience for a reason that will be clear in a minute. The score of this configuration, while not as good as the best configuration, also provides about 60% of cost reduction. So this can be considered also an interesting configuration with respect to the cost efficiency goal. As regards the parameters, what is worth noticing is that this time, AI pick at different settings that significantly change the shape of the container. It is now has a much smaller CPU request than the baseline, but the memory is still pretty large, which is interesting. The JVM option were also changed. In particular, the garbage collector was switched to a different one to parallel, which is a collector that in some cases can be much more efficient and on the use of CPUs and memory. But let's not compare the behavior of this configuration with respect to the baseline. There are two important differences here. The peak on the response time upon the scaling out is significantly lower. It's still higher than the response time SLOs. However, the peak is less than half the value for the baseline configuration. This clearly improves the service resilience. And now auto scaling works properly after the high load phase replicas are scaled back to one. This behavior is what we expect from an auto scaling system that works properly. Notice that the response time peak could be further reduced. It's simply a matter of creating a new optimization with a new goal of minimizing the response time metric, for example, instead of application cost. Let's not also compare in detail the high-resilient configuration with respect to the baseline. Quite interestingly, this configuration has a higher memory request and lower CPU request, but higher limits, than the baseline. As you may remember, the lowest cost configuration instead had a higher CPU request than the baseline. Without getting too much into the analysis of this specific configuration, what these results show is that the optimization goal changes and CPU and memory requests and limits may need to be increased or decreased, and that multiple parameters at Kubernetes and JVM level also need to be tuned accordingly. This is clearly a confirmation of the perceived complexity of tuning Kubernetes microservices applications. As we're here just discussing just one microservices, but think about tuning hundreds of microservices. In terms of customer result, what did we get? The resizing of the service pods allowed a huge cost reduction in the Kubernetes environment, but first more, we allowed them to automate the tuning of their services, all services, in a matter of hours instead of months, enabling not only the cost reduction, but also enabling improvement in the latency of their application and overall, a better user experience for their customer. There are many interesting configurations that we found out with our study and they maybe were discussing, but I think it's time to conclude with a few takeaways. First one, tune, tune, tune. Any inefficiency is not going to be addressed by your Kubernetes cluster. If you don't think about how to optimize your platform, nobody else will. And the second takeaway is, today's application are too complex. You simply don't have time to optimize everything. AI-powered optimization works and can be the solution to your optimization needs. Thank you. I have a microphone for questions. If you're leaving, do so very quietly. Thanks a lot. This is, for me, at least very, very interesting. One question, can you speak a bit more about the AI use? Is it like black box optimization? Like here are some parameters and just find the combination that seems to work best or? Yeah, so we could talk about the approach we had. I'll just share some insights. So basically that is based on reinforcement learning and we provide as input the parameters that we apply, so the value of the parameters that we apply and the score. So the score of the last iteration that we had. Then after analyzing such a score of the iteration and the parameters, the algorithms that we've used learns how it has to behave and propose the new iteration. If you want more details, you can discuss separately because it's a long topic as you may imagine. How did you took into consideration the CPU memory ratio of the node? Like you had at some point like the high resiliency was up to something like two CPUs and five, six gigabytes of memory. So I didn't really get the question. So like you have the nodes in AKS or something like that? So the nodes in the cluster have a CPU to memory ratio. Okay. And that kind of like if you have like the pod with two CPUs and six gigabytes of RAM, like you can only put one pod on the node. Yeah. So I did take that into consideration the CPU to memory ratio. Okay, so I got it. So in this particular case, what we did is that was focused on the pod sizing not on the, let's say on the node sizing or we didn't think about auto scaling of the nodes. But what we usually see is that the experimental approach allows us to monitor how the system behaves in terms of node capacity while we're on. So we are able to let's say identify if we have a shortage in the CPU and memory utilization on a specific nodes. And doing this experimental approach, we can do it on one single microservice or we can do it, let's say full stack for every microservices which is running on a specific node. So every container which runs in a specific node. I don't know if I answered your question but we can discuss it further if you want. Hi, thank you. I think my question is a bit similar to the last one. It's like if we always, this example seems to optimize versus one service. But I think a lot of times to minimize the cost you need to optimize against all the servers. Otherwise, I guess the one of the example is if it has, it gets a very large pod, three CPU. And if you're not my mean, your node cannot accommodate any other pods. And but I imagine if you start tuning against all the microservices that becomes a huge parameter space and then the AI becomes very hard to run or maybe even inefficient, how does that work in practice? Yeah, basically it may take longer to find a suitable situation but we can test multiple services together. Also, I can add in specific case constraints related to the sizing of the node. So if I want, let's say, to be safe in terms of node allocation, I can add more constraints in terms of, okay, you don't use much more of, let's say, the 30% of the capacity on a node. But I can add it as a constraint in order to, let's say, define a limit in the space that you have to search against. So there's a lot of people following online. A couple of questions came in, I'm gonna read one out. Does the AI provide a predictive ability? Yeah, the idea is to predict the possible configuration that could lead to an improvement, okay? So that is an iterative and predictive approach. If you have followed also the talk before mine, they were talking about predictive approach. This is the case, I mean, this is a predictive and iterative approach, okay? Hey, in one of your slides, you said that you found that if you disabled the CPU limits, the response times actually reduced. As far as I remember, there was a Linux kernel bug in 2020, if I'm not mistaken, and there was already a bug fix about that. And afterwards, presumably setting the limits again would actually eliminate the bug and fix the bug, actually. So my question would be, do you still think that turning off the limits is a way to go? Or? Okay, in my opinion, I mean, I know that in some cases many kernels fix that issue on the throttling. The idea is that this argument of removing the request, they're so slim, it's some kind, by the way, it's something like a war between the people who want to remove the limits and the people who says, oh no, you have to keep that limits. The nice part of what we do is that we don't need to take, let's say, a position since our approach is experimental. So if we find out that removing the limits has better improvements on our application, we can remove it. Obviously, by removing the limits, we take the risk associated with it, okay? Okay, two minutes left. Hi. I have a question on the problem of overfitting to the data. So in machine learning, you typically have this problem that you can, your classifier or whatever can become too sensitive to the training data and not perform well on the real data. So I see we are using performance tests or some artificial tests to set the parameters. So how does it behave well in production and how do you solve this problem of being too fitted to the performance test? Basically, the accuracy of the test is very important. Yeah, basically, I mean, I come from a long practice of performance testing, so performance testing is what I do for a living since I started working. And basically, the requirement to this approach is to find a way to shape a proper model of the, let's say, of the performance test. So you do have, obviously, to use the same version of application, the same version of the operative system, comparable, let's say, sizing in the pre-production environment where the test is configuration. So that is everything that you have to take into account. And for sure, that's the whole problem of how significant is your performance test. So this approach is well suited for customer, for people who are very into performance engineering and know how to shape their traffic well. But we are also investigating the field of doing it, let's say, in production with the real data. Okay, last question. With the GVM-based application, usually you have a problem when you try to base your request to your actual utilization because of the startup time. Yeah. When the application is start and the CPU usage go up. Yeah. How you could solve this. And the other thing is when I bought this resource and usage actually almost the same, I will need to scale fast. All right, so I will need to have to ability to scale fast, which I still need to add more nodes. Yeah. So basically what we did with the GVM topic is that in this particular case, because it's not, let's say, one fit for all solution, okay, in this particular case, obviously the best configuration was the one we didn't trigger the replica. But if you see into the other configuration, which led to some kind of 15% reduction in cost, and that configuration, we managed to find a sizing of both the GVM in terms of mini-pan, money and max-hip, which were increased, where the GVM wasn't suffering, let's say in terms of throttling at the startup of the GVM. Obviously, a GVM always suffers in terms of CPU usage starts. But that specific configuration, terms of mini-pan, max-hip and memory request on the pod allowed us to cut, let's say, the spike in half. But that is something you will have to face ever. I mean, in every situation, if you don't want to, let's say, oversize your environment, you will have some kind of spike when a GVM starts. Okay, we are out of time. Thank you for being here, and get in touch soon. Thank you.