 Hi, all. Thanks for joining our first session for today is optimizing application performance on Kubernetes by Dhinakar Gunaguntala. Thanks, Dhinakar. And a brief introduction about Dhinakar. Dhinakar is an architect of the cruise project. Dhinakar is focused on autonomous performance tuning and exploring the usage of machine learning and hyperparameter optimization in the performance domain specific to Kubernetes and cloud. So, Dhinakar previously worked on making the open J9 Java virtual machine run more efficiently in the cloud and was the official maintainer of the adopt open JDK Docker images. Dhinakar loves open source astronomy and volleyball and it's a prolific speaker at conferences. So, a brief introduction about the topic that we're going to talk today. So, now that you have applications running on Kubernetes, wondering how to get the response time that you need, tuning applications to get the performance that you need in Kubernetes and can be challenging. At the same time, there are a number of Kubernetes features that we used in the right way can go a long way to get the most of underlying hardware resources. This talk looks into each and every aspect of optimizing a Kubernetes cluster, starting from the most basic no-definities to advanced methods such as tuning microservices each with examples and a demo. We'll also be specifically looking at tools that help to not only right size your containers, but also optimize the run times. So, we'll start the session. And thank you for attending this session. This is Dhinakar Guniguntala. I work at Red Hat, where my primary job is to see how run time such as Java can be made to run better in Kubernetes. And that is what I'll be talking about today. Has it ever happened to you when you talk technology to customers only for them to ask what's the mileage? So, today we will look at ways to improve the mileage that you can get with your Kubernetes cluster. So, let me take a moment to define what I mean by performance. Traditionally, performance looks at 3 key aspects throughput, response time, and utilization of system resources. These are the criteria that we'll be looking to optimize in today's presentation as well. However, I'll be confining myself only to compute. So, before we dive into the presentation, I thought it'd be good to define the overall context here. Imagine that you are an SRE and you've been given this problem to be solved. You have a complex polyglot application such as an airline booking system that is deployed onto Kubernetes. As you can see, it has many microservices, a couple of databases, and each of the microservices written in a different language and framework. The user is having a slow response time while doing a flight booking. Now it is up to the SRE or the IT admin to try and make the user experience better. So, let us look at in detail what are the steps that an SRE can take to try to solve this problem. So, the first aspect to be considered is observability. This is something that is very key. How closely we observe the system and all of the metrics associated will actually help us to determine where the performance bottlenecks are and how to go about tuning them. There are a number of tools out there that can help you to get better metrics. Prometheus and Grafana for example, a couple of the more popular ones. I would also suggest that you can take a look at open telemetry which is slowly becoming industry standard when it comes to observability. One of the key things in observability is the granularity of observation. For example, if you're observing the pods on a per second basis, then you get very accurate information, but that causes the higher overhead both in terms of CPU, network activity and in fact, disk space as well. So, there's a trade-off here and you need to be very careful in setting that value. Another aspect to consider would be to export additional operational metrics on a per application basis. Things like the spring actuator or the micrometer for Quarkus, prompt line for Node.js can be turned on for your application and they provide additional runtime-related metrics such as the heap, which we can see later, can be used to tune the application for better performance. When you have an on-prem cloud, you have the luxury of tuning the hardware all the way from the BIOS in each of your Kubernetes nodes. Common setting found in BIOS relates to the choice of performance of power. Choosing power means you get better power savings but variable performance. The same setting bubbles up into the operating system or the hypervisor as well. In the case of Linux, it's called as the scaling governor. In fact, I've seen performance drop by up to 30% with the power save option for certain workloads. If power saving is your goal, then this is a good setting but definitely not a performance of the application is the key goal. The other thing to consider or at least be aware of is to look at hyperthreading or not consider hyperthreading while doing capacity planning. Let us say a server has 16 cores and two threads per core that has counted as 32 CPUs. However, hyperthreaded CPUs give at most a 20% boost over a single core and so it is best to ignore this while calculating capacity. Now that our hypothetical SRE has set up observability and has fixed the hardware, what's the next step? Let's start simple. Match the application to the hardware features that is needed by the application. So node affinity is typically accomplished by setting the right labels to a node in a Kubernetes cluster. It is very useful if you want to assign pods to a specific hardware feature on the node or maybe the node is reserved for a particular type of workload or namespace or a security constraint. In this example, we see that this particular pod, which is a ML application, will only run on nodes that have the GPU label. Another way to constrain pods is to use pod affinity and pod anti affinity. If there are pods that commonly communicate together or maybe they share some common resources, then it makes sense for them to run on the same node. We can use pod affinity rules to make sure that they all run on the same node. But what if you don't want pods from one application A to run on a node if you have pods from application B running on that node? Maybe both applications are network heavy or both use GPU extensively. Whatever may be the case, we want to make sure that pods from application A and application B run on different nodes. So in that case, we can use pod anti affinity to make sure that they both don't run on the same node. In this example, we want a pod to be scheduled on a node, only if there are other pods that have the same security policy S1. And don't want to schedule it on a node that is running pods that use a different security policy S2. The advantage of pod affinity and anti affinity is that it allows the admin to dynamically assign nodes for certain kind of pods without having to dedicate nodes ahead of time. There are also other scheduler mechanisms such as taints and tolerations, pod priority that you can explore as well in this particular context. Now, we come to the most important aspect of performance tuning in a Kubernetes cluster. Right sizing application greatly helps to get the best possible performance. So this is done primarily by setting the CPU and memory requests and limits. So here on the left, I have an example application deployment YAML. And you can see the resources specified in the container spec section. It is very important as a best practice to always specify the resources to enable Kubernetes to make the best possible scheduling decisions. This usually means that you have to either set the guaranteed or the burstable QoS class and avoid the best effort, which means that in best effort, you're not setting anything at all. One thing that we do need to make sure is that for the best possible performance, we need to set the request to cover the consistent peaks that we observe. And the limits should be set to handle any spikes. So do ensure that the limits are set high enough during observation itself to prevent any throttling. Also, do ensure that requests and limits that you're setting do not clash with any limit ranges that might apply to your namespace. Now that we know that requests and limits are crucial to your performance, you might have a question, how do I arrive at the optimal values for requests and limits accurately? The vertical pod autoscaler can help in that regard, but I suggest you use the cruise tool, which I will talk about in a minute. So let's take a closer look at the various autoscalers that Kubernetes has. So application performance depends a lot on how the app is scaled, and this is where setting the right policies for the horizontal pod autoscaler is very important. So try to use app-specific metrics to set up the HPA as much as possible, as using just the average CPU utilization, for example, might not be the best approach. So for example, when a GST is triggered in Java, this might actually cause a new pod to be instantiated instead of when actual load is increased. You can also use external metrics, such as the number of concurrent users that your application is handling. But the best practice is to use objects that are known to Kubernetes as much as possible, such as the packets per second or requests per second. So in this particular case, what we are saying is that if the packets per second, the average value of the packets per second goes beyond 1k, then start a new pod. Or in this particular case, requests per second goes beyond 10k, then we start another pod, and so on. Using a cluster autoscaler definitely helps to make the best utilization of the underlying resources, especially when you're scaling down. You make sure that you free up the resources that are not being used. But you need to be very careful not to cause any service disruption in the process, especially when you're downscaling. Specifying the max unavailable pods in the pod disruption budget definitely helps in this particular regard. So now if you're an SRE, you'll know that every runtime has many, many tunables. Java, for example, has more than 100 of them. But you'll also know that you should never touch them. Why? Because you know that, you know, who knows what kind of an impact it has. And it has all these dependencies on other tunables. And there are just way too many of them for you to manually test and figure out. Also, how these run times behave in Kubernetes environments is not always clear. So guess what? Most of the time, an SRE is just limited to tuning the app itself or tuning just the CPU and memory. By tuning, you know, we all know what normally happens. We just end up doubling the resources until the problem goes away. So it is difficult to be an SRE, difficult to be an SRE, let's be honest. You have the users bugging you for better response times. The finance wants to cut costs all the time. And you have developers giving you a ton of options that you find it really difficult to use. So if you're thinking this got to be a smarter way, you're absolutely right. So we are really happy to announce that we have, you know, we have this new tool called cruise autotune. It's available publicly. It's an open source project we are from Red Hat. I do encourage you to take a look at our GitHub repo given below here. So let's take a deep dive into the whole process that autotune uses to tune application. So the first step here is that the SRE encapsulates all of the performance requirements into an objective function, which is an algebraic expression, such as, you know, a square divided by b plus c, where maybe a can be your throughput, b can be response time, c can be costs. And you want to either, you know, maximize or minimize the whole thing in this particular case, for example, if it is a square divided by b plus c, you might want to maximize it. And here, each of the individual variables of the objective function are specified as Prometheus queries. And the whole thing is applicable to a particular Kubernetes deployment, which can be selected using the selector out here. So at the heart of the autotune is the Bayesian optimization, which is provided by the HPO service that you see here. HPO is nothing but the hyper parameter optimization service. So Bayesian optimization is a type of black box optimization that uses probabilistic models of the objectives function that you have specified here. And that is searched efficiently to arrive at either the global maximum or the minimum as required. So essentially what's happening here is that, you know, the Bayesian optimization gives you a configuration for you to try out for this particular, you know, the deployment. So we have figured out what are the layers of the application, what are the layers of the stack and then send all of the tunables from those layers to the Bayesian optimization, which gives you a particular config value to try out. The experiment manager here deploys it and then we get a response, I mean, we monitor the pod with the trial configuration under load and then we get a summary of how it performed under load and then send it back to the Bayesian algorithm, which will look at the results and then try to find another config that will give you better results and so on. So this loop continues and, you know, after about 100 trials, you'll find that you, I mean, the Bayesian optimization has given you a config that will satisfy the objective function that you came out with and then we come up with a config recommendation. So that's in essence how we do this. So let's take a quick look at how this works. So I do have a small demo out here. So I have MiniCube running on my laptop here, as you can see, and it has Prometheus and Grafana installed in the MiniCube cluster. And I also have AutoProtune running here. And I have a TechCampower application, which is a Quarkus Resty Z Hibernate application that is also running here in the cluster. And so now the challenge here is to try and optimize this particular benchmark that is running. So what are we trying to optimize? We are trying to optimize the response time to try and minimize it. So response time is defined as request sum divided by request count, where request sum is this particular query, Prometheus query and request count is this particular Prometheus query. And this applies of course to the TechCampower deployment. And we are trying to minimize response time here. So let's try to apply this YAML here. And you can see that AutoTune starts to deploy specific configurations for and comes up with different configurations that it can test and see how it is doing under load. Of course, this being a very short demo, we are really not monitoring the load, but just giving you a sense of how the whole process works. So you can see here that it is starting multiple trials. And you can also take a look at list experiments here to see the conflicts that it is actually trying out. So here you can see it is trying with certain values of CPU and memory and also Java options that includes the hotspot layer that it has found and the quarkus layer. So very quickly you can also look at all the layers that it has found in the application here. So it has found the base container, hotspot and quarkus and so on. So if you keep monitoring this, it runs the whole set of trials and then comes up with the best trial at the end of the experiment to say, this is the one that had the best configuration. And you can take a look at the trial according to that particular trial number to figure out what was the best configuration. So that's a very quick demo of AutoTune. I would definitely recommend that you check out our GitHub repos and we have this demo also running, I mean, available on public GitHub. So it is available in this particular repo, github.com slash cruise slash cruise demos. This is the one that I was running just now. You should be able to even run it on your own laptop as well. And this is the main GitHub repo. So now that you've seen a very quick demo, what's really happening here is that the Bayesian optimization is quickly coming to try and find a particular config that gives you the best result. I usually compare the Bayesian optimization to a Jenny. However, there is one caveat here. The Jenny can only be asked for one wish. So you can involve the Jenny any number of times, which means that you can involve the Bayesian optimization for any number of experiments. But for every experiment that you're running, which consists of maybe even up to 100 trials, there can only be one objective function or only one wish. So you need to be really get creative with your wish. It's something like I want to be on a beach in Hawaii with my wife and kids and walk into my large house with this great internet and so on. So basically you're trying to put in all of your requirements into that one objective function. And then the Bayesian optimization will try to optimize for that particular objective function. So you've heard all of the theories so far. So let's take a look at some of the results. So here you see that as I mentioned, actually we were using the tech empower framework, which is actually an industry standard framework where you have benchmarks from for all different kinds of runtimes, Java, Golang, Rust, and Node.js, you name it. So we specifically picked up the Quarkus rest easy benchmark and ran this on a open shift cluster, which had this particular configuration. And it had all of these different tunables that we used, two tunables at the container layer, a bunch of tunables for the hotspot layer and a few for the Quarkus layer as well. And so these were the ranges within which they were operating. And we had set the Kubernetes request to be the same as the limits and we were using the G1 GCE garbage collector and max RAM percentage set equal to 70. The incoming load was constant at just 512 users. So we started off initially saying that, okay, we want to just minimize response time. But then we quickly realized that, as I mentioned, Bayesian optimization only tries to optimize that one aspect at the cost of maybe other aspects. So we realized that the low response time came at the cost of higher CPU usage. Then we did another experiment where we said, okay, fix the CPU usage, but give me lower response time. But this time we found out that it was giving us higher max response times or tail latencies. So this was the third take where we said, okay, Jenny, give me the best response time, the lowest response time, high throughput. And at the same time, keep the max response time or the tail latencies down and keep the resources fixed. So essentially, we gave it weightages as well. We said response time has the highest weightage, throughput comes next, max response time is the least in terms of the weightages and make sure that you fix the CPU and memory so that the cost is the same. So the 0th value here corresponds to the default one where there were no changes done to the application configuration with the same resources as the rest of the experiment. And so here we see that the default was about 14.21 milliseconds of response time. And then we see the autotune coming up with different configurations and trying them out. And then we got the best configuration around the 97th trial where it got a response time of 2.39 milliseconds. So you can see here that this has actually achieved about 83% better response time with a small or the throughput being almost the same. And of course, the tail latencies were low as well. So you can take a look at all of these results here. These are available on github.com, cruise slash autotune results repo. These are available publicly as well. So you can see that the max response time in the autotune case is down. The CPU usage is almost the same as the default. And we got really good response times, about 83% better response time as well. And we also calculated the cost of the hardware by looking at the data that we got from the previous experiment for both the default and the autotune config. We've measured how many instances it would take to handle one million transactions and applied it on a matching AWS configuration, aw1.extra-large, which is about 4 core 8 gig. And we observed with the autotune config there's 8% reduction in cost as well. So this is the corresponding best configuration. The right side column is the value for each of the tunables that we saw previously. Interestingly, you see that autotune has flipped some of the defaults from what the runtime itself sets. Okay. So in summary, if you're an SRE, your first step is to set up observability. Don't forget to tune the hardware. Set the node and pod affinities. Ensure requests and limits are set for all app pods and they're right sized. Use app-specific scaling metrics if possible. Ensure that there is no disruption with the pod disruption budget. And please do check out the cruise autotune for autonomous tuning. And we do plan to come back to you with some updates. So lastly, do check out the cruise GitHub repos. You have any questions, reach out to us on cruise Slack or send us a mail. We do look forward to hearing from you all. Thank you so much for listening. Hey, Denakar. Thanks a lot for the session and it was really informative and very helpful. And I hope the participants are benefited by this very informative session. And if you have any questions, you can post the questions on the chat. And Denakar is available to answer. Sorry. Yeah. A lot of the session and it was really informative and very helpful. And I hope the participants are benefited by this very informative session. And if you have any questions, you can post the questions on the chat. And Denakar is available to answer. Thank you, Ashok. Happy to answer any questions folks have here. Sorry. Yeah. Looks like there are not much questions, no questions on the chat that I see. But I have one question, Denakar. Say, if like, say if there is a fresh grad who would like to get into this open source space, like what is your recommendation that you would like to give or maybe someone who would like to switch their career path maybe after 10 years of experience. And if you think that, okay, they would like to switch their career path, what is your recommendation that you would like to give Denakar? Yeah. I think that's a general question. So I would say that, you know, the first step that I would always suggest is for people to understand what are their own preferences. You know, there is like a wide variety of open source software that's available today. I mean, there is system software front end, back end, machine learning, cloud and so on. So there's multiple different open source projects that are available. And I think that the best way is to first understand what are your own interests, and then find out projects in that particular space. So for example, you know, I'm a guy who's been interested in systems technology all my life. I've worked in operating systems, JVM and Kubernetes now and so on. So I always tend to look around in this space and see what are the new things coming up. And of course, now these days, I'm interested in machine learning, who is not right. Everybody, that's the buzzword now. So if you look around, I'm sure you'll find an open source project that you are interested in. So that's the first step. Next step is to find out what is the community around it? Find out who are the different stakeholders? Do they have a Slack channel? Do they have Gitter? Is there a mailing list? So go join there, find out what's the best way to interact. Look at GitHub, obviously, GitLab or anything that's around. Look at issues. Most of the projects these days have something like a good first issue that has been marked on GitHub issues. So you look at that and see what are the issues. It could be simple things like fixing language or maybe some simple issues and so on. So you can look at that and see if you can start getting into the project by understanding the process. How do you submit a PR? Basics of GitHub and things like that. And then you can read more on the topic, look at videos and talk to experts, talk to the community folks, see if there's a meet-up or attend some things like this KCD Chennai, understand more about the topic and then gradually you take it up there. So that's the way I would look at it. Sure. Thanks, Inkar. That answers my question. And I like the red hat on your backdrop. Thank you. Yeah, folks, please make use of this time for Q&A. And if you don't have any questions, then we'll have Inkar go so that he can enjoy the rest of the weekend. It's already Friday. Attend the rest of the talks. Thanks, Inkar. Thank you all and it's been a pleasure being in this. Thank you so much for organizing this great event. Sure. Thanks. Thanks, Inkar. Thank you. Bye. Bye. We'll have the next session starting in a few minutes. So please hang in there. Thanks.