 OK. So hi. Welcome to Cloud Native Live, the way we dive into the code behind Cloud Native. I am Mohamad Sharjah here, almost like a mythel. So I'm a sensitive ambassador, and I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native technologies. They will build things, they will break things, and they will answer your questions. In today's sessions, I'm stoked to introduce Andy, who will be presenting on Kubernetes and automatically write sizing. So this is an official livestream of the CNCF, and as such, it's subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of the Code of Conduct. Basically, please be respectful to all of our fellow participants and these presenters. With that, I will hand it over to Andy to kick off today's presentation, and let's add Andy to the session. So hey, Andy, how are you? I'm good. How are you today? Yeah, I'm fine. OK, thank you so much for this all sensation. So yeah, you can start. All right. So my name's Andy. I'm the CTO at Fairwinds, author and maintainer of several of our open source projects, including Goldilocks, which I'll talk a little bit about today. But today, I want to talk about something I've been working on over the last couple of months, and it will be slowly maybe making its way into Goldilocks, which is automated write sizing. So for those who aren't familiar with Goldilocks, Goldilocks is a wrapper around the Vertical Pod Auto Scaler project that lets you automatically provision Vertical Pod Auto Scalers and then view the recommendations for resources for all of the pods in your cluster in a single dashboard. Now, up until this point, Goldilocks has been really focused on recommendations. How do we see what resources our pods are using and allow us and give us a baseline for setting those going forward and how to tweak those? What I'd like to start to explore further, as we go into the future, is how do we start to utilize the automatic write sizing abilities of the Vertical Pod Auto Scaler in a safe and effective way so that we can increase the utilization percentages of our clusters? There's so many of the clusters that I work with are utilizing so little of the resources available in them because we tend to over provision. And I know that's a really hot topic right now because we're all worried about cost across the board. At least a lot of us are right now. So today, I want to show how we can set up a cluster utilizing. I'm actually going to use four different open source projects to put together a cluster that automatically sizes all of the workloads in the cluster and allows it to auto scale. So please interrupt me at any time with questions. Keep them coming. And we'll just kind of dive into the setup here. And I will show what we have going on. All right, is my screen share up? All right. So I have an EKS cluster here. And I have four different technologies specifically running in this cluster that are going to help us out today. So the first one is I need cluster auto scaling. So I need to be able to get new nodes in my cluster. And I need that to be relatively flexible because we need different node types in order to maximize the utilization of our cluster. So we have here running carpenter. So for those not familiar with carpenter, carpenter is an open source autoscaler for Kubernetes. You create an object called a provisioner. And so we can take a look at the provisioner here in the cluster and the provisioner essentially gives you the ability to say, OK, I want these kinds of nodes, but it also allows you to specify a certain amount of flexibility. So there's some values in here that are important. We have instance category. So I've listed three different instance types that we can get in this cluster. So we have C class instances, M class instances, and R class. So that's compute optimized, general purpose, and then memory optimized instances. So I want carpenter to be able to pick nodes that have different balances of CPU and memory based on the workloads that I'm going to deploy in this cluster. In order to save on costs, I'm allowing it to only provision spot instances. That's kind of up to you whether you want to do that in your environment, but I'm doing it because this is a sandbox and I don't want to spend a ton of money on it. And then you can cap out the amount of resources that carpenter has. This is just sort of a safety thing for me. I don't want my cluster to blow up. We'll talk about some of the pitfalls of automatic right sizing later on. And I will explain why this has to be in there. And then the only other thing that I like to enable in carpenter, which is not super related to... Hey, Andy, can you zoom in a bit? I guess it would be helpful for the folks. No problem. Yeah, I guess that's good. All right. So the only other thing that I like to do in carpenter, which is not required for automatic right sizing, but I think it does a nice job of keeping my cluster fresh and allows me to do upgrades more easily, is set a TTL on my nodes. So my nodes will expire after a day. So no node will live longer than a day with varying load in my cluster that tends to not happen anyway because carpenter is constantly rebalancing. And then actually that reminds me of the last thing, which is consolidation. So this gives carpenter the ability to evict pods, move them around to different nodes and sort of rebalance how the cluster is structured, which is really important for automated right sizing. So carpenter, the first tool we have in place, again, we've got consolidation enabled, true. We've got multiple instance types that it's allowed to use so that we can get different topologies. And then that TTL seconds until expires just something that I think is a nice to have. So that's the first tool that we have configured in our cluster. And so I installed that with the Helm chart. So if we wanna look at my values file, it's fairly straightforward. It's really just giving it access to the roll arm that it needs to do its job and then some other, some service monitoring so that I can get some metrics in the cluster and really not much else, just some security things that I have enabled in this cluster. So there's not a ton of configuration for carpenter other than adding that provisioner that I just showed. So, see we have some folks with video issues. I think we look okay for my preview so I'm gonna keep going. The second tool that we have configured in the cluster is the Vertical Pod Auto Scaler. So we are going to install that with the Fairwinds chart for that. So that's at github.com slash Fairwindsops slash charts. We have a VPA chart that allows you to install the Vertical Pod Auto Scaler. I'm using the latest version of it and I'm honestly using a good portion of the defaults except I'm enabling the admission controller which I believe is not default in our chart. And that's so that we can actually do automatic Vertical Pod Auto Scaling. So we need to have a certificate in place. I'm using CERT Manager to generate that certificate and manage the mutating webhook configuration. And beyond that, I think VPA is fairly standard configuration. Oh, the last thing we have to do, we want long lived data to feed our Vertical Pod Auto Scaler. So this is super important to getting accurate recommendations from the Vertical Pod Auto Scaler. So we have it hooked up to Prometheus. So in the recommender, which is one of the components of the Vertical Pod Auto Scaler, we give it a Prometheus address and we give it a Prometheus address. We're gonna give it a minimum CPU, a minimum memory and we're gonna say the storage type needs to be Prometheus. So that will allow our Vertical Pod Auto Scaler to reference this Prometheus in the cluster to get metrics. So if we go take a look, we can see we have Prometheus running in this cluster. We have things like machine CPU cores available. We have all the various metrics. I'm using the standard cube Prometheus stack installation to get that. I seem to have lost my comment feed. So I'll have to rely on you to throw questions at me as they pop up. Okay, sure, sure. I will do that. No, it's cool. All right. So, cube Prometheus stack, collecting all the metrics in the cluster, relatively default configuration there and then the Vertical Pod Auto Scaler pointing at that Prometheus using Prometheus as its storage. So we've got Carpenter, we've got Vertical Pod Auto Scaler, a couple of different values associated with those. Now we get into the next bit, which is, let's talk about Goldilocks next. So Goldilocks allows you to create VPAs for all of your workloads. So if we look in this cluster and we do a get Vertical Pod Auto Scalers across all the namespaces, we'll see that we have one for every single workload in this cluster. There's quite a lot of different workloads. We're running ARC or CD. We're running Prometheus. We're running a few different demo apps. We've got one in the team one namespace. We've got a Yelp app. We've got a demo, a basic demo app running as well. I'll talk about the actual applications in a minute, but we can see we have a lot of different Vertical Pod Auto Scalers. And so in order to do that, we install Goldilocks using a Helm chart, like I've mentioned in the past, at the same repo that we have in our Fairwind stable repository. And the only thing that we're doing here that's not standard is that we're setting this on by default flag. So what this does, when we tell the controller and the dashboard on by default, this means that we don't have to annotate the namespaces that the objects are in in order for Goldilocks to create Vertical Pod Auto Scalers. So we'll just create one for everyone in the cluster automatically, no matter what. So Goldilocks, when we configure it this way, we'll create a VPA for every object, but it won't be turned on. It'll be in mode off, which is just the recommendation mode, which is the default for how Goldilocks operates. So the last thing that we do is modify Goldilocks' ability, modify how Goldilocks creates these VPA objects. So let's go ahead and get the namespace YELB. This is one of our demo applications, the YELB application. And we've added two annotations to this namespace. And this is a sort of not well-known feature of Goldilocks that allows you to modify how the VPA is created. So first one we have here is we have Goldilocks.fairwins.com slash VPA update mode set to auto. So that's going to put all of the Vertical Pod Auto Scalers in the YELB namespace into this automatic mode, which means it's going to automatically when a pod gets created in that namespace that mutating admission webhook is going to set the resource requests for the pods created in that namespace. And you'll notice it's turned on an auto for all of my namespaces. So every single pod that gets created in this cluster has its CPU requests set by, CPU and memory requests set by this mutating admission webhook in the cluster, which is a little bit terrifying and mutating things on the fly as you're creating them all the time, which is why I'm doing this in a sandbox cluster. And we'll talk about some more of the pitfalls of that later. And then the last thing that we have is the ability to control minimums and maximums via this container policy and or this VPA resource policy annotation. And it's probably easier if we look at the VPA object that gets created itself. We'll take a look at this YELB UI VPA in the YELB namespace and we will see that GoliLux has added this resource policy here that defines the behavior of the vertical pod auto scalers, automatic rightsizing or resizing in this namespace. So this applies to all containers in every pod because we have a star here. And in this case, we're saying the maximum allowed is four CPUs and six gigs of memory. I think it's important to pick a value for this. I had some early experimentation that was really interesting where the VPA when it doesn't have a full amount of data will sort of recommend potentially very large or very small amounts. And in the case of it recommending very large amounts, what can happen is say, it thinks it needs 16 CPUs, right? I think I had one that it said this pod needs 16 CPUs. Well, that wasn't accurate, but it went ahead and created that pod, modified that pod to request 16 CPUs. And then Carpenter in all of its flexibility very happily obliged. And I think it created an M5 12 XL in my cluster, which is a very large instance size that I was not expecting. And so you wanna have these caps on here just for a little bit of safety. Now, those are controlled again through that annotation that we showed earlier. This is just a JSON format of this container policy that we're looking at. So you can modify that on a namespace level. You can even add additional specific containers or allowed to request more, but I definitely recommend having this resource policy in place just to cap things. The other thing you can use is a limit range, which is a Kubernetes object. The vertical pod autoscaler respects limit ranges as well. I found this resource policy to be a little bit more flexible and easier to work with than the limit range object. For the recap, we have all of our containers being automatically resized by the vertical pod autoscaler. You can see we have a lot of different recommendations. The Ingress Engine X controller here is requesting for CPUs. And I can't do translation from bytes to gigabytes in my head, but that's a decent amount of memory that it's requesting there. And so we have recommendations for all of our pods. We're collecting all of the metrics in Prometheus and then we're allowing Goldilocks to create these and then Carpenter is giving us new nodes in our cluster based on the requests coming into the cluster. So you can kind of see all of the different dynamic pieces going into this that allow all of these to write size. And so the last thing that we need to talk about is horizontal pod autoscaling. So we're vertically sizing, we're setting our requests and limits, but we also have applications like this, let's go to the Yeld namespace. We also have applications that need to horizontally scale. So if we take a look here, we'll see that we have two replicas of the app server running. That should actually be more, I'm not sure we'll begin to that in a minute. Dangerous of a live demo, but we need to horizontally pod autoscaling. Now a typical way to horizontally scale your pods would be with an HPA object. And you would maybe set that to some target of CPU utilization for that group of pods or average CPU cross your pods. But if we're also vertically scaling on CPU, we don't necessarily want to horizontally scale on CPU because those two will be at odds with each other and possibly conflict with each other. So the real key to this is being able to horizontally scale on a separate metric. And since we already have Prometheus metrics, if we go take a look here, we could get something like nginx, let's see nginx ingress controller requests. So we're using ingress nginx, we're getting requests and we can divide that up by the different ingresses in here, so for example, say this Argos CD ingress, we can see how many requests it's getting. We also have metrics such as network traffic coming in or latency for this particular ingress. We have lots of different metrics available, but setting up the HPA with those can be a little bit difficult. So I'm gonna add in a fourth project, actually I guess we're up to five now because we've got Prometheus, Goldilocks, DPA and the vertical pod autoscaler. So I'm gonna add a fifth project, which is KETA or how do folks say it? I don't know, I don't think we can do a poll here, but I'm curious whether it's pronounced KETA or KETA. I'm gonna go with KETA for today. So KETA is a nice controller that allows you to sort of create horizontal pod autoscalers with a different spec called a scaled object. So I'm installing KETA in the KETA namespace with a fairly standard set of values, I think of just setting some resource requests, adding some Prometheus information so that I can get metrics. But other than that, I'm using a fairly stock install of KETA. And so with KETA, what we get is we get these scaled objects. And the nice thing about these scaled objects, and let me switch tabs here for a second, is if we take a look at the spec for that, and we look at say the YELB app server scaled object, this is a very straightforward spec, very similar to a horizontal pod autoscaler that allows us to specify a Prometheus query out of the box. So we can say, here's where my Prometheus lives. This should look familiar from when we configured the VPA. This is what I wanna call this metric. This is the threshold that I wanna shoot for per pod, so this would be 10,000 requests per pod. And then I can put in a query. So if we take this query for the YELB app server, and we go punch that into Prometheus, we can see the value of it at this given time. So right now it's very low. We just saw that it's scaled all the way down. But what this does is it creates an HPA for us and then serves this metric one of the Kubernetes metrics endpoints so that our HPA gets automatically configured for us. So if we look at the HPA is currently available here, we'll see that that, if I'm in the right namespace, geez, we'll go back to the YELB namespace. Here's that YELB app server HPA and we can see that 10,000 target. Currently we're at 277, so we're at a minimum of two out of 40 pods. But I didn't have to create this HPA. I didn't have to write a Prometheus metrics adapter to do that, Keta just did all of this for me. So I'm a huge fan of this project. And it is really the last piece of this so that we can scale all of our pods horizontally based on metrics other than CPU and memory. So then the last thing I need to do, if this is not a real environment, which it is not, is I need to generate some load. So I'm gonna go over here real quick and just double check on my load generation because it doesn't seem to be working. I'm using a tool called K6. This just runs a whole bunch of load against various end points. And that seems to be working. We'll see what happens. Yeah, okay. That window wasn't important. I'm not gonna worry too much about it. But, so to recap again, because I think there's a lot going on here and I think it's sort of tough to put all the pieces together in your head necessarily if you haven't done this before. We've got the Kubernetes metrics coming in through Prometheus. We've got the horizontal pod autoscaler that is configured via Keta using Prometheus metrics that are not CPU and memory to scale horizontally. We've got our vertical pod autoscaler scaling pods up and down in their resource requests. And then we have Goldilocks creating those VPA objects automatically for us and configuring them in that automatic mode. So in theory, everything should be completely dynamic. As I increase load on the cluster, we should see potentially the vertical size of some of these pods getting bigger as they start to actually consume resources. We should see horizontal scaling where all of the different horizontally scalable workloads in the cluster go in and out. And then we should also see the cluster creating new nodes to accommodate those and then maybe reshuffling them over time. And so as you start to think about this, you're like, oh, how am I supposed to like wrap my head around this if there's something wrong, right? How do I like look at this happening? And so what I'm working on putting together and what I have here is a dashboard that sort of brings all the pieces together in Grafana here. So I'm gonna switch over to that Yelp namespace that we were just looking at. And I'm gonna say, let's look at the Yelp UI ingress because it's the only one in this namespace. And let's just say all HPAs in this namespace. And so if we go, let's go last 24 hours because I think we have some peaks in the last 24 hours that we can look at. We have a few different graphs we can look at. We can see the HPA. So we just saw that HPA, the target value was 10,000. Earlier here, we had it hovering right around that, the actual value hovering around that. So this is 10,228 requests coming in to that app server. And so you can, this is just kind of seeing the, and let's just zoom in on that time period. This is the HPA doing its job, right? The HPA wants to keep the number of replicas such that the per pod number of requests is at 10,000. So this is good, we wanna see that. And then we have on the lower end, the UI pod. So this is a multi-tiered thing. There's a UI and there's a backend. So the UI is also scaling horizontally. So that's the HPA in action doing its work. We can look at latency. So the next thing that we have to consider is the balancing factor, right? We can vertically scale in horizontally scale in horizontally. But what balance is that? What's on the other side of the equation? And then generally on the other side of the equation is some sort of performance metric, right? We need to have enough resources to have good performance in our cluster. So we see here, we're tracking latency on the Yelp UI ingress. This particular one has been sort of high around 600 milliseconds. I'm not actually sure why. It's something I've been planning to look into. But we want to have another metric to balance against. And one thing that I've done in other namespaces is add this latency metric as a second scaling metric for the horizontal pod autoscaler. Because that Keta spec lets you add multiple metrics as targets because you can do multi-metric pod autoscalers. So that is one option for maybe balancing performance with your resource requests. Down here, we just have the raw number of requests coming into the ingress. And then over here is where we start to look at the VPAs, the vertical pod autoscalers. So if we look here, and I'm just gonna filter down to the app server, because it's a little bit easier to see, we've got the target from the vertical pod autoscaler, which is its recommendation. And then we have the actual request. Now, these should be the same, which might be worth looking into, but they are very close. So we can track what the vertical pod autoscaler is doing for both CPU and then over here, we have memory. So let's go ahead and filter this one down as well. We have memory at exactly what the target is and the request. So we can see the vertical pod autoscaler doing its job, modifying the requests as they come in. And then we can look down here at actual utilization. We can see the, we'll filter down to just the app server again, we're using 60 millicores, the VPA is targeting 25. And I'm looking at a very small window of time in this particular graph just so that we can see the graph. My guess is if this hovers at 65 for long enough, the vertical pod autoscaler will start to bump that up. And then same with memory, if we look at just the app server, for some reason we're hovering around 1.7 gigs, not sure what's going on there. So this particular app behaves a little weirdly, might be not the best example, but we have a place now where we can start to see all the different pieces. And then over here we have just a little graph showing Carpenter doing its job. So this is how many machines it's created and terminated. I've been working on adding graphs for seeing the types of nodes. But if we take a look at the cluster right now, we can see, we have our base instance group. So we have a single managed instance group that allows us to run the Carpenter controller and things like that. I've changed that to a C5 instance type because I've noticed that this cluster is particularly CPU heavy. But then the ones that have a provision are listed, these were created by Carpenter. So we have a C5 2XL and a C5 4XL. Obviously Carpenter also recognizes that we're very CPU constrained in this cluster, not memory constrained. And we can see how it's reacting to that by giving us compute optimized instances. Well, absolutely we can use the look at how this is functioning is this tool called Qt Capacity, written by Rob Scott, if anybody's familiar with him. And so if we look at Qt Capacity and we add the utilization flag, the dash U flag, we can see that we have, well, maybe, there it goes. We have a CPU utilization of 83%. That's pretty good across the cluster. I don't see that very often, not that high. And then we have a memory utilization of 20%. Now that seems a little low. I would love there to be higher. But if we dig into that a little bit, we can see that we're using 10 gigs of memory across the cluster. We're using 24 and a half CPUs. That is a two to one CPU to memory ratio. And if you look through the instance size list available in Amazon, you will find no instances to give you that sort of spread of memory to CPU. So I've been having a debate with some of my coworkers about whether I just have some non-ideal workloads running in this cluster that are just, you know, not sort of your average workload, or if, you know, maybe there's something to dig into further there. But having a CPU utilization above 80% feels really good to me. And it's sort of what I'm going for here, along with if we go back to maybe one of our other demo apps, a latency value hovering under 100 milliseconds for this particular application. So that feels pretty good to me. And so, yeah, well, it was here. It went back up here, so we'll have to look into that. But we also dropped our requests. So I'd also like to share just a little bit about the demo apps we have. I think three of them running in this cluster. We have Yelp, which I've mentioned a couple of times, which does not seem to be functioning at the moment. Always got to break something, don't we? Nope, all right. We have this demo application, which just constantly pings the backend and shows you how many pods there are, or just kind of shows you which pod it's talking to. The color tier little off, so you can't see that, but that's the name of the pod that's being hit. So we can see the horizontal auto scaling in action here. And then we have the emoji Voto app from our friends at Boyance, who make linker D where you vote on an emoji. And you can see how many votes they have because I'm generating traffic against this. So some of these have a lot of votes, 180,000 or so. So those are the three apps we're running. And yeah, so that's kind of the general setup. And obviously it took me a half hour to get through the whole setup because it took me probably two weeks just to build all this and get it working. And the goal is in the future to make this easier to do. It's such a complex process. There's so many different pitfalls, there's so many different levers you can pull, not as you can turn, that we want to start to understand how all of these tools work together and then build an easier story going forward. So that's kind of my future goal here. Do we have any questions? Do you have any questions? I guess not, I have no question because it's very smooth. Okay. Okay, yeah. So I think an important thing to talk about is various different pitfalls. I've talked about sort of the issue, a couple of the issues, one of them being that VPA requires eight days of data to be really giving a good recommendation. And so if we go back and we look at Prometheus and we grab a CPU utilization, oh, we can grab really any graph here, we need to be able to say a week of data, right? So I've got a Prometheus instance set up here that's retaining eight days worth of data, that can be a considerable amount of information to store for any Prometheus instance depending on the size of your cluster. So that's one thing to worry about is how are we storing all of this Prometheus data for long-term storage a week may not be a problem. It hasn't been too bad for this cluster, but it might be in a much larger environment. So that's the first thing to consider. The second is understanding how the VPA works. So the Vertical Pod Autoscaler uses over those eight days, it uses a decaying histogram and so of the utilization for CPU and then it uses a sort of that, it uses memory peaks over an interval to generate its recommendation and it can only set requests, it cannot set limits. And so when we set the requests, it can adjust the limits and move them up. So if we have an initial amount and it's gonna move them up or down, it'll move them proportionally, but it won't set limits by default. So actually in this cluster, I have very few CPU limits or memory limits set. This might be risky in certain environments or for certain workloads. And so that's something to evaluate if you go to set something up like this, is do I need CPU and or memory limits? I know CPU limits are a hotly debated topic and I won't dive into the details of that today, but do I need them, where should I put them? That's going to limit the ability to scale as effectively or to utilize more resources as effectively. So I had a really hard time getting to that CPU utilization of 80% across the cluster without removing limits. And so something definitely to consider there in your individual evaluation of it. The next thing to think about and let me just pull up my notes here. The dangers of using carpenter, I pointed out earlier, you might get an M512XL or a C512XL spun out because of an errant recommendation from the VPA. So capping is important. The other thing to be aware of with carpenter is that there are lots of ways to utilize node selectors and resource requests or annotations in carpenter that restrict carpenter's ability to function. So I could create a pod that says, I want to run on a C512XL specifically. And that will force carpenter to create a C512XL. Well, maybe that's not the best choice for the balance of price and compacting workloads that carpenter wants. And so what I recommend is using some sort of policy engine to restrict that in your cluster. So I actually have in this cluster, oops, wrong window, there we go. I have some opal policies. These are being applied by Fairman's Insights. I won't talk too much about Fairman's Insights today. But we have some opal policies for carpenter that restrict specific things, specifically the, we're restricting the ability to use the node selector carpenter.cates.aws such instance family. I think there may be other node selectors that carpenter respects that I need to restrict. But essentially we're saying, you can't create a pod in this cluster with this node selector. And we're enforcing that via OPPA at admission time so that we don't let workloads in the cluster sort of mess with carpenter's ability to compact. So that's one thing to be aware of with carpenter. I think I've talked about this in other content that we've put out before, but using some sort of policy to control the workloads coming in so that they can't break carpenter. The other one that we have is the carpenter do not evict annotation. So you can tell carpenter not to evict this pod ever. But what that means is that you can't scale down. You can't let carpenter move workloads around and compact the cluster because it can't evict pods. That's the mechanism by which it moves pods around. So instead of using the do not evict, what we use is we have pod disruption budgets on some of our apps. So if we look at the various services for the emoji photo app, we have a pod disruption budget of max unavailable one. So it can only evict one pod at a time. So we're not affecting performance. We're not letting carpenter just wipe out our whole service, but we're using these pod disruption budgets to protect us. So some of this is auto scaling 101, like things that you should be doing no matter what, whether you're automatic right-sizer or not. If you're horizontally scaling and you have multiple replicas, you should probably have a pod disruption budget that'll protect you in the event of nodes being drained for an upgrade or various other events like that. So pod disruption budgets are super important. Another thing to be aware of, we talked about performance a little bit, but especially when you're using an Ingress controller, I think one interesting thing here is we're automatically right-sizing not just the workloads, but the Ingress controller that serves those workloads. So we have a horizontal pod auto scalar that is, I believe, working on the query we're using here. I think it's requests a second, metrics, average value, HTTP requests total. So just pure number of requests coming into the Ingress controller, but it needs to scale respective to all the workloads in the cluster because it's serving all of the traffic. And so your Ingress controller may end up getting fairly large. We top pods in the nginx ingress interface. We'll probably see we're using five CPUs per instance of the Ingress controller when we have four of them. So you need to be aware of the relative scaling of your funnel. If you have everything coming through the Ingress controller and then funneling out, be extra sensitive about how this particular workload scales. I think that's super important to be aware of. And then really just monitoring. I had to do a lot of extra stuff to get all of these metrics into Grafana for these various workloads. So something to be cognizant of is all of your various Prometheus configuration to get that working. I think in order to get the VPA state, apologies, into cube state metrics, you have to add custom resource state for the vertical pod Oscar. So this is pulling in that VPA container recommendations into Prometheus so that I can see them. I also had to, I'm starting to pull in the carpenter annotations for the various nodes so that I can keep an eye on what instance types are being put into my cluster. So that gives us that cube node labels metric in Prometheus so we can see specifically, this is a C5-4XL here, C5-D2 extra large. So those nodes that we saw earlier, now we have that available in Prometheus. We can create dashboards and monitor that and keep an eye on the size of our cluster. All right, and then just to kind of show the history of working on this, this is the last months of data I've got in this cluster. We started out at 28% CPU utilization or down at 16 here and we've gone all the way up to 87 and we should be hovering right around 80-ish here over the next few days. And then memory we know from what I've shown before is still fairly low, something that I'll be working on in this particular environment. So, yeah, I was hoping for more questions. There's a lot going on here. See if we can get any. Yes, I guess the, till now we don't have any questions I guess, but there was one question like, which should we, you choose? So it was in between the, when you are actually showing the lab, right? So, but yeah, other than this, there's no questions left. I guess viewers you can actually ask your questions, whatever you have any, if you have any doubts or not, but other than this, yeah, you can continue the session I guess. Okay. Yeah, I don't, let's see, trying to think if I have a ton else here to cover. We can talk a little bit more about the various queries that we're using to horizontally scale. So I've been experimenting with different ways to scale. So, you know, you may not have an Ingress metric available. You may only have, you know, sort of the base Kubernetes metrics for your pod or just what comes out of the box with Kuprametius stack available. And so we need a metric that is not CPU, not memory, but maybe we don't have Ingress requests. Maybe we don't have latency to scale on. Maybe we don't have something like that. So what else can we scale on? And so here we're actually scaling on container network receive bytes. So just the raw amount of data coming into that container, which is a rough approximation for, you know, how many requests or how much, you know, load do I have on this container? The issue that we're seeing with this is that this threshold is wildly different depending on the application. The size of requests. Oh yeah, we have a question. Andy, I guess we have a question. Great. So, okay, AKS or other flavors of Kubernetes, KS, where Carpenter is not available. What are the options to try out this? That's a great question. I'm really hoping that Carpenter expands out into the other cloud providers in the near future, but I know that's not necessarily, you know, a priority for them. I know a lot of those folks work for AWS and I get that, that makes sense. But I know in, at least I'll focus on the ones I know better. I'm not super familiar with AKS. I'll focus on like GKE. I know you have the ability to give GKE control over your instance sizes and allow it to sort of dynamically pick instance sizes. The other thing you can do is just use slightly more traditional cluster auto-scaling, but be more cognizant of your node sizes. So, or your node types, right? So cluster auto-scaler works in all the cloud providers. It will give you more and less nodes based on demand, how many pods you have. And so I would definitely, you know, recommend starting with that. It's not as intelligent, it can't pick instance types. And so what you need to do in that case is monitor that utilization, look at, you know, say that cube capacity charts or I'm still working on a, I think out of the box actually, we have a, from kube-permetheus stack, we have the cluster sort of utilization metric that we can see here. So we can see the CPU saturation and start to look at that balance between CPU and memory usage. If you're, you know, if you have, you know, 16 cores total that you're using or just call it 16 and you have, you know, 32 gigs of memory in use at load, then, you know, something with a one to two ratio, which is a compute optimized size is gonna be appropriate. And then you can adjust your cluster auto-scaler settings to do that. And then potentially utilize multiple node groups within cluster auto-scaler to give you options if you have enough disparate workloads to need both maybe some memory, optimizing compute optimized or things like that. So any type of cluster auto-scaler will get you closer to automatic resource management. It's just not going to be as intelligent as something like carpenter. So, and then there's other, there's commercial options like spots. There's an, I think it's cast.ai, has an AI driven spot instance generator. I think both of those work across multiple cloud providers. So there's commercial options look into that as well. So, great question. Thanks. Great. All right. We were talking about other metrics that we can scale on. So yeah, container network receive bytes or transmit bytes or using both of those metrics can work. But if we go take a look at this metric, we'll go back to our Prometheus and I'm going to zoom back out just a little bit because we're going to go show the graph. This is for that particular workload for the last week, but let's drop the namespace and pod requirements. And we'll just take a look at this graph for all the workloads in the cluster, maybe. I can see bytes total should be getting more than just, oh, we need to drop the sum. That's why, because we're just adding them all together, this is the entire cluster. So if we just drop the sum and we look at the rates for all the workloads across the clusters, might take a minute to query the last week, but we'll see that the levels are so different. I mean, that's entirely dependent on the app. Like if we take a look at this demo app here, we're just sending a little ping request every few seconds to the pod. So that's a very small size of request. Whereas this particular application, when I need to send back the full list of all the votes, that's probably a much larger app. So it's going to be transmitting more data. Same with the emoji one. This is a much more information being transmitted back to me. So it can work, but you have to sort of tune that auto-scaler in order to do that. I don't think forometheus like, see, I'm out of query, I'm out of data, I'm querying here. But it's very different, pretty very different apps. So something to keep an eye on. Another thing you could do is just be raw number of connections or different metrics that are available in prometheus. So that's one of the things that I'm starting to look into is how can we sort of generalize the horizontal pod auto-scaler metric to make a recommendation for a starting point that's much easier than trying to craft this yourself and then go tune this threshold number based on your percentage. And so that's one area to look at. And then the other thing is multiple faceted, multiple faceted auto-scalers. So here we can see the, is it, this one has this scaled object for the web service on the emoji voto app has multiple metrics. So we're scaling on that received bytes total. Notice the very different threshold for this app that I've had to tweak over time. And then we're also scaling on the P95 latency of the Ingress Controller Response Duration they're using the Ingress Controller Response Duration metric for this app. And so we're trying to keep our 95th percentile latency around half a second or actually this comes back in milliseconds. No, no, we're trying to keep it, around 500 milliseconds or lower. So we have two metrics here that we're horizontally scaling on that sort of balance each other out or hopefully balance each other out. So another thing to experiment with if you're going to go down this route is, really thinking about what's important about what you're horizontally scaling on. If latency is the most important thing for that application, maybe you should only be scaling on latency. It's going to affect your automated right sizing, you may get a little bit more over provision resources to provide that, but that may mean the trade-off that you're willing to make. So I think the hardest thing about all of this is choosing metrics and then tuning those different numbers to values that you want. And in that case, load testing is super important. You need to be able to generate some sort of even, remotely realistic load against your service in a non-production environment in order to play with these values. So again, I used K6, which is actually now owned by Grafana to run the load. And you just kind of write little JavaScript snippets that hit your app and it can run those. I think we're, if we go back over here to this tiny window, I'm running, let's see, this one's running 100,000 requests. This one's running 200,000 requests and running them pretty quickly against the app. So we're generating a decent amount of traffic as evidenced by some of our requests here. And actually we can go take a look at just the Ingress controller. Graf here, we're doing 589 requests a second. It's not massive, but for a small cluster like this, that's a, you know, more than it does just sitting there day to day by doing nothing. And so we actually get some real numbers here. So load testing is super important to be able to tweak these numbers and set up all these different things. All right, well, now's the time to ask all the questions if you have them. So again, six different technologies today. We've got Prometheus, we've got Keta, we've got Vertical Potto Scaler, we've got Carpenter or Cluster Auto Scaler depending on where you're at and then Goldilocks to sort of tie it all together. So keep an eye out on Goldilocks over the next six months or so. Hopefully we'll be releasing more features related to this sort of concept of automated rightsizing. It's really where I want to take Goldilocks in the next iteration of it. I think recommendations were a great place to start and I think folks appreciate those. And the Goldilocks dashboard won't be going anywhere. We'll still be showing you your recommendations here. It's just that maybe you didn't set them because you're letting the VPA take them over now and we'll be showing that as well on the dashboard. I think that's all I have for today. Yes, let's see if any question pops up. All right. Okay, let's wait for the questions to pop up. Okay, so other than this, like, yeah, so what is your insight on this? Like, yeah, you've talked about law, right? So some best practices regarding these things, these tools, what do you suggest on this? What should be done? You know, that's a good question. I think a lot of this is really, you know, tying all of these different projects together is very experimental at the moment. It's not something that, you know, I've definitely had some mistakes in this testing process where, you know, I've blown up the cluster really big or the entire thing's fallen over because NX isn't getting enough CPU and it just can't serve any traffic. And so there's a lot of risk associated with trying to automatically right size. And so as best practice, don't run it in prod right now. I think just really, if you're gonna go down this route of automatic right sizing, test heavily in your non-production environments before rolling it out and being confident about it. But I think it's a good path to being able to get better utilization out of our clusters or so. Yeah, we have, I think we have a question. Can you hear it? Yeah. So how to use Goldilocks for apps that has burst spike in memory users? That's a good question. And that's a tricky one with the vertical pod auto scaler. So if it's bursty in a semi-consistent manner in that like, you know, you're going to get burst throughout the day, the vertical pod auto scaler should account for that because it is using memory spikes within a window to calculate its target. And so in theory, it should be able to handle that. Now, if you think you're gonna have like, you know, a spike once a week that you need to account for, then I think aggressively setting a memory limit higher than your memory request may be the right route. You do have to be careful with, you know, requesting or having memory limits higher than the memory available on a node or doing that in too many places. But that is one potential way to sort of mitigate that. And then keeping, you know, VPA will also take into account umkills. So if you get out of memory killed, VPA will bump up the next recommendation to a higher memory amount. And so it will sort of self-correct over time potentially. But that's sort of a thing where if you're expecting specific burstiness that you can test, right? Write a load script that generates that burst and see how that looks. And then the other option is just for those particular workloads, don't use Goldilocks. Don't use VPA. So. So I think he also added a few things like with burstable queue was config also, I see apps getting umkilled. Yeah. You know, I did just, you know, sort of mentioned that if you're seeing umkills definitely bump that memory limit up and keep an eye on that. But, you know, burstiness is inherent, can be inherently difficult. The other option is to drop the memory limit and let the node umkill it if you run out of memory. But then you've got to watch your node sizes and your system level umkills, not just your Kubernetes umkills, so. All right. Yeah, I guess so, yeah, there's no questions left I guess. Let's see if there any questions pops up. Okay, so yes, okay. So other than this, that was already an awesome session and yeah, we've been seeing, we've been visually represented with a lot of drops and so on and so forth. So yeah, okay. So yeah, that was awesome. So something you would like to add. Okay, there is another question? Like, yeah. Okay. So like, so what's happened to create a when scaling to memory or CPU limits? I'm not assuming the question is, you know, I don't quite follow the question, but I'll try and answer best I can. Keta can definitely scale on memory or CPU limits. I have intentionally not done that here because it conflicts with the VPA. So if you try to run VPA and Keta or any auto-scaler on CPU at the same time, you're going to get unexpected results. So that's the caveat there, which is why I'm not using a CPU memory. So I think that's the question, but feel free to correct me if I'm wrong. So yeah, Arpan also mentioned this one. Like Keta is not VPA, if I'm not mistaken. Okay. Yep, yep, correct. Two separate projects. Yeah. Okay, yeah. I guess so. All right. Yeah, people are saying, yeah. That's thanks for your great session. So I guess if there's nothing to add more, we can end this session. Yeah. Great. Okay. Thanks for listening. Yeah, thank you Andy for your great session. Yeah. Y'all have a good one. Another one, I guess, at the last moment. Let's take this question. One more question. All right. Yeah, let's see. Scenario with a bunch of different daemons that's potentially resource hungry. Adding a new case worker node won't be a good idea. So vertical scaling should be considered. Yeah. So daemons that's aren't an interesting thing to consider here. Daemon sets would be considered sort of the node overhead in a calculation by Carpenter about whether you should add another node or not. And Carpenter does have the ability to take that into account. So ideally that should be handled by Carpenter. But yes, vertically scaling using larger nodes in a case where you have very hungry daemon sets is a great idea to sort of avoid just wasted overhead on your nodes. But like I said, I believe Carpenter should be taking that into account. So something to consider and be aware of if you have lots of daemon sets. So that's a great point to make. Thank you. Okay. So yeah, how are Goldilocks different from Cube cost? Ah, well, Goldilocks is older. That's for sure, been around longer, but Goldilocks uses the VPA just to provide recommendations. There's very little cost functionality in Goldilocks. There's a little bit. It's very specific, very limited in its ability. Cube cost has a lot more cost focus ability, whereas Goldilocks is much more focused on just resource requests and limits. So that's kind of the big differentiator. And yeah. So yeah. Okay, so it is awesome. Like when we are going to end the session, people are like popping up with questions. Okay, so let's, I guess, let's now end the session. We are done with the session, the time has ended. Okay, so thank you so much, Andy. I think all of the questions are done. So I hope to see you soon. Yeah. Okay, so let me take you to the backstage then. Bye. Bye. Okay, so thanks everyone for joining the latest episode of Cloud Rated Life. We enjoyed the interaction and questions from the audience. Thanks for joining us today and we hope to see you again soon.