 Hi, everyone. Thank you for joining us for this CUBE Clinic on Goldilocks. My name is Stevie, and I am an SRE technical lead here at Fairwinds. And joining me today is Andy. Hi, everybody. I'm Andy, CTO of Fairwinds. I've got about five or six years of experience in Kubernetes. I've been with our company before, and I just love all things open source and all things Kubernetes. Thanks for joining us today. Yeah. So Andy and I work for Fairwinds. And here's our mission statement, which I'll read out loud for you real quick. Fairwinds is a trusted partner for Kubernetes security, policy, and governance with Fairwinds customers ship cloud native applications faster, more cost effectively, and with less risk. We provide a unified view between DevSec and Ops, removing friction between those teams with software that simplifies complexity. So there are two sides to our house. There's a side where we manage Kubernetes clusters. And so we've got a lot of experience. We've seen a lot of things doing that work. And the other side of our house is using that experience to create a platform that's backed by open source solutions and software that helps users gain visibility into the state of their clusters from a security governance and reliability stance and offers mitigation strategies for those challenges. So I talked about or I mentioned briefly that we use open source solutions for that platform. One of the open source solutions that we use is Goldilocks, which is a utility that can help you identify a starting point for resource requests and limits for your Kubernetes workloads. And so today Andy is going to be installing Goldilocks for us and showing us around the CLI, showing us how it's installed, commands and how to interpret the output that it shows us. And then we're also going to take a look at Goldilocks within the context of that platform, which is called Insights. Before we start, Andy, why did the FBI not catch the hackers? I don't know. He ran somewhere. That's great. That's great. All right. So what is Goldilocks? Can you tell us a bit about the background? Can you tell us a bit about the name? Because Fairwinds open source stuff typically follows a convention of being space related. And this is not what I think of when I think Goldilocks. Yeah, definitely. So as Stevie mentioned, we have managed a lot of clusters over the years for a lot of different customers. And we started out telling people, hey, you've got to set your resource requests and limits, because that's fundamental to how we've been packing Kubernetes, how we auto scale, all that good stuff. And fairly reasonable requests to ask our customers to do that. And then they came back to us and they said, well, what do we set them to? And so that usually ended up with us going into our monitoring solution and looking at graphs. And every time they wanted to size a workload, it would be this process of just using your knowledge of Kubernetes to give them that recommendation. And I said, well, there's got to be an easier way to do this, because this is kind of painful. And we get this asked all the time. And so what we did was we developed Goldilocks. And we decided to leverage an existing recommendation engine that already existed in the Kubernetes community. So Goldilocks is based on the Vertical Pod Auto Scaler. And it basically manages Vertical Pod Auto Scaler objects in what I call recommendation mode or what they call off mode, where the Vertical Pod Auto Scaler doesn't do anything. It just sits there and watches your pods and makes recommendations on resource requests and limits. And so Goldilocks is a nice way to manage all of those objects and then show everything in a nice dashboard. And then you mentioned the name. So we were trying to come up with a name. And obviously Goldilocks makes sense because you don't want your resource requests to be too low. You don't want to be too high. You want it to be just right. And that plays into the space theme nicely as well, which not everybody may know. The Goldilocks zone is the habitable zone in distance from a star. And so the earth sits in what we term, what is nicknamed the Goldilocks zone because it's not cold enough for water to freeze, but not so hot that humans can't live on that planet. And so it's a little bit of a dual meaning there. That's super cool. Okay. All right. You might get started with the first part of the demo. Sure. Sure. So if you're curious about Goldilocks, you want to try it out. It is on our GitHub page, fairwindzops.goldilocks. If I could type correctly our company name, that would be good. And there we go. And so here you can find a link to our documentation site, which I just had shown. You can find releases. And then the documentation site will have instructions on how to install it. So if you're curious about doing this on your own, feel free to check that out. So I'm just going to jump straight in here. The first thing we're going to do is we are going to kick off a kind cluster. So we're going to have familiar with kind. That's Kubernetes in Docker. So a nice way to run just a quick easy test cluster on your machine. So if you want to tinker around with Goldilocks, this is the easiest way to get started. And then as soon as that comes up, I'm going to install a bunch of stuff. I'm going to do the thing that I hate when people do. I'm going to copy, paste a big old list of commands into my terminal here. And I'm going to run that. And so I'll put it over here on the left so I can talk about what's going on here. So the first thing, go ahead. I was going to say that's a big block of text. So yeah, if you'd walk us through that. Yeah, no problem. So there's a couple of prerequisites for running Goldilocks. The first one I already mentioned is the Vertical Pod Auto Scalar. And so you have to have that installed before Goldilocks can do anything. So here, Fairwinds has a chart for the Vertical Pod Auto Scalar. I don't believe there's an official upstream chart for it, which is why we have our own. And so you can install that from our repository. I'm putting in its own namespace. And then the other, the second prerequisite for Goldilocks and something that I kind of just install in all of my clusters and in my kind clusters as well is the Metric Server. So we have here the Metric Server chart with a couple of flags to make it work with the kind cluster, specifically the Cubelet Insecure TLS flag. And then I'm installing just a demo application. This is in our incubator chart repository. It's just a neat little app that you can hit and get traffic and show what pod you're connecting to. And then the last thing I'm doing obviously is installing Goldilocks itself. So we have a home chart for that. It's in the Fairwinds stable repository. That's at github.com slash fairwinds up slash charts. If you want to see the source code for that. And all of these instructions are should be in the documentation for Goldilocks. So real quick, I do always want to make sure that, you know, we're not making assumptions about things. So can you just talk really quickly about what the vertical pod autoscaler does in general on its own? Yeah. Yeah, sure. So if you're familiar with horizontal pod autoscaling, that is giving Kubernetes the ability to create more replicas of a pod or less replicas on a pod based on some metric and some target. The vertical pod autoscaler looks at a pod's CPU and memory usage and will scale it up and down in size. So it'll give it more CPU or less CPU or more memory or less memory. It does this in a couple of different ways, but the primary way it can do that, essentially there's three parts. There's a recommender, which is the thing that watches all the pods and looks at the other, watches all of the vertical pod autoscaler objects, which select pods much in the way that a horizontal pod autoscaler would and then calculates a recommendation for those pods. And you'll see a little bit more of this later as we go through the demo. And then it has an admission controller and an updater that will actually modify those pods or quests as they get launched. And then I believe in some cases it will actually start replacing pods if necessary. We don't typically use it in update mode because it doesn't work with horizontal pod autoscalers that are scaling based on CPU and memory because the two would fight each other basically. But we focus on the recommendation engine portion of it. In fact, the chart that we installed by default only installs the recommender and the updater and not the admission web hook because it's unnecessary. You can actually even have it just install the recommender if you want. So those are all options on the chart. Cool. Thank you. Yeah, no problem. So we have some output here from the Goldilocks install, which actually could probably use some updating because there's an easier way to do this. But what I'm going to do is I'm going to do a port forward to the Goldilocks dashboard service, which is now running in the Goldilocks namespace. So we can see the controller and the dashboard are running. So I'm going to go ahead and kick off that port forward. And we're going to open our browser and we're going to go to local host 8080. The chart obviously has the ability to add an ingress or whatever if you want. We run these with an ingress and an aloft proxy so that we don't have to port forward to them. And here we see nothing because we haven't done the next step yet. So the instructions are here about what you have to do next if you missed this part of the installation instructions if you're going through the docs. Basically, Goldilocks doesn't operate on anything unless we tell it to. So we have to opt in to Goldilocks instead of opting out in the default installation. So we do this via labels. So we can label our namespace with the Goldilocks.ferowins.com slash enabled equals true. So I'm going to go ahead and do that. And if you'll recall actually, we can see here that we also have no vertical pod autoscalar objects anywhere in the cluster because Goldilocks haven't created any yet. So what we'll do is we'll label our namespace demo with the Goldilocks.ferowins.com slash enabled equals true. And what we should see is Goldilocks immediately recognizes that, goes into a reconciliation loop and creates a vertical pod autoscalar for the deployment that exists in the Goldilocks name space. So we are sorry, the demo namespace. We see we have one deployment called demo basic demo and Goldilocks has created a VBA for that. You may have noticed that the CPU and memory were blank there for a second and the provided was false. Now we see here, CPU and memory are filled out. So this is its recommendation for the pods running underneath this deployment. So we can get that VPA. And we'll see that Goldilocks has created it to target the deployment demo, basic demo is off. So we're not going to try to automatically update anything. We're just going to sit here and watch. And then we'll see in the status, there's a recommendation here. So it's giving us four different types of recommendations. A lower bound and an upper bound, which we will see sort of used in a second. And then a target and an uncapped target. I don't remember what uncapped target is. I'm not going to go into that. But anyway, we see it has recommendations. It has looked at the amount of CPU and memory that it's using and it has made a recommendation. So if we do a top on those pods, we'll see they're actually using far less than this. That's because the VPA has a default minimum. And that's configurable if you want to set that minimum lower. But we're getting the default minimum recommendation here from VPA. Obviously if we wanted to see real world numbers here, we wouldn't want to generate some low against this. So we can go back to our dashboard now. And we can refresh and we see that there is a namespace listed here. And we can go ahead and click on that. And we will see, similar to what we were just looking at, we see, we have a namespace demo that we've labeled. It's got a deployment in it. A deployment has a single container in it. And here are recommendations. We see it's recommending 15 millicores and 105 megabytes of RAM. You can drop this down, get a little code box if you want to just copy that straight into your app. And it will do this for every top level workload in the namespace. So daemon sets, deployments, other things that create pods, stateful sets. And you'll see those all listed here for this namespace. Or you can click on the detail all namespaces button over here and see all the namespaces all at once. Now, if you, oh, I'm sorry. So now you would get this recommendation from Goldilocks and you would decide to go and make this change in your deployment, right? To change these settings? Yep. Okay. Yeah. So I could go and edit the demo for it and then put that in and see how it performs. It's kind of the idea here. And I would like to note that this is a baseline. This is a place to get started. We're looking at, you know, right now we're looking at about two minutes worth of information about how it's behaving. We're not generating any load against it. So, you know, Goldilocks is only as good as the information you put into it. So it's a great baseline. Good place to get started can be really useful for that. And so right now, if you want to use Goldilocks for a workload, you would have to label that namespace. So you would have to go through and what if you don't want to manually label all your namespaces one by one? How would you go about doing that? Oh, great question. So earlier we ran the, I think over here, the helm upgrade install Goldilocks, create namespace from our Goldilocks chart. There are a couple of flags for both the controller and the dashboard. They're both called on by default. And so basically that says, show me all the namespaces unless I opt out of them. And so we can flip that flag for both the controller and the dashboard and we'll do our helm upgrade install. And I will edit the namespace demo and yeah, edit the demo namespace and remove that label. And now we should see a whole bunch more VPAs. So we've got one for basically every top level workload in the cluster. And so this will actually give us a nice way to go back to our namespace list and see what this looks like when we have more than one namespace involved. I think my port forward probably died because I killed the pond. There we go. Now we see all the namespaces in the cluster. We can go take a look at the, I don't know, let's look at the Goldilocks namespace because we know it has more than one workload in it. And it probably hasn't populated yet. So we're going to give that a minute. No, there it is. All right. So now we see we've got two different deployments. They've got their own recommendations here. And I think something I skipped over earlier, I didn't talk about quality of service. If you're not familiar with quality of service in Kubernetes, it's a designation that's given to pods based on how their resource requests and limits are configured. So guaranteed QoS is assigned when you have your resource requests and your limits equal to each other. So you say, I need this much and I promise not to go over this much. And so it puts it into sort of a higher quality of service class because you've been very explicit about that. There's a burstable QoS as well, which is where you set your limits higher than your requests, which allows the pod to burst. This is important because scheduling is based on request and HPAs are also based on request, not on limit. And so you're allowed to burst over it. But if you start to do a lot of burstable things and they're all bursting all the time, then you're over provisioned most likely. Or you're trying to use more resources than exist. And so we give a recommendation for either QoS class. And that's based on that uncapped, sorry, not uncapped target. This is based on the target. So we set both the resource request and limit to the target from the VPA. Burstable is set to the lower bound and the upper bound, I believe. It might be target and upper bound. I'd have to check the code on that. But we give two different recommendations based on the information that the VPA gives us. And if you ever forget all that QoS stuff, there's some helpful information down here and some links out to the Kubernetes documentation on those quality of service classes. Nice. Yeah. And you said, I mean, you mentioned that, you know, it's only as good as the information you feed it, right? And so with the information that the VPA is getting about the resources being utilized for your workload, like how much data does VPA retain to make this, you know, like how far back does it go to make these considerations? And what if I, you know, what if I wanted to go back even further than whatever the default is? Great question. Great question. So the Vertical Pod Auto Scaler by default can only take in so much historical data. I'm not sure what the actual timeline on that is. It's, I think it's relatively short because it's storing the model in memory and the recommendation model in memory. What you can do is the Vertical Pod Auto Scaler gives us the option to connect to Prometheus. So you can connect that VPA recommender to a Prometheus instance in your cluster and then it's actually configurable how much time you want it to look back. So you can kind of align your Prometheus retention settings with your VPA settings so that they're, you know, somewhat sane, you know, agreeing with each other, shall we say. And so I can show how to do that here pretty quickly. What I'm going to do is I'm going to add a Prometheus stack to this cluster. So I'm going to use the cube Prometheus stack, which is a public chart from the Prometheus community. And I'm going to disable Grafana and alert manager because I don't need it right now. But what this is going to do is it's going to install cube state metrics, which gets all of like the Kubernetes state information. It's going to install a Prometheus instance and the Prometheus operator. And so that Prometheus will immediately start gathering information about the resources in my cluster. And so we'll get that installed. And then what we have to do is we have to update our vertical pod auto scaler installation. So I'm going to find my VPA install command that I ran earlier, and I'm going to add a few flags to that. So I'm going to add the recommender dot extra args dot Prometheus address. And I'm going to set that to the address pointing to the Prometheus service. So if we get pods and get services in the Prometheus namespace, we will see we have a service called cube Prometheus stack Prometheus in the Prometheus namespace on the 90 90 port. And then we have to set one other flag on the recommender. This is documented in the VPA chart as well as lightly documented in the VPA project. Although sometimes I've had to go download or go look at the code where all the flags are defined for the VPA recommender because they're not super well documented. So it's a set the storage method for the VPA recommender to Prometheus. And so we can go ahead and take a look at the logs as that starts to take place. We should be getting a new recommender here in just a second. And now we should have yep. Historical usage query used container CPU seconds total from the C advisor job, which is default in the Prometheus installation. And so it's going to go just check all and the container memory working set bytes. So it's going to go get all that historical information and make its recommendations based on the Prometheus information there. So if we'd have Prometheus running for a little bit longer, I think the default on that flag is like 30 minutes. But I believe it can be configured to be longer. Actually take a look real quick. Folks are curious. Go to the vertical pod autoscalar folder in the autoscalar repo. And we go to the recommender package. And we go to where the flags are exactly. Here we go. Yep. So we'll just search for Prometheus here. So we have that Prometheus address flag we set. If your Prometheus job is a little bit different, you're not using the defaults in the key Prometheus stack. And then when we start looking at storage, we have history lengths. So we'll go back eight days. History resolution, one hour, some timeout stuff. So there's a lot of different flags to configure here. And this is probably the easiest place to see them all because I don't think they're fully documented. And so the code is definitely the source of truth here. Would any of these, so I'm imagining a scenario where you have a workload that is, you know, sometimes is burst, just burst up over a limit, you know, for a short amount of time. So you just have a little spike, right? That can often be hidden or can adversely affect you know, the recommendation that you get. Are those flags that we were looking at and Prometheus, are those some things that you could use to sort of sort the granularity there to account for those kinds of spikes? Good question. I'm not sure how it takes that into account. I guess, well, probably the resolution and the interval together probably would affect that behavior. I'm not entirely certain, you know, the deep nuance of that. I do know that, you know, once it has that information from Prometheus, we do see that encapsulated in this upper bound here. I've seen that upper bound go super high because the workload will spike up and that will be encapsulated in that upper bound. As long as you're, like you said, your resolution from Prometheus is high enough such that you're actually seeing that information from Prometheus. And so how does Goldilocks actually populate all this data? Okay, good question. Good question. So this dashboard runs essentially the same as a CLI command that we have. There's a summary command in Goldilocks. So if we just do Goldilocks summary, we'll see it spits out this really large JSON object and throw that into a file real quick. And look at that via JQ. And so we see all of the VPA information for each workload summarized in this object. So this is the data object that is provided to the dashboard for each namespace. And so, you know, if you wanted to get that data out some other way or just, you know, you really like reading JSON for some reason, you're welcome to try out that Goldilocks summary command. It's not fully supported because we don't really maintain the CLI functionality all that much. It's really there for testing. But it is definitely available and usable if folks need that data somewhere else or want to, you know, look at JSON. That's probably possible. Right now, the dashboard doesn't actually, you know what? We might have put off to take a look. There may be an endpoint on the dashboard to be able to grab that JSON object directly without having to run the summary command. But yeah, you could effectively do whatever you wanted with that JSON object. So we've, you know, talked about installing Goldilocks in a cluster. We've talked about, you know, how to label your namespaces so that, you know, Goldilocks picks it up and creates the VPA objects and forward or sources of those VPA objects. And we've talked about getting our data retention to be longer so that we have more information to work off of. But we've done this, let's say, you know, in one cluster, if you have multiple clusters, so you've got like multiple production clusters, for example, that you might want this data in, how would you go about, is there a simpler way to run Goldilocks rather than having to port forward or create ingresses for every different cluster? Yeah, definitely, definitely. And this is, that's really where our SaaS product comes into play, right? Our SaaS product will allow you to collate all of the information from a whole bunch of clusters into a single dashboard where we encapsulate that Goldilocks data. And then we actually take it a bit further and we add cost information and other things like that. So I can show you that real quick. You know, obviously there is a pure open source option if you really want to run it everywhere and install it everywhere and grab that JSON and do something with it, but we've already done that for you here. That's exactly what we're doing here. So this is Farowind's insights. First, we'll see, you know, when we first see it, we see we've got a whole bunch of clusters listed here and they've got different scores and things. So not only do we pull in Goldilocks data, but we have a whole bunch of other open source that feeds into this. We've got Nova, we've got Pluto, we've got Polaris, we've got other people's best of breed open source like Trivie from Aquasec. All of that's in here. And so you get a lot more than just efficiency and cost, but we're going to focus on that because we're talking about Goldilocks today. So here we see on the efficiency tab, we've got, we first see a cluster comparison where we get these cool little box charts that show us what's our available capacity, what's our CPU requests, what's our CPU limits, and how much of that are we actually using. We can see here our utilization is actually far lower than our requests and our limits. So this is a little bit more detailed view than what Goldilocks gives you into that. And then we can also view memory as well, not just CPU because obviously there's two sides to this story. And then if you, you can either put in just a kind of a rough cost per node that you think you have for your cluster and we'll calculate these cost estimates, or you can pull in your AWS billing data into this and actually get a direct correlation to your CPU and memory costs calculated for each cluster over time. And so we can see our utilization percentages and our cost for all of our different clusters. And then if you want to look at it, you know, deeper than the cluster level, we get into this workloads tab where we dig into a single cluster on this workloads tab, and we're going to see all of our different workloads. So say we want to filter down to, you know, just deployments and we want to look at a specific namespace. I don't know what the cranky Nash namespace is, but I'm super curious now. But anyway, we'll take a look, you know, we can do our, that seems like an out of generated name. That's what it is. We take a look at just our Prometheus namespace here and see our relative total costs, our relative daily costs. And then we have additional quality of services that we've defined to give you different types of recommendations. So we already talked about guaranteed and burstable. But now, you know, where we add a little bit of value on that, as we say, you know, maybe this is a critical workload, and you want to make sure it never, ever, ever gets unkilled. Well, let's give it a little bit extra. So we're going to bump that recommendation up by a certain percentage or, you know, limited, maybe we just, we don't care about it. So we just want to drop it all the way down and just let it get killed all day long. That's fine. It's not going to bother us. So we add that on as well as, you know, what, what do we think it's going to cost to make this change? So sometimes you need to give things more resources and your costs go up. And that's okay. But you maybe want to know what that's going to look like before you make that change. So we have all those recommendations here in one good place. And then we can, you know, slice and dice this different ways we've got aggregating by namespace. So we can just see what different namespaces are costing us, and what our recommendations are going to cost us. And then we can start to aggregate by label too. So if you have a labeling scheme, you know, you want to look at your app dot Kubernetes.io slash name, we can filter by that as well. So just a lot more feature rich dashboard, multiple clusters. And then, you know, couple that with the fact that we have automation rules and all of a sudden it's like going on here. So I definitely don't have time to talk about it at all. But that's what we got. Cool. Um, yeah, so that's, that's pretty awesome Goldilocks. It seems like it'd be super useful. Certainly, I feel like one of the places where people tend to struggle with their clusters is setting their resource limits and requests, either not doing it at all, or, you know, struggling with figuring out where the sweet spot is. And so seems like Goldilocks is a good tool for that. I like to think so. We get good feedback from the community. So I hope I hope other folks find it useful as well. Yeah. All right. All right. I think that does it for our, I believe so. We do have a white paper that you can download about Kubernetes misconfigurations. The link is here on this slide, which I will leave up for a few seconds in case folks want to write it down. And then I believe you'll get an email with a link to that as well as well as the recording here, which I think is only being sent out by email because we had technical challenges today. So thank you all for watching and listening and I hope a great rest of your day. Bye everyone.