 Welcome to our talk. Where's your money going? The beginner's got to monitoring measuring Kubernetes costs My name is Mark Poco. I'm a senior software engineer at Grafana Labs And what I'm I'm I'm quack also Grafana labs working both work out the platform team. Yes So before we start let's do a little bit of a story, right? Imagine a scenario you get your cloud bill Kubernetes cost doubles month over month and you're trying to figure out where did your work? Where did it come from? So you start with your bill and you look at it and no matter what you do It doesn't really help you figure out where that cost is coming from so the next thing you do you go to your cost explorer and see what other dimensions are available for you and Really don't come up with much and in this case. We're showing on What's it called the incidence types? You might be able to see where that cost is coming from but nothing about the cluster or the workload So you turn to your favorite The individualization tool and you look at things like CPU usage and memory utilization Look at this over time and still nothing is really standing out to you on what is driving your costs and this leaves you sad so The show of hands how many people have had this happen before all right. That's yeah everybody How many people are running kubestate metrics in your Kubernetes cluster? All right, so the good news is for those that are doing this You're gonna be able to walk away today with some prompt all queries that you could run and be able to visualize this and those that don't Hopefully this is a good incentive to maybe look into kubestate metrics. So what can you expect today? First thing we're gonna do is that there's a we're gonna try to show a couple approaches that we use like Rafauna labs to help Bridge the disconnect between your billing statement and the metrics that you're already collecting their Kubernetes cluster After that, we're gonna step through a couple of prom QL examples to help measure And in this particular case, we're gonna show CPU because that's usually the most costly And then finally we're gonna share a couple of lessons learned both in terms of setting this up and measuring it and also How we helped improve that cost and with that we turn it over to Juanjo So, um, yeah to start digging in into this we need to understand the nature of our spendings This is a really very simple formula as you can see Spending the amount of money that you pay for your resources on a particular period of time Usage being the amount of units the number of units of such resources and the rate is how much we're being charged for for each of these units This is important to to to group to get to group separate actually to be able to focus on the force as you put on each part of these factors Like in usage, we can see the things that for example amount of CPU cores amount of memory Traffic and such and it rate our dollars per each unit of these of these resources from the cloud providers and this then drives Different really separates how we need to focus or what are what are the work streams that you need to develop to be able to tackle this In the case of usage We need to be aware of our workload resizing This is the most common intuitive one when you think about improving your cost sense Usually think oh, I need to properly do a right right sizing of my resources That is how much I ask how much I use but there are also several other factors that are pretty important in driving your cost other one is auto scaling having proper auto scaling story both On Kubernetes like things like HPA and VPA, but also underlying how how much your Cluster can be elastic to actually create new nodes or destroy nodes that you don't need anymore And then another aspect that is highlighted there because is the aspect that will be actually touching doing the rest of the talk Is cluster been packing cluster been packing mean mean that you have a node How well you're able to actually pack all the posts in this node to have the delete the least slack resources in that node And as you can see then there is a different aspect like rate all together that is essentially You need to understand What is the kind of discounts that you can get from your cloud provider on aspects like committing for Certain amount of resources that you will be using over time But also things like for example the circuit CPU architectures that you're using is it in the AMD arm Are your workflows ready to actually run on this on these architectures and then things like for example spot VMs What in Ritchie? He did a really good approach to the to these spot VMs are kind of free Chaos engineering because these are kind of VMs that can disappear anytime you have it there They have a steep discount, but they can go without the notice. It may be just a minute now this so important then to focus here is How we separate this because also it impacts which teams will be in charge of each of these items that I'm highlighting there So let's let's try them to to focus the rest of attack the talk on cluster been packing What we see there essentially is trying to by the way, this is a Graphing Grafana as you can see this is a result of some query there Well, we see there over time is suspending like for example if we focus at 5 p.m During that hour from 5 p.m. To 6 p.m. We will spend in something like 1.4 Dollars in total the top box is CPU the bottom box is memory And this also is hinting us that we will narrow the rest of the talk to compute resources only because in 25 minutes It's what what what we can get so Again, this is a spend so this is dollar time dollar over time If we apply the formula here We see that actually this is spent is composed of this crit unit of resources in this case We are trying for example again that's 5 p.m. We're trying to graph something like four nodes For example, you have there the four nodes each node may have eight or 18 B.C.P And then on the top you have CPU and on the bottom memory, and you can see that the number is the same But again, it's important to understand that it's not continuous what you request especially when When you're running a Kubernetes cluster on top of cloud This will be discreet indeed the nodes that being created or destroyed as for example to the right of the graph This will be discreet amounts and then a different dimension is how much are you paying for this as we just saw At the end of the day what you will be paying for will be the volume that you're seeing there Essentially the integral let's say the sum of each of these bars over time over a period of time Let's see then Let's try to focus then because this is kind of a pretty low-level view, right? We are seeing there the nose we're trying to graph there the nose that we are driving from our cloud provider Let's see we need to understand then how Kubernetes workflows drive the creation of these units And this is key to understand how we're going to cope on with the cost on each of these aspects Then let's see what drives when it is cost Here we try to graph, you know a Poth there is just a Simple work service that is asking for one CPU and to be ground It's not you're running It's kind of pending until it can land into a node that you have It's asking one CPU to be gram and it happened to land it in a node. There is a CPU and 16 You've run you can start to sense that there's something not so good regarding this picture Until you actually put a replicas and you have this kind of unicorn awesome theoretical scenario where where you are Right, we are exactly fool feeling the capacity of the node Obviously, this is pretty theoretical and indeed mark we will address this when we show the how we tackle this We have another poll there and we have essentially the same scenario before Let's that's thinking we have Essentially if you if you read through the through this graph You will see that you have CPU at the node level, but CPU at The pot level or so usually when we use covenettes We actually only see the up part like if you sum all the resources that you are Your both are using you will be seeing the the aggregation of all these small boxes inside But we need to be able to actually separate both to measure both to be able to to to see The driver and the effect of that driver being what is inside and the fact is the actual nodes that being allocated So then we need to indeed again. We are in this talk. We are just focusing on compute So CPU memory of nodes and likewise for workloads As Mark mentioned If you have cube state metrics, you will have these metrics provided by you By the way, this is not meant to be a prom queue a class of course But just a quick introduction what you have there are like cube nodes status capacity It's indeed the name of the metric and this metric can have as you can see have several labels These labels can have several values the cardinality of this and then the the combination of the metric name And the label value will give you a single will give you a single point in time of data and this is called time series So just put to put an example you fix the cluster because you're taking care of you're actually watching this particular cluster You fix resource to be equal CPU for example You fix the node because you're watching this node and then you will get the number like 8 for example for that metric about for That time series about 8 because it has 8 b CPU Now we again this we are watching at the box that is hosting this box if we got a bot level You will see that we have a metric that is Q pod container resource request so it will have we are we are not actually putting there the label like pod and and Containers us for the sake of of space in the screen, but essentially we Essentially the same idea. You see that we have cluster resource node and namespace And then mark we will help us understand how we can bring some What is true? Thank you. Want to so all right again Bringing this all back is we're going to focus on a usage side because that's what drives your cost in your clusters We have two metrics now that we could do some work with and so let's step through we're going to three examples Of how to measure compute in your cluster the first one is how do we measure the cost of our nodes? So in pseudo code and promql We have this again. We're expressing that formula sum is equal to usage times rate and To measure your nodes you need to update your usage to get how much resources you requested So in this example, we have cool node status capacity and we're saying we want the resource of CPU and what's charted on the right is going to be how many CPUs for an entire cluster you have What's called how many CPUs are being requested to summed up? As the graph goes down that's kind of showing that there's fewer nodes as that graph goes up. It means that there's more nodes and Now that we know the number of cores of a cluster we could then figure out. What is it cost to run that cluster? We're going to cheat a little bit today for rate What we're going to do is we're going to go to a cloud provider, right? And there's three things that you really care about for cost of a CPU It's the region that you're in the instance type and whether or not it's on-demand machine or if it's a spot machine So in this example, what we're going to do is we're going to find out We're going to use the on-demand price because that's most likely you're running an on-demand machine so we're going to take the three cents per per hour for vcpu, and we're going to plug that into rate and charted what that looks like is over time how much we're spending per Per minute and we're going to come back to a minute and they their cloud provider charges you per hour But we're going to cast it to a minute because in this this query you could explore your data Right, this is something you could whatever tool you're using you could explore it And you can charge it over time But most likely if you have a lot of nodes a lot of clusters this query You're probably not going to go further than a week maybe a month if you're in a smaller cluster So we cast it to a minute and what we're going to do is we're going to create something called a recording rule in Prometheus And if you haven't used Recording rules before it's pretty much it's a job that runs periodically It's defined in the apple and what you do is you specify the the Metric name that you want to store this data into you give it the expression in this case It's that the example who had some by cluster and You attached to it a label in this case. We're saying resources CPU and so when this runs in Prometheus What's going to happen every minute this runs? It's going to take it's going to Calculate the cost per minute and it's going to store it in the record cluster cost per minute sum and With a resource label of CPU and this allows you eventually you could do memory you could do persistent volumes You could do object storage Realistically resource could be many different factors. So Once you have this running use it stores it over time. So that's how you measure nodes Most likely your nose isn't really, you know, you probably could get a sense of what you're spending you probably really care about What is actually costing what in this case? We're showing a little bit more of a realistic example where we're running pretty much the Prometheus community and One of the things that we do at least internal with the Grafana is we use name spaces to isolate our workloads So in this case, we have a Prometheus instance that running in likely high-value Bellamy mode It's bigger because it's most likely consuming a lot of memory and then we have a bunch of other little workloads So the question is how do you measure this? Like how do you measure that namespace because this is probably the most important aspect for your engineering teams? so Coming back. We're gonna alter that Equation just a little bit in this case. We're gonna do some by namespace and instead of the usage We're gonna look at requests, right? What is the workloads running in that namespace requesting? So From kubestate metrics, you have this metric kubepod container resource requests and again We're doing CPU because that's most likely the most expensive and If you look at the graph one of the things you might notice again, is that it's going up and down over time So nodes when you're measuring nodes, that's just gonna be new nodes added new nodes delete or new Nodes deleted in this case. It's most likely replicas You likely are using some type of horizontal pot auto scaling policy and so is the Number of cores goes down you likely have less replicas as it goes up you likely have more Once the newer version of Kubernetes comes out and you could have dynamic resource requests This is probably gonna change a little bit So again, we're gonna cheat here and for rate. We're gonna use that same threes about three cents Per hour, we're gonna convert it to a minute and This is the cost per namespace tried it over time and and this is a made-up example where we generated some fake data we have namespace a and namespace bay and You know, we could look to see how much is a cost per minute over time. So Similarly, this is useful. You could run this today and you could get some actual data On the cost of your namespace, but if you want to go over time, it's most likely not gonna work So I'm gonna fall back on a recording rule in this case We're gonna call it cluster underscore namespace cost per minute some and for the expression We're gonna use what we had before in our Explorer and we're gonna again add the resource Label of CPU. So again, if you want to do memory instead of it being the resource CPU you change that to memory find hourly price of memory do a little bit of conversion because Bites and gigabytes that they don't always match up. So This is two examples We have one more that I think is really important that lines up with almost every keynote speak about energy efficiency and so Both those exam almost all three examples are extremely unlikely to happen in in in in a production cluster most likely what you're gonna have is you're gonna have a lot of nodes over time and each of those nodes is gonna have a different level of What's it called? Workloads running on it and What we really care about from a platform perspective is figuring out how much of that extra space on that node Are we not using and the reason for that is that's I mean that you're paying for that entire node and If it's not being used you're wasting money, right and you're wasting energy and It's yeah, it's something important for us to track. So how do you measure that this one's probably the most complex? But you have to kind of follow me Instead of caring about the usage or the number of requests for Node or workload you care about you want to know how much capacity is on that node Minus how much your workloads are requesting, right? That's gonna be what drives or idle usage That's what's left over and then times it by the rate and that gives you how much you're spending on just idle resources. So It's a lot if you're not familiar promql But let's step through it right you have on the inside you have a sum by node of the capacity of the node minus sum by node of the Requests on that node and again, we're looking at CPU and what's charted here on the right is how much we're spending per minute on Resources that are not being used by anything. So and we convert it to 60 seconds because again You could do maybe a week of this 30 days six months is not gonna not gonna cut it. So Take a pause it's three promql queries that we've looked at if you're not familiar it could be quite a bit I would love to be able to spend more time and step through memory persistent volumes You're gonna just have to bear with me. Oh God ahead of myself Let's look at that in recording rule because this is pretty interesting with what we do We have cluster namespace cost per minute some and we have this query here right the same one we had before Notice though in a query. We don't sum up by namespace anywhere And there's a reason for that is idle resource isn't associated to a namespace. It's associated to a node We add an extra label called namespace under double underscore idle and we call that a virtual namespace Pretty much what we do is we associate all of the date all the spend that we're not Using anything on into a single namespace and we could do some pretty cool stuff with that. So Here's where I Showed a couple of things here Unfortunately can't step through the entire process but I do want to show you what we do internally in Grafana and the way that we measure this so If you put this all together right you draw the rest of that owl you have this pretty complex looking query We're what we're what we're what we're graphing here is how much what's the percentage of All of our spend that's being associated with idle resources and this is important for many reasons, right? This is money that we're spending To our cloud provider that isn't being used at all. So we have targets and goals with an organization of what we want to do with that and What's really cool about this is that every month or every week we get a report on how much we spend with an Organization by team and also as a platform we care about the idle resources So you might notice that there is a point where it kind of shoots up, right about mid-July July 14th We noticed that and We were able to react and we're able to get that cost under control so that we didn't have a surprise bill And so this is six months. We tracked this we could do it over a year 18 months and Yeah, so With everything right this has been an iterative process. It's taken a while I've been doing this, you know, we've been doing this for about a year just trying to get to the point where we're at The first main Like shortcoming of this approach at least with what we shared with you is that it only it only works for homogeneous clusters Most likely you're gonna have clusters with many different node types and in our case We run across many cloud providers. So that's that's a hard problem to solve So what we first did that worked very well was we averaged how much our CPU cost Across all of our cloud providers and instead of using that like three cents an hour. We use that that the Average for a month and use that in all of our recording rules. We did that for about six months And then from there we used open cost we deployed that to all of our clusters and instead of using that estimated number we're able to join a query and we're able to Get the actual cost of the node and it takes an account the region it takes an account if it's spot the type of instance and the interesting thing for us is that when we converted from the estimated to the actual numbers our Estimated was about 90% correct right and again, that's across three cloud providers if I heard that I wouldn't believe it That's it was really I couldn't believe when I saw the numbers But we wrote about this in our blog post and we adopted open cost So go check that out if you don't believe that The other thing right now is this only takes no account compute resources. So we could do CP. We do memory We do persistent volume Most likely at least within Grafano We use a lot of object storage which has a very different type of compute and we want to be able to associate to our teams And let them understand how much you're spending on that resource as well. So it's something that we've built internally and We're not quite sure what we're gonna do, but there will be something from that Now you might have noticed all my examples I used AWS for billing You know if you've used that before but I pulled the numbers from GCP Not all of your cloud providers are gonna give you an actual breakdown of CPU of memory So if you're in if your cloud provider doesn't give that to you, it's about 80 percent If they just give you the total cost of an instance, it's about 80 percent for CPU 20 percent memory So you could do some math on that to figure that one out and then finally Namespaces doesn't match up the teams. So from a platform perspective, this is really powerful for us to figure out how to resource But our engineers they they need to know like not just the namespace, but how much are they spending? So we have this kind of like magical Little metric that is part of our CI process That associates the namespace to teams so that when we build these queries After the recording rules We're able to join up namespace the team and we could then provide all of our teams a breakdown of their costs Yeah, just to add a bit on top of what Mark said Attribution here right being able to attribute so that teams actual engineering teams can take care of the costs that they own And imagine that underscore idle to which team it goes it goes to platform team So Interesting here also they're defining the ownership doing these these being able for teams to actually own Their own cost so they can actually do the what is called TCO for total cost of ownership force by the teams With this lesson element, let's see what lessons we learn on the our been packing journey Oops, maybe none or maybe we need to ask ourselves first a question Do we have a resource request for our workloads resource CPU? Memory yes ish at least the more relevant once few hands up, okay You cannot embark in this journey if you don't tackle this properly because the resource that you request for your pause Is what drives the scheduler to actually land those posts in a particular node that has that capacity available? And then the view that the scheduler has regarding the proper been packing the scalar together with the cluster of the scalar Needs to know how much you are requesting for the poll again We're talking about be packing not my sizing that is we're not talking here in this talk about how well you are using The resources that you request we are talking about how well we are being packing what you request for on the nodes that you have available so Here essentially the idea of this is to share the to share you some tools some approaches that Has helped us in GKE by default the cluster run with something that is called balance mode Balance mode. Let's say it's right to be gentle with the post trying to do a fair distribution of the post over the nodes As you have available if you switch that and this is a cluster wise setting to optimize utilization This is the opposite. It will try to actually do this been packing But again, if you don't have those requests is fine and instead of been packing you will have pot smashing, right? You need to be able to Tell the scheduler how much you're requesting for very useful for that for for us. I will give you some some numbers in a bit For the kind of patterns that That we have in Grafana we have essentially continuous deployment So we don't have a day where we deploy we have so many teams working where so essentially we're Constantly deploying rolling out new versions or or new settings for for the services that we run and this creates a lot of Fragmentation at the nose. So let's say that you have a point in time where you have an awesome been packing picture like 20 something percent as Mark was showing but then you do a rollout and then essentially you fragment all this know with all these gaps of resources So by allocating new nodes where the new posts are landing and what we found in a because of the way that we use the platform Is that in Giki at this is what this was taking too long? After all a rollout it may take two days to actually settle back to that idle percentage Call out and gay again to be able to measure that because as you're running blind All what what just mark show and this is a permitted measure So we are actually tracking that idle so we will able to react on on that and say okay This is taking too long a tool that helped us that is certainly not a silver bullet And you need to understand how the world so has many knobs is that the schedule or this scheduler will try indeed to Observe how well you are using those nodes has several strategies one of them is high node utilization And then it is fit observes that the node is being underutilized. It will evict the post to be able to kill the node But of course you need to understand the schedule Again, it's not a silver bullet you if we have some in this some Incidents because of this we have to really took this properly. This is which is on Giki. We are running on three main cloud providers AWS We replaced the cluster the autoscaler a Call out an important call out here in Google the scheduler and the cluster of the scalar are in the quarter plane So you in Giki you don't control it in AWS There's this difference the scheduler is in the control play, but you need to run your own cluster of the scalar So this is replaceable carpet. It is an open source project That actually it tends to replace the stock cluster of the scalar and it has so many knobs things like you can configure for example the Especially diversity of the underlying nodes to better match the shape of the post by shape I mean the ratio of memory divided by CPU much the shape of the post with the shape of the nose and have a better been packing story and We are really lucky Today like yes today. We published a blog post was by a couple teammates that were actually working in carpet Logan and Paula With which it has which has a pretty well detailed and some really interesting views on again on this on how car pattern Have helped helped us especially on the on our been packing story and then finally a call out for this white paper by Google status Kubernetes cost optimization and With that I think that we're not. Yeah. Thank you. Thank you for being here Thanks Sorry, fine. I forgot something numbers. I had promised some numbers We went for something like without touching anything on our cluster. We were hovering 40 45 percent of either resources Considered that you have 50 percent of either resources You are essentially paying for the double that you are actually using right your your nose are using half so we went for something like 45 ish to 22 25 percent of idle usage Questions question Too late, please use a microphone. I notice you were using requests for everything. Does that not Considered limits by itself. Is that the container request and then the limits are on top of that? So the main thing that the scheduler cares about is what you request, right? So if you have upper limits Like it could go up to that and that's fine But the main thing that we care about is tracking what our resources are requesting We've we've done some analysis comparing the two right looking at the maximum value between either the the limit the usage that they've had versus their quest and at least in our organization it's Really has been negligible. So we focus on requests at least you had to keep it simple So as we mentioned in this particular talk we're actually focusing on been packing so understanding how the scheduler is able to Be impact the post we are not focusing on resizing that would be a Completely different approach you have different metrics to try to see how well you are using the amount of Requests that the request that you are requesting. So but that's why we wanted to highlight the start of the talk when we talk about about Usage there is a different aspect of usage. This is how how much of your if I understand correctly your question would be How well you're using how the resources that you're requested at the top level The other question is about CPUs versus memory costs I'm not sure that's accurate. It's like let's say you reserve a note on a yes, babe It comes in for CPU and let's say 16 gives them you can't really break down that into Like separate cost. It's like you can reserve 16 gives a memory, but you only use two CPU You're still taking up the whole note, right? So Yeah, and that's a great question. So I The way that we we approach this internally is we're not trying to get to accounting level accuracy to have it line up Exactly with our bill. We're just trying to have a rough approximation of each workload What it's requesting what those units of cost are and we sum that up over time and for us We're really looking for trends, right? We want to see over time for our workloads what are they requested and Calculate the cost of those requests Thanks. Thank you. Great talk Question is there a place that you publish it in dashboard that has this like different the queries Let so like open source or like internally Not yet. That's not that we're trying to figure out. So we we use Like json it internally to publish a lot of our dashboards a lot of our queries and whatnot So it's one of those things we're trying to figure out the right balance because most people probably aren't using json it but if it's something you're interested in you could join our slack channel and We're we will more than happy to share that. Yeah, we actually run short of time We wanted to create a repo with let's say already manifested because as Mark mentioned we're using json it sometimes that the It's kind of hard to get started with so yeah So we want to let's say to to to share the already manifested Recording rules, but we run out of time and that's why we created this resource essentially a channel there in our slack And we may actually publish in some of these as excerpts But yeah, it's in our plans to actually publish the full Reach out to a slack. Thank you. Thank you Probably have time for one more question anybody but that thank you