 Well, everyone, thank you so much for coming to our session. It's the last session of the day, so hope you've had a great KubeCon so far. My name is David Porter. I'm a software engineer on the Google Kubernetes team. I focus on Node. I work in upstream-signode community. And I'm also the maintainer of CAdvisor, which we'll talk about in this presentation, which is a monitoring library used in the Kubernetes ecosystem. Hey, everyone. My name is Peter Hun. I'm a senior software engineer at Red Hat, primarily working on the container runtime cryo. But I also work in upstream-signode and on podman and runc and other container-related technologies. Today, we're going to talk about making sense of your vital signals, the future of pod and container monitoring. So let me start this presentation with, I hope, a situation you haven't been during KubeCon. But for anyone who's been on call, it's probably faced. You suddenly get paged, right? And maybe you get paged and you have an alert. You know, maybe it's your chatbot application or your other LLM thing. And it says that the latency of your application is too high, right? What just happened and what do you do? Well, first thing, you probably hack it, right? But then after that, you figure out, how do we actually resolve this? And that's what we hope to describe in this presentation. So what is the monitoring of observability space and why is it important? So observability, it's really important to understand how your applications are performing, right? You want to understand if you roll out new versions of your application, how is the resource changing? Is there a regression in your application? Is it using too much resources, too little, et cetera, right? So you want to be able to identify issues and unexpected behavior with your apps. And maybe you also want to alert them, right? So we alerted, for example, on our latency and we got that alert, or maybe you want to alert on other scenarios like using too much memory or something like that. Additionally, maybe you have SLOs or SLIs for your internal customers or external customers and you need observability to understand if you're adhering to those SLOs or SLIs. And lastly, not just for you as the cluster admin or application developer, but also the internal Kubernetes components also need monitoring and metric data to perform its core functions. So a good example here, Kubelet, which is the worker agent running on every single node in your cluster, it needs monitoring data to be able to evict pods and understand, for example, which pod is consuming too much resources. Another example here, if you set up an ephemeral storage limit on your pod and Kubelet needs to know how much storage is your pod actually using? Because if you go over the storage limit, your application will be evicted. So we need monitoring data. You need it to be able to understand how your app is doing, your customers need it to ensure that your SLOs and SLIs are met, and also the core Kubernetes components also needed to perform its core functions. So when we talk about monitoring and observability, there's a lot of stuff in the Kubernetes ecosystem around this, and so we want to keep this presentation kind of scoped, but I want to give you a quick overview around what's there in the ecosystem. So the first thing is node level metrics. There's an open source project that's quite popular, the Prometheus node exporter. It's job is basically to collect node level metrics around how your actual node is doing. So this is things like how much CPU is my node using, how much memory, things like that. The next category is Kubernetes components. So each of the core Kubernetes components export metrics around how they're doing. So an example here is the Kubelet exports metrics around, for example, the latency to start a pod. The second to last category here is metrics around your API resources. So there are some open source projects like KubeState metrics, which generate metrics based on the API resource objects. So an example here is how many pods are in this namespace. Maybe you want a metric for that. And then the last category of metrics, and this is what we're going to focus on during our presentation, is pod and container workload metrics. So this is actually around the actual pods and containers that are deployed on the node, how are they performing. And I would argue this is one of the most important pieces of metrics because this actually impacts your end applications and your customers for using your apps. So where do these metrics come from and how do they start? So to explain that, I need to introduce you to Cadvisor. And Cadvisor had a very humble beginning. This is a blog post from 2014, quite a while ago. And this is on the Google Cloud Platform blog. And this is actually the first time Kubernetes was announced. So you can see here Kubernetes announced here. It's like a cluster orchestrator back in 2014. And side by side on the same day that Kubernetes was announced, Cadvisor was announced. And Cadvisor's job here, as it's described on the blog post, is a tool that enables fine-grained statistics and resource usage for containers. So at this time, this is just the landscape. This is just when the containers were taking off, right? People were used to VMs and kind of node-level applications, not containerized workloads. And at that time, Google had a lot of experience running containers in production. And so when Google announced Kubernetes, it also realized the most important thing's going to be able to monitor and observe your workloads. And that's where Cadvisor came in. So Cadvisor's job was to basically get monitoring data for all the containers that are running. So what is Cadvisor a little bit more detail on how does it work? So Cadvisor, it's an open-source project. Its goal is to provide observability for all the containerized workloads that are running. How does it do that? It has built-in drivers for all the major container runtimes. This is things like Docker, ContainerD, Cryo. Once it starts, it understands and talks to the driver to understand what containers are running on the system. And then it collects metrics around all those containers and exports them in a variety of different formats. To do that, it actually uses a library called LibContainer from the Run C project to actually go ahead and scrape that metric data. And we'll go in a little bit more detail about how that works. The other thing to call out here is Cadvisor can be used in two modes, standalone mode and library mode. So standalone mode is how you can run Cadvisor on a single Linux box. So this might be useful, for example, even outside Kubernetes if you're just running some Docker containers on a single VM and you want to monitor them. But library mode is a little bit more interesting. And library mode basically allows you to integrate Cadvisor into your Go application and provide monitoring for all the containers that are running from your Go app. And the most important thing to call out here is Kubelet, which is the worker node agent, actually does this and links against Cadvisor as a library. So this means whether you know it or not, every single node out there that's running Kubernetes that runs Kubelet is using Cadvisor to provide monitoring data already. So what type of metrics are we actually talking about when we talk about Cadvisor and kind of these workload-level metrics? There's a huge list on the documentation here around the workload metrics. Too long to list here. But some example here is, for example, how much CPU time my app is using, how much file system storage is it using, memory, network, all types of things are exported here. And there's a lot of metrics to go through. So where do these workloads metrics come from? We talked about that they come from Cadvisor, but where does Cadvisor get them from? And this is where we need to introduce Cgroups. So Cgroups are the core Linux kernel functionality that provide resource accounting and monitoring for containers. So every single pod becomes a Cgroup, and every single container becomes a Cgroup. And there's two main goals of Cgroups. One is to group a set of processes together, so that's what basically forms a container, and then also be able to restrict a certain usage for different kind of resource types. So for example, when you set CPU limits and memory limits, Cgroups are used to actually limit and enforce those things. But not only to limit them, it also exports monitoring data about that. So that allows you basically drill down to a container level and say, how much resource is this container using or how much resource is this pod using? You can see it in different levels of granularity. And so with Cgroups, there's a new Cgroup V2 interface. And I did a KubeCon talk last KubeCon about it if you're more interested in details about that. But this forms the basis of where these workload metrics come from. So I'm going to hand it off to Peter to talk a little bit more about what does it look like for a metric and where does it come from? So I'd like you all to imagine that I am a humble metric on your Linux node. I'm CPU usage or something like that. And I'm going to talk out how I go from the kernel to you as the cluster admin being able to see what my value is. So I'm the Cgroup value. The Cgroup FS is exposed by the kernel. And it shows it's basically just a file like everything else in Unix. And the kernel keeps it up to date with all of the processes within that Cgroup. RunC's lib container is a library that is used by both RunC, the binary, as well as CAdvisor to actually be able to read those values from the Cgroup FS. It's just opening a file and reading it. And CAdvisor calls into the container to read the Cgroup value. CAdvisor actually is watching all of the Cgroups being created on the node. And for all of the ones that it knows how to read, either the three container managers that we talked about, Docker, ContainerD or Cryo, it has special handlers to be able to talk to them a little bit to be able to figure out exactly how to interface with those containers, as well as it has a raw Cgroup driver to be able to read Cgroups on the host. CAdvisor will then put those metrics to do different places, which I'll talk about a little bit in a moment. So this is kind of the overarching picture of how I'm Cgroup value and I'm transmitted to you. I can also show a more concrete example. So I'm CPU usage, which we're gonna talk about a little bit more in a bit. So that comes from the Cgroup path CIS, the FSC group. And then there's quite a bit in between there for the specific container. It goes like, you know, in a burstable and then it goes all the way down to usage use, the CPU.stat file. And within the CPU.stat file, there's a field usage use second. That's actually the usage per microsecond. And that's exposed, you know, there. Libcontainer reads that and stores it in its structure, CPU.stat, CPU.usage.toldusage. And it does that when CAdvisor calls into it and gathers that value. CAdvisor then translates that into its own Cgroup structure and stores that in the usage.total. And then it writes to the metric container CPU.system.seconds.total directly, as well as transmits that to the cubelet. And the cubelet will then store that and emit it. So how are, you know, are you able to actually see the values of the Cgroup? The most common way that a cluster admin is going to be able to do that is through the metric CAdvisor endpoint or usually directly through Prometheus. So Prometheus gets it from the metric CAdvisor endpoint. That endpoint is proxied through the cubelet. So even though CAdvisor is directly writing to that endpoint, the cubelet is proxying that endpoint through its own proxy. So you can actually get it directly, you know, if you're using kubroxy or something, you know, this kind work, so this is how you would do it if you're running a node called kind worker, as you would with kind. And, you know, you hit that endpoint directly and you'd be able to read a whole list of metrics that CAdvisor is emitting. Cubelet also then takes the, when it requests those metrics from CAdvisor, it also translates some of them and writes them to a JSON endpoint that's called stat summary. And this is the endpoint that actually eventually gets translated into another endpoint called metrics resource. And this is what the metric server uses to emit that to the eviction manager. And that's how a eviction manager knows what the status of the eviction, you know, which pod should be evicted on a node when, you know, it's using too much CPU or memory. The cubelet also depends on CAdvisor to gather node level metrics. So we call those machine metrics. And CAdvisor does that by, you know, reading the raw C Groups value of the host level C Groups. And that's how we know what the capacity of the node is as well as how much is being used, which will inform, you know, also will inform the eviction manager. So next up, we're gonna talk about a case study on like actually what one might do to read the CPU usage and how you might, you know, edit your cluster's behavior based on that. So to go back, you know, we started the presentation with getting that page. So how would you debug it? How would you investigate and understand what's going on? So we got that alert. What do we do? So the first thing is that's helpful here. And just in general, when we investigate an issue like this is to have a strategy because if we're just blindly looking at different things, you know, we might gonna run out of ideas or just try different things and not have a strategy. So one of the strategies here that we might use is something called the use method. This is introduced by Bren and Greg, who's kind of a performance guru. And the main idea behind this strategy is we look at all the different resources in our system and kind of analyze the utilization, saturation, and errors. And I'll explain what those are. So when we talk about containerized workloads, what type of resources do we care about? There might be quite a few with, you know, more and more devices these days, but the core ones that are always there are CPU, memory, IO, and storage. So basically we enumerate each resource and then we look at these different characteristics. So the first characteristic is utilization. Utilization is the average time that the resource was being used to do something useful. So how much CPU time was actually used by my application? The next one is saturation. So saturation basically tells us how much extra work we wanted to do with that resource, but we weren't able to because that resource was busy. So for CPU, for example, if you set a CPU limit, you know, that'll enforce saturation and you might hit some wall there, right? That we wanna investigate. And then the last category here is errors. Not all the resources have errors, but some might. So for example, with storage, you know, you might have an issue with actual disk and you wanna investigate some errors there, right? For CPU and memory, it's less likely we'll have errors. So here's our Kube Chatbot that we got alerted on. Here's the manifest. I wanted to make sure it's a guaranteed pod, so I set request equals limits and I gave it two CPUs because, you know, that's how much I thought I needed. So what does that actually mean when we set two CPUs as a request and limits? What does that actually do underneath? So it's helpful to understand that to understand how that actually works so that we can understand what metrics we wanna investigate. So how does CPU request and limits work? Let's do a quick deep dive into that. So CPU requests, the way to understand them, they're the minimum floor for CPU, right? When you specify a CPU request, Kubernetes doesn't over commit on CPU. The scheduler will always ensure there is that CPU available on the system. So you can always guarantee, even if the system is contended, you'll always be guaranteed to get your CPU request. CPU limit on the other hand, it's the ceiling for CPU. So if you set a CPU limit, even if there's spare capacity, you'll get throttled. So as soon as you hit that CPU limit, you'll get throttled and won't be able to use more CPU. So these are what it is from the Kubernetes pod perspective, but how's it actually implemented and what does it do underneath? So for CPU requests, we actually use a Linux feature called CPU shares. The way they work here is it's basically a proportion based system where one C group, one pod, one container might have certain amount of CPU shares and another one has maybe double the CPU shares and we kind of calculate a ratio between all the different C groups and the ones that have higher shares are prioritized higher and have more CPU time. That's for CPU requests. For CPU limits, there's something called CFS bandwidth control. So how does that work? There's two key pieces of information there. One is CPU quota. That's basically how much CPU time you can use within a given time slice. And then the second characteristic there is CPU period, which defines how long that time slice is and that's usually defaults to 100 milliseconds. So I think this is better explained with a picture. So let's take a quick look. So with CPU limits, I have a little example here. I have an app, let's say, that just needs one CPU worth of work, right? So it has one CPU worth of compute that it needs to perform. As a result, I know that up front. I'm gonna set my CPU request and limit to one CPU. I have my CPU period, my time slice, which is 100 milliseconds. It's gonna run. Everything's gonna be happy. Everything's great. But now let's imagine we introduce a CPU limit. So we set a CPU limit here just as a sake of example for 400 millicores, right? So 40% of one CPU. So what's gonna happen? How long is it gonna take? So first this first CPU period starts, it's 100 milliseconds. And our app is gonna actually run for 40 milliseconds. It's gonna run for 40 milliseconds because that's how much quota we have available because we set it to 400 millicores as our limit. But then after 40 milliseconds, our CPU quota's up. We can't run anymore and we're gonna get throttled. So for the remainder of that period that 60 milliseconds, we're gonna be throttled. But we're not done yet, right? We only actually were able to use 40 milliseconds of CPU time. So another period's gonna start, another 100 milliseconds. You're gonna get that quota again and then you're gonna run for another 40 milliseconds. And then you're gonna be blocked again because we used up all the quota in the period. So then another period's gonna start and then finally we're gonna be able to run for 20 milliseconds. And at this point, you know, we run for 40 plus 40 plus 20 milliseconds which is equivalent to 100 milliseconds. And now we've done our one worse of CPU work. But it took us three periods to actually do that, right? So it took us 100 milliseconds plus 100 milliseconds plus 20 milliseconds. So 220 milliseconds to do this work. When originally, when we had no throttling, it just took us 100 milliseconds. So this is why when you set CPU limits, it's important to really understand if they're having some type of disadvantage or if you're getting throttled. So how might we investigate this going back to our alert, right? That's where all the workload metrics come into picture. So first, let's start with CPU utilization to understand how much CPU our application's using. Here's the metric that's coming from CAdvisor, container CPU usage seconds total. It's a Prometheus metric. And I installed actually a open source project called the Kube Prometheus stack. It sets up, it's really nice. It sets up all of the Prometheus rules. It sets up all the nice Grafana dashboards. Everything kind of just works out of the box and it's using those metrics. So here's the dashboard that it comes baked in in Grafana. I can see here my CPU request in limit is two and you know, it's definitely CPU bound. It's running quite a bit on the CPU. But I don't know if this is like an issue or not, right? It's running around 1.5 to two CPUs. So it's definitely CPU busy, but is it actually saturated? Is it hitting some wall? And because I set a CPU limit, I need to investigate throttling. So how might I do that? There's two other metrics you wanna consider for throttling. CFS periods total and throttle periods total. So just like we were talking about before, you wanna understand how many periods your application ran and how many of those periods in which it ran, it actually got throttled. And using those two metrics, you can create an aggregated metric, kind of throttle percentage, which basically is just how many periods did you run total and then how many of those periods were you throttled, right? Throttled periods over total periods. And so that's what this built-in dashboard shows, the CPU throttling percentage. And you can see it's quite high. It's over 50%, right? So 50 to 75% of the time I'm getting throttled. So this tells me very clearly I'm hitting my CPU limit and that's preventing my application from using more CPU and CPU is definitely saturated and kind of the issue here. So how do we solve this? Well, the simple answer, just increase the CPU limit, right? Once you increase the CPU limit, everything will be great and your application will not be throttled anymore. So the main thing to take away here is workload metrics really helped us understand how our application's performing and helped us fix on a debugger issue. So that's kind of the workload metrics today and we're doing some work in the space. So I wanna hand it off to Peter to talk about kind of the future here. Thanks. So this, David just described the state of metrics as it's existed for largely the entirety of Kubernetes existing. See advisor gathering the metrics, a cluster admin reading them through various methods and acting on them. And this works really well, but we're constantly trying to improve Kubernetes and improve performance. And there's a couple of limitations that see advisor currently has that we want to address. So one limitation of it is it's a monolithic design. It's a very large piece of code. It's able to talk to a bunch of different container managers, you know, Docker container, DN cryo. But because of that, it ends up being a little bit unwieldy to work with sometimes. It's also barely CRI aware, even though it's able to talk to multiple CRI implementations, it uses the C group hierarchy to guess, to figure out what the container actually is, then makes a direct request to the CRI implementation to actually be able to get additional information from that. So it's kind of like an odd design, working to figure out what actually the container is up to. It also doesn't work at all for kernel separated containers, things like VMs made with COTA containers as well as it doesn't work at all on Windows. And because of that, you know, there are new topologies of, you know, Kubernetes nodes that are not even accessible to see advisor and you can't get all of the metrics that we're looking for. It's also, there's some work that's being duplicated between the CRI and C-advisor now. Kubelet has this notion of the stats providers and there's a CRI and a C-advisor stats provider and when you use a CRI stats provider, it actually, in addition to gathering the stats, the CRI, C-advisor is continually gathering the stats because it's just watching, you know, the new C is being created and it's, you know, just doing its thing, chugging along and reading all the values but that creates duplicated work which can actually have performance concerns, which is why, you know, Crowd doesn't even use a CRI stats provider even though, you know, it's a CRI implementation because we were worried about these performance limitation. So, you know, we're looking, in thinking about the future, we, you know, are thinking about who should really own metrics collection. C-advisor has been in this position for a long time and it has done a really good job of it but, you know, as we're trying to improve, you know, and add more features and also tune performance, you know, we also consider the CRI implementations as possibly being the best place to collect these because they're the ones closest to the containers and pods and that's just what KEP 2371 does. It takes the CRI, the stats gathering from C-advisor and pushes it to the CRI implementations. So, here we have, you know, C-advisor lists CRI full container and pod stats KEP 2371 and in this KEP we describe a world in which C-advisor is taken out of the business of gathering container and pod stats. It still has some responsibility in the notes so it's not going away forever but it is having its capabilities reduced so that the entities that know best about the containers and pods are the ones responsible for gathering those pods and containers, stats. So, here we have the alpha state of the KEP which has reached alpha in the last couple of releases. We have the, these here are the, the CRI messages that the QBIT will request of the CRI to actually fulfill its needs for the stats and metrics. So, we have the pod sandbox stats and that will feed into the stats summary API. So, the stats summary API is, you know, has these structured fields that it needs to fulfill for the entities that are relying on it like the eviction manager. So, the request is gonna, you know, ask exactly the things that it needs. We also have another structure, pod sandbox metrics and for that structure, these are, you know, eventually gonna feed into the metric C advisor endpoint because, you know, they're just key values basically on a, you know, on the metrics themselves. So, the CRI will do the gathering of the metrics in a similar way that C advisor does, report that up through the QBIT and then the QBIT will report that through metric C advisor. So, looking forward and imagining how the CRI stats will be exposed in the future, you know, the QBIT will continue to expose those two endpoints. The metric C advisor endpoints will be based on the metrics object of the CRI. The QBIT will request down to the CRI. CRI will gather all the information and report it back up to the QBIT through the CRI. And then the QBIT will convert those into Prometheus metrics where it will emit it, again, over the metric C advisor endpoint, proxying it in a similar way as it did before for C advisor. And so, the endpoint won't change there. For the stats summary API, the QBIT will request of the CRI implementation, the stats object and, you know, the CRI will, in similar fashion, gather all the stats and then package them, send them up and then the QBIT will translate that into its stats summary object. From the stats summary API, the metrics resource endpoint will also be fulfilled and that will be fed into, you know, the eviction manager. The QBIT will still depend on C advisor for gathering no-level metrics because CRI, you know, even though it knows best about the individual containers, it may not know the best about how to gather the metrics on the full node, so we're still gonna get machine stats from C advisor and, you know, similarly that will feed into the eviction manager as we saw before. So what are we looking forward? You know, we've completed alpha and then going forward towards beta, the main priority that we have is testing. So we wanna test a couple of different things. The accuracy of all those metrics, believe it or not, you know, the, while the stats summary API is somewhat tested and we are validating the stats, we're actually doing little testing of the metric C advisor endpoint and because of that, you know, we, to, in the process of imagining, you know, a place that it could be better gathered, we also wanna make sure that the place that we're newly gathering it is accurate. So we wanna add additional testing to be able to test that as well as test the old implementation. We also do know a validation that all of the metrics exist on the node at all. C advisor could technically right now just change any of those metrics and any of the alerts that exist on the system would be kind of broken. So part of this is also adding coverage to make sure that we have all of the metrics that, you know, people have begun to expect. It's functionally become a stable API of Kubernetes even though we never made any promises about that. So we're gonna, you know, begin to make that promise. And then also testing the performance. Metrics gathering is a really expensive thing that the KubeLit does. And the prospect of moving that into an entirely new component that hasn't been in this business for, you know, the entirety of the time Kubernetes has existed concerns some people and, you know, one might worry about the performance impact that might have. And it's an explicit goal of the KEP to make sure that there's no, there are no or very little performance hit when we're, you know, making that change. And that'll also allow us to test the performance of the existing implementation. So, you know, you as an end user might, you know, look up here and hear all the things that we're talking about changing and might be worried like what am I gonna see is my stats collection gonna get messed up? And ideally the answer would be none. There should be no impact to the stats collection. Stats summary is a stable API of the KubeLit and it's already being tested a little bit and we're gonna add some extra testing to make sure that those fields can be relied on and eviction still works as we're expecting. The metric C advisor endpoint, even though none of those Prometheus endpoints have been really tested, we're gonna make sure that they get tested so that, you know, you can all rely on those and not worry about that changing. In general, all of this extra testing should prevent regressions so that, you know, all the metrics and all of the stats that you've been relying on won't get changed. In summary, observability helps you gain insights into your application platform. You can use it to debug, you know, an outage or like, you know, some app misbehaving, as we saw earlier. C advisor currently powers workload and container monitoring in Kubernetes. However, with cap 2371, we're moving that from C advisor into the CRI to better, you know, have better performance and have a cluster that makes a little bit more sense. As always, a contribution would be welcome if you have any interest in this work. You know, come chat to us in SIG node. We're happy to, you know, help, you know, as we usher through into this new world. I'd like to thank everyone for joining us as well as for, you know, all of these communities here, SIG node, C advisor, container run time maintainers and run C maintainers, all of which, you know, are relying on that the stats collection relies on and, you know, they are vital components of the system. And here are some more, here are some more resources that you could look into if you have any questions and I believe we might also still have time for questions. So thank you everyone for joining. Do we have any questions for us?