 Hello everyone and welcome to the Kubernetes SIG instrumentation introduction and deep dive session. I'm David Asheville from Google Hey, I'm Frederick. I'm the CEO and founder of Polar Signals I'm Alana Hashman. I currently work as a principal site reliability engineer at Red Hat And I'm one of the co-chairs of SIG instrumentation Hey, I'm Han. I am also a co-chair of SIG instrumentation and also work at Google with David Asheville Great, so I'm going to kick us off and this is what we're going to cover in today's maintainer tracks session So I'm going to introduce What does SIG instrumentation do? What are we responsible for and then we'll get into some of our current activities status updates on What we're doing for the various SIG components? metrics logs and events Traces and what's going on with all of our SIG sub projects and in each of these sections one of my co-leads will Introduce the topic and what's going on? Then I will close out our talk with some resources on how to get involved in the SIG how to contribute Where to find us in the various Kubernetes online spaces and also share some links to related talks So what does SIG instrumentation do? Our charter and you can find these slides on the schedule So you can click all of these links if you want to get a chance to drill down But our charter in the Kubernetes community repo says that we cover best practices for cluster observability Across all Kubernetes components and as well we develop all of these relevant components So in summary, I like to think of this as working on metrics logs and events and Traces which are sort of our various pillars of observability within the project We're also responsible for a number of sub projects again covered in our charter But some of those that you might be most familiar with include cube state metrics K log metrics server and many more So how do we do it? Our SIG activities usually involve triaging and fixing relevant instrumentation issues Reviewing all code changes for metrics go files developing new features and enhancements and maintaining all of our various sub projects Hey, Elena, what do you call a chart without any underlying metric data? I don't know. What do you call it, Han? Pointless You got to have the drumbeat So in SIG instrumentation, you'll know me either as the metrics guy or the guy who likes terrible puns But you're not gonna hear me tell a pun about insects because they bug me All right, that's the last one Just to side in order to understand how metrics works in Kubernetes, we're gonna have to know a little bit about Kubernetes so just very briefly Kubernetes pretty complicated stack and it has a lot of Dispair components and it gets even more complicated because These disparate components have quite a large number of interactions and this can make it difficult to tell if and when something is going wrong and This is basically where metrics come in and in Kate's we instrument our binaries using Prometheus clients which can look something like this In general, you're gonna have a component that you want to instrument some bit of software and the software exposes a simple HTTP endpoint conventionally slash metrics Which is then scraped by some monitoring agent and inserted into a time series database or backend and For those of you who are already familiar with previous metrics. This may seem Simple and a bit boring And yeah, it is it simple enough We have some software and we want to measure stuff. I mean what could possibly go wrong So well, it turns out quite a lot Metrics can become memory leaks Typically through unbounded paternality and label values Even worse we can have unbounded interpolated metric names and We can get any of these leaks in any of the Kubernetes binaries These issues can be latent and they can manifest through innocuous underlying changes Sometimes our metrics just don't do the job properly like we have inadequate bucket sizes Which don't make any sense. So if you have two buckets for latency metrics, obviously, you're not gonna get Super helpful latency Data sometimes the metric says it's actually emitting seconds, but the metric unit It's actually emitting is in microseconds. So that brings us to the stuff that we do involving metrics in significant First and foremost Due to all of these things that we have encountered in Kubernetes We overhauled our metrics. This has landed in GA recently. If you want to see the cap, it's a Up here And This fixed a bunch of inconsistent and broken broken metrics across Kubernetes But in doing so we changed the API and this caused issues for people who were adjusting our older and broken metrics and so in order to offset this we implemented a stability framework so that People who are adjusting Kubernetes metrics can rely on them with a Proper deprecation policy and this landed in beta and we have slated work to land this into GA in 121 We have Not only do we focus on my broken metrics, but we also focus on improving existing metrics and on the road to alpha we have pod resource metrics, which is slated to land in 120 and Dynamic cardinal de-enforcement slated to land in 101, but our work doesn't just involve fixing metrics and Improving and iterating over existing metrics in the in the cube binaries We also want to make it easier for people to debug Kubernetes clusters and in order to do that We wrote this tool called prom queue, which is basically a in-memory Prometheus client running in your CLI Help you debug native Prometheus endpoints and we have a link if you want to check that out Cool, thanks, Han. That was really interesting So now let's talk about logs and events. I'll start with events. So events in Kubernetes are the way that users tend to interact with Kubernetes objects when they're first getting started as well as when they're just trying to figure out what the heck's been going on To think about it in terms of telemetry It's essentially like writing a structured log message to the API server that Kubrick that Users are then allowed to query and show up and shows up in things like you control top Almost three years ago We found that we were having some scalability issues with events because they can be quite spammy if something is Crash looping for example that will emit a lot of events and so Way way way back then they came up with a plan for how to change the event object in order to make it scale better and add a little bit more structure to it and Now finally three years later. We've graduated this new events API to GA and it's currently in use in many Components in the Kubernetes ecosystem so Great job. We move something to GA For logging. This is probably the most Simple form of telemetry, right? We're just writing things to files. So what the heck could we improve with logging? Well, it turns out that oftentimes you want to know Which things are being referenced in a log message? For example, if I have a log and it's about a pod It would be nice to represent fields from the pod such as the pod pane in a standardized way so that I can if I'm a log-in jester then I'm able to take those and search potentially over those attributes in my back end so sig instrumentation worked on structured logging and The method we chose to introduce this is to introduce new methods into the K log library You'll notice that the info and error methods here are appended with an s Instead of the usual f for format And what this does is it allows you to set a message for your log line and then to add in key value pairs One after the other so for example the key might be pod and the name or and the value might be the name of the pod This is alpha in 1.19 and it can be enabled on any component in Kubernetes using the logging format flag and Because there are certain log messages that people really care about potentially some stuff in the API server or kubut We're starting with those log messages that people find most impactful And if you're curious about the details feel free to check out the the structured logging log post The next set of improvements. We've been working on for logging is related to logging security so as a general rule, it's a very bad thing to log credentials or other secrets into Kubernetes logs and Sadly, this has happened more than once in the recent past and came up in a recent Security review by a third party. So we knew that we had to do something about it And we are taking two approaches in the 1.20 released. The first is dynamic sanitization of logs meaning When your Kubernetes component is running in your cluster and tries to log something that we think is bad That log message will be Blocked or otherwise modified so it doesn't contain the the secret information So that's that's one method And another method that we're going to be applying in 1.20 is static Checking and this is mostly during development So you can think of this as we're basically adding pre-submits and Ede tests that go through or that run Kubernetes components Or know that statically analyze all of our controller binaries And look for points in which the secret information Could be logged Using the K log libraries, so we we try and programmatically figure out where that could be happening And this can be enabled with the logging sanitization flag. I believe that's for dynamic sanitization And you can see more information at the link to cap here now. Let's talk about traces Traces are exciting and new in fact This is the first time that Kubernetes is doing anything with tracing at all So, what are we tackling first? Well, we decided to start with the simple and straightforward Use with tracing API server requests the API server is a big HTTP server that runs at the heart of a Kubernetes cluster and it would be really useful to know how long different requests take and Especially for those ones that don't behave as we expected. Maybe they're too slow We'd like to be able to see detailed information about how that request passed through the API server and on to other clients such as at CD So in 1.20, we're going to be adding Distributed tracing to the API server using open telemetry and you can enable it by specifying a configuration file with the open telemetry config file flag if you'd like to read a bit more about it, you can look at the Cap which is linked below alright Thank you David for sharing all this awesome work. I'm definitely super excited about the tracing work. That's happening Unfortunately, I don't have a great joke like how to start out with so let's just take it away and talk about some of our sub projects Next slide So we have three primary sub projects Which are actually we have a couple more but these are this is the selection that we want to talk about this time This time around The first one being coupe state metrics, which I believe may be the oldest sub project of sick instrumentation It is actually it's under the Kubernetes org. That's how old it is Because I'm saying that because things don't get submitted to the Kubernetes or generally anymore Then the other one that we that we're talking about today is the metric server and we'll see what that is a Little bit later, and then the Kubernetes Prometheus adapter, which we'll also see Okay so first off coupe state metrics, I think this is a really exciting component and This really originated from a need where we were talking to a bunch of people in the Kubernetes ecosystem who are also using Prometheus at the time and The gap was kind of People were saying well Prometheus is great and Kubernetes is great But when I actually troubleshoot my applications, I still drop down into coupe CTL and query things It would be really handy if I had a lot of this information queryable in Prometheus. I can alert on it I can do all these automated workflows with it. And so that's kind of where coupe state metrics was born and Kind of the philosophy that we have with coupe state metrics is that anything that can be a metric in a Kubernetes API object so pods deployment state process You can pretty much find metrics about any API object that is available in Kubernetes as Metrics in coupe state metrics and we kind of take everything that can possibly be a metric and convert it into Prometheus Into the Prometheus exposition format and then whenever Prometheus comes around it just scrapes scrapes this output And ingests it and then we can do really exciting things like what we have on the slides here where we can say Well, I have some expected number of replicas and I have an actual number of replicas Of my deployment and if these are not the same then obviously something's not going The way it should be going and we can write pretty sophisticated and really Incredible alerting routes that have definitely helped me run the applications on top of Kubernetes numerous times so this is extremely helpful and And one exciting thing about coupe state metrics is that we're actually just spend Wow, actually over a year almost cleaning up the entire code base And we've started doing a couple of pre releases of a of a new major version of this this project So please go ahead and check this out. Try it out run it on your clusters Give us feedback Both in terms of performance As well as obviously whether things still work that they used to work if you if you may already be running coupe state metrics and The last thing that I think I want to mention about coupe state metrics That's kind of unique about it is we've done a number of really incredible performance improvements Where coupe state metrics doesn't actually use the normal Prometheus client because Because of the nature of what coupe state metrics does it kind of converts every API object to a metric it tends to Get tends to have really huge metrics output. So I'm talking megabytes of slash metrics and points and so that is a different dimension of Even writing bytes out to an HTTP request so if you're interested in kind of performance work There's lots still possible in coupe state metrics So get involved into this project if these are things that are appealing to you But we've already done a really incredible job I think we used to have tens of seconds of latency with really huge clusters and we've brought that down to a handful of seconds In really really huge clusters. So I think we've done a pretty good job, but there's always room for improvement So, yeah, that's what I have to say about coupe state metrics. So going on to the metric server You may some people may not be aware of this component, but we've almost certainly used it or another variation of the resource metrics API because the resource metrics API is essentially the generic description of An API that can be used to request CPU and memory usage of pods containers and nodes So if you've used coupe CTL pod, you've actually indirectly used this API because coupe CTL pod essentially requests a resource metrics API implementation and the metric server happens to be kind of the what we say the the default implementation of the resource metrics API and This this API can also be used to autoscale your deployments on Kubernetes with resource metrics. So as I said CPU or memory and the varying kind of Usages that that you that you can kind of can configure and the the thing to kind of Keep in mind here. This component essentially works very similar to Prometheus, but it's kind of a very narrow scope so it also goes around Each individual cooblet and collects all of these metrics by pulling them I believe every minute and then holds the state in memory and then whenever there's a request Let's say from coupe CTL top for example then it presents the information that has been requested but this component is kind of intentionally very very narrow so that we can We can have very crisp Expectations on the scalability requirements and stuff like that so this is one possible implementation of the resource metrics API and then we have another one next slide Which is Actually something that we've we're just a project that we're just starting to adopt so as as of this recording the Adoption of this project hasn't actually entirely gone through but I expect that over the next couple of weeks This will probably happen. It has already been Accepted by the basic instrumentation as a whole. We just need to figure out some people signing the CLA but essentially what this is is Much like the metric server is the default implementation of the resource metrics API the Prometheus Kubernetes Prometheus adapter is essentially an Implementation of the resource metrics API as well as the custom and external metrics API's which are also generic descriptions of metrics API's and This one as the name already says is backed by Prometheus and Why this is useful is because if you already have Prometheus collecting these metrics anyways You might as well use Prometheus to present these metrics to your users or use it for autoscaling purposes, right? That way you don't have to have this extra process running in your cluster that uses Memory and CPU to essentially collect the same things. This is really only if you're already running Prometheus anyways I would I would say that If you're not using Prometheus, this is probably a too complicated of a setup just to get the resource metrics API in that case The metric server is probably the better choice, but if Prometheus is already your choice monitoring system of choice I would highly recommend you giving this a try So that's all the sub project selections that we wanted to share today and now Back to Elena Thanks so much Frederick. That was awesome So you've heard a bunch about our SIG activities and what we're working on. How can you get involved in that? So first thing is if you're interested in getting involved attend our SIG meetings That's the best way to get an idea of what's happening with the SIG What sorts of things you can work on what sorts of projects are looking for contributors? And get to know all the various people working on the various different components You can also start participating in reviews and issues and documentation All of those things were happy to accept new contributors and you don't need to ask for permission You can just jump in and we will take a look In terms of specific projects Cube state metrics is explicitly seeking new contributors and you can reach out to Lily if you're interested in working on that Both a metric server and the structured logging Implementation are seeking contributors and if you're interested in working on those you can contact Merrick And prom queue is also seeking new contributors So if you want to work on prom queue, which Han introduced earlier You can reach out to him or Sally or you can so how can you find us? We have regular SIG meetings effectively once a week. We have two alternating bi-weekly meetings So our regular SIG meeting is on Thursdays at 9 30 a.m. Pacific time And that alternates every other week with a triage meeting where we go over our PR and issue backlog And those are on Wednesdays at 9 a.m. Pacific time If you want to find more information or reach out to various folks in the SIG you can Visit our slack channel pound SIG instrumentation. You can join our mailing list Which is a Google group which will also give you right access to the meeting agendas linked above And if you need to know who's in charge of what just repeating again The chairs of the SIG are myself and Han and the tech leads are Frederick and David The last thing that I wanted to mention before we close out the talk is some other relevant talks from this cube con That we want to give some brief shout-outs to that go into a little bit more depth in terms of some of the things We cover today. There is an entire talk Discussing the structured logging implementation in kubernetes 119 So highly recommend you give a look at that recording if you're interested in what's going on with structured logging That's been at least a two-year-long effort. So I'm very excited to see that land and hopefully soon become GA And as well the CNCF SIG observability intro and deep dive talk is scheduled I think at the same time as this talk So if you're watching this one, I recommend you take a look at that if you want to look at Observability things in the wider cloud native ecosystem There are of course many other related talks I don't want to talk about every single talk at cube con But you can check out both the observability and maintainer tracks for more talks on kubernetes observability and instrumentation And thanks so much for joining us. I hope you had a great cube con