 All right. Welcome everyone. This is sick instrumentation intro and deep dive. I'm Frederick I'm founder of polar signals and I am one of the tech leads for the special interest group for instrumentation and Kubernetes and today with me I have David Ashpole who's also a tech lead and he works at Google He's a deck lead at within sick instrumentation as well We've also got Ilana who is one of the chairs of sick instrumentation. She works at Red Hat and we've got Han who also works at Google who's also the chair So if you don't already know special interest groups in Kubernetes we have several of these and essentially there are focus groups for certain areas to focus on and we focus on instrumentation and we'll talk a little bit more in depth what we actually mean by that but Essentially we take care of all things observability, right? and Today let's I want to talk about a little bit like about that definition and a little bit about how we work as a special interest group Then I'm going to walk over a couple of our sub-projects Then Ilana is going to talk about Logging and some of our initiatives within logging then David is going to talk about tracing and Han's going to finish up With metrics and then we'll let you know a little bit about how you can contribute Where you can find us and some related talks to these topics so essentially The way all instrumentation or all special interest groups in Kubernetes get created is They have a charter they have a specific purpose and this is an a literal excerpt from Our charter our sick instrumentation charter and our purpose is to cover the best practices for cluster observability Across all components and develop relevant components. So effectively what this means is that you know, we care about More than just the Kubernetes slash Kubernetes repository, although we care a lot about this obviously But we also create additional components that may be Additionally helpful to understand what's going on in your Kubernetes cluster But of course we also care a great deal about the instrumentation related things within the Kubernetes project And we maintain several libraries within the Kubernetes Repository but also for external use And then we create additional components that can be really useful Maybe that Don't necessarily cover every Kubernetes user, but hopefully A lot so just some examples of sub projects that we have our kubestate metrics K log. This is kind of our canonical logging library for the Kubernetes And the metrics are rent more and I'll talk about talk more about this And then we often have kind of we we split this Split most of our topics into metrics logs and tracing And more generally and this is kind of how all special interest groups within kubernetes work is We triage All instrumentation related issues and pull requests Through labeling and this labeling either happens Automatically because we're a code owner of a particular piece of code Or someone tags us in it because they have identified. This is a an instrumentation related thing after they've You know done a first pass on The issue or pull request And then we do try to review all changes for metrics We are not necessarily not for all metrics. Are we a blocking review? But for some we may be especially For our stability guidelines So the the the more stable a metric becomes In terms of stability guarantees The stricter the reviews become as well And This is again a more general process within kubernetes whenever we develop Larger scale things that are more involved. We create these things called kubernetes enhancement proposals or In the kubernetes community, typically they're just referred to as caps um And of course we write these for sick instrumentation as well. And then as I said, we maintain sub projects So let's talk about some of these sub projects and potentially something that you could get involved in I picked these three ones because they're the ones that I'm Most familiar with and that I think the the kind of most important ones that we Take care of so kubernetes metrics is the one that I personally have other than the kubernetes repository Been involved with the longest And probably the most even And essentially what kubernetes metrics is if you're familiar with the prometheus ecosystem It's a you can think of it as a prometheus exporter for a kubernetes cluster Essentially what we do is we look at the kubernetes api And anything that could be possibly useful as a metric in prometheus We kind of convert to a prometheus style metric And so this way we get information about parts about deployments about stateful sets about all of these wonderful things So that we can then um, you know, create alerts or dashboards out of to you know, visualize and make useful And just just a really quick example that I wanted to bring you to to to this to highlight the usefulness of the of this component This is an actual example of what we have we have the Expected replicas of a deployment and we have the actual replicas of a deployment And why this is useful is now we can kind of compare these two numbers and We can understand whether a deployment has been rolled out successfully, right? And this is just one of many useful examples that you could use this for and kubernetes metrics Exposes a lot of metrics and it's highly optimized So it's a really awesome component. And if you're interested in this is also a really great way to get involved Then the second component that I want to talk about is the metric server in Kubernetes we have an abstract API of metrics And essentially the reason why this was created is so that we could have A common language to talk for autoscaling needs So that when we use a lot of cpu or a lot of memory that we can automatically scale our Deployments and kind of as a side effect of that we got Coupcetail top for free And this is essentially kind of similar to the linux or the unix command top That you may be familiar with where you can see the memory and use And cpu usage of your processes and the way that this kind of just works is that metric server uses what's called kubernetes aggregated apis. So Whenever there's a call to the kubernetes API about a metrics server A metrics api the kubernetes api just forwards this To the metric server and the metric server essentially Asynchronously collects these metrics from kubernetes nodes and then returns them on requests And in a way the Prometheus adapter is actually the exact same thing except that it doesn't Reimplement all of this gathering functionality. It lets Prometheus do all of this and then just acts kind of as a translation layer So that when the kubernetes api asks for resource metrics, it just Forwards that to Prometheus does the api convergence because obviously They don't necessarily speak the same apis Um And then returns this the really cool thing about the prometheus adapter also is it doesn't only speak the resource metrics api but it also speaks the custom and external metrics apis and This is really cool because now we can not only autoscale based on cpu and memory usage But we can autoscale on any metric that we have in our prometheus server. This is a really really powerful tool to have and we actually just kind of migrated this project previously. This was maintained by soli Thank you, soli for having maintained this project for such a long time and now finally it's been It's now under the umbrella of the sick instrumentation I believe we just a couple months ago finished this migration So yeah, these are the kind of three sub projects that I wanted to highlight To you today and now Elena goes on with logs Thank you Thanks, frederick So now i'm going to spend a little bit of time talking about what sick instrumentation is working on in the world of logging So much of our efforts have been dedicated towards transitioning kubernetes to structured logs Now you may ask what is a structured log? So i've demonstrated on this slide As an introduction here is sort of what the before and after looks like So the before is the first log line and that was what a log line in the kubelet looked like prior to transitioning to structured logging As well as including these after views now i've included both text and json versions of the log because by default The kubelet will continue to log in text mode But you can also turn on json mode which then can be ingested by Various log aggregation tools and make indexing of these log entries much easier So which brings the benefits of structured logs. Why would we do this? Why would we want to do this? It makes it much easier to aggregate and correlate logs by Presenting them in a fashion where we don't have to parse them all after the fact They're already in a serializable format And then various tools can deal with them rather than dealing with just the raw sys log Or we might have to use reg x's or other things to parse those log entries in order to determine what happened where so Structured logs win So we started out by migrating the kubelet in the 121 release and even sort of track our progress in the attached issue And part of that work was including some static analysis to prevent regressions So every file that was migrated in the kubelet was marked such that a ci will ensure that if you go and add a non-structured log It won't pass you have to go and ensure the login entry is structured now In terms of what we're going to be migrating in the future We didn't migrate any particular component in 122 although migrations continued And we're looking at for the upcoming 123 release, which by the time you watch this recording will now be in progress Selecting components for migration. So that's tracked in the linked issue there We're also looking at deprecating some klog specific flags in kubernetes components Now these flags have been supported since we initially used i guess glog And then later transitioned to klog But these flags kind of came along as we were implementing Klog with glog support They weren't necessarily intended to be part of kubernetes We just happened to get this feature set and now when we're trying to implement flag parity between text logging and json logging we're finding that it is relatively difficult in order to Support this quite large And featureful set of flag which include all sorts of options for log rotation and the likes So we currently have an enhancement in progress targeted for the 123 release That will deprecate a number of these flags using the standard kubernetes deprecation process And remove them from kubernetes components. So not in scope here is removing them from klog itself They'll stay in the library. We just won't support them in kubernetes anymore So who's working on structured logging one of the more exciting things that I get to talk about is the creation of our new working group structured logging in order to manage the structured log migration So the organizers of the new working group are americ for google and shubh seksha from apple And i'm very excited that they've stepped up to lead this effort They have a slack channel. So you can check it out there They also have charter and the community repo and they meet walk bi-weekly on thursdays at 1500 utc So they need your help. I know that as part of the cuba migration in 121 We had a lot of new contributors who helped out with the effort So i'm sure that there will be lots to do in the 123 release and onward So it's definitely a great place to get involved if you're looking for new things to do in kubernetes Log security is the last thing that I wanted to talk about in the world of logging So we have a couple of caps on our backlog That we've been working on since the 120 release where we introduced some features to Hide credentials and secrets from logs so that you know an attacker if they're able to access logs don't get access to the rest of your system So we had two of these where we both included static checks at build time to ensure that people weren't necessarily Writing secrets out to logs as well as some support for dynamic sanitization So those static checks And dynamic sanitization both debuted as alpha features in the 120 release Which we discussed at kubecon of last year Now those static checks are going to be targeted for graduation in the 123 release And dynamic sanitization has stayed in alpha for some time But I believe we will be targeting beta for the 123 release And you can enable that feature with the logging sanitization flag And I think that's all for me and i'm going to hand it over to david to talk about tracing Thanks, Elena. A lot has happened since the last kubecon in the area of distributed tracing My name is david ashpole and I've had the privilege of working in this area But i'm excited to share the progress we've made Control plane tracing which includes both the api server and ecd is now an alpha This is something that the SIG has been working towards for a very long time And it's really exciting to finally see it happen a special shout out to lili for her work on the ecd integration Control plane tracing allows cluster operators to collect distributed traces for correct requests sent to the kubernetes control plane This will make it easier to debug slow requests and to figure out what path the requests took through the system Just getting off the ground is an effort to start collecting distributed traces for what the kubelet and container runtime are doing for pods There's a proposal in the works for how this will work in the kubelet and interest in the integration from some of the container run times as well Finally, there's been a long-term effort to figure out how to propagate context between controllers So that we can glue together all of this work that we've been doing But the context propagation proposal is more general than that and would apply both to traces and to logs I want to dive a little bit deeper into control plane tracing since it just reached alpha so What I did for this demo is I enabled the api server tracing feature gate and on the api server I set the tracing config file flag to a file containing the following configuration But that's a 1% sampling rate for the api server And then on ecd. I enabled the experimental distributed tracing feature And for each of those I ran an open telemetry collector as a sidecar to collect trace to collect spans From them and to send them to my back end in this example I ended up sending them to agar, but it's super easy to send them to hosted Trace back ends or to other popular open source back ends as well So here's an example trace You can see the top bar is the incoming request. So The request first hit the api server at the very beginning And then the api server responded at the very end And so we can see the tealines Are coming our spans that are emitted by the api server So that includes this top one and both of these child spans here And then i'm also running a custom mutating admission controller here And and ecd has It's tracing enabled as well and that span is here So from this example, we could see all the way from when the api server first got the request To when it asked for a response from a mutating admission controller To when it actually committed the transaction to ecd and responded to the user So this hopefully will be really useful to operators as I said And this is just the beginning. We know that tracing becomes more useful the more things Integrate with it. So we're excited to see where this all leads Up next to talk about metrics Is han take it away Thanks, david. Hey everyone. I'm han and today i'm going to be talking about metrics in kubernetes First i'm going to talk about some more foundational stuff And then i'm going to get into the history of kubernetes instrumentation and how our SIG comes into play First you all should know that we use prometheus clients for instrumentation This means that the primary kubernetes components the kube api server the controller manager scheduler kubelet All of these things expose text-based metrics endpoints, which can then be scraped and ingested by a various Different number of time series backends Though commonly this tends to be prometheus Prometheus has a number of different metric types, which I won't get into since this isn't a prometheus talk But needless to say we use basically all of them in kubernetes Now let's get into some history It's a bit of an understatement to say that we've had some problems in kubernetes with metrics There is though usually a common theme Underlying these which is that metrics can go grow in size and quickly become memory leaks This can cause instability in the component Why is this the case? Let's take a look Take this example metric. It's a request metric Like all metrics it has a width and a height By width what I mean is the number of labels For this example metric, we have verb code and path. This is pretty common pattern So this has a width of three By height what I mean is the total number of values that a label can have So as you can see the verb has a height of four code has a height of four path has a height of one So every time like the relationship between labels and label values is that each of these Each of these combinations forms a single time series put 200 pods is one time series put 201 pods is another time series post 200 pods yet another So if we add a new label with more than two values It has a multiplicative effect on the total number of time series This can be problematic because it it has a multiplicative effect on on the size of the metric So this is where our SIG comes into play. We have exploding metrics and we need to fix this so And we did that we created a framework which makes the structure of a metric an immutable api We call this the metric stability framework With static analysis and hooks in the kubernetes commit pipelines We can validate that stable metrics do not structurally change. They will always have the same labels You will cannot remove or you cannot add labels of a stable metric This gives powerful guarantees to kubernetes users Because you know that you can alert off of these stable metrics from release to release without breakage So you can also create slo's based off of these metrics But what happens if there is a code change and all of a sudden a thousand new labels are added to a stable metric All of a sudden you're back in the same place. You have an exploding metric tons of cardinality and huge memory leaks If you're a cluster operator, then your fix is basically Roll back your cluster to a safe kubernetes version wait for upstream kubernetes to fix this issue Backport it then cut a release at which point you could have easily been waiting at least a month Instead we built this tool metrics cardinality enforcement What this allows you to do is at runtime at runtime You can specify valid values for a label through a command line flag I mean, let's say upstream kubernetes makes mistakes And releases a bad metric one that explodes in cardinality because a thousand new labels have been added You don't have to roll back your cluster. All you have to do is restart it with a flag and bind the possible number of labels for that metric Then you just wait for upstream to fix it and then you can roll forwards And in the future, we want your help. We want to keep improving these things We want to make the the usability of instrumentation better for users One of the ways that we want to do this is to extend metric stability We want to bring it into parity with feature stages Currently we actually only have two stable classes afla and stable and the former has absolutely no guarantees And stable metrics have very firm guarantees We want additional expressiveness to denote metrics which are kind of in the middle And you might be able to set up charts for We also want to hook into the static analysis pipeline to auto-generate metric documentation Currently the only way to find all of the metrics one has is either by one Reading the kubernetes source code that sucks or two curling the metrics endpoint that also sucks So please come to our meetings get involved We have a ton of sub projects that could use your help like Cube state metrics metric server prom q structure logging tracing Where can you find us? Well, we have a meeting every thursday at 9 30 pacific standard time and we alternate each week between a sig meeting and a triage meeting And in the triage meeting we basically go over issues and pr is relevant for a sig You can find us in slack pretty easy in this sig instrumentation slack channel And you can ping us directly in slack and we should be easy to find because our slack usernames are our github handles Anyway, thanks for coming and listening to our talk