 All right, welcome everyone, and I hope you've had a great KubeCon so far. I guess it will be kind of one of the last talks you will attend, and I hope it will be fine and not that boring as well. So what I will go through today is Siege Instrumentation, a quick introduction in the deep dive, as well as kind of an update about our group. And who am I to present Siege Instrumentation? So I'm Damien Grisonet. I'm a software engineer for Red Hat, and I'm also the Kubernetes Siege Instrumentation Tech Lead, or at least one of the two Tech Leads of the group. And I also maintain projects such as KubeStateMetric, MetricServer, and PrometheusAdapter that you may know, and you can reach out to me on either GitHub or LinkedIn, here are my handles. And so, like, I don't know how much you are aware about Siegs, in general, and Siege Instrumentation as well, so I will go through what is a Siege and what we are doing in Siege Instrumentation, what are the sub-projects that we own, the different signals that we own as well as part of our group, how you could help us and contribute, and where then you can find us and just chat about anything, really. So what do we do? So a Siege in Kubernetes is a group that has a definite task and a definite area of expertise. In our case, we have a charter that tells us that we have to cover the best practices for cluster observability across all the Kubernetes components and develop the relevant components that we need in order to spread those practices. So we don't own the signals as such, so we don't own metrics for all the components, but what we own is the libraries that we then provide to the different developers to help them improve their signal and build better observability as a whole. So to do that, we first have our sub-project that we also support, maintain as well as kind of oversee. So here are like the three main ones, I would say. CubeSafe metric, KLOG is to get any kind of metrics about your Kubernetes resources. KLOG is the logging library that we use to produce any kind of logging Kubernetes and metric service is what you would use for autoscaling. And we have many more. I will go a bit more in through the details of a couple of them later. That's what we do as part of our sub-projects. And then the value signals that you can see on a daily basis in observability. We try to cover metrics in Kubernetes as well as logging and events. Events are some kind of logs, a bit more structure we could say. And in logging like we nowadays do more on structured logging as well as contextual logging because we've noticed that well, producing a lot of logs that cannot be aggregated isn't that useful to our end users. So we've invested into that as well as like we are seeing more and more tracing around these days. And it helps a lot when we are developing such a big project that is Kubernetes. That's what we invest into. And how do we do it? So we have multiple sessions in general, so bi-weekly where we try edge and try to fix the relevant instrumentation issues. So try edging means that every two weeks we'll go through all the issues, all the PRs and make sure that we have at least an assignee or someone to look after it and make sure that at some point it will be addressed. We also review most of the changes made to metrics. I will go later on about how we manage metrics in Kubernetes, but we tend to review most of the changes that are made to it in order to make sure that best practices are enforced as well as the quality of the metrics is there as well. We also develop new features and announcement. There is a process in Kubernetes in general that is called the CAPS. And we go through this same process in order to make any kind of improvements. As I mentioned before, we maintain and support the support projects. And as of a couple of months ago, like we used to do a mentoring kind of on a per mentor basis, like each mentor was trying to help and introduce new people to the community. But now the SIG is kind of trying to create a mentoring program. For now it's only internal or at least like we do it with the contributor that we've seen joining our meetings. But we plan to extend it and allow more external contributors to also join and learn more about Kubernetes and the different sub project later on. But we are really investing into that because we really need new contributors these days. So now that I've kind of tried to present as best as I could what we do in Kubernetes, I will try to go a little bit deeper about the sub project and to kind of give you an idea of how much do we have to support in Kubernetes and do we support and provide to our users. So the four ones that I've chosen to present today are CubeState Metric, Metric Server, Prometheus Adapter and a brand new sub project that we've just recently added to our portfolio which is Usage Metrics Collector. So CubeState Metrics, I believe most of the Kubernetes users have CubeState Metric installed in their cluster because that's the main entry point for any kind of like if you really want to look into what's happening in your Kubernetes cluster, that's the pieces of software that you would need. It's a Prometheus exporter, which means that it's pieces of software that converts any third party data into Prometheus Metrics. And in CubeState Metric case, it's the API objects such as the pods, for example, and any kind of cluster administrator wants to know what's happening in their cluster, like how many pods they have, how many deployments, what the state of these resources. And it's not convenient to go through via the command line and search for these resources. So we have this awesome project that allows you to expose metrics out of those. And two examples of the metrics which are CubeDeploymentSpecReplica and CubeDeploymentStatusReplicaUpdated, which allows you to know at any point in time how many replicas deployment has and how many replicas were updated, which means that if there is a rolling update that fails or something went wrong, then you can directly see it in metrics. You can add it on those metrics and really have a better overview of what's happening in your cluster. And how does CubeState Metric does that is by watching constantly the changes made to these resources in the Cube API server and then converting this data into Prometheus Metrics and then hopefully Prometheus will scrape these metrics and expose it to the users. Another important project that we maintain that also is most likely to be using most Kubernetes cluster these days is Metric Server. It's the main project today for exposing resource metrics. And it's what is the source of CubeCuttleTop, for example, so without this piece of software or a similar one that would implement the resource metric API that we kind of support, you wouldn't be able to see anything with CubeCuttleTop and the command would be failing. And as such, it's also like the main source for any kind of photo scaling based on resources. So whenever you create an HPA and set, let's say, the CPU utilization on your HPA to 50%, then the HPA controller will reach out to Metric Server and ask, what's the resource usage of this pod? And Metric Server will be able to answer. And then the HPA controller will be able to take a decision on that and autoscale on your application. And how does Metric Server do that? It essentially acts as an extension of the API server. So whenever there is a software that asks for any of these resource metrics, the API server will just forward the request to Metric Server. And in turn, Metric Server would have grabbed the metrics from Cubelets and then expose them to the QBAPI server as if it were the QBAPI server that replied to the request. Another software that is pretty similar or has similar goals but a bit broader is Prometheus Adapter. I mentioned the resource metrics API for autoscaling. There's also two other metrics API that we support as a SIG, which are the QSTOM and external one. These ones are meant for any kind of resource, not only CPU and memory like Metric Server would do to autoscale your application on. So for example, if you want to autoscale your web server based on the traffic, you could do that. If you want to autoscale your web server based on the number of objects in your RabbitMQ queue, you can also do that. And so Prometheus Adapter allows for this kind of bigger reach with your autoscaling solution. And it does that but instead of trying to grab the data from Cubelet, because Cubelet only knows about CPU and memory, it will grab it from Prometheus. So let's imagine in your Prometheus backend you have metrics about your RabbitMQ. Then Prometheus Adapter will be able to get those metrics and expose them to your autoscaling solution in Kubernetes and then you will be able to autoscale anything from that. It's kind of similar to what KEDA is doing today. I don't know if any of you know KEDA, but it's kind of similar. All the KEDA supports a way bigger variety of scalers. It was like Prometheus Adapter only supports Prometheus. And another newcomer in the line of support that we have is Usage Metrics Collector. It's brand new because it was only given to the SIG in January this year by Apple. They've been working on this project for quite a while internally and then they hand it over to us. And the goal is to have a more optimized way to collect the resource usage and the capacity metrics inside of a Kubernetes cluster. Because there are limitations today that prevents you to have high frequency scraping of this data. Whereas Usage Metrics Collector now allows you to do it like even at the second. So if you really want to have a capacity planning that is like really up to date and with fresh data, then that's a good tool for that. It also allows for aggregation at collection time. So instead of sending a ton of data to your Prometheus instance, you can reduce the scope. And one good selling point of that project is that it doesn't require anyone to know about PromQL, which is one of the big blocker that we have when we are trying to expose any kind of software that works with Prometheus. Like it's always hard for new users to learn about PromQL. And this project doesn't require that. So for example, I showed one configuration of how you can get the P95 of the CPU and memory utilization with a one second scraping and sampling interval. And as you can see, like there is no PromQL inside of this configuration, although at the end of the day, you will have the same results. Meaning that it will compute the P95 for you without having to do the big complex Prometheus query. And it's pretty convenient and pretty easy to use. And then you can expose the P99 directly. Instead of having to expose like the entire memory or CPU utilization, you can just expose the subset of data you care about and not the entire thing. So then I guess I will go through the different signals, how we handle them, what we provide to the Kubernetes developer to better work with them and improve the quality overall. So the metrics, for us, the metrics are pretty simple. We have an integration with Prometheus directly. So we expose the metrics in a Prometheus format. And then it can be used by any software that can understand this format to build alerts, to build collection, to build dashboards, whatever. And so the entry point is really the Prometheus format. But the point that we faced over the years is that, well, it works having a lot of metrics, but when you are in such a big project as Kubernetes, people tend to change the metrics on their own agenda. And we've learned that you cannot simply rename the metrics because if you change the name of a metric, the metric will cease to exist. So if you have an alert that is based on it, the alert won't do anything at all once you change the name of the metric. So we went through a pretty big metrics overall a couple of years ago in order to improve the quality of the metrics. And that's how we figured out that, well, we cannot just change the name of all the metrics because it will break a lot of users. So in order to be able to do that kind of thing and improve the quality, we needed to put something in place to make sure that we don't break anyone. And that's when we decided to create the metrics framework. So the idea was to create a framework on which we could express a certain level of stability for the metrics and provide guarantees based on that. And we also wanted to make sure that there was some automation that would prevent any users to make breaking changes without a significant implementation being aware of it and even new users being aware of it. So that's what we did and that's what we provided to most of the developers by creating a brand new library on top of the Prometheus one to make that easy for them to use. So the stability level kind of changed a bit over the years. We started with only alpha and stable, alpha meaning that there is no guarantee at all on your metrics. So we could change the name, the label name as well of your metric at any time. We don't bring any guarantee so we can still improve the metric without breaking anyone. And stable meant that, well, you cannot change the metric at all. Once it's stable, it sets and if you try to modify it, it will break someone, so you don't want to do that. But we figured that it wasn't enough and we need a bit more stability level because going from just like no support, no guarantees to, well, you cannot change it anymore. It's too big of a jump. Like we needed some levels in between. So the one that was the most needed is better and that's what we are currently introducing, which is that we still have a bit of guarantee but it's likely to not have any guarantees. And so, oh no, it's likely to have guarantees meaning that once it is stable, it is on its way, once it is better, it is on its way to become stable and it should be treated as is. Well, you can maybe add some new fields but you cannot remove them. An internal would be more for recent recent use cases where we're introducing more metrics that are more relevant to the developers rather than the end users. So we don't want any guarantees on them because well, they could be removed at any time because they could be used for testing a certain feature and are really developer facing. So we shouldn't build alert on top of them either. So this is what we have today for the stability. And we are currently working on an establishing new guarantees for this, especially the two new level. So beta and internal, internal won't have any guarantee but beta will have some certain guarantee and we need to make sure that we have the static analysis tooling to prevent users from making changes that we don't want. One thing that we surely want to do is that signal instrumentation should review all the metrics that go to beta because we want to be in control of the quality of these metrics. Another thing that we were able to do from that effort was to generate the documentation for the metric. That's a big problem that we've had and I guess a lot of people have which is discoverability of the metrics. It's super hard to understand like what should I do in this case? Is there any metrics to help me? Like there is such a variety of metrics that it's hard to figure out what to do. So we have a documentation now in the official Kubernetes website. It's not perfect yet because we are hitting limitation with the website format. But at least you can get an idea of what's exposed today by Kubernetes. It should include all the metrics that Kubernetes exposes and it gives you as well a small documentation of what these metrics do. So if you have any, let's say a problem with scheduling, I guess you can search for scheduling in these metrics and find maybe something that could help you. So it's pretty interesting and I'm pretty sure it will help a lot of people. And another aspect will be later to auto-generate the changelog. So today it's pretty hard for users to understand like what has changed in metrics because there is no big section where we just expose all of them. So if we were able to generate the changelog, that would be easier. We also want, I don't know if any of you have attended the observability days. But there was a discussion about native histogram in Prometheus and the more recent support of it and how it's now compatible with open telemetry. It's used to not be the case, so we were kind of reluctant to go that way. But now that there is some compatibility, we are thinking about integrating native histogram as well and we will be looking into that in the upcoming months. So that's it for the metrics and we have still like two signals. I guess it won't be too boring for you. But yeah, now the logs, what you are most familiar with usually or what developers tend to be more familiar with because they don't want to go with metrics like they don't know it's an unknown area for them. And before structured logging, it was pretty hard to do anything with the logs because there was no format. Like it was impossible to aggregate them together and do any kind of operation between the logs. Like you had to grab and do some magic in order to gather information, it wasn't great. Then it was converted to a text-based format, which was a bit more formatted but still like in your ingestion backend that ingests your logs. It's hard to query something that is text-based, like it's hard to find patterns. So we also decided to support JSON, which is way easier to query. So we added support for structured logging on JSON. And now it's like most of the log would tend to look like the last log line that is shown on this page, which is like way easier to understand, way easier to query. And when did we do that? So we started initially by only one component of Kubernetes, which is a kubelet in 121. Later on, we also added support or like fully migrated kubes scheduler to structured logging. And the feature finally became stable in 126. That doesn't mean that all the components are fully migrated yet. There are still some non-structured logging that are lingering around the kubes. But that means that we won't backtrack to a time where the logs wouldn't be formatted. So now it's set in stone. We will continue forward with the formatted logs, which is great from a consumer perspective, I guess. And part of that, we also included some static analysis to prevent any kind of regression and any kind of backtrack to a world where it wasn't structured. And while doing that, we noticed that K log, which is the library that we use for producing logs, adds some flags that weren't meant to be client-side. Those flags would be more suited to be on the platform side or your logging platform side nowadays. So we decided to deprecate them in 123, and they will be removed in 126. So yeah, if you are using any of these flags, be worried because they are removed in 126. And so now that we had structured logging almost everywhere, we still noticed that logging wasn't complete. There was still improvement that could be made in one of the which was what we call contextual logging, which is a way to have structured logs but with a context. So the idea was that when you would produce a log, you could attach to the logger a certain context, let's say a value or any kind of data that you want, and propagate it through your calls in order to make sure that you set a value through your call sites. This was in Kubernetes as alpha in 124, and it's currently planned to become beta in 128. As I mentioned, the goal is to log, or at least like what we used to have is that you add a global logger. You couldn't do anything with it because you cannot attach any data to it. There was nothing to do. So if you were, let's say, in the scheduler and were trying to make sure that all your log lines contain the name of the pod that you are trying to schedule, it was impossible. You had to make sure that all your logs already contained this information, and there was no way to ensure that. So that's the goal of contextual logging. And one more futuristic goal would be to also attach, let's say, a span ID for tracing or trace ID in order to have correlation between logging and tracing made super easy. And now the goal is to really migrate the code base to both structured logging and contextual logging, and we are actively looking for contributors to help us do that. So there is a dedicated group in signal instrumentation that is working on that, which is what we call a working group in Kubernetes. It's the structured logging working group. So they are focused, like there are parts of signal instrumentation. So we kind of oversee them and support them, but their focus area is strictly logging. There are two organizers for it, and they meet every two weeks on Thursday. And if you want to contribute, that's one of the best places to start because it's pretty easy to get into not much context to know about, or history to know about, and it allows you to kind of get an experience of a lot of part of the code base. So the last signal for which we added support recently, and that is pretty exciting is the traces. So just before going deeper into what we support and what Kubernetes offers today, I would like to speak, like to tell you about distributed tracing if you don't know. So today, when you are a client and you send a request to a software, it can go through many layers of software and of services. And you don't have any way to really know where the request went and how much time it spent in the various areas. And that's what distributed tracing is trying to solve by adding trace ideas and spine IDs to your request. And then all your components inside of the goal is to have all your components produce traces and spans with these IDs. And then your tracing platform will aggregate all these data together, try to match it with the correct IDs, and you will be able to see the three spans below, which go through the different processing that went through with your request. So the server request, then you have the backend where you send the request to the backend, the backend doing some processing, and then sending the request back to your client, and you know exactly how much time was spent in each of these aspects. And in Kubernetes where we started is to do that to the API server. So what we really wanted to know was when a client make a request, let's say you make a request with Kupkato, how does it went through the API server and then HCD and then goes back to the client. And we really wanted to measure that. So that's what we did. We introduced API server tracing I think in Alpha in 123, if I remember. But it went better recently in 127. And we fully traced the request between the API server and HCD. And also Kublate Tracing recently graduated to beta as well in 127. Kublate Tracing was started a bit later, but code up to API server now. They are like the two components that are traced. And in the case of Kublate, we are tracing the request between Kublate and container runtime. All right, so here you can see the trace from Yeager that we initially added in the spans to Kubernetes 122. It was pretty bearable, like you didn't have that many information. The big span above is the request from the client to the API server. The second one is the API server to HCD and then in yellow it's HCD processing the request. So you can kind of see what's happening, but you don't have much information about what's going on really inside of each of these components. But at least you know who is responsible for how much of the latency. And one thing I wanted to mention before going further into what we did was that there was always some kind of tracing before in the API server. But it was more log-based tracing mechanism where you add these traces logs that were appearing whenever the API server was slow, for example, to give you some kind of idea of where slowness occurred really in the API server. But it was like it's pretty hard to just use as is because it can grow pretty large in your logs, so it's not the best. And what we did in 126 is that we connected these log-based tracing and we converted them to spans, like to open telemetry spans. So now you are able to exactly know what's happening inside the API server, how much time you spend decoding, let's say, data from HCD, how much time you spend processing a certain request and so on. So now all this information you are able to better introspect what's going on. For Kubelet, when it was added as alpha, it was also pretty bearable. We had two things instrumented. One was the request from the Kubelet to the container runtime to create a container. And the second was from the container runtime, like the container runtime starting the container itself. So we didn't really know when the pod creation was started by the API server and then when the container was started by Kubelet. So that's what we did at KubeCon as a proof of concept and we kind of linked the pod traces to the Kubelet traces. So we now better know what's happening when the Kube API server is sending a create pod request. And then it's propagated to the Kubelet and you know when the containers are created as well. So the future plans are to completely add the Kubelet span to track the create pod instead of just like creating the container. We also want to improve on signal correlation. We want to link metrics and traces via Prometheus exemplars. And we also want to, as I mentioned before, link the logs to the traces most likely via contextual logging in the future. And to do all that, like we need more contributors, I guess, and you can help us also improve all these areas with your IDs as well. So one way to do that is to attend our meeting, participate in reviews. There are many projects that are looking for contributors, so feel free to reach out to anyone responsible for them. And you can meet us every week. We are having a SIG meeting. One is for trades, like one week is trade, one week is meeting. So feel free to join any of these meetings and we'll make sure to help you contribute. And yeah, that's it for the first presentation. If you have any feedback on this session, feel free to scan this QR code and send me your feedback. I would really appreciate that. And that's it. I guess we can go on with the questions. Does anyone has any question? We exposed ViLTOP just set up a collector and central system. Is there any chance, potentially, in the future, for to get logs and metrics via this thing? Via open telemetry you mean or via the collector? So the question was, we are producing logs with open telemetry today. Is there a plan to have them as metrics and logs as well? Or have metrics and logs produced with open telemetry as well? So I think that was considered. The blocker currently is that most of the thing I discussed with metrics where we have this stability framework and so on is not something that is supported by POMETU Sizer. So we had to create our own libraries to support all that as well as all our static analysis tool. So the challenge would be to understand how much of a rewrite that would require. I don't think we have anything against open telemetry at all. It's more like the challenge of the rewrite that would not necessarily make sense and be valuable. The amount of effort that would require might not be valuable. And worse, it as well. But yeah, one thing that is worth mentioning is that we support. All the signal can be today collected via the open telemetry collector. And that's something that we want to keep going forward. We always want to keep support for that. So today it was not enabled by default in Alpha. So the question is what happens to the tracing, to the traces today in the QB API server if you don't have anything grabbing these traces. So tracing wasn't enabled by default in Alpha. It's now enabled by default, but it requires some configuration in order to produce the logs. So if you haven't set up anything. Yeah, to produce the traces. So you wouldn't see any traces today. If you don't activate them directly and set up a certain ratio for the number of traces you want to retain. So yeah, today if you don't do anything with them, you won't have any traces produced. But if you want to use it then for sure try to configure it and then you can grab them via open telemetry for example. Yeah, there is a configuration for tracing that you can pass to the UPS server or even Qubelet now. To set up, to tell it like what is the ratio. I don't remember the name for the sampling rate that you want for the traces. You just have to configure that and then it's all set. Yeah, I think you cannot configure the endpoint. It's set by default. I won't lie, I'm not that familiar with open telemetry as a whole. I know a bit about the metric side but not much about the logging side. So I'm not sure, the best would be to go to one of the working group meetings and ask them directly if they've considered that. But at the moment they have their own library in KLOG in order to support contextual and structural logging. And I don't know if they have an intent to actually rewrite it. Yeah, it might be. All good. Thank you everyone for attending and I hope you enjoyed the end of Qubcon and you travel back.