 So, hello everyone. If you would like to follow us on the tutorial and do the coding with us, please you can start getting ready. If you go to my GitHub account, there is the KubeCon EU-20.10.2.3 open telemetry Kubernetes tutorial. Or you can just scan these QR codes. There is readme with the setup and so I would advise you to start looking at this and maybe create a Kubernetes cluster and deploy the observability backend that we'll be using. The QR code is as well on the slides that are uploaded to chat. Okay. So, hello everyone and welcome to the tutorial for exploring open telemetry metrics on Kubernetes. And this is essentially continuation from the last KubeCon in Amsterdam where we did a tutorial for kind of exploring all this open telemetry signals on how to collect them on Kubernetes. And today we're gonna just focus on the metrics. So, my name is Pawel. I'm principal engineer at Red Hat. I maintain open telemetry operator and as well contributes to the open telemetry project. I'm as well maintainer of the Jager project and Grafana temp operator. And with me today is amazing group of people. Would you like to start the introductions? I can do. Yeah, hi. My name is Bina. I work, I contribute to the open telemetry project and I also work at Red Hat together with Pawel. Hi, I'm Anthony Mirabella. I'm a senior engineer at AWS and I work on open telemetry, mostly the collector and the go client libraries. Hey, hi, I'm Anusha. I'm a software engineer at Apple. I'm also an open source contributor. I work on open telemetry metrics and traces. Hello, everyone. My name is Matei. I work as a software engineer at Coreologics and I'm also open source contributor. I'm coming mostly from the Prometheus ecosystem. Recently I've been working more with open telemetry with the collector and operator. So that's my area of focus. Okay, thank you very much. So as you can see, we are all kind of contributing to the ecosystem. And it's a tutorial and all the content is hosted on GitHub. So if you would like to follow what we do live on stage, please scan this QR code or go to this URL or to my GitHub account and the repository is pinned on the index page. If you will have any issues during the tutorial, you can just raise your hand and we will come and help you to resolve issues if you will know, obviously. And with that, I will jump to the GitHub. So what we prepared today is essentially a couple of sections we want to cover related to metrics. We'll start with introduction to how open telemetry metrics are designed and Tony will talk about that. Then we will have a live demo. We will instrument an application with the open telemetry API and SDK for metrics, kind of manually. We will initialize the API and SDK. And then we will compare it to the automatic instrumentation and we'll see how that makes the whole instrumentation easier. After that, we will deploy it on Kubernetes with the open telemetry operator. We'll deploy it in the collector and use the instrumentation CR. And after that, we will focus on different topics. We will explore how we can use open telemetry with Prometheus, how we can replace parts of the Prometheus with open telemetry collector to collect Prometheus metrics. And after that, we will look at how we can use open telemetry to collect Kubernetes infrastructure metrics. And the last topic is correlation where we will explore how we can correlate metrics with traces and logs. So the first thing what we need to do for the tutorial is to do the setup and we will need Kubernetes cluster. So if you have one, you can just use that one or otherwise there are instructions to use the kind cluster or you can use Minikube as well that will work just fine. So you should end up with running Kubernetes cluster and kubectl installed on your machine. And after that, we will deploy SerdManager that is required by the open telemetry operator. When you run these commands to deploy SerdManager, please wait a couple seconds maybe or even half a minute because it takes for the SerdManager to start and initialize properly. And after that, you can deploy the open telemetry operator which will essentially install the latest trees. And after that, we will deploy our backend for observability data that is using Prometheus with enabled OTLP ingestion and exemplars. And for tracing, we will use Yeager. And the backend as well contains Grafana deployment which as a last step will port forward it to our host. Just to get an idea who is gonna do the demo with us, please raise your hand. Okay, cool, that's good. And just to, we already have this setup on our machine so we will not be doing this first part, setting up the cluster. So it is already there and I would suggest you can port forward to Grafana and keep this running in the background or just keep it open because we have some links for the dashboards that will expect Grafana to be on the local host 3000. So if you want all of this to work, yeah, leave it open or leave it port forwarded. I still see people typing. Mehmi, you can give us some indication if you are done with the setup, maybe you can raise your hand so we know that some people are done. I think we can, because in the next section Anthony will introduce the data model so we will still have some time to install and go through the steps. So no stress. Thanks Pavel. The fun of learning to use somebody else's computer and how they have it set up differently from you. I think I need to click that, there we go. So I've got a brief presentation I'm gonna do to talk about what are metrics, what the open telemetry metric model looks like and how you can utilize the open telemetry API in SDK to produce metrics. So first off, what are metrics? Metrics are at some level a pre-aggregated time series data, pre-aggregated time series data that represent observations of numeric quantities and associated attributes. Now that's a big mouthful, it's a lot of words there so I'll try to break that down a little bit and talk about each of those constituent parts. First observations, what are those? There are two types of observations that we can make, synchronous and asynchronous. Synchronous observations are when an event of interest happens and we can observe a value related to that event directly. So this may be something like how long did your HTTP requests take or how many bytes were in that request body. Maybe you made a query to a data store and you've got a counter of how many queries you've made to that data store. So these are things that happen. You observe that directly when it happens. Conversely, asynchronous observations are things that happen not necessarily when the value changes but when you want to know what the current state of that value is. So this is more useful for something like how many items are in a queue for processing. Maybe you don't track every item coming and going but on a periodic basis you ask how big is the queue or how much memory is allocated to a process. You're not tracking every allocation but I've got five megabytes now, I've got 10 megabytes and you can kind of see that changing over time. What's the temperature in the room? All of you people in the room are probably raising the temperature here and we can periodically check what that is but we don't know every time that changes because it's a continuous value. The next part of that is in numerical quantities. Metrics deal with numbers. They can be integers, they can be floating points but numbers are really the thing that are of interest to us here. It doesn't really make sense to observe red, right? But we can observe that we've got 99 red balloons and so that's your metric there is the name of the balloons. You've got the value of 99 and some attributes like it's color is red. And then that takes us to the last part of that which is attributes. So attributes can help provide context for understanding that observation that you've made. If you are counting HTTP requests maybe you wanna know what was the URL or what was the status code, was it successful or was it an error? You might wanna have different values for those. Which data storage did you query or what color was your balloon? Mine are red, maybe yours are a different color. And then aggregation is the final important part of metrics. Especially when we're doing synchronous observations there can be a lot of observations. You don't necessarily want to report every single one of them to a data store. So every HTTP request you're not necessarily going to want to report that to Prometheus or to some other data store and say, hey I had another request. You will say on a minute lead basis ask how many requests were there in the last minute or how many requests have there been in total. And there are various different types of aggregations that you can use. Some is just adding up all of the values that you've observed. You could take a last value which is, what was the last thing you saw? Maybe the value is constantly changing but you only wanna know what it currently is. And then there are histograms which are representations of distributions of data so you can have some more information about what types of values you've observed without necessarily needing to keep every value. In open telemetry there are three data models that are not necessarily always represented in the API or the interfaces you use but these are kind of conceptual models and states that the data will flow through as it goes from your application to your data store. So what I just described with observations is kind of the event model where each observation is an event that happens and you record the attributes and what thing you're observing. That then can get translated into a streaming model in OTLP which is a representation of a set of aggregated events that is easy to send over a network and then finally your backend whether that's Prometheus or some other data store will turn that into its time series model that it will store for later query analysis. The open telemetry metrics API in SDK is these are the ways that you interact with open telemetry and record these observations and get them down that pipeline of those data models to your eventual data store. It starts at the top with a meter provider. This is the entry point to the API. Your application will probably have one of these. You may have more for some reasons, but typically you will have one. And in the API, the meter provider just gives you the ability to create a meter which is the next level of interactions. But in the SDK, this is where your application will configure things like what resources are attached to your, to all of the metrics that are produced by this meter provider or how do you want it processed, which aggregations do you want to use for specific metrics and where do you want that data to go, which exporters will have used. The meter itself is responsible for creating instruments and instruments are the things that actually record observations. So a meter, you will typically have one per library. So if you've got an HTTP library, it will have its own meter which might associate some information relevant to that library to all of the instruments that it creates. You might have a separate one for your database library that associates in different attributes to all of the observations for your database. So I'll talk a little bit about the various types of instruments that we have. I mentioned earlier that there are synchronous and asynchronous observations and that carries forward into instruments as well. So synchronous, as I said, allow for direct observations of a value related to work that's happening. And asynchronous allows you to observe something that might be changing outside of the context of the work but will influence that or is related to the whole thing that you're observing. They are very similar to synchronous instruments except they add a callback parameter so you can provide a function that the SDK will invoke when it wants to know the current value of that instrument or the thing that that instrument is measuring. So this callback is invoked when it wants to record. So for instance, if you've got a Prometheus exporter when Prometheus comes and hits the slash metric endpoint to scrape that data, it will then invoke this callback and say what is the value of the current memory state or something like that. So the types of instruments that we have, we have a counter which is probably the simplest kind of instrument. These track monotonically increasing values. That means the value only ever stays the same or goes up but can never go down. This is important for some metric systems that expect that property of counters that they're looking at so that they can effectively make assumptions about those data. It defaults to a sum aggregation. So if you observe a bunch of different values, it will add them all up and report the sum of those values. So if we observe here one, four, two and three, we'll get the aggregated value 10 out at the other end. And this also has an asynchronous variance if there's a situation where you need that property. If you don't or can't use a monotonic counter, there's an up-down counter which as the name implies can go up and it can go down. And this works just the same. But here we can see if we observe a negative two instead of a two, we'll end up getting the value six out. This also has an asynchronous variant. Then we've got a gauge. This is asynchronous only. And these are useful for tracking things that as with most asynchronous instruments don't really change in the context of some specific identifiable activity but maybe changes continually and you just wanna know what's the current value or what was the last value that was seen. So this is useful for something like memory usage or items in a queue. And here if we observe those same values, one, four, two, and three, the result is three because that was the last value we saw. There's some experimental work happening for a synchronous gauge but that's not yet available for general use. And then finally we've got the histograms. There are two types of those, explicit bucket histograms where you say here are the boundaries of the buckets that I want you to put all of the data and the distribution into and then an exponential histogram which will try to automatically fit the bucket boundaries to the data that you're providing it. These can include a lot of additional statistical summary type information. So you might get the minimum, maximum sum and count of all of the observations that you have. So we can see here that if we observe the same four values, one, four, two, and three, then we get a much larger set of data where it tells us the minimum value we saw was one, the maximum was four, the sum was 10, and the count was four. So we can get a lot of the same information that we would get from a counter but also a lot more information if we need to know was the average really high or was it really low? Where's a counter just tells you what was the total? Maybe you've got a bunch of outliers that are bringing up your total and you won't necessarily know that from a counter. Okay, and I think that takes us through all of the instruments and Pavel, am I heading off to you again? Yeah, thank you, Anthony, and we'll continue with the third part of the tutorial. Are you ready with the setup? It seems like, yeah. So actually in this step we'll not use the Kubernetes. So if you're not ready yet, it's okay. So what we're gonna do right now is to instrument our application that we are hosting in the repository. So let me explain what application is actually doing before we jump into the instrumentation part. It's a simple microservice polyglot application that contains three microservices, frontend, backend one, backend two, frontend written in Node.js, backend one in Python, and backend two in Java. Essentially the logic is to play a game, roll dice. The frontend will ask backend one to roll dice for one player and backend two for the second player. And then each backend replies with a number in this case, three and six, and frontend decides which one is higher and prints the result to the console. So we need to instrument this application to get metrics, right? And there are two approaches to instrumentation. The simplest one is to, the simplest one is to use the API and SDK manually, which means you will pull those libraries into your application and use the APIs. And the second approach is to use the automatic instrumentation. Each of this has benefits and some problems. The manual instrumentation is great when you need to have good control over what you do. On the other hand, you might forget to instrument important parts of the application. And the automatic instrumentation is something that you can set up very easily. But on the other hand, my consume more resources. And you just don't control it, right? So you will just get what is prepared for you. So we'll start with the manual instrumentation on the frontend application. So I'll change directory to frontend. Yeah, could we ask the room technicians to dim the lights, please? Cool, thank you. Is it better now? Could you please make it bigger as well? I don't know, can you please make it? Okay, cool. We are Linux users and this is Meg Machine, so we are confused. But anyways, so I'll change directory to frontend and I will run npm install to install the packages. If you don't have npm on your Node.js environment, you can as well use the Docker container, which the steps are just below. And then I will use npx to run the index, which the npx, the Node.demon will watch for changes on the file, so I don't have to restart the application manually. So the app is running, so now I can access the port 4040, it should be, let me see, it's actually 4,000. And I'm getting internal server error, which is correct because the frontend needs two backends which are not deployed, so that's fine. So what we're gonna do right now is to instrument it and you can find good guidelines on the official OpenTelemetry documentation, the link is on the readme, and you can simply paste the code from there into the indexed file and initialize the OpenTelemetry API and SDK, and our goal is to use the console exporter that will print the metric in the application console. So switch to the app and so this is the index, and as you can see it's initializing HTTP framework and then configuring the index endpoint and there is the business logic then. So what we need to do now is to open the OpenTelemetry documentation which takes some time. Are you able to load the page? Okay, it's there. And here you can choose between TypeScript and JavaScript, and there you have the instructions how to load the packages for the instrumentation. In this case we need to load the OpenTelemetry API, SDK, and the instrumentation for Node.js, right? So if you just use the API and SDK, it will not instrument the Node.js, meaning that we would need to write more code to get metrics for and traces as well for any invocations of the Node.js, and here we see the code. You can just simply copy then paste it to our index. What I prepared as well is a second file called instrument which essentially contains the content from the webpage and you can load it by uncommenting this line and yeah it just configures the API and SDK and configures the console exporter. So I'm gonna save it and then create requests and as you can see the index is already using the OpenTelemetry API, right? But if we don't initialize the SDK, we will not get any telemetry data, right? Because the API will just be no operation. And what I have done here is I use the meter as... Anthony was talking about as the main kind of entry point to the metrics API and from the meter I created a couple of counters. First one is the RequestCounter which counts the requests on the root API and if I go back to console, I should see the metric printed in the application console. It takes some time because the SDK is batching those metrics and it's reporting them periodically after some time. Yeah, I think I did that. I can try again, it was reloaded. It next was not saved, you're right. Cushal S, cool. I need to create more requests. Thank you very much. Yeah, it takes a couple seconds to metrics get reported and we should see something like this in the console that is on the read me. And okay, so this is the output. So this is kind of the simplest way to debug your instrumentation is working because it kind of doesn't require any other service to be running on the host. And next step, we're gonna run Prometheus in Docker that has enabled OTLP right receiver. So it's kind of a new feature that was added to the Prometheus. We're gonna run the Prometheus in a separate console and I need to now change the code in the instrument, JS, to use the OTLP HTTP exporter. And I need to specify the URL. And Prometheus exposes OTLP under API we want OTLP prefix. You know, the API we want is the, you know, the prefix for all Prometheus APIs. And then they just added the OTLP suffix for the OTLP HTTP API. So I'm gonna save the file and again, refresh the application to again create some requests that will create metrics and then I should see metrics in my Prometheus. So the Prometheus is on the port 1990. Are you able to run Prometheus? Seems like the network is very slow for us. Try to switch to mobile data internet, which may be faster. Yeah, sorry about that. Maybe in the meantime, we can check. So who is still with us? Who is still following? Do you guys have your clusters installed? Show of hands who is still following and have their cluster ready with the operator? Those ability backend? Okay, couple hands. We said it's not the best format. There is a lot of people here, but if you have issue and you wanna discuss with us, feel free to raise your hand. We can, someone can come and take a look in the meantime. So yeah, this part we're working with Docker on our local machine, but soon we'll move on to the actual part where we'll use the cluster. So you still have a bit of time to prepare for the main portion of the tutorial. So it seems like the Prometheus for some reason couldn't be pulled, but the next step, optionally we can use the open telemetry collector that has the debug exporter and will print the metrics to the standard output. So we're gonna use that. We need to make the changes in the code, in the instrument. We need to change the exporter from HTTP to GRPC. We could as well use the HTTP one, but for the tutorial we decided to use the GRPC one for the collector and the HTTP for the Prometheus. So I'm gonna save the changes and refresh the application to create metrics. And again, we need to wait because the SDK is reporting metrics periodically, but we should see them coming to the collector soon. So the debug exporter that we are using on the collector is another useful feature in the collector that we can use to see if the telemetry data is coming from our process. So when we start the instrumentation, I would recommend you to use the console exporter in the SDK and then deploy collector and see if the data is coming to your collector. So in this case, we see the output. It's different than we saw in the application. In this case, we see as well the resource attributes that describe the process reporting the data and then the individual measurements, which is the Request Counter in our case. Okay, so this was the manual instrumentation. As you can see, you need to pull in API and SDK packages to your application. You need to make code changes and then recompile and deploy to get telemetry working. And now we're gonna take a look at the auto instrumentation. Bene, could you please come here? So Bene, we will continue with the next part. Do a quick microphone change, or you hold? Okay, perfect. Quickly, where are we? So, yeah, what we have seen, what Pavel showed us was more or less how to instrument our application manually. So therefore we need to know what we want to know about our application. There are other ways. In the JS contract repository or for other languages as well, you will find predefined packages where you can then instrument things automatically like an entire framework, for example, Gorillagor MOOCs, which automatically gives you traces, a trace support or a metric support for Go. And there is another option, which is the auto instrumentation. We will have a look here with the backend service. This auto instrumentation is available for multiple languages. So for example, for Java, Python, .NET, it's always a bit different what they support and what they do not support. And since Java as well is supported, we go with this here. So what we do now is, or what we have seen previously, we needed to configure the SDK, we needed to instrument stuff. We can go with Java and download this Java agent itself so that we can quickly copy this. Well, let me try to copy. So with the support that Ben mentioned, some of the agents, like most of them or all of them support tracing, but not all of them report metrics as well. Fortunately, the Java agent supports metrics. It reports some metrics for some frameworks, not all of them, but for some. Okay, this, or is it? Can I just try to type? So what Ben, I will show you, how to use the agent locally. But in production, you would usually embed the Java agent directly to your Docker container or inject it to your environment dynamically, as we will see on Kubernetes. So I'm running here a container, which has Java installed because the machine doesn't. So it may be an option for you. And what we see next is more or less, we'll quickly go here over this configure point. You will find all the configuration details on the OpenTelemetry page. We just load this Java agent and then we specify some environment variables. So in the first one, we go with the logging OTP exporter so that we see our metrics reported to the console. We just disable locks and we also don't care about traces in this step. And then we finally execute the jar, which is the backend. I pre-compiled this, so it's part of the repository. So in theory, we should be able to just execute it as soon as it's downloaded. I hope this goes a bit better or quicker on your machine. There's actually a smarter option. We already have a prepared container that we use afterwards in kind, which would contain the jar so that we just need to set the environment variables, but I don't want to change this now. The console output is kind of small if you could make it bigger. We stopped the Prometheus container. This should run in the O. Command T, yes. Quickly go back to... Why is it downloading it again? The second time. Okay, so this should be up and running and then we try this here again. Control C to command V. I'm able to access. Maybe we need to change to application back into C. Seems the repository's not up to date. Then we download a bit more stuff. So there's also the option to build it directly. So Ben, how do you tell the Java runtime to load the agent? Yeah, so basically we set the Java tool options and then the Java runtime will do the rest for us. It will load this first and it's still... Ah, I see what it's the problem. We download it, oh, where is the open telemetry agent jar? And see if we did a typo and we did not. So now we can call the backend a few times to roll the dice in case I'm able to copy. There's a one. I think there should be more numbers, but yeah. So in here, we see that we roll the dice a few times and after a while, so after the application will flush it and we should see it on the console itself. And this will then also include some Java runtime metrics. So for example, about the garbage collector and so on. I think we need to skip the perimeter as part because of time issues. We spend a lot of on downloading. Maybe we can quickly, if JQ was installed. So here, now we see it a bit better. We see that we get here, for example, CPU utilization and so on. You can browse through it and it contains quite a lot and you can also configure what you wanna see and what you don't want to see. So in theory, we can also just switch the environment variables a bit. So this time we remove the logging OTP exporter. So then it will by default use the OTP exporter. We specify an open telemetry metrics endpoint. This is the perimeter instance where we enabled OTP. And then we need to specify the protocol type. So in our case, we want to transmit this using HTTP. And here's the exporter explicitly installed and we just disabled the other ones. And then it's the same process. So this links into the perimeter instance and then yeah, we should see it there. So you see it here on the screenshot. Technically, this also works for Node.js. So you can also play around with this. So I added here this optionally. So in the next section, we would like to give you a short overview about the open telemetry operator, what it's used for and how it makes life easier on Kubernetes. The open telemetry operator is basically a component which delivers currently two CRDs. One is the instrumentation CRD and the open telemetry collector CRD. The instrumentation CRD is used to configure the SDKs and also inject the SDKs. So previously, we did download this JAR file and this is something the operator can do for us. And so now we want to do exactly the same on Kubernetes. Therefore, we deploy the entire application. Yeah. If you hold it. So now we should see. Yeah, so this is the part where you can already follow us and basically just copy and paste the commands and it should apply the correct manifests on your cluster. So in this case, we're deploying the same application we've seen previously in Docker. We're now doing the same in Kubernetes. Yes. And next we go with the auto instrumentation CRDs. So we have first maybe a quicker look how this looks like and let me open this a bit bigger. So it's basically the instrumentation CRD which has the open telemetry API here and we configured the exporter. This will be, or this is the default path where the open telemetry metrics or all the signals get exported. In our case, we export traces to the Jager Collector and next we configure the same environment variables that we did before and we use a different endpoint. In our case, the Prometheus instance which is running on Kubernetes to shift the metrics there. So let's quickly deploy this and then nothing should happen because if we want to make use of it we need to modify the pod spec. So let's have first a look on back end tool. So what we see here is the container somewhere in this back. We have the container, we have the image specified almost no environment variables. I don't see one, not a single one. And also with the annotations we have here the Prometheus scrape metrics but this is different. So what we need to set up is now for what we need to do is more or less just add an annotation and tell the open telemetry operator that it should inject the Java SDK and it should also configure the variables that we configured on top. So this is what we do next. Since deployment is patched, the pod should restart. We just verify this. Maybe it just get the details again. So it restarted, ah, there it's 20 seconds. And so, yeah, now we see the difference. So now there is an init container which is using this image here with the Java authorization in this specific version and what it basically does, it uses the CP command to copy the Java agent into, can we see this here somewhere? I thought terminated. I don't know how to search a terminal here on the MacBook, it's difficult. Ah, here, we see the command from the init container. We see that it's copying the Java agent on the specific path which is the default path where the Java agent then gets picked up. But it's also specified here in the environment variables. When we see here, we set the Java tool options which is then using this path. We also set the other parameters that we want to transmit metrics using HTTP and we also target the Jager collector to transmit the traces. It will automatically also set some resource attributes so that we know where it's coming from. In case we have things manually instrumented, which you can try. There's a step in the tutorial from previously. We only need to inject the SDK configuration. In this case, there will be no init container and nothing gets injected because then the application is expected to deliver all of this. So what we can do now is forward Prometo is here directly because it's the easiest way to collect or see the metrics and go to the Prometo dashboard. So usually also here it takes a second until some data arrives and in the meantime, check the logs if everything works as expected here and then we should see on the top no errors. Okay, we are running a bit out of time so trust me, it worked on my machine. Does it work for someone else or does someone else trouble to see the metrics there? So I expect it works for everyone, expect me. Yeah, so basically we should then see the metrics that we have seen on the console previously. Transmitted there and also we can forward the Yeager or maybe we can check this quickly. So here we have traces are arriving, it's super strange and here we can see how things worked out. One last try, still nothing. Okay, so as we can see, we see traces and in theory, we also should see the metrics. So what we do next is we will have a short look on the open-time collector. This is more or less a short recap so that we all on the same page because the open-time collector comes now or will be become a bit more important. You will see how we collect basically metrics then also from the Prometheus ecosystem. So what is the collector? The collector can be divided into, let's say three main parts, they're a bit more but it consists of two. It consists out of a receiver, a processor, an exporter and there are also connectors. The receivers can be, for example, a Prometheus receiver which scrapes metrics from our applications or the OTRP receiver which is just listening on a port and waits for all data that arrives. The processing can take place to enrich our data, filter our data or just batch the data to save some overhead and then finally we can export it to different data stores. So there are options for different protocols again on the wire like Remote Riot or OTRP directly. So to quickly show how such a configuration looks like the open-time collector is configured with a simple YAML file. We see the parts that we discussed previously, the receiving part. In our case now we specify an open-time receiver that accepts GRPC on a specific port. We also specified the batch processor and as an exporter we use the debug exporter which directly prints things to console. Then we have the service section and there we can compile our service pipelines so we can have, currently we have one for metrics but we can also have multiply for example for multiply targets or different processing for different metrics. And we also can there specify pipelines for traces and locks. Yeah, I think that's almost it and Maciej will now show you how to use this on Kubernetes. Let's do it, let's finally deploy the collector on the cluster, what we've all been waiting for. So as you've seen, Bena introduced the custom resource called instrumentation which we use on the application side to instrument it. On the other side we have the open telemetry collector custom resource that represents how the collector should be configured and deployed and so how the operator will behave or how the operator will manage the collector instance. It's a quite, quite simple or quite, many of the parameters in the spec are quite self-explanatory so you can see some of the typical fields that you will see in other CR such as image, number of replicas, we have auto-scaling there. You can define ports, environment variables, et cetera. These are quite common I think so this is how our CR looks like, it's quite simple. We just pin the image to the, I think the latest collector image 88 which is now not showing and here we have the configuration which we took over from what Bena was showing so it's a very simple configuration where we have a receiver with the OTLP protocol. We will batch the data coming in and then export it with the so-called debug exporter which is just a simple exporter to dump the metrics or the telemetry data to your standard output and here we put it all into a pipeline for metrics so what we should see once the data coming through the OTLP receiver, we should see them on our standard output. There are a couple more interesting parameters that you can configure in that custom resource that you should get familiar yourself with. One that's specific to the open telemetry collector it's the mode, so this defines the deployment mode of collector, what do we mean by this is how the collector should be deployed, should this be deployment, demon state, stateful set, sidecar, all of these are valid options. So for example, if you're using the collector as an agent, you want to run it on every note you would set this to a demon set, if you want to use it as a kind of gateway you would deploy it as a deployment and so on, we will see some examples of use of this parameter later. Upgrade strategy so the operator can also upgrade the collector for you automatically, this is done by default. Target allocator is another specialty that you might want to get familiar yourself with if you're using the collector to scrape from these targets, so this is an optional component that makes it easier for you to integrate with Prometheus and to scale but we'll also see more of this hopefully in the section five. Lastly, service account which you can also specify this is also important when we are talking about Kubernetes infrastructure metrics as we will see the collector needs appropriate permissions to be able to collect certain data so it's important to have this set up correctly as well. So finally, let's apply our CR here and check that it was, first let's check that the CR was created so as you can see we have some information here as we wanted to deployment mode version 88 et cetera image and also let's check that the collector is actually running. Yes, it's there. So we have the expected output and now let's look at some metrics in the collector. But there are no metrics. So we're still missing, still missing one more step as we've seen previously in the instrumentation that was introduced by Bene, we were sending the metrics or we wanted to send the metrics to the Prometheus instance, in this case we will finally send them to the actual, to the collector that we'll output them on the debug. So this is similar to what you've seen what Pavel demonstrated on Docker. Now we're doing the same in finally in Kubernetes. So let's apply, maybe let's show it quickly. We have adjusted the instrumentation so now instead of sending it to Docker we just send it to the collector itself and we also here specify that we do not wish to export traces and logs. So we should see only metrics in this case. If we go back and try to apply this instrumentation there's one caveat where we should, we need to restart the applications in order for the instrumentation to take effect so I will simply delete the pods in the tutorial application namespace and they should be restarted with the desired, let's watch them for a bit. They should restart with the desired instrumentation. Maybe let's check what the backend two is doing. If it's, might take a minute to, might take a minute for it to show us some metrics. No metrics so far, but there should be also console export, right? It looks like the application is working. Let's see if the demo gods will be in our favor now. So we want observability backend and we want to look at the auto basic collector. Yes, it worked. So finally we see some metrics. As you can see previously we've been, we come up to this line so we had some initialization messages. Now we're seeing finally some metrics coming in that are being printed to the standard output as I suggested and we can see resource attributes. So we're seeing the name of the container and so on and finally some metrics like server, request duration and so on. So this is a very, very basic example. We should now have an understanding of how instrumentation and open telemetry customer resources work and now we'll move on to a more realistic and more complex example of how we can use this for collecting Prometheus metrics which Anusha will show us. They also want the, okay. I don't have, okay. All right, so now let's talk about the collecting Prometheus metrics section. We only have 20 minutes left so and three more sections. So I'm going to skip and only cover the important parts. So on a high level this section we'll be talking through migrating, talking to the steps of migrating from Prometheus to open telemetry for collecting metrics. And after that we'll dive deep into how do we scale the collector instances with a component called target allocator. And after that we will see, we'll talk about the interoperability between Prometheus and OTLP formats where we'll be explaining how do we ingest Prometheus and output it in OTLP. Finally we'll also see some of the format compatibility challenges and limitations that we need to be aware of during the migration. So pre-rex, in the previous step we also have the back into application which was manually instrumented to generate Prometheus metrics and those metrics are being exposed on a HTTP endpoint which we'll be scraping in this section. And our Prometheus backend has been set up is up and running on the Kubernetes kind cluster and we have the remote right enabled. I'll be skipping through the overview. So the first thing that we need to analyze is like we need to actually look into our existing Prometheus configuration and identify the scraping mechanism that has been set up. So there are two ways target discovery can be done in Prometheus. One is, the first one is a more traditional way using scrape configs. So when using scrape configs, we'll have all the targets and the target end points specified in the Prometheus configuration file itself. Here is the sample Prometheus configuration file where we can see we have the alerting, recording rules, scrape configs and remote right specified here. So the next thing that we need to do is when before we put this configuration over to hotel collector is just pick up the scrape configs from this section. The other advanced features like alerting, recording rules are not supported by the collector. They can still be set up on your Prometheus backend. So we pick up the scrape config from the Prometheus section and the next thing that we need to do is escape the dollar characters in the scrape config sections. The reason, collector today supports environment variable substitution, which means it interprets the dollar characters in your scrape config as the starting of your environment substitution. So in order for us to, for the values to be replaced by the environment variables, we need to be escaping the dollar signs in the configuration. So if you have metric relabel configs or relabel configs in your scrape config, make sure to escape the dollar characters. So here is how our configuration looks like after the exclusions and escaping. So if you see here, we have the dollar characters escaped. The second method is using the service monitor, which is more of a dynamic way for discovering targets in Prometheus. So when creating the service monitors, typically in a Kubernetes cluster, we specify the desired configuration or desired scraping behavior using service labels. Service monitor uses these service labels to kind of dynamically discover the services and the targets to scrape. This is more flexible and scalable way to do target discovery in a Kubernetes setting. So in order for us to create the service monitors, we need the Prometheus operator CRDs installed. So we'll go ahead and install the CRDs. There you go. So the CRDs are installed. So the next step is setting up the open telemetry collector itself, just a quick recap. So what we looked at so far is we looked at two approaches to scraping or discovering the targets in Prometheus. And now we'll see how do we set up our open telemetry collector for these two discovery mechanisms. And we also have the Prometheus scrape configuration ready to be ported. So the way we set up the collector is open telemetry offers a Prometheus receiver and a couple of exporters for us to port over to the open telemetry collector. So let's start with looking at the Prometheus receiver. A Prometheus receiver essentially is a minimal drop in replacement for the Prometheus server to scrape metrics. So under the hood it is just a wrapper over the same Prometheus scrape configuration code. So it supports the full set of Prometheus scraping service discovery and re-leveling configurations. We have a couple of exporters. One is a pull-based exporter and the other is a push-based exporter. The pull-based exporter essentially exporters all the metrics on your metrics endpoint for your observability back into scrape. And push-based exporter will write all your metrics to your remote write compatible backend. Now let's start with configuring the open telemetry collector for the using the scrape configs which is a native service discovery in Prometheus. So here is how our open telemetry CR or custom resource looks like. Here is a spec. So what we have here is we'll be deploying a collector in a stateful set mode and just one replica for now. And then under the config here we have the receivers as Prometheus. So whatever scrape configuration that we prepared earlier in the earlier section we just put it over to this section. So for simplicity I just, the job that we are using here is an open telemetry collector monitoring the open telemetry collector job itself so it'll just monitor the health of the collector instance. And then under the exporter section we have logging exporter. All the metrics will be logged in the console and we also have the Prometheus and Prometheus remote write exporter. But for the purpose of this demo we'll be using the Prometheus receiver and the Prometheus remote write exporter. If you guys want to experiment with the pull-based exporter all you have to do is just replace the Prometheus remote write under the metrics exporter section to Prometheus here. Now let's go ahead and apply this chart. So this will essentially start a new collector instance as a stateful set with the native Prometheus discovery. I mean, the native Prometheus scrape configs. So let me just copy paste this. All right, let's now see if the pods are up and running. So we see the collector instance up and running here. So we should also be seeing the logs because we have the logging exporter configured. We should be seeing all the metrics in the collector logs. So yeah, so as you can see the collector logs now have the open telemetry metrics. So we have a collector dashboard set up on Grafana. So we should be seeing all those metrics on the Grafana dashboard. All right, so here is the collector dashboard. So we can see the scrape job here, hotel collector scrape job that we just configured in the scrape config. And most of the other metrics are empty. We'll see more metrics coming in as we go through the rest of our sections. All right, just going back. So what we saw so far is we set up the collector for the native Prometheus service discovery, right? So the next step is the next section is we'll look into how do we scale the metrics pipeline using a component called target allocator. So in the previous section we deployed the collector as a stateful set and we only have one instance of the collector running. So now if we have to scale the collector instances we cannot have the same scrape config in all the collectors. So it will end up double scraping your metrics. So one option is you manually shard the targets and you configure the replicas with distinct scrape configs. Obviously that is a very tedious process. So to simplify the process, open telemetry operator introduces an optional component called target allocator. So the two essential functions of the target allocator is even distribution of targets across the collector instances. And the second is it also facilitates the discovery of Prometheus CRs that we saw earlier, which are the service and pod monitors. So here is a high level overview of what the target allocator does. The first job of the target allocator is to discover the metrics targets and then it will also discover all the available collectors. And then it will use a consistent hashing algorithm to distribute all those shard, all those targets across the collectors. And then the open telemetry collector in turn inquiries the target allocator to get the metrics targets to scrape. And finally the open telemetry receiver scrapes those metrics and points. So the second job that like I mentioned, it also facilitates the discovery of service and pod monitors and then adds the job to the TS scrape configuration, which in turn adds the job to the collector scrape configuration. So here is the CR. And we'll go through the notable changes that we made to the CR from our previous deployment. So here now you see the replicas are three. So we scale the collector instances. We also have a target allocator block added here. We enable the target allocator and the sharding strategy that is being used as consistent hashing. We have two replicas for the target allocator for HA and resiliency of the target allocator. Deployment if one goes done, you always have the other one to take the request. And we have the image and this here Prometheus CR enabled true will essentially enable, will facilitate the service discovery for the target allocator. So it uses the, if this is enabled, the target allocator will be able to discover the targets using service monitors. And the other changes under the Prometheus section here, we have the target allocator section. And here we actually configure the target allocator endpoint that the collector uses to go fetch the scrape configs from. Now let's go ahead and apply the collector stateful set with target allocator enabled. It will also create a cluster role that grants access to the target allocator to, it grants access to go fetch all these metrics. Now we applied the chart. Now we should also apply the service monitors. And then let's see if the take up, let's see the collector pods are up and running. Oh yeah, so here are the new collector pods. Here's the stateful set, three stateful sets. And we also have two instances of the target allocator running. We had two replicas configured in our manifest. So now we should start, we should now see all the Prometheus metrics for our back end one service in our apps dashboard. So under the apps dashboard, we have a Prometheus section here. Let's see, maybe it takes a while for it to, yeah. There you go. We see all the metrics flowing now. We see the dice numbers and we also see the dice number count and other metrics. Now let's go back. So where we are is like we have seen the apps dashboard. We also have a dashboard for seeing the collector target allocator metrics. Yeah. And then Anthony will take over and talk about the other two sections. Fortunately, we have only five minutes. Maybe we should just do an overview or just skim through those parts. Is that fine? So just people know what to find it but it's all documented here. It's described so you can apply it yourself and see but maybe just give a short overview. Yeah, so I think since we've got about four and a half minutes left, we'll open up to any questions. If anybody has any questions about what we've covered or anything else, raise your hand. We'll get a microphone out to you. This open telemetry is different than Azure app insight. It is also collecting. It is different than that SDK. So that SDK, I'm not sure. I don't know the SDK, the Azure SDK but it's probably similar, right? It provides you a way to create and send metrics and open telemetry. Why I would use one or the other, I would choose open telemetry because it allows me to switch the backend, right? I'm not tied to Azure. I can send metrics to Azure as well but as well to Prometheus or other systems. It's not vendor locking me to any specific, you know, observability provider. There is a SDK actually there. We can use that instrumentation key and put it in your code. It's like collecting all custom logs and events, everything and then send to app insight. That SDK will work like that. So this is also almost working like that, right? But that is a cloud specific SDK, right? And if you want to switch the cloud and move the GCP you would have to, again, re-instrument your applications with a different SDK but with this you can avoid all that. Yeah, Azure is the biggest contributor for this open telemetry. I don't know why they are spending on, again, they have the same SDK. Is there any difference with open telemetry and that SDK? I'm not sure. I don't think we know. Maybe even the Azure SDK is using open telemetry behind the scenes. It's something. They are extending that. Yeah. But SDKs, primary SDK is there with them to support in the .NET and Python node.js. They are not supporting in other languages maybe because of that they are introducing this one, open telemetry. I don't understand the question, sorry. So they are supporting only few languages. Okay. Like Python, node.js and .NET. They are not supporting other languages like recent languages, GoLine or different, different one. And it's not on Kubernetes, I think. Maybe that's why they are biggest contributor to this telemetry, open telemetry. Is that the reason they are doing? Well, none of us work for Microsoft. I don't know that we can really speak to them. Yeah, it's the exact way that you just told that. As an AWS employee, I know we provide the AWS distribution of open telemetry, which packages the subset of components that we offer enterprise support for. I believe Microsoft is doing something similar and in Google is as well. Okay. I think that's all the time we have, right? Should we? There is one more question if it's quick, maybe. This is regarding the Prometheus receiver that is explained a little bit earlier. So it is used to scrape the metrics and send it to hotel as a backend collector. My other question is, can we leverage this Prometheus receiver as an independent component, even with Prometheus ecosystem? Can we use it Prometheus receiver as a scrape and remote write target and Prometheus server as a store? So the collector doesn't have any storage or querying capabilities, but you can use it as a Prometheus scraper and then use the Prometheus remote write exporter to direct the metrics that have been scraped to a Prometheus compatible target. Okay, cool. The other question I have is in terms of the portal collector, once it receives data from Prometheus receiver, does it do any replication or these are independent stateful sets that don't coordinate with each other? Yeah, so when you use the target allocator component of the open telemetry operator, what that will do is it effectively separates the service discovery mechanism from the collection. So the target allocator discovers all of the targets and knows all of the collectors that are running and so it can allocate the targets between those collectors so that each target is only given to one collector. That way you won't be double scraping any of your targets or triple scraping depending on how many replicas you have. I understand that, but that is more around collecting and sending data into the target collector. What I'm asking about is hotel collector, can there be a replication that can be set so that if one of the hotel collector instances go down, we don't lose the data or something like that? Oh, so if you wanted to scrape twice each target and have multiple collectors scraping each target? No, no, so let's say Prometheus receiver or hotel receiver collects the data and sends it to the backend. The hotel collector that collects this data, does it have any replication enabled or these are independent stateful sets? Like a write ahead log, yeah, so the Prometheus remote write exporter has a write ahead log and I believe there's a storage component that some other exporters use for storing data before it's been sent out. Okay, so hotel collector is primarily based on the same wall and other sorts of mechanisms how Prometheus works. All right, but that's only for persistence just before it gets exported. It doesn't do storage and querying. Okay, so there are some tools or some receivers like Thanas receiver that does replication among multiple nodes. Does hotel collector have a roadmap to be on that state? I don't think so. The collector tries to be as stateful as possible so that you can scale it horizontally. Thank you. Yeah, I think we're already over time, so thank you everyone. If you want to look at those more advanced topics, the fifth and sixth section, I invite you to look at them by yourself. If you're interested in joining, contributing, having more questions, you can also find us on the CNCF Slack. There is a dedicated channel for the operators, so feel free to ask there. And yeah, thank you for your attention. Thank you.