 All right. It's 4 30. Thank you all for being here. Um, this is Prometheus and open telemetry better together and we're really excited. So before we start a couple of brief introductions of ourselves. Yes. Hello. My name is Adriana Villela, and I am a CNCF ambassador. I am a Hashi Corp ambassador, a blogger, podcaster. My day job is as a senior staff developer advocate at ServiceNow cloud observability, the artist formerly known as light step. By night, I like to climb walls and fun fact, I really love capybara as as you can see from my t shirt. And I am recently, I'm a senior developer relations engineer at New Relic. I work with the inimitable Adriana on the open telemetry and user working group where we are focused on connecting end users to each other through events and enablement content. And we are also focused on creating a feedback loop between end users and containers to help improve the project and drive adoption. And my fun fact is I love anything spooky and paranormal. So open telemetry and Prometheus. They both help us monitor the health and performance of our distributed systems. They're both CNCF open source projects. But what role do they each play in observability? So open telemetry, or hotel for short, is a vendor neutral observability framework and standard for generating, processing and exporting data. Prometheus has been a fixture of the observability landscape for years. It's widely relied upon by many organizations for monitoring and alerting. And both Prometheus and open telemetry generate metrics. But the topic of similarities and differences between open telemetry metrics and Prometheus metrics is a vast topic. It deserves its own session. What we're going to talk about is how these two projects support each other and we're going to show you how we're going to show you the interoperability between these two projects. So while you can use Prometheus to monitor a wide variety of application infrastructure metrics, the piece that we're going to focus on is Kubernetes monitoring. So, oh, and that's because it's arguably one of its most wide use, one of the widest use cases. That's why we're going to be focusing on that. So first we're going to start by learning about a few open telemetry collector components you can use to collect Prometheus metrics. Next we'll talk about the target allocator and how it can be used for charting and Prometheus service discovery, followed by a demo. Then we'll talk about some additional open telemetry components you can use to collect Kubernetes data. And finally we will do a wrap-up where we'll talk about some of the posing cons of the setup that we demoed and also talk about some of the stuff that Prometheus is doing on their end. So let's learn about Prometheus metrics with open telemetry. As a brief refresher, the open telemetry collector is a vendor-neutral standalone service. It's used for ingesting data, so receivers. You can transform that data with a number of processors. You can do stuff like filtering, redacting, sampling, batching. And finally you can use it to export your data to multiple backends of your choice. So for example, you can use, oh, I almost forgot. There's additional functionality. You can also use connectors as well as health check extensions. So you can use, for example, Prometheus SDKs to generate metrics, ingest them with an open telemetry collector, and then do processing as applicable and then forward them to your backend. Before we get too far, let's also do a brief refresher on Prometheus. So for those that are less familiar with Prometheus, it encompasses many things including a server and a data format. So the Prometheus server collects metrics from targets defined in a configuration file, with target being an endpoint that exposes metrics for Prometheus server to store. Prometheus data is stored as a dimensional time series, meaning that it has attributes and a timestamp. So let's also talk a little bit about how Prometheus and open telemetry are different. Open telemetry is primarily focused on the instrumentation piece, so it does not come with a backend. You still have to forward that data onto a observably backend for storage, querying, alerting, and so on. And on the other hand, Prometheus provides a time series data store you can use for your Prometheus metrics in addition to instrumentation clients. You can also view graphs and charts, query, and set up alerts using its web UI. And it also encompasses a data format known as Prometheus text-based exposition format. Also note that open telemetry also generates traces and logs, whereas Prometheus is for metrics. So those are kind of the more high-level big differences. So getting back to the open telemetry collector components, we're going to start with the Prometheus receiver. This component allows you to collect metrics from any software that exposes Prometheus metrics. It serves as a drop-in replacement for Prometheus to scrape your services. It also supports a full set of configurations in scrape config. And if you're interested in exemplars, which is a recorded value that associates open telemetry context with a metric event, you can use this receiver to ingest them in the Prometheus format, convert it to OTLP format, and that allows you to correlate your traces with metrics. For exporting your metrics from the collector to Prometheus, you can use... So you have two options. One of them is the Prometheus exporter, which allows you to ship data in the Prometheus format. It is used to report metrics via the Prometheus scrape, HTT, and point. However, because all the metrics are sent in a single scrape, it means the scraping won't really scale. And we can take a look at kind of why. So this is the architecture when you use this exporter taken from the very helpful Grafana blog post. And as you can see, all the metrics that are exposed by multiple apps are exposed in a single endpoint. So this means it's exposing a huge amount of data, which makes scraping inefficient because the load is not evenly distributed across time, and there's a huge ingest spike at every scrape interval. Also, if you try to low balance the OTLP requests among a pool of collectors, it's likely the metrics are going to be available in every single collector, so that also makes scraping hard as well. You can use the Prometheus remote write exporter instead. So this will help you get around the scaling issue, and we'll look at what the architecture looks like in a second. But it allows you to push data to Prometheus from multiple collector instances with new issues. Additionally, since Prometheus accepts remote write ingestion, you can also use this exporter if you're generating open-toll entry metrics, and you want to ship them to a backend that is compatible with Prometheus remote write. So here's what the architecture looks like with the Prometheus remote write exporter. And now I'm going to turn it to Adriana, who's going to teach us about the target allocator. Awesome, thank you. Okay, target allocator. Well, here's the deal. Prometheus, we love it, but it's not perfect. It does have its fair share of challenges. So for example, it experiences some challenges when it comes to things like performance and resource allocation, resource usage, especially when we're starting to increase the number of metrics that are being consumed. Now, one way to get around this is through sharding, where we basically have a number of Prometheus instances that basically each instance has a set of metrics that it's going to ingest based on a set of rules. Awesome, but it has its own fair set of challenges. So for example, we have a problem with resource. It can be resource intensive. So for example, if you want to do the sharding thing, it means that you need, say you have three Prometheus workers. Now in order to manage this, you do need a management instance of Prometheus. So now you've got four instances because you've got your management instance. On top of that, say you have, so basically your management instance requires as much memory as the combination of your worker instances. So for example, if you have three workers and they're all combined total using 300 gigs of RAM, then your management instance requires 300 gigs of RAM as well. So now all of a sudden you've doubled your memory requirements, which can be a little problematic. Now, if you want to avoid that, you just pare down to one single Prometheus instance your memory requirements have halved, but now you don't have the resiliency that you would have from having multiple Prometheus instances. So now another area where it can be challenging is with even distribution of targets. So by default, Prometheus will scrape targets regardless of whether or not they are dropped. So that means that say we have three Prometheus instances. Now say each one of them scrapes 500 targets, but one might drop all targets except 10. One might drop half of its targets and one might drop none. So now you have this imbalance in terms of what's being adjusted by each instance. So, sad panda, what do we do in that case? Well, fortunately we have Prometheus, sorry, we have the hotel target allocator to the rescue. And you might be wondering, okay, what is the hotel target allocator? So the target allocator does a few things for us. So first of all, it is a part of the hotel operator, but what is the hotel operator? So the hotel operator does a few things. One is collector management. So it manages the deployment of collectors, but it also manages the configuration of a fleet of collectors through the op amp. So there is op amp integration. And we also have another component of the hotel operator, which is auto instrumentation management. Now for the purposes of this talk, we're going to be focusing on the part that really talks about managing the deployment of collectors. And this is supported by a custom resource in the operator called the open telemetry collector custom resource. And the target allocator is part of this. So this basically means that the target allocator is only available via the hotel operator. So even though it is a part of the hotel collector, it's only via the operator. So what is the target allocator? Now the target allocator, what it does is it decouples the service discovery and metric collection functions of Prometheus. Where we have the collector, which manages Prometheus metrics without requiring us to install Prometheus, which we saluted to earlier. We have the target allocator, which then manages the configuration of the collectors Prometheus receiver in some cases. So then the target allocator basically serves two functions. We have the even distribution of Prometheus targets. And we have the discovery of Prometheus custom resources at Prometheus operator custom resources. Now let's dig in into each of these to see how they work. Now I have to admit I was kind of scared of the target allocator when I first heard of it. But you know what? Once you see this, you'll be like, Oh, cool. Okay. So how does it work? First we have the target allocator and it goes out to see, okay, what metrics are available for scraping? And then it goes, Hey, what collectors are available to scrape metrics? Then it decides which collectors scrape what metrics? So then the collectors go to the target allocator and go, cool. Can you tell me what metrics I'm supposed to scrape? And finally, the collector goes and scrapes those metrics. Once I figured that out, I'm like, Oh my God, it's not so scary. Now I want to do a little bit of level setting because if you're new to Prometheus, like I am targets and scrapes, like what? So a target is basically an endpoint that supplies metrics for the Prometheus tool to store. And then a scrape is essentially the action of collecting metrics through HTTP request from a targeted instance, parsing the response and ignoring is very ingesting the collected targets, collected samples to storage. Now let's look at the other functionality of the target allocator, which is the discovery of Prometheus operator custom resources. Now in particular, we care about two custom resources. We have the pod monitor and the service monitor. And these are part of the Prometheus operator. And essentially what they do is they say, okay, if a pod or a service matches the set of criteria, we're going to scrape metrics from them. So then you have the target allocator, which will discover the Prometheus operator custom resource. They'll go into your Kubernetes cluster and say, okay, are there any pod monitors or service monitors around cool? Then we'll add the job to the target allocator scrape configuration. So we convert that information into Prometheus scrape configurations for the target allocator. And then finally, the target allocator goes and says, okay, my collector buddies, these are the scrape configurations that I am going to distribute to you so you can scrape these metrics. All right. So now that we've got the theory, let's talk about this in practice. So we're going to do a little demo. It is not a live demo because I don't believe that live demos ever work out for me. So it's prerecorded, but it's live narrated. The application is a simple Python application. It's made of a couple of services, but we're going to focus on one in particular. It is basically a Python app that emits Prometheus metrics to be ingested by an hotel collector. And with the help of the target allocator, it is going to then emit these metrics as OTP metrics just to the collector's standard out. So we'll be using the logging exporter. And because we are using the hotel operator for this, we're going to be running this lovely setup in Kubernetes. So I'm going to be running basically the open telemetry collector custom resource. When you deploy that to Kubernetes, it basically spins up an hotel collector and a target allocator. And I've just said, okay, we're going to run this in a namespace called open telemetry. You can call it Bob. It doesn't really matter. And also, our little Python app is going to be running in that same namespace. The open telemetry operator runs in its own namespace, which is open telemetry operator system. So let's dig into some of the code that is going to make this happen. Over here, you have what is a sample piece of code for the open telemetry collector custom resource. And if you scan this QR code, you can see what all of these different attributes mean. So I'm going to focus on some very specific ones. So first, we have our namespace. So as I said, this thing is running the open telemetry namespace. Next, we have our mode. So the open telemetry collector can run in four modes. We have deployment, sidecar, demons and stateful set. Now, if you want to use the target allocator, it runs on all the modes except for sidecar. Now, in addition to that, we have our target allocator configuration section. And this bottom part, which if it looks familiar, it's because it's the hotel collector config YAML. Now, moving on to zoom in a little bit more. So as I said, this is the target allocator config section. And it's not just a matter of just popping it into your open telemetry collector CR and away you go. You actually have to enable it. And we have to get on top of that because the target allocator is responsible for basically creating those Prometheus scrape configurations for the Prometheus receiver. The Prometheus receiver needs to be made aware of the fact that the target allocator exists. And so we have to define, we have to specify the endpoint for the target allocator. And that endpoint is expressed is basically the name of our open telemetry collector instance, plus the dash target allocator suffix to give us this. Next, if we want to be able to use the Prometheus customer resource discovery, we have to enable that explicitly with this little bit of code. So now if we want to enable the Prometheus customer resource service discovery, then we need to define either a pod monitor or a service monitor or a combination of both. In this case, we are defining a service monitor and the service monitor works as follows. So I'm saying here that I'm looking for services that match this label. And if you look at the service definition over here, you can see that yes, my service has this label. And on top of that, this is optional, but you can say, okay, the service must also reside in this namespace. And I'm going to be scraping metrics from any service that matches the label name space and also has a port definition that is called prompt. And we're going to be scraping this every 15 seconds. Now, on top of that, if you want to use the target allocator at all, you don't only enable it. You have to set up some permissions for it. So one of the things that you have to do is set up a service account. Now the service account gets created for you automatically. So you don't actually have to specify one. If you leave it out, basically it's the combination of our hotel collector CR instance dash collector. So if you leave it out, it still gets created. However, you still have to assign the permissions that you want. So you still have to do a cluster role binding and a cluster role. And we'll look at that shortly. So this is an example of our service account that we're creating. If you scan this QR code, you can see an example of the service account and cluster role binding definitions on the hotel target allocator. Read me. So this is our service account. And then over here we have our cluster role definition. Now two notable things over here. First thing. These are the permissions that you need in order for the target allocator to work. Period. You don't have that target allocator. No work. In addition to that, if you want to do the Prometheus CR discovery, you also need these permissions. Now you can make these as part of the same cluster role binding. Sorry, cluster role or separate cluster roles. It's all good as long as they're both bound to the same service account and cluster role binding. You're good. And speaking of cluster role binding, we bring it all together by associating our service account to our cluster role. And voila, we are good to go. So now we're ready to actually see the real demo live narrated by me prerecorded. So before I start, I do want to mention that if you want to play around with this, we have the repo publicly available and we'll provide a QR code for you to scan. Afterwards, and you can run this in GitHub code spaces so you don't have to pull your hair out trying to get this to run locally because that's always a nightmare. Okay, so here we go. We are starting first with running the starting up our GitHub code spaces. And it's starting up doing its thing. And now we're going to be installing kinder Kubernetes in Docker. Pretty lightweight Kubernetes distro, runs in code spaces, extra bonus. So we'll be installing that. I've got basically a little script that's going to run. And once it's done installing, a little hamster wheel going, then we're going to just make sure that our Kubernetes node was created successfully. Always kind of a shock if that stuff doesn't get created properly. And to just make sure that the pods that we need for Kubernetes to work are actually running. We're all good. Now we're ready to start installing things on our cluster. So the first thing that we want to do is install the Prometheus custom resources. Now you don't have to install the entire Prometheus operator in order to get this thing to work. You can actually just pull the surface monitor and the pod monitor custom resources directly from the Prometheus operators helm chart. And I believe we have a blog post where we talk about this, which we linked to later. So we'll give you all you need to know to do that. So once that's been installed, now we have to install cert manager, which we're doing right now. And cert manager is a prerequisite for installing the hotel operator. You don't have cert manager. Hotel operator will get mad at you won't install. And after installing cert manager, we just want to make sure that the pods related cert manager are up and running. And shortly they will be all good. There we go. Finally, we're ready to install the hotel operator. And again, just checking to make sure that the pods are up and running for the hotel operator. Here we go. Okay, we are ready to go. Perfect. So now we're going to build our services just do a Docker compose build. And once we do that, we're going to load them in kind. I'm not doing anything fancy like running a local Kubernetes registry or anything. There's a command in kind called kind load loads it into kind and your images are available for use, which is awesome saves a lot of work. Once these are installed, we're going to deploy in a minute, but I do want to show you really quickly. Our service monitor definition, which should look familiar because it's similar to what we showed earlier. We're basically we are matching on this label app, my dash app. And we are looking for endpoints with basically surface names with with these names. And now we should be ready to deploy our application. Now, don't mind those warnings, because that was like a little buggy thing that was happening with the hotel operator before but it's basically checking to make sure that you have that you've defined your open telemetry collector CR properly. So we've deployed our all of our resources. You notice that the collector showing crash loop back off. Don't panic totally did the first time. It does eventually sort itself out. If it doesn't within a couple of minutes, totally panic, but it does sort itself out. Now we're tailing the collector logs to make sure that things are getting processed. And as you can see, lovely things are happening. This version here where we're pulling up the logs, basically we're just pulling up anything that starts with name colon, because there is something in particular that we are looking for which I'll point out in a sec. So I'm looking for something called some underscore counter because that is the Prometheus metric that we created in our Python code, which I'll show in a sec. So open that up in a minute. Okay, so this is our Python app and you can see we define this thing called some counter. It got scraped by the collector and it showed up in standard out so yay. I knew that it was going to end favorably. Also, I know we're still in the middle of our session but I am so impressed with all the work that it did on the target allocator. And actually some of the stuff that she discovered is now part of the open telemetry docs. So be sure to check out the docs for more information and definitely come to Adrian if you have any questions about the target allocator. Meanwhile, I'm just going to chat a little bit more about some additional open telemetry collector components you can use for monitoring Kubernetes metrics, not singular. I don't spend too much time on this section particularly, but we have the Kubernetes cluster receiver. You have Kubla-Stats receiver for collecting specific metrics and there's a few examples of metrics that are collected by these receivers. There's also the objects receiver which collects objects from the Kubernetes API server, but there's also some other components that are not just Kubernetes specific that you might find useful. So you have the host metrics receiver as well as the file log receiver that you can use as well. So for processing data, this component, the Kubernetes attributes processor is considered one of the most important for monitoring Kubernetes with open telemetry because it adds Kubernetes context which allows you to correlate your application telemetry with your Kubernetes telemetry. You can also use this process to set custom resource attributes for your traces, logs and metrics using the Kubernetes labels and annotations that you've added to your pods and namespaces. There's a few more collector components that we didn't cover here specifically, but that you might also find useful that aren't necessarily Kubernetes specific. So the batch processor, memory limiter as well as the resource processor, but there's also many more that you might find useful for your specific use case. So to wrap up, let's look at some of the pros and cons of the setup that we covered for you today specifically in Adrienne's awesome demo. Also, being able to narrate something that is recorded like keeping with the time is also pretty. It's not easy, so just wanted to point that out. So here's some of the pros. For one, not having to maintain Prometheus as your data store means less infrastructure overall to maintain, particularly if you go with an all-in-one observability back-end to ingest your open telemetry data. Having to maintain the Prometheus operator is another benefit. You still have to maintain the service monitor and the pod monitor, but it's a lot less work than keeping the operator up to date. You will also be able to have a full open telemetry solution while still obtaining your Prometheus metrics. And finally, since open telemetry is an observability framework, you also get traces, logs, and I think soon profiling very soon as well. Oh, and it also supports correlation of signals, so you can correlate your logs to metrics, sorry, logs to traces, metrics to traces. And as we just learned, open telemetry provides multiple tools you can use, such as a target allocator, the various collector components, as well as the collector itself to provide more flexibility for your deployment and configuration options. So for the cons, of course, as with any new tool, there's going to be a steep learning curve, especially if you're newer to observability in general, or you're just not familiar with open telemetry concepts, workflows, components. Additionally, if you are used to using PromQL, which is Prometheus' query language, you may have to learn a new query language if your back end does not support PromQL specifically. Open telemetry itself contains many moving parts. It has its own challenges with scalability and adoption, so that's also something to consider. The various parts of open telemetry are still in various stages of maturity, from language to language, component to component, whereas Prometheus has been around for a long time and has a pretty mature ecosystem. Of course, there's likely going to be a need for additional computational and human resources to manage these components as there is going to be with just about anything, but that is something to consider depending on the complexity of your open telemetry infrastructure. And finally, managing and maintaining both Prometheus and open telemetry components is going to obviously introduce operational complexity and everything that goes along with that. So we've mainly been focused on how open telemetry supports Prometheus, but there's also been a lot of work from the Prometheus folk to support open telemetry as well. So that's what we're going to talk about here. So Prometheus maintainers have been working to strengthen the interoperability between the two projects to make it easier for Prometheus to become the back end for OTLP metrics. And so Prometheus accepts OTLP. And soon you'll be able to use Prometheus to, sorry, Prometheus exporters to export OTLP. And finally, there are also working on adding delta temporality support. It's something that is available in open telemetry right now and it has its own use cases. So they are working on a component that can do this. And you can learn more about what the lovely Prometheus folks are doing by scanning this QR code. And that is it. Not all images are created by humans. In addition to being the target allocator expert, Adriana is also a prompt engineer expert. She was the one that put together all these lovely sloth images for you. So thank you so much. I had fun with that. Yeah, we hope you had. Enjoy them as much as we did. And just a final slide for y'all before we go. If I encourage you to check out my podcast that I do with my daughter, it's called Geeking Out, Scan the QR Code. I've had past guests such as Kelsey Hightower, Charity Majors, and Reese. Also, come find us at the Hotel Observatory booth. We're near the GitHub booth. It's not labeled properly in the map. So yeah, come find us. Come ask questions. Come hotel with us. And if you have signed up for the New Relic party that we're hosting with Dr. Pulumi and Tailscale, please come early to make sure you can get in. I think it's full. So please show up early and come hang with us or go to the hotel observatory. Thank you. Thank you.