 Yeah, hello, everybody, and welcome to Prometheus Day, Europe 2022. My name is Ramon. I work for Timescale. And today I'm going to be talking about correlating data from different sources, and in particular, Prometheus and OpenTelemetry for faster troubleshooting. I've been working in the building observability products for the last few years. And this is a challenge I've always encountered. As I talk to users of those products, is that they don't typically use just one tool. They use a lot of tools. And if you just look at the observability cloud native landscape for all the different tools that are there, which are, by the way, not the only ones that exist. They're just part of the community. There are a ton of them. And you probably are using more than one. And quite often, the challenge is that you're getting collecting data and getting that data into different systems, and you have to correlate it somehow. So the first issue is that interoperability is key. And in particular, this is about data. You know how you can get the data, the telemetry, the metrics, logs, and traces flowing through different systems so you can more easily correlate the data. And with that, hopefully also a troubleshoot problem faster. Luckily, CNCF is sponsoring and supporting two standards that have a lot of adoption and a lot of momentum. So on the one side for metrics, we have Prometheus with its Prometheus exposition format and the open metric standard that is for metrics. And then open telemetry for metrics, logs, and traces. And Prometheus is obviously very widely adopted. And open telemetry is a standard that has a lot of momentum. Still a lot of building happening. But it's the second most active project in the CNCF. And it's also the second with the most contributors. So there's definitely a lot of momentum on both sides. The question is, OK, as time goes by, most of you will probably end up having data that is generated using those two standards. So how do you correlate the data together? So that's what I try to cover here. Before I start, I just want to paint a picture of what high-level architecture of this system would look like. So you have your services and infrastructure. And you're generating Prometheus metrics out of them. And they go into Prometheus. And in this case, you also store them in Prom Scale, which is a long-term storage for Prometheus. So you can do long-term analysis and things like that. But the key thing here is that as you start adopting open telemetry as well, you'll have metrics and traces that come from open telemetry. And open telemetry doesn't have a backend. You have to store it somewhere. The first thing is that there are metrics there. You already have Prometheus. You know how to use it. You probably want to have your data in there. So the first thing you have to do is how do you convert your metrics from open telemetry into Prometheus? So luckily, there is this component called the open telemetry collector that does a lot of wonderful things. But one of those things is you can convert data from a lot of different standards via something. You can receive data from a lot of different standards via something called receivers. Then process that data to do things like sampling or batching the data. And then you can export that data to a lot of different solutions, one of them being Prometheus, via exporters. And so here, what this architecture is showing and this configuration of the open telemetry collector is that it's getting metrics, open telemetry metrics, and it's transforming them into Prometheus metrics and sending them to Prometheus via the Prometheus remote write exporter. And then for traces, the only thing it does, it does some processing and then it's still exported using the open telemetry format. OTLP stands for the open telemetry protocol. And so in this case, traces are being stored in Prom Scale. And because Prom Scale supports Prometheus metrics and open telemetry traces, we're storing all the data there. And then we're connecting Grafana to it so we can query it. And you can query all metrics using PromQL. The Prom Scale is built on top of Postgres and Timescale DB. So you can also use SQL to query both metrics and traces and do some interesting correlation. So let's talk about metric and trace correlation first. One very common way, or at least the one that we typically talk about, is correlation BI exemplars. So you have some example of Python code that is instrumented with both Prometheus and open telemetry. So here we're creating a histogram metric to measure the duration of API requests to our service. And here we're recording a new open telemetry span every time the random weight endpoint or the random weight method in that API gets called. And so in order to correlate and use exemplars in this case, what we do is we add additional metadata when we add the duration to the Prometheus, when we add an observation to that histogram that we created for API duration. And so the exemplar is this piece here. It's a piece, typical piece of metadata set of attributes. In this case, it's just one. That references data that is outside the metric set. In this case, it's the trace ID. And so when you do that and you get the metrics from the Prometheus endpoint of that service, this is what you get. You get the typical on the left side, you see the typical exposed metrics, Prometheus exposition format. Then here you have some additional information, which is has the trace ID and then some of the things like the duration of that trace and the timestamp. But this is the ID. This is the thing that allows you to correlate. It's an example of a trace that fell within that bucket of the histogram. And so when you are in Grafana, for example, Grafana does have support for exemplars. So you run a query, a PROBQL query, to do the 90th percentile on that histogram using the histogram quantile function. In Grafana, if you enable exemplars, which is this thing here, which I believe is enabled by default, if there are exemplars that were sent to Prometheus or Promscale, because Promscale also supports that, it will show you data points that you see here. Those are the exemplars. Those are individual traces and how long they took. And so if you put your mouse over one of those dots or you click on it, you'll get this pop up here. And if you click on this button, then you can jump straight to the trace. The hope here is that you're trying to get an example of a trace that took a certain amount of time within the percentile that you're looking at. And then you can see, or the bucket that you're looking at, and then you can see where the time is spent. As long, obviously, as that trace is representative of all the traces that fall within the bucket or within that percentile. So that's the whole idea here. It will help you, instead of trying to figure out, OK, what traces were generated and when this metric had these values, you can actually jump straight from metrics to the traces. The other way, which is probably simpler but still really important, is correlation via labels and attributes. And so open telemetry has a concept of attribute, and it's the same as a label in Prometheus, basically. And so the only thing you have to do, if your service was already instrumented with Prometheus metrics, when you add traces, don't forget to add maybe those attributes that you're using in your Prometheus metrics. In this case, endpoint and instance. It's a very similar syntax to do this. Again, it's a Python example here. And so when you do that, you can do things like that. So this is an example. This is a dashboard where you have a filter at the top. You are filtering by service. And what it's showing is, at the top, you have metrics. And this could be quite easy. You're using PromQL, and you're showing those metrics in charts. But at the bottom, especially the two on the bottom right, is showing squaring traces. So you can actually see performance of your service with the three golden metrics, but also traces and how long the slowest traces that you can jump straight into those. And maybe even errors. There is error information in trace data, so you can see which ones are the most common. Anyway, you can correlate visually the data in a dashboard. Other things you could do actually, in the case of PromQL, because you have SQL, you could actually run a query that returns all the hosts where there were traces or spans that had the most errors and then do a subsequent query or a join to retrieve all the plot in a chart, the memory consumption on those hosts. So you can actually try to understand if there is a problem, that memory maybe is growing or peaking at some point. Even farther, you could actually do another join and all of that in the same query to just retrieve the exact processes that were consuming the most memory at that point. And that gives you very quickly from spotting a problem here to actually getting very much deeper understanding of what could be the source of that problem. So using labels and attributes actually is very powerful, especially if you can do joins on the data. Another one is metric correlation. And for metric correlation, one of the first things you have to take into account is that open telemetry metrics and Prometheus metrics, they have different types of metrics. And so they need to be mapped. And here you have the mapping. I won't get into detail, but another thing to keep in mind, there may be some types that are not that you cannot map. And an example of that would be open telemetry has an exponential histogram that doesn't have a way to map it into Prometheus metrics. And that is defined by the way in the open telemetry spec. And there were a lot of discussions between open telemetry and Prometheus projects to come to this mapping and definition of the metrics. But that's something to keep in mind. So make sure that you're using metrics that you'll be able to convert so that you can map them together, especially if you're going to be storing them in Prometheus. And so again, with metrics, probably the only thing most likely that's available is you have labels. And so again, you're going to be able to correlate with labels and attributes. So here is the same thing. You see code. This is the same code as instrumented with Prometheus client library on the left and with the open telemetry SDK on the right. And except the beginning, you'll see the rest is actually fairly similar. It's the same. You define the metric. There is a name. There is a description or documentation in the case of the Prometheus client library. And then you just increment the counter, adding some labels to it. In this case, we add the name of the API endpoint, which is ad product, as an example. And if you do that, then you can start correlating metrics. Again, in a dashboard, you could be filtering data from both Prometheus and open telemetry. So in this case, most of those charts, Grafana panels, are from Prometheus metrics. But the one on the top right is actually from a service instrumented with open telemetry that is reporting metrics. And so you can show and filter and see in the same dashboard for a specific service, as an example, the telemetry coming from open telemetry as well as seeing telemetry coming from Prometheus. And so just to wrap up, tool interoperability is key. And actually, at the moment, I'm really happy that we're saying there is obviously so much momentum with Prometheus for the metrics side, but now with open telemetry as well, especially for traces, because that gives us the tooling and the foundation that we need. And then you have to think about planning when you're doing instrumentation. Plan it carefully to make sure that you'll be able to correlate the data in the future, especially by using consistent tagging across signals, maybe thinking about using exemplars, and also choosing your metric types carefully so you can do the mapping correctly. Thank you very much. And by the way, we have a booth outside, just outside. So if you want to talk more about this, then I'll be sitting around and be happy to discuss. Thank you.