 Awesome. Well, this is very good timing. All right, hello, everyone. It's so nice to see you all. I guess just the first thing to say is I assume that most of you work remotely most of the time. And it definitely gets kind of quite lonely. So coming to events like this, being in-person with you all with my lovely team from Red Hat and old friends from previous events and the new people you're going to meet today, it's just, yeah, it's really nice to be back in-person. So I guess, sorry to take this off now. Yeah, my name is Ian Billett. I'm an engineer at Red Hat where I'm the team lead for Red Hat's internal observability service. And today we're going to talk about Prometheus exemplars and specifically how you can use them to accelerate your debugging to warp speeds. And being serious when I say warp speed, I'm not using this term lightly. It's a very appropriate use of the term and not just a great excuse to include lots of Star Trek memes, which it is, by the way. Exemplars are one of these features that if you're not, if you're kind of earlier on in your Prometheus story or you're not kind of paying super close attention to the project, it could be a feature that kind of slips under the radar. But kind of when you see it in action, it looks like magic and I want to kind of show you the magician's tricks today. So the goal here today is threefold. Firstly, I'm going to convince you why exemplars are such a powerful addition into your observability stack. And kind of as with any change that you make to production systems, it absolutely has to be worth it. It has to be worth the investment in time and effort and energy. And I'm going to construct that argument with you. Secondly, we'll then go on to talk about the actual mechanics of exemplars, how they work, how they're ingested, how they're stored, the real nuts and bolts of how exemplars work. And then finally, we're going to see some exemplars in action. And I sacrificed the pano-chocolata, the demo gods, so fingers crossed for that one. All right, so I want to kick off just by kind of painting you all a picture which will kind of give us all the shared context with which we can understand the value of Prometheus' exemplars. So you are a human or kind of close enough. You are responsible for a system or a set of systems. And then the system that you are then responsible for starts behaving in a way that is unexpected. And we receive, and then you receive an alert or you observe some symptoms of the system not behaving. So as the responsible human, it is then your job to figure out what is going on and to return the system to a normal working order. And this is exactly why we have observability tools, right? They are tools that collect and emit and store streams of data, your metrics, your logs, your traces and more recently your profiles. And then with these streams of data, it is then the human's job to ask questions of the system using these streams of data. And it's these streams of data that then provide the answers to the questions that you are trying to answer. And every question you ask and every answer you get narrows down the space of where the problem that is afflicting your system lies. So this is kind of a very roundabout way of just describing your normal debugging workflows, but in context of your observability tools and your observability data. And yes, that is a picture of Mr. Zuckerberg giving his congressional hearing that was photoshopped to look like data. It was too funny not to include the internet's amazing, isn't it? All right. And I would say that it is the speed with which you can repeatedly go through this cycle of asking questions and getting answers that really is the making or breaking of your observability tooling. And when I say speed here, I don't just mean like mashing through like the questions and answers quickly and quickly as you can. I mean, I'm more talking about the speed with which you can home in on relevant contextual information. And kind of, then we can then use the mean time to detection and mean time to recovery metrics to kind of capture just how effective we are being here. And these are metrics that are kind of most commonly used in incident response like crisis response like situations, but I would argue that they are absolutely applicable any time your software is not behaving in a way that you're expecting. I think these are applicable. And another point I want to make is that the time you spend debugging is not productive time. And that feels like a slightly controversial thing to say. And I would add a caveat to that by saying that from my own experiences, time that I spend debugging actually has some very useful kind of side benefits. They're the times when you kind of pick up your service and you kind of look at it from every angle and you start to see it in ways that you didn't before. So there's definitely a big learning process that you get from the debugging. I guess the point I'm making here is that the opportunity cost of that debugging is always quite high. There's almost always something better you could be doing with your time, something more productive, but instead you're debugging this problem. And the final point I want to make here is that your time is incredibly precious. You know, like in engineering organizations, I'm going to assume this is true of most of you that the human cost is the biggest cost that those organizations pay out. Thank you very much. And I read a kind of, I've seen an article that said that on average, the average engineer only spends, only has 30% of your time in your kind of working week is focused engineering time. A disclaimer, I didn't know if that was backed up by any data, but it kind of feels intuitively correct. So the overarching point I'm kind of making here is that finding and eliminating inefficiencies in your debugging process is not only good for you, it's good for your organization, it's good for everyone, okay? So next question is, how do Prometheus exemplars help us achieve a better efficiency in our debugging process and what is the magnitude of this efficiency gain that we can get by using them? So when you're in this debugging loop of asking questions and getting answers, very often the piece of information you require to answer your question will lie in a different stream of data than the one you are currently focused on. For example, you see that a metric that shows increased request latencies, so you go and you need to find some tracing data which kind of breaks down these latencies. And it helps you answer the question of like, where is this latency increase going from? And I've been there like many times, you kind of looking at Prometheus and you're like, okay, where is the tracing bookmark production URL? Okay, it's there. Okay, how do I log into Yeager? What was the Yeager syntax, blah, blah, blah. Now all these steps you have to go through in order to get the answer to the question. And it is this process, this switching of context that I think is a small period of time but is actually incredibly costly. The human brain can only hold four chunks of information in working memory at any one time. So if you have to go through this costly context switching process, you have to take your current debugging context, kind of unload it into your longer-term memory, load in the Yeager context, find the trace, and then unload the Yeager context, and then go back to your debugging context. And it's this context switching that kind of ruins your debugging flow state. It's a very costly switch that Prometheus exemplars can help eliminate. The documentation says that exemplars are references to data outside of the metric set. I would kind of go further and kind of almost use more vivid language and say that exemplars give you the necessary pieces of data so that you can hop directly from your metrics to the relevant contextual locations in different streams of your observability data. And that is exactly why this talk is called warp speed debugging. You don't have to faff around context switching, remembering things. You can just warp directly from your metrics, directly to the relevant location in a different stream of data. Final point I wanna make here is like, okay, that sounds cool, that sounds helpful, but just how helpful is this? So this is gonna sound silly, but I think of this almost in terms of swimming pools. Bear with me here, right? You take the average time it takes you to perform one context switch, then you add, say, 20% to kind of load and reload that context in your head. You then multiply that by the number of times you do that in a given time period. And then you multiply that by the number of engineers in your organization and you kind of defining a volume of size of inefficiency that your organization goes through. And it's like a net negative cost that you're paying. So it's like a volume that is underground, so kind of swimming pools. And this just kind of gives you, that's a way to think about just how useful Prometheus exemplars can be to you and to your organization. All right. Yeah, and there's always a relevant XKCD. Okay, captain's log. Convince everyone to use exemplars. I'm gonna mark that one as done. So now let's talk about the actual mechanics of exemplars. All right, first question is, how do you create an exemplar? Okay. And this is done when you are instrumenting your application code. So effectively it's like the client libraries make it very, very, very simple. When you are making your observation in the instrumentation code in your application, the functions that are provided, it just observe an example, so you just make your normal observation, but you also can supply a Prometheus label set which you can think of just like a, almost like a very simple JSON dictionary. And kind of in this instance, this is taken just directly from the Prometheus Golang library. We are kind of observing this value with the label of dummy ID and like a random integer. Okay, so that's how you create an exemplar in your application code. Yeah, there it is. So once we've defined an exemplar in our application code, how do we get that into Prometheus? Well, if you think back to your Prometheus 101, you remember that Prometheus operates on a pool-based model of data ingestion. This means that Prometheus kind of goes out to each of its targets and the things that it's configured to scrape and then request the metrics page, which is a text page which kind of looks like this in the normal case, which includes the metric label set, the set of data that uniquely identifies that metric and then the value. So that's standard Prometheus without exemplars. So after you have instrumented your code just as we saw that previous example, what happens is that in this format that the application presents to Prometheus for scraping is that it depends some extra metadata. So this is the exemplar label set that we defined in the instrumentation code and then you have the value that was observed and the timestamp that it was then observed at. So that's what an exemplar looks like in terms of the exposition format, but this model has a couple of nuances that I just wanna call out here. The first one is, let's assume that your server is serving, say, 10 requests per second and every request is being observed with a set of exemplars, but your Prometheus server is only coming around to scrape that metrics page every 30 seconds. So what's gonna happen here? And what you find out that happens is that only the most recent exemplars are captured. So this means that if you have 10 requests in the time it takes for Prometheus to come and scrape you only the very last one is going to be presented to Prometheus to then ingest into its storage. So just bear that in mind. The second nuance you should be aware of is that exemplars are not supported in the standard, bog-standard Prometheus exposition format, but they are supported in the open metrics format. And like these formats are very, very similar. This isn't something to be like worried of, it's just something to be aware of. And there is some more peculiarity in their differences that I do not feel qualified to talk about, but the thing is that if you configure your Prometheus this is just gonna ingest exemplars, this is just gonna happen kind of automatically without you thinking about it, you just need to be aware of that. Yeah, so if you're going to curl a metrics page and you don't see your exemplars, you're like, huh, this must be broken. What you actually need to do is just tell the Prometheus, the server that's serving on metrics page that explicitly that you want open metrics format, which will include your exemplars. All right, so we've instrumented our application, we've ingest, we've kind of gone and grabbed the exemplars from the application. So the next question is how does Prometheus store exemplars, like what's going on on the back end side? So within the TSTP, our exemplar data is stored in a fixed size in memory circular buffer is the terminology used in the documentation. And the size of this buffer is controlled by the max exemplars configuration option. What this means is that Prometheus will just hold on to a set number of exemplars at any one given time. And when a new one is added, you'll lose one from the back end. So you just have the most recent ones in memory at any given time. And things being in memory then also mean that, maybe troubles, maybe difficulty when your Prometheus restarts. So exemplars are not persisted in a long-term fashion like Prometheus blocks are. You do get the persistence of the wow, which is the right-to-head log, which then what this effectively means is that the most recent two to three-ish hours of data is like there on disks so that if your Prometheus server restarts and starts back up, you will read all that data back into memory. So if your Prometheus restarts, you will get the two or three-ish most recent hours but not the longer-term ones because they were just in memory and they're kind of lost. All right. Don't forget. Exemplars are not enabled by default in the Prometheus server. They've been in the master, the main branch story since version 2.26, which was about March last year, but there's still a feature that is marked as experimental so you need to explicitly enable it. That will definitely catch you out if you come to use exemplars. Okay. So once your exemplar data is in Prometheus, the next thing we want to do is actually get hold of that data. So how are exemplars queried? So very, very simply Prometheus exposes on get and post methods the query exemplars endpoint. Super, super simple. And to this endpoint, you provide a query, you provide three things. You provide just a standard Prometheus query. Any query doesn't matter. The start timestamp and the end timestamp. So what's kind of going on behind the scenes here is that Prometheus is taking this query. It's then with this query, it's parsing out the selectors, like the label sets from this query. And then with those timestamp, it's going through its circular buffer and kind of pulling out all the exemplars that kind of match those criteria and then returning this to the user. It's very simple. But the final question that I then want to address when we're talking about the specific mechanics of Prometheus exemplars is how can you walk between the data streams? Because it's all very well having a HTTP endpoint that gives you JSON back. But in reality, that's not going to give you a real first class debugging experience. And so the answer to this question, like at this point in time, is simply Grafana. No, this is Prometheus days isn't Grafana cons. I'm not going to go into it in loads and loads of detail, but, and the point in Grafana specifically is in your Prometheus data source settings. And when I show you the demo, we'll kind of show you what I mean by this, but like that is the point that you tell Grafana that you're interested in exemplars and you're interested in linking them into other data sources. And Grafana does kind of, I said I wasn't going to like talk about Grafana, but it kind of does two neat things with the data sources. It lets you define an internal link and an external link. So an internal link will basically allow you to hop from that exemplar data to another point, like within Grafana, but it also lets you basically template into URLs, your exemplar data, which will link to a place outside. So we'll see that in action, hopefully. Okay, internal versus external. Yeah, yeah, and I've been, I was bumping into a bit of trouble with the URL templating in Grafana. It wouldn't let you have all of the global variables, like the timestamps. It just seemed to be just the value rule, but I'd be happy to be corrected. Otherwise, if other folks know better. All right, and I guess the final point is that all of the code kind of I wrote, kind of for the demo that we'll kind of show you here is just available in that repo. It uses a really neat framework that allows you to run Docker containers from your Go test that my esteemed colleague, Mate, is, and Jessica are giving a talk on, so I definitely can check that out. Yes, all right, demo time. Time where it all goes wrong, okay. So for this, awesome. Okay, so for this demo, what we've done is we've written a very simple server that basically just calculates Fibonacci numbers, is it? It just picks a random number and calculates that Fibonacci number and then kind of exposes metrics and traces and logs, but we're not gonna go into the logs. And then these are then ingested into Prometheus. I'm using for metrics, obviously. I'm using Grafana Tempo to ingest the metrics and then obviously using Grafana to visualize this all. So I'm kind of looking at it from here. So these are, this is my Prometheus data source in Grafana and I've done this in configuration but we'll show up here. This is the crucial bit here. This is the point in your configuration in which you tell Grafana like how you want the exemplar data to link to other pieces of data that you, other streams of your observability data. And we have our internal link that goes to Tempo using the trace ID and then we have our external link that's linking to parser, which is the continuous profiling tool. So then we'll just kind of just pretend like we are going to query for a, oh wow, this is not good ergonomics if I'm just like. Okay, so let's then just do the very simple example of wanting to look for the 95th percentile of my demo metric that I'm exposing and you'll see, and you'll see here what this data looks like. So all of this is obviously, this is just the normal line that kind of you draw but what's really cool is that after I've enabled exemplars and I've ingested them and I've configured all the things to the right level, these little green triangles are that exemplar data. So what's really neat is you write this like quite high level aggregate level metric but then it gives you the possibility of taking individual specific points that relate to relevant contextual pieces, relevant contextual locations in other streams of data. So just take for example this data point and it kind of gives you the option to first query with tempo. So let's just see what it looks like so and then you'll see straight away that trace. You can see here this is us calculating Fibonacci numbers that is almost as pointless as Bitcoin mining but perhaps not so much. That was actually probably the most controversial thing I'd say in this talk, all right. And then looking at these traces we were then calculating and the steps I would have had to have gone through to get to this point would have been such a pain in my neck but it's just a one click and that is that one click hop, that one click warp is the prime debugging experience here. But the other cool thing, so that was an internal link to tempo but the other cool thing that I will show you is just by templating, you can template into this external link which then just links me to a totally different data source that then lets me dig into where's my, where's the icicle graph? That's the real, there we go. And then like it's a contextual warp to another relevant place in a different stream of data but for those of you that are hawk-eyed you'll notice you'll realize that this is not actually the thing that was running because this is just on the parsers demo but the templating works and you can link to an external source so that is how you would do it. It looked a lot like that. It's really cool. It is warp speed debugging. Cool. And with that, that's all I have to talk about today. So that's all folks. All right. Thank you Ian. Are there any questions in the room? I saw someone typing on Slack, oh yeah there, question from Slack for you. All right. What magic was that? I was about to see the demo and it went to the, oh that was a streaming question. Sorry about that. If anyone has any trouble viewing the stream or wants to watch the sessions again, everything is recorded. Everything will be up in a short period of time. Thank you. It's possible today to send an exemplar by remote write? Good question. I don't know off the top of my head. I assume so. Does anyone know off the top of their heads? I assume it is. I may be looking at them. It is. Three thumbs up. It is. There you go. All right. Any more questions on exemplars? Thanks for a nice talk. Is it always like this that the client libraries would expose the last exemplar? Should it be always like this or can it be through some magic of detecting the very interesting ones? That's a really good question. I think the short answer is that's the only way that it works. I think the clue is in the name, right? Exemplar. It took me a while to figure out why that was chosen, but it's an example of something. It's not everything. It's an exemplar of the data. So I guess just to follow that logic through then is that the exemplar that you observed that Prometheus scrapes, if you have a system like tracing that you can't hold everything, there's no guarantee that that's... Exemplar you've observed is going to link to anything that your tracing system can serve. So yeah, there's nuance to this. But about the most interesting thing, I don't think you would know a priori like before Prometheus has come to scrape it if it's going to be interesting or not. So, and that would get a whole complicated... I think it's probably just the easiest just kind of knowing the limitations and kind of working with them. All right, thank you. Good question. Embrace the randomness. There was another question. Thank you for the talk. You mentioned that only the latest exemplar is stored, right? So is that the latest per set of labels that like for a particular label set, the latest value will be stored, the latest exemplar. And if that's true, let's say a particular metric was raised with the exemplar and for the next five minutes, there was no new exemplar. So will that exemplar be sent out in each scrape to Prometheus? Okay, I think I get what you're saying there. So we're talking about in the code that the application, the instrumentation code, it's just, so it's on a, the label set basis. So if you have like a really broad histogram that has really, really long and really, really short and it's only very rare you get a really long one, that will persist like indefinitely between scrapes until another new one comes along to kind of knock it off its perch, if you see. So, and I believe at like, at least until the process restart or whatever. But on the Prometheus side, like the backend side, I believe that the exemplars are identified by the label set that you ingest with. So it's similarly to the application side, yeah. Cool, really good question. Another question, wow. Very interesting topic. Are there any plans that Prometheus will persist those exemplars? Another excellent question that I do not know the answer to, that I look towards the table of maintainers. I'm gonna, I would think probably that the, the overhead required in terms of storing and serving those would be too, too high in a similar way that the kind of, that tracing kind of can get very, very big, very, very quickly. I would imagine that there'd be a similar constraints felt on the Prometheus side. And you know, like what is the use case of this? Is it like, it seems to be geared towards like a very, like it was recent two or three hours. And if that's good enough and you don't have to go down this long winding path of storing them in the TSDB and sowing them from the TSDB like blocks, then you know, that just feels like we kind of getting 80% of the goodness for 0% of the work as it were. So I would raise an issue on the Prometheus repo and kind of see if any smarter folks have anything, any smarter things to say. Cool, great question. Thank you.