 Test, test, ah, yeah, perfect. OK, so we are restarting. Please find your seat. If you know we don't need to defrag yet, later in the day, we might start to actually start defragging you. If you also could quiet down a little bit, because up here it's really, really loud. Justice, the people who are still talking? OK. Anyway, they're going to stop talking in a bit. So I teased it just a few minutes ago. Now you're going to see the real thing. We have Björn Rabenstein talk about native histograms. Thank you. Thank you. Yeah, my name is Björn. I work for Grafana Labs. And I also do one thing or two in the Prometheus ecosystem. And my pet project for years has been the native histograms. This is not a talk about the native histograms. In general, there have been too many of those already. If you have never heard about native histograms, if you're watching the recording of this, you should stop now and watch those other talks. If you're here in the room, that's not an option. But you can still watch it afterwards, right? So this is a view list. It starts with the historical context. Then the middle one is PromCon 2019 in Munich, which is more than three years ago, which proves how long we have been working on this. It tells you all the stuff that where we started from and what was the plan. And the final one is actually two talks. This is the recent PromCon in 2022, where we showed what we have. And this talk now is about how histograms actually work in production. The abstract problem is that I will talk about the compatibility with the open telemetry new histograms, exponential histograms. Luckily, this topic is so important that there's a talk on its own, so I don't have to talk about it. So just watch the session later today. OK, so let's start. Native histograms are still experimental, so this is the first production pro tip. You need a feature flag to enable them. Everything could still break, but we are more and more confident that nothing will break. And this is actually more and more stable, but there's no guarantee yet. So be prepared, although we get more and more confident. OK, let's start with instrumentation, because that's where you start to use native histograms. When I submitted this talk, I was hoping that I could tell you it's all good. It's everywhere. Unfortunately, it's not. Klein-Goleng is the official Go instrumentation library. It has released it. Yeah, that arrow, right? In 1.14, it's all released. That's cool. Klein-Java has it in a branch, and it's not quite complete. And there is no other official Klein library supporting it yet. The problem, the problem, quote unquote, is protobuf. Protobuf is, of course, not a problem. It's awesome, and it's very well suited for native histograms. That's why the one library that still supported the protobuf exposition format, although Prometheus didn't, has it, has the support fully. If you enable that feature flag that I showed on the previous slide, you will actually get Prometheus back into the state where it's using protobuf. Now, we are not planning to add protobuf to all the instrumentation libraries. We are doing the opposite. We will create a text representation of native histograms eventually, which might not be as efficient, but it will work. But not quite there yet. That's why you don't see the support in all the other client libraries yet. So you have to wait a bit. You can also go down the open telemetry path, but this is the other talk. OK, this is a slide from yet another talk that wasn't even on that view list, which is KubeCon22, EU, but it was online, so not really EU. In this talk, I formulated wishes, what we want from the native histograms. And the yellow wishes, that was the conclusion. They are all fulfilled by design. And now that the native histograms are there, indeed all fulfilled. But the open question back then was the orange wish. I want all of that at a lower cost than current histograms so that I can finally partition histograms at will. That was the wish back then that we had to see if it will come true. And essentially, this talk is about checking out if the wish has come true. Partitioning of histograms, the old classic histograms, was kind of what nobody wanted to do because the histograms were so expensive that you don't want to add labels to them in addition and multiply the problems. And now we want to see if this works with the native histograms. Comparison are kind of almost impossible because it is kind of missing the point if you reconstruct the exact use case of a classic histogram with native histograms because then all the yellow wishes are fulfilled. And if you use native histograms and fulfill all the yellow wishes, then if you want to do the same with classic histograms, everything blows up immediately because the classic histograms are too expensive. So I went down a more pragmatic path. I took this more or less popular framework from Weaverx, Weaverx Common, which has a really expensive classic histogram in it and modeled the same case with native histograms. I don't know if you know that. At Grafana Labs, we are using this kind of as the plumbing for some of our microservices. And our pet will be the so-called Cloudback and Gateway, which we use to whatever root request to the actual backends. And the histogram in there is configured in that way. This is the classic histogram. It has a humongous amount of buckets, 14 buckets. And it has a humongous amount of labels, which you usually don't do. It's partitions by, so it's instrumenting HTTP server. It partitions by method root and status code. And WS probably means WebSocket. I don't know, Brian Knott. It's ever true or false. So this is a lot of labels and it's a lot of buckets. And this is as far as you can take classic histograms, and it's already a very expensive histogram vector. The terminology in Kleing-Golding is VEC. With all those effort, you still get, where's my pointer here? You still get only a, you cap your relative error if you want to estimate quantiles from the histograms at 42.9%, which is already a very favorable calculation. You can ask me after this talk for the math behind it. I mean, you grow every bucket by a factor of two or 2.5. It's kind of obvious that the error of quantiles that you estimate from it is kind of high. Also, it only works if you're within the range, which is from 5 millisecond to 100 seconds. That's a pretty big range, but if you happen to have like even slower or faster requests, then you still don't get this guarantee, right? So for all the effort you have paid here, it's not a very good result. With native histograms, this is what you do to add them. You add this native histogram bucket factor. That's the one thing you have to configure. We essentially say I want a native histogram with a 10% bucket to bucket growth. In reality, it's 1.09. That's the math behind native histograms, which is explained in the other talks. Two to the power of minus three, if you want it really precisely. The other two configuration settings here are to limit the number of populated buckets. So one of the main ideas in native histograms is that only populated buckets inflict a cost, but you still might get a lot. An infinite number, in fact, I'm not quite infinite, but like a lot. And so you might want to limit, in production, the number of buckets you're actually using. And in this case, I have configured this cloud native back end gateway to have at most 100 buckets per histogram, like per label combination. And if you ever hit this 100 bucket, you just reset the histogram and you start with only empty buckets. That's fine, as you can see in all the other talks, if it doesn't happen too often. And this is the other limit here. If you want to reset it, don't reset it more than once per hour. And this is the last resort. If you have to reset it more than once per hour, you start to degrade the resolution, which is something we don't want, but we'll see how this works out in practice. Okay, so I deployed this. I deployed this to a real cloud native, cloud native cloud back end gateway in our production environment. And now I can do actual questions. How does this work in production? So the first question is how much resources do we have to pay to additionally expose those native histograms? Because the way I configured it here, it's still just both. And this great buddy sites, if it wants to see the classic histogram or the native histogram. So we have additional cost, of course, because we are exposing the native buckets as well. And here's the graph of like RSS and Go Heapsize. So you would think, obviously, this is the resource we have to pay in addition, right? Luckily, that's not the time I deployed this. The time I deployed this was here. So essentially there's no visible resource increase. That's because this cloud back end gateway is actually doing real heavy lifting. So the telemetry stuff is not really taking as much resources that you could see any dent here, which is good, right? In a binary that is only doing metric serving, you would see a dent, but the good thing is compared to your actual production payload, this is still negligible. Okay, question two, how often do those strategies kick in to limit the buckets? Because if we reset the histogram all the time, that would be not so good, but it would be even worse if you had to reduce the resolution all the time and don't get that high resolution, which is like gives you an order of magnitude better quantile estimations. This boils down to question two and a half, how many buckets are actually populated with the native histograms? Okay, so I set up two Prometheus to test this because on a production Prometheus that just scrapes everything, just a tiny bit is actually histograms. So to really see the effect, I set up a test Prometheus server that the one was just scraping classic histograms and the other one was scraping native histograms. And I also drop all the other metrics. I just keep this one super expensive histogram vector, which is in this case called Cortex duration seconds, which is because we use Cortex in the old days and the metric is still called that way. In reality, this is routing Mimir queries. Mimir is a distributed thing that implements the Prometheus APIs. So this is very meta, right? We are using Prometheus to monitor something that implements the Prometheus API. But that's a pretty cool test case because Prometheus API is usually your run queries and this has a broad distribution of latency. So we are stress testing the newer histograms here. Okay, so classic histogram, it sees 964 histograms and it monitors 15 of those backend gateways. So it's per instance is about 60 histograms. These are the different label combinations. Every histogram has 14 buckets. As we have seen, there's a infinity bucket and there's count and sum. So every classic histogram creates 17 series. So we have 16,000 something series, which is, I mean, if it's the only thing your Prometheus server is doing, it's not bad, right? But for a single metric, that's a lot of series. If you just want to look at the buckets, it's times 15. For the native case, we let it scrape the same 964 histograms and then it results in 964 series because that's the other idea about native histograms that every histogram is just one series. But of course the samples there still have a lot of buckets and this is now the core question. How many buckets will we see? We have seen we are at 10x accuracy. So we kind of, as the worst case, expect 10 times as many buckets. So something like 150,000 buckets. But how many buckets are really populated? Now you can all make a guess in your hand. We see 20,000 or 21,000 populated buckets in reality and that's cool, right? That's the thing I wanted to see. We are increasing the resolution by a factor of 10 and we are still only seeing about as many buckets as with the classic histograms because the classic histograms have to account for every bucket, even if it's not populated. And you can see this. This is like the wall of text representation of native histograms that we have in Prometheus in the UI. There will be nicer graphical representations in the future. But that's what you see. This is like a typical busy histogram. If you like read the fine print, it's like a 200 HPP status code. It's a very typical endpoint. So this has seen a lot of observations and has seen a lot of different latencies. So it has 96 populated buckets, which is very close to the limit of 100. Well, this one here is something like an exotic endpoint that has almost never hit. It has only seen one observations. With classic histograms, it would trigger all the 17 series. This does just trigger one series, which is also really cheap to handle. And there is where we get all the leverage from, right? And this is interesting here. This is like a four or four status code. And for four or fours, I expect a more narrow distribution of latencies because you just serve the four or four, right? Nothing found. And it's not like different payloads and everything. And indeed, although this has like comparable number of observation, it has way fewer buckets because the distribution is narrower, more narrow. So we also save on buckets because we only represent populated buckets. So this works out really nicely. The question is how often do we have to reset because this one busy histogram was reaching the 100 limit, right? And in fact, we don't reset that often. And also what I like, the top 10 most reset histograms are all histograms which like serve the, it's the 200 status code. And it's actually serving the Prometheus API query endpoint where I expect a broad distribution of latencies. So it all makes sense. This is where we see resets at all. And we have like a handful of resets a day, even in those cases. And I found the worst case during my experiment of two weeks. It has seen eight resets per day. And this is the graph, it's the count. So you see it goes up and then you see the resets. And apparently there was a lot of traffic in this moment where you saw a lot of different latencies. So you had to reset the system quite often. But rarely would you hit this one hour limit where you have to reduce the latency. And that's the key question. How often did that happen? Occasionally, like it happened just a couple of times during the experiment, the problem is that there's a bug in the instrumentation library forego where it essentially never switches back, although it could. And I discovered this while analyzing the data for this talk. Conference-driven development at its best. You can look up this issue. It's a bit complicated to explain why it happens, but once you see it, it totally makes sense. And I just have to fix it, right? Bartek, pull request incoming. He maintains that library. Okay, so this all works out as I wanted to. So next question. When I scrape this, how much resources do I need on the Prometheus side? This is, again, the list from before, but now I added the RSS of the Prometheus, of both Prometheus, right? The classic one, the native one. RSS is almost the same, which is because I only scraped this one query, this one metric. So most of this is just baseline usage of Prometheus doing its thing, right? So we can't see so much. So I went down to the basement of this Prometheus here and looked at the storage. It uses on disk, and Prometheus uses the same storage layout on disk and in RAM. So this kind of gives me an idea how much is actually used in terms of RAM and disk storage for the actual data for histograms. And here we see, essentially, we have a few, like fewer buckets to represent. Here we have many more and only the populated ones that actually have data, but still we need only half of the storage. And if you look only at blocks, which is like minus the wall and whatever else Prometheus might store, this kind of factor two is even more pronounced. So bottom line here is you get 10 times the resolution for half the price, right? Good. Okay, but maybe the queries are now way more expensive because it has to get into all those advanced data structures and turn through all the data. This is the most popular query you do with a histogram. You calculate a quantile. And the graph look different left and right because the right one is with native histograms and it's like so much more precise. So the right one is the truth, right? Or it's closer to the truth. There's a bucket boundary in the classic histogram at 100 milliseconds and you kind of see this. This looks like some stop gap here where you hit this, right? So that's cool. And also, I mean, this query just aggregates everything into one. So it doesn't really make sense but it turns through all the data. So in that case, the native histograms are even a bit faster, right? So 10 times the resolution, faster query speed, cool. This is a more typical query where you just look at a specific endpoint with a specific status code. And this is the endpoint that is serving these like broad latency distributions with a promql query endpoint. So it's kind of stress testing the native histograms again and then they are a bit slower in execution time but it's like, yeah, more or less in the noise. So I essentially, we get this at reasonable performance speed. Now the other important thing you wanna do with histograms is heat maps. And heat maps with the classic histograms are kind of a joke, right? Because you have those like 10-ish pockets. And I like to say left is like Hubble and right is James Webb or something. So this is just amazing, right? If you look at this increase in resolution where you can graphically see it for humans, this is what we want, right? So this band, for example, this is where we have the smallest pockets in the classic histogram is five milliseconds. And there are actually a lot of requests that are faster than five milliseconds. It all ends on this band. And here, the five millisecond line is approximately here. You would see those exponential pockets nicely. And you see there is a whole story going on in this five millisecond band, right? And it's all resolved. It's like, it's beautiful. To be fair, this is a very long query, right? It's a heat map over the whole like experiment duration of 15 days and it's aggregating all the two X X queries. So it takes a long time. In practice, you'll probably want a recording rule and it takes like twice as long for the native histograms. But this is really heavy lifting, right? And you get a lot of reward. Okay, I think we are done here. This is everything in production. If you want to know the backgrounds, as said, watch the other talks, but the good news is it works and it's beautiful and it performs nicely. Thank you.