 Hello, everyone. I'm Ganesh Varanekar. I'm one of the Prometheus team member, and I'm also a maintainer of Prometheus TSTB. And I'm Dieter Plating. I've been working on monitoring systems for about 10 years, but I've never actually contributed to Prometheus before. So this is kind of my first time working with TSTB, and I'm really enjoying it so far. And I really enjoy histograms. So when I had the chance to work on this project, I was very excited. So our colleague, Björn Rabenstein, already gave many presentations about histograms and how they're currently implemented in Prometheus, what some of the shortcomings are, and his ideas for better histograms. But first, we should probably just do a quick recap. What is a histogram anyway? So basically, a histogram is a way to categorize your numeric observations into ranges. And this is very useful for looking at distributions of data, for example, for latency metrics. So in this example here on the left, you can see that there were three observations that were smaller or equal to 0.25. Then there were six observations between 0.25 and a half, and so forth. So this is a really useful way to get a good understanding of your distribution of data. You calculate percentiles and so forth. The way this is currently implemented in Prometheus is you have a separate series for every single bucket. So you can see here you have your different buckets. They all have a label declaring one of the bounds. And of course, they have the number of metrics in each. And then you also have some additional series, for example, a sum series with the sum of all the samples and the count which counts all the samples. So there are some shortcomings here that we intend to improve on. But first of all, of course, whatever currently works should keep working. That's kind of obvious. The second one, this is kind of a... One of the bigger problems, I would say, with the current implementation of histograms is that you have to manually define your buckets and you basically have to make a guess around what the values will look like and then hopefully, you know, your data won't go out of bounds or your loose precision. And it's kind of a clunky method. So in the new version, we would rather just do away with that completely and just automatically come up with all the right bucket sizes. Third, we want to have correct aggregation, both across time and also across labels. Especially when you have histograms with different bucket layouts, those different layouts should be chosen so that they are compatible and mergeable, basically, so that you don't lose... that you can correctly aggregate them and don't lose any data quality. And then, of course, you need to be able to have very accurate estimations. You need to have a low error rate so that you can compute correct quantiles and the justifications. And finally, we believe that if we can lower the costs of the histograms, then we can make partitioning much more feasible because I suspect that right now, many histogram users don't partition as much as they can. Like, for example, partitioning by an HTTP status code label or by a route or a path label is currently not so common, but if we make histograms much cheaper, then you can partition as much as you want. There is a big design doc that Bjorn wrote. It goes in the design very in-depth, and it's a very interesting read. So if you want to understand it more, you should look at the doc because currently, in this presentation, we can only really cover a small bit of the design. So we recently had a hackathon at Grafana Labs. So Bjorn Dieter and I thought it would be cool to get this design doc into a working prototype. And before we jump into the prototype, let's see why high-resolution histograms are so useful. By the way, the heat map that you're seeing right now is reading the data from our prototype itself. So let's see what we can figure out from this heat map. Initially, the request has some kind of short band of latencies. And then after some time, the latency stabilizes and you can see there are a lot of requests for some buckets which show up as red. And after some time, a new canary was tested and you can see the latency dropped, but it became less predictable because it was spread around a big time. And you can clearly see when this new canary was deployed into the prod and all the latencies just dropped and spread across a big range. And after some time, a part of the system started behaving weirdly and the latency shot up. You can see two bands of slowness there. It could be some heavy operation or some cold cash. And you can see exactly the time when the issues started and the issues when the issues stopped. And after the issue had stopped, the latency just kept on increasing and increasing until it became stable and within a band. So this is the kind of visualization that we would like to get from high-resolution histograms which are not feasible right now because of expensive buckets. But in this prototype, like in this talk, we are going to talk about three things. One is how do we expose this high-resolution histograms and how do we scrape it and then how can we encode these histograms efficiently into a TSTV and we benchmark the space taken by these new histograms in the TSTV. So as far as the instrumentation goes, this is based on the current implementation and in fact, if you look at the orange line here, what this does is, this exposes the conventional histograms using the current method, but it gets a little bit more interesting when we look at these three items here. So the sparse buckets factor, that basically defines the precision of your histogram and this is a growth factor and it describes whenever you go from one bucket to the next bucket, it describes how much growth you see so that as you keep adding more buckets, they also keep growing and growing. And because we automatically allocate buckets, however many are needed to accommodate your data, you probably want to set some kind of limit because otherwise in theory, it can just grow infinitely. In this case, we set a maximum number of 150 and the way we implement that limit, it's in two ways. So the first way is we can reset a bucket and this declares that we could reset up to once per hour and basically what a reset means is you start a new chunk and you get rid of all the buckets that were at some point used by some data and you only start using the buckets that are needed for the current data and hopefully and commonly, that will be enough to get you under that limit. If that's not sufficient to reach your defined limit, then we have a second solution, which is it start decreasing the precision. So that's something to keep in mind that in certain cases when you hit your limit, you might need to or it will automatically start growing your buckets and your precision will go down a little bit. So the encoding of how all the data after it gets scraped actually gets saved into histogram chunks. There's, so by the way, this is a little bit simplified. I left out a bunch of other details, but basically there's two main things. In the beginning of the chunk, you have your metadata and it describes the exact shape of your histogram and then after that are the individual histogram samples. So the metadata, there's two key entries here. One is the schema and this is simply a number that describes the growth factor. So we saw a growth factor of 1.1 earlier that basically gets encoded as a simple integer and then we declare which buckets are actually being used because in theory you could have an infinite number of buckets. The way it works is, and by the way, some of these buckets might be used for data. Some of those buckets might need to be skipped because they don't have any data. And the way we implement this is quite simply as a list that says, you know, basically, for example, skip 10 buckets that are not used, then you have 20 buckets that are used, then you have this many that are not used, then you have that many that are used again, and so forth. So this basically describes the entire range of buckets that are used. And this is also why we call them sparse histograms because in this infinite space of buckets, if you only use a subset of them, this is a very efficient way to not have the overhead of all the buckets that are not used. And I also just want to point out that using just these two items in the metadata, you can get the exact description of how the histogram looks like. As far as the actual data goes, this is inspired by the current XOR encoding of simple time series. There's just, there's some more fields. And then of course the buckets, this is a variable length field. But basically the first histogram sample that comes in, we're just going to store all the fields raw as integers and as floats and as a sequence of integers. Then the second histogram, we're storing deltas. And the third and the fourth and so forth histograms, we just store delta of delta everywhere. The one exception here is the sum field. So the sum of all the observations, because that's a floating point number, we use the XOR encoding from the gorilla paper, just like the standard XOR encoding of time series. And all of this data gets written into a bit stream, which of course gets serialized as a chunk of bytes. And so yeah, this contains the full metadata describing the histogram format and then all the histogram samples. And talking about our test setup, we instrumented the Cortex gateway on our Grafana cloud dev clusters with both conventional and high resolution histograms. Cortex gateway is the component which sits in front of our Cortex clusters and all the read and write traffic goes through it. And we set up two Prometheus, one of them scraping the conventional histograms and one of them only scraping the high resolution histograms. And we compared the storage at saturation, which means all the buckets that had to be filled were filled and there was less passness. And this is how the data looked like. In the conventional histograms, we had 14 fixed buckets, which means we have 14 time series data, time series in the TSV for each bucket. And you just saw that we have some additional series for each histogram. One of them being the infinite bucket, one of them is some series, one of them is count series. So add those three up, then you have 17 TSV series per histogram. And for sparse high resolution histograms, the number of buckets varies. In our data, we saw that it varied between one bucket to 128 buckets and it's dynamic. And because all of these buckets are stored in the same chunk, we only need one time series in the TSV to represent a histogram and all its buckets. So we have two blocks, two data blocks under observation here. One of them escaped data for 18 hours, which had 249 histograms. And you can see the number of series. It's a multiple of 17 for the conventional, multiple of one for the sparse. Similarly, block B is spanning two hours and one for four histograms. Though both of them scrape the same thing, the block A has nearly the double number of histograms because there was a rollout. So few labels change so you get more histograms. In the next slide, we will see how these dynamic buckets look like. So all of these, this is a snapshot of few of the series from the blocks for the sparse histogram. You can see that the number of buckets vary in huge numbers. For example, for querying, the query can be expensive or cheap. So the latency of query can span a huge range. So you have to cover too many buckets. And if you take example like a gateway timeout, it's mostly set at a fixed time. So you are always going to observe a fixed time for that particular histogram. So it's always going to fall in a single bucket. So you won't really need more than one bucket for that histogram. So this shows that partitioning histograms into multiple histograms with different labels is not going to be expensive because the buckets are dynamic here. Now let's look at our benchmark results. Yeah, this is pretty interesting. It also blew our mind. So the first column, you see the reduction in index size. The reduction is like 94 and 93% for the block. And if you look at it carefully, it is 17x reduction or little more than 17x reduction, which was the ratio between number of series required for conventional sparse histogram. And it's little more than 17x because the index store the series information. But in the sparse histogram, you don't have to store the LE label. So there comes the additional little more than 17x reduction in the index. And if you look at the chunks, though we have more buckets in this sparse histogram, you still get an efficient coding. So there's like 43% and 48% reduction in the size taken by the chunks themselves. And if you combine both the sizes, like the both index and chunks, overall reduction is like 48 to 60%. It can vary based on how long your data is. And this also means the memory that the head block would take will be a little less because it will be, it's using the same encoding as used in the block. And the number of series is, it takes a lot of space in the memory visually. And that's going to reduce with a single series per histogram. So there comes the overall savings in memory and space. So let's start to recap our conclusion so far. So there's very little configuration. As we saw in the instrumentation section, you basically have to define a precision, but even that we can just default to something same, like 1.1. And you probably want to specify a limit, but hopefully you will never reach it. Just like in our experiments, as you saw, we reached up to 128 buckets. We never reached our 150 limit. But if you don't specify those, then basically have zero config and it should just work out of the box. And with that growth factor of 1.1, we also observe a significantly more, basically precise bucket boundaries. So we see a difference in precision of about an order of magnitude compared to the manual bucket assignment. And then I also want to point out that as your observations, if they go well beyond the range that you originally thought your data might lie in, you still get to maintain that same precision. So in the conventional current implementation of histograms, I hope you would have accounted for some of your data and configured your buckets properly, because if your data goes beyond that range, certainly your relative error starts going up significantly, but not with the new histograms. And they're fully mergeable and aggregatable. We didn't really go too much in detail for that, but you'll just have to check the design doc if you want to understand more about how that works. But basically it's about when you have different bucket layouts, how they are compatible, and you can merge them. And we saw that there was nearly half of reduction of storage requirement and there was more than 90% of index reduction in our tests and depending on how many buckets you have chosen for your existing conventional histogram, we'll see similar reduction in the index. And this passness makes it very efficient to have partitioning. In the histogram we saw the buckets varied from 1 to 128. So when you have more partitioning, it doesn't mean there is a linear cost as we have for conventional histograms where the buckets are fixed and basically we have a sub-linear growth because the buckets are dynamic. There are two limitations that I think are worth pointing out. The first one is obviously the limits that we already pointed out earlier. You probably want to set some kind of limit to make sure if your data is really very... If you have a lot of variability in your data that you control the amount of buckets that you use, and as I pointed out earlier, we can just try to cut a new chunk and usually that works but in certain cases that might not be sufficient and then the precision will automatically get lowered. We currently don't have a way to really expose that but this is something to think about in the future to maybe make that a little more clear in the UI. And then finally, we currently don't have custom bucket boundaries so if you want to answer questions such as what's the percentage of requests that were exactly 250 milliseconds or up to 250 milliseconds and you want to set very specific human-friendly boundaries like that, that's currently not supported but I think Bjorn has some ideas on how to make that work. And the work that we have done till now only contains scraping and ingesting into the TSTB and retrieving histograms at the TSTB layer, the raw histograms and we have a lot of future work to be done. The main one being the promcule support so that promcule can natively work with these pass histograms and also we could create pass histograms in recording rules using this promcule. The heat map that you saw we needed to do some hacks to get it working but it did not use any native promcule support. And the next one because we want to one of our goal was what is working right now should keep working so we need a compatibility layer to bridge the conventional and its pass histograms so that they can work together and we want to play with more data and determine the query cost and because there is a big reduction in index we think there will be reduction in lookup cost for these histograms so that is just a hypothesis but we need to play with a lot more data to determine the query cost. So that is it. If you are interested in the code you can check these pass histogram branches of the Prometheus project and the client goal line repositories so that is just all of our experimental code so far. I emphasize experimental and we would love to hear what you think so both Ganesh's and my Twitter handles are listed here so please let us know what you think and I hope that everyone has a great rest of the conference. Thank you very much. Thank you.