 Next we have Ruslan and Ganesh. It's another one of those talks which I alluded to in the opening and in the Prometheus updates. And how, in particular, between open telemetry and Prometheus, we tried really hard and worked very hard to make certain that you actually have a fully compatible native histogram implementation. And this is about how to actually use open telemetries native histogram with Prometheus. Round of applause. Hello, everyone. Welcome to our talk on using open telemetry's exponential. Welcome to our talk on using open telemetry's exponential histograms in Prometheus. So before we begin, let me introduce myself and my co-presenter Ganesh. Ganesh is a senior software engineer at Grafana Labs. He's also a team member of Prometheus. And he is a maintainer of Prometheus, TSDB. My name is Ruslan. I am a member of an open telemetry community. And at Grafana Labs, I'm working on what will be data ingestion into Grafana Cloud. So Ganesh. Yeah. So before we talk about exponential histograms, let's go over a little bit of basics. Histogram is a distribution of your observations where we put your observations into something called buckets. In this particular example, there are four buckets. The first one says there are 15 observations, which took less than 0.1 seconds. The x-axis is showing you the boundary of the buckets. And the y-axis is the count in all of those buckets. And this is how it is stored in Prometheus right now. As Bjorn mentioned, it's quite expensive, the classical histograms. Every bucket takes one time series in Prometheus. On the right side, each line is a time series in Prometheus. And in addition to a time series for every bucket, you have a time series for the count. You have a time series for the histogram. So to represent a single histogram, you need a lot of time series. There are a few problems with this design. One thing is the buckets are cumulative. For example, if one bucket is filled, all the buckets which come after that need to account for this count. So the same histogram that we saw here, if we take a different representation of buckets, a lot of buckets have repeated same values because it's cumulative. So it can get expensive. And let's say the bucket representation was inefficient. And you thought of changing the bucket layout. Now you have to re-instrument your code and redeploy it everywhere. That is still one picture of this problem. Let's say you redeployed different bucket layout later. You cannot correlate them easily. So you had one different bucket layout. Now you have a different bucket layout. You cannot mix and match them once you have changed the buckets. Also, it takes a while to propagate all these changes across your system. And you have to wait for a long time so that all of them have the same bucket layout. And yeah, again, coming back to the same thing, it takes a lot of time series here. Now comes the exponential buckets. This is the basis for the exponential histograms that we are going to talk about. So I'm talking about the exponential buckets in a neutral fashion for both open telemetry and Prometheus. So this is how the exponential bucket boundaries look like. So the bucket boundaries are fixed. But what you can control is the resolution of the buckets. So I'll talk with example. I've used the formula on the right side, which is 2 power of 2 power of minus scale. For example, you saw beyond using a factor of 2 power of 2 power of minus 3. So we get that from this formula. We'll start with a simple factor of 2 power of 2 power of 0, which gives you a factor of 2. So we start all the basis from the bucket boundary of 1. So to get the next bucket boundary, you just multiply it with the factor. And you get 2. And you multiply it with the same factor again, and you get 4. So it's like the percentage, the ratio between consecutive bucket boundaries is same. And to get the bucket boundary on the left side of 1, you just divide by 2 and divide by 2 again and so on. And let's say this resolution was too high for you. You want to reduce the resolution. So you just change the scale to, let's say, minus 1. So you get 2 power of 1, which is 4. So the bucket boundaries are now like 1, 4, 16, and so on. And the similar thing on the other side. Similarly, there is 2 power 4, 2 power 8, and so on. This is you are reducing the resolution in this direction. And on the other side, if you wanted to increase the resolution, I'll again start with the example of the factor 2 power of 1. So you have a bucket boundary 1 and 2. Now you want to go to the immediate next resolution, where you will put the scale as, let's say, 1. So it'll be 2 power of 2 power of minus 1, which is square root of 2. So one good thing here is that the bucket boundaries of factor 2 remain same in the higher resolution. So you still have 1 and 2, and you get a new bucket boundary between the existing buckets. And if you want to increase the resolution again, which is 2 power of 1 by 4, then the bucket boundaries of the previous resolution again stay constant. So I have color coded it in green, which is same across all the resolution, similarly the yellow and orange. So it's like with every increase in resolution, you are adding a new bucket boundary between existing bucket boundaries. The property of bucket boundaries being the same from the previous resolution comes in handy here, where let's say you had a resolution of a factor of 2 power 1, which is the first bar chart here. So the boundaries are 1, 2, 4, 8, and 16. You can anytime convert it to a lower resolution. So the one in the middle has a factor of 4. So you can just add up the buckets of the higher resolution to get the histogram of a lower resolution. And if you want to reduce the resolution again, you can add up the buckets again. So you can always convert from a higher resolution to a lower resolution histogram. So this is one of the ways where the changing bucket resolution doesn't really harm in correlating different histograms. If you had a histogram with a higher resolution and a histogram with a lower resolution, we just convert the histogram of the higher resolution to a lower resolution. And you can mix and match and do stuff with it. So that's about the basics of the bucket boundaries. Now, Ruslan is going to talk about how it is stored in open telemetry. So let's talk how those concepts are represented in open telemetry exponential histograms. So here is a simplified overview of exponential histogram data point with some fields omitted for the sake of simplicity. Let's briefly discuss each field. Let's start with a scale. So Ganesh already mentioned that there is a parameter called scales that determines the resolution of exponential histograms. And he also mentioned about buckets boundaries. But let's add some additional information and talk about buckets indexing. So exponential histogram buckets also can be accessed by index. It is really important to note that there is like zero index bucket, and it corresponds to the bucket with a lower boundary of one. We will need this information later. The following fields are sum and count. The count is a total population of observation in a histogram, and sum is a sum of all observation values in a histogram. The optional min and max fields are used to record minimum and maximum observation values in the histogram. The value of a start time is used to actually denotes a start time of observations collection. Zero threshold and zero count defines the zero bucket. Zero bucket contains actually a count of observations whose values are less than or equal to the zero threshold. Zero threshold is an arbitrary value, and it is not related to the scale. The last field is a positive buckets. Please note that there is also negative buckets, and it starts separately. But for the sake of simplicity, we will focus only on positive buckets. So exponential histogram uses a dense representation of buckets. So the bucket range is going to be represented by the single offset value, and value counts for those buckets. But let's take a look at an example and see how the buckets of this histogram is encoded. So here is an example of a histogram that has a scale of zero. It has the following bucket ranges, and as you can see, there are some value counts in those buckets. So the buckets of this histogram is encoded as following. So we store all the count counts for the buckets in a bucket counts area, including also empty buckets. The offset is set to minus 1 because there is a zero index bucket that points to a bucket with a range 1 and 2, and there is non-empty buckets on the left. So we have to shift the offset by 1 to the left. Yes, that's it. And Ganesh is going to talk about Prometheus native histograms now. So the Prometheus native histograms looks very similar to open telemetry. So we have a schema which is same as the scale. Only the name is different, but it's actually the same thing as scale in open telemetry. And the other fields like count, sum, zero threshold, and zero count is exactly same as open telemetry. So there is nothing special here. There is one fundamental difference, as Russell mentioned, when you specify a particular scale, you know what are the bucket boundaries. So you no longer need to specify a bucket with the boundary. You can just give an index, and you can get the bucket boundaries of the bucket. In open telemetry, the zero index is the bucket with the lower boundary of 1. But in Prometheus, it is shifted by 1. In Prometheus, the bucket with the upper boundary of 1 is the bucket with index zero. That's the one fundamental difference in indexing the buckets. And here comes the main difference between the representation of histograms in open telemetry and Prometheus. We have something called spans and deltas. And similar to open telemetry, like how we talked right now, I'm going to only talk about the positive buckets for positive observations. Similarly, we have negative spans and negative deltas, which are a mirror image of the positive buckets, which are stored separately. But for simplicity, we will only talk about the positive stuff. I'll take the same example and explain how the spans and deltas are stored. So Prometheus uses sparse representation of histograms as opposed to the dense representation of open telemetry. In this particular case, the spans has a list of tuples. The 0,4 tells you that the first bucket starts at index zero, which is 0.521. That's the bucket with index zero. And there are four buckets starting at this point. And then there is a gap of two buckets. So the second tuple says that there is an offset, which is there is a gap of two buckets. And you have another bucket stored there, 2,1. So the spans tells you what's the representation of the bucket in the layout. And there is deltas. Delta stores the delta encoded values of the buckets that are filled. So because there are five buckets filled here, there are only five numbers in the deltas. And the delta is taken with the previous bucket. So let's say you had 10 empty buckets between the fourth and the fifth bucket, in which case the deltas won't change. Only the spans, which encodes the layout changes. For example, it would be 0,4 and 10,1, saying that there are 10 empty buckets between the first four and the last bucket. So this is how it is stored in Prometheus. That's the fundamental difference. So now that we have this precedent set, Reslan is going to talk about the translation. So at this point, you should have a good overview of open telemetry exponential and Prometheus native histogram structures. So let's now explore how we can translate exponential histogram to a native histogram. Let's start with the exponential scale and native histogram schema. So native histogram schema allows values that fall in a range from minus 4 to 8 inclusively. So if exponential scale has a value greater than 8, we can downscale the histogram and merge the buckets, that's what Ganesh mentioned before. If scale is less than minus 4, the histogram data points will be dropped. It was mentioning that the histograms with a scale that is less than minus 4 will result in a really wide bucket ranges. And the practical use of such wide ranges is actually questionable. The next fields count and sum. So the count and sum of exponential and native histograms are directly translated. Prometheus doesn't have a min and max fields, so how do we handle translation of those fields? So we don't. You can use promql histogram quantile function to approximate min and max values from the observation values. Start time. Start time, promql, native histogram doesn't have relevant fields, so we don't use the start time field in the translation. So zero bucket fields, so zero threshold and zero count fields of exponential and native histogram are translated directly. And we left with the positive buckets. So what we are trying to do here, so we are essentially trying to convert dense buckets layout representation of exponential histogram to a sparse buckets layout representation of native histogram. So let's walk through the translation steps. So let's look at an example of the histograms that we already familiar with, and we already know how these buckets are represented by two types of histograms. So let's start with the translation. So first, contiguous non-empty buckets of an exponential histogram is going to be encoded by the span of the length 4. And the absolute count values of exponential histogram are going to be encoded using the delta encoding. So we see that absolute value is 2, 1, 3, 2, and the delta encoding would result into 2 minus 1, 2 minus 1. So we also know that there is a difference in zero index buckets and that native histogram offset is shifted by 1. So we have to actually adjust the offset values of native histogram spans. So we have to do it actually only for the first span because all the consequence spans are going to be created based on the preceding spans. So we see that offset was minus 1. And for the first span, we have the offset of zero. So the sparse buckets are encoded with a subsequent offset value of the span of the next span. And we left with one non-empty bucket that has absolute count value of 3. And it's going to be encoded with a span that has offset of 2 and the length of 1. And the delta value for this bucket is 1. So at this point, we have translated the exponential histogram to native histogram. So we are done with the theoretical part of the talk. So now let's take a look at hypothetical system set up that uses exponential histograms. So in this setup, we have a hotel instrumented application logic that produces exponential histograms. And hotel collector receives this TLTP payload that has exponential histograms. So currently, the translation from exponential to native histograms is implemented only in a hotel collector for a Meteos remote write exporter. So thus, we have to enable a remote write receiver on a Prometheus server. And additionally, we have to also enable the native histogram feature since it is still a beta feature in Prometheus. Besides that, we don't have to add any other configuration to a hotel collector because Prometheus remote write exporter translates the histograms automatically. So if you want to learn more about exponential and native histograms, you can take a look at OTAP about adding exponential bucketing to histogram proftabuff or design doc about sparse high resolution histograms for Prometheus from Bern. Thank you. Thank you for your attention. If you want to get the slides, you can scan this code. And if you want to talk to us, you can find us at the booth number 36. I think starting from tomorrow. Thank you. As I am the one running around for questions, I get to ask a question. So why isn't it called the same thing? Why is one exponential histogram another native histograms? I can tell about Prometheus. Bjorn can correct me if I'm wrong. So Prometheus, like I showed how it represented before, it was a hack to show histograms in time series. We hacked together multiple times to show a particular histogram. But right now with the native histograms, Prometheus is getting the native support of storing the histograms structure in the TSTB. A TSTB could only store float 64 as a value previously. But now Prometheus can store the complex data structure of a histogram in the TSTB natively. So in case of Prometheus, we call it a native histogram in the context of Prometheus. But the exponential histogram is the actual thing that encompasses the fundamentals of the histogram. So that's why open telemetry is exponential, I guess. That makes sense, yeah. Any other questions? Oh, yeah, we have one here. Hi, how are you all using the native histograms at Grafana? Can you give an example? Yeah, so the talk by Bjorn was the only thing we are using for histograms, I guess. We don't have it in production yet. We are getting there slowly. So we are still in the phase of testing out the histograms in an experimental fashion. But as soon as we have something ready in Grafana member, for example, we will start to scrape it in Prometheus and remote write it to Grafana member once it has a stable support. And that's our plan to use native scrubs across our environment. Bjorn has a question. So you said the schemers or the scale has different limits, right? In hotel, it's unlimited, essentially. Did it ever happen in practice that you had to, like that hotel was sending a scale that Prometheus couldn't handle? Was that more theoretical problem? I guess the question was, was there any case where hotel was sending a scale that Prometheus could not handle? Yeah, that's the case when you have a scale that is lower than minus four. I think for that case, we just drop. Current implementation just drops as a histogram data points now. I think there were some not suggestions, some ideas from the community to employ infinite buckets for that, or maybe set the minimum scale in hotel SDK to handle this issue. And if it's higher than eight, then we can just downscale. OK, but the question is, did it actually happen in practice? In theory, you explained it, right? I mean, if it ever happens in practice, Prometheus will just extend their range, right? I think the Prometheus stance is, it will never happen in practice. But if it does, Prometheus will change. Promise. Yeah, any other questions? OK, we have one more question. I think this should be the final one. Getting my work out. You talked about open telemetry using a dense representation, and Prometheus using the sparse representation of the positive and negative buckets. Was my understanding, I thought that the positive buckets in OTLP was a repeating value so that the offset could skip a gap and be used to create a sparse representation? Is that correct, or do I have a misunderstanding there? Can you repeat the last part of your question? I could not. Sure, so if the positive buckets is a repeating value and you have an offset starting at negative 1 like you had before, but then a gap of two buckets. So you've got four values. You could then have another positive buckets of value in that repeating value set that has an offset of four to represent that last bucket. Is that correct? So I still did not understand the question completely, but was there a question like if there was a gap of four buckets, something like that? I'm sorry. I guess the question was, does open telemetry really use a dense representation? Does it have to have zero values for all buckets that are empty, or is it possible to create a sparse representation within the OTLP and have that savings? So a question, is it like can open telemetry use this sparse representation for the zero buckets? Ruslan says, no, you can't have sparse representation in open telemetry. Cool, all right. That's it, but the speakers will be around if you have more questions. Yeah, thank you.