 Hello. Hello, am I audible? Yeah, I guess no one is coming. So, welcome everyone to my talk. I am going to talk about the shiny new histograms that is going to come in Prometheus soon. Yeah. So, I am Ganesh Velnaker. I am a software engineer at Grafana Labs. I am a Prometheus team member and I maintain the TSTB in the Prometheus. So, before we talk about the shiny new histograms, let's see what is a histogram. A histogram lets you distribute your observations into multiple buckets. Let's take this example. So, in all the examples that I talk, it's I am going to observe the latency of a request. So, in the y-axis, it's the number of requests. On the x-axis, it's the request duration. So, in this particular histogram, we can what we can get out of this is there are 15 requests that are less than 0.1 second latency. There are 25 requests from 0.1 to 1 second and 1 to 2 second and the last bucket is special. It encapsulates all the requests that were greater than 2 seconds into a single bucket. So, how do we store this in Prometheus? So, in Prometheus we give one time series for each bucket. So, we have a label for a series called L E which means less than or equal to and Prometheus recognizes this special label as the bucket value for a time series. So, for this particular histogram, we have four bucket time series each mentioning the bucket boundaries and if you notice, it is less than or equal to. So, the first time series which is 0.1 includes all the count of all the request that had latency less than 0.1 seconds which is 15 and the next one is less than or equal to 1 second which is all the request that were before 1 second. So, it includes the first bar and the second bar. Similarly, the third time series is a sum of first, second and third bar and there is a plus infinity bucket which is everything before infinity basically the total count and we have two additional time series for the entire count and sum. So, this is a problem. The first problem is we have to pre-define these bucket boundaries even before instrumenting like when you write the instrumentation code you have to mention these bucket boundaries and it can get tricky and it can take some experimentation to get these bucket boundaries right and the buckets are cumulative. If you see this particular example, we have four buckets that are filled and I have changed the bucket boundaries a little bit, but we have defined a whole lot of bucket boundaries for this particular histogram. So, lots of time series are going to be empty basically lots of buckets are going to be empty, but still each time series is going to take memory disk space and lot of other resources that come with the time series and it is going to slow down queries a bit because the bucket is empty, but they still exist. And if you got the bucket boundaries wrong and if you want to re-instrument them again you have to re-instrument all the applications with the new bucket boundaries and you have to redeploy it everywhere to get the new buckets. This can also be a problem. For example, if you changed your bucket boundaries in an incompatible way where for this example the left buckets and right buckets are not matching and if you know about the Prometheus queries the labels need to match to do any kind of comparison. So, in this example you may be able to compare the bucket 1.0 and bucket 2.0, but all other variations are incompatible. So, you will have to wait for sometimes that you have all the new buckets ready. And for every histogram that you define the number of total memory series which is the time series that Prometheus consumes is number of buckets plus 3. So, why this is a problem? Take an example you are instrumenting the request and you have sharded the histograms like 1 histogram per status code per route. So, this is a simple example. Let us say you have 1000 ports which are instrumenting these histograms and a single bucket takes about few thousand series across all your deployment. So, even if you add let us say five additional buckets it is going to take exponential number of not exponential, but a huge number of time series. So, here comes the new histograms that we are working right now it is in POC stage we have a huge design and we will build it step by step in a simple way. I am leaving out lot of details out of this so that it is easier to understand. So, what I am going to talk in the next 5 to 10 minutes explaining this is a multi-year study and research by Bjorn who is here with us right now. So, Bjorn myself and Dieter Michaelik from Grafana Labs we worked on the code and all this POC. We are calling it proof of concept because few things here and there still need to be defined and standardized, but most of it is ready and open source at the moment. So, the first property of the new histograms you do not have to pre-define your buckets the buckets are already predefined for you, but you can set the precision like the resolution factor of the new histograms. For example, let us take the factor of 2 power of 1 what does it mean? It means the you multiply a bucket boundary with 2 to get the next bucket boundary and you always start with the number 1. So, if the bucket boundary is 1 the next bucket boundary will be 2, 4, 6, 8 and so on. Now, this is one resolution of histograms and we always have the factor as some power of 2. If you want to lower the resolution like the gap between two consecutive bucket boundaries is huge we just take the factors 2 power 2 and you get the bucket boundaries as 1, 2, 4, 1 sorry 1, 4, 16, 64 and so on and you can take the factor of 2 power 4 or 2 power 8 and so on. You cannot have the powers like 2 power 3 or 2 power 5 the powers are also power of 2 and this is about the going up going lower resolution which means bigger buckets. If you want to go in the other direction the factors look like 2 and then 2 power 1 by 2 which is square root of 2 the next will be 2 power 1 by 4 which is square root of square root of 2 and so on. So, you multiply it square root of 2 with 1 you get the next bucket boundary and then you multiply it again you get the 2. You do not have to worry about this math you can just assume that it works and we will see soon how this solves all the remaining problems. And if you see the color colors like once you go up like once you increase the resolution for example let us take the factor of 2 power 1 by 2 and 2 power 1 by 4. So, we have three bucket boundaries in above and when you increase the resolution one step a new bucket boundary comes between all the bucket boundaries that were there before and if you take the third and fourth example all the boundaries from above remain the same you just get new boundaries in between. Yeah I just talked about the boundaries which come after one if you want before one you just divide the factor. So, everything after one you just keep multiplying the factor to get the new boundaries and before one you keep dividing and it keeps on getting small. So, why is it like this? So, if we just glance at this for a moment we see that between resolutions there are some common buckets. So, that helps us move from a higher resolution histogram to a lower resolution histogram in this example the first histogram uses a factor of 2. So, the boundaries are 1 to 2 2 to 4 4 to 8 and so on. So, if you wanted to decrease the resolution of this histogram you choose the factor the next factor which is 2 power of 2 and you add the buckets which fall into the new buckets and you get the new histogram and you increase the like lower the resolution again and you get the new histogram. So, why do we want this? We saw earlier that if you change the bucket layout you cannot match the buckets between two histograms again because they can be incompatible. But with this the first histogram above is of resolution 2 power of 1 the second is of 2 power of 2. So, if you let us say if you wanted to add these two histograms you just convert the histogram of a higher resolution to a lower resolution and now you can do any kind of arithmetic that you want. So, that is the power of pre-defining buckets to some power of something. So, that histograms of different resolution can be compared together. So, I have skipped the step of converting the high resolution to a lower resolution, but you can match the colors the yellow bucket matches here at the blue buckets are added together to get the blue bucket. I will move on to the next slide. And because we have predefined the bucket boundaries now you have to just only specify the factor that you want to multiply for every new bucket boundary. So, now that the bucket boundaries are fixed you do not need to store the bucket boundaries itself in the storage because encoding float numbers is expensive and not very efficient. So, we can use integer numbers which starts from 0 goes up the number and down the number line and this is very efficient to encode and takes less space and less CPU to encode and decode. So, we give the ID 0 to the bucket which has the upper boundary as 1. So, in these are three different solution one is 2 power of 1 the second histogram is 2 power 1 by 4 which is a higher resolution bucket the next is 2 power of 2. So, we start at 0th bucket and every new bucket gets the ID 1, 2, 3, 4 when the bucket boundaries are increasing and when it is decreasing we go in the negative direction. So, we have built up all the information that we need about the histograms and this is how we encode it. We take a example of this histogram. So, the first part in the histogram is the metadata which encodes the resolution the total sum resolution is just the factor that we talked about 2 power of 2 power of something and the sum and count this is enough to decode the rest of the histogram. So, the next part we call it as pan which tells you what is the bucket layout in this particular histogram. So, if you observe the histogram the first four buckets are consecutive they are one after another and then there is a gap of two buckets and then there is a bucket again we are using the factor 2 power 1 and let me use that to decode what is written in this pan. So, we have 0 comma 4 then 2 comma 1 it means the first bucket starts at index 0 because the first bucket has a upper boundary of 1 and at 0th index there are four buckets hence 0 comma 4 and the next 2 comma 1 says that after the previous set of buckets you have a gap of two buckets. So, you have to skip two buckets and the next stream of buckets is of length 1. So, you have one bucket. So, this compressed format tells you what is the bucket layout and now you know which buckets are filled and which buckets are unfilled and we just store the count in each bucket consecutive there are there are we do not need to map the buckets with the count we just have the bucket layout and the count and we can use some kind of efficient encoding to store this and we now have only one time series per histogram because all the bucket layout and the values are encoded into a single piece we map one time series to all the histograms. Currently in Prometheus a sample which is a sample has time stamp as int 64 and the value as float 64. We just replace the float 64 with the new encoding that we just described right now and yeah and you just have one time series like if you increase the number of buckets or decrease it does not change the number of series and it is efficient. Like there was a talk in promcon last year and we saw that this new encoding without having one series per bucket gives about more than 90 percent of index size savings if you have too many buckets previously and roughly around 50 percent of disk size savings. Yeah so how do you instrument this it is as simple as this so everything remains same as previous instrumentation and you do not define your buckets you just define what factory you want to use. So here I have example of two and four four is two power of two and you do not need to get the precision very right the instrumentation library automatically chooses the closest precision from what you define and as and when you observe any values the buckets are filled automatically you do not have to like yeah it is just filled automatically the new buckets are created if it has value if it does not have any value the bucket does not exist. Hence the name sparse high resolution histogram sparse come from the fact that we don't care about empty buckets we don't store it anywhere high resolution because with this efficient encoding you can now afford to have hundreds of buckets in a histogram and it doesn't really make a dent in the resource consumption and in this proof of concept the scraping looks like this currently we have a Http request which asks for a text format matrix which gives the time series in the format that we saw earlier and we do another Http request to get the new sparse histograms we are encoding this new sparse histogram in protocol buffer format because it's more efficient and can be easily described like the new histograms can be easily described in protobuf compared to the text format so Prometheus does two requests to a target to get all the data that it needs now it's finally time for demo I cannot mirror my screen so I'll try my best to give the demo because I have to see here too so I'm running Grafana here and I'm running a Prometheus which supports sparse histograms and I'm running some synthetic load this is something called storyteller and some other synthetic load that I did like we have two synthetic loads and I'm going to show a live load soon so this is the context of what's running right now I will get another browser on the screen just a second so I have like the instrumentation example that I showed is the same thing that we're seeing here right now so I have two histograms one is called media lower resolution whose factor is 2 power of 2 which is 4 and the medium resolution is 2 power of 1 which is 2 don't worry about that number I'm going to use different timestamps to show you different values and arithmetic like we have implemented a bunch of fromql functions which work on these new histograms so I'm going to get this histogram at a particular time so at this timestamp we have three buckets filled 0.5 to 1 2 to 4 and 16 to 32 yeah this is the same histogram again but different buckets are filled which either overlaps with the bucket that is above or it doesn't overlap and there are empty buckets above so okay this is not really comfortable but I'm trying are you able to see what I'm able to show okay yeah and if you add the buckets they're kind of merged together like it's the same resolution now let's let's take an example of margin histograms that have different bucket layouts now we have two histograms the first histograms resolution is 2 power of 4 the second histograms resolution is 2 power of 2 if we do a sum of this histogram the resultant histogram has a factor of 2 power of 4 which follows the concept that we discussed earlier the higher resolution is converted into a lower resolution and they're added together and we can also do stuff like histogram quantiles at this point I'm just giving examples of what works what's present and stuff yeah I'm adding data at the concentrate so you have the histogram content now ends the boring part I will move on to the interesting part that we wanted to show okay so remember I have something called sorry before slower teller I'm going to show a live cluster with heat maps so my friends at grafana labs they worked on a new efficient heat maps the name are Leon and Ryan so the currently heat maps crash when you have too many buckets like we tried to render some high resolution he sometimes few months ago and the laptop just crashes because there's just too many buckets because for every bucket you need to have a time series now this is actually scraping a live cluster we have a memory which is like memory cluster running in our dev environment we have instrumented it with this new histograms with a very high resolution and we have a prometheus running in dev which is scraping these histograms and I'm forwarding that part and connecting into a local local prometheus yeah these are just requests and here you can clearly identify that there is a band of latency there are different bands of latencies one is here one is here and one is here and if you look at the y axis they are very close together like the buckets are very close together and the browser did not crash but this is still not the full capability of the histograms like I have a bunch of other histograms I'll jump into another histogram yeah this is again a synthetic load but there are so many buckets that it looks like a continuous gradient so I just wanted to show that we also have a heat map which is compatible with the new histograms and it works super efficiently and you can just play with this like there is a story behind why this heat map looks like this but I'm going to skip that story for now I had a bunch of backup slides in case the demo did not work and if I had to show how fast the histograms load I can just try running a six hour query and it should just load I guess fingers crossed yeah did it load yeah I guess it loaded six hours of query yeah how can you use this so everything that we saw here is open source the instrumentation is available in the client golang library a client golang repo but it's in a branch called spas histogram similarly the prometheus server that we ran is open source and it's in the spas histogram branch and the grafana it is actually running the main branch but the new heat maps are hidden behind a feature flag and with this thank you do you have any questions so if you have any questions I was asked like you can go towards mic which is at the center and ask there or if you can shout any question from here I can say the question again so the question is like the scraping happens via protobuf for the new histograms and is there no text format the answer is no there is no text format implemented for that and I don't think we'll have text format at the end unless we find some super efficient way to do it hi um yeah thanks for the talk it was very interesting so I get the mathematical properties of this I think it's really pretty nice let me know if I'm looking at this correctly so let's say we're measuring latency and we our latency for a specific endpoint is uh has a normal distribution center that around like 200 milliseconds right um so if we don't do anything most of our and so it we have a standard deviation that makes it go from like 150 to 250 right so if we depending on the factor that we use we're losing a lot of precision there right like most of it is going to the to go to the same kinds of packets is there a like are you thinking about this like without to play with offsets and scales so that you can uh get the best resolution for your metric so can you repeat the last part yeah so like what would you recommend right let's say that's our situation and we want to have like to take advantage of most of the precision right like where most of the precision is right on the area where our metric has the most variability okay so when you set a precision the bucket boundaries are fixed so you cannot ask to focus on a particular range of values but what could be done is like there was another option in the client library which I did not mention is you limit number of buckets that you want so you can set a very high precision and a limit on buckets like 150 or 200 which is still very practical with these new histograms and once it hits that for example 200 bucket limit it will automatically go into the next lower precision for example if you are using a pressure of 2 power 1 by 8 and hits the bucket limit it will automatically switch to 2 power 1 by 4 so that's the best thing possible at the moment like you can start with the highest resolution also at every particular interval there is something called a compressed unit of histograms called chunks which contains 120 histograms at a time so once those 120 histograms are done you start okay this is wrong okay sorry so it's possible that you can reset a histogram for example after every 10 minutes or half an hour or one hour you can set it to reset the histogram so that you again start with the highest precision and low number of buckets filled so that's another way to move back to high resolution again okay got you thanks hi thanks for this it seems very elegant um it might be a misunderstanding of mine but I feel like the uh or it seems like the the the sort of uh distribution of the buckets across the number line is assuming the data is going to be followed like a long tail distribution um like there's going to be uh more sort of um like samples at the lower end of the number line and they'll get more and more sort of spread out as you go up is that correct and yeah and so like what if the data doesn't fit that yeah that's correct like if you set higher precision the the size of buckets is going to be small up to some extent and it just exponentially grows so the idea with this was the gap between two consecutive buckets like the percentage difference between two consecutive buckets is fixed so and whenever you do a quantile estimation the error of your estimation is limited to what's the difference between the buckets like the percentage difference so the idea was to reduce the percentage error of the histogram quantile estimations and if you take the plus infinity bucket currently in prometheus if there is a value which falls in that bucket you cannot accurately predict within a certain percentage if the quantile estimation is correct if it falls there so that's right like the spread will be big but if you take the percentage error of your estimations that's going to be limited wherever you go thank you hello thank you for the presentation it was very clear so I've seen that you added the instrumentation in the Golang prometheus client are you planning to add this feature in other languages like python library for example yeah so this was just a proof of concept we are still playing with it once this goes into prometheus for sure it will just spread to other client libraries okay thank you we still have five minutes I guess yeah do you have one more question okay you can ask the question okay I was just gonna ask um like if you were to to go ahead with this it sort of sets a precedent for non uh floating point series values do you think there's other use cases for that like with you know once once that precedent's been set are there are there presidents okay you mean how will we have both the histograms together you mean the old histograms and the new histograms no more like once you once you've uh once you've allowed like non floating point numbers series values for this case are there other kinds of series types that you might do something okay yeah um so doing changes to if I understand your question right uh you mean we replace float 64 with a different data structure yeah is this scope to add different more data structures yeah so there is it's possible but the big problem that comes with it is making any change to the data type requires changing the tstb the promkelinging and everywhere you access the sample so if you have to make any change there should there needs to be a very solid use case that it's required and in a like many if if it has a bigger use case it's possible yeah I would say it's possible but it will require some real big use case for that to happen and your question was how will old histograms work with the new histograms okay so your question was how can we access individual buckets so um we have a huge design doc explaining all the promqueal things that we want to do with this and this is included there so we will have the ability to ask for a value at a bucket just that it's not implemented at the moment we are just running it in dev we are still playing with this yeah we have just one prometheus which is scraping just one day of class which is instrumented with sparse histograms two more minutes I think a lot of the previous questions were around the bucket distribution right it seems like the center of mass of the distribution will be around zero right now one yeah see exactly it will be around one and then it will spread out exponentially so is it possible to add a single offset so for example if you are measuring latencies you can say plus 200 and then it will be the center will be around 200 which is where the normal latency would be let's say so when you say the center of mass it is just the calculation which starts at one like the like the calculation of bucket boundaries which starts at one but the observations that you do can be centered anywhere but the problem with centering the calculation somewhere else is it creates a problem again you cannot mix and match other histograms no I mean the buckets will be the smallest the most higher resolution near one right now right like the smallest most fine grained buckets will be near one yeah because of the multiplication factor right but if people could sort of move that fine grained resolution buckets to 300 ms where the latency will be I think there will be higher accuracy quantile function results right yeah that's possible but again it kind of skews the bucket layout and makes it incompatible with each other with the merging yeah I guess we are out of time thank you for joining the talk