 For the introduction, it's great to be here. Actually, Forstem was the very first conference I spoke at in 2013, back then in the GraphDef room, and that was the second time I really had to be back. It's always an event I really enjoy going and it's very nice to be here. So latency SLOs done right. To get you in the mood, I have a question for you all. You all have APIs that you manage or care about in some ways. So if I ask you the following question or your manager asks you the following question, like how many requests in January were served within 100 milliseconds? Of all the requests you got in January, it seems like a fairly basic question, right? You're monitoring it. So how would you actually do it? Can you answer this question already? What about 150 milliseconds? What about maybe 180 milliseconds? What about if you had a problem in June 16th, 2018 between 9.12 and 9.35? How many requests were served within 100 milliseconds? I guess like, who can actually answer that question for any of their APIs? Okay, that's great. That's like, can I have like someone tell me how they do it? Splunk, yeah, excellent. So you keep slunk data for more than half a year. Okay, I admire this. Okay, actually this kind of questions are examples of latency SLOs. And we have at Zirconos been able to answer these kinds of questions very convincingly. And at the latency SLO workshop at SRECON last year, I realized that it's actually for many people next to impossible to answer these questions. And I was like, okay, this was like really shocking to me like how hard that really is. So I thought, well, this is maybe a good topic for a talk. So what I wanna do here is basically give you a few ways how you can do that with the tools that you have. Like everybody uses for monitoring and log analytics. And then some ways, some pitfalls that are often arise and some misunderstandings and clear them up and also show some of our tooling that is also a largely open source, which might give you more insight into your APIs and allows you to answer questions like that very convincingly. So some thing more about me. My name is Heinrich. I'm data scientist at Zirconos. I originally come from mathematics and recently I talked a lot about statistics for engineers. So kind of being a mathematician, entering the IT operations, monitoring domain. This is what I was talking about like what percentiles are and this kind of things and how you apply them correctly. I moved to the countryside in Germany so that I can shop more firewood myself. And yeah, I'm Heinrich Hauptmann on Twitter if you wanna follow me around. Okay, here's the little plan. First, I wanna just talk a little bit about why you want to monitor latency. Then the second thing, if you look at the title, probably I should explain what an SLO actually is. And then I will give you three methods to calculate latency SLOs effectively. And we'll talk about each of those methods and in the end we will have a confusion. Isn't that wonderful? So yeah, without further ado, latency is important. I will just say so much. Forgot some props. Like this is the book that everybody who does SRE or deaf ops stuff reads. This is the Sightler-Ridpy engineering book from Google authors, Niall Murphy and others. And they had this four golden signals in them. They are page 60. Here they are, four golden signals. Latency is actually the first one. And it has latency, traffic, errors, and saturation. And those four metrics were later rearranged into the red monitoring methodology. So if you wanna monitor APIs, then recommended best practices actually do red, which is rate, errors, and duration. And duration is latency. So they took the four signals, then got rid of one of them, then rearranged them in a different way and made an acronym out of this. But it's still like the best thing I recommend everyone to do. And some, many people agree with me. Yeah, and probably you all care about latency if you are in this talk. So I will actually not talk much more about it. So the second part I should explain is what is an SLO? Well, actually, if you're monitoring something, then you wanna have an idea of how should this value actually look like? So you have some metrics and what is the expectation? And SLOs are kind of a qualification of the service quality that you expect the service to have. And there's methodology also proposed in this book, which we have also seen in Richard Hartman's talk before me. This is this SLO, SLI, and SLA. Three acronyms which are used in this context. You start on your site, your service quality measurement by specifying certain service level indicator which very concisely measure the reliability or the quality of your service in a specific way. And then you have service level objective, which are basically a target value or a corridor of values you expect that SLIs to have throughout the duration of the, yeah, throughout the time. And the SLO is something that you might communicate to your users or internally. So you say something like I have a 99.9% uptime goal that would be an example of an SLO. But it's important to realize that service level objectives are something that take place in longer time spans. It's something that you manage stuff on. So you wanna do management decisions like am I, should I push out more features or should I be more conservative, spend more time in investing in reliability of my service? This kind of trade-offs are managed within SLOs and it's real in art and how to specify good SLOs. And then also SLAs, this is basically what happens if we don't meet our SLO and this is more a legal question. So I'm actually not going to talk much about it. Here's the first example, it's an availability SLO. So the SLI here is SSH into the target host and I spit out a one if it's working and I spit out a zero if it's not. And I do that every minute, put that into a metric. So that's a very clear service level indicator. And I have an SLO of a 99.9% uptime over the last month. So I take a month of data, I look how many ones did I have in that month and how many zeros did I measure. And if I have one more than 99.9% of the time then the SLO is met. And if I don't meet the SLO then you will get exactly one cake. That might be an SLA you follow. You can just put other incentives here but yeah, it's up to you what you're doing there. So here's an example of a latency SLO which was used in precisely this way in this SREcon workshop I attended last year. The SLI was the proportion of wallet requests that were served within a second. So this is a metric. You do that every minute. And then the SLO was 90% of the wallet requests in the past 28 days. So over 28 days, a long period of time were served within one second. And yeah, so you will recognize that kind of questions from the very start. This is basically the same thing. I ask you how many requests and we are now asking for a percentage. Yeah, an SLA was skipped here. And interestingly afterwards they showed you this data about the API. They showed you percentiles. And I guess like most of you will actually do percentile based monitoring. It's very common and it's actually what you're recommended. So the story with latency monitoring or let me just back up a little bit. The problem with latency monitoring is that the thing you are monitoring, the API, is inherently a thing that is event based. You just have tons of events which come in like maybe 10,000. And then you're trying to store them in a metric which is something you have just a single value which you're storing for that thing. So you basically have to compress a whole lot of information of 10,000 events and latencies in that period of time into a single number. The first thing everyone did was averages. Let's just do the average latency that we have. And then Doug Ulugang from OptimizeD wrote a very nice blog article which nailed the problem with this. And he said, well, measuring the average latency is like measuring the average temperature of the patients in a hospital. You don't really care about that. You care about the sick patients most and so you wanna focus on this. And then there was the Amazon Dynamo paper which said, well, yeah, everyone knows that averages and standard deviations are shit for latency. So we do 90th percentile, 99th percentile, 99.9 percentile. And so this is basically very currently R. So everybody is doing this. And they had it on the slide and the question was, was the SLO met? So you have 28 days of service here, 90% of the valid requests in the past 28 days were served within one second. And they asked the audience, well, was the SLO met? And there was, looks right. There's, so there's 1K milliseconds, it's here. And, oh yeah, I should, sorry, this is actually, this should be 99%. But never mind. So you look at the 99th percentile, so you know 99% of all the requests were below the 99th percentile. And the question is where 99% of all the requests passed, served within one second. And they're attempting to look at the 1K millisecond mark here, this is, and the 99th percentile and say, well, we were below the one second mark 90% of the time. So, should be okay. You should be fine. But is that really true? So what if I told you that actually 99.9% of all the requests occurred there? We don't actually have the request count here, we don't know. And by the way, this is not how a real percentile metric looks. This is how a percentile metric looks. This is at least how our percentile metric looks. I don't know if you have stuff like this. This is a very, very well behaved, constantly loaded service. This is something, I have periods at night here where I don't have any requests, like nobody is using my service. So it's a poor service. And if I take the 99th percentile of no requests, it's missing data. And down here I have maybe five requests and the 90th percentile is just shit, it doesn't do anything. So if I look at this, I can actually, not at all plausibly tell you if 90% of my data was below that. And so, if you're only doing the percentiles and you are doing maybe averages, then you actually have no way to determine this kind of SLOs. So that's the first realization, which is quite important. And it is actually not so easy to communicate. And I tried to make that clear, but I constantly have the discussion in which is basically it's a percentile aggregation problem, right? So SLO asks you, compute a percentile over one month of data or week of data, and then also across the service. It doesn't ask for is dub dub dub one doing so many requests. So I dub dub dub five. It asks to all web servers do or that serve the APIs do that. So you need to aggregate across multiple nodes, you need to aggregate across multiple weeks, and you just cannot do that with percentiles. And then people say, well, kinda you can maybe. And this is a reaction to my 2016 monitorama talk and John Rouser is a data scientist at Snapchat. And he said, well, if people like Hartman say that you can meaning can't really meaningfully aggregate sample percentiles, then yeah, I'm annoyed. Well, sometimes you actually can. And it went back and forth. I wrote him a letter. I said, John blah, blah, blah. And then he said, well, actually, I wrote a 10 page blog post in R showing you that you can actually aggregate percentiles. And yeah, so I have the link here. You can look all of this up. And it's actually a very beautiful post. I mean, he's really great in these planes. Okay, if you have multiple distributions and you sample from the same distribution and just average percentiles, then you will get true percentiles. This is kind of a statistical phenomenon. If you're sampling data from the same distribution, you will converge. Your statistics will converge. If you average them. However, in practice, you are not sampling data from a same distribution. So I don't really go into the details about what's on this chart. I will do later. But I took up some production data and I compared a one hour average percentile to a true 90th percentile over one hour. So one thing, I took all the data, computed the 90th percentile over the full data for one hour, or I took one minute percentiles and averaged them. And the results were 300% error. And the worst thing is usually it's fine. Like if you have five nodes which already do kind of the same work and you're averaging the P90th, you won't be not much off from the true P90. But you have two nodes, one blue and a red node. So the blue is doing a lot of work so you have a really nice distribution. It's very typical latency distribution here. And the second one doesn't really do a lot of work and it's P95 is down here. So the true P95 of the total will be pretty close to here because the red one isn't actually doing that much. This might be a failed service or something that just started up or you have a problem with a load balancer or something. So you have an in, like in not a good, your service isn't in a good state. And if you're doing the average, then okay, you have a 30% error or something. So it's substantial. And the real problem is that it works most of the time, but in the situation that you really care when something goes wonky, when a disk starts to fail, your aggregated percentiles will be terribly off. So yeah, I hope I convinced you that its percentiles are not really it. And to be completely fair, the Google guys didn't actually take a percentile metric as an SLI, but it's something else. They did a proportion of the valid requests which were served in one second. And this is precisely the first method on how you can do it right. So I will talk about three ways to do it right. The first one is log data, the second one is counter metrics, and the third one is histogram metrics. So if you have log data, if you have Splunk for half a year, then you can actually answer this kind of questions very easily. You can say, well, select everything from logs where the time is in some time box and the latency is below a threshold. And you would say, okay, you don't have Nesquil query interface for your logs. You might have another query language, but this is essentially what that is. You just put all your logs in a data store. You have a field for the latency and then you query that. And you can do that with log tools. And in this way are the original question. I will go back to that. How many requests were served in January within 180 milliseconds? Just go and count them. You have everything. So it's great if you can do that. It's correct, it's clean it easy. Problem is that you need to keep all your log data for a month, which can be very expensive. And every Splunk customer knows this. And the problem is not really that Splunk is ripping you off. That might be also the case, but it's not the main reason. It's that it's just a terrible lot of data for every request to have a log line, which is maybe 80 or 100 bytes. And you have to store all of this for every request. So if you have meaningful volume, like 1,000 requests a minute or 10,000 requests per minute, this is gigabytes and gigabytes a day. So it's just by design very expensive and very few people can actually afford to keep log data for very long. So the second thing is, this is called counter metrics. So it's also very simple. So you are interested now in the one second thresholds. I wanna know how many requests were served within one second. So I make a new metric and the common name for this is LT lower dash one second here for less than and just count how many requests were served faster than one second. Pretty easy. So I add a new metric, do that for each node. And the beautiful thing is now it has become mergeable. It has become aggregatable. So the percentiles, I couldn't do the aggregation. With the counters, I can just sum them and I can integrate them over time and this is how it will look. So I have here the in black, the total request count in red. I have the slow requests and then I can select the timeframe and I can here integrate the graph. So I will just sum up every line I have and I arrive at some numbers which are the, like I don't have a laser pointer with me but it's the two line graphs which come up high at the very end. So 8.9% slow requests. So for this API, my SLA was met. So 90% should be a fast. So latency SLOs by a counter metric, they're easy, they're correct, they're cost effective because you're not storing a lot of log data. They give you full flexibility in choosing aggregation intervals. Note that you can hear freely select time ranges and you can freely select like the number of nodes you're aggregating over. But you need to choose your latency thresholds upfront. You have the hard coded the one second there, you think here. And many people have that do that seriously actually do precisely this. They just hard code a bunch of latency thresholds and they do metrics for them. Cloudflare has examples where they do 1,000 thresholds for each API, not for all the APIs but for those that are monitored like this. So afterwards they can actually just select the right latency threshold they might be interested in. We have a technology called HDR histograms which allow you to do better. And the basic idea is that instead of storing the individual durations like here in a log form or something, we store the histogram representation of them. So we put them into bins and then we just count how many samples were in a bin. And then we apply one more trick. This would be very, very similar to what Cloudflare did with their 10,000 or 1,000 metrics. We just don't store stuff that doesn't have any samples in it. So here's a bin from 2,100 to 2,200 and we don't store that. So we use a sparse encoding for that. And this way we can get away with very, very low storage requirements and very broad range of data coverage. So we can basically cover the whole float range from 10 to the plus minus 128 with 46,000 bins and still just do 300 bytes per histogram. So it's a metric, but it stores much more than just a single value, it stores the full distribution. And so this is how it looks like. So for each point in time, we not only have one percentile or one average, we have the full histogram. And then we can go and aggregate it freely. And this is just I selected a timeframe and then I can just view the complete distribution. And then at the very end, I can do my latency as a low, very easy. I just go select something here and then I have 43 of 40 million total requests in the time span of a month. And I hover over this and I see 89.3% of my requests per faster. And this is pretty eye-opening if you have that kind of technology available. So here I have a demo. I select a date range, like say two weeks here. I do set and here's the updated thing. And I just can check, well, okay, here my latency below 27 millisecond was 69%. And I can do that on high volume APIs. So this are actually disk IO, block IO latencies over a month. So it's not a highly loaded system, but you see three different disks. They each do IO, block IO requests and I can see the latency distribution of all of them. So with this kind of technology, you can only monitor web APIs. You can do a low-level system APIs like SUS calls or block IO latencies and still do aggregate ASLOs on them. So it's not hard. So we actually have two commercial products, Suconos and R&DB. R&DB is a time series database which is fully histogram capable. And Suconos is a SAS monitoring tool. Both are as a full system, not open source, but they have a very generous free tier. So you're welcome to try. And the core technologies behind histograms are open sourced. So there are two open source histogram libraries, one developed by Giltini from ASL Systems, which is an early proponent of how to measure latency. Has an excellent talk on this. So if you're more into that problem of how to do actually subtle benchmarking of latency, you should definitely watch that. And he came up, which is HDR histogram name. And we have a very, very similar thing called lip circle hist, which we use in our product. So yeah, I just want to tell you, right, percentile metrics are not suitable for SLOs. If you want to take that away from here, if you just have percentiles, then it's not good enough to do your SLOs. You can actually, if you want, HDR histograms are just the ultimate answer that I know. And you should try it out if you want. If you don't have HDR histograms, the following thing is effective and can be done. It's not bad practice. So you store locked data usually for three to five days. Have that available. And there you can do the first method. You can freely choose your thresholds. You can experiment with this, determine what are sensible service levels for you. So is it one second, is it 100 milliseconds? What is my typical performance? This you can do in the first three to five days of data. And then you are adding the instrumentation for the counter metrics. You do less than 100 milliseconds, less than a second, less than 200 seconds. And then aggregate those metrics over weeks and months as needed with the tools you have. It's a little bit of a pity that you have to do this upfront choices. But yeah, as I said, this is how it works. And this is actually how Prometheus histograms work. Many vendors have started to use the term histogram. What they really do is they just allow you to specify a bunch of thresholds. So with this HDR histogram technology, you don't need this upfront choice. You have 40,000 pins and you are just basically covering the whole range you ever want without any upfront choices. So yeah. So in my opinion, this is the best way to do arbitrary latency SLOs for high volume APIs or what have you. And here are some other correct ways to do it. But please be careful with the percentile metrics. They give you a good impression of how your API is doing. But actually doing math on them is very, very hard. They are not just a very kind statistics in a way that it's not easy to aggregate them. So with this, I'm going to close. Thank you very much for your attention. And I think we will have two minutes for questions or something, right? Please remain seated. Please remain seated during questions. Do we have a sound? Hello? Yeah. If you don't have full historiologs or full histograms, can you but adjust the usual volume counter and a regular sampling of the latency? Can you approximate something correct by multiplying the latency samples by the volume of request during the period? I don't know if I understand the question correctly. So the first question was about subsampling, right? If you have locked data and just take a portion of that. You have the usual metrology system that measures the number of requests. And then you have some kind of probe samplers that measures the latency everyone's in the wild. Okay, an external probe. Yeah. With external probes, so the question was, can I cook something up with having the total request count and the external probe latency? So the first thing I'm going to say, it's good to have external probe latency because the internally measured latency and the externally measured latency can be quite different. It's good to also have external probes. But the main problem with external probes is that they are usually not representative. You see, you have a multi-modal latency distribution. You're just picking one out. And usually you're not picking out a random one, which should be required to this kind of up sampling. For example, usually you just hit the homepage of your thing. This will always lie in cache. So this will be a very fast request. So you have to be very careful to properly probe, externally probe your API for this up sampling methods to become effective. So I have not seen that. And usually the external probing is drastically under sampling. So you do one sample per minute. And at least in this example as I showed, we are at least talking a hundred or a thousand samples a minute. So the errors will be pretty drastic in this cases. Yeah, I hope that answers the question. I feel like I haven't quite understood the difference between number two and number three, because in both cases you have a histogram with counters. And you can't really, if your threshold falls in the middle of a bucket, you still can't understand precisely how many requests fall. And HDR histograms predefined those buckets for you. But conceptually it seems similar to how histograms are implemented in Prometheus. So from a theoretical standpoint, they are very similar. We count how many things are in certain buckets or less than a certain value. The question is the catenality and the cost economics behind it. So for example, doing 1,000 bins is not a lot. It will only cover a little range. And the costs you are paying for it are actually already very high. If 1,000 metrics per API you are caring about and probably even multiply this by the number of nodes or endpoints you're monitoring. So it gets expensive very fast. With the HDR histograms you do sparse encoding. So if the bin is not hit, then we don't store this. And usually your latencies are bounded within a certain range. They are not taking one time a microsecond and then five minutes. So usually you have one or two orders of magnitude which allow you to dynamically choose the buckets you really wanna record. So it's a cost and a performance differential. So with the HDR histograms you don't have any choices. It's pretty cheap and it's as powerful as if you would do 46,000 metrics of the Prometheus style. But yes, in theory they're exactly the same. Yeah, you can view it as a clever compression which gives you a factor of 10,000 more efficiency. Any other questions? Theo's my boss, so I'll take that with a grain of salt. So for people who are using arbitrary bins, so in Prometheus you have to choose your bins, right, and everyone chooses wrong. Every time I choose I choose wrong later. How do you choose better? So if you don't have HDR histograms and you are using Prometheus, which most people are, how do you go about choosing a better set of bins to protect yourself going forward? I think that's a complicated question, but there are two hints. One, you have to think about where you want to aggregate your metrics. So one error that Google is known to make with their choices is that different surfaces choose different bins and in the end they want to aggregate across them and they can no longer do that. So you can only aggregate the counts if they have the same thresholds. So you have to first get an idea in the organization across which teams and which instrumentation, like service instrumentation, race, you are actually in the end wanting to aggregate. So once you have this, the other question is like, where should we all agree to put our bins? And there the only method that I can come up with is actually use log data for that. So if you have log data, you can just say select latency from logs where in the last three days or whatever and you can draw the histogram yourself. You can see all the modes with any Python tooling. It's very easy to build. You just do a select and you just draw a histogram of all the values to get from there. And then you will have a picture like this produced from log data and you can say, okay, where is my 90th percentile here? Where is my 95th percentile? And you can just take this latency thresholds and start working with them. And then if your API is evolving, you might want to change it later on, but yeah, then you have to change it and you have to wait a certain amount until you can again compute those SLOs. But this is actually from a practical standpoint, the only methodology I can think of which makes sense. More questions? I think that's not the case. And I thank you again for your attention and have a great conference.