 Here are a couple of light-hearted jokes about observability. Why did the metric go to therapy? Because it had too many issues with its counterparts. You're not booing me, you're booing the AI, and it will remember this. Why was the log file so popular at parties? Because it always had the latest gossip down to the last detail. Remember, humor can be quite subjective, so it's funny for one person, it may not be for another, especially in fields of specializes of observability. How are we on that mic working on it? Let's go some more. The AI will remember that, too. You know what, it's not answering, so I think it hurt you. I think you hurt, I think you hurt Chad GPD's feelings. Okay, we got, we got one more. All right, this isn't, this one's pretty good. How do you comfort a JavaScript bug? You console it. Why did the time series database break up with the alert manager? It was tired of its constant overreactions to a little bit of jitter. Finally, why did the developer go broke? Because they used up all their cash. Again, you're, you're hurting GPT force feelings here, everyone. All right, I think we are ready. Testing, testing. Can you hear me? All right, take it away. Sorry, I'm going to rescue you from that. Okay, so this is going to be kind of like, let's take your real quick through sampling 101, and then I want to talk about some advanced sampling techniques. So, so to start, we just kind of need to level set a little bit about sampling, what it means, that kind of stuff. So imagine you have a big bag of marbles, and you want to know how many marbles are in it. So you're going to pick a few of them, weigh them, multiply, divide, you know, and then you weigh the bag, you weigh the handful you have, you count the handful you have, and then you can calculate the ratio and get the weight of the bag. Now you want to know how many black ones are in that bag. Well, you pick a random sample, and you count the fraction of those that you sampled compared to how many you believe are in the whole bag, and then multiply by that number, and you will get a reasonable approximation. And it kind of depends on the size of your sample, right? If you have a bag of 10,000 marbles and you pull 10 out, you're only going to get to within the nearest thousand or so. But if you take more of them out in your sample, you take as many as you can, and you get, you know, you get closer to the right answer. One thing I want to address when I talk about sampling is to be slightly pedantic about it. To sample means to choose a sample. So you keep the sample. People often say, well, I'm sampling out these things, or I'm sampling heavily, meaning I take many fewer of them. And I try to really avoid those terms because it just, it's confusing. People, you know, when you start using those words, people don't know that that's what you're, you're, you know, which side you're going on and you always end up getting confusing. So I really prefer to use the words keep and drop when you need to be clear. And if I mess up in this talk, it's because I'm human. I do the same as everybody else. But the other piece of it is sampling rate, sampling probability. And to some extent sampling threshold are all different ways of talking about the same thing. You have a fraction of, of items in your estimating how you're, you're taking a subset of them. What is that subset? So if you're choosing one in 10, your sampling rate is 10. Your sampling probability is 10% or .01. And, you know, your sampling threshold is either, you know, 90 or 10 depending on how you do it. But those, those kinds of numbers will all come up when we, when we talk about this stuff in general. I suppose I should tell you who I am. I work for Honeycomb staff engineer and I'm the lead on the project we call refinery, which is Honeycomb's sampling proxy. And what I'm going to be talking about today is a bunch of the things that we can do in the Honeycomb sampling proxy for a really advanced kind of sampling. But we're going to start with some basics. So you're managing a network of microservices and you can imagine that there are a lot of them and a request comes in from a customer and that thing talks to your web service, which talks to your login service, which talks to your database and all these things go flying around the world and you're getting all of these services sending, we're going to call them spans, log messages, however you want to do it, but basically they're recording what's happening and passing that on. So you can assemble all of those requests together and make a trace and say when this happened and all these other things happened and this is actually what Honeycomb does for a living. And so this, I'm just going to show you this more live. So this is a trace in Honeycomb. So you can see the request came in to the checkout service and you can see that there are different lengths, the width of a bar is how much time it took and you get this stacked trace of, here's some calls, here's some processing that happens and then here's a longer processing call and then when it's done some more processing happens with shorter traces. And so this is a one user action that has caused what's the number 53 spans. So now you're sending all those spans to somebody like Honeycomb and that is fine. It's cool. You get a lot of neat information but now multiply that times thousands of users per day and you know, internet scale and you can quickly be talking millions of spans and now how do you deal with that? Because if you're paying by the span, at some point you need to control that. So the way to handle it is by sampling it. So how do we sample? The first and most obvious mechanism and it's available in all the libraries including all the hotel libraries is to head sample. In other words, at the time you create a span, you roll a die and you say, you know, let's say it's a six-sided die. If it's a one, we're going to keep that span. If it's two through six, we're just going to drop it. So you do that and you're taking 160 amount of data escapes your system and gets into your back end but now when you get to the back end in order to get your data back you need to multiply by six. So you need to know that you were doing a sampling by six. So if you do that at the span level and you just do that randomly for every span, it works if your data is first randomly distributed and you're sampling off and the spans are all independent of each other but that's actually not true, right? We caused those spans and they're all dependent on each other. And so when you have a trace like we were showing a minute ago you have all those services and if you sample them randomly the probability that you will have sampled an entire traces effectively zero. You know, the more the more spans you have like with 10 spans at a sampling rate of one in six it's like 61 in 60 million traces will be intact. You know, by the time you get up to 50 you're talking, you know, number of atoms in the universe. So you need to figure out a way to sample whole traces. Now you can still do that with head sampling. You can use the trace ID which is passed around and shared by all the items. You can use the trace ID or some piece of element you share among the whole trace to make a pseudo random decision and make the same decision for everybody. So the one in six is based on whether the trace ID starts with, you know, five or the one in 16 or whatever it might be. So you can get or drop entire traces and so then the traces that arrive are intact. A span is chosen or dropped based on whether it's part of a trace. So yay, we got now we have intact traces. We can do our telemetry. We can multiply everything by six and we're great. Right? Not right. So when you have low probability events particularly when the low probability events are lower probability than your sampling rate then what it means is that you don't get those events. You know, so if you have, you know, a thousand users generating a million spans and your error rate is five nines that means that like only 10 of those users had errors. But if you're sampling at one in 100 that means 99 out of 100 errors aren't getting sampled and like on any given day if you have a million users in a day you're not even getting any of them. So I just want to show you what that looks like. Let's go over here for a sec. And so this I sent a thousand traces. I sent 1% of them as errors. And what we see here is what I did is I ran this through the refinery and dry run mode and it basically sent all the traces but it marked some of them as having been sampled and some of them as not. And so what we have is the ones we kept these are the ones that were so we kept this collection of traces. We dropped this collection of traces. We kept two errors out of that thousand but they were close to 10. In fact, there were nine errors total. These were the ones we dropped. So good. We got we sampled some errors but the problem is that the errors are the part we actually want to see. Tolstoy put it pretty well. You know, all happy families are alike but all unhappy families are different in their own interesting ways. And the book was about the unhappy family because that's the interesting part. So we should be telling the interesting stories and that's what we need to be able to do. We need to be able to sample and keep the stuff that matters, you know, because all of those 200s, who cares? All your health checks, hey, long as they came back healthy, you're happy. So now we want to do tail sampling and for anybody who doesn't know it, Honeycomb is big on dogs and we name all our services after dogs and so you have to have a dog in a Honeycomb talk. All right, so we're going to send all our telemetry to one proxy or to a proxy system that is doing tail sampling and what we do is we aggregate all of the spans in the trace in one place. We look at them in context and then we can figure out what's interesting and make intelligent decisions about what to keep and what to drop based on, you know, how interesting they are based on how we've decided what interesting means. So one way we do this and Refinery has its features called rule-based sampling and you can literally write down a set of rules and this is a fairly common kind of thing if you're talking like an internet service, if any span has an error to keep it, if there's an error, you know, status code over 500, keep it. If any status code is in the 400 range, well, that's a user error but maybe they're interesting so like sample them more frequently. Health checks, yeah, you know, once in a while you probably want a couple of them in your data set just so you know they're happening but mostly you can drop them and a lot of people do actually just drop them completely. We don't care if health checks ever leave our system, we don't want to pay for them and then everything else maybe you pick a rate of 100, whatever your rate is that your volume that you can manage with the content of your data. You know, the bigger you are, the higher that number is and it works, it works pretty well but it doesn't work if things are changing rapidly. It doesn't deal well with surprises. It doesn't deal well with that customer who suddenly decides that they're going to send you, you know, a million spans a minute and so you need to do something. One of the things actually I should just mention is this, you have to attach the sample rate to the trace because now if you're sampling at different rates you need to know how to compensate for that in the back end so that you get reasonable data when you look at the combination of things because otherwise your errors are going to look a lot more frequent if you're sampling, you're keeping every error but only keeping one in 100 and the other thing it's going to look like errors are half your data or a lot more than that. So now we get to the real meat here. What if your data isn't predictable? What if it's too complex to write general rules or it's changing too often or your bursty traffic or in a lot of cases the people who are running the sampling system aren't the people who are generating the data in the first place and so you don't know what some engineer, you know, on the other side of the world is turning on and may like overwhelm your telemetry. So this is where dynamic sampling comes in. So this is where we let the sampler make the decision as to what we're going to keep. And so we take a set of key fields. We think about what matters to us and a classic example if you're talking again about HTTP is well, let's take a look at the URL. Let's take a look at the maybe the verb. Take a look at the error code response from the HTTP result and maybe if there are any other error flags or something we're trying to track. And then you attach that to your traces and then you can use that combination of fields to determine unique key for everybody and then you evaluate sample rates based on that key. So we do this mathematically. What we do is we take these keys and we collect and we watch how many happen and excuse me, can I get a slide out of order? Yeah, no, I'm good. We take these keys and we calculate which things are most common, which things are least common and we make sure we keep all of the least common ones. But the most common ones we can fairly heavily reduce the amount of samples we keep. So then you end up with, okay, well, you know, 200s to my login service, you know, one in 100 of those, but obviously 500s from some other service over here that's crashing, we want to see everything from those. So I want to go back to my other thing here. I'm going to show you kind of a sample. So what I did is I sent batches of a thousand traces and just simulated things. This one has URL paths of like just the first part of the path has only had a cardinality of two and the second part of the path only had a cardinality of four and then I have some errors in status as other fields and I'm using that as my error key. So now I'm showing you a graph which shows for this thousand traces I sent what the distribution of the various fields was. And you see that in the compensated thing, it says, okay, I got 192 of the ones drawer clear with a 200 status code. Later on we'd see drawer clear is probably down here as a, you know, here's the 201 status code. There were 22 of those. And now I can look at this and I can say, remove the compensation. Don't do the math. And in this case, we have drawer clear. There were actually only 28 of those that made it to my back end, but because we knew that the sample rate on those was, whatever number we used, I'd have to look it up. But anyway, there were 192 of them. So like I said, seven. So the sample rate of the, of this particular value is different from the sample rate from some other less common value. So if I go back here for a second, you'll see we have these different counts. And notice it goes all the way down to the ones. Every sample was kept, every unique combination of keys was kept, but the ones that were popular were much more, we kept many fewer of them than we kept of the ones that were rare. And the way this works basically is we just, we track it for 30 seconds or some number. You have control over that. You record how many you have in the various buckets and then you use those buckets to deal with the next 30 seconds and you meanwhile record those and then you use those buckets to deal with the next 30 seconds so it adapts over time. And actually we have a bunch of different samplers that have different adaptation models depending on what your particular needs are. And I'll show you that in a minute. One of the things I wanted to note is this key strategy actually fails if you have too many keys. So here I did a sample, I sampled, I said the data I sent had a cardinality where the URL cardinality was over 100 or was 100 and then the company, then you have the additional combinations of what were the error codes and what were the status codes so that we literally have, you know, probably 500 different states of key in this batch of a thousand traces. So that's way too much combination to actually get good data out of it and to draw the thing. And so the thing I want to show you is so for the first set of data we have this graph which shows the average sample rate here in the bottom graph. The first 30 seconds the sampler is just going, okay, tell me what you want. I'm just going to do six because you said it was six and then it figures out what the average sample rate that it can achieve and it's achieving a sample rate of about four because of the low cardinality-ish of the key field we chose it. This one here, the cardinality of the key field was much higher and notice we can't get the sample rate up because basically we're sampling every key at somewhere between one and two because there's just too many keys to do this thing mathematically, you know, sanely. And so the sampler is trying as hard as it can but it just can't get the sample rate up there at this volume. Now this may be a fine strategy if we had millions of keys and we're sampling at these rates that would be okay. But because there's so much difference between all the data none of the keys can be like pumped up into a range that actually reduces your sample rate in a meaningful way. So let's just go back to here. All right, so generally what we're doing with dynamic tail sampling is, you know, we're catching the most interesting data types the more common events are dropped more frequently the less common events are kept more frequently and are anything new anything uncommon is guaranteed to show up. But if your cardinality is too high it's a problem. As I mentioned before, we have a bunch of different sampler types in refinery that do dynamic and they basically combine current data with past data or don't in different ways. So we have an exponential moving average a couple of samplers that do that. We have a couple of samplers that are designed to limit throughput where you can just say I would like my numbers to work out to, you know, 10,000 spans per minute or whatever number you want. And so the throughput will, they will optimize for throughput rather than sample rate. Some of these other ones will be more in sample rate and then we have a windowed throughput sampler which was actually donated by a third party that does a different form of moving average. It basically keeps a record of the last like, you know, several rounds and so it adapts better in certain circumstances. It was generated by one large customer. So since we're here talking a lot about hotel today, the hotel collector has a tail sampling processor. But in today's world, it can't add sample rate to the output. You can use the thing Tyler was just talking about. You can use a transform processor to add a sample rate, but then it has to be a constant sample rate because they don't talk to each other. So we can't do this dynamic sampling concept in the hotel collector today. Now, good news is I'm on the sampling SIG and we have a spec revision or an O-TEP that's out right now which is going to allow us to attach a sampling threshold which as I said is equivalent to sample rate to traces such that it can propagate through your system. And so that solves half the problem and the other half the problem is that the collector, the tail sampling processor composes samplers by making a series of individual binary decisions and it's really not a good way that I've found so far to figure out how to attach the correct sample rate based on having made a sequential series of binary decisions. So I want to try and think about that as to how we can get this into the tail sampling processor as we go forward but it's going to need some serious rework. So if you're going to give feedback on the talk, I'd love it if you'd shoot that thing and pass that on. And if you get a chance, come by the Honeycomb booth and check out the stuff we're doing. We're up in 22. Questions. The question is how are the keys selected? The keys are controlled by configuration. So you decide as the user which fields are meaningful to you as keys and then for any given trace, the assorted combination of all of the values of those fields for all of these fans in the trace is assembled to make the key. Yes, there is. So they are per trace and there is a semantic assumption that when you have a given field name that that field means the same thing on all the elements of the trace. So if you have a status field, for example, and it's not actually a status code in some cases, you might want to do something like in your telemetry to copy that to a value that's unique. I'm sorry, would you try anomaly detection? Okay, integrating random sampling with anomaly detection is the question and generally in that circumstance probably what we would do is say you should probably use the rule-based sampler and write rules that define the anomalies in particular that you're looking for. We're not doing AI automatic detection of anomalies based on the shape of your data overall or anything like that. We're just... You could specify a rule that says these are the things I care about and then basically set the samples on those to zero. Somebody here at the mic? Hey, great talk, thanks. I want to ask... So you described a lot of head-based sampling, head-based sampling, dynamic sampling. This all happens in the agent on my infrastructure. Typically in head-based sampling, either it happens at the point of telemetry where you're adding telemetry, the library you're using within your application, you can actually set a sample rate there or you may have an egress agent within a pod or something like that that is like a local collector and you can set a sample rate on that. I say head and tail, but really sampling can happen in a variety of places along the pipeline. So my question is, I think, maybe I have functions as a service, maybe in a cloud, maybe a couple of quads, maybe a local environment. I don't want to set up infrastructure for the collector, for the agent to sample. Is it possible that I just send everything to Honeycomb and configure the sampling in your cloud? Not today. No, it's been discussed from time to time, but it's not something we're doing today. So I'd have to set up an infrastructure where I put up your agent or the hotel and configure it. Right. You're either running a collector and doing it there or you're running a refinery as an endpoint within your services and then refinery is passing that data onto Honeycomb. Okay. Thanks very much. You're welcome. Great talk, by the way. Quick question. You've mentioned the rule-based sampling and the dynamic sampling. In some cases, don't you need the rule-based sampling to kind of augment some of the dynamic stuff? Yes. If you're doing... You don't want to collect all those successful samples because it's not going to give you any good use. Right. The question is, do you sometimes need rule-based sampling anyway even with dynamic sampling? And yes. And in fact, what we often do is you write a rule to capture those particular circumstances you want to make sure you don't miss or particular rules we say, look, this service over here is too chatty and I really need to, you know, cut it down. But then what we have is the idea that within those rules even you can say and then fall back to a dynamic sampler once you, you know, if the rule doesn't match these conditions, now use a dynamic sampler to do that. So it can become your fallback to other things. Together? Yes. Yes. That works today. Yes. Yeah, we don't have an adaptive model for adapting rate in that model. It is basically a counts kind of thing. So you can do that sort of detection in the back end. Once it happens, you can cause, you can send alerts and things like that but you would have to then modify configuration and deal with it. I think I'm getting the cue. No. Oh, wait, one more? Okay. Hi, thanks for the talk. Can you clarify which parts of what you talked about are open source and self-hostable and which are honeycomb specific? All of this, the refinery is in fact fully open source along with our dynamic sampling library Dyne Sampler Go which is also open source. And then as I said, you know, Tyler who works with me and I, we both work and our whole team works on the collector and we're sort of chewing through how to move this stuff forward so that we can get it into the collector entirely. Yes. That'd be great. It's great stuff. Thanks. Thank you very much.