 Testing, it's on! All right, this is going to be my clicker, so that's why I'm holding this mouse. Just a second. Bienvenido a todos. Thank you all so much for coming out today. I know it's been a long week, probably for all of you. And it's a long day. And I'm so flattered that you chose to come to my session for your last session of the day. So just to make sure we're all in the same room, my talk is why, how to, and issues, tail-based sampling in the open telemetry collector. And yes, I'm pretty sure I maxed out the number of letters allowed for a KubeCon topic. I'm going to have to stand here. All right, to start, I'm going to take you through our agenda for this session. I'm going to start with a brief refresher on open telemetry, collector, and distributed tracing. Then we'll get into a sampling overview where I'll cover not just what is sampling, but also why sampling. And then we'll see sampling in action with a live demo by yours truly. And we will wrap up with concerns and limitations so that you are aware of the challenges. The first question I'm going to answer for you today is, who am I? My name is Reese Lee. I am a developer relations engineer on the open telemetry community team at New Relic. I am based in Vancouver, Washington. I am passionate about helping observability end users get and understand useful data from their systems. So I'm very pleased to share this presentation with you today. OK. First, a quick refresher on these core concepts, as I'm sure most of you are already familiar with these at this point. If you do want more information, I'll have some resources at the end that you can check out. What is open telemetry? In 2019, two competing open source instrumentation projects, one called OpenCensus and the one called OpenTracing, were merged, forming open telemetry. It is now the second most active CNCF project after Kubernetes. And it is a unified standard for instrumenting, generating, collecting, and exporting telemetry, so metrics, logs, and traces, to help you analyze your software's performance and behavior. And it does so by providing a set of APIs, SDKs, and tools, including a component called a collector. What is the collector? It is essentially an extremely configurable system for processing telemetry data. It's made of the same main components that access that telemetry data, receivers, processors, and exporters. And some of the things that a collector can be configured to do include sampling, collecting host metrics, scrubbing data, and normalizing data. What is distributed tracing? Distributed tracing is the method of observing requests as they move from one service to another in a distributed system. And it's important for helping us understand our systems, such as our service connections, like how our services are interconnected, and can be useful for diagnosing problems, such as where latency is occurring. Also, what is a trace? A trace is made up of spans. Spans represent logical units of work within a request during a specific period of time. So an example of a span would be HTTP call or a database call. And in this very intricate diagram that I created here, you can see a trace, a request as it moves through three different services. The first span of any given request or a trace is going to refer to as the root span. And the next slide here, I'm going to show you an example of what a request will look like and a tracing backend that we send our trace data to, such as Yeager. So Yeager, I'm sure most of you already aware what it is. It's an open source distributed tracing platform that visualizes your service requests as traces. There are also other open source tools and multiple backend vendors that you can send your trace data to as well. Let me get my mouse over. So each of these lines here represents a span. And all of these spans make up a trace. You can also see the duration of time that each span has taken, along with the total duration of the entire trace. So so far, distributed tracing sounds great, right? However, there are some things that you need to take into consideration. If your system is producing thousands of traces per minute and you haven't set up any kind of sampling strategy, so that is to say you are capturing, storing, and indexing every span of every request, well, one, the cost of tracing could become higher than the cost of running a service. And two, it can make it difficult to see if any issues are occurring within your system. So what can we do? Not a true question. We can. OK, before I get into the sampling overview, can I get a quick show of hands? Is anyone currently using trace sampling in your organization or know anyone who or know whether your organization is? Excellent. OK. So it looks like some of you are probably really familiar with sampling. But in any case, I will cover what is sampling, why we might want to sample, and then we'll talk about head until we'll be sampling. Excuse me. What is sampling? To keep or not to keep a span or a trace, that is the question that sampling answers for us. The idea behind sampling is to reduce the number of created or sampled spans. And next, I'll talk about why we might want to sample. And sampling can be implemented at different stages of span processing. So the earliest is before a span is even created, otherwise known as head-based sampling, or after all the spans have ended, which is the latest stage, also known as tail-based sampling. So earlier, I showed you a meme that I created of a human with a fire hose of spans to illustrate how not having a sampling strategy in place could negatively impact your organization. But why exactly do I want to sample? So different organizations will have different reasons for not just what they want to sample, but also why they want to sample. Some teams, they want to see only interesting traces, or they might want to filter out noise, such as health checks. And I'm going to pause you for a second to explain what I mean by interesting traces. You'll also hear me use the term traces of interest. I'm just referring to the specific trace data that you or your teams might be interested in. For example, as an app developer, I might be only interested in error traces for debugging purposes. My front-end team might be interested in traces with specific attributes. Another way to think about sampling is if 99% of the traces that your system is producing are 200s and finish without errors or latency, do you really need all that data? So the thing is you don't always need a ton of data to find the right insights. You need the right sampling of data. So now we're going to talk about a couple sampling strategies, starting with head-based sampling. And then we'll get into tail-based sampling. Head-based sampling is simply where the sampling decision is made before its ban is even created or at the start of a trace, which makes it simple. It's efficient because the sampling decision gets propagated down to all child's bans. So it never has to wait until all spans and requests have finished. And it's unbiased because it never looks at the trace to make a sampling decision. Today, OpenSoleMateria SDKs ship with a number of built-in head-based samplers. We've got parent-based, always-on, trace-id ratio-based, and there's also always-off. I'm going to talk about the built-in samplers a little bit now. So the OpenSoleMateria default sampler is actually a composite of two of the samplers. We have the parent-based sampler, which takes a required parameter for what you want to use for your root spans. So that's why it says root is always-on because we are using the always-on sampler. As the name suggests, we will always sample the root span. And the way the sampler works is by asking a few questions of every span. First, are you a root span? If yes, well, guess what? I'm always going to sample you. If you're not a root span, that means, hey, you have a parent span. And now it's going to ask, OK, was your parent sampled? If yes, hop on the train. You're getting sampled. And if not, you're not getting sampled. And in other words, the sampling decision gets propagated down to all the child spans. So in this way, this is how the default sampler works to collect every span of every request in your system, assuming all your services are instrumented with open telemetry and using the default. OK, this diagram, which I also created, painstakingly, lovingly created for you all, shows an example of what we might see in our tracing back end that we're using. The blue dots represent the root spans, and the blue rectangle represents where the sampling decision is made. The green dots represent sampled spans, and you can see the entire trace is sampled. And for some flavor, we have the red dots, which represent error spans, spans with errors. And you can see we get a full view of all the traces our system produces. Next, I want to talk about the trace ID ratio based sampler, but hold on a second. OK, the name of this sampler is a mouthful for me so. This sampler uses the trace ID to make a sampling decision with respect to the sampling rate that you can figure. In open telemetry, when we combine this sampler with the standard random trace ID generator, we get a mechanism by which to do probabilistic or random sampling. Essentially, this diagram kind of gives a general high-over view of how the sampler works. On this side, it's randomly decided, OK, I'm going to sample everything with trace ID 1. So all spans with trace ID 1, you're getting sampled. On the other side, you see it's decided, OK, I'm not going to sample trace ID 2, and therefore, all those sips, spans will get discarded. I do want to note here that if you have multiple connected services instrumented with a variety of language SDKs because they're in different languages, the open telemetry specification recommends that you use this only for root spans. And so how earlier we saw the root takes in the OSON sampler, we'll just simply switch it out for the trace ID ratio based sampler. So we'll see that example in the demo coming up. So here's a diagram showing an example of what we might see in our tracing back end if we're using trace ID ratio based sampler. Essentially, a random sampling of traces, you may or may not always see traces of interest, in which case, in this case being the error trace. OK, so that's all I wanted to cover for head-based sampling. Let's talk about tail-based sampling. The biggest difference between head-based sampling and tail-based sampling is that with tail-based sampling, as the name suggests, the sampling decision is made only after all the spans and requests have finished. This means we're able to filter traces based on specific criteria. It is useful for efficiently seeing those traces of interest. And it's optimal because we get to keep all our spans in context and don't get broken traces when we need them most. Today, to do tail-sampling using open telemetry, you have to stand up a collector, which I mentioned at the beginning, and implement the component called the tail-sampling processor. Multiple policies exist today, and you also have the flexibility to add more if you like. This is just a short list of the policies that you could use today. I'm going to take you through the screenshot here. So this shows an example configuration for the tail-sampling processor in our collector config.yaml file. So here we've got our tail-sampling processor. That's the name of it. The first three lines here are all optional configurable settings you can use or not. Decision weight is the time to weight from the beginning of the trace to make a sampling decision. Num traces is number of traces kept in memory. And expected new traces per sec is expected new traces per sec. And finally, we have the policy section here. There is no default, so you will have to define at least one policy to use the tail-sampling processor. In this example, I have two policy set up. I've got a status code policy and a probabilistic policy. And this will essentially get me all my error traces, as well as a random sampling of remaining traces. This is an example of what we might see in our tracing backend using the exact configuration I just showed you, our error trace, as well as a random sampling of remaining traces. If we decide to only use the status code policy, this is what we would see, just the error traces. OK, so how is everyone feeling about sampling? Good, great, awesome. Thank you. So if you have any lingering questions, please hold on to them, and I'll try to get to them during the Q&A, or hopefully some of them will be cleared up with this live demo I'm about to show you. OK, so first I'll give you a context of the demo, and then I'll walk you through the scenarios that we're going to see, and then we'll do a quick demo reflection. So for the demo, I have three services that always call each other, and what I mean by that is just service one, always call service two, and service two, always call service three. I have a load generator that makes 20 calls to the first service, and each time it is run, it produces exactly one error. And for the purposes of this demo, I have decided that my traces of interest are traces with errors, and we're going to be exporting all our traces to Yeager, so we'll be looking at our traces of Yeager. And my goal here is to find out which sampling strategy is optimal for getting me what I want. First, we're going to start with a couple of head-based sampling scenarios. We're going to take a look at using the default, and then the trace ID ratio-based sampler, and then we'll wrap up with a couple tail-sampling scenarios where we use the statistical policy, and then adding in the probabilistic policy. All right, let me see if this is going to work. I think that worked. Oh, oh boy, what's happening? OK. So I'm going to be switching between three separate windows. We're going to have our Yeager UI over here. I'll have my text editor here, where I'll show you the SDK configuration. And finally, I will have my terminal where we are going to restart our services after making changes and also running our load generator. OK, we don't need that. You can stay there. All right, so we're set up to use the open slump to default, so I'm going to go ahead and run my load generator. And I'm going to come into the Yeager UI, and let's see what we got. Oh boy, please hold. You know, they warned me about doing live demos, and I said, I'm going to do it anyways. Are you not showing up? OK, please hold. So if I'm using the open telemetry default, and my load generator is making 20 calls to the first service, how many traces do you think I'm going to see when I run my load generator? Someone just flashed out 20. You are absolutely 100% correct. Now I would love to show you if I can figure out why this is OK. So for those of you who are not as familiar with Yeager, on the left-hand side is a filter nav menu. We are primarily going to be focused on the right side where we'll see our traces. So here we have 20 traces. You are so correct. And as I'm scrolling through here, so each of these lines is a trace. If I click on one, this might look familiar. It was in the screenshot that I showed you earlier. These are all spans, and you can see the different services. And if I scroll down a little bit more, hey, look, there's my error trace that I wanted, along with a lot of not so interesting traces. Let's see what happens if I run my load generator a couple more times. Now I want to come in here. If you guessed that we will now see 60 traces, you are correct. I'm so glad that worked just now. OK, so 60 traces. And if I scroll down, I will see, hey, there is one of my error traces. And I should see one more. There you are. And then our third one, which was from the first time we ran the load generator. Perfect, there we go. So I've got all my error traces, which is what I wanted. But I'm having to kind of sift through a lot of uninteresting data to get to them. So now I want to try probabilistic sampling because I'm getting too many traces. So now I'm going to come into my first service, which is a JavaScript service. So we're going to go into the tracing.js file. And here, I already have it written out, so I'm just simply going to uncomment it. So now I'm passing in the sampler. As you can see, I'm only using it on root spans per the Open Telemetry Specification recommendation. That was a lot of multi-syllable words at once. And now I'm going to do it for my second service, which is a .NET service. And finally, I'm going to do the same thing for my third service, which is a Python service. And also, you can see here that the sampling rate that I've configured is 25%. So now I'm going to go into my terminal and restart my first two services for the changes to take effect. The Python one auto-reload, so I don't have to rebut that one. So now let's run my load generator. OK, so pay attention to this top number here, 60. What do you think we're going to get? Maybe. So close, seven. So we have seven new traces. And hey, look, I got an error trace. That's cool. And a random sampling of all other traces. And you can see here too, so right here, it says a few seconds ago. And this was two minutes ago. So all the ones above this are my new traces. So let me go ahead and run this a couple more times. And let's see what we see. OK, so the number is 67. And it updates to 74. Seven new traces. Yes, I think my math is correct. Actually, that seems a little low. 74 traces. OK. Oh, and that's so interesting. So I actually got, using the trace ID ratio based sampler this time, it looks like I got two of my error traces, which is interesting. I've run this demo and had zero of the error traces show up. I've had like one error trace show up. So it really is randomized, as you can see. However, I want to more efficiently see my error traces. So now I'm going to go ahead and try using the tail based sampling processor. So in order to do that, I'm going to come into my services. I'm going to comment the trace ID ratio based sampler back out. And I'm actually going to use the always on sampler, because I want to be doubly sure that all my spans are getting sampled. You can use the default as well. I want to be extra sure. So that's why I'm using the always on sampler. So I'm going to go ahead and comment out the old one. And go ahead. And now I want to show you what that looks like in the hotel collector config.yaml file. So you'll remember the screenshot that I showed you of the example configuration. So here I'm using the decision weight optional setting. And I've already got a policy here that will get me all error traces. And the cool thing is I can have this set up, and if I don't want to use it, I simply don't have to include it in the pipelines section for my traces. As soon as I'm ready, all I have to do is add it into my pipeline, save it, and let's not forget to restart our collector, and then also restart two of my services. All right, it looks like she's ready. OK, so now I want to run my load generator. What do you think we're going to see in Yeager? Yes, I heard one. And you'll see, since I restarted the collector, this number here is going to reset. So one error trace. That's awesome. Let's run it a couple more times. And we will now see three traces, and they're all my error traces. So this is great. I'm seeing the traces I wanted, but now I'm thinking, maybe this isn't quite enough information because it can be useful to have a random sampling of all your other traces to see whether anything else might be occurring or if you want to improve typical operations within a given service. So now I'm going to go back in here and I already have a probabilistic policy defined here, so I will uncomment that out. And let's restart my collector, and let's see what this new configuration gives us. All right, so this is going to reset. And I have four traces now. So I've got my error trace and a random sampling of all other traces. If I run this a couple more times, it's going to go from four to 15. So I've got nine new traces. Hey, there's my error trace. There's another one. And there's my third one from the first time we ran the low generator. And this is great. Now I've got my error traces and a random sampling of all other traces. Okay, now I have to, it's going to be a little bit anti-climactic, hold on a second. Where is the, how do I stop the mirroring? No. Oh, did I lose it? Okay, well, I appear to have lost the display window. Never mind, that's okay. That's what you were supposed to see right after. I've seen how to implement tail-based sampling using open telemetry, and also how tail sampling can be optimal for getting us what we want efficiently. And I apologize, I keep hitting my mic. I'll stop doing that. Just to do a quick demo reflection. So using our open telemetry default, we always saw our traces of interest, but we also saw everything else. So then we wanted to do, to get fewer traces, so we used the trace ID ratio-based sampler for our root spans. And now we got a random sampling. We didn't always see our traces of interest. So then we wanted to try the tail-based sampling processor. When we used the statistical policy, we only saw our errors. We decided, hey, it is useful to get a random sampling of all other traces. And so I added in the probabilistic policy, and we saw that we got all our error traces plus our random sampling, which was awesome. And now that I've topped up tail sampling so much and hyped it up, I'm gonna bring the mood down a little bit by talking about the challenges. We'll start with some general concerns around tail sampling, and then we'll get into some open telemetry-specific limitations. So the first one is probably the one we see the most consistently, and that is performance consideration. Since tail-based sampling, since with tail-based sampling, we have to wait until all the spans that are crossed are finished. We have to hold those spans in memory somewhere before the last span has finished. And this, of course, can eat up application resources if the spans are stored locally or additional network bandwidth if your traces aren't stored locally. Additionally, determining the interesting traces. So earlier I said you don't need a ton of data to find the right insights, you just need the right sampling. Well, there's gonna be some work involved with figuring out what exactly that means for your team. You're gonna have to ask some questions to know what to sample. Data ingress and storage costs. So sampling is supposed to help us manage these costs, right? However, let's say you are using the tail-sampling processor to get you only latency traces. Well, if you're suddenly experiencing severe network congestion and your tracing solution is now exporting up a ton of traces of latency, you're gonna see a spike during that period in data egress, storage costs, and potentially data ingests depending on how you're back-end vendor, depending on their pricing model. And with open telemetry specifically, so to use tail-sampling using open telemetry today, you have to stand up a collector. There's no way to get around it at this time. And while the collector can be really useful and practical in terms of centralizing configuration and performing a wide range of data processing duties, it is still a lot more peace in your environment to implement, maintain, and consider. Additionally, all the traces need to be full. So all spans of a particular trace need to end up in the same collector for tail-sampling to work properly, which brings me to the scalability issue. So for a simple setup, like what I showed you with the demo, where I had three demo services, not a lot of traffic, one collector, sufficient. However, the more traces you have being kept in memory, the more memory you're gonna need, and the more computing and processing power you're gonna need to look at each span to see if it fits, to see if any of them fit your bill for an interesting trace. After a certain load, one collector is not gonna cut it. So now you have to look at collected deployment pattern and think about load balancing. There's no smooth transition to getting water, hold on. So since one collector is not gonna be enough, you're gonna want to implement a two-layer system in which the collector is deployed in an agent column, collector configuration, and since all spans of a particular trace, like I said, need to end up in the same collector, which is to say each collector needs a full view of all the traces it receives. And this is because if you have spans going to different collectors and they're each getting sampled, you're gonna end up with fragmented traces, which means you're gonna get traces with lots of gaps, and you may not necessarily know where exactly a problem is occurring or which service has an issue. There does exist today an exporter called the load balancing exporter that you can use if you're running multiple instances of the collector with the tail sampling processor. So there is hope. And finally, the future of the tail sampling processor. It sounds a little ominous, but it's simply referring to, there's an open issue right now in the collector contrib repository around replacing the tail sampling processor. The conversation is more around we'll breaking up some of the policies and building them out into their own processors, be more performance than, or be more or as performant as using the tail sampling processor. If you're interested in this discussion, I encourage you to reach out to Jirasi Crowling. He is for the Gafana, and he is kind of the mastermind behind this idea. And I think in summary, tail sampling is great for efficiently getting you what you want, but there's gonna be a lot of challenges with figuring out what exactly that means for your team as well as implementing it. A couple of quick future-ish things I wanted to touch on. So there's work being done right now to add probabilistic log sampling to the open telemetry specification, which is pretty cool. I encourage you to head to the collector contrib repository to find out more. And also wanted to talk about contributing. If you're interested in getting involved with open telemetry, now is a very exciting time, whether you want to do co-contributions or help improve docs, I know the community team is very excited to have all of your great minds involved. So definitely head to... Oh, actually I have a slide at the end that will have some links for you to check out. And big shout out to everyone that is on this list and all of you who came today and for anyone watching the recording, I appreciate you all so much for being here. And all right, if anyone has questions, please head to the mic. It looks like I am right on time, but I see a gentleman back there. So let me leave this slide here for you. This barcode here, if you have feedback that you would like to share with the open telemetry community team about your experience, whether you've used it or like what's stopping you from using it, we would love to know. And yes, what is your question that I can hopefully answer? Thank you for your presentation, first of all. I have a question. So how does Collector know where all the spawns have been collected? That is a great question. And it might... So I think that's one of the challenges, right? Potel-based sampling in general. There is that configuration setting, decision weight. So you can't have it set to a longer period of time. I think the default is 30 seconds, which is like a pretty long period of time. I think the problem, of course, will be if you have spans that are finishing really late or taking really long. So I don't have a great answer for you about that at the moment. I do encourage you to have this discussion in the OpenTelemetry Slack channel or feel free to come to one of the special interest group meetings. But I'm also actually curious about that. So if you want to hang around, I'm gonna grab you contact info. And then if I find out the answer, if you find out the answer first, let's share that. So it's time-based, basically, right? What is it? So it's time-out-based. We are waiting for specific time. Okay, thank you. Yeah, you're so welcome. Okay, we have one more. So there's a lot of talks today, and well, yes, and I'm going to be tomorrow about EBPF, right, EBPF, please, and EBPF that. And it seems that EBPF can do all this magic, like how much did you call take time and how much each function took and all of that. And it seems to be replacing, like distributed tracing in that way. So do you see EBPF as a distributed tracing killer because you said that distributed tracing and its implementation of telemetry, it has overheads and performance implications, and EBPF is advertising itself as something that comes practically for free because it's only in kernel space and it's very lightweight and all of that. So how do you see this competing? That is a great question. And I'm wondering if anyone from the Pixie team would like to answer that question. Do we have Pixie team here? Yes. You can join as well. I was just going to chime in and say, I mean, there's a lot of things that EBPF can do. EBPF is great at capturing spans, at least in the context of Pixie, but just to be completely fair on both sides, there's pros and cons, right? And so one thing that EBPF doesn't do is that you can't really throw in for me, or at least when you're trying to observe, you can't really throw tags into requests. And so trying to trace something across multiple hops, like distributed tracing, like what Rhys is showing, is actually a little bit more difficult to do with EBPF, right? So you're good at capturing spans with EBPF, but getting distributed traces, that's something that actually Open Telemetry is very good at, right? And so there's pros and cons with these things. It's not just one tool fits all, right? So just a quick liberal on that. Thank you. Oh, I'm so excited that you're here. And that is all I have for you today. Thank you all so much for being here. Hope you enjoy your stay in Valencia if you're visiting.