 Hello everyone, my name is Jurisir Paixão-Crolin. I'm a software engineer at Grafana Labs and with me today here I have Alexandre Magno. Hi everyone, thank you for joining with us in this session where we will share some experience about the OpenTelemed collector, especially in tail sampling feature. Wonderful, so we're covering here today a few usage cases of the tail sampling processor for OpenTelemed collector and how Pismo used that to reduce their observability costs. Well, our company is under exponential growth and the cost of observability was growth together and it was a problem that we need to handle and take some actions to solve this kind of problem. Here the reason that why we need to consider to implement some kind of sampling where our cost of observability traces was growing very fast. So with it in mind was necessary to take some actions that we will share more details in the next slides. All right, so on this the next 20 minutes or so we're sharing what is sampling? What are the sampling strategies that we have available for us and why Pismo needed sampling and how they implemented sampling? What were the outcomes they got and what holds for the future for Pismo when it comes to sampling? No When we talk about sampling we can apply the same ideas or the same concepts to all of the signals, right? So we can apply some sampling strategies for logging. We can apply the reduction of data for metrics and for traces. So I came up with a very simple definition of what is sampling and sampling is a technique that we can use to reduce the amount of telemetry data that is generated or stored. Right, so we either generate less data or we store less data. No, we typically think about tracing when we talk about sampling and because we have only 20 minutes here I'm not covering what is sampling for metrics and for logs. We're focusing on on traces here and Sampling for tracing is basically the idea of limiting the number of traces that we have by making a decision and then applying that decision to all of the spends Belonging to the same trace right so we can make the decision either at the very beginning of the data creation of the telemetry data creation So at the root span as we call them We can apply that at the collector And we can apply that after the trace has been finished There is a small difference between it the second and the third points here that we are going to get later no trace based sampling or head based sampling or just head sampling Is applied at the very beginning of the trace and we typically use a probabilistic strategy for that No, we can use other strategies as well But typically when we when we talk about head sampling, we are talking about making a decision based on some probabilistics So we say that for a service X We are sampling only or we are selecting only 50% of the data that we are creating and then we use That decision And we include that decision in the context that we propagate down to other spans for that for that same trace now Advantages of doing sampling at the heads at the head or at the very beginning of the trace is that we don't have any transmission costs for Non-sampled data, but so we can still perhaps generate depending on on the sampler that we are using but we are not sending that data out It's relatively easy to configure. So when you are starting with us Tracing you typically are taught how to do sampling for for your services Some libraries even come with a sampling already on right so they can They can reduce the amount of data by the full It's harder to apply it consistently though. I mean While you can configure for a service a a specific number or a specific probability You have to make sure or perhaps you want to make sure that the same Strategies are applied to all of your services and it's harder to apply when you have hundreds of services in there and perhaps the the most surprising downside is that The probability of 10% does not necessarily mean that we are getting 10% out, right? So when we say that I'm I'm flipping a coin like 10 times now the probability of getting heads or tails is 50-50, right? But then the reality is we get 60 6 or 4 or 7 and 3 and so on so forth. So the probability of 10% is not necessarily 10% output We have a second strategy and that is Doing sampling at the collector and what we can do there is we can apply a decision Consistently to all of these pens that are getting there based on the trace ID so I can hash the trace ID do a Hash mod n and then we can make a decision consistent for all of these pens within the same trace So we don't have to keep the trace the expensive in memory to apply a decision the decision is consistent But then again, we have the same problem with probability It is easy to configure for all of the services at once because if we have all of the telemetry data Going to the same collector. We can centralize the configuration at that collector and change and tune based on our current needs It's relatively easy to configure But it does come with the operational costs of a collector and the collector is not very cheap to run, right? I mean you you have to install you have to deploy it you have to maintain you have to take care of it And again, we have the same problem with statistics now We have a second way of doing sampling at the collector like tail sampling And we typically use that for complex strategies So we can apply different strategies depending on how the data look like We can This one of the characteristics of this Strategy is that we can we have to keep data in memory for some time So we keep all of the data in memory for a specific amount of time And then once we deem a trace to be ready, then we make a decision based on how the trace look like What it means is it requires all of these pens to be sent to the same collector So if we send if we have like ten collectors, we are scaling our our info So we have ten collectors and I send and I do a wrong robbing Load balancing of my data, then it means that parts of my traces are getting to one collector and parts to another collector And when I run the decision, I only make a decision based on partial traces and that decision might be wrong So it requires again all of these pens from the same trace to go to the same collector Advantages are I can really pick the exact traces that I want I can say I want 100% of the errors for instance I can apply multiple and complex conditions. So not necessarily only probabilistic strategies But combine probabilistic with errors and so on so forth It does come with even higher operational cost as you're gonna see in a couple of minutes And by now you have a state for collector and scaling state for workloads is somewhat harder than I stateless And even I mean sampling has downsides as well, right? So it's you don't use sampling by default say You only do sampling when you actually need to do sampling because it's harder to make statistics after the fact So if you are dropping data out You don't you don't get that data back So if you haven't collected statistics before dropping data, it's very harder. It's very hard to understand What you actually missed what are you actually throughout It potentially misses important elementary data. So if you are waiting like five seconds for data to come in So that make a decision and then you probably said, okay, so that trace there looks fine It the latency was very low and it was just more of the same. So I'll just drop that data And then suddenly because you receive a new span for that now it becomes interesting But because you throw that trace out, you don't have that data anymore So you potentially miss important elementary and it is indeed more complex to operate now You have to think about a another layer of collectors doing load balancing for instance And how can you do that load balancing? And so other problems to think about and I'll pass it back to Alishadri to say why they needed sampling at Pismo Hi, here. We have three pillars that why we need it to implement sampling in our case the exponential growth of the company and Hopefully, hopefully without the exponential observability costs While we keep the visibility of our service. So I will cover in more details here as I mentioned previous I work in at Pismo and Pismo is a Bink a service of providing And we have a large customer around the world and it becomes our operations Very critical. So we need to have a good of observability in our service To cover and perform some troubleshoot when problems happen Here we have some numbers of Pismo that the volume of money and the transactions that pass for our platform so Here we can have idea about the How much critical is this operation and Control observability costs the the growth of the company could impact the observability costs It's sometimes become a big challenge to us to control this kind of situation so We've discussed growing very fast It can impact the margin of the the products that we are sell So is necessary to to take some actions to solve this kind of situation Keep Services observable. So when we try to implement some kind of sampling we don't can we cannot Lost the visibility of About our workloads when some kind of error some kind of crash or incident critical Happen you need to have good data to analyze and perform to troubleshoot and become the recover the the platform as all and How we did it in the moment that we decide to implement simply in our in our service We we had two options on the table that was had sampling or tail sampling how we learn with to Juracy previously had sampling the decision to restore or not the this traces Is taken on beginning of the transaction. So for this reason we discard this option because some Could could happen that we discarded a critical information That can help us to solve some kind of problem. So we choose the tail sampling Why do you think they use sampling provide to us a good visibility of our service in case of crash? Find control with policy you have a lot of possibilities with policies that you can use in they use sampling and Tony The open talent as all there are good metrics that you can analyze and perform a improvement in in the collector in the configurations and perform some tunnings that Basis in the metrics that have we have available in in open elements collector Here we have a simple diagram about the The open talent where 100% of the traces of the the applications Is sending to our collector, but the policies apply as and just Sampling traces are sent for our back end Here we have some pulse that we have implemented the first pulse is about errors if some Trace has some kind of error this trace will be stored in our application Another possibility is about the maybe you don't want to implement sampling in specific application So you can use this pulse to to don't implement simply for that specific application another possibility that we We have is about the latency if there are some some latents Happening or in your environment you can store this data to analyze In another moment, and if everything is is running. Well, we you don't have error latents or any exception We will use probabilistic sampling that will start until 25% of the trace generating our platform Another session of settings that we have here is about the decision weight How much time we wait to decide if we will store the trace or not five seconds and We store 120,000 Trace in the memory it's possible to to start this number in the memory to take a decision if we will store or not We don't have a magic number about the number of spans that each trace has in some situations We have some traces that has ten spans in another moment. We can have trace with more than 100 Spans in the same trace Here we have the shark showing the volume of spans that is arriving in the In the collector how we can see he is something near of 250,000 Spans But when you look for the another shark you can realize that The volume of span that we say for our tool is less than 40,000 spans So it represents a good saving money for for us Trade offs here is the instance of the open telemetry collector is big But there are some reason for that for example in our case we have some bad process that run during the night where The collector needs to have a memory available to handle with this volume and another Situation is in case of an incident you will store more trace. So is necessary to to have Memory available to to open telemetry collector handle with this volume of trace So lessons learned that we have here that is about the volume of spans that collector can handle and we realize that is a big volume of His pants that open telemetry can can handle and it works well Yeah But a collector can only hold So much load, right? So they had 250,000 Spans per second. That's quite something but in There's some limit And they have after doing vertical Scalation a scaling of their collectors they found out recently that they hit the limit and Looking at the previous chart you probably saw like a spike and then a constant load after During the day, right? So that's also not very optimal usage of the resources And then now they hit a limit where it's harder to scale Vertically now they have to scale Horizontally as well. So that's the second lesson learned even though the collector can take a very high load there is a point where you have to plan for horizontal scaling and Another lesson that we learned that sometimes can happen that open telemetry collector crash, but it's not a big problem because when Open talent collector needs to be his start for any reason memory out of memory or some reason Open talent collector take just four seconds to be ready again. So the volume of Spans can arrive again in the collector that will be treated and sent to our Observability to so were anyone here yesterday at rejects Yeah, so yeah, we had a talk yesterday on on the resiliency of collector pipelines So the previous slide showed that it's okay to crash the collector or for their case But of course the collector provides ways for people to make resilient pipelines as well Now the second and the fourth lesson learned is that the probability of 10% is not 10% So I kind of alluded to that at the very beginning And if you were paying attention to the graphics You saw that they are ingesting two forty four thousand per second or that was their peak on that specific time window and If we look at the number of sample data, we got like thirty thirty nine dot six K If we do the math Thirty nine dot six K is not twenty five percent, which is what they had as probabilistic sampling It's sixteen percent and if we go back to the configuration They had they had like 100% of errors 100% of those services in here 100% of high latency and 25% of the other things. So I would intuitively Think that we get more than 25% But it's not really the case. It's only 16, right? So and I would and I I'm gonna be honest with you when I saw that those numbers here I came to Magna said that's wrong. I mean, I would expect 25 I mean to be roughly 25 and he said no, yeah, we looked into that And it's really like 16. It's we increase and we decrease and we can see the effects but um, of course, there's other situation there where We specify a percentage of traces, but then that we think that traces we have a different number of Spends so and the numbers that we have here are spins, right? So what it means is we get 16% of these pens being sampled, but that is not telling how many How many traces were actually sampled and the last lesson that we learned is about our currently policy that we have implemented at Pismo It's simple to detect when some kinds of incident is happening because the volume of traces is stored in the tool Raises so fast and imagine that a critical incident Will be stored more and more trace and you can The simple way analyze the chart for example and the tech that the time of the problem start and you can analyze base it in Good information because the the most part of information that will be stored is about errors and As we know when we are trying to perform some troubles, which we look for errors And here is the results of after the full implementation of tail sampling our company This is the chart of the financial team team of the Pismo and we can realize that we are saving a lot of money here and The size of Pismo today is much more than when we we implemented so It's complicated to imagine where is Discussed today because the volume of transactions that we have today is much more than previously for example on 2022 year for example and the next steps as You realize that our instance of open telemetry is big, but not necessarily to keep big for the all day So we will implement a load balancer because today we don't use load balancer But it will be implemented and implemented HPA to scale in case of We need to scale automatically. It will be bring more saving money for us in terms of instance to host the open telemetry collector So it's all that I would like to thank you for everyone and If you have some questions, please let us know We have microphones here on the sides Yeah, I have a quick question given you're in in banking Did most of the sampling that you did not involve money movement as far as needing to keep Transactional data. Did you mainly do non critical? pathways Well, the most part of data Is not so relevant for analyze some kind of problem because we have logs in a good level of detail So logs complement the trace. So we don't loss a data critical data We can search something at logs and we will find Just to share we have a good strategy in terms of logs that we Get the CID correlation ID of the transaction and it will be set in all Line logs of the all applications. So if we need to perform some troubleshoot the text something We can just get the CID and we can have a details in basic in logs But for traces, we have a tool where we can analyze the specific requests for database elastic cache and any other endpoints. Okay. Thank you. Thank you Yes, thanks a lot. Thanks a lot for this great talk Just can you go back to lesson four because I was wondering on the probabilistic anomaly? Because the interesting part here the max indeed has this weird thing but the mean is actually quite nicely 25% and That makes sense right because maximums are very sensitive to probabilistic anomalies. Well, the mean isn't so I'm not sure about the conclusion because it really in the mean it's 25% and that's the overall Period you're looking at. Yeah, so that's so I guess the point is You are configuring 25% But it's not necessarily gonna be 25% so it might be at most of the time But then you have to account for those those movements, right? So it is 19 16% If we compare just the peaks, but yeah, I mean ideally it would be around 25 Yeah, in your case, I don't know what the time frame was where you took this this thing that was one day, right? One day so at one day, you're at exactly 25% Yeah, I think that's one day. Yeah Yeah, I mean It should it should we should aim for 25 and that's Data should be within that close to that range, but it might not That's why I guess that's especially because anything that the When I was looking at that data then What I thought was we are filtering traces So it might be 25% of the traces and I would hope that the number of traces and the number of spins within traces Would it be almost the same throughout the day? And that might not be the case, right? So perhaps the traces that are generated from batches They have more or less spins than the ones that are transactional throughout the day So the sampling that we apply there might affect the the number for the end of the day But yeah, I mean over a longer the longer we would look at the data the closer to 25. It should be yes Thanks Quick question one claim I remember hearing a few years ago At a similar presentation was that for at least using an open source back end where where the customer has to pay for the Processing of traces so Jaeger for example the claim was that it for compute and memory costs tail-based sampling would be Roughly equivalent to just sending them all the way into the back end and processing them Can you speak to the amount of compute you had to throw at the collector to achieve these gains? It looks like it was quite low, which is good I'll let him ask her most of the question, but so They had to have this big of a of a collector like so 25 CPUs and 56 gigs of RAM So that's quite beefy, but then they were paying Where is the Yeah, so they were paying like almost 80k a month And they went down to 15k month with that collector, right? So that's I would expect a collector of that size not to cost what? Such many case All right Good. Thank you guys. How did you get to the number of 15 seconds and 120k of traces? It was based in our tests for example five seconds for us is a High latency because everything Every transaction in banking must be done in milliseconds so five seconds is almost incident for us So this is a good number to to have and the volume of traces that we have a story in memory in case a 120,000 trace is based in the critical moments for example that process running that Running during the night and in case of incidents when when we have some kind of incidents is a The customer performed a retry and he tried and the volume of requests is raising so fast so it was necessary to put a big number to to To be able handle with the volume of data and storing our tool got it got it But you you mentioned like some batch operations Working during night like and if you have any like operation that's taking more than five seconds But like it's maybe a reporting generation for example You want to trace this you are we going to lose these traces because this will be fragmented like how this is No, it's not fragmented But the the volume of the memory that we allocate for the Collector is necessary because we have a batch process But there are transactions happening in the same moments So to guarantee the the the transactions that is running in that moment is necessary to to allocate memory for this these situations So to address part of your question We we have a feature request and we are well since like forever, but the idea is To record on a cache the decision for a specific trace So if we have a trace that with a decision to trace or to sample sorry Then we we look at the cache and we send that data in and if the decision was not too simple Then we dropped as this data based on decisions that were made more than five seconds ago This is only a cache so data can be you know after some time it is going to disappear But it is something that we have on the road map for the load balancing. Sorry for the tail sampling processor That's cool. So this means that my trace might might take longer than the My this time decision data that you lost you lost But if you made a decision that decision is going to be applied consistently to the new or to the late late spends Thank you guys. All right Yeah, we are out of time. So one question. So you're talking about a scaling of the stateful sets, right? So the question is how often do you plan to scale out those? Stateful collectors and how do you plan to handle the statefulness of those collectors? I mean when when a new collector gets added or removed then you need to ensure that the spans of a particular trace goes to the right set of collectors The way that the so you would apply the load balancing exporter in front of your like a layer of a load balancing exporters And in the second layer is the same the tail sampling processor So the tail sampling processor would not know about that change in the topology the load balancing exporter would But then once you apply a new topology then you get data that is sent to a new location if that's the case So there are some algorithms on the load balancing exporter making it Change only by 30% of the trace ID range So only 30 30 percent would be affected by topology change if it changed by one But then we are trying to come up with a solution similar to making a cache of decisions for tail sampling We are thinking about the same for the load balancing exporter as well So that when a decision is made for a specific trace ID that decision is consistently applied In the future for other traces no matter the topology What is that how often do you expect it to scale out of scale it? The outscale will we will implement based in memory utilization. So we will set the number of the the Collector that can scale and based in the memory utilization we will scale or not All right. Thank you very much and enjoy the conference