 Good evening, everyone. My name is Vijay Samuel. I am a principal architect at eBay, and I help build the observability platform. You might notice that my co-speaker is missing. Aishwarya just had her baby boy, and she's back in the US right now. But yes, congratulations, Aishwarya. We did manage to get a recording of her portion of the talk, which I'll play when those slides come. We've been using OpenTelemetry Collector, and in the past we have talked about metrics in previous KubeCons. This time we're going to talk about tracing, how we have been playing around with various configurations of the OpenTelemetry Collector, and where we are at right now. So various configurations and the conclusions that we were able to make while fine-tuning OpenTelemetry Collector for tracing. That said, we'll do an introduction of what tracing is, how it looks like inside of eBay, what's the problem we faced, how we solved it, some of the lessons that we learned along the way, and if time permits, let's do some questions. So what is tracing? OpenTelemetry, the website defines tracing as traces give us the big picture of what happens when a request is made to an application, whether your application is a monolith with a single database or a sophisticated measure of services, traces are essential to understanding the full path a request takes in your application. So why is it important inside of eBay? We have call chains that can have like tens of databases, several microservices to the point where knowing what exactly the customer saw when a request was being served, tracing is very critical for that. What is our scale? We have a 14-day retention with 190 billion spans that we process every day, which roughly translates to 2.2 million per second, which is served by 3.6k provision computes and 150 terabytes in storage. And from an architecture perspective, every Kubernetes cluster has three kinds of applications that run. Some of them are what we like to call generic applications. So these are typically, I wrote a program in an arbitrary language. It doesn't, I just deployed on Kubernetes. It's basically free flow. On the other hand, we have managed applications that are there. Managed meaning they have full support in terms of developer lifecycle and whatnot. And they have SDKs into which we bundle the open telemetry SDK. So generic applications would need to bundle it themselves. They need to hand roll instrumentation, whereas managed, the instrumentation comes for free. They write their spans into open telemetry collectors. And from there, we have an ingest layer which receives in OTLP, writes it into our TraceStore, and whatever metrics that we are generating out of the spans, it goes into our metrics platform. And we have the Sholokio console. Sholokio is what we call our observability platform inside of eBay. There is a trace square here, which uses the Eager protocol, and a metrics square here, which uses the promql semantics. What is the problem that we faced? Let's be honest, tracing is not easy. It's quite hard. Tracing can become very expensive depending on the amount of HTTP requests that you're serving on a day-to-day basis. Organic growth means that you're going to spend a lot more. Every span has an arbitrary number of span metadata attributes that are going to be there. There are span events. People can abuse it like a logger. People want 100% sampling. To give an analogy, we have an internal logging platform, which is very similar to tracing, which generates 15 petabytes a day with 100% samples. This is without any of the span metadata type overhead that we have with the more modern version of tracing that we are using. There's a high barrier of entry for end users. Every app needs to be instrumented in the call chain, and context propagation needs to be done right. If not, you're not going to see the complete structure, which means that if there was one collector that dropped one span, it can lead you through a wild goose chase if you land on that exact span or that exact trace. Open telemetry is not that easy either, especially with the collector. There is a very famous lego analogy that's there. There are many building blocks to the collector in terms of receivers, processors, exporters. You can put them together in any order. It doesn't really limit you. You can either create something that's really beautiful or you can create a monstrosity. You have the ability to have one tier of collectors, two tier, end tier. Where should the Kubernetes enrichment in which stage should it be? There are so many things that are left to the implementer. There are challenges adopting as well in the sense that tracing is relatively new. There are many people who have talked about their reference implementations yet, unlike things like metrics where there are so many reference implementations out there. The community is also extremely vibrant in the sense that what is knowledge today can be completely obsolete tomorrow because this is one of the faster moving communities out there. What are the things that we tried but never stuck? So first, we tried a single tier approach where we have all our applications writing to the open telemetry collector. We first put it through the KX enrichment, then we generate span metrics. We do the tail sampling processor and from there, we write the traces into the trace platform and the span metrics into the metrics platform. The next approach that we tried is a two tier approach where we moved the tail sampler out alone because what we saw with the tail sampler is that all the spans for a given trace ID always need to land on the same open telemetry collector for it to be able to hold it in memory, make the decision to sample or not sample and then flush it out. So that being said, we have applications that write to the first tier of open telemetry collector where we were doing span metrics and Kubernetes enrichment. Span metrics go into the metric store and the raw trace goes directly into the trace store. But we also write a copy of that into the second tier where you do the tail sampling and then you ship it into the trace store. Finally, we also tried another where we moved the span metrics processor also into its own tier. So on the first tier, we just have the Kubernetes enrichment that's there. On the second tier, there's a path dedicated for metrics that are coming from the span metric processor. There's a dedicated path for tail sampling for the traces and the raw traces get written into the trace store as well. So how did we go about solving this? There are multiple things that we did. The first one being we solved for mass adoption. We did pipeline performance tuning. We did some scaling related tweaks. We optimized on storage and then we went for sampling. So the next few slides, we're going to talk about each one of them. Mass adoption. The first thing we did was we wanted to go for the highest common denominator. During the architecture, I described our managed framework. 90% of all the applications that are deployed inside of eBay use the managed framework, which means that we can ship instrumentation through the managed framework and immediately you get 90% adoption of tracing inside the company. And we have this concept of a monthly mandatory site-wide upgrade where every application that's on the managed framework automatically gets upgraded to the most recent versions of their dependencies, which means that any instrumentation updates that we want to do, there is a one-month turnaround time to get it into all the applications. And there's around 8,000 applications that we are talking about. We are working on getting tracing support on a service mesh. We deploy Istio internally, which means that if the mesh also emits spans, any application that does not have tracing instrumentation on it, the bare minimum, you would at least be able to get the client and server side spans through the on-voice sidecar. And finally, we wanted to make tracing instrumentation optional. So end of the day, if no developer knew how to instrument spans inside the company, at least I would consider that to be a win. Developers, what they would need to do is just use, in case of Java, just use the SLF4J, do logger.log. Under the hood, the open telemetry SDK would make sure that trace IDs and span IDs are being tagged to each of these loglines. And they should just be able to use those to get the logs. They don't necessarily need to instrument spans because the framework is already taking care of that. So within the managed framework, what we did is that we shipped in the open telemetry SDK, and we made sure that there is instrumentation for all the client calls, all the server APIs that are there. For database calls, like depending on the database, some of them have instrumentation already. We are working with the database providers internally to make sure that the remainder is covered. And we have instrumentation for iOS as well and write coming in the future. We had sampled at 2% by default. And basically what this allows us is that we can do what is the three-click RCA, Asterix on the three-click because it's a little more. But the idea is for a person to be able to do an alert to metric to trace and then decide if they want to go to log our profile later. And exemplars play a big role in all of this. And for all of this, we make sure that the red metrics are computed for every kind of span that's there. The next part is pipeline performance tuning. With performance numbers are everything. We broke down the pipeline into independent chunks. We evaluated how the numbers looked like, tried different combinations, rins and repeat to the point where we can figure out how to place these Legos in a way that works for us. So that being said, the first thing that we did OTLP in, OTLP out, memory limiter and batch, which we require anyways, identify how much throughput we were able to get out of that, the memory limiter and batch processors are very low overdasses, require some memory, but not too much. And they more or less don't impact your throughput. The next thing we tried adding the span matrix processor to see how the throughput looks like. It's fine for a single instance. But for multiple instances, we'll get to it in a couple of slides. The next one is basically the Kubernetes metadata enrichment. This is an interesting one. On a single instance, we saw that pulling the entire cluster's worth of part metadata along with some namespace metadata had a 5GB overhead. But when we add multiple instances, even though we were shedding the number of spans evenly across various instances, the overhead of each Kubernetes enrichment processor remains the same because every instance has to anticipate every pod's data that spans that are coming in. So it has to hold all the pod metadata in memory. So across each one of them, you basically see a standard memory overhead that's there. So the final configuration that we ended up with, the first stage only does the Kubernetes enrichment. It does an OTLP exporter into the second tier, where we only do span metric processing. We completely dropped the idea of doing the tail sampling because the way we see traffic inside the company, it doesn't really make sense to do tail sampling on the open telemetry collector. And the simple reason being that there is always going to be the case where a request flows from one region to another, which would mean that we would have to load balance all span traffic across regions to make sure that the entire request spans are being processed by a single open telemetry collector. And this was simply not feasible for us. And depending on the kind of application, you would have to hold the spans for a longer window, which means that you need more memory. So we went for an entirely different tail sampling methodology, which Aishwarya will talk about. And even on the load balancing exporter, Aishwarya helped add a feature to load balance based on service name instead of trace ID. Because if you were doing the service name, sorry, the trace ID based load balancing, what's going to happen is that every pod in the second tier will need to see every kind of span metadata combination. But on the other hand, if you do just by service name, all metadata combinations for a single service is always going to be collocated on a single collector, which means that the memory requirements are substantially reduced. It's like everyone needs to know everything versus a collector needs to know about only a given service. So that change on the open telemetry collector also made things quite efficient. So if you are doing a two tier approach where the second tier is going to do span matrix, it's better to do the service name based load balancing rather than the trace ID based load balancing. On on scaling, we did two things. The first thing is we started leveraging CADer because like I said, logging and tracing have very seasonal throughput. So at least for us during the holiday period or during weekends, when there is more shopping that's happening, you're going to see more volume, which means that doing auto scaling can greatly improve our ability to handle higher traffic without having sustained deployments of large size installations all the time. And the second thing that we did is to make sure that we are not using the Kubernetes service object to do the load balancing because the weighted round robin approach that it uses for GRPC requests, it is not very efficient and ends up sending the open telemetry collector pods out of memory. So we basically use the mesh which can handle GRPC load balancing a lot better for that. With regards to storage, from a consumption perspective, we implement the Yeager APIs because that's more or less a standard right now. And we use click house for storage. Click house is cheap. It's fast. It really it works really well for us. But the one improvisation that we do is that we don't have a single large click house cluster, but we have smaller click house clusters that are backed by a ring based discovery. There's a routing table that we maintain saying that this service name needs to be routed into this particular Kubernetes cluster and our ingest APIs, they can basically use that to make decisions on where to write the data into. And we also make sure that you don't need to do a global scatter gather on all the queries by implementing a global index. So every trace ID, we know exactly which are the shards on each of the clusters to basically query to get the data back rather than trying to query all of them. This is how our storage architecture looks like. We have a traces table, which has the standard parameters that a given span requires. We supplement that with a bunch of materialized views, one for service operation, which has the service name and operation name. We have the trace ID and service name in another materialized view. And we have an index saying that this is the index ID for this service name plus resource attribute combination. And this is basically what a combination of all three is what we basically use to make routing decisions. And we also have a sample table where all the sample traces which are retained longer are stored. Both these tables and materialized views have different retention. We retain the route traces for two hours because we believe that a usual site triaging event should not require more than two hours worth of spans. And the sample table is what we retain for 14 days after the two hours has expired. Let me move to Aishwarya's. Hello everyone. This is Aishwarya. I'm working as a senior software engineer for Observability Platform at eBay. And I've been working on the tracing project. Now coming to the sampling part of tracing, we have a lot of applications at eBay that seem traces. And there is a huge amount of data to process and store on the tracing platform. So with that we are raised with two fundamental questions. These every trace really useful. If not, do we really have a room to be more efficient while processing and storing data on the tracing platform? Now moving on to the next slide. If you see this diagram, this shows the high level representation of how tracing data looks like. So we have a lot of traces that ended up without any issues. And we have a small set of traces that ended up with high latency and also a small set of traces that ended up with errors. So we don't really need to store all this data on the platform. So to be more efficient and effective. So the ideal representative sample would be to sample a small percentage of traces that ended up without any issues. And we can always sample 100% of traces that ended up with high latency and also with errors. If we are really interested, then we can also take samples of traces with specific attributes that we are really interested in. So with that we can be more efficient on the platform side. Now moving on to the next slide. We have adopted two types of sampling mechanisms. So the first one is SDK-based parent head sampling. So when applications generate traces, so even before sending those traces to the platform, they're sampled at 2% by default by the framework. So the type of sampling that we are using here is parent trace head based. So that means when a root level generates a trace ID, a root service decides whether to sample a trace ID or not. So when the decision is made, the decision is passed on to all the services in the call chain. And all those services obey the decision set by the parent, whether to sample a trace ID or not. So with that, we also generate exemplars. So exemplars generate trace IDs that are only sampled. So basically, we attach these exemplars to a latency metric that is present on all the applications. So this latency metric has a latency dimension bucket and we add exemplars to each dimension bucket. So when there is any issue, it's easy to debug because we have at least one exemplar present and we can easily jump from metrics to traces to see the end to end distribution, to see end to end call chain and see where the issue really exists. Going on to the next slide. So the next type of sampling technique that we adopted is exemplar based tail sampling. So when we receive traces sent by applications, we store all the raw traces for two hours and then we apply exemplar based tail sampling. So basically, as I already told, we record exemplars for metrics on the framework. When a trace ID is recorded as an exemplar, that means it's really important and has to be preserved for longer duration of time. So all those trace IDs recorded as exemplars are received and then they are sampled and stored for longer duration of time. So not only exemplars are exposed by framework, we also have span metrics processor running and all the traces that are received on the platform use span metrics connector to compute red metrics, meaning request error and duration. And we also have exemplars attached to each and every red metric. So recently, we also added a new metric called events total, which record metrics for error labels that are emitted as part of spans. And since we have exemplars recorded for all these metrics, so we collect all the trace IDs from these exemplars and they are also stored for longer duration of time. Now apart from these two sampling techniques, we also have one percent of random sampling across all the traces that are received on the platform. Now moving on to the next slide. So coming to the tail sampling architecture that we follow. So we have traced tracing hotel collectors and framework sends all these spans to the tracing hotel collectors. And we also have span metric processor running that generates red metrics. And these metrics are sent to our metrics ingest pipeline. And this metrics ingest pipeline also send these metrics to metric aggregator. So metric aggregator during rollup, it also preserves exemplars and they are not dropped in the pipeline. So once aggregator receives all the metrics and aggregations are done, so these metrics are again sent back to the tail sampler ingest that we have. So this tail sampler ingest collects all the exemplars and it records all the trace IDs and stores in our click house table. So we also have a tail sampler job that is running. So tail sampler job picks the trace IDs that are needed to be sampled and it gets all the trace IDs and it gets the tracing data from raw tables and samples all the traces that are being exposed as an exemplar. So a couple of more things on that on the section that Aishwarya talked about. So exemplars are really important to us in the sense that think of latency metrics that are there. You basically decide if you're not using the new native histograms, you basically decide what are the buckets that you're defining for your latency and each one of them is going to sample exemplars and then emit it into your backend, which basically means that if you take those exemplars and you keep sampling all of those trace IDs to retain longer, what you're essentially going to get is that for every bucket that you defined, you get sample traces for every host that's emitting these metrics, which means that when you sample, it's almost a representative sample for your entire ecosystem. And what Aishwarya did also is that she worked with the open telemetry community to make sure that we are emitting events total as a metric where if you record every event that has a unique exception type, that also means that every error scenario also you are making sure for every host, every error type, you have at least one sample that you're retaining much longer. This greatly improves our efficiency while at the same time not impairing the customer experience too much. So what did we get out of all of this? One, we got a cheaper and easier platform to operate, not having to retain all these spans for a greater window, not having to do complex sampling algorithms on the open telemetry collector or even after the fact when it has landed on storage. Now that we have distributed tracing without any of the developers having to lift a finger, we can create dependency graphs at both a per request level or at the service to service level. And we have an industry standard in terms of open telemetry, but at the same time, we are gearing it heavily towards what eBay actually needs. What are the lessons that we learned? The first thing is observability is a team sport. And we should use all the pillars effectively and not try to say that, okay, use this one pillar and try to derive everything out of that. In our case, you saw how closely knit the metrics platform and the trace platform are being knit together because the metrics now provide the signal on what needs to be sampled and what does not. And the same time, all our P1 observability or P1 detection functions are strictly done just by metrics. We don't try to use the other ones. Exemplars could offer a representative sample. We just talked about that. Numbers really matter. Fine tuning took a lot of benchmarking, profiling, and we finally were able to land a pattern that was good. Sometimes the community also offers some prescribed patterns. It's better to just stick to them rather than trying to improvise. And finally, no one should need 100% sampling to do effective observability. At our scale, we do 2% sampling, which is plenty enough for us to be able to look at all kinds of issues because end of the day, one in 50 requests need to have the issue that the customer is facing. And with the kind of call volumes that we have in a day, one in 50 is fairly easy to achieve. With that, I'll take questions. Thank you. Thank you. Very great talk. Very familiar themes and very familiar journey in that, maybe for other people. I'll be interested. I've got some questions. First, your deployment model for the collector, it sounds like you didn't specify, but it sounds like you have a central pool of replicas for your gateway and you haven't got sidecars or agents per host or Kubernetes node. Correct. And the reason that we do that is that there is a severe problem of resource fragmentation when we go down the route of either a demon set or a sidecar. Sidecar, you also have the problem that if we need to do upgrades, then we'll have to restart the parent container or we'll have to have the managed platform do a rollout across the site, which is more than 8,000 applications. Like I mentioned, so the cluster local gives us the flexibility to do independently rollout. We can have bigger parts for handling more throughput and we can scale it at will. So that's why we do that. Okay. Also that you didn't mention application metrics. So you mentioned span metrics, but you're not collecting metrics with your connectors or logs either. So logs, we do the file based approach. It still uses the demon set pattern. We are slowly but surely now that the logs API is stable, trying to get people to adopt the SLF 4J based bridge for open telemetry sometime this year that should happen for metrics. There is some free instrumentation similar to tracing where the four golden signals are instrumented on a dedicated Prometheus endpoint and we scrape it for free for all the applications that use managed stack. And there also we emit exemplars for the latency metric as well. There are actually, since we have the managed framework has a lot of framework providers as well, like the database team or the messaging team and whatnot. And each one of them have their independent Prometheus endpoint, which we scrape and then we ship it into into our metrics platform. And a lot of free observability is provided to them in terms of security dashboards, curated alerts, being able to do health aware rollouts and things like that. Okay. Just one more very quick one on your ingest. Do you have any buffering or asynchronous read write buffering and storage before store? We only do the batch batch processor, nothing nothing more than that. So everything is pushed straight through the pipeline to your storage? Yeah, if the question is, are we using something like Kafka? No, we do not. We just receive it on the gateway. We pick a pick a click out shard to write and that shard will also have a buffer table. We push into that and then it periodically flushes into storage. Thank you very much. Thanks, it was a great talk. Thank you. I think my question is on the application level sampling that you did. Can you talk a bit about that? Miss? How did you decide what to filter out? Because my I'm coming from a point where you might be filtering out something which is a high latency trace. So on the application, as in the SDK based sampling, yes, we, we just do head sampling, like probabilistic sampling at 2%. So basically it's as good as flipping a coin and deciding if something needs to be sampled or not. And we are okay to do that because when we look at all our latency patterns or the number of errors that are being seen on a given application, it's well over 50 per host, which, which generally means that like a 2% sample, the lowest volumes should still see at least one request that's coming, coming into the platform. So that being said, like, we just go, go the, go down that route. One follow up last from my side. So how did you, Miss, what made you decide to have a trace store as click house versus directly using a load balancer and then doing the tail sampling, passing on to collector? The logic was the one that I mentioned earlier in the sense that we have cross region traffic, which, which necessarily means that like if you want to do a proper tail sampling on the collector, all the traffic that was seen across regions needs to come into the same trace collector. And we try to make sure that we are not passing log or trace information across regions unless we really have to. And that being, and the other reason being that doing a true tail sampling, you can always have a five minute or 10 minute offset or even a one hour offset for something that's really long for whatever reason to come in and then do the tail sampling decision. But we, in the case of the tail sampling processor, you had best can do maybe 20 seconds, 40 seconds. And if your throughput is really high, then you need a lot more memory to make sure that you're really doing a good job in the tail sampling. Thank you. Great talk. Two questions about tail sampling. The first one, is it based on open telemetry or is it something that you implemented yourself? Second, you mentioned that it's a tail sampling job. Do you mean that it's a crown job that how frequently does it run? And could you touch a little bit on how it is implemented? So the exemplar based tail sampling is something that we built in house, but it is something that is reasonably easy to implement on your own. All you would need to do is you have a processor that can basically look at all the metric signals and basically write them into something like a database table so that whenever there is an exemplar, it's making note of that. And the tail sampler, like you mentioned, is just a crown job that basically selects that table that has the trace IDs and then uses that as the subquery to basically pull all the raw spans into a different table. And that window can be configured. So we do two parameters. One is the offset after how long it needs to kick off and what is the interval. So typically we do five plus five. So at the fifth minute, we make sure that the 10th minute is being sampled. So the same query that I talked about, it just keeps getting applied for that time window.