 Hi, welcome to Yeager project deep dive. We have three speakers in this talk and we'll go over very many aspects of Yeager projects, but not all of them because we have a short session and we need to pack a lot of content. My name is Yuriy Shkurov, I'm an engineer at Facebook. I'm a maintainer of Yeager and as well as open tracing and open telemetry projects, all three of them are members of Cloud Native Computing Foundation. And I published a book last year, Mastering Distributed Tracing where you can find more information about the history of Yeager history of open tracing and as well as kind of introduction into the distributed tracing as a discipline. First, Anna and I will give an introduction and then we will talk about Yeager features and Yeager architecture. I will talk about something because this is one area where people have often questions and Paul at the end will talk about Yeager integration with open telemetry and deploying Yeagers on Kubernetes. And so first, Anna and I take away with the intro to tracing. Thank you, Yuriy. Hi, everyone. My name is Anna Nyagarwal. I'm a software developer at Grafana Labs and I'm a contributor to the Yeager and open telemetry projects. Today, we're gonna take a look at what distributed tracing is and how it fits into our debugging workflow. We're also gonna learn some concepts and terminology. This is a photo of Uber's internal architecture which is generated by Yeager and we can see that in a mature environment, the number of microservices can run very well into hundreds or even thousands. Each of the green nodes present in this graph represents a microservice and the gray lines represent the communications between these microservices. When we interact with the Uber app, a single request to the Uber infrastructure may look something like this and this typically happens billions of times a day. So what are some of the monitoring tools that we use to monitor such a complex architecture? Typically, we use a combination of metrics and logs. Metrics are great because they're aggregatable. They can be used to alert upon and they're a great way to get an overall picture into the performance of the system. This is a sample metric that I'm exposing from my application. It's a standard Prometheus metric which gives me the duration or the latency of a request to the system. And here, I can see that this gives me a very high level picture saying that my app of the ice cream shop took 10 seconds of latency for a given request. However, if I want more fine grained information about the system, then I can add more and more tags to it. But very quickly, I run into this problem of cardinality explosion. Cardinality refers to the number of items present in a set. And so in the metrics world, this means the total number of values that a given label set can take. As I add more and more labels in order to get more information about my application, I increase cardinality and this can lead to degraded performance and is also not cost effective. Logs are also a great way to check the health of a given service. But under concurrent requests, it's really difficult to get the stack trace of one particular request that pass through the service. And so really we need tracing because traces are like stack trace debugging for distributed systems. And it also tells us a story about the system or it tells us a story of the life cycle of a request passing through the system. Distributed tracing works on the concept of context propagation. On the left, we can see that we have a very simple microservice architecture where an edge service A creates a unique ID for every inbound request to it. And every time it makes a downstream request, it passes along this unique ID as part of the context. As these services do some quantum of work and generate spans, they attach this unique ID to the span. Once they're emitted and stitched together in the back end, we can see the trace as shown on the right. This is formed with the help of the unique ID. And so now let's look at some traces. So these traces are generated from the sample hot rod application that shipped as part of the Yeager repository. When we click on the system dependencies, the system architecture diagram or the dependencies diagram, we see that as a developer, this already gives me a very intuitive picture of the architecture of the system and the request flowing through it. So this shows me the different microservices that are involved and also shows me how many requests were made between these microservices. Next, in Yeager, we also have deep dependency graphs which may look similar to the system diagram at first, but the difference is that now they're filtered by the service which is highlighted. So in the system architecture, we could have calls to Redis, we could have calls from the driver to the Redis which may not have originated from the front end, but in the transitive service graphs, we ensure that the service that's highlighted, only the request originating from that service are highlighted. We can also switch to an operation view in the layout mode over here, which is what is shown here. Next, when we select a trace, this is the typical Gantt chart view of the trace that we see. On the left, we see that the services are arranged in a hierarchical manner, which shows the transitive nature of requests as generated. At the top, we have a minimap which is really useful if the trace has like a couple of thousand spans because we can highlight a given section and it will only show the spans for that duration. Next, we see just with a simple look of this, we can see that some of these operations were blocking. For instance, the front end service, which is called the customer, it's called the MySQL service, probably was trying to retrieve customer information and all other operations were blocked on this service, on this request. Next, we can see that the waterfall diagram over here clearly represents a sequential order of operations. This is also really useful for an application developer and they can look at, they can look at the operations and say, hey, this should have been parallelized why is this running in sequence? We can also see that the parent child relationships are encoded into this view. We can see that the parent always encompasses the descendants. Next, when I click on a given span, it generates some extra information in the form of tags and logs. Logs can be optionally indexed. We at Grafana Labs do not index the logs that are ingested into Yeager. But here you can see that as part of the tags, I can add higher cardinality data, for instance, like the SQL query itself. It would be very difficult to view the query if it was part of the service and operation itself, but here I can add higher cardinality data. This is a new view that's arrived in Yeager, which is the trace diff. We can see that the two traces being diffed are shown at the top. One of them took 2.7 seconds, while the other took 1.4 seconds. So clearly there must have been something very different about this, because even though they're hitting the same endpoint, which is shown at the top, and the difference is all of these nodes. So we can see that they had a common parent, which is the common gateway endpoint that they all hit. But from here, trace B, all the nodes highlighted in red were not present in trace B. So this is sort of like a visual diff that we see in systems like Git, where the red nodes show what was absent, what is absent in trace B compared to trace A, and the green nodes show what is present in trace B compared to trace A. So the operations right at the top over here were more or less common. And we can see that there's like a substantial diff, and this is probably because one of these was a success and the other was a failure. The other view that we have is to actually compare span durations. So in the first view, in the previous diagram over here, we could see this was more of a node-wise view, which showed which nodes were part of a given trace compared to another, whereas this view shows more of the latency differences between the two traces. And the darker the node, the starker or the more contrast between the latencies in trace A and trace B. So for instance, if I was a developer looking at these trace views for comparison, I would see that in trace B, this particular node might have been a problem because this seems to have elevated latencies in trace B compared to trace A. And transitively, all the other upstream services have sort of had a higher latency. If I hover on top of these, it also shows me the difference in latency, so this is really useful for debugging purposes. Next, we're going to quickly browse through the Yeager architecture. So Yeager is not just a single binary, it's a collection of services which help in trace data collection, storage as well as querying and visualization. So on this broad spectrum on the left, we have client libraries which are used to instrument the application and they're typically written in the same language as the application. And so the officially supported libraries are in Golang, Java, Python node C++ and C sharp while PHP and Ruby are community-maintained libraries. And on the right side, we have the visualization front end which is written in React.js. It's beautiful and something that we already discussed. One important point to note is that Yeager does not provide instrumentation. Yeager provides an SDK, but not the instrumentation API. For this, we can use something like open tracing or open telemetry. So a little history about Yeager. Yeager was inspired by Google's Dapper and OpenZipkin. It was created by Uber in August of 2015 and finally open sourced in April of 2017. The same year Yeager joined CNCF as an incubating project and it graduated to a top-level CNCF project in 2019. So this workflow shows what requests were part of what information is propagated as part of a regular HTTP call between services and how the trace data reaches the Yeager back in. When service A, which is an upstream service here, calls a downstream service B, it adds instrumentation from its end to add the unique ID into the HTTP context headers and then passes it along to service B. At service B's end, it receives the HTTP request, passes out the context information and uses the same trace ID in the spans that it creates. Finally, the span data that is emitted from service A and service B reaches out of band to the Yeager back in where it is stitched together to form a common trace. In 2017, when Yeager was open sourced, the architecture looked something like this. On the left, we have the host or container that has been instrumented. It has the application and the Yeager client, which was used for instrumentation. The Yeager client sends spans locally to the Yeager agent, or sends spans to the Yeager agent, which may be running locally either on the same machine or as a Kubernetes sidecar. And from here, the Yeager agent sends spans to a Yeager collector. The Yeager collector is more of a central component, whereas the Yeager agent may be deployed in multiple clusters. This is done for several reasons because the link between the Yeager agent and Yeager collector is cross DC and might break. So the Yeager agent can do stuff like buffering and so on. The Yeager collector is more of a central component that sort of receives the spans, denormalizes them, can perform some additional cleansing of the data, for example, removal of sensitive information and so on, and then ingest this into a database. From here, we have Spark jobs that run on the span data ingested that can compute the dependency step that we discussed earlier. And finally, the Yeager query, which queries the database to help visualizing the spans in the UI. Another important point to note are the red lines in this graph, which shows the flow of sampling information. The Yeager collector is a central store for all sampling configuration, and this can be used to define per service sampling, per operation sampling, and so on. The Yeager client can poll the Yeager agent, which in turn can poll the Yeager collector and receive sampling information without ever having to rotate config maps or something like that. So this is really useful. Another important change we made to the architecture was the introduction of Kafka. What this means is that we've been able to decouple the ingestion of spans from the client to the ingestion of spans into the database. The Yeager collector can enqueue spans into Kafka, and the Yeager ingester can asynchronously consume them and insert them into the database. So if ever we receive a high volume of spans because of increased traffic, they can be buffered in Kafka without overwhelming the database. The flim streaming jobs can also now enqueue from Kafka and write dependency information into the database. The rest of the architecture remains the same. Speaking of technology stack, Yeager is written in Go. It's a Go back in for tracing data. It has a pluggable storage with support for Cassandra, Elasticsearch, and Badger databases. It also has an in-memory store, and it has a pluggable storage and uses Hashicorp Go plugin. So if there are experts in other databases, you can write a plugin which can be plugged into the Yeager collector. Yeager uses a React.js frontend, which is really feature rich and beautiful. Yeager has open tracing instrumentation libraries. It's compatible with all open tracing instrumentation libraries. It also has strong integration with Kafka and Apache Flink for tracing analytics. And with that, I pass over to Yuri to talk about sampling in Yeager. Thank you. Okay, let's talk about sampling. In the distributed tracing, we use the term sampling in a classical statistical sense, meaning that we try to select a subset of all individuals or traces from a population of all possible traces in order to estimate certain characteristics of that population, or more specifically to reason about application performance based on those samples that we've selected. The question is, why do we need to sample? There are several reasons. The first reason is that tracing generates a lot of information and storing all of it incurs significant storage costs. Here's some napkin math. Assume that we have a tracing span of about two kilobytes on average and we have a server doing 10,000 query per second. So that means we'll already generate in 20 megabytes per second of data. Now assume that we have 100 instances of that service. So that's two gigabytes per second or 170 petabytes per day. And that's just for one service. If your architecture is complex, it may have hundreds or even thousands of services. You can imagine how much data we could generate if we actually were sampling every single request. The other reason is that if we're not sampling, then the instrumentation that collects the data from the application by itself introduces performance overhead. Here's an example again. If we have a service doing 10,000 QPS, then we have roughly 100 milliseconds per request to work with, right? And so if the instrumentation takes like five microseconds, then that's already 5% overhead on your compute costs. And that, like if you run in a very large fleet, that's a significant amount. And finally, the third reason is that when we collect traces, that data is actually very repetitive. A majority of traces look very same. They have the same shape, roughly similar latency measurements. And so storing all of them is kind of useless. We don't get any more insights if we're storing all of them. And so that's why we sample. However, in distributed tracing specifically, sampling has a slightly interesting aspect. What we need to do, so-called consistent sampling. And what we mean by that is that if we collect spans for a trace, then we should either collect all of them across the whole architecture for that given request, or we should collect none of them, right? Because the alternative to that is shown in this diagram here, let's assume we had a system on the left. And then we started randomly making sampling decision as part of the request. And the three nodes happened to sample and the other two didn't. And so we got a trace, which looks like on the right, but that trace is kind of broken. We have some of the nodes that came without the parent. And so it's hard to reason about these traces. And so they're not as useful as if we sample consistently and we get the whole trace every time or we don't get the trace at all, which is also acceptable. And as far as specific sampling techniques, there are two primary consistent sampling techniques that are used in the industry. Head-based is like most popular. It traditionally has been used from the days of Google's dapper paper. All the modern tracing systems support that. And recently in the last few years, tail-based sampling started appearing as another popular techniques. And I'll talk about both of them. So let's talk about head-based sampling or also called upfront sampling. The approach there is very simple. When we start a new trace, let's say when we generate a brand new trace ID because the incoming request didn't have any trace ID, then we make a sampling decision at that same time and we capture that sampling decision in the trace context which is propagated throughout the request. This way, we guarantee that the trace is consistent to sample as long as all the SDKs on the path of the call graph respect that sampling decision and capture the data accordingly. That implementation has minimal overhead when the trace is not sampled because again, we propagate the flag saying don't collect anything. And so all the calls to the tracing SDK become no open, they're very cheap. And so we don't affect performance by that and they don't collect any data. It's also fairly easy to implement because if you think about it, the code is really, you just make a probabilistic decision or some other algorithm to decide when to sample and then you just pass it around and all the other downstream SDKs are just respected decision. However, that approach also has a couple of drawbacks. One is it's not as good as capturing various anomalies. For example, let's say you're looking at your metrics and you see your P99 latency spiking suddenly. So you wanna see, okay, can I find some traces that represent that spike? Well, P99 already means one in 100. And if we're sampling with a rate of 1%, that means that our total probability of actually catching your P99 latency trace is one in 10,000. Now, if your traffic is very high to service that you have sufficient number of traces captured even with that probability, then you probably can get some example. The more rare the outliers are, the less chance you will actually capture them with the uniform probabilistic sampling approach. And the second big drawback of upfront sampling is that it cannot be reactive to how the request behave in the architecture because the sampling decision needs to be made at the very beginning. When we know nothing about that request, maybe we know like which endpoint was hit, right, at best, but nothing else. And we definitely don't know what's gonna happen to the request in a life cycle, but we already made the sampling decision and every downstream service has to respect it. And so it's very hard to sort of react to errors in upfront sampling. What about something head-based sampling in Yeager? So Yeager is the case out of the box, support that. And they come with the assortment of different samplers such as like always on, always off, or probabilistic, the most commonly used, or rate limiting, meaning let's sample like 10 traces per second or more things like that, right? And the benefit of that is those samplers are very easy to implement. However, the downside of sort of configuring sampling in this way is that when you have a service and you instantiate a tracer, you have to give it a sampler with a specific configuration which means that if you have thousands or hundreds of services in the organization, then all those decisions are made by individual developers. They're kind of sticky because once you deploy it, it stays in production with that, whatever probability or rate that was assigned. And the developers, they usually don't know what effect individual sampling rate may have on your tracing backend. Can your tracing backend actually support that level of sampling, right? So there's like a disconnect between the interests of the backend and the capacity of it and how the sampling is configured in the SDKs. And so for that reason, Yeager SDKs actually default to a different type of sampler which is called remote sampler. And what that means is that it actually reads the configuration from the central tier from the collectors in Yeager backend such that that configuration can be controlled in the central place. And then it's the team that runs the tracing backend can actually determine how much sampling for which service for which endpoints should be happening, right? And you can change it on the fly if you want to, but the point is that you centralize the configuration rather than having it all done at the edges of your traffic. Here's an example of a configuration for that type of sampling. So on the top left, we have a default sampling strategy which would apply to any service unless otherwise configured with something else, right? So here we see it says like everyone should use probabilistic sampling with 50% probability, right? Except you can also provide some overrides for very specific operations. So let's say we don't wanna sample anything on the health or on the metrics endpoint, right? And so here we give them probability of zero. On the right side, there is a specific override that we can do per service. So if we know some services that let's say maybe our full service here is very like low QPS. And so we give it a higher probability of sampling. But on the other hand, we can also override individual operations on that service. Now let's talk about tail-based sampling. So in tail-based sampling, as the name applies, the decision is made at the end of the trace rather than the beginning. What that means is that when we make a decision when we actually already observed the whole trace and that decision can be a lot more intelligent because we can look at latencies that we've seen in the trace, we can see have there been any errors, maybe there's unusual call, graph shape, et cetera. So we can be very advanced with that decision. And basically either we, instead of capturing samples uniformly, we can say let's steer them towards anomalies. If we have like a long latency or if we have an error, let's always capture that example. So that gives us control over what kind of data we can get into the backend. That means that we can also catch anomalies much easier than with the upfront head-based sampling. And another benefit is that because we are getting all this data into collectors before we make a sampling decision, we're actually dealing with a lot more data that on which we can run various aggregations. Let's say we want to compute some statistical aggregations of like what latency we're observing or histograms we're seeing in the services, right? If we do it before samples that we just have more data to work with and those aggregates will be more accurate than if we were doing them after the sampling, especially if we were doing after sampling which is not uniform, but like skewed towards anomalies. Two drawbacks though of that approach is that because we need to collect the whole trace before we can make a sampling decision, it means we need to store it somewhere, right? Because traces are distributed and they're produced in the individual pieces from multiple services. So you kind of have to collect them all in one place. You have to hold on to it until you receive all the data for that trace, which is also and like somewhat indeterminate time potentially. And typically in modern systems that support tailway sampling, this is done in memory. Like you store traces in memory, then you make a sampling decision. And if the decision is no, you just throw them away, expire from the memory, right? And because most of the requests are very short-lived, you can actually scale that system fairly well to very large traffic of inbound traces. And the other downside of tailway sampling is that if the goal is actually to collect all the data upfront and then sample only so much so that our storage can support, then all this collection upfront introduces an additional overhead on the application itself. As I mentioned in a napkin math before, you can have up to five to like 8% overhead on very high QPS services if you essentially collect data on every single request and export it into a collector tier before you make the sampling decision. So it's a bit expensive. In terms of Yeager support, so Yeager as we'll talk later is moving towards building most of the Yeager back ends on top of open telemetry collector. And open telemetry collector does have a logic for tailway sampling, all right? You can configure various sampling roles already based on like latency or certain tags, like the error flags. Unfortunately at this time, it's only in a single node mode. So sort of you can run a single service and send all the traces there, then it'll work. But if your traffic is such so large that that service cannot scale to it and you need to run multiple collectors, normally they're stateless, so it is not a problem. But in case of a tailway sampling, they become stateful and there needs to be some sort of a sharding solution which is right now not available, but it will be available in the future. There are already prototypes in open telemetry. So that is all about sampling. And now Pavel will talk about Yeager and open telemetry integration. Hello everybody. My name is Pavel Lofi. I'm software engineer at Traceable AI and I'm core maintainer of Yeager project and contributor to open telemetry and open tracing projects. In this section, I will talk about Yeager and open telemetry integration. Before we deep dive into open telemetry, let me quickly talk about open tracing. So we better understand open telemetry as the next evolution of open tracing and data collection libraries in general. So on this slide, we see basically two parts. On the bottom, there is our tracing infrastructure collecting distributed traces from the user application. And on the top, there is user application process that is instrumented with open tracing. Open tracing is a specification that tells what kind of data should be collected from the RPC frameworks, databases and so on. But it is also an instrumentation API that sits in between user application code and the tracing library implementation. The tracing library implementation is basically the implementation of the open tracing API. So this architecture allows us to change tracing system without changing all the instrumentation points that are embedded into RPC frameworks but also in our application code. However, there is one downside that if you want to do that, we still have to recompile and redeploy our application. And this might be a problem if we have, you know, dozens, maybe hundreds or even thousands of microservices. It can be very costly, a thing to do. The problem is open tracing doesn't define any data format. So all the tracing implementations, they use different data formats. So let's have a look at the open telemetry. So we see basically the same architecture as on the previous slide, but open telemetry is now substitutes the instrumentation API but also the implementation of that API. But we see also open telemetry in the agent and the collector. So the difference between open tracing and open telemetry is that open telemetry defines the API but also defines the SDK, the implementation of the API. And it also defines a data format that is exported from the SDK. And so this allows us to have an open telemetry collector which accepts, you know, this data format and then can translate it to different data formats for different tracing systems. And this pattern allows us to change tracing system without recompiling and redeploying our applications. So maybe a little bit confusing for Yeager users is that we see open telemetry logo in the agent and collector. And this is purely Yeager's decision because in Yeager project we have decided to base our agent and collector in gesture, basically all our backend components on top of open telemetry collector. So this way, all Yeager backend components will provide the same functionality that is available in the open telemetry collector. And we will just add Yeager specific functionality to it. For example, storage implementation. So let's talk more about open telemetry collector. The collector itself is written in goblin, is Yeager backend components. And in terms of Yeager integration, we basically rebased our backend components on top of open telemetry collector and we have added Yeager specific functionality to it. So now Yeager users will benefit from all the functionality that is available in the collector, but they will still be able to use all the current functionality of Yeager components. We also want to make it very easy for users to migrate to these components. So we will keep the current architecture with agent, collector, in gesture, and all in one, and also probably the same configuration options. If you are interested on our website, there is already a section for open telemetry where you can read what kind of configuration options are provided, but also there are some guidelines and you can start using these new components right now. So let's talk about Yeager and open telemetry SDKs relationship. So open telemetry SDKs, they usually support Yeager GRPC exporter and Yeager propagation format. So this basically allows you to use or to deploy services instrumented with open telemetry into an ecosystem where you are using Yeager clients. Then there is open tracing machine, which is basically an open tracing implementation that uses open telemetry SDKs. And this allows you to use all existing open tracing instrumentation libraries with open telemetry SDK. And last but not least, Yeager clients, they support W3C Trace Context, which is the default propagation format in open telemetry SDKs. So you will be able to use Yeager clients in a new ecosystem where open telemetry SDKs are used. Okay, let's move to a different topic, which is Yeager and Kubernetes. Yeager provides an excellent integration with Kubernetes and you can deploy Yeager into Kubernetes by using Helm charts, plain Kubernetes manifest files, and also Yeager operator. Operator is probably the most advanced method how you can deploy Yeager into Kubernetes. So it follows the standard operator pattern where first you have to deploy Yeager operator, create custom resource definition for it, and then you will be able to create custom resource where you define what kind of parts of Yeager deployment or how the Yeager deployment should look like. So for example, in the custom resource, you can define that you just want only one deployment or you want a production deployment with a storage backend. Yeager operator can also provision storage begins under some conditions. So for example, if the cluster, if you have deployed in the cluster Kafka or Streams operator, Yeager operator will be able to auto provision the Kafka cluster for you. Okay, this is everything from my side and thank you very much for your attention. Thank you, Pavel. This is the end of our talk. These are the different ways you can get in touch with the Yeager maintainers and community. We have a barbecue meetings where you can dial in and participate in the discussions and make sure to start the projects on GitHub, developers like those stars. Thank you very much for joining.