 So hello everyone, and welcome to the Yeager project. I'm excited to be here and continue talking about Yeager after the KubeCon in Amsterdam and show you the project updates. My name is Pavel. I am a software engineer at Red Hat. I'm a Yeager maintainer as well, maintainer of open telemetry operator and graphonotemp operator. When I'm not contributing, I like to spend my time in the mountains doing some freeride skiing and mountain biking. If you'd like to reach out to me, you can do that on the CNCF Slack or on Twitter, and we can talk about observability or one of these fun sports as well. So today I will start with introduction to disability tracing. We'll talk about why we should use tracing in the first place. Then we will do a live Yeager demo with microservice-based application where you will see how Yeager works and what kind of data you can see in the console. We'll focus as well on the service-performance monitoring tab that we have in Yeager. I will explain how it works. And then we have a bunch of topics related to open telemetry. We'll talk about how you can use open telemetry collector with the performance monitoring tab, but as well how you can mix open telemetry collector with Yeager collector. And ultimately, we'll talk about Yeager V2. It's a new project that we have in Yeager. We want to rebase Yeager components on top of open telemetry collector. And then towards the end, I will briefly talk about the open telemetry auto instrumentation and how you can use it with Yeager. And last but not least, we'll talk about new features and roadmap for 2024. So why distributed tracing? And TLDR is because we write complicated code and we use complicated architectures, right? And we as industry spend a lot of time thinking about how we can decouple our applications, how we can split them into separate pieces that can be created separately, compiled, and deployed and shipped, which is great. It enables us to innovate independently. However, when things go slow or break, we should have a proper tool to identify what is causing the issue. And the issue is that these separate components or services or even third-party APIs are managed and operated by different teams, right? And if we don't have a tool that can pinpoint the issue, then we don't know even who to contact, right? If something goes wrong. So tracing besides the root cause analysis can help us to understand relationships and dependencies that we have in our system, and as well define very effective SLAs and SLOs. Before we jump into demo, I would like to talk about conceptually how we can split the distributed tracing deployment in our environment. And there is three components, usually. There is the instrumentation, which is all about how we capture data from our applications. And it's very important to keep instrumentation separate from the data collection and the backend because usually it requires code changes, right? And if code changes are required, then we need to recompile and redeploy, which can be very time-consuming if you have maybe dozens of hundreds of services. For instrumentation, we recommend users to use the OpenTelemetry project, which should give us stable APIs that allows us to use any vendor of our choice, right? Don't lock us into specific backend, which is very important. Then there is data collection, and data collection is about how we gather this data from the instrumentation into a collector. We process it, and by processing I mean we, for instance, might remove some sensitive data or we might do additional data capture. For instance, we might be able to collect Kubernetes resource attributes, or we can even extract new telemetry data from the data that has been collected, as we will see in the demo. And then finally, there is the storage with analytics functionality and visualization. So Yeager falls into the data collection and storage with visualization, and OpenTelemetry into instrumentation and data collection as well. So we see there is an overlap, and we'll talk about how we can use both projects simultaneously. Before I jump into the demo, let's first unpack these two concepts that we use in tracing, and first one is trace. And trace essentially means end-to-end execution in our system, right? It models how a request went through all the services, and then there is a span which models, which means like single unit of work, essentially. You can think about it as a method in vocation, or HTTP call, or database call, and span usually has start and end, which implies duration, and it contains contextual information that we call tags or attributes, right? So these describe what the operation was actually doing. So when we put multiple spans together, we get a trace, which is like a tree-like structure that you can see on the left, or you can visualize it as a timeline view that you see on the right. We call it as well a gun chart. This is the visualization that most tracing tools use. So with that, I will jump into the Yeager console, and what I have here is the Hodger application that comes from the Yeager upstream. You can run it as well, and it's very simple. You can click on one of these buttons. When you do that, you essentially order a car ride, like Uber or taxi, and so let's try. Let's do that, and we will get a response from the back end saying a driver with his license plate will arrive in two minutes, and then we get the latency measured from the browser. So how can we use tracing tool to understand this application? In Yeager console, the first thing what we can do is to show the system architecture diagram. It shows us the relationships and dependencies between services. So in this case, there is front end, there is calling customer, then driver, and root, and then there is a bunch of databases. We see as well there is some sort of test executor service that is calling the UI and then the front end. So behind the scenes, I'm as well running some script that is calling the service. So this is great for the relationships, but I don't understand, like on this screen, I cannot perform the root cause analysis, right? I don't know what is the end-to-end execution if the front end is calling customer first or driver or the root. If I need to do root cause analysis, I need to go to search and search for traces. So what I get here as a search result are all the kind of user actions or transactions, right? I can see which one is the slowest one, which one is the fastest one, what is the over latency of the request, and how many operations are there and what services were executed in this transaction. When I compare this latency here that we see, it will differ from what we see in the hot rod, right? Because this one is measured from the browser and these ones are measured from the services, right? From the first service in my environment, which will be obviously less than what is in the browser. So what I'm going to do next, I'm going to choose the slowest service or the slowest transaction, right? And I get the timeline view. There's a lot of information here, but it's very simple to understand it. On the left, I see the service name and the operation name. And on the right, I see the line that denotes the duration of the operation. So for instance, I see the MySQL took maybe took 357 milliseconds, which is roughly half of the overall latency. So by looking at this, I understand how long each operation takes, but as well I see the structure, right? And I see the driver is calling Redis, probably to get some data. And I see that it's calling Redis in a sequence, right? So maybe there is a for loop that is executed against Redis API, which might be by design, which is okay, but maybe it's a mistake and it's something I could optimize in my code. I could use maybe Batch API or execute all these requests in parallel. What I see as well is this exclamation mark, which denotes there is some sort of issue. When I click on it, I see that this operation failed because it's marked with error. And I even see the exception message, which is Redis timeout. So usually in the tracing system, if there is exception, it's going to be captured in the spans. Now, when I click on the first one, which is the span from the dispatch API, I see the text that will show me that this is HTTP operation. I see the endpoint, the HTTP method, versions, status code, all the important information to understand the HTTP call. When I click on the similar HTTP code from the root service, I again see the same data. And this is very important because I get the consistent set of information for the same event in different languages and frameworks. When I compare this to logging, in logging, we have no standardization, right? So different frameworks and different developers and different languages have different conventions in logging, which is not great in distributed systems that are usually polyglot and use different technologies. So with distributed tracing, you get these nice standardization that you always get the same data across your environment. What I see here as well is the process, which shows me from where the data was exported in this case, this kind of boring is just a federal Linux machine. But if this was running on Kubernetes, I could as well see what is the pod name, what is the deployment name, and all important information to find out the source of data. And then I see logs. And this is actually logs from the standard output. So you can configure the instrumentation that runs in your process to send the log messages to a current span, which is super cool, right? Because then the logs are not mixed between multiple concurrent requests that your process is handling. So as I switch to the logs, this is not... Yeah, it's here actually. It's a different... These logs are mixed together, right? It's hard to understand them, but in a tracing system, I get them nicely attached to the span and nicely parsed with what was the time and what was the message. So one of the new features that we have in Yeager is this black solid line, which shows the critical path, right? And critical path shows us the operations that are kind of the most important, right? That contribute to the overall latency of this transaction. So if I need to optimize latency in this user action, I should optimize only operations that are on the critical path, because if I optimize something else, it will not roll out into the latency improvement. I can as well collapse the operations and it's gonna be properly reflected in the time I do. And maybe last one thing. So for instance, for the database call, we are able to see the query statement. So the timeline view is to summarize is great to perform root cause analysis and understand what the system is actually doing. Now I will switch to the monitor tab. And in the monitor tab, we see metrics and we see the latency error rate and a request rate. And what is cool, we see the same set of metrics across all services that export trace data. And on top of that, these metrics are well split by the operation name, right? So the operation name is usually the URL pattern for HTTP request. So if I have, I don't know, five REST APIs, I will see the same set of metrics for my REST APIs that I have in my process. So the way this works is with the open telemetry collector and it kind of brings Yeager project towards more kind of traditional APM solution and it gives us the monitoring and additional alerting capabilities. Alerting is not part of the Yeager, but you can use Prometheus or other alerting system to alert on those metrics. So what do you need, how it works? The trace data are exported from the instrumentation to the open telemetry collector. Open telemetry collector then is looking at this trace data and is aggregating metrics and reporting those metrics into the exporter, into Prometheus and then the Yeager UI just queries the Prometheus. You can as well set up the open telemetry collector to export those metrics to different metric system, but only Prometheus is supported by the Yeager UI. In the collector, you need to enable the span metrics connector and it's very simple. You just need to put it in the connectors and then in the pipeline connector needs to be put as exporter for traces in this case and as a receiver for metrics because it's kind of looking at the metrics and it's exporting, it's looking at the traces and it's exporting metrics. You can as well visualize those metrics in different tools, for instance like Grafana and as well you can send them to any metric system supported by the open telemetry collector. So this and then brings us to the how can we use open telemetry collector with Yeager? So the collector integrates with Yeager in different aspects. So for instance, it can receive Yeager data in the agent protocols as well as the collector. It has as well the Yeager remote sampling extension and it can send and receive Kafka messages in Yeager format. So why would you use the open telemetry collector with Yeager? So the monitor tab might be one use case and in addition to that, the collector has a great ecosystem of additional capability that is not in Yeager. So for instance, it allows you to filter data, do PII, it allows you to drop data that you don't need. You can as well do tail-based sampling or even smart routing. So for instance, you can keep majority of traces in your local cluster and send just a subset to your more expensive third-party tracing system. Or it has a very popular Kubernetes attribute processor that can automatically attach the Kubernetes metadata to your telemetry data. So all the pod names, pod UIDs, what is the node name, and things like that. So how do you combine these two? Simply you can put the hotel collector in front of Yeager collector and use the OTLP to send data to Yeager collector. If you're using Kafka, the hotel collector has the Kafka exporter. You can configure it to use the Yeager type that will put the spans in Yeager format to Kafka, which can be read by the Yeager ingester and then the ingester will store them to database. Okay, that brings us to Yeager V2, which is a project that we started a long time ago. Then it was kind of removed from the project and now it's there again. And so what we want to do essentially is to rebase all Yeager components on top of open telemetry collector and essentially provide kind of opinionated build of the open telemetry collector. What we want to do is to expose the UI and query as an extension and storage layer implement as exporters. Right, so there will be elastic search exporter, Cassandra exporter and in-memory exporter. The code is already in place in the main repository. We haven't released it yet, but it's there. You can run it and this is the sample configuration. And the interesting bit here is the extensions and there is the Yeager storage extensions which will encapsulate all the storage backends that we have. So in this case, there is only in-memory but there can be as well elastic search Cassandra. You will configure the storage in the extension and then reference it in the Yeager query extension and in the exporter. So this way, your storage configuration will be in a single place and those two components will just reference it. At the moment, only the in-memory supports it. But yeah, we are looking for contributions and in the main repository, you can find issues labeled B2 so we would like to help us to build it. We are happy to accept any contributions. And so this will be the future of Yeager going forward. At some point, we will deprecate and remove the existing Yeager collector, Yeager query and all in one. There will be just this single build based on the open telemetry collector. So before we draw them up, I want to talk about the instrumentation. And as you know, the Yeager clients have been deprecated for a long time. We as well deprecated Yeager agent this October. And we recommend you to migrate to open telemetry collector which has the Yeager receiver which opens the same ports so that there is clear migration path. If you are using Yeager SDKs with open tracing, you can use the open telemetry, open tracing scheme and keep using your instrumented services with open tracing API. It has great support for languages, even more that we supported in Yeager. On the configuration side, the migration is as well very simple, straightforward because the open tracing and open telemetry use the same concepts. One thing to keep in mind is the trace propagation. Yeager uses Yeager protocol or Yeager format which is supported by all open telemetry SDKs, but it's something you need to enable explicitly. And open telemetry goes beyond just API and SDK, right? It has as well the agents or auto instrumentation libraries which are software packages that you can just download and put into your hosts or Docker containers and they will automatically instrument your applications without you doing any code changes, right? It's a very powerful and very simple way how to get telemetry data. The languages or the agents are available for some languages and not for others. However, the community is still kind of evolving and for instance, in Go, there is already EDPF-based auto instrumentation. So I guess over time more and more languages will have some automatic instrumentation in the open telemetry ecosystem. So what are the new features that we have? We enabled by default the OTLP ingestion on the Yeager collector and as well keep in mind that the open telemetry collector removed support for Yeager exporter. So the way how you get data from auto collector to Yeager is just the OTLP. We as well kind of improved the support for the span matrix connector that I showed you before. Actually, the span matrix connector used to be a span matrix processor and there were some breaking changes that we needed to resolve. Then we have the Yeager V2, which is I think exciting project and it can really bring some innovation into Yeager. Then we did some improvements on the query. So Yeager query has multiple APIs. One of them is V3 that exposes OTLP. We bumped it to OTLP V1. You can find the definitions in the Yeager IDL repository and it exposes the GRPC and HTTP. And then on the UI side, we have the critical path visualization searching for text when you have the timeline view and as well batched on load of traces. So for the V2, it's something we want to really focus on in the next year. We will add support for the missing storage layers that we have and as well for Kafka. We want to as well reuse as much as possible from the open telemetry collector. So for instance, the Kafka exporter receiver will just import it from the open telemetry collector contrip. At the moment, our build doesn't support the collector builder, which is something we want to support. We want to let users build their own distribution of collector as well with just specific storage backend they want to use. There is as well somebody working on the native Clickhouse storage exporter that will be only in the V2 and we want to officially as well support Elasticsearch 8. And once we have more capabilities in the V2, once we have the feature parity, we will do officially release and deprecate the existing components. Okay, this is everything that I have prepared for today. Do you have any questions? Are y'all going to keep support for OpenSearch too in V2? So what is the support for OpenSearch? Yeah. I think I was talking to Yuri, the main maintainer. I think it should be well supported. I'm not sure what's your experience. I just want to make sure. Yeah, it seems like it would be well supported, but there's going to be some feature deviation from what I understand from Elasticsearch and OpenSearch. Yeah, so those are two different. We want to treat them differently. So even right now, when you run Elasticsearch, you need to enable it differently than the Elasticsearch. Yeah. Okay, so it's still on the stake. Yeah. Hi, great stuff. Thanks very much. I love the way you're using the OpenTelemetry component architecture to provide the transient storage and then separately the queries. I wonder if... So you have currently defined the Yeager storage extension. It seems like it could be maybe generalized into general telemetry storage, this one, to store also logs and metrics, and then people could build queries for those and UIs. Possibly, yes. But it's something that we don't do in Yeager. Would it be easy to maybe just rename the Yeager storage into telemetry storage and allow a storage element? It's all OTP, right? Yeah, so at the moment, this extension is using the storage layer from existing Yeager, which has its own format for every single exporter. Right? So Elasticsearch has a different model than Cassandra and so on. I see. Okay, thanks. Yeah, it's great. Thanks a lot. Thank you for your presentation. I have a quick question. We have a bunch of legacy applications that run a really old version of Java, like Java 6 or Java 7. These applications that we give a really hard time when something breaks. And is there any way that we can use Java auto-intimentation to install the agent for the service? Yeah, so it depends if such old Java version is supported by the Java agent. I'm not sure if it is. I think the lowest version is Java 8. Yes. Is there any way or is there another alternative ways to implement tracing for the system? So you could implement it yourself by reusing parts of the Java agent. So in the Java agent repository, there are instrumentations which are packaged as separate Maven modules that you could include to your application, initialize and get the tracing that way. So let's say there is instrumentation for servlet, right? So you could just pull it in as a Maven dependency, initialize it in your code, and have tracing. Oh, so I have to modify the Java agent? You will have to modify your application. Yeah. Not really an option. You could maybe build your own agent distribution with a subset of instrumentations that are only for that specific Java version. But it's going to be a lot of work probably. Oh, okay. Thanks. Another question. So when the agent, when the tracing, when I implement the tracing, that some of the traffic that is TCP-based, then when I show it on the service map, it does not show on the service map. So is there just what kind of traffic that will show up on the service map? That's a great question. So this diagram shows you connection. I know HTTP is the... Yes. So the implementation... But like TCP-based protocol? The TCP maybe, but the spend data, they need to contain the server and client attributes or spend kinds to make the correct connection that this is a server and this is a client, right? To kind of figure out what is the direction of this arrow. And then it's as well looking at the HTTP attributes but as well at the messaging attributes like the consumer and producer. So your TCP traffic will need to have some sort of metadata in the spans for this diagram to work. So if it does not show on the service map, so what can I do to investigate? The best way maybe is to open an issue on the Yeager with the trace sample. So you can maybe get the trace and just download it as a JSON and open issue that, hey, I have this trace and it's not showing on the system architecture diagram. And we can take a look at the data and see what should be added. But you will need to add probably more attributes to your spans. Thank you very much. Can you talk a little bit about multi-tenancy support that you guys added? Okay, so the question is multi-tenancy support in Yeager and it's a single tenant system. There is not much. I mean, for instance, if you're using elastic search, you could define tenancy by using index prefixes for different tenants, but it's something you need to kind of set up and maintain. Okay, I thought there was an announcement that there was some updates or something around that. There were different items, but it's not both supported. I'm just wondering if there's any plans for supporting low-key on V2. Low-key V2? Yeah, sorry, with version two of Yeager, any support for or any plan support for using low-key for log storage? So low-key is log storage, right, from Grafana. And it depends what you mean, right? Do you mean like how to correlate logs with traces or just store traces in the low-key system? So I'm already using low-key now. If I was to use what you've got up there, that means I'm going to be storing the logs in two places. So can I consolidate down into one log storage just having logs in low-key only, but still experience the UI? How have you shown us? I don't think so, but Grafana has the Grafana Tempo project, which has as well support for the Yeager UI. So you could deploy Grafana Tempo, keep traces there, use the Yeager UI with Tempo, and use low-key for logs with, I guess, Grafana UI. Great, thank you. All right, any more questions? All right, thank you very much.