 So hello everyone and welcome to the Yeager session. It's super amazing to see so many people interested in the project. I think this might be the biggest session we have ever had at KubeCon. So yeah, it's great. Thanks for coming. My name is Pavel. I am a software engineer at Red Hat. I'm Yeager maintainer and as well maintainer of the open telemetry operator and graphonaut temp operator. And in general, I contribute to the observability projects in the CNCF. In Switzerland, when I'm not working, I'm usually in the Alps doing some freeride skiing or mountain biking. And with me is Jonah. Thank you, Pavel. I spend time in the opposite area. My name's Jonah Cowell. I'm the VP of product management at Ivan. And for fun and my extracurricular open source work, I am a maintainer of the Yeager project. And I'm also heavily involved in open search and observability is near and dear to my heart. I spend a lot of I live in Miami, Florida. And I spend a lot of time underwater opposite of Pavel exploring what happens beneath. So some photos I took too. So today, we're going to go over, we have a couple of demos for you, but we're also going to go over some of the basics. This morning at the booth for the first couple of hours, we had people coming up to us asking, what is tracing? So we're going to do a few minutes on the basics. And then we're going to get deeper and deeper into Yeager for those of you that are well along your journey in distributed tracing and observability. We'll be showing you a couple of demos and we'll be talking about what's coming next to the project. We've got a very ambitious plan that we're going to outline and we obviously love contributors. And we'll talk about some of the things that we're doing where you all can get involved in the project and then be up on stage at KubeCon talking to all of these people. Well, we should have time for Q&A as well. And with that, I will jump into distributed tracing 101 for those of you that are not familiar. So most folks, and this number may be a little bit out of date, are running microservices architectures where you've got a lot of distributed components potentially owned by different teams. And the challenges that when things break, we do a lot of finger pointing or looking at logs and trying to figure out where it broke, who broke it and how to fix it and get our customers back going again as quickly as possible. And so the purpose of distributed tracing is the ability to do the root cause analysis, looking at the relationships between different services and then being able to drill in and really understand what's happening, where is it slow, where is the error happening. And this enables you to collaborate across all of your distributed teams with data and not just pointing fingers. And then it can also be used and we'll talk about how it's used for monitoring where you can get visibility into your SLAs and your performance and how your services are doing. So there's a few basics with tracing. The first is instrumentation. You'll hear us talk about that a lot. That is putting something in or near your software to emit data that can then be collected. The second part is collecting that data and storing it, which is part of what Jaeger does and then finally, with that storage, we can then analyze the data that we collect, which is also part of Jaeger and we'll talk about all of these pieces in the next few minutes. So to give you this mic. Yeah, so I'll continue with introduction to Jaeger. We'll deploy it on my laptop and we will deploy as well very simple microservice application and we will use Jaeger to monitor it and try to reason about its performance. Before we do that, I would like to talk about the data model that is used in tracing in general and there's mainly two concepts that are important. The first one is a span and a span, it's essentially data structure that models invocation in the system and invocation, it can be HTTP call, database invocation or even internal method call in your business logic. So since it's call, it has a start and end so it implies duration, but most importantly, it contains metadata that we called tags or attributes, as you will see in the demo, that help us to understand what the operation actually did in our system. So that's a single operation, but in tracing, the value is in the distributed environment, in the distributed execution, right? So when we stitch together spans, we form a trace and trace essentially, it's a list of spans that are kind of correlated, they contain the same ID, we call it trace ID and then each operation contains span ID and the parent ID and based on this, we are able to create a tree that is one way to visualize trace and the most common visualization is what you see on the right side, which is timeline view or gun chart. It's the visualization, I think every tracing tool really uses and with that, we'll jump into demo and I will show you how we can use the timeline view to understand tracing data. So they deployed locally, it's the Yeager all-in-one, which is a single container that runs all the Yeager components and in the second terminal, I'm running the Hodrot app from the Yeager project. It's a Go-Lang application that kind of contains free microservices. As we can see from the logs, there is the customer service, the route, the driver and frontend. So let's jump to the browser to see the UI. So this is the UI and essentially when I click on one of these buttons, what happens is I order a car. So let's try to do that and then I get a response from back end saying that the driver with the license plate will arrive in two minutes. This is the Request ID and this is the latency measured from the browser. When I jump back into the console, I see the logs, right? And by reading the logs, I'm able to sort of understand what is happening, right? I see that the frontend received the request and the customer was processing the request by getting the customer data. Then the driver service was probably trying to find the nearby drivers and then there was some route calculation and finally I think the request was dispatched back to the browser. So that's fine, right? I can use logging to sort of understand what's happening. The problem with logging is if there are multiple concurrent requests as I'm kind of simulating right now, if I jump back to the console, I'm no longer to understand what's happening here because all those logs are mixed together across multiple concurrent requests and this is where tracing can help you to correlate these logs together into a nice visualization that we will see in the Yeager console. So let me jump back to Yeager and Yeager not only it's able to visualize the trace or the timeline view as we will see but it has as well the capability to show you the dependencies in your environment. So what we see here, it's a very similar to what we saw on the logs, right? We see the service names, the frontend, the customer and the driver. So what is happening here is Yeager is looking at the spend data and is trying to understand which service is calling which one and as well how many times. So this is one capability and the second capability is the search where we can find the collected traces. So let me find the traces for the frontend service which is the first one as we saw in the system architecture and here we see the stored traces in Yeager. We see what was the overall invocation time for request in this case, 1.8 seconds. The next one to go only 50, 40 microseconds and so on. We see as well what were the services involved in these requests, how many operations or spans are reported and how many errors. So we'll just choose this one and now we get the timeline view. It might be overwhelming to see this screen for the first time because there is a lot of information but it's actually very easy to understand this diagram. On the left side we see the service name with the operation name and then on the right side we see these bars and the longer the bar is, the longer the operation took. So for instance, this MySQL call took 1.4 seconds out of 1.8. So it's maybe like 70% of the whole invocation was spent in the MySQL database. When I look further, what I see here are the calls from the driver to Redis and I see these, they are very short but they are done in a sequence, right? So maybe there is a for loop executing calls to Redis. Maybe this was intentional but maybe this is something the developer forgot to optimize and use batch API or execute all these requests in parallel in a separate go routines. So by looking at this diagram I'm able to understand where the time is spent and what are the structure of the calls that help me to understand the application and make the optimizations. What we have as well is the highlighting of the critical path which is denoted by this solid black line. So the critical bias is very important because if I want to optimize the overall latency of this request which was 1.8 seconds, I should optimize only operations that are on the critical path. So what I see next is the exclamation marks that shows me there is an error. When I click on it, it comes from the Redis. I see the exception, there was exception. I see the exception message, it was a Redis timeout so nothing probably really serious. So in a tracing the instrumentation, when it sees an error it will attach it to the spans which is very, very helpful. And then each span has tags and a process and tags are the attributes that describe the operation. I would like to show you the tags from the HTTP call from the frontend service. And we get the tags that describe the HTTP methods, the routes, the status code and all the important information that we care about for the HTTP request. If this was a database call, we would get the different tags but the ones that are important to understand the database call. What is very useful in tracing compared to logging is the correlation that we get this nice view with correlated logs but as well as the consistency. In logging there is no standardization on what data should be in the logs. Different languages, different frameworks and different developers they put different data into logs which makes it very hard to understand logging at scale in microservice environment with multiple languages. With tracing we get the consistent set of attributes for the same events. So for instance, if a frontend is written in Golang, I get the same attributes as in the road service that is written probably maybe in Node.js or even different language. Okay, so you can think about tracing as well about logging with strong correlation and with consistency on what is in the data that will help you to understand the application. Okay, this is all. I will continue with the monitoring. Thank you. So one of the things that we recently added to Jaeger in the last couple of years is integration with Prometheus. In the use case here is instead of the diagnostics that Pavel just went through is for monitoring. So one of the things that we wanted to do is to make Jaeger more useful operationally. And so the purpose is to move Jaeger from only distributed tracing towards APM or application performance monitoring. And it's about incorporating the tracing and the metrics together. It also allows you with Prometheus alert manager or whatever you're using for alerting to enable those types of use cases. So now you can get alerts to your team when certain things are occurring. And I'm gonna give you a few examples in the demo of how this is useful in addition to what Pavel shared, which is very useful to developers. But operationally, it's sort of an after the fact instead of a before the fact type of situation. So the way that this works today is that when you set up your open telemetry to collect that data, there's a component, a processor, called the span metrics processor. And what this processor does is it allows you to then derive metrics off of the trace data. So you don't have to do any additional work. The collector is actually going to take the trace data and build metrics and then send those off to any supported metrics back in. In Jaeger, we have built this for Prometheus, but if you're using some other metrics back in, commercial or open source, that's supported in open telemetry, you can use the same method. But of course, feel free to contribute to Jaeger to support it in the UI, which is what I'm going to show you in a minute. So in the connector in your open telemetry pipeline, you will basically define the use of the span metrics connector that will then build these data sets. There's a lot of other definitions that you can put into this configuration. So I suggest that you look at that processor in the open telemetry repository where there are further examples of how you can configure buckets and various other things that help with percentiles and other things that might be useful to you in monitoring. But this is the basic configuration that will generate that data. And I'm going to show you a demo so we don't need to talk about this, but give me a moment and I will pull that up. So we've been running another Docker container besides the one that Pavel was showing you that's generating spans automatically in the background. So I can show you the same view that Pavel was just showing you. The data is gonna be less interesting because they're simulated spans so they all kind of look the same and there's no real errors. The interesting part of this is aside from the tabs in the UI that Pavel was showing you, we have this new monitor tab. And we've been running this now since before you all sat down. And what you see here is that we're collecting a few different metrics at a high level. So these are often known as red metrics, but what that is is it's the request rate of my service that I can see, how many times it's being called, the error rate of my service. This one is very nice. I wish my production looked like this where there was no errors. And then of course the latency or the response time of that service itself. In this case it's very even, but the idea here is that I can look at this operationally. I can also set up alerting in Prometheus so that when that error rate starts creeping up to a level that I'm uncomfortable with, maybe your service averages 5%, but when it jumps to 30%, you wanna be notified and you wanna understand that. These metrics are given to you automatically through that integration. So it's really helpful operationally to have these things without having to configure additional Prometheus metrics or any other type of application code. The list of end points will be listed below in terms of the types of calls. And we can also look at other microservices which all of them are calculated. Interesting, the Redis one looks a little bit different, which is odd considering they're simulated, but in your production environment you'll obviously see a lot more variation in the data that's coming into this. So that's the advantage of the monitoring tab and definitely something to check out. And of course you can visualize those in your own Grafana or other tools that you're using on top of Prometheus. So they're helpful for other purposes as well. Gonna just jump back to the presentation here. And Pavel is going to take us through Yeager ingestion pipelines. Thank you, Jonah. Yeah, so I will talk about how we can deploy Yeager and how we can as well deploy it alongside the collector because as we saw the new capabilities in Yeager are required the collector as the span metrics connector. And it's not only for the span metrics connector. As you know, the Yeager project, it deprecated its SDKs in favor of the open telemetry SDKs. And we have as well deprecated Yeager agent. And if you have workloads that needs to emit or that emits the spans in the legacy Yeager formats you can consume that data with the open telemetry collector with the Yeager receiver. But on top of that the open telemetry collector can as well receive the at the moment supported Yeager protocols, the GRPC and the thrift over HTTP. And then there is as well support for the Yeager remote sampler in the collector and as well the Kafka receiver and exporter can send data in the Yeager payloads. And that's not all, right? The open telemetry collector as well gives you access to a large ecosystem of additional capabilities. For instance, there is a lot of functionality for massaging the data, to remove the attributes that you don't want to send to Yeager or to any observability vendor or a processor that can automatically attach Kubernetes resource attributes to your data that will help you to identify what was the pod name, the deployment name and so on. And just take a look at the open telemetry collector contrib to the processors directory and you will see a large set of capabilities that you can use with the Yeager project as well if you deploy the collector. So how do we deploy these two systems together? It's very simple, right? We have application instrumented with open telemetry SDK and then we can send data directly to Yeager or to the collector and then from the collector to Yeager. Yeager can as well receive the data in the open telemetry format in the OTLP. So you can combine these two systems based on your requirements and needs. That's the most simplest deployment. If you need more scale, you would typically deploy Kafka in front of Yeager. And as I mentioned before, the auto collector can send a data to Kafka in the Yeager format that can be then received by the Yeager ingester and stored into your database. Right? So Yeager's been stable for quite some time and we've been in version one basically since the project started or became part of CNCF, but we're doing a pretty major rearchitecture where there's gonna be some breaking changes and we're trying to minimize those but ultimately we're rebuilding Yeager around the open telemetry collector and the idea is to embrace the ecosystem, to reduce redundancy in the code base and the things that folks have to learn about and maintain. And so this is a very big project for us to move towards this that's been in the works for quite some time. And the goal besides adopting the collector is also to support new types of storage and APIs. So today the supported back ends for Yeager officially are Cassandra, Elasticsearch, Opensearch. We're very close to Clickhouse, I'm gonna talk about that later, but the idea is to build new storage APIs in open telemetry to support that. We also wanna make sure that it's compatible with those of you that are using Yeager version one so that the UI continues to function either way and we're trying to make it as compatible as possible to avoid any issues with your migrations or as you're in the middle of changing it. And part of it is also supporting OTLP natively as much as possible and not just on the edges of Yeager as Pavel just described. Similarly, today there's a few different binaries for Yeager that are part of the microservices. The goal is to incorporate those into a single binary that can be used in many different ways within your Yeager environment. So simplifying the complexity of the services and making it so that you can configure it in any way that you need to in terms of your scale and use cases. And the config we're moving, instead of the CLI Yeager, you can pass crazy amounts of arguments into and configure everything on the CLI only. There's no config file. OpenTelemetry uses config files specifically. So we're moving in that direction where config files will be used and not huge amounts of arguments to the binaries that are being executed. So these are some of the goals that we had in mind with the project and Pavel will kind of take you through a little bit more of the architecture. Yeah, so essentially what we want to do, we want to implement Yeager functionality as OpenTelemetry collector components. So for instance, the query will be an extension and the storage will be as exporter. And then produce a single build of this Yeager collector where you will be able to pick the right exporter for your storage as well enable the Spunmetryx connector or any other collector capability that are available in the ecosystem. So let's take a look at the config. Hopefully it's big enough. So this is what we have, what I mentioned. I think the most important here is the extensions. There are two of them. It's the Yeager storage extension and the Yeager query extension. In the storage extension, you define which storage backend you want to use. In this case, it's in memory. And then you wire the storage with the query extension. So in this case, in the query extension, I specify the mem store, which the configuration for the mem store is in the storage extension. And then in the exporter, I again reference the Yeager storage extension. This is the way how we wire together the exporter with the query. Which is sort of kind of new innovation in the OpenTelemetry collector. There is at the moment no way how these things are done. What is cool about this setup is as well that you can easily combine multiple exporters in your config. So for instance, I showed you the dependency diagram, the system architecture. In my setup, it's using the in-memory store. But if you are using elastic search, you would need to as well enable that for elastic search. But it's much easier to calculate these data just in memory. And with the new setup, you will be able to combine your spend store with the persistent storage of elastic search and the dependency store with the in-memory solution. So on the progress, we already have the builds. The base is in the place. We as well publish the Docker images so we can run it. It's in beta and we would like to release GA probably this year. Most of the work is happening from mentorship. And we'd like to as well invite you to help us with the project. At the KubeCon on Friday, there is the Yeager Contrip Fest. And I invite you all to join me and try to work on the Yeager V2 support. And now let's talk about the roadmap. Thank you very much. Yeah, so we're super excited about V2. We were really thankful to the mentees that are helping both through Linux Foundation and Google Summer of Code. So we have two mentorships running. We try to do this to get more maintainers and more participants in the project. Wrong way. So a few new things. And I've had folks come to the booth this morning and ask. We now officially support Elasticsearch V8. So that was something that was missing for some time. That's new. We've already kind of talked about Yeager V3, but some additional capabilities. We also support OpenSearch and Badger. These are sort of new and updated versions of the back ends. We've created a few new capabilities in the UI. These are mostly enhancements. You saw Pavel talk about critical path. There's a lot of other capabilities that we've added into the UI to make it easier. I'm not gonna go through them one at a time. In terms of the roadmap, Yeager V2 is soon to be beta, hopefully in the next couple of months is the goal. And we kind of explained some of the pieces here, but that also we have to work on our pipeline and documentation and various other things. So even if you don't write code and you can help with docs or you're interested in the project, please reach out. We are on the CNCF Slack, which I'll mention. But we're always looking for help. You don't have to be a great coder. You can do all kinds of things to help the project. The other thing that we're doing is we're supporting ClickHouse natively in Yeager V2. A lot of people are interested in ClickHouse and it's a great database that's highly efficient for trace and log data. So there's much more coming. But before I go to Q&A, definitely check out the docs. We have a community call monthly. Join the CNCF Slack. And definitely you can leave feedback. And with that, I guess we will open it up to Q&A. I think we've got a few minutes. There's a couple of mics on either side on long booms that you can feel free to ask any questions. Got one question on the left here. Thanks. Hello, and you have a very nice presentation. And we are eager to see Yeager V2. So I wanted to ask in a very high-walled environments where you have lots of transactions or requests, we found useful to utilize sampling. However, what would you advise or what would be the best practices when using sampling to combine multiple instrumentation data and make sure that they are all sampled? I'll give you my opinion and then Pavel's. What we've seen is that for things like 200 errors, so using tail-based sampling is always better because then you can look at the end of a transaction and decide whether to keep it or discard it. So for example, if you're having requests that are 200 where they're okay and the latency is relatively low, you probably only need 1% of those just so that you have a baseline. If they're 500 errors or they exhibit high latency, you want to keep those. So using different types of tail-based sampling is the best approach, but that also requires a lot of collector resources. So it's a balancing act. You need more memory in the collector to do tail-based sampling and it uses more processing power. So it's always a trade-off, whatever you decide to do, but let me let Pavel give his opinion too. I think that it was good enough, yeah. Okay, so we're in agreement. But yeah, that's always a good thing. We've got two minutes for one more question over there on the left. Thank you for the question. I noticed there was Kafka support in V2. Just wanted to ask if there was any discussion about Pulsar support whatsoever. Can you repeat it about what's? Pulsar support. Yeah, I don't know. So we support Kafka today. It's part of the architecture and it will continue to be supported because a lot of folks use it. Pulsar does support Kafka protocol, so it should work. Have I used it or seen a customer using Pulsar? Not personally, but it should, in theory, work if the protocol is compatible. But I haven't seen it offhand. All right, thank you. Thanks. So if there's no other questions, then thank you all for coming. Have a great week at KubeCon and see you at the Contrib Fest.