 Well, hello and welcome again to another OpenShift Commons briefing. Today, we have a really interesting new open source project that we probably haven't heard about, Yeager. You probably have heard about Uber, and we're really pleased to have Yuri with us, who's one of the core contributors to Yeager, and he's going to tell us about how all this works with OpenTracing, and Prometheus, and lots of other aspects of distributed tracing today. And we also have with us from Red Hat, Gary Brown, who's another one of the contributors to the project, and we're just hoping to build some awareness around Yeager and get more of you involved in the community, and so the format we're going to use for today is we're going to let Yuri and Gary do their presentation. You can ask questions in the chat, and we'll have an open live Q&A at the end, and all of this is being recorded, so don't try and scribble notes fast. There will be all the links at the end to the references for all of the stuff that we're talking about, and with that, I'm going to let Yuri take it away and introduce himself, and I'm looking forward to the discussion afterwards. Thank you, Dan. So just a few words about myself. I'm an engineer at Uber. We have an observability team in New York City, which does things like metrics, login, and tracing, and other observability-related applications, and I have been attacking it for Yeager project at Uber for about two years, and we've open sourced that project back in April this year. I also was involved in the OpenTracing project from the beginning, and one of the co-oppers, and I'm also a member of the Specification Council for OpenTracing. And so today, what I'm going to talk about is really to demonstrate why OpenTracing and tracing in Daniel is a big deal in the microservice world, and I will do a quick intro into what distributed tracing is, assuming some people may not know exactly what it is. I will also show you a demo of, like, really demonstrate why it's useful on an example application, and that's pretty much it. And so basically what distributed tracing is, the way I tend to think about it is a new way of monitoring for microservices, and so we can ask why do we need a new way, why, like, the old ways don't work, and to answer that question, I want to kind of show you're rendering by an artist of the microservices versus a managed application, and so what microservice is really, the biggest difference, obviously, is that the pieces of the previous big application are now individual pieces that work independently of each other, and so when we were monitoring, like, a monolithic application, we would put, like, some probe on it. It could be metrics, it could be some stream of standard logs out, and we can see what's going on in that application, and things are pretty simple there. But in microservices, how you do the same thing is not that obvious. So you can definitely put a thermometer on every one of those guys, but they don't give you the whole picture of what's going on with the whole system. And more importantly, if we think about another aspect specifically the concurrency, so we, like, the old application started the single threaded where you would process one request at a time, then it became more complex with multithreading where multiple requests like being processed in parallel, but still one request per thread, and then we went into asynchronous programming where a single request can actually jump between different threads during this life cycle. And finally, in the microservices world, we took that asynchronous picture and split it across many different processes boundaries, and the picture would be broken. And so what we really want to see here when we monitor that system is to be able to track a single request as it goes not only between multiple threads, but between multiple process boundaries. And that's what distributed tracing really provides. It is the ability to trace a single transaction throughout your architecture and throughout process boundaries, threads, whatever continuations, asynchronous calls and all these things. And conceptually, the way it works is that fairly straightforward. There is a concept of context propagation where we say if we have a microservice architecture with this five microservices and the first service receives a request, we create a unique ID for that request and we stick it in a so-called context, which is like a virtual container which is associated with that request. And that context is propagated by whatever means throughout every single call downstream as part of processing that request. And when we do that, it allows us to stitch together all those independent pieces of execution across the call graph and build a timeline of that same request where we can see, well, the whole request took that much time in a series A, then series A equals series B, B equals C and D, et cetera. And so that view is a typical view that tracing systems provide based on the tracking of the request that they have. So, but why should you care? Like why is that a good idea to actually do these things? And now I want to jump into the demo. And so I will demonstrate the demo based on the on the Yeager as an open source tracing system and plus inside that repository, there's an application called Hot Rod, which is a sample microservices application that I will be using here. So first I want to start the Yeager backend. So I have, where am I? In a github uber.yager, this is our main repository. So I can start the kind of Yeager backend with a single command go. And I'll just give it a second. So one thing it shows here is that it started Yeager query service at this port, right? So that's the Yeager UI that we'll be using later. And then I'm gonna, again, this is the same repository, but the subdirectory examples Hot Rod. So I can start the, this application as well. And I want to pay attention to the logs here because one thing that it says, it says it's starting a whole bunch of services like a root service, front-end customer and driver service. So just by looking at these logs, we kind of can get a sense that this is a apparently microservices based application because it's starting a whole bunch of things. And so, but the front-end is obviously the entry point. So let's go to that, to the UI of that application. I can make it a bit bigger like this. So just as a quick enter intro, this sample application is like a mock rides on demand thing where you have these customers and you click a button. And the back-end kind of finds the car which is closest to that customer and says, okay, well, the car will arrive in two minutes. And it gives you the license plate number. This is like the New York license plate numbers. And it also gives a few things that will be useful later in the demo. So one thing is like when I loaded this application, there was this client request ID, which is just a stable session ID for my page. Like if I reload that page, I'll get a new ID. There's also every time there's a request made from this application, the front-end, JavaScript front-end assigns it like a unique ID, like request number one. And then it also prints the latency, which we'll see it's useful later on. How long it took from the point of view of the front-end, right? So now that we see, well, okay, well, this application kind of apparently like dispatch the car to us. But how do we actually see what the architecture of the application? So we saw from the logs that apparently there are certain microservices involved. But maybe the logs lie, we don't really know. So what we're going to do is we're going to go to Yeager UI. And we will take a look at what Yeager UI out of the box provided. So remember that we execute that one request so far with that application for this car, right? And so that already provided certain data to Yeager backend. And we can go and we can see that data, for example, in this format. So by observing traces that, really it's a single trace. By observing that interaction of microservice, we can actually see what happened within that service. So there's a front-end service which called three downstream services. And two of those called apparently some storage backends like Redis and MySQL. And we also see that the counts, how many times. So just for the single web request, there are apparently over like 25, 27 RPC calls that happened within that microservice-based application. So that kind of gives us an architecture overview of the application. But it doesn't tell us what the actual workflow and the data flow was. Which services was called first and how long it took. And for that we can go back to the main page of Yeager. And then because the services like emitted tracing data to the backend, we already have this information. For example, all the services kind of presented and known to the Yeager UI. So if we search for a trace, we see that this is the one trace that was executed by the system. And it says that there are like 51 spans. I will go into that a bit later. And then there's all the services that are involved. And how long it took? It took 743 milliseconds. So notice that this is a bit shorter than 750 milliseconds reported by the UI, which isn't surprising because UI is measuring it from its point of view where the Yeager measures from the backend point of view. So there's some network delay between the UI and the backend, which is responsible for the discrepancy here. And so when I go to that trace, I now see that picture that I showed in a slide before. So it's a timeline view of the trace. So which means that this is a time and this every horizontal bar represents like a unit of work performed by a certain service. In particular, we can see from the top that the very first request was for the front-end service and endpoint called dispatch. And then in order, that service called like if you go down kind of a parent-child relationship, we can see that this front-end service called the customer service with a customer endpoint. Then the customer did some MySQL operation. Then the front-end called driver service. And then the driver service did a whole bunch of other calls apparently to Redis, like first find driver IDs and then a whole bunch of get driver requests apparently to like retrieve driver information. And then some of them we can see fail. Like they are marked by the exclamation point. So they took longer and then most of them have succeeded. And finally in the end, the front-end service, like after the driver call, the front-end did a whole bunch of requests to the root service. And so again, we don't really know how and what's the business logic here, but at least we see the data flow of this application. And then once that all these root requests were executed, in the end, the front-end produced result and the front-end display, the UI display. So this is kind of a very simple walkthrough of the workflow application just by looking at the single trace, right? It gives us a lot of context of what happened in this micros, like four, seven, whatever microservices in this application. Now, just a bit more details about this trace. So distributed tracing kind of allows you not only to see that information, but also drill down into individual pieces of every span. And again, span is just a unit of work within the application which was instrumented with a particular kind of annotation. And so we can, for example, like in the MySQL, we can expand that MySQL span and we can see, oh, there was this, we can see that the actual SQL statement was this that's executed. We also see the request ID from the, remember that request ID in the UI, this guy. And we also see something in the log which is associated with that span. So this information kind of allows you to, if there's an error, in particular, let's look at this, the error cases where we see radius calls fail. So if we drill down into that, we can see that, especially in the logs, that apparently it was a timeout basically on the radius which caused that request to fail and then the back-end, oh, sorry, the driver service retried it with another request. So this is kind of, again, just like a quick walk-through through the capabilities of the tracing system. This is very common functionality, one sec. So however, we still don't quite know what the actual business logic within the application. For example, why did the front-end call the customer service, right? And so to understand that, we can actually turn again to logging and try to understand the behavior of the application based on the logs. But before we do that in a trace, let's take a look at the logs here. So, and look, it's like, I'm scrolling, this was one single request, right? So there's like several pages of logs that were written to standard out by this application. So we can probably figure out what happened in this request by like reading very carefully, especially if you're like the guy on the brunette. But I find this very difficult to actually follow for exception traces. And remember that we only did one single request so far. Like if this was a real production service and it was serving like 100 requests per second, these logs will be a complete mess. Everything would be interleaved and there was no way to tell what actually is happening, what's the logic of the application. So instead of looking at the logs, we can actually look at the logs in the tracing system. And specifically, if we look at the front end service, the very top span, we can see that it has 17 logs. And if we expand that, now all those requests that we saw in the standard out, they're kind of the same logs, but they're now very contextualized to say, okay, well, I only see in logs from this particular span, like some other spans like my SPL, it had its own logs. The radius calls, they had its own log. So they were like in the log output, they would be all mixed up. Here, I'm only seeing what's relevant to the span. So that's what we call contextualized logging that tracing provided. It kind of allows you to narrow down the behavior of a particular execution very closely. And by looking at the log, we can now actually understand the actual business logic that the application is doing. So once it received the request, it says, I'm gonna load the customer information by customer ID, which was sent by the UI. Then I'm gonna find the nearest drivers to that customer. I'm loading all the information for those drivers. And then I'm gonna find routes for each of those drivers. And then finally, pick the shortest and dispatch it back to the front end. So again, the main point here is that the logs are very contextualized to every individual span. They're not mixed up with anything else. And also, we can see that, are we showing like logs and tags? This is like a standard feature in open tracing. Tags are really the things that you want to assign to the whole span, kind of a description of the span. For example, it's like, I'm calling my SPL service. The span kind is that I'm a client of my SPL service. Whereas logs are really things with a timestamp. So if you're meeting something at the point in time, then it's a log. Otherwise, if it's the whole descriptor of the attribute of a span, then it's a tag. So that's like the standard terminology in open tracing. And finally, the last kind of, I guess, not the least important thing of tracing is that we can see the overall latency of this request and what was on the critical path and what was taken basically 750 milliseconds to execute this request. So we can see that my SPL query took over 300 milliseconds. So something to look into. Then another thing we can see is that the loading of the drivers took another 200 milliseconds. And by looking at this kind of staircase pattern in the trace, we can understand that, oh, all these drivers were requested from radius sequentially. So another potential optimization for this application saying, maybe you could just call them all in parallel and just reduce it to just a few milliseconds instead of 200. And finally, the request to the root service. This is interesting. We see that they are actually concurrent. So we see a whole bunch of concurrent requests, but they're not all concurrent. So actually, in fact, there is at most three concurrent requests going on to the root service. And then as soon as one of them stops, like first stop, another one started. This stopped another one started. So it looks like there is some executor pool which is bounded by the three threads. And so the parallelism of this whole segment of the trace is limited by three. And so again, potentially another optimization point to improve the application latency. So now let's see how this application actually performs if we start doing a lot more requests. If I start clicking many times here. And so we can see that the latency is starting to climb. Essentially, the more requests, the longer it takes. And notice that request IDs keep like incremented, as I mentioned before. So how can we use tracing to investigate it? So I'm going to pick this driver ID or license plate ID and then try to search for trace with this ID. And tracing allows you to, I think it's like driver ID is no space. So we can find, oops, let's see what the syntax is. Just a driver, OK. So I'm looking at this thing. It says driver equals license plate. So I can search by the tag. So no driver ID, but just driver. OK. So now I get this trace. And we see it's the one that was actually very long, like almost 200 milliseconds. So this one is saying 182 close enough, right? So when we look at this trace, immediately we see, oh, my SQL is taking an enormous amount of time here. 1.4 seconds. So clearly, there is something wrong with that application. There is some bottleneck. And let's actually use the login feature of the tracing. If we jump into the logs, we can see, oh, this request is actually blocked by four other transactions. And it was waiting for almost like a second until it acquired that log and allowed to proceed to query my SQL. What that means is in practice, this is obviously a mock application. But it simulates a real environment where you only have one connection to the database instead of using a connection pool. But what's another interesting thing, though, here is that not only we see that how many transactions are blocking us, we also see the actual request and the ideas of those transactions. So imagine that you were in this scenario where there is some resource which has a queue in front of it. And every time you execute a request, you actually have to wait in the queue for something else to get processed. And suddenly you see this kind of pattern where, oh, you got stuck for a long time in the queue. And then when you look into this thing, you could say, oh, this is all the requests that are blocking me. What if I go and look for those requests and see which one was actually the longest and caused this all this blocking in the queue? And that allows you to do that. But what's interesting is that if we look at the customer service, there is this HTTP request that was executed. It says nothing about request ID. It only says, well, give me a customer information. So request ID came all the way from the front end, from the JavaScript UI. But it wasn't passed as a request parameter to the service. So how did this guy know about all these transaction IDs, right? And the answer is because it's another feature of Open Tracing API called Baggage. And Baggage is essentially, remember I would talk about context propagation where tracing is using context propagation to pass around trace ID, but context propagation itself is a more general concept. You can essentially pass anything. And so Baggage is this kind of anything key value store which is passed around the whole architecture as part of the request. And so this request ID that UI creates for every request is actually injected into the Baggage. And then it becomes available at every level of the call graph. So every microservice can actually get access to that ID without having to change any of the APIs of that service, right? Which is very important. Like if you have multiple levels of microservices and you wanna pass something from the top to the storage layer, for example, having to go and change all the APIs of the services in between is usually very like difficult work. Whereas Baggage allows you to do that almost for free. So we figured out that, okay, my SPL is kind of the culprit. What we can do is we can go and fix that, we can go and fix like this locking contention. And I don't know if I have time to actually do that. So I may, well, let me try to do that. You've got time. Go for it. Yeah. So I'll go to the code for the application. Okay, yes. So, and this was in the customer service, so in the database. And so finally enough, right? We see that there's this log that we saw the log about it. And so the reason this log is here, like I said, this is a mock application. So it simulates like a single connection thread pool. So simplest way is let's just comment it out and not block on this one transaction. And then the actual transaction to the database is simulated by this sleep statement, which has a certain delay, right? And so I also want to, they're just for demonstration purposes, I want to go and reduce that delay to make it a bit shorter and see how this small change really affects the behavior of the application. So we'll start it again, reload this page, note my session ID change now. And so again, I do a whole bunch of requests. So what we see now is that latency is still kind of going above the first one, but it's not as dramatic anymore as it used to be, right? It doesn't go to two seconds. And so if I pick again the latest, like, sorry, the longest trace and try to search for it, we'll see how that change in the code that we just made affects the trace shape. So here I change it to 100 milliseconds. So that's what roughly we get now. And so the whole shape of the trace change significantly. It's still long. I mean, it's like over a second, but this segment became shorter. The call to the driver is still the same 200 milliseconds because I really haven't optimized this thing, but notice how this segment change now. So remember, we used to see three at a time, but instead we're actually seeing sometimes one, sometimes not even less than one request being executed. So my whole request has been blocked, like we can see it in the minimap. There's this gaps in the execution where my request is actually not doing anything, it's just waiting on the resources, right? And again, as I mentioned, this front end, the root service has some sort of a thread pool inside it. And that thread pool is bound by three executors. And so when I execute a lot of requests, then obviously there's a contention on that resource and we can easily see the impact of that contention on the trace. And so what if we go and fix it as well? So, and it's happened to be like very close to in the same configuration. So, and this is a goal routine. So the cheap I'm gonna go 200. And again, let's see the impact of that change. I swear my laptop is usually faster. It's the video that's slowing down. Okay, so we got this application started reloaded again. And now, like because I optimized the whole bunch of stuff in this goal, I have to click really, really fast to actually get any sort of latency. So you see like how requests are coming back immediately. So like, and they're all way shorter than before. And like, so if I pick the longest one, just to see what actually is happening with it now. Okay, notice that there's lots of more errors for some reason. I don't know why that is interesting. The air is actually random and it's so like, it's kind of surprising why there are more errors. But like we can see actually this, the impact of that last change that, so we have 10 drivers being requested or like a 10 roots being requested from the root service. And now we can see that they all executed in parallel because we have essentially removed the contention on the resource pool, on a thread pool. So this again, this is, I guess, I hope like this is a demonstration of the tracing functionality and like how tracing can help you quickly narrow down what the problems are in individual components of your architecture, individual services and how you can try to optimize this by looking at either relationship in the calls, critical paths, like here, if we look at this whole trace now, then obviously the critical path is going through this segment and this is the longest segment and the most obvious optimization here is to try to parallelize this thing, right? Instead of doing sequential. But I'm not gonna do that in this demo. It's like an exercise if you want to do it yourself. Okay, so, and the final thing that I want to show here is that, so I mentioned baggage. I want to show another use case for baggage and this application actually emits a whole bunch of metrics. So if I go to this, yes, this, I mean, it's another port that this application exposes. So we can see a whole bunch of metrics emitted. By the way, some of them are, I think if I search for, oh, I actually don't have metrics from the tracer itself. So it's probably not configured. Normally the tracer itself emits metrics about how many spans it starts or stops. And instead what's configured here is the RPC metric. So we can see that all the services and all their endpoints are actually being measured by Yeager and emitted as metrics. So like tracing in general does a heavy sampling of the requests. We don't capture every single trace in the storage, but metrics work for all requests. And so you can get very, very pretty accurate picture of how your application behaving by looking at the metrics. And this is something that Gary will talk in a second segment of this presentation. But what I wanted to really show here is this part. So notice that this is a metric which says how much time the root service, calculation in the root service spent in seconds on behalf of individual customer or on behalf of individual web session. And remember that this my web session idea is this one, right? And so well, it's kind of not look at the root service. Let me collapse this thing. So pick any request to the root service and look at its HTTP request itself. Again, there is no mentioning of either the customer or of the web session idea in the request because well, they don't belong to the API of the root service. It really cares only about where we start and where we drop off. So just like two coordinates, all it needs. And yet it is able to produce this metrics by the customer and by the session ID which are the identifiers which are only available at the very top of the application. Essentially, the front end service knows that but it doesn't pass it explicitly to the root service. So I'm kind of repeating this myself but I just want to make sure that this is very important and like powerful feature of open tracing where baggage propagation can allow you to do a lot of smart stuff by kind of this implicit propagation of the data that you can use at the lower stack of your application and let's say the resource attribution. I can say, well, I can maybe do a chargeback to my customer saying, oh, you used so much computer resources from my application by like all the requests going to your customer. So this is something that essentially distributed context propagation allows you to do. And finally, one other thing that I want to go over in this presentation is really, so I hope that you like this few functionality and you think tracing is great. So great, like how difficult is it actually to instrument that application to get all this data and all these places? And the answer is it's actually not that hard. And in fact, if we look at the source code for this application, they will be surprisingly very little information, very little instrumentation for tracing explicitly. And the reason for that is because open tracing API is an open source API that any framework can use to instrument itself. In particular, any RPC framework can use to instrument itself. And as a result, if we look at the source code for any of the services, let's say we look at the front end service. So when we create in a server, we see that there is this, like one mentioning of a tracing for instrumentation, which really just creates a wrapper around the server. And then once that's done, all the requests through that server are automatically traced. You don't need to do anything special. Similarly, there is another service here. I forgot which one. I think it's a root service. So when it starts, this one is not based on HTTP, or it is actually, maybe it's a driver. So the driver server. Yes, so the driver server is not based on HTTP. It uses a T-channel, which is another RPC framework, open source RPC framework. And that framework is instrumented with tracing by itself, with open tracing. And so what we can see here in the code is that when I'm creating this new channel, the only thing I'm passing it is the tracer. And that's it. There's no more instrumentation anywhere in this service to actually enable tracing. In fact, if we look at the handler, so this is the handler function, which is being called by the server. There is no mention enough of tracing here anywhere, right? It just gets a context object, which is the common way for tracing to propagate data inside the application. And then tracing kind of happens behind the scenes automatically. Again, because open tracing is an open API that anyone can use, if you are writing your RPC framework or you're writing your, I don't know, Redis driver in particular language, you can write open tracing instrumentation either into your driver directly or provide like a wrapper, which what happens with the HTTP. Like there are standard libraries and open tracing contributor space, which allow you to wrap HTTP clients and servers and not really worry about tracing. However, if you do want to trace explicitly, then obviously open tracing allows you to do that. And there are examples in this application, like Redis, for example, this is not a real Redis, this is a simulation of Redis. And so to actually simulate that we're making some sort of RPC request, there is explicit instrumentation for open tracing. We're saying, okay, start a new span here representing the call to Redis. And we're saying that this is a RPC client kind of span, right? So those are the tags that we've seen in a tracing example. And this is the only really place in this code where open tracing instrumentation is done explicitly, simply because there is no real Redis server. If there was that, you probably could get away without explicit instrumentation for that. And finally, another thing I want to mention is, so we've seen how logs go both to standard out and the same logs appear in a tracing. And in fact, I'm gonna go back to this server. So we can see an example of the log statement here. So it looks pretty normal, right? So info, this is a kind of a key value login framework that zap as a login framework which allows you structured login. So rather than formatting a string with a like a format or you provide key value pairs explicitly and it's a lot more efficient and go with no memory allocations and it's easy to suppress. However, the really difficult difference here from normal login is this part, right? So instead of just calling logger info, if we did that then we wouldn't be able to associate logs with the actual context because they would just go to standard out. And this is just a little trick in this application where the logger isn't really the normal logger but it's a wrap around logger which allows you two methods. Either you can get a background logger which doesn't require context and can log your standard like lifecycle application messages or if you have something that is request specific, in this case, it's obviously a scope to this particular request find nearest car. So we get a different type of logger for that context and as soon as we do that, there is a magic is that you can look in the source code how it's actually the same log is forked into both standard out and into a tracing span. And that's why I am able to show it in the UI but when it's associated with a span you get contextualized login versus like a standard loud mess. So I'm just checking that, oh yes. That's the end of my end just like the very final point is open tracing doesn't bind you to any particular tracing optimization, right? So here we used Yeager but if we look at how tracing is actually initialized this is the only single place in this whole application which is specific to Yeager. We're saying config is the project from Yeager this one I guess, yeah. So it's a, we can see that Yeager client that's the only place where it is actually specific to Yeager, right? So we instantiate Yeager tracer and from that point on the rest of the application is not aware that there is anything to do with Yeager. If you want to swap it for Zipkin or for LightStep or for any other open tracing compliant tracer, this is the place to do it and it will work just well. Yeah, UI will be different obviously but the actual instrumentation doesn't need to change. So that I think is the end of my demo. So let me see. Yeah, so as a recap, what we've done is I showed that instrumentation itself is pretty much off the shelf. I didn't have to change a lot of stuff in my application. I can swap another tracer. So there's a vendor neutrality to the whole open tracing API and tracing allows the two monies of transactions across multiple microservices and process boundaries and different threads as well. We can definitely do things like latency, measuring latency operations, finding a critical path, analyzing root cause of some errors or delays in the execution. We can get contextualized login, very highly contextualized login with tracing. We talked about baggage propagation, how it's a very powerful techniques. In fact, at Uber we have a number of projects which are built strictly on top of baggage propagation. They really don't even have to be anything with tracing but they rely on Yeager instrumentation because they need baggage propagation. And I showed quickly the RPC metrics but that's something that Gary will talk more in the next session. And just a few words about Yeager. So Yeager is a distributed tracing system. We open sourced it in April this year. It's open tracing inside. So it's like built from open tracing from the beginning. It can be used as a drop in replacement for Zipkin if you want to just replace the backend. It's the backend is all in go. We support several backend storages and this is the main URL for Yeager. And we'll come to that slide after Gary's presentation. I'll stop sharing now. All right, we'll get Gary to share his screen and pick off his bit. Hi, can you see my screen? It's coming soon. You've got your sharing screen. So click into your Yeager browser. There you go. Okay. Right, I'll try to get through this demo fairly quickly. Thanks for the demo, Yuri. So what I'm going to do is just show how we can use an open tracing system like Yeager but also capture application metrics and integrate with something like Prometheus and have that all running on OpenShift. This example also runs on Kubernetes and there's a GitHub repository located here where you can find the example and the instructions for running on both. So the main name of this short demo is to show how we can sort of capture the metrics along with the tracing just by instrumenting the application with the open tracing standard. And as Yuri pointed out, there are ways in which we can make the instrumentation of the applications non-intrusive by instrumenting a number of the popular frameworks. And the benefit of capturing the tracing and application metrics information separately is we can report them to our preferred back-end systems and the sort of metrics we're going to be capturing isn't constrained by the particular tracing sampling policy that we want to use because application metrics is useful to be able to capture for the application invocations and have learning mechanisms to detect situations. But the tracing information is useful when you want to dive into more information about a particular invocation of the application. There's also on the cards some adaptive sampling mechanisms that we're looking to put into Yeager, which for example, if you get an alert on a particular area of your system, that could potentially be used to automatically increase the tracing information that's been captured in advance of somebody being able to investigate the problem. So in terms of the example I'm showing, it's very simple, Spring Boot application consisting two services, an order manager and an account manager. The order manager has a couple of rest endpoints for buying and selling and one to generate an exception, basically, as it tries to invoke a missing endpoint. And then the account manager just has a simple account endpoint. So both of the services are using the Open Tracing API, which in terms of the trace that we're using Yeager, but we're also decorating the tracer with a new component in the Open Tracing Contrib GitHub organization that basically intercepts the tracing information and extracts relevant metrics. And these have been reported in this case to Prometheus. Okay, so there's also a blog on the Red Hat developer program that explains how to run this on Kubernetes. There's a GitHub organization called Yeager Tracing where you can find the templates for deploying Yeager onto Kubernetes in OpenShift. And as I mentioned, there's the Java metrics component that decorates the tracer can be found in this organization here in this repo. Okay, so what I've done is I've already deployed the example. So there's the account manager and the order manager. I'm using Prometheus operator, which is an extension project to Prometheus, which is able to identify services that have been deployed and if there's multiple instances of the services and be able to update the configuration in Prometheus to scrape the metrics from those services. And of course we've got Yeager deployed as well. Before viewing the demo, I'll just quickly go through the application. So as I said, this is a Spring Boot application. The main application itself, as you can see, there's no tracing specific code added here for the account manager and same for the control of the REST endpoint itself. So this is basically just introducing a bit of a delay to make the metrics more interesting and then randomly creating an exception saying if I failed to find an account. The configuration of Prometheus is pretty straightforward. The metrics are being reported using a serverlet. A bridge is exposed at this endpoint here. And the tracing configuration is basically using a component of the open tracing project called the Tracer Resolver. So in this case, what we're doing is we're obtaining the Tracer based on configuration information. So in the same way that Yuri was pointing out, you could just change the code in one place. If you're using the Tracer Resolver and the tracing implementation supports it, then it can be done without any code change at all. But in this case, we're decorating the Tracer before it gets returned using this component here with a Prometheus metrics reporter. With Prometheus, the metrics are reported with a set of labels. So in a standard way, what we're doing is we're using labels to represent things like the service name, the operation, and various other fields that can then be used to categorize the metrics. But through this mechanism, you can also customize and add to your own labels as well. So in this case, what I'm doing is adding a baggage label. So this is using the mechanism that Yuri talked about where application-specific information can be propagated with the tracing context through a chain of the services that are being invoked. So what this one is doing is it's adding a transaction name. So this could be a business transaction. And the second parameter is just a default value. So in this case, if a baggage item with that name hasn't been provided, then we just use this value. The other thing in terms of the Tracer configuration is we need to tell it to ignore the rest endpoint slash metrics. So what we're doing is to scrape the Prometheus metrics. I'll just show you the order manager because this is slightly different. The application itself, again, has no tracing-specific code. But the controller does include or injects the Tracer. And this is purely just to be able to set the baggage item. So what we do is if the buy-in point is called, then we set the baggage item of transaction to be buy and same for the sell. And that information would then get propagated through to the account manager service. So this is the other UI for this particular application. So you can see there's some transactions. So we've got the order manager, which has the buy-in point invoked. And that's invoking the account manager. So that's a simple invocation. And this one's showing an example of an error. So if I look at the account manager, if I have a look at the logs, then you can see that they've failed to find account as being reported. But because Yuri's done an in-depth demo of Jager, what I'm going to do is focus more on Prometheus. So this is using the Prometheus user-in-face and I've set up some queries already. So this first one, what it's focusing on, is a metric called span count. So that's just the number of spans that have been created at a particular point in the business process. So if we have a look down here, you can see that there's a metric that's created for the operation cell in the service order manager and the span kind of server. So that's the server endpoint for that operation and that service. There's a number of labels that we're ignoring to simplify the information. So for example, you can view information based on pods, instance, job, namespace, the transaction field that we added ourselves, the transaction label, and also errors. For the moment, we're just aggregating over those particular fields. There's also, I've got a graph here representing the duration associated with each of those spans. So it's looking at an aggregated view of those particular fields. And for example, if we're interested in a particular transaction, we could add a filter. So if we're just interested in the cell transaction. So what this does is it cuts through all of the services and it's only focusing on the metrics being reported for that particular transaction type. So for example, if you wanted to find out what the bottlenecks were with a particular business transaction, this would be a good way to be able to focus in on that. And similarly, you could, if you're interested in what's executing in a particular part of your infrastructure, you can focus on the pods. So because the pod also includes a service name that's quite useful so you can see what services are running on that particular pod. But again, it helps you to locate if there's particular problems in your infrastructure. And then finally, I've got a graph that's basically looking at the error ratio between the, for the different services. So again, you can see whether a particular service is starting to generate more than the usual number of errors. And you can set up alerts that would be triggered if certain thresholds were exceeded. Okay, so that's just a quick demo. So just to recap, this is, this demo is primarily to demonstrate the integration of the open tracing technologies with something like Comedius for capturing application metrics. But within the context of a Kubernetes or OpenShift environment where you can also capture information implicitly about where those services are running. Okay. All right. Up into Yuri's last slide, so we have the resources slides up there. And I just want to say, really, thank you for this. This is wonderful to see sort of the interplay of all these different open source projects and how they all interrelate. And there's a lot of them in here. And this has been very, very good way to showcase that lots of different things. The open tracing project, Yeager, Comedius, Spring Boot even, it's pretty cool. I think you've done a pretty awesome job with this presentation because I'm not seeing any questions yet from any of the many folks that are following along. Is there any feedback that you want to add in, Yuri, now that Gary's finished his bit? Yeah, people probably lost. I had to go pretty fast. I just want to mention that. So a couple of, like a few links here. So for the open tracing project itself is opentracing.io. And then there's a Gitter chat room if you have questions or want to discuss things and this is the link. And then this is the link for Yeager for the main repository. We also have a chat room for questions and so on. And these demos that we've given, they actually have blog posts that essentially describe what's happening and in particular like Hot Rod has a very detailed walkthrough blog post that kind of talks about the same thing that I talked about, but with more examples and at a slower pace, obviously. And Gary's blog post that he showed is also here. So if people want to check them out later and actually go to the repositories and look at the codes, these are the links. All right. Well, there are a couple of questions now that I've put people on the spot. Jethro is asking, the sampling in Yeager tracing, is it span level or trace level? It is head-based or tail-based sampling? Yeah, I can answer that. So definitely an expert asking that. So sampling is trace-based. Once a trace is sampled, it essentially remains sampled throughout the whole architecture. And it is head-based. So the sampling decision is made at the very beginning and when the trace ID is generated first time. That's the only way for us to actually ensure consistent sampling across all microservices. But having said that, we actually have various work in progress that are trying to add other ways of sampling things. And that was Jethro, who's from the MassOpen Cloud that's actually using Yeager today. So hopefully we can get some feedback from you guys soon. There's another question from Vikas. Did you measure any performance impact after implementation? So this is kind of very interesting and very detailed question if we really want to go into that. The short answer is yes and no. Because the actual performance impact cannot be measured in isolation just based on the tracing itself. It really has to be measured within a particular service, in a particular traffic pattern because it's highly dependent. So we usually at Uber at least run tracing with a fairly low sampling rate because we have very high volume, very high traffic. And so because of the very low sampling rate our performance impact from tracing is completely negligible. There's like nothing to talk about even. But if you crank up the sampling rate much higher then you will start seeing definitely some performance impact. The reason why that question is actually very difficult to answer is because that performance impact is itself very hard to measure because it's not just like how much CPU time or CPU load you add to the service. There's all kinds of other applications like how much memory pressure you create, how much throughput is affected because trace collection happens in a critical part of the application of the request themselves but trace reporting happens in the background and yet that background work is somewhat expensive if you sample a lot of data and so that starts affecting your application throughput and latency. And that's why you really have to try it out. I mean with a low sampling rate you're not going to have any performance impact but if you want to go to very high sampling rate then you definitely need to try it out and see what happens. So Narayanan is asking actually a good question can we enable on-demand tracing with this if performance is a concern? Gary's come back with a little bit of an answer. Gary, you want to try? Yeah, it is probably better if your answers but I believe we were working on an adaptive sampling mechanism that we would address that so you would be able to switch off the tracing and enable it for certain scenarios when a problem occurs. Yeah, I can add to that. So there are two parts here. First of all, it solves the problem of having very low throughput endpoints which would be affected if you have very low sampling rate in a tracer then some of your endpoints may be sampled and some others may never be sampled because of they're just very low QPS. So adaptive sampling takes care of that and guarantees certain throughput of traces for any endpoint. And the second feature of adaptive sampling is what Gary mentioned is that it actually dynamically adjusts to the traffic from your endpoint and it can either increase or decrease the sampling rate based on certain throughput that desired throughput into the storage. But I think what this question was really about is on-demand sampling and that's possible via two ways. One way is you can do that programmatically. Open tracing has a standard tag called sampling priority. If you set it on a span with a non-zero value Yeager interpreted as a signal that you want to kind of turn that trace into a debug trace and it's going to be guaranteed sampled across the stack and it also bypasses any down sampling that may be happening in a collection layer. So that's one way or if you don't want to do it programmatically Yeager also supports, Yeager clients support a special header I think it's called like Yeager-debug you can pass a sort of correlation ID as part of that header and then it will also trigger the debug functionality for a trace and then you can go and find that trace by that correlation ID that you provide in the header. So that's useful if you want to like send the curl request into your application from outside and just saying I want to trace this curl request where obviously you cannot set programmatically anything because it's a curl that is not instrumented open tracing. But header allows you to do that. Does Jerry is asking does the adaptive sampling idea use the circuit breaker pattern? Not quite sure about Yeah, I'm not sure what that means, but yeah it's basically adaptive sampling works at the central collection tier and it measures all the traffic that's coming from a particular endpoint of a particular service and it has a target and if we see like oh we want 100 traces per second started by this endpoint and if we see a thousand then we're gonna reduce the sampling probability by 10 times. So that's how it works. So I guess it's like kind of a circuit breaking. And we have one last question I think we can sneak in here Jeff Rose again asking granting and merging are not captured right? It is by design not in the dapper model. It might be just a yes. I think there's potential for the open tracing model to support it because it has it can handle multiple parent references which the dapper model is a single parent approach but I don't know if I think more work may be required in the standard just to maybe define additional references with reference types. Yes, yeah, exactly. I mean the references mechanism and open tracing does allow you to have multiple parents but there hasn't been like a lot of work put on that reference type defined for that use case currently but there are open issues that if you want to provide an opinion there's definitely an issue about that and other similar situations where you want to there's a related issues like when you want to link to different traces then you can also use the reference mechanism to link them. So that is really all we have time for and I really appreciate Yuri and Gary for taking the time out today to do this. If you guys are interested in Jaeger or open tracing I know like Yuri just mentioned there's a lot of issues on the Uber Jaeger GitHub repo that you can weigh in on and give feedback on we'd love to hear from you all and stay tuned for lots of new things coming with Jaeger and we'll probably get these guys back on again sometimes soon with sort of a roadmap where we're going from here kind of talk and then hopefully some of you have mentioned that you're using it in production on your POCs can also talk about some of the work that you're doing to implement this and the benefits or whatever at your facility so if you'd like to also be part of this just give us a shout and again thank you all for joining us today and this will be up on the OpenShift blog post and a few other places within the next day or so so thank you again for joining us. Thank you.