 So in this talk, I will try to explain why as a cloud developer we should care about distributed tracing. I will show a demo of a CNCF project Yeager, which is a distributed tracing system. And I will speak about the challenges of actually rolling out distributed tracing in a large organization like Uber and some lessons that we've learned from that. Just as an introduction, why you should care about anything I say, so I've been doing tracing for about just two years. I work at Uber in New York on the observability team. I was the founder of Yeager project at Uber and also was a co-conspirator in the open tracing project when we started two years ago. And you can find me on GitHub and Twitter. So this is a diagram which Yeager can present, which is about a snapshot of Uber architecture. Obviously, it's not everything, but some of it. So when you use Uber app or any right-hailing app, really, every second the app communicates to the back end. And this communication might look like this. So instead of talking to just one single server, every request spreads out across a different number of microservices potentially hitting hundreds of thousands of individual service instances. And at least at our scale, that happens billions of times a day. So complex system, how do we monitor such a complex system to make sure that everything is working? Well, how do we traditionally do this? We use metrics and login as the classic monitoring tools. With metrics, the counter stats and gauges, et cetera, there are various techniques of people telling you what you should be monitoring to kind of know whether your components are healthy or not. And there is various products, like stats, D-premises, Grafana, which allow you to collect those metrics, visualize them. Similar story on login. I'm talking about application events, logging, and also errors and stack traces. And so as well, there is an ecosystem of tools that allow you to grab all the data and sort of aggregate it and present. So do these tools actually help? Well, of course, they help with certain cases. But the monitoring tools ultimately must tell us a story about what's going on in my system. So I was running a demo someone gave me yesterday. And when I reloaded the web page, I got this, the process just crashed. So how do we debug this? Well, I don't know. Like the code in that demo didn't even use this scanner class at all. And so yet it crashed with this message. And it's very unusual for a program to crash like this, because typically you get a stack trace in any program. This one crashed without stack trace, which is surprising to me. Anyway, I didn't want to investigate it. But the point here is that when we're talking about microservices-based application, metrics and logs are essentially giving you this one line, about one instance of a service somewhere. And how do you debug a problem without stack traces? So distributed tracing is essentially giving you what you can consider distributed stack traces, what happens across all your services in the architecture. And so really when we're trying to model architecture as complex as Uber, we want to monitor distributed transactions, not just individual instances. And just before I do the demo, I just want to give a basic idea of how distributed tracing works. So imagine we have a service that requests come into our system. And then when the request comes, what we do is we assign a unique ID to that request, simple thing. And we also introduce a notion of a context. And the context is something that we want to keep with this request, with this transaction as it travels through the rest of the architecture. So if the service makes a call to another service, then we pass that context on. And so on, it makes more calls. We still pass that same context. And as we're doing this, we kind of now know, OK, this is the path that transaction took through the architecture. As long as we can identify all the instrumentation within those services by this unique ID that we assigned at the top, right? You can modify context to also record some causality information, like the fact that actually be called C and not just someone called C in this transaction. And so if we capture all that data in addition, it doesn't have to be in the context, but captured in the background somewhere. Then we can build a trace the timeline that you see on the right here. And then next thing, just before again, before the demo, just a quick introduction what Yeager is a distributed tracing system. We started at Uber and we open sourced it a few months ago. Now it's an official CNCF project from September. And now I will show you some demo of the tracing. So I will start with kind of showing you an application that I want to use for the demo. So this is a mock application, hot rod. It's like rides on demands. And what you can do with this one is you can pick a customer, you click a button, and the backend just goes and dispatches a car, pretend car to you. And it gives you the license number. It also says when this car is arriving. And there is some, for our purposes, debugging information saying this is the unique request ID for that request to the backend. And this is how long it took to execute on the backend from the point of view of the front end. So that's all I'm going to tell you about this application. And so back to the point of monitoring tools should tell you the stories about the application. So what can I get about this application actually using tracing as a monitoring tool? So I'll go to the Yeager front end. Let me reload it. So the instrumentation in that application already sends some data to tracing backend. And so one thing I see here is that, oh, I suddenly got this list of services that are apparently included in this application. So let's not go there first. Let's go to this tab called dependency diagram. There is a separate view. So purely by monitoring the interactions between the services, of course, with the instrumentation, we got this diagram which actually tells us a lot about the application right away. We can kind of figure out what its architecture. We can see how many requests go to which services. We can see that apparently there are two data-based backends within this application. So we didn't need to do anything to get that, just to run the application. So the second thing is, OK, well, this gives us the architecture. It doesn't give us the actual workflow. Who calls whom? What's the business logic within this application? So for that, I can go and I search for traces. So there is one trace here at the top. Notice it says seven or four milliseconds. This is a bit lower than this one because this is from the client point of view. This is from the server point of view. So obviously, it's shorter. But when I click on the trace, so there are a lot of things on this screen, a lot of information. So most important one is you can see that the time sequence diagram that I showed you on the previous slide is kind of now for the real service. And these are individual operations executed by individual services listed on the left. And these are the operation names and how long they took. So one thing that we can immediately say about the service is that, OK, well, this MySQL select operation takes almost 50% of the time. So if you were trying to understand why your service in production is slow, just one look at the trace and, well, this is probably, well, at least a good place to investigate. What is it doing for that long, right? After all, it's all running on my local machine. How much time you can spend on the retrieving data from MySQL. What you can do as well is you can actually drill down into this and say, OK, well, this is the actual statement that was executed. So this is something that's automatically captured by tracing instrumentation. And you can see that within this individual kind of span. Span is an operation within a trace. So further rich data. It also says that there is a log statement here, requiring log, blah, blah, blah, whatever. We'll see why that is interesting later. So another thing we can see here is if I click at the top, this is kind of the entry level of the whole application, right, the so-called root span, which spans the whole request. There are lots of logs here. And if you look at the logs, then if you read it all of it, you kind of can get an overall idea, not just the architecture now, but what's the actual business flow within this application. So we were getting a customer. Then we were finding the closest drivers. We were calculating the roots to each driver like this. And then finally, we picked the best driver who is the closest or the shortest to reach the given point that we wanted. And then we returned that data. So why am I showing this? Clearly, you could get the same thing from the logs. However, note that all these applications could have been running on different instances potentially, right? And also, each application, if it was a production, then there would be 100 requests going through one single instance. So if you try to look at the logs, these 18 entries are going to be mixed up with gazillion of other entries from the same application, just from different threads. And how do you make sense of any of that? Well, tracing actually gives you just this transaction. It ignores everything else because it can correlate all the logs with this transaction ID or the trace ID. And that's a very important feature of the tracing is that it can give you highly contextualized information about all the things that your application is doing. And it also, again, note that these logs are attached to this span because it's really this operation that was doing all these requests from the top level, right? The logs for MySQL operations were different because it was doing something else. So not only you get logs for a whole transaction, but you get logs partitioned by individual operations within that transaction. So another thing that we can also figure out quite easily from this trace, if you're curious about what else could be the performance issue in this application. One thing is here. Well, first of all, there are a couple of errors. Like the red button here says there is an error. We can actually look at it. It was probably like a ready timeout, right? So fine. The service retries. It increases the latency. But even if there are no errors, there's a very clear pattern, staircase pattern here, which indicates that, well, all these operations are done sequentially. And now, since we didn't try the application, we can't really say whether that's the right thing or not for the application. Well, I wrote it, so I do know that it's a wrong thing. It could have been like fanning out all these requests in parallel because all the data just basically loaded all the drivers first and then getting some additional information for the driver. So you just redpool, paralyze it. You could reduce this 180 milliseconds to probably what's the longest, like 10 milliseconds. So another source of latency right away. Just by looking at the trace, I didn't have to do any sort of measurements anywhere additional. And the final one, so this one is interesting. By the way, I can actually do this zoom in into this section. So here we have a whole bunch of requests to the root service. And actually, there are 10 of them because there are 10 drivers loaded. But the execution pattern is a bit strange. Like we see here that there is some parallelism, that three requests were starting to execute in parallel. But then it's not three at the time. Actually, if you look at the vertical lines, it's always three at the time because, well, simply there is a thread pool behind it with the size of three. And it limits how much work you can do on doing this particular operation. So this is not a problem in this specific trace. But if we were doing a lot of requests, and I can easily show you, like if I go here and let's say, do this, right? Notice that the latency keeps climbing. So if I pick this one, and actually I can search the trace by individual tags. So this was a driver ID. I think I can say driver. Should have tried it at home, but it should work. Here it is. So 1.6 seconds. Well, so now this MySQL is not just it got even worse with the load. Obviously, MySQL server could have scaled easily. It's not a problem here because it's a simulated service really. But we can look at, again, in the span and say, oh, you know what? I was actually waiting on a log for other transactions. And notice the interesting thing. It gives you the transaction IDs. Remember this transaction IDs from here? So this service somehow knows this individual request. Like, we're looking at one trace, one request. But it knows about other requests which will block in this particular request on a particular resource contention, right? So that information is actually hard to come by without certain features that I'll dive in a second. In the end, so again, speaking of optimizing performance of this application, clearly this is one of the bad things. I was actually not able to show you this impact on this thing because we need to solve the MySQL problem first. But if we solve it to make it fast, then this thing becomes a bottleneck because your thread pool of three, obviously, is blocking now a bunch of requests. So you can't even do three operations in parallel for one transaction. You're going to be waiting on this contention. So the whole point of that is, like, I'm just looking at one single trace about the application. And suddenly, I know so much about performance profile of this whole application. And I'm not talking about one single service, right? That you could potentially have done with some profiling tool. But I'm talking about the application as a distributed application with multiple microservices, which you could run as multiple instances. So one last thing I want to demonstrate here is, so this application also emits metrics. And I have this metric here, which is kind of interesting. So this metric says, this is how much time we spend calculating the root by customer. And what's interesting about it, well, it's kind of easy to calculate. But the issue here is that if we look at the diagram here, right? So this is the root service. The root service actually doesn't know anything about the customer. It doesn't need to. All it does, it says, from point A to point B, what's the shortest root? And yet, that service is able to provide a metric saying, oh, this is how much time I spent per customer. And that's another feature of distributed tracing, which is known as distributed context propagation. Because, well, the front-end service does know the customer. And that front-end service can store the customer in the context. Remember the context I talked about at the beginning? If you store the customer in the context, then that context is available to every single node within your application. And then they can do additional statistic gathering based on that information, even though they don't really get that information in their direct API call or anything, because they really don't care. And that provides very powerful features if you want to do cost attribution. Like at Google, most of the requests, Google has also very deep stack. So if you have a Gmail, eventually it might reach some storage like a Bigtable. So the request actually carry the fact that this is a request coming from Gmail all the way to the storage, because then they can attribute that cost to something that makes sense for the business. So in this case, yeah, sure, I can attribute cost to, let's say, the service that's calling me. That's easy. But really we want to attribute the cost of doing the work to some business concept that makes sense for our business. For example, if it was Uber, we could say, OK, this is a ride request for the ride sharing. Or this is a Uber Eats delivery. So this is a high level business. And then we can say, OK, well, we are spending this many dollars per year on this kind of business by doing this attribution. So it's a very powerful technique. And I will speak about it a bit more. But this kind of demo application demonstrates how we can get it. And just to prove you that I'm not lying about the fact that root service doesn't know anything about the customer, we can go to the root service. And we can look at the URL request that it gets. This is all it gets, right? It has just start, pick up, and drop off point. That's all it has. So how does it get the customer information? Well, customer information comes from the context. And the context is propagated automatically by tracing instrumentation transparently to the service. You don't need to change the API or the service or anything else. Imagine if you wanted to pass this data for you so that services actually know about it. You would have to go and change a lot of services and change their APIs. That's very expensive. So let me stop here and go back to my slides. And by the way, this demo, there's a walkthrough if you're interested in this link. The slides will be shared afterwards. So it goes in a lot more details about how you can actually troubleshoot these things using tracing. And it has two aspects. It talks about open tracing and the Yeager specifically, and really what you can do with that. So to summarize, one thing that the tracing systems provide is the ability to monitor distributed transactions. The other thing is to do root cause analysis. So if you can find a trace which is looking suspicious, you can actually easily drill down into all various components that participated in that transaction and figure out what was going on. You can also do performance and latency optimizations using the same tool. And that's not just with individual traces. If you build some aggregation of the traces, then you can see patterns within the applications. And I'm going to be talking more about this. And finally, service dependency analysis, so far it was fairly simple, but it's just a simple diagram. I'll show you a more advanced version of it. And all of that functionality is fundamentally built on distributed context propagation. And this is something that I want to stress that context propagation in microservices is an extremely important concept. Many people are not doing this, and you will regret if you reach a certain maturity within the organization that you're not doing this. So it's easier to start from the beginning to have it. So like, who likes tracing now, like after I've given this presentation, right? So, quick poll. How many people here actually in your company or organization have distributed tracing system deployed and then use? All right, that's a pretty good percentage, more than I expected. So if the tracing was so fun as we saw before, how come it's not everyone raising their hands, right? And well, the question is kind of embarrassing a bit for the industry is that the instrumentation has been too hard so far to do tracing. It's like with login and metrics, it's easier. It's also work, but with tracing it's a bit more work. And I want to explain why it's a bit more work. So imagine you have a service and it has like a server end point and then there is a downstream call that you're making, right? And so presumably we add some instrumentation around these things to actually do the tracing at the entry and exit points. And presumably some upstream server is already also instrumented and sending us the trace ID in the request headers, right? So what the very first thing we do in this application with the instrumentation really what it does, it says, okay, take the headers, extract the trace context, create the context object in memory. That kind of you get that almost for free with the standard frameworks today. The second thing you need to do is like as your application doing its work, you need to keep that context around so that if you do happen to make a downstream call, you can pass that context downstream, right? Because if you lose it, then the trace is broken. You can't follow the transaction anymore. So number two, and then the third one is when you actually do the make a call, you take that context, the instrumentation, encode it again into the trace headers and passes on to the next request. And meanwhile, there's another library which kind of gets the callbacks from all this instrumentation. It says, oh, I'm collecting tracing data. I'm gonna submit it to the tracing backend somewhere for actual tracing aggregation. So this number two thing is what's known as in process context propagation, different from distributed context propagation. And this is actually the thing that was preventing tracing from being while mainstream these days because it's actually not straightforward to do. So it sort of depends on the language as well. So if you take languages which support thread locals, then well, you can actually store the context and thread local and sort of get it in the next phase or like in the next layer of your application, let's say in the client, in the previous example. And so it seems like it's a happy case. You can do instrumentation almost without changing your application at all, right? It's almost happy case except that there are, we don't do software these days anymore like this, right? We don't write Java applications where request is processed by a single thread. Most applications now become asynchronous. They use queues inside, multi-threaded pools, et cetera. And so that actually complicates a lot this whole process of yeah, what seemed simpler with the thread local. Now it's not as simple. There's another set of languages where thread local isn't even a thing, like in Go. You can't identify go routing. And so what do you do in that case? Well, you have to pass context explicitly. And that's like a sad story here because in many cases it's, if you didn't write your application from the start with passing the context, then well, it's kinda too late or you have to rewrite a lot of API calls internally. So it's not so happy. I mean, fortunately in Go the context object is a standard language function like object class. So Go language encourages you to use that all over the application. So hopefully as a community we'll start doing that. But it's actually, if you do do that, then tracing becomes almost easy. So like in Uber tracing Go applications was actually not that hard because most people didn't know how to pass context around, right? And context exists for other purposes as well, like timeouts and consolation and stuff. So it was already a mechanism which we can piggyback on. So is there zero trace instrumentation? Does it exist even as a concept? So like I said, fundamentally it's not even possible in some languages like Go. But if you do pass context, then it kind of becomes almost easy, almost free. And then with the thread locals as I said, it's a double H sort. It's like you get some benefits. It's sometimes it's easy, but sometimes really hard, especially if you're working with a custom framework which does some custom asynchronous processing. So we're at KubeCon there's gonna be and have been already a lot of talks about service meshes, right? And so do they solve the problem actually? Service meshes like in Voila, LinkerD, they run as a sidecar. They take care basically the business of doing RPC calls away from the application. So you can write your applications, all kinds of languages, but sidecar does all the heavy logic of saying like, I know how to route requests, how to do rate limiting, how to do load balancing and all these things, right? So very nice concept. And they do monitoring as well. So they clearly can send metrics but they can also do tracing for you. Like the example, in lift 95% of services don't do tracing, it's all comes from envoy, right? And the fine print though there is that to enable tracing you just need to pass the header with an application. Well, ironically passing the header is the exact same problem of in process context propagation, right? If you run in the situations with again the thread locals and multi-threaded, it becomes as challenging as otherwise. And so like open tracing as an API now provides a primitives for doing context propagation. And because open tracing API is being integrated into a lot of frameworks like ACCA, which is like active frameworks with very high I-synchronicity, it can do the things for you without you changing application. But if you write your custom thread pools then you sort of have to do a bit of work for that. So now what about kind of what we learned at Uber from doing these things? It actually felt like this guy. We have by now close to 3,000 microservices and about half of them are instrumented for tracing and that half has been a percentage for like a year and a half as far as I remember even though the number of microservices keeps growing. So it's a tough and it doesn't help with the fact that we have like four languages at Uber so it makes it even harder on my team to actually provide instrumentations for all the languages and frameworks and write the client libraries. But if you are gonna do this in your organization, so what you should do, like one thing, I would strongly advise using open tracing because it actually decouples you from the actual tracing system. If you don't like Yeager, if you wanna switch to commercial vendor, which does a lot of work potentially in Yeager, then you don't need to change your applications. That's open tracing instrumentation stays the same, you just flip which tracer you use, right? And obviously, it's a just common good programming practice. You should use infrastructure libraries in your organization which are shared across teams so that you don't reinvent the wheel in every service. And if you do that, then it becomes a bit easier to enable tracing because that's the only place you kinda have to go and instrument things. And a good thing is many of them are already instrumented with open tracing because it's an open source API. And it's like still vendor independent. And one important thing is like don't make configurations for tracing, right? It should come enabled by default. You can have a configuration to disable it but we made the mistake originally with our Python clients of like you actually had to go and enable like put a Boolean flag in the config. And that just like was completely unnecessary friction to rolling this out. Education is very important. Distributed context propagation, I mean DAPR came out what in 2008, 10 years ago. The concept is still new to many developers. So definitely giving talks internally in the company and explaining why it's important and showing some use cases helps. So I was mentioning this, the feature where the customer ID was passed around. This is in open tracing, it's called baggage because it's something that you carry with the request. It's not the actual trace ID but it's an additional key value pair. And so what we use it for is several things at Uber. So one thing is we have various sources of synthetic traffic at Uber. Let's say there's a black box system which keeps like pinning the APIs and saying is my service is working correctly, right? And there could be some performance testing or like capacity testing that increase a lot of, like create a lot of loads on the services as well. So if you cannot distinguish the traffic by these sources then and you monitor your metrics, suddenly your metrics goes wild and your alerts start firing all because someone ran a performance test somewhere else, like not even on your service potentially upstream, right? So that's bad. So by using the baggage and propagating this sort of the type of traffic that's getting into your service, you can separate metrics using as a dimension. And then you put alerts on the real production metrics and you say, okay, well test metrics, I'm not gonna fire alerts in the middle of the night. And the similar thing is tendency. This is what essentially the customer information is. You pass that around and you can do a lot of cost attribution using that. And chaos engineering is another aspect. Like if we have 3000 microservices and you do the chaos monkey approach, you're gonna like keep killing things forever and not find anything because there's too many permutations of things that you can kill and figure out what affects the actual reliability of the service. So with tracing, you can actually do targeted sort of chaos introductions into the architecture based on like you're saying, okay, you know where the request is going, you can encode certain parameters saying, okay, well when it gets to this point, just kill that point or like black hole it. And so if you pass that information in the baggage and the request it reaches, and it's again transaction specific. So that's important piece here is that transaction specific. So one other thing is kind of useful. We measure the adoption and the trace quality. So we wrote a process which kind of looks at all the traces coming from various application and says, okay, well, does this actually trace look correct? This instrumentation makes sense. And if it's not, we sort of raise certain things like I don't know if you can see that, but this is like a dashboard you can get by service. It says, oh, these are all the metrics specifically for tracing quality and these metrics like this is good, but this is not so good. You can improve it and this is how, right? So we service this information as part of like a standard, like a quality metric for quality dashboard for a service that every service gets. Integration without the tools is supremely helpful for rolling out tracing because those tools are your additional sort of people who can go and harass other people to implement tracing instead of you doing this. So I'm right in time, but like black box testing I mentioned is like external testing tool, but because it's a low traffic, you can actually force tracing and make sure that every request from the black box tool is actually traced. And then you get the trace ID and suddenly something fails. You can raise an alert and say, oh, and by the way, this failed, right? So as opposed to, if you don't have that information, all you know is that this endpoint to your total API layer failed. With trace you can pinpoint some downstream service that's responsible. Developer Studio is kind of an idea for developers to where we have the, where you can drag your position on the map, put the driver here and simulate the trip and all this fun, it captures all kinds of requests going on between the APIs and it also does tracing. So again, very useful people can actually get used to using tracing. I mean, I'm not a showman, oh, sorry, I'm not a sales person. So this is sort of kind of obvious point, but tracing is a product. You have to sell it. You have to show value to your customers, right? And I don't know what this graphics is about really. And finally, well, I can't really, unfortunately go with a service dependency diagram. So I mentioned that this is very powerful tool for actually understanding what your application is doing, right? Or how the system is organized. And so you can see there are a lot of questions like is my service critical for overall request flow? What workflows, business workflows my service is participating in, right? Will my service survive Halloween? The big thing for Uber, like Halloween is super high traffic for us and then we always do this capacity planning to make sure we have enough capacity for services and then, but you don't really know without tracing because just because our business number of trips increases by two times, does it mean your service needs two times capacity? It could be 10 times. It's like the factor is not clear actually. So with tracing you can get that factor, right? And another thing here is like a sample dependency diagram. So the previous one, you remember that was kind of a simple version because it just measured pairwise connections between services. This one actually looks at the path. And so when we say this dingo at the top left is calling core service shrimp is not real names. And it's calling service dog. So my question is, is dingo service actually dependent on the dog or not? No way to tell here, right? We have a tool which I will demo if you come to the Yeager deep dive session later tomorrow because I'm running out of time, but you can actually tell by this new tool where which services depends like at any depth of the dependency, right? So it's not the pairwise. And finally, my closing thought is that monitoring traditionally has been a lot about firefighting. Like, oh, I measure something, the fire alerts, I do something really tracing can do that as well. It can help you root cause and troubleshoot problems. But tracing also provides a very vast amount of data to do better than that, to do like fire prevention, to figure out what's your capacity constraints, how you should optimize the performance and which components in the architecture need to be optimized because they are the actual bottlenecks, et cetera. So an improving real ability basically of your service. So quick call out. So as I said, there is gonna be two more, actually three more sessions on Yeager specifically and we'll show some of the demos again there. And there's also open tracing salon. I highly recommend to attend that. There's gonna be general discussion about tracing. I also highly recommend not missing Ben Siegelman's keynote about tracing in meshes. That was, don't know what is about, super interesting. And finally, if you need more information, this is our website, some ways to get in touch with us. And it's an open source project, so all contributions are welcome. Thank you.