 Okay, hello everyone. I'm Sachin Ashok and I'm all with my TV local harsh today We'll be presenting our recent work on distributed tracing And specifically on how to enable distributed tracing without the name We're a couple of PhD students from the University of Hawaii at Paraguay Shafi And this is work in collaboration with our advisors at the University of Hawaii as well as a few researchers and I do research To start with, we make a very simple observation that debugging microservices is quite hard. I'm sure many of you will need to give it to me But to elaborate, consider this sample microservice system as shown here where requests flow from left to right And suppose you want to service many requests that are online at the DVI gate in such a system requests typically take a very cognitive path through this microservice system in order to get a single response And this can be a nightmare for visibility This is because many complex systems must be dragged to produce even a single N2N request response And this is exactly where N2N distributed tracing comes in. So for example, let's say you have incoming requests arriving at the DVI gate and they could be two of them as shown here, red and yellow So all the red and yellow rectangles here correspond to many requests that propagate within the system or they cause fans in the terminology And the job of distributed tracing is to put together these paths in such a way that you can obtain the caldera of a single request as it is flowing through the microservice These N2N traces can provide visibility which can be very useful for accident queries like which service contributes most to the delay experienced by the slowest one to two percent of the requests Or how much time did high-price requests take out of particular service Please note that such queries cannot be answered by all of the box fan-limit tracing that the properties provide today I explain what that is So for example Practices like onboard today give the ability to the user to turn on a flag and enable fan-limit tracing where Individual services get a lot of fans that arrive at them. Under the developer account, they look at them to see if they're fine N2N traces on the other hand can collect and stitch together these N2N These individual fans to form the N2N trace and this gives much more visibility even the flow of a request through a microservice So I don't agree that N2N tracing is a good thing. So what's the main challenge with achieving N2N trace? The main challenge is that let's say you have a scenario where you think coming requests are a bit easier The idea is that you might generate different execute requests to the back end in order to service these line requests The main challenge lies in the confusion that there are many options available for this match for example Is there is the right mapping between the incoming request and the back end request or is this the right map? Without thinking into an application code, there's no way to tell which is the right map And this problem is present in every single service that the request interacts with and any solution that tries to provide N2N tracing needs to solve this challenge of mapping incoming requests and back end requests So the possibility this is done today is why I had a proposition So in this approach when requests arrive at the API, the API would generate a Google ID and attach that onto this request And whenever a back end request is generated in order to service them, these Google IDs are replicated and propagated via this request using tracing methods and every single service along the way in terms of the same job And this is how header propagation is used and when tracing the code, they get stitched these paths together But header propagation comes with these drawbacks So for example, in this scenario, you need to have value from hundreds of loosely coupled services in order to enable distributed tracing and this can apply quite a lot And even if you do have cooperating services, you can have scenarios where you're either facing with lengthy applications or proprietary applications where multiplication of that code is quite difficult And all those needs is to say header propagation can be quite difficult So this leads us to the question of can we enable N2N distributed tracing without application history? Yeah, now I had a question by a team member who should talk a little bit about how it works Thanks Arjun So now that we know that if we can't use header propagation, let's see what information can we get from on-void without instrumenting the app that can help us reach into a tracing So let's say there are two incoming requests to the API and their corresponding responses Now we know the mapping between the incoming request and the outgoing responses And this can be obtained via these facts that are collected via the on-void sectors Now the API gateway makes several backend requests to answer the queries that are sent to it But the mapping between the incoming request and the backend request is not known And this is because these mappings are buried inside the application logic So without peeking inside the application it's not possible to decide which request was sent in response to what incoming request And this is really the crux of the problem How do we map incoming requests to the backend request? And solving this problem would enable us to stitch together these mappings and form an N2N trace So how can we reconstruct N2N traces without header propagation? So let's say there are three incoming requests to the API gateway Along with the responses that were sent out for each of these requests And as before there were some backend requests that was made by the API gateway So for each request the span level information tells us the time at which that request was received And the time at which the response for that request was sent out And these... Sorry, there's something on the screen No, I didn't see it go out Okay, sorry about that So how can we... let's do this again So how can we reconstruct N2N traces without header propagation? So let's say there are incoming requests to the front end Along with the responses that were sent out for those requests And as before the gateway sends backend requests to answer those queries Now from the span information for each request we have the time at which that request was received And the time at which the response for that request was sent out And these timings constitute one part of the input that will feed to the trace reconstruction process The second part of the input are a set of constraints which tell us which mappings are possible And which mappings are not possible So in this example you can see that this mapping... if you look carefully you can see that this mapping is not possible And this is because the response of the backend could not have been sent out after the response was sent out to the user So this is clearly an infeasible mapping And we can also have constraints that we derive using request parameters, HTTP headers and so on And these constraints form the second part of the input that will feed to the trace reconstruction algorithm And the final part of the input is the microservice call topology Which describes how services talk to each other in order to generate a response So using these three inputs we feed that to a trace reconstruction algorithm Which outputs the most likely mapping given a historical distribution of delays at each microservice So specifically for our previous example where we had three incoming requests and three backend requests It outputs a specific mapping for the backend request and the incoming request at this node And similarly it outputs the mapping for every service node And then we can stitch them up to get the end-to-end trace So we tried this approach for a benchmark application Wearing the request load for that application And we implemented two baseline schemes for comparison Which are based on ordering of the request And you can see here that as load increases the accuracy of the baseline scheme decreases Because as load increases requests have a higher chance of getting reordered at any service In contrast the algorithm that we have generates these end-to-end traces with high accuracy even at high loads So we asked the question how can this be useful to a developer So let's say the developer is interested in answering the following question Which service contributes most to the delay experienced by the slowest 2% of the request So here is a latency profile of all those services When you only have span level information which you can get today in Envoy via a simple flag So in the absence of end-to-end traces the developer will look at the 98% high latency of each service And the developer might conclude that some of these services are contributing to the delay And here is the latency profile with our reconstructed end-to-end traces And as you can see this shows a completely different picture And if we compare this with the ground truth Actually the reconstructed end-to-end traces provide a much more accurate picture of the troubleshooting scenario Than when you only have span only information which can mislead the developer Alright so to summarize our goal is to enable distributed tracing without instrumenting the application And to do so we use information from Envoy such as span logs, request parameters, HTTP headers and so on And this approach gave us a 96% end-to-end trace reconstruction accuracy for a benchmark application Now this benchmarking thing that we did was it does not cover the entire breadth of different kinds of microservice applications that exist out there But this is an ongoing project and we are still looking into how we can incorporate the various idiosyncrasies of complex microservice applications that exist out there Which includes things like asynchronous execution, caching of responses or batching of requests, etc You can check out our prototype at this GitHub repo and we would love to talk to you about your use case So please come meet us, we would be very excited in working with you towards enabling distributed tracing for your application Thank you Any questions? We can take questions Hi, do you know what the relative cost is versus traditional distributed tracing by using headers or something? Because in our case our distributed tracing is very expensive, so would this be cheaper or more expensive? So we'd be piggybacking on existing logging but Envoy does for requests and request response mappings What we would not be doing is actually going inside the code and adding like code there which can be really time consuming for when you're dealing with like thousands or hundreds of microservices In terms of the cost, if you need to store, so end to end tracing does require you to store the logs so that you can do the mapping So this is not exactly online, so there would be a staggered effect in that you wait a while for like a couple of seconds do the mapping and then destroy the traces that you're not interested in, but it does require you to save the traces for that brief period where you can do the mapping So it's quite cool to not have to instrument the application I think my question is if it's not 100% accurate I'm skeptical of how useful this actually is because if you can't trust it you're going to mislead people So I guess just like what are your thoughts on that and if people don't know if it's accurate or not what can they do with it and I guess how Is there a way that you could tell people how confident you are or what are your future plans there? That's a great question, so it is a tough problem to solve So given that we cannot actually instrument the application, what we can do is try to provide the best accuracy possible and then for the end-to-end traces that you do make available to the application developer, we are able to assign a confidence score which we can say which can relate to the developer that we are 99% sure that this mapping is correct and in cases of low confidence the developer rather than not looking at those traces, that's one The other is that there is a chance to augment information which is other than timing So if you have a partially instrumented application where some of the services are easier to modify but the legacy applications are not or the proprietary applications just are not amenable to changing their code then we can do timing-based analysis or all techniques with just those thoughts but you can still get the entire end-to-end trace by leveraging the existing instrumentation So just to add to that, there are, for example, if you have aggregate queries if you want to know about some latency distribution for some subset of queries then we believe that good approximate accuracy should be fine for that So those are the kinds of queries that would definitely be the use case and if you provide the confidence score then the individual request traces can be useful too Yeah, I think I have the same question, a little bit what Matt was asking In these large-scale production systems, not having something that's 100% and we cannot trust it, it's very hard because you know you're always optimizing the bottom or top 0.1% anyway So it's very hard. My question is Does your prototype include the ability to only report on spans where you have 100% confidence and just disregard the rest? I'm sorry, could you repeat that last part? Does your prototype have the ability to only report spans that have 100% confidence? No, so we can report all of them and then attach a confidence score to them so that the developer can decide to look at them if they want to But that's your overall score, right? No, so you can per matching you can assign a score, per trace So then they could essentially filter out all the ones that are not 100% and look at those To answer your earlier question, so if you do have a large microservice application where you do care about that one request that's sort of seemingly failing it's a really hard problem to solve but what I want to highlight is that you could have scenarios where in that large application you have systems that just do not cooperate let's say you have a legacy application that does not propagate the header in that scenario you just do not have the end-to-end trace, right? So if header propagation was completely possible for every single service then yes, then that would be the preferred option The question for me is then either I don't have any tracing or I have tracing that I may or may not be able to trust and if that's not clear to the developer then that might be misleading and that might be actually causing more issues That's a good point, thank you I'm wondering about how exceptional cases like timeouts look in your system whether it completely blows up your ability to solve the trace or if they're absorbed by the rest of the constraints that you have Right, so that's a good question So we do consider those cases and that's ongoing work For example, you could have request caching where a response doesn't go out to a microservice and it just cached or you could have request retries or batching of requests So what we can do is right now we use a developer environment to try out these very cases all these cases to figure out what's the space of possibilities and then based off the real-time measurements we get we can use that combination of the information to figure out whether oh, this looks like caching has happened or this scenario looks like batching has occurred and that's why the timings are so or the spans collected are so and then combine that information to give the right answer but this space is quite open because there could be a myriad of patterns You could say every seventh response I'm going to generate a request to a different service and those kinds of things are harder but we think that this solution can apply to a broad class of applications that do not have very crazy patterns so it's still usable by a wide variety of applications Yes Right, so the problem is that Envoy can pass the request onto the application but it's the application that decides which other service I want to talk to what's the semantic of this application flow which other service do I talk to and that's the one that generates a new HTTP connection and sends it So Envoy, even from Envoy, there's no way to know which outgoing request corresponds to which incoming request It can hand over the request to the service but it cannot do the matching myself That matching is buried inside the application code so there's no way of getting Envoy to give us that mapping Like only the application knows Yeah, yeah So if the application chooses to propagate headers then we are in good shape but otherwise we don't have anything else What was the question? Oh, sure Alright, thank you