 Welcome to the 11-20 talk by Herschel and Andrew. They are graduates of the Westford High School and they're going to call it soon this week. So take it away. All right. Hi. Our talk will be SCUA, Extending Distributed Tracing Vertically Until the Next Colonel. I'm Andrew. And I'm Herschel. Yeah. We did this as part of the MIT Primes Research Program for high school students. So first let's talk about distributed systems. So nowadays applications are getting more and more complex and they are no longer monolithic. So they're now written as distributed systems. The advantage of that is that you have modular development, continuous deployment, and better scaling. However, well, you can see them increasingly in large companies like Twitter, 2013. The problem with them is that they're hard to debug since there are many small services that operate, that depend on each other. So let's look at an example distributed system for, say, something like web search. So the user first makes a request to the front-end service and then that front-end makes a request to the web result service, which then contacts the page rank service. And then it returns results to the web results and then the web results are returned back to the front-end. Then it gets the images, which uses visual rank and so forth. Now suppose that a user is unhappy about search results being returned to slowly. How do you tell where the problem is? And this is where distributed tracing comes in. So distributed tracing lets you monitor and troubleshoot these distributed systems. So it helps you discover latency issues and it helps you find out what services depend on each other and helps you find where the issues are recurring. It traces specific requests as it propagates through the entire distributed system. However, distributed tracing tools nowadays miss a lot of things because there's more to performance than meets the distributed tracing tool. So there might be other services running on the same server and there might be kernel bugs that cause performance issues and even security patches like Spectrum Meltdown can have pretty significant performance impacts. So the question is, can we gain visibility regarding these issues via kernel? And that's our goal. So how did we approach this issue? We started with the Yeager distributed tracing framework. It's a tracing framework built by Uber, released to the open source community, I believe very recently. And it's a fairly mature system. It's actually used in production by Uber to as their own tracing infrastructure. To this we added LTTNG, which is a Linux kernel trace toolkit that is built to gain visibility into the sys calls and the kernel events that are generated while an application is running. When we combine these two, we get scuba. So now let's back up and talk a little bit more in depth about how tracing actually works. So as Andrew said, the goal is to follow an individual request as it propagates through the entire distributed system. So you start with a front-end request. And now with the front-end, the first step that it does is generate a context for this request. That includes a trace ID, a parent ID, a span ID, and whether or not this is sampled. The trace ID identifies this individual request throughout the entire system. It's a unique identifier for this request. The parent and span ID together are used to construct the causal relationship between a bunch of different services. So in this example, you can see the front-end request has no parent ID, and it generates a random span ID. Then when it contacts the web results service, the span ID of the front-end becomes the parent ID of the web results. And so forth with the page rank. And then when the front-end makes an additional request to the image of service and the visual rank service, then the parent IDs and span IDs are set accordingly. The span IDs are randomly generated and the parent IDs identify the parent span of that individual thing. So what exactly does this span entail? A span usually identifies a specific amount of work that is being done by an individual service and it's marked by a start and end time. Additionally, the user space application can attach logs and different events to the span so that they can gain visibility into what that individual request or that span was operating on at the time that that span was generated. Now, in addition, the front-end, when it gets this request, it actually usually does a, generates a random number and decides whether or not to sample that individual request based on that number. The reason behind this is if you tried to collect the span, all of the trace data for every single request that came in, the volume of data would simply be too large. And so while it is possible to do this in development for testing purposes, in production, usually a sampling rate of say one in a thousand is used. Now, if this entire trace is supposed to be sampled, then all of the different services will report the spans that they generate to a central service, some span aggregation service. This in Yeager is used as the collector service and that then stores it in a database which can then be queried through a web front-end. Speaking of that web front-end, this is what it looks like. So you can see the top-level request and then the length of that bar is the duration that that individual span took. And then you can see the causal relationship between each of these based on the indentation of all of these services. And you can also look at how different services are operating concurrently and when they start and end, how long each one takes. So it's fairly easy to identify a bottleneck in terms of let's say something's taking too long. You can simply look at the length of the bar if it's too long for what you expect, then you can identify that as your bottleneck and that was what causes your top-level request to take too long. So a little bit more detail on how SCUWA works. So already a user space application is running. We attach a Yeager client to that which collects these spans and then reports them to the Yeager framework. So this is already built as part of Yeager. So we didn't actually have to do too much specifically in the Yeager framework to anything to change that to make it work with our system. Now the Yeager client, we did have to modify which I'll talk about in a little bit. Now when the user application is running, it's gonna be making a bunch of different syscalls into the Linux kernel. And additionally, the kernel is also going to be generating a bunch of different events. For example, a kernel event could be a scheduler event switching that process off of the CPU and putting another one on. And that usually happens immediately after a blocking syscall is made. LTTNG already allows us to collect these syscalls and kernel events and collect them using a set of kernel modules that LTTNG provides. So we needed a way to propagate that Yeager context for the trace into the kernel. And so we did that by using PROCFS with our own custom kernel module. We stored that data in the task structure which is the thread specific data structure that the kernel uses to identify each individual thread. You can think of it as thread local storage in the kernel. Now, what we did is combine the data from the task struct which contains the context and we attach that to all of the syscalls and kernel events that LTTNG is generating as part of its kernel modules. We propagate that information back into user space using an LTTNG adapter that we built and then report those into the Yeager framework using the conventional methods of reporting different spans. So to recap, first the Yeager client propagates its context into the kernel, then it stores that information in the task struct. When the user application generates syscalls or kernel events, LTTNG modules reads the data out of the task struct, pairs that information with the events that are generated and then eventually those events are then propagated back up into user space through the LTTNG adapter and then the Yeager client as well as our adapter both report to the Yeager framework. So the Yeager C++ client that sends its context into the kernel that took around 25 lines of modification to the existing Yeager client library in C++. We treat each Linux kernel event as the next level in the span hierarchy. So let's say there was a single span that was running for the visual rank service. Each syscall that the visual rank service makes appears as another span beneath the visual rank service and then all of the kernel events that are generated. So like RCU events, scheduler switches, allocations and freeze of different kernel memory. Each of those becomes an event in the logs of that span, of the kernel span that it's under. And our modifications to the LTTNG kernel modules in order to tag each span with the associated context information took around 80 lines of code and our LTTNG adapter took around 250 lines of code. So now we wrote several programs to help evaluate how well we did with this. So first we wrote a correctness tester. We wrote a small C++ program that just creates a bunch of threads and makes about 10 different syscalls. So this is just to make sure SKU is actually recording the syscalls as expected. So we found that they were and it appears that LTTNG actually does not instrument a few of syscalls that are called very often like get time of day. And the events that happened during these syscalls were properly recorded as logs. We also evaluated the performance using two different benchmarks. So they're both running on the same machine with a sampling rate of 0.1%. And for each scenario, so we have no tracing so we just turn everything off with unmodified Yeager. So we just included the normal Yeager client with the normal span creation. And we also, the next one is where we used our modified Yeager client. So this is the Yeager client that's modified to report the context into the kernel whenever spans are created. And the next one is LTTNG without Yeager. And that's just to see the overhead that LTTNG imposes. And then finally, we have scua where we combine both the modified Yeager client and LTTNG recording the syscalls and events. So the first such program was a small HTTP server written in C++. We used AutoCannon as our benchmarking tool. So we send a million requests over 10 connections as quickly as possible. And then we measured the latency and throughput under each of those scenarios. So here are our two graphs. So you have each of the benchmarking scenarios on the bottom axis and then the throughput or latency on the y-axis. So the first, this is our baseline performance. This is our unmodified Yeager which introduces a little bit of latency but nothing on throughput. We also have Yeager plus PROC FS. So in this benchmark, it decreased throughput and increased latency by quite a bit. LTTNG alone also has moderate amount of overhead with throughput especially. And then with scua, we combine all of those. So in the end, we have a 12% throughput decrease and about 200 microseconds of extra latency per request. So the next benchmark we used was Fortunes. So we borrowed our code from the tech and power web frameworks benchmark. What this basically does is queries for Fortunes from a database. In this case, it was Postgres. And this is intended to be a real world application to simulate a real workload that someone would do instead of the hello world example. We used a similar benchmarking process but we ran AutoCannon twice because Java and we did it with 100 connections to exercise Spring's threading model. This is what it looks like after we run their benchmark. So you can see each sys call appears below. For example, the Postgres query. So you see it's doing a send to and a receive from which is expected. And you can see each of those logs that happened during the receive sys call. Like for example, you can see it did a context switch and it did some RCU and datagram copying and whatnot. Again, here are graphs. So that's our baseline performance and when we add Yeager and even our modified Yeager doesn't do anything to performance and LTTNG decreases throughput a tiny bit and SCUA ends up with a about 6% performance decrease in terms of throughput and a little bit of extra latency on average. So we discussed our performance overhead. So the unmodified Yeager is negligible. LTTNG decreases throughput and increases latency since it's recording all of that data. We could improve this by only enabling some of the instrumentation points because for these benchmarks, we enabled everything. All sys calls, all events. Our modifications to Yeager cause additional latency depending on the benchmark since each Yeager library is written differently. Performing sys calls is quite expensive even when you're only doing it for every 1000 requests and LTTNG and our adapter ingesting the kernel events also perform more work. So that performance degradation is expected. A note about the tiny HTTP benchmark. Each transaction is less than a millisecond. So the latency impact appeared to be extremely large even though it wasn't that bad. All right, so looking forward, one of the things we hope to do is improve our performance even more. So currently with the fortunes benchmark we had around a 6% decrease in throughput and a very small increase in latency as well. However, we think we can get that down even further and we, Andrew outlined a couple of the ways we plan to do that on the previous slide. Another thing is simplify the installation process of this entire system. So currently we require modifications to the Linux kernel. So you have to recompile a custom kernel. We have a modified version of LTTNG modules. So those are the kernel modules that are part of LTTNG. Those need to be, those are custom as well. So you need to recompile those from source and install those properly. And then there's other modifications we did to the Yeager clients as well. And so basically putting all of these together to make the installation process and the usage process a little bit simpler would be nice. One of the other things we were looking into is adaptive sampling reconfiguration. So a talk that Lily gave earlier at DEF CON was talking about how you can modify the tracing parameters in real time to adaptively add different instrumentation points and logging as you detect that you need it. So if there's a specific service that seems to be taken too long or is too slow, then you can add additional tracing to that specific service in order to get better insight into that specific thing. So we were thinking you could also do a similar thing to decide whether or not to enable SCUBA. So you only would enable SCUBA if you detect that a specific service is taking too long and then you can gain insight into the kernel level on goings of that service in addition to the user space application itself. Another thing we were looking into is attempting to trace SCUBA with SCUBA. So Manio was doing some great work on replacing the existing SCUBA tracing framework with Yeager. And we were thinking about if we could integrate SCUBA into SCUBA basically to show another example that SCUBA can actually trace real world applications and additionally to measure how much of a performance impact it has when doing so. So in conclusion, we can distribute, we can use distributed tracing to monitor and debug all of these complex distributed tracing, distributed systems. However, current distributed, distributed tracing frameworks miss kernel information. And so we developed SCUBA to integrate the kernel level data from LTT and G with the information that Yeager is already collecting. And SCUBA has some impact on throughput and latency while it's not that bad and it could be used for some applications. It may be too large for other applications. So it depends as of now on whether or not it's applicable to production systems. Our code is open source. You can access it there. Really quickly we'd like to acknowledge Raja who mentored us through this process and the MIT Primes program. Hello. All right, here we go. Okay. So with that, we'd like to take some questions. Here's the microphone, it's over there. Thanks for presenting that, that was really cool. Do you have any plans for contributing any of that work back upstream to Yeager? Right, so our modifications to Yeager are fairly minimal. The thing is that they actually do have a fairly significant performance impact as of now. Or they have, they cause the differences that they have forced the Yeager client itself to propagate its entire context information into the kernel for every sampled request. So I don't think it would be, I don't think we should directly contribute back to the Yeager client, but I think it would be worthwhile to maintain like a set of patch files, say, that can be applied to the Yeager client to enable the skewer tracing integration. So while I don't think we should directly contribute back, we can still build off of whatever they're working on. Oh, I would encourage you to reach out to them and discuss it because I think they'll be very interested. Okay. Any more questions? Oh yeah, that was a great talk, thanks so much. So you're doing service calls into the kernel, so a service writer typically wouldn't necessarily know what the meaning of those system calls are, so have you got any tools to help you break out whether it's like networking stack it's entered or whether it's like scheduling, accessing file system, et cetera, to help those kind of users? So we could do some sort of thing where we group different syscalls to say networking or file system or a set of categories such that you can see all of your networking syscalls together and then all of your file system calls together and so forth. So we could do that. As of now we're just pulling the syscall name or the kernel that name directly from LTTNG and we're not actually doing any analysis on it and we're just sending it up to the Yeager client or to the Yeager framework. So right now we're not doing any introspection but we absolutely could. Okay, thank you very much. As others said, thanks for the cool talk. I had one question about LTTNG versus EBPF. Did you look at all that modification or what would be needed to work with EBPF instead of LTTNG and are there kind of pros and cons or shortcomings or reasons that you picked one versus the other? You did. So we looked at EBPF briefly. The thing with EBPF is that as I understand it it's mostly used for networking related. So you could absolutely capture syscalls and different sorts of things that are networking related using EBPF. However, in order to capture a grand picture of all of the different events that are going on in the kernel, we opted for LTTNG. Now we were actually considering using EBPF to propagate those spans, the span context between kernel to kernel. So the user space application wouldn't actually have to send its trace context with its request. We could actually do that from kernel to kernel using like an extension to TCP or something like that. However, it just seemed like it would be a little too complicated and then it would be difficult to represent the causal relationship between those two kernels properly using the current Yeager framework. So we looked into it. I'm not sure it's exactly right for what we were trying to do though. Cool, got it, thanks. Do we have left? All right, thank you. Can I see a little bit? That was a great talk. That was a great talk. Great, thanks. So I just got straight to that. Thanks. I'm sorry. I'm sorry. I'm sorry. I'm sorry. It's okay. You're doing great. You're doing great, actually. I don't know how to do this, but a lot of time it all just starts to take up a little bit. You can see it was small It's a jail connection. You know, it's totally normal. Okay, good. Also really important, I'd love to start with this book. Sure. And I'm pretty familiar with it, just to be aware of that. And I think this book is very cool. Yeah. So, one of the... How much does Dan? So, he did a good shoot. Oh. I don't know if that was good for you. Good for you. That's awesome. What company? You go out there? Oh, thank you. Your books? Your physical books? Yeah. Go on. Welcome back. Can you use this one? Yeah. Can I use this one? Sure. Do you like this? And you're calling... Oh, you're very used to it. So, since you have some time, it's good if you just take this open set this inside so that it's like nice and proper. Yeah, so, put this in your pocket and then just get this from underneath your coat and fix this. Okay. Okay. Oh, that's my phone. It's the office. Good question. All right, good. And, yeah. How's the audio on the back? It's good. Could I try? No, more. Okay. Okay. That needs control from there. Okay.