 Hi, thanks for the kind intro. This is high performance tracing in production. Let's get started. What will we cover today? I'm going to start by covering what is tracing and why might you do it on a production system, some example performance problems and how you might analyze it using tracing tools. More common performance problems one might encounter. I will map performance problems to the appropriate observability tools and then take a quick tour of the subsystem that power these amazing pieces of software. Let's dig in. So to start off, what is tracing? I like to think of it as the observation and reporting of information about a system's behavior and execution. And this is really critical to understand the behavior of the system. One can make an assumption about how a system might behave by testing it in prod or simulating it or even through its design. But really the best way to get to know a system is to absorb it. So can we safely trace and profile in production? That does sound like a dangerous premise. It turns out, yes. And in fact, you often have to. Even the most robustly constructed test environments are going to differ quite distinctly from real workloads. Not only that, real workloads change. You can have traffic spikes. You can have new forms of data that weren't previously experienced. And applications can misbehave in ways that are unforeseen. Environments often include a diverse set of software. Even in organizations that are standardized on a common stack. Enterprises are getting bigger. Software is powering more parts of the world. And more and more businesses are defining themselves on their software and server strategy. For those of you who are familiar with the Cloud Native Computing Foundation, you might notice their interactive landscape map here in the background. And this is just cloud native software. It's very easy to get lost in this map. And a lot of enterprise environments end up having a similar architecture with thousands of services that no one even knows the name of all of them. So it's important to be able to understand a system even if you aren't responsible for the details of it. So tracing tools. Languages have their own tracing tools. Each ecosystem has their own tools. That's great when you're an expert on that. But they're often inconsistent, sometimes incomplete, and may be unsafe to run on systems serving live traffic. So those you might be familiar with would be Visual VM for Java, GPROF for C++, P-PROF for Go. There's a long list of these. And they're often pretty fantastic, and they definitely have the best integration with each language family. The Java tools in particular are world renowned. So they're great if you have a deep, intimate knowledge of the application. But if you're responsible for maintaining a lot of software, many of which uses sort of a plurality of different tools or languages or ecosystems, it's kind of hard to learn them all. And it's often inappropriate. If you don't know the tracing tool well, running it in production is like the last thing you want to do. That just adds risk when you're already in a stressful situation. Of course, if the service is down anyway, maybe there's not really that much danger to it. But there actually is a better path. And that's to use the standard set of Perf-based tracing tools that support every workload on Linux. So key points here are these tools work tracing all aspects of the system, not just the application layer. So if something is misbehaving on disk or in kernel or on the network layer, there's probably a tool for it. And often, you get better visibility into the application layer than language ecosystem's own tools. These tools are focused on providing visibility. And that's something they do well. Most of them are safe to run on live systems, use only a little CPU in memory. This is not the case for many language-specific tools, which have deep hooks into the language run times that maybe slow it down by a factor of two or four. And that's just something you can't really afford to run in production. And importantly, and this is from an operational learning perspective, it's one tool set that you can use on all the services that an organization might run. So I like to think of Perf events as like the one ring buffer to rule them all. So let's dive into a sample performance problem. We've got here, and I'm going to demo a simple web application that accepts requests, it logs the request details, or it logs the request details, and then returns a response. And this application is misbehaving. It's not behaving as it could. It has hit a limit on the number of requests and processes, but it's not using all the resources assigned to it. It's not using all the CPU, network, or disk resources. And this is a typical performance problem for which there's no obvious cause. It's pretty obvious if traffic grows and you end up saturating your network link, you might need another replica of the service. But in this case, there's no obvious problem with it. It's just not using all its resources and it's hit a wall. So what I'm going to do is I'm going to profile a running application. So I'm going to start the application. I'm going to run Apache's Apache Bench Command to simulate incoming traffic. I'm doing this because I actually don't have this environment running in a production environment, and it would be sort of like a little reckless to demo a real production environment on a webinar. I'm going to run the profile profiler against it to see what the application is actively working on. I will run the wake-up time profiler against it to see what the application is waiting on. I will convert the outputs to flame graphs using handy little flame graph script and examine the contents. And then we'll discover that actually it's a logging system that's syncing each right to disk. And that requires the disk controller to respond with, yes, I have definitely let it hit the platters or the ambient meat flash as a need baby. So importantly, this is not a real environment, but it's adequate and enough to simulate and show and explore the use of the tools. So I have prerecorded the demo just so that the demo gods will look favorably on me. And we will get started with the video here. So you can see here we're going to start the application, and it's now listening on port 8090. And we're going to run just a patch advantage command against it, and it's now running. And you'll note here that it is not consuming all of the CPU. This is all running locally, so there shouldn't be a real network bottleneck, but we've got plenty of spare CPU, and it's not using it, and that's not great. So we're going to run the profiler against it, and this is the profile script out of the BCC collection of tools from iOviser, and it has gone and captured and generated this file from 10 seconds of activity. So you'll note the entire time the app is running, like CPU usage barely budged for that profiler tool. And we're going to run it through our little generation script that produces a flame graph. So this is great. We can run this on any software and get an assessment of what it's actively working on, including the behavior inside the kernel, and that's the bits above the dashes. So anything above the gray dash boxes, that is actually code running in the kernel. So you can trace all of the usage of CPU right through to the kernel. Next up, we're going to run the off-wake time profiler, and what this is going to do is determine for each of these things that the program is waiting on, why are they waiting, and what makes them up. So we're going to do the same thing, run it for 10 seconds. As it's processing traffic, and processing our benchmark, you can see here, it's actually lost a few stack traces, and that's fine. And that's part of the lossy nature of data collection that lets these subsystems run safely in production. If they ever run out of resources, we'd rather the application go forward. And we can see here, we actually have a better view. We can see that most of the time the application is waiting, and we're going to zoom in on our main function that handles data. It's actually waiting on the sync call, right? This is not good. You don't want your application on spending all of its time waiting for disk to complete. And we can actually see even what has woken it up, which is the completeness notification interrupt from the NVMe disk. So that's great. We've now determined what the source of this problem is. It's an unnecessary sync call. It's only processing 1337 requests per second, which is not very elite. So we're going to go in here and fix it. We're going to find that sync call, and we're going to get rid of the sync. Let's go rebuild the software, start the service again, and people run the benchmark again. And let's see how much resources is this application using? Excellent, that's much better. It's now using close to 100% CPU with some CPU for the benchmark, right? So that's great. That's much better, much better CPU usage. We're going to now run our profile again, and we're going to load it up and see actually now the application is spending most of its time actually responding to requests. This is sort of the healthy behavior you want to see from at least this application. A little bit of time spent writing to the file. A lot of time spent responding to the socket and processing the HTTP request. And we can go back to the off wake time profiler as well and see what it ends up waiting on. And we're going to wait for this benchmark to finish first. And that benchmark number is much better. It's roughly 10 times as many requests per second, which is fantastic. So let's capture that off wake time profile and we can open it up here, generate our flame graph and excellent. We can see now there's hardly any waiting. It's only internal coordination by the build runtime. It happens to be a build program, but that is a much healthier profiler. And you can see we're making much better use of resources on this machine. Great. Now let's dive in and actually discover during this what the developer meant to do is log mutex instead of syncing to disk because we want log messages to not be interleaved with each other. So we're going to go fix the bug properly, run the service again, double check, could make sure that it appropriately uses resources and see that it does. And this is a quick example of how one might use some of the profiling tools to trace through the activity of real live running software. Excellent. I'm going to jump back now to the slides. So a quick recap of what we did during the demo. Profilers and surface relevant information on the application, the system and the current behavior. And this is really important because it gives a holistic view of what was going on in the system. And that is what we saw in the demo. The application didn't need to be run in a special mode or be restarted. We could just jump onto a live system as it's misbehaving, run the profiling tools and get information on what is going on. Performance overhead was very low. If the tools are already installed, you can run this on a heavily loaded system. Sometimes even if a system is so overloaded that it can't even write to disk, that can be a problem. So I highly recommend installing the tools by default as part of a production deploy, right? You should know what type of observability, what type of visibility you're going to want to debug a system before you've deployed it. And one important thing to note is the data is lossy. And that is to avoid introducing bottlenecks on the running system. We did lose a little bit of data collection, but that didn't impede us from tracing and understanding the behavior of the system. And often incomplete information is enough to make really good decisions anyways. And lastly, flame graphs, they're really powerful. They're a summarization of, in the example we looked at, hundreds of thousands of data points organized into a graph that is very easy for humans to understand. And it's a really good sort of like interchange format between the machine and the human. And I found that organizing data into this format is very useful, at least when it comes to tracing and performance analysis. So some other common performance problems. Many of you may have seen these before, high application CPU usage, kernel CPU usage, saturated network bandwidth, high memory usage or swapping. That's often the end of your production of time is when your application swaps. Poor disk IO or file cache behavior, long tail response latency. This one is often very difficult to trace down. And then of course, good old system or memory contention. And if we were to take a look at sort of like the high level view of a system, the application is actually just one small part of it. And there's the hardware, the software, all the different layers in between. The kernel actually plays a huge part in the operation of software. And Linux has tracing tools for pretty much all parts of the system. Any part of the system, there's probably some basic tool to analyze data, collect data, give visibility into what is going on and maybe even summarize it in a way that is accessible to humans. Even things like fans and power supplies, have probes in them and can access that data via Linux tracing tools, which when I first saw this chart, I was really amazed. There's tools basically for everything and should get to know. So let's take some mapping of those performance problems we had before to some of the tools we have here. So the CPU usage, we demoed this during the demo. We've got the main profile too and we've got dash U for user land and dash K for kernel. For network bandwidth, we can look at TCP top. For high memory usage and swapping, we've got memleak and swap in. For disk IO or file cache behavior, we've got VFS stat. That looks at the virtual file system layer in Linux cache stat, which looks at the caching behavior and of course, file top. We've got long tail response latency. We've got TCP life, TCP accept and latency top. There's a good number of other tools here because the network system is very complicated. So definitely recommend more exploration there. And then system or memory contention, we've got off wake time and CPU disk. And this is just a little snack preview of all of the tracing and analysis tools that are available on Linux. So how do they work? There's a few different approaches to how they gather data and how it all fits together. And the mechanisms they use are sample-based profiling, dynamic instrumentation, in kernel virtual machines, which many of you may know as EVPF, and then of course, asynchronous data communication with user land. So sample-based profiling is when you interrupt the system on an interval and just take a snapshot of what it's doing. And this has pretty low overhead. It turns out that the system has to be interrupted anyways to do scheduling. So this is what the Linux scheduler does at time slices of CPU and distributes that CPU to all of the different processes that are running on the system. So during that interruption, you can actually take a snapshot of what the system is doing and count samples and perform behavior there. Importantly, this is great because it's even, right? It doesn't skew the measurement results, but it is fundamentally a sample-based mechanism. So if there is like long tail events that you wanna look at or rare events or if you wanna capture every activity, it's just not possible to do that because you're only sampling the signal. And on perf, this is done via the kernel scheduler and it wakes up, does the scheduling behavior, annotates what was running and drops the samples into what is known as a perf buffer. This is a buffer that's shared between the tracing tool and the kernel and the kernel will record information that the tracing tool can pick up. Dynamic instrumentation is the other approach and this is to hot patch the code with calls into a handler that record when an operation occurs. So this introduces a small overhead but only on the parts of the system actually being instrumented. And so this is very useful to be able to go and get either long tail access or to record every activity and get a full view of the system. This does create some bias in the data since the measurement is unevenly distributed but often that bias can be corrected for or it doesn't even matter for the type of data insight that you're trying to get an assessment of. So the Linux kernel and perf has standardized safe facilities that all tracing tools use and are also used in other subsystems of the kernel as well. And so that is trace points. Trace points are dynamic instrumentation points that are placed in predefined operations in the kernel. So things like forking or exacting or opening file might have a trace point annotated in the kernel that probes can be attached to. K probes are the safe dynamic patching of arbitrary kernel functions. So almost every function in the kernel can have a K probe attached to it that can then trap and record data that can be picked up by the tracing tool. And then U probes, which is the same sort of dynamic patching behavior but that occurs in user land. So you can trace the inner workings of user land programs as they are executed. And just like the sample-based profiling, data on these operations are written into these perf buffers that are shared with and read by the tracing tool. Layered on top of this and this actually makes the difference between some of the traditional tools and the more modern tools is this idea of an in-kernel virtual machine. And on Linux, this is realized through the use of the EDPF and this technology allows, instead of writing directly into the buffer, a tiny little program can be run inside the kernel as the data is about to be collected that makes decisions on what to do with it. So it replaces that copy into the buffer with a run of this program. And this can do really anything. It has to have a fixed amount of execution because it's very important that the normal operation of the kernel not interrupted, but you can do a lot in a fixed amount of time including custom summarization or filtering or like histograms and a lot of these tools just go and perform all of their analysis incrementally in-kernel as the data is collected and have a tiny little drop of data coming out the end. For those of you writing EDPF programs, they must terminate in a provably bounded number of cycles and cannot block. And then lastly is this asynchronous data communication. And this is how these tiny little programs or this copying of ring data in the kernel actually gets back to your hands as an operator, right? So all of this thing, all of this behavior of tracing this activity happens in the kernel. It gets copied into either ring buffer or EDPF maps. And then the tracing program will either periodically or at the end of its activity, read these ring buffers or maps and then convert them into sort of a user-visible form. So something that might be useful for a human. So either a text output or a graph or something like that. And this is really critical because this sort of like asynchronous communication model means that you have fixed amount of activity that happens in the kernel incrementally. And then the tracing tool never stops the workload from running, so it never interrupts. And that means it has a little bit of overhead but you can rely on these tools never to disrupt a production system. And so a recap, use the tools. They're fantastic. They work great. They're low overhead. You can use them on pretty much any software. I didn't have to do anything special to get my demo running. I just used the standard tool chain built to misbehaving demo app, ran it and the tools just worked. And they pointed me directly at the problem that the application was experienced. They're fast, they're safe and they're consistent between all the software you might run. If you're running hundreds of different applications and services in production, learning a hundred different set of tools to maintain them is just, it's not possible. So get to know a consistent set of tools and use them on everything. And you can rely on them to provide that visibility into all the software you might run on a Linux system. They're there, get them deployed in your pre-prod and dev environments, get to know them, run them against your software and discover how that it behaves. Discover the inner workings, discover what your teams are actually doing. And I'd like to point out a big thanks to all the contributors to the Linux performance subsystem. This, like seeing the growth of it over the last decade or so, it's really a phenomenal system that exposes a lot of data and makes working with production services a lot easier even when they misbehave. And then, of course, the Iovizer project and Brendan Gregg for building some of these tools, they're really great, BCC is just a joy to work with and put them to good use. And that is the content I've got to cover today. Thanks for attending. I'm Ryan, I work at Capsule 8 where we use some of these systems to build security tools. And I will now take questions. So we have a question here from Ryan Perry who says, since the overhead is so low, what are your thoughts on doing continuous profiling and production and being able to go back and look at particular time periods to revisit profiles? Yes, this is actually a really interesting approach. I've seen a lot of the modern sort of like observability tools do this, right? They just plug into the tracing subsystems and they observe the state of the machine and they're very safe to use. Some of them do have a little overhead, like it's not zero, but it's very manageable and the risk is really, really low. So yeah, I highly recommend this. And we see vendors taking and doing this from multiple angles. We're certainly doing it from the security angle, but definitely the ops folks are ahead there. They wanna understand how systems behave. They wanna continuously know if the system is healthy and what's going on and they've been doing it for a while now. We've got a question from Soham. Does running the profiler itself add any performance overhead to the application? I mean, does capturing the stack traces reduce the performance of the application? There is a little bit of measurement cost to doing this, but it is distributed through the scheduling that the Linux kernel does. So it makes the scheduling a little bit more expensive since it has to capture this data. Generally, like if you're right at the like 99%, you're like just about to tip over or just about to overflow your performance budget. Maybe that's something to worry about, but if you're there anyways, you're probably already in trouble. So understanding how the system is behaving, so maybe you can fix it might be better. Uprobes in particular, do you have a little bit of cost because they have to switch over to the kernel whenever they get hit. So that is something to be aware of as well. But generally the cost is pretty low. And I would say get to know them in dev and staging before you like actually leap into prod with them, right? Know your tools before you start using them on problems for which there is pressure and risk. Carl asks, where can we get the chart of tools? This was taken from Brendan Gregg's website. So that's at brendangregg.com. His last name has two Gs, really fantastic chart. There's just so much in there. And like the cloud native computing foundations chart, it's just like getting more complete. There's so much to it, which is good because there's so many great tools you can just use on everything. And that's kind of what we like to see. It's part of the benefit of the Linux community is like the software is so composable and you can have components interact and be used against each other. So that's great. May I know what are the performance indicators you recommend that differs from the development to live environments such as lightweight, latency, dependency, this is Jayakar. I'm not quite sure how to answer this. I think each application has different metrics that they wanna be measured by. Like some applications, latency is really important. Some throughput is important. I do know that if sort of like it's not a batch system, you kind of wanna make sure that you have some additional headroom on each of the different resource types you have or have the ability to auto-scale them in time, which is a common, modern approach. So I think the indicators are actually sort of like service dependent, right? And a lot of people just run their applications with like, hey, let's give it so much headroom that there's no chance because human peace of mind on keeping the service up is better. I will say that often with performance tracing, it's finding the bottleneck, right? Like some resource is, you know, hit its limit and that you have to determine what it is and why. And sometimes the solution is give it more of that resource and sometimes the solution is figure out why it's using so much of that resource because it really shouldn't be doing that and that was the case here, right? We were syncing to disk way too often and just shouldn't have been. Chris Modrak asks, when was this functionality added to the kernel? TracePoints, Kprobes, Uprobes, which version? Ah, that is a really good question. Each of those were added. TracePoints, Kprobes have been around for a good amount of time. Uprobes are relatively new, but they were introduced for other parts of the system first and attached to the perf tracing systems later. I suspect around the 4.8 series is where you start to see them get like usable in production. The 4.14 series is where they're like really solid and feature complete and there's been some really cool new features added to them in the 5x series. So as long as you're using a modern kernel, you have pretty good support for a lot of these features. Red Hat has also back ported them to the 3.10 series and so you can use some tracing tools on that. But the support isn't, there's some dimensions of that. I'm not sure exactly which version they've back ported, but they're a good vendor and they make sure that everything is enterprise ready. We have from Tim Sander, what is the best way to trace RT workloads? I think that might mean real-time workloads. I guess it depends on whether your workload is like hard real-time or soft real-time. Soft real-time, this is definitely appropriate and it makes sense to like know where you can and can't probe, but you definitely could make use of some of these tracing tools. If you're like FinTech and you're pinning like an application to a core and talking to the network interface directly and have hard real-time requirements, probably not. You're probably in a completely different world of performance, but you probably already know that and have tools to deal with it. But most everyone else, probably it's appropriate to use the tools, obviously test and prod and know what they're testing pre-prod and dev and know what the impact is, but you probably appropriate. We have Pablo asking how to create flame graph files. So that is actually just a little script accessible on flame graph repository, which is linked from Brendan Gregg's site. You basically pipe in a list of stack traces with counts and it will summarize them in that flame formation for you and give you an SVG that you can interact with and zoom in on different parts. And I didn't realize SVGs can be interactive, but it turns out they can and so you can click through and see the details and zoom in. So I highly recommend that Pearl script that takes the textual information, just quite verbose and just gives a visual representation. Ricardo asks, is there a similar tool such as AB for Nginx that I have used or recommended? Actually, the Apache Bench tool is pretty great. It is limited and then it only does one kind of request. You tell it what request to do and then it like does that repeatedly. So it's not great if you wanna test like a whole distributed, you know, like how does the pattern of user activity replaying production traffic? It's not good for that but just for a simple demo like this, you can run Apache Bench against any web service, you know, server that runs. And indeed I didn't run it against Apache, I ran it against my little toy web application. Chris asks, are there any books that you can recommend on this topic? There are, I don't wanna give a specific endorsement here because I'm not sure if it's appropriate to do that. There are plenty of books on this topic and if you send me a message on Twitter or ryanatcapsulate.com, I'd be happy to follow up with that. Promise asks, just a follow up question with monitoring tools like New Relic, Stackify, having their agents installed on a production server, do we need to install these Linux tracing tools on a production server as well? In many cases, like vendors are coming into the space, they're offering commercial tools that let you like sort of do this from a like agent in the sky. If those tools work for you, great, use them. I don't think they might offer the same level of visibility as the open source stack does, but it's really great to have a vendor that you can talk to that like can help you through it. And in many cases, like vendors are advertising, hey, we have the EBPF and we use Perf, which is really great, because then you mean you know they're gonna be safe, right? Because they have the same properties as these open source tools. So the same thing applies to those vendor tools as does these Perf event tools. My class, to add to Ryan's question, would any data cleansing required to build a useful dashboard? I think that depends on how sensitive your data is. Everything we've seen here that I looked at doesn't have sensitive data. There's no like stack traces. I'm not sure how they could be considered sensitive. Maybe someone could make an argument for them to be. There's definitely the case where you can get sensitive data, right? So things like exec snoop or IO snoop or that sort of thing, you can get file names or you can get process names. And the tools, the probes are very sensitive. You can put, you can sniff network traffic. You can do anything like that. So definitely you can get access to sensitive data. All the tools I demoed here, you could put them in a dashboard and not worry about it. Namika asks, please, I would like to know how I can go about using these tracing tool across server fleets. I think there are some vendors that offer this sort of thing. I think a few of them were mentioned in earlier questions. Some of those are great. I think there's some like homegrown automation open source projects as well that will do this. Yeah, there's a lot of different options here. I think some of this space is emerging. I definitely like to see some more tools to be able to not just do this when something goes bad but like to be able to run these and collect the data historically and then like something is going bad. Let's compare today's view to yesterday's view when the system was healthy. Alexander asks, is it still in kernel BPPF mechanics when we use Uprobe for user space trace? Yes. So the way the Uprobes work is they trap to the kernel which incurs a user space to kernel transition. The kernel BPPF program runs and then it transitions back. So there actually is more of a cost to Uprobes than to Kprobes. So that may or may not be relevant for the application most of the tools on here aren't Uprobe based but that is something to know. Excellent. And that is it for Q and A today. Thank you all for coming. I will hand it back over to the Linux Foundation. Thank you so much to Ryan for his time today and thank you to all the participants who joined us. As a reminder, this recording will be on the Linux Foundation YouTube page soon. We hope you're able to join us for future webinars. Have a wonderful day.