 Okay, so, hi everyone, my name is Jeremy Guilarno. I'll be presenting about continuous application monitoring using LTTNG. I work at EPCS. If you've never heard the name, we're a small consulting firm based in Montreal. We develop tracers, LTTNG being the most known of them. I personally maintain LTTNG tools and battle trace. I'll start the presentation with a few words about tracing in general and more specifically system tracing and LTTNG. How we struggled to meet the needs of people who wanted to do continuous monitoring with LTTNG and our solution, tracing session rotation, what it is, and the shortcomings and how it makes, well, the challenges that are involved in making continuous trace analysis at scale. So, LTTNG started in 2005. At this point it is more of an umbrella project. We have a user space tracer, kernel space tracer. The system demons that I maintain to control both tracers and control the data extraction paths. The LTTNG analysis project, which I'll touch a bit later in the talk, it's a set of scripts to work with traces and LTTNG scope, a graphical viewer. So I got a super great question in the interview in the week before the talk that is now on the FOSDEM website. What's different between logging and tracing? And really it's a hard question to answer because there is no clear definition for both. They both have the same goal. So the general goal is I want to extract information from my application or from the kernel and be able to understand the state of the application at that moment. That's the gist of it. And different logging frameworks and tracers will target either developers, sysadmins, or in the worst case end users. And it's always a matter of trade-offs between usefulness, ease of use, the performance impact on the system, and verbosity, which ties into ease of use. And then there are some concepts that will come back in every tracer and every logging framework, enabling only certain subset of events and logging levels. The big difference I find is when people think of system tracers like LTTNG, we're more in the low-level events that are very, very frequent. If we think of the kernel tracer, our goal will be to capture system calls and their payload. So we can have thousands of them every second. But we also trace scheduling information. So we track every task running on every CPU at any given time. And it will track file system activity, networking activity, power management events, and so on and so forth. In the case of LTTNG specifically, the user space tracer, it tries to be the equivalent for user space. So since we can capture a lot of events with a very low overhead, people will tend to funnel very, very verbose logging through LTTNG USD, and our job is to cope with that. When I say a lot of events, I want to give you just a rough figure of what that represents on this laptop. I did a quick measurement while it was completely idle. If you enable all kernel events, that represents around 54,000 events per second. And you're not dropping events at that point. On a busy eight-core server, you're more in the waters of 3 million events per second, so that starts to be a lot to cope with. And so this is where tracers and loggers are going to diverge in the trade-offs that they have to make. First trade-off, I'd say the central part of LTTNG is ring buffer. That's at the center of the whole thing. We have a lockless ring buffer implementation in user space and in kernel space that are very, very similar. There are only a few adaptations to make it work in both scenarios. And it is consumed asynchronously, so there's no locking in terms of waiting or blocking the application or the kernel, for that matter, when you're producing events. So worse comes to worse, you're dropping events, so you're not slowing down the application. And then there are a number of memory footprint and the security parameters that you can set on that ring buffer, which I'm not going to go into for this talk. And so that's one big part. It's the quality of the implementation of the ring buffer. The other thing is the best way to not influence the system is to not trace. So trace as little as possible. And this is why in LTTNG we have a very complete set of a rule-based system where you can enable only the events that you want and you can perform filtering at runtime on the events using our interpreter that's running both in the kernel and user space. So I gave two examples here. The first one, you can see that you're enabling the read syscall. You could enable it for a given process and then you can filter on the size of the read itself and not capture any read of four bytes or things like that. And to give you an idea of the performance in the kernel space, it's kind of a complicated question I won't go into, but it's pretty much the same performance as F trace. In user space, we're in the waters of 130 nanosecond per event. So that's very fast. As you understand, we can't go through string formatting to achieve that performance, so everything's going to be serialized to a binary file. If you follow the development of the Linux kernel, we have pushed the... Well, we've been working on that for a while, but now a restartable sequence. The new system call is in the mainline kernel from 4.18 and up. And the goal for us putting that much energy behind that was to improve the speed and the performance of the user space tracer. So right now, the codepads that use RSEC are not in the LTDNG or USD, but our prototype branch lowers that number to 90 nanosecond per event. So it's quite a significant improvement for us. The users of LTDNG are going to be kernel and application developers and embedded system developers and for a very long time in the project history, that was really the core of our users, and I'd say all of our users. But now there's a lot of interest from people running infrastructure, system administrators, DevOps, and those users come to us and say, hey, I've used LTDNG to debug that tiny race in a driver. They found firmware bugs in their mix. They found understanding why there was memory pressure at a given time. So things that are hard to reproduce that can be reproduced. So basically what they would do was they would set up tracing, start collecting everything, reproduce the problem, and then investigate manually through the trace. And that works. I mean, it's the way that two other classes of users tend to work. But it's not really continuous monitoring per se. So it's fine if you're hunting for a bug, but if you want to know if everything's running smooth, it's not really the solution. Something clever that I've seen reproduced at a number of our customers' site is that people who run a large infrastructure are going to use a sampling approach. So basically they set up tracing at random times on random machines. They will collect the trace for a given period of time, run a number of scripts on that, and they'll look for deviations to their known good response time resource usage. They'll look at error rates, error types, those kind of things. But it's not continuous again. So we add the snapshot mode, which is mostly used by embedded developers. And in this mode, it's flight recorder mode, basically. So you're tracing in memory. You're never touching the disk. And if you figure out something wrongs going on, you can capture your buffers and work with them. And for all intents and purposes, this is going to give you a regular CTF trace that you can use every tool out there that reads that format to work with. But there are some drawbacks, especially for monitoring. I mean, first, it might seem stupid, but you have to react very quickly. If you're only keeping 8 megabytes of trace data around, by the time you realize that there's a problem, there's a good chance that you'll have overwritten everything. So no luck. And there's also the problem that if you manage to react fast enough, you only have a few millisecond of tracings. So you better hope that you're going to be able to understand what was going on with that information. I'm making that sound very negative, but I mean, it's useful in some cases, but it's a major drawback. And then we have the live mode. And live mode was something that we thought was good enough for monitoring for a long time. But really, it was not made for that. It was like a way to build, I don't want to say better S-trace, but a S-trace with kernel events that are not syscalls. And had user space tracing to that. So you could see what your application was doing and what the kernel is doing interleaved together and try to understand what is going on. But the goal was to have enough filtering going on that you could actually follow it with your eyes and not work with automated tools. And also we didn't want to slow the applications. Or at least not as much as P-trace does it. But when we discussed what users wanted to use that in production, well, first of all, our protocol, and we knew that is not designed to endow a high throughput, so it falls down in those cases. And it's not trivial to consume. There are a couple of people who have implemented that, but they're not making it open source. So it doesn't really exist for all intents and purposes. The only client that we have is the text viewer, which was for the original use case. And also maintaining connections to every target is not really realistic, typically in production, at least not for people that we talk to. And there's also the problem that you only get the most recent information. So if you lose your connection, the problem occurs, you reconnect, you lost your data, basically. So it became evident that there were missing pieces for continuous monitoring. Not that we had our head in the sand, but we thought, hey, if it makes the job for all of those users, why are you special, right? So, and I want to make a point, because I think there's an EBPF talk after this. Aggregation of system traces to monitor trends is different from what we're trying to do. What we're trying to do is have all the data available if there's a problem that goes on, and not establish trends or make an histogram of latencies and those kind of things, which are very, very useful. I'm not knocking that, but the big difference in the approach is that. So our modes are good for debugging, but for monitoring, they don't do the job. So we interacted, and we kind of were on the back foot to implement all of that because everyone had a very strong idea of what we should support, and they were all different. And we're not a project that has a ton of resources to tour around. For instance, everybody wanted us to support natively their own data store, their own ingest pipeline for logs, or their own message queues, and those are all legitimate ideas, but if we try to support everything, that's the only thing that we're going to be doing forever, right? So we needed to come up with a solution that is adaptable and people can roll on their own. There's a problem that traces are huge. There are people who run traces in production and they'll capture hours and hours or days of traces, and they don't mind if the traces take hundreds of gigabytes. Those people are few and far between, so this is not a good solution. So we need a way to make it possible to continue tracing but discard the old traces and make it possible to compress traces as they age and archive them or throw them away. And the third point is that trace processing can be very slow and very resource-heavy. When you have hundreds of gigabytes of traces, there's not one tool that's going to be snappy on that without a bit of preparation. So I think it's a given that the analysis must not happen on a target. Typically, your target is busy doing something. You don't want it to crunch to 100 gigabyte trace just out of nowhere. So the solution in concept is very simple. We thought, OK, so why don't we have a log rotation but for tracing, right? It's not that we didn't think about it. It's just there are a lot of challenges in implementing that with the constraints that we have, and I'll go through those constraints and show you why it was complicated. But basically, the idea is we will trace and then we switch over to new files and then you can do whatever you want with those files. So then if you want to ingest them in an existing logging pipeline, you can do that. If you want to shuffle them around, you can do that. If you want to compress them, go ahead. And that trace should work with any viewer. That's also a big point since our tracing format is standard. It's open. It's under the Linux Foundation Diamond Work Group. We don't want to break the format just because we're implementing a new feature. So there's one problem. Our format doesn't make this easy. And the reason why it doesn't make easy is two-fold. First of all, an LTKNG trace is a set of files. It's not one file. So what I said before, we have a lockless ring buffer that we keep per CPU. We also have per CPU stream files. And those per CPU stream files exist for user space and they also exist for the kernel space. And they could exist, depending on your configuration, they could exist for each and every user or each and every application, depending on what you want to do. So it's not like a classical log rotation where you write, you open another file, start writing there, close this one, and we're done. And we also have a metadata file. That's one of the big challenges. Because LTKNG is fully dynamic, you can say you have an application that opens a shared library and this shared library has new events. They will have a layout that is completely different from what is known up to that point. So you need to add that layout. And we have just a simple event here that has a UID and a cookie file, a payload of the field. And if you don't know that layout, you can decode the trace. That's basically the problem. So rotating is a... There's also a step where we have to know, okay, up to that point, what were all the event layouts that were possibly created? We have to serialize that to the metadata file and that over to the user so that they don't end up with a trace that is not decodable. So the steps become... Basically, as we are tracing, we need to sample the production position for every CPU across domains and then establish a switchover point, determine what all the event layouts that we need to decode the trace are up to that point, flush that, and then basically switch over to a new set of files and mark this one as being ready for consumption. So you can either monitor the file system to see new files being created or you can use our Unix socket-based notification API to know that. So I'm very happy to say that this works now. It is not in a stable release, although we will be releasing this month. But now it is in release candidate stage, so if you want to try it out, if you find bugs, hit me up on GitHub or on the mailing list. I'd be very happy to hear about that before we release, although if it happens after, we'll fix it. So for a while, we were super happy. We thought, all right, this is done because we look at the world from a tracer's perspective. So we thought that all the analysis tools would be a good job of other people. It doesn't really work like that. People started playing with it, and there were two groups of users. And I'd say there are more people in the first group that were very happy with the feature. Basically, people who do stateless analysis. And stateless analysis is what you do when you just want to count events. So you want to figure out, you have a simple rule, maybe with grep, maybe with something else, where you just want to know how many of those events were there in my trace, how many errors, how many, you know, whatever, and you just need that, and it's fine. And then you can throw that in Grafana or whatever dashboard you use. The other class of users that were less happy were, okay, this is great from a theoretical point of view, but I want to perform stateful analysis. So what is stateful analysis? Basically, anytime you read a trace and you construct a model, so anything that's going to be interesting, which traces is going to use that, graphical viewers do that all the time, they basically have a simplified model of the kernel, for instance, and they'll follow every event that comes in. And for instance, if they see a syscall open and the open succeeds, they'll see, okay, now that fd number three is associated with that path name, fair enough. So this is going to make it easy for you to read a trace and understand what's going on without seeing a read on fd number three and having to go back perhaps hours before to see what that file was, if you even have it, and that's the thing with trace rotation, maybe that information is another chunk. So you'll never have that information. So basically the simplest use case is presenting data in a more familiar way. Even a command line viewer can do that, so instead of having the raw event, we replace the PID with the process name and it will replace the fd with the file's name. And this is what in LTT in general analysis we do. I'll present the project a little bit later, but basically if you want to see the longest read for a process in a given time range, it's going to give you that. It's not going to tell you there was a long read on fd 13 at that moment, it's going to give you the file name and the fd. So the way those tools work is that they track state changes of resources. So in concept it's very, very simple. They have, so a bit of background LTT in G is going to, when you start tracing the kernel for instance, it will dump the state of the kernel. So that means all of the open file descriptors, all the active processes, well I should say all the tasks and their PID, group ID, the user namespaces and all that. So everything's going to be dumped. And the tools can use that as an original model that they can then modify by reading the traces. Now what most graphical tools are going to do is track the changes and populate a state history database. And they make the reasonable assumption up to this point that all the information is going to be available all the time. The problem with independent chunks is that you don't have a complete model. So, well you have all the information, it's just split into a lot of tiny, tiny traces. So if you look at a classical LTT in G trace you're going to be able to follow the whole lifetime of a file. And if you look at chunks, well maybe you'll see the open and one are reading the other and the close and the last one. So if you point existing tools to any of those independent traces well they're not going to tell you much. So the challenge became, okay, people are going to have to adapt their tools, maybe I should try to do the exercise myself and try to help them. And I thought it would be very easy, turns out it's not. But there's a demo at the end and you'll see that it works, spoiler alert. So LTT in G analysis, we have a number of scripts. The most popular ones are there, system called latencies. There's one to see the interrupt and alert durations, IO usage, the highest latencies when accessing a file, scheduling latency so you can see, for instance, if you took a long time to answer a request, were you scheduled at all on the CPU during that time? Things like that. So I wanted to reimplement that analysis but with chunks and basically provide the same information as LTT in G analysis. So again, that analysis, very simple. You just have to track which is called active in any given thread at any given moment. When it completes, you can compute the duration and you can print that and use the kernel model to pretty print the information. So replacing file names, PIDs and all that. If we don't have all the information, we need what we call partial modeling. So basically the central idea of partial modeling is that any kernel object or operation for that matter can be represented as a span. So the lifetime of a Cisco is going to be bounded by the lifetime of the thread in which it happens. The thread is bounded by the process. File descriptors, let's assume that it's bounded by the lifetime of the process and not get into weird clone flags and share the file descriptor tables. CPUs, the same way, most tools don't... Well, most tools assume that CPUs are there all the time, but with CPU outplug, they can come in and come out. So the lifetime is also a span and the next kernel, that instance, is also a span in time. And the same applies to application. When you're serving a request, that's a span and that's not a new concept. I mean, Zipkin represents that information in that way. It's just that for system traces, it's not as usual. So this is where most analysis needs the most adjustments, and I'd say mostly a rewrite, typically, or at least of the back end. So intuitively, I thought, well, the analysis can happen in two phase. First one, I'm just going to go over the chunk, find every complete span and keep that aside. And if there are spans that span over the frontier between two chunks, I'll just keep that aside and when we do the merge phase, we'll be able to stitch that together. And this sounds a lot like a classical map-produced pipeline. OK, it's a bit more complicated than that and I realized that very quickly and it makes a lot of sense if you think about it. Syscalls are not atomic and system calls can fail. System calls can be interrupted and restarted. So you have to be careful. If you see it open, it doesn't mean that that file descriptor is going to exist at any point in the future. Maybe a system calls are going to be interrupted. Maybe it's going to hand it in the next chunk so you don't know a lot about it and your model has to account for that. So you have the example here. The other point is that can happen on different CPUs so you don't have to perform your analysis not on a CPU stream basis but on a chunk basis. Fair enough. The Babeltrace library makes that easy to collate all the streams together and do the merging. So during the mapping, we find every span. I gave an example here. Maybe we'll see in a chunk that there's a read on FD22. Keep in mind that this could be a socket. This could be a number of things. The only thing that we know is that at this point, if the read succeeds, there was a file descriptor 22. And we succeeded in reading 15 bytes. And we saw the end of the lifetime of that file descriptor. Fair enough. So now we have the end of that lifetime of a file descriptor. Reducing, well, to talk about my implementation, basically at the end of every mapping phase, I'm going to produce a JSON file that has all of the kernel model with the partial spans. And during the reduce, we're going to tie this back together. So at some point, there is a chunk that will tell us, I don't know the FD of that file, but I know that there's a file descriptor that points to this file. It's a block device normal file. It's not a socket. It's a nipple file descriptor. And we'll have all the information, and then any request that wanted to use that information is going to complete. I should say the report for a request, and it's going to give you the breakdown. OK, so in fact, it's a bit more complicated than that still. The rotation is not atomic. So if you look at the model that I had in the PDF traces before, well, it's easy to think, OK, I have a switchover point. Now from that point on, I have a new set of traces. It works until you try to do analysis on that. And really, it's simple. If you try to perform the switchover without blocking everything, you're going to end up with this staircase pattern where in time, the rotations happen and you switch files. I thought that this would be a lot quicker than it actually is. And for a time, I thought, ah, it's not the end of the world if we drop events for that little interval. But actually, rotating on a 12 logical core machine is going to take 28 milliseconds. So there's a lot that can happen. There can be complete lifetimes of processes during that time. So, well, processes, maybe not, but it could happen. But at least files, that can happen. And complete requests can happen during that time. So you can see that what ended up happening is that in some cases, you could see an open on the first CPU and then a dump on CPU number three. And then you think you know everything in that time span, but actually you're blind to what's happening during transition. So then you assume that, OK, I know what that FD is, but actually it was something else all that time. So you don't know what that FD refers to. So there are a number of solutions to this problem. The one that I used is a bit lazy, but if you use two chunks, you can see to a point where you'll have a complete chunk that fits your assumption that the switchover is atomic. I think it's a good moment to mention that we use the same clock source, clock monotonic, between kernel space and user space. This is what makes it possible to make the analysis reliable and really order events in the way that you would expect. So I'm happy to say that the prototype works. The goal was to provide the same data as LTTG analysis. And what I wanted to do is have a model of an application that performs a non-trivial query and then show the time that was spent in each phase of the query and then drill down what was the time spent in every syscall that happened while serving that phase of the query. And I tried to address the shortcomings of LTTG analysis. First, it is a lot faster. LTTG analysis is written in Python. So working with millions of events is not what Python is made for. So it's normal that it was slow, but it made it pretty hard to use in a continuous way. So that part is taken care of, but really the challenging part was the partial modeling. So the example I'm going to show you, it's a server that receives a request to generate a thumbnail. So basically it gets given an asset ID. It's going to seek to the requested frame, decode the video enough to decode that frame, scale to a thumbnail size and send the response back. And then we can generate a breakdown of time spent in every syscall. So I have this running in the background. It's feeding a Grafana instance. And basically what you can see here is the time for each and every request. Now I don't recommend that you do that in production because you don't want to be processing each and every trace chunk and individually graph each and every request, but it gives you an idea of what can be done. The thing that you can do is that if you have outliers like you have here, those points you could generate a report and drill down into. And this is what we're going to do. If you look at the report generated for one of the fast instances, we can see that most of the time is spent opening the video with FFMPEG and decoding a full frame. That's fine. That's what most of the work is about. Scaling, as you would expect, is mostly CPU bound. So opening the file, there's a lot of stuff that goes on. And you can see here that the combined duration of all GET TIDs is actually what is dominating that operation. Then we have a number of calls. I'm not going to go through all of them, but you can see the interest in augmenting the model here. As you can see, normally you would have a read on an FD that you don't know about. You would have M remap on FDs that you don't know about. Well, not M remap, but M map and so on and so forth. Seek, same thing. So seeking can see that not a lot happens, as you would expect. This is an indexed video file, so it's fine. Then decoding, nothing unusual. Scaling, nothing happens. It's a purely CPU. It's purely compute. And then writing the result to the socket. Now if we take one example that took a long time, we can see that the outlier is in the right. Actually, this one is not sending over in the network because I had some problems here, but it's writing to a plain old file. Basically you can see that writing doesn't take a long time, but during that time there was a sync call that took around 600 milliseconds. So that's the kind of insight that you can have very rapidly. Hopefully your applications don't perform sync during request handling, but it has been seen. So in conclusion, I think from the tracers perspective we can see that we have the infrastructure to distribute trace processing. Our users are very happy that now they have files and they can do whatever they want with them. But the open questions that I have for you guys is what form can this type of processing take that would be integrated with existing viewers? Or do you think that we need new viewers, new tools for that? Are there tools out there that you guys use to monitor very low level metrics continuously? I'd be very happy to have your feedback. Otherwise, do you have any questions? All right. Well, thank you.