 Hi everyone, my name is Jeremy Guilarno. I'm one of the maintainers of the LTT&G project along with Monsieur de Noyer. We're both at Efficios. I'll start with a quick word about Efficios if you don't know the company. We help customers build development, monitoring, and debugging tools. Everything we do is open source. We are behind the LTT&G project, Babel Trace, Bear CTF, and we contribute regularly to GDB and the Linux kernel. A large part of what we do at Efficios is bring ideas that originate in the kernel down to user space. And LTT&G is one of those things that we bring tracing as it is imagined in the kernel down to user space. And we're also behind the Mem Barrier Cisco and the Restartable Sequence infrastructure that allows you to do proper per-CPU data structure in user space. So this talk is about new features in LTT&G and how we're starting to move beyond ring buffer tracing. I'm gonna start with a brief introduction of LTT&G, the project itself. Then I'll talk about the ring buffer tracing and its limitations. And then I'll go into new features in the last release and the following, the next release, which are triggers and aggregation maps. And then I'll touch on some future work and I'll take your questions. So what is LTT&G? Well, first things first, I think it's good to talk about loggers when we talk about tracers, loggers. I assume you know what they are. You basically, the idea is that you basically add instrumentation point to your code and you take the opportunity to extract some information about the state of your application or of the kernel in the case of a tracer. With tracing, the goal is the same is to understand the state of your program at a given point in time. In both cases, users have the same trade-offs that they have to take in mind, trade-off between verbosity and performance, performance both at runtime and at the analysis time. I'd say the real difference is the nature of the events that we target. When you talk about loggers, you're really gonna target high-level events and the low frequency of those events is gonna make it possible to log to text. When we're looking at tracers, we're more looking at very low-level events, things like syscalls, things like RQ and in user space, it's gonna be very low-level application events like memory allocations, job dispatch and those kind of things. So things that can happen thousands or millions of times per second. So there isn't really a clear cut-line between the two but I would say that tracers, by the nature of what they try to record, have to go to a binary format and have to make certain trade-offs in terms of usability. So I think it's a useful distinction to have. There are a ton of kernel tracers out there, perf, ftrace, the talk just before this one, eBPF, starting to be active in that front. For LTT and G, the goal is to understand the interaction between components on the whole system. We're not just focusing on the kernel itself, we want to understand the interaction between applications and the kernel and between containers. So that's a bit different. To do that, having a kernel tracer is certainly useful and that's why we have LTT and G modules. We reuse the upstream kernels facilities, so things like trace points, the hooks to perform this called tracing, we can extract their arguments, we reuse Uproves and Kproves to instrument dynamically existing code. But we have our own modules, basically to have our own ring buffer implementation, our own filtering facilities and to produce traces in the common trace format rather than having a bespoke format for every tracer. I'd say the big difference with other projects is that we have a user space tracer. That's really the big difference. It allows you to instrument CC++, Java, Python applications. And you can certainly use it as a very efficient logger, but it has a number of advantages beyond that. First, it's very fast. It's basically the same code as the kernel tracer, but it's also type safe. So we don't cook strings, we keep the type information all the way down to the analysis so that makes that a lot simpler and more compact. And therefore more efficient. In LTT and G tools, it's really the control plane of the tracer, it's the control of the sessions and all that stuff. So a key point of the LTT and G design is that basically everything is influenced by the nature and frequency of the information that we wanna save. Users mostly want to, well, they're mostly system developers, I would say. So if they are interested in events, it's gonna be very high frequency, as I mentioned, for applications, it's gonna be memory allocation and stuff like that. For kernel events, again, very high frequency events, as Stephen showed in the previous talk. So it's not, even though it's a user space tracer, it's not really targeting tracing as things like Jagger or open tracing based tracers would define them, which Jagger, for instance, you would be interested in tracing requests that are gonna be in the order of milliseconds. For LTT and G, it's more events that are at the micro second or nanosecond granularity that we are interested in. So the focus is on the low intrusiveness. As you can imagine, tracing that many events, we have to be quite efficient. And for that, the ring buffer is the cornerstone, really, of LTT and G. It's critical to performance. And there, as I may have mentioned, both the user and kernel space tracers share basically the same ring buffer code. So they are lock less ring buffers that are allocated per CPU. They are lock less because we don't want to push back on the applications, for instance. If you are producing more events than we can consume, we're just gonna drop events or overwrite the older ones. You can choose both policies. And those buffers are endlessly tunable. You can choose a memory footprint. You can choose the access permissions for UID per process. And we have a number of settings to accommodate the real-time constraints. So ring buffer tracing is typically what people think about when they talk about tracing is really saving to ring buffer, really taking an event. So saying that, okay, this event occurred at that precise time on this CPU. You wanna collect some context associated with that. So things like the UID or the user space call stack at that moment, things like that. You write that to the ring buffer and then either you trace to memory you're not expecting to collect that. You just want to maintain it in the memory. And when something interesting happens, you capture a snapshot of that like a flight recorder or you can stream to the disk or through the network. So we support all of those use cases. So that difference where you can lose event it's really a key difference between tracers and loggers again. All of this means that tracing is super cheap. The instrumentation in the code, I won't say it's free but it's almost free when it's not in use. When it's not in use it's basically gonna amount to a load in a correctly predicted branch. So not free but very cheap. But when it is enabled I did some tests on a fairly old Xeon but just to give you a rough idea if you want to trace the time, the idea of the event in a integer payload it's gonna take roughly 150 nanoseconds. It's the same ballpark figure for both the user space tracer and the kernel space tracer and I have more benchmark later in the slides but just something to have in mind. So that's good that it's efficient. It means you can add a lot of instrumentation points in your code. You don't have to think so hard about that. But the bad side of that is that it's very easy to enable too many events and up with a lot of noise. And I'd say most of the features that we integrate in LTDNG are there to mitigate that but to make it easier to filter on your events. So we provide the number of facilities to mitigate that. The first one being the event rule system where basically we want to cut at the source. The event rule system is basically an event rule that you can define and you can define a pattern that the event name has to match. You can add filter expressions. So that would allow you to filter on the CPU ID, filter on things like the payload of the events, whether they are strings or integers or whatnot. You can add exclusions, have log level filtering as you would find in most loggers and more. Filter expressions, I'm gonna come back to that later because it's important but they are converted to a custom byte code that is executed at runtime against the events when they happen. So it's a very flexible filtering mechanism. And one thing that separates it from most loggers, I would say it's that all of this is entirely dynamic. So you never have to restart applications or restart the kernel or whatnot. You can basically define new event rules, disable them, add them. You don't have to restart your applications or whatever. So from that point is very interesting. Now just to show you what an event rule actually looks like in the wild, the first one that you have here, enable event, will allow you to trace to a ring buffer all of the sync file range syscalls that have a number of bytes that is larger than a page. So that's one use of the filtering facilities. And if you have instrumented your own applications, you can use basically the same event rules with the user space tracer, but you target the user space domain and then you can add exclusions. So here you have a pattern that's gonna match all of the myhapworker underscore events, but maybe you have one spammy event that would introduce a lot of noise, you can exclude it from the get go. And then you can add a filter. So if your events have a job name field, you can filter on that and you would only get those that start with OSS submit. And you can have a log level associated with that. Maybe to introduce the features, I wanna talk about how people use tracing in the wild or at least people that we interact with at Efficios. I'd say there are two camps, there's the debugging side and there's monitoring. When you're debugging, it's pretty simple. Basically, you can reproduce your problem, hopefully it will. But you can typically afford to have starage on the target. You can afford to degrade performance a bit for that single machine if you're in the cloud. So that's fairly simple and that's something that's out there for all tracers. You're gonna select events, trace to files and as Stephen mentioned in the previous talk, use things like trace command and look at the raw events themselves. But that's not something that's gonna be very useful if you want to monitor, unfortunately. The users that we have that use LTTNG in production all the time, they will basically carefully choose events and event rules that they care about and trace to in-memory ring buffers and that's really writing to disk. And if they do write to disk, it's gonna be very, very high level events. So basically something that you would find in the log, so you would get some context if a problem occurs and you would capture a snapshot. So that would give you maybe weeks of trace data at a very high level and then the last four or five seconds at a very detailed level of information. So that's easy to deploy because basically a snapshot is not harder to collect than how you would collect core dumps, for instance, in production. So you can have a job that runs every once in a while and that does that. And what we're seeing more and more is people that mix and match. So they will, like I mentioned, they would keep a high level of trace over a long period, take the snapshot, but also some more involved cases I would say. People are gonna use trace rotation, which is like a log rotation and ship those trace archives offsite for analysis or process them on the target, depending on their use case. And another thing that we've seen is snapshot-based profiling where basically people take periodical snapshots, like just the in memory ring buffers, they analyze that, but over thousands of machines that gives you a very good idea of what's going on everywhere and you can start to extract patterns from that. So you can imagine that one of the key limitations that's set up is complex. When you start to log around huge trace files, you need to account for storage space, you need to account for detecting the problems yourselves, which is often a big part of the battle. And if you do all the other, develop the other deployments use cases, you have to write analysis and run them on the target and then you're slowing down the target. So it's all things that you have to manage that can be a bit complicated. So the big feedback that we got over the last few years is it would be nice to use the same filtering facilities that LTNG has to control the tracing. So basically something that would allow you to listen for a very rare event. And when it occurs, that's indicative of a problem and you start tracing to the ring buffers and you capture information for a certain amount of time. Or you can just perform some other custom actions. And a fair number of users have devised ways to do that where they would re-implement basically the filtering of LTNG and put that aside with the instrumentation and when they detected some condition, they would use the LTNG control API to take a snapshot or start tracing or whatnot. So it was with different levels of maturity but it's something that people were doing. So we started to work on triggers. And triggers aren't really a new concept in LTNG at this point. The first release with triggers was 2.10 released in 2017. And what you have to know is that a trigger is just a way to associate a condition with an action. Initially, the scope of this feature was very small. The goal was just to do traffic shaping on trace data. So basically, you would see, okay, my buffers are getting fuller or more quickly than I can consume them. So I have too much stuff enabled. So you start to disable less important event rooms and when the situation dies down a bit, you can re-enable them selectively and that would be automated by a monitoring implication. So this is what this enabled. We already had what is in the 2.13 release in mind when we did that, but it was like a very good way to get it to do the groundwork of that feature and get it out there and production. So in 2.11, we introduced new conditions again to satisfy a very concrete use case. Basically, we introduced trace rotations at that point and people wanted to schedule rotations every couple of megabytes, every couple of seconds and then they would want to know, okay, that trace archive is ready on that server or in this folder and I wanna perform an analysis or archive that or what have you or do live analysis or simply keep five of the rotations on the system and discard the other ones. So in 2.13, excuse me, this is where the trigger feature got where we wanted it to be. We added the event rule matches condition. So now triggers can fire when an event rule matches an event much like what you can do with ring buffer tracing. You have all the same filtering capabilities. We also had customers that wanted to basically start, stop, rotate, record a snapshot or be notified when an event occurred but some other wanted to know, okay, this event occurred but I want to get the payload in the notification. So if you can imagine two applications that are talking over the network that are instrumented, you get an error, you can actually get the peer ID involved in a transaction and that will allow you to report to a monitoring app that, okay, this is something, this is sending me a malformed data. I wanna take a snapshot. I want to understand what's going on right now. So I'm gonna show you a demo and I'm running the next release. So I hope everything's gonna be all right. So what I wanna do in that demo is let's say we want to monitor for open app syscalls and we wanna take a snapshot whenever a process is denied the open. So it gets the e-access. I would create a session like you could always do in snapshot mode or flight recorder mode. You can add a channel named my channel. Hope this is big enough. I can bump a bit. And I can add context that will be collected with every syscall. In this instance, just for the demo, I'm taking a proc name, so the process name, but you could take the PID, the UID, GID, the namespaces involved with that process. So let's keep things simple. And I'm enabling all syscalls to be traced to my ring buffer that is in memory. So as you see here, I'm enabling all syscalls in the kernel domain basically. So I start my tracing. Then the new stuff is that you can define a trigger. That trigger here, I'm gonna use the mouse to highlight, but it's very simple. It's, I can give it a name. Here it's gonna be open at e-access. And then I can define a condition. The condition is event rule matches. We are interested in kernel syscall exit events. We want to filter. We want to just select the open at syscall. And then I can add a filter. I'm just interested in those that return minus 13. And I wanna capture a part of the context that is proc name. And when that occurs, I want to notify external applications that are interested in that. And I want to capture a snapshot. So we can use the LTDNG listen command to wait for that to happen. And I'm gonna be naughty and try to touch ETC password. Okay, so I get permission denied, as you would expect. And my listener application got the notification with the proc name touch. So that's already good. I can stop LTDNG listen. I can look in the folder. I have a snapshot that was captured. So that's just the content of the in memory ring buffer. And I can read the snapshot with babble trace. And it's a bit compact on the screen or dense rather. But you can see that indeed we had an open at syscall for ETC password. And we see that it was denied with minus 13. But we can also see the other syscalls that occurred before that and a bit after. So that's it for that part of the demo. Capturing value was really, I would say, the most challenging part of that feature. In terms of implementation, we already add the filtering virtual machine in place for both the user space and kernel space tracers. And that's because the event filtering basically compiles down to a custom bytecode. And this bytecode or this program rather is linked against the TracePoint fields when it is enabled. So that only occurs once. See it as this TracePoint is now on. At runtime it's just populating the interpreter stack, running the bytecode, which is typically gonna be a handful of instructions. And then the result of that program is either we accept or reject the event. And thankfully the VM already had to deal with dynamic typing because we support variants. So when you can see here, when I define an eventual, I'm not implying a type. So here I'm filtering for foo being larger than 42. But we know nothing yet about foo because we can match multiple events. We can match event A where it seems to be an integer or event B where it seems to be a float. Okay, so that means that when we actually attach, we can resolve that and specialize the bytecode and be clever about that. But there are situations where we can do that, like when you use variants or when you use application contexts, which are also of dynamic types. So what we do is we define the new type of program by code type capture. It basically only changes the return value of the program to be the location of a payload. And what we do with that payload is we format it as a message back message after doing the appropriate permission checks if it's for the kernel tracer to get from user and all that. And that in turn is sent to the session demon. And then it's basically a pub sub system where you can have a number of listeners for that trigger. We get the notification and we dispatch it down to the various listeners. The way the messages are sent varies between the kernel space and the user space tracer. It's just an implementation detail, but in terms of performance, it has a pretty big impact. For the kernel tracer at the instrumentation site, we're basically just writing to a ring buffer. But then we signal to the session demon that there's something to consume right away rather than batching them together. So we're more optimizing for latency than throughput. But at the instrumentation site, it's not so heavy. At the instrumentation site of the user space tracer, it's a non-blocking right to a pipe. So it's actually a Cisco. So you can imagine that this is not a replacement for ring buffer tracing. I would not suggest that you just enable add triggers and wait for events to come in through the notification system. The use cases that we have really right now, people are being so aggressive with the filtering that they get in event every few minutes. So really throughput is not a problem. I would expect that we're gonna get use cases where people want more throughput and then we have a lot of gains on the table. Aggregation map, it's not coming feature for 2.14, which we hope to release by the end of the year. It basically addresses the other limitations of ring buffer tracing. The main one is memory over read, both in terms of bandwidth, but also space. Tracing is cheap to a ring buffer, but it's not free in their harsh situations where you just can't afford, even if it's 150 nanoseconds per event. It's gonna be too much. And there are architectures out there where the tracing is not as efficient. For instance, on ARM64 up to recently, getting the CPU required a full system call. So a full system call on the fast path was killing our performance, but with RSec and the JLBC 2.35, that amounts basically to a read. So that got faster. But still, there are architectures out there where it's not gonna be the case. So there are costs in terms of runtime, but there's also other effects that are not desirable. If you're tracing to disk, the IO is gonna show up, it's gonna be a problem very rapidly and the same thing for a network. So there are certainly trade-offs to explore, and aggregation is one of those trade-offs. When you trace, and really you record, I should say, you are interested in the precise order of events. You want to know the payload, you want to know on which CPU it occurred, you want to know the order and the timing respective to maybe other machines, other applications and stuff like that. Maybe you don't need all of that precision or maybe you can do away with it. For aggregation, we're basically counting event rule matches. And basically, we want to reuse our own event rule filtering to do the aggregation. You can, of course, do something very coarse, like counting all syscalls, but that's probably not gonna be very useful. But you can count, like in this instance, the KMM freeze, the entries into receive message and you could filter again on the payloads. You could count open syscalls by UID100 and so on and so forth. We also add people that were, let's say, unknowingly doing aggregation or that could have done what they do with aggregation, where basically they were capturing long traces, but at the end of the day, they're just counting events. They want to know how many times did I get this event during a run on the CI? What is the frequency of those errors over time? And so basically, you're better off doing the aggregation in place. So we have aggregation maps, which are basically per CPU arrays of counters. They allow you to use named keys to address the various slots in the array. They have a configurable width, 32 or 64, so that's a trade-off you can explore, 32, 64 bits, and they have a configurable size to bound the size of the map, which are just gonna be a number of counters. You can track overflows, and this is not gonna be transparent. This is not a new concept. It's something that existed in the kernel in the form of the BPF map type per CPU array. It's very useful there, but we wanted to make that available to user space applications and keep that cheap. From a usability standpoint, it's integrated in LTT and G the same way that the ring buffers are. So if you remember my previous example, I add a ring buffer that I added to a session. Now I'm just adding a map. It's basically the same concepts. And then you can add a trigger. You specify, again, a condition, an action, increment value, and you target a specific key. So in this case, I would want to count all of the new connections done by the Deboucher user. As expected, it's a lot faster than tracing to a ring buffer because obviously you're not saving the payload of events. We were happy to see that for those benchmarks which are on GitHub, we were interested in all the details of the runs. Tracing to a ring buffer in user space goes down from 116 nanosecond down to 43. So over three times faster. So that's an order of magnitude that makes tracing possible where it was not for some of our customers. And in terms of performance for the kernel version of those data structures, it's basically the same order of magnitude. As you can see, it's a bit faster than the EBPF equivalent. And that's not surprising. There's just less going on on the fast path for that. So we were happy to see that we're in the right ballpark. So yeah, just to give you an idea. So really it's not just a bit faster. It's a lot faster, even though we're talking about nanoseconds. So another demo of that feature, I can make some space. So again, I'm creating a session, just a regular normal session. I'm adding a map as you said in the previous example. And I start tracing. Then I can do whatever I want. If I add the new event rules from that point on that target that map, we're gonna get counts of events. So things like counting all syscalls on my machine right now would be as simple as, oh, maybe the colors are not so great for this projector, but basically you are defining a condition, event rule matches, and we match every event that is of type kernel syscalls entry. And the action is gonna be increment value. And we target one key, which is gonna be system call. And then I can watch my map and you see that we have roughly 10,000 syscalls per second accumulating there on an idle machine, which was surprising to me. But we can also count errors of the syscalls. So I'm interested in all errors. So I'm interested in all the syscall exit events that return less than zero. And then I get, again, another key with a static name, syscall error count. I'm gonna clear the contents of the map. So we start again from zero and see both rules together. And I can watch the map. And you see here that roughly 10%, 7%, 8% of the syscalls on my machine return an error. Okay, fair enough. Now I can also do pattern matching where I will add another trigger that matches all the syscall exits the message family of syscall. So receive message, send message. And for each of those, I'm gonna create a new key using their event name. So every syscall is gonna get its own error count with this rule. And none of the, there's no ashing that takes place on the fast path. Like that allocation of the key is done one time by the session daemon at the actual trace point site. It's just an offset into the map. So it's very cheap to do, even though you have all that flexibility. I'm again, clearing. And then you can see all of the values of the maps on my system. And you can see that most of the errors in that family of syscalls come from receive message. Well, all of the errors. So that's it for that demo. The other one I have is the grams. We don't support Instagrams natively yet. That's something we want to do. But you can define the bins of your Instagram using the filter mechanism. So it's gonna be a bit less efficient than it could be, but still not quite that bad. If you had a map, you start your session. And then my script creates basically a trigger per bucket of the Instagram with the various ranges that you see here. And then I can look at the map. That's not ideal because I'm sorting the bucket's lexical graphically. Oh, it's the right term in English. But we can do something nicer and send it over a WebSocket to a chart.js plot. And basically you can see it live. So that's the kind of things that we want to build easily. We can consume those values using the LTTNG control API. And we hope to provide Python bindings for that in the near future. So for the future, because I'm short on time, we want to get this released in that, well with those features for the moment. But we're looking at native Instagram support. That's really the big thing that's missing right now. We want to allow decrementing the values, but we also want to let users use the payload of the events to affect the counters. So if you can do that, you can track the memory that you're using and things like that. It opens up a lot of new ideas. We also want to make the size of the event if it had been recorded to a ring buffer available. So that would allow you to do dry runs before you enable tracing to ring buffers. So you would know, okay, those ring buffer, those event rules, they would consume two megabytes per second of bandwidth. Before you enable tracing, you would know about it. So especially for production use cases, that's very useful and something that people keep asking about. The other thing is we want to make LTTNG, USD are secondware. We know that we can eliminate some lock prefix operations from the fast pad. So we expect to get some very good gains. Although I have not, obviously, benchmark that yet. So yeah, that's all I have. You can visit our website and I've made all the benchmarks available on my GitHub because I know people with very precise questions, but I'll take your questions if you have any. All right, well, I'll talk to you after the talk. Thank you.