 Hi, my name is Tom Zanussi. I'm a software engineer at Intel working in a new group called Safety Critical Systems. And today I want to be talking about an enhancement I've created for the Clinics Trace Event subsystem that adds intervent capabilities, such as latencies. Just a quick overview of what I'll go over today. I'll start with a quick overview of Trace Event itself. And then I'll jump right into a real world latency example that actually is the motivation for this enhancement. And then I'll dive into a little bit on the design and interface and conclude with some future plans and time for questions, I hope. So Clinics has a bunch of places in the kernel where developers and administrators can sort of gather data about what's going on in the kernel behind the scenes under the covers. Those things are called trace events, and they're normally disabled. And they don't add any overhead at all to the kernel when they're disabled. But you can enable them. And when you do that, the data goes to a trace buffer, and still the overhead, even in that case, should be pretty minimal. So there are hundreds of these things scattered throughout the kernel. And they're grouped to make things manageable. They're grouped into subsystems. This is a small subset of them. If you go and look in debugFS, traceFS, tracing events, subdirectory, you'll see the whole set. But each one of those subsystems is actually a subdirectory in itself. And it, in turn, contains a whole bunch of other subdirectories that correspond to the trace events themselves. And within each of those trace event subdirectories, there are a set of files used to describe and control individual trace events. Here's an example. For this talk, the most important file is the format file, which basically describes the event data that gets logged whenever you enable an event. This is an example. This is the KMLOCK event. The KMLOCK event, and it's within the KMM subsystem. And I've highlighted the individual fields here. Basically, this is exactly what will get logged to the buffer when you enable an event. And then the kernel hits those points. So like I said, they can be enabled. But it looks like from the command line. Essentially, it's a command line interface. To enable an event, you send or you echo one to enable file. In this case, we're actually enabling a whole subsystem, the scheduler subsystem, not just one event. And then once you've done that, you look in the trace file, which is basically how you view the ftrace buffer containing the events. And you'll see a whole bunch of records corresponding to the events that were logged when you enable them. So here we have three events and two different event types, schedule switch and schedule wake up. And again, I've highlighted these in red just to show they directly correspond to the fields in the format file. And this is just a little bit more of a graphic depiction of that. On the left, we have four events. They're all enabled and logging one or more events into the trace buffer. And this is just a subset of the trace buffer. So that's the basic operation of the tracement subsystem. You enable an event and you eventually, when it triggers, it sends a record to the trace buffer. But that's not the only thing that can happen when an event is triggered. It can actually trigger other things, for instance, like dumping a stack trace. So it'll actually show you the call trace leading up to whatever event that you're interested in. And another action is taking a snapshot. Essentially, what you can do is enable, basically, you can enable every single event in the system. And when the event of interest hits, it takes a snapshot of the trace buffer. And at that point, you can look at the snapshot and basically get a really detailed idea about what led up to that event. For the purposes of this talk, we're mainly interested in a HIST trigger action. And HIST trigger HIST stands for histogram here. And what a HIST trigger does is, basically, the event goes to the trace buffer, but it also goes to a user-defined hash table. And the user defines the hash table using, again, the fields in the format file. So essentially, it's a way of consolidating or aggregating all that data from the event file into a table. Just to give you an idea, using an example, suppose you wanted to see for every process on the system how many bytes each process was asking for using the read syscall and actually how many read syscalls it was making. You could do that by setting up a HIST trigger on a particular event. In this case, we're going to set one up on the sys-enter read, which is basically the entry for the read syscall. And we'll set it up using the process ID as the key. And with the value of the count, and setting the value just means you're going to basically keep a running count of that field, that account field, in this case. So I'm going to set that up, let it run for a while, and then at any time in the same event directory, read the HIST file. And you should see output like this. You can see we have a line for each process on the system, and in the right here we have the total number of bytes that the read syscall asked for while we were running this histogram. And in the middle, a hit count basically just means how many times this event was hit. And so that tells us how many read syscalls produced this number of bytes. So graphically, it looks like this kind of. We have our single event enabled, and it's logging a whole bunch of events into the trace buffer at the same time. Or maybe it's not even doing this. Maybe it's just going directly here. But it's creating a histogram using the fields that was set up with this case process ID and keeping a running total of the number of bytes read. So basically, it's just condensing this whole huge file into a more manageable table. OK, so that shows you what you can do with a histogram on a single event. But there's one thing you can't do with a single event is that you can't do latency with just a single event. You need two events for that. And typically, you also need this kind of code associated with each event in order to calculate the latency. So we have our two events here. Sked wake up and sked switch. And the pseudo code is all it's doing is taking the current time, time stamp right now, and saving it into an associative array or hash table. In this case, the hash table is named time stamp. And it's keyed on process ID, wake up, PID. And then at a later time, this sked switch event comes in. And it looks for the corresponding time stamp in the hash table. If it finds it, and only if it finds it, will it do this calculation here, is basically calculate latency, take the current time stamp and subtract the one it found. And you have a latency. And typically, it would want to also save that for later. Again, it would probably use a hash table. In this case, this wake up latency hash table. And then it'll allow the time stamp, so it doesn't get reused. And I chose these particular events on purpose, sked wake up and sked switch, because they directly correspond to a real-world latency that I actually want to be able to use trace events directly for, which is an RG patch set. There's something called latency hash, which creates a histogram of different various latencies. And the one that I'm talking about here and one of the main latencies in that patch set is it's called wake up latency. And associated with the RG patch set is a test program that's used to produce and test this particular latency, among others. And it's called cyclic test. And what I have depicted here is just it normally creates a whole bunch of threads doing the same thing. And what I'm depicting here is just one of those threads. So what it normally does is it makes an anti-sleep system call, asking the kernel to put it to sleep and to please wake it up some number of microseconds later. And so the entry to that system call is this event here. And then it sleeps for a little while. And then the timer wakes it up. The timer wakes up and tries to wake it up. And that wake up corresponds to this SCED wake up event. And then it doesn't immediately start executing again because it's been put to sleep and it needs to get the CPU back again in order to actually start executing. And that's basically the scheduler needs to schedule it back in. And that event is denoted by basically the SCED switch event. So the particular latency that we're interested in measuring is this latency between those two events, wake up latency between the time it was awakened and the time it actually was put back on the CPU. So just to finish off the example, so now it's back on the CPU, but it's not really doing any useful work because it still needs to exit the system call. Then it can get back over here and do its useful work. But that's the exit of the analysis system call and the latency between SCED switch and switch and the analysis exit is called switch time latency. And RTPAS also does combination latencies which are combinations of one or more latencies. And there is another latency that's a combination of these two called wake up switch latency. And we should be able to do that as well with the enhancement we're talking about. So that's what we want to do. We want to be able to actually calculate that latency, do a histogram on it using only the trace event subsystem. But the problem with that is that the trace event subsystem doesn't do latencies. It's up to now, you can only do strictly intra event kind of histograms and so on. Well, latencies are inherently intra event. So there are some external tools that can generically calculate and do latencies using the trace event subsystem. But they require extra languages and runtimes. And for an embedded system, that's not feasible for a lot of embedded systems. But my claim and what I've actually implemented is that we shouldn't have to be able to write code to do latencies using just the trace event subsystem. And basically what it boils down to is sort of encapsulating that pseudocode that I showed you before and making it parameterizable by the user. And like I said, that's what I did. I wrote a patch set to do this. And I posted it a couple of weeks ago to LKML and to the Archie users list. And it got some good reviews. And I'm working on another version. Should be out soon. But basically to create that patch, I broke it down into three different sub-problems, variables, synthetic events, and actions. And I'll go over those in more detail in the following slides. But just quickly now, what each one of these is about, variables. So with latency, you have a quantity here in one event and then a quantity here at a later time. And you need to keep track of those somehow. That basically means you need some sort of variable support for variables. And for the patch that I just basically did the minimum I had to to solve the problem, synthetic events. So the trace event subsystem is really geared towards single events. It's a very single event centric. And when you have something like a latency that's derived from multiple events, there's really no natural place to put a latency in that scheme. So what I did to address that is to create a synthetic event that basically that does try to fit into the scheme. And I'll explain that a little bit later. And then actions, so the variables in synthetic events basically are the data structure side of things. Actions actually are the thing that physically inject those quantities into the event subsystem so that they can be used, logged into the buffer and participate in histograms and so on. So taken together, that's what I implemented. And I guess the way I look at it is it's sort of a glue layer for just connecting events together using what you think of as logical wires. No coding involved. So that's the goal, anyway. So just going into a little deeper into those three subproblems, so variables. Latency is a really pretty simple thing. It's going to save a start value at event one and retrieve the start value at event two and subtract it from the end value at event two. So one of the ways that is commonly done is using a hash table. And in the case, in the context of the trace event subsystem, we have ready-made hash tables in the form of HIST triggers. So we should make use of those. And this is basically a diagram of the mechanism used to actually calculate a latency using the HIST triggers. So here we have our two events. There's SCEDWakeup, SCEDswitch, and then associated with each one of those. Each one of those events is a hash table. So basically, the HIST trigger has been put on both of these. And so the first event comes in, SCEDWakeup comes in, and it finds its slot in the hash table, in the HIST trigger hash table, which is there. And then it just takes a timestamp and saves it into a variable associated with that entry, TS0. Similarly, later on, SCEDswitch comes along, does the same thing with the TS1 variable. And at that point, it has the data it needs to calculate the latency. So what it does is actually go and look up the value from the first event, finds the TS0 variable, grabs it, and then does this calculation, and then saves it into yet another variable on SCEDswitch for use later. That's pretty much the whole mechanism for calculating the latency itself. And this is what it looks like from the command line interface perspective. So you can create a variable. You can define your name for the variable and then just assign a field to it of value or key field. So taking our first event, SCEDwakeup, here is just we just do the assignment of the current timestamp to this variable TS0. And then later on in the SCEDswitch event, we actually just grab the variable and using this notation dollar sign to denote that it's a variable reference and do the calculation. And then again, we save it into yet another variable here, wake up lat. And that's pretty much it. For illustration purposes, I just also did the same thing for a key, just to show you could also save a key in a variable. And woke and pit is this variable right here. So just one note, common timestamp is if you go into the current event subsystem, trace event subsystem, and you look in the format files, you won't see a common timestamp anywhere. That's a part of the patch that I submitted. Basically, it's just the underlying ring buffer timestamp exposed as a field for every available to every event called common timestamp. OK, so that's variables, synthetic events. As I mentioned before, the trace event subsystem is very event centric. So you enable event by writing to a file in the individual event subdirectory called enable. You set up a history by sending the histogram specification to a per event file called trigger. And you read the output using a file on the event also called hist. So if you have a quantity like a latency that you want to be able to display in a histogram, it doesn't really make sense to do that on either of the contributing events, though it makes more sense. And what I implemented was something called synthetic events. They're completely user defined. So the user defines a synthetic event. And essentially, it's a container for quantities from other events. But it exposes those in the normal event interface. So the user creates a synthetic event, and then once it's created, it ends up in a user defined name, but it ends up in this special subsystem called synthetic. And you go in there, and you can see all the same event files, histogram, trigger, enable, et cetera. It's actually a full-fledged event, just like any other event. It just happens to be derived from other events. Pardon? Right, yeah. It's stored in the trace buffer. Right. If you enable it, it appears in the trace buffer. So this is what it looks like a little more graphically. We have our two real events, sked wake up and sked switch. And we have our third synthetic event here called wake up latency that the user created. And you can see that it references variables in the real events, wake up lat. It's wake up lat here. Wokenpid is wokenpid here. So because it is a full-fledged event, you can actually just create a histogram on it, just like you would the other events. So you create a histogram, and you have a histogram with that latency, the derived quantity that you created the synthetic event for. And this is just here basically to show that this histogram really is derived from these other two event histograms. In this case, it's the difference between them that created the latencies. And so this is what it looks like from the interface perspective. So to create a synthetic event, you just echo the name of it. You give it a name and echo the name along with the fields and their types to the synthetic events file, special synthetic events file that I created for this purpose. And if it's successful, you read that synthetic events file again, and you should see your event, along with any other events that were added before. And just to point out again that because it is a full-fledged real event, you can just create a histogram for it, just like you would for any other event, or enable it and see it in the trace buffer, just like you would for any other event. OK, so at this point, we have the variables and we have a container, the synthetic event, that we've created. We see it in the debug FS file system. We even create a histogram on it, enable it, whatever. But even doing that, all that, you will never see an event, actually see an event in either of those places. Because up to this point, nobody's told anybody to actually send an event. So we need to give the user a way to basically generate this synthetic event we just created. And the way I came up with to do that is just something called an action. And it looks like it's just a phrase like this. And it's basically the unmatched and then the matching event and the name of the synthetic event and then the parameters that you want to use at that point to send that synthetic event. It looks like a function call, but it's just really just saying trace this event. So just in context here, here's a sketch switch event. Here's the wake-up latency that we just calculated from before. And then we just added this phrase here. And at this point, we're live. And all the great things that you see with the trace buffer and the histogram triggers are working. So the idea of an action is more generic thing. So I wanted to be able to do other things besides just trace a synthetic event. And one of the other things is that this is specifically something that came from the RT patch that latency hist stuff is that one of the things it does is that when it sees a latency spike, it will save the previous PID and the next PID, et cetera. So you can go back and look and see that maybe give you some idea what might have caught something with the previous PID might have lent something to the latency. Who knows? But this on-max latency, which you can feed any variable to, is basically implements that. Yeah, right. In fact, one of my last slides I actually mentioned, that's one of the things I want to do, a snapshot. And stack trace. So I'm not going to go over this slide, and it's really dense. But it's basically just covering the steps we just covered. It's basically just catnating those. And I just wanted to keep it all on one side just so I could say I was able to implement the real-time patch that's wake-up latency histogram on one slide. So yeah, just to recap, what it goes on is basically you create the synthetic event. The first event, you save the timestamp. Second event, you create the latency using that old timestamp and the new one. And then you generate your synthetic event. And that's all there is to it. And of course, create a histogram using it to actually see the output. And this is actual output from a cyclic tester on. And you can see there's a block for each thread in the, I think I did two threads. And there's a block for each thread. We'll just look at the 25-21 thread here. And we can see we have latencies from 1 to 26 microseconds. And we can see over here on this side how many exactly how many of each latency we had. So we can see sort of a peak right here on two and three. Pardon? Did you have a load? No. Is that idle? Yeah, that's idle. So that's pretty much it. As far as the enhancement goes, I hope I've been able to convey that. I think it's a useful sort of natural addition. It's something that I think is probably one of the only remaining things that I would really want to do with HIST triggers. But basically, I think with this addition, you should be able to basically build any straightforward latency application. Obviously, latency HIST is the original reason and the target for it, but you should also be able to do any sort of latency application that doesn't require any extra code or any extra logic to do, basically. And the other point is that we just happen to be looking at the common timestamp. Because you can use any field in the format files, any field in the trace event subsystem, you could create a histogram using any of those fields, not just the timestamp, if you can frame your problem in that way. As far as future plans, I don't really have any more. I really don't think I'll do anything more with HIST triggers. I think this almost breaches the breaking point of what you can do with a single line command line interface. I really wouldn't want to add any more to it, allowing those lines. Like I did mention before, or Steve mentioned before, is that I think it would be really useful to be able to create an on-max snapshot trigger. Basically, whenever you hit the latency spike, you could have the whole trace event subsystem enabled at that point. The latency spike, create a snapshot, and then that gives you a super amount of detail about what led up to the latency, regardless of how much overhead the tracing itself might cause. But I think that would be a really useful thing. It's basically a super, super set of the on-max that I showed you earlier, which just basically saved the previous and next PID. Similarly, the on-max stack trace, it's always useful to be able to see the call chain that led to something, so see the call chain that led to the latency spike might be useful. And then beyond that, just applications on top of this, again, implementing the rest of the latency HIST patch set using it. And then also integration with trace command. When I, during the review for the initial RFC patch set, I mentioned that it's pretty indispensable to have some way of listing all the triggers and all the variables and kind of impossible to deal with it otherwise. And so I was going to add an enhancement to basically do that, some kind of a dedicated pseudo file system or something. And Steve mentioned that trace command already does this. So in later versions, it already does this. So that means I don't have to implement it. So if I integrate with trace command, then my work is already done. That's all I had. Thanks for coming to the talk. And if there are any questions, I'll be happy to take them. Right. Yeah, that would be great. Yeah, as far as the internet? That's actually supported. I actually did that. And I ran into a problem with recursive events. And I'm still trying to figure that one out. But yeah, that's the way it should really work. Right. Yeah, that's a great idea. That would, yeah, basically some arithmetic between on max, well, now it's about twice as long after your comments. Yeah, that's good. Yeah, that makes sense. That would be more consistent with the shell and things like that, too, so. Karen? Right, because you could combine the wake-up latency with read system call, and it wouldn't be meaningless. But, right. And then you start getting into how complex do you want because then now you're kind of going into the area of the other tools I was talking about, which actually let you. So, yeah, that sounds pretty reasonable. Right. So that, yeah, but then once you have that, then you're going to add something on top of that and whatever. Yeah, well, right. Right. Right. Right. Right. Yeah. Right. Yeah, I think now I'm starting to hear a common theme, and I was actually talking to Steve about this yesterday, is that being able to, what I do now is I have a shell script that basically implements a function like what you're talking about or these other higher level latencies and so on, it would be really, yeah. So, I think what's needed is a higher level grouping so you can basically deal with a latency, a higher level latency as a single unit and instead of putting it together piece by piece each time. Yeah, I'm sure it'll be a little eventful when we can do that. Any other questions? Okay, well, then I think that's it. Thanks for coming.