 Okay. I guess we can start. Hi, my name is Tom Zanussi. I'm a software engineer at Intel. And today we're going to be talking about measuring and summarizing latencies using the trace event subsystem. I'll start off with a quick overview of trace events for people who may not be that familiar with it. And then I'll jump right into functionality that's implemented by a patch set, the inner event patch set, that I posted upstream, I think we're on version nine now, that implements basically inner event quantities and latency being the most important one of those. I'll start with basic latency calculations and what you can do with them. And then I'll move on to latency histograms and another new feature that this patch set implements, which is synthetic events and how those basically enable the latency histograms. Then I'll talk about this object tracing, tracking objects, latencies. I'm using a couple examples and then I'll talk about latency tracing in reference to another new patch set that just went upstream recently. It's been posted but it's not merged yet. Steve Ross said it's function events patch that. So I'm just a quick overview of trace events itself. So Linux has a large set of trace events. These are just places in the kernel where it may be useful to log relevant information when execution hits those points. And they're called trace events and there are hundreds of them in the kernel at this point. A thousand maybe? Thousands? A couple thousand. Okay. So to basically make those manageable, they're grouped into subsystems. Sorry. Yeah, I was forced. I couldn't get my Linux laptop hooked up so I'd use my window when that's the work updating in the middle of my presentation. So anyway, yeah, so they're grouped into subsystems such as things like networking block, etc. And those are contained in a subdirectory under tracing events. And then within each of those subsystem directories are directories for the events themselves and one for each event. And within those directories are a set of files that describe and control that particular event. So one of those files is called the format file and it's an exhaustive description of everything that gets logged when that event is logged in the form of a set of fields. So by default, an event, an event's data doesn't go into a trace buffer, it doesn't get logged anywhere unless it's exquisitely enabled. And to do that, you use one of the control files associated with the event or the subsystem called enable and you basically just echo one to that file. And after that point, you're logging the events and you can go and look at the output at any time by reading the trace file, the tracing, reading the trace buffer through this tracing trace slash trace file. And here are a few examples of things that are actually logged, basically one record for each event and then associated with the records, each record you see the fields filled in according to what was in the format file. So in addition to or instead of logging those events to the trace buffer, you can also send them to a histogram and so essentially that data be aggregated in some way through a histogram. The way to do that is through yet another file associated with the event called the trigger file and you, to enable a histogram, you send it a HIST specification like this line here. So in this case, so you can think of the histogram as a hash table with key and associated values and in the case of the F trace histograms, the keys and the values are just one or more of the format of the fields described in the format file. So in this example, we have a histogram and the key is the process ID, common PID, that's a value that every event has, a field that every event has, and the values here are bytes of log and hit count. So you send that to the trigger and then at any point after that you can see the current state of the histogram. It's constantly running and you can look at it anytime you want. And you do that by reading this, you get another file called HIST and look at the output and this is a typical output. Here we have a line for each unique PID and then associated with each one of those processes. We have a running total of the number of bytes allocated by Kmalik on behalf of that process in this example. So this is just sort of a more graphical way of looking at the same thing and I'm going to use this kind of diagram throughout the talk, so I'll explain a little bit right now. So on the left-hand side is your standard kernel user space layering diagram, you know, user space, calls it as the kernel and the kernel calls the driver and talks to the hardware, etc. And on the right-hand side, there's just a piece of the kernel broken out. In this case, it's F-traces, the subsystem that we're talking about. So on the right-hand side, we see the trace buffer associated with F-trace. That's where all the events go. And then on the left of that diagram, those are the histograms. And these are just two representations of the same thing. They're the same histogram. This is more of a tabular form and this is just the same thing flipped on its side and given some weights for the value of what it contains. So going back to the left-hand side, we see in this case, we have two processes. They're probably making a system call, going into the kernel and coming back and the dots in the middle of their execution are meant to represent the points when they hit a trace point. So each one of them is hitting a trace point and you can see that the trace event data goes into the F-trace buffer and it's also sent to a histogram. And you can see that for each process, it gets its only slot in the histogram and that's where the data is aggregated for that process. Okay, so everything I've described so far is already in the kernel. You can go and play with the histograms. You can go and play with the trace events. It's all very well documented in documentation slash trace. What I'm going to talk about here, hence forward, is basically functionality implemented by the inner event patch set, which covers latency as probably the most important quantity. And by latency, I just mean just a quantity derived from two events, one later in time than the other and the time different timestamps between those. And we can use, to calculate a latency, we can actually make use of the histograms we already have and what the patch set, what the inner event patch set does, one of the things it does is enable you to put variables on the histogram so that you can save and retrieve variables between events. So here's an example. I have two events, SCED waking and SCED switch. They're both keyed on the process ID. And in the first one, we're saving, basically when that event gets hit, we're saving the time stamp into variable TS0. And then later on when the SCED switch event comes along, retrieving that timestamp and subtracting it from the current timestamp on that event and then saving it into another variable on that trigger, in this case, waking latency. And that's just the latency that we want. So graphically, it looks like this. So you have a user space process, making a system call. At some point, it hits the trace point, logs its data into the trace point. And then it also sends it, it also saves this timestamp variable in that slot associated with that process. Then sometimes later, sometimes later, the SCED switch event happens. And it basically has a timestamp that's associated with the same slot. And in addition, it sends its data to the trace buffer. At this point, we have enough to do our latency, to create a latency out of those two things. We just subtract them and we end up with our latency still, again, associated with that process. And that's pretty much all there is to it. So now we have a latency. So what can we do with it at this point? Well, in the case of a latency, one of the things, most important things you can do with latency, or one of the most useful things to do is to actually figure out if it's the maximum latency you've seen so far. So that's what these icons are supposed to represent, is we've got a latency, now we're feeding it into this thing I call a handler. In this case, it's still supposed to represent the basically a function that compares the latency you're passing in to the maximum we've seen so far. And if it's not larger, do nothing, if it is, then do the action on the right. And in this case, action is this piggy bank is meant to represent save. So what it's saying is if we have a menu maximum, save some state associated with that maximum. So in terms of the triggers, this is what it looks like. So we have, so here's, all we're doing is adding this line here. And to the left of the dot is the handler, and the right is the action. So in this case, the handler is saying, okay, well, here, actually, here we're calculating the latency, and then passing it to this on max function, the handler. And then if it's larger, then we're invoking this action on the right. And in this case, what it's doing is saving a bunch of fields from the SCED switch event in a variable associated with that entry. So if we do that, and then we look at the output, if we look at the SCED switch event, this is where we attach the on max handler, we see it's telling us the maximum so far is 67. And when that maximum was hit, here are the values of the variables that we asked it to save. And presumably, we're telling it to save variables that will help us determine the source of, let's say, a high latency. You might want to look at the previous process and it's a priority and help you, maybe there's something going on with that particular process that you can then go take a look at. So that's useful, but you can actually do much more than that. You can chain handlers, we're adding more handlers of the same type on top of it and get even more information. So for example, another thing we could do is use the same handler, add another instance of the same handler on maximum whenever we get the same maximum, and do this action. In this case, this action represents a snapshot. So the idea is that you're running your workload and you enable a bunch of events that are pertinent to what you expect to pertinent to your application and basically, or all events in the system. And when you hit that maximum, in addition to saving those values told to you, it also takes a snapshot. So then you can go look at the snapshot and it'll give you a wealth of information. In addition, saving stuff is great, but having basically the whole state of the system in the snapshot buffer is infinitely more useful for digging into the problem. So here's what it looks like in terms of the trigger. All we did here was add another on max handler. And in addition to the save, we're doing the snapshot. And looking at the output of that, we see in this case, same thing, we have a maximum latency, we have our variables that we saved, and we also have another max that corresponds to the snapshot, which isn't here. But if you look in the tracing snapshot, if you look at the snapshot using this tracing snapshot file, you will see that your snapshot data. So in this case, here we have our sketch switch. This is what actually triggered the snapshot. And the corresponding sketch wake up is here. And if you subtract these two, you'll see exactly the max that was reported by the latency, which is 99 in this case. Okay, so that's a lot of useful information about one particular latency. So we have the data we saved and we've had a snapshot for one particular latency, the maximum latency, but really what we want in addition is to be able to do something with every single latency that we see. So we can add a handler to do that as well. In this case, it's called, the name of it is on match handler, so there's a match there. But the idea is basically every time you have a match between your events, and in this case we're generating latency from that match, every time we have a match, then do this action on the right. And in this case, this action is meant to represent another new functionality added by the patch set, which is a user generated synthetic event. So we're going to create, basically we're going to have this create a user, create an event of our own design. And we've designed it to contain the latency. So whenever we have a match, we have it generate that user, our custom event. And you can see here it goes into the buffer, the trace buffer just like any other event. And what's important about that is because it's actually a little bona fide event, we can apply all the machinery that we can to that event as we do to any other event, such as create a histogram, for instance. Or we can take snapshots or save or whatever. So in terms of the triggers, it looks like this. The first line, the first line is how we actually go about creating our synthetic event. There's a file, this file is synthetic events. All you need to do is pretty simply just echo the description of the event, which is just the name of the event and a set of fields and their types. And then once you've done that, you end up with a synthetic event, an event here under the synthetic subsystem with the name that you gave it. So now we can use those in our trigger. And basically all we do is add this on a match, which just says, okay, whenever we have a match, whenever we have a match with the SCAD waking event, generate, do the action on the right, which is generate a synthetic event. And this is the syntax we're generating. So think of a synthetic event. You just use, it's like a function call. It's modeled after a function call. You just use the name of the event and then pass it the parameters. And those parameters will fill in the fields of the event. So here we have our waking latency variable. We're passing it in as a first and that gets assigned to their lat. And then the same thing we have the PID here, that gets assigned to that. And then finally, now that we have a new event, we can do whatever we would do with any other event. And we can do the same thing as put a histogram on that event. And we can see then we'll be able to see the distribution of latencies through the histogram. So that's basically just a simple histogram definition on our new event. And when we look at the output, we get similar kind of thing we still have on max. So that's there, 97 in this case and the things we saved. And then we also have the snapshot still around. So here's the snapshot, and we can see that the very last thing in there is our our custom event, waking latency, containing our maximum latency just as we would expect. And of course, these should match up to they do. And finally, here's the output of the histogram that we put on our new custom event. And so here we have a distribution of latencies. And down here, we have our, as we would expect, the biggest latency we have has already been flagged by the snapshot and the save. Okay, so so everything that I've talked about so far has is is part of the inner event patch set. With the exception of the snapshot, that's something I added. I have had that as a follow on patch set, which I haven't posted yet. But it is available. And in the last slide, I have a pointer to the repository. It has all the code that I use to generate these examples. And it has snapshots and, and the stuff I'll be talking about for the rest of the talk in there as well. So you can you can look at that if you're interested in playing around with it. So object latency is in some cases, you know, in some cases, it makes sense. It's more natural to use like an object address of an object as a key for for the latency calculations. So for the histogram, you use using an object, an object address for those for the key. An example is in the networking subsystem. You have if you look at the events in the networking system, subsystem, a lot of them have an actual SKB address as one of the fields in the event. And that's that makes a nice histogram key for us to use to essentially think of it as as, you know, latency with reference to that particular object. And if you also think of it as if you if you set up a series of latencies, I'm using that address, it's sort of like tracking that particular object through the kernel, and, and, and calculating and seeing its latencies. So yeah, hardware, no, cell software. Yeah, this is the trace event subsystem. It's, it's available just by turning the some kernel options on and you'll have trace events for any anywhere in the kernel that defines a trace point. If you have the heart, obviously, they're not going to trigger unless you have the hardware for, you know, in the driver or something. But I don't think so. I mean, that sounds kind of advanced to I don't know, chain them that way. We can talk about it. I mean, I really understand the, okay, yeah. So they, but for the latency, they you use that SKB, and then you, you discard it, and then it goes to the next guy, right? Right. So but, but, yeah. So you're saying you could end up with a situation where you put the you get the first part of the latency and then you miss the second part and then you come back with the first part again. Yeah, I'm not quite sure. Yeah, I mean, this is an example. I'm not a networking guy. So I'm assuming that you have an SKB address and then, you know, you have the two events on the dress on that. And then you, you, you'll get the first event and then at some point later, you get the second one before, you know, before it gets recycled. So I'm not really okay. Okay. So yeah, in this case, I'm using, I'm using net dev start transmit and net dev transmit using the SKB address. And so this is what it looks like. And using the SKB addresses. It's just similar to what we've seen before, creating the two triggers and then generating the synthetic latency, synthetic event using using the result of that. And then you'll end up with something like this, which works initially and then it will stop working because we're using an address as a key and there's nothing freeing the dress once we're done with the latency. So so what what ends up what ends up happening is that we fill up the hash table, fill up the histogram with with object addresses. And then since since not, you know, we've finished the latency, nothing removes them. We end up overflowing the table and we see start dropping events. And once we've started dropping events, we're pretty much hosed because we're not doing latencies anymore. And there's no way to get the space back. So the solution to that is to add a delete operation. And basically, the trace of the histograms are based on the tracing map. And the tracing map doesn't have a delete. It didn't have a delete until colleague of mine at Intel, that on hotel added the delete operation of tracing map. So once we have that, then we added a delete history or action to take advantage of that. And that the the the delete the delete actually use requires a new new facility. Really simple thing called tags just basically allows you to tag a histogram with a name say tag equals x x x whatever your name is. And the idea behind it is that you know, you you essentially prefer every, every, every trigger that participates in the latency, you tag it with this tag. And then once, once the once latency was calculated, it will go ahead and delete its entry into in the table, along with all the other matching entries in the other tag tables. So you're now freeing up basically those slot slots for the next object that comes along as KB or whatever. And a nice side effect of having these tags is you you don't have to use them for delete, you can use them for, you can just give any any set of triggers a tag. And then you can remove them all at once by using this a new remove all command that I added to do that. So you can just remove all at once instead of one by one, like you pretty much have to now. So and this here is just an example of of of how that's of using that. So we're just tagging these two triggers with that with X met tag. And then on the last one, we invoke delete transmit and that will cause that entry to be freed up when we're done with the latency calculation. And then here's just an example of the remove all. Again, we're referencing the tag and and then executing that on on one of the one of the triggers. And it doesn't matter which trigger you use, it'll it'll it'll go through and find all the other events with the same tag and remove them. And here's the result is just you look at the histogram one of the histogram that you see, we see it's empty, clearly empty, even though we've had five hits. And the reason it's empty is because we also had five deletes. So it works. Yeah, but in this case, it's kind of being not used as something to look at it's being something used to track. You know, we're interested in the output of latency and the other histogram, these ones that are we're putting, you know, we're using to calculate latency, we don't care about. So we just look at it. But I just looked at it to verify that that it was actually empty and stuff was getting deleted. So that's that's one example. And just one more example is very similar thing with the block layer instead. So I just kind of wanted to give somebody, you know, some idea of how you might go about using this stuff to sort of track down cause the cause of a latency possibly, sort of the process you go about one process you might use to go about doing that. And in this case, the block layer. So basically, assuming that you you you have some problem that you suspect is is is caused by some latency problem, the block layer, like, you know, maybe you're reading files, and every once in a while, you just stall for a really long time, no apparent reason, which I've seen plenty of times before, and would like to look into it. But so that's kind of thing is, if we run across something that we want to look for latency is to track to possibly track it down. One way to do that is to write a simple script like this, and it just it just it just excuse the workload and before it before it does that it turns tracing on and turns tracing off and and enables those those events which we think, which are a good starting point, I guess, for tracing the block layer. I mean, this is just it's just called F trace me block or trace me block. And it's based on one of Steve Rostad's examples in one of his articles. So it's really just a simple script. And so looking at the output, you know, the idea is what you want to do is try to find a couple of events that are that might be good candidates for hanging latency calculations off of and looking through them. These two kind of are look like good examples block RQ issue and block RQ complete. So if we go and look at those, look at those events. It's block RQ issue, for example, there doesn't really seem to be that anything that's really directly applicable to using as a key for a histogram. Not like the network, the networking network. And that's the only way they have actually gave us an SKB address of an object to use as as part of the event. But that doesn't mean all is lost because we also know that if we have given any function in the in the kernel, we can actually create a trace create a trace event that pretty much exports any kind of any data we want from at the point that it's just using K probe events, for instance. So so then I get so you know, and the fact is that we did find a couple of events that were that were at good points in time to hanging latency off. It just happens that that for our purposes, we they we just don't they just didn't give us the data that we exact data that we that we really want. So basically, the problem transfers over to finding some functions. Nearby the trace points that we like that we can use for similar purpose on and and to do that, we can just add this line to the previous script, which is turning on function graph. And then when we look at the output, we will see see our original trace points are interested in, but we can also see a graph a function call graph surrounding those and that we can sort of try to pick and choose some things that might work. And this is this is the kind of output typical output you see. So here we have our block RQ issue that we're interested in block RQ complete. And looking around certain looking around those two events in the output, we see a couple candidates. Block MQ start request and block account IO complete, I have completion. So let's pretend that we want to use those for our trace events. So if we look at the look at the the source code and the kernel source that we look at those two functions and they do both of them have this struck request object that looks perfect for our purposes. So now we have a function and we have the data that we want to export as a trace event. There's a number like I mentioned, you can use the K probe events to do that. The syntax is a little off putting, I guess, and it's architect arch specific. Similar equivalent way of creating trace points for those functions will be per probe, you could do that the same thing, giving it those functions. But for for the purposes of this talk, I'm going to be using Steve Ross said function event new function event patch set that he posted a few weeks ago. And the idea behind it is basically that you can take very easily any function in the kernel and just directly create a trace point for it with with all the parameters and with all the parameters of the event. In addition to a whole lot more, you can go through structures, you can do global events and so on, but this is just using it to basically export one of the parameters. So if you do that, so yeah, anyway, yeah, to do that, basically, you just you, you echo the name of the function and the parameter you want into the dysfunction events file, and then you'll end up with an event like this. And you can see it's, it's the name of the function, and it's under this function subsystem. And here's our request object that we are interested in. Okay, um, yeah, so anyway, this is, this is just the same along the same theme. We have our triggers that we have now we have our request object that we can use as the as the key. And, you know, we're just creating a synthetic event and then creating generating histogram from that. And so here we have our output of the request latency between those two functions that we just picked out. So, you know, basically having having this ability to, you know, at will create a trace event at any function. Basically, you know, very easily do that any exporting any data we need. allows us to go from situation like this where we have just a handful of available trace points that may or may not have the data we need to this where we essentially can can very easily calculate the the latencies and show histogram between any two functions in the kernel. If we just by doing something similar, probably as to what I did. And in fact, this is this is kind of an initial stab is kind of a toy script that tries to do that. It's called funk event latency, you just hand it to functions and a key and it will generate everything all the machinery needed to create histogram between those two functions. So here it does it takes the function names and makes a function and makes function events out of them. And then it creates two triggers, you know, that save the latency and then calculate the latency here, keyed on the key. And then finally, it also it also creates a synthetic event named that funk lat and and generates and uses on match to generate a synthetic event with that latency. And then now that we have an event with that latency, we can just put the histogram on it. And just like here on the synthetic event that we created for it. Exactly. Yeah, like I said, it's a toy program at this point. You know, you'd have to be you know, you'd be much smarter about matching things up and you know, you know, if the parameters here first parameter and the second parameter on the other one, or maybe it's it's, you know, some other variable in a structure somewhere. But you know, it's, yeah, someone who's better at shell program and could, you know, do do it justice. But that's just sort of illustrating the basic potential of the thing, I think. Um, yeah, you need. So to do this, you need, you basically what I did was I, I took my pet, I took the I took the the inner event patches I posted, I put them on top of Steve's function patch function event patch set. And then I added on top of that these other changes that I'm talking about. And so eventually they'll all get mixed, they'll all get unmixed. But for now, you know, they're available in that get repository. I put everything in there. And you know, whenever things go upstream, then it'll get sorted out. But for now, yeah. Yeah, I'll show I'll show you on the last slide, I made a couple slides from now actually, I have a I have a get repo that has all the stuff. And you can actually just go get the slides from the LC website, and then just directly go to it. So anyway, that's so this is this is the this is the end result is you have this funk event latency script. And in one line, you you've you basically created this latency. And because you have it's actually a latency are actually fully fledged events, you can do all the other things with it as well, not just create this histogram. So yeah, so anyway, this is the the two functions I gave it the request. And at after you run this one line, you end up with the histogram that you look at. And that's the output, just like all the previous examples. So yeah, that's, that's all I had. Then here's the here's a link to the to the code. And that's pretty much it. Thanks for attending. And if there are any questions, I'll be happy to answer them. Pardon. Sure. Oh, the slides? Yeah. There, there, if you go, yeah, if you go to my slot on the schedule for this, for for ELC, you'll you should be able to just go get the slides there. They're on there. And I didn't put my email here. So in case that doesn't work, you can email me about you can just look for the patches. Okay. Any other questions? Well, this is just all right there. You're at the command line, you have it available. And, you know, it's not meant to it's not supposed to be like replay, you know, something that would be really in competition with. In fact, one of the things I was going to do for this talk that I didn't have time to do was actually, you know, BCC is a tool that uses BPF. And it has some examples similar to these. What I was going to do was take, basically try to, you know, plug this into into BCC that so that you could, you could actually make use of, of this as a component, instead of, you know, calling and you know, creating the BPF program and then all that machinery. So it's not usually exclusive, it's just, you know, and, you know, and one example, you know, like the block event layer, for instance, has a lot of stuff that, you know, we're never, we're, you know, you, like you have to write, you have to write little code snippet to get the major, minor things like that, which we don't have any way of doing that here. And I don't think we ever will because that's basically writing some code as part of the probe. And this is just kind of what you can wire up, you know, with just, just wires instead of not logic. I mean, not, not actual, any other questions? Yeah, I guess not. Well, thanks for attending.