 development. So the focus of this is more or less not really just how to use ftrace in general. I've given several talks on that, but this one is more specific towards people who are starting out to being a kernel developer, wants to modify their code and wants to have a way to see what's actually happening underneath the hood. So that's kind of where the focus is on this. So first let's start off with what is ftrace? Well, ftrace is quote unquote the official tracer of the Linux kernel. It entered the Linux kernel in 2008. Officially ftrace is the way that you can hook to any function within the kernel. So all the functions within the kernel, you can actually attach to it and be able to do something with that function. So the infrastructure to hook to kernels within the, or functions in the kernel is actually what ftrace technically is. But that's one of the major features is called the function tracer where you could trace all functions happening inside the kernel. And that is actually how come the rest of the infrastructure has become underneath the name of ftrace. So it's kind of expanded to everything that's in the sys kernel tracing directory. You'll see all these files in there. Originally it was in the sys kernel debug tracing directory, but that because it required debug if s to be compiled in and there was a lot of requests to say, hey, we want the tracing infrastructure on our production machine, but we don't want to bring debug fs because debug fs has a lot of debug files there that could make it harder to lock down the system. Because debug fs may not be so secure, could have vulnerabilities as it's more or less just for debugging. So we created tracefs to allow you to install the tracing facility without including all the stuff that comes with debug fs. And now if you know, if you have that compiled in, you'll see in sys kernel, there'll be a tracing directory in that way you can mount to it. So what is in this tracefs directory? So if you mount it, there's a command line that shows you how to mount it and feel free to follow along on your laptop. Ideally, you might want to do this on a VM and not your laptop, because if there's a bug in here somewhere, you could possibly lock up your machine and be kicked out of this talk. But anyway, the commands I use are as root, they're not sudo. So I mean, you can do sudo, but you'll be doing a lot of sudo commands. And ideally, you just do a sudo su and switch and then do everything as root user. So after you mount tracefs, you know, you cd into that directory. So most of the commands will already have cd into the directory to see what that and you'll see there's a lot of files and that could be quite overwhelming and not knowing what they all do. But ideally, there's really only a few files in there that you really need to care about. There's other things in there that's more for the real advanced user, but there's much subset that you'll probably ever deal with on a daily basis. In fact, we're only going to talk about a few subset of these. And yes, if you notice right in the middle, I don't know if you can see my mouse or not here, but you'll see a readme file within the tracefs directory. And yes, it does actually have a readme. It doesn't explain everything. So what is exactly what can be traced? Well, first there's this thing called events. Now what are events? Events are static points within the kernel that allow you to extract specific data that the developer wants. So they could also be dynamic where you could actually say, Hey, I want to bet it's something specific and I will go, I will attach something here. I want some data from there. So I'll talk about that a little later. On the other side, there's also tracers. Now tracers is another I guess another way of tracing things within the kernel and tracers as opposed to events do functionality like the function tracer will trace all functions to enable the function tracer and all functions will be tracing will all functions that are called will actually be traced. You can filter it. There's ways to modify tracers function graph tracer latency tracers will add functionality to say, okay, I want to see how long it takes to for a process to wake up, how long interrupts are off, how long preemptions off. And that's ideally what tracers are. So going back to what are events? Well, events are broken up into groups. For example, there's scheduler events, interrupt events, and they're all grouped together in these block events. There's actually probably now close to 100 different groups of events. So each of these events are created by a developer who has want some information shared that they think is important to debug their system or out on the real world, there's something happening, and something's not working quite right, they can enable these events, and it will show it will extract data into a ring buffer that could be extracted from user space that people could say, hey, I want like I'm having problem with my block or my device driver, I have a few events in there, enable those events, you can see them, you can modify them. So how do you enable these events? Well, it's actually quite simple. If you know bash commands, you know, echo, cat, that's basically all you need to know to work with pretty much the entire TraceFS file system. There's no utilities you need to have on top of it, although we have utilities on top of it, but ideally, if you're in the embedded world or embedded space, and you just have like a busy box utility, you actually have almost full functionality of ftrace on your kernel with just a busy box interface. So if you go to that sys kernel tracing directory, and for here, I'm interested in the scheduling events. So specifically, sked switch, when one task, task schedules out and another task schedules in, since only one task can run on a CPU at a time, and I want to see all these events, I could just echo one into the enable file of that sked switch event. You can see that the sked switch event is within the sked group. So the way the events are structured is you have the events directory, and within them, within the events directory, you have each group will have its own directory, and within that directory, each event will have its own directory that has the functionality of how to enable, disable, modify, filter those events. Now, if I zoom in to this specific one, you'll see what the event structure looks at. I could see that the previous process has been swapped out is the SSH process. I see its process ID, the prior, the priority. Now, note that the priority may look like 120 and say, hey, that's a really high priority. Inside the kernel, priorities are reversed order. So that means the lower the number of a priority inside the kernel, the higher priority it is. There's a reason for that. I'm not going to discuss that right now. But 120 means it's really just a normal process. If you see a priority of say, five or one, those are really, really high priority processes. And the previous state is s, which means the process has gone to sleep. So it basically it said, okay, I'm going to yield the CPU and I'm going to let something else run. And the next process is the swapper task. That swapper task is just simply idle. But that's just showing you information of what you can extract from the events. So if I want to see what that format of the event is, so I want to see what is recorded. So if I go down and look into that sketch switch directory and see the format file, because every single event has a format file that allows user space code to read the raw binary, you can extract the raw binary from the tracing data. So the events are written very, very fast. It's copied right in. I'll describe that later. And then it's extracted and you could, if you want to see how to parse that the format file directory shows you how that's done. And here you see the sketch switch format, which you see what's saved. Everything starts off with those common type, common flags, common, no preempt all the common fields that you see there are for all events. But then there's a space or a new line. And then there's a the fields that are specific for each event. That's the data that, like I said, that developer says, I think is important. And I want to see that information. And below that, you'll see how to parse that information. I'll describe that a little bit later as well. So if you look at that event, so the sketch switch event showed me up here on top is what was displayed in my cat the trace file. You'll see that the previous comm is here, you'll see that there's the field for where it was recorded into the trace buffer. And below you see how it was printed. And the same thing for the previous PID, the previous prior, and the state. Well, here the state looks pretty complex. If you look at what was recorded, I'll discuss that in a little bit. So how is does a trace event look like if you look in the kernel? So if you actually have the source code for the Linux kernel, and you go into the kernel sked core dot C file, and inside the scheduler function, you'll see down there, you'll see something that says trace underscore sked underscore switch. All trace events are built on top of trace points. They're kind of very tightly coupled. And a trace point is a function that looks like it looks like a function that you'll see trace underscore the event name. And when you see that here, you'll see trace underscore sked switch looks like a normal function that calls preempt previous next. Well, what's so special about the trace is actually that is not a real normal function. It's actually created by a macro that does a lot of work to make sure that, you know, when that event is not actually enabled, it's really a no op. So inside the real when you process it, when you start up your kernel, you have all these events, you'll have 1000 events within your kernel, but they don't take up much, they're not much overhead in them because all the locations are actually no ops. And there's magic code, I could give another talk on how that works on how to convert that no op when you enable the event, it turns into that basically that sketch switch function call. And I'll show you how to create this sketch switch function call. So if you want to create your own events within your own driver or anything that you're working on, you want to add some events to the kernel, this is how you go about doing it. It's all with this big trace event macro. If you go into include trace event sketch within the next kernel, you'll see that there's the sketch switch event is created by this macro that's shown right here. You start off with the trace event and it has several different fields. The first one is the name of the event, which will be sketch switch. And once you create that, you'll see that's the function that's going to be a trace underscore sketch switch. So that's where that sketch switch comes from. Also you'll see the parameters. So the TP proto defines the prototype of this function or this trace point that's going to be injected inside the kernel where you're going to place it. You need to have a TP proto and TP args, they're basically going to match, the preempt has to match preempt names have to match because the reason why is this is how the macro knows how to use those arguments or how to call the function with those arguments as opposed to declaring it. So that's required because of the implementation of the trace event macro. Now, let's see, oops. So let me go back. If you look, remember if I look back at that format file and inside the format file, I saw all these fields. Well, these fields are defined by this trace struct entry file. So you'll see that all the fields, the previous comm and everything down is defined here. You'll see there's an array that defines an array. That's where you'll see the bracket 16 for previous comm. It's because it's array of task comm length, which if you look at the kernel and you look at it is defined as 16. So most macro or pretty much all the macros within that's used in the trace event macro or trace event macro will be turned to their raw numbers when you look at them from the trace the format file for user space because the user space doesn't know what task comm length is. It needs to know that it's 16 to do the parsing properly. So inside the, so what happens is that macro actually turns into a structure that's going to be used with a special name called underscore underscore entry that you'll see down here. So this is actually would turn into something that would look like this. And that's for the next part of the trace event macro, which is the TP fastest sign. Well, the fastest sign, this is what you could think of is how the data gets written into the ring buffer. So you pass in the prototype and you'll see here's the entry macro or that structure. So the entry is actually a structure that's mapped right on top of the ring buffer. So there's no, there's not any copy of this. What happens is you allocate space on the ring buffer just enough to put the entry structure on top of it. And then you fill in the ring buffer via this entry macro or the entry variable. And if you look at the test, the trace sketch switch, it passes you the preempts previous next, remember, they pass that in. And what that does is tells is that's the information that's going to be used to fill in the entry structure, as you can see below there. So in actuality, if you really were to look, if you were to let it compile or the preprocessor to go and finish handling the trace event macro, ideally, you'd see something that would look like this, that this function, when it's enabled, will be just called as if you just wrote the function just like that. And this is where the TP proto defines what the prototype that you'll be using. So the last part is the TP print K, which is basically how to print this event after you have all the data. This is used in two cases. One, this actually will be used by the trace when you catch a trace file, it uses the TP print K part of the macro to know how to format and print that to user space. So it reads the raw data and exports it. Reason why we show it in the print for the print form or the format file. That is so that user space tools could read the raw data and know how to print it as well. So they try to do it identical from what happens inside the kernel to what happens in user space. And if I look at the print format file, remember how that was really, really scary and ugly? It doesn't look like it really matches what you had there, but believe it or not, that TP print K is what created that print format below. And if you look at that task report max right there, remember I said earlier that all macros are expanded to what they're really defined as. And if you look into the task drop or sked.h header file, you'll see that task report max is defined as that, or a bunch of other macros that finally turn into what you see below, that's zero, one, two, four, eight, 10, 20, 40, and then plus one shifted to the left. You'll see that three more times because there's three other places that that's used. We also have a bunch of helper functions that are in the underscore underscore print flags. That lets you be able to print the flags to convert a bit mask into text. So basically, if, like I said, the task interruptible is set, it will, s will show up. That way the user doesn't have to know, memorize what task interruptible bit is. You can actually turn bits into text that is human readable. So this could be all well overwhelming. And you can say, how am I going to remember all this? I'm going to constantly keep looking at this video over and over again till it, you know, gets into my mind. No, you do not need to memorize this. There's samples of all the functionality that can happen in trace events. If you go into the Linux kernel source code, there's a samples directory. There's a trace events directory within that, that discuss all this information here. Make sure you look to make file as well. That's in that directory because the make file has some information on how to make this work. It's pretty easy. I mean, the fact that after I created this, we have over a thousand or probably a couple thousand, I guess trace events within the Linux kernel shows that it wasn't very difficult for people to implement because I didn't help these people create a few people I helped, but a lot of people just went off and read the sample files just cut and paste. It's there's a template there that you could use to be able to figure this out. Just follow the directions and you should be able to have events within your code in no time. The trace event macros, the template for that, and also a description of the print flags and all other sorts of things you could do. You got dynamic strings, you got bit masks, you got CPU masks and all that. That's all described in the trace event samples.h file. And if you want to see how it's used, like I said, where they're executed, it's in a trace events sample file and it tells you how to incorporate all that there. So if you read all those three files, you'll be able to make trace events. You will have your own trace events in your own code. Now let's look at the other side and talk a little bit more about tracers. Believe it or not, tracers were actually inside the Linux kernel before the events were because everything was done by the tracers were say, hey, I want to trace something and you wrote functionality on how to trace something. The first one was the function tracer, which is why ftrace became the name because ftrace stood for basically the function tracer. And since that was basically the first real tracer to be added to the kernel, the name stuck and all the infrastructure that came behind after it, we just call ftrace as more of a general of everything that's attached to that. But really, the function tracer was one of them. And there's also lacy tracers that defines functionality. So events just enable and disable a single location to get some data from where actually tracers will actually do some work will actually change how the kernel is running to do something. What's nice is events could be enabled within the tracers. So you can enable events to and those events will actually be intermingled with the tracing data. So you can see how if you're some specific specific data within the the tracer that you want to see and there's an event there for it, you can enable it and it will fit right in. Some of them have some of the tracers have their own options. So you can actually modify how the tracer works. So that'll be a little bit. So here's a basic example of how what a tracer is. So if I again go into the tracing directory, and I echo function to current tracer current tracer is what how you enable a tracer by default, it's a no op. So there's always a tracer enabled the no op tracer is the tracer that's enabled by default when you boot up the kernel. And if you want to disable tracing or to tracer you just echo no op into the current tracer and that will turn off the tracer because the no op tracer does nothing as the name suggests nop. So if I were to so if I look at here this is the function tracer which basically traces all functions and it shows you the parent. So when you enable the function tracer by default you'll see all the functions here you'll see a trace the spin lock raw spin lock function and who called it the scheduler function called the raw spin spin lock function. You could filter on these there's a way to filter all this but that's out of scope for this specific talk I have other talks and there's a documentation to read that you can find out how to modify tracers but right now I'm just showing you that they exist. And again as I mentioned you can have events so here I echoed one to events enabled within the function tracer and you'll see all the events that appear intermingled within the tracing. Again for tracing options the IRQs off tracer will trace it will it record the tie all the times from when a interrupts are disabled to when they're enabled and then it will it will show you when it hits a max time that so the longest time that between disabling and enabling it will record that into a snapshot ring buffer it'll actually then when you catch trace you'll see just a snapshot of what the longest time is in by default enables function the function tracer as well so you can see all the functions that happened when in that time frame in the interrupts and if you look at there is like it was a latency here if you look up on top you'll see it as a 914 microsecond latency there was 85 events and that's why I don't have them all here within that but say but function tracing does have overhead so maybe it really wasn't 914 microsecond sand I I want to turn that off and by the way if you're trying to see if you have IRQs off in your kernel you probably don't because IRQs off tracer requires some hooks in the kernel that needs to be there all the time some functions have to be called excuse me some functions have to be called whether the tracer is on or off so it adds a slight overhead so most distribution will not enable the IRQs off preempt off tracers but you might have the wake-up latency tracer the wake-up tracers don't add it uses events they actually hooks into the events to do all its tracing so it's usually it's okay to have enabled it won't cause any overhead by having it but IRQs off causes overhead so you probably don't have it in your kernel but let's say I want to turn off function all the functions being tracing excuse me so there's an options directory and inside options directory you will modify how tracers work how the trace output works and bunch of other things and options there's documentation on each of everything that these options help they affect the kernel in the documentations directory so if you look in documentations tracing and there's ftrace.rst read all about this so in there I put in you know equals zero to the function trace which means tells the tracers don't enable function tracer if you enable one of the latency tracers and I enabled IRQs tracer and here you get this is a full snapshot of what it is of course that latency is really big and that's because a soft IRQ did something that took a long time and I got to see where it happened it gives you a stack dump at the end of the trace so like I said there's tracers that create their own functionality the hwlat tracer which is a hardware latency tracer it will actually create a when you enable it it creates a thread that runs and it will basically hog a CPU for a certain amount of time and it will check for any latencies that you might see when you run it so in microsecond I'll show you basically it's looking for SMIs or anything that the the computer did from the operating system so it's just doing us it's got interrupts disabled and it does a quick spin and make sure that nothing happens and if there's a gap in reading two time stamps it'll say ah the hardware's doing something here these are pretty big times because I ran this on a virtual machine so anytime that a the host scheduled the guest out that showed up as a hardware latency within the guest right now we're working on another tracer called os noise tracer uh daniel bristott's working on that where it's going to do the same thing but not just looking for hardware latencies but basically how soft irqs how to irqs nmis what does the system happen happen to do and also it'll prompt the process and say hey you know you have other uh threads in the in the kernel that's going to prompt me from running so it's that'll be coming up soon too so there are traces are still being created as we speak instances so let's say you want to do more than what run more than one tracer or you want more than one buffer for various reasons you want to say I want some events in one buffer some events another buffer or let's say you want one tracer running and one buffer like function tracing going on in one buffer and latency tracer going on in another buffer or events going on in another buffer or or function graph tracer or one of these other things going on. So say if you have one event that you want to trace, but happens rarely. And another trace event that you want to see going on, but that event happens a lot, like an IRQ or scheduling event that will overflow the buffer and you'll lose the events you want to see. You could create multiple instances where you enable event that you want to trace that's rarely happens in one instance. And the busy what's called the greedy one or whatever you want to call it for that will be in the other instance. And that way, if it overflows the buffer, it won't affect the rare one. And to do that is there's inside the tracing directory, there's an instances directory. And all you have to do to create a new instance is do a makedir. So here I did a makedir instances foo where I created an instance called foo. And if I look inside there, you'll see a subset of the same functions that were at the tracing directory. And it allows you to enable and disable functions or events and tracers inside an instance that doesn't affect the top level instance or other instances. You can make as many as you want, considering you have memory to do so. So here I just, in this case, I did, from the top level, I enabled function tracer at the top level. I created a foo instance and I enabled the events and then I cat it, you'll see it has all the events but it doesn't have the function tracing intermingled. So let's go back and talk about tracers versus events. Now, events, again, are static points within the kernel. And it gives specific data that the developer wants. Tracers add functionality, changes things. If you create your own tracer, you could do whatever you want to it. So if you want to implement some special code, like I said, latency tracing or whatever, that's what tracers are for. So they are distinct. But let me go and look at something else, too, with events, dynamic events. They're same as normal events and what they look like but they're not created with the trace event macro. They're created at runtime. So the way they usually do it, they are injected within the code. They're similar to events that they extract specific data. K-Probes uses a breakpoint, but if you do it at the start of a function, it may use ftrace, which means it doesn't have the overhead of a breakpoint. Well, remember that ftrace, like I said, was a way of hooking into the infrastructure that allows you to hook at every function. So you attach a K-Probe to a function that's that ftrace can hook to. It will use that, which is much quicker than doing a breakpoint. There's also synthetic events, which allows you to connect two events together to create a third event that gives you information between those two events. The K-Probe events, you could find, like I said, you don't have to memorize everything I'm giving in this presentation. Everything is documented. We're trying to keep up with that, keep up the documentation. So if you look in documentation, trace, K-Probe, trace, everything we like to talk about here is documented in there. The interface to create K-Probes is in the K-Probe events directory within the tracing, or sorry, K-Probe events file within the tracing directory. Again, if you attach it to the start of a function, it will use ftrace, which is fast. If you can also hook a K-Probe to the body of a function using the offset. So if you're doing basically just a sample of a function from the kernel, going to GDB, do disassemble of schedule, and you'll see the offset of the schedule that you want it to attach to. So you put in schedule plus some offset, it'll put a breakpoint in. And K-Probes even has a way to optimize that. If it'll analyze that function, see if it's safe to actually replace that breakpoint or the code that's being breakpoint with actual jump, and it will then, so when the code hits it, it'll jump to a trampoline that will simulate what it just wrote over, and then it'll still do your tracing. So that can actually be much faster than a breakpoint. So how do I, let's see, why would I want to use K-Probes and K-Probe events? So inside the scheduling code, you'll see the same called activate task. And activate task means that a task wants to run. I want to activate it, it wants to wake up. So at the low level, it says, okay, this task is about to start, it's going to run, I want to put on the run queue, let's activate it. So let's say there's no trace point there that I can look at there. So there's no trace point in that code that I want to see some information. So did I see struct, task struct P is in the activate task, and I want information on that task P. So let's look at task struct. So inside include linuxke.h, there's a task struct, and I can see the process ID, the PID for it, and the com, which is the task name. But I want to, I need to know the offsets of that pointer. So here's a little trick for the offsets of the pointer. So if you go into GDB and you have debug information enabled on your kernel build, and you do GDB via Linux, you can actually find the offsets of structures by this. And the trick is, is you basically tight cast zero to the structure you want, to a pointer of the structure you want. So you'll see struct, task, struct, star, the zero, and I put a parentheses around that. And then I offset that PID, and then I say, okay, print the address of the PID field of task struct for the value of zero. And that gives me, now, task struct's a very large structure, and you'll see that the task struct index is at, you know, hex AA8. I want to do the same thing for com, and you'll see that it'll print you index. So I quit out, I do switch user, I changed to root, I CD the tracing, and then I write this string in there. So K probes, I said, this is described in that K probe, trace RST file that's in the documentation directory. So you echo P stands for probe, colon, the name of the event I want to give this, so I'll call it activate. The function name that you want to hook to, now, again, I could actually, I don't have to put a function name, I could actually put the function name plus an offset, but since I'm going to use the arguments, I have to use the function name itself without an offset. So I put activate task as the function name I want to trace. I say PID from argument two, dollar sign arc two is the second argument. If I go back just so I show again here, you'll see this is argument one is RQ, argument two is P. So I use arc two, I index, I do a plus zero X AA8, which is just kind of like if you've done assembly code, it's similar to assembly code. So I take the value, I'm going to index off of it, I want the 32 bit value for that because I know PID as it ends. And then I do the same thing for COM and I do its offset with arc two as well, because they're both using the same, the task drunk P is arc two for both of them and string. And I cut that or I redirect that output into K probe events. So when I enable this function, I see here's my activate task, the process ID that's being activated and the name of the process that's actually being there. So I get to see all that. So next thing you want to look at is synthetic events and histograms. This is described in documentation trace histogram that RST. So you don't have to remember everything I do here. The interface is three files that you'll need. There's a synthetic events within the top level tracing directory. And then within, when after you create a synthetic event to create the synthetic event, you need to attach two, usually have to attach two events together using the histogram and trigger fields. Actually, trigger fields is the only one mainly you did, but histogram also works too. I'll talk about that later too. So yeah, it creates events based off of two events. And for example, if you want to track leads as functionality for that. So if you look at this code, I added, I'm going to add two K probes. So I'm going to create a K probe for activate and I did the exact same thing using basically the same parameters. The deactivate task function is when a task is deactivated. So I want to see from the time that task is deactivated to the time that task is activated again. So I created two trace events. If you notice here too, on the second line instead of doing a, whoops, instead of doing a just a greater than, I did a greater than greater than which is a concatenation to trace K probes. Because if I just did greater than, I would have deleted the activated event that I just created. So I create the activate event and then I create the deactivate event by concatenate it to trace probe events. And then I echo, I create a histogram using keys. I'm way, I want to trace the process ID. So I'm going to, the way to do this, I have to figure out what field can match the two events. Cause you have to have a way to match, like if I, the deactivate and activate, I want them to match by their process IDs. So here I'll say the process ID is my key. And common timestamp is a special keyword for this trigger that when you say, what's going to do is when I record the timestamp of the event. So I create a variable called TS, map it to the times, common timestamp, which is a special value. It's kind of like a special field, which is the timestamp. And put dot us, otherwise it's a nanoseconds, a nanoseconds is too much information. I just care about microseconds. And I cat it to deactivate trigger. And when I do that, it gives me an error. So if you do wrote that, and always why I did this was I want to show that if you make an error, this is in recent kernels. By the way, all this talk is about the most recent kernels anyway. You can cat error log and it'll tell me there's an invalid field modifier. And I can look at my documentation and look what I wrote and I said, oh, use sec, that's not what I want. It's plural, it's use secs. So I have to put common timestamp that use it. So I put that in there just so you can see that. If you make a mistake, look at the error log and it will remind you like, oh, I made a mistake. And now it works. So I do the same thing. So now what I do is actually I create a synthetic event called activate lat. And I pass in a PID and lat. Well, the PID is just going to be a U32, which is basically an unsigned integer 32 bit and 64 bit latency because timestamps are 64 bit. So I create a latency for 64 bit and a PID because I'm going to record the process ID and the latency and I cut that into, or I echo that into synthetic events that creates an event for me. And now what I do is this is the most complex thing but subscribed well in the documentation. I also have easier ways. We have user space tools that use SQL that will make this a lot easier. That's coming in the future. But for now, if you echo histogram, which is the keyword, keys, PID, this is the PID of the activate event, not the deactivate event. So these two have to match. If I had the activate say PID one or something like that, that would be PID one. This is how these two fields are what maps between the two events. They just happen to be the same name for both the events. But for example, wake up and schedule. Wake up is a PID. Schedule would be probably next PID you would want to do for doing the same thing. I create a new variable called lat and we use the common timestamp of the deactivate timestamp. And minus dollar sign means it's a variable TS, which will map to the first event that it matches. So when it matches the two, it will, if you have a variable defined in the first event, you could reference that variable in the second event with the dollar sign. So that's how I map that. So it takes the timestamp of the deactivation and subtracts it from the timestamp of the activation. And that gives me my latency that I want to trace. And here's a keyword on match. K probes is the, that matches the system. Remember the groups that was, so all the K probes events unless you specify a group will go underneath the name K probes and deactivates. So activates going to match deactivate here. And then what I'm going to do when the two deactivate by the way, I didn't have enough room on one, the slide to do it. I just put in a quote slash started again here. So this is actually attached to one, the parenthesis and the dot are together just to point that out. So I'm going to trace it when the two match I'm going to trace calling a, this trace means call a synthetic event. And it's going to trigger the activate lat and it'll pass in the PID and the latency, dollar signs is now I'm using the variable name which matches here to this event and I pass that in. All that for this. So after I do that, I echo a synthetic event because now I created a synthetic event. It's a actual event underneath the synthetic group and you'll see activate lat. I hit echo one, enable it. You'll see the trace and down here you'll see the activate and all the latencies of how long in microseconds of that latency. If I want to make a histogram out of this I would actually use the histogram trigger as well to say his colon into the synthetic event, activate. This is actually the histogram. Like I said, we did the attachments but you'll also do histograms as well. You'll see that here I'm going to do the keys the PID and latency. And I only want to record it if the latency is less than 10,000. Reason wise, because there was a lot of latencies this things are deactivated for a long time. So I only wanted to see things that were deactivated for a short time. So I had enough to, so I didn't overflow the way I could put on a slide. But it's also shows that you could filter by histograms and everything takes filter allows you to filter. You can filter events kind of similar just read the documentation. I'm not going to talk about it right now but there's a lot that the tracing function out tracing infrastructure gives you. I pass this into the trigger. I cat, this is the his file. Now it shows me the histogram of all the times it gives me a nice little histogram. I sorted it. Oops, I didn't do a sort in here. So you can sort this as a way to sort it so it comes out nicer than this. But that's just giving you an idea that it exists. I want you to go and read the documentation to find out how to do it properly for your own use case. Now debugging the kernel. So let's say you want to debug the kernel and you're modifying your device driver and you're doing an interrupt like it's hit thousands of times a second. And if you put print K in there, it just fills up the console and you can't debug anything because you live locked machine because by the time the print K finished, the next event happened and you did a print K again, it just fills up your console or you can't get anywhere. So it's really hard to debug. Trace print K is something that was added so that you could trace within your kernel while you're debugging it. You could put print same, it's all over the place. You could put thousands of them and it's so fast that you'll hardly even notice it in overhead. It'll record to the ring buffer and then you could print it or you could write it or read from the ring buffer and you can even trigger it. There's a F trace dumps on oops option that you could put in the kernel command line or in the proc file system, the assist control system that if an oops happens, it'll dump the entire F trace ring buffer so you can see all your print Ks there. Just be careful, just make sure you if you're going to use that shrink the size of the ring buffer. There's a way to do that in the documentation you can find it. Otherwise you might be waiting a few days for the output to finish on your console if you have a slow serial console. But anyway, trace print K works just like print K but could be used anywhere. You could use an interrupts, scheduler, NMIs. It's lockless, it's fast and you don't know how much work we did to optimize this. If you look at trace print K, it's actually a macro that determines using compiler or GCC built in tricks to say like, say if you just put a string in and without any parameters besides that like you're not doing a trace print K percent S and putting something else in there you're just putting a trace print K of a string. It'll actually convert it to a trace put S so it's really optimized well. And in fact, it won't even do the processing of this S print F processing fully. It uses a new thing called binary format processing where it just, it'll record the format and all the options that you pass to it. And then it will do the formatting after at the read side of things. So it really tries to be extremely, extremely fast. No, it should not be left in production. Reason why is trace print K is a, it's like print K. If prints, it will always print. You don't have the option of turning them on and off individually. You could turn them on all on or you could turn them all off. But trace print K if I were to allow this to be something that could be used in production. Everyone would do it. And that means that it would put so much noise in when you want to try tracing something. You'd have all the other trace print case inside all the everyone else's drivers and everything. And you'll have so much noise that you won't be able to say anything. If you find that you want this information if you're doing trace print K and say this information is important for me to have in production make it into a trace event macro that could be enabled and disabled and will not clog up the ring buffer from everyone else doing this. If you do leave it in there's a trick in the compiler that says, ah, you used to trace print K and it will get this really nasty notice on your kernel console. Some of you might have already seen this and you'll see this notice it scares people. It was legitimately made to scare people. That's the only way I was able to make it so that trace print K will not be used in production. But in your debug code, if you see that notice you're debugging it, good. So remember that activate task that I put the K probe in? Well, let's say I had to build you to build the kernel. I was doing debugging. I want to see information there. I just, this is the equivalent to that current K probe that I just did. I just put a trace print K at the beginning of the function. I said, okay, give me the PID, give me the comm and boom, I got the same information that I had with the K probe. So you don't have to do everything with the trace the best file system. There's tools that are around it. So the main thing is trace CMD or I can pronounce it trace command. It's an interface to the trace FS file. If you go to trace command.org it has full man pages of everything you do. You can enable functions or you can enable tracing enable tracers, enable individual events enable filtering triggers, all this lovely stuff can all be done through the trace command macro and it gives you man pages and it also has bash completion help. So sometimes if you don't know what the type next that you just hit tab and hopefully you'll find something that you expect to use. On top of that is kernel shark. Kernel shark is a GUI. We just released kernel shark 2.0. I have to finish up and make the announcements better but kernel shark 2.0, which allows a lot of plugins allows you to visualize tracing between hosts and guests, latencies and a bunch of other stuff. So this is a GUI interface to visualize all that trace information because sometimes texts could just be too much and you can't really see what you're looking for because there's too much data coming out at you. Lib trace command what's made is a library that basically is on top of trace command and that's to do allow users other applications to be able to read and write trace.dat files because trace command creates trace.dat files something that trace command does that reads the raw ring buffer and records it directly into the file system using splice, which is pretty much a zero copy. So it's really fast. So you can record a lot more data that's then you have in the ring buffer size and then trace command report will actually read that data and parse it. And that's what that file is used to read for trace for kernel shark and other things. So trace Lib trace command is weights that you could read that file too. Then we're working on it. That's just started. It's got the bare minimum for kernel shark to work but we're going to be making it expand it. So all the functionality of trace command will be in this library in the future. Something that this presentation wants to kind of focus on a little bit more is Lib TraceFS because Lib TraceFS is the way to interact with the traceFS directory without having to know all the files and everything else. It kind of makes it the policy of saying I want to enable this tracer to use a way of enabling this tracer very easily without knowing that the file is current tracer. It does the work for you. So if yours C code, you want to write application that interacts with tracing traceFS is for you. Lib trace event is used by Perf and some power top and stuff because it has all the ways to read the format file. I remember that nasty format file and I said things are converted for macros are converted to hard numbers and it's really hard to read. Lib trace event can read that and it will actually parse it and make it better for easy for you to get events from the raw data. So one of the little thing I want to talk about before we leave the whole traceFS directory is one other file called trace marker. This is a way that users space can write into the ring buffer and inject something. So if you were to, let's see, if you start an application and open up this file as a file descriptor and just keep it open and throughout your application you do a write to this file descriptor and then close at the end. It will actually inject at that time when you write it the data. The reason why I say open it first and then at the beginning of application because you don't want the overhead of the open to happen, you just want the write. It's rather fast too, it's streamlined to do the bare minimum to write data and into the ring buffer and get right back out. So we try, you do have the system call to come in and come back out. But other than that, it's pretty fast. I think we made it lockless too. So it's lockless as well, perhaps. Anyway, would open up the file descriptor first and then later on do is it. The TraceFS, LibTraceFS gives you helpers on how to using this. So TraceFS print init is that way of opening up in the beginning. So like I said, you want to open up the descriptor. If you don't open it, you just call TraceFS printf. It will see if it was opened or not. If it wasn't opened yet, it will actually open it then with the overhead but it'll keep it open. That's why you have to call TraceFS print close when you're finally done. It does allow you to pass the instance structure in there. So you can actually create your own instance and LibTraceFS gives you ability to create your own instance. So you get each application could have its own instance file that it's dealing with. So it doesn't have to worry about trace events happening from other instances or the top level tracing and TraceFS print init and printf and all that will write to a specific instance. So giving an idea of how this works with instances and such. So if I go into the tracing directory and I echo hello top level to the trace marker and then I create a foo and then I do echo hello foo to the instance foo trace marker. And I got trace. You can see there's the trace marker right inside down there it says hello top level. And if I look at foo you see my hello foo. Here's an example of how to use LibTraceFS. So I'm hoping that the LibTraceFS will be soon and all the distributions. And you may have to use package config to find out where it's located installed because you have to include the dash L LibTraceFS when you build your C file. But this simple trace print C file that I have right here this is the file in its entirety. If you were to take this file I have the slides uploaded so hopefully you'll get the slides from the links foundation. And if you were to take this cut and paste it and compile it like I said you need to have LibTraceFS and LibTraceVent installed. And you may have to pass in the dash L and dash I to find the pass for the traceFS header and the library and then execute it. This will actually work. So let me just walk through really quickly what this is doing. First thing it does is it creates my own instance and I called it my buffer. So see how easy it is? I just create a struct traceFS instance name instance and say okay call traceFS instance create and it creates it done. I do my init out. Now I'm doing this example because I'm going to write to the top level as well as write to my instance. Whenever you reference in traceFS when you want to reference the top level you always pass null as your instance and then it just assumes that you this is the top level that you're going to write to. So the traceFS print init null means initialize the trace marker for the top level and then I do the init. So like okay I opened up the file descriptor for the instance director for the instance trace marker file. And then I do traceFS printf null because I'm doing the top level enable events. And then I enabled all the events within my instance the schedule actually now I did it shows you how to enable a single event. So I enabled all sked events. So the traceFS event enable function takes an instance the sked group or sorry the event group which is sked here and null I'm saying if you pass null it means enable all the events within this group. If I were to pass null null that means enable all events. So null null would be all events. A string means all events within a group. If you do a screen if I put sked and put sked switch in there it just means enable just the sked switch event. I could put null sked switch if I forgot what group it is and it will enable all the events that have their name sked switch in all groups. Currently pretty much every event is unique unless you do your own probes you could actually create a probe that has the same name as event. If you do that if you passed in that name of the event it'll enable the probe plus the event the other event. So just to give you no understand what null in the group side means. And then I do my print into the instruments I'm gonna say go to sleep. I sleep one and then I say okay wake up and then I disable all of the sked events and then I'm gonna write disabled events to the top level. I'm not going to destroy this because I want to be able to see the instance but say if I wanted to delete the L I would do trace of S instance destroy of that. I did forget I do need a trace of S instance free I didn't add that in here I just noticed I didn't have that but that should be go there and trace of S print close another bug in this code I just realized is this destroy needs to be after the reference of it because destroy means I just delete a directory but having the close open it may not delete. So this destroy should have been after there bug in my code send me a patch whatever. So if I were to now run this simple trace print I did it trace command show actually shows you the is a way of doing a cat trace of the trace event directory but if you notice here I'm not in the trace scene directory trace command will find the trace of S directory and it will do everything for you and you can hear at the top level you'll see that I have the enable events the writing done. And if I do the show dash capital B which means I want to you instances stands for buffer so I say dash capital B my buffer and I was going to do a cat of the trace of the instance my buffer or yeah instances my buffer trace file and this is what you see and you'll see all the schedule events that were enabled as well as my right to go to sleep and wake up. So with that I think I don't know how much time I had but probably a little bit over almost an hour but that's all I have. Thank you. We have a couple of questions in now. Okay, turn off my stop sharing. Is that okay? Okay, cause that way I could see what's going on now. I think it's the same question twice but yes. What is the best strategy to trace RCU stalls? This is actually in RCU stalls usually what I do is I like to trace the schedule events interrupts and stuff like that. And I believe in the RCU stalls sometimes we used to have a tracing off on warning. What I do is what I should have actually mentioned in this talk is inside the kernel. Like I said, you have the trace print K you also have trace on trace off. No one really uses trace on but trace off just put trace off won't disable tracing of the top level file of the top level instance. And so if you're using trace print K which also trace print K only writes to the top level instance as well. But if you do like put in trace print Ks or if you put enabling events like schedule and stuff like that and in the install RCU code if you have a trace off there and I remember I tried to get something that we would put in a trace off at least if trace off on warning there's another thing that was trace F trace trace off on warning macro as well that will if a warning triggers it'll disable tracing and that's the tracing on file within the tracing directory. So it will turn it from one to zero and that way if a anomaly happens like a warning happens, it'll stop tracing. So that way you have all the events up to the point of where the warning happened. And if it turns it off so that you don't lose it because the ring buffer is only finite and it's a ring buffer. So it will delete the old stuff. So if you'd stop the tracing you then you could look at all the events that happened up to then. I have debugged RCU stalls before and usually like I said, the process is you can enable all function tracing. Sometimes that works to see what functions are being called, put in trace print case or just enable every event and when the RCU stall happens tracing off right there either injected there. Oh, nothing you can do is if there's a function being called that could be traced within the RCU stall because I think I've done this. The function tracer has a way to attach a probe to that will disable tracing if that functions hit. So if the RCU stall calls a function you could actually without modifying the kernel say if you can't modify the kernel you could actually inject either you could do a trace you could do it with a trigger on a event. So you either use the K probe or use a function tracer does as well has a way of saying when this functions hit turn off tracing. So I have done that. I think I've attached that and said turn off tracing if this functions hit do it when the stall happens. I have a bunch of data and then you have to go through and find out. Okay, why did this stall? Hopefully that answered the question. There is another question in the chat. We can measure, I think it's a question. You can measure latency between two events that don't have params that matches question mark. Yeah, I see. Is there a way to measure latency between two events that don't have a matching parameter? Well, here, I guess you could just well, if it's not a CPU there usually if you have two random events and you just want to see the latency between those two events the question is, well, what about those events? I need to see a use case for it. The question is, is there a use case for that? If not, then I'm not going to worry about it. If there is a use case I'm sure that we could find something because you could match like I said well, either a process or say there's usually something you want to attach something to to see the latency between things. Maybe you can make up a variable and match the same variable name, I don't know. There might be something you do. So I'd have to see the use case I'm sure I could find a way of doing it. Any other questions or anything? Or did I just talk way over everyone's head? Andy, you said there is another question in the question and answer. The two questions looked about the same. Would you like to turn on your microphone and ask the question? Looks like we have another question in the Q&A. I think it's a use case. Or same one. Yeah, it's a use case for input from SPI treatment and user space to send result through Ethernet. And there's nothing in there that would match what like, I don't know, like a packet ID, a pointer or something that you could use to match the two events. Like say if you... Because again, if it's totally unique, one thing you do, if it's that you... One thing you could do is I would say is if it's something so unique like that you don't need to use the static events, you could actually, if it's gonna be something like that, I would just say enable both events, write your application and do the lanes see from the LibTraceFS. By the way, somebody did mention that's coming out of this thing called Trace Cruncher. I should have had a link to that. Jordan, so one of my team is working on this and it basically has a plugin to also add LibTraceFS as a Python module. So you could write straight Python and have all the access to LibTraceFS. So you could write a Python module that just says, okay, enable this event, enable this event and also could read the buffer. So you could say, okay, now read the event and now for every event, do something and then you'll be able to actually do all your processing within a Python script. A few more questions in the question and answer. It says, other than IRQs off preempt hardware latency, is it okay to have FTrace enabled in production? I mean, FTrace is enabled in most production configs. In fact, I basically require that for a lot of things. Unless it's a really secure thing where you're paranoid that someone might, you don't want any way to break into it because FTrace, it's root only. If you can't trust root, so say if you disabled loading modules, loading all modification of the kernel and you just want this to do something and not be modified in any way then might have FTrace disabled. Other than that, FTrace is not, doesn't really cause much overhead at all. It's like in the noise and all the tests that we've done, fetched marks we've done. But I think unless you're saying, is there anything that you else don't want to be enabled? IRQs off preempt, maybe hardware latency. I'm actually working at ways to get those enabled. Hardware latency detector is something you have enabled but the question is, because actually I've seen it enabled in some productions but if you run it, it will cause, it's only a problem if you enable that tracer. So it's up to you. Yeah, my system looks like all of them are enabled. Not other IRQs off but others. Yeah, I accused off, I guess, because that one actually does have a measurable impact of like maybe 1% when you have it enabled. So Allison, any general advice about tracing with RT kernels beyond observation with event tracing is less likely to disrupt the behavior we want to observe with enough granularity filters and function tracing to be useful. By the way, Allison, you do understand that F-trace came from the real-time work. All this, most of functionality, not in the same way but was in the real-time kernel to debug and find latencies. And that's why everything's very, very focused on actually real-time workloads because that's actually where F-trace came from. So yes, in fact, that's why we also try to make it really low overhead because we don't want that Heisen bug where, you know, you see a bug and then you enable tracing and then the bug goes away. That happens sometimes. So ideally there's a way of doing this. Like if you enable all functions to be traced, yes, that's going to cause overhead. So usually what we do is I'll just enable like locks, the function locks, you know, that's quite a bit but the function trace is really fast per se. So a lot of times it works pretty good. And yeah, so you can, you start off slow. So basically if you enable everything and it causes latency to go out of the water, then don't do it. In fact, actually cyclic test is one of the more common utilities that's used to analyze how a system is real-time actually has hooks to the, you know, trace-to-file system to enable tracing markers and enable stuff and actually moved it to have trace command do all the work and then enable cyclic tests. But usually we tell people to do a minimum of things. Like I said, use the filters. Yes, so I mean, I could probably have a whole talk on that. Let's see. Did I lose here? This is a few more. Yep, I see. Hopefully it's, say, can you speak briefly on the differences between F-trace and other Linux performance tracing tools and why you choose to use one versus another and some other tools naturally more supervised, particular debugging use cases versus others. Okay, so good question. Cause there's other things. There's F-trace, you have Perf, BPF tracing utilities and LTTNG. And then there's S-trace and SystemTap. And honestly, I've actually, well, I don't really use SystemTap as much, but the, and I think LTTNG has its own way of doing that. I'm not going to really compare that. We're very similar, but we also have different objectives. For debugging, F-trace, I think is probably the best cause it's right there. It has no utilities that's on top. The TraceFS directory is probably what makes F-trace unique that you could go into a busy box machine or machine with only busy box on it. And all the utilities that a lot of me are installing any other tool, but we do have a shell and we could use F-trace. That's like the key for keeping F-trace around. And also, they also share a lot of functionality. The events are used by both Perf and BPF and LTTNG. The tracing infrastructures used by everything too. So they share the same infrastructure. A lot of the infrastructure is shared. It's just the use cases or it might be a little bit different. Perf, I find is really great for profiling. I use it if I just want to profile something, I tend to use Perf. I always say with F-trace and BPF, like some people say, well, you could do this type of filtering with BPF, why don't you just use BPF? And to me, I say, F-trace is like bash, BPF is like C. If you just want to come down and do something really quick and do something that's very common, you write a bash script. If you start getting more involved and more complex, you might switch from bash to a C code. I kind of feel the same way with what you're doing with tracing. If it's something real quick, a lot of times F-trace is just good enough. If you start doing more complex thing where you need decisions made within the kernel, you may need to switch over to start using BPF as well. But I'm also trying to make F-trace and BPF work together as awesome. Does Perf use TraceFS underneath? Actually, some of it does. When you do, that's why Libtrace event, which does the parsing, Perf reads the trace event format files or the format files within the trace directory. So Perf uses that it doesn't actually, there is a F-trace portion of Perf that you can actually enable function tracing and stuff like that. But it doesn't actually, it has its own, it uses a system call, a Perf system call to do most of enabling, disabling of the functionality. But for TraceFS, it does use it. And that's why the libraries are shared. Two, can Perf be used to configure and use all the features of F-trace about getting the hands dirty with the F-trace directly. But I'm thinking TraceFS directly. Like you said, there was some work with that. Trace command actually was equivalent there kind of working together. I mean, I work with the Perf folks too. And Trace command is similar, but it's focused on F-trace only. So Trace commands basically what we use for knocking your hands dirty. I actually seldom go into the TraceFS directory today. I actually use Trace command almost purely. And what would you advise for debugging early kernel startup? Say after F-trace is initiated. Well, there's inside the kernel, if you look into kernel parameters after there's a lot of F-trace, you can enable events, you can enable function tracing, you can enable filtering. And I plan on adding a bunch of more to that. Also, Masami Hiramatsu has introduced boot config that allows you to attach complex command line options to the init RAM disk. So using, if you go with, you'll see tools boot config inside the source code. If you compile that, you can actually attach boot config, you create a file that's kind of like a JSON type file that says, okay, enable this function, enable this and put in a lot of complex features and then pass that or run boot config to attach it to the init RD disk when you boot and then you add to the kernel command line, boot config to the kernel command line. And that will tell the kernel on boot up, hey, there's a boot config attached to the RAM disk, go read it and then it will go read it and then it allows you to enable all the functions. And we try to make F-trace start really, really early in boot up. It's not, it's before the init calls are even done. So basically once memory has been initialized, F-trace is available pretty much immediately after that. Before other CPUs start and everything. So you could trace a lot at early boot up. Okay, so I think I answered that one. I hope. Any other questions or anything else? I do hope everything was useful. Awesome, okay. Thank you so much Steve for your time today and thank you to all the participants who joined us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today and a copy of the presentation slides will be added to the Linux Foundation's website. We hope you'll join us for future mentorship sessions. Have a wonderful day. Thank you.