 My name is Steve Rostad. I'm the creator, maintainer of Ftrace, the tracing utility, and the Linux kernel. That's why I'm here to talk and have a lot more. But before I go, obviously I have to do something, and most of you know what it is. This is a camera. You take pictures with it. You can't make phone calls with it. But anyway, it's got to do my selfie. I'll look at that later and maybe I'll upload it. Hopefully I don't forget it. Anyway, I put up here so I don't forget it. Power off so the battery doesn't die. Okay, so next. What is Ftrace? I'm assuming most people here already know it, but I'm going to give it a quick lowdown. It's the official tracer of the Linux kernel. It was added in 2008, 2627. Actually, I think the birthday of it is coming up. I have it actually in my calendar, the birthday of Ftrace. Yeah, I'm feeling like that. Anyway, it's Ftrace is actually really the function tracer. It's a misnomer to call it. Really, it's the Tracy infrastructure that people associate to. I usually like to call that TraceFS to be for the Tracy infrastructure, because Ftrace really is how to do function hooks like for live kernel patching, BPF does trampolines on it as well. So lots of things use the function hooks. That's really what Ftrace is. But I also, when we say Ftrace, since it was created with the rest of the tracing, it kind of like was put together. And Ftrace is kind of nicer to say than TraceFS or TraceTracing. The infrastructure was designed with embedded in mind. I come from the embedded world, where I like to be able to do things with just busybox and not have to install other applications on top of it. So that's why in the TraceFS directory, which is insist kernel tracing, you go there and do most of the functionality of Ftrace is done by simply echo and cat. So if you have echo and cat, you have most of the functionality of Ftrace or the tracing utility. But I've given so many talks about that. Go find it elsewhere. Kernelrecipes.org has lots of great videos. Several people here has given talks there and I recommend going there. Whoops. You know what? Oh, did I just hit? I just somehow hit this. There we go. So TraceFS. Okay, the file system to interact with the tracing system. You can enable and disable tracers from it. You could create dynamic events. You could actually say, hey, hook a K probe someplace. So if a trace event doesn't exist somewhere, you could actually echo in some commands. See my previous talks at Kernelrecipes and such and open source on it. So you could put in dynamic events. E probes is a way of if you don't like, like if you want to see more information on events. So an E probe is an event probe, which means you can attach an event to an existing event. So something like the timer, the timer events, which has a pointer to the timer structure. If you know the index of the structure within that timer, if you open up GDB, and you look at a field and that's not exposed by other fields within the normal timer event, you could actually add an E probe and say, hey, take this timer instructor, dereference it and give me the value here. And again, I talk about this in other talks. It's not for today. You probes is a way to put a probe onto user space and have it trigger and come right back to the user space. What it does actually puts in like an example. So it does a page fall or whatever and comes back into the user's space and does it or breakpoint. And synthetic events is something I'm actually going to talk about today. And triggers, you could do, you could set any event, any of these events, K probes, E probes, static events, synthetic events, you could actually do stack traces, like make a snapshot of the buffer because there's like two buffers and you can do a snapshot to the other one. So disable tracing. So if there's a path that you know, you could hit something, you put a filter that hits this path, it's this filter, disable tracing, so that you could now, it will say that you don't lose the buffers. So when the, when the, what's it called when the anomaly triggers, you actually stop the tracer. So then you later, like the next day, as if it's running all night, you wake up next day, you see that triggered, you go, Oh, and then you could, you'll get the tracing and it stopped right at the point where the bug happened. And triggers are how you create synthetic events. So I will get to that later. So this talk, yes, question. Oh, Jesus. I must say that's what happened. Thank you. Better. Okay, thank you. I thought something was looking weird. So um, lip trace of S. So lip. So I wrote a tool called Tracy MD, which I'll talk about a little bit here that interacts with the tracing infrastructure. So you don't actually have to use busy box commands like cat and definitely makes things kind of easier to do. There's man pages, lots and lots of man pages on it. So and there's even tab comp, completion. So if you don't, like you do trace command start dash e tab tab, and it'll give you a list of all the events that you could add. So you only have to type, you know, a couple of few letters, hit tab, tab, and it'll finish the event name. So it's useful. They'll do the tracers and such like that. But what I really wanted was this functionality to be in other tools. I don't want everything to be in trace command. I want everything to be in like, if you have a tool and you want to access tracing, Lib trace of S is the way to interact with it. So it's a library to interact with trace of S. So anything has to deal with trace of S goes into the Lib trace of S directory. Your applications should not need to know the details of Lib trace of S. So I try to abstract it out as much as possible. So when you say, I want to enable an event, you just say trace of S event enable, put the thing and it's enabled. Like I said, abstracts much of the syntax. You could start, stop traces, start, stop events, disable events, start events, start tracing. It also can read the raw trace data. So I'm going to show all this and get it out. And you can iterate over all the events, which I'll talk about later. And here's the GitHub thing. By the way, I got the slides done an hour ahead of time. That's really a record for me. And so you don't really have to take pictures. You actually go into the SCED, go to this presentation. The slides are uploaded already. Wow, I'm really ahead of myself on this one. So you can actually download and click on the link. So everybody's low. Don't worry about this. You can just go to the slides and the information is there. Trace command. So applications, this application uses live trace of S. Actuality, live trace of S came from trace command. Trace command was written first. I started gutting it and then replaced and put it into a library and taking and having live trace or sorry, trace command used the library. And so again, trace command is a command line utility that you can start, stop traces on command line, read the trace ring buffer that's in the kernel. And one thing it does is it records to a file. So you can do trace command record. You could do trace command start to just enables it, but you could do trace command record, which actually does use a splice to copy the stuff right into the trace.def file. How many people know what splice is? It doesn't call splice. Okay, so for those that don't, I'm going to tell you what it is real quick. It's a way to connect a pipe to a file descriptor. And what it does is to move data right through from one from the file descriptor to the pipe or the pipe to the file descriptor without going through user space. In other words, let's say you want to do copy and you want to copy from one file system to another file system, you know, the naive approach would just be read the file or read a portion of the file and write a portion of the file. But so what does is the data is going to go from the kernel copied up into user space, then copied back to the kernel and then moved through to the whole thing. Splice, you could say, Hey, open up this descriptor, connect to a pipe, connect to another pipe, it has to use pipes, and then connect to the file descriptor and just say read the file, write the file. And what it does is it moves the data just right in the kernel right through zero copy. So the way the ring buffers are done, ring bars buffers were actually written to be optimized for splice. So it does the same thing when you write to the file like connect to F trace or to the tracing infrastructure, it will do connect to pipes and all it does is say move the data from the ring buffer into the file. User space actually never sees the data that goes through trace command. Obviously read trace that that so you can get the output from it. And there's trace command.org, which by the way, if you go to trace command.org and you click on the pony, it will take you through to all the lib of other libraries. And if you click on there, you get the man pages. So lib trace command. The goal is basically they give all the features of trace command, the are sorry, should I say all our other applications, all the features of trace command. So I'm trying to gut trace command into a library. Um, so I'm taking as much implementation as possible and putting it into trace command. Ideally, my dream solution is that trace command, the trace MD application becomes a shell around the library. So the fact that all functionality will become possible for any other tool tools. But there's still a lot more to do. Um, currently it can only read trace that that files. I haven't, I haven't got it to give lib trace command to write trace that that files. That's all my to do list. Like I said, it doesn't do it yet. Um, and you could use this for analyzing tracing, which is the point of this talk. So you can go to, uh, this is the where trace command lives. If you just do make and make install, you'll get trace command. But if you do make libs and sudo make install libs, you get the trace command installed. So let's, this is just an application that I traced and I'll go real quick. When I wrote the push pull logic of the kernel real time scheduler, um, I noticed that, uh, inside the kernel scheduler for real time tasks that there was no real good migration. So what happens is if you woke up, if a real time task woke up on a CPU and was on the same CPU of a real time task that had a higher priority CPU, and there's other CPUs available, it wouldn't, it wouldn't run. It would actually wait for the higher priority task to finish. Uh, so I actually wrote the code to implement a push pull logic that spreads out the, um, uh, task. So they make sure that the, if there's a CPU available for a real time task that that real time task can get to and not just be weighted on the queue. So to test this logic, I wrote this tool called migrate. And what it does is it kicks off a bunch of, uh, uh, tasks. Usually it will double the CPUs. So, so like, I think here I had six CPUs when I ran this on or something. So it created another. So I think it doubles it. So that's why there's 12 tasks and then of various priorities. And then it tells you, so the highest priority one is down at the bottom. It should have the least amount of migrations and it will actually record run and give you that. That's great. Why is this part of the talk? Well, I needed to run, I needed to have an example of something I was going to trace, but I had to give you an idea of what this is for, why I do this, but I'm going to now use it trace. So right here I did trace command record, which is the opposite. I said, okay, um, I wanted to give it a different name. By default, it records to trace dot that, but if you want to record it someplace else, you use the dash O option. So you don't need it if you don't care about calling it, but I had other trace dot that files. I didn't want to overwrite. So I said, dash O trace, wake dot that dash ease, get waking, switch, you know, migrate. So let's take a little example that I just recorded. Now I took that trace dot that file and now I'm going to, I want to kind of do some analysis on it. So I wrote this code, this program that this is the uses live trace command and you create a handle. So I always, I always call handle. That's for my own personal use. If you look at trace command, I always call the term handle, even though the structure is called trace command input. So you'll see struck trace command input with the handle and then you go handle equals trace command file. So, so the file I passed it is that trace dash wake dot that, uh, then you can pass in some flags cause the flags will tell you if you should do some other external stuff for you, but that's by way, this is all has man pages. So I have man pages for all these functions I'm talking about when I said going back to the website or that website, you could find the man pages for every one of these functions. So once you get a handle on the thing, you're going to say, okay, I want to see what some of these events do. So I'm going to say, I want to follow the sked waking event and I want to follow the sketch switch event. By the way, the sked and see if you see here this, this sked sked, that's the system group. So if you actually went into the cis kernel tracing directory, events directory, you'll see a bunch of groups. You'll see a sked file and then within sked, you'll see sked waking, sked switch. That's what it represents to. So the thing is that you can actually put null there and actually live trace command will actually find those, if it's knowledge, it will search everything. So more for optimization, you put in there. So we'll only look at that one directory. But if you leave that null, it'll actually search to see if it can find a sked switch or sked waking, waking file. So if you notice that I had this callback waking and callback switch. So whenever the sked waking event is hit, it's going to trigger a, this callback to sked waking. If when sked switches hit, it will call sked, this callback switch, which brings me to, or then you could pass in data to this. So I could set up data. And right now I'm just initializing the data to zero and then passing in the structure. So I'm handling a data structure to it. And then I'm saying, okay, go, let it rip trace command, iterate events. So what this does, it's actually going to read the trace at depth out through all the events triggering every time it hits, one of these events is going to trigger your callback. So this is what the callbacks look like. First, I created the date, my little data. I made this really simple. I just said, okay, I'm going to record all the times that the waking is hit and every time switches hit. By the way, when I first wrote this, it didn't compile because switch is a keyword in C. So you can't call it, you can't call it switch. So I call it, you know, I just took out the eye. So then you have the callback, you put the, you know, you have the handle, it will return the handle, it will return the event that the, so you actually get information from the event, it returns the raw record, or the record, which is a representation of the, the data that was in the ring, or inside the file for that event. And also pass the CPU that was on and the data you want. By the way, I don't, I don't have this in my slides, but I'm going to make a quick point because I just realized this, be careful about int CPU. I would not use the CPU unless you know what you're doing because the TEP record, I should have added the example in this, the TEP record actually has the CPU for the event. Reason why I'm saying this is you can actually pass in, you can actually process multiple trace.dat files at the same time. There's another iterate, there's an iterate events multi. If you use that, it passes in several things and also handles virtual machines. So if you use trace command agent, another talk, where you connect the host and the virtual machine, it's going to synchronize, this will handle the synchronization between the two events as well. So you can actually see the events going back to that CPU parameter. When that, when you have multiple trace.dat files, the CPUs are going to actually increase the set. So if you have eight CPUs on one trace.dat file and two CPUs on the second one, that second one will be, it will be zero, one, two, three, four, five, six, seven, eight will actually, you'll get eight for actually CPU, the CPU for the next one. So that's why it's very careful. I probably shouldn't have done that, but I did. It actually uses it. So that's how you can differentiate things. So be careful about not using int CPU, because it will normally work, but in multiple trace.dat files, it won't. So if you use record dash CPU or record offset, there's a CPU there. Okay. Anyway, I just increment the things. Finally, at the end of my program, I just said, let's print out the data. And from here, I ran it. Boom. So that program that I just showed you there produced this, not really that useful, but it just gives you a demonstration of how easy this code is. So this is something that's on GitHub. It's not on, and I don't have man pages for it because I'm still working on the API. I, I plan with this, it's hopefully coming the future. I'm going to add this in just because it's something that's coming in the future, but I'm adding it anyway, because it's kind of gets kind of a little bit. It's interesting. It's Tracy Val, which gives you ways of processing data without it kind of does the counting for you that you don't have to do. So right here, I'm saying, okay, I want to have some, I want to have some information on a process ID. So I say, okay, I'm going to find a key that I'm going to start keeping track of. So I say PID info type is going to be a number and the name is PID. So then I create a that Alec with one means only has one key on it. I'd like one, two and N. So they're done slightly different anyway, but I do Tracy Val one Alec and I pass in the and it gives me the pass in the PID info and it gives me the handler for this Tracy Val structure. Then back in my callback waking, what I do is I'm like, okay, first of all, this is side note. This is from lib trace event, which is stable. It's part of the man pages and everything else here. This is all documented. It's in man pages. I want information from the event that went to the trace, the callback waking file. So I create, I know I want the PID from the schedule waking event. So I say, okay, TEP format field, PID event, everything's long, long, because whether it's that's the biggest of all values will be a long, long. So I always have to put value equals long, long. If you notice, by the way, the PID format is a static variable. So the first time it hits here, it's going to be null. So I have to initialize it. So I go, okay, it's null, I'm going to initialize it. So I say PID format is TEP find field, pass the event that's passed in. I want the field for PID. So it will actually find the field for PID and then it will pass it in. This gives you information of the offset of where this is in the data structure. So you get extract the PID from the record. Also note, I to fit in the slides, I stripped out all error checking. So I think I have in the code, the examples I have, which I'll show you the links at the end, they do have somewhat error checking, but I didn't fully put it in because I was kind of rushing it because I wrote all this like yesterday. So I still have to clean them up and maybe put them up. So it has better error checking. So if you're going to use it as examples, I'll hopefully have the actual code with the error checking that you should actually add because all these things could fail and you should do something about it if it does. Anyway, for example, if the PID no longer exists, it's going to fail. So anyway, you get the format and then you say, read, I want to get the data from the record. So every time I got the event triggered, it says, here's the process idea that was woken up. Here's the process idea that's woken up. Here's the process idea that's woken up. So that's, I use that all the time back to the development stuff, which is the Tracy Val stuff. So Tracy Val here is okay, I have my key, that's a number of a type number. And now I'm going to say, okay, start. And what it does is again, you pass it in a, the Tracy Val handle, you pass in what you're going to be recording, which is the PID info of type. And actually, oh, yeah, I missed you, I missed a highlight. I forgot to highlight. I said the PID info, I actually put the PID. So this PID info has a type and a value, the value is here, I should have highlighted that. Sorry about that. I just realized, I'm like, where's the value? So, so basically, I'm injecting a key and a value into this kind of like it's the Tracy Val kind of database, which is what it has a hash table. So it hooks it up and I put in a timestamp. So here start on the sketch switch side. I do everything again, but this time I'm asking for the next PID from sketch switch. So what a sketch switch is when a process leaves and when a process comes on to the CPU. So the next PID is a process going on to the CPU. So I said, give me that field for the PID and say trace event and stop and passing in the info there. What this does is matches when a PID starts and when a PID stops, and it's going to do a, it's going to record the differences and keeps a history, well, kind of a history of it. At the end, I'm like, okay, give me the, at the end, and this is kind of ugly, and I'm, this is why I haven't published it. I don't like the interface, but I'm just kind of giving you just up there. It's like it helps out there. This code does exist. So you could still use it, but it's, it's the Val don't depend on this code. It's not stable yet. So, excuse me. So I said, okay, give me the number of results that it has a number of, which is basically the number of PIDs. Now I iterate through and wait to find a key array, a key, and we say, okay, the PID. So I said, I don't really like this interface. It's, I don't have to work on it. But then I, because I have to, first you have to do is have to get the array of indexes of keys because this, it gives you the list of keys that more, if you have more than one key, you might want to, you have to do this. And then you say, okay, give me the key, which gives you the PID, you know, so you print out the PID. And then you could say, okay, now it gives you the count of the number of hits it has, it gives you the max, the min, and gives you the total. So you could divide that with the count. I also plan on incorporating standard deviation as well into this. So at the end, you say, okay, now you run this code and it gives you every single process ID, what it woke up, what was the max wake up latency, what was the min latency, and what was the average latency. So that those libraries do give you some information on what you could do very easily. So enough of that. That's, that's, that's hopefully going to be advanced later. Anyway, back to synthetic events. Synthetic events connect two different events. So if you have event A, event B, it will connect them and create a third event. And it could also measure the timestamp between those two events. And you could add the fields of either events into the third event. Unfortunately, it's extremely complex to create. But there's a tooling. In fact, my last talk at kernel recipes did this. And there's a thing called SQL hiss, where actually create synthetic events using SQL line, but that's outside the scope, watch the other talks. This is the syntax that we have for it. Very easy. This is what happened when I found out that this has been in the kernel, I think for like 10 years, and no one was using it. Because when I used it, I had to look up the documentation to use it. Now I'm not the actual creator. I'm the maintainer. Thompson new season when I created this, but since the syntax is so bad that I have to look it up every single time to use it, who else is going to use it? So that's why I created SQL hiss. I also created a library to create this too. By the way, once it's created, it gives you a nice little event in the synthetic event. And that's what the event looks like. So let's do it from lib trace of us. So this is how you create the thing. So first thing I do is I tap, I said, give me all the local events. And then I'm going to call this function make event. Make event is not part of the library. It's part of the code I have. Here's the make event, which creates the synthetic event. So you do this, you say, okay, I'm going to create a handle. So you do trace of us synthetic Alec. And I'm going to give it that. Okay, the first thing I have to do is I have to pass it in the, the tap handler, which is what has all the event information. The next thing is the name of the event I want. So I want the synthetic event to be called wake up lat. And then I'm going to attach, by the way, the first event and second event. Remember what I said that sked could be null, I threw a null in this case. So that really would have been sked, sked waking, sked, sked switch. I didn't feel like typing sked. So I just put null and let the library do the work for me. So I said, the first event is sked waking. The second event is sked switch. Then I have to match the two. So I want to sked. It's the first event field. So when the wake, when the wake up wakes up a process ID, PID, then when the sked switch happens, I want it to match the sked switch ID. So when these two match, then you get the trick. Now the last one is if you want to save, since it's doing the match already and doing the work, you can make that part of the field of the synthetic event, or you could just ignore it. If you want it as one of the fields, you give it a name. So that PID is a field of the match. If I put null in, it just says the match and throws it away. Then I also want the timestamp. So this is, I can say synthetic compare, add a compare field. I say, I want the first one is what do you want to compare, or you have to put in the field of the first event. But there's a macro that you could use, which is called TraceFS timestamp usex, which is I want the timestamp of the first field or of the first event. And I want to the timestamp of the second event. And it's a compare field, which means it's going to actually do a subtraction between the two events. So to tell which way you want to go, I say TraceFS delta end, which means it's going to do the end timestamp minus the start timestamp. If you you put start, which will mean it'll do the start minus end, but you know, you want the second event subtract timestamp of the second event subtracted from the timestamp of the first event. So you have to say TraceFS synthetic delta end. By the way, this is all in the man pages. And then I give it the name of the field that it's going to exist in the synthetic event. And then I return the handle. The synthetic event is not created yet. It just creates all this information in the handle. So the next thing I have to do is TraceFS create or synthetic create, this actually creates the event in the operating system. So this is when you can do cat synthetic event. You see the event after this call. Since I did the TEP local or what's a trace of us local events to create the TEP, which reads all the events. It doesn't have the new event that I just created. So I have to tell it I have to do and say, okay, fill local events is basically retelling the TEP or telling TraceFS to update this TEP handler with the new event that it just found. So that I have to if I leave that out, I can't do the next line, which is I want to follow this event. So I just created it. If I don't do that fill local events, I do this. This is my error out because the event it will it uses the TEP handler and the TEP handler won't have the synthetic event that you just created and you won't get a result. So if you leave this out, you won't, this will error out. So you have to do the fill event. So it gets the information, which by the way is kind of stupid because the TEP handler was used to create the synthetic event. It should have updated the TEP handler when it did that. One of the reasons why I didn't do that, I thought about doing that and not doing it is because it hasn't created the event yet. Remember that synthetic event create is when the event exists in the system. So if I were to have an update at the time it was created, it may give you false security. You'll go, Oh, I can enable this event. And it'll say, Oh, yeah, it's there. And then suddenly, but it's not there yet. So it's kind of like you create the descriptor of it, but then you have to create the event there. So I could probably maybe I should make a function when you do the synthetic create, TraceFS to create, I could probably do TraceFS synthetic create update where you pass the TEP in there and it'll do the create and update the event. So I might make that into like add another API to make the two this two one thing. So then I'm going to the follow the event, same thing with the callback. And then I said, Okay, enable the event. So this is how you enable the event, trace events, event enable null, because I'm not dealing with instances right now. So instances are sub ring buffers. So if you don't want if your application doesn't want to modify with other applications, you could create a unique sub buffer. And this is actually will create enable the event within the sub buffer. And that's where the null is right now, I'm doing everything from the top level. This is a semi tutorial, not a full tutorial. So I'm skipping over a lot. The synthetic event, wake up, enable it, sleep, disable it. And then I do trace FS iterate raw events, which is similar to the member the trace command iterate events, which iterated overall the events in the trace that death file. This iterates over all the events in the file system. So this, there's no file that was recorded. It was just the, it's reads the ring buffer directly. Then at the end, by the way, the destroy will remove the event from the system. And by the free is you have to free up the descriptor. By the way, in my other slides, I deleted the freeze just because the safe space on the slide. So the callback handler. So the callback handler here is I used trace seek, which is the same. It's user space. It's defined by live trace event, but the trace seek is also defined in the kernel. So they both up there, it's a way of passing strings, creating strings and passing strings. So you do a trace seek static static. So it's null, the buffer is the buffer space that it's holding trace seek. So I checked to see if the buffers know I probably should have put a wrapper in there and say empty or something, you know, or not allocated or something. I should probably put a wrapper on that, but I didn't. So right now this will work. If the buffers know, let's initialize it because it has to initialize allocate the buffer. So it does allocate the buffer. And then for every time I come in here, I don't, I want to reset it so it's clear, clear again, because I don't want to keep appending to the same buffer because this is temporary use for the callback. And then I do tep print event. This is the trace event call. And you pass in the tep handler, you pass in the trace seek, you pass in the record, you pass it a format. And then the format is kind of like print F format, but it has some special keys. It's in the man pages. And for this case, I'm like just tep print info. If you've ever seen a cat trace, if you ever cat the trace file or did trace command report and everything at the end of the info, that's, that's this thing. This is the info. And then I said, okay, print it out and done. So when I ran it, it gave me this, the funny part is I found out that the creation of the synthetic event is incompatible with lip trace event. So lip trace event actually complained about the parsing of the, of the fields or something inside the synthetic event. I left it in. So I have to fix this. It's kind of a bug. But because the funny part is it still gave me the data I want is I failed the parse and then still gave me the data I want, which was the process ID and the next com and the delta. Oh, that's right. I removed the com. That's right. Anyway, so moving forward, histograms, pre allocated buckets in the kernel. They're fixed. The fix buys a number of buckets of the kernel, but it's modifiable. You create it. I think it's like 2000, whether it's over 2000 buckets are by default. I can't remember my I'm tired. It's a long day. It's the power of two, I believe they're sparse buckets. So if they're not used, they're not like it's not like if you have, you know, if you use one bucket and way over here, all the buckets in between are like wasted. No, it's actually sparse. So it's just when it says those are allocated, that's actually only if they have content in it. It's processed from within the kernel, which means it does not drop events. It will actually do it at the time it happens. But like I said, if you have too many entries over 2000 different buckets used, it'll start dropping events and I'll let you know that histograms are actually what's used to create the synthetic events. The way the synthetic events actually work is you actually write some magic into the first event, you write a histogram trigger in the first event, histogram trigger in the segment, and you have the keys match, it has some extra extra magic to create the third, like the trigger another event. So that's how it does it. It creates, it puts it into one of the buckets, and then put stuff on the other buckets and then goes searches the first buckets and say, Oh, I found a match of these two guys with the keys. Now I'll trigger a command. There's several commands. There's interesting read more about it in the documentation trace directory, but outside the scope of this talk. So let's create a histogram with live trace of us. So the way I did it was, you know, I did the make local events, make my event, trace, create, fill local events, you already know that. Here's a little bit more information. Now I need to create axes or keys for the histogram. So this histogram has only one key. So why are there two? Because the way, because it's dynamic, the way it's dynamic, it must end with a null. So that's why a null key, a type of zero and key of null. So I only have one key, and my key is Delta. And I'm going to say the type is, I feel like that's not actually should work his buckets. And then I passed it in there. And then I say sort by Delta. And then trace command histogram start. Once they do that, they actually executes the histogram. At the end, it puts it into a file called his within the synthetic or within the event directory. So I were to look at CIS kernel tracing events, synthetic, wake up lat slash slash, where you see the, you know, format, enable all that stuff. There's a his file there. So to read it, I just say, you know, trace of us event file read, you pass in the instance, which is null, it's the top level synthetic, there's the, the schedule group, or the sorry, the trace event group that's in, which is synthetic, the trace event name, which is trace a wake up lat and the file that you want to read within that directory, which is his. So I get that into a buff, and I print it. Again, I just, you have to destroy it means that it removes it from the kernel, the, I destroy the synthetic, and then you have to free it. You should free the synthetic event too. But like I said, if I did that, I drop off the end of the slide. And this is what you get. So this actually is the code that ran this is actually what I executed. It shows you the line that the histogram line that did, but it did a wake up lat. And right here, it gave me, it's sorted by what's called delta. So the lower down so there was a latency. Remember, it was microseconds. So 160 microseconds had one hit, and it was in a short once asleep once. So this is, I don't know what woke up, but you got the histogram. So finally, the title of this talk from what's listed is about, find out what you're blocked on. So we're finally getting to where most of you probably came here for. So what I owe is taking those time or lock contention in the kernel where something's blocked. Maybe locked contention in user space. Well, that's a little bit more difficult because that's a sleeping. It's not an uninterruptible block, but you could modify this code I'm going to show you to detect that as well. By the way, I think you like this line, right? To come is S frames are to come. And I'm going to be using this for this too because I want to see where things are blocked. And it will S frames are is a new feature by Indio over there. Thank you very much. Hi, that she's created, which is the Orc unwinder and user space. So we can actually have the kernel easily do stack unwinding of user space. So we're working on that. So this will include that soon too. And I'm really excited about that. So inside that make event function, I said, okay, let's, let's do something a little different. Instead of creating that wake up latency, I'm going to create a new synthetic event based off of the sketch, which is my first event and sketch, which is my second event. The same event. I'm using a sketch, which sketch switch. Why is that? Well, I'm going to base it off of fields. So the first event is sketch, which the previous ID, which means that what's scheduling off the CPU. The second event is the next PID is which what's scheduling on the CPU. So what I'm doing is I'm making a synthetic event that tells you how long the task was off the CPU. So I need a PID, whatever. Now I'm also going to do something a little bit, a little trick. Now, okay, here's the note. When I wrote this code, this feature came into 6.3, Linux kernel 6.3. We're on 6.4 now. We're on the RCS of 6.4. 6.3 is the last thing to release from here. I found a bug someplace else and I fixed it. And by fixing that bug, I disabled the feature. I just noticed. So I had to actually now fix the fix to re-enable the feature here for this. So this actually, this only works on my laptop now. So give it a couple of weeks for this actually start to work because I want to get it upstream and everything else. So yeah, I found a bug, found a bug both in the trace of S code as well as I found a bug in the kernel. So this is broken, but it will work at least in a week or two. Anyway, so I'm going to say when on the start event, so add a field based off of the start, the start event. So that's the sketch switch when it switches out, do the stack trace. It's the kernel stack trace, not the user space stack trace, but we can also do user space stack trace as well in the future. So it might be a little more difficult, but yeah. Anyway, the does the stack trace when it schedules out. So let me get my mind up again here. Yeah, so the thing about doing a stack trace on the sketch switch event, it only does the stack trace for the tasks that's scheduling out. So that's what I want. I don't want the stack trace of what I'm scheduling in. So I can't do the synthetic event. I can't do a synthetic event and do a stack trace off the synthetic event, because if I did that, I get the synthetic event happens when the second event happens. So if I were to set the stack trace off of the synthetic event, it will give me the stack trace of the tasks that just scheduled out beforehand, not the one I want. I want the task that's scheduled, the stack trace of the tasks that's scheduled in. So I must do the stack trace of it scheduling out. So that's why I say I'll start event to the stack trace, save the state, the previous state. So I want to know the state of that field as well. And down here, I'm going to add a filter because I only care about, I only care about things that are scheduling out in the uninterruptible state, which I happen to know is two. If I cared about tasks that were scheduling out in the interrupt, interruptible state, it would be one. If I wanted both, I would put three because if you notice the compare operation is this trace of S compare and so it's doing, it's going to look at the previous state and end it with the two. Again, this is all in man pages. Going back to the main file, I've changed the accesses from two to four, because I'm now having three different keys I want to keep track of. I want to keep, I want the first key to be the state, which is because my first is I was doing both one and two, or I did the, I did it with three when I first wrote this, but then I found out that the ones were getting was very not interesting only the, the uninterruptible. So the interruptible state was as interesting as the uninterruptible state, although every so often you do see it going through futics code. So you know those are user space locks, which is why I want the S frames. And so I, the second one is I want the stack as part of the histogram. So the histogram is going to be part of the stack as well. So then I allocate it again with the accesses. I'm going to sort by hit count this time because I'm, I want to see them the highest number of hits that, you know, not the law. I could have done Delta again too. Actually, I know if I did Delta here, shoot, I think I forgot to add Delta. And then I executed this new function I just added run, which is just a exec, you know, I just like, okay, I want to actually record something. And then I print the histogram, which gives me this. So I get a histogram and actually I could add Delta in there with the timestamp. I don't like, I forgot. Well, no, I did have Delta in here. I didn't have it in there for the log. Yeah. Okay. So I missed it. The very first one, which I, oh, I had a typo in there. No wonder. I'm missing key. I deleted it. I was trying to delete things to get the slide in there. And I actually deleted actually one of the things that they're right here. I was like, let's say, what the heck, this is zero, the zero key is missing, which would have been the Delta. And it's in the log. So it's a, it's a binary log rhythm log. So here, two to the, the time it was blocked for in the uninterruptible state was two to the 17 nanoseconds. So it's not as big as you might think, but it's this is a nanoseconds. So it's two to the 17 nanoseconds. And here's the stack trace of where it was blocked, which was, it was a, what's it, a downright common. And it shows you the whole thing. It was an anonymous VMA route. So it was freeing of the page table. So it was an exit MM here. So you can actually see it was on a due exit that something was blocked for a long time, but it was exiting. So we really didn't care. And then the other one was another thing where it was another due exit. So basically when processes exit, I guess it gets blocked for a very, very long time. But I guess we really don't care. But you could see it. And by the way, I cut off, this is a full histogram. And it didn't actually, believe it or not, I've yet to really overflow of the 2000s, because you find out that a lot of times the histograms, they fit in the buckets that you want. Anyway, that's the end of the talk. I have the links to all the code all the samples I used are here in the slides. So if you download the slides, you've got the links. Any questions? Or does everyone just want to go drink? Okay. Oh, if you want to go grab the mic there. So since, as you mentioned, you want everything to work with like CAD and busybox and all that. Like for the S frame stuff, what are we doing about getting the symbols into the kernel for the stacks? Because just the numbers will not be that useful, right? You could get the symbols. It's going to be in the L file. So when the busybox runs, all the symbols are going to go with it. What's the question? So like, for example, if you use the stack trace trigger, you get a, so what you get right now from what I could see is you get like a list of hex things. Oh, how are we going to get, so yeah, so the stack trace, you mean because it's going to be hex now, this is more for the, this will be later talks when we talk more about S frames. No, you weren't, no, you weren't at the M M when I talked about S frames for you. I was, but I think you mentioned symbols on one of the slides. So basically the idea for S frames is we're going to, we also have access to proc maps. So that we could actually take the address and get the file that it was for and the offset. So what go, what would be out in the trace output would be not a bunch of hex is going to be file offset, file offset, file offset. So all that code has to be written in the kernel to, to convert. And it's extremely simple. We got proc maps written in the kernel. Right. Okay. It's just going to be using the same data. It's the data is there. It's just, it's just a print F or print K format or print F format thing. Just passing in here's the file name. Here's the offset with a little subtraction. Okay. All right. Yeah. Makes sense. Anything else? I'll pass over to Jose. Um, I'm, I've been wondering how this library that provides access to this trace file system, how you can use you probes with that right as well, right? You probes? You probes. Yeah. Yeah. I actually haven't, I haven't used anything to create you probes yet. I mean, you could do like the, actually you could do it in raw because there's a trace that you could do, you know, trace the best file right and just right and pick the dynamic events and just put the string in. If you know the you probes, I don't have a you probe interface yet. We could add one. So yes, you could use a you probes as well. Yeah, because I was thinking if this becomes available as a library. It is library form. This is already in Debian and Fedora. I was thinking. And Susie, I believe. Maybe this could be used as the base for implement tracing in virtual machines. Yeah. We, well, I mean within the virtual machines itself. Right. You know, by letting the kernel doing it. So you don't slow down. We could. I mean, well, the tracing is done in the kernel though. Yeah. It is, right? Right. Yeah. With you probes, for example. Well, you probes is actually, well, for you mean the you probe side of thing, you probe inserts like, I guess, out of a breakpoint. Yeah, it's an instruction. But the thing is the user space will slow down because of the breakpoint. That is so, so it because a breakpoint is a kind of expensive because it does a full stack. So interrupts are a lot because it records more data than needed. Yeah. Yeah. I implemented you proof for the spark back. Okay. So what we're also working on is like user user events has made it into the kernel finally. So you actually write basically a way user events works is you register with the kernel events that and you say, this is the date of, this is how the layout will be. And you say to enable this event, here's the enable bit in my namespace and it will be set to zero in the namespace. So in the code, and they have macros, they're working on macros for this though. So you say, like, kind of like a trace event thing, like you'll trace enable or sorry trace, flu. And then what will happen is that we'll have an if this variable, you know, do your trace and that variable be zero big deal. And then what will happen is this will this trace event when it registered to the kernel shows up to in the trace of s files directory. And then with an enable file and you can echo one into that enable file. And what the kernel will do since it registered where the enable bit was, it's going to set the bit that was registered to one. So now inside the user space, you'll say instead of if zero, it'll be if one, and then it will keep running and then it will do that it writes into the kernel. The tracing data. So how do you expect this library to be used? For which one? For user land tracing. For user land tracing, well, there's a lip site. So basically it's this isn't this is just for mostly normal trace or kernel tracing. But yeah, but with lip side, which is Matthew Danoy is working on this, because that way you actually hook it with LT TNG and other tracers that are user space per federal is looking into it, which is the Android and Chrome OS tracer. So what the idea is, we'll have this library, but once the event is registered, if it's registered to the kernel, then you could do, you know, same thing with TraceFS is TraceFS event enable user event bump and then enables the event. So you could use both libraries. So the one library will create the user space events. The this library would actually be able to enable it and disable it from from the application that wants to enable it. So the one the lip side is actually will be linked to the application with the trace events. And then you whatever engine that you want to enable a disabled event would be using this library. Anything else? So one of the I think the one of the use case you mentioned, since like you can trace on a particular thread when it's scheduled and then the next time scheduled you have the delta. Yeah. So usually if let's say you're debugging a scheduler issue or something and you noticed a thread is been it's been not being able to run for quite some time and you're suspecting that. So once you see that, I guess the next step is you're going to have to read into the timeline. Look at what's actually running on that CPU. So there's a lot of other things. Like I said, this is short. I mean, we're probably over time by now, right? I mean, hello, my watch died. No, it's okay. I think we're about over time. Anyway, what I'm saying is you can also add triggers to the trace event. It's where you could say you could have something. If you could say filter if the delta is bigger than X, stop the trace one thing to enable other events. So you'll see things. So when this delta happens, you'll see all the other events that happened before that. So when you're so you could be tracing it. That's one of the things with synthetic events that actually abuse this is I'll say, okay, when this delta is greater than X, stop the trace and then what's it called? I enable all these other events of block events and everything else. And then the trace stops. I'm like, Oh, this interrupt went off. That caused this huge delay and it missed a wake up or something. So I've actually, I've actually used this to debug scheduling problems. So should there be, maybe it's already exists, but should there be like an application like doing the actual analyzing work, reading the entire fire and just getting up all the, all the events showing there and in a timeline, visualize it. Well, there's Colonel shark, by the way, that does, that could do this. But also starting Monday, I have an intern whose job is specifically to write simple tools to do analysis of traces. So I'm hoping to get like, like the BCC tool set that they like, where they have all the tooling, we kind of want the same thing, but for F trace. All right. Thank you. And probably the last question over here. Hi. So I was wondering if I remember correctly, the ring buffers are per CPU, because you have to be fast and sometimes you drop them, drop events on just one CPU. But these events, since the day that you shown usually mean that something happens on one CPU, the first part and the other on the other CPU, so somewhere in there, in there you have to synchronize on the bucket or something. Oh, wait, wait. So how, how large issue is that? Okay. The histograms are not part of the ring buffer per se. The histograms are a separate unit. But the processing of the synthetic event has to work with the histogram buff. You mean for the analysis, you mean the analysis tool, the trace.def file analysis? No, no, the, the things that happen inside the kernel and. Yeah, it happens at the moment that it actually happens. So like I said, when to connect to two events, it doesn't use the ring buffer. It actually has these histogram buckets. It's just a histogram bucket. So it's the one event has its own histogram buckets. So it puts it in there. It's not per CPU. It's per histogram bucket. And then when the other vent takes off, it matches the things, then it triggers the event. So yeah, so my question was if this causes some contention that becomes much more overhead. There's, it shouldn't, it's actually pretty fast. And I believe it's lockless because it could be even done in, in my context. Okay, thanks. I think that's about it. So thank you very much, everyone.