 Hi, I'm Steve Rostead. I'm an open source engineer at VMware. I'm also a Linux kernel developer, and I'm the maintainer of the tracing subsystem within Linux, namely Ftrace. I'm also the maintainer of the stable real-time tree, and today I'm going to be talking about finding sources of latencies on your Linux system. Let's get started. Once again, just want to say that this is I'm a VMware employee, and if you want to check out more about our, you know, check us out on Twitter at vmw.opensource, as well as the blogs at vmware.com and open source. So, first thing I want to ask is what is latency? You know, if I want to talk about finding sources of latency, let's give it a definition. And if you look it up on Wikipedia, it will tell you latency is the time interval between the stimulation and response, or from a more general point of view, a time delay between the cause and effect of some physical change in the system being observed. Okay, a lot of people are probably still scratching their head wondering exactly what is the definition of latency. So in simple terms, I will call it simply the time from when an event is supposed to happen to the time it actually does happen. Simple case, every morning when you wake up, say alarm clock goes off, and it's very annoying, but that's when you're supposed to wake up. Then, when the time you hit the alarm clock, that is the time you actually do wake up. And I would like to call this wake up latency. So, where does latency come from? Now, if you look at your system, you'll see you have your application on the top layer. Below that application will most likely be talking to a library. Then that library will communicate to the kernel, and the kernel is what it will communicate to the actual hard drive, or not hard drive, but your hardware, your processor, getting resources. And there's ways that the library can circumvent the kernel, and even the application can circumvent both the library to get to the kernel and or the hardware. But we're not gonna really talk about those details today. And also, what I'm gonna say is, right now what we're gonna do is focus on the latency that happens in the kernel and the hardware. You'll see that little black box there, the bias, the basic input output, which is from, is basically firmware on your computer. And I kind of consider that part of the hardware. So, let's look at the hardware latency. Where does that come from? Now, a lot of times what you'll have is something called the system management interrupt. Now this is where the firmware is part of the hardware. So let's say you have a laptop, and your laptop requires thermal control. It's very small, everything's very, very small. And within the components of your laptop, and you need to make sure it's cool, but you don't want the fan running all the time. So there's a lot of logic, complex logic to make sure everything's doing everything correctly. Well, you can't rely on the operating system to get this right, because each laptop is unique. So that's where the bias, the firmware, has a way of managing how to keep your laptop cool no matter what the operating system is on top of it. And it will usually trigger something called the system management interrupt. And it will turn the system to a system management mode, which is basically hidden from the operating system. The operating system has no control over this. And this is where maybe you'll check to say, okay, we're starting to heat up, we gotta turn the fans on. Or everything's cool, we could slow down the fans, or maybe we'll change the frequency of the CPU. Well, that work is a latency within your system. And if you're really interested about like real-time performance, you will have to be aware of these type of changes. Other things that hardware can do is every time you have a cache miss, or instruction or data cache miss, that could cause a latency or branch prediction. Basically, it's non-deterministic behaviors, which might cause you. Branch prediction, by the way, is when you have your if statements within your code, and to make sure it tries to, the CPU tries to stay up very, very fast, it will look way ahead. And when it sees a branch, it'll has to make a choice about which way that will go. And based off of previous times that branch was taken, it will say, okay, most of the times it goes, it's a true case, so I'm gonna follow the true case. And that way, when the CPU gets up to that point, all the memory will be in cache and ready to make things much faster. But if the branch prediction is incorrect or flushed, it will have to stop at that point and then re-pull in the memory to continue. Hyperthreading, when you have your multiple threads on one core. So if you have a hyperthreading is when you have a CPU core, that actually simulates two logical CPUs. And when you have one logical CPU doing some sort of task, another logical CPU doing other tasks, it's really going, it's still, what it's doing is kind of shifting the work when one's fetching memory, the other one's going to be doing some processing of the CPU and back and forth. But it's really, everything's, it's gotta be shared resources. So depending on what's on the other thread, it could affect your latency. So, and finally page faults and look aside buffers that what happens is when you execute something, unless you did an M lock all, which pulls in all your executable into memory, the way your system works. So when you start off your browser Firefox or Chrome, it doesn't pull all that into memory run. It will actually just execute some part of that code and if that file for that executable is not in memory, it will take a page fault. The kernel will pull that file into memory and then it will continue the execution. And that of course calls latency. So what do we have to detect latency from the hardware? Well, if you look inside your kernel configuration, you'll see something called config HWLAT tracer. And it's actually available on Fedora 31 system. I checked one of my systems, which was Fedora 31 and it actually had the HWLAT tracer enabled because when it's off, it doesn't cause any overhead at all. It's only cause effects, it's only adds overhead to do the tracing when you enable it. But I also noticed that my Debian systems don't have an enable. Now I don't know why, but because it's a useful feature to see where the latency of the hardware is. And when it's off, it doesn't cause any overhead. So what it does is it runs in a really, really tight loop with interrupts disabled and checks to see if there's any gaps while it's taking timestamps. It'll run for a width of microseconds within a window. By default, that width is 500,000 microseconds or a half a second. And the window is a million microseconds or one second. So basically it runs for, it goes into this tight loop for a half a second for every second on your system. And you could determine the, or you could change the width and the window by looking into the CIS kernel tracing HWLAT detector directory. And there's two files, a width and window, to modify these parameters. And once it's done, it'll actually bounce from one CPU to the next CPU to the next CPU. So it does each of these windows on a single CPU and then moves to the next CPU and does the same thing and moves to the next CPU. And you could determine which CPUs that runs on by the tracing CPU mask. Also, it will only trace if it detects something over a greater, some number threshold. So there's a file called tracing threshold that you could set by default, it's 10 microseconds. And any gap that's greater than 10 microseconds, it will record. Otherwise, it could just be recording a lot. If you do one microsecond, a cache miss or whatever a branch prediction, if you're looking for those, you probably see a lot more hits. But 10 microseconds is usually the way to go. So how do you get to this? Now, if you have this enabled in your kernel, first thing you wanna do is you wanna mount the sys kernel tracing directory. If you have the tracing directory enabled, the inside the sys kernel directory, there'll be the tracing directory will be there. You don't have to create it, it will always be there if your kernel has it configured. And then we do is you would mount it with that option. For most of these commands to get rid of all the paths, I usually cd change directory into that tracing directory. And then for the first thing you see, just for example, I cat what the hwlat detector with file was and you'll see it's 500,000. So I'm gonna be bold and change that number from 500,000 to 900,000. Basically nine tenths of a second of every second I want this thing to run. But remember, it only runs on one CPU at a time so it's really not that bad. So from then, if you notice I cat the tracing thresh and it's 10, so it's gonna make that mark. So I did this, I let it sleep for a hundred and then I cat trace, traces the file that shows you that information. The first thing you'll see is I want to stress is inside the tracing information you have this what we call the latency field which tells you when interrupts are enabled or disabled. In this case you'll notice that they're all disabled. And by the way, you'll notice that the shift of the, they don't actually map correctly from the arrows, the IRQs off if you follow it down. It's one off from the D. That D should be underneath the IRQs off. What happened was this code was originally written when process IDs had no more than 64K of elements or no, 16, sorry, 16K of elements. And as you can see, there's 211,000 processes. So basically the process ID can be no longer, no bigger than maybe it was 64K. I can't remember what it was, but it was much smaller. And now that we allow a much larger number of process IDs, it caused the process ID to be bigger than what the code was originally written for and it shifted everything over and the title didn't match. So ignore the little offset and that shift there. But the important thing is that D is supposed to be IRQs off. When D is set, it means interrupts are disabled when that trace happened. If it's a dot, it means it was enabled. And since we're doing all this with interrupts disabled it should all be Ds. The next is whether or not the process should schedule and you'll see an end if it needs to, whether or not it's running at a hard or soft interrupt which it's not here and the preemption depth which I'll talk about at a later time. So the delays timestamp, this is the timestamp that's done by the tracer. The tracer actually has a timestamp that every time you record you can modify this timestamp inside that trace FS directory in the tracing, you know, cys kernel tracing directory. You'll see there's a tracing trace clock file that you can modify which clock to use. This is the clock that it uses. The next one is a little bit of a misnomer. It says function and that's because the tracing was originally written for the function tracer and there's a lot of other tracers that have been written since time but we haven't updated the title. So it says function when that just means whatever the tracer is doing. And here you'll see one, two, three for the hardware latency tracer we'll put underneath the function field here a one, a two, a three for each inspiration. So you see how many was hit especially if it overflowed you'll know how many which position it was in. The next is the interloop. Now I'll explain this a little bit in more detail in a short second if you in another slide but the interloop is the timestamp it takes. Now you'll notice that every number it'll be either zero or you'll usually the inner or outer loop will either be have to be greater than 10 because the threshold it won't record anything unless it sees something greater than 10 but you'll see the 14 and 15. So the 14 means it happened if you discovered it in the interloop and then the 15 is that discovered it in the outer loop. Now again, I'll discuss this a little bit later and the timestamp here you'll notice this timestamp is a little bit different than the timestamp that tracer uses. This is an internal kernel timestamp that just basically lets you know when it happened and this is actually when the first instance it finds happens it'll record the timestamp. So it's a little bit more accurate but it doesn't match the tracing timestamp unfortunately. And finally is the count. This means that 42 means that it found 42 instances that were greater than 10 microseconds within that loop. Remember this is the output done after the window. So a trace only happens after the window has completed so it shouldn't be no more than one second apart because I ran for 900,000 microseconds within a million microseconds. And then at the end it gives me this report in showing in that inner outer what was the max it found in the inner what was the max it found in the outer. And now the count says it found between those two it found 42 instances that were greater than 10 microseconds in latency. And this is the diagram showing you how this is done and why there's an inner and outer loop. So the first thing we do is reset the last times last T2 counter to zero and then we'd take two timestamps right after another. We get time, get time and put it into variable T1 and T2. Then we look to see, hey is T2 have if T2 doesn't have any, if T2 is zero or the last T2 is zero, it goes down and does the compare you'll see it does the compare of, okay, what's the difference between those two timestamps T1 and T2 actually T2 minus T1 and that gives you the delta. And if it's greater than the last thing it found it will save that recording. And if it's greater than 10 or the tracing threshold at the very end of this whole loop then it records it to the trace buffer. The outer loop records this whole calculation because when we first did this we just did T1, T2 and we missed a lot of latencies. We had this one machine that had an SMI go off every 11 minutes, 200 microseconds but it never was found until we added this outer loop because it was when it would go off it would go off while we were calculating the differences. So during this whole window we wanna make sure we catch everything within that window. So we actually look at the calculations from the inner and outer. And since the outer loop actually has a little bit more work to do that's why we record them differently. We record the inner loop because nothing happened in between the two it was just two timestamps right after each other. The other one could be affected by branch prediction and cache misses and such. So that has a little bit more leeway. So we wanna differentiate between whether it was found in the inner loop or the outer loop. Hey, so let's end up with the hard latencies that you can find in the hardware. Well, what about the kernel? Well, one of the things that the kernel we have interrupt latency and there's several things about interrupt latencies. One is how much latency that an interrupt can interrupt your process. So your process wants to go and it gets preempted by interrupt. That's a latency for your process. There's also a when interrupts are disabled that we want the interrupt to come in real quick. So we have a latency. So when a interrupt comes in that usually means that a device wanted to do something. And if we have a keyboard or we type on the keys or we're doing networking traffic we want that when that network packet comes in we wanna process it right away. But if the interrupt handler can't process that networking packet that's a delay and that's a latency itself. So it's they're both interrupt latencies but they're two different meanings. We have IO latency. So when you read and write from a device this is something that you like to measure like how fast is writing to a device works. That's a latency that we do like to measure. And there's also the kernel has a bunch of maintenance tasks that it must complete. And if your process is not aware about what the kernel is doing for basically making sure your system's running nicely and smoothly it could preempt your process from going on and the kernel goes ahead and does something that makes sure the system runs nicely but it perhaps could have waited a little bit longer or ran on a different CPU. So you need to be aware with all of these. But first let's talk about interrupt latency. So what are interrupts? Well, simply put and here's another little clip art that I drew a long time ago. Your CPU is basically running doing your processing over and over again but if you have like sort of typing on the keyboard and move your mouse or networking packet comes in there's gotta be a way to tell the CPU stop what you're doing and take this external process or this stimuli, this external stimuli you must handle it now. So the CPU has no idea about it unless it gets interrupted. So it interrupt is basically just a way of telling it letting these devices stop the CPU from what it's doing and say, hey, you need to work on this. But there's some cases where the process the CPU is doing something that it cannot be stopped it must continue and cannot be preempted by a external device. And in that case, we have what we call interrupts disabled and just think of it this way that it just says, I'm gonna ignore everyone and just go on. So looking at the latency from interrupts so your program is running along a device triggers interrupt and then your interrupt handler happens and it comes back. So that gap here that you see, that's your latency and the latency from the interrupt is what's going to be measurable in your program because it's not running. It's handling some process. Well, what about this delay that we see from the device when it triggered to the interrupt handling why isn't that lined perfectly up and down? Well, this is where I call the interrupt latency because interrupts could be disabled. The entire latency there this is the really I call the full interrupt latency and this is when you're worried about response times from the device. So when a device triggers so sometimes if a program is waiting on a device the device interrupt comes in it calls the interrupt handler the handler processes the device and then just comes back down and then passes that information off from the device to your program and that time you may wanna measure as well because that's the full interrupt latency. So before I go further I wanna talk about like I said I'm a maintainer of Ftrace which is the official tracer of the Linux kernel and also what's called trace-cmd or what I like to call trace command. Now if you see on the slide it says HTTPS colon slash slash trace-cmd.org remember that because it has the link below. Trace command happens to be installed on several distributions Debi and Fedora they all seem to have it but if you wanna get the latest and greatest cause I under active development and I haven't made a full official new version recently you wanna get the source code the best to get the most of the features. So go there to get this and now from the rest of this talk I'm gonna kind of be using trace command to interact with the latency of the tracers. So measuring latency from the interrupts. So if you wanna measure like say that the interrupt handler latency what that's causing on you you could run this kind of nifty command which trace command record which means I'm gonna record it to a file dash P is the tracer you want. Yes dash P is a misnomer say why not dash T but dash T is already taken when I first wrote F trace we called them plugins not tracers but now plugins have a different meaning so just remember dash P for your tracer. Dash L is the limit the function graph tracer to do IRQ and anything that has ends with interrupt because those are your interrupt handlers and eventually I also wanna enable the IRQ handler entry event. Now this is the dash E is for an actual event which is a static event within the kernel Linux kernel has over a thousand of these events all over the place find them with trace command list dash E that'll list all the events that are available they can look at allow them interesting the IRQ handler entry means every time a device interrupt happens not a timer interrupt that's where those underscore interrupts are usually kernel interrupts where the do IRQ is the device interrupt when a device interrupt comes in you get to see what device that was via the event from IRQ handler entry. So when I did a report after I did a record I ran for a while and I did a report and dash L will give me the latency remember I told you about the D dot dot one you'll see that 3D dot dot one the first number there is which CPU it's on and since I don't want to see a much mixed interleaved CPUs because if I just do trace command report it shows every single CPU and it gets messy and didn't give me a good screenshot. So I just said dash dash CPU three which will tell trace command report only show me CPU three in the data file and this is what I get. You'll notice this little section right here this is where the interrupt latency happens to be where you can see how long the interrupts took. Now you'll notice that there are some pretty big interrupt latency like that one the EM well first of all you see the trace event called IRQ handler entry and that is the trace event that was for my networking device. So EM one was my networking device and it was on interrupt 26 as you can see from there I highlighted in blue to see which the name and so I know that do IRQ was from my networking device and at the bottom there you'll notice it took 98 microseconds to handle that device that's pretty long you know that 98 microseconds is a pretty substantial delay 24 is kind of long as well and that was another networking device. Please note this is not a real time kernel this was actually just the latest I think Linus's tree I downloaded and was compiling it had a bunch of things enabled I have other things that kind of delayed things and I ran this test on this kernel. So lab is up I actually did this with a real real time kernel. Well this is I took the 5414 RT7 kernel and tried the same thing on that. So what I got for that was this output. Now you'll notice that there was kind of a large gap there that 19 microseconds is still kind of long. I'm not going to tell you why the real time kernel is better at handle interrupt handlers than the normal kernel is. I give him given several talks on that before look them up you'll find them and I'll probably get more talks later but that's out of scope for this. This just lets you know that for the real time kernels behaves much better than the normal vanilla kernel. By the way the rest of my talk, this talk is I'm running on the growth. So let me see from interrupts. Remember I talked about this part where this is delayed from when the device trigger happens to when the interrupt was able to go off and I also said that this happens when we disable interrupts. So let's take a little look at how we can measure that. Another thing that's interesting that we want to look at not just disabling interrupts because a lot of times when interrupts are disabled we're also worried about being able to schedule and there's another way the kernel could do it could keep interrupts enabled but also it could disable preemption where interrupts could come in and handlers could work but the kernel will not let the CPU schedule another process. So if a process has this running a low priority process is running and it disables preemption and a high-level, high-priority process comes in and the interrupt goes off and the kernel says, hey, this guy needs to run. Well, the kernel's going to come back and say, no, this high-preemption disabled you have to wait. And that's another form of latency from the kernel. So preemption disabling happens when interrupts are disabled and because you can't notify the task to stop. So that's just in there when you disable interrupts you basically disabled preemption because the CPU doesn't know to change. Spinning locks when you take a spinning lock you the kernel has to disable preemption because a spinning lock means that it's when you have a lock if another task comes in tries to get the same lock it's going to spin it's not going to wait for it. It's not going to go to sleep. It's not going to wait. It'll just spin hoping that lock gets released quickly. If you let the task schedule out while it's holding a spinning lock it could cause a deadlock because maybe it gets scheduled back on on the guy that's spinning and waiting. So you can't let any task holding spinning locks to preemption. And then also there's one of the optimizations that could be done on a kernel is to when you're accessing per CPU data. Now, since if you don't care about other CPUs because when you access data that's on your CPU process on another CPU is not going to access it but you worry about scheduling out and scheduling in because if you schedule out and another process comes in it will now access that same per CPU data. So sometimes to do this if you don't worry about if this data is not done by interrupts or not touched by interrupts you're okay but if you have the CPU data you could just turn off preemption modify your CPU data and enable preemption and basically that's your really fast lock. That's another reason why we have preemption disabled. So now we have the latency tracers but as you already saw the HVLAT latency tracer now I'm going to talk about a couple of other latency tracers we have which is the IQs off latency tracer which basically traces all the times interrupts are disabled. The preempt off latency tracer is where it just looks at preemption being disabled. But what I like the most is the preempt IQs off latency tracer and this is the one I use. I don't really, the other two are more to me academic while the IQs off is useful if you're worried about just the devices coming in but the preempt off is kind of it doesn't measure interrupts off so it's not as useful, it's more academic but the IQs off, the preempt IQs off tracer is nice because it will look at the whole time from when interrupts are disabled and preemption are disabled together as a whole. So that's really the true scheduling latency that you would have if you have a high priority task that wants to do something. And no, none of these are configured in a production kernel because they actually do have overhead even when they're turned off. So you don't see them enabled but they're usually in a debug kernel they're nice to have. There's also trace events. Remember I told you those little trace events that are static trace points within the kernel. There's a preempt and IRQ disabled enabled disabling events that are out there too but because they require wrappers around this they also cause a little bit of overhead when they're in there so you won't see them enabled on a production kernel either. So if you go and you open up your make menu config and go to kernel hacking tracers menu now make menu config is when I go into the Linux source code and I want to modify the configuration there's a end curses way of modifying these configurations you'll see you'll, you can find these where these tracers are enabled and disabled. And like I said, if you go to kernel hacking menu and then the tracers menu, you'll see this menu and in that red box you'll notice that you have it starts off with the preempt enable and disabled trace points and you'll see they're off because this is the production this is more for the production kernel you have the interrupt latency tracer and a preempt off latency tracer. Well, you may ask, well, where's the preempt IRQs off latency tracer? Well, that's automatically enabled if you have the interrupt tracer and the preempt tracer if you have both of them enabled the preempt IRQ off tracer will also be enabled it goes without saying but you'll notice there have another tracer down there which is a scheduling latency tracer which is actually something that you should have enabled even on production systems. So tracing latency from interrupts with the same preempt so now what I'm going to do is start we'll use trace command start it's not trace command record trace command start just doesn't record to a file it just enables tracing and the latency tracers only shows the max so you don't really need to do a recording. So you do this you enable the dash P preempt IRQs off and then I notice I enabled an option sim offset well why did I do that? Because if you notice down below with the blue statements down there it has the offset of the function so when it traces the function because it enables function tracer where these happens and I also like to know where these where in the function that happened not just the function name if you take away that option you won't see those extensions of the offsets of where the function where in the function that the trace happens and finally I want to limit since function when you enable the latency tracers enables full function tracing if you trace all functions it causes quite a bit of overhead when you're tracing every single function in the kernel so I like to limit it to like say the interrupts and spin locks and this was the output you said and at the very end you'll see there's a stack trace of everything and there's some garbage in there I'm not exactly sure why but you ignore it but it's kind of lets you know where it happened for that so that's the next thing I want to talk about is the scheduling latency tracer this does not have overhead when you don't enable it it's basically written completely on top of the trace events so you have trace events enabled in your kernel might as well have this enabled because it works so this could be I'm kind of upset that I don't always see this on production kernels or production systems because it does unless you're worried about maybe size because it does take up memory it does have a memory footprint but on speed wise if you're worried about performance enable this it's a nice thing to have and when you enable this you have three types of tracers the wake up tracer, wake up RT and wake up DL which is a deadline tracer the difference between them is the wake up looks at any process whether it's real time, deadline, whatever but some of these processes because it looks for the highest priority process but because they're not fightful like a real time task will run until it gets preempted this guy may run until hey, it might just run out of time and then it'll start the next thing and then you'll see these huge gaps in the wake up just kind of hide the real time and deadline task because the wake up will trace the non real time tasks which always have a larger latency than the other two and you'll never see the other two if that's what you're looking for because it only records the max latency that's why we have the wake up RT because the RT will ignore anything that's not a real time task and only trace real time tasks the highest priority and that gives you a much better idea of the scheduling latencies for real time tasks and the deadline scheduler is the same thing for the same reason so if I run with the wake up RT on the preempt card this is what you get the only thing for now you'll notice is that I picked the wake up RT scheduler I still have everything the same except I also added the sked switch and the sked waking trace events now sked switch is the trace event that happens whenever one process gets scheduled out and another process gets scheduled in and sked waking is when that process actually wakes up so if your process is sleeping waiting for an event and that event comes in something has to wake it up and when it gets woken up the sked waking will trace that action and it's very similar to the other IRQs off tracing well what's the problem with latency tracers the latency tracers you have is first of all the rigid not very flexible there's not much you could do you enable and disable them it gives you the max it's always recording the highest priority task or all tasks like I said you can't just say I'm just curious about this one task I want to see the latency of this one single task is it's really hard to tell these latency tracers to record that and it only records the max latency what if I want to know about all latencies whether it's max or not the average a histogram and it's specific to IRQs preempt stable or wake up latency that's all you have you can't do any other type of latencies like IO latencies or anything else so I want to talk about histogram triggers and events synthetic events now next this is where you get to choose your own events this is where those preempt at IRQ enabling the disabling events come in handy is for the synthetic events here and histograms you can add filters you can say I only want to trace a specific process I don't care about any other processes I want to see the latencies of a single process it gives you a nice histogram of everything so how do we create it first of all you got to mount the tracervest file system which I talked about earlier if it's not already mounted and by the way if you run trace command it automatically mounts it for you and then what you have to do is you enable the IRQ lat... this is I create a synthetic event so right here I echo the name of the event I want to create this is going to have two parameters one is going to be a type process ID another one's going to be a U64 because it's going to be a timestamp timestamps are always U64 and then I name the first the first ones will be the process ID that I'm going to trace as well as the latency now from here making the histogram so I do trace command start and dash E which means I'm going to enable the IRQ disable trace event I mentioned those before that you have to configure them in they're not on production systems dash capital R means a trigger so I'm adding a trigger to this and then I add my histogram my keys and stuff so I'll talk a little bit more about this here we go first thing I do is I map two events I have the disable event and enable event that's going to happen right after it I want to make sure that these two happen back and forth and how do I map these two together since I know they always are on the same CPU when they happen because the interrupt disabled can't be preempted so the enabled has to be on the same CPU I want to make my keys that match the two the CPU then I'm going to record the timestamp and load it into this variable called TS0 and I only want to trace if the common PID is greater than zero common PID is all events have common PID that's why it's called common the process ID and all events but the idle event has a number greater than zero the idle event has a process ID of zero and the reason why I don't want to trace the idle event is because it will schedule out and with interrupts disabled and then get scheduled back in and then that will screw up this tracing so I want to ignore the idle event I just care about processes and I could actually put a filter on a specific event instead I said if common PID equals some process ID that I care about and then from here I remember that synthetic event I created well I'm going to fill it up the next thing with the I'm going to take here is look at the this is IRQ LAT well this actually is missing I should have named it IRQ LAT let's just call it LAT the LAT equals common timestamp that use x minus the TS0 well that TS0 was the saved timestamp from the IRQ disabled event now on the IRQ enable event I want to take the current timestamp subtract the timestamp from the IRQ disabled and that will give me the delta or the latency and I store that in the IRQ LAT variable and then if you notice everything starts with dollar sign is a variable that you can use within the tracing and then on match is the way of telling the IRQ enable trace event to say I want to match the IRQ disabled trace event to execute this note that I have preempt IRQ.IRQ disabled to put the on match you have to know the system name of the IRQ disabled all events are grouped by systems and the IRQ enable and disabled are both under the preempt IRQ system name so you have to put the full system name to make this work and eventually I now the trace this is where I say trace means kick off or basically execute that synthetic event that I created with the IRQ LAT with the two parameters the PID and the IRQ latency to that event and finally in this trace command start I want to trace that event that I just created the synthetic event so I'm going to trace the IRQ latency and I'm going to create another histogram to make a histogram of it so not only will it be in the ring buffer and you can see that when the trace happens it will also create a histogram of it. Now if I look at the histogram you'll see this is what it looks like which is kind of nice because I could see that the interrupt or for the number of times the latency the max latency of interest being disabled during this run was 21 microseconds because I put that U6 in there by default the common timestamp happens to be nanoseconds by putting that U6 it's microseconds so the max latency was 21 microseconds it happened once and the majority of times it's nice it's like two, three, four microseconds which is pretty good IRQ off it's not that long. Now let's do the same thing for wake latency I'm just doing like I said it could be much more flexible than other things but now let's do the exact same thing for the wake up latency and you'll notice that right now I echo wake up LAT instead of the previous one and I put it into prior so I still have the process ID and the latency but I'm also interested in about what priority the wake up latencies were in because I want to record everything and do this so I do the same thing here that I did before basically this is the exact type of format and I run the tracing and you'll see that the priorities now inside the kernel priorities are actually reversed the higher the priority number in the kernel the lower the priority actually is so when you see like a 99 priority or 98 priority that's actually a priority of two it's not a very high level priority task the lower you'll see a zero priority that's the highest priority task could possibly be is when inside the kernel prior zero and there's a history behind it that I don't have time to go over it so you'll see that it happened like the latency was seven the maximum or it hit one time was from IRQs were offered or the wake up latency sorry, that's the wake up latency the wake up latency was for seven microseconds and it happened a couple of times and you'll see here's pretty good wake up remember this is the real time kernel but histogram triggers and synthetic events they're powerful they can enable on production systems they're great to have I mean obviously you can have it matters what events you have but the actual histograms and synthetic events can't be enabled on production system you may not have the preempt to enable disabled events enabled but any other things read IO, the wake up latency works all this is on any system you have and it's easy to use, right? everything I just talked about in the last few minutes, you understood you understood everything completely you could go off and do it yourself well it's a rather strange format and it takes a while to get used to and I know because the problem is if it's hard to use, people won't use it I know because I found bugs while using it that should have been caught if anyone was using it and if it's hard, people won't use it so we need something a little bit better because that we could do this work where people it's easier to use and people could do this on time so or we could do this in user space so welcome to the vaporware some people say welcome to the cloud this is welcome to the vaporware well it's almost vaporware so if we need a new language that people don't have to relearn well what about SQL? well think about it events are like tables the fields within an event is a column and every instance of the event is like a row we could join tables and basically these synthetic events are joining events so why can't we do that? so from here instead of having this archaic output this looks much more familiar and this is something we're working on it's almost there so we have our select statement which everyone's familiar with I mean I know SQL and I reluctantly do but most people do and we have variables you can put in anything as as you can label something as PID and it will find it for you for example the start is defined as IRQ disable and another thing about since trace command has access to the event directory it could find that event for you and find the system that the event is for you so you don't have to worry about variables you don't have to worry about what system the event is on you don't have to worry about synthetic events this does everything for you here IRQ enabled for the end and there's your combination that's one way measure so and then you could say as IRQ lat this will be your synthetic event that you can then reference later so by naming the table the table is the event itself and you named it so you don't have to worry about synthetic event everything's done behind the scenes you just have to know SQL and this is where you map so let's look at the two together so here they are this is what I had for the disable IRQs and below is the SQL the same version of that and for the latency tracer this is what we had for the latency tracing and again this is what I had for the wakeup so to map the two together if you look above you'll see the variables all the variables that was there that's actually done here before so let me start off from breaking this up so beginning you had to create your synthetic event like I said that's the naming of the table then you had to put in what is going to be the parameters of that event of that synthetic event that happens to be the selection field so since if you don't have to put the as PID but if you leave it out all you'll see is arg1, arg2 and it won't be meaningless so it's good to name it so when you see the tracer point or trace output or if you want to do something in the histograms you'll see that work you know the sked waking the from sked waking is the first event that you start with so the from is the first event and the second event that you're going to map is the join event and you notice I put as start as end so I only have to name them there and then you start and end every place else and then from here I the timestamp remember I had that variable recording the timestamp and mapping it well when I put in the code here end common timestamp usex minus start.com timestamp usex trace command will know how to that oh we need to put a variable we need a variable to map the two and it will do it for you so you don't need to worry about that and for here how are we going to map remember I said you have to have a way to have the two processes mapping together so the sked waking maps the process ID to the next PID well here you just do join on the on statement will tell you how to do those that joining just like it does in SQL and then you have this filtering where we the where so if I put where start that prior less than 100 that's the filter everything's very simple so it's coming soon I've everything working pretty much except for the where clause that's that's still a little bit broken but we were almost there and if you want to look at it yourself just go to my github page I have SQL histogram you can echo it into the file if you notice from here I remove the where clause and it will give you the output from actual it will give you the output that it would run we're already having worked and put this as part of trace commands so be there for you and what's it called you have to run this as route because it needs access to the trace of S directory which is only available at route or you could copy it over there's ways to work but hey feel free to play and we'll be there soon so with that thank you very much again my name's Steve Rosted I work for VMware check us out at you know on Twitter at BMW open source as well as our blogs at vmware.com and open source and now hopefully we have a few minutes for probably one question thank you okay so quick change of scenery if hopefully you can hear me and see does the technician say they can hear me hopefully okay see any other questions are coming in now I answered some questions already in the chat and let's find a way to which is so as a way to make the screens bigger and hello I'll just answer the questions I've seen I've already written some responses first was how does it relate to other kernel systems like eppf or monitoring tools like net data I know eppf actually uses Perf for its tracing but also hooks into a lot of the F trace infrastructure and if you use BPF trace prank to debug your BPF module it will actually write into the F trace and you can actually read what you presented or in your module well another question is does the HWLAT detector support or latency detector support the on arm64 architecture yes because it's completely agnostic it just disables interrupts the spins and does all this work it doesn't need anything specific to the hardware the one thing that might have a little bit of hardware is if you have NMI's go off there is some hooks in there to let you know the NMI went on because the IRQ disabling doesn't disable interrupts so I mean sorry doesn't disable non-masculine reps so if you say IRQ disable will stop all normal interrupts but non-masculine erupts by you know by their name and nature can't be interrupted or cannot be stopped so it could detect NMI's so it's good when they go off that you know it's NMI that's going off and not the hardware one question was about how F trace differs from S trace I like I use S trace all the time because it shows you the system calls that your utilities or your application is calling but that's it it just stops at the system call F trace is where you actually trace what does the system call actually do and if you want to learn what like sys read does and you know if you're learning a new file systems you know if you're looking at butterfs and you want to say hey how does butterfs how is it implement a read you could run F trace and actually do us read from the system call and then have F trace read watch F trace trace the all the functions that are being called within butterfs and and then look at the source code and you can learn how it works at least what it's doing I guess there's one question that was can we use renice or rt prior to affect the highest priority test profiled it while it if it's if you change the highest priority if it's still the highest priority it still will be profiled for the normal tracing but if you change the highest priority test to not be the highest priority test it won't be profiled so it just picks whatever at the current running time is the highest priority test running the profile oh here's one how do you compare this with d trace and ppf trace and perf well they're all kind of a little bit different and overlap slightly um brennan greg has a really good talk and i guess he has books about this coming out and i recommend reading it each one has its pluses and minuses and i like to have a lot of tools i know some people say there's too many tools but if you only had one tool you won't have anything that does exactly what you need to be done and i find that i'll use perf sometimes i use bpf sometimes i'll use ftrace sometimes depending on which i think is the best uh tool for the use case um then someone asked what the minimum version for synthetic fence and histograms i actually unfortunately don't have that off hand i usually kind of do remember and i'm surprised i didn't put it in my slides sometimes like this put that into slides where they are available but they've been available for some time and there's a lot more work coming out so see uh do the tracers have impact of latency values do the tracers have impact of the late on the latency values are the displayed latency larger than the reality without the tracers enabled uh in most cases it will tell you matters okay yes and no uh and you could change it obviously the heisenberg principle where no matter what you try to measure or the fact that you measure it will affect what you're trying to measure that will always be the case but you can limit it what you you can limit it to a point um when i do latency tracing i learn sometimes disabled function tracing and only enable the trace events because tracing every single function when when you're trying to figure out how long uh interrupts are disabled um you'll find that the tracing of the functions will increase that time significantly so i will sometimes pick a few i'll use the filtering of the trace the functions to trace so instead of tracing all functions i think i even mentioned this in my talk i will pick maybe i just want to trace the spin locks that are being taken and maybe a few other uh functions that are going to be triggered and then just run the latency tracer because that keeps the overhead down quite a bit and i prefer that to get some good values for the tracing but yeah the more tracing you have enabled it will impact the tracing latency times but once you if you remove all the function tracing event tracing like and just run like the irq it won't tell you it'll tell you where it happened but it won't tell you what happened between because it won't the tracing will be off that's usually a very accurate number what the tracing problems are the trace latency is yeah to evaluate the ftrace output one needs to have good knowledge of the kernel internals generally yes um this is why i think ftrace is really good for learning the kernel because you look at the output and then you might say to yourself hmm what does this mean and then you go look at the kernel and you follow the ftrace output and you look at the kernel and you kind of learn what it's doing so um anyway i think that's about the last question that i can answer now i'll be on slack uh just find me i've been hanging out at the uh what's the uh the track embedded linux you'll find me there uh i'm also on track linux systems but um feel free and like i said come check out our booth i guess vmware has a booth and come check us there thank you very much