 Okay, so my name is Joel Fernandez. I spent like the last couple of months working on some interesting real time issues. And as I was debugging them, I found different ways to debug them. And I thought it would be really useful to share some of those experiences. So this talk is really like kind of like a recipe driven type of talk. So I show you problems and show you the tools that I used and these like real problems that they're not like fake examples or something like these are real issues, so I thought that would be really useful. So yeah, so I used to work for Amazon like until a few weeks ago. Now I work for Google and my responsibilities are like Android performance and that kind of stuff. So yeah, this is like how I felt a lot like you fix issues and then they show up to like 24 hours or 48 hours when you run a test. And sometimes you think you fixed something, but you didn't really do anything. It's like all about having like sometimes all the ducks line up and bad things happen and really rare, those occurrences are really rare. So for the purpose of my talk, I like to make some definitions. Like period is the time interval between which RT tasks will be released at a fixed rate and deadline is when that RT task is expected to finish responding. And for periodic real time tasks like audio period and deadline are the same thing. So I'm gonna start off with like some basic concepts and then we'll go into the real issues, the categories of issues and how to the tools I use to find the issues and fix them. Yeah, so this is what a periodic real time task would look like. Like you have tasks waking up every period time interval and then it's expected to finish execution by the end of that period. And you have issues like where the task has, like you can see here it has some scheduling delays, jitter that happens when it actually gets the CPU from the time it's woken up. And you also have the task taking long time. Once it does get CPU, it takes a long time because it's not well written or something and it misses its deadlines that way. And then you have combination of both those, you can have jitter and the long execution time causing issues. So I wanted to be free to go over all the delays, the possible delays that you can face. So in this example, I have a task two that is responding to an interrupt two and it has to finish responding. And what you have here is you have some hardware delays. As soon as you get the interrupt signal, you have some propagation delay. The interrupt control has to register that, all of that. And then in this example, I have another interrupt that was already running on CPU. So interrupts cannot be nested. You cannot have a top half interrupting another one. So the interrupt has to, even though the signal was received, it has to wait for the previous interrupt to finish. And that's what I call IRK is off delays. And then finally, the interrupt two gets a chance to run there. But the interrupt itself takes time to run. So I call that ISR execution delays, interrupt service routine execution delays. And then interrupts, it wakes up the tasks and puts it on the CPU. But for it to actually get the CPU, it needs to contend with other tasks. So in this case, I have task one that's higher priority. So task two has to wait for task one to finish. And I call that scheduler delays. And then finally, task two gets a chance to run. Finally, after all these delays, and then it takes time on its own to run. So I call that task response delay. So you have all these delays together that add up and introduce latency. Usually, the scheduler delays are the biggest. And then in this picture, I try to show preemption off delays. So you had a task that turned off preemption. And so here you have task one responding to interrupt one. And task two was already running and it turned off preemption. And then the interrupt one came in and you have the regular hardware delays and the interrupt service routine delays. But then after the task woke up and was assigned CPU, it still cannot run because task two has turned off preemption. So you have the preemption off delays. And then finally, when it re-enables preemption, because task one is higher priority than task two, it'll get CPU. But it had to wait for preemption to be enabled. So you have those preemption delays as well. And then in this picture, I try to show a delay where a CPU took long time to wake up from idle. So to service an interrupt, CPU might be in a deep sleep state and has to wake up from that. And that has delays too. So you can see here all this time was taken for CPU to actually come back into a running state. And usually that's not high, but it can introduce latency. And especially if you have hardware bugs with the CPU waking up, maybe a power rail has to be toggled or something. I've seen issues where that adds latency too. And we'll go over that. So a quick mention that if you don't have threaded interrupts in Linux, executing interrupt context cannot be preempted by anyone. It cannot, another interrupt cannot come in and run. It has to finish running and then the next one will get a chance to run. The reason it did that was you had these interrupts, interrupting interrupts over and over again. Then you had stack old flows and stuff like that, which is a pain. So they just are like, let's just not do that. And then, yeah. So yeah, a quick mention that if you do something less in your driver, then you can really piss a lot of people off. Like if you disable preemption or something for no reason. As I showed you earlier in the pictures, disabling preemption for a long time introduced latency. And the RT patch that goes to a great deal of great. It does a lot of work to make sure that that doesn't happen. So you spin locks and all those things are not don't disable preemption as in the non-RT, in the non-RT world. And if you have to disable preemption, a lot of paths in the kernel actually check if there's something higher priority that needs CPU. And then if it needs CPU, it re-enables preemption. And there's ways to also, there are APIs that you can use to check if there's something that needs CPU. And then if it does, then you release the lock. So there's all kinds of tricks like that. That the driver can implement to actually make sure that turning off preemption for a long time doesn't sound system. So the RT patch set, Stephen walked in at the right time. The RT patch set actually converts spin locks to RT spin lock, which doesn't really turn off preemption. Spin lock critical sections are preemptible if you have the RT patch set and you have the preempt RT config option. Mutexes are actually converted to RT mutex, which has priority inheritance support and stuff like that. And all IRQ top halves are force-threaded. So if you had an interrupt handler and you use config preempt RT4, it force-threads everything, like all the top halves that we're executing in interrupt context previously are actually forced to run in threads. So the idea is that you don't want to run things with interrupts turned off, and you want to run your handler such that it can be preempted. So this talk is not really about the RT patch set or its features. It's really about debugging and some of the tools I used and stuff like that. So the three categories of real-time issues I wanted to discuss, the kernel application and hardware. The kernel is, I'll show you some examples. Preemption is turned off, IRQ is turned off, spin locks are used where not necessary, stuff like that. And RT patch really fixes a lot of those things. And then you have application and hardware issues. Application issues are like application takes too much time to run when it's given CPU. You have compiler issues where the code is not optimal, the code that's running in user space has a lot of cache misses, page falls, stuff like that. CPU frequency is not correct, and I'll show you an issue where CPU frequency is not correct, and the task was actually missing its deadline. Use the wrong scheduling priority policy and stuff like that. And then I'll talk about hardware issues where the interrupt handler was trying to access the bus, and the bus was taking too much time. I'll show you some real issues I saw. And as I was saying, CPU wake up from idle and interrupt has been received, but CPU takes a long time to come back into a running state. And so the interrupt handler cannot run soon enough. So we start off with the IRQs off and preempts off issues I found in the kernel. So the idea is that interrupts disabled on local CPU for too long has the effect of locking the CPU to other tasks and other interrupts. Because if an interrupt is received and it has to wake up a task, it cannot do that if it doesn't get a chance to run. And then you have the preemption disabled issues in the kernel that you can have that if something is woken up, it still cannot run because preemption is turned off. Yeah. So the first example I want to show you, the IRQs off was like, this is without the RT-PAT set. This is just like a regular kernel. And we ran the, so I wanted to talk about the IRQs off tracer. It was really cool. As soon as you turn off interrupts in the kernel, it starts counting the start time. And then when you turn back interrupts on, it takes another timestamp. And then it knows how much time elapsed. So it keeps track of what was the maximum ever, the interrupts were turned off, what was the maximum time. And that helps you find these kind of issues. And by default, it also traces functions between those two points. So you can actually see what the hell was kind of doing and what functions were executed, which gives you a lot of insight into what might be taking time. So in this case, it's not the function trace I'm showing you, but this is just a stack trace when the interrupts were actually turned back on. So it gives you a stack trace. I'll show you the actual function traces a little later. And this particular example, again, this is non-RT kernel. Spinlock IRQ save actually turns off the flips a bit in the CPU that masks all interrupts. So if you have, say, device 1's interrupt handler did a spinlock IRQ save, and device 2 generated an interrupt, then device 2's interrupt handler will not run because of device 1. So what you may want to do is instead of doing spinlock IRQ save, just do spinlock and disable IRQ, which will just turn off the interrupt line here and not disturb this guy. So you just turn this off. And that might serve the purpose, which we did in one of our products. So something like this we did. And with the preempt RT full config option, spinlock IRQ save API does not disable interrupts. So it's just really cool. The API is named the same because you want to use the same drivers with preempt RT full. And the driver shouldn't have to worry about using a different API for RT with non-RT. So spinlock IRQ save gets converted into a mutex, and it does not disable interrupts. But if you use raw spinlock IRQ save, that will still river to the old behavior because you may have instances where you may need it. And then second example is top hats taking long time. Again, these are like non-threaded. This is the non-threaded handler. If you use threaded IRQs, then they still run as threads. But in the case of non-threaded handlers, as I said, there's no nesting support. So if the handler takes a long time to run there, it keeps interrupts off for the whole duration of that time. And Linux doesn't support nesting of those. It's not available. So one way to find this is to use Function Graph Tracer. I'll show you some tricks later that you can use to narrow down even further. But this was kind of like I did Function Graph Tracer on the whole system. And I happened to see that Handler IRQ event was taking three milliseconds there. Function Graph Tracer nicely shows you the total time. Yes? Yes? Yeah. That you want to just filter the IRQ that you just filtered? Yeah, I'll be showing that next. So it turns out that after a lot of debugging, found that the handler that was actually taking a long time was from audio and was trying to access the bus and was taking long time. And as Steven was saying, there are some tricks you can use to using filters and stuff like that. So one nice trick I found was you can just run Function Graph Tracer on Handler IRQ event and set the depth to three and show me everything that took greater than one millisecond. And it'll show you something like this. So just looking at that, you can easily see across the whole system which interrupt Handlers took a long time. And this wasn't actually working. I fixed that, I think, two months back, where it was like if you used the thresholds, like one millisecond, then it would just trace everything and it would ignore a lot of things. So is that answered? I would still disagree with this because the Function Graph Tracer is still needing every function, although it's not recording, so it is quicker. But the handling of jumping to the trace call and saying, oh, don't trace me, continue, that's still kind of high. What I'm suggesting is just to go set the set after it's filtered. OK, but the thing is you still don't know which interrupt Handlers you don't know the names of these guys, right? So you can only filter on them, right? One, six. Oh, OK. And then narrow down and yeah, OK. Cool, yeah. So after what Steven's saying is after you see something like that, then take the individual functions and then set Ftrace filter on those so that nothing else is traced but just that. I think there's a nice trick on LW and you showed where you enabled the, you were tracing do IRQ and then you enabled the IRQ entry event, which is also really cool. And then the other trick that you can use is caret probe. So you do the same thing you did with Function Graph Tracer on Handler IRQ event. So the idea is that you can dynamically insert probe points in the kernel. And with caret probes, you insert both a beginning and an endpoint. And what I did is in the entry point, your K probe is called with the same arguments that handle IRQ event is called with. So you can actually take those arguments, the IRQ descriptor, and actually find out which function is about to be, which handler is about to be called. And then at exit, you can take another timestamp. And then you can take the difference between two timestamps. And then if it's too high, you can mourn. And here's some code that this is the entry handler. Basically, I take a timestamp and I store the function, the handler, the symbol name of the handler. And in my return, I take another timestamp and then if it took a long time, then I just scream that it took too long. And this is some boilerplate to actually set up the K probe on the handle IRQ event. And you do that and you see something like this, which shows you nicely in the kernel logs, all the intro handlers that took a long time. So the recommendations are my recommendation on youth threaded IRQ as much as possible with preempt RT patch set, with the RT patch set, and preempt RT full, like all introps are threaded. So you won't run into the problems, hopefully, that I just showed. And you can use the techniques I showed you to time your IRQ as using function graph trace or K probes. And as Steven said, with function graph trace, you probably want to set the filters. And this third example is the 8250 driver on one of the products I worked on, which the serial console prints actually in an older kernel. It was actually disabled like introps like that. And then it was sending all the characters down the serial port. And it actually disabled introps. And then it did that. I'm thinking it did that because you didn't want messages, overlapping messages. But that's what it was doing. And the possible solutions for this are, of course, like try not to print anything unnecessary. In our final product, we disabled the serial console completely for security reasons as well. So we fixed it that way. But the real fix was made by Ingo Molnar, where he converted the local IRQ save that I just showed you into a spinlock IRQ save. And the idea is that if you use config.prm.rt4 and you say, since spinlock IRQ save, that's still OK, because it won't disable introps. So that fixed the issue. But you have to have the RT-PAT set, and you have to enable prm.rt4 to actually fix those kind of issues. So I want to briefly talk about the prm.rt4 off tracer. It's like the IRQs off tracer, but it starts tracing when you either disable preemption or you disable introps, and it stops tracing when both are enabled. So that kind of takes care of both the cases. And then you can see the function trace of the maximum latency that happened. So this is a real issue. This is another issue that I'm working on right now in the upstream kernel. Basically, you have this VMAP stuff. It's used to create virtual mappings in the VMAP space. And this code actually has this mechanism called LazyFreeVMap. So when you map something, it creates a virtual memory map. But when you free it, it actually doesn't destroy the mapping. And after it crosses a certain point, I think it's like 32 megabytes times the number of CPUs, if the threshold crosses that, then it actually starts destroying the mappings. And it does the whole thing with SpinLock and SpinUnlock, which for non-RT, again, turns off preemption. And so this issue, the way we worked around it was we reduced the 32 to 8, the max threshold. So that when it actually gets triggered, it gets triggered more sooner than waiting for 32 megabytes times the log of the number of CPUs. And preempt IRQS off tracer nicely caught this 14 millisecond there. And so there was a bug in the preempt IRQS off tracer that the help of Steven, we fixed that where it wasn't function tracing for when preemption was off. It was only function tracing when interrupts were off. So you would miss all these things. But once that was fixed, we could see that. And we could easily see that it was actually busy in this loop here. Because we could see that the FreeVMap area was repeatedly being caught. So that's the preempt IRQS off tracer. I would encourage people to use that. And then a little bit about the RT-PatSer, what it does is, as I was saying, SpinLock IRQS saved doesn't turn off interrupts in RT-PatSer. It gets converted to SpinLock. So it's kind of like a trick. It doesn't really disable interrupts. And then it actually gets defined from SpinLock to RT-SpinLock in the SpinLockRT header file, which, if I understand correctly, it's kind of a spinning implementation. But it sleeps while spinning. So that's it for kernel issues. We're going to hardware issues now. So the hardware issues that I saw, there might be many. So some terminology for bus accesses. So you have two types of accesses as if you can make on the bus. You have posted and non-posted. Posted transactions don't wait for the transaction to complete. So you make the transaction, and then you continue. As non-posted, you have to wait for the transaction to complete. And this is like the architecture of one of the devices I saw this bus related issue on. It's an Intel atom. And it has this, I don't want to go over everything here, but basically it has the Silverman system agent, which is this high-speed bus which is connected to your course, your memory controller, and everything, the high-speed stuff. And then you have something called IOSF, which is for all the IOS stuff like PCIe, USB, audio, and stuff like that. So it looks something like this. And again, we use function graph tracer with depth of one. Sure, probably you set up trace filter after what Steven said. But anyway, we were seeing these huge times that were happening in the Wi-Fi driver. And we were also seeing the earlier trace that I showed you. That was actually a hardware issue, not a kernel issue. But it turned out to be that the inter-handler in the kernel was taking a long time. Actually, a hardware access issue. So basically, audio was trying to access the bus, and it was taking forever. And it turns out that there was a bug in the bus where the bus would only support one, it would only support a single non-poster access. It wouldn't support concurrent ones. So what was happening was the PCIe bus has these low power states. So it was going into a low power state. And then the Wi-Fi driver would, so the Wi-Fi system was using GPIO for interrupts. And then the Wi-Fi interrupt handler would try to access the PCIe bus. And then what was happening was because PCIe was in a low power state, it had to recover. And while it's recovering that access that the Wi-Fi driver made is still pending because you can only have one outstanding access in this architecture. So everyone else who accesses the iOS had to just wait. It was just horrible. So anyone want to guess what the fix was, or the workaround was? So yeah, we just had to turn off the power management on PCIe to never go into its low power state so that the question of recovery would never happen. Because the issue was it was going into its low power state and then accesses to the bus were taking forever. But anyway, the thing about these hardware issues, you can't really measure them using the kernel because they're in the hardware. You need some deeper tracing. But using function graph tracer and these kind of tools, we were able to easily see that the bus was taking a long time to access. Now, while it's taking a long time, those things, you need deeper hardware related debugging. But honestly, we didn't really need that to. We saw a pattern that the bus accesses were taking a long time. And then we were able to determine that it was the power management. And then the other issue we saw was, I just want to check how we're doing on time. So you have this thing in the kernel called the PMQOS framework. And you have this thing called different drivers and things can basically tell the kernel that I need this much of latency when you actually recover and come out of idle. So when there's a latency requirement change, a new set of sleep states have to be calculated based on what the requirement is. So to do that, all the cores have to be woken up. Because if they're not woken up, then it doesn't make sense because they're already sleeping. So how can you change the amount of time they take to come back from a low power state? So you have to wake all of them up. And then the kernel will program new requirements for what power states, low power states they can enter. So it has to wake them up. So the interesting thing is that it disables preemption while doing that. I'll show you later. And to wake it up, it actually sends IPIs to all the other cores. So it's called an inter-processor inter. That's what the mechanism it uses. It does something like this. So CPU idle latency notify calls, SMP call function, which the idea is that you want to execute a function on every other CPU. But the function really doesn't do anything other than just returning. And so the CPU that wants to program new CPU idle wake up requirements has to call SMP call function many like that. And all this while preemption is turned off. And it does an IPI while preemption is off. And what it does is it grabs a lock and it does an IPI. And then on the receiving core, the lock that was grabbed on the core that is making the IPI is released. And that's how the core that is sending the IPI knows that the function that was supposed to run on the other cores are done. And again, preempt arches off tracing, we could actually see that we kind of found a pattern of which cores are actually taking a long time to release the lock. And this first set of locks, so CSD lock weight is basically a function that waits for the lock to be released. So first set of CSD locks actually are just dummy things. The idea is that you just want to, if the lock is really taken, you want to wait for it to get unlocked. So those are fine. Cores 0 sends IPI to core 1, 2, and 3. And I'm sorry, the IPI part is later. Core 0 just checks the locks of 1, 2, and 3. And they're all unlocked, and so everyone is happy. And then you can see it sends IPI there. And the first lock, preempt arches off tracing again beautifully shows the problem. First lock returns immediately, but the second one actually takes a long time. So we enter the CSD lock weight, and then we're just spinning there, waiting for the core that the IPI was sent to to release the lock. And we wait all that time. And then you see that the next one happens after 14 milliseconds there. So it looks something like that. And we found this pattern like it was so obvious that core 3 was always taking a long time. And of course, it didn't tell us how to fix the problem, but at least we knew that there might be something going on there, that the IPI that the core receives is not able to run the function, and is not able to run. It might be something related to being idle, because we were trying to wake everything up. So you get that insight. And the issue turned out to be something to do with the PMEC was on the same I2C bus as the I2C driver. It's just a mess. And so the I2C access was actually keeping the bus busy. And because of that, the PMEC couldn't be contacted to bring the CPU out of idle. In time, it's just a mess. So with that, we go to application issues. And I worked on this product, a very cool product, Amazon Echo. It does a lot of things when it receives audio. It has to do beam farming to figure out where the sound is coming from. It does noise cancellation and that kind of stuff. It's always listening for Alexa, and it has to respond to that. And for user space, since Android was being used, I've used Android SysTrace, which is actually based on FTrace. And it uses the FTrace TraceMark facility to start and end things so that the application can actually make marks like I start this year and I end this year using the TraceMark write feature of FTrace. And so it nicely shows you how much time each of the different things in the application took. And it looks something like this visually when you actually run it on SysTrace. And so this showed us the task here. It picks up every now and then to read frames from Alexa, process them, and do your beam farming, and all these other algorithms. And SysTrace nicely shows you a lot of things at the same time. So instead of going through traces, you can easily see visually what's going on. And we found a pattern where these things correspond to these things kind of merging. They shouldn't merge, they should execute quickly within the time that they have. Others, they will not be able to process the next period. And the governor is like to blame because we were using, I think I was interactive, governor. And it's really crap like it has a very short memory. So it doesn't look at overall utilization. It just looks at the last window. And so it's like, OK, not much. There's some room to drop frequency. So it drops it, and then bad things happen, then it backs up again. You can see that here. So we actually had to hack the product to have a minimum frequency and never to drop below that. But I think the new governors are really good. The SCAD freak governor does take utilization into account. And then I used Perf quite a lot to find issues like cache misses and stuff like that, and then report that to the algorithms team and say, hey, you have cache misses 30% of the time. I had to blur out this stuff because I didn't want to get into trouble. But this is like open source, and this is closed source. Anyway, yeah, so this is the generic Android audio pipeline. And then you have some algorithms processing here. So I had to tell them optimize it and stuff like that. And you can have other application issues, of course. Like I did find that they were not using enough parallelization to run the algorithm. So I reported that. And you can have other issues like page faults, memory locality issues, and stuff like that. And then the scheduling part, like I wanted to go over some things in the scheduler statistics that can tell you about problems. So with scheduling, you should design your system such that all the priorities are correct, and you set up your policies correctly like scared other or round robin or whatever. And scared stats is a cool feature in the kernel that can show you how much time something was waiting on the run queue to get CPU. So I use that quite a bit. So this is actually for the CFS policy. What I really wanted was to use it for RT, but for round robin, and it's zero for round robin. So it was like, why is it zero? And then looked at the code, and it was actually not being calculated. But then we found some other code in the scheduler land where it actually calculates the total time from the beginning of time, the total time something was waiting on the run queue, which doesn't really tell you the worst case time something was waiting, but it tells you the total. So we messed around with that code to actually calculate the max if something was waiting before it got CPU, not just the total, but the max. And that showed us some things here. It was definitely more than this, but just to show you that. And then I did some more tests where I wanted to see if we need to calculate delays on all the run queues before something got CPU after it was woken up. And I did see that. I took Stephen's RT LOX test, and I messed around with that a little bit. And I did see that when there are too many RT tasks and you have the low priority ones that are being migrated a lot, there is a difference between calculating the delay on one run queue and adding up the delay on all. But that was a very rare occurrence, so I didn't really bother much. And then cyclic tests is a great tool to use to actually see how much time the system takes to service something. So the idea is that you start a timer and you take a timestamp, and then you go to sleep, and then the timer wakes you up, and then you take another timestamp, and you see how much time elapsed, how much time were you supposed to be woken up, and how much time it really took. And the difference is the latency. So anyway, I'll talk more about cyclic tests. But you do have something in the RT patch set called latency test, which is very cool. So it shows you a histogram of all the times preemption was turned off, interrupts were turned off, and those are called possible latencies, because preemption might be turned off, but it doesn't mean something really suffered, so it's called possible. But you should still fix them, obviously, because you don't want to leave a problem there. And then it shows you effective latencies, which is like when something is woken up, until it got CPU, how much time elapsed. And that's probably very useful. And so I did a little test like a demo, like I was running cyclic tests, and I had this trouble maker kernel module that was turning off preemption, busy working, and doing some busy work, and then enabling it. And then I wanted to see whether cyclic test caught that latency, and the kernel module looks like that. This is something I asked you never to do, unless you don't want to live very long. And then I run that kernel module in a loop like that, and I have cyclic tests running there like that. And then sooner or later, a cyclic test catches it. But the real reason I did this test was I wanted to test the latency hist stuff. And so I mentioned effective latencies, which is a histogram of wake-up latencies that ever happened on the CPU. So it looks something like that. I skipped a lot, but it does give you an idea of what was the worst case. So you can see that it's pretty close to what cyclic tests found. It's roughly close. And then you can, so that's just the histogram, but it doesn't tell you what actually suffered. So then you have this other file called maxLadency, which shows you what was the task that suffered the maximum. So you can see that here, as it shows two millisecond cyclic tests suffered that latency. And it also shows you what was running before that task got CPU, which is useful because it might give you an idea about what was going on just in the previous task, which task was running. And that might help you to dig in further. And then the other tools that I learned about was Matthew's latency tracker, which I think I'll try. Matthew's sitting right there. You can definitely, do you want to say a few words about latency tracker? Yeah, should I? Yeah. So it's an internal tracker, not buffering or nothing. It does aggregation directly within the kernel. And it's based on a log-free hash table that we implemented, as well as a log-free memory allocator. So it's dedicated, so it's pre-allocated memory chopped. So the goal there is to either be fast at non-dependency unlocking or nothing, and can be used from NMI content. And then the final part is, so user space can basically talk to a file, which is work begin, work end, and they get a code or whatever write a cookie to match the beginning and end of their work. So basically you can follow the response time from the IRQ start down to the end of the actual test. OK, yeah. Cool. Yeah, I really like the per breakdown that I'll show you in the whole path. And then RT app is something that's used to generate a real workload. I tried to use cyclic test. I thought cyclic test supported like putting some load, as well, on the CPU. But I think it didn't do that. And RT app does that, so that's definitely I'll check out as well. So anybody, any questions? Yeah, please. Yeah. Yeah, this was this one. Oh, was it? Yeah. Oh, was it this one? Start to trace this. Do you say which one goes to the base or? Yeah, so you do specify a graph function, which basically tells the graph tracer that that's when it actually starts. It needs to start writing into the ring buffer. Yeah, until then it's just ignoring all the functions that are still being traced. Like Stephen was saying that there is some overhead. So you need to be more specific. But that's how it works. As soon as you hit the graph function, it stops ignoring all the functions that are being called. And or if the depth has been exceeded and stuff like that. If you're too deep into the graph function, then it still ignores the functions. Yeah, so that's the whole idea. Like this has nothing specific to a particular ISR or IQ. That's the beauty of it. You can just run this, and then it'll show you everything here. So you can see, and it filters that too, right, based on the one millisecond. So it'll not show you stuff you don't want to see. Right? Yeah, yeah, so this is not like a silver, like is this not like one size fits all type of thing? Yeah, you definitely have to try running these tracers and see what's going on and rule out, OK, it's not this. You know, like. Tracing costs. I think if you're on the limit, tracing must cost performance. So maybe then you get in another drop because it's taking too long. Yeah, you can. So you need to be very smart about filtering and making sure the overhead is low and stuff like that. Yeah, like just enabling function tracer. I was doing some measurements, and it drops performance by 50%. If you're not even writing anything into the ring buffer, if you just like, say, register ftrace call or whatever, and it calls your function for every function in the kernel, that it sort of drops performance by 50%. But if you use ftrace filter, it actually improves things a lot because it goes and it no ops everything else. So there's no overhead on the functions that are being filtered. But if you don't do filtering, then the overhead is quite high. Yeah, so yeah, if you have any questions, just send me an email or something. All right, thank you.