 Hi, hello, humans. This is Daniel. So as I said, I work for Red Hat. I work in the real-time team. We have a Red Hat Enterprise links for a real-time product. And most of my time, I spent doing resource and development on the real-time kernel. I also work in the helping maintaining parts of the kernel, mainly related to the RTLA, parts of tracing, where we do this tracing analysis that we'll see today, and scheduling maybe in the deadline scheduling. So this presentation is about the main metric we have on the real-time Linux, the people that look more when they search for real-time Linux. And we will discuss it today. Because even though Linux was not designed to be a real-time operating system at the beginning, over the last decades, it developed to be a real-time operating system. There are very levels of real-time that people can argue and say what is or what is not a real-time operating system. And we can stay here for days, if not years, discussing whether Linux is a real-time operating system or not. But there are many use cases where Linux is indeed used as a real-time operating system. So it's a fact. And there are many reasons for people to use Linux as a real-time operating system, or also or maybe better for people to move from embedded operating system to Linux as a RTOS. One of the main reasons is the main power. There are many people that understand Linux and that can work on tuning Linux and making things work. But one of the main drivers for this is that this new cool stack of software that is driving the future. They were built. They were developed. They were thought to run on Linux. Mainly with the cloud and the edge and the things get mixed. Linux is the native platform for many of these applications, many of those that will interact with the real world, with self-driven cars and so on, so embedded systems. So there is not only OK, I'm a Linux developer. I wanted to make Linux a real-time system because I think it's cool. It's not that anymore. It's the real demand from companies that would like to use the AI stack on the edge for their self-driven cars. So this thing is way more than, hi, I want it. It's that we need it, right? So because we need it, more people are landing on Linux to say, OK, I need to use Linux as a real-time operating system. And this is also possible because this need is fitted by the fact that Linux can achieve very good timing behavior nowadays, not only for the latency that is the topic that we will discuss today, but also because of scheduling, right? We have advanced real-time scheduling features on Linux that will improve over time. But they are still there. We have that line of scheduling with VDF. We have the fully primitive kernel that can give us very short latencies and so on. So we have this win-win situation where Linux provides good numbers already. And we have this demand coming. And people, oh, we need a real-time Linux because it will unlock this new set of embedded software. I'm missing a word that that was a very hot topic on the research in the last years. It's not coming. Cyber-physical systems, cyber-physical systems. So these people with cyber-physical systems that both like to run this new stack with Linux, they need it. So one of the problems, however, is that because Linux evolved to become a British system in a very developer way, not, let's say, like an academic way, over the years, the way that Linux was tried to show its timing behavior, it was not rooted on a sort of analysis or on the composition of variables like people in the academia are like, right? They were mostly using test case, using black box tool that would mimic the behavior of a real-time application. And then they would complain if things doesn't show like the desired result in time. So for example, for testing the scheduling in Linux, the most famous tool is Cyclic Test. And the Cyclic Test, I will show some data, I think it's next slide, yeah, how it works. And so these tools, they generally, they reply a latency number, right? They reply at the latest number. They say, okay, this is the latest that my system can achieve, but it doesn't show you why and how bad can it be in the sense of trying to compose things from here and there, right? So this is a limitation and don't get me wrong, I have a golden Cyclic Test and I work with people doing Cyclic Test. This is not like a fight, this is the evolution of the tooling that we use every day. So getting really basic. How do we measure scheduling latency in this black box approach on Linux nowadays? So there is a tread, a Persepial tread generally and this tread, it sets a timer for the future, right? Using an external clock reference, the wall clock, right? So it says, okay, wake me up at 2 p.m. It's a shorter granularity, it's in microseconds, but let's say 2 p.m. And then the timer fires in the hardware and this timer, what it does, it wakes up a tread and then this tread starts to executing and the first thing that this tread does, it's looking at the clock. And so it looks at the clock and says, ooh, it's 2 p.m. or plus 40 microseconds. So my latency is 40 microseconds because I was expecting at 2 p.m. So this is, it's not only in a nutshell, this is it. That's how it works. So, okay, the black box approach, one can say, oh, it doesn't work, no, it actually works because Linux evolved in this way. So, but it has some drawbacks that makes our lives harder, because it doesn't give us a root cause analysis. Saying, okay, this latency was bad because of this. So you have an evidence on where to look at to debug the system or to tune the system or in the case like internal Red Hat workflows, like I'm trying to say that this report from this user is the same report of these users. So we can concentrate into a single problem instead of two. So this root cause analysis, it's generally done, like with this previous approach, it was generally done using tracing. So it would, a guy that understands tracing would go into the Linux kernel and say, oh, I would like to enable this kernel event and this kernel event and then wait for the thread to stop and then go back and try to, so this event, from this event to this event to that amount taken out, but that will try to speculate, right? So this thing is like I say, this independent thing traces, they are glued by a human. That is a human is looking at the trace and try to interpret it. And I did this for more than 10 years and I've been working at Red Hat for 10 years and I've been doing this way before Red Hat. So after 10 years doing this glue of tracing and try to connect things, I mean, I didn't start getting annoyed. And then it doesn't scale because it's, there are not many people that are there ready to get into all the details, even if it's a repeated bug, right? There's not enough people for trying to understand all the repeated bugs. It's better to concentrate people to at least on a few of what is a bug. And that's where one of the motivations for the Alta now is that we'll see later. But who cares, right? Who cares? Other than the part duty doing the bug, like the Daniel trying to solve the case. The main point is, as I said before, there is this large demand for running Linux as a real-time operating system. And there will be more people in need of understanding this. And there's these people that would like to run it real-time. And there is another set of people that doesn't care about real-time that much. Let's say, I'm a kernel developer that works with file system, right? And I send a new patch to the kernel. And it starts breaking the latency. That will be more of a problem when we finally get the prem-tart emerged because regressions on other subsystem people in the real-time will say, okay, this patch created me a regression. How can you show to this person doing a file system that it's breaking? And how can you give a test case that shows, okay, to here, it's your problem. You see it's your problem. It's in your stack trace. And now it's not anymore, right? Giving them the tools for them to make their life easier while debugging, there are things that has nothing to do with 30. They just don't want to break someone else's properties or someone else's metrics. And that happens very often in the kernel development. And we do take even more when the prem-tart gets merged. And there is this case that I mentioned before, like automotive is coming strong, trying to use Linux. And they also need more than just numbers. They would like to have more knowledge of the system, to show more documentation, to explain things better. There's also factory automation. And the things grow up. The synergy on Linux for cyber-physical systems is bringing us these more and more needs. So in here, this is the thing that motivated the Timerlot. There are even more things than our resources. I think I've based it on a link, I think that came before it. But this is the story that brings us to the Timerlot. So the Timerlot is a tool that aims to address this, like giving evidence of where the latency is. And it is not only a tool, it's a composition of internal tracing and user space tooling, right? So the internal tracer, so the Timerlot is a tracer inside that trace. And it shares a lot of the code with the OS noise tracer. And the main idea here is that trying to do internal and processing internal for those things that is faster to do internal than doing user space, but that doesn't create overheads to change the picture. So we try to do things that are the small amount of things in the kernel and these are the tracers. We have the Timerlot tracer and the OS noise tracer. So it starts by these, it's not only a tool, it starts from a tracer. This tracer can also dispatch a workload that mimics that behavior that we see before, right? A Persepial thread dispatch it and it sets a timer in the future. So we'll get more into this later. So we have these components in the kernel. This thing is very well documented. And you see this paper here, it's a paper operating system noise in the Linux kernel that explains the OS noise tracer, but most of it feeds also for the Timerlot tracer. It's a nice source of information for another metric that's also important. And you can see here, it's a nitriple transaction on computers and it's an open access granted by the university that I work here in Italy that I collaborate. So this is one side of the coin. The other side of the coin is a tool named RTLA, real-time Linux analysis, right? So the RTLA is a suite of tools and inside of it, there is a tool that is Timerlot. So the Timerlot is a benchmark like interface, the benchmark interface is for the tracer. You can use Timerlot purely as a tracer, like from tracer fast or using trace command, but RTLA Timerlot gives us, let's say, a more shine interface for it. It's easier to use. The idea is that it sets up, collects and parse the trace data and gives us some outputs, like quantifying the latency, a top-like interface and a histogram interface that I will show later with videos, right? So we have inside RTLA, we have Timerlot top that is a tool and Timerlot he's that is another tool. And we have an auto analysis that is a thing that we'll do what Daniel, Daniel or the guy or the terminal will do analyzing the tracer, the trace and giving a starting point for the, a starting point for the analysis. And it also has a Timerlot, a user space, there's a kernel space workload and a user space workload if people, if people would like to have the tracer. So I hope to just read and chat here. Yeah, later I will show how the thing works and how we can get the results and how the workload works. We'll get there. So a practical intro. So as I said before, it also has a workload that sets a timer in the future and when the timer fires, we can start to measure in the latencies. That's the same starting point. So, but here we have already the difference. Because we are doing things in kernel, the tool also have the control of the IRQ in the middle in the wakeup. So it has one more matter. Getting back to this example before there's a thread it says, okay, wake me up at 2 p.m. And then 2 p.m. wakes up and then this in the previous example we didn't see but between the external event and the thread that starts running, right? There is an IRQ handler that runs in the timer IRQ. So the tracer is already, it has its own timer handler. So it can give us a little bit more information. That is when the timer pass, it's the hardware fired the timer at 2 p.m. And then internally in the kernel, an IRQ handler will process this event that's coming from hardware and it will say, okay, it's two hours and 10 microseconds and then wake up the thread. And the thread will start running and say, okay, it's two hours and 40 seconds like before. But at this point we already know that it's already split into two metrics, the IRQ latency and the thread latency. And this starts breaking down the problem. So here's one example of the interface as a benchmark tool. So hello humans, that is Daniel from the past writing and it's a video. It is on my developing workstation. It has 24 CPUs I think. And there is a workload in background running the kernel compilation. The command I, Daniel from the past wrote is RTLA timer, not top. That is this top like interface for the tool. So here we can see the CPUs and the amount of events that happen. How many times that the per-CPU thread was awakened? The workload was awakened. And here it says the current, the current latency from the thread that happened. Okay, I will pause here. This is the minimal IRQ latency observed on that CPU. You see here it's zero, that's very, very short. The average latency for the IRQ handler, which is between zero and one, but there is some maximum values that are more than one, right? There's IRQ latencies that are hitting for 17 microseconds, 20 microseconds, right? And then we jump to the thread side. So in the thread latency, this is similar to the number that we get on cyclic test. That is how much time did it take for the thread starts running? So here's the current value, the minimum value, and here's an important point. So the minimum value that we are seeing here, it's one microsecond, two microseconds. You see, it's almost in the limit that we can think on Linux, right? It's one microsecond latency. Take this, get to this number as an evidence that the timer lot, even doing the things that I will show you later on tracing, even though it's doing tracing, it's not only running, but running and tracing, the overhead is very low because in the minimum, it's not hitting like 10 microseconds. So it's still within the one microsecond granularity, the overhead, that that's a question that people generally ask. It's a granted that question, so the overhead. So the minimum on the thread is very low. So the overhead is within the granularity. So we have an average number and then the maximum number here. This is one interface that gives us an overview of the system. There is another interface that creates us a histogram. So we can see all the execution, not only what is the mean, the average and the maximum. It shows during all the execution, how many latencies were within one microsecond, within two microsecond, within three microsecond. So we can have more sparsial view of the system. So here's the index, it's in the zero. Within one microsecond, right? In the zero microsecond range. Here's the IRQ, so in the CPU zero, there was 883 IRQs that happened within one microsecond. But the thread, there was some that were in the one microsecond, more in two microsecond. And here we can see that, okay, this is the curve for the thread, it's very short for the IRQ. And then we can see that the thread is a little bit longer and it gets longer the tail, right? So we can start having a more integral way of visualizing it. And here we have some summaries, how many wakeups we saw, what is the minimum, what is the average, what is the max? Yeah, I think that's it. So, slide show, okay. Oh, Daniel, your microphone turned off. Sorry, those interface, I'm too old for user interface, I like turn off. No, I'm kidding. So, yeah, I had to move this widget. So that is the more, the simplest form for the ARTLA tooling that is using all the infrastructure as a benchmark tool. It's transforming the tracer into a benchmark tool. So, but when we are doing those measurements, we are not only, or generally, we are not only doing the measurements to see how good my system is, but also to see, okay, I'm measuring the latency, and in my case, the maximum latency I accept is 100 microseconds, for example. And for this 100 microseconds, I would like to know anything that happens that is longer than 100 microseconds, I would like to understand what happened to see how can I adapt to my system for me to reach this target latency, right? And to adapt to my system would be like, either tuning my system, isolating the CPU, move this thing from here to there, or, okay, this is the part of the code. I need to go into that part of the code and try to break it down a little bit more or try to make a report for the file system developer that doesn't understand much about RT, but if I give him a good starting point that shows that, okay, maybe it's a problem on your subsystem, it will be easier for them to start working because it's hard to make people start working on a thing unless you give an evidence that the problem is there, right? Because maintainers, they are busy, they have tons of problems to solve. We need to make their life easier if they are not real-time maintainers. So, so primer dot can be set to do the measurements to do the measurements and stop the tracing if a threshold is crossed, right? This threshold can be the tread latency and can be the IRQ latency. So, because Daniel learned that he needs to make his life easier and automating the things, the RTLA has an option that is dash A, which means dash auto. And this option has a value that is the threshold. This is the mesh option that we'll try to enable the default things that myself or the community, the people that give me feedback, right? What is the, when I would like to stop the trace and have an analysis, what is the option that we generally enable? Because it's getting harder, right? And this option, this option, that option, that option. There's this magic option that will evolve over the time to get the best setup of options of the tracer. So, the dash A threshold, it will enable this common set of options for RTLA, run the trace and produce a report if it crosses the threshold. And here it comes, the auto analysis. So, timer dot stop. I will move the widget here. You guys are not seeing, because I'm moving one thing from Zoom, okay? So, here it's saying, run timer dot stop and we stop the trace if the latency is higher than 30 microseconds. If it's more options, it's going there and enable them. It stops if trend is higher, it stops if fire is higher, it stop, bring to the stack trace, but there are condensate here. There are more options and talk about it later. Okay, here we go. So, it started running and then it hit a latency higher than 30 microseconds here, trend latency 35 microseconds. And it printed a report. So, here we see that there is this spin lock in the middle. So, the RQ latency, the RQ delay, the RQ handler, it delayed for 12 microseconds. The RQ latency, the RQ that looks at the watch and see it was 14 microseconds. The RQ handler took this amount of time and the blocking trend was this amount of time. I will reach to these things later, but just giving an example. So, it was this AS command that hold a spin lock and spin lock inside C group that is inside the butter FS and the AS is the assembler for this compilation here. So, we hit the latency and we know that this was the command. It was a write-to system call inside the butter FS that was doing C group operation that is able now spin lock. So, we have here, what can I say, the fingerprint of the latency that I hit with just the RQ delay that shade the latency. So, the rest of my presentation will be explaining those things that I talked before, the blocking and the interference. Here is just to give an idea, right? It's just to introduce the problem. So, do you have any questions so far? If not, I'll continue. I can keep speaking. Okay, let's continue. Oops, stop. And then I always need to do this because I don't know how to use this interface. Well, it's lecture. So, now we'll get into the auto analysis. That is what composes that fingerprint that I showed before. So, the auto analysis, it decomposes that latency into a set of variables, right? And these variables, they can be analyzed independently. And here I will do a comma. So, these variables, there is a research work that explain those variables and explain why they are independent. It's a research that I did before RQ delay that proposes a tool that gives the worst case execution time, the theoretical worst case execution time latency. If you click in the link where I have this presentation here that was posted before, there's also in the last slide, you see a link for that research. That uses formal methods to show that there is a bound for the schedule latency and just that more heavy academic notion. That will go into the kernel into the future. But meanwhile, I took part of that and integrated into RQ delay with something that's more what people are used to see, which is like a sampling not worst case. It's all the samples, not the worst case. But that's the topic for another presentation. Still, there is some background that explain why these variables are independent for those that like go into the hard part of real-time theory. No, it's not real-time theory, it's real-time theory. So each is where it can be analyzed independently. So I RQ and TRED, why they are separated, right? Mainly because I RQ and TRED latents have different analysis. They cause that delay an RQ and a TRED, they are slightly different. So the importance of having the two separated metrics. And it's important to note here that mainly when people think on real-time Linux, they are thinking on the preempt RT or the fully preemptive kernel, right? And yes, that's the main use case. But all these tools, they are not made only for the preempt RT. They work for any kernel. For the kernel with preempt RT, without the preempt RT, without the preempt RT in the full preemption mode, without the preempt RT in the voluntary preemption mode. It works for all, it's independent, right? And that's a good thing. It's not exclusively for the preempt RT. Also because the no preempt RT kernel can also give acceptable numbers depending on the latest that you target. So it's all generic, it's not only for the RT kernel. So as I said before, the time or attitude is abstraction from the theory and there is that resource that backs it up. But just bringing some abstraction from theory to make it easier for us to communicate. So when I say execution time is the amount of time for the test time having interest to execute. So the wake up of a thread, there is an IRQ involved. The execution of that IRQ is the execution time to make progress on my system. So that's the execution times. It's the execution time for the thing I'm interested here. They are the time to accomplish the test required for the schedule latency. Then there is the execution time that I have interested but there is also the execution time of things that are blocking my scheduling. And by blocking we mean anything that has a lower priority than the task I am analyzing. So things causing delay with lower priority, it's blocking. And there is also interference. Interferences are the things that have higher priority than the workload that I am analyzing. And here I open a parenthesis that by default Timerlat use the 5.0.95 priority. So anything with priority lower than 5.0.95 it's blocking. And things with higher priority is interference. But we need to break down with more than only threads because Linux is not made only full of threads. When we get into these lower fine-grained metrics we can have more tasks abstractions or more context where things run than threads. So for the research we split the card into non-mescable interrupts as a sort of task. This has the highest priority because it will preempt our queue, it will preempt software queue and preempt threads. So it's a higher priority because it preempts the others and the other doesn't preempt it. So one level after we have IRQs. IRQs they preempt software queue and threads. They preempt, they cause interference on software queue and threads. But they do not cause interference on NMIs. NMIs can cause interference on IRQ. But if we think on the kernel locking mechanism we know that one software queue and one thread they can block the IRQ execution. So they can cause blocking on an IRQ if they disable IRQs like we saw in the example before. So threads, so IRQ can be blocked by software queue and threads if they disable IRQ. The software queue they are lower priority than IRQ, higher priority than threads, unless when they move into threads like in the preempt RT software queue become threads will become equivalent to threads. And then we have threads that can only preempt other threads but they can still disable software queue, disable IRQ and cause blocking for them. So blocking interference. So let's get to the case of an IRQ latency example and analyze it and try to start thinking into these variables by reading the auto analysis. So I run RTLA timer lock stop, stop the trace if the latency is higher than 50 microseconds, let's say. Right, it's higher than 50 microseconds, stop the trace. And here is one example, but to make it easier to understand I've created like a timeline of the events that are happening here. So the timer locks threads at a timer for the future for the 2 p.m. and when it was 2 p.m. the hardware dispatch of the thread dispatch of the event, hardware event and the IRQ latency was 32 microseconds. That is the IRQ handler started working and then the IRQ read the time, it already passed 32 microseconds. And here's the report. Because I know when the handler started working by analyzing the trace in the background, right? We can see that the IRQ handler it started running 31 microseconds after the expected time. So here we have this IRQ delay of 31 microseconds. So something delayed the starting of the execution of the handler, the IRQ handler. What can cause delay? There was no interference here. So it was something causing blocking, blocking for the IRQ. What can cause blocking IRQ, threads? But what was the blocking thread? It was the OBJ tool. It was the example of compiling early in background. The OBJ tool, it causes the blocking. This number here is not for the IRQ, this number here is for the thread that we'll see after this. So there was this OBJ tool, the blocking thread and then we can see the stack trace. The stack trace, you get back to it and see, okay, there was this write system call inside the butterfs and then we get into the group of all I see a row spin lock IRQ restore. This function, if you go, if you know it, you understand that it was disabling IRQs. If you don't, what does this function do? Is it locks a spin lock? It unlocks on a spin lock, but the locking of the spin lock disabled the IRQs. And this function, it's unlocking the spin lock and re-enabling the IRQs. So the thing that was running from the expected time until the handler started working, the thing that was blocking it, it was the function that run inside this critical section. So inside this critical section, something delayed for at least 31 microseconds. And why it's user, right? Why do we need the synchronism? Because it's not the root cause. It's the kernel synchronization. It's not the problem. It's the mechanism used that caused the problem. So where is the problem? The problem would be in the C group. C group is calling a function that disables the IRQ for too long. So it's doing something for too long inside there. And then- There are a couple of questions, Daniel. One in the chat and then one in Q&A. I think that might be relevant for the current discussion. What happened between, okay, this one microsecond here, it is handling the timer. It's starting handling the timer, changing to the handler that handles the IET, timer.irq, doing tracing, and so on. So it's the execution time of the handler that is. The handler starts, it selects inside of the timer subsystem, it needs to select which timer fired. And then this timer fired, it will have some overhead of the tracing and then it will print there. This 1.170 microsecond. That's the answer for the question in the chat. What is the other question? There is a question in Q&A. I can read that out to you. I don't know if you can see Q&A box. Let me see, okay, Q&A. Ah, okay. On, can multi-core CPU process multiple IRQs in parallel? So they do not block each other. Does this depend on the CPU architecture on the operating system or so? Let me reread it. Can we multi-core CPUs? Well, so they do not block each other. So the multi-core CPUs, they can do multiple handling in parallel, right? Now, if at the hardware level, they will block each other, I don't know why I'm with the operating system guy. In the, I'm on one level above. I would just point for the hardware. Then we'll talk about this later. But assuming that it's possible in hardware and it's possible for the granularity that we work on microseconds, that's a lot of time for hardware. So they are parallel or level of granularity. So if they do not block, does this depend on CPU architecture? Yes, it depends on CPU architecture but that's probably a problem solved for CPU people. And at the operating system, yes, they can happen in parallel in the operating system and they will not block each other unless you use a synchronization mechanism. Like if you have all your handlers, they run in parallel, but they all will fight for the same lock that is global lock. Then they will lose this. They will not run completely parallel. But if they don't share any resource, the kernel will just run them in parallel. They run in parallel. Even the timer can run in parallel without interfering with each other. But there is also a kernel feature that tries to create a, tries to, for a scheduling timer, at least try to put a shift between one and the other. But yeah, they can run in parallel. So I guess it depends on if there is dependencies between those IRQs. I hope there is no dependencies on all of those. If they are, then they will be, a dependencies meaning one IRQ, them needing the same resource. Yeah, correct. But in the best case, they can run in parallel. That's not a problem. They are not serialized by default. They are not serialized by default. Good. Let me close this window here because I already have too many windows. So here is one example. So we see that there is one patch in the kernel that added this pin lock that disabled the IRQ. In the middle, do you guys see the pointer when I'm moving it or no? Like when I'm moving the pointer, the mouse. We can see it. Oh, okay. It's a little, but we can see it as you're moving. Okay, so keep moving. So here in the middle of this code, there's something that requires the IRQ to be disabled. So if the amount of, if this data is not acceptable for your system, you have two options. The first option is don't compile things on the CPU where you carry the latency, right? Don't run this task there to CPU isolation, for example. Do use the task affinity to avoid this piece of code on the CPU where you care about latency. On another, like another approach would be, okay, I need to run this code there. So I need to go into that part of the kernel and breaking it down into smaller pieces or removing the dependency from depending on the IRQ, right? That goes into more advanced set topics, right? But if you come with a report like this to a person that works with C Group, you will get less resistance on understanding that, okay, your problem is no C Group. You give like, okay, that's a good starting point. We can go further with the tooling, but that's at least a good starting point. So there is another example that we also saw here, which is there is one abstraction that is release editor. Release editor is a delay caused by something that is outside of our context. So for example here, I can even take this here. You see here, there's max TimerLat IRQ exit from idle, 90 microseconds, and here we have a better example. So we set the TimerLat to stop at 50 microseconds, right? Stop at 50 microseconds. And then we see, okay, the IRQ handler delay, it took 30, 90 microseconds for the handler to start working, so that's suspicious. If you look what was running there, it says swapper, that is the idle thread. And what was the idle thread doing, right? This part here is the IRQ handler, so it's not the cause, it's my execution time. It was doing idle, right? The CPU was doing idle, and here the TimerLat is already giving an evidence, exit from idle. So in this case, the latency was not caused by the operating system post-opponent in the IRQ. It was caused by the exit from idle machinery from the hardware that tries to save power. That is the system goes to sleep into a deeper idle state to save power. And it, but it causes a drawback of taking time to wake up the processor, not your operating system, the processor until the operating system can start running. And here's a case, right? It was not the software causing the delay, it was the hardware waking up from a deep idle state. And the delay, and the tool is already pointing here, okay, this is an exit from idle. We go to an idle. There is a lot of funnels that we can add it into the tool, and that is the idea, right? But from this tool, it's pretty young, and it's already giving us composite hints of what is going on. It can only get better. So, and here we go, that's where we were talking about the thread, right? The IRQ latency, sorry. But what happens next, right? And here is one example of a thread delay. So, in this example here, I think I don't need a timeline, yeah, I don't need a timeline. So in this case, the IRQ handler delay, it was zero. It was near zero. So close to zero that it's within the error margin for some nanoseconds microseconds. Because actually the IRQ latency, it was like 1.64 microseconds. So we are really running, things are really running fast, right? And here we have the timer lat IRQ duration that is the execution time of the IRQ handler. It was 90 microseconds. Okay, it's not that bad. And then we see like a blocking thread. Hey, the blocking thread. Ah, and then we see that the vast majority of the number is in the blocking thread. But the thread we see down here, the amount of interference that is while you're waiting for the thread to schedule after the IRQ handler, while waiting for the thread to schedule, the system had some interferences from older things with higher priority. There was one timer IRQ, that is another IRQ, not the one that will get up yet another that took 3.68 microseconds, that's nothing. There was soft IRQ interference. You got like 4.21 microseconds, the sum, not that relevant. And then we have even a thread that granted the blocking a thread that weren't before the timer that's thread that I was analyzing. And it was the migration thread. That is a thread with the highest priority on the system. It's even higher than scattered line. It runs in this top machine. But so we have interference from IRQ, software IRQ and threads. But if you see in the percentage here, it's a very small amount. So they're not that relevant. By the way, this is clearly inspired on a receipt. And I tried to use this as a receipt because it can get easier than a receipt. So, and here you see, okay, the main thing that caused the delay, it was a blocking thread. It was a thread running, right? And this thread was in the, was a K worker thread on the CPU 40. It's a user, an unbounded K worker, right? It was a K worker thread. If this U, it means that it is unbounded to the CPU. It's running here, but it might run on other CPUs as well. This is what this U means. So it was running. It was doing a butterfs work that compress pages. So it was the third work in a K worker from butterfs compressing pages. And it took 500 microseconds. Why did it took 500 microseconds here, right? Why so big? Because this is not the preemptive kernel. This is the no preemptive kernel. So we see higher latencies on the no preemptive kernel. And this also shows that the two work for the no preemptive kernel as well. So there was a long run between these functions, Z, S, T, D, compress, block, fast, between this function and the next check for need resched. And this was like 500 microseconds long. So if one would not in the no, even in the no real time kernel, right? No real time kernel. I'm okay with a 100 microseconds lateness or 200 microseconds lateness, but I'm not okay with 500. What can I do? I can try to move the affinity of this K worker, the unbounded K worker, to avoid this work to be done on the CPU. Or I can try to get into this code and try to see a point where I can add a check in it resched and try to move forward from it. But yes, in the tread, when it starts seeing more in the tread side, we'll see generally this locking tread higher in the RT kernel because it's disabled preemption or on the no RT kernel because this point is very far from the next need to resched. Am I going into these more deeper points? Yes, maybe it's harder to understand from now on. Yes, but if we will open up bugzilla with this starting point here, it's very easy, even for unknown, that super expert kernel to see in which subsystem they need to start working or who they should contact or is this problem similar to that other problem that I saw? Am I still reproducing that known bug? Well, it's a good starting point. We stopped saying, oh, it was for 400 microseconds to say, soon it was blocking tread and it was a K worker on the butt-refressive system. It's way more precise than saying it was that latency, who knows. So, and then we can go deeper here and say, okay, that report, it was good at starting point, but I need to go further. Daniel, there is one question probably on a previous slide. U40 is in worker pool 40, not running on CPU 40. I think there is a question in the... Oh yeah, I don't know. It could be the 40 can be that is the number 40. The 40 might not be the CPU 40, might be just the ID for the tread, yeah, yeah. But the most important part is the U here. So it's not bounded to the CPUs, it can move. And so it also makes sense it's not being the CPU number, yeah. It's a, it can be just the ID of the worker. Anyways, the tread, it shows here which CPU it filled the trace. So, I actually timer that trace. So we can go deeper into tracing, right? So, because RTLA is a time, RTLA time or not is a front end for the tracer, we can see tracing events from it. So, the trace activates the tracer by default activates these OS noise trace points. They are the minimum trace points for us to compose the auto analysis. They try to do some processing in turtle to avoid overhead because the overhead of right to the buffer would be higher than processing. There is a search about it and we can talk later. But so we have one trace point for each of the causes of blocking in the interference, interference blocking execution time. And also important that these values they are read free from nested interference. That is the tread noise, even inside the tread I have a software Q and RQ and NMI. The value of the tread noise, it already discounted the amount of time that was caused by the software Q, the RQ and NMI. So you don't need to try to remove noise from the value of duration or the execution time. It's already dealing with this in turtle. So you can just read it straight. So here's an example of auto analysis editing tracing. So it's TimerLat on editing like the dash A30 stop if the trace, if the latency is higher than 30 microseconds. And it's just running on CPU4 here to make it easier to read the trace. So it starts running and then it blocks it shows the auto analysis. And it says you're saving the trace to TimerLatTrace.txt. Probably Daniel from the past will read this file. Okay, it showed here this clear page, ERMS. It has this function there. You can see that the IRQ latency was one microsecond. The TimerLatRQ duration was 17 microseconds. And here's the trace. This is pure F trace output. And then you can see like the TimerLat events in the kernel. And we can see this OS noise trace point saying, okay, the trace noise, which is this trace noise here that translates into blocking into the auto analysis. It was the CC command. This was the PID. It started at this point and it took 9.2 microseconds. And these are the things that the RTLA is using to to compose the auto analysis. The auto analysis is so based on this amount of events. So it tries to minimize the amount of events. So by minimizing the amount of events, we minimize the overhead. And these events are processing part in kernel to minimize overhead. So it's trying to get the minimum overhead as possible. Do we have overhead? Yes, but the overhead is within like the margin, it's within the granularity that we are analyzing the system. So it's okay. Okay, is this two available only on Red Hat or other versions as well? It's all that I do is full open. It's available for all distros. The on Fedora, we have the package already. I know that in Debian, there is a package. And on Ubuntu, I know that the tracers are enabled, but we don't have the package yet. Andrea Righi, Andrea Righi in Italian. Andrea is enabling these on Ubuntu. And if it's not enabled right after the next release, it will be right a little bit after as a uptake. He's working on this now, Andrea. So it will be available on Ubuntu and it works for all distros. RTA is part of the kernel. It's a tool inside the tools directory in the kernel. So it's generic kernel stuff. So, okay, trace out analysis, getting back to the slides. So, okay. But then the auto analysis is based only on the very... Okay, there's a Q&A here. Sorry, I didn't see it. Does our TLA depend on a particular K config? The only K config that it depends are in the K configs that enable the tracer. So you depend on the app trace. You depend on enabling the OS noise and timer lot tracers. That's it, basically. So they don't depend, for example, on in-prem 30 or not, or being SMP or not. They only depend on themselves, the things to make the tracer available. And as I said before, yes, it is... It works for the non-real-time kernel as well. So it doesn't depend on the K config. Good question. Type answer, okay, answer live, but yes, done. Okay, so the RTLA timer lot can also be used as a front end for older tracing events, because why not, right? So it's possible to enable any other trace point in the kernel, appending it to the timer lot trace output. And not only the events, it's also possible to filter the events and using tracing trigger features. And it's possible to do some cool stuff, you can see. Still, there is a lot of things to develop, right? The two is, it's not even enabled as a package on Ubuntu, so it's very new. So here we have, okay, I dispatched the timer lot again, same command line, but it's also enabling some kernel events. So enabling the SCAD events, the working queue events and the ARQ vector events and the ARQ events. These are classes of trace points. So here it's enabling all these classes of trace points and then it hits the stop tracing and we look at the trace output. Go ahead, Daniel, from the past that made these videos. Come on, Daniel, from past. Oh yeah, Daniel from past is hitting, showing that it's this free on reg. As you can see, I'm not a YouTuber. I'm not a good at making videos. So it's saying the timer lot, ARQ duration was 30 microseconds. There was an ARQ work, ARQ. And here we can see that we have more events than before. So for example, if you see the mouse, we see here is SCAD waking and SCAD wake up events, ARQ work events, together with timer lot events. See, timer lot, ARQ, but also the wake up events for the timer lot thread. The execution time of the ARQ handler, but also the events that say, okay, it was a local timer, this is the vector, and so on. So we can have all these events enable helping us to go in deeper into the analysis. Let me move here. So, but going more than enabling, and this isn't where we start getting more deeper. It's also possible to leverage US noise trace points that shows like the blocking and interference to collect is statistics or histograms for all the sources of a blocking and interference from the current. There is a page here, it's linked to the, also in the page, it's on my blog you see on the page where we have the tutorial for this presentation. There is also a link for this page probably. So here's where I copy and pasted the command line and you'll see here in this video because it's a pretty cool command line. So here I'm going there, I'm copying. And here, this is the monstrosity of a command line. Let me see if I can, it's below the, so the RTLA timer lot, it's a top timer lot. It's enabling the US noise event. It's a little bit, let me see if, ah, this bar is in the middle. You see, I'm not a YouTuber, but hopefully it will be able to see. So an RTLA timer lot top, it's enabling the US noise IRNMI event and it's saying, I would like to have a trigger when this NMI trace point happens. And this trigger is to create a histogram or a histogram of the duration of these events for me to collect a histogram of how long the NMI is happening. How long does these NMI's are taken to run? Same thing for the IRQ. IRQ noise, create a histogram of the duration of each IRQ. Same for software Q, same for threads. So this is the command line that it runs. So timer lot to start running. It's doing now the measurements. You still hear, there is a little bit of overhead. So from the minimum from two, we start seeing three. You see, it's within those acceptable values. It's very low overhead. It's from, it was from two to three, right? But you see the amount of information it gives us. It's worth the overhead, that's it. So here it's running, it's doing the measurements. At the same time, it's creating histograms of the execution times of the things that create blocking and interference and execution time of the area. So, and here at the end of the execution, it saves those histograms into files. So let's see here. There were NMI's on the system. And on the CPU zero, there were two NMI's that took nine microseconds. Again, this is pure tracing output. Yes, there is a to-do to making this looking pretty. I just didn't have time, right? There's a to-do, on my to-do list on making these histograms easier to see. But it's also, it's already possible to take this. So on the CPU eight, we have, on the CPU eight, on the six microsecond range, we have one NMI. The higher number you see here, okay, on the CPU 22, there was one NMI that took 10 microseconds. So take an auto, 10 microseconds. On 10 microseconds, the NMI's can create an interference of up to 10 microseconds. So we restart it. Okay, let's get back to here NMI's. So now IRQ's. So the IRQ's, we can see, it's started by CPU's. So before all the CPU zero, not CPU one, all the CPU two, and then started by the duration. I think at the end we'll have a higher one. So going back here. So here in the CPU 22, where I'm pointing the mouse, there was a timer IRQ that took, it only also, but it took 32 microseconds, right? So I have a timer IRQ that can reach up to 32 microseconds. And I have an NMI that can take up to 32 microseconds. So my minimal latency is easy to say that my minimal latency is 42 microseconds because if the 32 microseconds happen and it, as it is independent from the NMI, I can have a bad case where I hit both together, one after the other, an IRQ and NMI, and then my minimal latency is 42 microseconds. Even though I might not see in the timer not output because they didn't actually happen while sampling. But if we get them synchronized at same time, like I said, it's possible, it's possible. And when we start to go into this level of number, we are starting to compose the worst case latency. And that's the other work that I mentioned at the beginning that tries to compose what could be the worst case latency if the worst things happened out in the worst possible order. And here it is about, okay, this is the blocking thread. So how much blocking lower priority threads are added into my system? And here it is, you see, it's 14 microseconds. I think there was one. Yeah, I will try to block this screen here. Okay, let's say 10 microseconds here. It seems that there were more, but I'd say plus 10. So the 42 microseconds, they now became 52 microseconds because if I am unlucky to have this RCU thread running and then via my IRQ was set to fire. If this thread was running in this case and then we have the NMI that is independent from it happening and we have also an IRQ happening, we can sum these three two things that are independent and say, okay, my latency could be up to 52 microseconds. And so we started doing this game, right? Even though here, see, it was 52. But here in my example, yeah, there was no single sample that took 52. Still, there are evidences that it could be more than this number here. The max number here was 42, I think, 45. Yeah, it was 35, but it can be up to 52 if we compose those metrics. And these kinds of analysis, they can only get better as the two evolve us. That's the main goal. But to reach there, we need to experiment things. So as I showed in the first example, there is Incarno workload and user space workload. I started by doing Incarno workload because it was easier to show as a proof of concept. And then I extended the tracer to also enable user space threading. And it's also possible there is the dash U option where it's a user space thread instead of a kernel space thread. So with the user space thread, we have one more field. Also, as the demo from past is running here, it's the user space thread. Here it's an RTLA thread, but it can be any tool. It can be like URL, it can be ROS2. But you can measure the lateness of ROS2 if we adopt it, for example. It can be like at something inside the middleware that uses the timer dot interface to go to sleep. When it wakes up, timer dot can produce a scheduling latency report for it. I have an example with ROS2, or it can be like a Python script. So here is IRQ timer latency that we know, the thread latency. And here is the user space tool after it's finishing processing, right? So I'm not sure if I put a diagram for it, but we have this metric here that is the overhead of going to user space and returning to the kernel. It can be both in the top and the history. I recently submitted a patch to the kernel, it's in 6.9, where you can use a Python program as the workload. And then you can extend from that Python program to any program. And how to use timer dot to monitor your program using the timer dot interface. So you can point to the root calls for the latest for any program, not only for timer dot. It got in the 6.9. By the way, we're heading to the end. So by the way, there is a lot of options on timer dot. You can set the period for the thread, you can limit the CPUs that it's running. For how long do you want the experiments to run? If you would like to see the bugging of, okay, it's going there, it's enabling the tracer, it's setting priority and so on. The do it starts as a 5.095 thread priority, but you can set any priority. You can set like it bridge to be a schedule order that is the no real time scheduler with a nice, it is changed now, it's nice, it was priority before. You can say, you can say it's a round robbing, you can say it's a 5-fold, changing the 5-fold priority, or you can say it's a SCAD deadline task. For it to be a SCAD deadline task, you need to disable the real time throttling to allow tasks to be peen into course. You can say for timer dot to run like the thing that collects data for it not to cause disturbance to the workload. You can limit it to run on set up housekeeping CPUs. Let's say you run timer dot on CPU zero and then monitor CPU one to the max CPU. You can set timer dot threads to a different C group and you can do some sort of tuning for the exit from idle to avoid the exit from idle problems. By the way, and then we get from by the way too, that is before running, here the trace analysis, it was done assuming that your hardware runs without interfering with the thread. But it can be the case that the hardware itself causes stalls on the execution of the CPU. So the hardware as running it blocks, like for example, to run an SMI and it blocks the entire operating system. So the hardware can cause scheduling latencies as well. Or in a virtual machine if we're running a PCPU and then the PCPU is printed by a thread in the host. How those things can create scheduling latency. And here's this tool, the hardware noise, it measures the scheduling latency. Just a second. It's allergy season in Italy already, it's spring. So this hardware noise tool, it's a tool part of RTA that is based on the OS noise tool. And then there is a paper that I point in the beginning of the slides that explains about this OS 2 and hardware noise. So I'm saying measure the noise of the hardware on CPU 0 to 7. And this is the specs of per CPU thread that disables IRQ and try to see if the hardware is stopping the thread execution. I'm not going to the tiles, but it's able to say that, okay, in the CPU 0, in the CPU 0, after this amount of runtime in microseconds, there were 33 microseconds that were stolen from the operating system by something that is not my current thread. So here we can see that there was one stop from the hardware that took eight microseconds. So the hardware is, there were 10 times during this time that the hardware stopped running the current thread and did something else. I don't know because I'm not a hardware guy. And they create these latencies here. If we get this eight here, if we get this 10 here, 10 microseconds that the hardware can add to my latency. And if I sum up with that 52 that we summed up before, we can say that my latency can be up to 62 microseconds. And then we can start composing. But as a rule of thumb, are you guys seeing the screen or not? Is it not seen the screen, Moldar? Hello, humans? I can see the screen. Okay. Yeah, maybe it was a little bit before, but now it's working. Yeah, it's fine and then I can even see the cursive. So it's good, it's all good. So as a good advice is, before going into scheduling analysis with Timerlot, be sure that your hardware is tuned to not have latencies. Or be aware that if your hardware has latencies, any number lower than this max single here at least, they might be not in the operating system, but the hardware creating noise into the scheduling latency, right? These hardware noise, they make things very random and it complicates the analysis. So it's easy to be aware of this number before, try to get it as lower as possible doing like bios toning or CPU toning, like disabling idle states, setting up stable frequencies, disabling hyper trends. Try to get this number lower because this number here influences on the granularity that you are able to trace the system. It's this kind of bugs are tricky. And yeah, final remarks. So the Timerlot integrates the workload, the tracing and the auto analysis into a single tool. So it produces a summary of the root cause for a latency spike. And that's a good starting point for the analysis, even for a known expert. Like you don't need to be like a tracing expert to know, okay, it was that, right? Maybe I can just set affinity and we're good to go. So this is user experience, not a kernel level experience of the buggy. And this helps out those people that are coming to enable Linux on new platforms that they don't want to be a kernel developer. They are developing the cool stuff that running user space that humans can understand and humans generally cannot understand kernel in that set. So they are developing tools that my mother could see and say, good Daniel, I'm proud of you. And she doesn't understand kernel, so she's proud anyway. But yeah, those people doing the new stuff in the user space enabling the cyber physical system with Linux, they don't need to understand the kernel to be able to make their products. They just want to do a report that will make things easier for them. So having this level of information it helps enabling those people and help enable these use case for them. And it also allows the usage of more advanced tracing for those tracing nerds. The RTLA is home for other tool for analysis like the timer lot. There's the hardware noise, the timer lot that we saw. There's also the operating system noise tool. But it can only get better, right? There is, it's possible already to measure the execution time of tasks running timer lot or measuring the the less noise trace points, but I need to do a better interface for those things inside RTLA to make it easier to run, to understand. All this work, it was motivated by a research that shows the formal proof for the scheduling latency. This is the RTSL. I think I thought about, I talked about it at the plumbers at 2020. The goal is to make that part of RTLA. I just don't did it yet because the timer lot is more and I'd say what people are used to work with psychic test. And many people are not interested on the worst execution latency, but on the average case. And so I have been working with Paula Bonsigno integrating these things with KVM. So for example, the hardware latency instead of pointing to the KVM to the hardware, like when you have KVM in the middle, now it's always pointing to the hardware. We will be able to differentiate from the noise added by the KVM and by the hardware. And whatever the community needs, the thing can only get better. So this presentation before doing it, I did a tutorial and then I built the presentation. So you can see a verbose version of this presentation in the link. It also has a recording of the old version of the presentation there. And there is like the text explaining all these things here it's even I even I can suit it. And yeah, that's it. Do we have any more questions? No questions in the chat at the moment. I think you already answered this question, Daniel. Is this tool available only on Red Hat or other versions as well? I assume that is RTLA. So I put in a link for the RTLA in the kernel and then also the documentation. Yeah. Yeah, it's available for Linux in general. I mean, you can just compile the tool and run it. Yes, yes. And it's compiling very nicely because I had to rework all the make files. I just compiled as we are talking. It just has a couple of dependencies happens to not have on this system, this test system. So yeah. Good. And that was the thing that I fixed because Linus being at us saying, come on the way that you guys do the pen is checked. It's not good. We write out the make files. And now I'm ready. Oh, I see. Yeah, I just told me what to install live trace event and live trace FS and just installed both and then done. Good, good. Now that was work I did last month was trying to make to make files better. Right. Linus say lovely. So yeah, I will print that mail. Yeah, you'd have to print and frame it, Daniel. He says lovely. Yeah, Stephen, I'll say print it and frame it. I will do it. Stephen said the same thing. Yeah. I say the same thing. I'll print it. I'll make a T-shirt like with this printed. Yeah, nicely printed. So yeah, that's good. Do you use the latency tool in the same directory tracing? There is a latency collector. Do you use that or is it something you don't have a need for? That's more of a simple code because the tracing subsystem, it can notify if there is like RTLA timer dot and the preemptor queue disabled tracers. They have a notify interface that if you enable those things and wait for the notify events, those notify events will wake up the trend. That latency collector is a simple code for that interface, but it's more of a simple code. Simple code, OK. Good, thank you all. It speaks a time in Italy. Looks like there is one question. Can it be used to identify noisy neighbors? Can you identify noise neighbors? OK, yes, there is one option that helps you identify that. There is a timer dot. There is an option that is dash dash dump tasks or dump tasks that when the trace stops, it will print all the threads that were running on the other CPUs. So, for example, that is based on a real case. So we were seeing high IRQ latencies. But these IRQ latencies, they didn't make sense because my CPU was idle. And it looked like it was a hardware-inducing latency. It was something running on the system that was causing the hardware to do chunks of stalls. And I was seeing on my CPU that the hardware was stuck in. But nothing was running on my CPU. And it was like in between the thread handler and the starting of the handler and the print from timer lat. So that number is very short, it's generally in the one microsecond range. And it was getting up to 10 microseconds or 50 microseconds. So it didn't make sense. It could only be the hardware. So what was happening was that on another CPU, there was a video driver that was writing to the video card. And this operation in the hardware moving the memory was stuck in all the CPUs. So what we were seeing on the tracing was on some CPU, there was a K worker running for the video driver. It was always there. Always that CPU running a video driver, a K worker for that driver. So what I did to automate that was when the dash-dump task does this, when you hit a latency that is higher than the threshold, it will bring to the auto analysis, but also bring to what was running on the other CPUs. And if we start to see always the same guys on the other CPUs, we can start looking for, OK, what is the other code running here? But it is one example of noise neighbors. That dash-dump task help it on that. So in this case, was that K thread or process was it tied to that CPU? No, it was running on another CPU. OK. Not necessarily just scheduled to run on. You just can specify, just tie it to that CPU. Yeah. No, no, it was not tied to that CPU. It was random. Yeah, it was random. And it was not where I observed the latency. It was another CPU. OK. And then on that case, we bring it to the hardware person and say, we are seeing this, but we are always seeing your driver there on the other CPU. If we disabled your driver, it stops showing the problem. Hey, thank you, Joe Mario. I know Joe Mario. Just a supplement in the chat. I like Joe Mario. Thank you, Daniel. This is a really great presentation. OK. People can take advantage of this, for sure. People have a lot of questions on how do we configure? How do we track? How do we just, like, noise the neighbors' questions? So this would be a thank you for doing this. These are very advanced questions. Thank you all. Ciao, ciao, Bailey. Bye-bye. Some thank you, Daniel and Shua, for your time today. And thank you, everyone, for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today. And a copy of the presentation slides will be added to the Linux Foundation website. We hope you're able to join us for future mentorship sessions. Have a wonderful day.