 Should I start? Mark, wait a minute. Okay. Okay. Got it. Got it. Yeah. Okay. So, hello, I'm Daniel, and I'm here today to talk about a tool that we integrated in the kernel two release ago, 5.17, which in name is Realtime Linux Analyst tool, RTLA. And I work for Red Hat in the Realtime Team, and I basically work these, doing analysis on the Realtime, and now working on runtime verification on the logical aspects of Linux, but basically trying to make analysis out of the kernel by analyzing tracing, basically. Right. I was trying to put some backgrounds of theory where I come from mostly. So, Linux has been used as a Realtime operating system. It's a fact, right? People might not like it, but it's a fact. And there are multiple reasons for people to use it. One of them is the software stack, like we just had the presentation about ROS, things are being built to run with Linux as the default operating system. And we have a mempower, people that know how to use Linux, but mainly, because Linux nowadays achieves the desired timing behavior, that the granularity of the scheduling decisions that we have on Linux are good enough for the vast majority of applications, right? And there are some key features to help on that. One of those is the fully primitive mode, the preempt RT, that hopefully will not be the preempt RT anymore, like in the next release, it will be just Linux in this mode, and Realtime scheduling with SCAD deadline, and among other things, right? So, however, one of the problems that we have nowadays is that the timing properties of Linux, they are analyzed using a black box approach, which is sometimes not that convincing when you need to run things like on safe critical systems. So, for example, we have Cyclic Test and Cisjira OSLAT that try to mimic some workloads, typical workloads, and try to report a latency, and then we can move on trying to see if it's good or not. The black box approach, it works. There's nothing to say, the preempt RT was developed using it, so it's not that it doesn't work. It works, but it has some drawbacks. For example, it doesn't give a root cause analysis of the problem. And there are ways to do the root cause analysis, mostly involving trace, but tracing is not that accessible for non-experts, right? And also, trying to select the things that you would like to trace and try to refine, going back and forth on all the possibilities takes time. And moreover, another argument is that with the merge of the preempt RT, the real time would go to the masses. So, all the kernel developers will have to start testing their algorithms with the preempt RT, because a regression on the preempt RT would not be a regression on the preempt RT pet set, those guys in the garage, it would be a regression on Linux. So, but not everybody is interested in learning all the details of RT, like I'm not interested in all the details on file systems, because there's just an amount of knowledge that we can handle, right? So, we should not impose to the, or the preempt RT developers, not impose to all the other developers to be aware of all the details of the preempt RT to try to debug, it's unreasonable. So, and that is the story that brings to this new approach that RTLA is bringing. So, the RTLA tries to turn that black box approach into a white box approach and integrates both the workload that we try to mimic to get that latency measurement and the tracing that allows us to figure out what happened when a bad value, for example, occurred. And RTLA, it has, in kernel, it used some tracers and in user space we have a tool that makes easy for us to use those tracers in a more automatic way and also to help out in data analysis. So, kernel tracers. RTLA used two tracers, the OS noise tracers, that tries to measure the OS noise metric common on the HPC, but also common on RT nowadays with the usage of network function virtualization, where people try to isolate the most CPU and try to give one CPU for a busy task that loops on the network. And the timer not traced that is basically a cyclic test on steroids that does that game of, I was playing better, just introducing timer not is like cyclic test. So, starting with the OS noise tracer. So, the operating system noise is a well-defined metric on HPC and it is basically the amount of interference experienced by an application by the operating system activities, but not only, but any kind of activities. And it's generally a very fine grained metric. So, trying to give an idea of the context. So, HPC workloads are generally composed of parallel jobs. You have one dispatcher that dispatches the jobs throughout CPUs. They run in parallel and then when it's done, it collects the results. The problem of OS noise that by delaying the execution of one of these threads, you can create those vacuum on the execution and delaying the entire execution of the computation. And there are some HPC real-time workloads mostly on the networking side where glitches like operating system noise as high as 20 microseconds aren't tolerable. So, the OS noise tracer, it's a kernel tracer. You enable it using the trace infrastructure that also dispatches a workload. And it tries to mimic that HPC workload. Dispatching, for example, one thread per CPU. Then the tracer also hooks to existing trace points and parses this data trying to take out the meaning of that trace. And it's good that it's in the kernel because I can synchronize the workload and the tracers atomically. So, I do not have false positives on the output of my tracer. There is a paper that is under review, a peer-reviewed paper under review where I explain all this in detail. Hopefully, it will be published. It's on the way where I go into the details. But the kernel documentation explains everything in the bits. So, this workload, the tracer dispatches the workload in CPU. It besloops reading the time. And when it detects a glitch in the time it reports, okay, there was an OS noise and there was this amount of tasks that interfered. It would be better with example. But it also can enable hardware or VM-enduced latency. So, getting a little bit more practical. Using the OS noise tracer, I can go to the tracing directory and echo this OS noise tracer for the current tracer and read the trace. So, this is the periodic output that the tool gives. At the end of a given period, that thread that was running trying to capture OS noise latencies, it prints a summary. And this summary shows how much runtime it took during this period. The amount of noise that that thread detected while running. How much CPU was available, so for my workload thread. And from all those noises, the highest noise from a single execution, for example, a thread printed, what was the largest one, it's printed here. Moreover, as I'm also tracing the system, I am able to identify right here to give a hint of what is causing those latencies. For example, I say how many IRQs happened, how many software Qs, threads, but also hardware, and I'll explain better how I detected this. So, before I go in there, there's a set of configurations that I can do on the tracer. I can say which CPUs I want to run it, a runtime and a period that the task will be dispatched. And if I want to stop the trace, like if the total amount of noise was higher than a given value, or if a single noise was higher than a given value. And there's also this finite tune knob here, where I can say, because the rest noise tracer, it works by reading the timer twice and trying to detect a glitch in the time. Here is a finite tune where I can say how short is the glitch that I would like to detect. For example, one microsecond or five microsecond, the thing that's tolerable. And the default value is five microseconds. I've been running it if one microsecond and it works. If there was too much overhead on the tracer, running with one microsecond would be bad because it would not converge the number, but it's being able to run with one microsecond and it's a good property of the tracer. So that part of the tracer was just showing an overview, but I also said that it was possible to detect the root cause of the problem, right? So going further down the road, we can ask what kind of interference can create the noise on my workload. Trying to abstract to that, we can use these four levels of tasks that we have on Linux. So we have no-mask-bow interrupts, we have interrupts that can be preempted by no-mask-interrupts. We have software IQs that can be preempted by IRQs and NNMIs, and then we have threads that can be preempted by software Q by the ma, blah, blah, blah. I did a formal model explaining this and the blah, blah, blah can be very long, but I will stay here. So, but we can also suffer interference for things that are like in a high-guest priority, for example, SMIs that just preempt the operating system and bye-bye. Or if you are running a virtual machine, you can have the host preempting the VCPU. And these things are not detectable by trace, because you have no tracing about the execution of those things. But because I read the time, when I have like a threshold higher than value, I can detect that thing. But there is more. With the example, it will be more clear. So, to be clear. So, for me to detect the execution of those four levels of tasks, the tracer added four more trace points. And these trace points report the duration of each of those interferences. And because like a tread, for example, I have my measurement tread running, a tread can preempt my tread. But while running this tread, I can have uninterrupted inside of it. And then I can have an NMI, right? So, we can have this nesting hierarchy of tasks. The trace points are aware of that and they are at a discounted value. They reduce the value of the interferences of one on top of the other. So, all the values here are net values. You can just read the value and that was the overhead added by that task. You don't need to try to compute and figure out, okay, this value here is my tread was running for one microsecond, but I have a 20 microseconds interrupt on top of it. So, the tracer will say, the interrupt was the culprit for 19 and then my tread was just one and that's why the overall is 20. The trace points to all this magic are in kernel and that's why the workload is in kernel. So, I can synchronize everything automatically. And there is a trace point that when the workload detects this noise, it prints two things. The amount of noise that was detected by the tool and how many interferences from tasks were detected and that will be useful to explain this. So, here is the system running. I set it to stop if there was a noise higher than eight microseconds. So, it was running and I see here the IRQ noise. It was this timer and it started at this time and this was the duration, blah, blah, blah. And then you can see how this trace right up. Here, it detected one noise of 8.8 microseconds and it's already saying this noise was caused by the two previous task execution. So, I just need to look back for the two previous events. So, it was caused by this migration tread and the timer that run before it. If this interference value is zero, it was something below the operating system so I can detect hardware noise and preemptions on virtual machines. And because this data is synchronized, there's a nice comment on the code, how do I do the synchronization? There's no false positives on it. So, you can trust on the data. So, the next tracer is the Timer Lat Tracer. The Timer Lat Tracer has been used as the metric for the preemptority, right? Because at the very end of the day, cyclic test is a timer testing tool. If you read the code, it's a timer testing tool. So, it empirically measures the observed scheduling latency of the highest priority tread or any priority you set to the tread, right? And the Timer Lat Tracer measures the symmetric. It tries to mimic that behavior. But it's integrated with tracing and it has more data that you see now. So, just trying to explain here how, for example, the cyclic test works. You have your tread, you set a timer in the future, go to sleep, and wait to be activated. So, the external timer will activate an IRQ, that will activate the task, that will read the time, and compute this delta, right? That's how cyclic test works. Timer Lat goes a little bit further. Because I'm running in the kernel, I have my own IRQ handler for the timer that we wake up the tread. So, I can also capture, not only the latency suffered by the tread, but also the latency suffered by the IRQ, right when it starts to execute, right? Not when the IRQ wakes up the tread that would incur having to figure out the amount of overhead that I had on the wake-up and taking the IRQ logs. No, it's just the IRQ handler starts to run, it brings the latency. And that's useful to detect the case in which the hardware is causing me the latency. So, an example of running these tracers, go to the tracing directory. You say, echo Timer Lat to the tracer and run, then read the trace. So, here is the... There is this per-CPU threads running in the CPU-0. There was the first activation. In the context of the IRQ, the timer latency was 900 nanoseconds. Then, in the context of the tread that was later awakened, the overhead was 11 microseconds. And this is the summary output of the two. It has the same set of configurations that I had for OS Lat, but I can also say to the tracer to capture a stuck trace of the tread that was running when the IRQ happened. And so, I can use this information to figure out when there is a latency added by the tread, where this tread that was running here was executing. So, I can look in the back trace and check the code there and see where is the section with a long IRQ disabled, for example. So, Timer Lat analysis. Again, the latency can be caused by that for kind of tasks, but also by the previously tasks running with preemption IRQ disabled. Parenthesis here. In the future, there will be in the RTLA, the integration of the RTSL tool that I developed as part of my PhD that can detect the worst-case latency with a theorem proving the worst-case latency. But that tool will depend on having the IRQ disabled trace points that are a little bit more costly and are not enabled by default. So, that's why we still have the Timer Lat tracing doing a half way between these two tools. But that's for next year. So, again, I can use those same trace points with a slight difference. The NMI and the IRQ trace points, they report the same value. The software IRQ and the trace points, they don't report the entire execution time of these threads, but only the execution time from the execution of the IRQ, because it's from the execution of IRQ the amount of time that I care. So, I don't need to try to compute the value to discount the total execution time minus the thing that happened before the IRQ, which is the thing that I care. So, I'm already doing this cleanup on the trace points already in the tracer. To make it easy to understand the tracer, so I don't need to go back and text files and do this thing that I've been doing for 10 years, and I'm tired of doing it. So, here's one example. I go to the tracer. I enable the tracing director. I enable the tracer. I say, stop the trace if I have more than 500 microseconds noise, and get this stuck trace of the IRQ if the latest was higher than 500 microseconds. And here is the trace, right? So, I have my timer latency here. It was 1.6 microseconds, so the problem is likely not here. I think it's very, very smooth. Then I have here a counting a timer noise of 11 microseconds. Another timer noise. Oh, this is 11 microseconds. It was the timer that woke up the track. Then I have yet another timer of 7 microseconds that run something. We can later go here and enable other tracing, but it's not the main factor. And then I have here a tread noise of 838 microseconds. So, this is the dominant factor, right? The problem is here. So, now let's try to look in the stack trace. So, next page, but it's the continuation. Here I see this, okay, this is my timer not IRQ. It should be running. That's where I get the thing. Here's all the IRQ. Then I have this, okay, from here on is the IRQ. From here down, it's the thing that we're running before. This was just an example that I placed in documentation. I create a dummy module that was just doing a 1 millisecond sleep with preemption disabled. And the tracer was able to capture that, right? It's dummy load 1 millisecond preemption disabled. All the things that I'm showing here are exactly the kernel documentation that is on the kernel documentation thing. So, it's basically an interactive read of the kernel documentation. So, and now we get to the RTLA. So, RTLA is a user space tool. And it serves as a front end to set up trace and do some data analysis of those tracers, right? So, the idea is that it transforms those tracers into a benchmark tool, right? Because I'm interested on the numbers, not only when it executes here and there, but the collection of the numbers, right? And it's in C. It's hosted inside the kernel, like Parov and then the BPF tool, the tools inside the tooling directory. And in this initial implementation, the RTLA has two commands, right? One is RTLA OS noise. That measures the OS noise. And the RTLA timer lot that measures the timer latency. So, the RTLA timer noise is interface for the tracer. It adds more options to the tracer. From the RTLA, I can change, like, the priority of my workload. And it allows me to enable other tracing features. For example, I would like to enable the default trace points of the workload, but also all the timer trace points or all the scheduling trace points, straight from the tool. I can also enable things like the tracing histograms from this. And I'll show on a demo what is possible. So, it has these RTLA OS noise has two modes. One is the top mode that shows just the summary of the workload, of the summary of the tracer. And the OS noise is that computes histograms. It just collects histograms, and so then they can be parsed into data that people can do a nice presentation like the one we had yesterday. So, same thing, timer lot. It's for the timer lot tracer. Some more options that I can do by hand there. Set the priority of the threads. Set it to run the schedule line. And it has two modes, the top mode and the highest mode for the presentation. So, I'm a user testing my real-time Carnot setup. I'm not like that guy that wants to be, oh, I am the real-time dude. I'm just using it, right? I have other things to care about. And I wanted to measure the latency and generate a report if my latency is higher than 50 microseconds, right? Before our delay, I had to set up cyclic test. I had to try to understand what I should trace and go to the Linux RT RSC channel and make this question yet again, what should I enable tracing? And who is on the channel knows that that happens. And figuring out things by myself, by hand, by scripting. And actually, this was the thing I was doing too. I had these tracers done. And I did this many times. Many times. Before working on development at Fred Hat, I was working at support at Fred Hat. And that was the thing I did every day, computing deltas by hand. BC was my friend. It was good. It's a good thing to give you work that you can justify. But how much easier my life gets using RTL8? That easy. I can just say there's this option which is the automatic tracing that I say it starts the tracer, sets the priorities, the full priorities, and it measures the latents. It sets up the tracing session. It enables the minimal set of events that an expert would need to be able to start going forward in the analysis, right? It's not the solution, but it's a starting point, a precise starting point. And it stops the trace if the latents are higher than 50 microseconds. Saving the results to a text file. That's it. It's just one comment. So at the end of the day, RTL8 is the automation of an expert analysis. It was the automation of the things that we were doing at Red Hat for 10 years. And I did these getting people by hand and saying, what do you think about this? What do you think about that? Like Clark Williams and Uri Lele. So we tried to automate that. So just on demo here, disclaimer, I am not a YouTuber. I'm good on taking photos, not on editing videos. Photos is my hobby, but videos is not. So... Let's see if it stops. Here's a demo. It says you're pre-5.18 because it contains the features that are added to RTL8 that wasn't for the 5.18 merge window. It was before the release. But now this is the thing that you get on Fedora. If you do the NEF install RTL8, it's basically this code. And it's already available on Fedora. The tracers are enabled and the tool is packaging, it's being packaged now. And if it's not, it's on the way on Suzy. Daniel Wagner is working on that. And I saw on that report of Ubuntu enabling the prem 30, they also mentioned that the two will be there. So it should be available for the main distros. And I still didn't wrote an article for LWM because I was waiting for the packaging because otherwise you need to say, oh, you need to install blah, blah, blah, blah, blah. And people say, okay, it's not turning my life easier. It's just telling me how to compile things and everything on its own time. So let's go back here. So the environment just explained my system. It's my working desktop, working station. So it's a simple system, 24 CPUs, single-numer node, 24 CPUs, 12 CPUs, 12 threads, 12 more threads. And I'm using the prem 30, half of the CPUs... I said, I'm not good with videos. Yeah, that's because I'm using inside the... Okay, go to YouTube. So half of my CPUs are isolated. That's why you see the difference on the results of one CPU in your... Let's base prem 30 patch with the things I had to patch. So less noise. Less top dash help. Here are the set of all the configurations. The configurations I mentioned. All the other things that I can do, like setting priorities and adding older events to the trace. I'm saying run OS noise with that tunable as one microsecond. And here it is. This is the top overview. We just do a summary. Keeps collecting that summary from the trace and displaying it. So half of my CPUs have more noise than the others because the system is just half of CPUs isolated. I'm seeing the IRQs. Thousands of IRQ per second here because I'm having hurts and here I have no hurt. I have a nice analysis of it in the paper, but I need for it to be accepted before we make it. So here I'm starting a histogram and Daniel from the past forgot to say I want to run this for 10 microseconds. And here it goes. So it keeps running until I hit control C. There is an option that is dashed. And Daniel from the past will repeat this with delay. And there is an duration option. And I say I want to run this for 10 microseconds, for 10 seconds or for 10 hours. Here are the histograms of each noise execution and some values of the minimum, the maximum, the amount of noise. And you can follow it. So the timer not tracer. Again, out is the full options plus some others, like the more advanced event tracing and setting priorities. I can have timer not running with measuring latency of the deadline. Doesn't make much sense because it's not appropriate of the deadline schedule to have good latencies. So I have I set a workload in the background and now you see the no-isolated CPUs having a higher value because there is the workload running there. The isolated CPUs keep the value low. And here you see this is the cyclic test output didn't like the tread latency. But here I also have the IRQ latency. I can see here is 12 here of 18. There is something that is the link with the IRQ a lot. I already have from just this output I already have more evidence of where to look at. It wasn't enough, Daniel. Here Daniel from the past is trying to say see the IRQ values. That's why they are good. Isolated CPUs, yeah, we got it, Daniel. Next time, Daniel, just don't do it. Let Daniel from the future to handle it. I will get better. I will get better. And here is the histogram and probably again it run to I hit CTRL C or use I can now I can use dash D duration to set there is a typo there too much data. That's because it reports data for the tread but also from the IRQ so I have two columns for CPU and I have 24 CPUs so I have 48 columns. I just, yeah lots of data. I just run it with less CPUs, just 5 CPUs. And here is an example of the output, right? IRQs, tread IRQ, tread use with care. Yeah, that's what I recall. Tracing how much time do I still have? So it was let me see what Daniel from the past did. So stop the trace if there is a latest higher than 20 microseconds. There is a workload running the background. Okay, it stopped. Save the trace to the file. I will integrate instead of saving now I'm saving the stop trace hit on CPU 4 and then I can just look at CPU 4 and then try to go back in time and understand. So tread noise 6 microseconds, probably not the problem. IRQ noise of 13 microseconds so in this case the IRQ added more latency than the tread but it's somehow counterintuitive, right? We generally think that IRQs are faster than threads. Advanced tracing so here I'm saying I would like to create histograms for the tread noise that my timer not's facing and I would like to create histograms for the IRQ noise that my timer not's facing and then it runs fast forward, I think it was 30 seconds. Okay I said to stop the trace yes, stop the trace in 25 that's so the trace stopped and the values of the noise, the histograms were automatically saved into files and so I can have a histogram of the duration of all the noise that happened during that execution and I can see what kind of treads are editing more to the value I will add an option, okay see between 9 and 9 microseconds I will add an option to parse these data and show it in a more beautiful way these two is under development yet and also add an option to save the trace file into a trace.dat file to be read with trace command with a kernel shark so I hear here are the IRQs and the timer is taking up to 20 microseconds and I think my network was editing 8 microseconds here so all this information it's now easy to collect and the idea is to here you see that I like drugs and computers so that's it RTLA is upstream finally Daniel is not talking about the future and on his research project he's talking about something that is a product and he's upstream the tracers are there since 5.14 the user space tools since 5.17 the tracers are enabled on Fedora and Red Hat products they are also enabled on Suzy and on the recent kernels and then the people in Ubuntu are also enabling it on the real time kernel the the package is ready on Fedora it will be ready if not already on Suzy and Ubuntu and there are more tools on the way the academic paper that I made that I show the demonstrate the latency on the kernel, RTSL will soon be part of those tracers and the RTLA in user space there will be more tools that will develop and motivate people in the resource to help on that and yeah that's it I cannot so where will the paper be published I better wait for it to be published and it's a nice journal I'm sorry I'm not I'm using just one timer for now which is the same timer that I use that we use on the SCAT deadline but the idea is not that to test that source clock right it's to test the latency for handling one case of timer but the kernel can be also extended the tracer and add other kinds of timers so the question is by adding the output on the trace .at format used by the tracing tools if it will be able to use on KernelShark yes that's why I will export the data into the KernelShark into this trace.at file because it can be used on other tools that use that from the tool it will show yeah it should just work out of the box if you enable the tracers by hand via trace command it will add show you can use this tracer with trace command as well it works so the idea is for it to be integrated that's why I'm using tracing infrastructure I'm developing on that because there is an ecosystem around it the question is if I will upload slides yes it will be uploaded on the stop I have a stop so I will upload it to the system probably ask me to upload but otherwise it will be on my personal page I put all my presentations there all my papers there Jake oh no okay good thank you and you can just read the kernel documentation how the main pages are in the kernel documentation it's not my windows all the documentation is in the kernel documentation all the main pages for RTLA are in the kernel documentation everything is integrated with the kernel documentation