 Right good afternoon everyone and and welcome to this room for the afternoon Considering the topic of the talk we have as the first talk for this afternoon I think it's a good idea to start on time. So without further ado, please welcome Daniel for RTLA timer lat. Thank you Thank you. So I'm Daniel. I've worked for ready head for quite a while and I mostly work on the Upstream on kernel and things related to real-time runtime verification and scheduling and so on and so This is about a tool that is inside the kernel the set of kernel tools that is time time or not as a tool inside RTLA so Here we have been talking a lot about the real-time Linux Even if beyond presentation this morning So Linux has been used as a real-time operating system, right? It's a fact many people might say it's not a real-time operating system because of of It's a large operating system because of the general proposed nature, but Linux is able to deliver Reasonable good timing behavior And and many people are looking for Linux as a real-time operating system option Because the the stack of software is increasing people are using AI stack and video processing all these accelerators and And and these technologies have been developed using Linux as the default operating system So there's a huge advantage of continuing using Linux on these setups So people look at Linux as a viable option as a RTOS also because there are many people that understand it hopefully and And I say them also because Linux they choose the desire timing behaviors And and some of the key features for this is like the fully primitive mode Which is the preempt RT kernel the real-time scheduling locking and so on One of the problems however When we try to say that Linux is a real-time operating system is the way that we show these these properties Right because Linux has been tested using a black box Like method and when you start talking with people that run real-time systems on critical systems They might say it okay black boxes It's it gives me that the number but it doesn't gives me why that number happened And how I can debug it and how can I explain if we if that's a bad scenario or not? so The way that these black box tools work like cycletest is that they they are a timer in the future using an external clock reference and then when this timer happened the the High the highest priority thread is awakened and when this thread starts running it reads the time again and measures these latents That's the base idea. So the base can do we cannot say that this approach doesn't works, right? Because we have been developing the different RT doing things this way and the results are good. So yeah It it works. It's just that it gives no root cause analysis. It doesn't and doesn't And it doesn't do that that step to that next step, right and generally for us when we are working In the kernel when we see a latence what we do is we use we use tracing Features right we go there. I've been working this for quite a while I already know by by the top of my head. What are the events that I need to enable? I just say okay enable this this event and then give me the trace and I will try to analyze it and But but trace is not always accessible for for let's say the non-experts like my project manager Right like my manager and when they see a new bug zealot redhead say, oh, there is this latence and this number, right? Do you always need to get someone some expert involvement to try to to debug the problem and also after their 10 years? Doing this kind of repetitive ritual one gets annoyed, right? I was annoyed red. It's all do these do that and do that. Oh, yeah, again the same same The same ritual So who cares about this anyway? So other than the poor dude doing tracing like me other than me who cares? so one of the point is that there is a I would say even incredible for for if you look like 10 years ago If we we think the number of use the case that we have for real time nowadays and the request for red hat and for for the upstream It's impressive Many people are using real-time Linux like in telecommunications and now in the elisa project for safety critical So there is a large demand for for the real-time kernel and And as as the kernel gets merged there's also a fact that All the kernel developers will have to start testing for regressions On the real-time kernel because the real-time kernel will be part of the regular kernel so but not not not all of them are interested on on Learning the details of the kernel the real-time kernel and synchronization, right? If I am the video for Linux Maintainer I and there is a regression. I will talk to someone that is expert on real-time not necessarily The expert on another field in the kernel the kernel is huge And also that there is a lot of projects as I said that that That required is real-time like automotive and automation to name a few telecommunications and so on So what is timer lot? The RTLA Timer lot is a new approach for that So the RTLA is actually a test suite that is inside the kernel source that has a lot of tools inside Timer lot is one tool inside of RTLA So RTLA is a new approach that tries to to integrate all the bits and pieces that I use on the bugging that we use on the bugging So in in one side we have in the kernel side. We have a tracer that is the timer lot tracer that contains the It does the internal processing for for the timer and Adds a set of OSNOS trace points that we'll see later that does most of the processing in kernel Or some of the processing in kernel to reduce the overhead on tracing and try to write just Processed information to the trace buffer and it uses lockless synchronization to avoid like false positives and so on And and it also has an in kernel workload that simulates that a black box Tooling and if you are interested only details of the lock less synchronization and the noise reduction There is a paper published at the IEEE transactions on computers that is about another tracer This is a sibling tracer that is the OSNOS tracer, but the trace points are the same and the idea is the same So you can find a lot of information here and so That's the the kernel part and we can use it using the tracer fast interface Together here, I say I put LibitraceFS So I use the LibitraceFS to glue this kernel part with a user space side That that that gives us a tool that's timer lot that tries to bring a Tries to give to that tracer a benchmark like interface trying to make it easy to use It sets up the trace it collects the data press the data and gives us some output that I'll show later And it also allows us to have some auto analysis for bad values And I'll discuss this also later and gives the possibility of running the Workloading user space that's something that's that's hopefully you show up in this near future on kernel so Before starting There is one thing because I'm in kernel. I'm controlling the timer The timer not the it already breaks that latency metric into two metrics So it has a special IRQ handler that is activated and when it's activated I measure the The IRQ latency These IRQ we wake up the tread and then the tread latest will be printed later So instead of one we started breaking it up But instead of have one metric it shows already two metrics and it so Here is one example of the two output This is my system like five years ago. I'm running. It's a regular workstation. It has 24 CPUs I'm running one on I'm running kernel compiling on all the CPUs That's the this terminal here and here. I'm trying to use the tool So RTA timer not top will give a top like interface that will show me the latency that my system is facing now So here I have the the side the CPUs the amount how many cycles My currently IRQ latency the minimum the average latency the maximum latency for the IRQ That wakes up the tread and here's my current to the minimum the average latency in the maximum latency So it's already giving some some two-stage information here. This is the top Interface there is the histogram interface That gives us a histogram of the system. It does the same thing but instead of keep displaying it runs in silence and It collects the histograms and when I either hit control C or say run this for one hour at the end it brings a summary of the output and Here it is. So let me just pause here. Maybe it's better So here's the latency and how many occurrences I had on that latency. So as you can see we can see here This is the IRQ latency. It's shorter. It happens more close to this time This is the tread latency. It goes longer and it it spreads more So this is useful for for analysis or for regression analysis There's a lot of options for me to format here to make it easier to to be parsed by by scripts So this is just to show an idea of what is the test What happens is that when we are testing the system? We have we generally have a maximum acceptable latency or a threshold and if that threshold is it's generally in microseconds, right? So and when this threshold is a Latest occurrence it goes Beyond the threshold it's interesting for us to try to understand why it happens So why my system is not within my limits, right? So time or not allows us to set a variable On the command line that says okay if if a large latency happens, please stop everything Then I can say oh if the tread latency was higher than a hundred stop or if IRQ late for higher the than 100 stop Or there is a magic option that is the dash A that enables all the default Things that an expert would enable just an easy shot So dash A 100 it will enable if you tread if IRQ get the stack trace enable the tracing and so on I use it a lot. I just put my defaults there and hopefully people will like it suggestions are well So when the system stops time or not not only stops tracing and collected trace It also prints an auto analysis of the root cause for that for that spy Daniel from the past was faster than Daniel from now So I'm saying that's a 30 is top if there's a latency higher than 30 bro and here the tracing is stopped and And higher we can see there is okay an expert would say okay There's this IRQ disabled that's made the problem But you can see here that there is now a variable. That's IRQ latency delay IRQ latency the duration of my IRQ which was the task that was running and this test running was this one and this a suspicious call and so on and Don't worry. I will explain this filter that that's the subject of the talk It's explaining how to read this report. So and my latency was this one and and And it gives this Hopefully easy to read the output are as easy as we can get for now. So Next Auto analysis, how do I analyze the auto analysis output? How do I read it? So the auto analysis tries to decompose the latency into a set of variables Not one number not two numbers, but a series of numbers somehow independent that compose the latency and Each of these values can be analyzed them mostly independently And and when we see like on the IRQ latency or a tread latency They have different ways to be analyzed because an IRQ has a different priority or follow different rules than threads, right and And the auto and it's important to say the auto now is here Even though it's probably most people using print RT we use these two It works for our kernel for the no primitive as well. There is even one example. So it's generic I try to abstract the Concepts to to make it as generic as possible So the timer latitude some abstractions that we use from that have on RT theory I like the RT theory many people doesn't but I do and So we have there the execution time, which is the time to accomplish a task For example, if my RQ handler executes for 10 microseconds and I need it to accomplish my task This is the execution time of it Blocking happens when a lower priority task Post opponents the execution of the higher higher priority task. That's blocking The interference is the contrary is when a higher priority tread prems a lower priority tread. So blocking low priority Interference high priority When we think on tasks on Linux, we generally think on threads, right? But if we look in the internals of the kernel In turning the current it's easy to see that we have like four classes of Tasks if you want to see the prems like how many prems we can have inside the trace buffer We have these four abstractions here one thing that can prep the other So the rule is somehow like this we have NMI's that can prompt every every kind of task and Then cannot be prems it we have IRQ's that prems all but NMI's And we have software cues that can be prems it from IRQ's and NMI's but cannot be prems it by tasks and Treads that can prems to themselves but cannot prems to the others if we read backwards We can see threads can cause blocking to software cues software cues and threads can cause blocking to IRQ's And nothing can block NMI. Sorry Other than the older NMI's Sorry or bugs. Yeah bugs kids And and I don't like to hear bugs on NMI code Like NMI prems in NMI's never good So let's get one example here Let's get one example here. Now an IRQ latency examples. Let's focus on the first part first So I set my latency to 30. It stopped on the 50 on threads and yeah Oh, I also draw a timeline here to make it easier to visualize So the IRQ handler here the IRQ handler, it's it had 31 microseconds delay from the starting So this was the time where the the hardware interrupt was set to run And it took 31 microseconds until it started running And the IRQ latency was 32.17 microseconds boom And that that was the fastest time I could read the timer and report it The blocking thread that is the thread that could have postponed at it. It was this object tool and Here is the this part in It's a it's the IRQ handler itself. So just from from here is the IRQ handler from here is the task So we can see here that there is this function IRQ restore that re-enables the interrupts So it makes sense. There was a thread that disabled interrupts That's why my interrupt was postponed and then we can see here that it was inside the butterfly fast and a VFS right and Then I see here why why is this C group or our stat flushed lock at Call it the row spin lock, right? I can just do a git blame on kernel and try to figure out the patch In this case it's a legitimate use case because the person sending is is the prem 30 maintainer Sebastian, right? So there is a reason for it to be there What one can do is try to optimize the code inside there to reduce it or try to make this workload not run on that CPU that I care But this is a legit case and I could connect the code with the analysis And also we learned that these use case has the signature if there is another bug with the signature My PM can say this is mouse like that the other bug duplicate. Hey, let's work for them Well, that's not a point in time. That's interesting to have less work, but that's another discussion Another one other thing that he mentioned on his presentation about having the the is the X from idle latency, right? On and here you I have one example of exit from idle latency In the theory. There's these release a jitter term. That's basically this it's So I RQ delayed there the RQ can be also delayed from hardware because of factors in harder And I'll talk more about it later But there is this special case when when the system doesn't have like exit from idle tuned And there is also he mentioned the tune the profiles that Fred had that can help on that But here's a system that the the C states aren't configured the latency on the RQ handler He took like a 39 microseconds The system is running the swap or tread which is idle and the analysis are at say, okay This is exit from idle This is the problem here You need to check this and it also mentioned okay over the analysis I also saw that my largest delay was this amount Exed from a idolite latency Even if the cause that made me system stop wasn't this it would print at the end because that's a common limitation So and the tab where that has a option to that same sector that has an option to work around it But the real thing that one needs to do is is set up the C states So going further one example of tread latency Here it's harder to get tread tread delay examples on prem 13 because it's running well So I went to the regular kernel and that's why I took the regular kernel But I can say also well, I just wanted to show that it also works on the no no preemptive kernel so Here we have an example where the the IRQ handler delay. It was zero It was some nanosecond. So okay, that's good. It's not a problem of delay in the IRQ Let me see if I had an animation or not. Yes, I did so the timer not Eric you Did Eric you latency from time or not was one microsecond. It's okay. It's just the time to put up and The time or not I RQ it the execution time of it. It was like nine nine dot five microseconds and There is also one interference. That is why The blocking tread was running. It also suffered some sort of interference. That is older IRQs happened So the the timer I RQ another timer I RQ happened a Software IQ to software cues happened and we can see here Also, the migration tread was dispatched on that CPU and we can see the amount but they are not that relevant, right? It's zero seven five percent to one percent. It's not a big thing The big thing was a key worker tread blocking a key working tread blocked by 500 microsecond and here we can see that it was again I'm not against butter fs's, right? It's just that I use butter fs and I'm getting you the example there And I do like butter fs so butter fs people. I like you I use on all my system. That's why they are here It's just that here we are running on the no real-time kernel So we have the Premchon points inside the kernel and it's normal to have this Premchon disabled It's like if Premchon is disabled by the fall and there are points in which you check if there's something to be risked So here what one could do is okay inside of this code I could go there and try to put more Premchon points more converse cat converse cat converse cat to try to reduce the latency for the no Premchon kernel or or if you have like the The the Premchon kernel but not the the Premchon 30 Why would one like would like to run that that that case because they do not wants to have the overheads of the The real-time locking for example, it will boom to has the the low latency kernel It's just like that. So one could go there and optimize this use case. It's not Black and white. Let's say in the in the sense here that you need Premchon 30 for everything You can tune the no real-time kernel and the tool also works for that so Okay Real-time the RTLA has its roots on tracing, right? It the trade it has a tracer and it's done by people who likes tracing And so it's natural for it to have also interface for the tracing things So as it act as the RTLA timer lots the front end for the timer lot tracer By the way, all the documentation here for the tracer and for RTLA there are on the kernel documentation You can go there and read main page and everything so the The timer lot can enable the us and noise trace points and they report the amount of time that each task run on the kernel and And and it gives you a number and I'll explain the number. It's easy to read And so we have one trace point for each kind of task So there is the enemy the enemy noise the RQ noise software Q and trats Then we are connected with those blockings and interference and these values that they report they report a The amount of time in nanoseconds This value is already free from interference. For example, if I have a tread running blocking and an interrupt happens The amount that time that the this interrupt run is already discounted from the execution time of the blocking So I don't need to look when I see the blocking time I don't need to try to reinterpret it to all these I need to discounted this value because there's an IRQ and blah, blah This value is already readable. It's 10. It's 10. It discounted everything So it it avoids us having to look back and trace and look back and trace and get back and forth back and forth So here's one example of timer lot and the auto analysis and tracing out together. So Run the auto analysis stop with 30 on only on the CPU for and broom it stopped Here is we can have a look at the analysis the Daniel from the past will probably click on the something Daniel from four years ago. Yes. So in this this is the the the point where the inter or the pension was disabled and Go video Okay, you see there's this function there So I'm doing tail on the trace file that's automatically saved and here we have the the F trace output the tracing output by default, it's only enable I'll see this is the IRQ latency and This is the amount of noise that the blocking tread gave and it has the starting time and the duration How long the IRQ run? These are my late and so I can read it here as well If I wanted to go deeper and and and and trying to read the auto analysis from the trace because Because I think that we will see now Okay No, this is not public yet. So that's the problem of using chat in presentation It's Luis Claudio Is a colleague? so The RTLA timer not can also be used or it can also be used to enable other Tracing features so I can use all the tracing events on the kernel is starting by the timer lat interface and And here is the option that can enable future enable events run futures and triggers here is a video so I Wanted to say run the stop on 30 and the stop if it's cad It's cad work queue it's cad IRQ Vectors IRQ And run so the thread will stop And if I see here the Go ahead any of them past If I look at the trace file there It will show okay, then from past trying to explain. Oh, then we need to run faster It's here. Okay. Here's the trace output I just read that same file and here I can see I can see the trace points the I Can see like it's cad waking is cad wake up Local timer. These are the events that I enabled via the command line. So I can use them So time or not tries to avoid or tries to reduce the amount of time that you will go into that direction Going deeper in tracing, but it also allows you doesn't block you to go into there The one can always go deeper and deeper and deeper as much as one can or like There is even more things we can even go Further into the tracing because there's a lot of options that we can do in tracing So for example, we can use the tracing histograms to get all the interference that IRQ gave on my system and have all the interference that software Q had on my system to check then at the beginning because as As the timer out for a second they work by sampling It might be the case that there was an interrupt that was very long But it was just not at the same time that the timer was fired So that could have been a longer latency that I didn't catch but with the OS noise trace points and the histograms We can have a look on that So I will paste one histogram line they are here, right? So now you see why I posted the link So this page and I have this command line That's why I placed the reference there because I don't I cannot memorize it And I paste it on my site just because of this because I need to check it So I'm enabling timer lot and say create my histograms for the IRQs and display the IRQ name on DCPU how many of them took 10 microseconds and then it shows a histogram Same for IRQs same for software cues and now the timer lat is running. It has these histograms running in the background and When the traces stops it will automatically save those files and this works for any histograms If there is a histogram that was enabled time a lot will save the histogram file So here, let's see how many oh My system has NMI's and I wasn't expecting actually and my system has NMI's and I can see how many times they took So they are generally on the 9 to 10 microseconds So my latest can be at least a microsecond because an NMI can happen So then the IRQ noise I Can see all the IRQs that happen on the given CPU like your CPU CPU 1 2 Here you can see the pattern go into 12 17 and Going down you can see the pattern, but look this I wasn't expecting this either when I run The next one here. There is a timer IRQ that took 32 microseconds So my latency is bound to be at least 32 microseconds on my system Always because there is an IRQ that takes at least 32 microseconds And I know that there is an NMI that can take at least 10 microseconds So I can have a 40 microsecond latency on my system Even even if I didn't have one example here if it didn't stop So I can start doing this reasoning if we keep doing this reasoning We will reach on another tool that is the one I present on 2020 That shows the worst case latency that tries to compose these things It will be part of the timer lot is just not yet because it depends on some of the Optimization that you have to do on the preempt and an IRQ disable trace points and those trace points aren't enabled by default on distros Also, we can get man and also can get away to have those trace points enabled by default because if we reduce the DD overhead we can have these other tracer enabled and I can try to ask the it for merge There is one feature that is Hopefully will be part of the current in the future that is now the workloads in the kernel But people say that I would like to see it in kernel in user space and there is a feature Hopefully on the way for this. So instead of having the the user space the current kernel workload We have a user space workload that use interface provided by the timer lot And okay, I think I'm running out of time. I just need to run So here's one example of this option. It will be enabled with just a flag that time allowed that shoe And so instead of having two metrics now, I have three metrics The IRQ the tread and after pinging back in the user space So it goes to user space and if it goes straight to the current it showed the latency or the execution time of going to this space and returning This data will be larger also because now as we increase the window we can have more chance of having interference and that's why thinking on this abstraction is easier for us to communicate and We can also have histograms and this interface is exposed to user space I can use my own tool to use that interface and then I can read this number as the response time for my task to be Awakened started running run my computation and then go back and the autonomous We also hopefully work so I can see the response time or my of my task have these amount of interference and these amount of blocking and this and there are more things on the way that I some by the ways when trying to figure out the The delay this is right. Okay. The timer lot has these these options I can set a different period a different set of CPUs how long I want to my session to run I can set the priority saying like I want my thread to be scared deadline Not it's cat-fifo. All right. I would like to test if my I know real-time tasks used to see how much Interference I know real-time test could run There's a we can we will have full chose that this is under development. I say I Want to run on DCPU my collection The MAI latency that I mentioned and I I don't want to collect it I don't want to collect the trace the The numbers I just want to a report if it stops like I have a huge system and I would like to have a very fine-grained Period that would cause my air-to-lay tried to run too long and but I'm not interested I just want to see if it stops and print me a report This dump task is important when you have Delatance induced by another tread. So for example, I get only I'm presented So you have things running on another CPU causing pollution on the cache and my tread is running here And it's stopping and my system was idle. So what could have been it displays What was running on the other CPUs for me to get a hint of what is causing me the indirect delay and And by the way, there's another tool RTLA is a set of tools and there's one tool Which is the hardware noise that measures this case when the hardware has some kind of latency For example SMIs or bus locking that ends up creating latency on my On my tread latency because if my hardware has latencies it will also postpone my tread latency So here's an example of the output. I will pause here So my period it on each period like each second period it tried to run for 75 percent of the time And on each of these period the total of noise that my hardware noise received was this amount So I'm having this amount of CPU time net from the hardware noise each of these cycles in one cycle I have 18 it was the maximum and These 18 was composed for at least two because my max single noise was 13 microseconds I can see here also that there were hardware noise and NMI thousand noise This is the a way to the thing on the back here is the US noise tracer if I are accused disabled So now I'm saying, okay I would like to stop the trace If I see a hardware latency larger than 15 and I would like to see the and I would like to see the NMI trace points Right, I would like to check the trace So it stopped and I see the trace and Here I can see oh, there is a sample threshold for 15 microseconds and while it was running I'm sure that there was an NMI because this is interference is one. I'm sure that it is one because of the lockless Synchronization that I have on kernel that ensures me that this was one there between the time I read here and here Just one thing happened. It was this NMI So the NMI noise and the and if I wanted to remove then I need to check why is this perf event NMI handler running here before I'm going deeper on the analysis and So final remarks Time or not integrates the workload tracing out analysis all in a single tools and it tries to produce a summary with Root calls that can be hopefully read by my project manager without having to call me but The idea is also to speed up the the way like for example a person that is not from RT testing the Their subsystem in kernel. They saw that they have a regression Now, how do I report this to the real-time people for them to give me a hand boom just one one output and And the tool also allows the users of more advanced in tracing and we will more and more integrate those things with Steven's library like trace command And The RT is the RT is the home of photo tools for analysis We have the timer lot the US noise that measures interference only the hardware noise that I showed but it can only get better, right? I'm working on a tracer that shows the execution time of the tasks is using those trace points There is this tool that I mentioned that that shows the worst case analysis with a formal proof and It it it will be placed inside of RT LA I'm working with Apollo Bonzini and And the Francesco and student a Milan To integrate the timer lot and less noise with KVM so I can see if there was an a noise On my system instead of saying it was harder I would be able to say if it was either hardware it could be hardware, right? Or it could have been the host that prems to the vCPU and we will be able to show those things and So it can only get better Because the tool was think on that way and whatever the community needs it the more tools we can we are able to put there the better I've I've made a script Before writing the presentation. I wrote like a script of this will be my presentation so we have a how-to like of this presentation here and I Just placed online today and that's it questions We have time for a few questions So from what I understood you have the ability to identify the difference between the hardware latency and an NMI latency is this taken into account for the auto analysis or Will unexplained latencies be attributed always to NMI's No when when there is those three points Right that I see when an NMI happened or when an IRQ happened or when a thread happened If none of these trace points happened between two reds of the time It was nothing inside the operating system. So we can say that it wasn't the hardware So when you see there was no that trace point There was This trace point here Okay, then it says one When it says zero There was nothing happened on the operating system level. So it was something on the hardware And that's where we would have we could extend saying okay, if there is nothing here in the hardware Can I check with KVM if there was something then KVM can give us and we can extend for other things as well any more questions Currently not so thank you very much good. So the I will go home thinking that the output is good enough for people not to have questions Thank you, Daniel