 Hi everyone. So my name is Krzysztof Kosowski. I work in inarrow. I work in inarrow Qualcomm learning team. I also maintain three Linux kernel pieces, mostly known from the device tree bindings. I already had one talk about this. So what this talk will be not about. So I will not explain what is the real-time Linux operating system. I hope that you know it. There was a session just three seconds before. I will not tell what's inside the preempt 30 percent. I would rather focus on how to build a Linux kernel, what to expect during debugging, how to test it, how to evaluate it, and how to be sure that it has some characteristics of the Linux kernel or real-time operating system. So entire work was because of customer approached by our customer, which was customer of inarrow, and everything was done for 6.1 kernel. However, all this later was re-tested on 6.3, and it applies nicely for the 6.3. All the numbers are basically the same. So even though it was done a bit earlier, it doesn't actually matter. The board which I was testing, which was let's say important for the customer, was the Qualcomm RB5 board, which is nice 8-core quad SOC. I mean coming with 8-core quad SOC. So the first steps are usually that you need to get the preempt RT patch set. Luckily, most of it is already in the mainline. So at the point of the 6.1, there was 50 patches already merged. With the 6.3, there is 80 patches outside. Applying this is quite straightforward to just grab the patches, check the signature, and apply it. You need to configure the kernel to be the real-time kernel. And the main option is the config preempt RT. If you build such kernel, so the fully preemptable kernel, real-time kernel, you will have a sysfs knob telling you that the kernel actually is real-time. So you can check it. Interesting other options are config no-hurts-full, which does should not have an impact on the system till you enable it through the command line. I will say a few words about this a bit later at the end of the presentation. A few other interesting options when configuring the kernel is timer. So you want to be ticking on the highest possible frequency to respond to the events. For sure, you want to isolate the CPUs, partition the CPUs on the systems. We do it now with the CPU sets. And a few other options, for example, for the contract groups and IO latency also might be nice. During the evaluation of the system, during the testing and debugging, there are several config options which will allow you, which will give you some debugging tools. So these are, for example, the tracers. So latency top, the schedule tracer, time allowed tracer, and hardware latency tracers. Certain tools use these tracers if I have enough of time, I will mention them. So we have the kernel, we patched it, we built with preempt RT and our kernel is preempt RT, right? So the job is done and my presentation could finish. Luckily, I have a bit more slides. It doesn't finish yet. So the main thing, and this is why also customer approached us, was that the kernel which is built with preempt RT will exercise a bit different paths. And actually, if no one ever tried your drivers with preempt RT kernel, you might be the first person who experiences the issues there. So these are the results of some typically some race conditions coming from different missing synchronization, different order of some callbacks being called, more prob deferrals, things like that. So most likely problems are already in the Linux kernel. However, the mainline patch that does not bring it up. So just the preempt RT brought it up to the light. So when, how to debug it? We likely have already quite a lot of tools. So ksum and debugging of the IRQ flags, debugging of the shared IRQs, lockup task detector. So all these options will help you to automatically figure out that the kernel behave is better or good. The important thing of the preempt RT is that it changes few things about the locking. So few kernel locks have different semantics. And we can also automatically debug such options. So figure out whether the locking is correct or not. And these are the options for configuring the kernel for this. So we want to prove locking that the locking is correctness. We want to check the rows pin locks. But there is properly nested. And finally, we want to really be sure that there is no sleeping during the atomic section. If you build a kernel, so these options impact the kernel performance. But you want to evaluate the kernel before you actually put it to the production. If you build a kernel like this, you might see some interesting bugs like mentioned at the bottom. And this is what I actually saw. So the point was, I was the first person was trying these particular drivers with the preempt RT. Let's say a few more words about this. So why these things happen? The main feature here is coming from the spin lock. So spin lock is a busy lock. In normal kernel, it just spins. However, the preempt RT changes the semantics of it. And now the spin lock is a sleeping lock. A few other locks as well. For example, the local lock. Although in most cases, no one will ever use this local lock. It's nothing that doesn't exist in the drivers. So if spin lock can sleep, it means that it should not be used in atomic section. Atomic section is the one where the sleeping is not allowed. So when you have disebit IRQs, or where you have disebit preemption. One, the simplest solution, the simplest fix usually could be to convert the spin lock to something called row spin lock. Row spin lock is not sleeping. But of course, the art of engineering is not the simple, so there are other solutions. So what can go wrong? And this is the real bank which actually I experienced. So when you put the kernel with proof locking, and you try to exercise some path in the driver, on the left, there is a code which is correct for the regular kernel, for the mainline kernel, which is not correct for the preempt RT. It starts with disabling IRQs, and then somewhere calls the spin lock lock. So of course, you will not see exactly that type of function. I mean, more likely you will see that some piece of code disables IRQs, calls other function, calls other function, and these other functions call spin lock, right? It's never that simple that you have exactly this code. But this is exemplary. The simplest solution for this could be to convert it from spin lock to the row spin lock. Again, whether this is applicable for this particular case in your driver then depends, right? I don't have the answers for everything. Symmetric situation is if you disable the preemption. The back, the warning, the splat is quite similar. So it will tell you that sleeping function called in a valid context. The IRQs are not disabled, but the preemption count is higher than zero. So again, the code on the left is not correct for the RT. So preemption is disabled, you should not call spin lock. And the simplest solution could be to convert it to the row spin locks. But these are very simple cases. Much trickier is actually runtime PM. So the runtime PM causes the users the spin lock quite heavily, which means that certain functions which the regular kernel were using, for example, PM, runtime gets sync, which is a synchronous call. So now these functions actually might sleep and many drivers are not prepared for it. I don't cover this case here because this is then more talk for the Linux plumbers. This is basically not solved. And such example of such not solved code is the CPU idle, which uses exactly this combination. So CPU idle loop disables the preemption and then certain CPU idle drivers might be using runtime PM and then you have a problem. The solution for my case was to partially disable the CPU idle. So not entirely, but the driver was not putting individual CPUs to the sleep. What else can go wrong? The preempt RT also changes think about the memory allocation. So now with the preempt RT, the memory allocator is fully preemptible, fully sleepable. So even if you use the GFP atomic flag, the memory allocation inside the kernel can sleep. The message, if you run such kernel, would be similar. So sleeping function called from VIN context and the code on the left could be such example so that we use it roast pin lock in that case and then perform the atomic allocations. So the code on the left is correct for mainline but not correct for the preempt RT. And the trivial solution could be to in that case convert the roast pin lock back to spin lock, but probably much better solutions to move the allocation out of the critical section. So let's say that I built my kernel, I boot it up, I put all the debug options and I fix all my drivers. So now I hope that all my system behaves correctly. And let's say that, yeah, my job is done. Let's see, I mean, whether it's actually correct, right? So the point is that we want to be sure that the system is real time because we don't talk only about the kernel but entire operating system. So if you heard the previous talk, probably you know what is the latency. So in most cases we are interested in the maximum latency. So latency is the answer for the specific event happening. So when the kernel answers. And again, we are usually not interested in the minimum or average latencies. And the example is if you have something controlling the car or industrial application, you usually don't care that on average the system behaves within some time, right? You want to be sure that the car or this pressure monitoring device will always, always respond to some kind of system requirements deadline. And this deadline, this maximum latency comes from the system requirements. So it doesn't come from me, but from someone else. How to do it? We have a few tools which will help it. So the most known is Cyclic Test, which will measure entire latencies, including the hardware latencies and the firmware latencies. Pretty often it's being coupled with StressNG. StressNG is a nice, really awesome tool made actually friend of mine, Colin. And it also, by the way, includes the Cyclic functionality. But I was not using it here. I was using the StressNG purely for the, stressing the system. You can actually stress in very different ways. You can stress the CPUs, which I was doing, but you could also stress the scheduler IO, simulate many different workloads. Finally, there is a set of new tools, RTLA timer lat. I will mention it a bit later. And actually, there's another topic from Daniel which explains the RTLA much more. So if you want to deep dive, I think the talk is tomorrow, I'm not sure. So the code at the bottom shows such example that we can stress the system, make it busy at 60%, simulate some kind of workload, and then use Cyclic Test to measure the latencies. All the measurements were done on the RB5 board, which is quite interesting because it has three clusters. There's a cluster of slow CPUs, A55 for slow CPUs. Then there is a three much faster A77 cortex. And finally, there's a one CPU which runs at the fastest frequency, the priming cluster. The numbers which I will show are from 6.1 kernel. Then I later redo most of these numbers on the 6.3. The numbers were the same. I didn't update the numbers in the presentation. The kernel which I'm testing is both the vanilla, so without the preempt RT, and my kernel with the preempt RT and certain fixes for the drivers which are being upstreamed. All of the fixes are already in the mainline. Some hardware enablement drivers patches are either in the mainline or being on the mailing list. So the first try. So we have such kernel, preempt RT, fixed bugs, we run it. So the system is idle and the table shows on the left side minimum latencies on the average latencies and then on the right side the maximum latencies. The distribution is per core. So starting from the left out the slowest on the right is A77. Actually, the A77 is the most interesting because most likely your read time application will be running there. So the minimum latencies are quite good, right? Five microseconds, two microseconds. The average latencies are still not bad, right? A77 has a five microsecond average latency, but the maximum latencies go up to one millisecond almost, right? And both kernels, so the kernel called one is vanilla mainline, the kernel RT, which means the kernel which I'm doing now, right? So both kernels have the maximum latencies quite similar and there are way above what I would expect there. So I guess we are not yet done, right? I mean something is missing. Just to check it, I also made the system busy. So with the stressNG, I made the CPUs busy at around 60% load and redo all the measurements. So the minimum latencies are again the same slow average latencies are again similar nice. So look on the right side of the column. So the maximum latencies are still quite higher. However, they are better than when the system was idle. So does anyone have the answer? Why is that? Why my numbers are better when the system is busy? Anyone? Yeah, the answer was power management and the true answer is the CPU frack. So when the system was idle, my system was going on the low frequency. When my system is busy, my system was going on a much higher frequency. Therefore, I could react to the events much faster. And this is already one point what is missing in my stuff. But it's not everything, right? Anyway, the latencies are way past the system. And this is a proof that having a kernel with pre-emptive patch set is not enough. Your system will not be real-time just by magical patch set and build option. We need to do something more. And this is called tuning the system. So there are several activities in the kernel and by the way operating system, which are kind of housekeeping tasks. So we have RCU callbacks. We have periodic timer ticks. We have the intraps coming all the time or work use. And we don't want the real-time application to compete with such housekeeping tasks. We want real-time to be devoted to the real-time to reduce the latencies. Of course, we already have a lot of real-time features in the mail-in kernel. So it could compete. But as the number is shown, it's not good. So usually, we devote, we partition the system to kind of one which is doing the housekeeping tasks and the other part doing the real-time tasks. In my case, I put the slowest CPU. So the boot zero and the one, CPU number one is the housekeeping tasks. And the rest of the course are the real-time tasks. We do it through the command line option. So first of all, we can move all the RCU callbacks out of the real-time tasks to the housekeeping one. So this is the first command line, kernel command line option. By default, we can also affine all the intraps to the housekeeping CPUs. And there is some mitigation for the x-time log computation. We can also finally disable the lockup detection because you already detected all the lockups, hopefully. And the final interesting option is this no-herds-full. So I mentioned in the beginning that one of the kernel options, the config build options, is to make the kernel no-herds-full. You can activate this feature with this command line, and this will tell which CPUs in your system should have no timer tick. Timer tick is like this periodic 1,000 hertz waking up the system to do some work, to check for the periodic timer work. So in that case, these CPUs will not have tick, which means that they can devote 100% of their time to the real-time task. And this is quite useful if you have only one real-time task thread per such CPU, because it comes with a significant penalty. So the context switch for such CPU is absolutely huge. So you enable the such command line option only when you have a workload matching your case, so one thread per such CPU. And for all my measurements here, I didn't enable it because the stressNG mixed with cyclic test is not anymore such case. I have, on the same CPU, stressNG for 60% and cyclic test doing something. Okay, what else? So these were the command line options, but we have also a few runtime CSSFS and other things. So we moved the IRQs to the housekeeping CPUs, but the IRQ balance can rebalance them. Hopefully the balance, the demon might respect the CPU isolation. I will mention a few bits about this, but you can also disable the IRQ balance entirely. What else? You can also move the work use through the mask, so tell them which CPUs are allowed for the work use. So I moved several work use to the housekeeping CPUs. And finally, there are these two power management options which are interested to me. So I actually didn't care about the power energy saving, so I expected that, yeah, CPU can get hot. I don't care, right? More energy use because I want to focus on the latency. So the CPU frequency is set to the highest possible for the performance governor and the deep sleep idle states are being disabled. The deep sleep idle states also give you certain latencies. These latencies are actually known, pretty often defined in device 3 or in other system specs. So if your system allows for such latencies, you don't have to disable the CPU idle. In my case, I wanted to show the minimum number, actually the lowest latency, so I disabled also this. Finally, the last process option is for allowing the real-time task to take 100% of the scheduler time. So now let's see how all these options impact my result. So the row on the bottom is this new measurement. So this is my second try. System is still idle, and this is the real-time kernel with this tuning option. So the minimum latencies are almost the same. The average latencies are better, yeah, by a number or factor of 5 plus minus, right? And the maximum latency. So the column on the right, which is the most interesting because this is the case which we are kind of optimizing here now, is significantly better. So the slowest cores are quite good, less than 100 microseconds. However, the fastest CPU, the A77, the maximum on the right, has absolutely the same maximum latency as before, as vanilla kernel as preempt RT, my first try. So I still miss something. Obviously, I didn't finish the work. And this missing quark are the CPU sets. So we need to partition the entire system, the task there. So previously in the previous kernel, all the times, there was a kernel command line argument called isolate CPUs, and there you defined which CPUs are isolated for the real-time task. Now we use it through the control groups of CPU sets. And the example here, the code shows how to create first kind of, I call it bulk CPU group, CPU set C group. I assign them CPU 01. And also, I move all existing processes running in the system to this control group, to my bulk. And this will be for the housekeeping task for entire operating system. Then I create another group. So I partition the system for the real-time workload. So this will be a root group. So I echo the root to the partition, which means it will not be connected with the previous one. I assign there my real-time task, so real-time CPUs. So 2 to 7 go there. And eventually, I check whether this is really root. So you can cut the CPU set CPUs partition to check whether you partition the system correct. If the output is invite, then you didn't partition system correct. Now it's important that such, if you want to run applications within this control group, you have to move the PID to such control group or use the cgexec command line tool. So let's see. So this is full tuning. And the third try, and the numbers at the bottom, show the results. So the minimum latencies are quite similar to the previous cases, to the second try. Average latencies are also similar. And now, if we focus on the maximum latency, it's pretty good, right? So I here show two rows. I show here vanilla tuned system and the pre-out RT tuned system. And the number for both cases is almost the same, except that actually in a few cases vanilla is better than real-time. So this may be interesting point that maybe there is no point for real-time kernel, right? Because vanilla kernel, if you tune the system, the vanilla kernel already behaves nice. Let's see, but this is the idle system. If we make system 60% busy, here we start seeing some differences. So these are pretty close. Again, let's focus on the maximum latencies that are pretty similar to the vanilla case, but the maximum latency on the vanilla tuned system is 38 microseconds. For the real-time tuned system, maximum is 4 microseconds. So this is a pretty good result. You might notice that we miss here CPU 0 and 1. So my number of columns get shorter. It's because now I have to run in the partition system. So all my cyclic tests and stress NG are running in this control group, so I don't have CPU 0 and 1 anymore. One more exercise is to see how the system will behave with the 100% load. So this is actually test whether the scheduler is really scheduling my real-time task, how it should be, and it behaves nice. So the numbers are between vanilla and real-time kernel are still pretty similar. Real-time is a bit better. And this still raises the question, right, that the same question that maybe we don't need the preempt RT patch that maybe the vanilla kernel is still good enough. So a few notes from all these results, right? First of all, you can see that the different cores on heterogeneous systems will behave differently, will have different latencies. This is actually quite obvious because they have different frequencies, right? But this also might give you the answer where you want to run your real-time workload, especially for such small boards. The question which I raised two times, whether we need the preempt RT, it's kind of valid. And partially the answer is because we already have a lot of preempt RT in the mainline. So we already have the scheduler there. So we have the scheduler policies, the SCAD FIFA and the round robin one. So if you would like to ditch the preempt RT, you could try, but you cannot, right? Because I run my workload for, half an hour, one hour. This is not the proof that the kernel and the entire operating system will give you such maximum latencies over your real product, real application. And the mainline kernel does not give you the guarantees because it doesn't solve certain real-time problems. For example, the priority inheritance or the priority inversion problem. And only the preempt RT tries to solve the priority inversion. Therefore, even though if the numbers in my simulated workload are actually the same between vanilla and real-time, the answer is still that you need real-time if you want to have the guarantees. So this is the end of the evaluation of the numbers. And I'm pretty sure that my system behaves pretty not bad. I still have a few maximum latencies which might be above the range of acceptance or not. And now it depends how much time you want to spend to figure the sound. Because maybe this 22 microseconds and 8 microseconds for A77 is something which makes you happy, or maybe actually this is still too much. You could start digging more. Because certain drivers might be causing such latencies, and we have tools to detect them. So I will shortly mention a few useful tools. I seriously recommend going to the Daniel's talk later, because he will say about the RTLA. So the first tool is to detect the latencies coming from the hardware. And this uses one of these tracers which we enabled at the beginning. So this is hardware let detects or hardware latency detect. Luckily it will point that the hardware or the firmware does not introduce any latencies. If you run the sign click test, it has a useful option with trace mark. So you can run tracing and sign click test. And then when the sign click test will kind of figure out that the latency is above some threshold, it will stop the tracing and you will have a nice trace what causes such latency. And this might help you to debug the options to figure out in the drivers why you might driver causes such latency. This might be, for example, for the work use which are fixed to the certain CPUs. And there are still drivers in the kernel which have fixed work use and they don't allow the work you to be moved to the housekeeping CPU. This might be your case. But much more interesting tasks are from, I mean, much more interesting results can be done from the RTLA. So RTLA Osnoys and more interesting RTLA Timerlet are kind of sign click test on steroids. So they will give you very detailed answer what's happening in the system with the tracing tool. And they will also give you much, much shorter latencies, much lower latencies than the sign click test. So if you want to debug why your drivers might cause latency, these are the tools actually to try to do it to figure out the drivers. And with them, I actually finished my slides. So there are a few references. You can go for the Daniel's talk later if you want to figure out more. And maybe you have questions. There's a question. So could you show the table with as out any of them? So let's say why is there not a CPU for maybe I missed something and there are two sevens. So we missed your CPU zero and one. No, no, no. I mean in the first line, the CPU. It's just a table. Yeah, it's just. Yeah. Okay. Yeah. Indeed. So no problem. Just obviously. Okay. I just thank you very much. I will correct the slides. I actually have a little comment. In the beginning, when you were talking about like one, one of the solutions you could do about, you know, when you have that spin lock, you convert to a raw spin lock, maybe because you have a preempt in there. If you have a preempt disabled, just a sitting there, look at why you have preempt disabled. There's sometimes where you could look at it because remember, in the real time kernel, interrupts run as threads. And there's other things you could do. There's a local lock and like really look at why instead of just blindly switching something from spin lock to raw spin lock, please don't understand why you're switching it or if you have to, because one of the reasons why the preempt RT kernel gives you such good response times is because we got rid of all the raw spin locks. I mean, yeah, and we made them all spin locks. So don't, when you convert things from spin like back to spin lock, you're actually hurting the real time kernel just. So please look at that. That's why I just want to make sure to stress that. But any other questions? Yeah, that's a good point. Hey, nothing online yet, so. Good, that's it. Thank you. Yeah, thanks very much.