 So, welcome. I think we can start now. My name is Maxime Chevalier. I work at Smile. I'm based in France, in the south of France. And I work in embedded Linux development. And so, in my work, I've worked on multiple projects that use the preempt RTPatch. And today I'm going to present you what it's like to use the RTPatch with more of a user perspective. I'm not a real-time patch developer. So, does any of you use the RTPatch in the project? All right, quite a few. So, let me just present what kind of projects I've worked on. So, the first one was a simulation platform. It was in the aerospace industry. It was one of these big workstations with two Intel Xeons inside, loads of RAM, and we needed to run software that would simulate real code that was supposed to run on a real-time operating system. So, the main constraint here was to be able to respond to an event that would come on the network interface in a given time. And so, we also run cyclic tests on this machine and our target latency was 200 microseconds. That's the first project. The second one was pretty similar. It was a test bench that was communicating with real-time software and it was performing acquisitions on various input ports. So, you have this kind of rack with a lot of input IOCards. On each card, you have maybe 128 inputs, so that's a lot of things to acquire. And so, when receiving a packet on the network interface, we would have one second to perform all the acquisitions, be it digital or analog, and respond in one second. So, that might seem like a large time to respond, but when the system is on the load, we experience some spikes that would go over one second. The third one was a project I've worked on for two years, which was a more embedded product. So, this time we were using an ARM-based platform, so an IMX6 from FreeScale. So, it was a product that was embedded into vehicles, like tractors, buses, trucks, and that was performing acquisition on all the vehicle bus, that's the CAN bus on this kind of vehicle. And so, in order to gather all of this data and never lose any incoming packet, we needed to use the real-time patch, because we also needed to run some customer load on the IMX6. We didn't have any idea of the very nature of these loads, but the system could be under heavy CPU load and still needed to acquire each and every one of these packets and never lose any. But here was more for a deterministic behavior rather than real-time, although it's quite the same as I'm going to show you. And finally, I worked on it on an image processing solution for the medical industry, and we were doing some real-time processing, real-time not as in RTOS, but real-time as the things are happening in real life. So, we have this video stream, we need to do some processing, and this processing is fed into a robotic laser, I think. So, that's for medical imaging, so you are filming something and you point to laser to what you are filming in real life. And so, if you have too big of a jitter or of a delay between the time you acquire the frame and the time you point to laser, you're not pointing at the right thing. So, that was pretty critical in the medical industry. Just to give you some feedback of what is some ground for us to be on the same ground on what is a real-time operating system. So, pretty much all of you know what that is with the quick poll, but RTOS is all about determinism, having a deterministic behavior, especially regarding timing constraints. So, when you have an event that occurs, you want the response to that event to happen in a known timeframe. That's what you want primarily on the RTOS. You also want to run multiple tasks, and to do so you need to have a real-time scheduler. This scheduler will decide which tasks need to run given a priority that is assigned to this task. So, a critical task is assigned a high priority so that it runs always before the lower priority tasks. And finally, on an RTOS, you want to handle some specific cases. Because you are running multiple tasks and you are sharing resources, you have some locking mechanisms. And so, with absolute priorities on your scheduler and lockings happening on the side, you can have situations where, due to this interlocking mechanism, you have a lower priority task that prevents a high priority task to run. That's called a priority inversion. So, we need to solve all of these problems to have a real-time behavior. What do we have in Linux? In Linux, we have pretty much everything we need to have a real-time operating system. Thanks to the effort of the real-time patch, the programmed RT patch, a lot of the things that were missing are mainlined. So, this is based on 4.9, but yesterday there was a pretty good talk about where things are nowadays. So, you have a real-time scheduler. You have three of them, SCADE 5.0, which is a pretty simple one. The first task that needs to run will run, according to its priority. SCADE RR, which is for run robin. So, this is the same as SCADE 5.0, but with a notion of time-sharing and time-slicing. And you have SCADE deadline, which is the new one, which basically allows you to tell, I need to have this work done by this deadline, no matter when it will happen, but I need it done by the next 200 milliseconds, for example. We have support for the priority inheritance problem. So, that was the tricky use cases that we can have. So, we use priority inheritance to solve the priority inversion problem. We have a preemptable kernel, almost, which means that a task can stop the kernel from running to do its job, but the kernel still has a lot of critical sections in it. And so, that's nowadays what the RTPatch is all about. It's removing these critical sections. We have high-resolution timers for quite a while. So, we lack full kernel preemption and also some worst-case scenario optimizations. So, what you have to understand is that between Linux and a real-time operating system, you don't have basically the same approach in designing these OSs. And when you design a real-time operating system, what you want to do is to be 100% sure that you will have your deterministic behavior. And in order to do so, you have to check every execution pass that your con can take, measure how long it will take, and prove that you will never have unbounded latencies. This is not feasible with Linux. It's impossible to do because Linux is too complex and it's moving really too fast to have this kind of verification and formal proof. So, what the RTPatch does, it's kind of a best effort to make everything in the kernel nice to everything else. So, every time we take a look, well, will we do so that, instead of having non-preemptible sections with interruption disabled, we try to do so that the lock is sleepable, is interoptible, you have still interruption that can occur. So, you make sure that every piece of code is nice to the rest of the kernel. To do so, they introduce the threaded interrupts. So, the threaded interrupts is also a mainline thing. It's just enforced in the preempt RTPatch. Basically, when you have an interruption that occurs, instead of running the interruption handler in a specific context, that would preempt everything that was previously going on, even if it was a real-time task, a hybrid task, the interruption handlers are now run into dedicated threads. So, you still have a tiny interrupt handler and all it does is wake up this task that is pending. The advantage of that is that you can assign a priority to this given task and make it so that it is scheduled after your critical task, if you don't really care about this interruption. The locks are sleepable inside the kernel, so that's where all the magic happens. You can take a look at some talks I give at the end to really understand what is going on. But you have to have that in mind when designing a user application, is that the kernel itself, in the ways it handles the resource sharing, behaves differently, and this has an impact on your performances until it removes the critical section, as I said. Just a quick look at how can we determine this from the user space. I don't know if you know the tool Perf. There has been a lot of talk about Ftrace. You can do that with Ftrace, but with Perf, we can really see what is happening to your task that is running on user space. For example, here I am running ping-floating from my machine, and I monitor what is happening using Perf. So Perf will every now and then see what current function is running inside the kernel and give you the percentage of how many times did you spend in that function when you run your specific load. So this is the result on vanilla Linux without the preemptRT patch. So you can see that you acquire and release a raw spin lock with RUQ saving, which means you disable the interruptions while you are holding that lock. So that's not good for real time, but that's kind of good for performances. On the contrary, on preemptRT, this lock is replaced with an RT spin lock which is sleepable, which means that when you cannot acquire the lock, you will go to sleep and let something else run. So you have a context switch here, but you are nicer to the rest of the system. So that's really just to show you that real time patch changes the things inside the kernel. So what about everything that is not real time related? Because the kernel is changed, but the ABI is the same between the kernel and the user space. So you can still run everything that runs on vanilla Linux. If you have your JVM with your custom load on it, you can still run it on preemptRT and have your critical task that runs on the side. What you have to be aware of is that your non real time task will have what is left of the resources. So what is left of the CPU time, what is left of the memory that you've locked for your critical task, I'll get into that later. So when you've applied the real time patch, what do you see exactly? So the first thing you do, you do uname-a, you see that, I didn't show the result of the comment, sorry. You have preempt and RT in your uname string. Also you have a file that is created in this kernel real time. If you look at what inside, it says one. You cannot go zero to disable the RT patch. It's just here to say I have the RT patch applied. Another thing that you can notice and that can be confusing is what happens when you do htop. I'm sorry I wanted to do a demo, but I have to reinstall the machine and I don't have the RT patch running on it. But I can explain it to you. What happens is that you have some more tasks that are running on your system. And this is because you have threaded interrupts now. Because of that, every interruption will now show in htop as if it was a process or a task running. This can be confusing because if you have something from the external, from outside of your system that is triggering interruptions, for example, the ping-flood example, if you are pinging your system from the outside, now the response to the ping that happens in interrupt context shows into htop. And you can have your load getting higher in htop, but your system is still doing the same thing. So actually, when you apply the RT patch, you can visualize better what your system is doing. This will also impact your load of rage. I don't know if you're familiar with the load of rage. It's not really an indicator that you should rely on, but it basically tells you how many tasks are waiting to run on your system. So if it's inferior to your number of CPUs, you're fine. And if it's over the number of CPUs, you're having preemption on your system and you have more than 100% of a CPU that is used. So that will be impacted too by the preempt RT patch, just in the way that things are computed. A good set of tools that I used, well, to debug, to analyze this kind of things because I had to prove to my customer that the preempt RT patch was having an impact, but it was not as big as we thought. So I used the stat tools, so PID stat, VM stat, and MP stat, which I found very useful, which allows you to monitor the events that are ongoing on your system. So it allows you to analyze the context-switching that happens, the interruption, the cache misses, page faults, and at some different levels. With VM stat, you have a global level of what is happening on your system. I will show you an example later. With MP stat, you can monitor, for example, the interruption on a per-core basis. So how many interrupts are handled by the Core Zero, Core One, how many context-switches, for example. And PID stat allows you to do that on a per-task basis. So here is an example of the output you can get from VM stat, the first one. VM stat, so the one after the common line means that I do an acquisition every second and I compare what's happening to the previous second. So R means how many tasks are running, IN means how many interruptions happen in the last second, and CS, how many context-switches happen in the last second. And I was running VM stat, and in parallel I was running stressNG with a 5.4 run. So stressNG is a tool that can allow you to stress some specific parts of your system. It can be your CPU, your memory, a bunch of system calls. And so while I am stressing my system with stressNG, I can see that I have about 700,000 context-switches per second. That is pretty huge. This is the same behavior on real-time and vanilla Linux. It's just to show you the tools. PID stat will allow you to have the same information but on a per-task basis. So this is one per thread. You have the PID. I didn't show it in the slide. So each of these processes triggered 70,000 context-switches, and NV is for non-voluntary context-switches. So it's the time that you get preempted by the kernel. So the kernel says you've run for enough time. It's time to let the other processes run. So obviously this is using SCAD other. This is not some kind of real-time task. Just doing what you would do on your system regardless of vanilla of preempt-arty. An interesting thing to show is what I talked to you about with the Htop is using these tools to monitor what's happening when you have a ping-flood happening. So that was the problem that I had to solve. So why, when I have a ping-flood with a real-time patch, my load of rage explodes and my Htop is going crazy? And what we realize is that the difference between RT and vanilla Linux is just only in the way things are monitored. But also the same thing in the real-time patch triggered more context-switches. So you have kind of a tiny performance loss in that regard. You can confirm that with PIDstat that shows that you have roughly the same amount of context-switches as this particular IRQ than you have interruption incoming. So this is really the culprit, the guilty part for this performance loss. So this is an effect of the thread interrupts. But if that bothers you using the real-time patch, you can still assign a different priority to this particular task. You can also use IPerf to compare the bandwidths available on the network. And IPerf showed no difference at all between vanilla Linux and pre-empt RT, although you might think there is an impact here. But this should be tested more with tiny packages because the more interruptions you get, the more context-switches you get, and so the more time you spend switching from one task to another. I also use StressNG a lot. So StressNG should not be used as a benchmark tool, but I did it anyway. It's a rough benchmarking tool. So if you are running something on vanilla Linux and you want to know how it is going to behave on real-time Linux, and you know exactly what kind of load it is, so is it doing CPU-intensive work like doing math calculations, or is it using a FIFO, or is it accessing the disk very quickly, or the network, you can try to use StressNG to get an idea of how this will behave on real-time. So I've used five stressors that I compared, so stressor is the thing that StressNG will run into a loop as quick as possible. I think you have like 70 available, 90? 180? Yeah. But I limited myself to the stressors when you can set the number of operations that you will run, because you have to compare the execution time. That's what I did anyway. So for the CPU, for example, I've run a load that took 11 seconds, and in RT it took exactly 11 seconds also. It's logical because you do not go through the kernel. Vault triggers a page vault, and here you can see that you have a significant difference. The highest difference is the FIFO stressor, which takes eight seconds to run on my system with vanilla Linux, and 70 seconds with preempt RT. So you can see that there are areas in your kernel that will change their behavior and that might be slower, or faster, like the few texas that is the fast user mutexes, and that actually gets faster when you have preempt RT applied. So you can use StressNG. That's what I did. It gives you a pretty good idea. But so to recap the performance issues, you can expect an overhead when you run into the kernel. So when you are doing a Cisco or you rely on a path that goes through the kernel, you will get an overhead. And the thing is it's very dependent on your platform, so you should test it very thoroughly as soon as possible. My point is that when you use preempt RT, it's only the first step and you want to have a real-time operating system. So you must first make sure that everything you want to run on your system will still run as you expect, but you also have to take into account the fact that you are building a real-time operating system. And so preempt RT is the first step to build a real-time operating system. If you apply the patch, you set your critical task with your SCAT54, your high priority, you run cyclic tests, and you see that you still have spikes in latencies. Why is that? It's because you have tons of other things than the kernel that can impact your latencies. So for example, the CPU idle state, and I don't know if you know what that is, the CPU idle state is basically what your CPU is doing when it has nothing to do. So for example, my laptop is probably 99% of the time asleep and the rest of it showing my slides. So what is my CPU doing in that time? It's not running in a loop waiting to have something to do because my CPU would catch on fire. Instead, it goes to sleep, and the sleep states are depending on your architecture, on the SOC you are using or the CPU you are using. On my system, I have 10 different sleep states that I can go into, each one having a different latency, different times that it takes for it to wake up. So when you want to build a real-time operating system, you might tweak your CPU idle state and disable the deep sleep ones in order to guarantee your latencies. It's also true with DVFS. So CPU idle states, you don't have many on ARM. You basically have, I think, one or two available, which is either pulled in a loop or wait for an interrupt. On X86, you have tens of possible states. But DVFS is also something that you are starting to encounter on embedded systems. It's basically adjusting the frequency at which you run depending on the load you have on your system. And this can affect your real-time behavior because if you're not running always at the same speed, how can you guarantee that you will respond to an event in a bounded time? Well, in a known time. So what you want to do also is to fix the frequency at which you run. You can also have hyperthreading on your system and my recommendation is to disable it if you want to have a real-time behavior. That was one of the problems on the big Xeons system that I used. You had a lot of CPUs and you had hyperthreading and hyperthreading and what's really the thing that blew our latencies. So we disabled that and obviously when you do that, you have very much less processing power. So you should also take that into account as soon as possible when you are designing your system. The key thing about all of this is really knowing your system. So both into the software side, so your resource-based applications, what's going on in the kernel. So you don't have to necessarily know everything about the real-time patch. You can just use some benchmarking tools or measuring tools to see what's happening. You also have to know your hardware because there are things that you cannot disable. For example, if you have DMA, well, you might be able to deal with it, but DMA can cause latencies on the buses of your SoC, for example. System management mode on Intel processor or x86 processor can be really helpful for your latencies. So if you don't know what SMM is, it was originally used on the laptops to do some thermal management. So every once in a while, you have an interrupt that fires that is not maskable and the code that runs into that interrupt is not under control of the Linux kernel. It's inside the firmware, the BIOS, I think. And so you cannot mask it and you don't know how much time it will take to process that particular interruption. You have tools like hardware latency detector that will help you see how much time you spend in SMM. But since it's doing thermal management, if you disable it, you can sometime in the BIOS, your CPU will catch on fire again. So that's not recommended. And you should also know if you have resources that will be shared in your hardware. I have a pretty good example that's not mine. That's one from a former talk at ELC. Someone did a talk about running real-time Linux on the multicore system and one of his problems was the single instruction multiple data the vector unit in its system was shared amongst several cores. So I think he had four of them and eight cores and so when he tried to run five critical tasks concurrently it would see he would see some really important jitter on his latencies and because that was happening on a hardware level. So things like that, you should be aware of that when you are designing a system. Just a quick refresh on CPU idle there is actually a way to know the latencies that it will take to wake up your CPU from the sleep state. In the CFS you can check out you have this pass and you can see how many time it will take for each state. So for example this is a quick view of the ones that I have on this laptop so I have tens of them and so for the first one which is the pole state so in that state you don't go sleep at all you just run in a loop waiting for something to do and you have zero latency, wake-up latency and the deeper you go the more time it takes to wake up so this is in microseconds and the real residency time is the time that the scheduler will try to approximate how many time you will stay asleep and it will use this residency time to choose the sleep state that it will go in. So you can disable CPU idle from the kernel command line so you can choose the max C state that you will use you can disable it sometime in the BIOS so on some platforms for example the test bench one we could not disable it from the kernel and the decision of the sleep state was made in a BIOS level so you have to disable that in the BIOS so that's pretty much it my hope was that we could have some feedback from you users of the real-time patch if you also have anything to share if you also are more interested in the real-time patch there are a lot of good talks about it a lot of them were gave by Steve Rastet I think he also gave one on schedule deadline but I cannot quote it on each and every line the real-time Linux on embedded multiple processor was the one that talked to you about resource sharing also a good talk about IRQs if you want to understand how they are endowed and vanilla in RT Linux and also you have the talk from yesterday about the real-time patch that had a lot of good informations so thank you now if you have some feedback or questions no feedback yes, let me give you the mic thanks for using stress and G it's a tool I've been writing to measure systems like this it helped me a lot the latest version has now a cyclic test built in so you can actually get a full histogram of latencies and all sorts of stuff related to real-time performance measurement that's right, I didn't know about it just to let you know that that's very useful anyway to benchmark everything in your system and also to stress your system besides any other questions or feedback well, thank you for coming