 I'm Daniel, I'm part of the real-time Kernel team at Red Hat, and I'm also a post-doctoral researcher at the School of Spirits and Tata in PISA, where I try to live in between these two words of the development and research, right? So, relax. This is not a heavy talk. I'm not going to much details, but it's more an open-ended talk, presenting some talks and pointing to the directions, and people will see links on QR codes during the presentation, right? So, what is real-time, right? Real-time systems are computing systems that the correct behavior does not depend only on the functional behavior, but also on the timing behavior. That is, the logical result is only correct. It is produced before a given deadline. So, the response time is also important. In the real-time theory, the thing works somehow like this. So, we start to create a precise definition of the system, try to capture all the behaviors, and we try to imagine some algorithm, try to define the worst-case scenarios, and come up with some formulas to show that results are delivered before the deadline. There are some theories that are straightforward to understand, like the DDF for a single core, but there are others that are more complex, right? And when I say complex, it's in that complex that it generally requires some years of understanding of these mathematics behind real-time. So, the real-time system theory is generally considered a complex subject, because the mathematical reasoning is complex, and in order to facilitate the development of new theories, some assumptions are made to facilitate the reasoning, right? Like, assuming that the system is fully preemptive because it's easy to think in this way, assuming that the other process of a system are independent, that overheads are tolerable. And yes, things are this way because it would be too complex to be reviewed and established otherwise. And on the other corner, we have Linux, which has a more practical approach, right? The real-time Linux is not a single thing, but it's set a set of features that tries to provide a more deterministic behavior for Linux. And the approach works more like, okay, we have a metric to be maximized. For example, we have the scheduling latency. And some background is generally considered like, okay, in the theory, the preemptive model is a good thing for this, so we try to mimic that. And then some testing tools generally develop, right? And user-space testing tools like CycleTest, and this goes again for other metrics. So nowadays, the main features for the real-time Linux, we can say, which is the fully preemptive mode with preemptivity, we have a scheduled line, a lot with PI, and so on. But in the same way that mathematics, Linux is complex, and many times some assumptions are made to facilitate the development. For example, the preemptive mode of Linux is not the preemptive mode on the theory, because we can disable the preemption temporarily. In the same way, the scheduled lines accept some things that are not yet formalized, for example, Persepille threads mixed with a global schedule. But these things are accepted because otherwise it would be too complex and we would go nowhere else, right? So, in one hand, these things are good, these simplifications are good because they enable the progress, and they are enabling the set of environments, but they created this gap between the theory and practice. So, but, well, who cares about real-time Linux, and why is it important to try to put these two things together. In our current use case, we are seeing high-frequency trading, some embedded electronics, some low latency virtualization case that enables Linux and that push the real-time Linux forward. But in the future, we are starting more complex scenarios, right? We are starting to see a system with lots of real-time tests instead of having one real-time test for one per CPU. And we are seeing the users of Linux on safe critical systems. And here is where the things start to get more complex, right? Because safe critical systems require a more higher level of insurance of the system behavior with some evidence that it works correctly. And to the point that tests are good, but sometimes that other things like formal methods could make these events more strong and enable Linux for more complex safe critical environments. So, and here I'm back again with this theory and practice and try to make them together. Luckily, some things have changed in the near past, right? So, and here's an example with the preemption theory versus practice. So, with the preemption, the kernel becomes more preemptive, which is as preemptive as possible. Everything becomes a task. The preemption is enabled by default. And we have some good results measured by cyclic test to user space saying that, okay, my schedule latency is in this amount of microseconds. But this is good because, okay, it enabled Linux to be used on real time system and enable the developments of theory that relies on it like scan deadline. But yet there's no clear description of the factors that cause the latency. There's no evidence that worst case scenario was ever hit by cyclic test. So it's hard to convince a skeptical person, right? So, how can we try to make Linux more scientifically compliant, let's say, right? Well, we can try to follow that algorithm that people in the academia work, which is trying to create a precise definition of the system, create an algorithm trying to find worst case and define some set of equations. Luckily, and that's, that's something that we also see on the application of other things for, for safety critical. Most of the kernel are at work and there is the code that there and many times we don't actually need to create an algorithm but just to analyze Linux with a new perspective. Right. So, yes, it's possible to try such approach. For example, this schedule in latest definition, we said that schedule later definition is the longest time between the arrival of a highest priority job and when it starts to execution, it's on code. Then we created a formal model that represents all the kernel events that could cause this delay or that influence in this metric. From that, from all these events, we draw using a formal method, what could be the worst case scenario for this latest defined a set of variables and came up with a bounce for the schedule latency. And this was accepted by, by, by the real time kernel community in theory, right. So, this was a good evidence as a real time community because for a long time Linux was precise by being not explored academically right or not exploring in the same rigor that they use there. Things have changed now. We create some sort of theory that makes Linux up, but still right theory without practice is not that that useful for me as a developer right. So, we created also a practical schedule latency measurement to that uses that formula capture the events of the kernel that influence only variables of that formula, but still trying to minimize the overhead that it causes, right, because we need to deal with too much event. I don't have time here to explain this right but you can check the presentation of the Linux plumbers last year that I presented in the tie in details how these two works and it's available on my GitHub and you can check more about it on my, my web page. And it has an output that tells me how much latency, the system has considering some kinds of interrupts and where is the, from where did the latest came from which is, is it this variable or is it these interruptions. And with that, we could bound to gather the theory and perhaps right. And at the end, what one thing that we feared when trying to, to make this more pessimistic and worst case things from theory. We fear that the values would be too much pessimistic like not enabling limits anymore for the workloads that we were looking for. But instead, the results that we got where in that confirming what we thought what they think about things. Obviously, the latest reported by this more pessimistic that cover all the cases that shows the root case, the root cause. They are larger than cyclic test because cyclic test cannot get all the possible case or it will take too long to hit one of these pessimistic is still even with all the pessimism we are, we are below one millisecond like we show on some architectures. So, more complex architecture we show that it's still like, even being super pessimist, almost unrealistic pessimist, the latest it's, it's on the two or for my milliseconds, right, which might be too large for some use case that we are seeing but it's not for the automotive case, for example, and we have a strong evidence. So, with that work, can we say that Linux was directly proof as real time operating system. No, there's still a long way to go right. And the main point is that the focus of the print RT for example of the, of the major focus of the real time internal community was in the scheduling latency, but the schedule latency, even though it's a big part of the problem. It's not the main goal, the main goal is the response time of a test. It's not when a test starts to run but when the test ends the computation, and these involved soldered parts of the system, like, like scheduling and locking and then things get more complex. But, well, we have an example of this integration and what else can we do right. What is the lesson learned with these these theory joint with in the branch. We should start from practice should start from theory. The point is that may find that it's not a chicken egg problem. It's an evolution problem. Right. We need we need to balance and bring new things from theory and mixing it with new, new code and make things working together. Is there a way is there any evidence that is it's possible, given that the Linux curve evolves fast and that we are there or or that does this little external developers care that much about it. Right. And here we have another example with my great disabled so my great disabled is a synchronization mechanism from the prem 30 that temporarily avoids a task to be migrated from one processor to another. Right. It is good because it's a it avoids disabled the preemption that that causes the schedule and latency right that's why it's it's good in one hand. On the flip side, it breaks one assumption that we use on the schedulers which is being the working conserving like schedule line assumes it's working conserving. Or that is, that the, the M high guest trends will be able to run on the AM processors of a system. And this great discussion on LKML because some people were trying to to use my great disabled on the no print RT kernel because it would facilitate their lives. hand, it would create for us a problem on the scheduling subsystem. And during this discussion, when I say was when Peter came up with some some algorithm, okay, this is the algorithm I suggest to allow us to to mitigate this problem, we will create tools to to find the worst case. And we will also try to update this with the theory that we are creating. So that's a good example that we can go forward and continue to try to mix these these two words with the acceptance of the community as well, right, because the results are good at the end. So where else do we need to put effort on this, right? One point is the scan deadline and you can check my presentation in the future last year at DEVCOV. For example, currently, the design of the scan deadline doesn't allow tasks with different CPU masks to share a CPU, right, because of global schedule. But some per-CPU kernel threads need to run a scan deadline. For example, we might need to boost some per-CPU threads like RCU for a given time, right? So we have this need of having arbitrary priority affinities on the little square. We have some theories like the disembarkation or some level of dynamic partitioning that could be useful for that, but more resources is required. Another thing is that currently we ignore the overheads inside the system, right? And also the methodology that we use to test scan deadline is basically informal. We put some user-based tests to run and see if they reply before the deadline, but we don't try to understand what is going inside the operating system. And this had led us to have problems in the past, like with tasks deadline shorter than the period, for example. Another point is that with proxy execution that is currently on the primary T, we have the priority inversion protocol that works like this. If I have a highest priority thread waiting on a lower priority, this lower priority will inherit the priority of the high-guest to avoid a test with a mid-priority postponing all these batches, all these segments of people depending on the law. So this disembarkation, it works for single processors for fixed priority schedulers, like the FIFO scheduler, but not for, it was not designed like for multiple cores and neither designed for deadline-based schedulers, right? This protocol was later extended with a deadline inheritance, but it has some low issues and you can check it out on the presentation last year. There is this initiative that is implementing proxy execution that is promising. It's a good mechanism, but you still need some kind of background to get some corner cases. And so on. Well, this presentation was faster than I thought. So my final remarks are, okay, well, we had a lot of progress in the real-time Linux as a community in the last years, right, or last decade even, we are getting old. So we have the brand priority, we have schedule line, we have lots of tools. And most importantly, in the last years, we've been seeing this need of Linux on safe critical systems and now the automotive is not on the door, but there is even more, there is industry 4.0, there is the edge computing, there is also the low latency use case that we are at support, right? So these are demanding more sophisticated analysis, right? But for us to be able to give better response for the requests, for example, from our customers, but also to create this evidence that will enable real-time Linux to go further, to get new markets that require these more sophisticated kind of evidences and documentations that show that Linux works fine, right? And the difference between us and these older real-time operating systems that sometimes were designed to be real-time operating systems since the beginning that were designed with a certification on the mind to target and to get certification for this kind of application, right? Linux was not signed that way, and but still we can find a way to give these evidences based on the things that we already have and the scheduling and real-time community, it's well receptive for these more theoretical like work because it actually make our lives easier at the end, right? And it avoids us to facing problems in the future because we didn't take in consideration this crazy algorithm analysis that the academics are doing, right? And we have learned how to deal with it now. So, yeah, that's it. Thank you all. I have a lot of support from research people, so please check the red-red research booth. They are nice people and feel free to talk to me about the research as well. And that's it. Actually, we have two questions. The first is from Caroline. What about real-time on ARM? Okay. Each ARM board has its own set of problems that are derivated from the board support patch, right? Some have a driver's upstream, some doesn't. So, assuming that the hardware is upstream, support is upstream, there's no problem, right? The primitives of the primed RT, they are all architecture-independent. There might be now some clue on architecture-dependent code to make it work, but it's not necessarily part of the core of the real-time features. They're mostly all hardware-independent. So, yes, primed RT on ARM, it works. The people that work on ARM, they support the development of primed RT and these kind of features. So, yeah, it works. And we are ready. Assuming that the board support patch is upstream or it works fine. Okay. Thank you for the reply. We have a more question for you. So, get ready. Oleksandar is making more deeply, more technically. I don't know if you are able to read or show to us. Okay. He's pointing out that some part of the kernel for instance, transparent use tables are currently not compatible with preemptive real-time. Yes. And he's asking if this blocker for upstream primed RT. So, the last question first. No, it's not a blocker for the primed RT upstream. It's not disabled for all the cases. It's disabled on our kernel because it might hold the virtual memory lock semaphore for too long. And there are some use cases of people that, because of doing that their applications were designed, this caused an under-termism. Because it's not because of the primed RT. It's because the virtual memory semaphore, it's a huge kernel lock and can cause this kind of issues. Again, it's not primed RT. It's just because it's a big kernel lock and it's been quite a while that people are trying to find a way to improve it. It's not easy. So, yes, the transparent huge page is disabled. Also, this kind of heuristic work generally not required that much for many users on the primed RT because they actually like to reserve, they reserve by themselves manually instead of trying to use some risks like the transparent huge page. Thank you. Actually, there are a lot of questions. I'm trying to go one by one. The most voted one will be the next one. And that is from William. Has there been interest from the new automotive engineering effort to using the real-time kernel? Yes, yes. Mainly for the requirement of safe creatures. Looking forward, and next is from Daniel. I see a need for real-time workload inside the KVM virtual machine. What do you take on that? It works. There's even a team on Red Hat. Luis Captolino is the head of the team that we have been working as a team for three or four years already, making the KVM RT and the kernel RT working together well. We have good results, but still it's just being applied for a single use case, which is Telco. And the results are quite impressive. But I advise you to talk to Luis about these more, okay, these other needs other than Telco with the KVM RT. Luis can give you a more business-like answer for this. Yeah, thank you. He's going to hear about this. We have also one question from Varan. What about SMIs x86 or any other hypervisor does the prem 30 determinist still hold? Yes, and also it's a good point because the SMI looks like a hyper like if the bare metal becomes a hypervisor, right? We have a something on a lower level that stalls the operating system that runs on top of it, right? It does not change the way that the prem 30 works. The prem 30 is still deterministic. It's just that you need to account these interference from the lower level layer, right? So if your SMIs run for like 10 microseconds and still these 10 microseconds on top of the latency of the prem 30 doesn't cause you a failure or if it's still good for you, it's okay, right? It's just to get another workload. So it does not affect the prem 30 model. It's just affect the results that you achieve on that architecture. But if it's a hardware or if it's an operating machine, okay, Caroline, what is missing to make Linux safe, critical for automotive? Lots of things, lots of things, standardization, creating more evidences, try to understand what are the evidences that we need to create. And there is an effort now I really had to try to figure this out. Will that happen? Yes, for sure, for sure. As to the follow us on issues, okay, the follow us is another thing, right? Linux will probably use no safe critical, right? But some are low level of criticality now, right? It will not control brakes, it will not control the steering wheel, but it can control like dashboards or other things that require timing and safety. But, but it's still the, the ASU B level, which is the ASU D is something that is more complex. But anyways, there's no need for Linux to be the full OS. There is a lot of problems to be resolved. I hope in the future we can achieve such a level of, of confidence in results, right? It will take some time, I think. Personal take.