 So, we planned a panel discussion which is a little bit hard because there's no panel, so the panelists have to be standing which makes sure that it's short. So what I wanted people to talk about and it's not only about the panelists, the panelists are there to get something, to get the discussion going, I want you to be involved. So the topic is real-time safety and mainline Linux, where are we heading? And is this really a viable, a feasible solution? Are we on the right track doing this? And where are the limitations we are currently running into? I'll let Nikola start with that. Okay. So, the approach is actually almost trivial, as usual. Safety standards assume that we have a process and the process emits code. Process also emits documentation and bugs and a few other things. But the assumption is that if we have a well-structured process, then we have a tolerable level of residual bugs. So the question in the context of functional safety standards is, can we assess the Linux kernel development process so that we have reasonable confidence that residual bugs are in a tolerable range? That is the question, what is a tolerable range? Okay. Well, basically, when you're talking about tolerable ranges for random faults, it's trivial. We can enumerate that and say SIL 2 is 10 to the minus 7 critical failures per hour. It's a range, not exact number, but that's not so important. For systematic faults like bugs, we can't really say that. But we can come up with some things like if Thomas writes code and Peter reviews the code and the probability that it's better after the review is greater than zero. And if we do enough such cycles, then we can achieve an equivalent code, equivalent to the state of the art for managed development. I don't know if there's a chalkboard here because then I can put that in one simple diagram. Do you have some chalk? Yeah. Yeah. Okay. Here? Okay. Goodie. So it's very simple. Normally, we have a process like SPICE or a CMMI and that produces some code. And this code has a process-based reviews, audits, testing, analysis, whatnot. And you get some initial increase and then ideally decrease of bugs. And at some point, you're saying the residual bugs in there are in a tolerable range. And we put our life on these systems in airplanes, trains, ships and whatnot. And in the open source case, we have some process here, but we don't really know what this process is and how good it is, but we can measure this part. Of course, it's much better. And that's basically the approach we're taking. The bad news is that for the Linux kernel, we can do this. For some of the out-of-tree patches, it's a little bit painful. And one of the painful patch series is currently prepped RT, because all our nice little statistical exercises don't work with that patch set due to traceability limitations. Due to the inability to trace it back in history when it was introduced in all cases, why it was just introduced. So root cost analysis, severity of bugs is currently a little bit hard to do with a prepped RT. And also the static... We don't want to make it too easy for you. I was assuming that that's by design. That was my assumption always. And the second side of the coin is, of course, the static code checking. Now I understand that there's a new CI framework in the work that will help improve that. Static code checking, of course, will be an important thing. You might now think, well, why is testing not being mentioned? And I would estimate that if we all sit down and test a lot of this RT prepped patch set, we could probably cover 10 to the minus 67% of the state space that a Linux kernel can achieve. And that's why it's simply not relevant. So the argument for safety is the process, not testing. Testing and prototyping is nice. It shows you that what you want to have is there, but it gives you absolutely no clue that what you don't want is not there. And the only way to do that is to assess the process. So we need to find ways how we can improve the prepped RT process. Of course, mainlining is one of the key issues. How we can improve this process so that we can actually come up with reasonable arguments for why it's in a tolerable range. Can I get to one? Yeah. Okay. So given this, the open question in this regard, so we as Siemens currently do not really bet on one solution. We bet on many. So we are looking very interested in the approach to certify the kernel to make it usable for safety criteria applications. But we're also looking into having an option for a partition system, which is a much smaller kernel, so to say, is a hypervisor approach. And we are also looking into different approaches regarding real time. So we are both running prepped RT as well as Xenoma in products. So simply there are many open questions. If you have to build some system out of this and have to come up with a solution eventually, it's a bit like probably like the question, what is the drive system for the car in the future? If you bet on one, you make bet on the wrong one. For 10 years, it's painful. So it's pretty hard to bet on one solution or one approach in this regard. So what is the limiting factor right now in terms of going with the simple partition stuff? I mean, this should be accessible. This should be easy to put in a process. Yeah. So we actually went with half of the documents, so to say, or not even half the friction of the approach to the TIF, and asked about if it's a feasible approach. And the TIF said, well, the software looks good, and it's probably doable. We can do a white box assessment on this. But we need more regarding the hardware. And that's actually currently the open aspects, because a hypervisor relies heavily on properties of the hardware to achieve the isolation. And that means that we need information about how is the hardware working, but we can rely on where are the potential errors, how we can address them in software, or if we can't address them, but other approaches are needed. And that's currently the open question. You're not covered by the camera. So to maybe put that in another description, the code Yarns talking about from the jailhouse project is extremely small. And basically, the strategy is quite brilliant for safety. The strategy is try to do nothing in software. So basically, at runtime, the hypervisor ideally is doing almost nothing because it's only spatial partitioning. The problem is it's actually not doing much other than configuring the hardware. So the problem is sort of delegated to assessing the hardware. From what we understand from hardware vendors, currently anything that's as complex as an IOMMU or so, there is not even a reasonable strategy how to certify these things at the moment. So that's the showstopper at this point. I've heard that a number of times. And if you can't certify an IOMMU, I don't know how you certify a CPU with an MMU. It's as complex and more. We've built both. And our hardware guys don't think one is fundamentally more complex than the other. OK, do you have a complete specification for an IOMU that we can go to TIF with? There are multiple ways to do IOMMUs. Yeah, but the one that you did, do you have a formal specification or reasonably complete specification? Where's the formal specification for the MMU and the CPU? Well, there's two answers to that. One is we have a different historic background for MMUs. So put simple, if I claim that take a single core, say a Pentium M, and claim that the MMU is not total rubbish, they will accept that. If I go to them with a new chip and say the IOMU is perfect, it might be hard to argue. OK, so you're saying because of the lack of historical record, you can't count on the IOMMU. That's the most reasonable argument for the difference I've heard so far. Also just non-standardization. I mean, IOMUs for different CPUs have fundamentally different behavior. MMUs for different CPUs, except for set associativity and maybe some replacement algorithms, they're well understood. They're well known. It's not really a complex configuration behind it either. How many registers do you need to configure the MMU? How many do you need for the IOMMU? The SMMU v3 spec. Yeah, there's a lot of registers there. There is. But still, it's a different level of complexity. Yeah. I think the historical record argument is probably the strongest for why nobody's ready to trust an IOMMU today. But so if this is the case, if you can't trust the hardware with a little tiny bit of software on top of it, how are they going to trust the same hardware with a whole bunch of software on top of it? That's actually easier because it's not our argument, but this is just a rough representation of the problem. If I take a highly complex operating system, how many instructions is a hello world in assembler from hitting enter until you have hello world on the screen? Just roughly. 50k, 100k. You're arguing percentage residual bugs is easier to hit in a large code base. The probability that it survives and does something false positive, saying everything's fine, producing a valid output and actually, it's totally garbage is much lower in the complex system than is a simple system. And so in Germany, we call that the thousand kummiwutl hypothesis. If there is if a pointer, a straight pointer word to occur in the kernel, the probability of this thing actually surviving and producing reasonable output from then on is reasonably low. So in a certain way, highly complex systems can be simpler to certify than simple systems. Yeah, it's just really hard for me to correlate this. So I completely agree with you that the standards say it's all about the process. But then when we get practical advice from people that have gone through the process before, it's about, well, you can't have any dynamic memory allocation. You have to justify every use of dynamic memory allocation. You have to... Wrong standard, that's a problem. They're all using the wrong standards. They're using the standard for low complexity systems because that's what they were used to. Yeah. I understand that there's a big disconnect today between the complexity of the system that we need to certify and what they're used to certifying. But right now, on the partition system, we're trying to do a small code base, 10 to 20K lines of code to partition, and then a very small, again, a 10 to 20K application for the safety critical part. Yes, but the problem is that we rely on the isolation features of the partitioning. And that's the hard to prove point. Are these partitioning features really reliable or not? And we can't assess that at all. We need the help from the chip manufacturer for that. And what we have seen so far... Virtualization, just use the kernel. Yeah, what we have seen so far in terms of safety manuals for that kind of chips, it's close to fairy tale books. So that's the problem where we are running into right now. And of course, there is no actual huge use or precedence of SMP systems in safety critical. So, but now we are hitting with various use cases like autonomous driving, robotics, and whatever. You hit the barrier where you actually need the computing power of a PC or something equivalent. And on the other hand, you have the safety requirements because you don't want to run over somebody or crash the car or whatever. So this is where we run into that gap. And we need both some ideas about how to tackle that on the application level. This is not yet the solve problem. How to assess and certify applications at that complexity level. But then we have this extra piece of the hardware which needs to be highly complex, more complex than what's used in safety critical systems right now. So we need that information. Can we rely on that piece of hardware or not? Otherwise, the authorities will say, yeah, if you do crystal ball assessment, good luck with that. Yes, why can we do complex software? We can't do complex hardware. That was one of your questions. Well, and I think I got the answer when Thomas was talking. You're not relying on using the kernel to own more of the hardware data flow. So you're not relying on the IOMM. You're relying on user kernel separation. And the key difference here is that if we rely on the hardware, we have sort of a single level of protection in the kernel. I can use C groups, Seccom, diversity, MMU, whatnot. I can use multiple layers of isolation. And this allows us to build up arguments where we know C group is not bug free, where we know Seccom is not bug free, and we know the MMU code is not bug free and the process environment is not bug free. But what's the probability that pointer in application A violates all of these protection layers and actually can impact application B? So we can use multiple layers of protection analysis to mitigate some of the uncertainties. And the key difference here is it's managed uncertainty as soon as we're in the software level where we have the design flexibility that we don't have in the hardware. Yeah, but without the hardware level protection as well, you have to concern yourselves with an active attacker and the safety aspects as well. But we're not looking at security at this point. And the standards don't either. The standards don't either, but if you're talking this far out, right? I think that where we're disagreeing is what time horizon we're looking at. I'm looking at a year or two years out. I think you're looking five years out. No, not five, 10, 15 years. Yes, if you absolutely need the complexity of a Linux system to get your job done, then you're gonna have to find a solution to this. Right, and if you don't have that complexity, don't use it. Actually, if you asked a certain part of the industry, they want to have do that tomorrow. They are advertising it for five years that they can do it tomorrow, which is hilarious, but that's a different story. So, but actually the need for this complexity is there today because they have today to build the systems they want to run in maybe when the job is done. I mean, that's even less predictable than preempt RT. We'll wait until the standards catch up with us. We're doing, we're going, we're gonna ignore the standard and keep going, right? If they have to finally sit down on the button and write the standard, that's our job. What are they waiting for? That's our problem. We're taking standards that were written in the 1990s and saying we're going to use this for autonomous driving, which is just wrong. And the consequence must be that industry sits down and says we need complex OS, we need AI, we need C++ and other horrible languages. So we have to find out ways to certify it, but they're not doing that. Well, it was written in the 90s for a 1970 coding style. That's when it was. But yeah, but yeah, I mean, if you're really trying to solve the autonomous automotive problem, then you have to look at this level, right? I'm looking at problems that are partitionable today and how can we solve the problems today? And most of the application itself that needs to be that safety certified is not as complex as the machine learning and all that sort of stuff, but does require more than a Cortex-M to run. Okay, so you're going to buy one generation of products by doing that, but you're not going to get much further? Yeah, but still, if you look at it, let's say in the context of Chale House, then we still have to have a reasonable certainty that the partitioning we provided or we set up in software is actually provided by the hardware. And it's something we can't access without having the information from the chip vendors. I don't know if you have more information than other chip vendors, but I've seen nothing so far which holds up to the job or Jan. We are waiting eagerly for this. Well, the first comes out with such a thing. I mean, the design of Chale House is to cater all kinds of architectures. Well, not all kinds, but the majority only on the market. So basically it leaves the ground open for any vendor to jump in and then provide a solution which is certifiable and set the pace for this. It'd be very interesting to see. And if none of them do it, then we still can try to go for diversity and argue away the hardware by having a ARM64 in parallel with the X8664 in parallel with the MIP64. Just the last one just for the fun. That's what we have done before. Machine learning is not solved. We know that. I'm just trying to get the software stack up to G-Lib C certified on complex architectures and how the funny bunnies from the AI department are going to do it, nobody knows. Any other questions to that? So I've always been told 10 to 20K lines of code can get certified. 20K lines of code is going to take a year to get certified. And what is your experience? My personal experience is not in this area, actually. So I've only participated a little bit in certifying a Linux kernel once a long time ago. For a specific purpose, with a very specific approach, probably not really repeatable in this form. But yeah, I guess the dimension you mentioned is actually what we are heading for and maybe you can even get smaller, at least with the real critical part to be going through the full process. But yeah, so it's really, if this can be done in software with reasonable effort, I guess this is the software approach for this. Otherwise, you really have to go for the complexity, as Nicholas mentioned, and go complete different paths. Yeah, I mean, the diversity solution is well known and well established, but of course, if you think about mass production, then the bean counters will hate you for mentioning it, even mentioning it. Because they won't, I mean, they are clever enough that they figured out by now that there's this single chip which can do all the things from the computing power perspective. So they want to have this single chip doing all the things, including the safety stuff. So that's why they are looking into that because they want to save the extra x86 MIPS diversity problem with all the extra costs, extra room, extra heat. There we go. One thing about the Jailhouse solution, what I might add is I personally looked at the code a little bit, and I do think that it is in a complexity class where you actually could argue it as a low complexity system from the code. Low complexity in 61.5-way, it is defined as piece of software where all failure modes are known and behavior under failure is understood. Now, at least for the core part of Jailhouse, that's probably doable. And if it can be brought down to low complexes in the software problem for Jailhouse, it's almost trivial. The hardware, of course, is an open issue. Any more questions related to real-time safety and are we heading into the right direction? So when you're talking about adding layers and layers for security or safety reasons, you would assume that Linux or that Hypervisor is the only system that has full control over, well, the ECU, the hardware. What's when they're, let's say, two other CPUs that are not controlled by the Hypervisor but are on the same bus, have access to the whole IOMU to everything else, and there runs a proprietary or just some custom code on there. So how is the certification for this whole system? Actually, this is not possible. So this is well-known and understood by the hardware. This is already understood by the hardware vendor, at least by the particular one. If it's going to be realized is a good question. But yeah, this is part of the system. It's part of the system, it has impact. Which impact may not be really completely understood yet by all of us or maybe by some people. But for a safety critical system, I absolutely agree that it's not possible. And we have to be worked on. But there are multiple architectures on the market and others may have different properties. Maybe favor ones, maybe not. So that's the advantage that you can move around and see which is the best shop for this problem. Wrong crowd. But proprietary isn't the problem. If it's proprietary but written by the manufacturer for that second device, then that's okay for them. Well, there are many people that go to certifications with closed-source solutions. They have to share them with the auditor. But they don't have to share them with the world. And today, on automotive, you know, anybody that can get to the CAN bus, that is also a multi-processor bus where you can do things you probably shouldn't be able to do. Yeah, of course, it's not the topic about proprietary versus free software at this point from a safety point of view. It has to be assessed as well. And the question is if this code has been written according to the standards, according to the processes that we need to apply. And there is code involved in the setup and the startup of a system which hasn't been written according to this on typical machines already today. So that means that everything is involved and all the complexity you have in these additional chips have to be compliant and compliantly developed. And that may be a huge amount of software, maybe even larger than what we're currently talking about here. I mean, maybe the question that's implicitly might be behind this is why are we looking at Linux now? And the key issue here is that the traditional operating systems were by process developed for single core systems, all the major vendors. And as the Linux kernel proved in the 2.2 series of kernels, it is not quite that trivial to switch from a single core to a multi-core. And the assumption that you can now take these traditional safe operating systems and dump them on the multi-core is, I would say, a little bit naive. They're going to try and do it. It's probably not going to work. And of course, the security properties also will not be able to be brought into that. And I think these two changes are what sort of reopened this kind of worms for industry. Any more questions? Everybody tired by now? So, I want to thank you for participating in that discussion. I hope we can come up with better news next year. Maybe the wake-up call to the hardware manufacturer works. And you go home to your company and you come back with something which is not a fairy tale book. Would be appreciated. And so, I hand over to Michael for the final instructions for the rest of the evening.