 So welcome to this little introduction to building safety-related systems with Linux. The intent of this session is to disappoint most of you, so don't be too surprised. OK, go to the first one. So some of the basic questions we want to answer is what standard to use, because there seems to be quite a bit of confusion or hear a lot of confusion in the industry about that. Of course, this is my view. It's based on working in safety now for a little bit more than 15 years. And I started thinking about how to certify Linux in 2007 when we started the Safety Critical Linux Working Group at OSADL. And if we look at the target systems that people are trying to build and what components are trying to put into these target systems, then we're talking about relatively powerful multi-core systems, connected devices, which means security comes in, new concepts, like using machine learning, AI, over there, updates. So new concepts, new hardware, new tools, new measures and techniques needed for assessing the correctness of the system or the safety properties of the system. New design methods. My favorite question is to ask if anybody knows what real-time Jordan is. That's one of the specified design methods in IEC standards. Don't worry if you don't know it, nobody knows it. Because a lot of the measures and techniques in these standards are, say, carefully a little bit out of date. They were written in late 1990s based on technical standards of the late 1980s. And basically, they were not really updated. Now, in the new edition 2010 6158, you do have a few new methods mentioned. That's good, but the old methods were not removed and it's still mostly out of date. Then we come to the domain standards. These domain standards use 6158 edition 1 as their basis, not edition 2. So they are really out of date. Next one, please. So we end up in the situation that people are trying to use a domain standard because they say, oh, I'm building a rail system. So I'll use EN 5128 for software or I'm building a car, so I'll use ISO 26262 because it says functional safety for vehicles on the title. Unfortunately, these are based on what is a standard, the consolidation of the state of the art of an industry. The state of the art of the industry was microcontrollers running OSEC type operating system, bare metal or maybe ultrasound on a single core. So not anything that we're trying to do with Linux now. Essentially, just means that you can't use domain standards if you're trying to implement something this new. If we look at the history of functional safety, we have never done that, that we switch all paradigms at once because that's a perfect recipe for disaster. So what you normally do is you would be switching one of these issues after the other, maybe moving to multi-core and keeping everything else stable or moving into a connected environment where you introduce security. But we're trying to do all of this at once, and that's just not going to work. Aside from the simple fact that we don't have any certified hardware, but we'll get to that a little bit later. So the solution that you have to then take is you have to go up the standards stack and say, OK, if you don't have a domain standard because you're trying to do new stuff, then you have to use one of the generic safety standards, like 6158. And that scares a lot of people. And they say, oh, but I want to use automotive or the rail or the chemical industry standards. But actually, it's much, much harder to use one of the domain standards that is, as I said, a consolidation of previous state of the art for new technologies. So actually, it's much simpler to use 6158 than to use ISO 26262 for autonomous cars to stay with the hot topic simply because the generic or basic safety standards, as ISE calls them, were designed to be much more flexible, make less assumptions on how they will be used, and therefore also have the appropriate structure to be flexible. And just to make it very clear, next one, please. If you are one of the people using ISO 26262 for autonomous cars, you're using the wrong standard. It would be a really good thing to at least look at 6158 before you go on. I have been asking, especially in automotive industry, who has read 6158 and found out that almost never anybody, actually, I know a single engineer that said he had read it, even though it's normatively referenced, which actually means that everybody should have read it. OK, so next one, please. So that brings us to the question, what is compliance? Because one of the reasons why we're trying to use standards is people want to be compliant to something. The problem is that there's some misconceptions about compliance as well. Now, if you do not comply to functional safety standards, then the probability of you just not building a safe system is actually quite high. Now, technically, that's not really always true, because we have a number of domains that don't have appropriate safety standards and have still been able to build safety systems. I mean, just automotive industry itself is a good example where we had no functional safety standards, at least not the formal level for many years. And still, cars weren't just randomly turning left or so on. There was internal appropriate awareness of safety, and there were precautions taken to build safe systems. So it was not maybe a formal standard, but there was de facto standards. Unfortunately, ISO 26262 sort of came into this industry and scared a lot of people. And they said, we have to be conformant to this. And the decision was to blindly follow a standard, which equally does not result in building a safe system. You might have a compliance system, but you're not safe. Safety standards or compliance of safety standards is just really this first line of defense to make sure that your internal safety processes, your internal safety culture, is reasonable in not going totally off into the wild. That's what you're supposed to or what I think that safety standards are intended to use, but not to follow them, and especially not when you're following the wrong one. And this is something that has to be, I would say, established in industry to a certain extent that the standards that we have now are really not guidelines that you can hold on to. This is not like the previous standards generation that we had in the 1980s, early 1990s. Some of you might know them, Mu 8004, which was an extremely prescriptive standard to say you have to have a replicated power supply. You have to have independence of, I don't know, some unit. You have to have duplication. And it didn't care about the use case. And the reason why that was possible is because the use cases or the use of electronic devices and safety-related system was actually limited to relatively simple devices. So they had a lot of commonality. With growing complexity, this commonality more or less disappears. And that means at that point, you have to use them as sort of guidance, but not try and stick to each statement in the standards strictly. So essentially, you have to adjust each of these standards to your specific use case. The higher the complexity, the more of that you have to do. And this idea that you can achieve safety of a system by being compliant to standard is really one of the fallacies that we have in a lot of industries. Compliance not only that doesn't produce a safe system, just as a side note, also does not give you any legal guarantees, which is, I think, one of the reasons why a lot of industries are trying to do that, because they assume if we're compliant to some safety standard, then we would have a reduced problem during any court cases. I don't think that's going to be true. Generally, if you do not follow standards, then it's you that has to prove that your system is safe. If you did follow a standard, then it will be the other side's duty to prove that you did not follow it correctly or didn't do it properly, which is generally considered to be harder. But if you followed the wrong standard, it's going to be trivial. And that's exactly what's happening at the moment. Next one, please. So basically, we have two types of systems. And that's also sort of the next confusion that I see in industry. When we're talking about compliance to a standard, none of the standards actually is a specification for a particular system. It's a specification for a class of system, for a very, very broad class of systems. And in some industries, these classes have been quite constant over the years. If you look in the rail industry, an interlocking system in 1925 actually had basically the same high-level requirements and an interlocking system has in 2018. It was manually operated in the 1920s, mechanically operated in the 1950s. And electrically, now it's done by computers. But the basic requirements did not really change. You want to have prevent head-on collisions, side collisions, follow-up collisions, and so on. There's not that many rules. So in that sense, we can make a domain standard because we know the potential high-level hazards of the system, sort of the technology agnostic level of the system. What is hazardous? Similar in cars, we know that breaks fail. It's going to be hazardous. We don't really care which technology you're using for the breaks. And we don't care if it's manually or electronically operated break. We can estimate which impact a failure of a braking system would have. But when you get to a highly diversified area or introduction of new technologies, then the domain standards fail. And the task to do is that domain standards conceptually, we're doing a translation from type B system to type A systems. So a type A system, or a low complexity system, is a system where we know all faults or failure modes that are possible. Or at least a reasonably high percentage of the failure modes. And more importantly, we understand how the system will behave under failure conditions. That means that opens the mitigation potential of monitoring and reactive response in the system by an independent second system, like a think of a watchdog timer or intelligent monitoring system that can detect some anomaly in the system and then respond. And finally, all of this assessment of criticality, because basically the risk that we're trying to mitigate is probability times severity. Now, severity can be judged in context, but probability generally cannot be judged. So we have some reliable data, hopefully reliable data, from previous similar systems. And if we have that, and we can judge the severity in the context of the specific system, then we basically have the necessary data to do risk estimation of the system. Now, if any one of these requirements or these assumptions is not satisfied, we're in a type B system. So we don't know all failure modes or we don't know the behavior of the system under one or more of the failure modes or we don't have reliable field data. And this applies to absolutely every system that's trying to use Linux, their type B systems. All the domain standards are for type A systems, because the assumption was we have a consolidated knowledge base for this domain. Because we have this consolidated knowledge base, we can actually treat them as type A systems. We have enough experience that we have the reliability data. We have an understanding of the potential failure modes. This is not true for any of the new systems that we're trying to build, be it autonomous robots, cars, or some of the upcoming medical devices. And this is going to be a critical point, because that test, of course, is connected to the idea of conformance and which standard you are using. So essentially, as soon as you're at the point where you're in a type B system, most of the domain standards explicitly refer you back to 6158. Like 62061 for machine tools will actually say if you have a type B system, go use 6158. And when you're done with 6158, then you can incorporate this analyzed, pre-certified, dangerous word to use, but this analyzed system into the machine tool and treat it as a low complexity system again, or a type A system. Because you qualified it to a generic standard. This is how 61511 for chemical industry treats it or 5128 for real industry treats it. Now, ISO 26262 is the one standard like the Bashun most forgot to put this little line into the standard saying that if you have a type B system, go back to 6158. But implicitly, it does the same thing as all the domain standards. So this difference of which system type you're building is really essential. And if we go back to the principle of functional safety, what functional safety is really trying to do is you take a system and you're trying to analyze it, and especially you're analyzing, not testing, you're trying to analyze it to the point where you're actually in a certain way doing a transition from a type B to a type A systems because we know that we can't actually build safe type B systems. If you don't know the behavior under fault condition, it can't be safe. So essentially that's our target, to get every system to the point where we have a type A system. As I said, sometimes we can't do it and then we substitute that some of the portions of the system might stay type B, but then we use basic safety standards like 6158 to mitigate this elevated level of risk. And that's really where 6158 gets its flexibility from because they did not make any assumptions on the domain or on the use case and they have to give you a full spectrum of what you must be able to do. So keeping that in mind, that's where we are. We're trying to build type A systems. Next one. That should be hardware. Yeah, okay. So how is 6158 going to treat this? And basically we have a structure where 6158 and all this derivative standards say, first you have to understand your system. You have to build up the context of your system. You have to know what your use case is, which in context hazards you have and then you can go into hardware and software for the implementation. But again, just looking at the hardware and the software will not allow you to do that. We do have pre-certified hardware for low complexity systems. You can buy gravity relays that are certified to cell three. That's simply because the use case of gravity relays really quite limited and we can exhaustively enumerate all of the input states and output states of these things in all possible error modes and we're done. So you can do it totally out of context. And similarly, if we look at hardware, part two and 6158 covering that, we have compliance routes defined there that is trying to again get close to this situation where we understand the system fully. Basically we have two compliance routes, H1, which looks at the fault tolerance capabilities of hardware and the safe failure fraction. And the safe failure fraction is basically somewhat simplified. You look at all the failure modes that you know and you say which one of these failure modes is going to be safety critical and which not. Very often if you have a fail stop system then all crashes are not safety critical. If you have a fail operational system then crashes of a single system would of course be critical. So typically systems that can stop will have elevated safe failure fraction. But we have to do this analysis and this analysis is in context and the higher the complexity of the hardware is getting, the worse this is getting and the more assessment has to be done for elements that will never be used. If we look at a complex system then it is quite common that only a fraction of the hardware is actually being used. You have a number of hardware units on the platform that might just never be used. So certifying hardware out of context will be a painful thing. And again with very high complexity trying to analyze the safe failure fraction out of context is going to be impossible. The second path that we have for hardware H2 is that we build up adequate dependability data. Unfortunately that's quite well specified what dependability data is and just going to point you to IEC 60 300-2.3 as sort of the entry point in the set that is about 60 IEC standards long that describes the different properties, statistical properties and so on that you can extract from field data. Field data itself is something that is very numeric specification in all of the standards but there seems to be a strong idea that oh we'll just collect enough data and then we will be good. Unfortunately that doesn't work and if you look at the 6300 series of standard which is a very large series of standards and some of the derived standards then you will see why this is not so simple. I'm not mentioning the 14-224 because that's just for low complexity systems that comes from the chemical industry probably not too interesting for anybody that's looking at Linux based platform because you're not going to have the low complexity components in there. So essentially you're collecting data in the second compliance route collecting it not by just using field data but you have to collect it by appropriate standards and specifications again for it to actually be valid and when you have that then you can do an assessment. What you're really trying to get back is just you're trying to do this transformation back to a type A system. We have to know what the system's failure modes are how the behavior of the system is or the behavior of the system and the failure conditions and we have to have reasonable justified assurance that the numbers that we're assuming for the failure rates are plausible and that's why you need such rigorous data. Okay, next one. So if we ever get a certified multicore system on this planet currently we don't have a single one. I'm still waiting for it. I think that we actually have to sit down and write a new standard for that for this to happen but that's a different story. Then we have to look at the software stack and the software stack, we're talking about Linux for safety related system. We're really doing a paradigm shift here because we're talking about in orders or many orders of magnitude larger systems than we previously were using and the prime reason why looking at Linux becomes interesting for safety is actually not safety but versatility and security because all the Linux systems that I've known to be under development for safety are connected systems or will be connected systems and that means security becomes an important issue. 61.5 weight does consider security unfortunately that derives standards generally do not and if you have credible threats and it just points you into IC62-443 which unfortunately is not completed yet. But at least it registered that if you wanna do over-the-air updates then might be a good idea to consider security and this is really one of the big changes that we have with this level of complexity. We're not talking about certifying a system. We're talking about maintaining a safe state of the system even though the thing will have updates on a weekly basis. Even a small kernel configuration with this limited G-LIB-C support you have to expect that you will have one or two patches in mainline stable kernel per week. Not all of them will be safety related. My current estimate is that roughly one out of 30 will be safety related. That means you're going to have quite frequent updates. Sometimes you can get away with this derating of system or temporarily disabling a feature but essentially the dynamics of these systems is dramatically higher. So getting back to the compliance route or compliance issue, IC61-508 basically allows three compliance routes, bespoke development just follow the standard. Not an option if you wanna build a Linux system. Route 2S proven in use, that's sort of something that you should immediately delete from your memory again. Don't even try using proven in use for anything that's more than a wild one running on an 8-bit microcontroller. So we're left with Route 3S which is assessment of non-compliant development. Not going to go into the details of assessment of non-compliant development because it actually means explaining all of Part 3 of 61-508 even though it's a single clause but through its iteration it sort of evolves through the entire standard. The essence of Route 3S is saying if we have complex software components that are doing anything reasonable there must be some basis, some reasonable procedural basis that create the software. Otherwise it would be exploding all the time. So essentially you're saying bespoke development is you take some rigorous process be it ASPICE ISO 9001 or a CMMI, you apply these methods, you generate software and then you monitor it and you will have some bug rate or fix rate or whatever in the software. And assessment of non-compliant development is just saying take the process as is, assess it, find its gaps, find out how good is it, identify gaps in the process, mitigate these gaps, mitigate these gaps by either doing architectural protection, moving protection into the application level which will never be pre-existing or by putting constraints on the element's use, not by modifying it. Even though the standard does have one bug in it where it says we can modify to eliminate that code, I would not recommend even trying that path or Linux kernel because it would be an extremely invasive thing to do. So essentially what you try to do is you try to look at the bug development or the incident development and by doing that assess if the documented process is as good as a bespoke process. And this is, I have to be careful how to formulate that because some people have taken this as proven and used through the back door. It's not that you're just doing bug tracking and saying oh we have five residual bugs in the Linux kernel for this configuration so we're good. That's not what you're doing. You're saying if the process is as good as a bespoke process then we should not have a too high bug rate. That's what you're trying to assess. It's essentially the only way to remove systematic faults in complex system is by process. So it's documenting the process, assessing the process and that's why this route is called assessment of non-compliant development. Next one, please. So assessment of non-compliant development for is our compliance route that we're intending to use for the pre-existing elements like Linux kernel, G-Lipsy, minimum busy box sets. Yeah, some of the tools and so on. A number of technologies that we're using for protection but safety just like security is a system property and not an element property. So when we're talking about building a system with Linux we're always talking about a system development lifecycle and the top part of the system development lifecycle developing the requirements for the specific system as well as the overall software and system design is never going to be pre-existing. At least not at the beginning of the use of the Linux kernel or Linux for safety and really that's again the difference between the domain standards and 61508 because in the domain standards you're assuming that these top parts of the development lifecycle requirements design would also be more or less following a common pattern and actually could be covered at least largely by the safety standard. So in 61508 they don't make any such assumptions or they are very flexible but they assume that you're doing a bespoke development for this top part. So requirements design basically we have to follow route 1S just flip in the bottom part please. So the system itself is a 1S development and once you're down to the design the overall software system design then rather than emitting requirements on the specific software elements or hardware elements you would be emitting, I'll call it a wish list what these elements should provide and then you would go and select adequate elements that can cover these functional and non-functional requirements. Now of course that's not going to be a precise mapping Linux kernel will not satisfy exactly your specification. It will have a whole bunch of dead code in it it might have some functionality that is close but not exact and then you might have to do adjustments at the architectural level or at the application level. Then the actual elements that you selected that's where you want to use route 3S and we do expect that midterm we will have patterns emerging that will allow it to get very close to a pre-certified element in the sense that the certification of the next Linux kernel will be more or less a routine task or the next G-Lib C version but essentially there will not ever be a pre-certified Linux kernel that you download and you're safe. And finally we have then of course the whole top of the stack application stack and maybe some generic functionality like system bring up which we call the launcher and the CIL-2 project and these generic components will also be 1S components mostly and have to follow the appropriate standards. So the effective development life cycle is not that much different than the traditional development life cycle at least for the first few layers and the bottom layers that's where we want to be able to reuse pre-existing and open-source software and I might add that there's of course a lot of pre-existing non-open-source software but I personally think it will be very hard to qualify that if it was not developed according to a safety standard because you simply don't have the necessary data to do assessment of non-compliant development because you need process data and that's the big advantage of the open-source projects for safety is that this process data is actually available. As Jonathan Corbett today in the morning in his keynote was describing, of course it underwent significant transformation from maybe a not so well-structured project to now a very well-structured project and that's exactly what assessment of non-compliant development is about showing that this process is not only stable but has the self-corrective capabilities and therefore the results of this process are trustworthy if an additional testing and verification efforts can form it. Okay, next one. Okay, that's interpretation mapping. Yeah, okay, sorry, I sometimes get lost in my own slides and yeah, so the basis to do this assessment of non-compliant development is that you have to take a standard and you have to adjust to your actual use case. In our case, we assume that requirements and design is going to be one S and then we can go into the three S route. How do you get to that point? How do you know what you have to do from scratch? Which methods you can reuse as recommended by the standard which you cannot? The way we do that is basically what we do is interpretation mapping. What it means is that you take the standard clause by clause, lot of vague words in these clause like adequate for use sufficiently, whatever and you have to interpret this. So what you're doing is you're mapping the entire standard to your particular use case. Then there's clauses in the standard that simply don't apply to you because they only apply to a one out of one system or only to a fail operational system. And if you're a fail safe system, you can skip it. They just de-scope it. So you have to do this interpretation, clause by clause interpretation, removing all the vague wording, putting it into context. And as I said, safety is a system property and certification is a system property as well. So you are creating the certification roadmap for your particular system. Again, it might well be that if we do this 10 times in a specific domain like rail that we see patterns emerging and then later we can simplify or extract a open source domain standard for the rail industry where we say 90% is this repetition and therefore we can reuse the argument structure or extract generic arguments for this particular domain and say, okay, if you replicate your system and run it out in a two out of two, then certain clauses are simply covered by this architectural pattern. But this is not something that we can do initially, especially we're trying to use things like machine learning or artificial intelligence where we have no clue how these things actually work. Needless to say that we have no safety concepts for them. So this interpretation mapping is really the basis because what you get out of it is not only a complete mapping of the standard in your specific context, you also get a certain feeling for the uncertainties or for the parts of it that might need to be refined in the case that you have field findings or incident reports. So it's not something static and really the creative work of working in safety is interpreting and working with the standards. And this is again where the domain standards are harder to work with because they're much more rigid. They were written for a specific use case, not for the generic case. So 6158 and other basic safety standards are just much more flexible and that gives you back the engineering flexibility that you actually need. And just to respond to one of the statements during the keynote today in the morning, safety has nothing to do with producing paper. Okay, next one. Okay, now that's a list of changes, extensions. Okay, good. So now getting a little bit closer to silt to Linux, what do we have in silt to Linux? Project was running now for a little bit more than three years, trying to come up with a qualification route for new Linux based systems. And to do that, we actually had to extend 6158 and get these extension reviewed. So the extensions cover selection because we have to be able to eliminate hazards, not only mitigate them. So selection is a place where we can mitigate them. We're at time. Oops, okay. Okay, then we'll leave this list for you to review later. Sorry for that, I'm too slow again. Just jump to the conclusions in. Okay, the good news for certifying Linux is that we think we know most of the problems and we have a large number of the solutions, options outlined and sometimes prototyped. The bad news is that there's no hardware on this planet and we don't see any concept of how we're going to get any certified hardware unless industry sits down and writes appropriate standards like developing an assessment of non-compliant development for hardware. There never will be a pre-certified shrink wrap safe Linux. So don't ask for it. There will be no safety element out of context that includes Linux and G-Lib C and other complex components. What will be possible is that we have a consolidated procedure that eliminates the business risk of using Linux and safety-related systems because you have a reasonable assurance that if you follow this pattern, you will be certified. There's a kickoff or a meeting on a follow-up project to SIL 2 on Wednesday, Safety Linux Initiative from Linux Foundation. Bad news for that, it's not going to happen that fast. My personal view is that we're talking about two to three years minimum if we get to work now to actually come up with a reasonably consolidated process for certification of Linux. Okay, thank you.