 Hi, I'm John McGregor. I work for Robert Busch GMBH in Germany in software research. This edition of ELC seems to be morphing into a Cooke's tour of home and work offices of Linux enthusiasts across North America and around the world. So I'd like to welcome you to my home office. Unfortunately circumstances have not permitted that I show it to you, but welcome nonetheless. Coons, age ago, I worked as a Unix developer programming in C. More recently, about 10 years ago, I worked at Bosch on Android applications in IoT. During that activity, I started attending embedded Linux conferences. About five years ago, I got the opportunity to work with the Cil2 Linux MP project, which is a privately funded public project that tried to certify Linux-based safety critical application at the Cil2 level under IEC 61508. When that project ended, I started working with the ELISA project, which is a Linux Foundation project that is working on enabling the use of Linux in safety applications. ELISA isn't focusing on any particular standard, but a considerable portion of the participants come from the auto industry and the interest in the ISO 2662 aspects of safety certification is considerable. Last year at the Embedded Linux conference in Europe at the Safety Summit, I made a presentation that basically said that we should learn to walk before we try to run with safety certification of Linux and I presented a number of measures that we could use in that direction. One of them was ensuring that the safety uncritical parts of the system still meet the expectations of the creditors and the safety certification agencies. This presentation is a result of the investigation that we did into that. My co-author, Nicol Papler, who worked as an accreditor, and I decided to start with ISO 262. That turned out to be a good start because 26262 is tailored to a more representative cross-section of industry that produces consumer products for the mass market. The authors have made some thoughts about how to use components in the development process. The topic is obscure and the corresponding terminology is obscure. I've tried to make the presentation more accessible, but it's hard to know exactly who's going to be watching it. I assume that it's either Embedded Linux enthusiasts that are interested in safety aspects or people from the automotive industry that are interested in certifying open-source software. It may not be admirable of me that I decided to record this presentation, but it has the advantage that when I will be attending the presentation and as the presentation unfolds, I'll be able to answer your questions in the chat and I invite you to provide questions. So the next question is obviously, how's it going to unfold? Well, we're going to take a look at the challenges involved and the situation in which they occur. Then we're going to look at the quality management areas that ISO 262 touches. Then the role that quality assurance plays in ensuring functional safety for safety critical products. Then after that, we will look at the strategies for using safety on critical software in a safety critical application and take a look at how those strategies would be applied to Linux. And then we're done. So the situation is that system integrators in automotive companies are looking at complex safety critical automotive systems. You can have software controlled power window control units when you want, but I don't think you'd really consider using Linux for those products. You're probably looking at products like in-vehicle infotainment applications where it's not exactly safety critical, but there are safety critical aspects or in instrument cluster applications as well. In some part, people are even considering using Linux in autonomous driving applications. And there, yeah, the safety criticality is a considerable factor in the overall system. At any rate, the systems consist of components, and components can be in themselves safety critical or they can be safety uncritical or they can have a mixture of safety critical and safety uncritical functionality. The point is however that components are off the shelf and that means that rather than being developed from scratch, they're integrated into the system. If you take a look at open source components, obviously the source code is available, but that source code could be provided by commercial agencies, not necessarily taken directly from the open source projects repository. It might be from a distributor or it might be from a consultant. Because the development process is critical to the safety application considerations, hopefully there's some information about the development process that was used in producing the component. Because it's a selection process, the source code should not be modified, but it could be trimmed up extraneous functionality before you try to certify it. That simplifies the challenge quite a bit. This presentation is about safety, but when you're selecting a component, there are other considerations. The component may have security vulnerabilities, you'll have to investigate that. Actually security aspects are also safety significant because when the system is hacked, the behavior of the system is not predictable. There are also aspects of embargo constraints that certain governments may not allow products coming from another country or including components from other countries. You have to understand the target market and the source of where the components are coming from. There's the typical Linux product problems of licensing and the component could be vulnerable to patent trolls and licensed trolls. These are all selection issues and not necessarily safety relevant, and this presentation is only about the safety relevant aspect. In a safety critical system, there is safety critical functionality, which is functionality that maintains the system in a safe state or brings it back to a safe state after a hazard event. This is the parts of the software or the system that is actually certified by a particular standard and it's accredited according to its integrity level. Depending on the severity of the impact of a hazard event, the standards may require higher levels of integrity in the component. That usually means that a higher degree of rigor is required in the development process. Safety critical functionality is usually just developed with the normal blog standard quality process. Contrast to safety critical functionality, the safety critical functionality or off the shelf components in general are qualified to be used in the safety critical systems. That means that the system integrator first has to choose a particular component. In the case of operating systems, you could have the choice between using an RTOS and Linux perhaps. And then after the system integrator has made its decision, then he has to justify that decision to the creditor. A basic problem is that open source projects don't manage quality like commercial organizations. At the beginning of the 2000 years, the British Ministry of Health commissioned a study into the bookability of Linux to medical devices. And the consultants that wrote the report came up with the term software of unknown pedigree, where pedigree is basically a certification that an animal like a dog or a horse is a purebred. And similarly in black British humor, the implication was that open source doesn't have this pedigree because they don't use the standard development process and they're more like Mongrels. So the goal of this presentation is to explore the actions possible to get QM accreditation for safety and critical functionality. The Silt to the Linux MP project, Eliza, Zephyr and Zen projects are all addressing safety certification of open source components. And this presentation, as I said, reflects a walk before you run approach. The vast majority of the code in the system is going to be safety uncritical and it's probably a good idea to look at that first. So what is QM? What is ISO 26262 think about QM? As I heard conversations at the office, I didn't quite understand what they were talking about. It turns out to be quite an enigmatic term. QM designates what was obvious, what's the QM quality management system, but it's also an integrity level. What is an integrity level? Well, that basically means that the quality management system has produced a component with integrity equivalent to a commercial component. And over and above that, the QM qualification process involves demonstrating that the open source quality management process is essentially equivalent to the quality management process required by ISO 26262. So what does 26262 actually say about QM? First of all, sort of the obvious, the system integrator has to have a quality management system that's compliant with IATF 16949 in contraction with ISO 9001. ISO 9001 is probably reasonably well known as a quality management system certification standard and IATF 16949 is the automotive industry adaptation in the international context for quality management system. That means that there were different standards in different geographical areas for automotive quality systems, say in Japan or North America or in Europe. And this IATF standard amalgamates them all for use across all companies in the world. The next requirement took me a bit by surprise. 26262 has a way of classifying hazards based on how likely it is that you're going to encounter that hazard, how severe the harm is to people's lives or well-being resulting from the hazard and the ability of the driver to control the situation and avoid the damage. And the standard says, well, if the hazard is not that dangerous, if it doesn't look like it will cause much harm, then you can handle the safety critical functionality to support that with the quality management system in force in the company. It's a standard quality management system. Regardless of that, all functions in the system that have no safety requirements must be specified according to the quality management system in force. That means that there must be a specification for all of this functionality. There can be complex components where you can break them down into a part that's safety critical and a part that's safety uncritical and then use different measures to qualify or certify those components. So what does the 26262 expect for qualification of a software component? First of all, they gave a definition of component, which was nice because they also use a lot of synonyms. There's everything from items or elements or systems and systems and systems and then there's components. And what they defined for component is that it's actually a unit of functionality, either hardware or software, that could be good to stand alone testing and that it is defined at the architectural level. Definition of component was posed for hardware and software and therefore software component is just the same definition of software. So what do you have to do to qualify a software component in general? That means in safety critical applications as well. You have to define the maximum safety integrity that the component will expect. Then you have to define the intended use of the component in the system. You have to investigate and describe the known anomalies for the component. You have to specify its functional behavior and then the behavior in different types of exceptional situations. When the component fails, when the system gets in an overload situation, what does the component do? What are the response time and resource units requirements for the system, for the component? Also, you have to document the safety requirements in the current process and also the numerical accuracy of the component, which seems like a little bit of a strange requirement in the context of what we're talking about, but actually the standard was thinking of a software component more in terms of a math library, where you try to find something like the sign of 60 degrees and get an answer of 0.86 rather than more complex. At any rate, after you've selected the component, then you have to do an acceptance test and then integrate it in the system, and that's done with the usual procedures that are specified in the standard. What is the relevance of quality assurance to save the critical system? We have the situation where QM is now considered to be an integrity level, and the system to greater has a quality management system, and the open source project only does quality assurance, which means testing and reviews and coding standards and things like that. So the question is, how do we achieve QM integrity? The answer is by addressing and eliminating failures in the product. So a little bit of a digression to testing 101. There are error chains. I learned this as a mechanical engineer, but I'll take a computer science type of an example. A software component can have a fault, which is something like a coding error, which exists. But until the code is compiled and linked and installed in a system and the system runs, it doesn't have any effect on the system itself. At that point when the fault is encountered, it can produce an error, but depending on the type of the error and the type of the system, it may not be critical or interesting to the system itself. When the function returns a value or doesn't do it in time, then perhaps the last value can be used and is still tolerable. But then there are errors that lead to some sort of catastrophic event, which would be a failure. And it's the failures that we have to worry about in the safety context. So what are the possible causes of errors? First of all, there's what I just said, erroneous functionality. There are also hardware errors that occur from random events like radiation or cold that affect the hardware. There are systematic errors where the failure doesn't occur in the component itself. The component itself is correct, but another component is incorrect. And the failure cascades to the component under question and causes the component itself to fail. And then there are dependent failures where a good example is when the hardware fails, then the software will fail too. Then there are overload situations where the system is not performing because of lack of processing or memory resources or capacity on various communication links. So there are two basic strategies to addressing failures. One is to avoid them and the other is to tolerate them. And how do you avoid them? Well, you try to eliminate them in the development process by examining the context and developing appropriate requirements, developing measures to eliminate them, to avoid particular areas of errors being introduced into the software through review processes and testing. And on the other hand, you can introduce mitigation mechanisms to tolerate failures that occur when the system is running. First of all, you can use encapsulation to avoid, to put particular functionality in a partition that does not access the safety critical functionality and does not influence it in any way. You can have redundancy strategies where you have more than one example of a component and then if one instance fails, then another can replace it. You can have diversity to detect systematic errors. That means a good example of that would be to take a component and compile it under as 32-bit or 64-bit and have two different components. And then because there are different memory mappings and possibly different instructions being used, the probability of the two components having the same incorrect behavior at the same time are lowered. You can monitor the performance of the system, you can introduce checkpoints and validate the system producing correct results or you can take a look at the resource consumption in order to detect the causes of these problems. You can have diagnostics, in the worst case you can have a watchdog that resets the system when it doesn't respond and in certain cases you can substitute a more performant functionality with a less performant functionality and have a limp home mode for the system. This is a compliant development process for safety applications and the basic idea is that the system integrator defines the system and its context and looks at the hazards that occurs and then defines safety mechanisms and safety functions. That produces a safety concept. Then these safety functions and safety mechanisms are defined in the architecture and then at the architectural level you allocate the safety criticality to the particular architectural elements and then technical safety requirements. Like response time or resource consumption to those elements. Then you develop the safety mechanisms and you do that with the measures and techniques that are codes that have the rigor required for the particular integrity levels. After that you have to verify that the requirements have been made through testing. The idea of the testing is that you follow the requirements from the architectural levels of the design and the implementation and then you devise unit tests and integration tests that verify those requirements. So there's a traceability down through the development process for the requirements and across to the corresponding development steps in the verification. So now we're going to look at how to qualify a software component at the QM level. What does that mean? Well, that means that the system integrator will want to demonstrate to itself or after that to the accreditation agency with sufficient assurance that all interfaces of the open source component have been developed and are being maintained to an industrial level of quality. That it's an industrial strength component and that would make it suitable for integration in a safety critical system. And the special aspects of QMR that you still have to consider all of the error mode that can occur that would normally occur in a product that's not safety critical but you can neglect the hazard mitigating functionality. And that is in the context of reuse of an off the shelf component, not that's being developed from scratch. And consideration that we really haven't talked about so far that you have to consider the hardware on which the software component has previously been used. Safety safety community doesn't really believe in right once and run any everywhere. It's obvious that particular hardware components have particular vulnerabilities and have particular faults and each hardware platform has to be documented. And the experience on that platform for the software component has to be documented as well. So at the overall level there are a number of considerations that don't necessarily influence the strategy and one of them is simply the source of the source code. The system integrator has the choice of taking it direct from the project repository or from a distributor or some sort of amalgamation. Examples of that would be OpenELEC or Debian. There are two usual sources of source code for safety critical applications. In the first case then the QMS is only the open source projects QMS and in the second is the amalgam of the distributor QMS and the open source projects QMS. The next consideration is the essence of 26262. 26262 was derived from IEC 61508. IEC 61508 is a generic safety standard. It's actually a safety standard for developing for defining safety standards. And the idea is that based on the philosophy in 61508, particular industry branches or domains can define a domain specific standard. And 61508 talks about two types of product that can be certified. The Type A product is a well-known product. The failure rates are well defined and the behavior under fault conditions are well known and can be completely determined. An example of that might be an airbag control unit where we've had airbags for, I don't know, 30 years now and they're in most cars. And they come from a variety of suppliers and are used in a variety of automobiles. So the failure mode of an airbag control unit is well known. How often it fails is well known and the effects and the impacts on the system from a failure are also well known. I see 61508 in the case of Type A products then the industry is welcome to make a domain specific standard. Otherwise for Type B you should use 61508 and that's the fly in the ointment because if you take a look at the complex applications that we're talking about, such as autonomous driving, you can't say that there are well defined failure modes or that your rates are well known at the moment or that the behavior of the failure of an autonomous driving system has a particular effect on the behavior of the car or its environment. But on the other hand 26262 says that it's applicable to all road vehicles and then it becomes a question between the accreditation agency and the system and greater. What exactly that means? Another consideration is what I briefly talked about before. The clause in 26262 that addresses component qualification talks about simple components where the response that you get from an input is not dependent on the state of the component. In the case of a math library it's pretty simple. Like I said, it's assigned at 60 degrees, a bit of processing and outcomes 0.866. That's not the case for an operating system. The response at a particular interface may depend on the state of scheduling, whether a process has been preempted, whether there's intermittent processing going on, what's happening with the memory management, and so forth. In those cases you have to take a much closer look at what the exact requirements are for timing and resource consumption and what the responses must be. That usually depends on the knowledge of the components within the subsystem and that's called white box testing. And like I said previously, you have to consider the hardware platform where the system will be run. So when you're looking at QM, you have to account for all the failure modi, so even at the quality management level. And there the avoidance requires something like an industrial development process. In the context of the automotive industry that would be stipulated by ISO 26262. Fault tolerance means that you have to define and investigate the faults that occur while the system is running. And two typical safety measures for doing that is to identify the relevant faults. It's a failure modend effects analysis or a fault tree analysis. I won't go into the details of them. After you've done those analyses, then you have to take a look at the capability of the fault tolerance mechanisms that the component has. And then take a look at what the organization's normal bog standard quality management system would require to account for from interference and cascading failures. And then assess the gap between those two and define ways to fill those gaps. So basically the approach depends on the complexity of the component you're looking at. ISO 26262 doesn't specifically address whether you should use black box or white box testing at a particular integrity level. But what they do say is that at the lowest integrity level, structured based testing is recommended. Recommended is optional. You don't have to do it. That means that it's a question for the system integrator whether or not they consider it relevant to do. And if that's not the case, then you can say that the component is not complex and it can be qualified under the class that we've already seen. If the QMS in the system integrator requires structured based testing at ACL-A, then you have to qualify the component as if it was ACL-A. And yeah, I've been a bit of a wag here and said, okay, then if you're doing that, then QMS ACL-A-A or ACL-0. IEC 61508 has safety integrity levels from 1 to 4 whereby 26262 goes from A to D. And the mathematics makes more sense for a CIL-1-1 than you get a CIL-0. There again, in the end, you have to verify the component selection with an acceptance test and integration tests as you would normally do according to this. So, well, we discussed the Part 8 class 12 qualification requirements in slide 11. Yeah, basic review here. You prepare a specification of the component which then serves as the requirements for the component and then you have to demonstrate that the component complies with those requirements. Over and above that, you have to ensure that the component is suitable for its intended use in the specific product. And that would again mean an analysis of the hardware failures. And yeah, the components development process must be compliant in some way with a recognized standard. What ISO 26262 says is that you could also look at IEC ISO 12207, which is a software development process standard, which is interesting because things like ISO 9001 are pretty binary, either you're compliant or you're not. But in 12207, you can say, well, in this process area, what do I know, configuration management, we're compliant, but in this other one, we haven't quite made it like in review processes may not be thorough enough. And then you can document that and say, okay, we've partially achieved compliance and then it would be up to the system integrator to take a look at that and decide whether or not additional measures need to be taken to ensure that the overall compliance of the system is met. And so, yeah, basically, you have to prepare and execute a plan to accept the component and integrate it into the system. On the other hand, if you take the hypothesis that, yeah, we have a safety critical element that isn't safety critical, then you would follow the ACL A minus A approach. And what the standard does say is that when you have a safety critical element, the integrity levels are just for particular control flows through the component where they have to meet particular timing requirements or resource requirements. And the rest of the component, yeah, should be ACL. And that means that, yeah, if you have an element that doesn't have any integrity requirements, then you still have to produce it at the rigor required for that ACL level. You also have to account for all of the error modalities that are listed in the quality management system previously. And yeah, if you were doing this by the book, then the components QMS should be equivalent to IAT 6949 plus ISO 9001. Yeah, the first thing is that it's typical auto industry practice to substitute an A-spice assessment for the 9001 certification. A-spice is an automotive standard for assessing the maturity of a development process, which means that a customer is assured of getting the product that the supplier commits to. In the case of a QM open source component, perhaps you could drop the IATF 6949 part of it and simply try to certify that under ISO 9001, or even in ISO IEC 12207 as would be done in the qualification cause. Or yeah, you could try to accredited it at A-spice level two, which is a relatively low level of maturity. So what does this mean for Linux? Well, everybody I talk about who wants to put Linux in safety critical applications believes in Linux. Linux is in all kinds of high availability applications and also in all kinds of embedded systems like routers that have to be reliable. Nobody really doubts that Linux has the requisite quality and reliability to be put in a safety critical system. But unfortunately safety standards requires that one demonstrate the integrity of the system. Integrity is a matter of the technical properties and of reliable implementation. So it means that the error handling within the software has to be reliable and the system has to be stable. And the development process must avoid bugs as much as possible. But regardless of what you believe, you still have to demonstrate it or prove it in order to do that. So where could you get this evidence? Well, when you're taking a look at whether or not the kernel development process is suitable, there is a definition at kernel.org. And you could compare that to the requirements in 26.262 and see how far they agree. Other than that, there are other parts of the development process like requirements and design that are done within the community but not necessarily reflected in the development process at kernel.org. And it may be possible for particular parts of Linux to ask the community for additional information to support the certification effort. And then, well, for technical properties themselves, basically you're going to have to do it yourself and define the requirements that could be put on the kernel for safety critical functionality. And then you would basically do that by defining a use case and look at typical applications of safety critical software in embedded systems and see what the functional requirements and control flow requirements are on the Linux kernel. I mean, which interfaces in the Linux kernel are being used. And, yeah, in order to understand those interfaces, then you're going to have to look at the architectural requirements and understand how the kernel would be integrated into the system in general. So, yeah, if you look at the requirements management requirements in the standard, you could look, you could mine discussions in user groups and mailing listings, but unfortunately there is no defined set of valid requirements. And these requirements are needed to trace the, to be traced through the development process to validate the testing and the functionality that's produced by the development process. Similarly, there's no central architecture model. Basically, Linux was developed ad hoc based on different examples of different Unix systems and other academic systems like Minux at the time. It wasn't designed top down from the architectural model. Probably, for me, the biggest failing of Linux at the moment is that there's good focus on quality assurance, meaning testing and coding guidelines and continuous integration, configuration management. But when you ask, yeah, what is the quality of the system that is produced by the development process? How good is Linux at meeting the requirements? Yeah, it's not really known. And then the next question is, yeah, if the quality is not sufficient, what do you have to do to get it to be sufficient? What things can you do? And there again, that's not the way the open source community works. If the quality is unsatisfactory, then you have to introduce your own measures to improve the process and perhaps submit a new patch with the appropriate functionality. Over and above that, yeah, what can I say? Linux is more agile than the waterfall models that the state standards seem to prefer. And when you take a look at it, yeah, you can often argue that it produces the same results. And the Linux development process has strengths in functional development and testing and configuration management get is probably the best configuration management system in the world. There are small weaknesses in the requirements and traceability and change management, exactly how a change gets proposed and approved and implemented is not formally defined. And yeah, there's no tracking that it's consistently done in a particular way. And then there are larger weaknesses that requirements and definition is not done at all and that the architecture is not defined and that makes the requirement traceability much more difficult. So you can also take a look at what the facilities are that are already in the kernel and there are facilities for freedom from interference. Second, for example, limits the system interfaces that can be called by an application after a certain point. There are container technologies, which for different processes, depending on their group, they have different accesses to access abilities to different resources, which is regulated by control groups and name spaces and access controls. There are robustness facilities, Linux has a watchdog functionality and logging and diagnosis. So, yeah, it's not like you're starting from square one. There are things there that just have to be identified and defined and then brought into the context of safety qualification or certification. So a little bit about the Eliza project. The Eliza project is divided into three working groups at the moment. And yeah, basically, there are domain specific working groups for automotive and for medical at the moment. And these domain specific work groups should provide use cases where the use cases basically are a safety app running on user space lives. And somehow rather than this diagram, they've forgotten the obvious that user space lives are on top of Linux and yeah, Linux is running on particular hardware. The use cases contribute a safety function definition for that domain. And then some idea of the context in which these safety applications are running so that one can validate or understand the technical safety requirements that the domain has. So have for a particular part of Linux and the architectural assumptions that go into that. Yeah, the architectural group takes the technical safety requirements and the architectural assumptions and looks at how that will be positioned to facilitate the different safety mechanisms. They also look at the particular freedom from interference and fault. Mode analysis that could be done for the Linux and they will in the future define safety requirements that can be allocated to the kernel to ensure integrity features and provide conditions of use for Linux safety mechanisms that means describing the things that you have to do in order to use the Linux safety mechanisms safely in a particular product. The process working group is looking at how to demonstrate that the Linux development process is equivalent to the development processes that the standards expect and they're also working on Linux qualification. So what have they achieved so far? Liza's been running for a year. The process group has defined a reference project that combines the aspects of a number of safety standards and then they've surveyed the Linux development activities. With respect to those reference processes and are defining an initial gap. They're also doing work on patch impact testing and mining the kernel as I explained before patch impact testing quickly is one of patches accepted for a current release. Then it's back ported to various long term Linux versions and the system integrator has to constantly assess whether or not a particular patch in the current release affects the long term release that it is using. Over and above that the process group is doing a survey of static analysis activities in the Linux development process at the moment. The architecture group has identified and are investigating a number of architecture variants with hypervisors or co-processor container technologies and they're now working on a particular memory management safety app as a demonstrator. The automotive working group has just formed and they're working with the automotive grade Linux project taking a pilot application that HL is defined and working with the architecture group to look at the safety aspects of that pilot application. The medical devices working group has started examining the artificial pancreas system which is a system that takes a blood sugar measurement device and an insulin pump and combines them with a raspberry pi so that the blood sugar in the blood level remains acceptable automatically. It prevents a diabetic from having to measure it himself and inject insulin. They're all appropriate to the coronavirus problem at the moment. The group has also started to examine open source ventilator. So that's pretty well it. What have we learned? Well, Linux is stoop and it's not QM out of the box but a pure bread dog is not better than a pure bread. In this sense, there is nothing to say that Linux is not as good at QM level as a commercially developed product. Similarly, you would think from the title that QM is an issue for the development process but that's not true. It's an integrity measure and there are the aspects of fault tolerance. You've probably seen the safety accreditation is not simple and it's exacting and demanding and that QM qualification itself is a massive amount of work but because it's not safety critical, it doesn't really depend on any particular safety use case and it's an area where different companies should be able to cooperate. So the outlook is that the ELISA project is working on the topics that we've talked about today. The Zephyr project is an open source RTAS project also under the Linux Foundation and they have a more conformant development process and they're closer to safety certification at a higher integrity level. Regardless of that, we will need new certification approaches for open source software because the old ones are not necessarily appropriate. Thank you for your attention and that ends my presentation and now I'd be glad to talk to you and address your questions.