 My name's Paul Abbottella and I'm here to talk to you about using open source as part of a system safety mechanism. Now I've been working with complex software systems since 1990, starting with banking system mainframes and then PCs and mobile phones, and most recently Linux based systems in automotive. Most of my development experience has been at the system or platform level but over the years I've become more and more focused on software engineering processes and how we use them. I first started working with open source software in a mobile phone context, working with platforms like Mego and Android which have Linux under the hood, but the thing I really enjoyed about these projects was the way we used open source tools to improve and extend what we could do and I've been passionate about that ever since. This passion is what led me to join CodeThink in 2019 when my focus has been on safety and open source. At first this involved a lot of time learning about safety from the first principles and trying to decipher the safety standards and then thinking about how we could apply these to open source software and development processes. Now I've been able to do a lot of that thinking in public as part of open source projects such as Trustable and Elisa and in talks like this one. But last year I was finally able to apply some of that thinking in practice, working with my colleagues to qualify an open source tools integration using ISO 26262 which is the automotive safety standard. This talk is about extending the principles from that to safety systems that include open source components. So what do we actually mean by safety? Well this term can mean a number of different things in the software world so I've included some definitions here. To be clear I'm talking about functional safety which is a discipline concerned with those parts of a system that are intended to protect us from harm, that is injury or death, when that system or a related system malfunctions. And what's a safety mechanism? You can see the ISO 26262 definition of it here but that really boils down to two roles either transitioning to or maintaining a safe state or detecting, alerting a user or other safety mechanism when a fault occurs. An example you might be familiar with is a watchdog which you can implement either in software or in an external hardware device and its job is quite simple, it's about detecting when a software process that we care about is hanging or in more complex implementations using a challenge response interaction with that process to verify that it's actually doing what it's supposed to be doing. But when we have one of these safety mechanisms or a safety related system how do we decide what is safe? There's two different parts to this, there's the technical solution itself and what are the hazards or risks that we're trying to avoid with it and how does the solution work to prevent or mitigate them and how is it integrated into the wider system context. But we also care about how we can be confident in that solution. How can we be confident that it actually addresses the safety goals that it's supposed to, that there aren't some additional faults that it might introduce that could compromise that and how can we be confident in its implementation, has it been specified correctly, has it been verified correctly, does it actually do what it's supposed to do and also how can we be confident in the organization's responsible for doing all this, for implementing it and for verifying it and for that we look to evidence of their safety culture and for quality and safety management processes. Now the established answers to a lot of these questions which you'll see enshrined in the safety standards have been around 30 or 40 years and in some cases they're starting to show signs of their age. This is something that Nancy Leverson talks about in her book Engineering a Safer World in which she talks about how the fast pace of technological change and a reduced ability to learn from experience with accidents as well as a changing nature of those accidents is preventing a challenge to those traditional safety approaches and particularly she focuses on new types of hazards which arise from the increasing complexity in coupling between systems and between safety systems and how that is defying the ability of some of the traditional fault analysis techniques to identify those hazards and how to prevent them. But there are some software specific challenges too and not just because we have increasingly complex software functions and systems involved in our safety related systems, there's also a proliferation of components and dependencies involved in those systems and they use pre-existing open source software instead of purpose built software for that particular safety system. And when it comes to open source there are some additional perceived challenges, there's a lack of an entity to take legal responsibility and licenses that explicitly disavow liability and components are often not developed for a specific purpose because they value immediacy of use being able to apply something to a variety of different cases, adapt it and rapidly iterate it and that means that formal requirement specifications are almost always lacking and FOS communities don't operate like commercial organisations, they don't have management processes that we see described in the safety standards and they can't even command or direct the people developing them to do the particular tasks so they can't ensure that particular types of work are done and this means that they may have informal or inconsistent development models, they may apply good engineering practices but not as part of a formal process and so you can't be certain that those practices are being applied consistently. So how can we meet these challenges and how can we achieve safety certification for a system that involves open source software? Well with my colleagues at CodeThinker as part of the ELISA project I've been working on a new approach to safety which we've been calling RAFIA which stands for risk analysis, fault injection and automation and this involves using top-down hazard analysis to identify the risks involved in using open source software as part of our system and the constraints that we need to put in place to deal with those risks. We then use automated construction and testing techniques to verify requirements that are based on those constraints and software fault injection to validate the tests but also any system level safety mechanisms that we've introduced to deal with risks that we can't manage in the open source software itself. To do this we've been using a methodology called System Theoretic Process Analysis or STPA to drive our software engineering processes. This was a methodology that was developed by Nancy Leverson at MIT. Our first application of these principles was to an open source tools integration called Deterministic Construction Service which is designed to create a change controlled and reproducible software construction and verification environment that we can use to verify our software. We used RAFIA to achieve an ISO 26262 tool qualification for this reference integration but our next step is to apply RAFIA to the systems that we're building that have open source components in them. What I'm going to talk about today is the kind of roles and responsibilities that open source can have in a safety related system and some of the challenges involved in integrating it with other components. To do that we're going to look at integrating with actually a proprietary software safety mechanism which is the ARM software test libraries which are used to detect hardware faults in a CPU. We work with ARM to investigate our approach to integrating these with a Linux based system and the purpose of this exploration was to understand the role of FOS as part of system integration, the challenges of integrating the software test library with Linux using other open sources as part of the solution and also managing the safety integrity claims we want to make about such components using open source tools. So what is a safety test library? So this is an example of using software to mitigate hardware level risks. So when you have a hardware component of a safety system such as a CPU, you want to understand the kind of faults that might occur in that and what you can do to mitigate them and not all of those can be mitigated in hardware. So what we have here is a suite of tests that verify the correct operation of the processor. It was first introduced by ARM for the Cortex A53 and the idea is that you run it on boot and then periodically during operation of your system so that you can detect permanent hardware faults and also latent faults that are kind of waiting to manifest while the system is running. From a safety standard perspective this is about increasing what is called the diagnostic coverage for hardware faults and it's a key requirement for safety certification. Now this doesn't achieve safety by itself but as we were saying earlier about safety mechanisms you can combine it with other mechanisms for active protection. So for example you could trigger another mechanism to activate a safe state when you detect a hardware fault and this kind of principle is an alternative to having hardware redundancy. So instead of having multiple pieces of hardware running in parallel to make sure that you can always rely on them you can understand the reliability of a single piece of hardware by running some software tests. Now there's a number of particular technical challenges involved in using and integrating STL. The first is that we need to run these tests during system operation and we want to run them on all of the cores in the CPU and we need to do that within a specific time frame which is called a fault tolerant time interval in the standards and this is really to give us time to activate a mitigation so if we want to activate another safety mechanism that puts this into a safe state we need time to activate that when a fault is detected. But it also means that it has potential to interfere with other running software which would include of course any safety related software that we're running. The other problem with these tests is that they require the highest system privilege level to execute and that's because they test some of the processor functions that are only available at that level. Now in ARM architecture there's a number of different levels of privilege described in the trust zone architecture. The programs running in Linux user space for example would be what's called EL0, the kernel itself operates at EL1. Above that you have EL2 which is if you have a hypervisor system as part of your device and then the highest level EL3 is reserved for trusted services only. But when we're thinking about this it's not just the technical challenges it's also the safety challenges. How do these technical solutions fit into the wider safety argument? What do we do when we detect a problem and how can this safety mechanism interfere with the software and with other safety mechanisms? But we also have some challenges associated with using open source components and the first of these is software license restrictions. Now Linux is licensed with the GPL which places certain requirements and responsibilities on the end user. STL has a more restrictive commercial license which means that combining the two can be tricky. There are some strategies for combining, for integrating non-free components into the kernel but the Linux kernel community generally regards these as harmful and undesirable. And the end event in this case we know that we can't run all of the STL tests within the kernel itself because some of them need to run at a higher privileged level so we'll need to look at a different way to manage this. But we might also have some challenges associated with building with open source tools. Safety certified components often require the use of qualified tools. Now these tools might potentially be based on an open source tool such as GCC but they might be based on a specific older version of the compiler or they might assume that we're using specific proprietary tools and using different tools to compile different components and then integrating those components can result in unexpected behavior. So how did we approach integrating STL with Linux and how did we address some of these challenges? Well in with we turned to another piece of open source software which is the Trusted Firmware or TFA project which is an open source reference implementation, a firmware for ARM trust zone. Now this is primarily concerned with security and a secure world concept but from a safety perspective it also has some very useful functions which are concerned with system initialization and system management and the particular function we were interested in was a method for invoking a secure monitor running within that firmware from an operating system. But it's also concerned with the initialization of the system so operations that happen during the boot up process before the operating system is even invoked so we would be able to run the STL tests on boot to make sure that the hardware is all functional before we even start the operating system and then when the operating system in this case Linux is up and running it can call back into TFA to run the STL tests. So this means that we have an indirect integration with Linux rather than integrating the library into Linux we're integrating it with Trusted Firmware and to do that we have a new service which we added to TFA that will run the STL tests and the STL tests themselves are rented in a library which we can link in with that service and that just means that we then need a way for the processes running on the operating system to invoke those tests to do that we implemented a simple character device driver which lets us send a message to TFA and invoke that service via the SMC instruction which is a low level assembly language instruction that is already support for in the kernel so that we can run our STL tests and this means that we're running those tests at the correct privilege level at EL3 because they're running in a secure monitoring application within the Trusted Firmware. Now to actually control when those tests run we have a number of test processes running on the operating system in Linux these are running in user space and we can assign or give each one an affinity for a particular core so that we're sure that we're running our tests on each of the cores how we actually manage running of those test processes and when the tests are executed is it's a slightly bigger challenge because we need to run that within our fault tolerant time interval but the important thing is that because we're splitting them up into individual processes we can manage those processes individually more importantly the operating system can manage the scheduling of those processes around the other tasks that it needs to do and which could include some safety related functions or some non-safety related functions that the operating system needs to to run in order to support whatever our system is intended to do. It's important to note that this is just a proof of concept implementation it's about understanding the technical challenges in integrating the STL library in this way and just establishing the feasibility of it as a way of invoking those tests. If we wanted to use this as part of an actual safety mechanism then we need to do an analysis of this integration method as well so we actually need to look at the different software components involved in invoking the STL the different circumstances in which that's going to happen and what could possibly go wrong so to give an example those test processes that we have running in user space we don't want them to interfere with the other things that are running on the system but we also need them to to run within our fault tolerant time interval so how can we ensure that they do that how can we ensure that they're a high enough priority that they run frequently but not so high a priority that they interfere with the actual function of the system and similarly there's a whole sequence of steps to invoke the library it's not just a simple function call from the from the test process so there's a number of things that could go wrong on that pathway and we need to look at each of those and understand how it might be compromised and how we could ensure that we have confidence in that pathway so we then did a prototype implementation of this approach for the Raspberry Pi 4 which we're able to do because there was existing support in trusted firmware for that hardware platform and because it was using the correct processor architecture for the STL version that we were integrating and that meant that we could confirm the viability of our approach by running the tests on boot by running the tests from user space via our invocation method and start to understand what the technical challenges were in more detail and just confirm that the tests were able to run as we expected but to go further with this we want to integrate it into into a reference system not just a prototype and we made a start on that by using a demo platform that we've developed for a presentation at OSS Japan which is an illustration of a a rear-facing camera application running in parallel with an IVI for which we were using the AGL reference IVI and this was built using the deterministic construction service tool that I was talking about earlier which meant that we could confirm that we have a controlled and reproducible way to construct all of our components and an environment with which to verify them and this meant that we had a basis for investigating the application integration strategies that we need to to explore to understand exactly how this safety mechanism would operate as part of a wider system so having developed our prototype safety mechanism involving open source how would then go on to use something like that as part of a safety related system and how we then certify that system so to do this we would first need to use our deterministic construction service integration to help us to specify control the way whether we're using our open source components this means having a specific integration and configuration of those software components which we're storing in Git repository alongside the source code for those components to have that within a CI environment within which we can perform verification of those components but also of the integrated system itself and also which we can control changes to the integration to the source code itself and to the tests that we're implementing to do that verification and we're also going to use this environment to coordinate our collection of evidence for these various engineering processes which is what we need to satisfy the work product requirements in standards like ISO 26262 which are all about building up a body of evidence to support why we have confidence in the processes that we've followed in constructing our system but in order to do that we actually need to understand what those requirements are and this is where we're going to use RAFIA to analyze the role of the open source software in the system and use that to specify exactly what we need it to do so that we can then verify that so we first need a system architecture which lets us understand what role our software has within that system we're going to use STPA to document the safety goals associated with that system and analyze the associated risks and from that we want to identify component level and system level constraints that we need in place to manage that risk and in some cases those constraints are going to be implemented by our open source software so we're essentially going to need to show that its design and its behavior meets those constraints in other cases that's going to be part of the wider systems responsibilities to manage risks that we might not be able to address in the open source software itself and we're going to derive tests from these constraints to verify that software but also to verify its behavior as part of the system so we really want to derive system level tests here and to give us confidence in those tests we also want to use software fault injection to validate them so that when something fails the test actually detects that the failure has happened but we also want to validate our external safety measures you know to make sure that that they're working so when the system is up and running and we have a detection mechanism such as STL as part of our system how can we be certain that it actually detects a fault and invokes the appropriate safety measure to deal with that we can use software fault injection to to explore what happens when our software components fail here you'd need to use hardware level fault injection to do that for the hardware components to detect those hardware failures but ultimately we also want to automate this testing and the fault injection where possible using our DCS instance so that we can manage all of this as part of our coordinated engineering process and collect all the evidence we need about it that we can then use to certify our system so what do i want you to take away from this talk the first thing is that safety is a system property which might sound like an obvious thing to say but it's easy to lose sight of this when we're talking about certifying individual components and then putting them together to make a safe system safe components alone don't make a safe system to have that we need to understand their role in relation to each other and in relation to the system's safety goals only then can we be sure that we're achieving those goals what certification of components can give us is confidence in the safety integrity of those components which means that we understand what they're intended for and why we can have confidence in the development process is followed to achieve and verify that purpose and historically this has been a problem for fos components because the materials typically provided by the open source development communities don't always give us that clarity of purpose however that doesn't mean that we can't use open source components as part of our safety related systems provided that we as system integrators are prepared to specify how we're using those components in our system analyze the risks that are involved in doing so and show how we are managing or mitigating those risks and why we have confidence in our specific integration and safety measures and open source can also contribute to this in the form of tools to support the software engineering processes that we use to achieve that safety integrity but open source can do something else as well it can help to promote understanding of safety topics and safety concepts we can use reference implementations like the trusted firmware one that we looked at in this example to illustrate how components are intended to be used and integrated what risks engineers need to consider when integrating them in their own systems and how these can be mitigated and open source projects like elisa can help to identify these common risks and mitigations so that engineers using open source components can learn from previous safety analysis and build on that instead of having to start from scratch every time to my mind that desire the desire to share our work and our hard one knowledge so that others don't have to reinvent the wheel is what open source is all about thanks for listening and if you have you you have any questions i'd be happy to answer them now