 My name is Paul Sherwood. I work for Kirkling and I'm here today to talk about progress that we're making on our own journey on the path towards safety certification of open source software. This is quite a challenge as I'm sure many of you are aware. Not least because the traditional approaches to safety engineering require commitment to deterministic behavior in software, which I think is achievable at certain scales, but not so achievable these days with the advent of multi-core microprocessors and Linux scale systems. So we started on this particular phase of our journey in 2016. Cutting is pretty well established in the automotive industry. We tend to work primarily with OEMs directly and tier ones. And we have a well-established capability in addressing software integration issues and improving performance and stability of today's complex in vehicle systems. A lot of those systems are running Linux-based operating systems anyway, and a lot of that work to be fair is not safety critical. But as the years have gone by we've seen an increasing number of organizations interested in the possibility of achieving safety goals and making safety promises on systems which are also occupied with the kind of workloads that Linux is well suited for. I guess the first real trigger for that for us was when it became clear that everybody is extremely interested in the possibility of autonomous vehicles. And autonomy tends to lead people towards Linux-based systems anyway because that's where a lot of the tooling has originated. And a lot of the research is based on Linux-based machines. So there's a strong pull from the AI community, the machine learning community, for Linux-based solutions. As a result of that kind of pressure and various comments that customers were making, we started wondering whether there was a possibility that our work in infotainment for example might ultimately be safety related at one point. People in the industry were suggesting that infotainment itself needed to be considered as safety critical. And I'm not sure whether that was a genuine concern or whether it was more kind of scaremongering by organizations that had a vested interest in selling a hypervisor or some other safety certified product. And maybe the idea of just creating that impression would lead to more work for them. Our first direct look at the safety question as code think came when one of the top tier ones asked us for a research development really to look at whether there was anything in the open source world that could improve their ability to do safety related engineering of these complex systems. In effect, they were saying that their traditional methods for safety analysis and safety mitigations were not scaling very well to the modern complexity and automotive systems. And they were wondering whether there were any methods or tools or even existing solutions in the open source world that could help them. And unfortunately when we did that research we quickly arrived at the conclusion that there wasn't very much at that time. There was no magic. And in fact one of the key people in our client that had engaged us on this, one of their safety experts went out of his way to point out that in his view although open source was good and they used it in many environments the simple fact that open source does not generally have requirements defined or an architecture meant that in his view it was not going to ever be possible to really use open source in safety critical environments simply because the process that leads to open source was not going to be fit for the kind of analysis that's required in safety. And then our kind of thinking was compounded by one of our clients literally asking me did I believe that Linux could be safety certified and based on what I knew at the time in 2016 I said that I didn't believe so. And unfortunately that was the answer he was looking for because he promptly said so okay we've decided that we need to be running safety critical stuff on our infotainment and therefore we're going for QNX for the whole thing. And that struck me at the time and still strikes me as a bit of an odd decision to go for changing your whole software architecture on the basis of a claim about one specific small piece of software which in that case would have been the QNX microkernel. The idea that a safety commitment around one tiny piece of software is somehow more important than the performance reliability integrity of the whole systems struck me then as odd and as we've continued with our analysis we've concluded that it really is just odd. So that's where we started. Now initially trying to understand what the standards say and what the safety approach in general has been bear in mind that from code thinks perspective we're coming at this as software experts not safety experts. As we tried to digest the standards broadly it seemed that it was going to be a very difficult job to somehow shoehorn Linux and an open source into the kind of approaches that the standards were highlighting and I saw 2662 specifically. Broadly offers the possibility that if you can show you've run your software in situ for 10 years and with reliability then maybe that's a way to get to a proven use argument that it's safe. Or alternatively you would need to follow this traditional approach of showing that you've got requirements defined and you have an architecture and you show how that maps the code and back and you demonstrate that your safety goals are satisfied by applying a significant amount of work and analysis to that whole process. And then the final idea which kind of struck me then as odd and still does is this idea of safety element out of context which is the approach that some of the microkernel vendors have taken to establish a safety certificate for their products. They in effect say that their component is generically safe. They make constraints on what can and can't be done. So they provide a safety manual saying this is how you must use our microkernel. But they get to a piece of paper on the basis that their component is known to behave in a certain set of ways. Unfortunately that approach doesn't itself seem to obviously get us very far when we think about the reality of the kind of systems we're talking about. So if you look at the middle block in this diagram what we're basically saying is that on an SOC which we might get a safety certificate for for the hardware we're going to have a certificate for this kernel microkernel. But all of the other stuff that's run the drivers firmware and all of the applications on top of it they clearly aren't part of that certificate. So most of the software that's going to be in the system is not related to the safety claim that's that's been made by the microkernel. And this strikes me as as really a huge hole in the argument. There's very little merit to me in making a claim about the kernel when you can't make claims about the whole behavior of the system. So a general principle that I've established over some decades now in software engineering is that it doesn't matter what claim someone makes about their software. If I can access the source code I can break that claim just by changing one line of code I can change any statement I can put an exit statement in. So it's trivially easy to change the behavior of software and for a microkernel for example it would be trivially easy to change the behavior just by making drivers that misbehave which frankly happens in quite a lot of production projects the drivers are custom for the work and safety will not be the primary concern when trying to get them working. So against this backdrop of what seemed obviously impossible approaches I started a discussion in public in a group called trustable software and I made this assertion in 2016 I basically said that if we're going to trust software at all we need to be able to make certain promises about it we need to know who provides it and we need to know that we can build it and we need to know that we can rebuild it and be sure that it is the same software that we started with we need to know what it does and we need to know that it does what it's supposed to do and ultimately in the kind of systems we care about we need to be able to update the software and still have all of those promises whole true. There was quite a lot of great debate in public around this on the mailing list at the trustable group and I don't want to kind of dismiss the work because it was significant and challenging and we did a lot of thinking but broadly the the simple output came down to we need to prove why we have any confidence in the software we need evidence to show that we have some basis for our confidence it's not good enough just to have a certificate that is one kind of evidence but you would need evidence of tests evidence of design evidence that people had actually thought about the problem and had demonstrated how the problem was being addressed and how mitigations were in place for their solutions. So wine forward from the trustable work CodeThink has undertaken a series of projects in this area now in effect figuring out the method figuring out how we can approach the problem of making promises around software and in the early part of 2021 working with one of our OEM customers we settled on a safety concept that broadly requires a systematic method to be applied and the systematic method broadly says that we need to do risk analysis to be sure that we know what the problems are that we're trying to address and we need to provide tests to show that we address those problems so in the context of safety specifically this means that we need to establish our safety goals establish our safety constraints and provide tests to show that the constraints are satisfied but then because as I've said software is trivially breakable that's not enough we won't be able to guarantee that just because our tests pass the target software is guaranteed to behave the way we're expecting so we go to the other side of the argument and say well what would happen if the constraints were not satisfied so we can easily achieve that with software we have the opportunity to do fault injection by changing the code and what this gives us is the ability to explore what would go wrong if a constraint is defeated and this leads us to the ability to systematically for the enclosing design to say yes here we have some software which we're confident behaves in a certain way but here it also is evidence that we can break it and that can be used to test that the mitigations around the safety design actually work when the software would misbehave and then we take that approach ultimately and wrap it into a continuous integration framework so we automate the collection of evidence for this the design process which leads to the risk analysis outputs and the fault injection and all of the tests and all of the test results and we in effect end up with a framework which allows us to apply that method systematically on an ongoing basis as a project is developed and we then applied that method to a real scenario we looked at the actual process of constructing this kind of software so we imagined a payload of an autonomous vehicle trajectory calculation program and we used our methods to verify that we could achieve a deterministic construction approach that would give us sufficient confidence in the way of working to be sure that the construction process itself was not going to pollute our output so in effect decoupling the environment factors and the software construction from the target payload even in this case of a supercritical payload such as an autonomous trajectory and we were delighted to achieve an ASLD tool certificate for this and this I think is our first real proof point that both the method works and that the method applies for open source because in assessing our work pretty much everything we were using as open source our reference implementation is based on GitLab the tools that we're using our open source compilers and so on there's nothing kind of that's pre-certified in anything that we did and we were able to achieve this certificate without the fact that anything was open source having any reaction or impact on the result so if you were to go to exceeded.com you could look through their certified products list and you can find the full report that explains how this reference deterministic construction surface implementation works and the kind of analysis that led to this ASLD certification so the key point from coding perspective is that the certificate is a proof point on the journey now it gives us a basis for constructing the kind of software we care about for constructing Linux scale software that could be used in a safety critical environment and that's where we now step on to the current work which in effect is building on that basis that framework and method to assess an actual system with actual software so we've chosen a specific application now and this again is based on real feedback from customers the suggestion that we hear in general is that organizations are not considering the idea particularly of putting Linux in the breaking system what they're considering is they're going to already have Linux in significant machines in the vehicle and the question is could they safely provide some safety functionality on those machines in a way which was not compromised by the rest of the workloads so if if the Linux system were doing infotainment or some other user-facing kind of functionality would it be possible for the same system running Linux to also contribute to the safety architecture and safety productions of the vehicle so for our analysis we've settled on the architecture you can see here which is basically a Linux based operating system acting as host directly on the hardware we chose the Raspberry Pi 4 for our demonstrations we are running AGL as a kind of example infotainment system in a container so that is isolated from the host via Linux containers we have a safety application which is this rear-facing camera application it's the idea being that when the vehicle is put into reverse the camera needs to get its image onto the screen pretty much immediately so that the user is clear if there's anything behind the vehicle the camera image should reflect that so in our demonstration architecture we have an actual rear-facing camera we have an actual display and then we have some warning lights which would in reality become some kind of warning indicator on the console of the vehicle itself so the way we apply our process this is the RAFIA method that I've mentioned we begin by an STPA Procedure Systems Theoretic Process Analysis this is a method established by Nancy Levison at MIT some years ago it's now I think increasingly established and maybe the de facto for system level thinking around safety these days it gives us a top-down approach to consider what could go wrong in these kinds of complex systems and it's systematic it can be applied by software engineers and we increasingly are kind of finding ways to improve the method for using this so that software engineers feel comfortable that they understand why they're doing it and it contributes to the software engineering result so in STPA everything is treated as a set of controllers in effect and in these diagrams the control signal tends to go downwards so in this case the user for example sends a control signal to the reverse gearbox in our demo pressing a button feedback about what's actually happening is sent upwards in these diagrams and the diagrams form the basis for analysis of what could happen when the controls structures that we believe are in place break down when things go wrong so again systematically we analyze for each controller what could happen what could go wrong when the control signal is sent what could happen to go wrong when the control system is not sent for some reason and what could go wrong when the control system is sent too early or too late or for too long a duration so applying the STPA and the method that we've evolved around this STPA this leads us to a set of constraints for the system which in effect are safety requirements and then from the constraints we're able to identify responsibilities for the controllers as to what they must do in order to satisfy the constraints and we end up with scenarios and unsafe control actions that state what can go wrong in effect and we use those as the basis for our tests so a more detailed breakdown this so this goes more into the understanding of what the software parts are doing and it's the same kind of topology we have control signals going down we have feedback coming up and we have as part of this analysis we quickly conclude that a critical component that's going to make the most difference to whether the safety behaviors are followed or not is how the compositing happens how the interaction between AGL driving the screen and our safety application trying to drive the screen with with the camera feed directly how is that process going to to work so let me show you our actual demo of the system in place so we're quite pleased with this now we've managed to get AGL containerized on top of a Linux based host with full hardware graphic support and with demonstrable control over the screen so that we can overlay the rear facing camera without affecting the AGL behavior so here we have AGL and on the bottom right you can see the dev board so AGL is running the breadboard has a couple of warning lights on it and a button which we're now going to press so the button simulates the gear stick and as you can see the the camera feed immediately overlays and is live and now we simulate the camera going offline for some reason you can see that the warning lights go and once the camera is reconnected the warning lights go off and the camera feed returns and then we can switch back so that's the basic proof of the architecture now interestingly as we did this design using STPA it caused us to evolve to an improved solution so this is the original design where we started assessing how the container is going to interact with the compositor and one of the example hazards is this question of what could what could go wrong if some application that we are we're not in control of on AGL attempts to work with the compositor directly could that interfere with our safety application so what happens if in effect something in the container just grabs focus because conceivably in this architecture that would be fusible so following through our analysis we identify this as a use case our way of tracking requirements is is based on YAML and this this gives us the ability to run scripts to check that everything ties up that we have requirements mapped to test some and so on and it gives us the basis for the formal documentation which ultimately is required for the standards to show that you understand the requirements you understand how the requirements are satisfied and you can provide evidence of that so in this case we we trace from the unsafe control action to the to loss scenarios and then from the loss areas to actual constraints that need to be applied and and the key constraint that arises out of the use case I've just described is we really can't afford to get into a situation where a rogue application is is taking control of the compositor and this leads to a better design where now what we have is a nested compositor so AGL is talking to the nested compositor this will and one proxy and then only from there does anything that is in that container get access to the real compositor and we control the priority of of that access so now the camera application clearly has a very different ability to influence the outcome versus the AGL container and this is just a demonstrably better design and it results from from the STPA work that we did so that's approximately where we got to in terms of our current ongoing work around certifying a Linux I'll just finish with a summary slide so we've been working actively on this topic now for something like five years I think we've shown that the idea of fault injection is critical and that it does give us an actual path to certification we've proven that and the kind of further discussions with Exeter and with with the customers give us confidence that that that is a widely applicable approach you know it gives us the extra rigour that's required to compensate for the uncertainties in in this kind of complex software we've established now that STPA does provide us with a good framework for reasoning about the safety concerns and we've we've been productive now in figuring out how we can make the the general safety concepts as described in STPA relevant for software so that software engineers don't kind of drown in a sea of uncertainty because they're able to think about the specifics of software behaviour and software tests which is a major benefit to be able to get the safety reasoning done by people who are expert in software is where we need to be really there's still a lot of work to do we we're actively looking at integrating into a wider range of test environments because ultimately this still does come down to how confident are you that your tests are representative that you've covered all of the corner cases and that that leads ultimately to a need for a significant amount of testing we have a new initiative starting with ARM which is quite interesting they they've been working towards using a software library approach to provide extra assurance on hardware where perhaps that the traditional lockstep method isn't achievable and we're interested to see see what what that brings into the kind of architecture we've described and over the coming year we expect to make significant further progress now on this basis of continuous compliance which is to say that we do see that whatever promise we make for software as the projects continue the software keeps changing so we need to revalidate our analysis and revalidate our evidence on an ongoing basis and it's simply not cost effective to have that requiring a huge amount of manual effort each time so we have a strong interest in ensuring that the the process of gathering the evidence and verifying that the evidence supports the assumptions we're making and the claims we're making making sure that that whole process is as automatic as possible is a clear win for both time and money for a customer so thank you very much I hope the talk has made sense and I look forward to answering your questions