 Hi, welcome. My name is Chris Tempore and this is the session on deploying Linux in Safety Critical Applications three key challenges. I am a Lead Safety and Reliability System Architect at ARM working out of the ARM office in Germany. So the work that I'm presenting is work we have conducted in the context of the open source project ELISA and the ELISA project is about enabling Linux in safety applications. It's not about providing a safety conform kernel and I'll explain in the next slide why I'm emphasizing this. Yeah and this presentation will now look at and lead you through three key challenges and solution ideas on enabling Linux in safety applications. So what is this about the difference between safety versus conformance? So safety is about the absence of unreasonable risk and it's a property of the system that is experienceable while conformance is defined as to be or act in accord with a set of standards expectations or specifications. Often you hear safety and conformance used in the same sentences. The reason is that there are a lot of safety standards which are actually safety integrity standards and there is a strong desire to be conformed to those safety standards but as I've illustrated here on the figure conformance does not automatically imply safety. So you can have systems that are not conformed to a safety standard and are not safe. You can have systems that actually conform to a safety standard but still aren't safe. In the ideal case can have systems that are safe and conform to safety standards. That's clearly the target area that one wants to be in and there are also reasons when you can have systems that are safe and maybe in some areas don't quite conform to the standards and the safety standards actually take this into account by using a language of highly recommended and recommended rather than mandatory for the requirements set forward in the standard. So that allows some areas of gray because in some cases there is good reason to not conform to a specific clause of the safety standard but that still ensures safety. So safety is the objective, safety engineering is the approach and conformance is an additional constraint that one wants to achieve in the quest of achieving safety. Now I already used the term safety integrity in respect to the standard so how do safety and safety integrity go hand in hand? So safety and the way that one achieves sufficient control of dangerous failure modes is done through safety claims and the safety claims state specific safety capabilities of the system or one of the system elements and it's expressed in form of safety requirements and the safety requirements clearly depend on the context. So the typical example that one frequently reads or hears about is an airbag system and the safety claim for an airbag system is two-fold on one hand it should deploy in the event of an accident that's the purpose of the airbag but there is another safety claim which is almost takes more engineering effort to fulfill and that is that the airbag should not deploy inadvertently when you are not in an accident and the reason is quite straightforward if the airbag deploys and flies into your face then that most likely will cause you to engage in an accident because it might knock your unconscious blind you or whatever it's a pretty massive experience having an airbag deployed and it's not something you want to experience for example why you're driving on the motorway but you're sitting the whole time behind the airbag and the airbag the squib is there so the potential for the airbag deploying is high it's there and you don't want that to happen outside of a crash so the key question is when you start engineering solutions for your safety claims the question so what how much effort do you put in and this is where safety integrity comes into play and the idea is that if the probability of a hazardous event or the controllability of the hazard events by a human or the severity of the consequence is very low then clearly it makes sense to put less effort into substantiating the safety claim then if those are very high right if you're very exposed to the risk if the severity of the injuries you might experience is fatal and if the controllability is very low so there's not much you can do as a driver as a human engaged in the surrounding of the car then clearly you have to apply a much higher level of rigor and this is explained through the safety integrity and that expresses the degree of rigor that is taken to substantiate the specific safety claim and this is expressed in terms of safety integrity levels so these are the famous asils in the ISO 26262 standard uses A, B, C and D with A being the lowest level of safety integrity and D marking and denoting the highest level of safety integrity and the safety standards which really could be seen as safety integrity standards describe all the things you need to do to substantiate safety integrity for specific safety claims and then the system you build you have to ensure that the safety capability is met with the stated degree of rigor and lastly the idea of safety engineering is to provide a safety case as a structured argument supported by evidence to justify that a system or a safety element fulfills the safety claim with the stated safety integrity so this is a very very high level picture explaining how these different aspects fit together so apart from expressing the desire for a safety integrity level a key question clearly is what are the safety claims that one needs to fulfill and that in the context of the ELISA project has formed one of the challenges and pulled together now a simple model to explain how one can obtain necessary safety claims without necessarily having to dig into all possible applications because an operating system clearly is something quite generic so you can envision a magnitude almost any digital system you build will have some kind of operating system in there and then having to analyze that system context to derive requirements for the operating system is tedious so the idea is to use a simple model to start identifying some of the key requirements that need to be substantiated so that the ELISA project can look at how to enable the use of Linux safety applications so here is a simple model it's two applications a one a two each application has an application context and in very very general hand-waving terms the applications produce output based on input and the internal context within some deadline I think that's a fairly reasonable model for control system it's not outrageously complex and that's the whole idea of starting with something simple then we have an operating system that operating system also has a context the operating system again very simple model provides services to facilitate timely progress that will be everything to do with with scheduling and and and maintaining system timers and so on and so forth it provides services for maintaining the context of the application and it manages underlying hardware and then the third component the third level will be hardware with a hardware context and that also provides services to facilitate timely progress so so that is clearly the hardware implementation for example of of timers and interrupts and other hardware facilities and it also provides services for maintaining the context that that's all the instruction set that the underlying processor provides and and other services so the first thing I one can look at because I picked the simple stuff first to get it off the table is timing so we can look at this this list we just put together and we say okay there are the operating system provides services to facilitate timely progress and the hardware does the same thing what happens if those things go bad and basically there are two established techniques for dealing with that there is a technique in software that are typically applied using deadline monitoring so if you can see that the the system misses its scheduled deadlines you try to take some corrective action this is a gigantic field because it depends on the the scheduling requirements it depends on your application requirements but there are established solutions there are established puzzle pieces that can be obtained through which one can create a mitigation story for these timing problems and then on the hardware side it goes starting from hardware and clock and timing monitoring circuits starting very simple thing either you have a second oscillator and you cross check the clock or you have an RC based frequency monitoring circuits and you build monitoring circuits that this is not a complete list this is just an example but there is tons of art available to to prior art to to read and there are a lot of solutions of this kind already available and implemented so this is not really it's challenging but it's not super novel to look at on the the hardware context sides again what happens how can you deal with those problems and one area that is being used again as an example is a privileged based security architectures more and more the hardware processors are supporting privileged based systems and this is across the industry its arm has as very sophisticated support for security as well as x86 and and other process architectures and the security problem already brings a lot of mechanisms and techniques to isolate the applications from critical hardware accesses but also isolating the applications from from one another because you don't want one application snooping around in the data space of another application from a security perspective and that is the same that you don't want one application inadvertently modifying the context of another application from a safety perspective and in addition which is quite nice there is more and more a bigger and bigger number of products emerging that have safety built on the safety enabled hardware architecture so that includes things like split lock course to detect random hardware folds there are safety islands available to perform platform safety management internal SOC data paths are all covered with parity bits and error correcting control there are additional bespoke error detection mechanisms dealing with specific dedicated peripherals there are software test capabilities and hardware scan diagnostics available and there is memory based available so there are extensive feature sets available and more and more semiconductor manufacturers are actually certifying their products and the mechanisms they provide to ensure safety at the hardware level to certain safety integrity levels so now we cross off those problems we say okay that's again challenging but we you know there are solutions out there puzzle pieces are available so now we look at the application context right so we say there are really two challenges the one is we need measures that ensure that an application cannot corrupt another application context the context of the operating system or the context of the hardware in an undetected dangerous way and the second challenge is the OS itself shall not corrupt the context of the application the context of the OS or the context of the hardware again in an undetected dangerous way you can already see I'm I'm the language is chosen intentionally a little tricky because these are starting to get quite challenging problems and you need to be certain that you're not setting the bar too high so saying in an undetected dangerous way that means that if corruption can occur but it's detectable or it is not dangerous in the sense that it does not violate the safety claim then the corruption might be okay to live with otherwise you start running into the risk of intensely overengineering the system which is to nobody's benefit on the so so so what what solutions do we have at hand but for these problems and I already said that the measures that ensure that the application itself cannot be the source of corruption of other contexts in the system there are techniques available to to address that as I said right the the privilege based security architectures if you add sufficient safety integrity to those mechanisms you get very powerful techniques to address this problem they are very very sophisticated virtualization techniques available these days supported by the underlying hardware architecture that give additional ways of of establishing containers and other techniques to mitigate and address this this cross interference and it's possible to exploit specific application properties and I'll get to that point further on in the presentation to explain how that works so so we're left kind of with this question okay the OS itself shall not corrupt the context either of the application of itself or of the hardware in an undetected dangerous way and now the question is how can that be how can that be achieved so on the next slide I'll present some selected solution knobs this is by no means complete there are tons of ideas there are tons of ideas emerging in the ELISA project but these are I think some main knobs that are available to turn and see and try to to hum in on addressing this problem so on the left side I simply repeat the problem that we're trying to look at which is how can we what can we do to ensure that the OS itself doesn't become the source of context corruption and a very simple solution would be the choice of an operating system so there are operating systems available built based on micro kernels and separation kernels micro kernels go back all the way to the 1970s separation kernels go back to 1981 to work that John Rushby did and the benefit is that the the kernel is built around the idea of having hardware firmware software mechanisms to establish isolate and separate multiple execution partitions and to perform the control information flow between the subjects that that's the the term used by the NSA definition and exported resources allocated to those partitions so the operating system the kernel is really built around this whole idea of separation the con and there is literature out there to read through what mechanisms that performance impact emerges is is a performance impact under certain constraints and assumptions that impact can be taken to a lower level but there is still a performance impact and it is not off-the-shelf Linux which is not the target we have set ourselves for the ELISA project so we want to see how how can we enable linux in the safety critical applications examples if anyone's interested for for micro kernels is our QNX or Vx works and for separation kernels it's integrity or links to name a few examples so we say okay we want to use linux so an obvious approach is to say let's harden the operating system what what does it take right one key technique in hardening is scope reduction and features stripping right everything that isn't there and isn't used is something that can't go wrong and can't be a cause of of of interference and corruption except that a lot of linux people are quite stressed when when you tell them you know what what can you do away with because all the features in linux are there for a good reason and the more you strip away from it the less it still resembles the original intention of linux with all the good features that that have been built in I mean if you strip it down radically you take out a scheduler you take out tons of features you reduce the device drivers eventually you're gonna resemble something that's starting to look close to micro kernel but it's then not not Linux anymore then what we just said on the application side using hardware security and virtualization features is is is a good technique there are limits in linux to how far you can take those features into the kernel with the exception of features that the hardware provides that are transparent to the operating system and there are techniques of hardening the OS using inline checking and context signing that's also an established technique to secure a context the problem with inline checking and context signing of pure software approaches and they impose a significant overhead because really every context information that you use every time you use it you really would need to check whether it's corrupted or at least have a good argument that if you're using it at a certain point and it is corrupted you will eventually uncover that problem downstream so it won't be an undetected dangerous violation of your safety claim and and those arguments are very very sophisticated to lead so so the cons of hardening is is the overhead and the performance impact of monitoring and context signing it it can be taken to the point where the performance of the operating system becomes wiped away by all these additional checks that that have been added added to the code so this is important it cannot address the whole problem and we can't work with black and white arguments we're gonna have to put a puzzle together so this contains important puzzle pieces and the key is to find the right spots in the kernel where these arguments help without overdoing it and creating too many disadvantages then the next knob that one can apply is what I call correct by construction testing so what the argument is that you say we have tested a certain piece of code so extensively that we have a huge faith in the integrity of of that code and there are areas and today software engineering has advanced to the point that in some cases these arguments are used they use successfully in safety systems today that are in the field it's always a matter of of the complexity that you're looking at so again you the big benefit is that there is low or no runtime performance impact because you're saying I can test the code to this level of perfection before I deploy it and there are lots of new methods for testing software and the good thing is that the open source community is really very very active in adopting these methods for Linux and also developing and driving methods for testing so this is a really really interesting aspect the on the con side the correct by construction testing requires really some prerequisites right so the standard the ISO standard itself kind of suggests highly recommends that for these kind of arguments you have one entry and one exit point you use restricted use of pointers no unconditional jumps etc because else you're just facing a state space explosion when everything can continuously be interrupted by something else and all the cogs are spinning and turning simultaneously that the test space it gets gets immensely big I think this has a lot of value I think it forms a really important puzzle piece I personally I'm always a little doubtful whether that can be the sole argumentation and it reminds me of the shortest joke amongst computer programmers in a room I'd ask if everyone knows it I'll just share it with you it's two words the shortest joke is last bug so we looked at testing we looked at using runtime features and here isn't really really interesting knob and a way of achieving the goal of enabling Linux in safety application and that is by exploiting application properties this is largely how it is done today in many systems that use linux so a lot of people say why are you even struggling in the Eliza project I already heard about the system that uses linux but the key question is how much of the safety argumentation of that application actually pivots on safety claims made by the operating system and how much is done by putting the safety mitigation around the operating system or into external hardware and not relying on the operating system this was a crude example simply to show that there is a complete rainbow coloring of options how one can conceive applications I've simply for reasons of simplicity separated them into a class L and a class H the class L I've I've crudely labeled low safety complexity that this is not scientific just just take this with a grain of salt it's to put two stakes in the ground rather than to have an exact science typically in the low complexity safety systems transient faults are not that critical and I'll give an example in a second of such a system those systems have a high low pass characteristics so the transient fault just gets wiped out by the by the low pass characteristics of the applications permanent faults are critical the fault tolerance time compared to the execution speed of the system is long so you have quite a long time in which a fault can be present in the system before something dangerous happens very very often you will see a human in the loop which is very convenient because the human can do some additional plausibility checking and intervene and since the fault tolerance time is is low there is still time for the human to actually do something in many cases end-to-end plausibility checking at the application level is plausible you don't so often see arguments from its criticality where the system has to achieve multiple safety claims that are independent from another and require applications of different criticality being integrated and usually and again right just as a stake in the ground you're looking at safety integrity levels up to A's and B's a very typical use case one that's also being considered in the ELISA project is IVI telltale so it's you can imagine the digital dashboard and the critical failure mode is if a warning telltale that's trying to inform the driver that something critical has occurred and it needs to take action doesn't appear at the TFT display and those systems usually use some kind of feedback loop for example by taking the bitmap from the TFT feeding it through an image detection algorithm and actually checking whether the whether the telltale is visible and if it isn't visible it tries to alert the driver in a different way by by using audio by by turning off the dashboard by switching to an emergency dashboard or there are a whole bunch of techniques that are established in the industry that's on the on the low end side of complexity and as you can see that fulfills the criteria of class L and then on the other side of the spectrum you have the class H which is a high safety complexity in those systems transient faults are critical so you don't really have a low pass characteristic that you can use in your safety argumentation permanent faults are critical and there is the fault tolerance time is short so there are only milliseconds left to react it's not really possible to engage the driver or human in the loop end-to-end plausibility checking at the application level is not that that feasible and very very often you're looking at mixed criticality systems going all the way up to ASLB so that I depicted now on the right which would be autonomous driving systems and then there are a whole bunch of systems in between gateway systems for example are usually still considered more towards the class H because there's no human there's no low pass characteristic in the loop and then the e-gas I'm not sure whether to what extent this is known I presented that last year that is basically the safety architecture that has been developed and is in use across most of the cars for the electronic throttle control so the accelerator pedal these days just goes to a sensor and then the signal is sent to the engine control unit to control the speed of the car and that is done using a system that's called e-gas which is higher than the telltale use case I would I would believe but it is is still in more towards the lower end of the safety complexity ballpark the good thing is this is a very very powerful knob that the challenges introduces this application property dependency so what we're trying to do in the Eliza project is engage with OEMs and trying to get them to provide us with the most interesting use cases they have so if we do want to resort to application properties into this technique we focus on the use cases that are of the biggest interest to the OEMs in the Eliza project rather than trying to solve a problem that has no relevance and the industrial team in the Eliza project I've now spoken all about automotive the industrial team is doing the same thing they've chosen specific use cases and that allows them to reach out to this knob and twist that knob and say okay we're limiting now our our enablement for Linux to this particular use case I the big risk is just that someone enables Linux for use in in in one class and someone else isn't aware of this class dependency and says hooray the problem has been dissolved and because Linux is running in this kind of system I can take it and put it into a different kind of system and I must be able to reuse the arguments and I think this slide hopefully shows that that is not necessary the case if you do transfer arguments you have to be really really careful that the constraints under which those arguments were met still remain valid so the the challenge to discuss is the integration challenge right so supposing now we have we have Linux we have put put all these arguments in place we've hardened it we've used application dependent aspect to to mitigate and address some of the critical faults and now we need to put it together and the way this is done typically is that the different constituents for integration so it's the operating system that has to be integrated with hardware with drivers with all kind of bits and pieces libraries you name it and typically those individual elements are developed and provided a safety element out of context the safety element out of context means it's a safety element so it fulfills some safety claims with a stated safety integrity but it has been developed using an assumed context by a supplier because the suppliers as I can you know I'm assuming this is for example going to be used in an airbag and they use these application constraints techniques to to argue the safety capability of that element so those elements come with integration requirements and they are really critical and typically those integration requirements are stated in the safety manual and the safety manual now contains the assumptions on the context that the supplier has made and the integration requirements that are usually stated as as assumptions of use requirements which are safety requirements allocated to the integrator in practice this has turned to be quite challenging because the person integrating all these different elements is now facing a challenging task the integrator needs to ensure that the safety claims are sufficient that the assumptions made by the supplier are valid and that the assumptions of use that the supplier has expressed for his safety element are addressed with a sufficient integrity and then and this is something I think that software programmers and current programmers will understand immediately when you start integrating things sometimes properties disappear so you have to make sure that the safety properties that the element had before you integrated are still there and you have to ensure that the safety claims that don't emerge until you have achieved integration actually emerged as desired and lastly which is really really ugly you have to ensure that no new critical failure modes have emerged so this integration challenge also puts a limit on the level of creativity that you can apply if your enablement of Linux in your safety application hinges on 100 assumptions and 2000 integration requirements it's gonna be almost impossible for anyone to take that operating system and integrate it in an intelligent way and still have confidence at the end of the day that that the safety is still maintained right so that automatically puts a bound to the to the extensity and the creativity of the arguments it kind of forces everyone back into a world of reason and sanity and saying it has to be a reasonably safe a reasonably simple solution to be able to to demonstrate safety and argue safety at the end and and typically an ECU when we did this on a particular autonomous driving system we within no time within a couple of hours we suddenly found a hundred safety elements different software drivers different hardware parts of the power management unit from the hardware that needs to that needs some in periodic engagement with software and so on and so forth it was quite amazing how complex this problem became and there is a very interested I present the paper the European Dependable Computing Conference about a month ago where I went into detail of this problem and we were discussing how what options exist to solve it so the conclusion of this presentation is really the safety argumentation that we pulled together in the Eliza project will require making trade-offs and it requires collecting multiple puzzle pieces and combining them in a smart way this is I think the challenge for the project today would be much easier we would have make or break arguments because then you can throw away all the weak ones and you're left with one argument that's the killer argument that is very unlikely to emerge it's more that there are a lot of arguments each one carrying a small weight of the overall argument and it's the combination of those arguments that actually enables the use of Linux in the safety application so all the pieces are needed someone at the end of the day will have to understand all the pieces to gain confidence that the combination of the pieces delivers a sufficient argumentation and the Eliza project is is now working on on solving the the puzzle and looking for new pieces that could be used to argument and justify the safety that that concludes my presentation and thank you for listening