 So welcome everyone to my talk on safety critical Linux and To put it a bit more catchy is the Linux kernel development good enough to make your life depend on it. This is a Progress report on the procedures and methods to qualify Linux the Linux kernel development process So before we go into the topic. I just want to know What's the background in this audience? So just a few questions. So raise your hand if you have some background in kernel development Kind of expected everyone does So maybe you can raise your hand if you have some background in functional safety Okay, a number of people as well So who has background in that intersection functional safety and kernel development? Yes, so you are the people that we're gonna hire to get this done, right? Darren in the first row. He knows that it's gonna be a difficult task so let's get started on this question and The motivation for it comes from the company I'm working for BMW Karatee, which is the software company affiliated with the car manufacturer BMW and as you know in the current industry the Automotive manufacturers are moving towards autonomous driving and You immediately come up with the question. How are you gonna build a system? autonomous driving in software and You'll figure out I need an operating system for that and Can we choose Linux to build this in for autonomous driving and We asked this question a couple of years ago and this led to the Osadal so to Linux MP project Its mission is to provide the procedures and methods to qualify Linux on a multicore Embedded platform for safety integrity level 2 according to IEC 6 1 5 0 8 Edition 2 so if you don't know what safety integrity level 2 is or IEC 6 1 5 0 8 Don't worry. We're gonna go into that later on We want to show that these procedures and methods are not just Some kind of toy things that don't work We want to show that this is feasible to apply them in a real-world system And we want to show that this is has some potential of reusing the Linux kernel analysis that we're doing in later products as well So this is a collaborative project of industrial research partners It's full members are BMW Karatee Intel our air tech KUKA sensor technique Viedemann And we have a number of reviewing members as well. We're supported by people from academia and We have experts from certification bodies that help us interpret the safety standards and Last and most importantly we have the core working team. That's Nicholas McWire Andreas Platzek, Lucas Böhme and Markus Kreidel that are working full-time on this question So to go into this to guide you through this talk We're gonna have three sections first. I'm gonna set the scene. What are we talking about? What you have to know about functional safety and the system that we're assuming then we're gonna go into a part We're gonna talk. So how do you start this safety qualification considering Linux? What are the questions that you have to consider and what are potential methods and answers that you can give? Third part is going to be about assessing the Linux kernel development quality So finding out is Linux kernel developer good and what are the gaps and how can we improve that? So this first part the Linux safety qualification is Interesting for people who are building products. We're gonna build a system using Linux The second part is interesting for kernel developers to know what kind of things we observe What kind of things we measure to claim that you're doing a good job? So let's start with functional safety and as I've seen Probably half of probably more than half of the audience actually has no background in functional safety And if you don't know anything about this topic You just start and look at Wikipedia and then you can start reading and you see Functional safety is the part of the overall safety of the system that depends on the system operating correctly in response to its inputs Including the safe management of likely operator errors hardware failures and environmental changes and the objective of Functional safety is freedom from unacceptable risk a physical injury or of damage to the health of people either directly or indirectly So what does that mean? Put it very simple the system should work correctly and it should work correctly if the person does something wrong using it if some hardware failure occurs or If you're actually using it in a slightly different environment, then you actually intended to use it And what you have to do is that you have to make sure that the system works correctly and doesn't Hurt or kill anyone So the work on functional safety is not software development traditional software development It's actually a risk management activity and this means that you Are gonna try to have a software development process and your task is to find out how to set up the quality assurance in the right aspects and in the right parts of your development so the definition said we have to Show the freedom of unacceptable risk So who's gonna decide which risk is acceptable and which one is not is it is it okay if if a train crashes every Hundred hours. Is it okay if a flight crashes every thousand hours and I say well Yeah, I can I can take that risk and you're gonna say well I Disagree this is unacceptable You'll find out. We have to come to a consensus among a larger group Actually, you have to come to a consensus among the whole society And to do that the process was in the past as follows That there was an agreement on a global safety standard. So this is IEC 61508 It tells you the the functional safety standard that applies to all different kind of industries and Tells you what kind of things should you do to build a safe product? And what you will do is you print out this IEC 61508 you have 10,000 pages of nicely written prose looks like legal text and You're gonna go over it You're gonna sleep it with it and you're gonna find out if you're gonna summarize it in half a slide You will say the following things if I want to design a safe system. I have to do two things first of all You're gonna do a system design and analyze your system You know sign analyze your system to find out which part must be of high quality and which part must be Of higher quality than others in that system for that you have Safety integrity levels so one to sell for so still to as I mentioned before is kind of medium risk medium safety level and Then to actually achieve that to build that with high With a rigorous development process you can actually achieve to build a high quality software So the safety standard tells you which objective you want to meet In each development phase to get a high quality product So considering the system architecture now so now that we have some kind of understanding of functional safety You have to build our system and you have actually different system architectures that you can consider You want to have some high performance software running? doing some kind of computation and then Invoking something that can Actually possibly kill or harm a person and you have different kind of architectures available for that You can run the hot high performance software on custom of the shelf hardware and have afterwards a safety check That just checks that this output isn't within reasonable ranges and it cannot hurt anyone and That safety check is simple and implement that and you're gonna run it on high integrity high low performance hardware a Second possibility is that you actually use custom of the shelf hardware you employ a hypervisor and you're gonna use this hypervisor to isolate a Customer market OS with non safety some non safety critical software and the high performance safety software What you're gonna find out when you do this design is that the safety software is gonna run without an underlying OS So if you're need scheduling multi-threading file system You're gonna implement all this in your safety software And the third approach that you could take is that you use the Linux kernel To isolate the non safety critical software and the high performance safety software so we're gonna have to show that the Linux kernel provides sufficient isolation and The safety software can use the scheduling from Linux the multi-threading the file system and so The main challenge there is of course that parts of the Linux kernel has to be qualified And that's the the challenge that we're gonna take in the silt of Linux project so let's get started on that and If you look into the following notable facts on the Linux kernel development, I'm just gonna mention them here, right? We have 23 million lines of code We have 14,000 commits in every release There are over 17,000 developers in the total history. You're gonna find 1,700 contributors in each release and Of course, they're different kind of companies or they act as individuals the development process is highly transparent The process actually defined by a social contract not by any kind of legal working contracts that you see here and The stabilizing phase is also impressive as well You're gonna see that you actually have 90 bucks corrected each week and that's detected not only by running it on devices, but also by Continuously continuously someone looking at the code Certain very verification activities that are going on in the community So this is nothing surprising to you. You're all kernel developers. You know that if I do the same slide In front of an audience of functional experts Functional safety experts you're gonna have half of them standing up screaming and running out and And they have good reasons for that, right? They never have to deal with the code base that large They never have to deal with so many commits in such a short time. They Actually have seen problems when they had two or three companies working together on one Artifact so they're not gonna trust that you can do anything if you have more than 10 companies working together and Of course, they're relying on the fact that there are some kind of working contracts in place that you can say well Someone takes responsibility But you don't have to worry about those half those people that have all been Running around screaming and have left the building by now The things that the persons you have to worry about are the other half that are still sitting there and they're gonna start asking really uncomfortable questions to you and you have to provide Reliable answers to those questions So let's do this. So How can the linux kernel cause physical injury or damage to the health of people That's the starting question and The answer is easy. It depends It depends on so many things it depends on environment system hardware the safety application software that you're running And if you now try to answer that question in a generic way, you're either gonna make a large number of system assumptions That are either completely unrealistic Or not even implementable For that system design that you want to actually build So what you come up with is that you have to understand and use the system context And this is what we did in our project We chose a simple example system to understand the activity that we have to do and this activity had then has to be repeated For each and every system So it's always done for each specific system and you can't claim well Linux was used in this safety critical system and hence if I'm just gonna employ it here It's just gonna be safe as well one of the problems when you encounter The assessment of the linux kernel is that it is pre-existing software This means the software development is already done with a fixed process before you actually build your system And of course the kernel developers have no understanding of the specific system context that you're considering When you want to build your system So there's a solution to that and that is that you split the activities in the activity in two steps The one thing is that you have a system specific activity Would you determine which functionality of the linux kernel? I actually use and which has to be assessed That's specific to your system The other part is that you look at the development process in the linux kernel development So was the linux kernel development done with sufficient rigor and if you find any kind of gaps Close those gaps with further measures that you have and in case of an operating system We have to consider that it has a significant context on specific functionality. It has a large hardware software interface and All these functionalities impact safety So your main goal is to make the system depend on very few selected OS functionalities And you can select those OS functionalities based on the artifacts on the evidences that you could that you can gather from the development For that we have two methods has that driven the composition design and development and assurance driven selection So let's go into that Has that driven the composition design and development so short HD 3 is a systematic approach to a problem that we had in our System analysis. So what we want to do is we want to have precise technical safety requirements on the lower levels on an operating system That clearly indicates which impact it has on the system safety when we did the first iteration of our safety engineering we came up with requirements that was of the kind This is called open is used in a safety critical application and must work correctly and the problem with that is if you look into the specification of open it's has Large number of different options and it's actually way to imprecise this requirement to tell you What is the further testing the verification and validation that I have to do? So we redid our safety engineering and we used this new approach Dedicated method that should achieve more precision And by doing that we came to the conclusion that we could actually have 12 constraints on this is called open and Now taking these 12 constraints into account. We can do specific testing and verification activities That is feasible within our project So to explain you this new method in a Very high level one slide So what we did in our naive system safety approach is that we started with the top level System design we did a hazard analysis and we came to functional safety requirements we took that and Did a functional decomposition until we were at technical safety requirements on the level of an operating system? But at that point the requirements were way to unclear how they related to this top level hazards. So instead We started the activity repeating the top level system design and hazard analysis just as before But then we unroll the system design by one level and we restart This hazard analysis so that we get refined safety requirements on the next level and we repeat this process By unrolling the design one by one until we're at the level of having Precise requirements on the operating system If you want to know more about this approach We had a three-day workshop last year and we're probably gonna have another three-day Workshop next year going into the details of that So another Method that we found was assurance driven selection And I'm gonna explain that with a simple example So you get the the task to implement an in its system That sets up the partition for the application with isolation and controlled access To the shared system resources and it should start up the safety application in those partitions So it's just some kind of start-up manager and if you would do this with a functional driven selection So let's say a traditional a system of software engineering approach. You would look okay What are the pre-existing technical solutions out there? You'll find in it. You'll find system D. You find other Technologies and you find that using system D and writing a few system D service files you solve the problem Unfortunately, you now have the task to qualify a system D and Then when you look into that you'll find out well Actually, I can't provide evidences that this was developed with high quality The artifacts that you can gather are not sufficient to claim that So the solution to this problem is that you actually take the assurance data that you find already in this Into the selection process you so you consider the technical solutions out there You can look at in it and system D and find out do they what's the potential in the effort to qualify them? and you can also consider doing a simple simple special purpose dedicated program and Qualify that and if you do that you'll find out the effort to qualify the special purpose program is much simpler because we don't have the evidences for the other artifacts and For the lower for the small program that we wrote ourselves. We can actually develop that process so The message here is not reinvent the wheel every time you Have to do such a task that the answer is really you have to always consider The effort of the qualification when you actually do this kind of selection and in some Sometimes this means that you actually use a solution that is maybe from a functional point of view Not the optimal one but from a qualification point of view is much simpler to handle than the other solution So we did this and this resulted in the following software architecture that we're assuming We have a safety critical and non safely critical applications running on the same kernel The isolation between them is achieved with CPU shielding. We use dedicated cores and memory regions for that and We try to Identify unintended behavior of the safety critical applications with second The safety of critical applications actually use g lip see because that's the The library that we had the most issuance data for so now we're going to go into the next part and Show what we do to find out if the Linux kernel development is done with sufficient quality And there we developed a number of new methods and tools to do this kind of analysis There's one part where we do analysis of the kernel get data So we have statistical prediction models that Shall predict how many remaining bugs are in the kernel and we have other processes other tools and methods where we try to find out Different questions that are asked in the safety standard. Can you tell me? What is the competence of the persons involved in your development? Do you know what the dependency between the developers are when they are doing this review? Can you find the identity? Can you find critical patches that didn't go to through enough review for the? For the level of trust that you actually want to put into that and we have tools that try to do that the second Step the second methods and tools go all around analysis of the kernel source code so we have a database of all the execution traces From syscalls to show that there's a certain independence of different protection layers that there's a certain Path covers of some syscalls and others not independence of consecutive calls and some inherent diversity of system call executions I Think an interesting tool that I've now already heard from multiple people Having some kind of prototype of that as well as the patch impact tester So you want to determine if a patch has actually impact on the specific kernel configuration that you have and It seems that this is interesting for for various people in this community Third tool is a code minimization tool which pre processes Selectively if and if the if deaths and if to minimize the source code when you want to do review inspections or Pass it to a source code analysis tool that isn't aware of the kernel build system So I'm going to show you so I think actually all of these Different tools deserve a talk by themselves. They're very interesting, but we don't have time for that So I'm just going to look into one method that we assessed or that we developed and I'm going to choose the one that's actually the one that is The most difficult to argue one and the most the the one that actually raises the most discussions within our group So what you're going to see here is the statistical prediction models who predict the number of remaining kernel bugs on the left side We actually just plotted the bug age histogram that we extracted from the fixes tag from a 4.4 kernel What you see here is That most bugs are actually detected after roughly 500 days And then if you look at a point around thousand thousand five hundred days There's a really low long tail of remaining bugs that are in there so if this is true and Bugs actually behave in the following way you would see in a stabilization phase a decreasing number of Bugs being fixed bugs being detected bugs being fixed so what we plot here on the right on the on the right-hand side is the number of bug fix commits that are Contributed to the 4.4 Kernels from 4.4 Well 4.4 1 to 4.4 7 7 and You can see that it's kind of a Random cloud of points But if you use it if you use statistical models and that you will see that there is actually a decline of The number of bugs in there So at some point you can actually say well Tell me with which confidence do you believe how many remaining bugs are in there? And that gives you a risk estimation how many bugs Could potentially impact your system safety? So I guess if you've seen the last slide and you've been wondering okay, they just plotted some things and that seems fine, but I have hundreds of counter-arguments that could explain that picture as well and I completely agree so if you have any kind of Discussions on this topic. I'm happy to discuss that after the talk or within the next few days I really selectively just picked out one part of an argument and Probably all to all your arguments counter-arguments. We have to come up with a Reasoning that explains that your counter-argument might explain part of that picture, but not the complete picture so Now that you've seen the methods and tools that we developed at which we apply I Want to tell you a bit how we want to actually improve the Linux kernel? And it starts with a very simple core observation That if you would try to build up an internal team that should improve the Linux kernel You're probably gonna get into the point that you find out that actually The competence of your internal team is probably not gonna be as good as taking the overall kernel developers into account so this means if you're gonna Modify the Linux kernel with this internal team you're not following the development process and that's gonna reduce quality and Gonna increase the risk of safety cortical bugs So if you really want to reduce the risk and that was our ultimate goal You actually take a stable mainline kernel So there's no kernel for no Linux kernel for safety. It's just a well-matured LTS kernel And you try your best to improve in this process the artifact that you're Constructing right so if you're gonna do a change for an improvement It has to follow the Linux development process and this means of course the improvement must be reviewed Accepted and appreciated by the kernel developers and now trying to do that in some engagement or interaction With the ongoing development is of course much more effective than trying to do that in some kind of deferred Post-development mode. Nobody's gonna react to you if you're gonna say oh, by the way, you didn't follow the coding style Ten years ago on this patch Nobody's gonna bother about that if you're gonna say that oh, this was by the way yesterday and we have some we have some risk that this Let's say this different syntactic structure is gonna confuse someone later on You actually have a good argument that he's gonna rewrite your case code. So this kind of involvement has to be Collaboratively with the Linux kernel developments. So what kind of activities do we see for the for use in safety critical systems? Right, so there's on the one hand There's an existing coding style and what we actually did is we looked at the existing coding style and gathered Evidence for its quality. It's more or less gathering Conclusions that have been discussed on the mailing list that have led to the decision that the coding style is in such a way and not in such a way So it has been well argued and we just gathered the evidence is right a documentation why it has been this way Try to monitor and motivate its compliance So that we actually can claim people are following that coding style rigorously on the testing side we want to extend the tests of the Linux test project for the system calls that we determined We want to apply a static analysis methods. So Using the kernel testing tools that are already out there and define try to find more bug patterns more bug classes and last We want to address the point of change management So we want to have want to monitor and we want to assess if bug fixes from the main line Have actually been consequently back ported to the LTS versions and we want to analyze which bugs could actually Or which kernel bugs and which bug fixes could actually impact the system safety And of course all these activities will focus on parts of the Linux kernel that are relevant for the system safety So if we don't use audio you're probably not going to see any kind of changes of that kind in the audio driver so I'm just going to Pick out one of those activities and that's the maintenance task. That's going to be Hunting us the next few years when we're going to use Linux and a safety critical system and It's actually quite easy the safety standard Says you're required to do a continuous monitoring and analysis of identified issues What they mean with that is of course you're going to identify issues in your system and if you find out that Some part of your software that you developed is wrong. You have to act accordingly But in case of the Linux kernel development This also means that if someone else finds a bug in the Linux kernel development And I've shown you that's about 90 bugs every week in the current 4.9 kernel you actually have to react to that as well and How are we going to implement that with the Linux kernel development? So the bugs are actually continuously found in the in Linux and these bug fix commits are backported to the affected LTS branches and Then we actually have to consider that every product developer that uses Linux and a safety critical system Must determine if that bug and that bug fix impacts the system that He's running or that he's shipping to his customers and we do this with a two-step process so First you're going to find out that for each bug fix you're going to have a kernel analysis team That's describing the impact of that kernel bug on user space and its bug fix in the detail that you need So this analysis is independent of the specific system and you can actually have a collaborative team of all the product developers Working together to find out what's the impact of that bug fix that we're gonna That was detected in the kernel development and the second step is then taking this output and Passing it on to a system analysis team for each for each system and they have to judge if that described bug fix has some impact on The system and the system safety and if it does you have to probably go through the Difficult question of how likely is it that this bug fix will actually trigger my system? Do I switch the functionality off? Do in a new analysis Find out if I want to apply a software update apply the software update and then reactivate the functionality So that nobody gets hurt in the meantime so I'm already at the end of my talk and I'm just gonna conclude with the following words if I've got you interested with a number of these Topics that we are discussing you should consider Joining the safety critical Linux group that we're working in So if you want to join that group you can contact Nicholas McWire or you can come to me after the talk Or write me an email And we have a number of upcoming events in November and December where you're all invited to come if you're interested So we have a project management meeting in November in Munich There's a course on IC 6 1508. So an introduction to functional safety in the mid of November in Graz and And we have an hands-on workshop. We're gonna actually try out using the different Linux quality assurance methods To improve the Linux kernel in a three-day workshop and hopefully people that are interested continue doing that activity later on so What you've seen in this talk is that if you have a multi-threaded high-performing complex safety applications you're probably gonna need the qualification of a full-fledged operating system and Our project shows that this is feasible with the Linux kernel and One of the main insights that was part of this project is that the difference between the safety critical Linux and The main line Linux is actually only the way you use it But that's the important part You have to know how you use it and you have to design your system that you use it in a proper way So that you can actually get this task done if you're gonna do this Simplistically you're gonna have a task in front of you that you just cannot Encounter or that you cannot just cannot Get done So part of that What we've seen is that certain activities can be done collaboratively although For specific systems you always have a system specific activity The safety critical Linux group is actually interested in quality assurance activities that is in line with the kernel Community and we're trying to bring together two groups That's on the one side product developers that are trying to use Linux and a safety critical system and kernel developers That are interested in improving the quality of the Linux kernel development So if you're any one of those and you have interest in that effort, please contact me or Nicholas McWire Yeah, thanks for your attention and I'm happy to answer one or two questions. Yeah, let's just start there Yes, so the question was how do we achieve the Actually the partitioning of memory regions on the physical memory I think it comes together with two things that we use sure you have to have kind of You have to make sure that The hardware doesn't mess this plan up right and just randomly Rearranges that that's the one hand and the other thing is that we use the Palock patch So that's a patch that makes sure that the Linux kernel applies Puts the memory in a certain physical memory location And that's the part that we apply to make sure that we actually have them in separate physical memories Yes, so yeah, so right for DMA the the probably the main question then is How do you make sure that this part of the system that DMA master doesn't mess this up? Yeah, that we have to address as well. Yeah Yeah, so the question was how do you deal with hardware that hasn't been put mainline and There's of course two answers to that first of all There's a selection right you're gonna use the hardware that is mainline and You're gonna have to have hardware that's going through a qualification process and Anyway, so they're gonna have a lot of effort. So the actually effort to getting that mainline is probably not the big step if you're not gonna say I Still want to use that board and hasn't mainline hasn't been developed mainland you have to show that this has Sufficient quality and I guess that's actually much more difficult than just getting it mainline anyway Yeah Yeah, yeah, so yeah, so that the question was when they're gonna see the first product with Linux Using in the safety critical system and the answer is actually quite simple There's already Linux used in safety critical products out there So it's already done The question is of course when do we see results from this project, right? And I would expect that as you said something in around five to ten years. You're gonna see products coming out. Hopefully a number of them working collaboratively and not trying to do that on their own Okay. Yeah, thank you. And if you have any further questions, just come to me after the talk