 Hi everybody. So thank you for coming today to join this session at the open source summit. Today we talk about a maintainable and scalable kernel qualification approach for automotive. I am Gabriele Paoloni. I am an open source technical leader at Red Hat and today I will co-host this session with Daniel Bristo. He is also working at Red Hat and is a principal engineer working as Linux kernel maintainer. Okay, so what we're going to present today, it is a session that was already presented in the last ELISA workshop and we would like to repropose it again to have feedbacks and raise discussion. Everything that we present today is still work in progress and no results are binding on behalf of ELISA and Linux foundation and nor we make any safety claims based on these preliminary reports. Okay, so what you're saying is that this is an investigation work, an investigation activity and there are no claims that will be made. The agenda, okay, so first of all we'll clarify what is in scope and out of scope of this presentation. We'll talk about possible functional safety qualification approaches for Linux and then we will introduce the hybrid qualification approach that is the key topic of this presentation. We will present an example of this hybrid approach that we investigated within the ELISA working groups and then we will talk specifically about the runtime verification monitors that is the technical activity that Daniel has been working on in the last few months. Okay, and then we'll talk about the next steps and indeed we'll have the question and answer session. Okay, so let's move on. So what is in scope? Proposal and high-level description of a functional safety qualification flow of kernel code allocated with safety requirement to meet a certain ease. This is according to the ISO 26262 safety standard for automotive indeed. What is out of scope? We will not talk about the functional safety qualification of the hardware. We will not talk about any safety standard that is beyond ISO 26262 and we will not talk about freedom from interference claim between coexisting kernel drivers or subsystems that are allocated with different ease levels. Okay, so let's look at the possible approach that today are defined in the ISO 26262. So if you look at this safety standard and if we consider pre-existing software component, we have part 8.12 that talks specifically about qualification of pre-existing software component. This is a black box approach. Okay, and it is based on the verification of the top level requirements that are allocated to this software component. Okay, so basically we have a software component allocated with functional safety and functional requirements and we need to have a set of test campaign that is comprehensive enough to verify these requirements. Okay, this is an approach that is commonly accepted for simple software components like, you know, libraries, but indeed this is not a valid approach for the Linux kernel itself that is way too complex. Okay, then we have the part 6. Part 6 approach is a white box approach. So you have a complex software component. You need to define all the software architecture, the different units, the interaction between the units and accordingly you will have unit tests, integration tests, so it's a very structured approach and this is indeed what is recognized to be suitable, you know, also to assess complex software components. Okay, then we have part 8.14 and this is the proven use. Okay, proven use, that is a valid approach. However, we need to make sure we need to have statistical data that is showing a target failure rate according to the easel that we want to target. And if we want to claim proven use on top of the valid statistical data, we need also to make sure that the boundary conditions of the software element are maintained. Okay, so hypothetically, if we had the statistical data, we should make sure that the hardware is unchanged, the hardware configuration is unchanged, the software configuration is unchanged. Okay, and for a component complex like the kernel, this is quite difficult to meet. Okay, and then we have finally part 10.9 that is the safety element out of context that is just describing a way to formally define the requirement. However, it doesn't tell how to practically qualify the element according to these requirements, and what it does, it redirects to the other routes that we just talked about to, in order to qualify this element. Okay, this software element. Okay. Now, so today we will focus on part 8.12 and part 6 specifically. Okay, now let's look at part 8.12. So this approach, as we said, is a black box approach. Okay, as you can see, the number of artifacts of collaterals that we need to maintain is pretty simple, because we just have the pre-existing software element, we need to have the requirements, top-level requirements that must be pretty well specified, and we need to have a comprehensive test campaign to verify this requirement. Okay, and then if we have all of these, the qualification is kind of complete. Okay, so the amount of collaterals is not too big. Okay, on the flip side, if we look at the part 6 standard approach, we can see that the number of boxes here is much, you know, it's much bigger. So we see that this is a white box approach. It is much more structured. It starts from the technical safety concept, then we have safety and nominal requirement, architectural design, unit design, implementation, unit test, integration test, platform test, validation test. So there is the amount of activities and associated the artifacts and documentation to be produced is much, much bigger. And the key point here is that if we talk, when we talk about unit design, usually in functional safety, we talk about design of the single functions in the codes. Okay. And these, especially all the activities that go from the unit design and below, these are the ones that are more expensive, you know, in terms of amount of collaterals to maintain. Okay. And so today, so what can we do in Linux? Right. So Linux, we say it's too complex, right, to be qualified according to part 8.12. Okay. We say that the safety element out of context approach that is part 10.9 only covers the requirement definition, so it doesn't provide a practical solution. We could try to assess Linux according to part 6. However, if we do that, it's, we'll see that the effort associated with the generation of artifacts and collateral just explodes. Okay. And indeed, as we, it is feasible, I mean, it is doable to use part 8.14, assuming that statistical data is valid and that the hardware software configuration and the stimulus are unchanged. Okay. So, and this is, you know, a very, very big constraint, right. So in practice, it's very difficult to use part 8.14. Okay. And so why, what are the pain points of ISO part 6? Okay. As we introduced previously, we say that most of the problems are related with the unit design and the code implementation. And in fact, if we, first, let's look at the unit design. Okay. So for each unit, so for each function, basically, we need to have an informal notation up to SLB. And for SLC and beyond, we need a semi-formal or formal notation. Okay. And when it comes to the implementation, we need to have one entry exit point in each function, no dynamic object variables, no multiple use of variable names, no implicit type conversion. So this is a bit hard to be met, right. And when it comes to unit test verification, we also need 100% code coverage and requirements coverage of the software units, okay, of the single function. And we know that, you know, today in Linux, we have more than 80,000 of functions and more than 50 million lines of code. So the effort, you know, to write and maintain all these artifacts and collaterals, practically, this is not viable. Okay. Okay. So what do we do? Linux to complex for part 8.12 and part 6 is to complex for Linux, right. And so we come up with an idea, okay, this hybrid approach. And here is divide and conquer. So what does it mean divide and conquer? Let's see. Okay. We know that today Linux, okay, is already partitioned into subsystems and drivers, okay. If we just look at the maintainers file, okay. So the idea here is, okay, instead of considering a software unit like a function, let's consider a software unit like a block of code, right, that could be, for instance, a single driver or a single subsystem to start with, okay. Now, if we take the single driver, the single subsystem, so the single software unit, we can try to qualify this subsystem driver according to part 8.12, okay. And when it comes to the interactions between the different subsystems, between the different drivers, and when it comes to the architectural verification, so the integration test, okay, we will consider the integration of these software units, software block subsystems working together, okay. So practically speaking, okay, let's go back to the diagrams, okay. So this diagram, it reminds of the one that was shown for part 6. However, you can see that there is a key difference here. So the difference is that part 6 is followed from the technical safety concept down to the software architectural define, and then from the integration testing up to the validation test. For the single block, okay, we will use part 8.12. So here, first, instead of having a software function as a software unit, we have the drivers, the subsystems, right, so higher, a bigger granularity, you know, pieces of code, okay. And why, so now the key question is this approach may sound like a sort of a shortcut, right, compared to part 6. So why is this valid from a safety point of view, okay. Okay, so from a safety point of view, we know that part 8.12 is already used to qualify pre-existing software components of limited complexity. And indeed, we could consider our driver subsystem like a pre-existing software component of limited complexity, right. And according to the ISO 26262, today, any pre-existing software component that is qualified according to part 8.12 can be used to be integrated into a more complex software framework. As long as the assumed safety requirement and the conditional use are located to the pre-existing software block are met, okay, are defined and met. Okay, so from a safety point of view, we can envision Linux as the integration of multiple pre-existing software components, okay, all working together, okay. And now here the key point here is, there are two key points. One is how can we decide if a single subsystem driver is simple enough and how can we describe the interaction between the different subsystems. And we'll answer these in the following slide, okay. So this is showing in terms of state diagram the different steps associated to the hybrid approach, okay. So the first step is to define and allocate assumed safety requirement for a critical unit, okay. Then we, the specification of the software unit are written following the kernel doc headers. When it comes to the interaction between the software unit and the other, the rest of Linux and the other units will use the semi-formal or formal specification, okay. According to the unit specification and to the specification of the architecture of the software unit interacting with the others, we'll write the safety analysis. And then, based on safety analysis, we'll write kernel self-test based on the kernel doc header and also kernel self-test based on the architectural behavior of integrating the target software unit with the others, okay. And then what we do in order to verify the dynamic architecture of the single software unit in interacting with the others, we will use the runtime verification framework that will be discussed later on, okay. So, and indeed we have continuous integration to the monitor the changes, okay. So now, as we discussed today, as we discussed, so the one of the dilemma is how to decide if a unit is simple enough, okay. So, how can I decide if a driver subsystem is simple enough to be described exclusively by kernel doc headers, okay. And if we look at part 8.12, okay, it requires to describe known safety requirements, functional requirements, behavior in case of failures, resource usage, description of required and provided interface, interfaces and shared resources, and also the configuration description, okay. So, practically speaking, if you're able to specify comprehensively in other language, all of the specs above, the level of granularity for a single unit is the right one, okay. So, if you are not able to comprehensively specify, to comprehensively provide all this information for a single unit, then there is something wrong, okay. So, it means that the unit is too complex, okay. We cannot use kernel doc headers to describe the architecture of behavior, it must be broken down in simpler units, okay. As we said, Linux is already partitioned. So, I talked already about subsystems and drivers. So, systems and drivers are already specified in the maintainers file, okay. And it is a good starting point. It is a starting point, okay. I'm not saying we maintainers is good enough. However, you know, using maintainer, it is easy to map the code to the people responsible for the code. And indeed, if we realize that a driver or subsystem is too complex, there is nothing preventing a further division, right. So, we will probably create another file with, you know, with a different partition, okay. But by the way, maintainers is a starting point, okay. And now let's look to a specific example, okay. So, in the past few months now, so we have looked to a specific use case that is called the telltale use case, okay. In the telltale use case, we rely on the expiration of an external washdog as one of the key safety mechanisms, okay, that can trigger the system into safe state, okay. So, to summarize what happens is that there is a display system that display a telltale in case of car failure, okay. And there is a safety application that is receiving a message from an external monitor. And if the message says, hey, the telltale is wrong, or if the message is delayed, or if the message is corrupted, the safety application will stop patting an external washdog, okay. And if we stop patting the washdog, the safe state is automatically triggered, okay. So, one of the critical safety requirements is to make sure that the washdog timeout is properly set, okay. And therefore, this is the safety requirement that will take here as an example, okay. So, the safety requirement is the washdog subsystems shall ensure the washdog timeout to be set according to the IOCTL input parameter, okay. And we have on the right, we have the entry points associated to these safety requirements, okay. And one of the key entry points is the syscall define for the IOCTL syscall, okay. So, this is the kernel entry point to program the washdog timeout, okay, using an IOCTL. Now, let's look at maintainers, okay. So, the first subsystem that we, the subsystem that we are going to analyze here is the virtual file system, okay, because the IOCTL is part of VFS, okay. And we can see that maintainers here is defining the scope for this subsystem, right. You can see like a list of a directory, FS, and then a list of files, okay, that are part of this subsystem, okay. Now, what happens here, what we're showing is, okay, so we have a clear scoping for our target subsystem. So, our software unit is VFS, okay. In the context of the IOCTL, what are the other subsystems and blocks that this software unit is communicating with, okay. And you can see, this is a communication diagram, effectively. We have the incoming function that is the IOCTL from the safety app. And then we have a different outgoing function, okay. And we can see that we have like the work queue subsystem, the security subsystem, we have the architectural specific subsystem that in our example is x86. And then we have the function pointer associated to the unlocked IOCTL, that is, in this case, the watchdog device trial, okay. So, this is basically a static view of the architecture of the different blocks interacting with the target one, that is the VFS, okay. So, and this is effectively a semi-formal notation. So, it's a UML communication diagram. On top of that, we also create a dynamic view. You can see that the subsystems here are the boxes on the top. And we have a sequence of events that is showing the, what can happen, you know, in terms of flow diagram, okay. The flow of event that is supporting the IOCTL syscode, okay. So, following an IOCTL, what is the sequence of events between the different subsystems, okay. So, and this is, so the communication diagram together with the flow diagram, this is what constitutes our architectural description of the different blocks talking with each other, okay. Now, if we go to the block specification, you can see that here we proposed, okay. So, this was not pushed upstream. So, this is just an internal discussion. But this would be a sort of proposal for improved kernel doc header for the IOCTL, okay. And you can see here that we have a quite descriptive specification of the possible behavior, okay. And in the end, we also have the possible return values, okay. And there is also a to-do. So, and this is important because if there is something that is not covered yet by this kernel doc header, this should be explicitly mentioned, right. So, this is very important for functional safety, okay. Okay. So, this is our architecture. So, our specification are the software architecture plus the block specification, okay. Then, based on this specification, we do safety analysis safety analysis in the context of the IOCTL to see what are the possible failure modes and to either derive additional requirement or refine the architecture, okay. And after the safety analysis where we have the code, okay. This is the standard code. And then, we have the block testing, okay. So, here the kernel doc header will be used to define kernel self-test according to the specification of the kernel doc headers themselves, okay. When it comes to the integration testing between the different modules, we will use runtime verification modules that can formally verify the behavior of the runtime element to be the same as specified, but in the model, okay. Excuse me. And Daniel will talk about this in a bit. And, okay. So, now I will leave the scene to Daniel. So, Daniel, please go ahead and talk about runtime verification. So, hi. I'm Daniel. And now I'm more on the technical side of this presentation. And I'll talk about how to integrate these ideas using runtime verification to connect the specification and the system, right. So, runtime verification is a lightweight that yet rigors from Amita. And it's used to complement other kinds of from Amita. It's like model checking and theory improving. And the difference is that instead of trying to compose a theoretical view of the system, the runtime verification actually works on analyzing the system running live, right. Trying to verify the actual execution of the system. So, to better understand runtime verification, we need to see the two inputs from it. So, in one side, we have the formal realm where we have a formal specification of the system, of the desired behavior of the system. And on the other side, we have the Linux realm where we have a set of events generated by the system as it runs. So, the runtime verification stands right in the middle of these two things. So, it reads the trace in one side and the other side, it reads the specification and try to combine both of them. So, as the system is running and everything is fine, the runtime monitor should just say, okay, I think things are running as expected. But also, a exception is handled or an event is generated that the formal specification doesn't recognize, we can take some reactions to these kind of events. And these kind of reactions can be either going to a failsafe mode or using these as information to improve our understanding of the system. So, how can we connect or how do we see the runtime verification inside this hybrid approach? A huge part of the specification work is actually understanding the system and specifying it in a format that can be then translated to the requirements, right? And one of the challenges of writing the specification of the system is actually being sure that the specification matches with the actual implementation of the system. And that is even more complex on a software that is constantly changing, like in Sweden, it's changing every day. So, how can I have a confidence that my documentation is still matching the system, right? And the good thing about runtime verification is that with the correct format of the documentation, we can run the documentation in kernel or run the documentation in the formal format as the system runs. And we can actually compare in runtime if the documentation is still matching to the kernel. And this closes the loop between the documentation and the implementation. And the idea here is to use it in two phases or we can use it into two phases. In one phase is at the design time where you can use the runtime verification to optimize the documentation. So, let's say you start making the documentation of the system and explaining it. And then you use the runtime verification to see if you covered all the aspects of the system by exercising the system, reading the trace and seeing if the documentation is stable, right? If it is still working as the system runs. It will reach a time in which the specification starts combining with the kernel and we can say that we try to cover it, right? After that, after we reach a level of knowledge of the specification good enough, we can also use the specification in the runtime monitoring of the system. So, one thing that we think when using the formal methods is, okay, formal methods are complex. How can I specify the system using a formal language, right? It's not always that easy. The good thing is that in the hybrid approach, Kipro is finding the UML sequence diagram as a useful language to express the runtime behavior of the system. And this informal language can be easily converted into a formal language, which is automata. But why automata? The good thing is that previously to this work, I was already using automata to describe the behavior of the part of Linux. And this part of Linux was the synchronization model of the print 30. And using automata, I was able to create a description of the thread synchronization. And I was successful on explaining it, even though the system was really complex. So the thread synchronization model has more than 9,000 states and 21,000 transitions. So automata was flexible. I could explain this very complex system, building up this complex model from a set of small specifications, like with less than 10 states. So it is practical on using on Linux. Moreover, on my research, I also found a way to get the formal specification using automata and converting it into kernel code and run the documentation or the formal specification in parallel with the kernel at runtime synchronously at a very low overhead. Indeed, the overhead of running the automata was lower than actually saving the stress for post processing on a later time. So this gives us good evidence that this can scale for the purpose of the hybrid approach. Now, okay, that part was the resource part. The good thing that I've been working on transforming the resource into actually kernel code. And here you can see that I submitted the first version of the runtime verification interface for the Linux kernel. And it's basically composed of two things. One tool that automatically generates the runtime monitor code based on the specification and intuitive interface where we can control the monitors available in the system, enabling and enabling them, and configuring them to have different kinds of reactions. So let's say that we have a sequence diagram. We converted it into the automata format, which should be straightforward. Here I'm showing how can translate the automata here in the WIP file. This is the automata specification on an open language, the graphics format. I can convert it into the code that can be run in parallel with a single command line. And the work that is left for developers is pretty straightforward. It's just trying to connect, okay, this is my kernel event, like a trace point or a function. It translates into this event on the automata or in the sequence diagram. And okay, it's not, we don't have time to explain this all here. But if you look at the documentation in the slide, you can see how easy it is. And actually this part of the patch that I sent to the kernel. So, and how about the interface? Tracing users are used to the S-trace interface. And here I'm showing an example of how to run that WIP monitor that I previously converted into code. So once that monitor is loaded into the kernel, I can simply join into a folder inside the tracing RV folder. I can enable a reactor that is, a reactor is an action taken when a failure in the matching the kernel and the model is found. So here in the third line, I'm saying, okay, if an unexpected event in the WIP monitor happens, I want you to panic the system. Boom. And in the third line, I'm enabling the monitor. And as the system running, the monitor can do nothing, just watch the system. Or we can even enable trace points and watch ourselves the how the monitor is running, right? Until hopefully not, not hitting an exception. Or if we hit, we can take the right action. So, okay, this is just an introduction on how we see the runtime monitor matching with the, with the hybrid approach. There is plenty of documentation. We have some documentation on the Red Hat research partly, which is a magazine where we where we, we showed some summary of the research, research that they're doing at Red Hat. But you can also see the academic papers with all the details or on connecting these things in these links. And there is also a presentation from the LC in 2019 where I explained more details, all these machinery for runtime distribution. And that's it. I give the word to Gabriele. Bye-bye. So, thank you, Daniel. Thank you very much. So, and with the runtime verification monitor, we concluded the integration test phase. Platform tests and validation tests are out of the scope of this presentation. And this would be covered by the standard platform and validation testing done based on the top level safety requirements allocated to the whole kernel and to the, and allocated in the, in the safety concepts. Okay. So, we are now at the end of the presentation and let's do a bit of wrap up. Okay. So, what are the pain points here in the next steps? Okay. We talked about communication diagrams. Okay. These communication diagrams, they present the static view of the interaction between drivers and subsystems. This can be supported by static analysis tools of the code. Okay. And here you can see that you know, we have a link to the call tree tool that has been developed by Mobileye. Okay. And you can never look in the GitHub of the safety architecture working group. This tool can be used, you know, to support the generation of these diagrams automatically. Okay. A dynamic, a baseline of the dynamic flow diagram that can be generated by using the trace points. So, once we identify the interfaces between the blocks with the communication diagram, we can attach trace points to these interfaces and generate a baseline of the dynamic flow. Okay. Indeed, this baseline is not comprehensive. Okay. It cannot be used as a model. And what must be done here is to review the baseline and integrate it, you know, and based, integrated with the missing events based on the code review of the code itself. Okay. Then, as Daniel said, we need to find a way to translate these architecture diagrams into automata formal diagrams. Okay. Formal model, not diagrams. Okay. And finally, we also need to comprehensively specify, you know, the behavior of the single unit. So we need to write down the kernel doc headers for all the functions that they are missing. Okay. And then, so as next steps, okay, what we need to do, we need to develop a refined tools to support the generation of architectural models. We need to continue the development of the runtime verification interface. And finally, we know we need to try to go high scale by pushing these tools and engaging with the, with the maintainers, you know, that should indeed be able, you know, to, to maintain both models and the kernel doc headers. Right. And I didn't say that. I'm done. So question and answer. So, guys, please, go ahead and raise your question. Okay. Thank you very much for watching.