 Alright, hello everyone. Thanks for joining us this session. I'm Junjie from Intel has been working on functional safety certification of the open source hypervisor acorn for three years and in this session I'm glad to share This our experience with the certification as a case study of how open source software can be certified for functional safety This will be the agenda for today's session. So we'll first briefly introduce what is functional safety and what is acorn itself And then we share some key challenges and daily working model changing practices we apply when we certify acorn But first about functional safety itself According to the Wikipedia Definition of functional safety or the definition of safety in general is the freedom from unacceptable risk of physical injury or damage to the health of people either directly or indirectly. Well here this the freedom from unacceptable risk means that we accept the fact that risk residual risk will always be there and even we use a piece of hardware or software in safety critical uses we need to define a certain level of the extent of risk that we can tolerate and this naturally derives the level of different safety criticality or the safety critical the safety integrity level defined by the functional safety standards and what is special to functional safety is that The safety is achieved not by hardening the function itself But by the proper implementation of one or more automatic protection function Well here in our presentation we will refer to these automatic protection functions as safety functions following the terminology of IEC 6508 which is a functional safety standard and These safety functions of course need to be properly implemented to make sure that it will function as expected and the failure of itself can be Handled in a safe way. So talk about software. So what does functional safety require for a piece of software? Of course, we already know that the ultimate goal of functional safety is to mitigate risk and In the contact of software, which only have systematic failures The goal becomes the mid to the mitigation of risk from systematic failure and in this presentation, we will scope ourselves with a IEC 6508 which is the standard we Certified against in our previous certification and we will only talk about all out of contact certification means that the software is certified with assumptions on the uses of the software itself In our case, we don't have already a system level understanding or a concrete customer which defines their System where the ACOR is to be used and as a result we cannot do in-contact Certification so in our set in this presentation We will share our knowledge based on the out-of-contact specification and of course we are talking about software here So the basic strategies to achieve this risk mitigation goal is they include the fail-safe principle and default avoidance So first the fail-safe means that we accept the fact there will always be failures and We make we need to make sure that the software that is safety critical or safety related Needs to have the fail-safe principle in mind. That is we have a clear idea What kind of thoughts that can be detected by the software by the safety related software and once detected how these thoughts can be Handled in a safe way and Second or once we have this piece of software, which is designed to handle failures then itself needs to be developed with us with a certain level of Reliability or that so that is we need systematic management and Systematically develop the software the piece of software we define to implement the safety function So here systematic means the measurement or the development activities are Conducted with plans before these activities are conducted and The evidence that the activities are conducted following the plan is a collected throughout the practice of the measurement or development that is throughout the life cycle of the software So that's about functional safety, but what is functional safety certification? So the result or the outputs of a random functional safety certification is to get a certificate from a third-party certification body Saying that the piece of software under certification is being developed In a way that is compliant to the functional safety standard and thus it is proper to use this piece of software in safety critical uses But according to the standards or the functional safety standards There are three routes to achieve a functional safety certificate so the first or the most straightforward way Is that The piece of software is developed in a way that is compliant to the standard with a functional safety standard itself defines how the Management shall be conducted or it also defines a few requirements and recommended measures that needs to be adopted When develop the software so a compliant develop means all these requirements and Recommended measures are properly adopted and fulfilled The second round is that is the provingly used if the piece of software has already been used in safety critical Situations for many hours then these statistics and its failure rates can be used as a statement In the proving use route to achieve a certificate This will definitely not be the scope focus of our presentation today Or print our focus will be the third route Which is also the route acron uses that is the assessment of non-compliant development So a piece of open source software is typically developed for general purpose from day one, maybe and Later on we decide to reuse this piece of software in safety critical situations And this is where the assessment of non-compliant development development route applies so to do the certification or to achieve a certificate We interact with the third party's certification body with in a multiple interaction with multiple interactions so as we have mentioned that Functional safety requires a software to be developed and managed in a systematic way, which means we need to have Plans of the activities beforehand So the first set of work products we deliver to this certification body is not about the Architecture design or even the implementation of the hypervisor, but first a set of plans So based on a basic concept of what the hypervisor is We first deliver a set of plans of the activities and processes We adopt during the development of the hyperbonding After the delivery the certification body will assess and Ask questions and raise opens to the plans Making sure that the planned activities are compliant with the standard And of course there could be multiple interactions afterwards to clarify the questions or close the gaps Once the plan is settled we start delivering Our development work products for the first batch is the requirement and architecture design specifications With the second batch being the detailed design the right from the architecture design and the concrete implementation And followed by that is the test activity The first of all is the test plans or the objective strategies of the test Activities and the test specifications which define the test case as we use during our time And finally the test reports showing that Our implementation fulfilled or passes the the test and in case of failure why we Accept or why we deviate from the defined requirements and whether that has been recorded in Manuals like safety manuals or user manuals to let system integrate and know what are the non-issues of the hypervisor So all these manuals are subject to be assessed as well So the certification body will assess all these four products according to the requirements from the standard and raise questions during the revisits And there could be multiple rounds of Interactions again to close the gaps identified in the previous work product and these interactions can actually overlap So you may have two batches for products delivered and followed by some responses To previous questions and then the other batches of your work products So once all the work products have been delivered Revealed with the questions and opens clarified and closed We get a certificate from the certificate certification body Which state the problem is of using the piece of software in safety critical case Usages so this is how from technical perspective Are we interact with the certification body to get the certificate? So did this is how we work with the certification body in our previous experience All right, so much about an introduction to functional safety and functional safety assessment In the next few pages, I will briefly introduce. What is a corner? So Econ is a Linux foundation project which implements a flexible and lightweight reference hypervisor So it is built majorly made mainly for real-time and myth criticality scenarios For the architecture of a corner is like this Econ itself has a hypervisor, which is a type one hypervisor. That is it runs on bare metal directly and one of its key capability is to petition the hardware into multiple parts What in one part you can have a separate VM owning its own hardware resources like processors memory and devices Well in another part you may have multiple VMs with one of them being the so-called a service Yeah, which manages the rest of the VMs and where the device model runs or device sharing and In the same part you may have multiple other VMs running less critical software like machine learning or HMI logic or your other computation logic So in the safety critical scenario, we assume that the safety functions from user from the system integrators run in a set in a petition Or with no service. Yeah, that is it is launched by the hypervisor directly with resource isolation guaranteed by the hypervisor on Followed by the brief introduction of functional safety and echo The rest of this session will cover three major challenges We have met during our previous experience and how we tackle them We will briefly introduce our approaches that has been Approved feasible to Certify a piece of open source piece of open source software So the first challenge we met is to define the safety function that is implemented by the hypervisor but This is required because we need to first at least Clarify why the hypervisor needs to be certified at all Of from the definition of functional safety We can know that a piece of software is subject to certification if it is part of a safety function or It may impact existing safety function So as a piece of hypervisor, we hardly know the Business logic that is we have no idea what kind of sensor inputs Users or system integrators may have and what kind of actuator output They may they may want to generate to the actual actuators. Well, as a result Precisely speaking the hypervisor alone does not implement any safety function But rather its main objective is to consolidate Multiple software stack with mixed physicality as some of which implements safety function So a failure inside the hypervisor may result in the breakage of Consolidated a safety function. So actually the econ falls into the second category But this is also known as interference in this standards, but talking about Impact to this safety functions or interference on the safety function. There are two sources one is that the hypervisor itself is Breaking the safety functions this can do two Mailing two reasons the first is that the hypervisor virtualizes the platform in an incorrect way or causing the safety function to respond in an only Expected manner and second is that Is there is no safety without real time or the additional overhead introduced by a hypervisor may break the real-time property of the safety function Leading it to miss its real-time requirements, but this is also a kind of failure for safety function We'll talk about the incorrect virtualization. There can be multiple Reasons one is that the hypervisor is corrupting of the states and The second is that the hypervisor provides a response as between being an unexpected way that is In a way that deviates from the existing hardware specifications And also in the worst case the hypervisor may even block the execution of the safety function so all these may are typically due to thoughts or Areas inside the hypervisor and talking about the additional delay, which is typical for all kinds of virtualization solutions It may delay the execution of the safety function or It may delay the delivery of events the asynchronous events typically interrupts to the safety function The both of both of which will delay the overall execution of safety function and causing real bad line for the second category of interference or is that a Failure inside a hypervisor may allow other petitions to impact the safety function Again, this can be further categorizing to subcategories. One is that the other petitions are Has a possibility to corrupt the memory of storage used by the petitions that runs the safety function Which is also known as the spatial interference And the second category delay effect Due to the share of our resources like last-level caches or peripheral bandwidth, etc is Very heavy workloads inside one addition can typically leads to delay in the execution of other petitions So this is also known as the effect of noise enables and this definitely needs also need to be considered when we define the hypervisor approach to Mitigate the interference mentioned earlier in order to protect the safety functions are as follows before the errors or for the errors due to the Happy to the hypervisor itself before the incorrect virtualization part We apply systematic development of acron which means we define the requirements or the Expected external behavior of the hypervisor and we have excessive test to verify that the implementation fulfills the requirements and In addition during the requirement definition and the architecture design But we are also we also apply the principle of being defensive to hardware that is whenever There is a possibility that the hypervisor is capable of protecting the hardware error We apply we design a defensive approach to make sure that we fail in the safe way And with regard to the additional delay as a as a piece of all this context software We have no criteria to make sure that the delay introduced by the hypervisor is tolerable in the overall system so in that case what we Send out to do in our certifications that we conducted a performance evaluation to showcase what's the worst case a delay of different kind of reasons and These numbers are provided in the safety manual for system integrators to reference so that they can know What kind of delay they can they may expect and whether their system overall system can tolerate these additional And in addition to that for as a Bottom-line approach, we also assume that there will be a hardware watchdog Which monitors the execution of the safety function to make sure that the safety function completes in a designed Time so it does not miss that line and it does does not complete a two early so this hardware mechanism also serves as the bottom line to catch and handle these tight line business before the interference from other VF or from other partitions For the memory corruption of public corruption or in general spatial interference The hypervisor is required to leverage hardware capabilities to implement mechanisms to avoid that and For the execution delay, we will cover in more detail in the next page. It is worth mentioning a bit more To fully understand what kind of a temporal interference that can be read can be caused by non-safe condition, but it is worth applying a systematic Interference analysis, so again here systematic means we have a defined approach. It is not ad hoc. It is not purely based on experience But it is based on the systematic structure approach for example a defined or widely adopted checklist and we understand For each item or for each variant mode in the checklist how That applies because the end users of the hypervisor and how we react to them But the way we do is that we leverage a checklist in literature on temporal interference So in our case we use the checklist Presented by consignment or in the public paper For each variant mode in it we have a detailed analysis Whether each variable applies to us or not and if that variable applies How do we mitigate that? So the result can be additional requirements to the hypervisor or It may require the hypervisor to partition the physical processes and they require the hypervisor to leverage hardware mechanisms to isolate the RAM that each variant can access or The analysis has been derived assumptions of use For example, whether we assume that there is a hardware watchdog monitoring the execution of the safety function externally And there could be some other assumptions of use at the safety at the system level And these assumptions will be recorded in the safety manual for the system integrators So the safety integrators can know when they integrate the hypervisor into their overall system What they need to check and guarantee at system level But with all these activities We define the basic concept of the hypervisor What's the safety related functionalities and needs to achieve And what are the assumptions derived from these activities So the next step is to implement that Before we do that, we need to set up supporting processes which turned out to be another challenge during our e-certification So we're talking about a functional safety compliant development So from developer's perspective, maybe the first impression that we need to comply with this V-model So in addition to like develop the actual implementation, the V-model requires that we have a separate phase at the very beginning to define and Specify the requirements of this update So these requirements need to be organized systematically and stated in the semi-formal way to make sure that it concises this kind of end-complete and it fulfills some other requirements to good requirements And from that The V-model requires a derivation of the architecture design from the requirements and the derivation of the detail design from the architecture design And for each derivative derivation, a verification is required. That is, we need to verify that the right work product with regard to the completeness concise the correctness and consistency with regard against the previous of the upper-level work product And and also the lower-level work product needs to trace back to the higher-level products to aid the verification activity And in addition to all these development activities on the right side of the V-model, there are multiple layers multiple levels of tell So the software and the development needs to be exercised at module level, at the integration level, and at the requirement level So all these activities are essential for systematic development of a piece of software for functional safety But in addition to that, before we start the development or start following this V-model We need to have a clear set of clearly defined supporting processes But these are also important because these processes are meant to be Conducted in a again systematic way. So they need to be planned beforehand and we need to follow these activities processes along with the development of the software But unfortunately Most of these supporting processes are already back practices for software engineering in general and even for a piece of open-source software like us Most of them are actually naturally maps to our daily practices But for these activities what we did is to wrap up what we are already doing as plans and provide evidence to the certification bodies to justify that they are followed exact in a precise way but still there are two supporting processes that somehow change our daily working model that is the configuration management and change management So the configuration management or the objective of configuration management is to achieve a consistent set of work products to make sure that the work products are delivered in in a consistent way and under good and under good management So our typical open-source practices that we have maybe a single get repository which we have order code there as well as some kind of documentation and The releases the versions are always met are managed by this get repository alone But considering the functional safety practice in addition to the code or some ad hoc documentations Which of course we still have we also have some other activities generating as a work products like the specifications the test plan test specification test results and all the evidence review evidence and other evidence showing that you are Working according to your plans. So all these are subject to be managed by the configuration management system so For all these work products a single get repository is typically not enough What we turned out to do is we introduced another matter get repository which have the original get repository get repository management managing the code as a sub-module And in addition to that the matter repository also refers to other systems for example The external requirement management system we used in our previous Round of certification to manage our requirement specification. It is a web-based third-party application. So we turn out to you Record the URL of this specific version of the requirements into this a good matter rubble And this is also applied to the design to the architecture design specification and At the same time for the test code We also have separate get repositories to manage these and we add these repository as against the modules into the matter repository so that In the end we have a matter repository that can be Tired and provide a consistent version of all the work products we generate During our life. Following that a natural question is about change and this is why the change management processes also change a bit how we work on the hypervisor But the open again the open source typical open source practice is that Where there is a chain required? You also cook a patch that submit to maybe a mounting list or get have issues and The maintenance will be reviewing the changes there above from functional safety Practice perspective in addition to the code now we have a collector of products including the requirements the architecture design and so on So any change before being applied to any of them needs to be fully understood So this means that we need to apply a round of impact analysis resulting a summary of changes needed and of course this summary not only Collects the impact to existing work product But also it needs to consider the impact from safety perspective whether it impact any safety concerns or it raises new safety concerns or It mitigates or resolves some of them So or so this summary of changes is itself Subject to a review and approval and was approved by the architects We will apply the the proposed changes to all the work product and merge them together to make sure that the work product after The some the changes is that are applied is still consistent but these two are the major changes to our daily working model due to the requirements of the supporting process but with the supporting process is set up and the next now we are to develop the hypermising the systematic way and the very beginning of this development lifecycle is to Draft a requirement specification or to specify the requirements of the hypermise So as the V-model shows the requirements are actually set as the roots of the whole traceability and as a result only verification activities So a concise requirement is essential in for the follow-up spaces of the development But talking about the hypermiser What's the requirement of a hypermiser? Well, the most straightforward understanding maybe that The hypermiser is to partition the hardware for multiple variants and to a run multiple sets of software stacks simultaneously And separately and you know with some separation guarantees But wait this first steps needs to be revised because that is too ambiguous So there is too much information that is not present in this forward requirement. For example Partition to other things. When we talk about partition hardware, the hardware itself is very complex Especially the device part. But when we talk about like partition the hardware What devices need to be assigned to partitions and what devices can be like ignored to VMs So this is what needs to be clarified in the requirements to make sure that Our architecture design can derive the precise mechanisms for this partition mechanism And the validation activities can validate the hypermiser even precisely And also what's the capacity of the hypermiser? So we all know that a hypermiser cannot support an unlimited number of VMs At least it will be restricted by the hardware resources But in some cases it may also be restricted by some other aspects But this capacity information needs to be there in the requirements as well So that system integration scandal where they have their system design What kind of constant they need to consider and make sure that their system design force into the capacity of the also from VM perspective We already have the CPU processors providing The variety of features, but from virtual platform perspective. What are the features that are available? So this information needs to be present again to system integrators because they need to Verify that when they have assist when they have a safety function Which is to be executed in the partition provided by the hypermiser So all the features required by the safety function is also supported by the hypermiser So there is some consistency checks When integrating the system and these kind of hardware capability Exposure information is crucial in this activity And also, so what's the blue protocol for these VMs? So what are what is the initial states when the hypermiser handle to these VMs? That that's also part of the information that needs to be specified precisely And there's also a bunch of other details that it needs to be clarified as requirements, but it's not uh included in this a bring this simple requirement But as a result, we turned out to use a more system again systematic or structured way to not only analyze But also organize the requirements into one requirement specification work product But conceptually The idea we use to analyze the requirements Is to model the virtual VM or more precisely the virtual processors as state transition system So that's a kind of state machine with labels on the transition So here the states are The state includes registers values and registers memory and devices and state transitions are Uh synchronous instructions or a synchronous events triggered by the by the virtual processor So of course, this is a very very huge state transition system and there is and it is well closely visible if ever possible to To define in a complete and formal way So our next strategy is that things are the hyper the virtual machine is mimicking the physical platform So we decide to leverage existing documents which specifies the how it so whenever the The feature or the functionality provided to virtual platforms or virtual vm are Is are exactly the same as the hardware? We refer to the related documents for a detailed specifications of behavior Well, of course that serve as part of additional information provided for the validation But during our requirement analysis or requirements in our requirement specification We save a lot of effort not to like duplicate Excessing information into our work products And last but not the least The results of these analysis Are organized Uh in a specification in a systematic way, but we talk about different aspects like what's the virtual platform look like What's the what's its initial states and what are the state transitions? Or with regard to different instructions of features from features or what with regard to the asynchronous events And what are the defensive actions and other aspects for example security considerations of the hypervisor? So all these are listed And organized in the requirement specification and this specification is subject to Not only to the review or the assessment of the certification body But also add the foundation to derive the architecture specification and evaluation tests of the hypervisor Okay, so with the requirements. We ended up executing the Later development phases following our supporting processes In a quite smooth way And that's how we achieve the certificate in our previous round of certification So the last page I will Recap some key learnings We we have we we had from our previous experience so the first is that The piece of software the hypervisor in our case needs to be failsafe and it's not only failsafe By applying some coding guidelines at code level or at detailed implementation level, but also failsafe by design So we need to clarify and specify precisely What kind of failure the piece of software may have or may need to tackle and when the failure at the failures are detected How the software needs to be reactive So the detailed design needs to refer to this design and implement this design And the task cases needs to verify that Under the case of these failures The a software the piece of software and the development really reacts in a specified way And the second is that the measurement and development need to be conducted in a systematic way and the systematic here means of course You have plans beforehand you and you collect evidence throughout the execution And the third is that as a very as a very important foundation Of the development development life cycle the definition of the software or the requirements of the software Is analyzed and specified in a systematic way so that we can reduce the additional effort or requirement changes in later phases So that's what I have For today's sessions to share Thanks for your time. And if you have any question feel free to Put them in the chat window and I'm always there to answer your questions. Thanks for your time again