 Hello, my name is Stefano Stabellini, I work for AMD and I'm one of the extend maintainers. My name is Bertrand Marquis, I'm the principal software engineer at ARM and also one of the extended maintenance. What is safety? So the definition is on the screen, but safety comes into play for software every time human lives are at risk. So you can think of automotive, agonics, or even industrial situation where software malfunctioning could cause harm. In all of those cases, there need to be very strict guidelines so that the software cannot minimize the risk of the software putting human lives at risk. And these guidelines fall under the umbrella of software certifications, safety certifications. One of the most popular used in automotive and other environment is ISO 26262, but there are others as well. And they typically involved very strict coding guidelines like MISRAC, meant for safety to reduce undefined behavior and also just defects in the code. And also strict guidelines on testing and documentation and also how to write requirement for the code is a pretty significant endeavor. Why exam matters for safety? And then the vast majority of situations, software that has safety requirement has actually a mix of critical software and not critical components. And that is because typically in any of your environment you're going to have a smaller function that is critical for the actual functionality of the device. So of the vehicle if it's a car, and then there is a number of components, of the components that are usually larger, more prone to crashing, but they're not actually required for strictly required for the functionality of the device. So in a car to think of the infotainment system. So this environment are called mixed criticality because there is a mix of critical and non-critical component. So in these mixed criticality systems that are extremely common is always better to separate out the critical component in a set different environment. So that non-critical software which is larger, harder to check, easier to crash cannot affect the critical function. Basically, you don't want to put all your eggs in a single basket. And a technology like Zen allows you to have many baskets, and specifically allows you to run the critical function in his own separate domain is on separate virtual machine separate from the others that are running non-critical functions, so that if the non-critical functions crash, they don't affect the critical function which is completely isolated. So Zen has been doing that for years in other environments. For instance, highly secure environments. So QBOS, OpenXT, other projects are famous for setting up highly secure desktop and laptop environments. And the way they work is they use Zen to isolate your critical environment, which is maybe your work environment from your non-critical environment, which is your personal environment. As we've read it today, we are also using Zen for mixed criticality. We are separating out the critical environment, which is typically a real-time application, an RTOS that is controlling like a robotic arm for instance, or the critical function of the industrial device just to make an example, separating that out from Linux that as usually the user interface or it talks to the cloud. Safety critical systems are mixed criticality systems as well. So that's why Zen is relevant for safety, but is it a good match? So yes, Zen is a good match. It has been a very long time since the early days of the cloud where Zen was small but not as small and the Linux environment was always required as 0. So nowadays things are very different. Zen is actually very small, is less than 50,000 lines of code and decreasing, thanks to the K-config infrastructure is going to be even smaller. It doesn't need DOM 0 any longer. DOM 0 has become only optional, thanks to DOM 0 less. Now Zen can start domains at boot time directly from the hypervisor. Then also always had a micro-colonel architecture, but now we are exploiting it to the fullest to decrease the privilege of large amount of code running in unprivileged domains, assigning devices to them, but with full protection of the SNMU. And the result is large amount of your code don't need to be safety certified any longer because they're not privileged over the system, not privileged over any of the critical functions. We can support real time in terms of inter-app latency and also cache isolation, as we'll see later, and that's a very thorough code review process, as well as security process, which are very important for safety. An example of how Zen can be used in a safety environment, and it's already used in a similar environment that don't have safety requirements today. So you have the critical application running on Zephyr, which is, for instance, a real time remote controller of a robot application. You have Linux alongside with a larger codebase running larger applications with network access, for instance, controlling cloud APIs. So in this example, as you can see, there is nodem zero. So the safety components are Zephyr and Zen, Linux doesn't have any safety requirements, and there is nodem zero. Another example, more complex is automotive. There is a larger CPU cluster and more VMs, but there is one main difference from the one from the example before, and then the difference is now we also have a nodem zero. But let me go through step by step. So we still have the larger Linux environment, which doesn't have any safety requirement on the right. For instance, running the infotainment system. We have the instrument cluster and Zephyr real time, maybe a sensor data parsing application, which we do have safety requirement and are smaller and more tightly written. We also have on the left a Zephyr mini dom zero. What do I mean by that? I mean, you can still have a dom zero for monitoring, for checking the health state of your overall system. But it doesn't have to be Linux. It doesn't have to be a full, all powerful dom zero. It could be a very small, tiny dom zero environment with just a couple of monitoring functionalities and it could be based on Zephyr. And that is already a work in progress. So before getting into the subject, it is important to make sure everybody has the same understanding of real time. So what is real time in software? On the right side, I have some statistics on the response time of an application. The application is fast as most of the time it is answering in less than five milliseconds. But being fast is not being real time and saying that it can do the job on average in five milliseconds is not real time. Here, the application is never answering in more than 100 milliseconds. If I can say I will give a response to an event in no more than 100 milliseconds, then I am real time. So real time is responding in a guaranteed amount of time. Why is this concept important in safety? So let's take an example. My car must stop before a wall. If I take all the right parameters, so maximum speed of the car, how long I need to stop, when will I detect the wall? I can come up with a maximum time that my software can take to answer to a wall detection and action the break. Respecting this is definitely important for safety. If my software does not respond in time, I meet the wall. So turning those kinds of constraints to software is an analysis named worth case execution time or WCET. This is an analysis done by demonstration and not by tests. Tests are usually used to confirm the demonstration. The pass taken for the WCET is usually involving a combination of cases that are impossible to trigger by test. In Xen project, there are several subjects we are investigating around real time. First investigated subject is the interrupts and in particular the interrupt latency. This is the maximum time until a guest receives an interrupt. When an interrupt is raised in hardware, Xen is catching it first. If the interrupt is for a guest, it will forward the interrupt to this guest. The complete time needed to handle the interrupt depends on the guest and the processing done for the interrupt. The point of the analysis is to check if the Xen can forward the interrupt in a definite amount of time. We did the analysis on R64 with a guest alone on its own CPU core. Zephyr was used as a real time guest and we used the timer interrupt. A real use case would be something like a periodic task running on a real time OS on top of Xen. We did the analysis by code analysis and inspection and we confirmed the findings using hardware tracing on a real target. Overall, Xen can lead up to 1090 instructions to forward an interrupt with several big steps. First, saving the guest context. When the interrupt occurs, the guest is running so we need to save it state first. Then running the guest interrupt handler and init the handler specific for the virtual timer. Finally, restore the guest context so that it can handle the interrupt. Overall, the number is quite good. On a modern CPU, 1000 instructions is not much. In general, there are lots of conditions which could impact this number of instructions. So to calm down to this number, we took some ascension and came up with some limitation. The real time guest must not use any hypercalls. If the interrupt occurs while we are running an hypercalling Xen, there would be some extra time needed to finish this hypercall before running the interrupt. No interaction with guests on other core. If during this process a guest on another core is sending a cross-core interrupt, this would make Xen do more operation. Xen init phase is also not considered in the analysis. During the initialization phase, Xen is doing a lot of stuff and we cannot consider that the system is set up. So we took out of the analysis of the initialization of Xen. Finally, the configuration is fixed to remove some possible operations that would require lots of processing in Xen. For example, create guest, create communication channels, or add memory. So this work was the first step and we discovered several issues or limitations that would need some extra work in the future. There are several cases inside Xen where a cross-core interrupt could be generated that could impact this number. For example, RSEUs, read, copy, or date, which are delayed tasks, which could impact this number. We have isolated the core as a guest on its own core. What if we want to have several guests? We turn off WFI, WFE, handling, which is not good for power consumption. And also PV drivers could not be used as they could also generate interrupts. The full analysis will be published soon and will contain a lot more details. Another area we are working on is MPU support. Before explaining what MPU is, let's refresh our knowledge on what a MMU is. So the MMU is used to translate virtual address to physical address and also limit the accessible addresses of an application or a guest. To translate, we use page tables, which give for virtual address to physical address. Those are stored in memory. The CPU has TLBs, which are kind of a cache to prevent going through the page table or no accesses. This system is hard to use for real time as a worst case time require can be high. When something is not in the TLB, we need to go through the page table in memory, which can have some cache effects. If other cores are doing operation with the MMU, those could end up in some TLB flush triggered by other cores. Also, if there are several guests running on the same cores, the TLBs might be flushed by this guest because it has other needs for mapping. And as a consequence, our application will behave differently. The MPU is a much simpler system. There is no translation, so virtual address is the same as physical address. The MPU is only used to restrict accessible addresses and set attributes like cacheable, executable, etc. And the MPU is only using co-processor registers. So there is no page tables and no cache effects or issues related to TLBs or things. In some architecture, the core text R and the RIT2 architecture in particular is being worked on. It has support for both MMU and MPU. At EL2, the execution level where Xen is running, there is only MPU support. This MPU is used for Xen itself and also to control what is accessible by guests. At EL1, the execution level for guests, there can be an MPU to be used by a real-time OS, for example, Zephyr, or an MMU for non-real-time applications, Linux, for example. The architecture and Xen will allow the cohabitation of real-time guests using MPU and non-real-time ones using the MMU. Xen will support both time to allow real-time to cohabit with non-real-time and Linux in particular. The proof of concept of this work is already available and the final support is being upstream using Xen. Another thing that is very relevant to real-time and inter-a-platency is cache. The reason is, in many instances today, there is a single shared EdgeU cache across the entire CPU cluster. Being shared, what it means is an application running on Core 4 that otherwise would fetch directly the data from the EdgeU cache, which is very fast, instead has to go all the way to DDR, because another application on Core 1 access other data end up evicting the information on the EdgeU cache relevant to Core 4. It's very hard to predict whether Core 4 is going to be able to fetch data from EdgeU or from DDR and the difference in performance is really great. And this is particularly damaging for small bermata application or real-time OSS that will fit the entire inter-a-paneler in the EdgeU cache, and will have a far, far smaller and more deterministic inter-a-platency otherwise. So the solution to this problem is to split the cache in software, fully dedicating cache lines to each VM so that then you are guaranteed that if the code is small, then you're going to entirely fit in the EdgeU and therefore the performance is going to be great, and there are not going to be variations. So this is what we call cache colouring. Cache colouring is identifying this small subset of cache, which we call colour, and assigning colour to each VM. And the way it works is by on exiling this U102 is to finding the correlation between physical addresses and these cache lines, and then allocating memory in a smart way very carefully to VMs so that they always end up hitting the same cache lines, so the same colours. The trick is to allocate one page every 16. So one page every 16, so page 0, page 16, and so on, is colour 0, and then page 1, page 17, and so on is colour 1, and so on. So three pages, memory pages by colour, let you fully dedicate cache lines to each VM, and therefore then you have no more cache interference effect, a much lower inter-app latency and more importantly like Bertrand was explaining earlier, more deterministic inter-app latency, that on our board it measure very close to three microseconds. Let's discuss about the static configuration, so what is it? It's defining completely the system statically in a configuration file, so this is our many guests and their characteristics, but also the communication channels. So why does it matter for safety? In a safety scenario, we want to avoid all possible random behaviors, and make sure our system is exactly the same upon reboot. We also want the guests and the communication, but also the behavior of the hardware and the guests themselves. So we want to use the same address in memory, the same cores, etc. We also want to reduce the amount of testing, so defining everything statically, limit the possibilities and usually allow to reduce the code size. In Xen cases could be disabling some hypercodes or some section of Xen code. We also want non-dynamic behavior to limit complexity, so we allocate and create everything on boot and there is no free. All in all, the goal is to reduce cost of certification by reducing the core base or the code base or make it simpler by reducing the cases. First example of static configuration is Xen DOM0less. So this subject was already mentioned by Stefan. So DOM0less is a system to define guests in a configuration file. This is how much guests you want, how many memory for each, which hardware devices have access to, or how many CPUs they can use. Those DOM0less guests are created directly on boot and defined directly in the device stream. You have a simple example here. In a safety world, this has several advantages. First is to remove the need for a complex DOM0. DOM0 is usually a Linux and depending on something that complex for safety is not possible. Second, it allows guests to boot quickly as they are started directly by Xen without the need to wait for DOM0 to boot and then create them. Finally, it is reducing the system and Xen complexity as it does not need to support to create guest dynamically. Most hypercodes in Xen are dedicated to DOM0, so removing them is reducing Xen code base a lot. This feature is already available in Xen. Second static configuration example is static memory. So this is the ability to define the physical address and size to use for all needs of the system. This is the guest memory, the Xen heap, used for Xen internal allocation, and the Xen guest heap, which is used for all allocations in Xen related to a guest. For example, it's page, level two page tables. All those can be defined in the device tree configuration. You have a small example here. For safety, this has several advantages. First, the system will be the same upon reboots. There is no allocator involved that could change where a guest actual RAM is physically. It is also reducing possible interferences. This is mainly thanks to the guest heap, as one guest cannot starve Xen memory anymore. In a standard system, all allocations are done using the same basket. Finally, if a guest needs to be added to the system in a feature version, existing guests could stay where they are without any impact, and the new guest could be assigned on memory not used. This concept is very important for incremental certification and only certifying what was changed or added to the system. Upstreaming of this feature is in progress, and it will be available in the next Xen version. Third static configuration example is communication. Standard system are using Xen bus-based drivers to communicate. For example, PVNet or PVBlock drivers. Xen bus requires down zero or at least a Linux system to be used. The drivers overall are very good performance but are quite complex and require accessing one guest memory from another guest dynamically. So a simpler, more static system is required. For these two new features are introduced. First one is static Xen memory. This is defining areas of memories in the system which are accessible by several guests. This is defined in the device tree, you have an example here, where you define the physical memory area and which guests have access to it. Second part is static event channels. Events are used to do signaling between guests. And we allow here to create event channels statically. Those are also defined in the device tree. Using a combination of those two systems, any protocol of communication can be easily built on top of Xen. So this is currently being upstream in Xen and will be available in next Xen release. We will also provide example and support for Linux and the third guests. Final static configuration example is static CPU pools. CPU pools is the ability to define which cores are usable by which guests. This concept is already existing Xen which we was announced to allow statically define the CPU pool. A CPU pool is a pool containing cores. It can have one or several cores and can be assigned a specific scheduler. A core can only be assigned to one pool. A guest can then be assigned to a CPU pool. Several guests can run in the same CPU pool. The scheduler is an independent between the CPU pools. CPU pools can now be defined in the device tree and dom zero less guests can be assigned to a specific CPU pool. You have an example here. This feature has been upstream and will be available in the next Xen release. Thanks Bertrand. So as part of the Xen foods special interest group, we are following a series of activities to make Xen easier to safety certify. What does it mean so in practice today Xen has already been safely certified together with other software components and hardware components. We have at least a couple of situations in recent years and some of them were even being discussed at Xen Summit in public presentations. So it has been used in safety certified systems already. However, all the work to make Xen safety certified worse than downstream. So many required code changes or docs or testing it was all done downstream. So, so there is a significant work that needs to be that needs to happen once you take a vanilla upstream Xen release before you can use it in a safety certified system. The purpose of the CIG is to make Xen easier to safety certified. So, so to make it so Xen is already safety certifiable but make it more safety certifiable to make it closer to safety certifications. So aligning it with the requirements. So they're going to be gaps so gaps are going to be expected and user will have to fill these gaps but the gaps are going to be fewer going forward and better documented so one thing, one, one of the most important aspects of this is clarity. So today we are actually already following several of the guidelines required by safety certification is just that we don't talk about it is not clear which one is the one we follow, and the one we don't follow. The most important thing going forward is going to be to clarify really which one as a rule we already follow so that the user can easily or more easily estimate the work to bring Xen up to standards, filling the gaps for the things we don't yet do. So we're starting from the code. Why starting from the course of the safety certification have a number of requirements that go beyond the code itself such as docs and testing and requirements. But the code is for sure, as the main output of the project the main focus of the Xen community and also the thing that we are most expert on. While other things such as docs requirements and tests are easier for somebody that's not necessarily very familiar with Xen but knows about safety certification to do and to write. So for these reasons, we are focusing on the code first and specifically we are focusing on these three aspects. And one is coding style coding guidelines and misery see and I'm going to talk to you more in a second about that. We are also focusing on the terminism and you have heard from Bertrand all of that all that we are doing and already already done and plan to do in the coming months about inter appending the terminism, as well as static memory allocations. So the configs of the configs allow us to go smaller and smaller is a lot better in terms of safety certification because it means fewer bugs and less code fewer lines of code to safety certified. So we want to announce the config infrastructure so that we can remove even more parts of the code from the build. So misery misery is the de facto standard in all industry sector for safe C code, coding style and coding guidelines is maintained and backed by an authority the organizations and Israel consortium. And does it's probably pragmatic approach is a good match for Xen so by that I mean the misery always states, never to sacrifice put quality for compliance, and definitely good quality is of them at most important to all in the Xen community to maintainers and contributors. So what is the status. So already last year, we went through a process with usually referred to as tailoring where you go through misery and you define a subset of the rules that are relevant to the project. So for Xen, this subset is a little bit more than one under rules. So what we're doing now is to go through this list and adopt the rules officially in the in the coding style. Just last week we agreed with the other maintainers in the community to accept the first 15 rules out to 100 in the in the coding style. The rules that are easier of course, which are basically a rule that we are already following in practice or so not officially, then we're going to go through but slowly through all of them. Carefully also looking at the deviations and whether we want to just document deviations of the rules or instead fixing the, you know, the violation of the rule, or simply some of these room we might decide not to follow them at all but in that we automatically scan by misery see checkers. Speaking of misery see checkers, they are the biggest advantage or for one of the biggest advantage at least to follow misery see is that we can use very powerful static code analyzers to check for violations in both existing code base and even better in new patches coming in. And automatically with static code analyzer scan for very many all of these 100 and something rules for each patch coming in that will ease significantly the code review burden of the maintainers and improve code quality. It's also good to follow misery see because it's going to improve the safety of the code of course the security of the code because safety and security have a very large overlap. So, why then compiler compatibility, making sure we don't have any undefined behavior and not violate the standards. In terms of tooling we're focusing on two. One is CPP check is an open source tool, it doesn't have full coverage of the misery see rules we care about but these open source does easily accessible and anybody can use it in a few steps. So we're working with Roberto bagnara and back saying with the Claire Claire is a fantastic tool it does 100% coverage of the rules that we are working with is that what is the tool that we are currently using for evaluating the subset with this tailored set subset of rules. And you can automatically scan also for changes like patches coming in not just the food code food code base. So if you want to see it in action click on the button and then you'll see a few projects including Zen and you can see the results for example publicly for the latest staging and master branches of our tree. Future work. So, we have a few things in the pipeline I've just seen. So, in terms of determinism. We want to make the interrupt handy code pass fully deterministic and that's the full following, following up from the work that Bertram team has already done so completing publishing the work and then making any changes required for instance to the RCU subsystem is to make the code pass fully deterministic memory allocations you were actually pretty far ahead here even in just the next case and release you might get full memory allocates static memory allocations. So that dynamic allocations are not required any longer, and even handling of memory is a lot more aligned with safety certification requirements. As I mentioned earlier today with a confidence infrastructure that we have, we can go down to 50,000 lines of food. I think we can go down further to 30,000 or probably even less. That's what we're looking at now. So, we've just started looking at these rules and like I said, the first 15 were accepted last week. And in a couple of weeks, 15 more will be evaluated and accepted documentation documentation and testing so there are strong requirements on both for safety certifications and documentation specifically we're working on with doxigen to improve the infrastructure so that it's easier to add code documentation in the future. So the first thing we're working on two progress, two projects one is good good lab CI for testing, and we are improving the infrastructure there, so that it's easy to add tests together with past series and scans, you know runs this test automatically on past new patches coming in. And XTF the stands for exam testing framework that is very useful for test individual exam interfaces. So that's the end of the presentation so feel free to ask us questions on the chat and we'll be here to answer.