 So first of all, thank you for taking the time and yeah, now it's working. Thank you for taking the time and the interest to attend our talk. Yeah, my name is Sandro Pinto. With me today I have Jose Martins. We are from academia, but we are the funders and main maintainers of this Bau open source hypervisor. And the title of our talk is the porting... Not really. Yeah, now it's much better. Yeah, I think you understood the first. I don't need to repeat everything. Yeah, and the title of the talk is porting a static partitioning hypervisor to ARMS Cortex R52. And this static partitioning hypervisor is Bau. The agenda for today is this one. I will start with a brief introduction just to provide some background about virtualization, mixed criticality systems and IoT and the rise of these static partitioning hypervisors and the motivation. Then I will briefly present the Bau hypervisor and overview and some few highlights. Then I will land over to Jose, which will explain or provide the motivation, implementation and the challenge of Bau on this particular novel architecture form ARMS, which is the Cortex R52. Then we will perform a non-stage live demo. We hope we don't run in the same troubles as the previous presenter. And then we will just touch on takeaways and future work. Well, virtualization is a quite broad technology that is around for several decades. And it basically allows the execution of multiple operating systems into the same platform. A kind of analogy is that an hypervisor or virtual machine monitor is to an operating system. What an operating system is to a process or a task. Well, the main functions are research management, abstraction and protection and slash isolation. And basically, virtualization is well established for quite some decades in cloud and the server space. And there are typically two types of hypervisors, type one or bare metal, which is represented below, where the hypervisor runs directly atop the hardware platform. And on the upper side, you have this OSTO type two hypervisor, where typically the hypervisor or the virtual machine monitor runs atop the operating system. Well, virtualization started the shift a few years ago from the cloud or the server space, also to embedded and mixed reticality and IoT applications. We have here four market segments that I've been using these virtualization concepts. For example, in automotive, the consolidation of the infotainment system with some safety critical control systems, like for example, the brake pedal. Also for the medical industry for drone applications, but also in industrial control systems where you right now, for example, want to perform some control logic to basically move the robotic arm, but at the same time you want to have a very nice UI or interface and with connectivity support that you can, for example, from China control your robotic arm, for example, in Europe. Well, but embedded virtualization has different requirements from the virtualization that you typically are the requirements for virtualization that you typically see on the cloud and the server space. On one side, it provides this sort of consolidation that typically alleviates what is called a swapsie, size, weight, power and cost to reduce. At the same time, in embedded virtualization, we don't want much performance overhead. What typically ends up in leveraging as much as possible the virtualization support, the hardware virtualization support available on the platforms. But at the same time, you typically use this technology for also called mixed criticality systems and you need to provide these safety guarantees or special and temporal lack of freedom of interference or temporal and spatial isolation. Typically to comply with safety standards like, for example, in the automotive industry, the ISO 26262. At the same time, as I explained before, you have these control logic loops. So these control logic loops has these strict deadlines or timing requirements which translate in these real-time guarantees. And nowadays, with this ongoing connectivity trend within the IoT, you don't want like hackers or attackers from the outside to basically broke down your system. So nowadays, in these sort of applications, there is no safety without security. That's why, and vice versa, that's why you need the security aspect as well. Well, in terms of providing a brief overview among the hypervisor spectrum, on the left, completely left side of the spectrum, you have this traditional server-oriented, cloud-oriented hypervisor, which typically have a lower code base. Most of them depend on Linux as part of the trust and computing base. But they are completely fully featured in terms of the features that support to target a very broad spectrum of applications. But they typically provide some eye overhead in terms of IO due to the use of techniques like emulation or parametralization. And typically, the premium examples in terms of open source is the KVM and also Zen in the vanilla form of Zen. Then, if you go a bit right in terms of the spectrum, you have these embedded hypervisors with some more reasonable code base, typically in the hundreds of cases. They provide some sort of soft real-time guarantees. They have this possibility of running or scheduling multiple virtual CPUs, a top-1 physical CPU, and they target generic embedded applications without no strict real-time guarantees. We are talking about, for example, ex-visor or acre. Then we have these micro-kernel kind of hypervisors. They typically resort to the use of capabilities of this concept of software capabilities because they implement just a minimal set of service on the highest privileged level, and they differ the remaining part of the service as well as device drivers to user space, and they have these user-level VMMs, and some of them are formula-verified, like, for example, CL4 is the most well-known example, but they have a trade-off in terms of performance, interrupt latency, because they run most of the components in user space, so they need to have a lot of communication going on through the highest-privileged layer, and they typically target mobile embedded mixed-criticality system. On the right, completely right side of the spectrum, we have this trend towards the static partitioning hypervisors. The goal is to have a very teeny-tiny small code base, typically in a few kilobytes of software lines of code, but the idea is to provide strong isolation and real-time guarantees, and they mostly target mixed-criticality and some sort of IoT applications with strict real-time requirements. The only problem is, typically, on this kind of architecture, you have an efficient resource users because you typically partition all your system at design time. And, yeah, the examples, as we will discuss, is Bo, J-Louse, and this new variant of Zen called Zendom Zeroless. Well, a kind of motivation for this static partitioning virtualization, which, in fact, was mostly pioneered by the Quest of 5 from Richard West, but they just focused, firstly, on X86, and then this architecture starts getting some traction when J-Louse from Siemens, yeah, start to implement this architecture. The problem, when we start Bo, and just to go a bit forward, is that when we start Bo, first Zendom Zeroless architecture was still not there, and we didn't appreciate much the fact that J-Louse, they have the reasons, rely on Linux to basically boot and stage the system, but this can be somehow a problem when you have security and safety into perspective. And this architecture, yeah, has this static resource assignment with one-to-one virtual physical CPU mapping, no memory management at runtime, devices are typically passed through, and it relies on, as much as possible, on the hardware virtualization support available in modern platforms. Yeah, and with that, we go to Bao. The name Bao was an inspiration from the Chinese word Baohu, which means to protect. Back then we had some Chinese students at our lab, so we got some sort of inspiration from them. And Bao started basically in 2019 when Jouzai started his PhD under my supervision, and yeah, in 2020, we first released the first version of the Code Open Source. It's a Type 1 static partitioning that relies on this or implements this static partitioning architecture, and the focus is in mixed-criticator IoT system with strong real-time and security requirements, and you have just the link and the log. Well, as I said, it's a Type 1, that runs directly atop the hardware, as this one-to-one virtual-to-physical mapping in terms of CPUs, complete static memory assignment, device pass-through, hardware interrupts. It provides a minimal layer of inter-VM communication, mostly based on top shared memory and a few notifications, but it doesn't have any sort of dependence on any operating system, any external libraries, and any privileged virtual machines or operating systems. It relies as much as possible on the hardware-virtualization support. Back then, we started targeting the ARMv8 architecture, because the support for virtualization was quite good. Yeah, and in the different flavors of this support, either on the CPU with this second-stage translation on the memory management unit, the interrupt-virtualization support available on the most modern Geek Generic Interim Controls from ARM, or in the IOMMU that enarms this system MAMU. It provides some sort of internal isolation, just for security. It also implements SuperPage. This is quite important in terms of if you want to achieve a very low performance overhead. I will explain in a second. And we have implemented from scratch some techniques to address this freedom from interference or to provide real temporal and spatial isolation at the micro-architectural level, not only at the architectural level. I'm sure you have heard about SPECT, for example. SPECT leverages these timing channels on the micro-architectural for security, but you can see on the opposite side the same problem for safety or real-time, where you can basically temper with the cache access to basically force, for example, one real-time operating system to take more time to perform a specific task. Well, here we have two important takeaways from BALL. First one is the reduced performance overhead. Graph on top, there are two bars. This is the benchmark, mybench, automotive benchmark and open-source benchmark. Automotive suite. On the orange bars, we have, when we configured BALL with super-page support, two megabytes of pages and as you can see the performance overhead, in this case, is always below 0.5%. This is the cost that we need to pay for virtualization when you have the super-page. You may wonder what means the other bars in green. In green is when you don't use super-page, but you use the finest predict generality, which is 4K. Sometimes you need, for example, when you implement techniques like cache coloring to reduce interference, but because you have to go through all the stages of the page tables, you incur in a much higher performance overhead. In this case, the maximum is 5%. But the takeaway is when you use super-page support, the penalty is negligible. On the bottom, we have, which is typically the average stress computing base or code base of BALL as you can see below 10K. It quite depends if you as you will see, we have now ARMv8, ARMv7, but we also have RISC5, so it depends it's architectural dependent, but on average is 8K software lines of cost. And this was one of our goals to keep it as simple and as small as possible because of potential safety-related certifications. And, yeah, as I said, we have implemented from scratch this cache partitioning technique which is very good for safety in terms of guaranteeing this freedom of interference, but also for security. As I said, there is always this dual perspective, safety versus security. And for security as you can see, this is a neat map that we measured. On the top, we have what is the channel, is this sort of channel that attacks like spectrum leverage to recover secrets is a timing channel and when we apply this cache coloring technique that is effective for safety, we also completely eliminate the channel that is the image below. So this is real experiments with this cache partitioning technique. Well, over the last three years we have included a lot of support for Baal both in ARM V8 and ARM V7 as well, A but also we have strongly bet in risk 5. We have a lot of different platforms. I will explain when we go to the demo part, but we have major platforms from major silicon manufacturers in the different architectures. We don't have support at the moment for silicon for risk 5 because it doesn't exist, but we have built like we have extended risk 5 course with hardware utilization support, so we support all of those platforms as well. And also we also support Kimu either for ARM and risk 5 and also the ARM FVP model, which means that you have no excuse if you want to try. You can basically without spending a single cent on hardware platform you can go grab the code and try by yourself. And we also of course support the Raspberry Pi for platform, but these days it's quite difficult to get them, but yeah, they were quite cheap and I hope we get more platforms. We also support the different firmware either from ARM, the TFA and also the open SPI from risk 5 and different gas operating systems that we run on top of, either Linux, Android, RTOS, like Zypher, actually we will demonstrate the demo with Zypher, but also free RTOS and Erika, Erika is automotive centric auto-supply operating system. Well, and just to get the stage for José, yeah, we start first on the application class of processors, but the whole idea since the beginning was to create a very small code base that also would be possible to have bow running on these idly available microcontrollers or real-time processors. And quite interesting trend over the last few years in industry is this inclusion of hardware virtualization support as well in these real-time microcontrollers. ARM as these Cortex R 52, also the 82, but renaissance as these RH850 with virtualization support and infinite and quite important player in the automotive industry in Europe also decide to include hardware virtualization support in the Aurex Tri-Core, the new one, the TC4. So that's why we felt it was a very interesting niche to explore with bow and that's why we implemented, we started for the Cortex R 52 and now I hand it over to José. Good, can you hear me? Yep, so now for the bit more technical part of the talk I'll talk about the challenges we had when porting bow to this Cortex R architecture and I'll start with the main difference between the Cortex R and the Cortex R architecture. So the first difference is that if you are familiar with Cortex ARM V8a, you have all these privilege modes on top of the user, supervisor, hypervisor, you will have the secure monitor on trust zone and on the secure world, you'll also have the hypervisor, the trusted OS and the trusted applications. On this Cortex R course and specifically on the Cortex for the 32 bit version that we will be addressing today, you only have the hypervisor, supervisor and user modes so that there's no trust zone here, there's no secure world. The second the main difference is that while on Cortex A processors you have the traditional MMU which brings the translation and protection with a lot of flexibility on how the way you allocate memory on the way you lay out memory etc. On the Cortex R CPUs you only have the MPU that you'll find also on the lower end Cortex line and you have of course the protection but the fact that you don't have this translation of the page tables, this implicit access to memory by the CPU, you'll have more predictability which is the whole point of this architecture and on the hardware side mainly I believe this brings a lot more simplicity on the software side is kind of debatable. So the Cortex R52 or the Cortex the ARM V8 architecture has this MPU which implements this PMSA protected memory system architecture instead of the virtual memory system architecture which essentially all the logical addresses are physical address so we have a flat identity mapping always and the innovation in this new line of processors from ARM is that you will have a virtualization support, you'll have a dual stage MPU, you'll have a second stage MPU that is managed by the IPervisor and you'll have a first stage MPU that's managed by the OS, the guest, whatever so the Cortex R52 acts on the whole 4GB physical memory space and I believe in the next slide I have more details but you have a number of protection memory regions where you establish the base and the limit of the address is that someone can access, the software can access so it's a white list and you have access permission, memory access attributes and a few important points is that these entries have to have a granularity of 64 bytes, must be aligned to 64 bytes and cannot overlap you cannot have two entries overlapping in the MPU that will generate an exception also, although architecturally, there is a maximum of 255 of these MPU entries that you can have, which I believe would be more than enough, in the actual implementations in the Cortex R52 you'll have a lot less so you'll have up to 24 in the physical cores that we saw on the real platforms that we saw, you had about 20, the other had 24 and you have a number of these hypervisor specific registers to manage this MPU entries. Another particularity of this MPU is that these are what harm calls the translation regimes, so on your leftmost side you have the translation regime for the hypervisor itself and this second stage the hypervisor EL2 MPU will control the access from the hypervisor itself and then this same MPU will control the access from the guest, so the entries are shared between the hypervisor and what the guest can see so you don't have, in Cortex A you will have completely separate page tables for the hypervisor and for the guest this is not the case, so you'll have a single MPU for both privileged levels and here on the MPU entries you can control what is called the harm calls the shareability of the region which mainly tells the processor, so on which set of course do you want to, the coherency, the cash coherency to act and also has a few implications on the memory model so things like the barriers and stuff like that, we'll see in a few slides this was a problem for us in this part so also the permissions so and here the important point is that when you give access to a guest you also have to give access to the hypervisor so you can't have the guest see something that the hypervisor can see and the same is true for execution so if you give execution privilege to a guest you also give execution privilege to the hypervisor which might be a problem from a security perspective and on top of that you'll have the memory attributes to define the memory is readable basically as cashable or non-cashable and stuff like that so now for the port itself the first issue we went to when porting bow to this which was specifically designed to use an MMU to this MPU architecture was the issues with the memory layout so bow relied a lot on the MMU to have for example a statically linked so all these links bow on the same addresses for a given architecture we used extensively the MMU to kind of limit the access to the hypervisor following this privilege principle so that it will only access to what it really need to as much as we could and we did a lot of aliasing using the MMU on the memory space and this created a quite fragmented address space so another point was that we also for example included the guest images in the hypervisor binary itself which kind of so that we could have a compressed binary we had to separate the on the physical memory layout which you see on your left we had to separate the BSS regions from the data regions from the loadable regions of the hypervisor and we also used the slide allocator that at runtime allocated to the hypervisor some objects mainly for the CPUs to communicate which created an even more sparse address space and the main issue here is that we do have a limited number of MPU entries which are shared between the hypervisor and the guest as I said before and this is an issue because we can quickly run out of them if you want to have a kind of arbitrary address space when configuring your guest it's an issue that I said before regarding the permission so you have to share the permissions which is not as great for example in the MMU we can have parts of the image being execute only or read only only parts being read and write and here for that we have to use a lot of MPU entries so this was a problem for her and also we use different MPU entries to restrict what CPUs can see per CPU I guess that's what's called structures that the other CPU use we limit that and this was not scalable at all when you increase the number of cores so here is an example but on the right most side you can see that the hypervisor is using already I don't know 5, 6 MPU entries and this is just a toy example so in reality if we did a naive translation from our memory we would be using more than half of the MPU entries that are available on the Cortex R course so what we did for this was the first is the fact that so it's an identity mapping so we didn't want to use the position independent code so we did provide a way to compile the hypervisor to wherever you want so it's a configurable base address for the image and restructure the layout so that we can have this all these structures that I talked about in the previous slide we can have them really contiguous and we kind of leverage the fact that this hypervisor is statically and did a lot of more things during compile time so that we have only one region for the image then we only have a region for the CPU one region for the VM that that CPU is running and we also have to use a couple of entries for the console for the UART and the interrupt controller so right now we use five of these MPU entries independently of what platform you run so it's more scalable we like to use less but that's what we came up with so next how we manage memory here so the original Bow memory management model was based on the page granularity so four pages it was fixed and we had this interface where you kind of allocate physical pages, allocate virtual address space and map them so we want to maintain the same model essentially so we don't have to change the car of the hypervisor a lot and what we did was we allowed this size of the page to be configurable so we configured to the MPU granularity but you can kind of configure it to be more and add an extra wrapper on this memory management interface so that you kind of merge the physical and the virtual allocation you do it in one step and the Cortex-A part separates them but the Cortex-R part do just one step and verifies that okay you are trying to map it's an identity mapping so it's valid and then we kind of separated the MPU management in two layers first what we call a virtual MPU which sorry a virtual MPU which is kind of architecture agnostic and we have one paragraph space the same way you would have one the same way you would have one page table paragraph space and then a physical layer and kind of the each virtual MPU is managing the physical MPU and each core locally maps multiple virtual MPUs to the physical MPU that it has so there are a couple of problems with this the first the fact that you don't if you change one mapping locally you have to tell the other cores that you change it you have to do a kind of broadcast the fact that the number of virtual MPU entries might be larger than the physical available and they kind of have a mechanisms to manage that and also the fact that we have this non-overlaping regions that must share the permissions between the guest and the hypervisor so it really complicates a lot so we can guarantee that you are giving the minimum permissions possible to each one so next I'll try to be quick because we're running out of time but the next step was interrupt injection and if you're familiar with the how interrupts work on the ARM architecture you have this distributor that receives the interrupts from the peripherals and then this distributor forwards according to some configuration forwards interrupts to the interfaces and the interface decides if you're for or not the interrupt to the CPU and you have virtualization support that works out of the box in the Cortex-R and what happens is that hypervisor has to fully emulate this distributor part and then the hardware has support so that the hypervisor can inject interrupts in a virtual CPU interface and the ARM architecture has two interrupt lines going to the processor one is called the IRQ that's used by all hypervisors right now and the other one is called FIQ that's typically reserved for trust zone for the CQ world but our observation here was the fact that so there's no trust zone here we looked at the hardware and so we have the same thing so we have two interrupt lines we can assign one to the hypervisor and one to the guest so there's a small point in this in this previous slide the fact that hypervisor has to receive all interrupts and inject them in the guest to the virtual interface it really increases the lot of the interrupt latency which is one of the goals of these hypervisors is to reduce this kind of stuff so real time we want to have real latency, deterministic latencies so what we did here basically was to let's give the IRQ line to the guest and the hypervisor will have the FIQ line you still emulate the distributor you still control which interrupt lines the guest has access to but you really have a complete direct injection from the hardware to the guest without the hypervisor losing anything so the hypervisor still has its own interrupt it still has full control of the interrupt controller and you have native latency when injecting the interrupts so there are also a few issues the fact that we don't have a EL3 there's no standard trusted firmware for this course we have to do a lot of work for managing the platform itself so low level setup like lock setup power management stuff like that you will have to do this and for Baal that's a problem because Baal does rely on the trusted firmware for this low level operations on the Cortex-A and now we have to include platform-specific code which is something that we didn't have until now so a last point regarding the core is when so most of the part we did using the RMF VP models which basically give you a model a platform model essentially equal to the Cortex-A core so the platform model is the same but when you go to the real platforms you use the real cores you have a slightly different thing it's more akin to what you find in MCU platforms than what you find in the application platform so the first thing we did we had to do some initialization code to move data from flash to the S-RAMS and change the way the hypervisor is linked to allow this we also have some issues with the Cortex-A related to the shareability domain so the main issue is that the Cortex-R 52 each core although the vendors provide clusters of cores each core is its own shareability domain which means that sharing things in the cache and things like spin locks that rely on the fact that these cores are on the same shareability domain don't work correctly the only thing that we had to do but we're still looking to find more issues because our spin locks started acting really erratic is that we have to configure the all hypervisor as being all hypervisor regions as part of the outer cache ability domain in this way the instructions like with the store release semantics we elect on the further cache is not on the inner cache is of the core so there are also some core specific registers that for the Cortex-R implementation that we didn't have to manage in other cores and that we have to manage on this core to give guest access to some private core peripherals so that's pretty much it I will try to do a brief demo of this running on the FPP platforms so we have this repo repo on the on GitHub that has a lot of demos but that can run on almost any of the platforms that we support so you will have the demo running Linux, the demo running FreeRTOS, Zephyr and of course on this course we don't run Linux, I don't know if it is possible but we don't and right now we'll show you how to run this Zephyr plus a bare metal application on the FPPR 32-bit core so essentially what you have to do when you clone this repo is just run this I don't know if you can see it yeah essentially you just have to set up some environment variables tell it which cross compiler you want to use, which platform you want to use which demo you want to run so we want to run the FPPR we want to run Zephyr and bare metal and you run this just a moment, we have a few make files that automate fetch the hypervisor, the guest, supply a few patches the configuration, build it so I already did this beforehand so but you will have to wait a few minutes here and then for the models that can run on a computer you just can make run and then the model will start and here we have so we have assigned a new art console to one of the guests, another you have the bare metal application running right here, it's running to cores with receiving some interrupts you can send it more interrupts from the characters and you have a simple console application for Zephyr that allows you to communicate with the other guest so you can write you have this biopc command that tells it to you can write a message to the bare metal so the bare metal will receive an interrupt from the Zephyr that immediately the hypervisor to tell it to send an interrupt to the other guest, we have received the interrupt and printed what it said and also each time the bare metal application receives one of these character interrupts it will write on a shared memory region and you can read this shared memory region from the Zephyr side and you tell okay I received this and it will write on a shared memory region from the Zephyr side and you have this number of interrupts so it's a really simple demo but demonstrates the base concept of having the two segregated guests each run on separate cores and can communicate with each other and you can try as of today yeah so you can just download it and run it I don't think this mic is working so I will continue with this one so yeah the demo was is that this is one of the few Cortex R52 compliant platforms available not yet for public on the market, still pre-production but thank you NXP and thank you Brad for this loan of this platform, we have it on our lab this is the S32Z 20770 and yeah we also have the very same demo which mean we have the port of bow for this real hardware platform as of today and we also have this Zephyr plus bare metal system also running on this platform as of today with the Lauterback tools but soon out of the box from flash or SD card well just to conclude so yeah we have demonstrated that we have completed the architectural port of bow hypervisor for this R32 R architecture Cortex R52 I think this is working yeah but we also in the meantime we have also implemented the R54R support which mean is a variant for example the Cortex R82 where you can have for example on the bottom layer an MPU but on top an MMU which mean that you can basically from an hypervisor point of view have this determinism of predictability guaranteed by the MPU but on top you can run Linux general purpose operating systems like Linux out of the box and we have also a demo with Linux and Zephyr for this platform on Github you can go to the bow demos and you can try by yourself well as I said before we also have work in progress on real hardware platforms the only one that you can buy publicly is this renaissance one and we also have this port for this NXP that I talked before there are a few open issues on real hardware platforms at the system level they have custom system MPUs if you are familiar with application processors you have these standardized system MMU that is quite the same once they follow like the once they are implemented IP compliant with the specification here what we have seen is a lot of heterogeneity which mean that you need to perform or to implement a different support each time you get a new hardware platform that requires implementation of drivers also this coherence problem under the clusters and the cache partitioning and we are also exploring that's why when Juza was explaining we implemented this virtual MPU on the top and the physical MPU on the bottom is because of the same technology for other architectors and in particular we are already exploring with a company that is building a RISC-5 SUC a similar architector for RISC-5 similar to the Cortex R52 but in RISC-5 and also we are starting with our XTC4 so that's all for today a few pointers if you want to follow about on different social media channels and we are open to address all your questions I think we are we need to repeat otherwise you can go thank you is it working is there a way today on the R52 version to prevent the MPU against DMA access from a guest that would allow it to circumvent isolations provided by the MPU the question is about prevention of DMA triggered attacks because they are at the system platform level not at the core level and the MPU is at the core level for this case as I said in real platforms the MPU different silicon manufacturers have different implementations with different capabilities I can comment on the NXP because is the one that we have implemented we have seen also for NSS for the ST they have it but we don't want to comment on that for now but for NXP you have what is called resource domain controller it's a kind of minimalistic implementation on this sort of course and it allows you basically to enforce isolation at a platform level and this is a case to avoid any DMA like device to trigger any sort of security attack or whatever it is thank you so if there is no more questions thank you all for coming and feel free to try Bow and provide your feedback in the comments every time you want okay thank you