 Yeah, welcome everybody. This talk is going to be about zero overhead virtualization with a special focus on virtualization on embedded systems. It is the preliminary results of a joint work of our university, University of Applied Sciences in Ringsburg, together in a joint project with Siemens Technology. My name is Roy Fromsauer. I'm here together with my colleague Stefan Huber and the head of our lab, Wolfgang Maurer. We are talking about virtualization on embedded systems and embedded systems are sometimes safety critical systems and safety critical systems are always real-time systems. So we are centering about virtualization on real-time systems. Now, if we virtualize the systems, we would like to have no overhead to be discussed by the hypervisor or by the virtualization at all as no activity of the hypervisor, so no traps, means that we have no impact on the real-time capabilities of the systems. In this talk, we will present the current boundaries of embedded virtualization and present if the concept of zero overhead virtualization is already possible on different architectures or not. But before I start, I would like to set the frame of this talk and explain what we do not mean by embedded virtualization. And there I would like to start with data center or cloud virtualization. So you might know or you probably know that products like QMU, KVM, Xen, VMware, or even Hyper-V. This is the classic standard stuff that you will find out there in the cloud. And the key aspects of those virtualization technologies is over provisioning of hardware resources. That means your guests can have more resources available than they are actually available on the platform. So it can be CPUs, memory devices, and so on. This is made possible by sharing physical resources through sophisticated techniques like para-virtualization. In the end, you hope that your guests never exceed the actual physical resources of the systems. And if they do, you have to rotate the resources or somehow throttle them down. The overall optimization criteria of those systems is throughput, throughput, throughput, and performance. In contrast to data center virtualization, there is embedded virtualization with products like the Bayo Hypervisor, Intel Acorn, PyCoS, or Chalouse. Chalouse is the system that we are going to talk about in this talk. As we target safety critical systems, the optimization criteria on here is always real-time, real-time, real-time, and deterministic, respectively low latencies. So instead of resource sharing, what we want to do here is only the resource allocation of the devices. So you're dedicatedly assigned resources of the platform like CPUs to your guests. Instead of over-provisioning hardware, so we only have a fixed set of guests in our system, which means that there is no need for hardware sharing at all. So we assign exclusively memory to the guests. We exclusively assign CPUs to the guests, devices to the guests. So instead of sharing resources, we dedicatedly allocate those devices. By that, we get rid of tons of code that is needed for over-provisioning like the per-virtualization stuff. We do not need that at all. So embedded hypervisors usually have a very, very tiny and small code base, which is especially beneficial for certification of those systems because in safety critical environments, you often need to certify your stuff. So here we are and the question is, why do we actually want to have virtualization on embedded systems at all? The principal argument for this is the consolidation of formerly separate devices. Think of the classic example, like it is shown here on the right. It's a car. This could also be an industrial product line or something else. Basically everything where you have a lot of distributed devices. There are dozens of control units in a car that somehow need to communicate with each other and each of those devices comes with its own CPU and a huge software stack. This is what we call the traditional approach where you have separate devices for separate concerns. So if you have two or more appliance or control unit, say the infotainment system on the left and some ECU can be the battery management or instrument cluster, you name it. And some kind of communication bus can be a CAN bus to allow those devices to communicate with each other. And nowadays those systems already have SNP CPUs so they are powerful CPUs inside those machines and many of them are just there for idling because most of the time they basically do nothing. So even on those systems, you already have SNP CPUs, SNPs everywhere. But from an economically and technical perspective, it would be much easier and more maintainable if you were able to have all those systems on a single platform. So instead of having dozens of devices on a platform, you only want to have one single device that is responsible for all. The problem is that those formerly separated systems serve tasks of different criticality. So the ECU has a higher criticality than the infotainment systems. And if you consolidate them on a single platform, you will get a mixed criticality system like it is shown here. In this architecture, you have a shared, just let me get a sip. In this architecture, you have a shared software layer which is your operating system, which probably will run Linux because for the infotainment system, for example, you need the feature rich ecosystem that comes with Linux for dedicated software stacks like Bluetooth stack or maybe you will even have to run Android on top of that system. So what we have here is a system of systems on a chip with the downside that the failure in the infotainment system might tear down the whole platform, including the ECU. So what you want is to safely isolate those two worlds that one cannot interfere in an unintended way with each other. So what we need is safe isolation and this is exactly the root of embedded virtualization. Safe isolation of different computing domains on one single platform and we can do this by leveraging the virtualization extensions that we already have on those platforms. We can rely on strong isolation guarantees that are given by virtualization extensions. This is the reason why we would like to have zero traps, so zero hypervisor exits on those platforms because if we manage to have zero exits for the hypervisor on those platform, then we maintain the real-time capabilities that are given by the design of the systems of the architecture. We not only get less overhead, think of the certification case again. If there is no hypervisor software running on the system during runtime, then you also do not need to have to certify the code. Less traps make certification efforts not possible, but at least it makes it a bit easier. And of course, we have a solution for this. I'd now like to present the basics of the hypervisor jailhouse. It's a Linux-based static partitioning hypervisor. We did not implement jailhouse on our own. It's an open source project that was initiated by Jan Kiskor, maybe he's even here, maybe. And we implemented together with Siemens and we joined the development of jailhouse several years ago. Just to give a brief sketch, the basic idea is to offload all uncritical work that we have to Linux. So we put Linux on the bare metal platform as we're used to put Linux on any platform. This has a strong advantage that Linux does all the heavy lifting stuff for us. So putting the platform, device initialization, platform setup and so on. We don't need to do that in the hypervisor. So we're putting Linux as we're used to put Linux. Later, we load our jailhouse kernel module and shift the operating system to the state of virtual machine and insert the hypervisor underneath Linux. The hypervisor needs to know what it needs to do so that it gets the topology of the system that it knows how it has to partition the system by providing a system configuration. And in the end, we have Linux running in a virtual machine. The next step is to partition the whole system. We do this by calling the hypervisor and providing it a configuration of the partition that we would like to create. Linux will shut down CPUs that will be used in a new partition, hypervisor claims it and we have created a new cell. Of course, the cell now needs to run some software, some inmate code. So jailhouse language we call it cells and the cell runs an inmate, the software of the cell. And this can be any secondary operating system like even Linux or small, tiny real-time operating systems. We kick off those CPUs and then we have the exact state of the machine as we would like to have. So now we have the partition system and the hypervisor takes care that those systems won't interfere with each other. Just a few further comments. We have no device drivers, no resource sharing, no VCPUs, no nothing, which means that we can work with only about 7,000 lines of code for a 32-bit ARM platform for instance. Our goal is to eliminate traps and hypervisor overhead. So during the operational phase when the system is running, we would like to have no overhead of the hypervisor at all. Tiny code base again means it's beneficial for the certification. Now at the moment, we support three different architectures. We support ARM, 32-bit and 64-bit. We support Intel and AMD x86. And our research group during the last few months, we were reporting the cell as hypervisor to the RISC-5 platform. So the hypervisor extensions of RISC-5, they got just ratified and it is running. But during the port of cell as to RISC-5, we realized that problems that we already know from other platforms appear on RISC-5 again, sometimes in similar ways, sometimes in even worse ways. And this is why we decided to give this talk to give an overview when, why and where on which platforms we have the need to still to trap and if we can avoid those traps and what available techniques are out there to avoid trapping. And we will finally try to answer the question if zero overhead virtualization is actually possible or not. So when we have a virtualized systems, we need to differentiate between avoidable and unavoidable overhead. Unavoidable over this systemic overhead. And if when we have that, we can categorize, we can systematically categorize different types of overhead that appear on those systems. And this is what Stefan is now going to explain. We can, as we have already said, we can distinguish between unavoidable overhead and avoidable overhead. One example for unavoidable overhead is, for example, the additional stage for address translation. And it's a picture here we can see a non-virtualized setup. As you know, we have to translate between virtual memory addresses and physical memory addresses. Now on virtualized setup, we basically have to do the same between we have to translate from guest virtual memory to guest physical memory. And we have to do another stage to translate then from to host physical memory. So this is of course some overhead and costs some time. Additionally, it puts extra pressure to the translation look aside buffer. We can cope with it a bit if we use huge pages for guest physical memory to host physical memory because we have less entries then. Another example for unavoidable overhead would be partitioning. It only happens during system initial and system setup so it's not really a problem. And maintenance, we usually don't need maintenance at all for our use cases. So an example for avoidable overhead. Here we can see a non-virtualized setup again. It's a toy example. We can see an interrupt arriving at the operating system. Everything is fine. Now in a system, in a virtualized setup again in a system that's not well suited for virtualization we may have the problem that interrupts cannot arrive directly at the guest. So the hypervisor has to take the interrupt and then has to re-inject it to the guest. So execution is intercepted by higher privilege level codes. This is what we call trapping. And traps, yeah, it's additional code to execute. You have overhead, you have increased latencies. Yeah. Another example for unavoidable traps are hypercalls for cell management. It's not in hot pauses, it's no problem usually. And also interception in case of excess violation. But it would, this would only happen if we had malicious cells that tried to destroy anything or something. So if we're talking about trapping, we already heard and said higher privilege level code has to intercept. And here we, let's look on the right side first because we use similar illustrations on the rest of our slides. And here we can see these privilege levels where on below you can see the machine mode where the firmware resides. And then the hypervisor mode where the hypervisor resides. Supervisor mode where the kernel resides and user mode where applications would be. On x86 we have pretty much the same thing but it's called rings. We have ring threes for application, ring one and two for device drivers. Apparently we decided not to use it at all. And we have ring zero for the kernel and ring minus one for the hypervisor. Detail, on ARM it's pretty much the same concept. You call it exception levels or a privilege levels depending on the platform. Details might differ but it's the same concept. So what are the consequences of overhead? Here we see a non-virtualized setup again a bare metal application running with our hypervisor. You can see these arrows and they mark the arrival and response time for an interrupt. And you see this red line it's a deadline and we always meet the deadline here so we have no problem. Now on a virtualized setup running the same application with an hypervisor we can see the time between arrival and response is a bit higher but we meet all our deadlines here so it's kind of fine. However we want to avoid trapping to avoid any hypervisor code running at all. So if we have no code to run we don't have any bugs. Now we define four types of requirements for trapless static partitioning and I'm going to talk a bit about them. The first requirement we call logical device partitioning. And here on this picture we can see statically partitioned setup in a slightly idealized version. On the top we can see we have three operating systems running on the same chip. We have this green Linux which owns CPU zero and Ethernet we have this blue Linux which owns UART and CPU one and we have this red RTOS which has the rest of the devices. The system bus is shared however. So now what can go wrong? Here if we look at the right we have an example of NVIDIA Jetsen of the example of the memory map. In red you can see we have different devices mapped on the same memory page. So we can't rely on hardware to isolate these resources we want to isolate. So we can't use the MMU for access control. We have to do a workaround. We have to do subpaging which involves trapping and moderating by the hypervisor. So the solution is quite simple. Use dedicated pages for each device and then use the MMU for automatic access control by hardware. Now the second requirement is the hierarchical autonomy of devices. Here we can see the same picture again but we can see where we are on the left and we have clocks now and the clocks are shared somehow between our different cells. The clocks are responsible for setting up the operation and speed of devices and obviously we need parts of its functionality for multiple cells. So diving a bit deeper into this example which out going too much into detail we can see a schematic of the clocking sources of the Nvidia Chats in TX1 and we can see they somehow hierarchically depend on each other. So what does that mean? So if one cell changes some clocks other cells may be affected. So we don't want that. So what would be the solution? The solution sticks the configuration of meta functionality to each device's memory. Another requirement is platform resource partitioning. So on the right we can see a memory map of the PLIC which is the RISC-5 interrupt control as the current one. Again without going too much into detail we can see that the configuration registers for many IQs and many contexts are mapped on the same page again. And which requires us to do sub-paging again. What makes things worse is that on RISC-5 guests cannot even claim or complete any interrupts themselves which means they cannot mark an interrupt as being handled or finished. So the hypervisor has to do a full virtualization of the IQ controller which is a lot of overhead. The solution quite simple provide interfaces with reason and memory layout again which means place all registers for a single IQ to one dedicated memory page. And for RISC-5 allow claiming and completion of IQs in guest context. On the right we can see a little bit of history about the platforms we looked at and how well they solved it. So it's a green dot says it's everything is fine but only MD obviously managed it. As already said the PLIC is the worst of it but it will be a little bit better with the APLIC it's an advanced PLIC that will come out somehow soon. We don't need full virtualization anymore then but we'd have to do sub-paging for all all those platforms that are marked in yellow here. So another example for fail plot from resource partitioning would be the slacking support for interrupt pass-through. So what do we mean by that? If an interrupt arrives it has to be intercepted by the hyper, it has to be catch by the hypervisor and the hypervisor has to inject it into the root cell. This requires mode switches and lots of additional code and we have severe latency increase for interrupts which is especially bad for real-time systems. So the solution would be quite simple as we know any in a partition setup we know any arriving interrupt always belongs to the guest we just should allow it to directly arrive at the guest which is called interrupt remapping. Here a little bit of history again we can see all platforms had this problem sometime but most of them fixed it and risk five will also fix it with the IR, the advanced interrupt architecture which is a combination of this advanced PLIC and the IMSI which is MSI controller that will come out soon. Another example that violates platform resource partitioning is that critical functionalities are only available via system firmware. On ARM for example, we have to use the PSCI which is a system firmware on ARM for stuff like CPU on or off-lining or system on or off-lining. It's not a big problem for our use case since it's not in hot paws. On risk five however, we have to use the SPI which is a firmware for risk five for stuff like timers, for IPIs or remote fences stuff that happens all the time. We have to frequently use it. So we have unnecessary details through the SPI which costs time. What makes things worse? We have to even intercept the access via the hypervisor which costs additional time again. The solution quite simple provides frequently used functionalities directly via hardware mechanisms. And now back to Rolf. There is, does my mic work? Yeah, last but not least, the last requirement is the cross-core independence. And as the word independence already implies there is some kind of dependence between cores and we would like to not have these dependencies between cores. Let me give you an example. There are shared hardware resources on systems like buses or caches. Imagine a system like this you have eight CPUs each of which has its own dedicated level one cache and level two cache. But there is a shared level three cache across let's say four CPUs. Now let's partition that system into three partitions. The first three CPUs go to the first partition the next three CPUs go to the second partition and the last two CPUs go to the last partition. Now an application running in the first domain might be able to evict caches in the level three cache of another domain. Which means that if the other domain needs that memory it has to fetch the memory to the cache again. Which means that we have latencies due to cache misses. So there should be mechanisms that allow to allocate or to quote hate the access to such shared resources. The shared resource can for example also be the shared system buses like PCI bus or something like that. And there are possibilities given by the architectures that allow this type of cache partitioning or cache allocation. So on Intel x86 this technology is called CAT the cache allocation technology for allocating the level three cache to our guests for allocating or quotating access to shared buses. We have the memory bandwidth allocation technology MBA. This is part of Intel's RDT resource director technology. We have a similar thing on ARM. It is called the QoS the quality of service. We again have a similar thing on sorry on AMD on ARM. We have the dynamic IQ and also it's called quality of service QoS for the shared buses. I don't know if any technology for cache allocation or bus quotation exists for the risk five platforms. I would not be aware of that. Okay, so these are the issues that we have on those platforms but what are now the consequences and the limitations of those platforms if you consider real time virtualization. So the point is the systemic overhead like we have from the additional page translation that cannot be avoided but we can measure it. And if we have to trap on those systems we can also measure the impact of the trap. So this is what we basically do. We want to know which delays do we have to accept if the hypervisor needs to get active under certain conditions. So we need to have micro benchmarks that trigger those exact paths in the system that then allow us to measure the latencies. And this is how we design those micro benchmarks. We have a micro benchmark that runs on the bare metal hardware. And let's say the benchmark does measure an IPI round trip time. So the time that is required for sending an interrupt from one CPU to the other CPU, the other CPU will respond and send the interrupt back. And the time that we need is the bare minimum that we need for the IPI round trip time on that platform. We cannot go below this time. So this is given by the platform. It's the baseline measurement. And then we do the same, we repeat the same measurement in a virtualized environment. So we have the same micro benchmark running in a guest and we trigger the exact path in the hypervisor that will get active and we know what additional latency will be expected. The third measurement is the noisy neighbor measurement. We put additional load on the Linux that is running on a different domain to put additional stress on the shared platform resources like caches or buses. The last step is we visualize those measurements and draw histograms. So we repeat those measurement measurements for millions of times running for several hours to get all corner cases that may appear in that platform. And we get latency histograms for each trap into the hypervisor that is possible on this platform. So this is the example for a RISC-5 platform for instance. The black line is the bare metal is the time that we require on bare metal. The yellow histogram is what we require if we are in the second benchmark so the virtualized environment and the gray histogram is used when we have additional load on the Linux running neighbor to our real-time guest. So we repeat the measurements for several scenarios for inter-processor interrupts. We do measure the emulation of the interrupt controller for remote sensors or for setting timers. These are the points where we need to get active on RISC-5 for example. And given these charts, you can decide on your own if these additional latencies fit your needs or not. So they serve as a basis, they can serve for a basis for your decision. We have similar measurements for basically all support platforms. Of course, it's dependent of your board for all platforms that we support. So this is it almost to sum it up. You now know the boundaries of embedded virtualization and the requirements that we have for hardware vendors of course. So somehow someone needs to implement that in hardware. It was the logical device partitioning, the hierarchical autonomy of devices, platform resource partitioning and cross core independence. Still, of course, the ultimate goal is to eliminate any hypervisor activity but this requires the involvement of the hardware guys. While on x86, it's currently probably the most suitable platforms for hardware partitioning and on ARM platforms, it's still on the way of getting better but we have major issues on the RISC-5 platform. The good thing is that RISC-5 folks follow the hardware software co-design strategy which means that any new feature or extension they add will first be available in an emulator and can be reviewed from a software engineer's perspective before they are eventually pressed into silicon or ratified. I think this is a very good approach as the direct feedback of developers is definitely required to overcome the presented issues. All right, so thank you very much for your attention and I'm now happy to answer some of your questions. Sorry, can you please repeat the question? Or is it, I think it would be good. So you're referring to, yeah. Okay, so the question is what happens if we have a crash inside the Linux cell? Then of course Linux will stand still. There will nothing happen anymore in the Linux cell but the real-time cell is unaffected from that crash. So you can have a small communication interface between those two domains where you have some kind of monitoring between the two cells and if the real-time cell recognizes that the Linux cell is crashed then it can, of course, take action. So it is possible to communicate for those cells with each other. Does this answer your, okay, the question is if the Linux cell is evil, can it harm the real-time cell? So you might think that Linux just decides to shut down the cell. So we have a mechanism for this. We can say, okay, the platform from now on it is locked. There will be no more partitioning from that moment on. So it cannot simply shut down the cell any longer. What Linux, of course, can do is to put pressure on shared hardware, shared cash or something like that. So it is possible to slow down the other systems if there are no extensions like cash allocation, for instance. It can cause the other cells to slow down. But if we have hardware extensions available, we can also overcome these issues. After partitioning, you can also take away the clocks from Linux. So the question was, if Linux takes care over the clocks, couldn't it just power off devices of the critical cell? So the idea is that once you have set up your clocks, this would be one possibility. No one on the platform is allowed to do any modifications to those clocks. So some years ago, we had a demonstrator. We had a microcopter flying and running this exact hypervisor, Chailhouse. And we solved it indeed. So we had this exact issue. We had to reconfigure the SPI bus all the time. We had to change the clocks. And what we did in the end is we were providing a para-vitualized interface through the hypervisor that took care that the platform is only allowed to set those clocks what it should be allowed to. So this is where we used a small kind of para-vitualization. But what we would actually want to have is to have the clock configuration stick to the devices so that you're only able to change clocks or even reset lines or something like that on that single device. So you need a partition label or a virtualizable clock controller. This is basically what you would need. Not yet. Not yet. I have to admit I haven't even heard of, how is it called? SCMR. SCMR, I haven't heard of it yet. Yeah, this would be the exact solution. So virtualizable or partition label. Yeah, we are, so the question is how do we get Linux to EL2? We are piggybacking on the KVM stubs. Yes, with a normal kernel, you have to know the location of the hypervisor stubs. We take them over, replace it with our code and then hook us down to, I can show you the code afterwards, how we do it. Yes. So on the guarantees of the RT style, what we do is we do statistical evaluation. So we put the whole system on the load. We see how the system reacts to that. And yeah, guarantees is of course hard. So what we try to eliminate any corner cases that may occur. This is why we are aiming at zero overhead virtualization. So if there would be no overhead at all, then we get rid of that problem. Yeah. This is exactly what we are trying to do with the third type of our benchmark. We put all kinds of loads to the Linux. So we try to do everything that can go wrong and then see what happens to the latencies that are caused due to the load. Yes. Yes. So the question is, is there a solution to do IPIs right in risk five? Let me get this slide. Let me get this slide. Where is it? Where is it? I guess it was. Is what, yeah. So on the regular risk five, you use the OpenSBI interface to send IPIs between cells. And with the new APLIC, the advanced interrupt controller of the risk five platforms, it is possible to directly use the interrupt controller to send IPIs between CPUs. So you get around using the system firmware. So we haven't implemented it yet. This will be our next task to implement the advanced APLIC together with the advanced interrupt architecture on risk five. So this is on our roadmap. Yeah, the DM signal. Any other questions? So okay, then thank you very much again.