 Hi, everyone. I'm Anu. I work for Western Digital. Me and my colleagues at Western Digital, we contribute a lot to open source projects for helping the RISFI ecosystem as such, software ecosystem. This talk is all about ex-visor, RISFI, embedded hypervisor for the RISFI world. RISFI as such is quite suited for the embedded world and having an embedded hypervisor for RISFI world would be great. So this is a lot of outline of the talk. We'll have a quick overview of the RISFI edge extension, what is there in RISFI for supporting hypervisors. And then look at XYZ in general. Have a quick overview of that as well. And then look at the RISFI port of XYZ. And lastly, look at our current status and what do we plan to do in future. And if time permits, we'll also have a quick demo of RISFI running on QMO. So let's begin with RISFI edge extension. So even before going to RISFI, I'm sure most of you are already aware of RISFI, what it is and what it stands for. But nevertheless, it's a free and open instruction set architecture. It's a clean set, it's a flexible ISA. It is designed to support 32-bit, 64-bit and even in future 128-bit machines, just like any other RISC it has, 32 general proposeristas. It also has variable instruction length, which can actually lead to a quite compressed binary sizes. And then we have three privilege modes, machine mode, supervisor mode and user mode. So machine mode will typically run the platform-specific runtime firmware, whereas the Linux operating system or any general operating system will run in S mode and the apps or user space will run in U mode. Each mode has its own CSRs. We won't go into details of each of them. But to highlight the box on the right shows that to run Linux, we just need nine CSRs of S mode. So that was a quick overview of what is RISFI for people who don't know RISFI. Like I mentioned, it's an extensible ISA. So it has a lot of optional extensions being defined for RISFI. So one of them is the hypervisor extension or the edge extension, as we call it. So it's an optional extension. The CPU implementers can choose not to implement or implement it based on that requirement. It's designed to suit it for both type one as well as type two hypervisor, bare metalers. The current state of the spec is at 0.4 draft stage. We'll be soon having a 0.5 draft as well with some of the pending changes. But it's pretty much not changing much as of now. When Western Israel initial started off the work or the hypervisor extension, we added support for QMU and XYZER first. And we also have a KVM board for RISFI as well. So that was all based on 0.3 draft. Recently we have moved all these things to 0.4 draft. And once 0.5 is out, we'll move to that as well. And we don't hope to see much more since in the spec as well. And we hope that it will be frozen soon. So it's almost like at Western Israel, we are doing a co-development with the RISFI spec and software. We are actually validating the spec by implementing it in QMU for both type one as well as type two hypervisor. And it actually helps us provide a lot of feedback to the spec community. And this helps us ensure that the spec is in right shape and it is suited for both type one as well as type two. So RISFI extension actually changes the privilege levels. What I describe, M mode, S mode, and U mode. So it adds a new HS mode. It's nothing but the S mode with additional hypervisor capabilities. And then we also have VS mode and VU mode. So VS mode runs the guest kernel or guest Linux and VU mode will run the guest user space. So it's designed such that guest Linux or Linux can run unmodified as a hypervisor in KVM or it can also run unmodified as a guest in VS mode as well. So that's why it suits both actually type one as well as type two. And the color coding over here will be consistent across the slides. We have a separate color code for the privilege mode that will be consistent across the slides. So there are new CSRs of course in the S mode which are only accessible to the HS mode. So there are two types of new CSRs added for hypervisor extension. One is H prefix CSRs which provide the hypervisor capabilities to the HS mode. And then we have VS prefix CSRs which allows HS mode to access the state of the VS mode. And when we are running in VS mode all the accesses to the S mode CSRs will map to VS CSRs. And these are pretty much all the CSRs that are newly added for H extension. And just like any other addiction with virtualization support we also have two-staged MMU in hardware. So when we are running in VS mode we will have the VS mode page table or the stage one as we call it in ARM world which will be programmed by the guest itself. And it will translate guest virtual address to guest physical address. And then we have HS mode guest page table which is nothing but the stage two. It will translate the guest physical address to host physical address. And yes it is programmed by hypervisor. So and in general in HS mode or the hypervisor we can program two different types of page tables. So HS mode page table is for the hypervisor itself which will translate hypervisor virtual addresses to host physical addresses. And guest page table will be for the VMs only. We will have one guest page table and for most hypervisor one guest page table which translates per guest to translate guest physical address to host physical address. And the interesting part is like all these various page table types the page table format remains same. There is not much and we can reuse a lot of the code across hypervisors. And injecting virtual interrupts to guest vCPUs or guest is very straightforward. We just directly update the VS IP CSR from HS mode to inject per CPU virtual interrupts. And for software and timer interrupts or the IPIs within the guest and the timer interrupts for the guest we have SBI calls which will be explained later. And of course we'll have a lot of software emulated devices as well which will be trapped and emulated using the HS mode guest page table. So right now the biggest or most critical IP that we are emulating in software is Blick or the interrupt controller for the VM. And we also have emulated virtual interrupts for the guest or VM. So moving on to Xvisor in general, what is this project about in general? Overview of Xvisor. So Xvisor stands for Extensible Virtual Hypervisor. It's an open source GPLv2 type 1 monolithic hypervisor. So the word monolithic is quite critical here. That's how Xvisor differentiates with other open source hypervisors like Xen. It's a community driven project. It's not driven by New York. It's there in development since more than nine years now. It supports variety of architectures which are widely used like ARM, V5, V6, V7, V3, then ARM V8. We also support for Xeteria 64-bit and then RIS5 is working progress. So it's primarily designed and developed for embedded virtualization. So it's not targeted at servers or the thing where there are a lot of resources, too many CPUs or huge amount of RAM and a lot of devices. It's primarily the goal of the project is to give good virtualization performance in resource constraint systems, typically embedded systems. Our first academic or peer reviewed paper was in 2015 where we showcased and did an Apple-to-Apple comparison between KVM, Xen and Xvisor on the same board, same machine and showed how it performs better on microbenchmarks. This is a traditional classification which most of us are aware of. Hypervisor is either a type 1 or type 2. As I mentioned, it's a type 1. Just like Xen, KVM is type 2. But there are a lot of differences between Xvisor and Xen as well. Xen is more like a microkernelized hypervisor. It runs most of the drivers inside a guest. But we have a provision to run drivers directly in Xvisor. You can choose to use pass-through and even run the drivers even in the guest or VM. So like I mentioned, we are in developments in quite some time now and of course we have lots of features now. Some of the features are listed here like virtualization infrastructure related features. We use the device trees for both hosts as well as guests. We have a soft real-time pluggable scheduler. You can always add your scheduling policy anytime. Then we have huge paid support for both guests as well as hosts. Then we have tickless and high-resolution timekeeping. We don't have the timer wheel baggage or anything like Linux. And we have a host device driver framework. So we have the right-sized device driver which is just enough required for the devices that we care about. We don't have a device like Linux for all the devices on the earth. And we also have a Linux portability headers which will help you port the drivers easily from Linux to Xvisor. Then we have a threading framework to create background threads like kernel threads. And then we have runtime loadable module as well which works in ARM as well as Intel. It's still not there in RIS 5 but we'll add it. So you can plug like KO objects. You can have XO objects which you can plug it from file system. We also have a lightweight virtual file system that's just for logging and loading guest images. It's not for sharing with hosts because it's not scalable that way. And then we have a white box testing framework as well to stress our own internal modules. Then we have domain isolation-related features like we can assign affinity to each vCPU and also the host interrupts. We also have spatial and temporal memory isolation. By this I mean that most of this SOC have cache hierarchy up to level 3, level 2, and L2, L1. So when we are running a virtualized system, so we can ensure that all VMs get a fair share of the cache hierarchy using spatial and temporal memory isolation. Then we have standard device virtualization-related features like pass-through support. So pass-through support is not there in RIS 5 because we don't have a defined IMMU in the RIS 5 world as of now. Maybe in future we'll have that. But it's there for ARM world. We have it in XYZ ARM. Then we have block device virtualization, all the standards of network, input, display. On para-virtualization, we follow a TAO spec. As of now we have 0.9.5, the older world TAO spec. We should be moving to one TAO spec as well. And the other interesting feature we have is the domain messaging where we have a unique way of passing messages between the VMs. So we don't have any XYZ specific para-word calls. We use VARTIO RP message for message passing between the VMs. So using VARTIO RP message, we can actually bypass the networking stack. So we can always use VARTIO net, but if we use VARTIO RP message, you can bypass and have a faster communication between the VMs as well. And we also have shared memory. So we can actually achieve zero-copy inter-guest communication using shared memory, VARTIO RP message. So some of the key aspect that differentiates XYZ is that everything in terms of the XYZ scheduler is a VCPU. It's quite different from Linux. Linux everything is a task. We have two types of VCPUs. One is a normal VCPU, which VCPU is belonging to a VM or guest. And then we have Orphan VCPU, which means it does not belong to any VM. It will belong to Hypervisor itself. And that will be used for background processing. Basically, Orphan VCPU is a kernel thread for XYZ. Orphan VCPU is a very lightweight compared to normal VCPUs. And like I mentioned, we have a pluggable scheduler. So right now, two scheduling policies we have, fixed priority round robin and fixed priority rate monotech. And we plan to add the deadline scheduling as well. So in general, the scheduling policies are soft real-time. We are not going after hard real-time things. The other key aspects are in Linux, we have two possible contexts. In Linux 7, process context and interrupt context. Over here, we have three possible contexts. One is a normal context, Orphan context and interrupt. So normal context is when we are handling an interrupt for a normal VCPU, which means we are handling either MMI or Trap or Stage 2 Trap or some kind of Trap. And then Orphan context is when we are running a background thread. And then interrupt context is when we are handling an host interrupt. So over here, the unique thing about XYZ is that we only allow Orphan VCPUs to sleep. Or we can only sleep in Orphan context. So normal context and interrupt context are not allowed to sleep. So we cannot have any sleepable logs in those code parts. This ensures that, of course, the interrupt handlers can never block. The trap handling for the guest or VM will never block. So this intern actually helps us to provide a very predictable delays in handling MMI, simulation, and trap handling for the guest. So moving on, like this was about XYZ in general. So let's now talk more about what is there in the RISFI port of XYZ. So these are all the software layers of the XYZ when XYZ is deployed on a system. So the color coding is same as mentioned previously. So over here, as you can see, it's a monolithic hypervisor, just like Linux is a monolithic kernel. It's just a single software layer providing the complete virtualization service. We don't depend on any other OS for running drivers or emulators or anything. And the black boxes shown over here, what are your frontends are actually the existing what are your drivers which are there in upstream Linux kernel. So as such, we don't have any code changes in Linux kernel for XYZ. We just run unmodified Linux kernel out of the box. And a typical vCPU context on RISFI looks like this. We have two parts. One is the Arc Regis, which is common for both orphan as well as normal vCPUs. And most of the stuff in Arc Regis is restored at time of every trap and entry and exit, sorry. And then we have stuck RISFI private, which is only for normal vCPUs where we have all the vCPU CSRs, additional CSRs, there are other for the guest. And all the floating point registers. And this will keep growing once we go for virtualizing other extension of RISFI like vector extensions. But the Arc Regis will pretty much remain as it is. This is a timeline of sequence of events, what happens when we actually get a host while we are running a VM. So we just enter the HS mode, do the trap entry and handle it in the interrupt context and go back. So the trap entry will typically just save 31 GPRs and 3 CSRs, that's all. So it's pretty simple. But if you look at other hypers like KVM, you will have a big world switch where we are switching a lot of GPRs between host and guest, and then moving on. And how does the context switch happen? So most of the time context switch in Exvizor happens in interrupt context when the timer time slice expires or the time allowed to a VM expires. But then orphan vCPUs and normal vCPUs can both even voluntarily give up the CPU as well. So orphan vCPUs can give up by just doing a slip or the normal vCPUs can do a wait for interrupt instruction and then voluntarily give up the CPU. So when a vCPU is preempted, it will be mostly due to a timer, time slice expiry, it will get a timer interrupt, it will do a trap entry, this will follow to the host hierarchy subsystem, which in turn will go to timer subsystem, and then it will call the scheduler callback, and then scheduler will decide to move out the vCPU and do a reswi, the context switch, which is a reswi specific, of course, and it will be different for other architectures. So the usual trap entry exit save restore will happen as it is. It's only the context switch which will save restore the private by the right hand side box. So the full context switch will only happen when vCPUs actually scheduled out. And so real world VMs will tend to have a mix of both pass through, par hour, and software emulated devices. So we cannot really ignore the software emulated devices, even in embedded world, because on any SOC, the number of devices will be really limited. So we'll end up having some kind of software emulated devices for the guest or VMs. And when we run RTOS is inside the VM, actually, the delay is in emulating the MMIO registers is also very important. So the point behind this slide is that we really have predictable delay in MMIO emulation, because we don't allow XYZ code to sleep when we are in normal context, when we are doing any trap emulation or handling a trap for a guest or VM. And the way we handle guest time is quite different. I mean, particularly, unlike a lot of other hypervisor, we really don't pre-map the entire guest time into the hypervisor virtual address space. So this gives us two things, actually. We save some memory in the HS mode page table. And we also avoid the cache aliasing problem where we have two separate virtual addresses for the guest time pages. One is in the hypervisor and one is in the guest. So of course, this comes with a price. We need an alternative way to access the guest time. So how we do it right now is we access 4KB at a time, we map the 4KB page, access it and un-map it. It is lower, but it can be improved, at least on RISC-5. So in RISC-5 we have a special hardware feature called Unprevious Load Store. Using this, we can just avoid the map and map dance shown over here. And we can get actually very good performance in accessing the guest time without even pre-mapping it in HS mode page tables. And just like other hypervisor, we also have huge pages for both, for hypervisor as well as the guest. So this, of course, again helps in performance. We have 2M and 1GB of page sizes in RISC-5.64, our huge page sizes, and 4M beyond 30 GB systems. So like mentioned before, when we are running guest or VM, guest and VM can actually inject software interrupts and timer interrupts using SBI calls. So SBI is nothing but supervisor binary interface in RISC-5 world. It's nothing but a syscall style interface between host and guest, and to, you can easily compare it with the PSCI spec in the ARM world. So currently SBI is 0.1, is in use by the Linux kernel, and that's what we even emulate for the guest or VM. And SBI 0.2 is still in draft stage, and we have pretty much defined it. It should be there out soon. And the table over here shows all the existing SBI 0.1 calls. This is going to change a lot. So, and like mentioned before in the features, the other unique thing about XYZ is that we use device tree for both, host as well as guest. So in fact, we have three types of DTs. One is the host DT, which as the name suggests, describes the host hardware to XYZR. And XYZR uses that DT to probe the drivers and all those things. Then we have a guest XYZR DT, which describes the guest virtual hardware to XYZR. We don't really have any fixed guest types in XYZR. It's all probed when the, all the emulators are probed when the guest is actually created. And we describe the guest memory layout and everything in form of a DT. And then we have guest OS DT, which is like describing the guest virtual hardware to the OS that is going to run inside the guest. So typically it's a Linux DT. And like I mentioned in the features, we can actually do zero copy inter guest transfers using a combination of shared memory and whatever RP message. Just shared memory is not enough for zero copy because we need asynchronous notifications between the VMs so that we can avoid unnecessary polling. So RP message helps that part. And RP message also helps us with the name service notifications which they have. And essentially in this setup what happens is that if you are learning Linux inside each guest, the apps will see a separate character device for each other VM. And writing to that character device will just send out a message to other VM. And if the VM goes down, other VM goes down, then the character device will be removed by the UDO. So it's pretty cool in that. And by using what RP message we are actually avoiding the Linux network stack as well. And so using what RP message is not mandatory as such to do this kind of stuff, but you can also use UDB packets using what our network can. So moving on, these are some interesting stats of the code base and the binary sizes gathered recently on 24th October. More in context of this for RB64. So as you can see the amount of code which gets compiled for RB64 is just 159K lines of code. And bulk of it is just drivers and emulators actually. And we can actually strip it down further and reduces a lot. And the binary, overall binary size for RB64, 64 bitrix size, almost 850 kb. Which is pretty less actually compared to a lot of the drivers. The thing about KVM, KVM you consider the entire kernel binary size, KVM tool or UQM binary size and the libraries that it depends on. So the binary size, it tends that our text memory is such very less to achieve the complete solution. And at runtime we have this concept called VA pools. We restrict the complete memory usage by XYZ, by having a compile time limit called VA pool limit. By default it's 32 MB, you can even increase, even decrease that. And typical usage out of this is 22 MB right now. And when we boot actually we further free up 136KB of memory. Just like Linux freezes a lot of init memory. Yeah, so the most important question, why? Why is it ideal for embedded systems? Yeah, so why we think is that there is no dependency on any guest OS for running management tools or drivers. It's a single software, single binary providing you the complete virtualization solution. And the guest types are not fixed. The users can design their own VMs and choose not to even contribute back. You can even come up with micro VMs, small VMs just with few peripherals. It has a soft real-time pluggable scheduler. The MMO emulation over it is predictable quite. We follow open standards for para virtualization like Vortio. And we have zero-gop inter-guest communication. And our internal footprint, memory footprint at runtime is quite low, we believe. And we have a reasonable code size to do the complete virtualization solution. And yeah, it's a perfect playground for academia as well. So we have seen in past a lot of publications and interesting research was coming from. In fact, the Spatial and temporal memory was done by one of the initiatives, something things like that. So what is our state as far as RISFI goes? We had the initial RISFI port working last year itself, but at that time we didn't have QMO with extension emulation. So we just booted to prompt and that's all. So we had pretty much the groundwork already ready for extension. Then once the draft was in good shape, in June we just implemented in both QMO and XYZR. Currently, it's in very good shape of course. We are able to put multiple guests with multiple VMs on an SNP host. And we are also able to boot open embedded and Fedora guest OS as well. So it's quite comparable to the XYZR ARM port as well right now. We were supposed to release the next version of XYZR by this month itself. It's already a year now from the previous release. But then we want the spec to be in additions. We wanted the 0.5 draft with all the pending changes to be there. And then there are some QMO fixes that we want to go in by the QMO 4.2 release. So if you want to play with XYZR, you can, there is already documentation on the docs XYZR avoid QMO. And work in progress QMO is founded this location. And you can even write to us by joining to XYZR Google groups. So what we look at next is we want to get our 32-bit RISC-5 port working as well. So it's the same port that runs in 32-bit mode. Then we also want to add 0.2 support. We want to also have more advanced stuff like per hour time accounting to be defined in SBI 0.2 and also implemented over here. And like mentioned before we want to access guest time using on privileged load stores and then loadable module support. So vector extensions was going to be there in RISC-5 as well soon. So we want to visualize that as well. But of course this depends on the state of QMOs. Then we want to add LibWord support eventually. But yeah, didn't find time so far to do that. But the other parts like adding 30-bit running 30-bit guest on 64-bit guest should be possible now because it's already defined in the spec. We should also be able to do big Indian guest on little Indian host or reverse. That is also defined in spec right now. So yeah, I think we have time for the demo or we ran out of time. We have five minutes, okay. Yeah, meanwhile if you have questions, just shoot. I mean, yeah, please. Not. Yeah, it's totally in CR. We didn't put that as in there. Sure, sure, I'll love to do that. Yeah. No, we don't have that support but we generally use QMO thing. So we don't have a debug server running in XYZ. So basically, even if you do it eventually, we'll need to run GDB server in XYZ and then that will help us debug the VMs. But that we don't have right now actually. Yeah, we can use QMO doing. Yeah, yeah. Yes, it can debug both. The trick with that is that we have to figure out using QMO register state, whether you are currently in guest mode or hypervisor mode. Because you can break at any point in time when it gets stuck. So that becomes tricky sometimes. Yeah, so I'll just actually I was going to show by running two VMs, simple busybox root FS and just so yesterday Alistar also gave a demo while in his QMO talk. It's pretty much similar demo. So we cannot go for a complex demo in interest of time because like booting Fedora takes quite a bit of time right now. So what we have done is actually we just started all the XYZ build for 64 bit RISFI and we are also building the basic firmware. So basic firmware is just like a boot loader which boots up first in the VM and then that launches the guest Linux. And the other image is like we have already pre-compiled Linux kernel and also compiled the root FS for the guest. So generally we provide the guest images to XYZ in form of init.rd or disk. So XYZ can load guest images from some block device or init.rd as well. So after the build completes we'll just create the XYZ disk that the disk that we are going to provide to XYZ to load the images in VM. Any other questions? So the rationale was so actual devices will require more register traps. So we wanted some Paravolt mechanism which requires minimum register trapping. So Zen has his own ZenPV, Hyper-V has his own Hyper-V Paravolt Chelsan. Vertio was kind of neutral to any hypervisor. So we went with Vertio. And actually it's very few traps. It's only with the notifications. So we didn't want to make anything to XYZ specific as well and we wanted to avoid changes in kernel as well. So we went for Vertio. Yeah, so we have compiled it. So we'll skip the steps over here in interest of time and because we already have a pre-compiled guest Linux and guest root FS we'll now just create the XYZ disk which will have the guest images. So what we have done over here is we have created a small extension to disk image which will pass as an Initiality to XYZer and it will have a guest Linux image. It will also have DT which describes the virtual hardware to XYZer. It will also have a DT for the guest Linux also. So let's launch it. So right now we compile our QMU. We don't use the released QMU because the high-provisors H extension emulation patches are still in flight. They might be released with QMU folder to our higher. So we have booted XYZer with two VMs. Okay, so you can... There are a lot of commands if you do help. So what it has done at boot time is that after booting it has mounted its own Initiality and created two VMs for us. Okay, you can also create these VMs through just commands. Okay, and if you do host info it will just tell whatever DT information QMU has passed to us. So we also... We can list the vCPUs. Everything is a vCPU, right? So even the background threads you will see as threads. But the normal vCPUs are named like guest zero vCPU zero. And guest list is here. So just... If you want to start... Okay, so this will first start the basic firmware. So basic firmware is like our boot loader. And from there we just boot the Linux. So the host has four CPUs. The VM has two vCPUs. So this is a very basic... BC box base, the root FS. And if you do cat, proc, cpu info you have two vCPUs. And same thing. Any questions? Yeah, there is a... What I want to add between the VMs we can even ping pong between the VMs. So to just assign IP address some random IP address. Any questions? So the message you just saw like I mentioned, what our message has this name name service notifications when the other guest comes up you get a notification the character device gets created automatically. And we can just ping. Questions? I think it's time. Yeah, we're already up three minutes. Okay, thank you. Thank you. Thank you for attending.