 Hey, so it's gotten quiet, so I think we'll start. So my name's Alastair, and I work for Western Digital as part of the research and development team. So my colleagues and I focus on the RISC-5 and the Linux ecosystem. And today, I'm here to talk about developing the RISC-5 hypervisor extensions in QMU. So I'll start with what a hypervisor is quickly and why we use them and why they're exciting. Then I'll talk about the RISC-5 hypervisor spec and the current status and details about it. Then I'll go over the current RISC-5 hypervisors that exist, the QMU implementation, the QMU future work, improvements to the draft spec that came out of all this, at least from the QMU side, the overall current status, and then if we have time and demo and questions. So what a hypervisor is. A hypervisor allows you to run virtual machines. So compared to containers, a hypervisor is a full OS. So you run a kernel and a full user space inside in your virtual machines. And they're extremely common in the enterprise world. So cloud companies use them, AWS, Azure, all of that stuff, all runs on hypervisors. It allows you to share machines and do fancy save, restore, migration, all that type of stuff. Operating systems use them. So in the Microsoft world, they have Edge Defender, which they use to protect browsers. And Cubes is the famous open-source one that's based on Zen and allows you to partition basically applications and protect against different types of tax. They're also used in consumer products. So the Xbox One is based on a hypervisor and uses that to handle game switching and things. So they're big in the enterprise and the big x86 world, but they're also big and embedded. They use to process isolation in security and critical applications. And they're also used to run multiple legacy OSs. So the famous example of this is you say have a car and you have five chips that used to do breaks and power steering and infotainment and all these things. And you can consolidate down into one CPU core, a big SOC, split it all up, pin your brakes to one core so it's always running, pin your steering to another and then run everything else between all your other cores to do infotainment and communications or whatever else you wanna do to cut costs and simplify things. So hypervisors are used everywhere, all over the place. And so Zen and KVM are the two well-known open-source ones. They're both GPL. There's lots of talks here about KVM especially because there's a KVM forum. And Xvisor is another relatively new hypervisor. It's also GPL, it's open-source and I'll go into more details about that later. So the RISC-5 extension status. I'm not gonna talk too much about the RISC-5 spec in general, just that the way it works is you have kind of a base spec and you add extensions on top based on what you're interested in. And the RISC-5 hypervisor extension, the virtualization extension is an extra extension on top of that. So it's designed to support both bare metal and hosted hypervisors. And the 0.4 draft was released what about four months ago, isn't it? And so it's a draft. So draft spec means that it's gonna change. So if you implement this today, don't expect it to keep working when the next version comes out. Eventually it'll be frozen and ratified and then there's guarantees about backwards compatibility and all that stuff. But draft it means it's not. Additions have happened since then but generally today what I'm talking about is the 0.4 version just because that's what all the software is using because it's too hard to keep updating every time one commit changes in the spec. So Western Digital ported QMU, X-Vizor and KVM on the 0.3 version of the spec which was the previous one. We've since updated everything and we're now using the latest, the 0.4 version and we'll update from there too. So, another actual details. So in the RISC-5 world we have these privilege modes. So normally we have just M, S and U. So this is your firmware, your operating system and your user. It's kind of like EL-3, EL-1 and EL-0 in Armwork. With the RISC-5 hypervisor extension it changes to this. So we kind of partition the S and the U. So you can see here the firmware where you run OpenSBI in your machine mode that doesn't change. You don't virtualize your firmware. And the next step is your HS where you run your hypervisor. The key thing here though is if you had a hypervisor extension in your hardware and you didn't care about hypervisor, you didn't want to use it and your kernel didn't support it you could run an unmodified kernel in HS. You don't have to update your kernel to handle the hypervisor extension if you don't want to use it. So you can just keep running it there. You run your user space in U mode and in the future you could update your operating system to add support for the hypervisor spec. So you don't have to have weird hooks to hook back. But if you do want to use the hypervisor spec and you do want to run hypervisors, you run a HS, your hypervisor user space runs in U. So in the RISC-5 where we say the virtualization's off or V equals zero. And then your guests run in this V mode. So virtualization on, guest Linux and guest user space. So on top of that we have these new CSRs. So CSRs are the RISC-5 way of controlling everything, control what your CPU is doing. They control status registers. And so they're normally prefixed by the letter of the privilege. So S status is the status control register for S mode for your operating system. M status is for the machine mode. So when you're in your hypervisor and virtualization's off, you have these new ones. So all your S ones are still there. They're all unchanged. You can use them as you always did. You can run an unmodified operating system. But now you have these HS. So HS is for the hypervisor. Things to control the hypervisor. Status for the hypervisor. Second stage translation for your two-stage MMU. I'll talk about that next. Delegation settings for your guests, things like that. Things that a hypervisor needs to do. But now you have these VS. And this is a weird one. So VS is the virtual copy of the guest S status registers. So you have all these copies. You can see them on the slide. So VS status is the guest S status. And when you transition from hypervised or virtualized to not virtualized, the hardware will alias the S register to the VS register. So when you're running in your guest, your guest will access the S status register. But in hardware that will transparently access the VS status register. And the guest then can't edit the hypervisor status or the hypervisor HS registers. And when you swap back, the hardware will automatically swap them back. Or in this case, QMU will swap them for you. But the hardware will handle swapping VS and S status accordingly. And this is actually one of the hardest things to implement in QMU. It gets kind of tricky. So a two-stage MMU. So one of the general things that hypervisors extensions do in hardware is you have two stages of MMU. So the first one is for your guests. When virtualization is on, this only applies when you're running inside a guest. So the first one translates your guest virtual address to your guest physical address. This is controlled by your guest kernel, with Linux or whatever it is. And it sets it up as it always would. It doesn't have to know that it's hypervised. It just does whatever it would normally do. The second stage is run by the hypervisor. So it sets this up. And that does your guest physical address to your host physical address. And the hypervisor will set this up differently for different guests and it'll manage it to map it to physical memory. This follows, if anyone knows the RISC-5 MMU stuff, it basically follows the same things. It's the standard format, the SV32 or 39 or 48, depending on the bit lengths and things. And it just kind of loops the stage through twice. So it's not that much re-implantation. You don't have like a whole new page table walker and all this stuff. It kind of loops it through. There are some differences, which I'll talk about in implementation, but they make it not that trivial, but it's not that hard. And IO and interrupts. So for anyone asked later, the RISC-5 hypervisor extension does not have virtualized interrupts or an IO MMU. It's just not specced out there. And there's no implementations, well, at least publicly. And so the interrupts are injected from the hypervisor to the guest using the VSIP. So SIP, SIP, CSR is the interrupt pending for the OS. The VSIP is your guest interrupt pending. So the hypervisor writes into the guest interrupt pending, then transitions to the guest and the guest then gets an interrupt from the interrupt pending. The hypervisor is in charge of emulating SPI calls. SPI is kind of like the RISC-5 equivalent of PSCI. It does lots of different things, but the hypervisor is in charge of handling that. And the hypervisor can also use the true-stage MMU to trap and emulate MMI or accesses. And this allows you to tell the guest there's a UART when there's not really anything there. And when they access it, you trap into the hypervisor. The hypervisor sees what's going on and pretends to be UART or something. And so quickly, just comparing it to AH64, which is kind of a similar concept, if anyone knows that, there's no separate privilege mode in the RISC-5 hypervisor, unlike AH64, which has yield two. So this means the RISC-5 hypervisor suits really well for both type one and type two hypervisors, but you don't have to trap back down to like a previous level. It just kind of splits the SNU modes. The RISC-5 also can use virtual interrupts. You can inject them in the CSR, like I just talked about with the VSIP CSR, so you don't have to have virtualized interrupt support. The RISC-5 hypervisor extension also has support for nested virtualization. So you can use VTBM and VTSR to trap into the hypervisor on certain operations, and this allows you to do nested, although that hasn't been done yet. And the last one, you can use SBI to do timers and interrupts. So there's two ported hypervisors that support the RISC-5 hypervisor extension. And these are X-vizor and KVM. So X-vizor is a monolithic bare metal hypervisor. The nice thing about X-vizor is it has no requirement on Linux. So KVM obviously requires Linux, and Zen, at least until kind of recently, requires Linux as well. So X-vizor can boot up straight to a prompt, and then you can start your guests from there. This made it simpler to port the extensions. As one of my colleagues was telling me, we could co-develop the QMU and the X-vizor at the same time as we were both working on them in contests against each other. So one of my colleagues has a talk on X-vizor tomorrow if anyone wants to hear more details about how it works. And if you want to use the RISC-5 work, you have to go to the X-vizor next branch. It hasn't made it into the master branch yet. And KVM, KVM is the kernel module for the Linux kernel that turns the Linux kernel into a hypervisor. There's lots of stuff about KVM this week. There's KVM forums under the end of the week. And there's a talk with a colleague and myself on Wednesday about the same thing QMU and KVM. And if anyone's interested, there's a GitHub link for KVM. I'll talk more at the end about the status of the patches. So just to give you another idea, this is what a bare-metal hypervisor looks like on RISC-5. So your machine mode, your OpenSBI runs in your M mode, then your hypervisor runs in HS mode. In top of that, your guest kernels run in VS and VU mode. So I think I've explained a lot of that already. And KVM looks a little different. So again, in firmware and M mode, your host Linux kernel with the KVM module can run in HS mode. Then your QMU or KVM tool or whatever you're using runs in U mode, and that's part of virtualization off. And then your guest kernel and guest user space run in VS and VU mode. So this is different to other architectures. So we'll talk about implementing QMU. What is QMU? It's a very quick, very quick open source emulator. So it uses TCG to translate a guest instruction into a host instruction. This allows you to run RISC-5 operating systems, binaries, everything on whatever architecture you want. QMU is not cycle accurate. It's not like a simulation platform. It's an emulation platform. So it's functionally accurate but not cycle accurate. It doesn't pretend to have a pipeline and caches and all these things. This allows it to be really quick. It's not written and maintained by a single company. It's a big open source project. There's a whole lot of QMU developers here this week, especially for the KVM forum at the end of the week. It's, GPL has a great mailing list, all that stuff. So it's a great platform to use to develop and both the extension and software on top of the extension. So the current QMU implementation patches on the list. Everything we do is open source. We're not gonna keep it. It supports both 32 and 64 bit of hypervisor extensions. All the CSRs, including swapping the CSRs that I talked about earlier. The interrupts are generated correctly to the hypervisor which can then delegate them to the guest OSs. Floating point is correctly controlled. The hypervisor can disable guest floating point and we handle that. And the true stage MMU is fully supported. So the patches on lists are not merged. But once they are, they're expected to be, the hypervisor extension will still be disabled by default. So it's a draft spec. We don't wanna turn it on for everyone and have it change versions and all these things. So you have to enable it with this dash CPU option RV64, then x dash h equals true. So x dash is just a standard QMU way of saying it's experimental. Don't rely on this feature to still be there. Next release, exactly the same, it's gonna change. And the h is just hypervisor extension. You can turn it on and off. So the patches are on lists. So you can apply them from there. Or if you want a branch, there's a link there for the branch. I think if you just search like respive QMU, it'll probably come up. So some things have already gone in. It had to be made in preparation to QMU. So the MIP CSR, which is the overall global pending interrupt CSR, used to be updated atomically. This is because you have your CPU running, but other CPUs could generate IPIs into your CPU. And you didn't want it just changing, being overwritten and information being lost. The updating that atomically is a real pain when you're swapping your registers all the time. So it just gets a headache and prone to bugs. So instead we updated it to lock the IO thread. If anyone knows QMU implementation, that's a standard thing QMU does. You lock the IO thread so you know no one else is gonna come and change your status. We last sent the ISA extensions via the command line, like I just talked about. That's important so we could disable it by default. It's also useful for other extensions as they come in. So let's talk about vector extensions now, for example, in QMU, so it'll be useful for those two because they're also draft. And some floating point consolidation to try and simplify that. So in QMU we have to maintain the hypervisor state. We have to know if we're virtualized now and or if we're not. So this isn't that hard to do. State only changes on traps or returns. So there's only a handful of places it could change and we know where they're gonna be. The problem is that the RISC-5 hypervisor extension allows M and HS mode to pretend to be virtualized so you can do loads and stores through the two-stage MMU. And I'll talk about this on the next slide when I talk about the MMU. So QMU needs to know if this is happening to then decide if we're gonna run those accesses through a two-stage MMU. And the other weird thing is that certain faults can never be delegated. So no matter what your delegation settings are for, like if it should go to the guest of the hypervisor, in certain events they're just ignored and you have to have trapped to the hypervisor. So like a two-stage MMU failure in the second stage has to trap to the hypervisor. And so in QMU world we keep track of this as part of the virtualization state. And if anyone ever looks at the code and is interested we use this force HSXEP and we help us to set and read that. And that allows you to forcefully trap into the hypervisor. So the two-stage MMU. So when you're virtualized the two-stage MMU is always on. You can do one-to-one mapping if you don't wanna use it but it's always gonna be two stages. But the two stages can be turned on even when virtualization is off. So that's what I just talked about. So the idea for this is that if you trap into the hypervisor or your firmware or hypervisor wanna know what caused the trap they can turn the two-stage MMU on which only applies to loads and stores not to instruction fetches. They can then load or store or load normally load the address of the instruction then they have the instruction that faulted and then they can decode it and see why it faulted and what to do. Maybe they should return something or do something else, right? So you have to know both if you're virtualized or if you wanna pretend to do a virtualized access and then the really fun part with this bit is then it changes the registers you read or read. So if you're virtualized and you're doing a stage one translation you wanna look at the SATP register which is the guest page table, the top of the page table register. But if you're in M mode and you wanna do a two-stage MMU lookup you have to read the VSTATP register because you wanna use the guest's register not your hypervisor register. So it's just these weird things that you always have to keep track of where you are and what state you're in. And this is I think definitely the part of the point of bugs because it's very easy to accidentally read the wrong register or check the wrong conditional and suddenly you're using the hypervisor page table instead of the guest page table to do a two-stage lookup. And like I said, the second stage failures have to trap to the hypervisor and the way QMU works is these are kind of two different places. We figure out we're gonna have an exception and then we handle the exception much later on in a different place and so you have to keep track of that to make sure we don't delegate that later and don't accidentally lose that information. So register swapping. Register swapping is my least favorite part of this whole thing. So it gets really complicated because of the way RISC-5 CSRs work. So for a lot of the CSRs there is the M mode and the S mode so like M status. There's M status and S status but the S status register is actually a masked version of M status. So M status is the register that contains all the information and S status is just some of that information for the guest that the OS can see. The OS doesn't see all of it, you just see some of it. So QMU uses M status internally to track the status. We don't have like a second independent way to know what we're doing. We just read M status. So M status always has the latest updates of what privilege mode we're in or whatever it is. And so we read from that. So swapping that back and forward while needing to keep it updated all the time is quite difficult. So if S mode only registers, things that QMU doesn't really care about, they're easy, we just value swap them on virtualization changes. So trap some returns. The M status and MIE, we use pointers. It's kind of easier just to swap the pointer and then copy the values around so you don't have to clobber things and copy them back and forward. And the MIPS CSR, like I talked about earlier, now that it's no longer atomically accessed, we can value swap that too. So this is what the register swapping looks like. So on hardware, on the left, oh, yeah, the left. If you access the S status CSR and your virtualization is off, you get the S status CSR, which is actually just some part on top of the M status CSR. And if virtualization is on, you get the VS status CSR, which is some part on top of the M status. So then in QMU, if you access the M status, which is a pointer, then you either get M status no vert, which it points to, or M status vert, depending if you're virtualized or not. And then if you access S status, you get M status no vert, mast or M status vert martens. So the guest doesn't have to do this. The guest just accesses S status and QMU will figure out what it needs to do and translate it. The reason it kind of ends up being so complex and some other architectures have arrays and they can kind of keep all this information, like if virtualized, read the second index of the array, if not read the first index, the zeroth index, things like that. Is it because the RISV world is so modular, there's probably a lot of people who will use RISV or do RISV work and never care about the hypervisor extension. So to force them to understand every time they access M status, if they're virtualized or not virtualized, which one should they access? It just seems like a pain. So the idea here is that at all times, all the registers are correct. If you access M status, no matter where you are on the code, you are getting the correct M status and the hypervisor extension kind of off by itself will make sure to update that correctly. The theory is that when people come later and add whatever new extension, they don't have to deal with figuring out which state they're in and virtualize them or virtualize. That's the idea there. So future work. So current plan is to upstream what we currently have. If it's not upstream, it's just not worth having. So we want it in mainline. I'll talk about that a little bit later too. So the other thing is to update TLD, Cummys TLD caching. We don't know how many people in here know Cummys TLD stuff, but it caches TLDs and we can flush them. And right now we just flush on any sign of anything. If someone said to flush, just flush everything. So the idea is hopefully we can change that to kind of save some of the virtualization state when we cache them so we can flush just the virtualize or not virtualize state, but depending on S fence or H fence instruction. So it should be a little quicker, but it should also help catch bugs in the software. So if your software is incorrectly flushing things, Cummys will bring it up and you'll see it. Otherwise now you would never really notice Cummys just flushes so much. It doesn't matter and it might not actually work on hardware. So that's kind of the point there. Update to the spec, keep updating until it's draft and even then keep updating it. Add support for nested virtualization. So I talked about nested virtualization at the start, but those bits don't connect through to anything at the moment in Cummys. It's kind of hard to test because no one is really writing in nested virtualization, so we'd have to get a hypervisor updated to support that and also at the same time develop it in Cummys. And 32 bit Linux guests don't work yet. So we just run bare metal guests, but Linux fails to boot. So the original zero to three that we did all the work on had this background swapping register stuff. It was a little different to the way we do it now with this virtual status or the virtual registers. And it was unclear where interrupts should go to and who should handle them and things like that. So we caught that. Well, mostly a lot of bugs in Cummys caught that. So that's cleared up now as part of the update to 0.4. It's now much clearer how the shadow registers work. So the overall current status. So like I said, the second version of the hypervisor patches on the mailing list, about half of them have been reviewed. If anyone's interested, more as the better, you can go look at them. We might make it into 4.2, which is the next release of Cummys, I think by the end of the year. Otherwise it would be 5.0. That kind of depends on soft freezes and hard freezes and all that stuff. So there's KVM patches on list. My colleague and I are talking about KVM on Wednesday, if you're interested in that. And the KVM patches are mostly reviewed after that. Exvizor patches on Exvizor Next Branch, waiting to be released. And my colleague has to talk about Exvizor tomorrow. It's really interesting, you should go look at that. OpenSBI, which is the firmware for RIS-5, the go-to firmware. So the firmware needs changes to handle the hypervisor extensions as well. And that's being released in the 0.5 release, which is the latest release. So that's all good to go. And 0.5 of the hypervisor spec, I heard was supposed to come out this week. So sometime soon we should be getting that. And then we can look at updating. So I have a demo and questions. So if anyone has any questions, you might start that first. Okay, I'm gonna do the demo then. Okay, so here's just QMU with no graphic, which is a standard way of starting QMU. The vert machine, this is the thing I was saying, turn the hypervisor extensions on, where kernel is, this is Exvizor kernel running OpenSBI. This is in an RD that Exvizor, oh, can you see that? Should I zoom in more? Okay, in RD, and then we have some append options. And SMP4, so I'll get to that in a second. So OpenSBI boots, this is Exvizor. So it brings up four CPUs. And it copies in all the kernel and root of S. So this takes a while, I think it's a really big kernel. Oh yes, I think it says, yeah. So it says, this is the RIS5 ISA string, so we're RV64, and there's a H in there for the hypervisor extension. So this is Exvizor. So we can kick the first guest, kick the first guest. So we bind to the UART of the guest. And so the way Exvizor works is these guests run this little like band metal thing. So we can say hello, and it says hello back. And we can use it to auto-exec. So it's kind of already copied in the kernel, so now we'll boot it in. Oh, I have to type. This is over SSH, which is why it's all kind of jumping around. Okay, so we can log in. So here we are. So this is inside the Exvizor, this is our Linux. We have two CPUs passed through to the guest, and we can assign it an IP address. Now we jump back to Exvizor, I just exit it out. We can kick the second guest, bind to it, auto-exec it. So now we have two guests running, takes a second. Now we can ping from the second guest, the first guest. Okay, so they're still running, I can list the two guests and list the CPUs. And I can bind back to it and show. This is the second guest, still pinging away. Go back to the first guest, and we can ping the second guest. So yeah, so that's basic hypervisor running on, we'll just have hypervisor running on top of QMU. All right, so now, any more questions, or any questions? Otherwise I'm done then. Or we can just keep watching it ping. Yeah, I don't, is the ping time worse with the hypervisor on than with off? You mean if I ran like two QMU instances and made them talk to each other? I don't know, I don't, I don't know. It's pretty, I mean, it looks a little slow because the FSH connection is kind of sucky, but yeah, I don't know, good question. That's it. Okay, I'm done.