 Good morning, and welcome to the first virtual KVM forum. I'm Marc Saint-Gier, I work for Google. And together with Christopher Dahl, we're presenting the work we have done to try and revisit the way KVM-ARM64 presents interrupts to its guests. And without telling too much of the content of the presentation, the title is quite explicit. We basically want to remove the interrupt controller from the picture, or at least try to. Why would we be doing that? Well, hypervisors are quite complex. They are very complex, as you've said. And yet they are reliable. We're in the business of building a very reliable hypervisor. But then is this thing called confidential computing, or trusted computing, that is becoming extremely visible. And security in the hypervisor has always been a concern, but it's now even more of a concern. And that's things we can do to make that better, or to make us feel better about it. We can review, audit our code. And yes, for those of you who are reviewers, developers, this is hard. It takes a lot of energy, dedication to properly review code. We can harden things, make it more difficult for bugs or exploits to appear. Again, that's very useful, but it tends to be reactive. Ideally, we would prove that our code is correct. That's not practical at scale, especially when you look at the size of something like Linux. Our idea is to reduce the part of the hypervisor that we trust by deprivileging some of it, and only making the ultimate core of the hypervisor trusted. It doesn't mean throwing away KVM, and deciding that we need to reinvent the hypervisor from scratch. And a good example of that is the Android protected KVM project, which I'm contributing to. And if you're interested in that, please watch with Dickens' presentation. So users do care about their data, especially their secrets, their private data. And they can store that data in what we would call confidential VMs. And it's interesting because they expect that these VMs will provide a stronger isolation than normal applications. So what defines a confidential VM? It's private state in general. That's the SIPI state. You don't disclose your registry to the third parties. Your memory is strictly private, either unmapped or encrypted. You rely on things like time passing monotonically that doesn't go backward. That's also very important. And then there are introps. So what about them? Well, VMs need to have some guarantees about their introps. And you can think of these introps as being split in two buckets. On one side you have what we will call the trusted introps. They are the introps that the guest needs internally to be able to function on its own. And there are things like IPIs. Having a vCPU being able to interrupt another vCPU. That's crucial. Linux totally relies on that for its scheduling for RCU, all kind of things. And it's really important that the guest, especially a confidential guest, be able to trust that when it receives an IPI, that IPI actually has been sent from the VM itself and it doesn't come from outer space. Same thing goes for CPU private devices, the timers. You want time to be monolithically increasing. At the same time, you don't want your timer introps to be crazy or to disappear altogether. So these are the type of things that you would like to rely on. On the other side, you have what we will call untrusted introps. And these are introps that are outside of the trusted computing base, things like your network interface, your mass storage. Ideally, you would distrust these introps. Having said that, and having thought about how to separate these two buckets, can we make that fit in the ARM 6400 architecture? We wish. Let's have a look at the geek architecture to find out. And it's a five-minute overview and some paracetamol. The geek architecture, which is at the heart of the ARM 64 architecture, presents not one type of introps, but actually four. We have SGIs, which are used to implement IPIs. IPIs, which are introps that are private to a given CPU. SPIs, which are global. And LPRs, which are global as well, but only for things like MSIs. These four classes have different intruplife cycles, different configuration parameters, and can or cannot be directly injected depending on their class and the hardware. They come with all kinds of tunables, their group, their priority, their affinity, and a complicated set of states. So you have the notion of an intrap being pending, being active, being enabled. But you also have active priorities, which are both priority and per-CP, and the concept of running priority, which is per-CP. So a lot of jargon, a lot of states, but that's only the beginning. So these intrap classes are actually supported by three different blocks in hardware. So distributor and redistributor are extremely similar. The redistributor is at the per-CPU level and deals with SGIs and PPI, so your private introps and IPIs. The distributor deals with SPIs, which are global introps. So it implements the routing of these introps to the various CPUs. Both of these blocks are entirely MMIO based. And then on the side, you have the ITS, Intrub Translation Service, which deals with LPI, this thing we use to implement MSIs. And that block is partially MMIO based, but also uses in-memory command queues and tables. The inside joke for those familiar with the GIC is that the ITS implements page tables for introps. And that's actually the case. And the ITS is what is required for the architecture to provide direct injection. And that was only for the physical side. Now on the virtual side, when we implement a hypervisor, what facilities do we get? Well, the only thing we have is that the architecture virtualizes the delivery of an introp. It's the only thing. So you have two ways of ensuring that an introp gets delivered to a guest, either by putting that introp in what we call a list register, or a set of list registers, which are registers private to a CPU, but that actually represents a global state. If you think that's a good idea, think again. And then we have direct injection, which we would love to have everywhere. Unfortunately, the GIC only does that for LPI's if you have gig v4.1 and some form of SGI's if you have gig v4.1. So sorry, LPI's gig v4.0, SGI's gig v4.1. And frankly, setting that up is terribly complicated. But it exists. For everything else, which is the whole infrastructure, the hypervisor has to emulate it. Distributor, redistributor, and ITS, there is zero support coming from the hardware, has to be entirely done in software. Fun. So what does KVMM64 do about the gig architecture? Well, it's quite painful. We support the whole gamut of the architecture. But if you look at the way it works, there's a lot of state duplication between emulation and injection. So a lot of things happen behind the hypervisor's back. Remember these list registers. If you put an interrupt in a list register and get the vCPU running, and that another vCPU needs to, for many reasons, need to introspect that state, it's hard because, oh wait, the state is actually private to a CPU, another CPU cannot observe it. So you need some synchronization between the two physical CPUs to get the vCPUs out of the VM so that you can synchronize the state. It means physical IPIs. It's hard to get it right and it's even harder to make it performance. There's a lot to emulate. There are literally hundreds of registers with complex semantics at both the distributor and the distributor level. There are very complex ITS commands that require global synchronization and the opportunities for bugs are everywhere. An interesting thing to try and do would be to de-privilege that part of the geocamulation and, let's say, like on x86, try and move it to user space. In theory, possible, and if you look at the way the geek is split, you'd say, oh yeah, we can move all these blocks to user space. That could be independent. In practice, that would result in a massive state duplication which would be, again, hard to synchronize and would also result in bad performances because to achieve that synchronization you would have to go all the way back to user space. We also have to deal with an enormous hardware variability. Everything's optional in the architecture. Really, absolutely everything. There are multiple ways of doing the same thing and especially around direct injection. We also have tons of architectural legacy dating from 15 years ago at the very least. So does the virtualization architecture for the geek succeed in simplifying the hypervisor? No, not really. You just have to look at the amount of code this represents. About 70% of the whole KVMM64 code base. It's just enormous. And we did say, minimizing the trusted computing base earlier, it doesn't quite cut it. And that's because the geek has been designed to support extremely demanding workloads. Millions of intros per second, device assignment, direct injection, compatibility. The key word in the geek architecture is we want to stay compatible and that's great. I mean, we can run guests, even 32-bit guests dating from 15 years ago. They will just run. But that puts the complexity in the hypervisor instead of putting it in the guest. So what if we could introduce an intrapcontroller architecture for workloads? Not any workload, but only those who are, let's say, mostly compute bound, not IO bound, and that do not do device assignment. Literally compiling your kernel in the VM, that kind of thing, only joking, but yes, for example. Can we make that hypothetical architecture as simple as possible and move the complexity to the guest instead of having it at the hypervisor level? What would we gain by doing that? Well, that's the question we asked ourselves. And let's hear it from Christopher with the Arvig architecture. Hi, my name is Christopher Dell. I work for ARM and I'd like to talk to you about the reduced virtual intrapcontroller or Arvig architecture. The Arvig architecture is an experimental hypervisor ABI design. It's publicly available in what we call an alpha state, which means that it's subject to change or withdrawal. The Arvig architecture is a minimal set of ABI calls used to implement a virtual intrapcontroller on ARM V8. It seeks to have minimal impact on the hypervisor TCP and minimal impact on the NVM kernel's complexity. The Arvig is designed to support split-mode hypervisors that have a trusted small TCP and a non-trusted larger codebase. The Arvig provides per-CPU interrupts, and all interrupts have S-triggered semantics. We do provide a re-sample operation to support level-triggered signals. The Arvig provides routing of virtual interrupts between virtual CPUs. And the Arvig provides support for threaded interrupt handling. The Arvig architecture actually describes two separate components. The Arvig is the per-VCPU interrupt controller implemented in the trusted hypervisor. And the Arvig is the per-VM virtual distributor implemented in the non-trusted hypervisor. There is one Arvig instance per virtual CPU, and the Arvig instance signals interrupts to a virtual CPU using standard ARM V8 virtualization mechanisms, such as the virtual RQ bit in the hypervisor control register. The Arvig exposes a hypercall interface to the VM, which allows the virtual CPU to receive and handle interrupts, enable and disable the entire Arvig instance, mask and unmask individual interrupts, and to send IPIs. The Arvig has a single instance per VM. It's used to route interrupts to different VCPUs and their Arvig instances. And it exposes a hypercall interface that allows the VM to map an interrupt source to a specific Arvig instance with a given interrupt ID. The Arvig introduces the concept of trusted interrupts, and trusted interrupts are managed by the trusted hypervisor. An example of a trusted interrupt is an IPI, where the VM can trust that the IPI was actually issued from another virtual CPU within the VM. Another example is VCPU local interrupts, for example from the generic timer, where the VM can trust that if it sees an interrupt signal from the timer, that the timer actually has inserted its signal. On the other hand, you have non-trusted interrupts, which are signals that can be generated from outside the trusted hypervisor in the non-trusted hypervisor. An example of non-trusted interrupts would be virtual peripherals implemented in the non-trusted hypervisor, such as complicated devices like VertIO. And all non-trusted interrupts can be spoofed by non-trusted software, meaning that the VM software has to be resilient against spoofed interrupts of this kind. All interrupts can trust their mask status on the Arvig instances. The Arvig introduces a fairly limited set of commands. You can request the API version of the Arvig implementation. You can use the Arvig Info command to query how many trusted and non-trusted interrupts are implemented across all the Arvig instances in the system. You can enable and disable entire Arvig instances. You can set and clear the mask status of individual interrupts. You can query the pending status of an individual interrupt. You can acknowledge a pending interrupt and obtain whatever interrupt ID is pending currently. You can clear the pending state of an interrupt. And you can signal an interrupt on another Arvig instance. Finally, you can also re-sample the state of an interrupt, which is useful for level-triggered sources. And a number of these commands work across Arvig instances, or across virtual CPUs. And that is really useful for re-routing interrupts and to queers the system. The Arvig only exposes two commands, map and un-map. Map maps an input signal to a target Arvig instance with a particular interrupt ID. And un-map tells the Arvig to not signal any interrupt at all for that device. In terms of future direction, we would like to evaluate the usefulness of the Arvig architecture for things like Google PKVM and also for other use cases. We plan to add a priority scheme to the Arvig, which can be used to support pseudonymize in Linux, and also to support other operating systems, which require priority support. Currently, our level-triggered re-sample operation is limited to trusted interrupts. We would like to expand that to non-trusted interrupts, which can be useful if you want to emulate legacy components like UARTs using level-triggered interrupt signals in the non-trusted hypervisor. And finally, we would like to add virtual MSI support in the Arvig, which would be useful in the context of emulated PCI, for example. So Arvig would like to welcome any feedback on use cases or on changes to the Arvig or Arvig concepts in the Arvig specification. And you can email me directly for any suggestions. Thank you. Thanks, Christopher. So what about the Arvig and KVM? So we have a full Arvig and Arvig implementation for KVM and I'm 64. You can see the pointer to the patches at the bottom of this slide. That's tiny, absolutely tiny. 1,100 lines of code, which is less than a tenth of the EV3 equivalent. Most of the changes are about moving the Vigig, which had its fingers in every pie of the I'm 64 tree to its own box and keep it there. So it's about defining a clean interface. Always a good thing to do. We have some very, very sketchy code for KVM tool. It's basically a huge hack, which is DT only doesn't deal with live migration like the rest of KVM tool. It's really a way to get things started. Actually, that should be rewritten from scratch. And we have a complete guest driver for Linux. It's about the same size, about 900 lines of code, plus a bit more for the MSI support, which we'll talk about later. And frankly, it was a good opportunity to clean the I'm 64 SMP code and especially the IPI code. So yeah, not a bad thing to do. How about is it? Well, it's acceptable that way. I don't have any scientific measurements, but early benchmarks of that hack bench, extremely high IPI heavy, is around 5% slower. Not too bad. I mean, 5% is huge. But in the grand scheme of things, given how this TV control is implemented, it's not horrible. The general compilation, which is the benchmark I actually care about, shows actually no significant difference, which is quite amazing. So it's very trap heavy, but turns out that interacting with the hardware isn't free. I mean, it has a huge cost. All these system registers, all this every store we have to perform is also very expensive. And not having any cross CPU interaction really helps here. So definitely more analysis is required. We want to find out whether we can improve things or maybe things are worse than we think. But honestly, for now, it's all right. And to be honest, performance wasn't the key point for the design. It was the simplicity. And I think for that, we seem to have achieved a goal. And talking about simplicity, I said earlier that we wanted to move the complexity from the hypervisor into the guest. So how bad is that? Let's look into what is probably one of the most complicated operation that we have to implement, which is to move an interrupt from targeting a CPU to another, just getting the affinity. If you do that at the geek level, it's quite complicated because you need to synchronize your CPUs, making sure that the state cannot change from under your feet because it will be changed by another CPU whilst you're trying to move the interrupt. It's fought with danger. If you implement it entirely in the guest, like we do with Yavik and you can look at the code that is here, it's extremely simple. Although it's logically what you would do at the hypervisor level as well. So you mask the interrupt at the old CPU level. You sample the pending state. You map it onto the new CPU. And if it was pending, you inject that interrupt. You clear the pending state on the old CPU and you clear the mass state on the new CPU. And you're done. The difference here, well, there's no locking. There's no locking whatsoever because we rely on the guest doing the locking which it needs to do anyway. It's a given. So if we had to do that in the hypervisor, it's much harder. Here we just reuse something that the guest has to provide. And that's key. Now, when it comes to MSI support. So as we said, being PVed, RVed doesn't define what an MSI is because it only has abstracting. No MMI, no doorbell whatsoever. Just not the thing. And that's a bit of a problem because we want to be able to reuse things like PCI device models. Vert AOPCI. Because we don't have level interrupts, untrusted level interrupts, we can't implement in-text support so far. So we need to grow some form of MSI support. So, okay, we've invented it. So we just reserve a bunch of RVed input as MSIs. We add a fake doorbell, which is yet another terrible KVM tool hack which is based on the Gig V2 emulation. That definitely needs reworking so that to be independent of the RV, basically what we need to do is create a separate virtual block that is in front of the RVed and which only provides input to the RV. Or we need the RV to grow some MSI support. But anyway, that needs to be changed. But as a proof of concept, not too bad. So where do we take this? So we managed to demonstrate a few things. It's possible to PV an interrupt controller with actually very little. All we need to be able to do is to inject signal and exception to the guest and take a hypercode. Yeah, most architectures are capable of that. Guests don't always require future-rich architecture, which is great because the RV provides extremely little. And we managed to keep it small. It's easy to review, it's easy to audit. And more importantly, it's easy to prove properties about it. And frankly, the way I see it, it's a very interesting tool to prototype new interrupt architectures, try new semantic. We're freed from the hardware. We have both the host and the guest. We can play with that, we can invent new things and we can see where that takes us. Also, as I said in the first point, because it requires so little, could it have a life on other architectures? Then we have a cross architecture, interrupt controller architecture. I guess we'll find out. And with that, our thank you.