 this early. My name is Mark. I work for Google. This is a short talk about trying to find ways to isolate data and execution inside PKVM. So, quick forward. So, PKVM, for those who don't know about it, aims at providing isolation, very strong isolation between hosts and guests and sense that the host cannot spy on what the guest date is. And there's a talk tomorrow by Contempei, which will give you a deep dive into what PKVM is and how it actually works. It's unfortunate that the talks are in that order because a lot of things would make more sense the other way around. But please have a look. So, this is very heavy on the ARM 64 architecture side because that's what PKVM runs on at the moment, although there are rumors. But if anything seems obscure, I'm sure it will be, please shut. And of course, all of that is a work in progress. Nothing really works. It does, but there's still a lot to work to do. So, first, let's start with a few silly considerations. And the first thing is that software is buggy, Newsflash. We have read gadgets in hypervisors. They're not that hard to find. Just have to look at it. And it's really unfortunate in the case of confidential computing because the hypervisor has access to state that would be better kept private. This is certainly the case with PKVM, both in terms of gadgets and having access to that state. The more code you're going to add to the privileged part of the hypervisor, the more you'll be susceptible to these sorts of attacks and the more fragile the whole thing becomes. So, what can you actually leak? Well, the hypervisor maps a bunch of guest-specific data structures. And probably the most interesting one is the VCPU structure. The PKVM structure is also interesting with the VCPU one is really the juicy one, has all the registers and stuff. And also the fact that some of the guest state, while some of the state in general, gets pushed on the hypervisor stack at runtime. And that's the fact that, because we write things in a high-level language. In limited cases, the hypervisor itself could map some guest memory, but that's pretty rare, certainly on the ARM64 site that doesn't have. An interesting example is the ARM PRNG hypercall, which basically offers a service to the guest to get entropy and presumably high-quality entropy. But one thing that can happen is that that entropy gets written on to the hypervisor stack, because register spills, anything like that could happen. That could be retrieved with a gadget and reasonable timing attack. And this is worrying. Worries people get emails. But it's so fine because the architecture has a solution for us. That's a literal quote from the ARM spec that says that you need to discard all the entropy bits once you've communicated them to the guest or to request an email. And you override them with zeros. Okay, that's interesting. But allow me to be a bit hyperbolic here. The hypervisor keeps a record of the VCPU state, and that includes copies of the registers. So discarding that entropy is really counterproductive. Are we going just to zero the guest registers? Because, oh no, we can't possibly keep that entropy around. That's also the fact that, as I said before, the hypervisor is written in a high-level language. Yes, people see it as a high-level language. But it doesn't track what gets spilled on the stack. And more importantly, the stack is not visible, so we can't even control what's there. It all depends on the compiler. And then there's some ARM-specific considerations. Is that without cache maintenance operations? Well, you can actually retrieve the data. You need a simple amount of cache mapping. So if you perform cache maintenance operations, to what level? We have multiple of levels in the ARM architectures. We're not going to go into these details. But yeah, you can flush to any of these levels. And if you spy data at the level below, you could retrieve the data as there. So that doesn't work as a security feature. We could map everything as non-cacheable. It's not an actual suggestion. So we need something else. And ideally, we need a way to isolate the memory that is used by the hypervisor while dealing with a VCPU. And if you haven't had enough of MMU talks yesterday, well, yes, this is about page tables. So let's get back a bit. And you've probably seen these slides tons of times, the way the exception levels work on ARM64. So we have the ARM v8.0 exception model, which has, on the non-secure side, three exception levels. We have EL0 for user space, EL1 for the kernel, which has two page table base registers. We call them translation table base registers. Basically, you get two roots, one for the kernel, one for user space. And we have EL2, our hypervisor level, which has a single one, single other space. And I'll ask KVM to use that mode. And we call it NVHE for obscure reasons that you'll see in a couple of slides, is that on the left side of this slide, we have basically the host, it's not a VM, but really the host user space and kernel, which use their own isolation, their own page tables. We have the KVM world switch, which runs at EL2. And the guest runs in the middle of this diagram, being wrapped by what we call stage two page tables, so the equivalent of EPT on XH6. That provides the isolation with it. So an interesting thing to notice is that from a translation perspective, EL2, having only one TTBR, has all its aerospace Q2 zero. So any data that needs to be shared between the Linux kernel and EL2, the hypervisor, needs to be mapped at an offset. So anything that is mapped in that TTBR one region in blue can get mapped into the EL2 address space at an offset from zero, which means that if we follow pointers at EL2, we need to offset them at runtime. We have ways to do that relatively efficiently by life-patching the code at good time, but it's still not a significant overhead, but it's there. You have to think about it. So how does PKVM use that same mode? Well, it's basically the same, except that we now have a stage two that wraps the host as well. And that provides the isolation guarantee that we need so that when we give the guest some memory, we remove it from the host stage two and, lo and behold, it can't access it. That's quite neat, not necessarily massively efficient, it's quite neat. What does it mean in terms of translation? Well, it's actually exactly the same, except that we can play with permission so that the page given to EL2 from the host kernel, we can either totally give exclusive access to EL2, or we can keep some access at EL1 if we want to share data between the host EL1 and EL2. So that gives us a final grade. In the previous mode, we could write everywhere from EL1. There's no security boundary between the two. So that's 80, also known as old stuff. But fear not, since 8.1, eight years ago, we've improved that and we have this 8.1 plus exception model, which is a bit different in the sense that EL1 has disappeared temporarily. But EL2 has gained an extra TTDR. So it's back to having two set of page tables. And how KVM uses this, like this. So it's a lot simpler in a way, because Linux and KVM basically share the same address space. Actually, KVM doesn't exist on its own, it's part of the kernel, and it's very similar conceptually in a way to what other architectures do. So, of course, in that mode, we can't have really something like the standard KVM. We can't wrap EL2 with its own stage two. That doesn't really make sense. That's not how things can be built. So we need to play a bit of a game. And it's the letter soup game. So we have NVHE, we have that, we invent HVHE. None of that is architectural. I made this up. But the idea is why do we force KVM to use the NVHE model on hardware that actually supports VHE? One thing we could do is actually enable VHE for EL2 only and still run the kernel at EL1. It's basically the NVHE logical model, but using the VHE infrastructure. We just have to pretend that we're running a guest when we're actually running the host. And conceptually, that's exactly what VKVM does already. We just need to set up a bit in a system register, this TGE bit, which means trap general exception. We set that to zero. That really tells the architecture that I'm running a guest. I'm not going to bother taking exception which should be rooted from, for example, from EL0 to EL1. Don't want to see the material too. So how does that translate into address spaces or rather memory layout for now? Well, we got these two VHE ranges. So we go from this mode at the top, the NVHE model, to the one at the bottom, we can actually move the hypervisor mappings to TTBR1, which means we don't need to play this translation game at runtime. We can just use pointers at this value. We'll probably have to sanitize them. But at runtime, we don't need to offset them. That's quite neat. So it saves us a few instructions, but also a few headaches. Interestingly, TTBR0 is actually unused for now. We'll see that later. So what does it mean for VKVM with this new fancy HVHE mode? Well, it looks a lot like the previous incarnation of VKVM, except that we have now this blue box on the side, which serves absolutely no purpose. So what is it good for? Absolutely nothing. We still use a single address space for ER2, but we have some fun moving things around. It's always a nice thing to have. And now we can run on the Apple CPUs, which can only run VHE, civilizing. So what does it take us? Well, before we can move forward, we need to have a look at a couple of extra things. We'll see how all that come together. So the ARM architecture has this concept of ACID address space identifier. And that is actually used to attack TLBs. We can make TLBs actually non-global with this bit called NG on global. And it's conceptually similar to PCID on XH6. Basically allows you to have address space, and Linux uses that to isolate to the space context. So each address space gets its own set of page tables, its own ACID, and you know that any translation that will be cashed in the TLBs will have that tag, and so you can't, from one address space, hit someone else's translation. And that can be used as long as you have two TTBRs, and that's described in the architecture as the EL1N0 and EL2N0 translation regimes. That's exactly what we've shown earlier. So another concept, which has nothing to do with this, is the concept of loading a VCPU on KVM. And that has the effect of making a VCPU notionally resident. And you have a VCPU load, and KVM knows that, okay, we're dealing at the moment with this VCPU. You reverse that with VCPU put, and no VCPU is resident on this physical CPU. And that's exactly what we used for on AM64. So any state that is compatible with the current execution of the host can be directly loaded on the physical CPU. Anything that would disrupt the execution of the host is immediately before we actually jump into the VCPU itself. So far so good. With PKVM, we had a few additional restrictions, is that if you've loaded a VCPU on a physical CPU, you can't run another one. You need to do a put before. So you can't play games. You're not allowed to play games with that. Another game you can't play is load a VCPU on the physical CPU and try to load it again on another physical CPU. PKVM has code that ensures that this is not possible. And we're going to make use of this. And apologies to Virginia Woolf or Bethan. So what if we could make it so that, as I said, a VCPU could be made only resident on a physical CPU at a time? We already had that guarantee with PKVM. But also had its own address space in the hypervisor and had its state only mapped in the satellite space. If we could add to that a stack dedicated to the execution at EL2 and only mapped in the satellite space, if we could do that, well, well, we can, actually. We have an extra address space. And we have an extra VA range, actually. And because we're in the EL2 and zero transition regime, we can make use of that extra blue box that we had earlier and map there the VCPU state and the hypervisor stack used in the context of dealing with that VCPU. And if it feels like a bit like some kind of twisted user space, it's about that. But we'll talk about that later. It's really a strong isolation primitive. It gives you per VCPU TLBs. And there's a property in the ARM architecture that guarantees that unless you say otherwise, that TLBs are CPU private. You can't share those between CPUs even on an SMT system, which are, thankfully, pretty rare for us. So you get that notion of really per VCPU, per physical CPU isolation. And that comes with another interesting bonus that you get fixed maps for most of these things. So since each VCPU has its own address space, the VCPU can leave it at a fixed address in that address space, which means you do not need to follow a pointer to reach that VCPU. If you map it at 64K, it will always be at 64K. And you can make sure that all your VCPUs that map at this address, all the pointer chasing becomes just loading a constant. It's a mild improvement, but it's one. It's the same thing for the VCPU stack. You know exactly where it is. You don't need to store the address in the data structure. If you want, you can randomize those at one time and still, you know, it will still be relatively, it will still be constant. So we end up with a bunch of per VCPU, per CPU fixed maps. And a consequence of that is that any register spill on the stack is now only visible to this VCPU, to this physical CPU, sorry, because we can only map these VCPUs once on this physical CPU, right, guarantee that. Yes. The question is, does it mean that you can only see it on one CPU at a time or it matters a different address on another physical CPU? Is that a question? Yes. No, it's only map once on the single CPU and nowhere else. It's the same address on all CPUs, but for different contexts. So everybody has the same address. And since you can only map, you only have one register to map something, you can't map two things at the same address. They have mutual exclusive. So if we go back to this diagram, we see that, yeah, in that blue box, we now have these VCPU states and their stack. And that's our isolation context now for the execution of this VCPU in the context of YART. In terms of code, what does it look like? Well, that's not the real thing where I've condensed it a bit and removed a few things. It's basically about fetching the VCPU structure, finding the DTBR0 that contains the root of the VCPU. Root of the page tables, performing some synchronization, switching stack, sorry, and we're gone. So this right C-Strategy R2 really sets both the ACID and the root of the page tables atomically. That's an important construct. I'm going to have to go fast now. Of course, this comes as a cost. It's extra memory allocation. We need four page tables. We need an extra page for the stack at TL2 per VCPU. We need a zero page where no VCPU is resident, but we can point all the DTBRs to that one just to make sure we don't fetch any extra TLVs when no VCPU is resident. We need an ACD allocator, which limits the number of VCPU. Almost hardware is two to the sixteenth. That should be enough. Plus a reserved one when no VCPU is resident. What's bad? Well, we've killed PKVM on 80. Oh, well. I don't think anyone will shed a tear, but okay, it's still an important consideration. Maybe we can do more in two minutes and 30 seconds. We can do sandboxes. And why would we like to sandbox things? Is that PKVM on mobile devices really doesn't only want to isolate VMs and hosts from each other on the CPU. We also need to do the same thing for DMA. And we need IOMM use to perform the isolation as it turns out on mobile SOCs. They're not necessarily standard IOMM use. They come with all sorts of really weird and wonderful power management requirements, which means we need to have some kind of small drivers in the hypervisor. So how do we make that stick in terms of enabling PKVM on a large set of systems and still maintain some level of sanity? As a maintainer, I tend to fear these kind of things. So how about we do hypervisor modules? Well, it's not exactly new. We've had that in the past. I mean, not necessarily for the hypervisor, but we had kernel modules. But kernel modules are not necessarily totally ideal for what we're trying to do. Well, first, we don't have really any L2 API. And at the same time we're also trying to reduce the amount of privileged code. So what if we could sandbox those? Well, we have address spaces now. But only to map data. We could also use that as a primitive to actually run code. We just need to make mappings executable there. We just need to treat it as, oh, well, use a space again, which means setting TG21, because in that case we want to trap exception from there. We read to ER0, I use a space, and start executing our module. Easy peasy. System call to go back to ER2, reset TG. By design, this module doesn't have access to any VCPU state because they compete for the same utility VR. So that's, again, a good isolation primitive. So that's what it is in the end. We have just extra hypervisor code running at TL0, the privileged. So as a conclusion, which is not quite a conclusion really. So we have basic blocks to provide data and code isolation for the privileged part of the KVM. We can introduce some form of driver modularity to deal with a really complex ecosystem. But that was the easy part. The hard part is to define how we make use of this sandboxing in terms of API, in terms of loading this module, guaranteeing that they do the right thing, defining an API. That's the next challenge. And yes, it's a huge work in progress. And I'm out of time. So thank you. We have 10 seconds for questions. Otherwise I'm around all day, so feel free to drag me. Okay. Is there a constant changing the TTBR? No, that's the whole point. Changing the TTBR doesn't change your, it doesn't have any influence of TLB. It just changes the way you will populate the next lot of TLB. But TTBR architecturally doesn't invalidate anything. So you're just adding some entry point over the TLB? Yeah, but you had that pressure already by virtue of already having a mapping. So you've moved any L2 mapping to any L0 mapping. But do you have anything? Well... You can sort of... Repeat that. But any red application that would be on the host? Well, we could do that, but we'd need a context to erate to that. Why not? But that seems... The problem is that you erate into something we have, you have hardly any control on. Okay. We can look into this, but I'll find that slightly jarring. But I mean, yeah, we're happy to entertain the idea. Christopher? Yeah, probably. But then for your module? You would be jumping to your L0, but you wouldn't have access to the privileged part of the hypervisor. Right. So that's just about data, and it's only being accessed from here too. Right. Anyone else? We are off time anyway. All right. Thank you very much.