 Good morning. Good afternoon. Good evening, everyone. I'm Will Deakin. We're talking about virtualization for the masses, which is all about exposing KVM on Android. Super exciting. I've left this till the last minute in the recording, so we're going to do it in one take. So hopefully no deliveries halfway through. Let's see how we do. Introduction. So who am I and what am I hacking on? So I've been hacking on the Linux kernel for a while, but not so much with KVM. So this is kind of a little bit new for me. But yeah, active upstream kernel developer. I co-maintain, I mean, the big thing I can maintain is the R64 architecture, along with Catlin Mariners from ARM. And as well as that, I'm also working on concurrency aspects, currency models, so locking, metomics, memory model tools, what else, TLB invalidation, good stuff there. And I also maintain the SMEMU drivers, which are ARM's, it's ARM's IOMMU IP, essentially. So that's kind of the closest part I've got with KVM is the device pass-through and the FIO stuff. So last year I joined the Android systems team at Google, and I found myself leading the protected KVM project there, which is a project to enable KVM on Android. It's a relatively new team, and lots of people on the team haven't even worked on Linux kernel before, so it's been quite exciting ramping everybody up, especially remotely. Consequently though, we've managed to come out as the top contributors to KVM for ARM64 for both 5.9 and I think for 5.10 as well, looking at what's queued for the current merge window. And there's tons more to come, because what we're working on seems to be a hot topic, not just, you know, on ARM64, but for other architectures as well. So I'm very keen to hear from other people, whether it's in the questions or in emails about other efforts to do similar stuff to what we're doing. And a disclaimer before we jump in, this is all very much a work in progress. We are upstreaming as we go. We don't want to fall into the old trap of doing a whole lot of stuff out of the tree, particularly for Android, and then throwing it over the wall of the community, and say, there you go, have at it, because, you know, that doesn't get anywhere upstream, and it's not a very good way of building up a decent solution involving the community. So we're upstreaming as we go. Some parts of the code we're pretty confident with, some parts we don't really have much of a good idea on yet. You know, we're open to change, so please get in touch if you've got ideas. Mostly it works all right. For things like user ABI, obviously we're not upstreaming that as we go, because you kind of need to get that right. But for implementation stuff, it's a good approach. So before I talk really about the KVM side of things and the virtualization side of things, I need to talk a bit about the state of modern Android to motivate why we're doing what we're doing to give you an idea of some of the problems we're facing. So the sort of latest thing in the Android systems base for modern Android is this thing called generic kernel image, or GKI. And you might have heard of it. There's a link here at the bottom where you can read loads of stuff about it on the source of android.com. There's also LWN articles. It's been mentioned at Linux Plant as I think at least once. And what this generic kernel image is about, it's about attacking this problem that we have typically traditionally there's a separate kernel for each Android device. So you have different Android handsets, they're all running their own kernel. Clearly this doesn't scale. It leads to horrible fragmentation. And the problems associated with that fragmentation include the inability to update those systems because it's difficult and expensive if you have to deploy a separate kernel update for every single device because they're all running different kernels. It's just on scale at all. It also can make it impossible to upgrade from one Android release to another for a given device because that new Android release might require a kernel feature that's not present on that device and there's no way of doing an upgrade to that kernel which introduces the feature. And I think also, and it's a point that doesn't get made quite enough, it's bad for upstream because the upstream kernel, the job of the upstream kernel is to have the right subsystems and the right abstractions in place that we can support all these different devices to get the generality right. And you can't do that unless you have visibility into all of the different problems out there and the different devices and different pieces of hardware and different solutions in them. And because it's all squirreled away in these different kernels, it's very hard to see the wood for the trees. You can't quite come up with an abstraction that will work for everybody because it's difficult. So GKI aims to solve basically this fragmentation problem and the way it does that is by sort of rallying around a single kernel image for a given Android release and kernel version. So here I've said, you know, Android 11, 5.4. This would be a given GKI kernel. And for that kernel, which is only the 5.4 kernel for Android 11, a subset of the module ABI remains stable, which is what this means is you can build a binary module as a vendor. You can have a driver, which is a binary module against this branch. And as the branch is updated with sort of LTS and security updates, we will guarantee as the Android systems team, we will guarantee that your modules will continue to load into that kernel. So this means that the vendor portion, i.e. device drivers, can be deployed independently of the core kernel image. And it also means that you can have a core kernel image that's shared across multiple devices, and then we can have LTS updates as long as we don't break the ABI. And the ABI, I mean, I can probably hear you screaming. You know, kernel ABI, oh, it's the worst thing you don't want to do. Don't do it. Well, like I say, it's only for a given Android release and kernel version. So it's kind of over LTS merges, but also it's not for the whole of the kernel. It's for an allow list of symbols, which we explicitly say we will maintain this. So that's the idea of GKI, solving the fragmentation problem for the kernel. So what about the hypervisor? Well, let's look at virtualization on Android today. And if you think fragmentation for the kernel side is bad, I mean, this is much, much worse, at least with the kernel side of rules using a different version of Linux. Well, with hypervisor, it's totally crazy. I said it's the wild west of fragmentation. Some devices don't even have a hypervisor. So, you know, that's one possible configuration. You don't have to have one. You just kind of ignore that exception level. Many devices do, but it's a hypervisor gym, but not as we know it. If you look at some sort of three things that people do with the hypervisor here, I've listed them, these bullet points. So the first one here is security enhancements and protecting the kernel. So some people put code in the hypervisor, which allows the kernel to make hypercalls to, you know, change permissions on pages and things like that. That's all well and good. But the thing to remember here, obviously, is that the hypervisor itself is running with elevated privileges. And mitigations are attack surface, too. If you stick that string into your favorite search engine, there's an article, great article from Jan Horn, Project Zero, showing actually how to, how you can attack and exploit vulnerabilities in some of these security enhancements. You know, the code that's protecting you is actually itself, but it's horrible, so it's not protecting you at all. It's introducing more issues than it solves. Two other things here, coarse-grained memory partitioning. So many of these devices feature something that looks a little bit like an IMEMU, but if you look closely, it's not. Often you don't have translation or you might not have page granular mappings. But what you can do with this hardware is sort of carve up the physical address space. So early drawing boots, you just do this thing once. And you say, well, this piece of memory, this guy can DMA to it. This piece of memory, the CPU can have. This piece of memory goes to the radio. And that's okay. Fine. I get why that needs to be done, but the HUD files are a boot. But then that kind of code doesn't really do much after that. And it feels like a big waste of the exception level. There's a lot more you can do than just carve up your physical memory once. And this third point is my least favorite. Running code outside of Android. You put simply the hypervisor exception layer is a place where you can put code and it's not firmware. So you don't have to worry about breaking the device, updating it. And it's not the Android side of things. So you don't have to worry about having to integrate with anybody else. So it's like a little bit of a playground where if you've got some code you don't know where to put it. Well, I'll just stick it in the hypervisor layer. And that's really bad because it's then running with privileges that it really does not need. And the takeaway as well, most of the time down here, there aren't even any virtual machines. What kind of hypervisor is this? It doesn't really offer much in the way of hypervisor-like services that you might hope for. And my conclusion on all of this is I think both security and functionality loses out because you go and increase TCP and updating this thing. It's difficult because of the fragmentation. And certainly functionality loses out again because of the fragmentation. It's not possible to expose a portable sort of hypervisor or virtualization API to Android applications or even the Android host kernel because it's all so different. You can't leverage any of this stuff. And often it's just locked down anyway. So to give you a probably more concrete flavor of exactly how this stuff fits together, we need to go through the R&D exception model here. This is the sort of traditional view that you would get if you ask somebody familiar with the ARM system architecture, the ARM exception model, they would give you this picture and lots of versions of this picture on the internet. So now on the left-hand side we've got EL0 to EL3. So EL3 is more privileged than EL0. EL0 is user space. Apps. EL1 is kernel. As you can see there's a couple of kernels there. EL2, hypervisor EL3 if that way. And the interesting thing probably here is this dividing line down the middle which partitions EL012. It kind of mirrors it. And on the left side we've got this non-secure, sometimes the normal world. And on the right-hand side we have the secure world. And the difference between these two is a bit on the bus which tells you whether you're not non-secure or not. But this is a way of running the so-called trusted operating systems or trusted applications in some memory, some physical memory which is otherwise inaccessible to the normal world. So you've got some sort of isolation. This has been in the ARM architecture for a long time and it's a support for virtualization. And I'm going to talk a bit more about this because this is important as a motivator for what we're doing, what we're doing. Because if you take this diagram and you reorganize it in terms of what I'm going to call privilege which is your ability to access physical memory or map physical memory, it looks a bit like this. And you'll see that I've put the secure world basically underneath the non-secure world where the lower down in the stack you are the more privileged you have. And that is because the way the architecture works is that secure world can access all of non-secure memory. You just have to set a bit in the page table to say, hey, I want to map this non-secure. But you can't do it the other way around. So non-secure world cannot map secure memory. If it tries to set the bit to say security it's ignored by the MMU. So this means if this trusted OS wants to, it can map hypervisor memory and worse if it offers some sort of wacky M-map call down to this trusted app, maybe the trusted app can also map hypervisor memory. And I mean this is bad for a few reasons. But if you map it to how it looks on an Android system so we can first of all get rid of a few of these boxes. We don't have any virtual machines as I said, we don't have this because the hardware doesn't yet exist for any of these bits. When you look at these boxes that are left it's bad because Android is running up here. So Android is now sort of the least privileged part of the system. And so this includes the GKI we talked about, the modules all of the system, libraries, the apps. This is what people kind of call Android. Hey, I'm running Android even though you've got all this other software. This is what people would normally think of as Android. So Android which is the part that, you know, we're trying to update the apps being updated over the Play Store because they've got a security problem. Well, okay, that's fine. We've got all this stuff running on top. And what kind of stuff does run on top? Well, over in the secure world you might have DRM in there. You might have crypto in there. Now why DRM would need this sort of elevated privilege to have access to all the hypervisor memory, I don't know, but it's an fortunate side effect of the way this is updated. Third party OS is opaque blobs and it's integrated per device which means you've got the fragmentation problem I've been talking about and for most of this talk so far. You'll notice that all of the software that runs in that secure blob is typically prefixed with trusted. If I go back one slide you can see it's a trusted OS, trusted app, trusted partition manager. And I think the reason this is here is just marketing. This is a technology that I'm called TrustZone and it's to give you the, you know, to feel that something is safe and reliable. But another definition is to expect, hope or suppose, definition of trust. And the unfortunate reality is that Android hopes that this software isn't malicious or compromised because if it is there's nothing we can do about it. We're running right down in that tiny little corner, you know, deprivileged. So what we want is a way to actually deprivilege this party code. We don't want it running up and over up in secure world and we want to provide a portal environment so that those trusted or trusted software can be isolated from each other but also from the rest of Android. So put a red arrow on that diagram and put simply why don't we move all of this into a virtual machine because we haven't got any, right? There's a nice space here and a gap that fits right in there and now it can't, you know, have a really access to the normal memory because it's going to be managed by this hypervisor. But what about that hypervisor? What's going to live here? Well, the idea is that we can leverage the GKI effort, right? GKI is about shipping this single kernel binary which should work everywhere. Well, that's based on Linux. Linux has KVM as a well supported hypervisor. So maybe we can leverage the GKI effort and as well as putting GKI kernel here we can also put KVM as hypervisor. That's the idea. So now let's move on and talk a bit about what virtualization looks like on I'm 64 and in particular how it's used by KVM because there's a couple of modes and it's important to see why they don't both work for us. So it's just quickly run for the checklist to make sure this is even remotely usable. The good news is that all I'm 64 Android devices have support for hypervirtualization. It's hugely underused technology because of the fragmentation but the CPUs support it. They have two-stage MMU. It's nothing particularly bizarre here. You've got a stage one translation which is set as page tables that are owned by the guest and the output of that the guest reckons it's a physical address but it's not. It's an intermediate physical address and that goes as input to a second stage page table that's owned by the hypervisor to actually do that translation and so these stage two page tables allow you to provide memory isolation because you can carve up your physical memory and donate different bits to different guests. So yeah, our idea is to move out this third party code to improve security and in the end we want a common hypervisor in Android so that we can have confidentiality of data and type of computation for applications for a new use cases as well. So how does KVM fits in? KVM has been supported for a while on I'm 64 since 3.11. I mean I'm 64 itself was only supported I think in 3.7 so it's been there almost since the beginning for us and it's curious because it can run in one of two modes you know broadly speaking so back to these sort of exception level diagrams. On the left here we've got the sort of traditional one that I showed earlier on. So EL2 is our hypervisor. So this is also known as NVHE I'll come on to what that means in minutes this is in 8.0 so all I'm 64 CPUs can run in this mode if they want to the hardware supports it so you have your VMM down in host user space you know could be QMU and then you have your host kernel which is so far so good and flipping over here you've got a guest kernel also ring EL1 with its applications okay great but because of this hypervisor layer you can't actually jump directly from that you know you can't set up an exception as it were except from return directly from this host kernel to this guest because the you don't have the privileges to access the system registers and things like TLB validation instructions all that kind of stuff that you need to be able to have the context switch so what happens is you have to hypercall up to EL2 so if the VMM does vcpu run that's going to be an ioc tool to host kernel that will do some mucking about and then it will do a hypercall up to say right I want to run this vcpu we then have what we call the world switch which just switches all of our register state and an exception returns back to the guest kernel so that's how it works on a v8.0 system and obviously the problem here is that you end up jumping around quite a lot you have to keep going up and down from this world switch code which can add you know latency to things like an MMIO exit where you've got to go up from the guest back down to the host back down to the VMM if it's the handle of the user space you can go up and up and then down again so v8.1 which is kind of the iterative release of the architecture this thing is called the virtualization host extensions which is VHE so this is non-VHE they added some tweaks which actually mean that you can run the host kernel directly EL2 the reason we can't do that over here is that this EL2 environment is quite constrained it's more about that later that you can't run without VHE but with VHE you can do that and you don't have anything here you can just exception return down so this is blazingly fast one more month but the problem with this is that although this is fine for KVM it really shows an issue with the threat model of KVM when compared to the threat model for Android the threat model of KVM it places the entire host kernel and the VMM via these IOPTLs into the tcb because the host has full access to guest memory the VMM has full access to guest memory all the memory over here is accessible to over here so it's a bit like an inverse trust zone because you know before if this thing here this guest kernel used to run in secure it was the one with access to everything and now it's not but the host kernel has access to everything so I think we've just inverted the problem so here's the big problem which I accidentally switched to earlier on but the big problem is the security model of Android is not aligned with this the current design of KVM which says that the host and the VMM have access to all of guest memory is a big no level for Android the Android security model requires that guest data remains private even if the host kernel has been compromised which is practically impossible I think to achieve with the VHE case because your host kernel is running at the hypervisor level but you've given it that privilege but within VHE maybe it's not so bad because we've got that piece of world switch code so what if we extend that world switch code so we trust that but we don't have to trust the hole in this kernel that's running in the host so we basically improve this world switch code so it can manage all of the guest memory and doesn't let the host kernel have access to it anymore so we can install a stage 2 translation now even for the host kernel before we load any of these the party vendor module and we can use message parsing for the host and the VMS communicates and we need to make sure that the host doesn't tamper with the VMM images so we'll have to have a special bit loader that does some signature checking and while we're at it we'll try and apply 4 more verification techniques because the yield 2 code is drastically simpler than Linux it's much much much smaller and cut down it's not doing an awful lot of stuff even after we start extending it like this it can't schedule for example it doesn't have preemption so we can try and reason about its correctness and perhaps even improve some isolation properties open question here another alternative would be to run Android in a VM I think that's plausible it has a different set of challenges I don't think it's obviously better either way is obviously better running in a VM there's a latency overhead with things like interrupt delivery and there are also additional hardware requirements now you really need to be able to path through devices and the I'm amused that we're seeing aren't quite up to that sort of job but more generally so roughly here's what the flow might look like so you've got the GKI kernel but now we shoot with the hypervisor during boot it deprivileges which leaves this protected KVM we're calling it so that's your yield 2 well just the world twitch code is doing a bit more and then we can start loading modules in this VM KVM here is going to be setting up stage 2 and things like that so what do we need at yield 2 because this world twitch code doesn't do it very much at the moment we need to tame it it's a pretty horrible place it has its own limited virtual address base so it doesn't have quite the addressing capability as yield 1 does it only has one page table which makes something challenging and you cannot run general code there not least because it's not preemptible or interruptible so you can't block or schedule but also you know some of the registers aren't quite got the same name and stuff like that yield 2 can access all of normal memory if mapped so again we don't want to run lots and lots of code there because if we end up running lots of complicated code there we kind of have defeated ourselves in the sense that we're trying to reduce the amount of code that has access to guest memory whereas yield 2 does have access to that if it decides to map it there's very limited device access at yield 2 again the host kernel normally deals with devices so typically you don't have a console we've got some hacks for the Evo console but typically you don't even have that it's just kind of the level of device access you have it's normally none and it's just doing this context switching that I was talking about earlier on and it's quite interesting in prior to 5.9 there was this macro in Linux called KVM core hype where you pass a function pointer in here from your host kernel basically if that function is in the hype text section of the kernel image then it will just run at yield 2 so this allows the host kernel just to run code at yield 2 it's obviously a big security problem I'll show you a bit about how that was implemented so here's a cool site here KVM core hype and in this case we're trying to flush the TLB so then that calls this macro which this KVM case in ref basically converts this function pointer into a linear map address because we have an alias as a linear map to yield 2 then it makes a hyper call so everything above this dotted line is hypervisor everything below this dotted line is the host kernel so then we come into the entry dispatcher this converts that linear map address into a hypervisor address it's just an offset and then this is the crazy part this do yield 2 call is literally an indirect call it's an indirect branch so it just branches to whatever address came back out of here which hopefully you know if the host was being nice we get here and if we run off into the weeds who knows what will happen so you can see all that code here we need the yield 2 code to be self contained and safe against the compromised kernel and very clearly it's not you saw on a previous slide you can just essentially pass function pointers up and have them run so we've been trying to improve bits of this I'll show you some of the changes we've been making so one of the changes we've made is actually instead of having a run arbitrary function we're changing to a fixed set of hyper calls from each you know individual service we need to offer we've also worked at embedding the yield 2 payload into its own partially linked object so that we can reason about the symbol references that we have so we're trying to reduce the number of symbol references we have to the kernel text and to the kernel image we want to have this whole thing self-contained so an example here it's a good quote I saw the other I don't know where it's from but who needs namespaces when you have underscore so we've used that similar to what we do with EFI stub so every symbol inside this KVM NVHE object which is our partially linked EL2 code gets prefixed with this KVM NVHE and that's what stops us accidentally referring to other symbols so you can see that the thing that we ran earlier on the KVM flash VM context has now got that there and then we can have an allow list because in some cases we do have to access kernel symbols we can't de-reference them unless they're mapped but we sometimes need to perform some arithmetic based off them but at least now that's an opt-in rather than just an implicit you can refer to stuff and can't really reason about what's going to work and what's not so anyway with this object embedded in the kernel during boot we install it then we de-privilege before we de-privilege we flip all the static keys do the runtime patching it doesn't mean once we've de-privileged you can't flip any static keys in the EL2 code because if we allow the host to modify otherwise the text would last and then after de-privilege the EL2 object is actually just completely up well it's not at the moment but it will be on that from EL1 so you won't be able to read it anymore so in 5.9, 5.10 this is stuff that's gone in or going in that call now it looks like this so the KVM flush VM context actually gets converted by this macro here into a constant 2 so instead of passing a function point we pass a number 2 and the EL2 entry dispatcher so above the line we're in the hypervisor well it's now in NC which would be easy to read and we can just switch on the function ID so it's much more sort of fixed function for operations rather than arbitrary function pointers and this code is all moved into an MVHE directory so it's a bit easier to reason about that's partly because we're building that partially linked object in there so that's all good but there's still some major problems and some of these problems well pretty much all of these problems fall down to the way the virtual memory is managed so the host kernel is in complete control of the hypervisor virtual memory so your hypervisor stage 1 mappings here are created by the host that's obviously bad because it means that you can make sure that it's all self-contained and doesn't have arbitrary function pointers but the host has the page tables for the hypervisor code so it can just mess around with the page tables and have it do whatever it wants what's worse is the hypervisor pages are also mapped by the host linear mapping so you can even just directly write to parts of it if you wanted to there's also some funny behavior I think I find this quite odd but with the current KVM approach the KVM data structures they get mapped to EL2 which you need, you need it for the world switch but there's no one map so as time goes on EL2 gradually just gets mappings for more and more of kernel memory if you spawn a VM EL2 will map the VM data structures the VM exits, that memory will be freed and used for something else but EL2 still just has this kind of dangling mapping to it which isn't great for a security point of view there's a homebrew per CPU implementation that just directly reuses the host regions we need to separate that out and it's not just the hypervisor stage 1 mappings it's the guest stage 2 page tables as well are also managed by the host kernel and so EL2 just blindly installs them so the host kernel sets up the stage 2 page table and during the world switch we just plumb up the root pointer and we hope that the host has set them up correctly and as a side effect of the host managing the page tables they just happen to be constrained by the host page table configuration so the page table construction uses the usual linear subtractions PGD, PUD, PMD but it means that things like the stage 2 of a guest is artificially constrained by what the host can address or the way that it's been configured it's just undesirable, it's not a security problem it's just undesirable but the rest of this stuff I've said here it makes it trivial for the host kernel to bypass any hypervisor restrictions hopefully you can see that there's basically no separation between the two and this is without PKB so work that's ongoing some of this has landed in 5.10 or will have it landed in 5.10 the forward quest is out so Paolo at the moment other bits we're targeting the 5.11 preventing the host kernel from accessing these page tables directly so for 5.10 we've got a complete rewrite standalone page table walker so this is good for a few reasons one because it means that the host configuration is independent of the stage 2 configuration now also it's actually more efficient and less code so it's generally a good cleanup but the main thing is it allows us to instantiate a page table allocator at EL2 directly and stop the host having access to those page tables problem with that of course is that in order to allocate page tables we need a memory allocator and we don't have a memory allocator at EL2 so that's work that's ongoing for 5.11 the idea is that there's a hypervisor carve out that's donated during boot but we still need an actual proper allocator on top of that the way it works with the bootstrap is that the host before the host de-privileges it allocates some temporary page tables calls up to EL2 to get it going and then EL2 bootstraps this out and then the host can free the old page tables after de-privilege I'll show you some code for that in a second now we've got a new complete standalone per se implementation especially it's the kernel per CPU implementation but instantiate it again with a separate region so we don't have to borrow any of the host per CPU region and eventually we'll need to keep the eye on the musimsync as well to make sure you can't do dma attacks to access guest memory so here's some more code if we look down at the bottom again below the line is the host in the CPU init height mode so this is the very first thing we're going to do is we want to install some vectors into the EL2 so that we can do the bootstrap because initially there's just some stubs in there which were installed when we were still booting but for the first time from the bootloader so we get this vector we pass that up and we also pass the temporary page table which is this pgd pointer and some temporary stack and things like that then what we do is we instruct the hypervisor with this kvm hype setup call to actually bootstrap itself so you're running on my temporary page table that I've given you please go and create your own set up your allocator and deprivilege me so here we passed in a whole bunch of other things but the main idea here is that after this is returned we can free this is actually this pgd here we can free the old page table because now EL2 is completely running self-contained so over EL2 this leaves us in a state where we have an allocator so this is a basic body allocator we have for page allocation we're trying to follow as close as we can the mm design for the host kernel just massively simplified because we don't need a lot of complexity so we have a body allocator we can allocate pages, only pages we don't have slug or anything like that there's a height pool and an order and the pool is just protected with a lock we're just using very coarse locking works fine for what we need and we have a struct height page which basically contains the pointers into this free list or the set of free lists and that's in the height beam and maps where we can track each page each physical page that is owned by the hypervisor now the interesting thing is with all this page table code now running at EL2 it means when you spawn a guest the guest memory is unmapped from the host and Linux can't really deal with that because it's not like memory hop where you lose the whole bank it's just you lose whatever pages were assigned to that guest so memory will disappear from the host as it's assigned to a guest and there's this patch series that isn't merged but it's been on the list for a while from Kiro Shutsumofi he's one of the main MM maintainers so if it's from him it's probably not a bad idea he's got some chance of going in if we show that it's useful to us so we will definitely need this on the host so the pages can disappear as they go to the guest and then reappear as the guest is torn down and there's some questions here about what we do if the host accidentally accesses that um I'm in user port again I mentioned that earlier we need it to prevent DMA attacks it's an unfortunate reliance on SOC design that SOCs out there aren't quite ready for this kind of stuff but we'll see what we can do ideally we would have you know a an IMU which could just simply reuse the page table that was installed in the CPU so very little in the way of device management to EL2 because what we don't want is to have lots and lots of IMU drivers at EL2 and then have this big pile of code that both duplicates logic that's in the host kernel but also makes things like the formal verification efforts or just even the you know compliance uh the security requirements make that very difficult to verify uh what else so yeah I mentioned earlier on about right at the beginning about hypervisor being used to provide security enhancements for the host kernel we could in theory do something like that if the host has a read write entry at stage 2 it could ask the hypervisor to change it maybe to make it read only and it appears to the host like the permissions have changed for the physical memory so it's quite an interesting uh defense mechanism there um and another thing is that the VMM will now not be able to access any of the guest state so that's clearly going to cause problems the main problem it causes um is vertio which I'll talk about a bit in a minute and there's interesting questions about how do we initialize the guest to begin with so initializing the guest to begin with we're going to be using this template bootloader so very very small bootloader planning to try and write this in bare metal rust and the idea is that actually the when you tell KVM hey I've got a protected VM uh the hypervisor will instead of entering the VM directly you know sort of the VMM setting the PC and say please enter it here for the VCPU uh the hypervisor will actually run this template bootloader first which will be specified in a cart route and provided by the host prior to deep privilege and this template bootloader um basically has two jobs one is to signature check the payload that's coming up just some sort of hash or public key or something like that and then if it passes to jump to it um that's roughly how that's going to work so we've got some ideas for um exactly what we might do here so the template bootloader probably run in a restricted environment where you can't have things like MMIO exits and stuff like that we really wanted to run a very small piece of code um the payload itself will have a proper second stage bootloader that we chain loaded none of this really needs to be on 64 specific um so if anyone else thinks this will be useful for them or they'd like to talk about it or they've got ideas or they see problems with it please let us know um because we're just about to start you know really implementing this in anger and I'd like to hear thoughts before we have you know if we've got something that then works and someone says oh by the way can you make it work from the right edge it would be nice to know earlier so just to finish off the last section of the talk is about the virtual platform that we offer so we're adapting cross VM as the medium so cross VM there's loads of talks about this at KVM forum you can seek them out um it's part of the Chrome OS distribution it's now included in Android and Android we've got a copy of it in there it's a modern code base written in Rust and the reason we're using this was the many reasons but one big reason is that there's a focus on security and sandboxing which really fits our need um and also because we're lazy and cross VM has a decent selection of other devices implemented but I think it can run Android um now so that's that's a big deal for us one thing that is surprisingly important here is that we need this whole solution to be cross architecture you might think you know especially as a talk from me the R64 commentator um you might think that this is all a big R64 play it's not it really needs to be cross architecture um you can learn about this virtual platform we have called Cuttlefish which is used actually for lots of our Android pre-submit testing and that's an x86 platform so we need this to work on x86 and I'm 64 um so with cross architecture is very important to us cross VM is cross architecture but on the I'm 64 side the virtual platform we provide looks very very straightforward as a fixed memory map we're using the standard PSCI firmware calls to bring CPUs online we've got the architected timer we need to offer entropy service through patches I think from ARD on the list to do that um to make sure that we can get crypto working early um one thing we're doing differently from the architecture so normally you'd have the GIC which is the generic I think generic intro control is the ARM architected intro control the GIC is a huge huge beast it's very very complicated and it requires lots of complicated code at EL2 which is exactly what we don't want so we're investigating um a para-virtualized intro controller known as the ARVIC and you can see Mark Zangier's talk I think it's at some time and it's early in the morning tomorrow I think or maybe Wednesday I can remember um have a look on the schedule and uh and see what that's all about so what about IO well we should just use vertio stupid it's the best thing since sliced bread use it for everything um vertio is great I love it so here's the job done well not quite um because vertio assumes that you have access to all of guest memory from the host it assumes that guest memory is shared with the host um it also that I omit then means that even if you get it working the host can intercept data so you have to use crypto and you have to use quite clever crypto for example things like FS Verity which you can read out here in file system and we'll need some modifications there because I think it's currently just done at open time we need it probably done at read um so we're doing some changes but we really don't want to have to change the spec we might need to add things to the spec like maybe new devices but we don't want to make radical changes and the big problem we've got with vertio is there is no shared memory device uh which forces us to use bounce buffering so we have this working um via these shared windows so in vertio one at one you can you can set this flag in your device called f access platform which indicates that the device um access to memory is limited which maps what we're doing right it does have limited access to memory access any of it and when linux sees this is set it uses dma api for all vertio uh allocation so for the for the rings and also for mapping and unmapping so what we can do is we can set that flag in devices in crossfm uh we can tell the guest kernel if it's linux uh please use bounce buffering for everything and then we can hook these set memory decrypted encrypted api in linux um and as a hack at the moment um you basically need to make some hypercalls to share and unshare the pages backward the host so it's it's a bit weird because it's not really decryption encryption but it does map quite nicely so if you decrypt a page you share it with the host and if you encrypt a page you don't get the host access anymore and that works i mean we have this working it's just it means that you bounce buffer everything and if you want zero copy for a better performance then you know that you know that that doesn't work at all with this approach and so some cases for example buying a shared memory requires zero copy and zero copy is really quite difficult because you have to have a handshake from the host and the guest it's like a three-way handshake involving the hypervisor to make sure that um the guest can't just decide to access host memory but also so the host can't just randomly screw the spec the stage two for the guest so you kind of have to have the host say to the hypervisor hey i'd like to do a zero copy for this page and then the guest also said to the hypervisor yes i agree to set this page up for zero copy and then they have to talk to each other to work out exactly what they mean and then get set up so it's a bit fiddly um there's a specification from ARM called FFA it's changed name at least three times but at the time of writing it was called FFA um so they put a lot of thought into that memory handshake and it's worth having a look at the spec the problem is the spec itself is fairly heavy weight it covers a lot of other things like uh scheduling and it covers things like message parsing and it's not cross architecture so we really need cross architecture and we don't need any of the scheduling here so an open question here is what what should we be using um i'd really like to know how other people manage to set up this zero copy shared memory between host and guest with a limited um addressing environment i've looked at things like ibsh mem and it doesn't mean that's not the support domain i don't think um but we do need a solution here so what's next loads of other stuff to do complete the bootstrap stabilize user abis figure out what we're doing for zero copy um we need to move more of the guest state up to EL2 things like the vcpu state and stuff like that memory poisoning on reclaim proxying firmware calls attestation is really important because the guest needs to be able to you know figure out that it really is running under the protected kvm hypervisor um ballooning for reclaiming uh guest memory integration with the rest of android is quite interesting because you know this this whole talk is a kernel hypervisor talk and i'm a sort of currently hypervisory kind of person um i kind of underestimated how much effort this is but getting all of this to work nicely with the rest of android user space is a big task and we'll just keep up streaming any questions we've got a list which you can cc on patches if you like um it's not archived but you can enter through the whole team here or just the kvm analyst we were on there so thank you very much