 Hello everyone. We're here to talk about virtual session on ARM V8 and how it can be used for different things security. So I guess every one of you knows the Linux KVM and its history. KVM4 ARM V8 came about 2013 or so and finally stabilizing around 2015. We have been working with mobile chipsets for quite a long while. Around this time the virtual session feature slowly started to emerge in mobile chipsets. Around 2018 or so it started to look like that it might be complete enough to actually try to make KVM work on a mobile phone. Prior to that it was primarily used for controlling all sorts of device accesses and memory accesses of the underlying Linux system. So there was some support for virtualization features but it was limited. Around 2018 or so it was when we noticed on the first chipsets that it's starting to look complete. At that time we started a quick study on how complete the support is. It seemed good and through a couple of quirks we got it going later that year so we got basic KVM going on a mobile phone. For the first time it was kind of a proof of concept for all this work that we're about to present here. Last year Google delivered to you a talk about the matter bringing virtualization to the masses which was excellent. Google had noticed the same thing that the virtual session starts to work and it will eventually be highly usable in the mobile realm as well. But at that time already we had been focusing already quite some time on KVM and its security and how to make it a real product on a mobile phone. So ultimately everyone wants secure virtualization and the way that Linux KVM itself is highly functional for embedded world and mobile space. But it really adds pretty little to the security. So if you ever compromise a host trying KVM guests, compromising the host is enough and you get access to all the memory in the system and you can modify the guests whatever way you want. You can read and write their memory in entirety. All hope is lost in the security sense. Then for mobile use KVM has this limitation that's designed on the entire CPU hypervisor model. Unfortunately as I said there were already hypervisors in existence on mobile chipsets by 2018 or so and they were controlling majority of the device accesses or the host OS. And a lot of that stuff is really unmovable. It cannot be taken out. It has to stay there. So this kind of clashes with the KVM's design and KVM like that doesn't really work. External hypervisors have the software from the same chip. So they are designed to own this entire CPU model. Moreover they bring more to the table. They are usually fully blown operating systems that require extensive boarding efforts between the mobile, between the socks in question. But in mobile phones all average lifespan is really small. So it's two to four years and if you have to do major boarding effort between iterations of mobile phones then your virtualization solution is doomed. So what we wanted to create here is something that is minimally portable between the environments. Ultimately the goal was that we wouldn't need any other BSP board support package for our virtualization solution apart from the Linux itself. The vendor usually, regardless of not, they provide virtualization. They are more or less mandated to bring KVM in at least in the source form. It may not function and probably doesn't function. And the code for example you get from Qualcomm but at least it is there and it's kind of a majority. It usually builds it at least. So there is something in there. And the problem with these external hypervisors also is that they end up making the swapping dysfunctional. So they waste a lot of space, memory space, but meaning the guests permanently in the memory. We don't really have that much memory on some of these embedded devices to be run. So we have to allow some form of swapping and we really prefer to do it without having to write an entire operating system. Just use the Linux itself for it. Linux is portable so make it just work and do that job. So our goal for this whole work was to create a kernel memory protection framework. This was in existence already before 2018. So we can log pieces of kernel memory and make sure that certain things in the kernel cannot be modified by the attacker. So if somebody gets access to kernel, at least some things we can expect to stay the same. Then aren't we at architecture is nicely layered security. So each layer has considerable security value and KBM's design has the flaw that it kind of a blends two of these security layers into one, yield two and yield one layers that can equal slowly. Shortly explain are more or less the same in the security sense, but we wanted to make them actually separate just like the army art that ended. And then we wanted to support KBM guests that are properly isolated from the host. So that the host and the guest cannot read and write each other's memory on any of the shared areas. This is highly useful feature for things like secure browsing and chatting and things like that, even on the mobile. Then we wanted to create a tiny kernel intrusion detection framework so we can make measurements of the kernel memory at certain points in time. So there's the rough goal we wanted to do. And then this whole talk is kind of a project plan on a high level. It explains step by step what we had to modify to get there and each and send questions through the chat. If something is unclear, we're going to fast friendly talk. But first we have to start from kind of explaining the arm. Terms way that they are used here, unless I think most of you are not familiar with the army itself detail. So I'm able to start by explaining the terminology here and how the virtualization works on the architecture. Thanks, Janne. Hi everyone. In arm architecture, the execution is divided in a different privilege levels called exception levels. There are four different exception levels in arm architecture. Going from zero to three, three being the most privileged. And the higher privilege level always have access to all the features of the lower level, but not vice versa. Next slide. Another important feature in the arm architecture in addition to the exception levels is the two-stage memory translation, where the virtual addresses seen by the operating system is translated to physical address through a second stage of address translation, which is controlled by the hypervisor mode. And this makes it possible to have several isolated operating systems in the same system like we see in the next slide. So in this slide, we see that there are two operating systems running on the same system isolated from each other and the hypervisor controls the intermediate physical address space for both of those different operating systems. And whenever the context is switched from one operating system to another operating system, the hypervisor does the context switching by setting the base address of the second stage translation table to correspond the running guest. Next slide. I think I'm handing back to Janne at this point. Yeah, so in order to go towards this kind of secure, secure, proper host guest separation for visualization, the first thing we had to do is that we had to create relays to hide stuff. So we had to create tiny hypervisor executable that can take ownership of some of the kernel functions. This goes to the first requirements for this thing is that it has to be compatible with almost any form of translations in existence. So you can have, as you can know, as you know, you can have table walks of different lengths. So you can walk to table levels three or four, and you can have all sorts of different block size mappings and everything. Our hypervisor, since it has to be compatible with the KVM itself and it has to be compatible with third-party hypervisors, has to be able to understand it all. So it requires pretty extensive mapping code. Then, of course, it has to be able to record all the guest information, the run state and the memory layout and everything of the guest. You can have all the plumbing to handle, trap and handle all the relevant exceptions. And since this thing is getting stuff like the guest information, guest kernel and guest images and guest pages and things like that, through an insecure transit path being the host kernel, then it has to be able to perform signature checks and integrity measurements of the data coming in. This table walk is another element that I'll explain in a bit later on, but it has to be able to make sense of the entire device memory on anyone's behalf. So it has to be able to translate from process X's page into actual physical page what it happens to be even though it's controlled by the insecure Linux host. The first stage of the translation is controlled by the insecure Ilma. So the hideout needs to exist first. And after the hideout is there, the first thing we have to do is that we have to lift the host kernel through which we want to allow this execution of these new KVM-based guests. You have to lift that thing to be a VM on its own. And already on majority of these mobile chips, chipsets, this is already a case to some extent. So the host is already a VM that is controlled by the hypervisor. So our hypervisor has to sit in the boot flow before this other hypervisors and it has to be able to read and write the same page table information generated by that other. And then as mentioned already, it has to have this architecture plumbing for dispatching the exceptions based on wherever they happen to be going. So some of the exceptions have to be passed to the sock vendor hypervisor code. Some of them we handle ourselves and some of them we pass to the KVM. So the KVM step is of course not required in this first step, but anyway, so that's the whole construction. So after we have the host Linux itself running as a VM, we have to start building a system where we can protect pieces of that kernel's memory. So since that original host VM is a virtual machine, so we have full control over its view of the memory. So we have to provide plumbing for this kernel's protection. This was the first thing that we had in place around 2008 and this was already highly useful for all sorts of security as the use cases. And the way that we do this currently is that we have first our hypervisor builds after the kernel itself and the kernel is extended tiny bit to reveal some of the memory information. So this ASM offsets.h, which is generated during the kernel build is we're adding elements to it and this header generated that header is basically inherited by the hypervisor build so it knows and understands the kernel structures. Then what we have, we also have a kernel driver out of three kernel driver that can be loaded on that host kernel and that will send the memory information. Because a lot of the memory accesses are randomized during runtime. And this kernel driver when it loads, it actually knows this information how the kernel randomized things. So it will upload the information to our hypervisor during the early innate call. Then in addition to that, we're also patching the kernel tiny bit to align some of the protected data that needs protecting. Like Kmoloc is used for majority of the relevant structs in kernel and it is really giving pretty poorly aligned stuff. And we had to modify those Kmoloc requests tiny bit to get properly page aligned entities so that we can protect things like for example, BCP register state. And then it's entirely valid question to ask at how this kind of approach works since kernel keeps evolving very fast. Well, it appears to work pretty well. So we have pretty wide range of kernels, LDS kernels that we support. We don't support the development releases that is out of our scope, but we have a ready base patch starting from 4.9 to 5.10 LDS release. And this doesn't actually require any changes in our code beyond this mentioned elements here. Then in order to work towards the secure KVM, we have to slice the KVM in two pieces. First we need to identify all the elements where the malicious host kernel shouldn't be allowed to read or write or actually write in most cases it's fine if it reads some of these things. There's a list here. This is not the complete list. This is just few of the most important elements that we are protecting by using the hypervisor when the guest execution is ongoing. Most of them are probably pretty obvious what they mean, but the last one probably is not that clear. So what our hypervisor can do is that it can even lock the kernel's own page tables. Kernel generates its own page tables as it puts up and attacker could basically go and modify those page tables. They could view as kind of a break kernel's own view of memory. And through that they could do things like, for example, disable SC Linux or do any sorts of other nasty measures that we really want to protect against. So our hypervisor can even lock and locate the kernel page tables from the memory and protect those against the modifications by the malicious host. So this is the set of things that ultimately require protections, more realistic on our web page. Then the way that our construction works is that we're not altering the KBM's internal execution flows. They're pretty much all intact. But what we're doing is that we are overriding bunch of the KBM symbols that deal with this EL2 execution level stuff. And we're replacing them with our own symbols that are basically jumps to our hypervisor and requests for the hypervisor to do something. Rather than handling it in the kernel, we jump to the hypervisor and ask it to perform that same task for us. Then the last element in that list is the pretty important one. One of the things that we are trapping in the hypervisor is, of course, the MMU, Memory Management Notifyer call, which will tell us that now there is a memory pressure on the device and the MM code wants to swap out the VM. And those cases we also handle in the hypervisor, they end up in the hypervisor. And hypervisor knows if memory has been touched or not. So if the page is basically intact, the guest hasn't written any potentially leaking information on the best page. It was just used as is coming from the host. Then we allow it to be swapped freely, but we just take a short 256 measurement of the page and then unwrap that page and allow the Linux to either reload it later on from the backing mapping or then technically from the swap. The Linux doesn't swap the pages clean, but anyway. For dirty guest pages, we currently don't allow unmapping. So we're not leaking the guest secrets by basically being aware of the page state. Later on, we could improve this time a bit so that we could allow the Linux swap the dirty pages as well. We have to just add encryption support for it and it should be pretty straightforward. It's just one work item among many. Then Yanni will say a few words of this construction as a whole before we go a bit further. And also mentioned how KB self runs. Thanks Yanni. This is a high level picture of all the blocks that we are talking about in this presentation and how those blocks are then divided to different exception levels. In this figure we see host operating system and do guest operating systems running on the same hardware. And this tiny hypervisor implementation we've been doing. It sits there in the exception level two in the hypervisor mode. What has been changed to existing KVM implementation is that we took the ownership of the exception vector and stage two translation tables. We took one other a little bit smaller stuff and we took that ownership because as mentioned before we needed to make it possible for the two different systems to coexist in the same system. KVM wanted to have the EL2 ownership and the SOC vendor implementation wanted to have the EL2 ownership. And to make that possible we needed to create an implementation like this so that actually we are the ones that own those EL2 features but are able to hand the execution over to proper instance whenever needed. And next slide we will tell a little bit more about that. So here we see how the KVM originally is running in the system. And also how our hypervisor implementation is fitted into that picture by handling all the EL2 mode exceptions and handling all the stage two memory translation. So for example whenever there comes an exception from the host operating system you're able to identify that exception and hand the execution over to SOC vendor EL2 code. And in a similar way whenever there comes an exception from the guest operating system we are able to identify that. And then hand the execution over to KVM implementation. And to mention some of the most painful implementation to do was here to make the system work in a way that this SOC vendor EL2 code is able to work there in a proper manner because there are things like caching and SMMU and stuff that needs to take into consideration handing back to Janne again. Yeah, so the overall picture that Janne gave here is like it just doesn't really properly explain how some of these issues are so notoriously difficult to make work. But anyway, going forward. So I should have mentioned on the slide 12 that as our guests boot, the pages migrate slowly from the host to the guest. Every time there is the guest walks on a page and that page may have loaded in from the information that it was loaded by the host. That page is basically unmapped from the host entirely so the host no longer has access to that page. This mechanism is pretty straightforward. There's just a clear page migration from the host and the host to the guest. The host self-allocates the page, fills it up and then eventually when the guest walks on it, the page migrates to the guest. This is all nice, but then there's the difficult part that Vert.io is more or less mandatory on every single virtualization system because it's the key for any reasonable performance because it is all about making sure that the hypervisor doesn't need to trap on every axis. Vert.io is difficult in the sense that Vert.io memory allocation is handled by the guest. The guest allocates random pieces of its own memory and then just decides to share that information to the host. And this is nice, but there's no way for the host or the hypervisor anyone to know what these pages are, they're entirely random. So ultimately the responsibility for telling the hypervisor that which pages are to be shared for host guest communications are now on the guest responsibility. Luckily for us, AMD got this problem before us and we got to copy this approach. So the first model how Vert.io works here is that we hook into this same set memory decrypted call in the kernel that is done, that's also used by this SCV, nothing until TDX as well. So this is a function that where the whole the guest kernel can say that this page should be open for the host reading. This is one model that we support, then we have another, so actually the previous model that the way that it works is just generates a hypervisor call. So hypervisor will know which page basically map back to the host, and then all that will work just fine. But then we have another model which works for unpatched guests, the previous one will always require that there is a support in the guest kernel to open those pages. We have a model for that works with the unpatched guests also, but it's more or less time bound. So we lock the guest mappings later after the guest has initialized. So before that we allow the communications to happen. And because of that's tiny bit crazy in terms of when that window closes, then it's like the preferred mechanism is number one. But number two is there in case you happen to be wanting to run a modified guest. Take a Red Hat kernel and run it, it should be fine, and you can, we call it blinded. Then that was basically all we had to do to get KBM work in secure, since some secure manner the guests were separated from the host and the host, no matter what you do from the host, you cannot really write to the guests. That's all nice. And then in addition to that we have a bunch for other useful memory protection. Some of these were already used before this KBM code came into existence. And some of these examples are shown by this hypervisor mode driver that we have in the R, we were about to link later there. The R GitHub repository where this code resides. There's a driver there that will show some of these samples. It will do things like protecting the kernel C runtime. It will lock the redone the data and text sections of the kernel. It will lock this kernel page table modifications so kernel cannot alter its own mappings. Then one big use case that we have is immutable kernel keys. So you can have things like false system encryption keys and the measurement keys in the kernel. They can stay immutable and there's no way for a attacker to ever modify them when sent. Then another big use case currently under study is actually protecting the entire SC Linux rule set by pulling the SC Linux memory. SC Linux rule memory from a memory pool that is already protected by the hypervisor. So we can allow the rule creation and modification and everything from the device boot up time. But then later on we locked the whole construction. And beyond that we can also lock these mappings from ever changing so we can make this pretty hard construction. So here is the repository itself. Basically everything mentioned here has been implemented as a tiny KVM extension or writing a lot of KVM symbols. It doesn't really require much code modifications themselves but it's not just some other symbols that you replaced. It controls all of the EL2 sensitive data and it protects Valkovit. The Intark, QMU and KVM APIs, user-based APIs and everything is all the same. So none of that. It's the changes all within KVM kernel functions. Everything is pretty well, happy path tested. It's very solid on the devices. We have given it like a pretty heavy shake and bake. It works. And if you want to give this a spin, let us know. We have basically support for basically all LDS kernel since 4.9 so we can provide a base patch against those things. And this whole construction has almost unlimited amount of use cases and a lot of them are under study. I think we have to kind of set up a web page where we track this thing and kind of show the entire progress of that whole construction because the code that is out there is just the base plumbing giving possibility for many of these things like this secure KVM. And our primary chipset that we currently support are come from Valkom. We support multiple chipsets in there. Then we have a full support for our QMU Word 6 based system emulation model. So you can run this whole construction and see what it's doing in this kind of emulated environment. There is an SDK there so that you can basically play and develop the whole construction further. I think we managed that goal of giving the porting F port minimal. Just remember, one of our primary goals was to make sure that we can move this thing to a new mobile phone every time we get one with minimal effort and we have to set up entire operating system environment just by utilizing the existing BSP that was done for Linux for the particular chipset. So in order to move our code base between two different systems, the only thing you pretty much need to do is to create the one C file and place that construction in your boot flow. That's basically all there is to it. There's really nothing more. It's very minimal and even this word driver mentioned there is more or less optimal. It's only if you ever want to see output on your screen on your serial console. There is a RAM lock in there that is read by the driver itself. That's sufficient. You don't even have to do that. The kernel security framework works and but the, I think one of our colleagues is about to update the kernel driver. To be a bit more complete in terms of all the things that it's showing. What to do with this stuff. The virtualization stuff overall is functional. We are running Linux and Android guests with it. We are now in the process of making virtual GPU work directly on ARM64 Android. I think this is something that even Google doesn't fully have. The Android's graphics pipeline is a bit of a mystery to many of us. So it's been a long reverse engineering exercise to make all that work, but it actually starts to slowly grow things on the screen. So we're happy. And then the entire construction is waiting for full penetration testing. And we'd be very happy if you guys could browse the code and have a look at it and tell where it sucks and what the problem, if there are any considerable issues with the solutions security. There's a read me on that main page that will kind of a current list to recognize the issues that be recognized. And if anyone is willing to send or we are very happy to listen. So currently we don't have unfortunately any proper mailing lists about this matter. We only have these private emails. But if this thing generates interest, if anyone of you wants to start running your own secure KVM and hardening your kernel using the EL2 mode, we are here to help. So let us know and let's just set up a mailing list or something where we can talk about this stuff and how to get you to use this code. So now let's be fine for the QA. So I guess we're going to stop here.