 So, hi everyone. Thanks for coming to my talk about KVM as a sandboxing mechanism for the Linux kernel. Before diving into the technical details, I will quickly introduce myself. My name is Florent West. I am a computer engineering student during the day. I am an embedded software consultant during the night, which gives me plenty of time to maintain an operating system named Asteroid West. During the summer, when I have time, I do internships for various companies. My latest internship was at ARM, Cambridge. I was part of the kernel team there, and I was in charge of exploring various hypervisor-based security solutions. One of the thing I did there, I was under the supervision of Mark Zangier, who is the maintainer of KVM ARM, and he had this idea of what if we could put EFI runtime services inside a KVM virtual machine. Don't worry, I will explain what EFI runtime services are, what the problem with them is, and what it means to run them inside a KVM virtual machine. I will begin with a quick introduction of KVM, what it is and what it does. Then I will tell you about how I managed to get that as an API inside the kernel. I will present the EFI runtime services and how my sandboxing solution ended up working. KVM. Just so that everything is clear for everyone, KVM is a hypervisor. You may have heard that there exist two kinds of hypervisor, type one, type two, those that run between the hardware and the kernel, that run between the kernel and the user space tools, and KVM is something else. So it's a part of the host kernel. This is a module. You could load KVM with some simple in-smod KVM.co, and essentially its role is to provide an API to user space programs so that they can access the virtualization capabilities of the hardware. You may know that some CPU, for instance, Intel has half extensions like VTX, AMD as AMD V, and ARM as the ARM virtualization extensions. Essentially what those things are, it's an extension of the instruction set of this CPU that allows virtual machine creation and the separation between the different virtual machine is done inside the hardware. So obviously you can fully trust the hardware, that has been recently some troubles, but you can have a decent level of trust inside your hardware. So the important part in this slide is, you need to understand that KVM has been designed to provide a bridge from the hardware to user space program, for instance, QMU but also KVM tool, and so on. It's also a multi-platform API that can work on different architecture. How does it work? So KVM exposes virtual devices in slash dev slash weather. For instance, you have slash dev slash KVM file, on which you can run IO controls to create your virtual machine. So it is standardized across all the architectures. For instance, you have KVM create VM IO control, and that just returns file descriptor to a new virtual device that represents your virtual machine, and then you can run KVM run. Well, obviously you need to attach virtual memory, CPU, but the DVM runs. One other thing I would like to highlight here is that since KVM has been designed for user space programs, when you attach a memory slot to your virtual machine, the physical memory of the virtual machine is actually mapped from the address space of the process. So let's take an example. If you run QMU, QMU can allocate a range of memory, and that memory for QMU will be used as the physical memory of the virtual machine. That's important for the rest of the talk. That's it for general introduction to KVM. Now let's come to internal KVM. So the idea was what if we could from the kernel run small sandboxes, small virtual machines in which we could run code that we don't trust, maybe because we haven't wrote this code, or maybe because this is a highly critical piece of code, and we want to make sure that if there is an exploit into this program, it won't affect the entire kernel. So that would be great if we had something to spawn virtual machine from the kernel, something that we could call kernel virtual machine, something like KVM, and we have it. So everything is almost there, but it's not been conceived to be used by the kernel itself by various subsystems. So we had to do some adaptations here. Just to clarify the objectives of KVM, it's meant to be cross-platform, to be standardized across different architecture, so that the same subsystem can use virtual machines on ARM or x86, for instance. It can be used for security, but also for stability because when you have a program that can potentially touch the system registers, it can cause problems. The API is meant to be generic, so that every kernel maintainer can use it to sandbox something. So what were the problems? So first of all, KVM as it is today, doesn't offer enough functions, so you would expect to have the equivalent of the create VM IO control I showed you before inside the kernel, but you don't. So we had to extend the KVM APIs, and then there are plenty of assumptions in the code of KVM that the virtual machine is created from a user space program. For instance, every parameter that is sent to KVM functions is retrieved by the kernel using copy from user, and obviously if you don't call this function from a process, you get tons of problems. KVM is also acting on the scheduler because it assumes you have a process, a user space process spawning the virtual machine. So that's one more assumption that I had to refactor. Then one last thing, a couple of other things. When it comes to the memory mapping, I told you before that QMU uses a part of its process address space for the physical address space of the virtual machine. But when you are in the kernel, you need to map something else. So KVM is not done for that. So once again, some refactoring, I chose to make internal KVM map physical address space, physical as seen by the host kernel, to intermediary physical address, which is the physical layout as it is seen by the virtual machine. Is that okay for everyone? That's a lot of different address spaces. One last thing on ARM, the virtual machines have an instruction named HVC for hypervisor call, and the guest operating system can call functions from the hypervisor to do whatever the hypervisor offers. For now, it's not used by KVM because only for PSCI, but it's not really widely used. In our context for sandboxes, we need a way for the host kernel to call code inside the sandbox, and a way for the sandbox to send back messages to the host kernel. So we need also a way to root hypercalls from the virtual machine to the host kernel. Just to give you a global idea of how this new API can be used, it looks like this. It's very straightforward. You just create an internal virtual machine. You add a virtual CPU to this virtual machine, you configure it. I cut the code that configures the memory because we also use virtual memory inside the virtual machine. So you need to configure several page tables, and it's a bit of a mess sometimes. But the overall idea is quite simple. It can be used from any driver or subsystem. So that's it for KVM and internal KVM. Now let's come to the EFI one-time services bit. That's getting exciting. So when I first heard of about EFI one-time services, I had no idea what that was. So I just Google it, and here are the results I found, more ways for firmware to screw you, and UEFI is not your enemy. You can probably imagine that if there is a talk named UEFI is not your enemy, UEFI is maybe your enemy. So essentially what EFI one-time services are, it's a bunch of functions. Actually, it's an array with function pointers that is given by the bootloader to your kernel, and those functions can provide access to various hardware capabilities. For instance, you can get access to the EFI variables, but also to real-time clock and various other things. Now let me repeat that one more time. You get function pointers that you call from the kernel. That haven't been coded by the kernel. So you run code that you don't have control over with the privileges of the kernel. So there are some mitigations in the kernel. They are trying to put it into a special address space, but it means that the code inside the EFI one-time services still has access to the system registers of your CPU, and it can do nasty things behind your back. So there are several ways to look at it. If you are an old man like this guy, you may be worried that you run code that is proprietary from your kernel that you believe to be free. If you are a technical guy, you may be wondering if the code that is running inside the EFI one-time services really works well, if it has been implemented correctly, and the experience shows that the OEM developing those services sometimes make problems. For instance, some of the EFI one-time services disable interrupt. So you call a function with interrupts enabled, and then it goes back with interrupts disabled, and your kernel doesn't work anymore, and you don't know why. So there are plenty of hacks around the EFI drivers inside the Linux kernel to try to limit the impact of those services. But you can feel that it's starting to be a bit hacky, and you can't really trust the device vendor that provides you those functions. I like to call it the jump and play for the best strategy, because that's exactly what you do. You jump into an address, and hope you will get back with a running kernel. Did I miss something? No. Now for the EFI soundboxing. So you may be starting to see a pattern here. Couldn't we use the internal KVM framework we created before to run the EFI one-time services? Obviously, if I am here today, you can think that we can. So we use internal KVM from the EFI drivers to spawn a lightweight virtual machine at boot time, and then in this virtual machine we map the source code and the instructions, the code of the EFI one-time services and also the addresses of devices they need to access. We map only what EFI one-time services need inside the virtual machine and nothing else, which means that, well, at least they can't touch anything from the memory that is outside what I need. And also, since they only have access to virtual CPU, then they can't touch the system register of our kernel. So that caused several problems. First of all, EFI is meant to be run very early in the boot process of the Linux kernel, and KVM is meant to be run very late in the boot process. So a bit more refactoring inside KVM to run KVM earlier in boot and EFI later in the boot process. And then you started to see earlier during my talk that there are many addresses, address spaces you need to address, the ones from the hypervisor, the ones from the host kernel, the physical address space of your virtual machine, the virtual address space of the virtual machine. So I try to keep things simple here, and there is a direct mapping between the physical address space and the physical address space of the virtual machine. And same thing for the virtual memory. The virtual machine uses the same virtual address space as the host kernel, which really simplifies things. There are other problems you need to address. How do you call a function inside of VM? Because when you just call an EFI runtime service from the context of a kernel, it's just like a C function call. You jump into the address with arguments in the registers, and then it jumps back to the address of your code. But if you have a virtual CPU, you need to put manually the arguments inside the registers of the virtual CPU. And when it comes to the return, the virtual machine can just jump into the code of the kernel. So there is a wrapper inside the virtual machine that runs a hypercall to tell the kernel, that's it, I'm done with the EFI runtime service. One more problem. What happens if there is a fault inside the virtual machine? We need to recover from that clearly. So there is a fault handler, and if a problem occurs inside the virtual machine, we have a special hypercall for that, and the EFI drivers from the Linux kernel can revert properly. So this is starting to be quite nice. No more system register mess, clean exception handling, as I was explaining. I was able to make it run on several devices, real devices, like the SoftEye 1 Overdrive 1000, and also on models, for instance the ARM P8.3 model. I got enough time to clean the codebase and send the first request document to the Linux kernel mailing list, and it has been reviewed by the maintainers of KVM and EFI for ARM. So it's not merged yet, because I don't have the time to work on this anymore since my internship, but the idea of this talk is that if anyone in this room is interested in maybe doing the same thing or doing the same thing on a different architecture or for a different need, for instance I was telling you about sandboxing EFI runtime service, but what if you would want to sandbox the IP stack of Linux, for instance, so that if there is a problem in the IP stack, it just affects the virtual machine. There are lots of possibilities with this, and if you are excited by this, please take a look at my RFC, it's not that long, not that complex, and I would love to help you if you are interested in it. Now for the limitations, because obviously the world isn't that great when it comes to EFI. So the idea was to have two different kernel space, but on ARM you have different privileges, on ARM V8 at least, you have EL0 for the user space, EL1 for the kernel space, EL2 for the hypervisor, and then you have EL3 for the secure monitor. And if from the Linux kernel you create a virtual machine, you get a second EL1 level, you can put code inside it, but this code, even if it's controlled by your Linux, it can call a function from the secure monitor, and you have no control over the secure monitor from your Linux kernel. So one more way for the firmware to screw you. That's what happened on the Overdrive 1000. So the solution I used so far is that I just allowed secure monitor calls from internal virtual machines, but maybe a proper way to do that would be to have a white list of secure monitor calls that we want to emulate or to forward, but it's starting to be quite complex. And then one more way for the firmware to screw you, because I found it really amusing, at least, for me. A user reported on the Internet that his machine would randomly crash. And actually what happened is that when he called the EFI Runtime Services, they would enable DMA transfer, and then the device would write Ethernet frames at random locations inside the memory, and it would overwrite your kernel. Obviously that's quite complex to debug, but when they found out about this, well, there's nothing you can do. And while sandboxing EFI Runtime Services is cool, but if in the address space you map them, you leave them access to devices, they can configure them wrongly and still do things behind your back. So that's it for me. Just one last thing. I am searching for my next similar internship. So if anyone is interested, get in touch with me. And if you have questions, feel free to raise your hand. Yes? Oh yeah, probably. Something I had to say is I only worked for ARM for two months, and I did that. I only worked on that. So I'm not super familiar with everything, and probably IOMU would fix that. I didn't work on the DMA thing. Yeah, but the EFI Runtime Services do need access to devices to some extent, and you get this information of what addresses they need from EFI itself. So maybe you don't know what you need to allow and not allow. It's a complex problem. Yeah? What? Well, you have if... So the first question was, does IOMU solve the problem of DMA transfers that I highlighted in my last slide? And this question is, what does EFI Runtime Services do? So mainly the use cases I found were EFI variables, which is used by ACPI, if I'm not wrong. Again, I'm not an expert about ACPI, but I believe it's used. There's someone from ARM over there that can probably answer the question better than me. And the cool feature I found is real-time clock. You get an easy access to a real-time clock. There are other functions. In the end, there's not a lot of functions, but they are used so that you can easily adapt to different machines. Yes, EFI Runtime Services is also used on X86. Well, I haven't worked on those machines, but I assume the problems are probably the same. Yes? Okay, I didn't benchmark the results of my work, so I assume it's a bit slower. It's not called super often, so that shouldn't be that bad, I assume. Yeah, one thing I didn't mention is the virtual machine doesn't constantly run. It only runs when you call an EFI Runtime Service. The rest of the time, it's just idle, so it shouldn't have a big impact, I assume. From the kernel, we... Where was it? From the kernel, we map the address space that the virtual machine needs, and when we call a function, we put the arguments in the registers, and also if you need to pass a buffer, you need to copy the buffer inside the memory of the virtual machine and then put in the register pointers to the address relative to the virtual address space of the virtual machine. But that's pretty much everything you do as configuration of the virtual machine. I just assigned one CPU. That should be more than enough to just run a couple of services. Yeah? So wouldn't it be easier to just forward all the... I mean, usually Runtime Service calls are in the total slow path, and usually are accessed from the user space anyway. So wouldn't it be easier to just bounce them back to use this application and then spawn the virtual machine and get special privileges to maybe do all the calls eventually? You mean from the user space creating... You basically only call Runtime Services when you read the RTC, when you read or write the variable, which are all going from user space so you can sleep, which is important. It gets called on reboot, which happens from user space because it's assist calls so you can sleep. So all of these things basically always only get called from user space, which means they're sleepable. Where do they also get called? In the variable driver, of course. Initialization. Yeah, correct. Well, doing init, you can... You just got initialized from EFI. EFI just ran, so you can just trust it during that short time frame, right? It's like two seconds after you were just running exactly the same code, but also some privileges. So there's nothing... Actually, he's doing the boot process. Some drivers and some subsystem need access to the EFI variables, and it's not from the user space context at all, and it's very early in the boot, and if you don't have those... access to those variables, you lack features later. But also, it's not just about the trust with regards to, do I trust... Has no one messed with my UEFI? Yeah, but it's... It's also, logically, it's the same as having a binary only kernel module. Does that answer the question? Yeah, yeah, yeah. It makes it more tricky. Yeah? So KVM is traditionally, I would say a feature that is being used in the enterprise market where scalability is a concern. So it looked to me as you've created like an internal... kernel ABI that would facilitate to create these sort of internal VMs also beyond EFI services. Yeah. That would mean that for each driver that uses this, the amount of user space the VMs that can create it gets decreased, right? Okay, so the question is, if we spawn virtual machines from the kernel, does it reduce the number of virtual machines we can run from the user space? Yes, but as far as I remember, the limit is like 256 virtual machines you can run with KVM. And I believe, I mean, I'm not in an enterprise context, but I believe that if you only can run, if you can only run 255 virtual machines, you should be okay. I don't know. You can install quite a lot of, you know, terabytes of RAM and stuff. Probably. I just never worked with that kind of machine. So, just a suggestion to discourage people from overusing this feature. Oh, yeah, correct. Yeah, we shouldn't use that for everything, obviously, but just for critical parts. I don't think there are that many subsystem that uses something like that. So, one more question. Do you speak any machine every time? No, just one virtual machine at the boot time and then I always call functions in this virtual machine. Which hardware resources to map for the EFI service to run correctly? Yeah, this is a very good question. So, the question is how do we know what we should map to the virtual machine so that EFI run time services have access to the hardware. And thankfully, the EFI standard gives you a mapping. They tell you, I need that range of memory as a device with right access, with exact access. And I just go through this structure and adapt the page tables of the virtual machines according to the info that EFI gives me. Okay. Anything else? So, I think I'm done. Thank you very much.