 Yesterday I took about script execution control and I ended the talk with, well, some things that could improve kernel security. And today's talk is about hypersorinforce, kernel integrity for Linux with KVM. So this, well, the kernel is not perfect and there's still a lot of issues, a lot of bugs and potentially also a lot of exploits against the kernel. And that can be leveraged by attackers to, well, get some, have to read or write to the kernel and then, well, bypass everything. So what I'd like to do is to be able to enforce integrity on the kernel. So here integrity I defined it by being able to guarantee well, critical parts of the kernel to be predicted. Well, there's a lot of potential issues, known and unknown. And yeah, we'd like to protect against that. So let's end with it. Let's say the attacker is able to use an exploit to gain up to read and write to the kernel. And here we're talking about a guest kernel. So maybe the host is the virtual machine that is in charge of managing other virtual machines. In the case of KVM, guest virtual machines are what we commonly call virtual machines. So it's when you spawn something and you can play with it. So here we're trying to protect guest virtual machines. So the third one is, well, the attacker could, is some way be able to, well, through user space processes, even network packets or even malicious block devices to be able to attack the kernel and to exploit a bug. What we like to hear is to use virtualization and especially hardware-based virtualization. There's currently a lot of different protections on the kernel to protect itself against such kind of attack, but well, it is a self-protection. So once you have enough power to change or to disable such protections, well, you can bypass them. So these are good things to have, but we can improve that. And for that, we need to rely on something which can have more privileges than the kernel itself to be able to protect it from the kernel, the guest kernel, than the hypervisor. So there's a lot of different hardening mechanism. That I use, that can be used. So there's, let's say, self-rotation mechanism like JSCD, Pax, and OpenBSD is using a lot too. Linux, well, kind of improve his states with different mechanism too. Windows is using what we call virtualization-based security, VBS, and this is a set of mechanism that is able to protect a Windows operating system from an integrated point of view and even to, well, protect some secrets. For Android, there's, well, Samsung AKP and the YY hypervisor that can do kind of the same things to protect an Android smartphone. Well, iOS, iPhones use similar techniques and they develop one version, watch, watch tour and then improve it with some hardware mechanisms. For Linux, there's also a virtualization-based security. So mainly, VDFender's hypervisor memory introspection is used, well, it can be used to protect a guest channel but also to introspect it. So that might be useful for debugging, developing or even, well, analyzing attacks. Intel also released a proof of concept with some mechanism leveraging virtualization. And in this talk, I will explain what we did and until which point and what we can do to protect a guest channel. So for this, we took a look at different patches which were submitted some times ago, mainly the KVM patches which was like 30 or 40 patches but also the parameterization control is the pinning patches and the IPv0-based integrity which was presented some years ago but yeah, well, last one doesn't really, well, there's no pretty code, at least only some part of the code. So yeah, nothing really for Linux yet. So that was our goal. And for this, we wanted to use KVM. So KVM is de facto Linux hypervisor. There's mostly two parts, one is the hypervisor itself and the other part is the guest drivers which might be used to, well, have improved virtualization. KVM kind of leverages Linux mechanism, well, Linux channel code to like the scheduling and resource limitation and well, you can create, spawn and manage the machines thanks to what we call a virtual machine monitor which can be, for instance, QMU or cloud hypervisor. So this is a new space program running in the hosts part. Okay, a word about chain of trust. So here we like to enforce integrity but bit more than that. So we need to rely on some stuff, some basis and so the idea is to rely on, well, to have something to trust as a basis. So for this, I can rely on an instance echo boot which can be used to give some trust to the base, mainly the virtual machine monitor and the hypervisor. Once this host is ready, it can launch a new virtual machine. So in this case, guest cover and guest use space, guest applications. And well, the rule of a kernel is to manage use space to create processes and so on. And the goal of this kernel is well, I should protect such processes. There's a lot of stuff in Linux to enforce access control and even the static stuff like virtual memory management is useful to kind of isolate and protect applications. So the idea is to rely on the hypervisor to enforce rotation on the guest kernel and then the guest kernel as usual can enforce restriction on applications. So what we like to have is to let users manage on kernels. We don't want to enforce anything or to, well, require some stuff that users might not have or code or external code and so on. And yeah, we want them to have the same control as they have currently with kernel self-protection mechanisms. And well, if we want this feature to be usable, we want it to be simple. And well, we sell both no configuration. From a security point of view, well, we need to rely on something as I just explained before. Mainly, well, secure and trusted hypervisor and virtual machine manager. The security we like to enforce are mainly two or three. Simplest one is to enforce control logistic pinning. So it's a simple hardening mechanism which can be used to restrict the user of hardware capabilities that should not be usable once a kernel is booted. So that might be leverage, for instance, by an attacker to, well, disable the right permission that could then enable to write to any part of the memory. And also some similar mechanisms like SMAP and SMAP. A second part is to enforce restrictions on the kernel memory. So for the kernel data, we want it to be set really, which is already the case, but we want to enforce that from the hypervisor point of view. So this includes syscall table, set the case, key security configuration, and a lot of stuff. We also like to enforce a global execute ZOR write kernel memory policy, or commonly called write ZOR execute. So this means that we like only some little bit kernel memory pages to be executable, but to deny everything else, I mean, from the kernel part. But we also need to run new space, so that need to be handled as well. This fact series is, well, contains two parts. The first one is a common guest kernel API, and the second one is a KVM implementation. So what you like to have here is to let the guest, the machine, the guest kernel, configure itself like it is already done with different kernel-safe rotation mechanisms, and to not rely on a specific hypervisor to enforce that. So the first impression is KVM, but we like to have more implementations. Well, we need to, of course, run the hypervisor in some way. And well, that might be useful to also get some attack signals and to give them or to make them available to the return machine monitor in a way that it could be logged and that could be then, well, used to inform the user that there's an ongoing attack or stuff like that. So the idea is to have a common layer, normalize, and this API should be able to be simple to use to enforce, well, to tie kernel memory pages to, well, protections, memory protections. Well, we don't need to rely on hypervisor implementations or we don't want to. And last but not least, we'll do some tests and we want them to be, well, usable by any hypervisor implementation. And until the guest API looks like this, it's pretty simple. So as we can see at the bottom here, better, there's mainly two functions, protect ranges and log zeroes. And putting protect ranges takes, well, mainly one argument, but it's split in two. It's a pointer and a size. Number of elements. And, well, this point to an array that identify a set of virtual address, well, kernel address ranges. So, well, these first virtual address definitions and then this need to be translated to physical address ones. So that's why you can see here, virtual address range and physical address range here. And then we can tie these memory page ranges to a set of attributes. And here, okay, it will be protections. And the two protections that we define are the no write and the exact dimension. The second helper, the log CRS, is to, well, pin control registers. So pretty simple to use. So I talk a bit about the boot process. And here I want to highlight some part of the gas cannon. So it's really a summary, pretty simple. But the important point are when the kernel is initializing itself, there's a first call to the API in it, which is in charge of configuring kernel sections and to map kernel sections to memory permissions. Well, then it's not part of these batteries, but it's relevant. There's a marked only call, which is in charge to enforce the kernel set production mechanisms, especially to enforce memory management unit predictions. And then just before launching the initial process, there's a Hickey late in it call, which then enforce restriction on memory protection from the part which is managed by the IPRESA. So I will explain the EPT part later. And then once everything is set up, once everything is locked, we can launch new space that could, in some way, maybe compromise the kernel. Let's talk about the KVM implementation. So as I explained, there's many two parts. And for this, well, I implemented two IPRESA calls. The first one to enforce kernel memory restriction and the second one to be able to pin control registers. And yeah, the underlying idea is to only be able to enforce more and more restrictions and not to unlock them. For the control register pinning IPRESA call, it's pretty simple. It looks, well, it's quite close to the guest API. There's a new IPRESA call ID, locks here, date. And then two arguments. The first one is to identify the control register, in this case, CR0. And the second argument is to identify a flag that we want to pin, which means that should not be a date table, but the guest kernel after that. The second IPRESA call is to enforce memory page ranges protections. So this will lie on EPT that we'll explain just later, just after. For this, we need to rely VMCS, which is a virtual machine control configuration register for hardware virtualization. And yeah, the ID after that is for the IPRESA to log beta channel items and potentially to forward it to secure logs and things. And as it is a case for common memory management permissions, once a fault is detected, well, we need to create a page fault. In this case, it will be created by the IPRESA itself and forwarded to the guest kernel. So what is EPT? Well, in fact, here we're talking about what we call second layer address translation, a slot, or also called two dimensional paging, TDP. And well, for Intel, it's called EPT, and for AMD, it's called VI or NPT, it depends. But what is really interesting here is that it is a second layer of permissions. So the first layer is controlled by the guest kernel itself with MMU. But this second layer in order to enforce kind of the same restrictions, but by the IPRESA itself. And this kind of restriction are not directly available to the guest kernel, otherwise it will be useless. Okay, and so they need to rely on that to enforce the restriction that the guest kernel wants to be enforced on itself. But there is a mission. Well, common memory predictions may need two, three. There's read prediction, write prediction, and execute prediction or permission. And the issue is, well, with a common kernel, there's kernel space and new space. So there's two kind of memory. And well, if we can only enforce one kind of permission on new space and kernel space at the same time, well, that might limit us. Because that means that we'll need to enforce restrictions on all running code. So the kernel and all new space processes at the same time. And that is very difficult because well, there's a lot of different new space, code, application processes, and so on. But here come hand back. Mode-based execution control. So it is an Intel mechanism which can say, well, these are kind of something similar for AMD, but here we only worked in leverages this mechanism. The idea is to split the execute bit permission into two permissions, either the kernel-bound execution or the user-mode-execution. And that's really interesting because you seem about to, well, enforce like a deny-by-default security policy on all the execution, the kernel-execution pages, but to not touch and to not care about user space which should be restricted by the kernel itself, by the guest kernel itself. So how does it look like? So here is kind of summary of the kernel memory. Well, the whole physical, the whole guest physical memory, which is available. But here we are only talking about the kernel memory permissions. So you can see, well, there's mainly two, well, main two section, the text section of the kernel and the read-only data section. But the thing is, if we don't have and back, well, some parts of the read-only data kernel section must be executable because, in fact, the read-only data section contains your space code, in this case, VDSO, which is mapped to your space processes and then used by them. So if we deny a section on this part, well, your space processes will not work well at all. And here comes another thing, is that, well, every new memory that can be used by user space must be, well, should be executable, executable in some way, otherwise you will not be able to run code, use space code. And so this means that we cannot enforce restrictions on memory which is not allocated yet. We need to be able to differentiate between kernel execution, kernel code, and use space code. With and back, it's much more easier and cleaner. So we cannot enforce a read-only data section in a read-only web, but this time again, it is only for the kernel. So the VDSO here will be executable for user space because we don't configure that, we don't care about that. And we can enforce read-execute-permission for the text section of the kernel. But most importantly that we can enforce a deny-by-default policy for everything else. Everything else should not be executable by the kernel. Okay, let's see some demo. So it's mainly, well, a code which is part of an RFC we sent last week. So I'll give you the links at the end of the talk. So as you can see, at the right here, well, I will launch the guest, a guest-veter machine. And at the left, well, it is a host. So we'll see the host-slope at the top. And at the bottom we'll see the test code which will be executed at the right. Let's get back to the beginning. Okay, so you can see here, so I let it at the left. So it's a assembly code that will be executed by the kernel. And the important point is that it is stored in the renowning data section, which is not common. It should be in the text section, but that's to check that renowning data section is not executable. And so the code that will be executed is really simple. It just take an integer and implement it. Okay, the test code here, well, is, well, the guest kernel is still using the kernel server protection. So the first step here is to disable this protection at runtime. And after that, to execute the code, which is, by the way, not something easy to do, but it is doable to disable the kernel server protection. Okay, at the right, I'm launching a virtual machine. So without any process, because we don't care about that. And I'm launching a normal kernel without the HKE protection. Okay, so you can see here, the test failed. And well, we can see that three plus one equals four. So that's good. But what is not good is that this code successfully disables the kernel server protection and executed a code that should not be executable. So we're doing the same, but this time with HKE enabled. So what we can see first is that there's initialization, the guest kernel. So it may identify five sections and enforce protection on them, either the no-write restriction or the exact permissions. And at the left, you can see the KVM logs. And well, the request is received. And you can see that once it is received, everything, well, all memory which is mapped to the guest kernel is set as the permission. So it is deny section for everything except what is allowed. So as you can see, the right, the no-write permission is an allow list and the exact permission is a deny list. We can do a deny list for execution because the execution is quite, well, it's not difficult to enforce because it's not common to have new executable to code, but it is quite common to have new data that you want to update. And once the permission is enforced, well, the IPv0 lock the guest configuration. So at the bottom, you can see that's the test. The test, well, kind of failed. We don't see the end of the test because the guest kernel is trying and trying again in a loop because there's no kernel page fault handling implementation yet for this kind of exceptions which is kind of uncommon for a guest kernel, but the kernel cannot execute such code. And we can see that, you can see the kernel virtual address here and it is here, well, receive and blocked by the IPv0 and the IPv0 create a fetch page fault which is then send to the guest. So the guest should receive that, well, it receive it, but it just ignore it for now. So that's really part of a free to work. Okay, now let's see a second demo. So it's much more simple this time. It's about control with the pinning. So you can see the same, the test code at the left. So it's quite straightforward. The code, well, the kernel test code, let's say the exploit, read the CR4 register and then we try to remove this map bits. Then the right will launch a new video machine, a guest which is not protected with a key at first. And we can see, well, that the test failed because the guest kernel was able to disable and to remove this prediction. If we do the same test, but this time with a Hickey prediction, well, you can see that the test failed. Well, the test succeeded because the attack failed. And you can see at the left that KVM IPv0 well, code is update and blocked it. So yeah, it's not really fancy for a demo, but yeah, it's always difficult to show how it works. Well, there is some limitations. From now, this facility will only set clean force permissions. So dynamic kernel module is not allowed, which might be an issue of course, but that sort of feature might still be useful. If you don't need KVM modules. And well, there's some attack techniques which are out of scope, for instance, rope. So there's a written or anything programming. So be able to jump to, legitimate easily to KVM codes is not restrictable by Hickey because what it is not its job. But there's other mechanism that can be leverage, even with hardware features. Okay, so what we like to do is to be able to handle dynamic code of course. And for that, we need to rely on something, not only the IPv0, but something which is less privileged to avoid this path of code being exploited. And there's two way to do it, either to implement that in the virtual machine monitor, so QMU or cloud IPv0, or to use a dedicated WadiGa virtual machine, which is what is using Windows VPS, for instance. And we also need to handle, well, kernel memory being freed to remove such permissions. So that will be much more complex, but still do. Second important thing is what we improve to kind of save position mechanism because that is the one that are kind of used or copied to the IPv0. So what we can improve the Hickey intervention by restricting more MSR updates, by using the IPv0 manage linear address translations, which is a way to kind of prevent MME changes mainly. We could also use execution memory for the kernel, but yeah, that's not, well, there's some issues there, but that should be possible. And yeah, we like to also enforce this kind of restriction on the KVM host as well, but KVM is not designed for that, so yeah, that might be a big issue. Anyway, we can protect the guest, and that's good. Other things to, other ideas, is that we can move to attacks by creating new interface to log such attempts, and potentially to react to such attack. And so what is not quoted yet is nested vectorization, because of the Mbeck feature, but that might be possible. And so we like to, well, support more architectures. So for now, it's only the Intel architectures, but AMD should be usable the same way, and maybe, okay, let's wrap up. Hickey, the difference in depth mechanism that leverages hardware visualization, and especially the Mbeck feature, which is, by the way, also used by Windows for the same purpose. So the RFC we sent last week defined a common API layer, and KVM implementation as well. So it's fairly available, you can use it and test it. You don't need to update or change Cremium, which is nice. And yeah, it's a new project. We're looking for contributors, so if you're interested, please reach out. There's a lot to do, a lot to do. For instance, support other iPersoles, to support other architectures, to improve gas channel support, and also to enhance the term machine monitor, so Cremium, cloud iPersoles, and other stuff. You can take a look at this link. There's some, well, useful links in these links, and I will post a demo and the slides there, too. Thank you. I'm curious if you're handling huge page support in the second layer address translation. Do you facture memory, you know, demote it to 4K pages or something like that if you get a request for smaller regions, I guess, for local addresses? I'm not sure. I need to add that, but that should be doable if it's not the case. So right now, the thing is, the confusion is pretty simple. The canal just identify a set of sections and just send the related physical addresses to the iPervisor, and the iPervisor just protects it from the beginning to the end without caring about which kind of memory it is, so. And does all this map cleanly onto existing Hyper-V semantics? Sorry. Does all of this that's proposed for Linux as a guest already map cleanly onto what's offered for Hyper-V? In the inverting and the enlightenment? So you're working on new Hyper-Viser support, right? So that's going to be some change to guest code in Linux in order to support other Hyper-Visors like Hyper-V. Does Hyper-V already support all the necessary semantics from the guest to Hyper-Viser interface perspective to implement what's proposed for the Linux's guest? So this KV implementation, well, there's some stuff going on for Hyper-V. Hyper-V uses similar mechanisms, but it definitely doesn't work same way. It uses kind of a side-cohabitor machine to enforce such restrictions. So it's not the same interface. And that's one reason why we want to have a common API layer for guests can then to kind of not care about Hyper-Viser information details. But yeah, it's only a KV implementation. Yeah, you're going to need some kind of VM API kind of interface abstraction that will let you do this on Xen or KVM or Hyper-V or VMware, right? So yeah, there's some backend drivers that can be plugged into this common API. So yeah, I'll just put my Microsoft hat on for a second. So yeah, there is a common layer mentioned in there that is meant to be Hyper-Viser agnostic. And as far as possible, we want to make sure that any existing Hyper-Visors are supported. And so this talks about the KVM implementation using that. Yeah, so there should not be any limitation to Hyper-Viser to implement that. It's, yeah, it's just, well, some Hyper-Viser are not implemented. In the case of Hyper-Vises, well, the Hyper-Viser implement some stuff but other stuff are implemented elsewhere. For Xen or other stuff, what is new code to add? Okay, any more, any question? Thank you.