 Hello everyone. Thanks for attending this session. My name is Nim. I'm working as software engineer for Google Cloud. Our team is mainly responsible for maintaining the hypervisor and virtual machine monitor inside Google and Google Cloud. Today, the topic of the talk is hypervisor-based integrity. As you can probably tell from the name, we are trying to utilize the superpower given by the hypervisor and hardware to provide the guest kernels running inside the cloud environment. So some high-level disclaimer. This is not a statement of direction or planted investment by Google yet. We are exploring this as possible security mechanisms and we are seeking feedbacks from the big communities. A background on why we are heading this way. So Google Cloud offers a product called Show to the VM. In fact, that's a default VM type if you try to create a virtual machine inside Google Cloud. As you can tell from the name, it can be some enhanced security features, one of them called secure boot. On high-level, in case you're not familiar with it, provide integrity check against firmware, bootloader and kernel during boot time. It will use the certificates and to verify the signature of all the modules and has been sent correctly, all the drivers and all the binary is trying to load. So it gives you a guarantee like by the end of the boot, everything should be in a very good shape. Nothing has been modified or has been hijacked in any way. After that, most kernel modules, if you try to load additional dynamic kernel modules, they will be protected by the module signers. So that's code in the kernel, which will do signature checkings against the module you are trying to load. That's all good. But in real life, attackers can still find different ways to gain like highest privileges inside a running kernel. If he can do that and he can modify the code page or call back in some special way, like all the security gate we put in place cannot be trusted and we could put user applications in a risk state. The goal is trying to protect the kernel even at runtime. Since we already have all these green box checked during boot, we are trying to extend what we can and help make sure like the kernel at runtime is also protected and can be trusted. If kernel has not been modified in any way and running as expected, the user application can feel much safe about it. And one more thing is the protection we put in place that cannot be able to turn off from the guest, because I mean rootkit can gain highest privilege. If he can change the settings in the guest to turn off the protection, then the protection is not that strong. So let's talk about the thread model here. So we assume the attacker has the following capabilities. So it can gain ring zero access. It can have arbitrary read and write permissions. So it can read and any kernel memory address and write any value to any kernel memory address. So how do we protect the guest kernel based on the thread model we just talked about? First of all, we want to make sure there's no intended modifications on the kernel code segment and read only data segment. And this is very obvious. We have very well written code. We don't want it to get replaced by some molecular code. Second of all, we want to make sure the code execution, no code execution from other part of the kernel space. Data segment should always be read and write, but not executable. And this is where bubble overflow attack happens and we want to block that. And third, and we want to make sure there's no intended modifications of key kernel data structures. For example, system call table, which is a table of functional pointers. The hackers managed to replace it with something else. The user may get monitored or hacked without realizing it. As an example is control register like CR0 or CR4. There's a bit in the control register in the CR0, which is read write protection. So if you send off the bit, the kernel world, the processor will allow you to write to a read only memory, which is not ideal. And there's other important MSRs, like there's people talking about system enter MSR. And utilizing that MSR, they can hack some system calls. Of course, similar to system call table, there's IDT interrupt descriptor table and GDT global descriptor table. Both of them have some functional pointers. We want to make sure it never gets overwritten. Of course, last but not least, we have the page table, which has mapping from virtual to physical. So if people hacks mappings, we can redirect the code execution to some other place. So why we want to use hypolysis as another security layer? Our understanding is there's always a risk of protecting guests from within the guest. Because we already did a lot of kernel handling work in the kernel space to make it harder and harder, but it's not guaranteed to be impossible. Rootkit or Attacker can still gain highest privilege and read and write any guest memories if they want. And it's not easy, but it's still possible. Anything we put in kernel space could be risky. But on the other hand, if we guarantee kernel code never changes, then basically in return, it will make sure all the security checks we put in place be more robust. So this is not a replacement of all the kernel handling work we have there. We hope this is just an enhancement layer, like make sure whatever you put there is correct and remain correct for the entire lifecycle of the VM. On the other hand, hypolysis has almost one-to-one mappings to all the protection we want to achieve. The hypolysis can control read, write, exclude, bit, or page align guest memory region. Utilizing the two-dimensional pages like EPT or NPT for AMD, we can set up the permissions which guests can never change. And hypolysis can protect unsafe modifications of counter-registers and MSRs. We can just configure the VM exit on those write to MSR or counter-register, and then we can do double check whether that's legitimate or not. And hypolysis can also protect the key kernel page table. There's a new feature called hypolysis managed linear-adjusted translation from Intel, which allow you to set up page table mappings controlled by hypolysis and guests can never change it. And most importantly for us, for Google Cloud, all of our guests, all of our customer machines are already running in virtual machine mode. We already have a layer underneath all those VMs. We are not trying to introduce another layer of complexity here. We are just using the existing layer to make everything more robust and more secure. So how does everything work end to end? Here we want to introduce a new guest security kernel module. This module should be loaded at boot time, ideally by the end of boot time, so the kernel has already set up all the page table, IDT, GDT, system call table, etc. It should be signed and protected by the secure boot, so it can be trusted at boot time. If anyone tries to modify it in some way, the secure boot will fail. And after it gets loaded, it will check whether the hypolysis supports special integrity protections. If it does, it will try to identify all the kernel code data segment and address all the key data structures and send a please protect me request to the hypolysis. And then we need to wait for the acknowledgement from the hypolyzer, because we don't want to jump into the runtime, so everything becomes a little bit out of control. During boot time, everything is still in control and in code shape. On the hypolysis side, during boot time, it needs to decode the memory, segment information sent from the guest, or modify the second level translation table to set correct read write execute bit for the specific memory region. And it will config the VM exit on MSRs and control registers. After that, it should acknowledge the request and let it continue to boot. We just a few articles we need to call, so it should not have significant impact on the guest boot. During runtime, the hypolyzer and QMU or VMM will just enter in EBT and NPT access valuations. And if that happens, you can config one of the following actions and either kill the running VM and dump all the memories for analyze. That's for the people who care a lot about the security and won't make sure nothing wrong should happen. And for the user, we would expect we will just log up critical event and notify the user and let them decide what they want to. Meanwhile, on the other hand, we do the same thing for the control register and MSR modifications. And you may be wondering like, you know, in runtime we can add dynamic kernel modules. Those kernel modules are executable. So how do we handle that? Our recommendation is much safer to do not do that. But if you have to do that, and we have APIs exposed by the guest module, so we'll modify the kernel code a little bit. So first of all, the kernel code will allocate read and writeable memory region, then copy and run into it. And once the verification is done, it will call the API to change the code segment of that memory region to be read and executable. So I probably will make sure like the code segment never change after the first time you load the kernel module. You may wonder like what's the performance impact since we are doing a lot of stuff here. And for memory access, the impact should be minimal. Mainly because we have support through the hardware, Intel, EBT or MD and PT. If there's no validation happens, there should be no impact at all. But if something happens, I think it's okay to pay some extra cost on the VM exit to determine whether the operation is legitimate or not. And from our prototyping and testing, those do not happen very often. If it is a HLC kernel. For control register and MSR, the module-specific register, the impact should also be minimal. Because we don't expect a lot of modifications on a typical kernel runtime. If you are running a nest virtualization, it's completely another story, so we don't recommend people to run into these scenarios. And for the PgTable translation, it needs hardware support, but currently only limited to Intel at the lake. So it's not there yet. We cannot match it. It sounds straightforward, but it has a bunch of technical challenges. Two major ones is like how do we distinguish between legitimate kernel modifications. There's a bunch of reasons you can see we list here one to modify itself. And we need to work with our customer to disable those special scenarios in their image. Or we need to decode the instruction pointer and go through a call stack and whitelist those operations. Another major problem is when set memory permission, not all segments are nicely aligned. If we have one page and because when we set permission, it's on the page level. If we have one page that is shared between a code segment and a data segment, it will be very hard for us to set the correct permission on it and we will have a potential problem here. To mitigate this issue, some changes in the kernel need to happen to make sure all the code segments, all the different segments of kernel is nicely aligned. So here's a quick example of what it looks like if we turn on the protection. We have guests running on DBN10. After the kernel boots, we map the entire kernel code segment as read and executable. We have a bunch of EPTA violations right away, but if we go through a call stack for each of them, it comes back to the same function, which is a text book. The description of this function clearly says it's trying to update the instruction on live kernel. So this is a valid case for kernel modifications. After we remove that particular module, after that the guests running smoothly without any further violations. This is a very good example to show you, like, if we config the image correctly, any violations, any access violations, will be a strong hint of possible attack. Now let's talk about to support the protection we just mentioned. What are the changes we need for KVM and QMU? First of all, we want a common interface to turn on the protection. As you can see, during boot time, the guest agent is sending some message to the hypervisor and VMM. So currently we are using select MSR to these configurations. It's not ideal if we can allocate a special hype call just for this purpose. It will show that this is for VM only and it will avoid a lot of confusions. And once people agree on the interface, API, what the hype call looks like, any hypervisor or VMM can implement as a support for this API to provide the same level of protection, no matter where the VM will be landing. Like Azure AWS or Google Cloud or QMU, they can all implement the same APIs, so the guest can move freely around different providers and gain the same level of protection. We have RFC sent out a while ago, so that's for adding a hype call called ABM, a hype call U call. So this is a hype call we want to tell the hypervisor to hand over to VMM to handle this special hype call. We have a link, you guys can review it. So why we need this change? So today all the hype calls are handled inside KVM. It never costs any VM exit. And we want a hype call interface to pass a control back to VMM. Why we want VMM to handle this request instead of KVM? Because VMM is responsible for setting up the guest memory mapping and should have knowledge or should control the permission associated with it. And what's more important is it can simplify this part for live migration because during live migration, VMM could just create the same guest memory between a pilot permission because it has knowledge of everything. And we can keep this KVM simple. KVM just need to expose API we need to turn on the production and all the status is maintained in the VMM instead of KVM. So once we have the hype call defined, we want to define the message sending from the guest to VMM. This is a simple proposal we put here. We want called HPI request. It's a hypervisor based integrity request. And we have a version and operation code. So for operation, it can be set memory protection, et cetera, et cetera. We can extend it for future usage. And for the message body, it only need to tell us what's the PFN, page physical number, how many number of pages you want to protect, what's memory permission you want to set on those pages. For the memory permission, it can be read and execute or read and write. Never allowed execute and write at the same time. And once the request is complete, it gets able to get the result from this written code. So that's for the protocol, for the guest VMM communication protocol. Other than that, they correctly set the memory permission. We need to call the KVM set user memory region IPTOL. Unfortunately, today's current call only supports setting the memory as read and executable. But we want read and write, not executable. This is something at the last part today. We will try to extend it. And we also want to expose ability to do VMM exit on control register modifications. Because there's a lot of bits we want to protect there. And the hardware Intel AMD, this part, if you can fix, for example, the VMCS correctly, we can get to exit on control register or control register 4. And we want VMM to make this call whether we let the motivation to go through or we should not. And last, this is basically support the new Intel hypervisor management linear address translation. This is new hardware. I don't see what is there. I hope Intel or some committed members or us will work on the support for this together. And for the future, once we have the protection we talk about in place, we can extend this guest QT module to expose APIs for any kernel code to consume. It can be a driver. It can be a security hardening code inside the kernel. So they can call the API during boot, locate a memory and write to it and then market as read only. So we will send this memory segment to the advisor at the end of boot. And then we make sure those memory will never change during the entire lifetime of the VM. So yeah, all the kernel security and driver modules can take advantage of these to protect themselves. Now let's talk about some other security considerations. But first of all, all the protection we talk about today is built on top of skill boot. If skill boot is broken, our guest agent could be changed and we cannot trust the message it sent to us. So why can we trust skill boot in Google Cloud? For Google Cloud, all the guest firmware is actually immutable. We inject the guest firmware during boot time and copy it into the guest memory. So even if you manage to change the firmware in some special way during runtime, after you reboot, you will get a fresh piece of firmware and restore back to a good state. A lot of firmware integrity check is actually happening outside the guest. We know that's critical piece of code. We want to make sure it's doing what we expect. So a lot of security check, signature check is happening. We upload all those logic into the VMM processor. It's not happening inside the guest. And for important security boot variables, primary keys, signatures, they are stored on remote server instead of your local disk. So there's no way a guest can skip UV and skip our firmware to modify those variables directly. One popular attack against skill boot is through system management mode, or the SMM attack. To reduce the attack surface, the SMM is completely dropped from Google Cloud. We don't support it at all. But the production is not perfect. Even if we guarantee the kernel code stays the same through the entire runtime, and nothing gets executed outside this kernel code space, there are still possibilities people can use something called a return-oriented programming. Drop attack. What it does is trying to find some instructions on your executable region and train them together to achieve some special goals. For example, if people use drop attack to modify the kernel page table, it will still break the protection because you can modify the page table translation and point the kernel code to a different memory location. And if that happens, that memory location is no longer protected by any EEPT or NTP, so the hacker can do whatever they want there. Miticate is a problem. I think who knows, this randomization is very popular today. Let's make drop attack a little harder. And yes, ongoing work for the kernel control flow integrity. Make sure to make it harder to happen or make sure it's never happened. And yes, hardware spot, like the hypervisor managed linear address space translation we talked about earlier, where you can make sure the address space translation is happening on the hypervisor side. So let's avoid people hacking this page table in some special way. To summarize this presentation, we are trying to propose a hypervisor-based mean-forced protection for guest memories and system registers. And we want to also provide a parallel virtualization interface to enable this protection. And we want to enable security by default. This happening, we are proposing here to extend this keyboard and provide runtime protection for kernel integrity. Thanks. That's it.