 Welcome to my talk about nesting secure hosts, my name is Janosz Frank and I'm one of the maintainers for S390KVM and a 390KVM unit test. Unfortunately, I couldn't make it to Dublin, but I really hope to attend the next year's KVM forum in person so I can see all of you again. Here's a quick reminder that the architecture's secure VM implementations differ quite a bit. If you have questions specific to architectures other than 390, you might have more luck in finding an answer from the respective maintainers or developers. With that out of the way, let's begin. Today we'll start with a short recap about secure VMs and then we will jump right into the what, the why and the how of nesting secure hosts. At the end we will have a look into the problems we need to solve to achieve nesting and we will look into a short summary of 390 secure host nesting as well as a possible alternative to nesting. Anything where confidential information can be stored inside a VM, for example, registers and memory, is considered sensitive state. The same is true for controlled structures where VCPU behavior can be influenced directly. For secure VMs, that sensitive state is protected against read access and often also against manipulation by firmware and hardware. But as the sensitive state has to be managed in order to run a VCPU, a so-called trusted entity, which is a combination of firmware and hardware, takes over the sensitive portion of secure VM management. The hypervisor uses an ABI to request management actions from the trusted entity in order to run secure VMs. A lot of the hardware vendors have already implemented secure VM technology. The implementations differ quite a bit between the architectures, but generally all of them have the concept of a trusted entity and provide protection of sensitive state in one way or another. Let's define what we are talking about today. Secure host nesting means that a KVM VM is a host to a secure VM and a secure VM is being run as a nested guest. This talk is not about starting a VM from inside a secure VM, no matter if the level 2 VM would be secure or non-secure. Nesting works by reading and manipulating guest memory in order to emulate the virtualization instructions. But a secure VM's memory can only be accessed by itself and a trusted entity, which means nesting is only possible with a considerable hardware and firmware support. The reasons for nesting are not different from non-secure use cases. It's great for hypervisor and test development. Also, lately there's been a trend of moving complete computing environments into the cloud. Real servers are converted to VMs and if they run VMs themselves, then those become nested VMs. And last, we just had to try to achieve nesting since it looked like a sizable challenge. There are three things that are needed to nest secure VMs. We need the ability to run secure VCPUs as nested VCPUs and we need to let the level 1 hypervisor manage secure VMs the same way like a level 0 hypervisor would. Which means providing the trusted entity API to the level 1 operating system via emulation. Also, we need to take care of side effects to achieve architecture compliance. Running a nested VCPU is a solved problem. The level 1 VCPU exits to level 0 KVM, where level 1 tries to run a nested VCPU. KVM shadows the nested VCPU control structures and then the shadow control structure is used to run the nested level 2 VCPU as a normal level 1 VCPU. As long as starting a secure nested VCPU also exits to level 0, KVM can reuse a lot of the existing nesting code. This is easy on architectures which extended their virtualization instructions and VM control structures for secure VMs rather than introducing new ones. For example, S390 needed about 300 lines of code to support nested secure VCPUs. But the biggest change stems from the emulation of the trusted entity API. The largest part of the emulation is achieved by level 0 manipulating level 1 API request and issuing them to the real level 0 trusted entity API. A smaller amount of API requests are passed through, unmodified and some requests are fully emulated. We can sort the API calls into the listed categories. The interface is hierarchical, the initialization has to occur before the creation of a secure VM, which has to occur before the creation of a VCPU and so on. Generally, we need less emulation the higher we go in the hierarchy. Let's have a closer look at the emulation. As stated before, the trusted entity has to be initialized in order to manage secure VMs. Initialization is therefore one of the first API calls to be executed. Only the KVM with direct API access, i.e. level 0, initializes the real trusted entity, which means that any initialization by level 1 has to be fully emulated. Secure VMs and VCPUs are also graded via API calls. The requester receives some form of handle, which acts as a secure resource ID as a response. As level 0 KVM manages all secure resources, the create request is heavily modified to suit level 0's point of view before being reissued to the real trusted entity. If the response secure entity handle is a simple ID, then it can be passed through to level 1. This allows ABI calls that consist of an action plus a handle to be passed through, and it means level 0 doesn't need to map real handles to emulated handles. The lifecycle calls are generally the simplest ABI calls to emulate. To make a VCPU as being in the stopped state, for example, we only need its handle and the expected state. Such a call can be passed through to the real trusted entity without manipulation. Data requests like dumping are a bit harder. They require address translation and data bouncing in order to work properly. Architecture compliance largely revolves around memory protection because nesting allows a level 1 VM to try to access secure memory. If it tries, we need to report the correct address and exception type to level 1 for the OS's memory management. On some architectures, level 0 KVM donates memory to the trusted entity when creating VMs, VCPUs and on initialization. The donated memory can only be accessed by the trusted entity and is used to store management data. In order to emulate the donation, level 0 needs to take access rights away from level 1 for those pages as well as inject access exceptions on access. If the amount of memory donation for certain ABI calls is specified by the trusted entity via an interface, then donation could be disabled entirely. But that would also mean the level 0 ABI emulation has no way of regaining the amount of memory that it donates to the real trusted entity. Also, the donation serves as a limiting factor when creating secure resources as effectively memory is removed from the caller. Let's dive into some of the problems. Secure VMs generally have a pre-determined feature set. That feature set was chosen to minimize hypervisor interaction with a secure VM and hence often handles complex instructions without hypervisor assistance or takes care of automated IRQ injection. But not all of those features are compatible with nesting, especially the ones that require level 0 addresses in order to work. Those features will need to be disabled or a firmware-based solution has to be found. Page management performance is another problem. Due to the page integrity and mapping protection of secure VMs, all mapping changes require ABI calls to be performed. For nesting, each ABI call causes an exit into level 0, which costs performance. That cost is even worse for fault-driven secure VM memory management where the hypervisor receives two faults. One that indicates that a page isn't mapped into a guest and one that indicates that a mapped page is non-secure and therefore can't yet be accessed by the secure VM. There are two potential fixes to this problem, which could also provide a performance boost for secure VM memory handling in general. We can either pre-fold and pre-secure guest memory in one go at the start of the VM or we can potentially merge the mapping and securing of a single page into one action. Both solutions, of course, have their drawbacks. The next issue is migration. Having a non-secure VM that hosts secure VMs makes migration very challenging as secure VMs need to be migrated via a special process and in cooperation with the trusted entity. It might be easier to migrate the secure VMs into a new host VM on the target rather than migrating both the old host VM and the secure VMs. For now, migration will likely be disabled once the trusted entity is initialized. And last, we need to emulate a trusted entity teardown on a level 1 reboot. As we multiplex the real trusted entity, we can't use its teardown mechanisms as they would destroy all secure VMs created by any of the other level 1 guests as well. Therefore, every level 2 secure VM that was created by a rebooting level 1 hypervisor will need to be destroyed by level 0 via the trusted entity before the reboot finishes. This will likely affect the reboot time significantly. A lazy destroy approach to this problem might improve reboot times but it will also add more complexity to the API emulation. If an architecture uses memory donation, there's an additional step. The access protection for every page of emulated donated storage will also need to be removed before rebooting. The A390 team has created a proof of concept and has done some experiments. Linux and KVM unit tests are running as nested secure guests although we don't achieve full architecture compliance yet. The trusted entity emulation largely consists of error and access checks so checking the emulated API from mistakes will be a lot of work. Luckily, the KVM unit tests already contain tests for the 390 trusted entity ABI. It's about 2000 lines of KVM code, 80% of which are ABI emulation, 10% memory management and 10% secure VCP nesting. For QMU, only limited changes are required. There's an alternative that was introduced by our colleagues from research right before this talk. A VM-to-hypervisor interface is used to allow a primary VM to manage secondary VMs whose resources are grouped under the primary. It's faster because there's no nested paging and it's largely user-space-based which means successful attacks won't result in kernel privileges. But it's also less flexible and might require more code overall since VM managers have to be adapted to support the new interface. As we only have limited time for questions and since I'm not in Dublin I've tried to come up with potential questions and answers. First off, I haven't measured the performance impact yet. The state of the code simply doesn't allow it right now. Second, I have neither tested nor thought about more nesting levels. The performance generally doesn't get better with each new nesting level so I don't see a use case. Next, I have to declare that my statement about migration being complicated is a theoretical one. 390 doesn't support migration for secure VMs right now so I didn't have the opportunity to figure out a way to support it. And last, my current guess is that AMD SEV is the next best candidate for nesting since the API isn't too big. I only had time to glance over x86 though so take it with a grain of salt. Thanks for listening. If you have further questions you'll find me in the secure VM meeting rooms of the meeting's website or on IOC. See you next year.