 Welcome to the common challenges of secure VMs. My name is Janusz Frank, I'm the KVM and KVM unit test S390 co-mantainer. I've implemented and tested large parts of the secure VM support on 390. A short disclaimer upfront, I've tried reading as many of the architecture document as possible, but there's just a limited amount of things I can take in and put into a presentation. So this is merely an overview and not a deep dive. With that out of the way, let's start with having a look at what secure VMs are, why they are needed and who introduced secure VM support to their platforms, then we'll continue with the challenges that we need to clear to protect and run a secure VM. Next, we'll have a short look at the current collaborative development efforts and lastly, we'll have a look at the future to determine what features might come up in the next few years. In the following slides you'll see patches like these. They indicate which vendor chose to implement the technique feature we're currently discussing. For AMD, there might be ES or S&P in there, which means that it was introduced in one of the extensions AMD introduced in the last few months and years. A secure VM is a VM whose sensitive state is not accessible from the operating system or hypervisor. Instead, a trusted entity manages the sensitive VM data. The hypervisor cooperates with the trusted entity to run the secure VM. Such a VM is protected against host to VM attacks and it is protected against VM to VM attack. And that protection allows users to confidently deploy sensitive workloads into a public cloud. What's a sensitive state? The biggest parts are certainly the memory contents, as they can contain disk encryption keys and large amounts of data on which the VM is currently working on. Registers are part of the sensitive state as they can contain sensitive data and key fragments. Also sensitive are the VM controls as they determine which and how instructions are executed. As interruption controls can change execution flow, they also count as a security sensitive entity. By protecting these things, we want to combat data leakage and manipulation, as well as manipulation of execution flow. There are three basic building blocks we can use to protect the sensitive state. Encrypting or hiding data, this will make an attacker read the encrypted state, which is useless without a key. We can use access control to restrict access to the sensitive VM state. And lastly, we have integrity verification, which ensures that the VM always accesses state that hasn't been altered by an untrusted entity. Often, only a combination will give you full protection. The last question that remains is who is in on it? Well, basically almost everyone. AMD were the first to introduce secure VM technology with SEV, secure encrypted VMs, and that has been introduced into the kernel in 4.15. Last year, they presented SEV extensions, the encrypted state, which provides register confidentiality and secure nested paging, which adds integrity protection. IBM introduced secure execution on Power on C, in kernel 5.4 and 5.7 respectively. They both use the same name, but the technology differs as the platforms are quite different. Intel recently announced TDX, and Sean will give a presentation about it on Friday. And I think some architectures might still be added to that list in the future. So, as a summary to the recap, secure VMs are protected against VMMs and other VMs accesses and managed by a trusted entity. Most major architectures have secure VM technology, and there are three basic building blocks that provide security. That means a lot of the challenges are shared between the architectures. Let's begin by having a look at how the runtime protection can be achieved and which challenges the chosen techniques cause. First of all, the runtime protection can be divided into the memory protection and the state protection which covers the CPU and interrupt state. To provide basic memory confidentiality, the VMs memory needs to be unreadable for everyone but the VM itself and the trusted entity. With write protection, we can additionally ensure that the integrity of the VMs data is not compromised as long as the memory is never swapped, for which the VMM will need read and write access of some kind. Once swapping is involved, we need extended integrity verification which handles swap out and swap in. These challenges correspond one to one to our basic building blocks. We can make memory unreadable via encryption, unwriteable via access protection and add a layer of additional integrity verification for swap and page table management. For best results, all techniques should be combined so we have read and write and integrity protection when running the VM, as well as encryption against cold boot and other hardware attacks. For encryption, the VM's memory is encrypted by the CPU's memory controller. Each VM has its own key, the key is kept in hardware and never leaves it. Read and write with the wrong key will just result in random data. The advantage is that the memory is always accessible to the VMM and IO devices. When untrusted entities read or write to and from the VM's memory, there are no memory protection exceptions that can interrupt IO transfers. The transferred data will simply be unreadable. Additionally, encrypting protects against cold boot attacks if the key is kept in hardware and not reusable on boot. But handling keys in hardware also means that only a limited amount of keys are supported due to space limitations, which limits the overall number of secure VMs. And although writes to the VM's memory are encrypted, the memory is still technically writeable and the VMM can corrupt guest memory. Moreover, the VMM still manages host to guest translation page tables and can remap pages. When using access protection, reads and writes from outside of a secure VM will result in an exception. This provides integrity protection if the secure memory never leaves the protected state. Also, rogue accesses can be traced and blocked via their exceptions. The main problem is that IO can be cancelled by access exceptions. Fixing access and re-rolling IO is very, very hard. As the CPU needs to know which page belongs to which secure guest to deny access to everyone else, some storage is needed to save that tracking and will be unavailable to the OS. Sometimes the tracking is even more granular, for example on a cache line basis, which will need even more data. Additionally, we don't have cold boot attack protection as the firmware can clear memory on reboot, but firmware doesn't know when memory will be physically removed to clear it beforehand. Integrity protection is needed in three scenarios. When the VMM can manipulate a secure VM page, when the VMM can manipulate the VM's memory mappings and when memory can be swapped. We can solve the first scenario with access protection. The other two can be achieved by making the trusted entity, safeguard, swap and page table management. The VMM will then need to cooperate with the trusted entity for those actions. On the VM state protection side of things, we need to make sure that registers are at least unreadable. At best, the VMM can only read or write a specific register if it is necessary for emulating an instruction. Therefore, the trusted entity hides or encrypts all those specific registers. This is mostly done by providing a dummy VM control block to the VMM and the trusted entity manages the real VM control block instead. If interrupts are injected at the wrong time, attackers could significantly alter the instruction flow to their liking. Therefore, the trusted entity needs to safeguard interrupt injection and only allow it if this vCPU is eligible for an interrupt. Once again, a dummy control block can be used to manage the VMM's interrupt requests and copy them over if they are valid. Booting secure VMs is another interesting scenario, as we need to make sure that only customer-approved executables are booted. There are two quite different approaches to that problem, remote attestation and encrypted boot data. When using attestation, a trusted entity authenticates its hardware and the VM software to a remote host. The advantage is that the customer can roll their own attestation environment and quickly change the authorization values for VMs, firmware configurations and machines. On the other hand, attestation is complex to implement on both sides of the equation and is therefore a great attack vector to secure VM or to the trusted entity. Also, the attestation requires a connection to a customer's attestation service which needs to be reachable from the VM host. With boot data encryption, the executable is fully encrypted or only sensitive parts are encrypted and others are measured. On secure execution, an executable header holds key slots for the machines the VM is allowed to run on. The key in the slot is used by the trusted entity to decrypt various things and ultimately the trusted entity decrypts the executable into protected memory. When encrypting, we have the advantage of not needing to connect to a remote machine which saves us some complexity in the trusted entity. Also, if the executable is fully encrypted, the VMM doesn't even know what's being run inside of the VM which makes attacking the VM a bit harder. The drawback of not having an attestation environment is that updating the executable is painful and needs to be done either by the customer or with help of the trusted entity. Additionally, bootloaders like grubs are sometimes run in the secure VM and can create complications. And lastly, we have the classic problem of distributing and verifying public-private key pairs. All of the solutions that I spoke about need extensive tooling. AMD has the SAV tool, IBM C has Ganttrod image, IBM Power will have a conversion preparation tool and I expect Intel to also release some tooling. It would be really great if we could combine our efforts for such tools if the differences between the vendors can be kept. Now that we have explored how we can securely run a VM, let's look at how we can achieve IO and swap while maintaining that protection. IO accesses on an encrypted or protected page will result in garbled data or access exceptions. Therefore, we need a way to unprotect some special IO pages. This has to be done via a guest request to be secure. The guest will then bounce buffer IO data to those pages. Of course, a special handling for IO means lower IO performance, but I'm sure we can work on reducing the performance impact significantly in the future. When doing swap out, the memory needs to be made readable in an encrypted form so it can be written to a device. On swap in, it gets protected, integrity checked and decrypted before the guest can access it again. The VMM request swap out and swap in from the trusted entity which will safeguard the whole process. As with the IO before, we take a performance hit when swapping, so we should consider not using secure VMs in a memory over committed hosting environment. Let's continue by having a look at the current development efforts. The QMU host trust limitation patches from David Gibson aim to unify the configuration of secure VM options to make configuring secure VMs easier. Unfortunately, it proved quite difficult to find a common solution as the already existing options as well as the architectures of the secure VM implementations differ quite a bit. Also, I'd guess that the developers and maintainers are currently occupied with secure VM extensions or in Intel's case implementing secure VM support at all. The second common focus for development are the VDIO IOMMU bounce buffer encryption or protection hooks, which allow the guest to do IO over unprotected guest pages. AMD SEV were the first to hook into that API, so the naming of the hooks has been rather SEV specific. I expect that we will see some changes in the code, especially regarding performance in the near future. Lastly, S390 already hooks into the memory management code of the kernel to pin IO pages. But I expect other platforms to need similar hooks for swap or other memory management related actions. But as common code might need changes, the resulting discussions could be lengthy and inclusion into the kernel will take some time. Recently, we pin Shama posted an RFC to add SEV address-based ID limiting to C-groups. The problem that needs to be addressed with this limit is the hardware resource limit problem described before. As the trusted entity can only run a limited amount of secure VMs due to hardware restrictions, a C-group limit is needed to manage the number of secure VMs. The other architectures quickly expressed interest in the patches, asking for a common approach to the problem. One of my main concerns lately has been testing. On platforms with boot data encryption, we need the public key of the host, the secure test should run on and then we can simply encrypt the tests. On platforms with attestation, we need an attestation environment to run secure tests, which complicates testing immensely. In general, testing is a very important topic for secure VMs as there are lots of changes to the kernel, KVM, QMU and LibVirt. There are new kernel compile options and start command line arguments, IOControls, new instructions and interfaces to the trusted entity, which need testing. And upcoming features will only increase the amount and complexity of new code. As a development community, we need to make sure to cover as much of the new code as possible to provide a stable base for the secure VMs to stand on. Lastly, let's try having a look at the future. Migration is one of the most interesting and complex things that will be added in the next few months and years. To maintain the security while migrating, the migration data needs to be encrypted and integrity checked. If the VM state is hidden, the trusted entity will need to make it accessible to the VMM in an encrypted form that can be read by the trusted entity of the destination. I expect backwards compatibility management to be one of the most challenging things that needs to be solved. A lot of the compatibility management needs to be handled by the trusted entity, which moves large parts of the migration logic complexity from the VMM into the trusted entity. Additionally, migration policies will be needed to determine which hosts a secure VM can be safely migrated to. That decision has to be based on the CPU hardware and firmware revisions as well as other criteria. Both the remote attestation and the header when using executable encryption will need new data fields to model those dependencies. This will increase the complexity of the trusted entity and of the support tools even more. AMD already presented their solution last year, but I expect the other vendors to follow suit, as you can't have VMs without also supporting migration. All in all, supporting migration for secure VMs will bring us even more new IO controls and interfaces to the trusted entity, which will mean a lot of work for developers and maintainers for the years to come. Also, I expect to see secure IO devices which will be bound to a specific secure VM and only answer to that VM's IO requests. Most likely, the devices will also need to authenticate themselves to the VM so the secure VM can be sure that it talks to a legitimate secure IO device that supports the kind of IO that the VM needs. When providing such devices, we will lose a lot of the mix and match flexibility we have with our virtual IO options. Maybe such devices can be divided into a number of separate virtual devices which are then bound to a VM to decrease the number of needed hardware devices. Modeling the VM-to-device relation for secure VMs in LibVirt and other VM managers can get increasingly interesting in the future. Let's see what the hardware vendors will come up with and how many standards for the VM binding and authentication we will get. After migration, the next big feature is certainly the ability to dump a VM from the outside. With Kdump, we can already save a dump to an encrypted disk, but Kdump will only work if codes can still be executed inside a VM and we can boot into Kdump. If we can't run dumping code inside a VM, we need to ask the trusted entity for help. It will need to provide the VMM with encrypted access to the VM memory and CPU state so the VMM can write the dump to disk. The dump needs to be encrypted in a way that only the creator of the VM is able to decrypt it. Most likely, we will need new IO controls over which QMU can request the dump data from KVM, which in turn will ask the trusted entity. In recent years, side-channel attacks have led to significant problems and we went through a lot of effort to introduce side-channel protections in our kernels. Of course, leaving such protections to the untrusted VMM is not an option for secure VMs. Things like enforcing SMT disabling, disabling or scrubbing, debug and performance counters, or registers as well as cache flushing will be enforced by the trusted entity in the future. The VMs requirements for such security features will likely be needed to be modeled in VM management software. And such protections will be enforced via remote attestation or options in the executable headers. In the last few slides, we have learned that the basic building blocks of secure VMs are similar and the differences are in the specific implementations. Secure VM technology is getting more important each day and its complexity will increase a lot with each extension that is released for an architecture. Fortunately, we still have some time to come together and discuss the collaboration possibilities. Now, as most of the architectures have introduced their idea and implementations of a secure VM, is the perfect moment for that. Thank you for listening. If you still have questions after the upcoming Q&A session, feel free to reach out to me via my email addresses or on the CKVM ISC channel.