 My name is Tobin and today I'll be talking about secure fast live migration for encrypted VMs mainly from an AMD SEV perspective. The work presented here is largely carried out at IBM research with my colleague Dov Murick and with guidance from Hubertis Franke, James Bottomley and in coordination with some people at AMD such as Ashish, Bridgesh, David and Tom. First of all a crash course on live migration. The basic idea is to move a VM from one machine to another without stopping the VM. Mainly this is the hypervisor's job to converge the memory, make sure the memory on the source machine is going to be the same as the memory on the target machine, coordinate the CPU state and control the actual execution as it start the target at the right time and stop the source at the right time. The hypervisor is mainly the one in charge. Confidential computing is the other major theme of this presentation and the idea here is to protect data while it is in use. We already know a fair amount about protecting data at rest. You can put it in some sort of encrypted disk or you can put it under your pillow. We also know a fair amount about protecting data while it is in transit. There are protocols and procedures that can help you out with this, but how do you protect data while it is actually being operated on? Well, there are a number of different ideas about this. AMD SEV is one of them. Mainly confidential computing is backed by some sort of hardware root of trust. This is true in AMD SEV as well. With SEV, the VM is the enclave meaning everything inside of a VM is within the trust boundary and crucially the hypervisor is outside that boundary and is not trusted. This should bring up some obvious questions about migrating when you can't trust the hypervisor. To be specific, each of the three generations of SEV brings new challenges for migration. With plain SEV, what we get is guest memory encryption. So the memory of a VM is encrypted and the key used to decrypt it is managed by the AMD secure processor. So if the guest memory is encrypted and the hypervisor cannot access the key to decrypt that memory, how is the hypervisor going to control do one of its traditional roles in migration? Which is to copy pages from the source to the target. Now, you might think that the hypervisor could just copy the ciphertext from one machine to another, but this is not the case. For one thing, the key that is used to decrypt the pages and encrypt the pages is stored only on one machine, right? That is tracked by the AMD secure processor on one machine. So if you take a ciphertext from one machine and put it on another, the key to decrypt it won't be on that machine. Beyond that, however, the encryption and decryption algorithms are constructed so that they are dependent on the GPA that the memory is located at. So even on one machine, if you were to move a page from one GPA to another, you would not be able to decrypt it. So clearly, we're going to need some other strategy to handle copying pages from source to target with AMD SEV. SEV ES brings new challenges with SEV ES. We get protection of guest CPU state while the hypervisor is supposed to take the CPU state from the source machine and set up the target with the same CPU state. So it can resume running seamlessly. It's going to be hard to do that with SEV ES. We'll talk about what you need to do to make that work. SEV SNP brings new features and which can help, but it also brings some new challenges. In particular, SNP is the first version of SEV that guarantees integrity for memory. And this doesn't exactly break migration, but it puts new constraints on migration. If you have integrity protection for memory, perhaps you should also have integrity protection for pages as they're in flight or as they're being copied over. So there are new challenges surrounding integrity with SNP. Returning to the first iteration of SEV, the problem we have here is with moving guest memory pages from the source to the target, the AMD secure processor will manage the keys that you need to read and write memory. The memory controller does those reads and writes in line, but the secure processor makes sure that the key is accessible to the memory controller at the right time. The hypervisor is not the right time. It won't be able to access the memory directly. So it needs help from something else to be able to export pages. The AMD secure processor actually has functionality to do this. It can wrap pages with a transport key so that the hypervisor can send them off to a target. This was actually talked about at KVM Forum in 2017 and 2019 in presentations from AMD. The issue here is that the throughput of the AMD secure processor when it comes to wrapping memory pages is low. And it's not enough to do the entire memory of a guest just through this mechanism with the secure processor. That would not be an efficient live migration. So we need some kind of additional support. It basically needs to do the same thing. Export a page that's wrapped with a transport key and import pages that are wrapped with a transport key. But it needs to do this from inside of the guest context. So where exactly is the best place to put this thing? The approach that we've been focusing on is to put migration support in the firmware. So a migration handler in firmware. And there's a number of reasons for this. First of all, putting a migration handler in firmware means it can easily be measured at boot. Rather than having the migration handler be some opaque blob, some opaque binary provided by the cloud service provider, which is difficult to measure, difficult to evaluate, the migration handler can be part of the firmware, which can be open source, which can easily be distributed and can easily be inspected by a guest owner. It also means that there's a somewhat less of a dependency for the operating system. So if the migration handler is in firmware, you may be able to migrate earlier in the boot or when the guest is not responsive. There's a couple of different approaches we've looked at for doing this migration handler in the firmware. The question really is, where do you get the VCPU to run the migration handler? And our first approach was to simply add an extra VCPU to the guest but hide it from the operating system by manipulating the ACPI tables. OVMF would start up normally, this is our firmware, OVMF, and it would use that extra migration handler with the MP service. But the OS wouldn't know anything about it, so the migration handler would be able to run along in the background while the OS booted up and the guest was running. There were some issues with this proposal and we've retooled it to use a mirror VM. The idea here is that there's a secondary VM that shares the memory, an encryption context to be the ACID of a primary VM, so it has the same view of memory essentially, and then we would warm boot one VCPU in this mirror VM directly to the migration handler. So like I said, the primary VM and the mirror VM will have the exact same view of memory and from the perspective of the AMD secure processor, they actually are the same thing. The tweak is going to be from the perspective of the hypervisor where these things are actually separate VMs and where you start the VMs from separate places. So where the main VM is going to start from the normal reset vector, we essentially have our own reset vector, the MH entry point for the migration handler. The address of this migration handler entry point, which is what the hypervisor will set the instruction point there of the VCPU in the mirror to, is discoverable by parsing the firmware. So the hypervisor prior to starting a guest can look at the firmware, figure out where the migration handler entry point is, and set one of the VCPUs in the mirror to boot from that entry point. The entry point, like I said, is a lot like a reset vector, so it will trample in up to the migration handler. Once you get to the migration handler, things will look a fair amount like a normal DXE runtime driver. Since the mirror VM and the main VM have the same view of memory, the migration handler in the mirror can use any OVMF services or libraries that any normal DXE runtime driver running in the main VM would be able to use. This is an added benefit to this firmware-based approach. You don't have to have an entirely self-contained migration handler, but you can use some of the functionality provided already by the firmware. Now, we do need a special mapping, special memory mapping in the migration handler and the mirror VM. The idea here is to have an identity map, but to map addresses to themselves with the CBIT set. So the CBIT is a construct in SEV that tells the memory controller whether or not a page is encrypted. So when you go to read a page, the memory controller will try to decrypt that page if the CBIT is set, and normally this is controlled basically by the page tables. The migration handler will get guest physical addresses from the hypervisor. The hypervisor will ask it to read a guest physical address, and it will read that and export the page. Every page that the migration handler is being asked to read will be encrypted. So every time it goes to read this, it wants the CBIT to set. Hence, we have this identity map where the CBIT is always set. That said, the migration handler also will occasionally need to use a few shared pages. So we also map some shared pages at an offset. Turning from the technical to the theoretical, let's take a minute to make sure we haven't broken the trust model. There's a couple of ways that people might be a bit nervous about what's going on here. First of all, we have the hypervisor triggering execution inside the enclave. Now, fortunately, the migration handler being part of the firmware will be part of the launch measurement provided at boot to the guest owner. So the guest owner can be pretty sure about what code is inside of their guest. And the API for the migration handler is well-defined and small. So all that the migration handler does is take a guest physical address and return an encrypted copy of the page located at that guest physical address. So the execution inside the enclave that's triggered by the hypervisor is, we think, sufficiently limited. Now, some people might have a sort of inverse concern, which is that QMU is depending on guest execution. The hypervisor is in its normal migration path diverting to untrusted code essentially inside of the guest. And QMU here can't really verify the execution of the migration handler because this launch measurement is provided to the guest owner, not to the cloud service provider. Really here, what it comes down to is that, again, the API is small and well-defined. QMU will be getting a page that's encrypted to the transport key. It takes that page and it sends it off to the target where the migration handler on the target will load it back into memory. QMU should not be operating on this memory directly. It shouldn't have its own execution be conditional on this page or anything like that. And at most, we think that manipulating this page can just result in a crash on the target in an innovative way for the guest to screw up the migration of their VM, but probably not a way for them to manipulate control flow in QMU or any other hypervisor significantly. One place where our approach diverges slightly from AMD's original migration proposal is that we have the guest owner verify both the source and the target machines directly. In AMD's original proposal, the source will reach out to the target so that it can verify the target of the migration. Here, we have the guest owner check out the launch measurement of both the source and the target and only provision the transport key, which will be used for encrypting the pages as they're in flight, only provision that if both measurements check out. We don't have time to go into all of the details of the key management, but this is a crucial part of getting the trust model to work. Final thing to note is that the mirror boot process that I've described here is compatible completely with SEV, ES. The VMSA of the mirror can simply start with the instruction pointer of the migration handler entry point and the mirror can start like that. It can be measured. That all works perfectly with SEV, ES. While the mirror boot is compatible with SEV, ES, ES does introduce some challenges to migration more generally. With SEV, ES, the AMD secure processor saves the CPU state of the guest into encrypted memory at each VM exit. If some of that CPU state is needed to carry out to process one of the VM exits in the hypervisor, then that CPU state should be put into a special buffer by a handler inside of the guest. This means that when the hypervisor is operating, it won't have direct access to the CPU state of the guest. This is a bit of a problem in migration because the hypervisor is supposed to take the CPU state from the source and put it into the target. Now, you do get one chance to set the initial state of an SEV guest. An SEV, ES guest, you get a measurement, a launch measurement that includes the initial state of the CPUs. There are a couple tricky things here, however. For one, we still don't know how to get the CPU state from the source. But perhaps a bigger problem is that we actually can't use this initial CPU state on the target either because the target is already executing. What is it executing? Well, it's executing the migration handler. We have to start the target before we know what the CPU state of the source is so that the migration handler on the target can start importing pages coming from the source. Since the source is still running, there's no way to know what we should set the CPU state of the target to be. Even if we did know what it was supposed to be, it would conflict with the CPU state that we need to set to get the migration handler to run. Essentially, we can't use our first pass at setting the CPU state. So how can we set the CPU state of this guest which is already running? Well, one option is a trampoline-based approach. We have investigated this trampoline approach fairly extensively. The basics are that you map the VMSI into the guest. This means that the encrypted area where the PSP is going to save the guest's CPU state will actually be inside of the guest, and will be migrated along with the rest of the guest's memory. So the target, once memory is converged, will have a snapshot of the source CPU state in memory. The question then is how do you get from this snapshot of CPU state in memory to actually resuming that guest? And essentially, you need some way to set the CPU state of that running guest to this CPU state snapshot that is in memory. Unfortunately, you can't really do this atomically, so you need to set each register individually via a very delicate trampoline. We show that this is possible, but it's tricky. It has some challenges. One thing to think about is that the CPU state of every VCPU is different, and so every VCPU will probably need its own small migration handler that will read from this VMSI inside of the guest and set that VCPU to the appropriate CPU state. In light of some of the complications with the trampoline, some people have suggested alternatives. For instance, some people have wondered if you might use suspend and resume to suspend the guest before migrating and then resume it after you've migrated, thus essentially storing CPU state in memory. In fact, our trampoline approach is actually based on the suspend and resume trampoline in the kernel. One of the questions, though, is whether or not this would really count as a live migration if you have to suspend the guest before you can migrate it. One thing to note is that for SEV S&P, there'll be a new interface allowing for RMP adjustment. This will allow you to designate certain pages as VMSAs and then essentially atomically resume a VCPU from that VMSI. This will greatly, greatly simplify this step. One thing that people have pointed out is that there's no integrity protection for the pages where the VMSI will be stored. This is because prior to S&P there's no integrity protection for any page, but it could be particularly sensitive if that page stores CPU state. There's potentially an attack where an older version of that page, an older version of the VMSI is replayed before, just before or during migration and where the target then resumes to an older state of execution. SEV S&P brings with it some new features but also some changes to the trust model. So like I said, S&P gives us this RMP adjust feature that would potentially greatly simplify the trampoline. S&P also has integrity protection that would eliminate the replay attack mentioned on the slide before. Now when you have replay protection, when you have integrity protection via S&P, that also changes somewhat our goals for migration because now we also need to be aware of replay attacks that happen during migration. We need to make sure that a person in the middle can't drop or replay pages as they go from the source and the target. And we would include the hypervisor as a person in the middle here. Let me give you sort of a concrete example. Let's say that the source machine is running. We're doing a live migration, but the source machine is still running. So pages are being copied from source memory over to the target memory. The pages are encrypted, the hypervisor can't read them. It's just going to send the pages along. Now it copies a page and the workload may or may not touch that page after it has been copied. If it does, then the page is going to have to be recopied. Let's say that it will be recopied. So the page has already been migrated once. It's dirtied and it needs to be sent over again. Well, the hypervisor here is the one who checks the bitmap of pages and says, okay, I need to resend this page, but if we don't trust the hypervisor, how can we depend on the hypervisor to do that? Here we get a little bit of a window where an older page could essentially be replayed because the newer page would just never be copied over, and we would have what is probably a valid arrangement of memory on the target, but not the correct one, not the one that represents the most recent changes to the source. Now with SEV and SEVES, this was an issue, but we weren't concerned about it with migration because this type of replay attack was possible at any point. The hypervisor could simply replay an older version of a page on top of a newer page at any point, but with S&P, the hypervisor cannot do that, and so during migration we need to also make sure that it can't do that. So we're going to need some method of extending migration to have a verification of exactly which pages are sent and make sure that the hypervisor can't make any malicious choices like this. This is still a bit of an open question, but we have been talking about it. There are a couple of other migration-specific features with S&P. For instance, S&P has support for a migration agent, which is essentially another VM that can make more complex policy decisions about when and to where it's okay to migrate a VM. There's also a feature called the initial migration image. This allows you to designate part of guest memory. As an initial migration image, this is essentially what the guest would boot with. This would be measured, and then the rest of memory would be imported via migration later. Clearly we already have some open questions at every level of migration of encrypted VMs, but we have a couple more open questions that I thought I'd single out here. For one thing, we haven't talked about post-copy at all. Mainly we've been leaving that aside as a developmental simplification. Now, at the surface level, at a high level, post-copy should be compatible with the migration handler approach because the migration handler on the source and the target can simply keep running and respond to post-copy requests. Implementing it might be a little bit more tricky. Another thing to think about is parallelism for the migration handler. Hypothetically, the migration handler could be more than one VCPU in the guest. Maybe if we have a huge VM with a lot of memory, we would want to have multiple migration handler VCPUs so that we can migrate with a greater bandwidth. Another question to think about is if there's any way to generalize confidential migration support and provide one API, one interface that will work across platforms. Some of these are more engineering questions. Some of these are even higher level questions than that, but certainly there's a lot to think about going forward with doing fast migration of encrypted VMs. If you have any questions now or later, please reach out. If you'd like to see code, we have patches available on the EDK2 and QMU lists for various versions of live migration. Thank you and enjoy the rest of the KVM forum.