 Hello, my name is Wei Wang, and I'm from the Intervortionalization Technology team. In this presentation, I'm going to introduce the migration support for TDX, which facilitates the deployment of TD guests in the cloud environment. Those guys here are also contributors to this project. So for this presentation, I have four parts in total. In the first part, I'm going to do a background introduction. And in the second part, I'm going to deep dive into the details of TDX non-migration, which is followed by showing you some initial POC results. And in the last part, I'm going to introduce the current status and our future plan. Okay, let's get into the first part. Let's first have a look at some basics of TDX. So with TDX support in inter-sovarapie and laton CPUs, there is a new execution mode called CM mode, which is a folder split into CM root mode and CM non-root mode. It is pretty much like the VMX root mode and non-root mode. So there is a piece of special software called CM module. It runs in the CM root mode to manage guest privateer states. So with this CM module, we are able to put a special guest called TD guest. TD guest is different from the traditional VM guest, as it's isolated from KVM and QMU. So KVM and QMU is removed from TCP. So for the TD private memory, it's not accessible to KVM and QMU in this case. And the TD visible states are also not accessible. There is usually some shared memory shared between TD and the hypervisor. So this shared memory is usually accessible. For example, the guest driver may use bounce buffers. And this bounce buffer may be shared to QMU to do the device emulation. For example, but I will add. As the CM module needs to be kept as small as possible because it's part of the TCP. So KVM still manages the whole physical resources. And it assists the TDX module to virtualize TD by same course. For example, the secure EPT which is owned by the TDX module. KVM allocates pages and offers those pages to TDX module to build the secure EPT for the guest TD. Here are some special things I want to call out. So those are usually common questions asked by people in the first place. So the first thing is about the third page logging. We don't support the PML in the first release. So we will do rather protection as the secure EPT is owned by TDX module. So KVM cannot directly change the secure EPT entries to do route protection. Instead, it needs to invoke a same course to TDX module to do this route protection. For the guest memory copy, as you know that QMU doesn't have direct access to TD private pages. So QMU needs to ask KVM to do same course to TDX module to export and import private pages with encryption and decryption. Another thing is for the secure EPT on the destination site, it needs to be set up in advance before importing a TD private page. Because importing the private page, TDX module needs to work through the SAPT to find the final entry to import the guest page. Then it's about the huge page split. But we don't consider it for our first release based on the assumption that TD works with 4KB pages in the first place. But we will support it in data release. We also need to build a common framework to abstract TDX migration implementations into a window-specific layer. So this is to make it incorporate with other similar technologies similar to TDX from different windows basically. Okay, let's have a look at more details about the TDX migration. So let's first have a look at the whole picture from the elevated view. So in this picture, a source TD, so the guest TD that needs to be migrated has a micro TD associated with it. So the micro TD will assist the migration process. I will introduce more about it in the next slide. So before the migration starts, the micro TD needs to do some migration policy evaluation. So those include compatibility check and security attestation. So the goal for this evaluation is to ensure that the destination physical environment, the destination environment that the guest TD is going to be migrated is migratable and secure to migrate. Another thing is the migration key setup. So there is a key used throughout the migration. So the key migration key is generated on the source micro TD and it's set to the TDX module using a TD core. And the key is also securely transferred to the decision side micro TD. So there is to establish a secure communication channel between the micro TD on the source and destined side. So the migration key is used by the HEDX module to encrypt and decrypt the migration data. So at least the states that needs to be migrated here. So the TD, those include the TD private memory states, the TD visible states and the TD scope states. And for states in clear text, meaning that they are not encrypted, that includes the TD shared memory states. So for micro TD, it's a service TD to assist the migration of guest TD. So at least some points related to micro TD. So it's the basic responsibility is to perform migration policy evaluation and the migration key setup as mentioned before. And it is to talk to the TDX module to set the migration key, for example, using TD core. And it doesn't need to interact with the guest TD during migration. And the micro TD is bound to the guest TD by the VMM using same core. And one micro TD can assist the migration of multiple guest TDs at the same time. For example, if the host has like 10 guest TDs, that is to be migrated at the same time. So in this case, we can just put one micro TD to support the migration for the 10 guest TDs. And the last thing about micro TD is that it's a part of the platform TCB. So it's included in the TD attestation. So as you know that the micro TD on the source side and on the decision side needs to exchange some sensitive information like the migration key. So the key is to be securely transferred. So we establish a TRS connection between the source and decision side micro TD. So the dot lines here shows the message path to transfer to exchange the message. So the green ones here are for the micro TD to host the communication. So here we have a VSOC transport. So the VSOC transport, we can choose to use a VTIO VSOC or a new transport based on VM core. So on the host side, there is also a relaying entity called SOCAD. So the SOCAD is able to relay the message passed from this VSOC transport to the decision side SOCAD. So the message goes through the host network stack, like the TCP IP stack. So once the message is received by the decision side SOCAD, it released the message further to the micro TD. So micro TD is vendor specific and Intel will provide a reference design and rust based implementation. But cloud vendors can design micro TD on their own. So for the reference implementation, for the micro TD, we don't run any operating system inside it. So it's just a bare metal implementation. So for the migration flow, the first thing we need to know is that the migration thread needs to distinguish if a page is private or shared. So that it can decide to use the legacy page transfer path or using the export and import transfer path. So in this case, KVM maintains a per memory slot bitmap. So this bitmap indicates if the page is private or shared. So this bitmap is set and cleared upon EPD violations. So the 14 GPA has a special bit tells KVM that if this page, if this GPA is shared or private. So if it is a private page, then it will go through the TDX migration path. If it's a shared page, then it will be transferred as a normal realm like the regular VM migration path. Okay, let's have a check the flow here. So the yellow box here is the pre-migration stage. So at this stage, we are put a micro TD and binds it to the guest TD using the same core here. So I marked it the same core rather here. So the micro TD then generate a migration key and assess it to the TDX module using the TD core here. And then it initiates the migration. So as the migration setup stage, the migration thread will create one or more migration streams using the same core here. And then the, so the same core is invoked to throw an IO control from QMU to KVM. And then it starts the third page logging and then do huge page split. So for the route protection, it will use another two same cores here. So the block W is for block write and unblocked W is for unblocked write. So when the guest page is dirty logged in the bitmap, so we can unblock this page for the guest TD to write. Now we move to the migration start stage. So in this stage, we will send the TD scope in multiple states. So this is the first migration data that we need to send to the decision side. And after sending this immutable state, we move to the memory save iteration step. So at this step, the QMU will ask KVM to export memory pages using this same core. And at the end of each round, it will also ask TDX module to generate a token. So there will be more introduction about this token in next slide. And when the iteration step ends, it moves to the complete step. Then the migration thread needs to pause the guest TD. So KVM calling the making this same core. And then it exported the remaining memory pages sent to the destination. And then send the mutable TD scope states and the VCPO states. And lastly, it generates a starter token to get it imported on the decision side. So for the memory transfer, here is a picture here. So in this picture, we can see there is an order phase and an auto order phase. So for the order phase, the source TD is still running. The order matters here. So it needs the token here to mark the order. So the order means that a newer version of a page must be imported after the older version of this page has been imported in each round. Basically, this is not an issue for current KVM implementation, because in the KVM migration process in each round, the page only gets transferred once. So for the auto order phase here, the source TD is paused in this case. So this is usually used by post copy. So I'm not going to talk about auto order in details because we will support this in the next step. So as you know that there are many states need to be transferred. So in this slide, I'm going to introduce something about the migration data transport. So each migration stream creates a migration device emulated using the KVM device. And then the migration thread do IO controls on the device FD to send requests to the migration device in KVM. For example, it can ask this migration device in KVM to export the memory states. And then the KVM device allocates a piece of memory to be mapped by the migration thread. So this using shared memory is just for performance purpose. So we don't need to copy the data between the user space and the KVM. So for the shared memory, it consists of several sections. So the first one is the MBMB buffer, which stores the migration boundary metadata. It's like the header. And for the green one here, it is the migration buffer, which stores the guest incubated states. And for the MAC list buffer here, it stores MACs. So for example, when we export like a list of pages, for example, 16 pages, so there will be 16 MAC stored here. And also 16 GPAs corresponds to the MACs stored in the GPA list buffer here. So those data like the MAC and the encrypted states are all filled by the TDX module. With multi-FD supports, we will have multiple IO threads to send the data. So each IO thread will create a KVM device here. So the encrypted states can be exported from TDX module in parallel in this case. Okay, in this slide, I'm going to introduce the confidential guest migration framework. So the green boxes here are the existing migration logic. And the green ones are the new layer we added to the migration logic. So like at the setup stage, we need to do TDX setup. So this is invoked through the framework. And then at the starter step, we need to send the immutable states. So this is also invoked through the green box. Then at the memory transfer step, the TDX specific function is invoked to export memory from the TDX module. And at the end of each migration run, the TDX function is also called to generate a token to send it to the destination site. And at the end of the pre-copy, so the TDX function is also invoked to send the TD state and the VCPO state and a starter token. On the destination site, the setup is similar to the source site, which creates the stream. And then for the load function, it also calls the TDX load function. So the load function basically loads whatever is received from the source site, like the memory state, the TD state, and the VCPO state. So from this function point of view, the state is just a piece of data. It doesn't care. It's to care what's stored inside. The QEMU just delivers the packet to the KVM and KVM to the same core to TDX module. TDX module will pass those packets through the MBMD header and store the data to the guest. Okay, in this section, I'm going to show you some initial results. So one thing I need to point out here is that those results are actually emulated results. So there are tests of the XVM non-migration with adding the estimated TDX overhead to the memory copy. So we limit it to TDX. So for the testing environment, we use the E5 CPU running at 2.2GHz and we use the DDR4 RAM. And for the leak, we use the 210GB leak and use the direct cable to connect them on the source and the test. So for the LAN migration, we set the downtime using the default downtime number, which is 300 millisecond. And for the network bandwidth, we didn't set at the limit. And we have three types of guests here. So the first one is Daxi guest. The second one is Daxi guest, but I don't have zero-page optimization. And the third one is TD guest. So we label the TDX here. So the training label here is the overhead of cycles. So we added to memory copy. And for the TDX case, it doesn't use compression and also doesn't use zero-page optimization. So the model, the overhead we have here, so we have two types of overhead based on the encryption algorithm. That could be used by the TDX module. So one is 2,000 and 300 cycles. So this is calculated using a factor here. So this factor is the encryption overhead per page plus the same core latency and the same core latency. And another one is around 4,000 cycles. We also tested with multi-fd. So with multi-fd configuration, we used four IO threads to send the data. So inside the guest, we run a workload to dirty the memory at this dirty read, 600 MB per second. So we can see the migration time first. So with the legacy case, it's around 13 seconds. If we remove the zero-page optimization, the migration time gets longer because zero-page skips most of the pages. So the time is shorter in the legacy case. Since for the first release of TDX, we don't have zero-page optimization. So with the TDX overhead, so the migration time gets longer. So the comparison will be compared between the second column and the remaining TDX emulation because they all don't have zero-page optimization. So with a little bit more TDX overhead here, so the migration time gets another 10 seconds longer. And if we enable multi-fd, the migration time will be improved closer to the legacy case here. For the downtime, I don't see too much difference between them. And for the migration throughput, for this one, when we introduce the TDX overhead, the migration throughput will be lower. But if we enable multi-fd support, the throughput will get closer to the legacy case here. So this is similar to the network throughput. For the CPU utilization, they are pretty much similar except for the multi-fd case. As we use multiple IO threads here, so the migration thread CPU utilization is higher. I also mirrored the maximum migrate or dirty read. So this is the data about the maximum dirty read that can be supported to migrate the guest. For example, in the legacy case, it's 1,000 and 100 megabytes per second. If the guest dirt is the memory at 1,000 and 200 megabytes per second, then this guest can't be migrated. So for the TDX case, the maximum migrate or dirty read is a bit lower than the legacy case. If we enable multi-fd support, this maximum dirty read is improved, like it's closer to the legacy case. But as we know, for this test, I used the full bandwidth of the leak, that is 10 gig, but in the real cloud environment, I know that most cloud vendors will allocate only a piece of the bandwidth to the migration thread. So I also did another test using network authority at 3 gigabytes, 3 gigabytes per second. So in that case, I don't see any difference between the legacy case and the pseudo TDX case because it's network-capped. Okay, in this section, I'm going to introduce our current status and the future plan. So for pre-copy enabling, currently the draft code is ready and it's pending to test. And we plan to post out the patches to the QEMI and KVM main list in Q1 next year. And for multi-fd enabling, we plan to support it in Q1 next year after we post out the basic pre-copy enabling patches. And for the post-copy enabling, we plan to support it in roughly by Q2 next year. Okay, that's all for this presentation. Thank you. Questions and comments are welcome.