 Hello everyone, thanks for attending my session. This is Kai Huang. Today I will give you a talk about Intel trust domain extensions host kernel support. This is a pre-recorded video, but I will be online when this video is being played. So if you have any questions please ask online and I will be online to answer your questions. So let's get started. So this is today's agenda. I will give you an introduction of TDX and then I will talk about design and implementation of the TDX host kernel support. And at last I will give you some current status updates and some future work. So this is an overview of Intel TDX. Basically TDX protects the virtual machine from a malicious host and some physical attacks. So today for normal VM, the hypervisor can access its memory and its vCPU status. But for TDX gets its memory and its vCPU status are protected from the host kernel. So basically host kernel is out of the TCP in terms of TDX. To achieve this Intel introduced a new CPU mode called secure arbitration mode into Intel's latest CPUs. And similar to the Lexi VMX case, the same mode also has two modes, which are the same VMX root mode and the same VMX non-root mode respectively. So the same VMX root mode is designed to run a CPU attested software module called the TDX module. It basically acts trusted hypervisor. And the same VMX non-root mode is used to run those protected VMs called TDs. And the same also introduced a new same range register, which is an isolated range. And the TDX module runs in the memory range specified by this new same ranges. And underneath, TDX leverages MKTME to provide the crypto protection to the virtual machines. So basically TDX reserved part of MKTME QID as TDX private QIDs. And for each TD, one TDX private QID is associated to each TD, so it is crypto protected. And as a TDX private QID can only be used by the software run in the same mode. If the host kernel tries to use a TDX private QID to access the guest's memory, it's basically treated as illegal behavior by the hardware. So the host kernel communicates to the TDX module using a new SIMCore instruction. And after some handling, the TDX module returns to the hypervisor using a new SIMCore return instruction. So basically the TDX module implements a set of SIMCore defunctions to allow the host kernel to initialize and to run and to create and run TD guests. As you can see, in TDX, the TDX module sits in the core position. It basically is a CPU attested trusted VM, and it is eventually loaded by the ACM so it can be trusted. It implements a set of SIMCore leaf functions to allow the host to initialize the TDX module and to create and to run the TDX guests. So in practice, it is loaded by the BIOS, and the kernel needs to initialize it before the KVM can use it to create and run the virtual machines. And it can be runtime updated by the kernel. So this talk will focus on the initializing the TDX module. Before talking about how to initialize the TDX module, let me firstly talk about how TDX module does memory management. TDX basically imposes additional security and functionality requirements on the memory. So as a result, TDX introduced a concept of convertible memory regions. During the machine boot, the BIOS generates a list of convertible memory regions, and the machine check actually verifies the list of convertible memory regions provided by the BIOS that the other memories are actually physically present and they can meet the TDX security and the functionality requirements during the machine boot. And this list is static during the machine's runtime. The second concept is physical address metadata table, which is a metadata that used by the TDX module to track each TDX memory page status. For example, the page's ownership and the status, the page is owned by which VM, those things. So it basically it is similar to kernel's structure page. The third is called the TD memory region structure. It's a TDMR for short. So TDX introduced a concept of convertible memory regions, but those memory regions are not automatically usable by the TDX module. And as a step of initializing the TDX module, the kernel needs to select which memory regions that TDX will use, and pass those memory regions to the TDX module. And TDX uses the data structure TDMR to pass those memory regions to the TDX module. And TDX only supports limited number of TDMRs, and each TDMR must be one gigabyte aligned, and the size must be in one gigabyte granularity. And for each TDMR, there are three PAMT entries to track TDMR's each page status. So TDX supports three page sites, 4K, two megabytes, and one gigabyte. So each TDMR has three PMTs respectively to track each page for all the page sizes. And because TDMR is one gigabyte aligned, and the size is one gigabyte granularity, but the memory regions, the TDX memory regions normally are not one gigabyte aligned. So there might be some memory holes within one TDMR, and those memory holes must be put into the TDMR's reserved areas. And if the PMT overlaps one TDMR, the overlapping part must be in TDMR's reserved areas too. And TDX only supports a limited number of reserved areas for one TDMR. In terms of how to initialize the TDX module, the TDX module defines a sequence of steps to do that. The first step is a global scope initialization, which requires to call one SIMCore on any CPU. The second step is a logical CPU scope module initialization, which requires to call one SIMCore on all logicals, both enable the logical CPUs. And if any CPU is offline and SIMCore cannot be done, a later step of initialization will fail. So the next step is the kernel is responsible to choose all the memory regions that will be used by the TDX module, and to construct an array of TDMRs to cover all those regions. So after the array of TDMRs are generated, the kernel needs to configure the TDX module with the array of TDMRs, also with an global TDX KID. The next step is to flush cache or a PMT, because in later steps, basically the global TDX KID will be used by the TDX module to initialize all the PMTs. So before that, the kernel needs to flush all the dirty cache lines of the PMTs, otherwise they may silently corrupt the PMTs after TDX module initializes them. The next step is to configure the global TDX KID on all the packages, which requires to call one SIMCore on one CPU for each package. So the last step is to initialize all those TDMRs that passed to the TDX module. And after all those TDMRs initializes, the KVM can use the TDX module to create and run the TDX guests. From here, I will talk about the design and the implementation of TDX host kernel support. First, from a high-level perspective, the goal is to use minimal code to enable TDX at the first submission to the upstream, and any additional functionalities and optimizations can be done in the future. This is because initializing the TDX module is not a travel in terms of a line of code. So at the first stage, we target to use minimal code to enable TDX. And in terms of design, a major design is to initialize the TDX module at a run time, rather than always initialize during the kernel boot. There are three reasons to do that. The first one is to avoid the non-travel memory and the CPU time consumption when the TDX is enabled by the bus, but the kernel has no intention to use TDX. The second is to avoid doing the VMX on in the non-KVM core kernel. Because initializing the TDX module requires the same code, but the same code requires the CPU already in VMX operation. So if we wanted to initialize the module during kernel boot, we have to add the VMX on support in the core kernel. But from a long-term point of view, reference-based VMX or VMX off approach is likely needed because in the future, more kernel components is likely needed to be modified to support the TDX too. And the KVM so far is the only user of TDX and the KVM already handles the VMX and the VMX off. So initializing the module at the KVM face allows us to avoid temporary VMX on solution in the core kernel for now. And it is also more flexible to support the TDX module run time updates because after you update the TDX module, the initialization sequence needs to be done again for the new TDX module. So likely the core kernel in our current design, the core kernel, will just provide one function to allow the caller to enable TDX. And this function will be provided, will be protected with state machine and mutex because theoretically multiple callers can call that function to enable TDX. As already mentioned, as one step of initializing the TDX module, the kernel is responsible for choosing all the memory regions that will be used by the TDX module as TDX memory and pass all those regions to the TDX module. And after configuring the TDX module with those memory regions, no more memory can be added to the TDX module at run time. So basically all the memory TDX module can use a fixed after initializing the TDX module. In order to avoid having to modify the page allocator to distinguish TDX and non-TDX memory allocation, for example, we don't want to have a new GFP TDX flag when we allocate the memory. So currently we choose just to guarantee that all the pages in the page allocator are TDX pages. This is done by converting all the boot time system memory as TDX memory. The TDX module is initialized at run time and the core kernel does not handle VMX all now. So during kernel boot, we actually cannot get the list of the CMRs. So in this approach, we require all the boot time system memory actually as a convertible memory. This is true in practice and all the system memory, all the boot time system memory can be got by using the memory block data structure during the kernel boot. And to prevent adding any new memory to the page allocate at run time, we reject any memory from going online for hot added memory, if that memory is not in the CMR. But we still allow the memory hot plug to happen because although that a non CMR memory cannot be online at run time, so allowing that memory still to be hot added can still allow that memory can still be potentially used by the driver. For example, we can theoretically move those new added memory to the zone DMA. So they can be used by some driver but not as a page allocator. So after the kernel selected all the memory regions that will be used by the TDX module, the kernel needs to construct an array of TDMRs to cover all of them. And to keep the code simple, we use a simple solution to do that, that we always try to create a new TDMR to cover one TDX memory block. And for the PMT allocation, we use a contiguary alloc page to do that because the PMT must be physically contiguous. And at run time, we can use a contiguary alloc page to do that. And as we mentioned before, if one PMT overlaps with one particular TDMR, so overlapping part must be put to the TDMR's reserved areas too. So in order to reduce the reserved areas occupied by the PMT, we allocate three PMTs together for one TDMR. And after all the TDMRs are constructed, the last step is to initialize all the TDMRs. And for simplicity, we just initialize TDMRs one by one. Actually, TDX supports initializing different TDMRs on different CPUs simultaneously, but that is not in the first submission. As mentioned before, the TDX imposes additional security and functionality requirements on memory. So TDX has interaction with ACPI memory hotplug. In short, TDX does not support hotplug of any CMR memory because the CMRs must be physically present during the machine boot. And as a machine check, we actually verify those memory are physically represent and can meet TDX security requirements. And the list of CMRs is static after machine boot, so TDX does not support hot at any additional CMR memory at runtime. TDX also does not support hot removal of any CMR memory because this may result in physical replacement of CMR memory, which will result in potential physical attack. So a properly functioning BIOS should never send ACPI memory hotplug event of CMR memory to the kernel. However, architecturally, the TDX does not forbid the hotplug of non-CMI memory. So for the non-CMI memory, the memory hotplug can happen normally. TDX also has interaction with ACPI CPU hotplug. In short, TDX does not support ACPI CPU hotplug either. The machine check actually verifies or boot time present CPUs that they are TDX compatible before TDX can actually be enabled. And the machine check actually keeps those information, such as total number of CPU packages, total number of logical CPUs for the TDX module to use later. And non-buggy BIOS should never send ACPI CPU hotplug event to the kernel. So in terms of how to handle ACPI CPU and memory hotplug, at the core ACPI level, we will leverage the ACPI device's injectable flag to prevent the hot removal because the kernel needs to treat the CMR memory and non-CMI memory hotplug separately. So currently, the injectable flag is fixed during the device lifetime if the device supports ACPI EG0 or EGD method. So we changed that. We wanted to allow the ACPI skin handles attached core back to set the injectable flag to force so that the core ACPI code can just refuse the ejection event when it happens. So in terms of how to handle ACPI CPU hotplug, we wanted to reject the hot added new CPU because the TDX does not support hot adding any physical CPU. But at the same time, for all the boot time presented CPUs, we still want all normal things to be done normally. For example, we still wanted to be able to create ACPI campaigns. So we just set the injectable to force at the end of the attached core back. So for the ACPI memory hotplug, the handling is similar to the ACPI CPU hotplug. We still wanted to allow normal things to happen for boot time presented memory devices. So for the CMI memory, we just set the ejectable flag to force at the end of the attached. So any CMI memory will not be physically removed. It will be rejected by the core ACPI code. And for non-CMI memory, we do nothing. So it can act normally. And we still allow the memory to be hot added as we mentioned before. We only prevent the non-CMI memory to go online so that no new pages will be added to the page allocator in such a way that we can guarantee all the pages in the page allocator are TDX pages. So the last one is KXSupporter. So when TDX module is successfully initialized, we need to fresh cache of all TDX private memory before booting to the new kernel. This is because the hardware doesn't guarantee all the, guarantee the cache coherency between the different KIDs. So the dirty cache line of all the TDX private memory must be flushed before booting to the new kernel because they are associated with the TDX private KIDs. And in terms of how to do that, we just do WB invalidate and stop this CPU when the TDX is enabled by the BIOS. This is similar to the AMD's solution. So that is all for design and implementation. And the last part is the current status and the future work. So for the current status, I have already sent out the version 5 and any comments are highly appreciated. I'm still working on the version 6 and I will send out the version 6 very, very shortly. So as mentioned, the first submission will only use minimal code to enable TDX. And so the new functionalities will be done in the future. So for example, we will add support to exposing TDX private KID information and the TDX module information to the CCFS. So the US space software can use those information, for example, to check how many TDX guests can the machine support. And in terms of optimizations, the first is initialization of TDMS. As I mentioned, in the current implementation, all the TDMS are initialized just one by one. But the TDX actually allows it to initialize different TDMS on different CPU simultaneously. The second one is some corner case handling of constructing TDMS. So currently, we use a simple algorithm to do that, that we create one TDMS for each memory block so that if there are many memory blocks, we may run out of TDMS. And one TDMS actually only supports a limited number of reserved areas. So if one TDMS has lots of memory holes, we may run out of reserved areas too. And those corner cases can be optimized using some way, so that will be done in the future. The last one is TDX module initialization error handling. Currently, any error happened during the TDX module initialization will be treated as a fatal error that will result in shutting down the TDX module because there's no reason to leave the module in some middle status. But for some error, actually, we don't have to shut down the module, but the caller can try again. For example, if the TDX module initialization fails due to running out of memory, we can record some internal status of initializing TDX module and let the caller free some memory and call the function again to finish the initialization. So here are some references. The first one is a link of the version 5 patches. The second one is all the specification of the TDX. That's all. Thank you very much.