 Hi, I'm Sean Christofferson from Intel. Welcome to the 2020 KVM Forum presentation on Intel Trust Domain Extensions. Trust Domain Extensions, or TDX, is a set of hardware and software features and upcoming Intel CPUs that allow for the creation of hardware-isolated virtual machines Trust Domains. Trust Domains provide memory and CPU state confidentiality and integrity while allowing the CSP or platform owner to maintain control of system resources and maintain platform integrity. On the hardware side, TDX adds extensions to Intel's ISA and to VMX Intel's virtualization technology as well as memory encryption with integrity. On the software side, there's a new CPU-attested software module that implements the bulk of the TDX functionality. Due to time constraints, I'll mostly be focusing on the functional aspects of TDX as well as the impact on Linux and KVM for an in-depth view of the security properties of TDX and things like remote attestation. Please see my colleagues' presentation at the Linux Security Summit. On the hardware side, there's a new architectural CPU mode called Secure Arbitration Mode, or SIEM. SIEM is effectively a sub-mode of VMX root mode. It is slightly more privileged in that it has access to some new assets in the system, but beyond those, it is the same privilege level as VMX root mode. SIEM mode can only be entered through VMX root mode via a new instruction, SIEM call, and then conversely, can be only exited from SIEM root through a new instruction, SIEM RAT. Being a sub-mode of VMX, SIEM mode has full access to the VMX instructions so that it can create VMCSs, manage VMCSs, and ultimately run VMs through those VMCSs through VM launch and VM resume, transitioning between SIEM root and SIEM non-root. On the memory side, TDX builds on MKTME to add integrity, where integrity in this context means that if software accesses integrity-protected memory with the incorrect encryption key, the hardware will poison memory and ultimately result in a recoverable machine check when that memory is next accessed. TDX also partitions the MKTME key ID space into shared and private keys, where private keys can only be used and programmed in SIEM mode. So, for example, the untrusted VMM cannot take the private key for a trust domain and shove it into its own page tables and read out the guest memory. That operation will fail. TDX also adds a shared bit in the guest physical address space. This allows the guest, the trust domain to select between shared and private memory and ultimately allows the guest to decide which memory is shared with the outside world and which memory remains private within the trust domain. Also related to shared memory, a second EBT pointer is added to the VMCS. This allows the EBT tables to be split into shared EBT tables that are managed directly by the VMM and private or secure EBT tables that are managed through the TDX module. Again, the CPU switches on the shared bit in the GPA to select between shared and private. So if this bit is set, the CPU will start its EBT walk from the shared EBT pointer. And if the shared bit is clear, it will start its EBT walk from the secure EBT pointer in the VMCS. On the software side, there's a new Intel developed module called the TDX module that runs in SIEM. The TDX module is digitally signed and verified but not encrypted. The TDX module is responsible for managing guest private state, including context switching register state, Xsave state and select MSRs. And the TDX module also directly controls the secure EBT tables, the VMCS for the trust domains and so on. Because the untrusted VMM does not have access to guest register state and guest memory, traditional emulation of VMX instructions cannot occur without explicit asks from the guest. So the guest needs to be enlightened to explicitly request emulation of an instruction. To aid in that, the TDX module will reflect any instruction that would otherwise VM exit into the guest as a virtualization exception. So this isn't all instructions that VM exit, only those that are not otherwise emulated or passed through by the TDX module. An example being many CPU ID leafs are emulated directly by the TDX module. Reflecting these exits as pound VEs allows the guest to run with legacy code that has not been enlightened. So for example, a legacy driver that is executing instructions instead of directly doing hypercalls to request emulation. The TDX module also exposes an ABI to the VMM to create and manage trust domains. This is analogous to how KVM exposes an ABI to user space to create and run VMs. So where user space makes CIS calls to create VMs, add VCPUs to the VMs, and run those VCPUs, the TDX module exposes ABI to the VMM to create trust domains, add VCPUs, and manage the use in those trust domains, and run those VCPUs. The VMM, KVM in our case, is still responsible for managing resources, including memory scheduling. And so this allows the VMM to maintain control over what memory gets assigned to a TD and reclaim that memory. But everything gets routed through the TDX module that would affect the security properties of a trust domain. Another notable software component in TDX is the Seamloader, which is an authenticated code module that is responsible for verifying and loading the TDX module. Once loaded, the TDX module was protected via a new set of range registers, the CMRR, and this prevents code from outside of the Seam module from poking into the Seam module and reading or writing its code and data. Another responsibility of the Seamloader is to configure the Seam VMCS for Seam call. So underneath the hood, Seam call is just a fancy variant of VM exit, meaning that when Seam call is executed in VMX root, it effectively is a VM exit from VMX root into Seam root, where Seam root is the host and VMX root is the guest state. So for example, when Seamloader loads the TDX module, it programs the entry point of the TDX module into the host RIP field of the Seam VMCS. At boot time, the kernel invokes the Seamloader ACM, which loads the TDX module into memory, again, protected by the SeamRR, and also does some configuration of the TDX module during boot. Then at runtime, KVM executes Seam call to invoke APIs provided by the TDX module, and the TDX module executes Seam RET to return control back to KVM. As mentioned earlier, programming of the MKTME engine is done by the TDX module for trust domains to program up the private keys. And these private keys are represented by the different color bands over on the left. All data structures and private memory of a trust domain are encrypted with the private key associated with that TD. This includes the security PT tables, the VMCS, XAVE state, et cetera. All the memory is conceptually owned by KVM in the sense that KVM does the management of that, even though it can't directly access that memory. So KVM controls the lifecycle of the memory, but can't directly access it or program it. There are 40 in change APIs provided by the TDX module that can be invoked by the VMM via Seam call. And these are all the usual suspects of creating a VM, adding memory to the VM, configuring the key used to encrypt the VM's private memory, adding VCPUs, adding entries into the PT tables and running the VM. The TDX module also provides APIs that can be executed by the guest. These are done via TD call, a new Seam only instruction. TD call is for all intents and purposes VM call, just with a different exit code so that the TDX module can differentiate between enlightened code executing TD call and legacy code executing a VM call. This API allows the guest kernel to do some introspection on the platform capabilities, except memory into its private memory space and to tunnel VM calls out to the untrusted VMM to request emulation, for example. In the kernel and KVM, during boot, the kernel will invoke the Seamloader ACM on the BSP and then configure the TDX module on all CPUs. And I'll talk about this a little bit later in a few slides. On the KVM side of things, the biggest change to existing code in terms of lines of code is to wrap the x86 callbacks in VMX to achieve VMX and TDX coexistence, meaning a single instance of KVM can run VMs and TDs side by side. And by wrapping those callbacks as opposed to introducing a new set of callbacks, we can do so without any meaningful performance impact to VMX or SVM. For TDX, we are able to reuse select portions of VMX. For example, IRQ and NMI trampolines for handling hardware interrupts, the post interrupt support in VMX, the EBT entry points into the KVM MMU page fault, and so on. Outside of VMX, there's moderate factoring to x86 in common KVM. We piggyback and repurpose the IL controls added by SEV and also do some refactoring of the lifecycle of VMs as the ordering of API is dictated by the TDX module doesn't perfectly align with the existing code in KVM. The most impactful changes for TDX are to support secure EBT in KVM's MMU. And we also likely need to modify the kernel MMU to support unmapping guest private memory, which I'll talk about in a couple of slides. In KVM's MMU to handle shared and private memory, KVM aliases the shared GPAs to private GPAs in the mem slots, which means that we effectively treat the shared bit as an attribute bit, as opposed to a real physical address bit. By doing so, we can hide, KVM can hide the shared bit from host user space. For example, when exiting to user space to handle emulated MMIO, KVM strips the shared bit so that the host user space only ever sees the kind of real or private GPAs. And this means that host user space doesn't have to be enlightened to understand the difference and doesn't have to be, doesn't have to manage separate memory pools for private versus shared memory. For secure EBT, TDX adds several new hooks into KVM's MMU to insert, zap and remove secure EBT entries. This is necessary because the secure EBT is managed through the TDX module. So whereas a traditional EBT management would involve reading and writing memory directly, with TDX the secure EBT can only be managed by invoking Seam calls to again, insert, zap and remove secure EBT entries. KVM still does maintain a shadow copy of the secure EBT tables. This is because Seam call is quite expensive. So if KVM were to invoke Seam call to read an EBT entry for every level of an EBT walk, handling page faults would be extremely costly in terms of latency. These shadows are used anytime we're walking the page tables. And so it's only when we actually need to modify an entry do we invoke the Seam call. And then secure EBT also needs an additional API from the MMU to allow adding translations without a page fault. This is necessary because the only way to load non-zero memory into a guest private memory space, aka load the initial memory image, is that can only be done before VCPUs are runnable. And to load non-zero memory, the secure EBT translations for the associated guest private memory must be in place. So long story short, we have to have the secure EBT translations before we can run VCPUs, which conflicts with KVM's existing model of populating the EBT tables on demand in response to page faults from the guest. On the private memory side of things, all private memory, so all memory that can be stuffed into a, handed over to the TDX module to create private memory for a guest must reside in a trust domain memory region or a TDMR. TDMR is just a software construct defined by TDX module that it uses to track metadata for system memory. And it does this so that it can detect attempts from the VMM to do remapping attacks or to hand the same physical page twice to a TD. So for example, mapping multiple GPAs to a single HPA in the guest to try and attack the guest that way. During boot, the kernel adds all RAM to the TDMR array and this allows KVM to allocate private memory through the normal memory allocator APIs, which it then gifts to the TDX module. So for example, huge TLBFS, transparent huge pages, MAMFD, an honest memory, et cetera, are all naturally supported simply because we're routing through the normal memory allocator. As for unmapping guest private memory, because integrity failures result in machine checks, leaving the guest private memory mapped in host user space would allow host user space to essentially induce machine checks at will. While it's theoretically possible that we could harden the Linux kernel to gracefully handle all these machine checks, there's definitely a certain amount of risk associated with that approach. So we're exploring unmapping guest private memory so that accesses from host user space to guest private memory would result in a page fault as opposed to a machine check. For shared memory, TDX allows the shared EPT in TDX allows the untrusted VMM to configure select EPT entries to generate EPT violation visualization exceptions. This means that instead of causing an EPT VM exit, faults to these addresses get reflected back to the guest as virtualization exceptions. The primary use case for this is to reflect MMIO addresses as pound BEs. So again, if the guest kernel is not fully enlightened and has code that is accessing memory, accessing memory mapped IO instead of doing direct TD calls, the hardware can reflect those accesses as pound BEs and then the pound VE handler and the guest can request emulation after decoding the instruction. The catch with EPT violation pound VE is that suppression is opt out, meaning that not present EPT entries if zeroed would get reflected to the guest as pound VEs. So from KVM MMU perspective, this means we have to support a nonzero init value for EPT entries. This isn't a significant change in terms of lines of code, but it's obviously touching a core part of the KVM MMU. As for more advanced features, TDX architecture supports large pages, both two meg and one gig pages. Two meg pages we haven't implemented yet simply because we haven't gotten to them. One gig pages on the other hand will require a little bit of complexity as there's some extra bookkeeping that needs to be done to first build the memory region using two meg pages and then promote to a one gig page. TDX also supports host page migration. So for example, to support NUMA balancing as well as page promotion and demotion. So for example, page migration for transparent huge pages to collect fractured pages and promote them into a two meg page. Status for Linux and KVM and QMU, the basic functionality in KVM is code complete where basic here means we can build a TD with an enlightened guest virtual BIOS and enlightened kernel, boot that guest kernel, which in this case is Linux guest, and have full access to all Verdi O synthetic devices so we can access the network and disk, et cetera, all within the guest. QMU is functional but not as code complete as KVM. On the KVM side, there are 40 plus files changed, little over 6,000 new lines of code and 700 changed lines of code give or take. Most of those changes being the wrapping of VMX's x86 ops. The code is publicly available on Intel's GitHub for KVM and QMU will follow shortly. In the near future, so effectively prior to upstreaming on our to-do list is large-stage support, especially two meg pages and host page migration and the unmapping of guest private memory. Longer term, our top priority is live migration and we also have line of sight to nested virtualization. Nested in this case, meaning nesting legacy VMs inside trust domains, we are not planning on supporting trust domains nested within other trust domains. Last but not least, if you want a bit of light reading, the specs for TDX are available on Intel's website. Thank you and hope you have a great virtual KVM forum in 2020.