 All right, thanks. Yeah, thanks. Yeah, so my name's Josh, and I work in kind of a different direction than the rest of everyone here, I think. Yeah, I work on a team called Virtualization Security at Google, and we try to break out of VMs. And typically, we're looking at VM isolation from, I guess, the escape point of view. And I participated. We've had a couple different of these joint security reviews at Google for confidential compute. And I helped out with the latest one, which was with Intel TDX, where we tried to verify and look for bugs in these systems before we deploy them. So I'll be talking about how we went about going through these different reviews. I'll be focusing more on the TDX one, since that's what I actually worked on. But we published public reports on both of these, so if you want to know more details, there's plenty to read. OK, so for today, we'll talk. I'll probably briefly go over the confidential computing, because it sounds like everyone here is an expert. But I will touch on the threat model that we used. And this is kind of defined by the vendors and also our understanding of that threat model. And then I'll give a quick summary of the AMD work. Like I said, I didn't work on it directly, so I won't go into the details. And then we'll spend most of the time going through TDX and some background on how TDX works. This was specifically TDX 1.0, which is kind of the current implementation. Some of the tools and techniques that we used for finding vulnerabilities there, what we found, and kind of we'll finish up with a summary of common risks that we've seen between AMD's implementation and Intel's implementation and kind of the forward-looking view of what's coming. And kind of our thought on just the risks with confidential compute implementations. Like I said, so this has been kind of a long-term project for Google. We've worked with AMD and Intel. It started back in 2021, I believe. And we kind of took the role of the auditor in this case, where we're going through and reviewing these systems to help build trust. The whole idea was we would publish a report and also try to push for open sourcing of these code bases so that others can go and do the same kind of review that we did. So this was like a white box security audit. We worked with the vendors. We worked with kind of the architects for the systems and also the engineers that were implementing the firmware. And, yeah, impactful, I would say we had AMD. We ended up getting CVEs because this review was done after the hardware had already launched and Intel we did it as the hardware was launching. But over 30 bugs, basically 30 vulnerabilities were found between the two of these, which were all fixed as of now. Like I said, there were two reports that we published. And as of a few weeks ago, I think now AMD and Intel firmware for both of these is now open sourced. So you can basically go and reproduce what we did. And also now the code has been updated since we've reviewed it. So maybe there's more bugs. Yeah, so like I said, I'm just one piece of a larger team. We partnered with Google Project Zero. And so the first line here are the four different people that worked on the two different reviews. And then also Google's internal confidential compute team. So these are the guys that are actually developing confidential compute solutions at Google, like if you're in Erdem. And then also tons of different great engineers at AMD and Intel. So this is just a slide taken, I think, from the AMD white paper. But it applies to both. The idea here is that your TCP is now much smaller. You only really need to trust the hardware, the firmware, and your own VM that you're running in. But by shrinking the TCP, you've now expanded kind of the capabilities of the attacker. You can run from CPU BIOS mode, SMM mode, all these different super privileged modes that can do all kinds of crazy things with the hardware. Those are all within scope. And that really opens up for a lot of opportunity for different vulnerabilities than we're used to looking for. Yeah, so I guess obviously confidential compute is kind of this paradigm shift where you have control back to the customer. And you have this opportunity now for more innovation beyond just lift and shift of VMs to trusted VMs. Now you also have these new kind of tools and models that people are building on top of confidential compute as well. Yeah, so the threat model in general for the two technologies we looked at. We have the deprivileged host OS. The host now doesn't control the VM itself as much as it used to, like the lifecycle and everything. That's managed either by the TDX module or the S&P firmware image. It does kind of still control scheduling of the VMs and things like that. And so you have to worry about what it can do there. But you have tons of capabilities. You can change system configuration, like how DRAM works and things like that. There's just a fairly large API in each of these, right? Behind each of these APIs is code that some implementation that could have bugs. And you have device control as well. This is all within scope. Basically, the only thing that's not in scope is generally it seems like they've placed physical access, like actually interposing on the machine or doing voltage glitching, things like that, or at least currently outside of scope. On the right hand here, you have this large hardware and software state space on the host side. You have a very large state space on the guest side. These are all the different configurations you can put the system into, basically. And you have a fairly complicated implementation and design for how these confidential compute solutions actually work, which leads to a pretty rich, I would say, maybe small but rich attack surface. OK, so for the AMD, just to give some overview of how this works on AMD, they have the secure processor, which just sits off to the side of the X86 core. But still within the SOC. And they have a firmware that actually manages the confidential compute there. Critically, they have these RMP, the page ownership table, and the nested page tables, which prevents the X86 host from tampering with the guests. And then there's some hardware components as well, like the microcode, the IMU, that also orchestrate together to actually secure everything. And this applies to AMD and Intel. Our kind of strategy for going at this and looking for bugs, we had a few different things that seemed to work really well. So one was this invariant analysis idea. Either implicit or explicit security invariance that the designers of these systems believe to be true. Making sure you're figuring out what those are either by talking to their architects or reading through the specs. And then ensuring that they actually apply under all these different conditions I was talking about before, all the different complex configurations you can put the system into. I'll talk in the next slide about how we went about doing a layered crypto review. And then also, this is just in general for looking for security bugs, identifying places where people made performance and security trade-offs almost always will show you or expose some kind of security bug that they made in the process. And then also, yeah, we saw a lot of security checks in this, just looking at what they're checking and then trying to deduce what they're not checking. Making sure these checks are actually done at runtime, not just during testing. And as you'll see with TDX in particular, there's a lot of complex interaction between a lot of different components. Tooling we use, there's a tool called Weichproof that Google's crypto team developed a while back. And this is really nice. I hadn't used it before. You can plug in a crypto library into this tool and it will run through a series of inputs that will try to exercise known weaknesses and known flaws that happen in crypto libraries. And I believe in the AMD implementation, this ended up exposing some bugs. And then on AMD, we also had access to hardware, so we used the PCIe Screamer, which is a kind of a hacker, kind of a PCIe card that you can put in and they let you just send arbitrary DMA requests, which is really useful for actually testing how the IOMMU works. So for the crypto review, right, like generally we would say you want to review kind of the whole stack, not just the implementation and not just kind of the high level design here. So at the protocol level, they're trying to build some kind of system to form a secure channel. You want to make sure it gives you all the different pieces that you're looking for here, secrecy, authenticity, and the rest. Below that, and I think in particular for AMD, this is where we found some crypto bugs involving kind of the algorithm selection and the configuration of those algorithms, right? So making sure that everything actually fits the kind of industry and vetted like best practices basically. And then implementation, right? You can pick the best protocol, pick the perfect algorithm, but if you don't implement it correctly or you have some kind of side channel in your implementation, then that just defeats the whole building blocks that you were building on. And this is definitely a bigger concern with confidential compute where secrecy is such an important facet. Like looking for side channels is even more important than it already is with the rest of security. Yeah, so the summary of the AMD review, I think there are like a dozen or so vulnerabilities identified. We found different things in the actual cryptography, some legacy. So SCV is built on two previous generations, SCV and SCVES. There were some legacy weaknesses identified as well. Implementation bugs, and then like I mentioned before, we used the PCIe Screamer to find some implementation flaws in how they were handling the IOMMU. But for the rest of this talk, I'll focus on TDX. So TDX is Intel's Confidential Compute Solution. This is the VM-based Confidential Compute Solution we've heard about today. It was released with Sapphire Rapids, which is the current Xeon. And in their solution, the trusted VMs called Trust Domains, TDs. And what else is interesting here, let's see. So yeah, I guess the biggest piece, and if you compare this to the previous slide on AMD, where AMD does most of the processing and the secure processor for this, Intel is more composed of like a bunch of legacy technologies with a few extra new technologies and kind of combining all these together into one new solution. And we'll see how this works in some of the other slides. But the new technologies are really this thing called SIEM execution mode. So this is a new execution mode that you run in when you're running inside of a TD or a TDX module. So if you take a look here, we have, this is if you just look at like Intel virtualization in a nutshell, this is kind of like the old legacy VM. You have the two VMs. They're isolated from each other in hardware and then you have a VMM below it that's managing these and then OS and Hypervisor or wherever below that. It doesn't matter here. So then if you add in TDX, this is kind of taken from their white paper. Now you have these things called TDs next to it that are also still isolated in hardware from each other. But now there's this extra isolation between the TDs and the rest of the system. And there's this thing called the TDX module which is basically interposing between your legacy VMM and these TDs. And it handles all the kind of life cycle of the TDs, adding new memory, doing out of station, all these things all pass through the TDX module. To get to a point where you actually have the TDX module loaded in memory and running, you also need all this hardware to support it as well at the bottom. And on the right hand side, we have kind of the boot chain to get the root of trust and starting with M-check and MPCM loader and then working its way up to the TDX module. And I'll talk about these more in the next slide. So at the end here, you have kind of your trust boundary as a screen line that I've circled around. Essentially you have to, your TCB right, is you're trusting everything in here. I suppose you know, if you're running one of these TDs then you trust that as well. And everything outside of this is untrusted. And so this leaves us with the attack vectors and this is probably not complete, but you know, these are the big ones. You have all these red lines and red boxes now like your legacy VMs in theory could come up. The VMM kind of has inputs into the TDX module. The TDs have inputs upwards into the TDX module and the most complicated ones are like, you know, the BIOS and OS and SMM, all these other pieces can touch basically all the other components and put them into weird states and exercise any of the code that's in them. Okay, so yeah, we found vulnerabilities in all these kind of highlighted areas and today we'll talk about kind of these three in NPCM loader, the MSRs and the uncor. And the rest are in the report if you're curious. Yeah, so kind of going over the attacker capabilities, I think I probably touched on a lot of this, but these kind of compound, you know, like as you go down you're getting more and more capable. At the beginning you have just like a malicious TD, you can go up through into the TDX module. I think I have a slide here in a bit, but it's like, let's say there's like 10 APIs and that vector, the host has like 50 or something. It's way more, the host has way more opportunity to exercise bugs than the TDs, but the TDs do still have some interface. The host, yeah, it can mess with the TDX module, but it also can kind of configure MSRs and other things to change the system configuration as well. And finally, the BIOS just has even more privileges. You can touch more MSRs and Uncore registers that the, like your operating system can't touch. In theory, if you had a physical device plugged in or you had compromised like a GPU or something, then you might be able to send arbitrary PCIe packets on the bus and potentially try to access private memory that way. Yeah, so that's, these are kind of the main capabilities. The initialization for TDX just to cover, so it's clear what these components do, right? We have some trust rooted in silicon by Fuse keys and that's what's authenticating the M-check blob here and the MPC loader blob here. M-check is used, this is like the one component that we did not review. And I should state to our scope of the review, I might have had on another slide, was just on basically these components due to just kind of scoping out the project. We did not look at attestation really in the SGX side. We looked at kind of how attestation functioned within these components only. So M-check and ACTM are bundled as part of the microcode update. These basically look at, from my understanding, they kind of look at the system, how it was configured by the BIOS, make sure that memory is an alias on top of each other and some other checks to kind of make sure that the system's in a somewhat sane state before proceeding. The BIOS is in charge of loading that and then later the VMM loads NPC loader which is non-persistent seam loader. Its job is really just to load, to like bootstrap the persistent seam loader which then its job is to load the TDX module. And I think these are kind of split out so that you can upgrade over time and kind of swap some of these later modules out to upgrade them. Okay, so to start with the NPC loader, this is like I said, this is the first stage loaded basically and the way it works right, we're at the point now here where we booted into the OS and we don't really have a trusted, like the OS is not trusted right, it's just outside the TCB, we need to establish some kind of trusted code. And the way they do that is by leveraging a legacy technology called ACMs which are these authenticated code modules. These are assigned modules by Intel. There's a fused key that they use to authenticate the code so you run an x86 instruction that jumps into this module and as part of that instruction it does the authentication. These exist already, there's a few different usages for the ACMs. So one issue kind of like design concern that we had here is that all the ACMs have the same privileges so finding a bug in any ACM gets you the same kind of privilege over the system which now instead of having to review NPC loader we ideally would have to review all the different ACMs. And yeah, some of this is used for like you can measure like the very like core of the attestation measurement starts with NPC loader. There's a register that it writes to to do a measurement of the persistent seem loader. And it also has access to the private memory that everything else runs in. So breaking NPC loader breaks the rest of the chain of TDX. But we started looking at this and it is really actually a really small attack surface so you have almost no input into this blob. Really you're just supposed to load it and it like copies in the next payload. It's basically all it does. There's there's no in it. It's a one shot. You just run it and there's no like a runtime API that you can talk to. So this kind of just talks about like how we found like there's just weird this is you know due to the state you can put the system in being the host is how we've ended up finding some interesting bugs here. So to get into the details of ACMs there's this instruction here that gets set into ACCS. That transitions from your host OS down into the ACM. And like I said, this is a legacy technology. It was really meant to be called from the BIOS I believe where you're already in 32 bit but now we're calling it from an OS where all modern OSes are in 64 bit. And so Intel had to add new transition code into this flow to transition you from 64 bit mode down to 32 bit mode and then back up out at the end so that everything works nicely. And in that path they added some, there's new code and new opportunities for bugs. One other thing that makes ACMs extremely hard to exploit is that they run in an extremely complicated or constrained environment. They run in cache basically instead of running in memory and they also disable all the other cores when they're running so you can't don't have the opportunity to like mess with the memory while it's running or mess with the CPU while it's running it basically runs in isolation and it also runs in isolation by disabling interrupts and disabling exceptions. And so we were curious about how they actually disable these interrupts and exceptions. And looking into this is where we found the first vulnerability in NPCM loader. So this is called the exit path interrupt hijacking and basically the way that they disable interrupts is they clear out the interrupt descriptor table which is like an x86 register that points to essentially like a vector of function pointers. And so on entry they disable interrupts on exit they re-enable interrupts and you as the OS get to specify where the interrupt table is because you wanna resume back to your old interrupt table. And you also get to specify a few other registers. So you get to specify the GDTR and like your RIP that you're going back to in some other things. So the issue here was that they enable re-interrupts on this first. So we have like the example source code in the bottom right, they enable interrupts and then the rest of these instructions basically have no opportunity to cause exceptions except for LGDT if you pass in a invalid address you can actually cause that to cause an exception. And now you trigger an exception with the, now you trigger an exception with your interrupt table loaded and you can basically get code execution here you're still running inside the ACM context. And from here you can just basically tamper and compromise the rest of the TDX flow. Let's see, I will skip past. So the TDX module is actually the thing we spent the most time on and we actually didn't find too many bugs here. Like I mentioned earlier, yeah, there's about 10 APIs coming in from the guest and 45 or so let's say coming from the host. We found four bugs, these were mostly I think around the TLB tracking they have this very complicated system for managing memory and managing cache coherency with the different TDs. And so we spent a lot of time looking through that and found a couple of minor bugs there. Yeah, so the rest of this, let's see, I'll quickly go over. So MSRs and Uncore, these are like control registers on X86, MSRs control the CPU and Uncore controls like all the things in the SOC other than the CPU. This is one where there's just kind of a design concern that we had here where there's a privilege and version compared to when you think of the normal virtualization where normally in a VM if you try to access critical like sensitive data or sensitive control registers the hardware has a way to intercept and let you filter these out basically. But with TDX, it's kind of the opposite where the host OS can change all these registers and the TDX module has no way to kind of intercept these changes. So we went through and reviewed that we basically categorize these MSRs in three different ways. There's some that only affect the same hardware thread that you're running on. So this diagram on the right is like a time diagram where if you do change something on that's a thread specific MSR that TDX module and theory when it launches like the next time you call into it it could check and see if that has changed. There's still kind of a window there but it's not that bad. The bigger concern is if you have these core specific or platform specific MSRs while the TDX module is running or while the TD is running these can just change out from under it. Some of these are things like speculative execution controls and some of the mitigations for that are done in MSRs. And so in theory someone on the... If one of these were core specific or platform specific they could change out from underneath the TDX module. And then there's also things like ECC which I won't get into, we have slides here but basically there's a concern if you can disable ECC that might be the only mitigation you really have against a row hammer. But feel free to talk to me afterwards if you wanna know more about that. Let's see. Yeah, I'll quickly go over side channels. This was definitely a larger concern. Side channels are already a large concern but I think with Confidential Compute any leakage of information is going to be much more impactful than just regular VMs in general. We had a couple of different side channels that we look for. The most interesting is this access oracle. And there's a couple of different ways that we found where you can kind of get information about what the TD is doing based on the memory access patterns that it has. And we have kind of three different primitives that we found. The best of which is this monitor in-weight instruction which lets you get like kind of cash line level information about which data the TD is accessing. Now you don't know the contents of the data but you know kind of the address of that data. And depending on the workload you might be able to do some more information from there. So this is kind of just another kind of reminder that you should use constant access operations especially in your crypto operations where you not only constant time but you want constant access as well. Okay so I'll wrap up with. So the review we had, I guess the main point here is all of this is open source now. The spec and the source are on this link here. Intel has a bug bounty if you're also interested. And as well as that, we reviewed TDX 1.0 which was kind of the initial platform here and now they have specs for 1.5 and 2.0 I believe up. These cover a lot of things including live migration which is basically it's going to include a lot more complicated things than we already looked at. So live migration is already complicated without confidential compute. And we mentioned IO and I think the previous talk. So IO is also going to add more complexity. Now we have other devices that potentially have confidential compute solutions that are interacting with the CPU's confidential compute solution. There's just a lot of complexity here. And also there's a lot of things I didn't cover. Feel free to look at the report. Kind of just highlight some common risks that I've noticed between the AMD review and the Intel is unfortunately all of this code is still written in C or C++ right? So your entire TCB is either hardware that you can't look at or memory unslifed languages that a single bug is potentially game over. I didn't mention TDX does actually in the TDX module they use quite a few security mitigations like software mitigations. And so that helps a little bit but you're still running on top of this memory unslifed language. I believe in both cases as well there's a dependence on the cache and the registers as a security boundary. Your data stays in plain text in the cache and in the registers and only gets written out as cipher text when it goes through the memory controller and gets written out to external memory. And so this is I don't know this is again like an area where historically yes these have been security boundaries but there's more and more research into kind of looking for side channels, cache based side channels, looking at there's been recent bugs that have leaked register contents between threads as well. So I'm not super confident right that this is like actually a robust security boundary. There's probably something that could be done better there. We mentioned the heavy crypto usage yeah between this there's just a lot of opportunity to make mistakes throughout that design. I mentioned a lot yeah the hardware configuration is the biggest one. It's easy for me to go through and review all the code and think about if there's memory corruption bugs there. It's much harder to go through and think about all the thousands of hardware control registers many of which are undocumented and what they actually do to the system and if they would kind of break any security invariance. So this is definitely a challenge. Yeah and I guess the last point is a runtime compromise like that in piece seam loader compromise. It's not a testable you can compromise it and then just load the next stage and attest whatever you want at that point. So a lot of these compromises at runtime I'm not sure we currently wouldn't be able to detect them. Okay yeah but so I've talked a lot about bugs. I think in general compared to just normal traditional VMs this is a significant improvement right. That deep privileges the host by design the cost for doing a host to guests attack is extremely high compared to what it was without confidential compute. Obviously the collaboration that we had with Intel and AMD was very fruitful so we would definitely encourage more of this with other vendors and other users of these hardware devices. And also open sourcing of this has been really great. We pushed hard for both of these to be open sourced and I think at this point now at least AMD and Intel the firmware is open sourced and you can look at it. And yeah, just keep saying complexity, complexity. I think this is complexity. We're gonna have a long tail that you know there's growing pains here as we get used to this and we work out all the kind of low hanging fruit in the bugs. So getting to a point where we can roll new TCB versions out and like quickly patch and update and update our attestation expectations is definitely going to be useful. So I'll stop there if there's any questions. Thank you very much indeed. I'm a big fan of not believing something is working until it's been tested and you try to break it. So thank you and breaking it even better. Any questions that have come out from that? You want to raise their hands? Lots and lots of things. The slides presumably will be available for everyone to look at. So here we are from our Intel friend. I just like to endorse the work that you guys did. I mean, I think the relationship that we had through this was phenomenal, very, very open. And as you said, I think this is the way forward. This sort of openness as we're starting to build infrastructure in the hyperscatters. It's absolutely critical. As I said earlier on, this is the infrastructure that's gonna enable the platform services to monetize the cloud and we have to be open from the start. So I think it was a great relationship and hopefully we keep doing it with this new levels of complexity as you described it. Yeah, I should say too, I think if the slides don't link but if you go to the Google page where we published this, Intel also, we worked with Intel's internal security team as well and they kind of did a parallel version of this and they have their own report as well that kind of talks about their view of the security there. So yeah, likewise it was great. Another question. So you mentioned TCB patching. Is that something that already is supported in these systems in AMD and Intel or is some of the stuff sort of locked down and hard to change because it needs to be secure? No, I'm not as familiar with AMD but I know for Intel this is supported. I don't know how quickly. It's not just Intel, it's Google, it's the whole infrastructure of rolling this out and everyone updating their whole infrastructure around it. I don't think it's as agile as it could be. Especially if you think about when we use, when we ourselves report vulnerabilities, we typically try to do a 90-day disclosure policy or something like that. I'm not 100% sure the whole process could happen in 90 days today. But they do support updating a TCB. But it's sort of like if that mechanism needs to be super secure, right? However you patch the TCB. Yes, and that was part of, yeah, we did review that of how the SVNs and everything update and the whole TCB recovery and process and all that, yeah. Running out changes for stuff on die is a bit more difficult at least. Yeah, yes, hardware is not patchable really. Excellent, any other questions? Oh, please. Thanks for the great talk and the great work. I'm really curious, did you find any issues that didn't make it to the report where they would be like, well, it's an issue. However, it seems that it's not part of the threat model or, you know, I can imagine it was a dialogue and in some case it could be argued like, well, it's not part of the threat model even though it is an issue because there have been such discussions with SGX I believe earlier on where there were some issues and then somebody said like, well, it's not part of the threat model but it's still a really bad thing. Yeah, I don't, nothing comes to mind. I'm trying to think, there's just the things that are kind of like on the edge of the threat model of like more physical attacks and some of these we didn't really look into because I don't know if I mentioned for Intel we didn't have access to hardware, we had to use a simulator because it was a little earlier in the process. So like I think for things like maybe SGX I think there was like plunder vault where they were using the voltage controller on the motherboard to glitch the CPU. This is something that we were also curious about with TDX and it's like probably technically out of the threat model but still something that's interesting. But we didn't have hardware. So this is yeah, something that would be interesting for future work. All right, thank you very much. Thank you.