 I'm Brandon Weeks, and this is Matthew Garrett. We are both on Google's internal platform security team. So we're responsible for securing all of the devices that are used at Google, which includes all of our Linux workstations. Today we're going to be talking about how we're using remote attestation at Google, specifically how we're using TPM-based remote attestation, and how the tools we've developed to actually make this useful for us. What are our goals? Our goals are to remotely attest to the identity and state of every client and server we have at Google. We have over 200,000 machines across many different platforms. The goal of this is to strengthen BeyondCorp. So our BeyondCorp model relies on user and device identity. Device identity is currently not something that is always cryptographically backed or tested to, and that's something we're trying to change. And this also enables us to enable new workflows for our T-team. For example, we would like to be able to drop ship clients or servers across the world with no image on them whatsoever, have them remotely attest, and deliver a secure image without ever having to have them go through an inventory process or have an inventory technician touch them. So across our various platforms, we're implementing this in different ways. Android introduced key attestation in Oreo, and we are using that to attest all of our Pixel phones that support key attestation. Chrome has the verified access API. It's an API that allows you to access essentially the TPM primitives through a much simpler method. And if you are managing your Chrome OS devices with G Suite, you can actually attest to their identity. For Mac and iOS, Apple unfortunately doesn't expose their attestation primitives that exist in the secure enclave to us. So we can't actually use this. If anyone works at Apple and is an audience, please help. And for Linux and Windows, we have trusted platform modules. And that's what we're here to talk to you about today. So what are the requirements for the solution that we are building? We want identity attestation. We want to know the unique cryptographic identity of every machine we own. We also want state attestation. Right now, we are just attesting to the static Corridor of Trust. And we haven't yet thought about how we're going to begin doing dynamic Corridor of Trust measurements with IMA or any other functionality there. We want our use case is every Linux and Windows machine used at Google. We have over 200,000 of these. So this is a large problem with many, many edge cases. We have to support TPM 1.2 and 2.0, because we have a non-trivial amount of TPM 1.2 devices that would cost a lot of money to replace. We are building this in Go because that is one of the better supported languages at Google. It's memory safe and just the preferred development environment for our team. And we want the foundations of this to be open source. Some of the implementation specific details are very tied into Google infrastructure. And we're not able to release them. But we want the basis of this to be useful externally. And we'd love to see it consumed by other companies and users. So what have we done so far? So also, just to be clear, when we're talking about the number of machines at Google that we're doing this for, we are doing this for the systems on the internal corporate network. We're not talking about this in the production environment. So in terms of what we've done so far, one of the core things that we needed to do was add support for obtaining the crypto agile log format for systems running TPM 2. This turns out to be a surprisingly awkward problem for moderately legitimate reasons. In the TPM 1.2 days, the event log was contained within a static area of RAM that was allocated by the firmware, but which then remained available throughout the lifetime of the system. The downside of that is that that memory was allocated and was never available to the operating system. So if you didn't care about the event log, you were still using that RAM forever. The implementation for the TCG2 UEFI interface to TPM services is to allocate the event log in what's called boot services data. That's memory that is freed at the point where the operating system calls exit boot services to mark the transition from the boot environment to the runtime environment. So we had to copy the event log from the boot environment up to the kernel environment, which involved some lightly awkward code. Because when you make the firmware call that tells you where the event log is, it gives you a pointer to the beginning of the event log, and it gives you a pointer to the beginning of the final entry of the event log. It doesn't give you a pointer to the end of the event log, which means you then have to couch the size of the final entry, which is variable, and then copy that up. Things were then made slightly more awkward again because, when I said the event log goes away after you call exit boot services, exit boot services itself generates an event, which is clearly incompatible with this. So when you call exit boot services, an additional log is instantiated and event, sorry, when you make the call to get the event log, an additional log is then instantiated and all further events go into there, and that's available at runtime. Problem there is that if the get log call is made before the kernel makes it, for instance, if a bootloader does this, then you end up with events that are both in the event log that will be discarded and the event log that will be persisted, so we also had to add code to remove duplicate events. That's all landed in 5.3, so if you use kernel 5.3, you should be able to get the crypto agile event log. In terms of the go codes that we've written for access station purposes, we now have a functioning client and server, and as Brandon mentioned, we're not releasing our internal versions of those because they're fundamentally tied to Google infrastructure and nobody would be able to run them anyway, but we have demonstrated internally that we can do this at scale. We now have over 25,000 systems that are successfully generating access station events, and we are verifying those and parsing out the event log. We have code for verifying the quotes, we have code for parsing the event log, and then replaying the event log, verifying that the event log matches the quote, and we heard yesterday about other implementations that are also doing this. So this is not itself particularly exciting, but I think this is one of the first implementations that has been demonstrated to be working at large scale. The stuff we want to do next, though, is, well, an event log is itself not particularly useful. You have a huge amount of information, especially on Windows platforms, which I'll get to in a second, but that's not the information you want directly. You want to be able to look at the event log and make determinations like, is this system running firmware I recognize? Is this system booting a bootloader that I recognize? Does this system have various other security characteristics that I require for one of these systems to have access to internal resources? So what we're still working on is a meaningful way to take the event log and turn it into like platform attributes like good firmware had disk encryption enabled, that sort of thing. And doing that requires us to have a solid idea of what things like valid firmware are. And this needs to be an industry-wide effort. To an extent, we can get away with just looking at, okay, we have this many machines, we have this many copies of firmware, and we can maybe go, well, okay, if we have 10,000 machines running this specific firmware, then either this firmware is legitimate or we're going to have a really, really bad time. So we're hoping the former. But realistically, we need this information to come from vendors, and we need to have a reliable way to get it from the vendors to us and to the rest of the world. And also, ideally, we want to move from our current situation where the device identity is frequently stored on disk in our systems to the device identity being backed by the hardware, and for us to have full key certifications. So we can verify that this key was generated on ATPM that we trust. Some of that is implementing key certification properly, but the other part of it is ideally making use of platform certificates, so we're able to verify that the machine that came from this vendor with this serial number is expected to have this TPM. And then by looking at that, we can say, okay, this machine was shipped to us, it has this TPM, the manufacturer of the machine says so, and now we can verify that the device identity that we have created on that machine's TPM is associated with the appropriate device, and so should have the amount of access that this specific computer was supposed to have. For Linux, the information that we have is better than it was this time of last year. Grub2 now has support for performing measurements of boot components. So obviously we have whatever measurements the firmware generates itself, and so for PCR0 and PCR2, that's information about the firmware and information about the firmware in option ROMs on plugin cards. We also have the hash of the bootloader. So for systems that are using shim and secureboot, shim will also measure the next stage bootloader, so you don't have a gap between shim and grub. We have the secureboot key state that's measured into PCR7 by the firmware, and shim also measures its secureboot state and the key certificates that it's using to verify second stage stuff. So you have a log of, if you are using secureboot, the certificates that we used at every stage in the secureboot chain. Grub is measuring the kernel and the init-remfs. Grub is also measuring the kernel command line because the kernel command line, it turns out, is kind of relevant from a security perspective. You have the ability to do things like overwrite the IOMMU configuration, and if you've got a physically present attacker, and if you've got DMA-capable ports, if someone's able to turn off the IOMMU and you're expecting the IOMMU to provide protection for your system, that's gone out the window. So some amount of security-sensitive information is contained within the kernel command line. And then, obviously, we have anything that IMA is measuring, which is going to depend on local IMA policy. So here's the code released thus far. Matthew Garrett's patches to the kernel have landed in kernel 5.3, so if you're running that version or later, you can actually get a TCG2 event log. We also have released three different GitHub repositories that contain the foundations for our implementation. The first, GoTPM, is actually primarily developed by a different team at Google working on cloud security. This contains the low-level methods for communicating with the TPM, and doesn't contain any of the higher-level ways of doing things like quoting or key certification. GoTPM tools is where we're storing our test data and some test clients. It's useful for just messing around with your TPM. And Go attestation is our primary project, which builds upon GoTPM to actually do the higher-level operations, such as key certification, quoting, grabbing the measurement log, credential activation, and everything you need to actually use TPMs effectively. But, obviously, this wouldn't be a particularly convincing discussion session if we didn't have anything to discuss. So, right now, there's something of a functionality gap between Linux and Windows. Windows, when it boots, logs a lot of additional metadata in the event log, and we now have code for parsing that out. So, for instance, on Windows, we're able to verify whether a system booted with an encrypted disk, and if it did have an encrypted file system, whether the key for that file system was backed by the TPM or whether that was something that the user typed in. So, that allows us to verify that a system meets certain aspects of our internal security policy before we grant access. So, you can't just take a system and then throw a fresh copy of Windows onto it and have a different security posture and still retain access to internal resources. We also get information about which components were loaded during the boot process. So, for Windows, that includes stuff like, was Windows Defender started in the early launch phase, as in before any non-Microsoft components were loaded? And that's something that we want to be able to guarantee in order to say, okay, if someone was able to get malware onto the system, then Windows Defender was launched before that malware had an ability to do anything. So, the equivalent for Linux would be largely, did we load our security policies? So, if we're using IMA, did we load the IMA policy before we started launching any untrusted code? Because otherwise, something could have had the opportunity to tamper with the system before the IMA policy was loaded. And right now, we don't have any of that functionality. So, really something we'd like to hear from people is what further stuff could people make use of and what is the best way for us to log this? Should this be something that's up to the kernel, or should we trust the IntremFS as being part of the TCB and just have user land generating logging events? Also, for people who are looking to build on this, we are very interested both here and afterwards in guessing information about what sort of functionality you would like to see in the tooling that we're building, what would make it more useful to you, and anything else. What else can we do to make this more useful in terms of being able to verify system state and device identity? So, with that, I think we'll open this up for questions and discussion. Also, quick note, we'll be giving a presentation on the more sort of holistic details, aspects of this at the open source summit on Friday at 3.15. So, questions, comments, issues that you'd like to discuss? One challenge with attestation is, how do we know that the information that we are getting from the endpoint is the actual endpoint that we are interested in? How about I spoof some values to you when an intruder pretends that I'm the machine that you're talking to and you grant access? So, how do you handle that kind of situation? A little bit awkward. So, in this case, if we have an ability to tie the identity of the endorsement key on the TPM to the device we care about, then when we generate an attestation key, we're able to verify that that AK was generated by a TPM with that EK. If we're then able... So, we have two ways right now basically of binding that identity to the machine identity. The first is, we trust on initial enrollment that the machine is the machines it claims to be. The second is, if we have a platform certificate from the platform vendor, then we're able to make that association in a stronger way. So, we know in advance the machine with this serial number that was shipped to us on this date from this vendor. It is expected to have this EK. And therefore, if an AK is presented to us that was generated by this EK, we know that the AK came from that device. The quote itself has a nonce in it in order to prevent replay attacks. So, that gives us a way to verify that the quote is fresh and we're able to compare the chain of trust back to the original device. So, it shouldn't be possible for someone to give us a fake quote. So, yes, ideally when machines are initially provisioned, the initial provisioning will be associated with the enrollment of the EK into our trust database. If we don't have a platform certificate otherwise, if we have a platform certificate, then we'll also validate that the EK is associated with, that we're getting something from the EK that we expected to be associated with that device. And then, once we have an AK that we know is associated with this machine, we can then verify that the quote came from the machine with that EK. Regarding the additional measurements and stuff. So, one additional thing that I see is full is measuring the signer of the kernel during K-exec. Not sure if this is already covered in your lockdown patches or not, if not. Currently, I think the only IMA signature is covered, but when most of the distros and do not have the IMA signatures set on the kernels by default, there currently is just PE signed. And when you're K-exec-ing into a new kernel, measuring this signature of that new kernel that you are K-exec-ing into, will complete the chain of measurements. Right, so at the moment, I believe that if you're using IMA and K-exec-ing, then we will end up with a log of the hash of the kernel, but we won't have a way to just, like with the PCR-7 measurements, I would say that we trust the signature, the signer of the kernel, as opposed to having to care about the exact kernel itself. So, yeah, I think that's a whole at the moment. No, I'm not talking about the IMA signature. In cases, most of the distros do not have IMA signatures on the kernel. Even if you don't have the IMA signature, then you'll still end up with a measurement of the hash of the kernel, which is sufficient for some cases, but not for others. Yes, the signature of when K-exec file loads is verifying the signature, we're not measuring that. So that seems like a legitimate thing to measure. Cut our question. I have two questions. One, I'm glad to hear you're using the platform cert. Second, have you started looking at the IDAV, LDAV, work that's being done? Sorry, which work? IDAV and LDAV, from a device identification tied to the TPM, okay. And then second of all, you were mentioning LVM doing like a BitLocker, that it unlocks something. What were you thinking there? Because I mean, for instance, on OpenXT, we cap the PCR after we've done the unlock from the unsealed. Right, so that's a great question. So one of the issues that we have here is that most systems that are, if we want this to work the way that most Linux distributions currently work, then the unlock occurs in the image RAMFS rather than in the bootloader. And that means that something at that point needs to push some information to the TPM. So I think it's a reasonably open question whether the thing that should be pushing this is either the use-land tooling or whether certain events where we push crypto into the kernel should result in the kernel itself producing a measurement. Now in cases where it's going through use-land and the kernel has no idea where this material came from, then sort of inevitably we need to do that in the use-land tooling. And nominally, since the use-land tooling is embedded within the image RAMFS, the image RAMFS is trustworthy. Because we've very hard to sign with that before. So, but then obviously, if we're going to argue that the image RAMFS is trustworthy, then we probably need the image RAMFS to itself cap the PCRs again at the point where the image RAMFS is transitioning to the live system. And the right point to do that at is not something that's really settled yet. But that's an excellent point. So yesterday, Paul said, I couldn't put it in his mouth, that the kernel command line was not being hashed and measured. And you just said that you were doing that. Grubb does so in the current code base. Okay. So one of the problems with that is that since the kernel command line generally includes the root file system, and if that's a UUID, then if you're going to build any sort of policy around that, it's very difficult to say this is a static good PCR value because that will be completely machine-specific. So you really need event log parsing and you even then can't look at just the values in the event log. You need to look at the event log data, verify whether the event log data hashes to the event log entry, and then parse the command line. So we're not just measuring the command line, we're also logging the command line in the event log. So you have a cryptographically verified copy of the command line, which you can then parse remotely. So the infrastructure to do this is there, but right now we don't have policy written around consuming that and turning it into a this is good or this is bad. So if you want predictable values for the kernel command line, then yeah, you need to avoid UUIDs. This is sort of a policy question on how you use BeyondCorp. Are you targeting sort of employee-built devices, employee-owned devices, or are these all corporate-owned devices? Like, how do you measure, how do you like usefully do measurements of like I brought my laptop in from home and I would like to check my email? Or is that out of scope for your use case? That's out of scope for use case. Okay, so when you say BeyondCorp, these are all corporate IT has purchased and provisioned these devices. Yes. But they are for taking home. Yes. Okay. And ideally this is something we could use to prevent employees from actually bringing devices and using them at work. Okay. In a cryptograph, at least your way. It's like, for example, you lose your laptop and buy a new one and try to pretend you never lost your laptop and say this is now my new Google managed laptop. We're gonna catch you and say no, sorry. Okay, makes sense. Thanks. It's like, no more questions? Okay, thanks guys.