 My name is Anand Kanagala. I'm a software engineer at Google and not a cryptographer. I am going to talk about a system that's been built over multiple years by a team of cryptographers, software engineers, and site reliability engineers, which underlies a lot of the privacy and security infrastructure at Google. I think some of my colleagues, I think Birdo and Harness, are in the audience somewhere. They'll probably can answer questions that I won't be able to do it. Just to head off any confusion, this talk is not about the Google Cloud KMS, despite the name, the two separate beasts. The Google Cloud KMS is a system used by clients of Google Cloud, whereas the system I'm going to talk about is the Google internal KMS, which is used by systems at Google, including the Cloud KMS. First, a quick introduction to the context of the KMS I'm going to talk about. It's going to be the system in green box here. There's a hierarchy at Google, and that's mapped. There's a parallel set of systems that meet the, sorry. There's a parallel hierarchy of systems implementing the trust hierarchy. So at the top of the thing, you see the boxes in blue. Those are our storage systems. The numbers and parentheses are the set number of tasks that exist at that layer. So there's millions of processes that implement the storage systems. A layer down is the KMS that I'm going to talk about. It's a production. There's tens of thousands of those. A layer below that is the root KMS, and below that is a distribution system for distributing the root KMS master key. And there's a few hundred of those. Again, going back to the top, storage systems, when they are encrypting data, chunk it up, generate a random data encryption key, encrypt the data, write the ciphertext to disk, and then take the data encryption key and call out to the KMS, which will wrap the key and return the wrap key back to the storage system. And so the wrapping is performed using what's done the key encrypting key here, which never leaves the KMS. And similarly, it sturdles all the way down at the very bottom, where we're down to a single key, which is the root of trust. And it's stored in a few physical safes if we ever need to, if all of Google were to restart. But beyond that, it's just held only in RAM and a few hundred machines around the world. So this begs the question, though, right? Why do you want to use a KMS? The core motivation is that code needs secrets, and there aren't really very many good options. One of the things that you'd see in a lot of open services is you store it in a code repository, but that sucks from a security point of view. You find a bunch of secrets in GitHub. Storage and production hard drives is not much better. It doesn't help on the security point of view, and it's an operational nightmare. And you still have to manage all the secrets anyway. So you still need to think. So the best alternative that we've settled on is to use a centralized key management system, where you use our identity process and system identity management system, along with our Borg scheduling system, to manage who gets access to these keys. This allows you to solve this problem once for everybody, for all our systems, right? So you can decide who gets access to these keys. Is it humans? Is it services? Many of our keys are restricted only to services that are built verifiably. Humans will never see these keys. How do you impose this? It's a single choke point that allows us to do the auditing, logging, and control. You can minimize the amount of code that needs to deal with keys. It's much easier to secure. And the separation of trust, because we're able to exclude any of the lower level storage systems that need access to this. That could even have access to this, right? But what could go wrong? In Gmail, it has something like over a billion monthly active users. And in January of 2014, there was an outage where people weren't able to access Gmail for anywhere between 15 minutes to up to two hours for some people. The root cause of it was that a configuration file for the key management service that I'm talking about was truncated in error. And suddenly, the key management system did not know about most of the keys that it was normally handles. And since Gmail encrypts all of this data using keys that are held by the KMS, nobody could read their email. We always knew, actually, back up. So the hiccup in our system was only about 15 minutes or so. And it recovered. But it led to cascading failures. And by the time the entire system was brought back up, it was up to a little over two hours or so, if it came back up. So we always knew the KMS was important. And availability was important. But this really got a lot of attention at Google. We learned a bunch of things. And the rest of the talk is going to be about what we learned and how we fix some of these issues. So in normal operation, clients access the key management server through an RPC mechanism. And the KMS serves up the keys based on local config that it has. But the way this thing is assembled is that every one of the teams that uses the KMS, every one of the services run by the teams, maintain their own configurations. And they store them in our version control system, so it's a proprietary. And these are packaged up, merged together, packaged up, and sent out to all of our serving replicas worldwide. And since teams want to make their changes for velocity, we update these multiple times an hour. I think you can see where this is going. We had a problem. Some latent timing issue tickled a bug in the merging code that had been latent for a couple of years. And the bad config, the truncated config, got pushed globally. Nobody had their keys. A lot of the systems went down. So the lessons we learned was that, over time, KMS had become a single point of failure. It had become a startup dependency for many services. They would not start unless their keys were available. And even if they did start, they had a runtime dependency where, if you did not have access to the keys, you wouldn't be able to serve traffic, right? So this led to the realization that the KMS could not or should not fail globally. To achieve this, we ended up having to do a bunch of changes. This summarizes some of the important ones. So we'll mirror the global control plane. We changed how we roll out binaries and configurations. We no longer update changes, make changes globally in under 15 minutes. We've changed it to, it takes at least a week or so before we roll things out, or gradually. We're using the usual 1, 5, 10, 50% ratios. We've also minimized dependencies. And we've had to, different aggressively against our dependencies not being available as well. And also to deal with, at the scale it was just operated at, due to traffic shifts, we had to implement regional isolation so that one of the regions being overloaded or being in trouble, that does not cause it to do the cascading failure into any of the other regions. We had to do this for the KMS. And we had to do it for all of our dependencies as well. So I think we talked about the availability. So these are some of the other requirements that impacted our design. We talked about availability. The number that we were asked to meet was that we needed something like, was that five and a half nines of percentage of the requests that we serve. This is not based on time directly. Of the requests that we serve should be served successfully. So that means that, and this is measured in an hourly interval. So if you serve a million requests, we are allowed to serve five errors in that interval. We serve way more than that by nothing. On the latency side, we, at the 99% of the request latency, needs to be less than 10 milliseconds. After comparison, the reason this number that arrived at is human perception somewhere in the order of 100 milliseconds. And a user level operation may translate into multiple operations at the KMS level. So you need to be able to budget for that. Scalability, well, we need to meet all of Google's needs. That's the thing. I'll touch on security. I'll touch on only one of the security requirements that we have. I mean, I'm not going to talk about all the other ones, as it impacts availability. And that is being able to do key rotation, add scale, and ask foolproof a man as you can. The last one is efficiency, which is what's the performance you can get out of your course. This isn't as important, because if you can meet the scalability requirement, you can always deploy more. It's a cost. And minimizing it will reduce the cost of our service, the footprint and cost of our service. So we already talked about most of the availability thing. I'll touch on the latency and scalability parts of it. So we want to be able to encrypt everything all the data. That requires that you have a highly available service. And the other requirements are scale and latency. So we have a few choices that we can deal with. What is the granularity of encryption? If you have a system that uses a few keys to encrypt a lot of data, you are no longer as dependent on the availability of the KMS. On the other hand, the amount of logging and auditing that you can do is much coarser. And you have to trust the clients to do the right thing. On the converse, in terms of the rate of change, if you could allow clients to manage the keys at these things, you can end up with the different levels in the trust hierarchy. So if clients manage their keys and they can change them much more frequently, that's OK, because if that's messed up there's only a single client that goes down, potentially. And at the KMS level, we manage keys at a much more slower rate, which I mentioned over the one week rotation period, one week change management period. So that combined with a couple of insights, right? One is that if we do this thing one week, at least a minimum of a week to roll changes out, at the KMS layer, the key material is no longer mutable. That combined with the fact that if you can combine it with key wrapping at the KMS layer, we end up with a stateless server for that period of the week. So that allows us to scale trivially. It's very easy to bring up 10,000 instances of these. There is no coordination required between them. And so that's one way of fetching and scaling. In terms of keys, because of wrapping at the KMS layer, we manage in the order of tens of thousands of keys, which means we can hold them in RAM, which implies we can meet our latency budget as well. So that addresses the thing. So what we ended up with was infrastructure for managing secrets, SSL certs, crypto keys, passwords. But primarily, it's a wrapping and unwrapping service that takes the data encryption keys that I've talked about that the storage services generate and wraps them with the key encryption keys that are held by the KMS. And that never leaves the service. So we meet many of the security requirements. It's not a traditional database and storage system because updates happen via the version control system in a completely separate part than the serving system. And you'll also see why this matters further down when I talk about rotation. And finally, it's not a data encryption service because the KMS only deals with keys, and clients deal with the bulk data operation. So we're able to keep our tail latencies under control. So I'll hit on how we meet these requirements. I'm happy to say that we've had no downtime since that outage in early 2014. We serve our measured availability is much greater than 6-9s. And so we serve in the order of, you know, we're allowed to actually measure this in an early interval. We serve trillions of requests, and we serve dozens of errors or so. So many of these are not necessarily due to server-side issues, but anyway. On the latency side, at the 99.9% of most of the request latency is under 200 microseconds. The reason I already hinted at this, we use its held-in RAM that helps and to its symmetric repro, because we handle all of it. We own all the keys. On the scalability side, I'll hit on this. We serve in the order of tens of millions of requests per second. And we use on the order of tens of thousands of processes and cores to achieve this. On the efficiency, right, on the throughput per core, depending on the request mix and depending on the processor architecture running on, we get somewhere between 4 and 12,000 requests per second. I think that covers all the performance requirements I'd like to talk about, the key rotation requirement that we put in and how it impacts availability. The first is, right, where do people rotate keys? Where do you want to rotate keys? The two common keys are key compromise or the cipher's broken, both of which obviously require access to the ciphertext, which is typically access restricted by the storage systems, because you have other storage ACLs of that layer. In any case, if you were able to rotate your keys, you are able to limit the window of vulnerability, whether you detect it or not. And so we end up rotating quite often. But it happens that rotating keys is fairly error-prone. And if you mess up, it leads to data loss, and so which leads to our goals, we want our clients of the KMS to design with rotation in mind when they're designing their systems. That's one of the goals for us. We also want to make it trivially easy to use multiple key versions. So it ought to be no harder to do that than using a single key version. And we want to make it foolproof. It should be very, very hard for them to lose data. And the way we get to do that is that the first one forces our clients to think about what rotation frequency, rotation right up front, they need to specify the frequency of the rotation. What do they want to do every 30 days, 90 days, whatever meets their requirements? And they also need to specify the TTL or the ciphertext that they generate. It could be, there's a whole range of numbers, 30 days to every year, whatever they guarantee. Given that clients choose those, the KMS guarantees a safety condition, which is basically as long as clients do not try to access or decipher or decrypt ciphertext that was produced, that was produced within this TTL window, the KMS will have a key set that will be able to decipher that text, right? So basically, clients generate ciphertext and so up until the TTL, so until the TTL window is expired, the KMS will have a key version that they can use to decipher it. I mean, that's a guarantee that we provide, right? So the first and then the last one is it's, this is tightly integrated with Google standard crypto library, which supports multiple key versions and each of those key versions could be a different cipher, this thing guys, to provide more ability to rotate away in case some weakness in the cipher is detected. So this meets the three goals, the first one is basically forces our clients to design with rotation in mind. The last one, sorry, right? The multiple key versions is met by the third library, third bullet here and we make it very hard for them to lose their data and a bunch of safeguards that I'll go into. So the way we implement this is, there's a pretty busy slide I'll walk you through this. Given the parameters, the frequency rotation and the TTL of the cipher text, the KMS drives a number of key versions that we need to retain, right? And then it adds key versions, promotes, demotes and deletes these things over time. And the generation and deletion of these things is completely separate from the serving system. Remember I mentioned that customers, clients of the KMS manage, define the parameters that they need in our VCS. So we interface to the same system to these things. So we have a separate system that manages the rotation and that interfaces in an identical manner with the serving system. And this thing is rolled out slowly as always. So I'll walk you through this diagram, the weight works. So in the table, K1 through K4 are key versions for a single key. So every time I've used key in the last 10 minutes so I was really referring to it as a key set which consists of multiple key versions. Which any key version can be one of three states. It could be either active, primary or the SFR which is scheduled for revocation. All the key versions, any of these independent of the state, any of these keys may be used for decryption but only the primary key is used for encryption. So in this key set, time goes from left to right. At T0 we introduce K1, it's marked as active. It gets rolled out over a week to all of our tens of thousands of services. And if anybody were to access it, the beginning of the rotation period, one of the services service might have access to this key and the rest of them would not. At the end of the week, all of them would have it. So the next generation period, that key version gets promoted to primary. And again, first day, one person to the servers will have access to this key and you would be able to start encrypting data with it. But because we guarantee, what we achieve is that in the previous time period, we've ensured that that key version is available at all the keys. So even though it cannot be used for encryption, it's available for decryption. So any reader will be able to read data that was written by one of the keys that has access to this thing. Time T2, at the end of the week, everybody would have access to it again, but they have access to the key material itself. At T2, we introduce another key version, K2, a similar pattern. As you can see, it just follows another thing. This is just one of our schedules. There are trade-offs in terms of vulnerability windows and how fast you want to do this. There are other schemes that you can come up with, but this is just a simpler way of doing this. So a couple of things to note. We do not require transactional semantics for key generation, because the rollout ensures that every reader will have access to a key before any writer has used it. So, I'm not sure. So for example, let's see, at T2, T3, and T4 can interoperate independent of which version of the key you said they have access to. And then it goes on, and we introduce a new key, at K2 gets introduced at T2, and then it progresses through the states. And then going back to key version K1, based on the window, the rotation frequency, and the detail that we kind of think. Like I said, I've derived the number of... Okay, and we roll through this thing, right? And the schedule for revocation has held. We hold it for one more generation so that if clients were to use it, they would still be able to access the data, we'd alert them so we wouldn't lose the data. For many of Google's systems, such as Bigtable, the CypherTex TTL is enforced because we keep rewriting the data every time a compaction happens. So one other thing that we need to do is mitigating for hardware faults. Crypto provides leverage, right? It can amplify errors. If you had a single bid error in a wrapping of a data encryption key, you could end up losing a whole bunch of data. And so we need to differentiate against broken CPUs, NICs, twiddling bits, cosmic rays, right? So apart from all the hardware level, ECC and so on that we check something. We also have at the software layer, we verify the crypto ops at startup. We, after wrapping the data encryption keys, we unwrap them before we respond to ensure that we can actually, this is actually reversible. And as storage services end up reading back their data in plain text after writing encryptions, all right? And they also replicate one level up. So even if one region were to go down, you would have it. So anyway, that's a summary. So we need to do this at scale. It's not just the crypto. There's a whole lot of other work that has to go into it before it's usable. And for us, that's meant in a multiple lines of availability. The first few best practices listed there are pretty standard site SRE workflows. I know what you do, best practices. The last few are things that were specific to key management. All right. Now a few links at the back. Thank you. Thank you. Good. Do we have time? Yeah. Great, any questions for Anand? Yeah, Ian. When you're talking about key rotation, are you talking about the data encryption keys being rotated or the key encrypting keys being rotated or the level above that event? In this case, it was the key encrypting keys. So the data encrypting keys generally don't change. The cipher text doesn't change. You just change what they're... So, okay, data encrypting do get changed. Remember when I said every time a block gets redid and let's say because it's being copied or replicated or whatever, you end up generating a separate key. A brand new key. So those get rotated. Those get rotated. Is there any use of threshold crypto for the key encryption key? No. So it's just a single... Right now it's just symmetric. I hope I understood what threshold crypto means. This is the end of... No, it's not. We don't do it. It's just pure symmetric. Okay. Any other questions? What? Why aren't you using threshold encryption for protecting the key encryption keys? I should ask one of my cryptographer colleagues to answer that question. I have no clue. Okay. I mean, it's just being facetious though, right? For our security model, we don't need it. The number of people who have access to that thing is in the order of 10s. It's verified bills. You cannot get access to them. We do use PKI for the service identity management, but not for this. One of my... Bodo might have a better answer if he's around. Ben. Let's take one more question. Yeah, great. For the key rolling over operations, you might have, in your example, you had four different keys in use at one time. Is there something tagging the data to say this was version two of the key, or does the application have to manage which version of the key is used for decryption? Sure. So the question was given that a keyset has multiple key versions. Does the application need to manage which key version was used to encrypt which data block? No, because it's integrated into the the crypto libraries that I mentioned earlier, which has been released in NoPersource. So when the ciphertext is generated, we tag it with the little... It's basically a tag that tells you which of the key versions was used. And it's a truncated hash of the key so you can recover. So you don't have to basically decrypt using all the key versions that you have access to. Perfect, thank you. Cool, great, thanks so much. Thanks. All right, I'll have to catch you.