 So, let's start with some news you probably heard recently speaking of Uber. This happened sometime back, but this came to light recently. And when I was reading this article, I was like, exactly what crypto did someone have to break to do this? And then as I was reading more into the article, this is what I actually read. Now, it's not like breaking of crypto or even guessing of password here. It's just stealing of password. So, regardless of what your complexity of your password was, somebody, if they're just allowed to steal it, then that's it. That's keys to the front door. Nobody needs to find a back door. So, funny thing here is GitHub, the part GitHub. That means somebody actually checked in the code without secrets in it. Now, I mean, I'm sure, I mean, we have done that in the past. So, nothing to laugh about here. But the funny thing here is the second story, which kind of happened slightly in the past, like two years back or something. And the funny part here is the company itself was a code repository company for other people. And their credentials got stolen. And in fact, they had a much severe problem than Uber. Uber basically had a black PR mark basically got into PR problems and obviously also paid the bad guys. These people had to just shut their door. They just folded as a company. And the problem was basically they had secrets that basically somehow got into wrong people's hands. They tried to bribe, ask for ransom, whatnot. And then at the end of the day, they deleted everything and they folded as a company. So, it's a serious problem if you do all this crypto and don't manage your keys properly. So, what happens is all this code that we write, somewhere in the execution of the code, you require the secret to be present to be usable. Keys, passwords, whatnot. And how do you manage this whatever type of secret you have when the code runs and the secret is basically made available? So, over the last two days, we have basically seen a lot of talks talking about key management and then in key management systems and then in TLS and everywhere. So, I'll give you a very brief, very, very high level overview of what Netflix backend looks like. We basically have our control plane. We have our cloud provider that we use our services from. We have partners that we work with. We have our content delivery network. And, of course, our customers. We are mainly focused for the control plane here. We have 1,000 plus applications. We have a bunch of things deployed inside. What the fancy word now is service mesh that we have deployed and then these applications talk to each other. And then, obviously, in the ecosystem, we have thousands of people who also check in their code that make the applications actually go, build, deploy, and run. So, let's build a story. Now, let's talk about some piece of code. We'll take an example, but this is just an example. You can extrapolate this and make it any secret. But let's say you have this piece of code. Now, most of you probably already saw the line which basically says password. And even if it's a crazy long password that nobody can guess, that's not the point. The point is, can somebody steal this password? So, right now it's actually sitting in the code. So, sure, can we do better? Maybe we just do this. We don't put the actual password in it. We actually encrypt it and then call a magic function called decrypt. And somehow expect that when this code runs wherever it runs, this password will be decrypted and will be available so that now the code looks like this. I can make my database connection, no problem. It's likely complex than this. I'm just making things simple right now. But let's say this is piece of code that we are going to write and deploy and test. So, if I put this code in Netflix, what we have is a Git repository that these unfortunate developers are using for a given application for this code. And the job is to basically check out this code, run it locally before they check in, test everything. And part of that test could be to actually make that database connection. That means they need the password locally on the laptop when the test is running. So now, remember that magic decrypt function? That magic decrypt function needs to run on the laptop. And only on these two developers' laptop, nobody else. All right, then you have, now they are happy, they checked in their code. There's a continuous integration system. Jenkins in our case, it basically runs that build. As part of the build, they have a bunch of unit tests, they have a bunch of integration tests. And those tests also talk to the database to make sure that things are right. That means the same secret, the same decrypt function needs to also work on JenkinsBox. Okay, but a lot of other people's code also work on JenkinsBox. So you also want to make sure that only that person's Jenkins job is allowed to decrypt the secret, not everybody else. All right, so now build is happy, you're happy with the build, now you send it to the deployment system. In our case, we have an open source system called Spinnaker, pretty awesome. Then deployment system says, all right, I'm happy, I'm going to go push everything and spin up a cluster in AWS. By the way, Netflix is 100% AWS, so we don't have anything, any data centers. But this particular problem will become even complex if you have hybrid system when you have your own data centers and AWS or maybe some other cloud provider or hybrid. So now, how do you have that magic decrypt function actually give you the password you need to run your code in all these setups? So let's say you basically have a key management system, which is a central system. And all these three different places where the code is running from, they can go to this key management system and ask for decryption of the password. So a couple of problems here. So first of all, and this is funny because God is with me here. This happened, like the Spectre and Meltdown happened like a couple of weeks before I'm preparing to talk here and the reason we designed this system this way was we were expecting that something like this will actually happen someday. So this is three years back we were designing this system and it actually happens just before the presentation. So it kind of helps. But something like this happens. Now what's going on here is even if that key management system or key server here is pretty well protected, something like this happens. And remember, because we are in 100% AWS, this also runs on a VM. Which is basically running next to a random person's VM. So now all the secrets password that are basically coming into this key server are being decrypted and are for whatever number of milliseconds are in clear in the memory of the key server. All right, so that's a problem. Now let's see if we have any alternatives. But can I put then the key actually in the HSM? So there are two problems. Obviously the key is in RAM versus key is in HSM. But you cannot get away with the problem that your system just decrypted a password, database password, which is actually in the RAM of that key server. Okay, so let's see what we can do. So let's summarize. We basically have now multiple places, people, other CI CD systems, applications themselves. They're all coming to this key server and basically want to get their secret. And this is just one example. When I say secrets at scale, we basically have thousand plus applications. They all need their secrets coming from random sources. Now, in order to do the decryption, we basically have to do two steps. Remember, as I said, those two developers, only those two developers, nobody else should be able to decrypt that secret because they are the ones that are managing that code and they are the only ones supposed to develop on that code. So first thing first you want to do is authenticate the request. So who's asking for this decryption? So you have to know, the key server has to know who's asking for it. And then second step and more important step is to use the decryption key, write decryption key to do the decryption. Now, I'm not going to talk a whole lot about that first step of authenticating. My team has talked about it in multiple places, but I'll go through it very quickly. And I have at the end some resources that you can look at for the first step. It's not very crypto interesting in some ways. So requesters identity. So who's requesting? So we have two types of things. We have users and we have applications. In case of users, we basically have Google identity service that we use for our people, employees. So Google does it right, so we don't care too much about it. So internally we use OAuth or mutual TLS when users authenticate. The more interesting part is the applications that we use AWS VMs and then containers on top sometimes, where we also use mutual TLS, but the bootstrapping has to happen using something called AWS metadata service. So this is where actually took us a bit of time to actually get this going because I come from embedded system backgrounds. When I had complete control over my hardware, I could basically just put my key into TPM or something like that and bootstrap my system like that. In case of VMs, I don't have any control over my hardware. I actually have nothing. So where do I even start giving a name to my application so that it can authenticate to someone else? So this is how it works. It's hard to see because of the color combination, but trust me, there is something called metadata service. If you look up all the documentation of AWS, you can hit this endpoint only from that VM and it spits out something like this, which is basically just a base 64 for this. Now the most interesting part there is that instance ID. It's a unique ID that is basically given to your virtual machine at this point. And nobody else has that ID at this point. But it's not very interesting when you have to write a policy, when you want to write a policy. This application is supposed to decrypt a secret. That instance ID doesn't really help because it's ephemeral, first of all, and B, it's not something you can put in policy. What you want to put in policy is a name of the application. So you need to do a translation from this big random looking name to something that you actually can write policy against. So what we do is we make a AWS EC2 call, which basically tells me something about that instance that just got launched. And one of those things is tags, which where we put the application name and the version of the application and such. So now I have some information about the application's name, and then we do some background magic to do some short-lived certificates to every application that we launch. That's not the point. The point is now I have the application that can identify itself. So I have some other talks that I have referenced and feel free to dig more into that. The more interesting part here is the second part, given the audience. Now I have one group. In this case, I have two developers, one Jenkins job, and one application. So if I make a group for that group, how many of our secrets are shared, I need at least one unique decryption key. That key cannot be shared with anybody else. So now you can have groups like Alice Bob, application one, Jenkins one, Eve, application two, application three. Eve can have some secret that basically shared across two applications possible. So now you have individual keys for individual groups. All right. So now let's talk scale because Netflix, everything is about scale. Let's say if you have M users and N applications or the other way around. The maximum number of groups that you can have or expect, shouldn't say expect, have is two to the M plus N minus one. All right. So put some numbers around it. All right. Let's say you have a startup has 10 users, 10 applications. That number is million plus. All right. Now let's say you hired two more people who ended up writing two extra applications. All right. Now that number suddenly went to 16 million plus. At Netflix, we have 1000 plus applications and hundreds of people who write code. All right. So you can imagine what kind of what kind of number of groups combinations you can expect. Right. All right. Not gonna fly. Let's see. Some people will come back and say why are you complicated? Complicating this problem so much. You can just have this key server instead of you asking for encrypted decryption using a given key, you use a handle and put all the stuff in the database. If you do that, then you only need one key because now you can have this database. You pull out whatever you want, you decrypt and then you send it. You just need one key. Sure. Now you just did this. Why? Because in the previous case when I mentioned about Meltdown and Spectre, somebody actually had to sit there and monitor those passwords being decrypted and sent out and basically siphoned all those secrets out. In this case, if the secret is in the database, it can be a completely passive attack. The user is not even asking for any decryption. I sit here, I have the master key somehow and then I go to the database and I pull out everything. It basically becomes goldmine. Second problem with this approach is now if I have to share a secret among people, I have to first put it in this database. That means the generation part of the secret cannot be offline. There are some use cases we can talk about why it is desirable to have the generation of the secret part offline, generation and protection. So let's define our goals first because we've been talking a lot about what is good, what is bad. What we want to achieve, when we started this, what we wanted to achieve here is if a secret is generated by a party, is supposed to be consumed by this set of group or parties, that secret should not ever, ever be unclear except for there, for the creator and for the consumers, not even the decryption server. Okay, some stretch goals. This is how we started, like maybe if we can get the offline encryption part going as well, that would be really nice because let's say if we have a payment partner, some bank that we are working with, they need to send us some TLS keys. We would really want them to not send it to a PGP key for one of our employees if possible. Right? The guy can put that PGP private key on the laptop, laptop gets compromised or stolen. TLS key to a bank gets stolen. That's just crazy. Right? We would much rather give them a small piece of code which basically says, hey, just encrypt it using this tool and that immediately just take that encrypted version and throw it in the source code and it magically works for the group of people who are designed this too. So desirable, decryption service ability to observe usage pattern. So this was something I put in the stretch goals because given where Netflix is, people would really like to hack that key server which is unfortunately managed by my team and I want to sleep at night. So I don't want somebody to hack into that system and siphon off some sort of usage pattern that gives them some advantage over Netflix's business and pattern of business. So at least try to limit as much as possible as in even if you get into that box, what do you really get to see? And then of course, as scale, you should have number of keys that scale. As I mentioned, if you do one key per group, you have so many groups, can you really have those many keys that you can support? And also the request. So now you said thousands of applications with sometimes hundreds of instances of each applications and users. Can you really scale from request perspective? Can you really handle those many requests? And if you architect it in a way that you can horizontally scale, that would be nice. I think yesterday somebody mentioned about being stateless. If your systems are stateless, it's very easy to horizontally scale. Just throw more boxes at it. All right, so if you see visually, what we want to do is if you have a secret creator which somehow generated the secret, we want it to just basically protect it right there, maybe on their laptop if it's person generating secret. Then you have a consumer. It could be a person or an app or whatever. And then what you want is the ciphertext to be available in the code. And then at the runtime, it makes a query to a decryption server, exchanges some messages. And at the end, you want that message to basically show up on that person's laptop or application. But at the same time, you don't want that. You don't ever want that secret to be visible and clear on the decryption server. And if possible, you want the first phase to be offline and the second phase to be online, if possible. That's the scale part. You don't want this to grow. You don't want a number of keys to grow as you're adding people, as you're adding more code, as you're adding more secrets. That's not a possibility, especially in cloud. We talked about million keys, 16 million keys, stuff like that. The reality is no HSM will support that kind of keys. The max average is still in thousands. So not possible. So what we did is did some literature survey because we wanted to meet this goal. Not sure how. So we just ran, literally reading through random papers, seriously. And when we were going through this, we knew what we wanted, but we wanted to make sure that some math exists there that you can make use of. So we bumped into this paper from AsiaCrypt96, which is RSA-based blind signatures. But we don't really need signatures here. We need the encryption part. But because it's RSA-based, we were hoping that maybe we can tweak it in a way that we use it for encryption. So this is how we came up with. This is the setup. And this is literally just taken from that paper because that's how the setup is. We didn't change any of that math. You can have a group ID, which is the group of people and applications together, put them together, give them some integer number. And then there is a formatting function called tau. And then you choose two primes, p and q. Not randomly like any other RSA parameters that you choose, but there are some restrictions you have to follow. It has to be co-prime and stuff like that. Then you have the public E that you choose, but not like regular E, 65537 or whatnot. Once you do that, you want to do the encryption something like this. It's not regular M to the E. It's M to the E, but then you add that tau with it. Now, this may feel like that tag-based encryption, which basically group ID is basically the tag. All it does is with the same modules, it gives you different exponents. So what you do is once you encrypt it, maybe offline, you have the cipher text. Before you send this cipher text for decryption, what you want to do is you want to blind it. But because this is not regular RSA, your blinding has to also consider the tau part in it and the formatting function. So once you do that, it basically spits out this Z. Just for visualization reasons, I actually say that the C is somehow inside Z. Think about it as a box in a box. You had one box, which was C. You put that whole encryption box in another box that becomes Z. Now you send it for decryption to the server and the server performs this, which basically gives you five. So think about the server basically inserting. Server only has the key to the inside box, not the outside box. So think of server inserting a key inside the outside box. Only a hole is left to insert the key and opens the inside box. Okay? So just for visualization perspective, I just put that the M is open now on the server side, but it's actually still covered on the outside box, which is fine. And then you send it over to the sender and then sender knows the R, so it can perform the reverse operation and get the M out. So in this process, you basically did not let the server look at M. At the same time, you had everything driven through group ID. So now you can have many group IDs. You can have as many sets of combinations of people and applications you want, and you basically run the same math. So of course, the next question will come with RSI encryption. The next always question comes in the padding. So you can use any padding like OAPKM. We have internal padding, which is kind of looks like PKCS 1.5. And yes, before you go, what? There are other PKCS 5. You can perform the attacks that people have published in the past. It's basically because you don't have the authenticate. You're performing those kind of attacks where the channel is not authenticated. Remember, our decryption happens only after the authentication happens. But again, regardless, you can have any padding scheme you want with this. That's not a problem. Now, so how did we do when we wanted to achieve all these our goals? We put blank decryption behind authentication. For stretch goals, we did actually perform this in a way that it was, because it's a symmetric, you can do the encryption part offline. And because it's a stateless system with only one key. Remember, the whole decryption part required 1D. So the only thing that was secret was 1D. You could now basically scale horizontally as much as you want, because you can just spin up more instances that have access to D and that's it. So this is the most important part. Taking it one step further. So we talked about group ID being like group of applications and people, but it doesn't really have to. Because it's just a number, you can now convert this whole system into a policy that has number. So policy number one, I want to tie this secret to policy number one. That policy number one just doesn't have a white list of people and application. It can have any other rule inside. Like don't let encryption or decryption happen after 2 p.m. or whatever. So now you can write a random policy, give it a number. As long as the number is unique, you can tie your secret in. So now you're bleeding slightly into authorization as well, which is what we also work on. Some Matthew Green actually pointed out this paper to me. I have not had chance to actually look at this, but it seems like this paper could also have similar construction that you can make and achieve similar goals. Again, I just put it out there because Matthew pointed it out, but I have not looked at it yet. If you have other suggestions that are welcome, we only want to do better. So next steps are basically we want to just keep the same goals, but see if we can do it with provable security measures, more measures guarantees we can do. Probably do multi-party. Nick mentioned something about where is the master key. In this case, the master key is D. We want to see if we can perform the same thing with multi-party. Maybe do something as dumb as do this twice and then have the key basically just XOR. And then of course, the next step will be doing something like this with a PQ resistant scheme. All right, so these are some of the resources that we have put up for your things. And thank you. Thank you for a fantastic talk. We have just a minute or so for questions while other folks are getting mic'd up. Well, I have one. You've published so much else. Are you going to be making some of this key management software available for others to use? So we get that a lot. I have a personal interest. No, yes. Actually, I have patented this, but I want to also open source it because this is very useful. A lot of people I've talked to internally, they've told me that. I think it's just the engineering effort that we just need to put it. And at this point... That's not free. I get it. Yeah, we are hiring. So the more people we get, we'll probably, yeah. So... Thank you so much. Hopefully. Thank you.