 Hello everybody, my name is Nathaniel McCollum, I'm a Principal Software Engineer at Red Hat and I'm going to be talking about secure automated decryption. I would ask for you to bear with me a little bit. My talk and slides are actually hosted on a hosting provider and apparently their whole data center went down. They had like massive catastrophic power failures and unfortunately they actually are piping their recovery process live right on their blog so you can actually see this is what they're getting and the way there's more. Okay, now obviously my example is a little bit deceitious but it is a real world problem, right? And that is as we start to deploy the decryption at scale the question is how do we automate this process? Because we can't do it in the manual way, we've been used to doing it in the past but with automation also comes new security challenges. So the title of this talk of course is secure automated decryption and we are going to be talking about various different methods that are used. So the first question we're going to come to is how do we automate this? And of course the answer is yes. First we have this secret here, this is the data we want to protect and usually the way this is guarded is we first encrypt it with an encryption key. Now this is not the key that you use directly so this is not the password for instance that you type, this is a cryptographic strong key that protects the data and then what we actually do is we encrypt the encryption key in another key called the key encryption key. This wrapping process or sealing process all of these terminologies refer to the same thing basically allow you to change the outer key without having to change the inner key. Because if you have terabytes worth of data, you don't want to have to change that inner key and then re-encrypt all your data. So in order to have a vulnerability in the outer key, you can simply change the outer key, re-encrypt the inner key and your data is protected. So this is where your typical password comes in, right? You have a password and that's actually the key encryption key usually derived in some sense. And this is sort of the standard password model. Of course we distributed that to all these different people and now all of these people have access to the data and then one of them leaves the company and you can't manage that, right? So we have a whole automation problem there. An even better option would be of course to generate something that has a lot more entropy for the outer key. And then we sort of have a standard escrow model and this is what you're typically seeing deployed in a lot of places. This is actually showing up in standards. It's showing up in various different tech proposals and even there's some regulations that are talking about this sort of escrow model. So we're done, right? We've described automated secure, automated decryption. We can all go home, right? Well, not quite because the problem is we actually also have to protect the transport layer. So we have TLS or GSS API which we're using to encrypt the transfer, the key from the escrow back to the server or the client that's going to perform the decryption. And of course this is secured by a key on each end because you have to have authentication. Now that the key is being stored in the escrow, you can't just let anybody get that key, right? So the server has to prove who it is or the client has to prove who it is in order to get the key from the escrow. And so now you have mutual authentication on both ends. So at least finally we're done, right? No, so there's actually more process, right? Because in order to generate the keys for authentication, we have to have a third party authority which is usually in the form of if you're using GSS API, it's like a KDC for instance, Active Directory. Or if you're using certificates for TLS, you have a certification authority. So now we're done. Fortunately, no, because remember, we're storing all of this data now in the escrow, so we still have to back up all of this data. And key data is kind of sensitive, so you don't want to just back it up in clear text. So now you also have the heartbeat problem, that we transfer all of these keys over a wire, and all of a sudden our TLS is completely useless to help us. Okay? So this is sort of the model that everybody's doing. And we recognize that there are some security challenges with it, and we want to see we're going to gradually unfold some different methodologies, and we're going to see what we can do to solve these problems. So let's quickly review some of our lessons that we've learned. First one is that complexity increases the attack surface, right? So every time we added a little bit of complexity in that escrow model, we added another place to fail, another place to lose keys, right? This is also difficult to deploy, and we also learned that speed matters. If you think in the case of the database we exactly gave away at the beginning of this talk, if you've got a data center full of thousands of machines all coming up at the same time, all wanting keys, right? Speed matters. So, can asymmetric crypto help? And the answer is actually yes. We started a project called DAO. This was actually last year. And essentially what we did was we decided that we would take the key encryption key and encrypt that again using a public key encryption, and then just store it locally. So the key encryption key is encrypted again using asymmetric crypto and stored locally. Then during recovery time, when we want to get the key back, we would send the encryptable out to the server and we would get back the plain text key. Now this had some really nice advantages in terms of statefulness, because it was not stateful. We could actually store the data locally. The server didn't need to know all this data. But we still did have to authenticate the connections. And we had to protect the channel, of course, because we were transferring the key back in plain text. We had to protect the channel. We still have the certification authority. We still have the backups issue. And we did learn some lessons from this project. We learned that asymmetric crypto makes the server stateless, which is probably the most important thing that we can learn from this. Isometric crypto also allows for offline provisioning, which has some really interesting use cases. Say when you're trying to speed up deployment you want to bring up thousands of servers quickly to handle some load. You can actually do the entire provisioning offline, because you don't need to contact the server for anything as long as the client knows the server's public key. We also learned that sending the keys over the wire is still a risk. And we also learned that X599 takes a lot of effort. And this is, in fact, the last item is the reason why we killed the project. We designed it. We implemented it. It was fairly simple and it worked. But a lot of people had a lot of difficult time deploying it just around the management of X599 certificates. And so we wanted to see if we could do better. So let's ask some more questions. First, can we avoid TLS altogether? Because TLS is a pretty complex stack. We still are going to need to use encryption, of course, because we do need to keep things secret. But if we can limit all of the options that TLS gives us in which we could fail down to just a few manageable options, we have a much smaller attack surface. The second question is, can we hide the key from the server so that the server itself never sees the key? And we can. So we're going to use a mode that I like to call wrapping mode. So if you go back, remember that Donut circle, the outer one, the key encryption key is now in the middle of this. So what we do is we create another symmetric key called the wrap key. And then we encrypt that blob using the server key. Then we keep the resulting encrypted blob and we keep the wrap key. But notice that because the client does not have the server's private key, it cannot decrypt this material. So once you've encrypted the volume using the key encryption key, you throw away the key encryption key and then you don't have it anymore and you need to get it back. So then at recovery time, what you do is you generate a second key called the ephemeral key. And you encrypt this again using the server keys. We have two layers, actually multiple layers of encryption. The key encryption key in the middle, wrap key is used to encrypt that. The server's asymmetric key is used to encrypt that. Then we have that blob plus the ephemeral key in another layer of encryption. And then we can send this to the server over the wire without TLS. Now the server is going to get this big packet and it has its own server's private key. So it can use that to decrypt both layers and get the encrypted key encryption key, which it still can't see. It can only see the encrypted part, right? And it has the ephemeral key. So then it takes the ephemeral key, encrypts the resulting blob, and then sends it back to the client. The client, of course, has both keys. So now the client can simply unwrap the ephemeral key, unwrap the wrap key, and get back the key encryption key and perform the encryption. And so what we've just done now is we've accomplished two things. First, we have completely obviated the need for TLS. We don't need TLS at all. We can just send this over a UDP packet if we want. And then the second thing is that the server only saw an encrypted blob. It never actually saw the actual key encryption key, which is a really nice feature because it means that somebody can't just sit there and collect keys. They can collect the encryption blobs, but then they would also have to compromise the client in order to get the wrap key. So this is a pretty good model. We like this overall. It has one of those features. The server never sees the key encryption key. We avoid X5 or 9. We avoid TLS. It's stateless and it's fast. So these are all really great things. One of the things I did put on here is that this method is actually very easy to migrate to post-quantum crypto, because this is one of the major concerns that's coming up, is how we move to a post-quantum crypto world. And this is very straightforward to move. We just, AES, for instance, and symmetric keys are not part of post-quantum crypto. We believe these to be secure. Quantum computers don't hurt anything. It's only the public keys. So as long as we get some kind of public key encryption in post-quantum crypto, we're fine to use this model and it will continue to work in the future. But still we believe we can accomplish some other features as well. Number one, must the key actually go on the wire? And number two, can clients be anonymous? And the reason we're asking this question, remember the server in the last example never saw the key encryption key, but it still saw a blob, an encrypted blob that didn't change. So it could use that encrypted blob for tracking of clients, for instance. But one of the questions we wanted to ask is, can we do secure automated decryption that is more a demo in which you could just be walking around and have access to an access point, for instance, and be able to transiently decrypt some data without that server and even knowing who you are, not having any idea. So we actually can do this as well. This is standard ephemeral encryption. And if anyone dies for fear of math, I don't blame you. Standard ephemeral encryption, there's nothing fancy about this. But, well, a guy at Red Hat and myself, Bob Roe, we came up with a variant of this. So notice that on the left-hand side, I'm going to go back. Nothing actually changes on the left-hand side. We're only changing the decryption step here. And using this algorithm, we come up with a couple interesting properties. The first is that the server only ever sees random data all the time. It's random all the time. So the server never gets anything static. They can identify the client with. The client is completely anonymous. And because this is a key exchange, not encryption, we are essentially never sending a key over the wire. So it's like it didn't be helpful. Okay? So this even has some better security properties. And we have a server called Tang over here on the right. And essentially we take the key encryption key and we just use the key exchange to generate a key encryption key, which we can always then regenerate automatically. We still kind of need backups here. But we can avoid backups by using TPM and actually just burning the private keys into the chips. And in this case, you burn the server's private key into the chip. The server becomes completely stateless. There's literally nothing stored on the hard drive, nothing backed up, and you can't recover the key. So you can do this in a very, very old footprint. So we've been even toying with ideas like little Bluetooth beacons, right? Where you could have data that's encrypted on your phone and when you're in range of Bluetooth beacon, you can decrypt it and when you walk away, you can't. So very, very lightweight, very, very small, very, very fast. So we have a project that implements this. The project is called Tang. And you can see this organization, Lapset. You're going to see this pretty frequently. This is an organization that several of the crypto guys in Red Hat have created and it has various cryptography related projects. So even besides the stuff we're going to talk about here, just go browse and it's got some interesting stuff. So this is the server-side daemon. It's simple HTTP and Jose. Anybody know what Jose is? Anybody know Jason Web Encryption? Ever heard that? Jason Web Encryption, Jason Web Signing. Jose is Jason Object Signing and Encryption. So the nice thing about this is that we're using completely standard formats. This is a standard HTTP request and this is a standard Jason Encryption object. So nothing proprietary here, no funding structures. This is really, really fast. So I was able to benchmark with a single TCP connection, we can just cram your request in the server. On a standard desktop laptop, you can get about 100,000 requests a second, which is sufficient for low-height hardware. It's sufficient for large-scale appointments. So we're extremely fast. We're extremely small and have minimal dependencies. Our only server-side dependencies are HTTP parser, which is just a small HTTP parsing library, really tiny, it's two files. And then the library is going to talk about at the end of this talk for doing the Jose Encryption. So let's continue asking some more questions. What other things can we bind our data to? We've already mentioned this sort of standard SQL case where you've got a client and a server, but really we kind of developed some new technologies here. So let's think more broadly about what other kinds of endpoints we could actually bind our data to. Obviously, TPM is one where you could actually bind data to your local TPM. You could bind it to a Bluetooth beacon, which I mentioned. You could do something like, say, generate a random key and then print a QR code, and then when you need to access it, just hold the QR code up to the camera. You could do facial recognition, finger print scan, mobile phone, smart cards, off IDs, all these different kinds of endpoints that you can actually bind your data to. And one of the things, this quote is from Josh Brushers from his Security Talk at DevConf. And I love this quote. I keep coming back to it over and over again. Security is not a binary. It's a sliding scale of risk management. But very often in cryptography, we just deal with the binary, right? Is it secure or is it not? And in fact, allowing us to have multiple things which we bind to allows us to have a sliding scale of risk management. So, the question is, how do we make our unlawful policy non-binary? But there's this great little algorithm called Shamir Secret Sharing. Anybody know who Shamir is? He's the S in RSA, okay? So, he made this algorithm called Shamir Secret Sharing and it's a threshold new algorithm. It allows you to take a key and you can essentially split it up into five different parts which I've shown here. The number of parts is it could be any number depending on your key size. And then you have a threshold. And basically, your threshold determines how many of these key fragments are required in order to regenerate the resulting key. You can also nest this, right? So, you have a key here that's fragmented and then you can take that middle key and you can fragment it again with a separate threshold. So, we're going to use this technique to design non-binary security policies. So, we have a simple laptop, right? And now, I'm not talking about your home laptop here. We're talking about something that's corporate deployed laptop. And typically, you're going to have the user decryption password, but you're also going to want to have an admin password so that for whatever reason, if the user can't get into the system, the admin can get into it, right? So, in this case, with Lux, you can just use multiple slots. But if we translate this into Shamir's language, we can do a threshold of one and we divide it up into two fragments in the admin password and the user password. And so, the user password, the user can type in his or her password and because the threshold is one, that is sufficient to unlock the system. Same for the admin. Now, we can automate this process, right? Because we can state the time server we talked about as one of our endpoints and we still have a threshold of one. So now, when the user brings in their laptop to work, they turn it out at work, they get on the network and then the disk decrypts automatically because they're on the corporate network. They have access to the time server and everything just continues automatically. But if they're at home or a coffee shop, then they can use the user password or the admin can use the admin. So the security thing, you've got to have a network stack loaded before you... Sure, this doesn't just all magically work because I put it on a slide. I wish it was that easy, unfortunately. But yes, we really do have network stack in early boot and in fact, I'm going to show you again like this at the end. That comes with some complications that we don't currently have wireless in the early stack. So we're bringing this up in stages and we'll see how far we can take it. It's common because we have network stack as well. So there's basically nothing to do with this to clarify in the area, which would be a secure boot. Perfection. Yeah, so we have options here, but we're still trying to bring up the infrastructure, we're still trying to bring up the protocols. Once we get these protocols up and established, we can really start to look at some of these other avenues. So this is the standard sort of automated laptop case, right? But let's start to get a little bit more esoteric. Let's imagine now that we have a system that really needs to be secure, right? We've got nuclear launch codes on it. Now we don't want to just have one password that we share out with all people, right? Instead we want to have three distinct passwords for which two of them are required. But you've both got to turn the key to blow up the world kind of scenario. So here in this case, two different users have to enter a password in order to be able to get it. So let's look at now a complex theoretical laptop policy. So in this case we have our first level, which has a threshold of one, and it has a QR code. And this is sort of the master recovery key, right? So IT is provisioning this laptop and they generate a cryptographically strong random key. They print it in a QR code and they lock it in the safe. And it's only overused for disaster recovery. So that one has a threshold of one because if they show the QR code, you want them in, no matter what. But now we've got to another level of chimeras. Here we have a threshold of two, okay? In this case we're bound to TPM as well, as well as another subtree. And this means that both are actually required. So in this case, in the first case with the QR code, they can take the laptop out of the hard drive and use the QR code to get back into the disk. But in all of the remaining cases, the disk has to be in the chassis because it's bound to the TPM and there's a threshold of two, so I mean the TPM is required. Now we drop down to another threshold and we have four options here. Password, fingerprint, tying, and Bluetooth. And in these four cases, two of them are required. So you can now imagine that we have a sliding scale of security. So if I'm, again, at my desk and I have a Bluetooth beacon right within range of my desk and I'm on the corporate network and I have access to time, I've fulfilled my two requirements completely automatically. On the other hand, I take my laptop and I walk into a conference room. I still have access to the corporate network, so I can get to time, but I'm out of range of the Bluetooth beacon now, right? So in this case, I just scan my fingerprint and I get in. But now I take my laptop and I walk out of the building to an even less secure environment. And when I walk out of the building, I go down to my local coffee shop and I try to get into my data. And in this case, we need two again, so I use my password and my fingerprint. So you can see now that we've used this complex structure in order to synthesize essentially a real-world environment. Because we've identified that near my desk, that's a really secure environment. When I walk away from my desk, it's a little bit less secure, but we can still have some automation and some easy access. But when I'm in a completely unscrupulous environment, like a coffee shop, in that case, I have to use the strongest level of authentication in order to get access. Does this make sense to everybody? Have I explained it well? Yes. Now just think about your normal life. This is precisely the way you work. In fact, I like to call this essentially a neural network for encryption, because that's sort of what it is. The way we sort of look in our own brains, we walk into an environment that we don't know, we don't know what it is, and we're pieced together by the clues that are around us, and we do this subconsciously, that I'm in danger or I'm not. And we have these feelings of comfort or relaxation if we're in a safe environment or a feeling of elevated stress in an environment which is not secure. And so essentially what we're trying to do here is we're trying to synthesize the way that humans actually behave in order to create a security policy around that. So we're essentially emulating human behavior. And here's the key point I would like to leave us with. We still have a couple more things to talk about, but we're coming into home stretch. Let business policy drive the crypto policy and not vice versa. So we're trying to create a flexible system, one of which the person who's actually deploying this can create the policy. It's very easy for us to create one or two policies that we think are very good, but that doesn't let the business policy drive the crypto policy. It means that a crypto engineer has designed the crypto policy. And there are ways you can be more flexible here. So the project that implements this is, which is the client side, it's called Prevus, and it's a client side pleasable key management. Currently we have three plugins. We have HTTP with support to the traditional escrow model. We have a lot of people that use this and they want to migrate slowly away. And there are some cases perhaps where our regulation might require them to use an escrow model. So basically this just does an HTTP put against the URL and doesn't HTTP get to get the key back. And this supports custodia, by the way, which has this API, which is one of those other projects in Lapset and it's interesting in its own right. So go see it. Prevus has minimal dependencies and then each plugin can add its own dependencies. So we want this to be very, very flexible. We have early due to integration, which I'm going to show you a demo of. We have GNOME integration, which I'm not going to show you a demo of. And both of these are in progress. They actually work, but I made some core changes and broke some stuff. And so they currently don't work, but they will work again very shortly. So let's look at a demo. And unfortunately this always wants to help me at the lowest resolution. Okay. So on the right-hand side we have what's essential to the server. And on the left-hand side we have a client that's in a virtual machine. You'll notice that we're booting up on the left and what's going to require our password. On the right we're going to set up a tag server. This is actually some older code. So it doesn't look exactly like this, but you'll get the internal IP of the workflow. So we just generate some keys and start the server over here. We've typed in the password over here to boot. So here we're generating some keys. And now the server side is ready. We've got this is from scratch, by the way, from young install. And then on the left-hand side we're going to install the client integration, including the early boot integration. And then we're going to bind the data, the Wux partition to the tag server. I just got the IP address board over here. It shows you the key and asks you if you want to accept and type in existing Wux password. And now our disk is bound. On the left-hand-on the right side we'll show you now that the server is actually running. It started automatically with SystemD. I think we turn it off to show that it will actually turn back on when we reboot. So now when we reboot on the left you'll notice that the password prompt comes up but I don't type anything. The server has started. The process is the key and boots automatically. So now we're booted and now you'll notice I completely stop the service and the socket. Reboot it again and now I won't type anything and it will just hang and wait there because it can't contact the server. So in this case you can of course fall back to the password when I type in the password thinking it's going. So that's just a really, really simple demo of the technology we're making. Let's talk two more things. We're going to talk about some of the dependencies that we're using because we've had to build some infrastructure for this project. The first dependency is our larger one and it in turn depends on just open SSL because we don't implement our own crypto, right? That's the number one rule, don't implement your own crypto. So we depend on open SSL in Hosei and we depend on Zlib for user encryption, which is part of the standard. And basically this is just a library that provides JSON object signing and encryption. It's a C language library so it's very low level, getting no dependencies. And we also have this really cool command line utility which I'm really proud of and this is just simply user-friendly crypto. So here's an example of using the command line utility. So we echo this message, hi, we're piping into Hosei. We're going to encrypt. We're going to take our input on standard ink or we're going to use this RSA public key. We are going to output the message into this file. That's it. We've just encrypted the message. Now we're going to decrypt it up here. So Hosei again, decrypt the input into the same output file and this time we're using the RSA private key in order to decrypt it and we get the message back. Now we try the same thing with a different key and a different file. Using this utility, you can generate keys automatically. You can perform encryption. You can perform decryption. You can perform signing and verification all using the standard JSON object. It's really simple to use. But now when we use it, it's really simple to use. You can actually control all of the parameters with it if you want so it's both simple and powerful. What is that message.jwd? It's a JSON object. So it's a binary file? No, it's text. It's a text file that contains JSON and then some of the attributes of the JSON object are base64 encoded because they're binary. That's basically it's just metadata and for the binary stuff it's base64. This is all standard. I'm not inventing this myself. There is a set of, I think it's seven RFCs that implements all of these objects and exactly what they're supposed to look like, how the algorithms are supposed to work. This is one of the best standards to come out of Microsoft in recent years. It's really fantastic. So anyway, if you want easy crypto, right there. Another dependency is a project called LuxMeta. And the problem that we had is we need to store metadata about keys where we get them, things like that. But we need that stored on the disk before we decrypt the disk because we have to be able to access it. The problem is the vast majority of people have already deployed their disk using Lux taking up the entirety of the rock disk, right? So we needed to have somewhere to put this metadata. And fortunately for us there's a gap in the LuxOne header. Basically you have the LuxOne metadata then you have the key slots then there's a gap and then you have the start of the encrypted data. You have encrypted this library called LuxMeta and basically what it does is it allows you to put stuff in that gap. And it's also very easy to use. It's also a C-Library with a command line utility. You can use this completely independently of all the stuff we've talked about. Just a quick example. We're going to echo some metadata out again in our message high. You enter LuxMeta to save it on this device slot 2 with this UUID. And this UUID is just randomly generated. It finds hopefully in a collision-proof way with the contents of what's in slot 2 now. Okay? So then in the second command we load that data back out same device, same slot, same UUID and we get the data back. However, if we try to get the same slot in a different UUID it tells us the slot contains a different UUID and it won't give us the data back. It's just to protect it, right? So I'm accepting this kind of data in this slot and if everything matches, you get data back if not in this. So fairly simple, fairly id-approved. Okay. And so that UUID would be something that would be gotten back from the tanks or... No, it's just randomly generated It's essentially a type identifier for whatever metadata you're storing. Okay? So you... If I define my format I randomly generate a UUID that indicates that the metadata is that format and so then I always use that UUID when reading and writing the metadata and I'm always guaranteed essentially unless there's a collision which is extremely unlikely given a UUID it's basically impossible. It uniquely identifies the type of data so as long as the data in that slot UUID, I know it's my data and not somebody else's. So this of course is useful apart from the other projects. You can store your own data there if you want to do a different unlocking scheme. So this is where our terminology comes from. This is your standard sort of binding mechanic. People use this. The same mechanic is used in ancient handcuffs and the U shaped part is the clevis. The thing it binds to is the pan. And the plug-ins, by the way, for clevis are called pins. So that's where the terminology comes from. Any questions? I think I've killed all of you with math. Yes? Yes? It should be because all of those distributed file systems are both on top of a block device and so you would just do Lux encryption underneath the block device and the distributed file system on top of it. So that's where Gloucester works, for instance. It depends. It depends on the profile that you used when you created the Lux header. So for instance, if you use 256 bit keys it's fairly large. 512 bit keys, it's fairly small. But you can also adjust the size of the gap by just specifying a parameter when you're provisioning the system. So if you know ahead of time you want to have a pretty by a pretty complex policy you can just shift the encrypted data right when you're provisioning. So the idea is, yes we are somewhat constrained for some default deployments including rail 7 and probably recent fedoras because they're all using 512 bit keys now. But for users who want to plan ahead they can get more space if they want. Lux version 2 is also in planning right now and there's some preliminary code and they actually plan to have a dedicated JSON storage area. So you notice that our use of JSON matches very tightly with their use of JSON and that's because we see Lux version 2 in the near future we can hopefully port to that and then just use that area and not use Lux meta at all. But the Lux meta does provide us compatibility with Lux v1. So there's also some other interesting uses for this. One of the ideas I like which is a little bit controversial it depends on whether you like Xt4 encryption or not. So Xt4 Google implemented encryption on the file system level which means you can actually encrypt directories inside of Xt4 and I would like to store metadata in an extended attribute which again would be a fairly limited size. But if we can store that in the extended attribute then we can automatically unlock say a directory. Interestingly we can unlock it in a namespace and then start a process in that namespace. So say you have a database you can encrypt the directory automatically unlock it when the database starts and only the database process sees the unlocked directory. So, various other ideas like that. Obviously the full list encryption is what we're targeting first because that's what most people are using but this is flexible key management so it can be deployed to a lot of other areas that aren't just encryption as well. So, for example, any business to be able to application specific keys the actual application instead of bottom-end the resource for all of them. Correct. Yes, you don't have to commit your passwords to your configuration files and in-gif you can just use this to automatically recover them. Thank you everybody for coming. Appreciate it. Thank you.