 Hi, I'm Daniel Wicks. And I'm going to tell you about incompressible encodings. This is joint work with Tal Moran. Let me start with a question. The question is, how much space does it take to store Wikipedia? So it turns out that you can actually download the entire Wikipedia contents for offline use. And depending on which version you get, it can be something like 50 gigabytes. And this is for the text-only English version. But there's an alternate way of storing the Wikipedia data, which is just to store the link, www.wikipedia.org. That only takes about 17 bytes. And with this link, you can access the Wikipedia data whenever you want. And the point I want to make here is that when it comes to some public data like Wikipedia, even though the data can be very large, just by storing the link, you can essentially have this data for free without storing it. And the question for this talk is, can we take such public data, let's say the entire Wikipedia contents, and make it incompressible? In other words, can we come up with a representation of the Wikipedia data that would require, let's say, 50 gigabytes of storage to store this representation, even for people that have the link to the underlying Wikipedia data and have access to that for free, don't need to store it? So let me tell you what I actually mean by this. Let me make this precise, and let me get straight to defining this notion of incompressible encodings. So incompressible encoding consists of two algorithms, an encoding algorithm and a decoding algorithm. And the encoding algorithm takes some underlying data M, let's say the underlying Wikipedia data, and encodes it in a probabilistic randomized way to derive some codeword C. And here, I made the image of the codeword C have some noise or look noisy to indicate that it somehow uses randomness to drive this codeword. But this codeword C is essentially a good representation of the Wikipedia data. Anybody can come and decode the codeword and recover the original data itself. So storing the codeword is as good as storing the underlying data itself. You can think of it as just an alternate randomized representation of the data. Again, I wanna emphasize that the encoding and decoding procedures are public efficient procedures. Anybody can run them. There are no secret keys or public keys or anything like that. Anybody can run these procedures. And we will correctness. So correctness is that whatever message you start with, if you encode it and then decode it, you should get back the original message with probably one where the probability is over the randomness of the encoding procedure. But we want this encoding to be incompressible. And that means that it should be hard to compress the codeword C even for an adversary that knows the underlying data. So we're going to consider an adversary that for whom the data M is public, let's say has access to Wikipedia on the internet. And the goal of the adversary is to take this codeword and compress it down. So remember that the codeword was created using some randomized process, randomized encoding procedure. So it contains some additional information beyond the Wikipedia data. And the goal here should be that it should be hard to compress the codeword itself, this new representation of the Wikipedia data, even for someone that knows the underlying data for free. So the adversary consists of two algorithms, compress and decompress. The compression algorithm takes the codeword C and I'll put some compress value W, which will be of size beta. And the decompression algorithm takes this compress value W and its goal is to recover the original code words. And security says that this should be impossible for an efficient adversary to do. So we're going to say that for any message that you start with and any adversary that consists of this compress and decompress procedures. And notice I'm quantifying over the worst case choice of the message. So this indicates that the adversary could actually have the message hard-coded inside it. In other words, this message is completely known and we don't pay anything for storing it, okay? And we're going to say that if we take the message encoded by this probabilistic coding algorithm, give it to the compress procedure, which outputs some compress value of beta bits and then apply the decompression procedure, then the problem of us getting back the original code word should be negligible. In other words, the problem of the adversary succeeding in compressing this code word and later being able to use the compress value to recover it should be a most negligible. Okay, and the goal is to come up with such some scheme where the size of the code word alpha is as close as possible to the original data size. So we don't want to have much overhead and it should be incompressible again, almost to the entire code word or data size. So the adversary cannot compress it much beyond the underlying size of the data. Okay, so let me tell you a little bit about this problem. This problem was actually introduced in prior work and so I'll tell you a little bit about the prior work and our results. So there's a prior work of Damgar, Ganesh and Orlondi and they defined a variant of this notion of incompressible encodings, essentially as a building block to build something called proofs of replicated storage and I'm not going to get into what these are in this talk. But in this context, they constructed these incompressible encodings. I'll be with some major caveats. Again, I'll discuss those later, but one of them is that their construction only work in some idealized model like random worker model. And so in our work, the main point of our work is actually to take this notion of incompressible encoding and define it as a standalone object that we believe is interesting to study in its own right. And we give new and improved constructions of this primitive, in particular we'll show as we'll see later we get some constructions in the standard model without a random worker. We also give some lower bonds and negative results and we give a new application of this object to something called Vicky cryptography. So completely different application than the original one proofs of replicated storage. So let me actually start by giving you this application. I think it'll also illuminate a little more about what the notion is and why it's useful. And then we'll talk about the constructions. So the application is to something called Vicky cryptography. This is an area that's been studied in the past and it's motivated by the problem that computers can often become compromised by various malware or Trojan horses or et cetera. And essentially it means that a remote attacker can get access to the compromised computer. And if you have any kind of a secret key on this computer that that could can just exfiltrate it can download it and get the secret key. And so that means that whatever the security the secret key was providing is complete law so that that could can then let's say if this was a decryption key that that could can decrypt any messages that were encrypted with respect to this. And a really nice idea to prevent these types of attack is to make secret keys intentionally big. So instead of having a secret key be like 128 or 256 bits let's think of a secret key which is 50 gigabytes in size. And the idea is that a 50 gigabyte secret key is difficult to exfiltrate from the compromised system. It's a lot harder to get it out than it would be to take to get a small key out. Maybe there's some firewalls et cetera that can detect huge amounts of exfiltration but wouldn't be able to detect just 128 bit keeping the exfiltrated. And so I wanna use an image that's due to Adi Shamir. So Adi had this great analogy. He said, if you wanna protect a small diamond you really need a lot of security you need to watch it very carefully have a security team. But a Statue of Liberty is also a pretty valuable object. You're not really worried about people stealing Statue of Liberty because it's very difficult to steal. And so that's the idea. Here we're going to sort of make the key intentionally huge to make it difficult to steal it. Okay, and so this idea actually goes back. It's been studied in quite a few cryptographic works. Often it's referred to as the bounded trival model or big key cryptography. And these crypto systems have two goals. So the first goal is to design a crypto system that has a huge key, 50 gigabytes key and ensure that security holds even if let's say 99% of the key data leaks. So even if 99% of the, even if let's say 49 out of 50 gigabytes are leaked out of the system the adversary cannot break the security of the crypto system. And the second goal is to make sure these crypto systems are efficient even though their secret key is huge. So for example, reading even just reading an entire 50 gigabyte key to like cryptographic operation would be really inefficient. So we want to make sure these crypto systems only read a small portion of the key in each operation and their overall efficiency should not be much worse than that of standard small key crypto systems that have 128 or 256 big key. So there was a lot of work devoted to this idea. But one problem with all of the prior works is that they require the users to store a huge 50 gigabyte key and waste its storage on this. The key is completely useless for any other reason other than to do cryptography. So you're essentially just wasting storage. And so the new idea in our work is let's make the key useful. Instead of just storing some making the key random data let's make the key store something useful like let's say the movie database that a user would store anyway or maybe an offline version of Wikipedia that the user wants to store or when the user goes offline and wants to still read Wikipedia. So the idea is let's make Wikipedia data let's make the secret key consist of Wikipedia data. Now, is that a good idea? Is Wikipedia a good secret key? Well, no, it's a really terrible idea. Wikipedia is public. Anyone can access it on the internet. So it's a really terrible secret key. It's really terrible idea to use Wikipedia as a secret. But the idea is don't use Wikipedia itself as a secret key. Instead use an incompressible encoding of Wikipedia. And here you can only see that this starts making a lot more sense. So first of all, there was some randomness that went into making this encoding. And therefore the encoding is not predictable. So at least there's a chance of being a secret key. And moreover, even if the system is compromised an adversary tries to exfiltrate some small amount of data this incompressible encoding is incompressible. So it ensures that if you use it as a key the adversary can't steal the entire key by downloading, by exfiltrating some small amount of data from the compromise system. So this tells us at least it has a chance of working. Of course, that's not, we're not done yet in order to actually make the use of this key. We have to design new crypto systems that ensure that security holds as long as the attacker doesn't download doesn't compress the entire key. And we managed to do this. So in our work, we show how to construct public key encryption in this setting. So essentially construct public encryption scheme where the secret key can be a huge incompressible encoding of some public data incompressible encoding Wikipedia and security will hold even if that attacker gets a very large exfiltrates a large fraction of the encoding size as long as he doesn't exfiltrate enough to be able to recover the encode to be able to recover the code work. Okay, so that's the new application artwork. And I want to tell you a little bit about the constructions of incompressible encodings both from our work and through the previous work. So as I mentioned, this problem was originally studied in a work from crypto last year by Damkar, Ganesh, and Irlandi. And they give a construction in the ideal permutation model additionally using trapdoor permutations or RSA. Okay, so this was not in the standard model. They assume they have an ideal permutation which is something that's a little bit stronger than a random oracle. So that's one caveat of their work that they needed ideal permutations. And the other caveat of their work is that the complexity of the encoding process was quadratic in the message size. And this is actually quite a big caveat because we want to apply this to large messages like the gigabytes or the terabytes in length and a quadratic complexity on such large data is really, really inefficient. And lastly, there was one more caveat which is that the proof of security in that work was flawed. So we noticed this, that there was serious issues with the security proof and it seemed to be flawed beyond some simple patch or simple fix. And actually there was a concurrent work, concurrent towers by Garglu and Waters. They also noticed that there was a problem with the proof and they actually managed to fix the proof. I want to give them a shout out. This was not a simple fix. They, it really required an entirely new proof and quite a bit of difficult work. So this was actually a new, major new result to show that the original construction from the work of Ganesh, Nongar Ganesh and Irrondi can actually be fixed and be proven secure. And in doing that, they actually also managed to improve it in one additional way. They managed to replace the ideal trap or ideal permutation with just a random oracle which is somewhat simpler permutation. So in our work, we give actually a brand new construction of incompressible encoders, different construction. So, and we managed to give a construction that is provably secure in the common random string model. So we avoid the use of random oracles and we prove it secure under either the learning with errors or decision composite residuality assumption. So that's one improvement. We get a construction standard model. The other improvement is that we improve the complexity from quadratic to linear. And as I said earlier, because we're thinking of applying this to big data, this is significant. But there are some caveats with this construction. So it actually only achieves selective security where we assume that the message has to be chosen by the adversary before seeing the CRS. And the CRS size is very long. It's as large as the size of the data we're trying to encode. So we're trying to encode Wikipedia. It'll be out to 50 gigabyte long CRS. And we show that we can remove both of these caveats in the random oracle model. So we go to the random oracle model, then we can get adaptive security. There's no CRS, it's just random oracles. So we don't have to, we don't need a long CRS either. So we essentially get the best of all worlds. And this still improves on the previous work by removing by improving the complexity from quadratic to linear, even in the random oracle model. And we also give some black box separations to show that these constructions are essentially optimal. So for example, we show that we cannot have provably secure non-trivial constructions of incompressible encodings in the plain model. And in the CRS model, we actually need to suffer from all of the caveats that are construction apps. So the CRS needs to be long and we can only achieve selective security. So this holds for constructions that are provably secure using game-based or falsifiable assumptions. Okay, so let me tell you a little bit about our construction. I'll give you a slightly simplified version of this. And I wanna say that our construction, what one additional, I would say benefit of our constructions, constructions that it's conceptually very simple and I'll actually be able to even show you the proof of security for a slightly simplified version of this construction. So for the simplified version, let's pretend that we have a really nice object called a lossy trapdoor permutation. So a lossy trapdoor permutation is like a lossy trapdoor function. So this is a notion due to Picard and Waters, which is also a permutation. So we assume we have a family of functions indexed by some public EPK that maps embeds to embeds. So both the domain and range are embeds. And we can sample the public in one of two indistinguishable modes. So in the first mode, the function is injective. So this is the standard notion of trapdoor permutations. You can sample the public key with a trapdoor that allows us to efficiently invert the function at any point. But in the second mode, we can sample the public key in a way that the image size of the function EPK is much, much smaller than the domain zero one to the N. So it's two to the little of N, which means that if I give you EPK of X, this loses a lot of information, almost all the information about that input X. So it's very loss. Okay, so using this object, this lossy trapdoor permutation, let me show you how to construct incompressible encodings in the CRS model. And in this construction, the CRS will consist of some number of random values, and bit values. You can think of these as outputs of this trapdoor permutation. So this is just a common random string. Now to encode a message M, which we think of as consisting of L blocks, each of n bits each, we're going to sample a random public key along with a trapdoor for the lossy trapdoor permutation in injected mode. And then we're going to invert the trapdoor permutation on the values MI XOR YI. So we're going to take the message XOR it with the value in the CRS and invert it and output XIs, the XIs. And that's going to be the encoding. So the encoding is going to consist of the public key and all of these pre-images. And we're going to forget the trapdoor. Okay, so the trapdoor was part of the randomness of the encoding process, but the encoding is going to forget it afterwards. And this is efficiently decodable. So if I give you this codeword C, you can easily just apply the function in the forward direction on the XIs. You can take fpq of XI XOR them with the values in CRS, XOR them with the YIs and recover the message. So anybody can decode the codeword and recover the original data. Okay, and so I want to give you the proof of security intuitively very simple. So the adversary in this game sees the common random string and some codeword C, which is an encoding of some message and that the adversary knows. And in the original game, the distribution of these two values is that the public key sample injective modes, the YIs in the CRS are uniformly random and the XIs are computed by inverting the function on the message XOR with the CRS. But we can actually think of this distribution in an alternate way, which is completely identical. So actually this is an identical distribution. So in this case, think of sampling. So before we were sampling the YIs in the CRS at random and computing the XIs by inverting the permutation. Now let's switch it up and sample the XIs uniformly random and compute the YIs by applying the function in the forward direction. So this is exactly the same distribution. Notice the XIs and YIs are individually random in each of the two cases subject to this relation holding that YIs fbq of XIs or mi. Okay, so this is the exact same distribution but syntactically now we're sampling the common random string in a way that depends on the message. So this is something we couldn't have done in the original construction where the CRS was chosen independently of the message but in this hybrid distribution, we can do that even though nothing has changed from the distribution perspective. And now we're just going to make the public key lossy. Okay, and this is indistinguishable by the security of the lossy type of permutation. So what's happening now? Now the adversary gets the CRS and the codeword under this lossy public key. And notice that the codeword has a lot of entropy even given the CRS. Why is that? Because the XI values in the codeword are chosen uniformly random and then we apply a lossy trapdoor permutations on them which loses a lot of information about the XIs. So actually even if I give you the CRS, the codeword has a lot of information theoretic entropy even for a completely fixed message MI that is known in public. And this says that in this hybrid, the codeword is actually information theoretically incompressible. Even an unbounded adversary in this and the last hybrid cannot compress the codeword down at something much smaller than the codeword size because it has a lot of real information theoretic entropy. And because it's indistinguishable from the original distribution, it means that no adversary can succeed in compressing the codeword also in the original distribution. So that's the entire proof of security, very, very intuitively simple idea. So last I wanna talk about the instantiation. So unfortunately we don't have lossy trapdoor permutations that are as beautiful as what I assume we had where the domain and range is just zero one to the N. So if you look at known constructions of lossy trapdoor functions, the vast majority of them are not surjective the output size is bigger than the input size. And in this construction, our construction, we crucially use that we have a surjective function because the ANAS scheme actually inverted the function at random outputs. Okay, so we cannot use the vast majority of constructions of lossy trapdoor functions because they're not surjective. And or they also don't have like nice domains like zero one to the N. So their domains are much more structured like group elements or something like that. So in our work, we show that we actually don't need to have these beautiful lossy trapdoor permutations we can do with something a little more relaxed, something called surjective lossy functions. So the domain doesn't have to be as nice as zero one to the N. And they do have to be surjective but they don't actually have to be fully injective. So we relax the injectivity requirements. And we managed to construct these from decision composite residuosity and learning with errors assumption. And I wanna mention actually the LWE version construction has some interesting new ideas. So for example, we didn't even know how to construct trapdoor permutations from LWE. So you really need something different than let's say even the random or construction of down card ganesha or landi which relied on trapdoor permutations. So here we managed to show something under LWE. It's not quite a trapdoor permutation but it's something that's as close as possible to trapdoor permutations. It's a surjective function where the domain is not much bigger than the range. So it's surjective, it's a little compressing but not by much. And this is some interesting new idea. So I encourage you to look in the paper. And that's all I want to tell you. So thank you very much for your attention. If you have any questions, I'll be giving this talk live during crypto and I'm also happy to answer questions by email. So please email me. Thank you.