 Hi, everyone. I'm Kevin, a researcher from Google, and today I'll talk to you guys about a new recent work on lower bounds for encrypted multi-maps and searchable encryption in a new model that we call the leakage software model. This is a joint work of my fabulous collaborators, Sarver Patel and Pino Perceiano from Google and the University of Salerno, respectively. All right, so let's get into it. So the problem we're trying to solve in this scenario with the problem we're considering is sort of a user or data owner that wishes to outsource its data to a third party on the right or a server that's potentially untrusted. Anyone wants to do it in sort of a privacy-preserving way. So let's suppose, let's consider an example where let's say the user has already uploaded a map or a key value store, a key associated with values. So what happens is let's suppose the user wants a query for a key KI, it might send it into plain text, and it would get the associated to go VI. So I guess I want to get to quickly, sort of even if all the values were encrypted using a client-side key, during sort of these queries or requests by the user, the server will learn which key was being queried. So okay, it might learn it for a single query, it might not seem so bad, but over a large sequence of operations of queries and inserts, the server will get more statistics and more detailed information. So for example, maybe the second key was never queried, and the 15th key was the most frequently queried key. And sort of it's been shown in the past that using this sort of complex information, it can be used to try to understand what data is being outsourced even if the values are encrypted using a client-side key. And also it can also learn what kind of intent or algorithm the user is trying to run by just looking at these keys. So in an ideal world, what we would like is something like the following, where the user gets a key KI once a query, maybe does a sequence of complicated sort of retrievals from the server, is able to get the tuple of values VI associated with KI, but it does it in such a way that the server doesn't know what the requested key is. All right, so try to understand, to further understand this problem, what we can do is consider this map data structure and consider the privacy spectrum that's achieved by various solutions. So sort of the most basic solution is what we call point text map. So we usually just call it map data structures. You know, this has been a very classic problem. So it's actually called the dictionary problem many years ago, and it's been considered for a long period of time, and it's been solved very well. So there's many solutions such as perfect hashing, the FTS hashing scheme, and cuckoo hashing and many, many more. But sort of what you get with these plain text maps is a constant overhead. So when you want to retrieve a key KI for a value, for values VI, you would get that value associated with that key. And similarly, it only requires order and storage. If you want to store a table consisting of n keys and n values, you only require order and storage. On the other hand, in terms of privacy, it ends up leaking all keys and values because there's no privacy was not even a requirement for this case. So what we can then do is consider a slightly different primitive called structured encryption. So this will obviously have stronger privacy but slightly less efficiency. So what is structured encryption? Sort of the idea is to encrypt the data structure while maintaining its operations. So the classical example that's been very heavily studied is this idea of sociable encryption, where you're essentially encrypting a search index while maintaining the operations of the search index. And there's many works in the past two decades I've listed here that consider static, dynamic, and various privacy settings. But sort of at a high level, what we what is obtained by structured encryption from maps is you sort of typically get order one efficiency, but of course it can be higher depending on the leakage function that you get. And the privacy is obviously, it doesn't leak everything about the keys and values, but it leaks some sort of well-defined leakage function that contains various components. So things like the number of values that are associated with keys, for example. You might learn key quality between operations, which we'll get to a little more later, but it's sort of learning whether two operations are performed the same key. And so you might also learn the number of operations the user performs. So sort of this is on the spectrum, this is sort of a slightly more less, slightly more private but less efficient primitive. So finally to complete the spectrum, we can go even further to the right and consider something that's much more private but much less efficient, which can be built from things like oblivious RAMs. So what is oblivious RAM? Essentially it was introduced by Goldrack and Astrosky in the 90s. There's been many works that, you know, that led to something of the order log and optimal overhead constructions, but sort of what happens is they can actually in an oblivious RAM can actually implement a map using log and overhead. And this turns out to be tight based on lower bounds and various models that over the past a couple decades. But for privacy, what you end up getting is something very, very strong. What it turns out is that the adversary cannot distinguish any two sequences of the same length that could have been performed by the user. So, you know, translating this to the world of leakage functions, what essentially it says is that the leakage function is only of either the length of the operational sequence or an upper bound of the operational sequence. So in essence, what happens is you're sort of have this to the spectrum where, you know, the left is the most efficient but the least private. And as you move to the right, you get less you trade efficiency for more privacy. Alright, so in this work, what we're really going to focus on is actually this area between structured encryption and oblivious RAM. As you can see, there is a there's a gap in efficiency from constant to log n. And similarly for leakage for privacy where oblivious RAMs are in some ways the ideal or the optimal privacy, whereas the structured encryption ends up leaking some very non trivial, very informative leakage that the adversary could use to try to get information to try to compromise the data. And what we're essentially trying to do is trying to figure out how this is how this sort of ramp goes, is it sort of a direct job? Are there more things here that are there more things we can construct that are have less than log and efficiency, but better privacy than the structured encryption schemes? Of course, to study this, I stopped, I have to I sort of have to tell you what structured encryption actually obtained because right now I just said it's a non trivial leakage function. So it turns out that structured encryption can actually be defined using a very simple hash and encrypting pilot or many of these structured encryption schemes. So the idea of a hash encrypting pilot is to take any plain text operations with let's say three operations insert keys values get keys delete keys. You know, these are this is a plain text map that has that's only trying to get to implement these operations efficiently, and without any privacy. So for example, let's suppose we we've implemented this using a using the keys and values in plain text that we did before. So the idea of the hash encrypt compiler is sort of simply store a client side private key that consists of actually two keys of a hash key and an encryption key and simply replace each of the keys with a hash of the key and replace all the values with an in with an in CPA encryption of the values. And it turns out, you know, it's very this is not a very complicated transformation, but it still enables both for example, queries, if you want to query for a key AI, the user simply hashes low locally and sends the hash to the server. And the server will look up the hash in the plain text map and return the associated encryption values, or if it doesn't exist, return null. Similarly, if you wanted to insert a key with some value VI, you just send a hash of the key and an in CPA encryption of the value. And the server would simply perform the plain text operation of inserting the hashed key to be associated with encryption of the value VI, such as that. So let's take a look at what the leakage of the hash and encrypt compiler is. So when you do, let's say an insert for cats with the string zero one, what ends up have what ends up being revealed is that the server sees a hash of cats, an encryption of the string zero one, and the fact that you're doing insertion because the server has to know whether it's doing a query to either to do a query to the online plain text map or insertion to the online plain text map. And maybe you do another insertion for dogs. So with with with the value zero zero, so you get, you know, you get a hash of dog and encryption zero zero and you know you're inserting and so on and so forth. So let's let's analyze what leakage or what the adversary can learn by viewing this information. So first, of course, it learns a type of operation performed. Like I said earlier, the server has to know what it was doing a get or an insert into the underlying plain text map. So obviously has to learn the type of operation the user is performing. It also learns the length of the query response. So here, for example, you know, in the first query, it learns that there's one associated encryption with with the with the query keyword. Whereas for the fifth, you know, for the for the fifth operation, this query for cats, it learns that there's two encryptions actually associated with the with the query. And finally, this last but what's very important leakage is this key quality pattern. So what the key quality pattern states is essentially that the server can identify which operations are being performed for the same on the same plain text key. So take an example of these the first, the fourth and the fifth operation. In all these operations, the server learn the server is given a hash of cat, which is deterministic. And since it's deterministic, what the server can quickly didn't infer is that all three operations are being performed on the same keyword plain text keyword. But what I want to iterate is the reiterate is that the server doesn't learn which plain text keyword it was it was being performed. So it wouldn't know it's cat, but it knows that all three of these operations are performed on the same key. Similarly, for the other two operations, they're both performed on dog. The server learns that they're both performed on the same keyword. So essentially a long story show what the key quality pattern says is that for any two operations, you learn whether they're performed on the same key or not. So it turns out actually that surprisingly, this ends up matching the leakage of many of almost all of the best structured encryption schemes with constant overhead. So you know, it's this area. It's sort of this, you know, in the private spectrum, it's this red box. So a very good question is, can we do better? So let's take a look at the three leakage functions and see what we can mitigate in terms of leakage. Chances of the type of operation performed is actually quite easy to mitigate because what the user could always do is perform all possible operation types and sort of replace everything that's not a real operation with a mock one. So for example, if you're doing a query, you can also do a mock insert. If you're doing an insert, you can also do a mock query. The length of the query response is a little more tricky. It's, in fact, it's very hard to mitigate because you sort of have to pad the responses to always be the maximum. And there's been several works recently, like Camara Muatz has in EuroCrypt 19, as well as work from our group in CCS 19 that sort of show various solutions to volume hiding structured encryption schemes. But the whole point is that sort of it gets very messy if you want to hide the length of query responses because you have to pad. So what we can then do is actually focus on this third leakage pattern called the key quality pattern and try to figure out what happens if we try to consider a slightly weaker notion of a key quality or a slightly smaller leakage or a stronger privacy. So let's take a look again at the key quality pattern. And what I'm going to do is I'm going to just define to you something we call the decoupled key quality pattern. So like we said before, in the first, fourth and fifth operations, the first and fourth operations are inserts and the fifth operation is a query, the adversarial server learns that all three of them, all these operations are performed on the same key cat, even though it doesn't know what cat is. The decoupled key quality pattern essentially decouples the key qualities between inserts and query operations. So what would happen now is that this adversarial server will still learn whether two insert operations are performed on the same key or not and whether the same, whether two query operations are performed on the same key or not. However, we decoupled the information that the adversary can see between inserts and queries. So for example, for an insert operation in a query operation, the server would never learn whether these two operations are performed on the same key or not. And sort of this is denoted by the fact that we changed the color of this cat for the query into a different color. Similarly, we can do the same thing for this dog operation where, again, even though they're queried for the same keyword, now we can try to construct a scheme where somehow the adversary server would not learn that this insert and query are performed for the same key. All right. So actually, this leads us sort of to our main result. So what we actually end up proving is that any encrypted multi-map with leakage at most the decoupled key equality pattern must have a mega log in overhead. And actually, this turns out to be a very, this lower bound is tight because of their exist order log in all RAM based encrypted multi-maps that end up leaking much less than the decoupled key equality pattern. So sort of going back to this privacy spectrum, what we end up showing is actually that essentially everything just slightly to the right of the structured encryption schemes because like I said, we showed before earlier that it's actually a compiler that any sort of a scheme that is able to leak the key quality pattern can have order one efficiency. But as soon as you aim for something even slightly stronger in terms of privacy, let's say the decoupled key quality pattern, which doesn't really have many real world implications, doesn't actually harden any sort of system. Theoretically, it ends up requiring these schemes to you could use log in overhead, which ends up matching oblivious ramps. So in other words, what I'm trying to say is if you're going to if you want anything slightly stronger than the structure encryption in terms of theoretical asymptotic, you might as well go all the way to oblivious ramps. All right, so this is our result and let's let's go about trying to prove it. So to prove this lower bound, we end up using something called the cell pro model or the leakage cell pro model where we incorporate leakage functions. So the idea of the cell pro model is sort of you have a again, you have the user on the left and the server on the right and the server's memory essentially split up into something called cells and each cell is the same length. The only cost in the cell pro model is essentially accessing any parts of server memory. So accessing. So one unit cost in this model essentially is means either reading or writing, which we call also probing a cell in this of the server memory. So it turns out that's the only cost we consider in the cell pro model. Everything else is free. So things like computation, random oracles, accessing client storage is all free. So you can solve your favorite and if you're hard problem, if you wanted to for free stuff like that and sort of, okay, why we consider a very weak cost models that ends up being very strong lower bounds. If you were to consider a more realistic model where computation is is cost some amount of, you know, is some it's cost some expense, same thing as randomness generation or access to the client storage, our lower bounds will still hold. In other words, the cell pro models is in some ways a holy grail lower bounds as it's the weakest cost model. So the technique we use to prove our lower bound is the information transfer technique introduced by Petrascu and Domain in 2006. And sort of the high level idea of the information transfer technique is to sort of arrange how information is transferred between various operations that are performed. So sort of the idea is let's suppose you do an operations, you know, which I've listed sort of top bottom top to bottom here operation one operation and what we do is we're going to build a virtual binary tree over these operations. What does that mean? Essentially what we mean is we're going to build a binary tree with n leaves such that each operation is assigned to a unique leaf in sort of chronological order. So the most the first operations assigned to the top most leaf and second operations assigned to the second most top leaf and so on and so forth. So like I said in the cell pro model, each of these operations, you know, so let's say you're reading a writing to the encrypted multi map is actually implemented using cell reads and cell writes. So for example, whatever operation one was, it's actually it may it's implemented using for example, reading the cell at the address 15, writing something to address a cell address 17, 72, and writing something to cell address 220, and so on and so forth. So what we're actually going to do now is is sort of go iterate through every single cell read that occurs in this operation in this operational sequence and assign the red cell address to some unique node in this information transfer tree. So for example, let's suppose we took an arbitrary cell read, let's say cell read 15, that's to perform an operation three. What we then do is we go back and find the most recent operation that ended up writing to address to cell address 15. So that might be operation one. And what we end up doing then is we're going to assign cell address 15 to the lowest common ancestor of the leaf nodes that are associated with operation one and operation three. And in this case, it happens to be the root of the tree. And we do this sort of for every single address that's read in all and operations. So what I will this is this is sort of the key point of the information transfer technique is sort of that the way we define and assign these cell addresses ends up defining the total amount of information that's being transferred. So let's to explain this carefully. Let's consider a concrete example of this specific red node in this tree. And this red node will have a seat will have like several cell addresses that are assigned to it. And what these cell addresses signify is the total amount of information that's transferred from operations that are performed in the top subtree. So this top subtree of this red node that would be used by operations in the bottom subtree of this red node. So for example, let's say somehow operation, you know, an operation in the top subtree wrote something to a specific overwrote of value for a specific key. And that key is later read in the bottom subtree. Obviously that information for the for the data structure to be correct that information must be transferred from the top subtree to the bottom subtree. And what I'm trying is and what the information transfer technique shows is that actually the total amount of information that that's that's transferred is must exist in the cells assigned to this red node. And to sort of see this what we can go through is sort of try some examples. So let's say we take this blue this blue node and try to say maybe any cell address is assigned here somehow transfer nodes from the from the from the top subtree of this red node to the bottom subtree of this red node. And if you sort of think about it quickly, the answer is it can't because of any cell address is assigned to this blue node would be read in an operation that is already performed in the top subtree of this red node. So in fact, there can't be any information in cell address that are assigned here that can be transferred that's that's if that's transferred from the top subtree of the red node to the bottom subtree of the red node. Similarly, you can do another analysis, maybe pick another tree that pick another node that's a parent of the red node. And again, let's suppose some cell address is assigned here, that would mean it's it's written in, you know, in some in the top subtree of this of the blue node, which consists of the whole tree rooted at the red node. And it would be we read somewhere in the bottom subtree, which is sort of not important for any any operations that performed in this this performed in the bottom subtree of this red node. So in other words, what what we end up what you will what you end up seeing is that all the information transfer from the top subtree of the red node to the bottom subtree of the red node must exist in the cells that are assigned in the red node. So why is this important? Well, it turns out what we're going to find is we're going to find operations that maximize that the number of cell addresses that are assigned to eat to to each internal node. So again, let's look at this red node and what we can essentially do is try to find out a sequence of operations that would maximize the number of cell addresses that have to be assigned to the red node. So in fact, it's actually very simple to do this. What you can do is in the top subtree, put a bunch of inserts in the bottom subtree, we can do a bunch of queries and sort of what we're going to do is, you know, we insert to to unique indices in the top subtree with these values that are completely random. And then in the bottom subtree, you simply query them. So trying to so using this idea of what we're going to do is now actually construct our lower bound. So for our lower bound, we have to find some hard sequence. And the hard sequence we essentially choose is something very simple. It's like, you know, insert to one or random value V and then subsequently read it, insert to index two and subsequently read it so on and so forth. And we assume that how we generate these values has a large amount of entropy. So OK, you might ask right off the back, isn't this operation easy to handle? And so the key is that the sequence was indistinguishable from many sequences with identical leakage. So taking a look, let's look at the let's look at this hard, this hard distribution. We can quickly see is that this hard sequence ends up maximizing, you know, all of these red notes number of subjects that we assign to this red note. For example, let's look at this, this top red note. Clearly, you know, all of the all the operations in this top subtree are inserts that are subsequently queried in the bottom subtree. So OK, this hard sequence, at least maximizes the cell addresses for maximum cell is assigned to, you know, this this whole this whole level of notes. But what we can then do is actually slightly modify the operation. So for example, let's say we take the queries and change them. So what we do is we change these two queries to be dummies. And we actually query, you know, the key one and key two in the bottom subtree. And maybe to make sure the leakage remains the same between the two operations, you might have to add some sort of header where you insert a bunch of dummies that can be queried later. So if you take a look at this new construct, at this new sequence, which has the same leakage as the original hard distribution which I just described, it turns out that this sequence ends up maximizing the number of cell addresses up to a constant factory of two. The number of cell addresses that have to be assigned to this red, red note. And you can do this repetitively for each internal node in the tree. So sort of what ends up happening, how we prove this lower bound is sort of, you know, we use these ideas to show that many probes must be assigned to half the internal nodes for this easy hard, for this easy hard distribution. And then what we have what we also notice that each cell read or probe is assigned to at most one internal node in the tree. So by simply summing up the probes assigned over all nodes provides a lower bound. So what I'm trying to say is sort of if for some reason there was not enough cell cells that are assigned to this red note, we would know that this specific sequence couldn't exist. So for privacy, you have to assign the maximum cell addresses, the maximum cells addresses that have to be at this red note, according to this distribution. And if you do this for all of them, you get you just sum up the maximums over all internal nodes and you get the lower bound. All right. So it turns out that so that's actually completes our lower bound from a very high level, the information transfer technique, but it turns out that you can actually modify the lower bound to get the more modified the proof to get stronger lower bounds. So it turns out that we can prove lower bounds in our paper even when one of the insert operations are formed in plain text or the core operations are performed with plain text. Furthermore, we can actually use this uses lower bound for encrypted multi maps to end up proving lower bounds for dynamic searchable encryption. And in fact, what we show is that by this this decoupled key quality ends up actually being something called response hiding dynamic searchable encryption where you hide the actual documents that are that are associated with any query key. And what we prove is that such response hiding dynamic searchable encryption schemes end up requiring log in overhead. And again, this is tight because they're actually end up being ORAM based solutions for these for these the dynamic searchable encryption schemes that are response hiding. So OK, so that that completes sort of what we did in this paper. There turned out to be recently a large number of other works that the cryptographic lower bound in the cell pro model. So there's the seminal oblivious RAM paper by Lars and Neilson, as well as several follow up that I list here that if you're interested in you should you should go explore. All right, so thanks for listening to my talk and I hope you enjoyed and learn something.