 Thanks, Kenny. So this is joint work with Betul Durak, my PhD student at Rutgers, and also Thomas de Boisene of Galois. So this talk is about something called order revealing encryption, which is a fairly simple primitive. And it's easily, most easily understood through its special case, which is order preserving encryption, which is just a symmetric encryption scheme that's deterministic and order preserving, and a strictly increasing function. And what that means is that if you think of the domain space as an ordered set of numbers, and the range, the ciphertext space is an ordered set of numbers, if you encrypt some x that's less than y, you'll get a ciphertext for x that is less than the ciphertext for y. That's order preserving encryption. Order revealing encryption, and this distinction isn't too important for the talk, is a more general version of this, where we say maybe they're not actually, the ciphertext aren't actually numbers that are in order, but if somebody were to look at the two ciphertexts, they could figure out which was for the smaller plane text, which is for the bigger one. So order is revealed. So why would you want to build such a thing? It's for encrypted database protections. Namely, you would take a database table with a bunch of columns, and you would pick, say, a key for the first-name column and encrypt it, and you would take the plane text there and replace them with the order preserving encryption ciphertext, and you could do that with all of the columns in your database. And so now, this has the nice feature that it enables range queries on this encrypted database, meaning that if you wanted a query for the range between x and y, you could rewrite this query as a query for the range from the encryption of x to the encryption of y. Say, if you wanted the zip codes in some range in New York, instead of it querying for the actual zip codes, you would just query for the ciphertext. And so this is particularly nice and easy to deploy, because you don't have to modify the service. So whoever's running and holding this table and processing the query for you doesn't need to know that encryption's been applied. They don't have to provide any support or any change at all. And because of that, several companies, startups, and larger companies have deployed this. They're encrypting customer data with it now. Other companies have prototyped it. And if you're familiar with the CryptDB project, it crucially used order preserving encryption, along with other types of property revealing encryption. So this work today is about looking at the security of order revealing encryption, which I haven't commented on yet, and identifying some new kind of conceptual qualitative security issues that haven't been stressed in the literature, and also not by the people who are deploying it. And so I'm going to divide our contributions into two parts. The first is going to look at attacks against order revealing encryption when you have correlated columns in a table, meaning that prior work would say, OK, I got the encryptions for a column of zip codes. What can I learn from that? And what can I extract? Which is a good place to start. But in reality, you're going to store multiple columns of data all encrypted with order revealing encryption. And these rows are going to be correlated. The entries in one row will be correlated with each other. And so we point out to some conceptual problems that go wrong when you use order revealing encryption in a more realistic use case like this. Right. And so in particular, we're going to show that even when one column is sort of impossible to attack in some sense, having multiple columns there might enable an attack anyway. And the second part of our work is going to look at attacks on order revealing encryption in uniform data. And so the order revealing encryption has been studied mostly from a theoretical perspective with provable security and theorems saying what it achieves and so on. And several of those theorems assume that the input data are uniformly random, mostly for theoretical convenience in order to prove something clean. And we're going to actually experiment with it a little bit and see how that mismatches with the behavior of ORE on non-random data, which is typically what you would want to encrypt in any setting I can think of. And we'll see along the way through some pretty simple experiments that some practical ORE constructions reveal a lot more information when you use a real data set, just some that we found, than on random data sets. So the intuition was maybe misleading from those theorems. Throughout the talk, we are going to experiment with two data sets. So I was personally interested in using order revealing encryption for geolocation data for two projects. We were thinking of encrypting personal mobile phone histories with order revealing encryption to kind of have a more private kind of personal history search thing. And then also for another project, we would encrypt facility locations. So we used another data set that I'll see more about in a second. So our conclusion, and to be fair to the inventors of order revealing encryption, have been saying this since they first published their papers, that we don't have a really good understanding of what security you get when you use order revealing encryption in an encrypted database application. And I think this work is sort of articulating what the original inventors were saying, which is that we need to cryptanalyze not just the actual constructions, but actually the goal, the models and definitions that we're trying to achieve. We don't even understand fully what those say about practice. So for the rest of the talk, I'm going to spend a little while giving you background on ORE, talking about prior attacks and the different constructions. And then I'll move on to our results on correlation and correlated columns in order revealing encryption. And then I'll look at two different constructions and two other parts about what happens when you use non-uniformly distributed data. OK, so for the background, the first thing to understand about order revealing encryption is that it's inherently less secure than regular encryption. Because somebody can look at two cybertexts and see which has the smaller plain text, they're learning information that regular encryption just doesn't reveal. ORE is actually a bit more brittle than that. So it's an exercise to show that if you have chosen plain text access to the encryption scheme, then you can do full plain text recovery on any cypher text you want by doing binary search for the unknown value inside of a cypher text. You use your chosen plain text oracle to search for the unknown plain text. So given how brittle this is, the research paradigm, if I can rephrase what prior work was doing, is to construct ORE schemes that are the best possible security even in the face of these problems and to focus on passive attackers who capture a table and dump it and take it home and try and analyze it. We'll hopefully depend on other mechanisms to help us against chosen plain text attacks, although those will still be a problem in most settings I can think of. And I'll mention there have been several works, including many since we did this research. And I'm not going to get to comment on the myriad schemes that have come since. But the thing to understand in this talk is that there are really two flavors of order-revealing encryption. I'll call them ideal and leaky. Ideal is strictly more secure than the leaky version. And one way to understand it is to think about what would happen if somebody captured an ORE encrypted column of, say, zip codes. With ideal ORE, it's designed and proven to only reveal the order of the cypher text. We can imagine that an attacker who gets this column and analyzes it just learns what order of the plain text is they're in. He learns that the one at the bottom was the smallest zip code, the next one was the second smallest, and so on. The bad news is that if you actually wanted to build this, you need indistinguishability obfuscation or multilinear maps. And if you don't know what those things are, there are theoretical tools that are not clearly going to be efficient anytime soon. It's an open problem to make them fast. And the deployments I'm talking about don't use this. On the other hand, if you relax the notion of what ORE is as a cypher, but if you say, I just want to do range queries with some leakage and I allow an interactive protocol for doing that, then you can actually get fast implementations that work OK, but they're interactive and you have to change the server and so on. So I'll still consider, even though we don't have implementations of ideal ORE strictly speaking, I'll still consider it a valid thing to analyze because it could be used in practice with these interactive protocols. The next type after ideal is leaky ORE, which I'm not going to define too concretely right now, but it is ORE that leaks more. So in addition to telling what order the plaintexts are in, which is always required by the correctness of the primitive, we're going to, say, depending on the construction, define what other leakage it might reveal. So in this setting, I kind of highlighted these numbers in red to say, maybe your ideal or your leaky ORE is leaking some of the initial digits of the zip code, something like that. And the good news with this is that when you relax ORE, we have a fast block site for base constructions and these are what people are using now. But this extra information includes plaintext bits, some statistics on the zip codes, possibly other stuff that's basically statistics, but still other hairy-looking things. And so this leakage, so this bullet should really be about leaky ORE is an issue and prior work has studied it, mostly by proving theorems that we would call one-wayness theorems in theoretical crypto at least, which says that something about, when you encrypt random input data, it has some kind of plaintext recovery properties. They proved more than plaintext recovery, sort of approximating the plaintext, but still it was always when the input data are random. Okay, so my work is not the first to wonder how secure is ORE in practice. A work by Naveed Kamara and Wright 2015 identified two serious issues in ORE that do come up in practice. The first is that if you encrypt the entire plaintext domain, it's imagine you have a relatively small plaintext domain like days of the year or something like that, it's a problem, right? Because then you get the ciphertext, you know the plaintext domain is say one through N, you have N ciphertext, you just sort them and now you know what the plaintext are, right? And this is, I mean it is a problem that would come up and they experimented with some medical data and showed like emission month in another column what could conceivably do this and it's not clear how to avoid it. The next problem they identified is that when plaintext repeats, since this is a deterministic cipher, you can do frequency analysis. So the same thing you do to break a substitution cipher with letters, you just adapt that to whatever setting you're working with. And so this is in a setting where you would capture plaintext and you would also assume that you have some auxiliary information telling you about the distribution of the plaintext contained in the ciphertext and then you just set up an optimization problem to do your frequency analysis to generate guesses for the plaintext. So you're guessing the unknown plaintext in the ciphertext based on your training data. And they looked at other medical data columns where they had effective attacks in that setting. This work however had some limitations, particularly what they left open is that if you have a column of data that's not dense, so you don't encrypt the whole plaintext domain, maybe you encrypt a relatively small number of points from the plaintext domain. And also if you're encrypting unique values that never repeat, then neither of these attacks apply at all. You can't do the first attack because you haven't encrypted the whole domain and you can't do the second attack because your frequency information is essentially trivial. It's just these are unique ciphertext. And these are settings to which we'll return in this talk. Okay, so the first part of our work is on correlation correlated columns and order revealing encryption. So let's think about what would happen if you encrypted a database table using order revealing encryption with different keys for each column. And let's just generously say we're going to use ideal RRE. Things would be even worse if you used leaky but let's think about ideal RRE for a second. The leakage you would get on this table, maybe you encrypt the names on the columns, you don't actually know what they are, but you can imagine as an adversary look at this and you learn, okay, in this column, this is the order of the first names. Say this top row here, you learn whoever that was has the second greatest, the second smallest first name, the second smallest last name, the fourth largest zip code and so on. And maybe you can extract some information there. So here's like a very highly artificial experiment to kind of illustrate what goes wrong. So let's take an image and let's pick 300 random black pixels from the image and we're going to encrypt these in a cartoonish way. So we're gonna think of these points as correlated columns of X's, X coordinates and Y coordinates. And so there's 300 of these rows. You can't read, I'm sure you can't read the labels on the image, but it's a 0 to 2000 by 2000 grid. And so these are all points in 0 to 2000 by 0 to 2000. And we're going to think about what would happen if you encrypted these and then got the ideal leakage on each of the columns. So instead of learning exactly where this first point was, you're gonna learn that it had the 51st smallest X and the 64th greatest Y. And so you can really just write this down as a new column. And I wanna make the point here that the prior work doesn't apply to this setting. This is prior work that only looked at individual columns because we only encrypted 300 points out of our domain. So it's 300 out of 2000 in each of the columns. And we arranged, I didn't say this before, so that there's no repeated X's or Y's. So the frequency information is trivial. So if you were to just look at these individual columns, the leakage in these individual columns, this is what you would learn from the X's. I have 300 X's and they're in order one to 300. Also I have 300 Y's and they're in order one to 300. But if you put them together, because you have that table, you can actually essentially just recover the penguin with a little bit of distortion. And this is due to the correlation between the X's and the Y's. And so this is the qualitative point we wanted to make, that this could be an issue that hadn't been noticed, the disconnect between single column and multi column. So for a less cartoonish set of experiments, we used two data sets. First we looked at California road intersection. So this is a data set of 21,000 Latitude Longitude pairs of road intersections in California that was used to test RE in prior work. So we thought it was a nice set to start with. The second interesting for my research that I was trying to learn about is a personal mobile phone location history that was released by a multi-spits politician in Germany who's a privacy activist and sued his phone company to get this data and then posted it online. So it's also, I would consider, the sensitive information. And in the paper we also used the timestamps and encrypted those with the RE, but I'm not gonna talk about that today. So here's the same penguin attack done on real data with these road intersections. So on the left I've plotted the latitude and longitudes just on a map. And on the right side, this is the ideal leakage. So if you just learned the latitudes are zero up to, I think this is 2,000 out of the 21,000 points. You would learn the X coordinates are zero to 2,000. The Y coordinates are zero to 2,000. You would learn nothing from those individual ORE encrypted columns, essentially nothing. But when you put them together and you just plot them on a grid, like here's how my leakage looks. This is what you get. I don't wanna try too hard to, so this convinced me not to use ORE in this application. I don't wanna get too specific in how bad it is as a qualitative problem, but we noticed that you can still see Los Angeles in the Bay Area and this little outlier point out here. You're really preserving a lot of fine detail due to this correlation. One quantitative way to try and understand this is if instead of just plotting this, I told you the bounding box for California and told you these points are just in there. Then just by scaling this distorted thing, you get about a third of the points to within 50 kilometers, which maybe not that scary, but still the shape of the image is preserved and maybe a threat. So next, at this mobile phone location history, we did exactly the same attack. So there are a lot more points in this data set. So we picked out subsets to kind of see what was going on and then looked at the leakage on the entire data set and then looked back at how this day, this weekend, this month looked from the point of view of an attacker who had this ideal leakage. And so this is what the leakage looks like on those days. Again, I'm not measuring this quantitatively to say how similar it is, but when I was thinking about using ORE to encrypt latitude and longitude for a mobile phone user, this was enough to convince me that it might not be providing the best protection. The issue is that I've just recovered pictures that look like the actual movement of somebody through ideal leakage, but I haven't even used the fact that, any side information at all, I haven't used the fact that I might know this mobile phone user who was German and maybe I knew where he lived or whatever to really generate possibly accurate guesses for the plain text. So the lessons from part one is that at least, even if we did go through all the trouble to develop this theoretical primitive for ideal ORE or we go through the trouble of modifying the server to support interactive ideal ORE, on location data at least, it appears to be just too leaky for any of the applications I've thought of. And it's because correlated columns are combining to leak information in an unexpected way. And the open question, so this is not a technically deep part of the work, really how should we quantify this? We tried various approaches, like experiment correlation and I learned a little bit of statistics, but nothing seemed to really crisply give us any sort of test that crisply give us any test that would help a practitioner. And so I considered an open problem to say what, other than kind of running these attacks and seeing how your data looks, I'm not sure what recommendations I would make to a practitioner. And something I've been working on since then is can we attack other types of correlation real data? We picked location data because it was in front of us and nicely correlated, but I'm sure other problems could come up. Okay, so seven minutes? Okay, it'll go fast. So in the second part, we looked at what happens when you run ORE on non-uniform data. And I just want to quickly say that there are multiple constructions of ORE that we're going to look at now. The first by Baldi Reva-Chanette, I forgot the L, sorry, L, and O'Neill gave the first practical construction based on block ciphers. You can keep it in your head that it roughly leaks the half of its input bits when you encrypt a random thing. The next one was by Chinette, Louis, Weiss, and Wu that gave another practical instruction, targets different security, different type of leakage. And we'll see that in a second. So I want to talk about this, sorry. This first construction I'm going to talk about now and then assuming I don't run out of time, I'll comment on what's going on with the second construction. So this is the Baldi Reva-Dall definition, which they achieved via fast construction. They said that a Cypher E is random order preserving secure, ROPF secure. If the input output behavior of the Cypher with a random key is indistinguishable from a random strictly increasing function with the same domain and range. This is a natural thing to do if you're a cryptographer. It's like saying instead of a PRP, which is indistinguishable from random permutation, this is going to be indistinguishable from a random function that has the property we want. That's increasing. So the definition is parameterized by the domain and range. So we'll think of the domain as points with m bits, so zero to the m, and the range is with n bits. And n has to be bigger than m. If n is equal to m, then the only order preserving function is the identity. So you need the output to be a little bit bigger. Here, so this is a theorem I'm just throwing up there. Quickly, I'll focus on the intuition below. But to analyze this definition, it says once we have something that's indistinguishable from random order preserving function, what does that mean? And it's a very good question, and it was stressed in the original work that it's not clear what this means, but at least we proved something. So what was later proved is something about approximate one-wayness, which is if I pick a random domain point x, and I encrypt it, and then I show somebody the ciphertext, they can extract about the left half most significant bits of x. Numerically, it means they can estimate x to within about square root of the domain accuracy. But they proved a positive result, which is that no efficient attacker can do much better. So these bits on the right, so if I write the bits of x, it's most significant to least significant, these green bits on the left are in some sense strongly protected, so four random x. And it's an involved proof that I'm not going to get into. So it occurred to us that what was going on with random data might not happen with non-random data. One way to understand this is we did a very simple experiment where we just encrypted fixed values, fixed plain text, so we took powers of two just to space things out, so two to zero, two to the one, two to the two, and then we ran that prior attack. The prior attack is essentially reading off the bits from the ciphertext. It's maybe a tiny bit more sophisticated than that, but not much more. And so here's what we observed and let me explain this chart. I set the message length to 64 and the ciphertext length to 128, which are generous parameters. And here we've plotted the size of the plain text value on log scale, so this is two to the zero, two to the one, up to two to the 64. And what I've plotted on the y-axis is the log of the accuracy of the guess. So higher means you're guessing less accurately. And what we notice is a strong linear relationship between these things. If I encrypt say two to the 40, then I estimate that to within two to the 20 accuracy. If I encrypt two to the 60, I estimate that to within about two to the 30 accuracy, which is in contrast to what would be predicted if you just took the lesson on random inputs for fixed inputs. What that was saying is that you should be able to predict everything to within about two to the 30, but because of the distribution of a random order preserving function, when you encrypt small values, you can actually estimate them much more accurately. And so our conjecture, and if I had more time, I could explain, but I'll keep moving, that what this means in terms of the bits that are leaked is that when you encrypt a small value that has a lot of leading zeros or a big value that has a lot of leading ones, those will be leaked to an attacker and then half of the remaining bits will be leaked, at least experimentally. This is what we saw. And my conjecture is that the, the still the right half are hidden there. And it's an open question to formalize and prove this that we've made a little bit of progress on, but still a conjecture. Okay. Five. Okay. So finally, last section on looking at another construction of ORE that's targeting different type of security, not ROPF and experimenting with it on non-random data. So this was the construction or the definition targeted by Chenette, Louie, Weiss, and Wu was what we're gonna call most significant differing bit leakage. And so what this leakage is, is for somebody who has Cypher text encrypting, say these plain texts, for example, if you use their construction, when say you take the first two of them, XI and XJ, if you take any two strings and you start from the left, there's going to be a most significant bit where they differ. So here it's the third bit from the left. In their construction, that bit is leaked in plain text after the, you know, basically by running the comparison algorithm, you learn that information that this X had a one and this X had a zero and it's in this spot. And so what the construction is achieving is essentially leaking all of the pairwise bits that could be computed in this way. And the construction is actually elegant and clever, I liked it, it's very fast. And I think they had a lot of insight to target this different security notion over prior work. And so we're gonna call that most significant differing bit leakage. So you can tell where the most significant bit is and what it is. And they proved another theorem, this should be, probably labels an informal theorem, but what informally or intuitively what they proved is that if you equipped random Xs, this is great, right? Because what's gonna happen on random Xs? Random Xs are very unlikely to have a lot of matching bits at the top. Like you, it's like flipping a coin over and over and eventually you'll get different outcomes. And so they proved an asymptotic theorem that says it leaks like an arbitrarily small constant fraction of the bits at the front of X, which is a big improvement over the first thing I said, which leaked half the bits of X, okay? So of course, this is on random data. We wanted to understand what happens on real data and the real data sets we had at hand. So what we did is we took these constructions, the random order preserving secure, random order preserving function secure ORE, this MSDB ORE that I just defined on the previous slide. And we also experimented with double encryption. You can actually compose these and encrypt twice with two different keys just to understand what's happening. So this is a plot of what we saw in plotting how many bits we could compute from cipher text by just running the known attacks. So we did nothing special here. So it's a CDF plot. The X axis is the percentage of revealed bits and the Y axis is the fraction of plain text revealed at that level. So, and we have three lines for the two ciphers and then the double encryption. And I think the lesson that I wanna draw from this to keep moving is that they're not that different. Despite the asymptotic security theorem saying that this light blue line, so closer to the origin is better. So it is better in a measurable way on our data set, but not as dramatic as the asymptotic theorem on random plain text might suggest. But things are a little bit worse than just a brain dead sort of looking at the leakage might suggest. So this is an example of the actual leakage we saw when we encrypted either, I think it was the latitudes in the California data set, that if you actually go through and do the comparisons, these are the bits you can write down and you can record all of them and they just sit there. And there's a little bit more information because you sometimes know when some of these Xs should be the same or not, but it's not plotted here. So what is this leakage? How bad is it? One way to understand how bad it is is to try and visualize it. One simple way to visualize it is to just, all these Xs are either zeros or ones. So let's just replace all of them with one half. Think of a binary number, but it's a one half instead of a one or zero. And so this is what you would be able to plot if you did that. This is the effect of hiding those bits that the cipher's designed to hide, especially these higher order bits. Just the higher X that's missing there corresponds to a big gap here. So I guess we did longitudes in this one. So, of course, if I gave this to you by hand, you could reassemble California. Those bits aren't really hidden, right? It's clear where like you need to shift it over and then move it over. We automated this through not a complicated algorithm that just optimized the total pairwise distances between the points by guessing the Xs one at a time and putting California back together. When we did that, we got a very effective plaintext recovery or not plaintext recovery, plaintext approximation. Okay, one minute, I'm almost done. We ran the attack on various data set sizes. Here we plotted just some results. I'll highlight that our attack in all the experiments we ran guessed at least half the points to within half a kilometer. So these are encrypted points, right? And we're estimating them to within that accuracy. And this is despite the fact that the actual leakage that you might read didn't leak any of them to within 400 kilometers. So it is the effect of us guessing those high order bits that improve the accuracy. Okay, to wrap up, correlation should be considered if you're going to deploy something like order revealing encryption, perhaps very carefully for problems we haven't even identified yet. The leaky ORE might be even leakier on your data than on random data. And finally, for researchers, I think that an exciting, moving forward, I think it's very worthwhile to consider other primitives, other approaches to enabling efficient database access. And then once we consider those, also cryptanalyze them instead of just proving theorems. Cryptanalyze your definition to understand what it's saying. Thanks. Okay, we do have time for a couple of questions. I was wondering, as you're considering correlation for order revealing encryption, have you also considered the impacts of correlation on other types of property preserving encryption or even things that don't rely on property preserving encryption such as Booleans in these databases? In this work, no. But since then, yeah, I've been trying to generalize this to more general data sets, but I don't have anything really insightful to say now. Okay, I'll maybe talk to you. Yeah, it's definitely an issue though, it's right. I think you're referring to like deterministic encryption used on columns and that type of stuff. Yeah, there's definitely going to be, you'll see repetitions across the columns. It's easy to conceive that there are attacks that are not possible if you looked at the columns individually. Hi, David, nice job. Can you go back to the slide with the RTM recovery rates? Like there's a, so there's a jump from like 78% to like nothing? Yeah. Do you have any intuition like why? And then it's like everything is just zero the rest of the way. It's because we threw that one in late and very, very naively composed to attack. So if you want to attack the double encryption, you first attack the outer one which produces some gas of the inner cipher text and then you run the other attack on that artificial cipher text. And because of the way the outer attack worked, it was producing wonky cipher texts that had this behavior that caused us to just go completely wrong after a while. But I don't think that's an accurate representation of the leakage, the true leakage if you worked harder. Okay, let's thank David again for a great talk. Thank you. Thank you.