 Hi. Can you hear me? Hi, everybody. So, Kenny gave me my talk. I should mention my fabulous co-authors here, some of whom are in the audience and you might be able to find them afterwards. Richard McPherson, Muhammad Naveed, Tom Ristonpart and Vitaliy Shmadekov. And I have to shamelessly plug my Twitter because social media is the future. So, you can tweet at me. So, I want to talk to you today about web applications. And this is something, this is a kind of application that's ubiquitous. So, people don't really think about it, but that's what I want to talk to you about today. Web applications have roughly two parts. There's a client that runs on a user's browser on a mobile phone. And the data generated by this client is stored on some back-end server and not on the client. This model works great and it's ubiquitous and everybody uses it until there's a data breach. And these happen all the time. And all of the confidentiality of the user's data is compromised. So, this is a bad thing that generally we want to avoid. So, one solution that people proposed is to encrypt the data on the server, the data generated by the clients, we want to encrypt it with the key that's stored on the client only and never kind of seen by the server. So, when there's a data breach of the server that's storing all this data, nothing's actually leaked because everything's encrypted. And this neatly solves the problem of confidentiality of being violated by a data breach. But the problem is that basic stuff like searching all the things you want to do with your data don't really work anymore because this encryption just removes all the functionality from the data. So, one idea that people have been playing with in the last 10 years or so is to kind of strike a middle ground, a kind of balance between these two extremes where we use a primitive called property revealing encryption which has been looked at by cryptographers in the last few years where we reveal a property of the data that's not everything but it's enough to enable a certain specific functionality that's useful for our application. But we still maintain some kinds of provable confidentiality guarantees about the data. So, systems builders have started to notice this idea and notice its utility in building secure systems. So, in this work we identified a class of systems that all roughly use property revealing encryption to build a system that preserves confidentiality of data. And we came up with the goofy acronym BOPETS for building on property revealing encryption for these systems. This is a table just of some common PRE schemes as well as the systems or companies that use them. I won't name every name in this table but searchable encryption is something I'll explain in the next slide. So, just to give you a sense of the interest in BOPETS if you look at all the startups in this table they cumulatively have received about $200 million of VC funding from like Anderson Horowitz and Sequoia and Greylock. So, there's like you know there's a palpable interest in a lot of people are really interested in these systems. So, now that we understand what a BOPET is let's look at an example of a hypothetical file sharing application called cloud drive. So, the orange user has a secret diary that they want to store in the cloud but they want to encrypt it so that the cloud can't see it. So, it uses a key that's stored on its client and encrypts this document. Now, when the orange user wants to search its document to see whether it contains a certain keyword it can use this key to kind of delegate a search token that allows the cloud to search. Now, if the orange user wants to collaborate with the blue user on a different document that's perhaps something work related, they both need to have the key for the blue document and the blue document is encrypted with a different key than the orange document to kind of because the orange user's the orange document is private to the orange user and the orange user doesn't want any other user to be able to see it. So, the orange user when it wants to search it's not clear how it should do this. The naive way of doing this is to issue a search token for every single, like access control or every single kind of access group that it has access to. So, this scales badly and it requires the client to do a lot of work. So, in 2013 Popa and Zeldovich introduced a primitive called multi key searchable encryption which really neatly solves this problem and it does so using the following idea. Rather than sharing the key and requiring the orange user to generate different search trapdoors for every different document it's shared with, it gives the blue user the ability to tell the server to add the orange user to this document. And what the orange user does is it generates a bit of cryptographic information that it gives to the server which essentially delegates to the server the ability to convert its search tokens to searches under the blue key. So, the advantage of this is that you can issue just one search token and be able to search and the server will essentially do these search token conversions for you. So, this solves the problem that I just mentioned. So, this is a really, really cool PRE scheme but it's not clear how to build a system out of this PRE scheme. Since we're here at the real world crypto conference so we should try to think about how to build real systems and we know that we can't just use a crypto scheme in practice we need to build a system around it. And there's been some work on like David's talk covered some work that looks at the security PRE in isolation but there hasn't really been any work any holistic work on how to build a whole system that preserves the confidentiality of the user's data. So, that's the basic question that we asked in this work which is how do we build a system that uses PRE to provide confidentiality to users? So, to kind of assess this question we did a case study on a BOPED called Mylar which was a paper published in NSDI 2014 by POPED at all. Mylar basically uses the Meteor JavaScript web application framework and builds some cryptographic functionality on top of it that handles the multi key searchable encryption functionality that I discussed earlier as well as some key management and access control and things like that. So, the Mylar server kind of handles requests from the different Mylar clients and handles access control as well but it also has access to a principal graph and the details here aren't too important. You should basically just understand that this is metadata about access control relationships between documents and clients. And security goal of Mylar is simple. Mylar wants to protect the confidentiality of the data against attackers with full access to servers. And in this work we identified three major threat models that a BOPED, any BOPED including Mylar should be able to defend the user confidentiality against attacks. The weakest is the snapshot passive threat which is basically a compromise of a one time snapshot of the database. A stronger threat model is the persistent passive one where the adversary has a persistent monitoring of the system. And the strongest threat model which is the one claimed by Mylar is the active threat model which basically allows the server to issue arbitrary responses and even to collude with some users. So, this is the wording in the Mylar paper that this is their active threat model essentially. And don't worry about reading this all right now. We'll come back to it in a little bit. So, to investigate these threat models, we looked at four applications, one of which was released with the Mylar paper and three that we ported to use Mylar according to the guidelines in the Mylar paper. And in the threat model claimed by Mylar, we give an attack that in a simulated workload recovers 100% of the user's keyword queries and about 70% of all the keywords in the encrypted documents and I'll explain this attack a little bit later. We also give two weaker attacks that one of which I'll discuss and one of which you should refer to our paper for. In the Mylar paper, they explicitly do not claim to protect access patterns. So, in this talk, I'm going to focus on the snapshot passive threat model and the active threat model because these are attacks in these threat models do not use data access patterns or communication or timing patterns. So, our attack in the weakest threat model, the snapshot passive threat model is simple. The attack basically uses this metadata that I mentioned before. And this metadata is kind of crucial for the system to function correctly, but it's not encrypted. Existing systems don't bother to encrypt the metadata. And of course, we all know how important metadata is. So, this metadata has some inherent risks that haven't really been explored before. It's not protected in existing systems, but it is really necessary for the system to work. But the metadata itself, it depends on the data. And if the designers of the system aren't careful, then this metadata could by itself leak the information that is intended to be protected by the encryption scheme. And this observation, this attack is fully in scope and doesn't use any access patterns or timing patterns. So, to see how metadata can leak information, we need to look no further than this diagram from the Mylar paper. This is a scenario where two users of a chat application are talking about a party that they want to go to and they have a boss that doesn't want to know about the party. In fact, the text says that I hope my boss won't know about the party. So, the confidentiality goal here is fairly clear. The boss shouldn't find out about the party. The chat data itself is encrypted by Mylar, but the metadata is not encrypted by Mylar. In particular, there's an access control principle that corresponds roughly to a search encryption key. And the name of this access control principle is actually in plain text. And in this scenario, it's party. And the clear conclusion here is that the boss will know about the party. And because the users that have access to this principle, their access control principles are not encrypted, the names aren't encrypted either, the boss will know exactly who's going to the party as well. So, this attack is a little subtle, so we should pause here and kind of say a few more words about metadata. Metadata is kind of the glue that builds a system out of a PRE scheme. It's necessary for all systems, but especially in multi-user settings like the one targeted by Mylar, where there's kind of like complex access control relationships between different users. Metadata is crucial for stuff like key management. But existing BOPETs, not just Mylar, but all systems don't really encrypt the metadata. So this is presumably because they don't expect there to be any kind of confidentiality loss from this metadata. But what we want to motivate in this talk is that there are ways in which metadata by itself can compromise the confidentiality of user data. So our attack on the Active Threat model is pretty simple. And just to review, the Active Threat model is one in which the server can perform arbitrary actions and, in fact, collude with some of the users of the system. So in the Mylar paper, the way they phrase it is that the application in the database servers can be fully controlled by an adversary. Mylar also allows some user machines to be controlled by the adversary and to collude with the server. In this threat model, our attack is pretty simple. Suppose the multi-key searchable encryption functionality before this access token, not access token, sorry, the key conversion token used in the multi-key searchable encryption scheme, suppose that the adversary gets the token for a key that it knows, then when the user does a search, the adversary can use this token as well as the key that it knows to kind of get an unkeyed hash of the keyword. And then it can use kind of a brute force dictionary computed using only public parameters to recover the keyword. And this, through the leakage of the searchable encryption scheme, which is basically like which documents match a query. This compromises the contents of the documents as well, and I'll explain this more in a few slides. So now that we understand the attack, we can ask ourselves how can the adversary get a token for a key that it knows? Well, there's basically two ways this could happen. If the adversary is able to trick the user into accepting a share of an adversarial document, or if the user shares a document with the adversary, then the adversary gets basically the system, the way the Mylar system works, it will produce the information that the adversary needs. In Mylar, there is a hypothetical defense against this attack. It's not supported in the current Mylar code base, but it basically is a way for the application developer to force the user to explicitly accept searches. And this countermeasure is not used, and we'll just, for the sake of argument, imagine that this countermeasure works perfectly, and every user only shares documents with people that they trust. There's still yet another way for the adversary to get the token, which is for the adversary to corrupt a user, like a trusted user of the system. So in Mylar threat model, they say that an adversary can collude with some user machines, and that a reason why this might happen is because the adversary broke into a user's machine. So the way that this would work is imagine that the orange user trusts the blue user and allows search functions well in this setting. It works correctly. And the orange user accepts a share of the shared document from the blue user. So if after this happens, the blue user is attacked, the adversary gets the blue key, which allows it to perform the same brute force attack on the orange user's search queries, rather. And because the adversary has the blue key, this pretty clearly compromises the contents of the blue document, right? But what happens to the orange document? And in Mylar, the guarantee is made that the confidentiality of a data item is protected as long as none of the users with access to that data item use a compromise machine. So this is what the claim security would be. But through the leakage of which whether a search query matches a document, if the orange user makes a search that matches this orange document, the contents of this document are going to be leaked as well. So this shows how the confidentiality guarantee of Mylar is violated by this attack. This has all been a little bit abstract. So we should try to be concrete about a setting in which this attack could occur. The medical setting is used in all over the place in applied crypto. And in the Mylar paper, it's used as an explicit example of an access control graph in a search system. So in a hospital, the result of this attack is simple. If doctors and nurses are collaborating to treat patients and one nurse loses their laptop, the private files of the doctor can also be compromised and allow searches not a defense against this attack. So now that we understand the impact of this attack, we should probably come back and say how the contents of the documents can be compromised. When the orange user makes a search that the adversary knows, the functionality of the search scheme allows the adversary to tell whether that keyword exists in a document. So as the user makes more queries, more contents of the document are going to be revealed. So an important question here is like in typical application settings, how much of the document contents would be revealed? So to test this, we did experiments using the Ubuntu chat log corpus for kind of stand in chat log data. And to sample user queries, we sampled according to the distribution of keywords. This followed prior work. And we did this because there really isn't good research yet on realistic user query distributions. So this followed prior work on user query distributions. But this is kind of an artificial synthetic query distribution. So an open question is really characterizing real user query distributions that would occur in practical applications. And we took as a brute force dictionary, the largest dictionary of English words we could find, which was roughly 350,000 words. Using the Mylar searchable encryption scheme, pre-computing this attack dictionary only takes about 15 minutes of wall clock time. It's really, really fast. That's not a single thread. And this is a parallelizable operation even. So this attack with the simulated query workload starts at, with 100 queries can recover about 28 or 29% of the keywords in the documents. And as you get up to about 2,000 queries, you get to 68, 69% recovery of all of the keywords across all of the simulated user documents. So this, this active attack is, is clearly very powerful and it's powerful for a few reasons. It can recover the search keywords using just a, an offline brute force dictionary. So it can, if you search for a word that doesn't actually appear in any document on the server, you can still recover it using this attack. The, the, the recovery of query keywords doesn't actually rely on the leakage of whether a document matches a particular query. And if, even if hypothetically all of Mylar were run inside of some oblivious storage or, or PIR, the, the query keywords would still be recoverable using this attack. And this is, this is a subtle point, so I'll say it again. The, the basic, the basic compromise of confidentiality here is simple. If I share a document, a single document with a friend who I trust, who is later compromised, then all of my private documents are compromised as well. There is an active attack discussed in the Mylar paper and this is, this is the, the text of the, the attack. The, the, the attack basically relies on inserting a, a, a pre-computed brute force dictionary into the system and relies on the leakage of whether a query matches a certain document to recover the keyword. Our attack in contrast doesn't require there to be a dictionary document on the server. We can recover search query keywords that don't occur in any document on the server. And it, even if the, the victim, okay, even if the victim never shares a document with the adversary, the, the collusion allowed by the, the Mylar Active Threat model will still allow you to recover the, the contents of the documents. So in conclusion, the security of BOPETS is still pretty poorly understood. Metadata in BOPETS can leak information and violate the confidentiality of the encrypted data. Access patterns on the data do as well, but see our paper for this. The active adversary in BOPETS is very powerful and this is because any operation that it can perform can be performed maliciously. So you need some kind of way to verify that it's, that's only, it's only doing what you want it to do and, and nothing more. So integrating, integrating property-revealing encryption into systems to build systems that actually preserve user confidentiality is still pretty tricky. And active attacks on BOPETS are still an unsolved challenge in literature. So we hope to motivate more research on that as well. I think, I think that's it. So thank you everybody. Okay, questions for Paul? We have, we have time. Quick clarification. So in Mylar, you don't just see whether a search term matches. You get to see where in the document it matches and so you're recovering words or? Well in, Mylar doesn't actually randomize the order of the, the encrypted keywords. So you, you recover more than whether it matches, you recover the order of the words in the document. Modulo some kind of deduplication of the keywords because they only, they don't actually store words with multiplicity. They only store like the unique keywords in the document. But you still see, you can still see the order of the keywords in that could potentially enable more, more devastating kind of frequency analysis attacks that are just based on the search leakage and not on this kind of query recovery leakage. And so your number for the 70%, that's the number of lines in the chat log where all the terms get recovered? Well it's the number of, it's the number of keywords. So like we take, we take like the number of like the number of words in all the chats and then like the number of words we recover and then it's just like, we just divide those two. So I have a comment and a question. The comment is just, I just wanted to know that there are searchable encryption schemes that do not rely on property preserving encryption and in some cases they can be used in the real world and still need to be analyzed. There's more work to do but I just wanted to comment on that. Can you, can you be more specific? Yes, like the, there is a work by the IBM team which is called OXT, a sphere search for private search and there's work by Columbia and Bellabs called Blindseer. Oh, okay. Maybe this is just, it's not related to your talk, it's not, it's, okay, so this is kind of a definitional thing but that work would be included under property revealing encryption. No, it doesn't use deterministic encryption, it doesn't use this, it doesn't use deterministic encryption. Well neither, neither does Mylar though. Okay, we can talk about it. Yeah, we can talk offline. Yeah, this is just kind of a terminal logical thing. I do have a question about your talk which is in your passive attack, is it easy to solve it by just encrypting the metadata in some way or does it ruin everything? It does but like there's a kind of a usability question there because when you do a share you want the user to be able to see what's being shared with it before the user accepts the share. So it almost creates like a chicken and egg problem where like you want to, you want to encrypt the metadata but like when you share you need the user to be able to read the metadata before it accepts the share. Do you see what I'm saying? Like it's just kind of, it's, I think this is probably a solvable problem but it's not, it's not obvious to me how you would, you would do this. Okay, let's, let's thank Paul again. Thank you Paul.