 Okay, ladies and gentlemen, our next talk is coming up, DP5, PRI for Privacy Preserving Presence, a framework that allows you to have better control of your privacy. If you have a space, a free seat next to you, could you please raise your hand so anyone can find seats really quickly, even for the people coming in late now? Now there will be a notice about the translation. Es gibt von diesem englischen Talk eine deutsche Live-Übersetzung. Bitte das Decktelefon nehmen und Kanal 801 1 wählen und unsere Live-Übersetzer übersetzen von Englisch nach Deutsch. Okay, so we are ready to start. Here we have Ian and George and let's give them a warm round of applause. Okay, is this on? Yes. All right, hi everyone. I'm Ian Goldberg. I'm from the University of Waterloo in Canada and this is George Dinesis from University College London in London. Our colleague Nikita Borisov unfortunately could not be with us today, so it's just the two of us. So we're going to be talking about DP5, PRI for Privacy Preserving Presence. No. I will be talking about PRI or Private Information Retrieval for the first part of the talk and then we'll turn it over to George. Something popped up there. Doesn't matter. And then we'll turn it over to George who will talk about DP5. So first, PIR, what is PIR? So imagine an online database, not too hard to imagine. There are all kinds of online databases. Let's say this is a database of patents, right? So we have an online database of all the, say, U.S. patents and again, you don't have to imagine too hard because there really are these online databases you can query. Okay, and here's Alice. Alice is a researcher and Alice wants to look up patent 6368227, method of swinging on a swing. This is a real patent. It covers pulling the chains of a swing sideways so you swing left to right instead of forward and backward. Again, this is a real patent. It has since been revoked on the grounds of stupidity, but it was issued. So Alice wants to look up this patent, but Alice does not want to let the database server know that swings are a hot new research topic, okay? Now swings of course, this is a silly example, but if you replace swing by like one three dimethyl mumble mumble aminol, right? Some pharmacological molecule or something like this, then you can see it might be very important to be able to look up things in this database without revealing to the database what you were looking for, right? Otherwise, the database operator could themselves then go and say, oh, someone is interested in this drug. We may as well go research that ourselves, right? Or we saw examples long ago called domain front running. When you would look up, you would do a who is to see if your favorite domain name was available and the act of doing the who is would cause some unscrupulous network operators to then go register it before you can register it and then sell it to you at an inflated price, right? So you want to be able to look up things in a database without letting the database itself know what you asked for, okay? This is private information retrieval, okay? Note that we're not talking about anonymity. We're not hiding Alice's identity here. The problem isn't that we're hiding the fact that Alice is the one looking up swings. What we're trying to hide is that swings are interesting, right, that anybody is looking up swings at all. Of course, if you want to add anonymity on top of this, no problem, you just add tour to it or something like that and it works just fine, but you don't have to and in fact, you can have business models based on PIR that allow Alice to pay for these private lookups. So Alice logs in, authenticates herself, then does a private lookup and may pay for the privilege of doing a private lookup, okay? So Alice does this PIR query and the idea is that the server has no idea what it was Alice looked up. Okay, now you may be thinking to yourself, this is clearly impossible, right? Who's thinking this is clearly impossible? Right, right? It is clearly impossible, yet the magic of cryptography is such that things that are clearly impossible are often straightforward and vice versa. So here's a simple example called the trivial PIR protocol to show you that it's not impossible. Alice connects to the server, says, I would like a patent, please. And the server says, okay, here's all the patents. Look it up yourself, okay? Clearly, this is completely private. The server has learned no information about what patent Alice was looking for. But at the same time, this is a ridiculous protocol. Why? It sends way too much data, right? If the database is of any reasonable size, you're not going to be able to send all that information to Alice in a reasonable amount of time, okay? And there are other objections to the trivial PIR protocol as well. So what we aim for in better PIR protocols is to achieve the same level of privacy while sending much less data, okay? And we're gonna see a simple example of a PIR protocol in a few minutes just to let you know that this is a real possible thing, okay? There are a couple of main categories of PIR protocols. The first is called computational PIR. And this is where the security of the PIR system relies on the fact that the server who is your adversary, the server is trying to learn what you're asking for and you're trying to not let it. So you're relying on the fact in computational PIR that the server is computationally limited. It doesn't have infinite computing power. And in particular, it can't break certain types of public key cryptography, okay? So a brief taste of how it works, I won't go into the details on this one. Basically, Alice takes her query and encrypts it. But unlike the case where Alice is, say, connecting to the server over, for example, TLS, where what Alice would do is take the query, encrypt it to the server's public key and send it so that third parties can't read it, here Alice is trying to protect the query from the server itself. And so Alice doesn't encrypt the query to the server's public key, Alice encrypts the query to her own public key. This is weird, how's the server going to be able to read it? Well, that's the point, the server isn't going to be able to read it. And then Alice sends the encrypted query over to the server. So what can the server do with this encrypted query? Well, it turns out there are some kinds of public key encryption that allow for an operation that's technically called homomorphism. And you might have heard in the news about recent results in so-called fully homomorphic cryptography, which promises to allow all kinds of crazy computation at only exorbitant costs. But this isn't that, this is the simpler kind of what's called partially homomorphic cryptography that's actually totally reasonable. And what it allows is for the server to take the encrypted query encrypted with Alice's public key and the plain text database and combine them to form an encrypted response still encrypted with Alice's public key, even though the database can't decrypt the query or the response. And then the database sends that encrypted response back to Alice who decrypts it. So that's the general structure of computational PIR. Now these homomorphic operations are somewhat expensive. And so it kind of sometimes costs a lot to do these kind of computational PIR queries. So there's another kind of PIR called information theoretic PIR or IT PIR. And here what you have is that the servers are no longer computationally limited. Now the servers are allowed to have unlimited computational power, quantum computer, a magic computer, whatever. They can compute anything they want in no time. And Alice's query is still protected even against unlimited computational power. How can that possibly be? Well, we use a different security assumption, not one based on cryptography, but one based on information theory, hence the name information theoretic, PIR. And here the idea is you have to have multiple servers and Alice asks a question of each of the servers and gets back the responses and then combines those responses in order to get her answer. Now the security property we have is a non-collusion assumption. So we have to assume that some, at least some threshold of these various servers are honest and not colluding with each other to try to break Alice's privacy. And this kind of non-collusion assumption is common in privacy enhancing technologies, things like Tor, of course, use it. If all the servers in your path are colluding against you, you're kind of scrued. How did that come up on that? Yeah, if all the servers are colluding against you, that's bad and some kinds of electronic voting have this kind of non-collusion assumption, so this is a pretty typical assumption. And then there are ways that combine these two, combine a little bit of computational protection with a little bit of information theoretic protection if you want a bit of each, okay? So that's information theoretic PIR and the advantage of IT PIR is it's much faster, like 70 to 100 times faster than computational PIR, right? So that's a bit of an advantage there, okay? But on the downside, you need these multiple servers. So that's the trade-off you're making. Okay, so let's look at a simple example of an information theoretic PIR just so you can have a little taste and I'm gonna do a little bit of math. Who's scared? Who likes math? Yay, math! Woo! Okay, it's gonna be pretty easy math, like in some countries, high school math in other countries may be first year college math, okay? So it depends, yeah, it depends where you are. It's around then. So it's vectors and matrices. Who's happy with vectors and matrices? Woo! Okay, so vectors and matrices. So this will be, if you're happy with that, it'll be easy if not just hum along and pretend you know the words. So we're gonna say our database. We're gonna represent it as a matrix, okay? A matrix is just a rectangular block of, in this case, bits, zeros and ones, where each row is one record of the database. So the first row is record one, the next row is record two, the next row is record three, and so on. And let's say there are rows in this database and each record is s bits long. So this is an r by s matrix, okay? And what the querier wants to do is retrieve one of the records, right? Say the third one, this one here. So wants to retrieve one of the rows of the database, okay? So in order for this PIR system to work, we just need to recall two facts from maybe high school, maybe first year college math. One is if you take the elementary vector, if you take the elementary vector EI, which is all zeros except for a one in the I place, right? So this would be E3 here, right? If you take EI, this vector of length r, and multiply it by this matrix, okay, think back and it's the last time I did matrix math. How does this go? You turn sideways and you multiply. Anyway, you can trust me, what comes out is the I throw of the matrix, i.e. block I of the database, okay? So of course you could say, well, that's our protocol then. You construct this vector containing the index of the record you're interested in. You send it to the server, the server multiplies it by D and sends you your answer. But of course that's not private, right? The server can just look at this vector and say, oh, the one is in the third place, you're looking up record three, no privacy. So the second piece of high school math we need is the distributive law. Okay, you may recognize the distributive law as working on numbers, but it also works on vectors and matrices too. It works just fine. So what's the distributive law? V one times D plus V two times D plus, did it up plus V L times D is the sum of the VJs times D. So how are now we going to make a PIR protocol? You may be able to see it just from these lines. So here's how it's gonna work. Alice, let's say Alice wants row three of the database. Alice constructs EI, doesn't send it to anybody. Alice then picks, let's say there are L servers. Alice picks L completely random vectors, zero one vectors under the condition that they add up to EI mod two. Okay, it's all binary, so we're gonna do mod two. Okay, so how does she do that? The easiest way is just pick L minus one, completely random vectors, and then figure out what the last one has to be so they all add up. So EI minus the sum of the ones you have already. Now you have L completely random vectors, and you send V one to server one, V two to server two, V three to server three, and so on. And each server is getting just a random bit string. And that, if you work it out, contains very formally absolutely no information about I, this number three here. Okay, so each server gets absolutely no information. In fact, any coalition of as many servers as you want, as long as it's not all of them, has no information about I. And then what is the protocol? Each server just multiplies the V it gets by the database, gets the answer, sends the answer back to Alice, like this one. Alice just adds up the answers by the distributive law, that's the sum of the VJs times D, but the sum of the VJs we chose to be EI, which is this, and EI times D is block I. Poof. Okay, so is this actually more efficient, transfers less information than the naive, just download the entire thing, the trivial protocol? Well, how much data does Alice send each server? It's a vector of length R. So Alice sends R bits to each server, and how much does she get back? Well, this times this is a vector of length S. So Alice gets S bits back, okay? So R plus S bits are transferred between Alice and each server, and how many bits are there in the database? It's R times S bits, which is much, much bigger for any reasonable values of R and S, right? And in fact, we get the best protocol if R and S are about equal, if this database is squarish, then you're sending about the square root of the size of the database to each server and receiving the square root back, okay? So two times the square root of N if N is the number of bits of the database, times L, the number of servers, as opposed to N, the size of the whole database. And the square root of N is much smaller than N if N is of reasonable size. Okay, so this was one of the very first PIR protocols proposed by Chor and others in 1995, almost 20 years ago now, but it has some shortcomings that many of which have been addressed over the subsequent 20 years. So let's look at a couple of them. One of them is that the shape of this database is rather simplistic, right? This database consists of some number of equally sized records, right? Not all databases have equally sized records, so other work has shown how to extend this very same protocol or related protocols to handle variable size records. And you might think that's a little tricky because the answer Alice gets has to not reveal the size of the record Alice was looking for, so it's a little tricky to do this securely, but that's been done in previous work. Other previous work has looked at the query, right? The query in this protocol is I want record three, right? Typically, a client of a database does not know which physical record number they want to look up. They know some other information. They have some more expressive query, for example, keywords or even SQL and previous work has shown how to put that into PIR so that you can do an SQL query on a server without telling the server what your query is. That's pretty cool. That's previous work, that already exists. We have code and everything. So these are previous work and one other thing, one other shortcoming of this simplistic scheme is that it's not robust. And what do we mean by robust? So if you look here, what happens? Click, I'll go over here. Okay, next slide. Okay, there we go. Laser's working. Okay, so what do we mean by robust? Well, imagine one of the servers is down and doesn't give you a response. Or worse, one of the servers is malicious and gives you the wrong answer. When Alice adds up the answers, she'll just get some garbage. And worse, she can't tell which server gave her the wrong answer. Okay, so a robust protocol would allow Alice to retrieve her correct block even if some of the servers were down or malicious. And at the same time, she'll be able to identify which servers were malicious and of course that's a disincentive for the server to be malicious in the first place. So let's look at briefly how these robust protocols work. So it uses something called Shamir secret sharing, which is pretty fun and cool so I'm gonna talk about that for a little bit. Who's heard of Shamir secret sharing? Not as many people as have heard of Matrice's. Who's heard of polynomials? Okay, good, we're gonna use polynomials. Okay, so Shamir secret sharing, so what is this? You have some secret in the case of the PIR scheme. The secret is this vector EI, right? Which represents which block number you're interested in. So you have some secret and you want to share pieces of the secret among a number of parties such that there's some threshold T that if at most T of those parties come together, they have no idea what the secret is. It's not that they have some few bits of the secret, they have none of the secret, none at all, okay? But if more than T come together, they get the secret. Okay, so here's how that's gonna work. You have your secret and you have your privacy threshold level that you care about, we'll call T. And what you do is you draw your X, Y axes and you draw a green dot at zero comma your secret, okay? And then you pick a random polynomial of degree T that goes through that dot, okay? So you're picking a random polynomial of degree T whose Y intercept is your secret, okay? That's what you're doing. So when T is one, this is a random line that goes through the dot. When T is two, it's a random parabola that goes through the dot and when T is three, it's a random cubic that goes through the dot, okay? So what do you do next? Then you just hand out other points on this polynomial. So these blue points here, you hand them out to the parties. Then what happens if you, oh, someone said, someone just got it, okay? So what's about to happen? So let's just look at the simple linear case as an example. If you only have one dot, you have no idea what the Y intercept of the line going through this dot is because it could literally be anything, right? And similarly, if you have two dots, you have no idea what the Y intercept of the parabola is because it could literally be anything and so on. But if you have more than T, let's say you have two dots, of course, two points make a line, so you draw the line and there's your Y intercept. You can just read off the secret right from there. So this is Shamir's secret sharing. And this is what we use instead of this splitting the EI into just V1 plus V2 plus V3 and so on, we split it in this slightly more complex way, but this has the nice feature that you don't need all the servers to come back and give you the answers. You only need T plus one of them. And then what if some of the servers give you wrong answers? Well, that's where we use something called error correcting codes. So with error correcting codes, we can tell, for example, which when we have some click, some wrong points, which points are wrong? So here's a little quiz, T is three. So we're looking for a cubic that passes through all but one of these points. So there are six points up there, one of them is wrong. Which one is wrong? Who says the one at this end is wrong? Who says the second one? Third one? Fourth one? Fifth one? Sixth one? Okay, the answer is in fact the first one, that's the cubic. And this kind of error correction is actually used quite commonly. It's used in CDs and DVDs. It's used in QR codes, right? You're probably familiar with this quick. What does this QR code resolve to? No. But the reason that it's used in QR codes is so you can do this and just blot some stuff obscuring part of the QR code and this code will still scan until the same thing the other codes scan to, which is in fact the homepage. Okay, so these error correction codes are used all over in ordinary life because you wanna make sure usually that scratches on your DVD or tears in your QR code don't mess things up. And they are also used in PIR to handle malicious servers. Okay, so all this is implemented in our Percy++ open source library. That's a pun on PIR and C++. Pure C++. No, okay, fine. For those of you who like Git, there's a Git. For those of you who don't like Git, there's a SourceForge. For those of you who don't like SourceForge, find someone who you know who likes Git. Okay, so now I'm gonna turn it over to George who's going to talk about using PIR to make privacy preserving protocols. Thank you. Okay. Good afternoon everybody. I hope this works. So the cool thing about crypto is that it's about mass and mass is cool. But what is even better is that we can use crypto in order to protect our privacy against adversaries that we would never have imagined without crypto to be able to protect ourselves against. So the example that I will use to illustrate why PIR can be used for that, it relates to achieving privacy in kind of social interactions online. So you may have noticed that in the last 15 years we started using computers to talk to each other, okay? And you have things like Twitter, you have things like Facebook, no reaction, okay? And you have different chat protocols. Ooh, this thing works sometimes and not other times. All right, tricky. Okay, so all of these have something in common. Namely, they have a sense that some users are friends with other users. Again, this is kind of crucial to all of them. Even Twitter, which doesn't handle secrets usually, all that much has a sense that you have people following you and you follow people and that kind of routes information around and it's kind of important. Okay, now one further thing that is important, particularly when it comes to chat, and we're particularly keen of chat because we would like to eventually protect the privacy of people chatting, is presence. Now, what is presence? Presence is just a little green dot that you see next to someone's name when they're online. This is really presence. I mean, it's kind of a simple thing. However, we kind of grew used to it in the sense that in the old days, you would just call a number and hope that someone picks it up, particularly when it was a home number and most of the time no one would pick it up and it was very frustrating. Now, what was the last time you had this frustration? Right, I mean, 10 years ago or something, nowadays you can see when people are online, you can just call them and you know that more often than not they will answer unless they actually don't want to talk to you, at which point it's a bit awkward. And in order to avoid such awkwardness, in fact, we have improved our presence protocols with some additional information. Okay, you don't just see green or nothing, you actually have this kind of intermediate states of orange, which is kind of I'm going away or red, I'm quite busy. So we don't just want to know if people are online or offline, but we also want to associate some further information with what they're doing, what kind of status they have, some little message that says I'm away, I'll be back at five or something like this. So our presence protocols will try to replicate this facility, both of finding out if people are online and also convey some little amount of associated information that will help any further communications. Now, how is presence achieved today in most services, including very big commercial services or smaller services run by good people, like the Jabba services of the CCC that many of us are using? The way this is done is actually quite shared in some sense. What happens is that you have usually some kind of server and there is some kind of a notion of a social network of users having friends and contacts, et cetera, et cetera, they'd like to hear from or they would like information routed from or they would like to find out if they're online. And the service has to simply know the full social graph. Okay, that's how it's done. And then what happens is that Bob comes online, but then Alice comes online and Alice authenticates, says, hi, I'm Alice. The service knows that Bob is a friend of Alice and then the service sends back to Alice who's online, who's not online, what their status is. Okay, now, maybe I should, I'm clicking furiously here, I will actually move towards there so that I stop doing that. Okay, now, and then some presence information is routed back. So there is a fundamental privacy problem with relying on services that have this social graph as necessary information to provide presence. Now, what is the problem associated with that? The problem is that who your friends are basically leaks a lot of information about who you are. If all my friends are hackers, the probabilities are that I'm a hacker. If all my friends are journalists, the probabilities are that I'm a journalist. If all my friends are from Sudan, I have some association from Sudan. So suddenly who you know, actually gives away a lot of information about you. In particular, who you contact or who you include as a friend, as a new friend, et cetera. Suddenly I start including a lot of doctors or a lot of lawyers in my circle of friends. Well, that actually leaks information about me. Okay, and you may say, well, what's the big deal about this? This is not a big deal. However, what we have learned from the Snowden leaks is that if we have services that hold this information, what may happen, such as, for example, a friendly ISP like Levebeth, what may happen is that the government who wants to discover this information about you will go to the service provider who never intended to betray you and will say, thank you very much. Can we have all this database of who is friends with whom? Okay, and we know that there are specific programs that do target those databases for collection, either by coercing providers or by actually just stealing the data of the wire or from their databases. Now, what's the big deal with that? Well, we actually had a very interesting confession last year by General Michael Hayden. He said, we kill people based on metadata. So you being in the social graph of particular other people may in fact go as far as putting you on a kill list. And even when that doesn't happen, it might go as so far as putting you on a particular targeted surveillance list or make you a target for harassment, et cetera, et cetera, et cetera. So we would like to hide this metadata. We do not want to rely on third-party services holding that graph in order to provide presence or any associated data that comes with presence. Okay. Now, what do we expect from a privacy-friendly presence system? First of all, there are some functional features that we would not want to lose. We want users to be able to somehow associate friends with each other and register that they're online and extract people's status and then suspend friends if they do not wish to share any more information with them. Now, we would like our privacy-friendly system to be resistant against some pretty serious adversaries out there because we know they exist and they're interested in this information. We assume our adversaries, okay. We assume our adversaries, for example, can observe most communications on the network, all communications. We assume our adversaries sometimes are our friends, okay. And we would like to hide information even from other users that we have tagged as friends. However, we will assume that we have some aces up our sleeve in particular, the computers we're using are secure. And if that is not the case for your computer, go and talk to the people who maintain tails or any such distribution, they may be able to help you. And we will assume that both our cryptography is secure, cannot be broken by the adversary, and also that we have some infrastructure service that will not all collude with each other to violate our privacy. This is very much in line with the assumption that Ian mentioned earlier today. And with those tools, what we will try to do is, first of all, protect the privacy of our social network. We do not want any third party to be able to see the full social network. We will try to protect the privacy and integrity of the associated status information and also provide some additional properties, most notably for the purpose of this talk, and linkability, namely the inability from one time period to learn information that will be used in another time period to violate any privacy properties of the network, as well as forward and backward secrecy, et cetera, that are very interesting to resist very powerful adversaries. Now, some of you may already be thinking, ha, I can think of a trivial way of doing this, right? And this is not like using a lot of very advanced technologies. What we can do is we simply have Alice, and she basically contacts through an anonymity channel, like Tor, for example, to a chat server that is either hidden service or on the other side of the anonymous channel and registers using a pseudonym. And then we will have Bob down the line that also registered to the same server and also says, hey, by the way, is Alice or the pseudonym of Alice online? Okay, and that, you may think, solves the problem. However, this is not quite the case because what happens on this service is that the pseudonym of Alice and the pseudonym of Bob are somehow associated with each other. And so is the pseudonym of Alice with all the pseudonyms of her other friends and the pseudonym of Bob with the pseudonym of all his other friends. And what happens, in effect, is that a kind of isomorphic graph to the real social graph starts being constructed in this service. Okay, we have pseudonyms being friends with pseudonyms. However, the shape of that graph, the links, are one-to-one mappings with the real graph. And therefore, despite the fact that maybe these are just pseudonyms, instead of Alice, we have A instead of Bob, we have B, if the adversary has any site information, such as, for example, contributions of people and their friends on online video review site or anything like that, they may be able to use that information to reconstruct the graph and de-anonymize it. And we have seen particular attacks against Netflix that do that. Therefore, it is not actually a safe way to just do this, okay, because the adversary would be able to just go to this server or operate even such a server and recover a graph that is pretty damn close to the graph we're trying to hide. We don't want to do that. We want to leak no information, not just hide a little bit of information. Okay, so here is a high-level overview of how do we actually achieve this. And you will recognize a few of the ideas that Ian mentioned. So the basic problem is that we have Alice and we have Bob and their friends, and they would like to use a service, and this service runs our specific protocol called DP5, and that stands for the Dachstuhl Privacy Preservings Present Protocol and the extra Ps for extra privacy. So they would like to use a service that runs the DP5 protocol to gain some presence information. However, we would like all the servers that run DP5 not to learn anything. So let's see how we could achieve this. First of all, because the servers should not actually be able to extract any information about what's going on, clearly Alice and Bob need to share some secret that the servers don't know. So we assume that Alice and Bob share a key that they can either share when they meet each other or they can derive using public key cryptography. Then Bob somehow has to send some information about his presence and his status to the infrastructure, the DP5 server or servers, okay? And then naively, you may think, well, you know, Alice will just have to go and ask, you know, is Bob there, is Charlie there and recover that information. However, this would be insecure, of course, because that would leak to the infrastructure that Alice and Bob and Charlie are associated with each other, okay? So wouldn't it be nice if at this stage we had some kind of mechanism for Alice to ask the server for a particular record associated with Bob without, however, leaking which record that is, because that in fact would allow us to then have private presence. And in fact, this is exactly the protocol that Ian talked about before. And the key insight behind the DP5 protocols is that we use private information retrieval to request a query on Alice's side a record that corresponds to all her friends without letting the infrastructure find out which are those friends. Okay, so here is how DP5 works cryptographically. We divide, first of all, time into epochs. This could be longer, short, and I will discuss this in a second. And the idea is that at every epoch, users register their presence with the service and at each epoch, other users or the same users can actually recover who's online in the previous epochs. So Alice, when she needs to register, first of all, takes a key she shares with Bob, key of Alice and Bob, and using a pseudo random function which you can think of as a hash function or that is keyed, basically applies that to the time epoch I and gets two things. First of all, she gets an ID for Alice's presence at this time period, and she also gets a symmetric key K. Now Alice, first of all, can use the symmetric key K and effectively encrypt her status using a strong cipher and using a particular mode of operation called AED, Authenticated Encryption with Associated Data. Okay, and that basically is just secure encryption. You can think about it in this way. And then she can make a record of just the ID for this period and the encrypted status and just store it in the database. Okay, now Bob comes along at the next period and wants to find out if Alice is online, okay? So Bob, of course, shares a key with Alice, the key AB. So what Bob does is he uses the same pseudo random function with the same key and applies that to the previous time period identifier and extracts hopefully the same ID and the same key. Then using the ID and a private information retrieval protocol, Bob can try to retrieve the record that Alice has left in the database, okay? And hopefully he gets back the record C and then tries to decrypt that record with the key. And if the decryption is successful, it means that Alice is online in the previous epoch and furthermore, he can decrypt the status. Okay, now I hope that everybody is half convinced that this is secure. This would just work and this would defeat subject to all our assumptions, very powerful adversaries from finding out who's friends with whom and who's online and who's actually querying who else is online at any point. Now, however, there is a problem with all this which is associated with the fact that this database tends to get very large. As I said, for every relationship in the world, we will have to include a record in the database. And furthermore, we would like to include some more records in order to hide how many friends each person has so that each person always basically submits a hundred relationships and queries for a hundred relationships as well. So these databases tend to get quite big, okay? And that's a problem because big databases are less efficient to update and query and we would like these time periods, these epochs to be as short as possible because you have to register at one epoch and then query at the next epoch who was there before. So if these epochs are a day, you just find out the next day that someone was online the previous day and that's not very useful, let's face it. So we need to basically do something to improve efficiency so that these queries and these registrations are very fast. And as my academic grand-grandfather said, that's David Wheeler, any problem in computer science can be solved with another layer of indirection. He, by the way, he's worth a round of applause because he invented the subroutine along with many other cool things which is indeed a layer of redirection. He was also wise enough to add that usually this creates another problem. So what we do basically is we run two versions of the protocol that I described to you back to back, one feeding into the other. We define two different times of time period, two different epochs, the long epoch and the short epoch. And we basically, first of all, have Alice and Bob store one type of record in the long epoch into the database and that record has as a status just a public key. So now Alice basically in a long epoch that we envisage to be about a day just stores a public key for Bob into this database. And then in the short epoch, Bob can both retrieve the public key associated with Alice in the long period and then use that key to query for Alice in the short epoch. Now the trick here is that this public key is shared by all friends of Alice. So we now go from a situation where we have one record per relationship which is still the case in the long-term database to just having one record per active users which makes for a much shorter database and now that becomes tractable enough for this period to be as short as a minute or even a few seconds depending on how many machines you throw at the problem. So Alice can then store using this public key in the short-term database, her actual status and Bob will retrieve every minute, let's say the status of Alice back and make sure that she's both online and that he has an up-to-date status for her. By the way, this status is very important to us because we don't just think of it as being busy, not there or away for five minutes but we actually think of this protocol as being able to support other cryptographic and anonymity protocol. So you can think of this status information as being by the way, this is my anonymous pseudonym for you to contact me on this service or by the way, this is actually the address of a Tor hidden service on which you can talk to me. So we were extremely keen to have this associated information so that we can build more secure chat protocols on top of this presence protocol and on the back of its properties. Now, Cypherpunk's write code, I believe, is that how it goes? That's how it used to go and I hope that this is how it's gonna go from now on as well. So all of this has been implemented. We use the per C++ code as the core for our PIR protocol and the rest of the DP5 protocol at a low level is implemented in C++ and we have Python bindings that we use as part of a CherryPy framework and TwistedCore to implement all of the networking so that we can time it and make sure that it all works. However, this is still work in progress because it has not been integrated with any client. So the protocol is there, we can do a lot of timing. Sadly though, we can't actually find who's online who's not offline amongst real users, just our robots that do our tests for the moment. So this is a missing part that I hope we will be working on the next year. And the cost of running this is surprisingly low for the kind of very strong properties it offers. Sadly, the cost per user rises the more users you have because that's how PIR is, okay? Now if you have a service with let's say 10,000 users, it will cost you a fraction of a penny per user to actually run this protocol for a month. If however you start having millions and millions of users, it will cost you in the order of magnitude of a dollar or more per user to actually serve users given rational loads. So the takeaways here are that first of all, metadata in chat protocols and social protocols is an active target for national security agencies and everybody else. So we need to protect it if we're serious about protecting communications secrecy because we know how to protect content but metadata is something we're not very good yet at protecting, okay? Now private information retrieval as a general primitive is pretty cool, okay? You can do things that many of you at the beginning of this talk were not convinced were even possible and DP5 leverages private information retrieval in order to implement a specific protocol that does private presence and given the characteristics of this protocol, it is actually quite practical and I hope that many services will start considering using this kind of technology pretty soon. Thank you very much, we're open for questions. Okay. Oh, the Git repository. Okay, could you line up at the microphones for the questions and we also have one question from the internet. Please go ahead or signal Angel. Hi, so the first question or the only question currently is how does DP5 prevent that Eve gets information about Alice or Bob which they don't want to share with her? Great, so is this on? Yes. So if you remember at the beginning, George showed that Alice and Bob shared a key, right? And Alice shares a separate key with each of her friends and that key is what enables Bob to look up information about Alice but none of Alice's non-friends can look up information. So Eve won't have a shared key with Alice and so Eve won't be able to look up the information in the online database. Okay, microphone number four up there, please. Hello, thank you. I wanted to ask if you had looked at the price graph looks pretty terrifying because if you have one million users in F, one dollar per user, that's $1 million per month. That's quite a lot of money. Have you looked at what is actually causing this? Is it the PIR requests? Yeah, it's exactly right. So it's, I mean, it's a lot of money but it's not a ridiculous lot of money if you really have a million users running a privacy service for a subscription rate of you charge them a buck a month is not a ridiculous, totally ridiculous thing to do but it is unfortunate that the price does go up per user and the answer is the reason is exactly because of PIR because the size of the database scales with the number of users, right? And so every query does a PIR look up on an increasing size database increasing with the number of users and the cost of a PIR query is proportional to the size of the database. Why is that? Well, you can easily reason that if the server has to learn nothing about what record you were interested in then it clearly has to process every record in the database somehow because if it doesn't touch this record on disk then clearly you didn't ask for that record, right? So the server has to process every record in the database in some fashion and so its computational cost is proportional to the size of the database but the communication cost is much smaller. Okay, microphone number one up here please. Hi, thanks for the talk, it was really interesting. I have a question about the non-collusion assumption. So in the scenario where there's a patent server Alice doesn't trust the one server to not use the information request for their own competitive advantage. It seems like in the instance where you've got several different servers she still has to trust at least one of them not to collude. Why is the first assumption not reasonable but the second one is if you trust a server to not collude then why wouldn't you also trust them not to use your question? Right, so it is unsettling to some extent that for many protocols you cannot achieve them on your own and you cannot achieve them just with your communication partner. One classic example of that is anonymity protocols, right? I mean we usually have to use relays and the traditional wisdom is that using one relays is actually very fragile because at any time this one person if they are corrupt to some extent or if they are coerced even if they were not corrupt might actually be able to leak who you're talking to. However if you have more than one they would all have to be either corrupted or collude to do that. So all the logic that goes into why we trust things like mixed networks or onion routing to some extent or things like that really applies here. Why would you trust more than one in preference to just one because they might be in different jurisdictions because they might actually have different operators that will do different things to protect the data because it's actually logistically quite difficult to simultaneously go to end people and simultaneously ask for the secret information and remember something I don't mention here is that it is perfectly forward secure so going after some amount of time doesn't work you have to simultaneously go there, right? But it is a social assumption and social assumptions are fragile because they rely on people like us and our community is actually providing those infrastructures and as the people from other free software projects that rely on this know this is not automatic if people like us do not run these things it's very unlikely they will ever be run by anyone. Okay microphone number six back there please. How would sharding the user database affect performance and privacy characteristics of the system? So it would so if you shard the user database into K pieces the performance gets better by a factor of K but then only users in the same shard can be friends with each other, right? So if your user population naturally falls into these shards okay but in some sense that's exactly if that's true that's exactly the information you're trying to protect, right? You don't want someone to observe which shard you're in which would be visible, right? So that's why we don't we try not to do that because if you would shard the database although your performance goes up your privacy goes quite a bit down. Okay then we have a question for microphone number two and two more from the internet then. Hi quick question doesn't the public key that's used to query the short epoch database doesn't that allow a third party to reconstruct again a graph which represents the real one or is the public key also behind a PIR that doesn't really make it public but it's just shared between the users who know each other. Only Alice's friends learn that key, right? Cause you have to go through the first instance of the DP5 protocol to get that key and then so only your friends have it. So it's a public key in a sense that Alice has the private half and other people have the public half but not everybody has the public half only Alice's friends have the public half. So the public key is not really the query that goes to the database? It's never revealed, yeah. It's never revealed to anyone outside that circle and the second round of protocol derives keys from that key so it's never actually revealed in such a way that it can link different queries from different people, thank you. Okay, our signal lines are a question from the internet. So there are some similar questions going along the line. So the first one, what do I have to do to use this on Java? The second one, can I use this to build a distributed DNS system? Which is I think, yeah. Can I use this for X? Well, the first question is more easy so I'll take it and then I'll leave the hard question for Ian. So what you need to do to use that on Jabber if I understand correctly is the question is that you will have to spend some time integrating this protocol to the Jabber presence mechanism which is not something we have done yet. We have actually spent quite some time thinking how this could be done and we foresee a solution based on implementing a kind of Jabber protocol proxy or something like that working where your client talks to some local proxy and the local proxy actually translates all the presence related events into DB5 and then translates the responses back into Jabber events so that we can reuse a lot of the code. However, this work is still to be done. This integration work is actually one of the trickiest parts of computer security to make sure that it's actually usable and we haven't actually done that yet. The second question is how can we use that to do DNS? So it's not a very good match for DNS because in DNS the whole point is that you want everybody to know the DNS information, right? So it's probably a better fit, for example, for a short messaging service, right? You want, Alice wants to tell Bob in particular some small amount of information which would be the auxiliary data in here. So you can imagine DP5 or a very similar protocol being used for some kind of short messaging service where Alice is sending a particular message to one or a small subset of people, her friends in this example, but DNS, you want to send it to everybody and so I don't think that's a good fit for this protocol. Okay, two minutes left, two questions. Microphone number one, please. In the protocol, Alice stores her encrypted information with an ID, say X, and then later on Bob carries the database for that same ID X. So what prevents the server from knowing the relationship between Alice and Bob? Because the lookup is done using private information retrieval. So Bob looks up with ID X, but in a way that doesn't reveal to the server what X is. All right, right? Okay, so you're using the disputed model. That's right, that's the magic. Okay. Okay, microphone number one again. What happens if the database are not synchronized? For example, if they run out of sync for one time epoch and have some different information about the same user. Right, so that would be bad. But it's not so bad because as I said in the first half, right at the end of the first half, these PIR protocols support robustness. So if some of the servers are out of sync as long as not too many are out of sync, no problem, because the robustness properties of the PIR protocol will just think, oh, that server's giving me the wrong data. So as long as enough of the servers agree, then the answer that the most of the servers agree on will be the answer returned. Okay, so it's no value in between, it's the most agreeing number. It's the most agreeing value. Thank you. Okay, we have been asked to finish in a smack on time. So unfortunately you have to close the Q&A now. Again, thank you very much for coming here. Thank you very much. Thank you.