 Hello. Thank you for watching this talk. Today I'll be talking about peer with default and applications. And this is work together with my co-authors of Google and Mechu, who is our intern and is currently at Arizona State University. First, let me discuss what's the problem we're trying to solve. The particular problem we're looking at is called inner join private join and compute. So in this problem, consider a user who has several IDs X1 through XM and several associated weights V1 through Vm associated with each ID. And also a server who has a much larger database of a similar kind with many more IDs Y and associated weights W. And the goal is for the user to find the inner product of its data set with the server's data set, only for those values Xi which are in the set Y contained or held by the server. And for all such values, it should learn the product of the sum of the products Vx times Wx where Vx represents the value associated with each particular X and W represents the value W associated with that particular Y. So the user should learn this inner product together with some differentially private noise added epsilon. And in particular, nothing more should be learned by the user, nor should anything extra be learned by the server. So as we noted, the functionality we want as a user should learn the product of weights perhaps with noise added for IDs in the intersection of X and Y. And furthermore, we want the user's communication computation costs to be nearly linear in the size of its own data set. In particular, it should grow very slowly with the server's data set size and that's what this tilde means. And the assumption is that the user set is much smaller than the server's data set. In terms of privacy, what we want is that each party's inputs should remain hidden. We also want that the elements of X intersection Y should remain hidden and this is which IDs were in common. We further want that the size of the intersection should remain hidden, which is a number of IDs that were in common. But we assume that the sizes of the inputs size of X and size of Y are okay to reveal to each of the parties. And this leakage can be mitigated by adding the inputs on either side with random IDs. So why do we care about this problem? Let's look at a couple of hypothetical applications. So the first application we could think of is for exposure notification where the data held by the user could be thought of as the Bluetooth IDs of devices the user has been in proximity with. And the values associated with each ID could be how close the user was to that other Bluetooth ID, for example, and how long they have spent in proximity. And on the server side, you could think of the servers holding a data set of all the users who reported being infected with a particular disease, COVID-19. And the weights could represent the, you know, virulence of that user on a particular day. So for example, there could be one row for each user day here. And so the output of an inner joint, private joint compute in this setting would be for those users who the user was, for those IDs who the user was in close contact with, they read the product of the proximity rate and the virulence rate which could give in some sense how that user was to be infected and potentially with some noise added. So you could use this kind of inner joint PJNC in order to achieve an exposure notification. To do this kind of exposure notification computation and obviously privacy is important here. Basically you could show an alert if the value here is above some threshold, for example. Another potential application is to measure ad effectiveness. So on one hand you could have the user being a merchant with a bunch of IDs corresponding to users who bought some particular type of item and the values being the spend values. And on the server side you could have a server holding a whole bunch of user IDs corresponding to users who've seen an ad for a particular campaign, let's say a shoe ad. And the weights could correspond to the time decayed effect of that ad. So this is assuming that both the merchant and the ad tech company have fixed a particular date on which they want to do a measurement and the merchant is only using transactions for that particular date. And so what you could want to compute is the weighted conversion credit that should be given to the ad tech company for showing ads to these users where conversion is basically somebody seeing an ad and they're going and buying that thing. So in this case the this weighted conversion credit you could do a computer for example by taking the spend value and multiplying it by a time decayed ad effect for a particular user who both saw an ad and bought something. So in this case the inner join private join compute could basically compute like the total value spent by users who had also seen ads and proportionally decayed based on how long it had been since that user had seen an ad. So clearly privacy is important here and would be, you know, neither party would want to reveal to each other but might be happy revealing the aggregates. So basically our approach to solving this problem is to build a secure multi party computation protocol. I tailored for computing inner join private join compute and focusing on asymmetric input sizes. So what are the properties you want from this as we mentioned before we want to hide that particular items are in common between the user and the server. We want to hide the size of the intersection be able to compute the inner products on the associated values in the intersection and we want the user costs to be nearly linear in only its input size and relatively independent of the server's input size. So the private join compute protocol as designed by Google and described in this blog post. This is the first and third property, but reveals the intersection size. And for the more the user costs is linear in the size of both it's and the service data set. Now, there's another technology known as private information retrieval, which basically is very well tailored to asymmetric data set sizes and in fact you can build inner join private join compute from private information retrieval. So I'm using it as a private set intersection protocol. And so if you do so, you would, the basic way you would use private information retrieval to build private set intersection reveals the items, particular items in the intersection, and it reveals the intersection size. So that it allows competition on the intersection and also has user costs that's proportional to only that's almost linear in the smaller data set. On the other hand, there's very nice work based on called circuit PSI which uses global circuits and we're going to do private set intersection. In this case, you can get all of the first three properties that means you can hide the intersection the size of the intersection and compute arbitrary functions on the intersection, including dot products. But, again, in course cost linear or like slightly more than linear in both of the two parties is data sets. And so in our work we get all four of these properties. And basically our approach is going to be to use some kind of private information retrieval. And I just like to note here that are the setting is actually also addressed by this work by Chen, long lane and rental, which also does a kind of asymmetric private center section and allows computing over the intersection. So there were the first remarks in several important ways. And for a more detailed comparison, please do refer to the full version of the paper. So with that, let's jump into how our construction works. So at first I'll give an overview of the pieces that go into building inner join private join compute. And so what we're going to do here is we're going to start from a private information retrieval protocol. So the private information retrieval user has a particular index I, and a server has a database consisting of many identifiers. And basically what it here allows the user to do is to retrieve that index value from the server. The first thing you want to do is actually to use a variant of peer called keyword peer, where the user instead of having an index now has a keyword x, and the server has key value pairs, y and w. And keyword peer now allows the user to basically query this keyword and retrieve w of x, if x is in y, or otherwise retrieve garbage. So the first thing we're going to do is start with a keyword peer scheme. And the next thing we're going to do is modify this keyword peer scheme so that instead of getting garbage on x is not in y. Instead the user will retrieve some prescribed default value from the server without revealing to the server that a default value was received. We've said in normal peer schemes, I didn't say this before, but peers teams don't reveal to the server what the client was, and that's also the case in what we're going to call peer default. So basically the client queries on x and the difference from keyword pairs instead of receiving garbage when the client doesn't when x is not in y, the client will instead receive its server choice and default value. So we're going to build peer default from keyword peer. And in fact, we're going to do something more, which is we're going to allow the client to have a value v. And what's going to happen is the client is either going to retrieve the times w of x, in case x is in y, or otherwise it's going to retrieve the default value t. And we're going to introduce a further modification, which is so that the client doesn't just retrieve these values but will receive these values masked with a random mask chosen by the server. And once we introduce this random mask, we'll actually gain the property that the client will not be able to tell whether it received v times w or whether it received the default value. And so this is exactly what we're going to call extended period default. So actually what our work does is it gives a construction for extended period default. And once we have this extended period default, we can use this to build inner join private join compute in a straightforward way. And just to pause here what extended period default is doing is the client has a key value pair, and the server has many key value pairs. And the client is able to retrieve either the mass product of the values associated values, if x is in the server's data set, or it will receive a masked default value, if x is not in the server's data set. So what we're going to do to build inner join private join compute from this is now recall the user has many different key value pairs and so does the server. The user is going to execute extended period default on each of its inputs Xi vi with the server using default value zero and a different random mask each time. So note that in each of these executions, the user will either receive the product of vi times WI mask with a different mask, or it will receive zero mask with that same mask. And this will happen for each of its different Xi vi. Then the user will simply sum together all the outputs is received from each execution to get a value T. And the server will sum together all the masks it used in each of these executions and send these over to the user together with some noise added epsilon, and this noise is going to be for differential privacy. And the user is now going to subtract the sum of the masks that the server provided from some of the values are retrieved in step one. And this it turns out is exactly going to be the noisy inner product as we need it. And so this is how we would build inner join private join compute from extended period default, and this is going to be our solution strategy. And so given this, now the question is, how do we build this extended period default. And that's what we'll be going into next. So again, our starting point is going to be using a private information retrieval protocol. And so in most private information retrieval protocols or single server private information retrieval protocols that are efficient. They leverage homomorphic encryption. So the clients will basically encrypt the index I for which it wants to retrieve something from the server. And technically, you know, it's not exactly encrypting the index I but it's some special encoding of I but we skip the details here. What the server does is it expands this into an in one hot vector of n values, where all the encryptions are zero, except in the I position, and this expansion is done homomorphically. So the server doesn't know which index is zero and which index is one. And then it executes a homomorphic dot product with these ciphertexts with its data set. And sums together the results, which will basically give it an encryption of exactly why I because all the other values homomorphic and multiply by zero. And this will get sent back to this to the client who can decrypt to get why I. And so what do we do if you have an ID or keyword instead of an index. So there's a couple of different approaches you can vote in order to deal with the keyword. But our approach is to use a boom filter. And what is a boom filter just to give some background a boom filter is a data structure, which allows to test membership. Suppose a server has a data set which consists of why one to why and they can create a boom filter, which consists of bits be one to be capital N, where capital N is larger than that. And the clients can now take its value X and turn it into K indices, each one X to HKX, which can be computed by simply hashing X using K different hash functions. It's specified by the filter. And now the client can test membership in the boom filter by looking up the boom filter entries at each of the positions HIX. And if all the boom filter entries are one, then the client can conclude that its item X is in the set Y, except with some negligible family property. Well, they're well known constructions of boom filters. And it's a widely used for primitive. And just for concrete miss, we'll be using 31 hash functions, which will give a failure probability of two to the minus 40. And this will mean that the boom filter has an expansion of 58 times. And so the way we're going to do keyword peer is to use regular index peer that as we discussed earlier together with filters. So recall that we have user and server with key value pairs. Let's just forget about the values for now and have the user just use a single keyword. What we're going to do is have the server create a boom filter out of its keywords and have the user send K encryptions, one for each entry of the blue filter that it would need to look up in order to do its membership check. And it can send these as peer queries. So encrypted using a homomorphic encryption scheme. The server can process each of these peer queries to basically get encryptions of those particular bits in the boom filter, corresponding to the indices that the client had sent. And then it can homomorphically sum the responses to get the sum of all those filter bits and add a random mask or two and send it to the client. And then decrypt and subtract K where K is a number of hash functions that were in the boom filter, and therefore gets R1. And we know that R1 is equal to R2, if and only if X is in Y. And this is because R1 will be equal to R2 precisely when all the bits B that were retrieved that would have been retrieved in the peer query for all one. Or there were K one bits, you know, so this is a way to take index peer and turn it into keyword here. And of course, you just recall that there's some failure probability to the minus 40 here. And so suppose you have this step now where you have R1 R2 that are equal if and only if X is in Y. And now let's think about what we're going to do with associated values. And to deal with associated values, we're going to use something called a garbled filter. And a garbled filter is another data structure is very similar to a boom filter, where now instead of just having keywords, a server can have key value pairs. And the garbled boom filter will now be able to encode these key value pairs, such that a client can take its input and query K locations in the boom, in the garbled boom filter, such that the result of adding together those K locations in the garbled boom filter will be exactly the associated value if X is in Y. However, if X is not in Y, then the garbled boom filter entries will sum to some unknown value, which is undetermined or unspecified by the boom filter, it could be anything. And so what this next piece for construction is to be to combine peer with garbled boom filters. And so what the server is going to do is make a garbled boom filter of its key value pairs. And now let's just think of the users having a single key value pair X and B. And so what the user is going to do is now send in again the encryptions of the locations that wants to look up in the boom filter corresponding to its input X. The server is again going to process these encrypted indices as peer queries and thereby get encryptions of the locations at the boom filter that the client wanted and homomorphically sum the responses. And the user is actually also going to send along an encryption of the value V that it had. And the server is going to homomorphically multiply this value V into the sum that it had computed in the previous step. And furthermore, it's going to mask this value with a random mask S2 and send it back to the user who will decrypt to get S1. And note that S1 and S2 are additive secret shares of VX times WX if X isn't Y. And our secret shares of something random or some garbage value otherwise, some undetermined value. So now from these two things, what we have is if we do the peer query on a boom filter and the peer query on the garbled boom filter, the user will have retrieved values R1 and S1 and the server will have created values R2 and S2 with the properties that R1 equals R2 if and only if X is in Y. And S1 and S2 are secret shares of V times W if X is in Y and secret shares of some garbage value otherwise. And now, in order to get peer with default, what we're going to do is have the user and the server execute a generic MPC protocol. Where this generic MPC protocol will output secret shares T1 plus T2, which will be shares of V times W if X is in Y and shares of zero otherwise. And this is exactly what we wanted for peer with default. And note that this generic MPC protocol now only needs to take us input these values R1, S1, R2 and S2 and does not depend on the size of the server's data set. And in fact, it's easy to modify this to make it so that the server can specify a default value such that instead of the T1 plus T2 is adding up to zero, they instead add up to the server specified default value. And this is exactly the crux of our construction. And note that the generic MPC protocol can be any generic MPC protocol, but we specifically use garbled circuit based protocol. Our construction also has several optimizations. So note that we described everything just for a single key value pair. But in fact, you can get huge benefits by doing multiple key value pair queries in parallel. And this is a well known technique used in peer and keyword peer, which it uses slotting and batching of homomorphic encryption schemes. And another well known optimization, which is to cuckoo hash the inputs on the client side, which basically is a standard technique that allows grouping the inputs into smaller groups so that the peers are executed over smaller sets. And this basically has the effect of inducing a huge computational savings on the server with some minor increase in the client's costs. I'll now discuss some of the experimental costs for our implementation of this period default. So in these graphs, I'd like to highlight the communication costs. So in particular, the presentation, sorry, the construction represented is the red line here. Marked as construction two. And on the x axis, you have the database size held by the server. And on the y axis, you have the log of the communication between the client and the server. And T here is the number of queries the client makes to the server's data set. And basically what this graph shows is that the communication cost grows very slowly as the gross solely as the server's data set size increases, which is exactly what we wanted. And in particular, comparing to other works existing works. I also just wanted to highlight that there's actually, you know, the construction represented is actually the second construction in the paper. And there is actually a warm up construction which we didn't discuss, which is actually also very interesting, which essentially uses a naive pure instead of pure like a compressing here, which is what we've discussed so far. And it's very interesting and please take a look at the full version of the paper. Next, we also measured, you know, the end to end runtime and total communication. And here I particularly like to highlight this setting, which is where there's a large gap between the server's data set size and the client's data set size. And here we can see that our communication costs in particular are quite a bit smaller than existing works. And this is exactly what we were aiming for, but note that our computation costs are quite a bit higher. And so a natural question is how would we justify this larger computation cost. Well, because of that we looked at total monetary costs. That is the cost that would be incurred if you ran this protocol over GCP. So here we are again looking at this sector where there's a large gap between the client and server communication costs. We can see that specifically for the clients, the costs are a lot lower than in other works, like other works incurred equal costs for a client and server. But basically our protocol offloads a lot of the monetary costs to the server. And in fact if you look at the total costs, our total costs are higher than existing works. Moderately higher. So they're so like somewhat competitive. But we have the huge benefit that the client does not incur as much cost as a server. So now briefly discuss ways that our protocol can be extended. The first is that we can be so far discussed the inner join functionality between the two just data sets inner join dot product. But in fact we can easily support any other function f, which is supported by the homomorphic encryption scheme, underlying the period defaults construction that we describe. And furthermore, instead of just doing sums. So here we had the sum of values f of, you know, the two associated values associated with X. But in fact, instead of sums, we could do any computation G over these values, where G is anything supported by secret sharing scheme. And in particular, you know, over secret shares, you can compute any, any function using generic mpc protocol so in fact you can do any G. So that's all I wanted to present and thank you and please feel free to reach out to me or any of the other co-authors if you have options. Thank you. So now briefly discuss ways that our protocol can be extended. The first is that we can be so far discussed the inner join functionality between the two just data sets inner join dot product. But in fact we can easily support any other function f, which is supported by the homomorphic encryption scheme, underlying the period defaults construction that we describe. And furthermore, instead of just doing sums. So here we had the sum of values f of, you know, the two associated values associated with X. But in fact, instead of sums, we could do any computation G over these values, where G is anything supported by secret sharing scheme. And in particular, you know, over secret shares, you can compute any, any function using generic mpc protocol so in fact you can do any G. So that's all I wanted to present and thank you and please feel free to reach out to me or any of the other co-authors if you have options.