 Thank you. This is a joint work with Shaalevi, Yudalinda and Tarabin. I'm going to speak on privacy-preserving search of similar patients in genomic data. Secure computation enables us to compute some joint function. We have several parties and they want to compute some joint function of their inputs without revealing two inputs to one another, and nothing is learned by the output. We have tons of potential applications for secure computation. We can run learning on distributed databases, think about two hospitals that want to have some research on the patients, and they cannot reveal the patients the data to one another. We saw today applications of secure computation in blockchains. Secure computation is useful to protect credentials, cryptographic keys, and so many nice potential applications. There's a lot of interest lately in secure computation. We see some startup companies that actually implement this secure computation. It becomes much more practical. In general, in secure computation, distinguish between generic protocols, these are protocols that we develop techniques that give some function, some problem. We have generic ways to solve that problem, provide a protocol that computes securely that function. But unfortunately, this generic technique doesn't scale for very large, it does not scale for large inputs. On the other end, we have protocols for specific tasks. In this case, we look at some specific problem. We study the problem. We utilize some of the applications domains, and we can come up with very fast solutions. In this talk, I'm going to show we are going to design a secure protocol for specific tasks in genomics. We are going to demonstrate by that several design principles. Most importantly, we are going to see most of the computation can be sometimes. This is a very specific task, but in our case, we are going to push most of the computation that the preprocessing is going to happen in the clear, and this will enable us to run very large-scale computation that seems very expensive. In the generic protocols, it would have taken hours to run, and now it's going to take only several seconds. The problem that we are going to consider is the following. Suppose that we have a doctor that has the genome sequence of his patient, and there is a remote database, there is an hospital that has a lot of patients and the sequence data of these patients. The doctor wants to identify the very few sequences in the remote database that closes to its own patient. Why we want such a computation? We want sometimes it will help us to find the right treatment to that patient. Of course, the problem is to do all of that, or the challenge is to do that by protecting also the privacy. We care about the privacy of the patient. We don't want to reveal its own genomic sequence, and we care about the database and the other patients. Genomics genomes are very sensitive information. It affects not only my genome sequence, doesn't affect only my privacy, it also affects my parents, my kids, and this is very sensitive information. Why we care about such a task? Today, unfortunately, in cancer, we don't necessarily know what is the right treatment for the patient. Sometimes patients receive treatment that doesn't suitable for them, and using genomic sequences, we can actually come up and see what are the other patients, what treatment they got, if the treatment was successful or not, and it will help us to which treatment to select or avoid. It already happened today, there were 50,000 genomes that were sequenced for that task today, and according to this organization in the future, it's going to be in much larger numbers. So, we got to this problem through the I-security and privacy workshop. This is a competition that is ran by I-dash. I-dash is a biomedical organization. They are the domain experts in the new genomics, and they run an annual competition on genomics where they ask cryptographers, they give them the problem and they ask them to solve, they give databases, and this was the problem, the challenge last year. They have several tasks, this was the task, the track of secure computation, and the idea that the problem is to find the K closest sequences in the database where K is a parameter, fingers K is something like five or three, and we want to find the closest genes in the database which the added distance is the closest to that of the patient that the doctor holds. Just to recap what is added distance, so added distance is a way to measure how two strings are similar. We count the minimum number of basic operations that require to transform one string into the other where basic operations are the insertion deletions or substitutions. It takes, it's very, very heavy computation. It takes, it's quadratic in the length of the sequences, and it's something like n times d, we can reduce the amount of computation if we know some bound on the maximum, on the added distance that we expect to have. So just a few words about the challenge database that Itash gave in the competition. So the database consists of 500 sequences, each has the size of 3,500 characters. So we are looking at some specific region in the genome. These sequences were taken from high diversity region. The distance between two individuals were around 5%. So we have a lot of mutations, we have a lot of insertions and deletions, and this means that it's roughly harder. We cannot just do M in distance or something like that. We actually have to compute added distance. Also in genomics, 5%, it's very high diversity. When you take two individuals, the added distance is usually, their genome sequence is similar in 99%. Here we have a lot of differences in that specific region. So the added distance that is required just to answer one, one added distance between one sequence and the query in one sequence in the database requires something like 700,000 comparisons. This is because we have a bound on the added distance, which translates in terms of complexity of the problem. We look at the Boolean function that computes this particular task. So this Boolean circuit is going to contain 50 million gates, which means this is just to learn one added distance between one query and one patient in the database. If you want to compare 500, because our database has size 500, this is going to be 25 billion gates, which translates that if we use now the generic protocols for secure computation, it's going to take several hours to compute. What we did in our work, we came up with a domain-specific added distance approximation, and we actually utilized the fact that we have here genomic sequences that have come from particular distribution. We can come up with a much faster solution. We designed a secure protocol for computing it in the semi-honest model. It's an approximation algorithm. It's not exactly added distance, but it's very accurate. It was tested on several different regions of the genome, and to return the exact set, meaning we asked for the K5 closest sequences in the database, it returns the exact 5 in 98% of the time, and the 2% it returns something which is very close to the right solution, which returns the 4 closest and 1 which is 1 far from what was supposed to be returned. It's also very fast. We managed to push almost everything to the pre-processing, do everything in the clear, and then it takes only... we are answering a query in less than a second and a half after something like 11 seconds of pre-processing. So we pushed almost everything to the clear, and then we can answer any query in something like a second. And this solution also won the high-dash competition. There were eight submitted solutions, all were evaluated on the same databases, same environment, and this was the fastest. Some of related work. There are other works on genomics and secure computation. There is also other work that solves the same problem, but they focus on other regions that are much more... they have low diversity, which means that you can solve this kind of edit distances. This edit distance almost has an aiming distance. Most of the mutations are substitutions or not insertions and deletions and this becomes a much easier problem. And so they can work on much larger regions, and these things are a bit incomparable. There are surveys on genomics and crypto. There are also some security implications when we use approximation and not the exact result we are not... I'm not going to talk about in this talk. There are several concurrent works that were other competitors in the high-dash competition. So our protocol. So the observation is the following. If we take edit distance and try to break it into very small blocks, then everything becomes much faster. You just, by the fact that you just take your two strings, break them into blocks and then compare edit distances, block-wise edit distances, sum up the results. This is going to be very fast and it's fast because we have this amount of blocks. This is our edit distance cost. If B is something like 5, it becomes almost linear. However, it has very bad approximation. We tried it, we tried it with small blocks. It doesn't work very well. If we increase the blocks, it still doesn't work. And why is that? So it's not hard to see that sometimes you have some small deletion over here and because of that, everything moved around and everything was shifted and because of that, we recount this deletion again and again and again. If we somehow knew that here there was a deletion and we could push and choose this block to be only of size 3 and not of size 4, then the approximation would have been much better. In this case, it's even the exact result. So we need to know where to break our sequence into blocks. In particular, it also means that our blocks are not going to be all the time of the same size. Some blocks are going to be smaller, some are going to be a little bit larger. So the question is how to break into blocks. This is actually the problem because this is edit distance. We don't know where are the insertions and deletions. So the way we break into blocks is that we utilize a reference genome. So there is a reference genome out there. It was publicly available online. There was actually organizations that maintained it. The idea of reference genome, it was assembled from several donors and with the aim to use a single preferred tiling path to produce a single consensus representation of the genome, it's out there and we're just going to use it. So what we do, we run a full edit distance between each one of the sequences in the database and the reference genome. Now we have the actual alignment between our sequence and the reference genome. When we break our reference genome into blocks, the blocks of exactly the same size, we can align the sequence according to that reference genome and get the alignment of the sequence. We break the sequence into blocks. Now the hope is that if we align one sequence according to the reference genome and another sequence according to the reference genome, when we come to do the block-wise approximation, the approximation is going to be much more accurate. So we tested it and it was amazingly accurate. It was very exciting. At that point I thought, OK, so all the problems we're going to take on our database, just break all the sequences into blocks and then we are going to have a lot of small edit distances. We are going to parallelize everything and that's it. However, before we started to implement that, we took another look on what actual the values that we are getting and it turns out that genomic distribution is so... there are very interesting things that are happening over here, particular after we break the sequences into blocks and we look at specific block and we ask ourselves what are the values that happen to be in that block? It happened to be that there are very few distinct values in each one of the blocks. It's not 500, even though we have a database of 500, the number of distinct blocks in each one after breaking them into blocks is something like 10. Most of the time it's just one. Sometimes it varies between 1 and 10. And this is because genome sequences don't have so many mutations. So in a database of size 500, you're going to end up with very few blocks. Very few distinct values. Moreover, almost always the query is also one of these values. What it means, it means that at the online time we don't have to compute any distance at all. All we have to do is simple comparisons. We're going to compare the values of the query to one of the values in the database. We can do much more in the pre-processing. The pre-processing is going to be heavier but at the end of the day we kind of reduce any distance to just comparisons which are very efficient, secure computation. So how the server pre-processing works, so it breaks each one of the sequences into blocks, then we are going to see what are the values that happen to be in each one of the blocks and the server is going to compute a very large vector. The vector is the size of the database. And for each one of the possible values in the block it's going to compute if it happens to be that the query value in this block is this in this value then what is the contribution of this block to the edit distance? It computes that for each one of the possible values, compute it also for each one of the possible blocks. In each one of the blocks these possible values are different. And then what is left to do at the online time, we take the query, we just also break it into blocks by comparing it running edit distance to the reference genome, breaking it and then we compare each one of the blocks of the query to the corresponding set of values in the database and kind of know which one of the columns to select, what is the contribution of that block to the edit distance and we just sum up over all the blocks. And then we get the approximation between the query and all patients in the database. More formally what we compute are bits that tells us whether this value of the query is this value, is this possible value of the, is this one of the possible values in the database and what we compute is just this approximation. We sum up, we multiply each one of the bits with the vector and we just sum up over all vectors and all possible values within each one of the blocks. Okay, so the secure protocol for it is almost nothing. We just break the query into blocks and then we compute using Yau garbage circuit. We compute these bits. Then we have another trick to get this multiplication of XIU and delta IU. Then we just, using local computation, we can get this approximate, the vector, a sharing of this vector. And in order to find the five closest or the three closest, we run another Yau that actually computes with the minimal values in that thing. Again, using Yau garbage circuit, note that all the crypto that we have over here is Yau garbage circuit here on a very small, it's small circuit. We have some oblivious transfer over here in order to compute this multiplication which can do very fast using OT extensions and we can have again here Yau garbage circuit, again, it's going to be on a small circuit. So, I want to say just a few words about accuracy and performance. So we tested this on various databases, different sizes, different genes. Accuracy, it was very accurate, it returns 98% of the time, the exact case set. And I gave a few numbers of different genes, how many, what's the size of the database, what is the size of the genome, what's the length of the samples. Query time, and I think what is most impressive is the fact that we took a computation that was supposed to be 25 billion gates. We managed to come up with something which is so small and it's approximation, it's not exact, but in most cases hopefully it's good enough for the application. So some conclusions. So we demonstrate here in this talk that MPC secure computation achieves such high performance, but it might be that these tricks are also possible in other problems or in every problem that you face. Maybe there are tricks that can actually also do stuff like that. And it's just encourage, my takeaway message is to encourage to consider using MPC in places where initially it looks too expensive. Few acknowledgments to people that helped in the implementations. And thank you. Thank you very much. Okay, questions for Gilad. Yes. I agree. So the question is exactly what we are looking for. In many cases you look and want to see, it's a Boolean function, even though I retrieve the K-closest sequences in the database that are closest to my genome, I want to see if the treatment that they got was successful or not. So at the end of the day it's a Boolean function. So even though you've asked for the five closest, and we return the fourth closest, and the fifth one was one just after it was the sixth one, it's happened to be that it also has the same condition most of the time. But we don't have a real data to actually test that, even further. Other questions? So what happens if the query block isn't one of the blocks you saw? That's a good question. So the protocol, as it describes, what happens is just... Everything is zero. We just ignore that block. So it just happened to be that the accuracy is not degraded by doing that. If you want to consider that block, it means that you have much heavier circuit to compute over here, because you need to hide also that it happens in this block and not in another one. Also I want to say that this thing can also be proven theoretically. So if you have some distribution that happens to be on 10 values, and you take someone else which is coming from the same distribution, you can prove that he's also one of the set with very high probability. I'm just worried if you have a weird outlier. That's probably really relevant, and ignoring it is probably the opposite of what you want to do. Yeah, it's probably not such small probability for 500. You need to increase the database size, but it's not supposed to... And what is the size of each block and how many blocks do you have? So we have something like... We choose small blocks, like three, five, eight. We didn't see difference in accuracy when we choose them. If you have larger blocks, it's faster because you have fewer overall blocks. Right. Okay, any other questions? Good luck. Okay, thanks speaker again.