 Hello, everyone. So my name is Zelen and you give this is a joint work with money nor and the title is bloom filters in adversarial environments so There is a long-standing connection between a cryptography and data structures I know I want to show you one example through a data structure. That's called a bloom filter So what is a bloom filter? So there is some large universe you you is some large universe of elements think of all possible URLs in the Internet and We're interested in some subset s of size n and the goal is for any element to answer Is this element inside the set and we want the data structure? So it's a very practical data structure. So you want to be it very fast Okay, so the query time should be like really constant time and Even more apparently we want the representation Okay, the memory used by this data structure to be very small in particular We want it to be much much smaller than what you actually need to represent the set s So how can we do that? Of course we allow errors that this is going to be a probabilistic data structure and we're going to have some probability of error This notion was introduced by blum by bloom in 1970 And he gave a specific a nice construction But I'm going to refer to any solution to this problem as a bloom filter and First simplicity for this talk. Let's just think as the whole set given in advance and then we're giving the queries So how do I define the bloom filter? So n epsilon bloom filter is a pair of algorithms p and q p is going to be the preprocessing algorithm and q is a query algorithm And we have the following definition. So we run the preprocessing algorithm. Okay on the set s and we get Memory representation m. Okay, so m is going to be this representation of the set s and Then we have the two two requirements The first one is that for any element of the set we have that the probability That the query algorithm with this representation Inside on this element x just answers. Yes. Okay, so we have no mistakes here. This happens with probability one On the other hand for any element nine in the set we require that he mistakes So he answers yes with probability at most epsilon And the property the probabilities here are taking over the preprocessing algorithm p And there's several constructions So the memory I define it by little m so to achieve a n epsilon bloom filter You can you use n log 1 over epsilon bits and so this is a tights. It's upper bound and a lower bound and This data structure has various applications in many areas. Okay, it's practically used in Google Chrome and Facebook and other and networking databases many many places and I want to specify one common place where it's used by okay One place is like Representing the the content of the cache. Okay, so suppose we have a web proxy that Fitches the pages from the internet and it has some cash And we have a user the user requests some element and the web proxy first wants to check if he has this web page in the cache and if not it goes to the internet and Going to the disk is very slow. It's a it's a disk query So what we do we represent the elements in the cache in a bloom filter that takes a small amount of memory in the memory Okay, so the web proxy first checks the bloom filter to see if he has it in the in the cache If the bloom filter says no it just goes straight to the internet and if he says yes, then only we go to the disk Mistake what's called the false positive by the boom filter means he'll say yes And the web proxy will go to the disk and then see that actually doesn't have This page, but this will happen only once in a while not every query Okay, so now let's go back. This was a bloom filter the classic definition you'll find in the literature now let's go back to the definition and Try to look at it from a crypto back cryptographic point of view So this is what was like our security definition. Okay, the air probability at most epsilon and The question is what happened when when the user can do many queries and you see the responsive of his queries Okay, so you see the responses and then you can do more queries. He does them adaptively This could happen in many cases for example here let's say He measures the response time from the server So if he measures the response time, he could know if the bloom filter said yes or no And if he went to the disk or went straight to the internet, this is just one example And and the question that arises what can I usually do? Can he increase the false positive by doing the sequence of adaptive queries? So let's try to formalize this a bit and what I want to do is take the original definition of the boom filter and draw it as a game as we Do in cryptography in many cases Okay, so how do I draw the original original definition of a bloom filter as a game? So I have on one side an adversary on the other side a challenger and The original definition says that it must hold for any s so we think of s as adversarial lead chosen So the adversary chooses s sends it to the challenger the challenger computes preprocesses s and computes m then the adversary picks any element x and Challenger computes y which is the query with this room station M on x Okay, and we define that we say that the adversary wins if y equals 1 Okay, so he found a false positive and X is not in the set s. Okay, of course if x in the set s then so would be 1 So this is the definition if that is very wins and the security says that The probability the adversary wins is its most epsilon Okay, so this is just rewriting the original definition of the boom filter as a game And now maybe it's more easy to see what's the problem with this definition Okay, for starters all the arrows go to one side Okay, it's a bit of a boring game the adversary sees doesn't get any information back So let's try to give him more information try to simulate the definitions We know From encryption where adversary can play with the system first and only at the end he provides a challenge Okay, so let's send y back to the adversary So now we're sending the the response back to the adversary and we're not going to do it once This whole process we're going to do many times So now each x becomes x i in the y y i At the end when the adversary is done He's going to send x star Okay, this is the challenge that he sends and then we're going to compute y star Which is the query of m of x star and now we're going to say that the adversary wins if y star equals 1 and X star is not in the set s and of course none of the previous Queries that he did that he did before so no none of the excise and The security stays the same we say that the probability that they wins is the most epsilon Okay, so now we have a new definition of security for a bloom filter Of course now we can talk about the computational power of the adversary So let's assume for a start that the computational power if I don't say anything that I assume the computational power of the adversary is polynomially bounded and So we get a new definition which we call an adversarial resilient bloom filter Okay, and if if we have a data structure that follows this definition and Then we say that it's an n epsilon strong adversarial resilient bloom filter Strong because we didn't limit the amount of queries that he can do And if we limit the amount of queries to T Then we say that we have a n epsilon oops n epsilon T adversarial resilient bloom filter Okay, so what are our contributions? So the first one is defining address of resilient bloom filter, which we did The second contribution is a general transformation Information so we take any bloom filter and make it adversarial resilient and We use PRPs to the random permutations and we also give concrete implementation which I might talk about later Then we show that actually one way functions are necessary Okay, we see that later even for what we call the unsteady presentation case and At the end I'll talk about unbunded adversaries where we also give a construction So let's start with the first theorem transformation From any big standard blue filter to adversarial resilient blue filter. So what about the parameters? so we start with an n epsilon bloom filter with memory m and We result with an n epsilon bloom filter that uses memory m plus lambda What is this lambda? So our tools as a pseudo random permutation and So lambda just the security parameter actually just additional key to the pseudo random permutation Okay, and why does okay, how does the the? Transformation work. It's very simple just start with any standard bloom filter and You just apply a PRP to the queries before you send them to the bloom filter and why is this okay? Well actually what you're doing you're just really randomizing the adversaries queries Okay, so the the queries he made after the PRP are just indistinguishable from random and Then he really doesn't have any advantage of playing in this game and doing the queries adaptively He might have I'll just choose them at random to the begin with and then the standard boom filter security holds Okay, great. So this is a simple construction Now some of you might be a bit concerned. Why I took a daily structure called a bloom filter That we use every day and it exists, okay? We constructed from simple pairwise hash functions stuff like that and I changed the definition a bit. Yeah, we're all Okay with that, but now I have a new construction with that uses PRPs this heavy machinery that we don't really know if exists and Stuff like that. So so what's the deal? so our next theorem shows that Actually any adversarial resilient bloom filter must use one-way functions, okay, which are equivalent to PRPs and What I mean by non-trivial Well, one way of constructing adversarial resilient bloom filter is by just storing the set as precisely Okay, if you just store the set as precisely then you have no errors at all and the adversary really has no advantage And so non-trivial I mean any bloom filter just uses Slightly less memory than what you need to hold the set and some errors occur somewhere Okay, so the existence of one-way functions equivalent to the existence of PRFs and in turn PRPs Okay, so let's see for a second how Recipe of this proof works Okay, so what do we want to prove we want to prove the counter positive that if we don't have one-way functions We can really attack any bloom filter. So no bloom filter is adversarial resilient Okay, so we need to construct an attacking algorithm So how do we do it? Okay, you give me some blue filter. We're gonna do the following we sample some random X1 to XT Okay, and query them and we'll get results Y1 to YT Okay, so now I have these X's and they're like labels Y1 to YT This is like samples and I have the label. So now I'm gonna try to somehow learn some representation in prime M prime is not gonna be exactly M, but it's gonna be in some sense very similar to M Then I'm gonna find some X star That according to M prime is a false positive and this I can do because I have M prime so I can Simulate okay, I don't really have to perform these queries and Then we have to prove that since X star is a false positive relative to M prime It's also gonna be a false positive relative to M So the first two steps are done using a machinery from a learning theory so actually when we model this problem as a pack learning problem and the X1 and XT are our samples and really the the task is to Efficiently find a consistent hypothesis Okay, since Once I have M QM is just a boolean function. So I'm really trying to learn this boolean function that is unknown and The algorithm needs to run in polynomial time So I need to find a consistent hypothesis in polynomial time But remember that one-way functions do not exist and here I use the fact that I convert any function to quickly find a consistent hypothesis and The number of queries I need here is gonna be O of M over epsilon Okay, so what you can see by this is More or less I need to learn the bits of M There's M bits of memory and it takes me about one of our epsilon queries to learn each bit of the memory And then we're gonna find X star That's a false positive relative to M prime and sense Q of M prime and Q of M are gonna be very similar as boolean functions with very high probability X star is gonna be a false positive for the real representation Okay, so this concludes the Overview of the proof of this theorem Okay, so you have to use one-way functions according to construct adversarial resilient bloom filter now We ask what happens if we change the setting a bit Okay, and we want to give Q More power. So what what power can we give Q? So first let's let's let we'll let Q be randomized Okay, so now this the query algorithm can be randomized And this means that he can like add noise to the answers even though he knows that the answer is no Maybe with some small probability. He's gonna he's gonna say yes, and now the task before of learning it Becomes the task of learning something with ours, which is in general a much harder task and Let's even give him more power. Why not have Q change the underlying representation after each query So he gets a query maybe add noise and then maybe we will be hashes some stuff Maybe chooses new random valuables. Okay, can change the other underlying representation and now we're trying to Learn something that changes even so it's even more harder and one could hope maybe to Incorporate some differential private Algorithm or stuff like that to get rid of the use of the need of one-way functions. I Want to give one example of such a bloom filter, okay? That has an unstudied representation think of one that has two bloom filters boom filter one and two They're both initialized with the same set is and for the first 100 queries. I only use the first one Okay, and then after a hundred queries. I throw the first one away and only use the second one Okay So really for the first a hundred queries. I learned nothing about the second bloom filter So the algorithm from before Okay, if T is known in advance Then it can never work it Eliminates any attack algorithm that uses a pre a number of queries that is pretty fine from advance Okay, cuz I'll just make this bloom filter Exactly after T queries throw it away and use a totally fresh one Nevertheless our third result is we show that the previous Theorem, okay that you have to use one-way functions to construct Adversarial bloom filters hold even for the unsteady case Okay, so this additional power really doesn't give you more power And let's see what happens to this recipe in general So the first two steps was before learning a Function an unknown function then we use pack we model this as a pack pack learning problem here We're not learning a function anymore. We're learning a distribution Okay, because Q is randomized and more of her this dish and the distribution might change after each query that we do so this is Adaptively changing distribution So we use a result by an old rot bloom and they define this framework of adaptively changing Distributions and they have some results there and we're gonna we use Not exactly their algorithm. We modify it a bit, but we use essentially their algorithm for learning such distributions under the under the assumption that one of functions do not exist and for technical reasons the number of queries add from M over epsilon to M over epsilon squared and Then this other part also becomes much harder since now We don't our guarantee of the of the similarity is only that these two distributions are close in statistical distance And so these steps become also much harder But essentially this captures the proof for the unsteady case What about unmounted adversaries? So now I'm talking about constructions. Okay Since any unbounded adversary of course can invert any one-way functions So we cannot expect to construct a strong Adversary resilient bloom filter, but still we can expect to construct something that's resilient to T queries Okay, and here we have a result saying that for any epsilon and T There exists the n epsilon T resilient bloom filter So this is a bloom filter that is resilient for T queries and the question is how much memory do you need to put for each query? Okay, so the memory is going to be and log one over epsilon This is the basic what is required for any bloom filter even with the classical definition and we have plus T This is the additional memory. We need to be resent for T queries Okay, I notice that here there's like an assumption that the adversary really what happens He learns one bit or a constant number of bits per query Okay, where if you take our Our theorem from before as a lower bound Okay, in this case can since we can invert any function what we had before that we learned only We learned one bit of memory only each one of our epsilon queries okay, so here we learned one bit each one of our epsilon queries and so we still have a gap and and This is an Open problem. I think it's interesting to close to close because when you get into it you really get to the specifics of the problem The construction of this is very non-trivial using a lot of interesting hash functions and The last thing I want to said is say is about implementation So we gave excuse why we use pseudo around the permutations Theoretical one since we showed that you have to use one-way functions or pseudo around the permutations Nevertheless, we gave an implementation this implementation uses a yes instructions that are really built in many more almost all modern CPUs and The result is what we get is yet You can have an implementation of a bloom filter that in one hand is secure Okay up to the security of a yes And on the other hand runs really as fast as any other implementation that I could find on the internet Okay, thank you