 Hello, everybody. The next talk is from Claude Verwin. His talk is about indexing encrypted data using Bloom filters. Give him a warm welcome. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. OK, first up, there's going to be a lot of information on these slides. Don't try to read it all at once. It's available on the FOSDEM website and downloaded or in ResearchGate there. So if you want to grab that. So we're going to talk about encrypting data or researching encrypted data with Bloom filters. And we'll go through this process here. We'll talk about the goals and the process to do that. We'll talk a little introduction to Bloom filters. I have an example of doing this. And we'll make sure we see if we met the goals. And then there's some additional information and some resources at the end of the slides. So first off, the goals here. First I want to point out first that indexing encrypted data using Bloom filters is not a new idea. This has been around for a little while. There's a number of papers written about it. But what has changed is that we now have multi-dimensional Bloom filters, which is a mechanism for indexing Bloom filters. So we can now search large numbers of Bloom filters very quickly. So what we're going to do here is we're going to identify encrypted documents or records that contain specific data. And we're going to do this without decrypting the documents. And we're going to try not to leak data through the index and not leak data through the queries. What we're not going to talk about is how you distribute keys, how you make sure that when you get the encrypted document you can decrypt it. And none of that, that's a separate problem that we won't cover today. OK, so the process for doing this. First, the right process, obviously, we need to start with our document or our properties. We extract those. And we create a Bloom filter. And then we encrypt the document. So we have to start with a non-encrypted document. We get the properties, encrypt the document. We store the document in an encrypted database, or encrypted in the database. And then we get the record back from the database. And then we store the Bloom filter and that reference in the multi-dimensional Bloom filter or in the index, the Bloom filter index. If you're storing 1,000 filters or less, the fastest way to do that is a linear search. And there just isn't a solution that can go faster than that. Once you get above 1,000 records, it starts to make sense to use something other than that as a different type of multi-dimensional Bloom filter. Now our read process, then, is fairly simple. We take the properties we're looking for. We hash them, create the Bloom filter. We ask the multi-dimensional Bloom filter for all the matches. That'll tell us the records that match, potentially match. We get those back there in encrypted records. We can then decrypt them and then filter out any false positives. And you might ask, why false positives? What's going on there? Well, the answer is in the Bloom filter itself. The Bloom filter is this probabilistic data structure was described by Burton Bloom in 1970. And there's a reference to the paper there. But you create the Bloom filter by hashing your data multiple times. So we've got a one-way hash. We're going to run that against our properties multiple times, and then we turn on a bit and a bit vector. We can use the filters. We can merge them together by creating a union of the bits that are turned on. We can determine membership by doing intersection and seeing if we get back the one that we're looking for. And because of this process, it can yield false positives. It'll tell you things exist in the filter when they don't. But it'll never tell you that something doesn't exist in the filter if it is there. So in our process, when we get to the final step, you have to filter out the false positives. Oops, let's go the other direction. There you go. All right, so how is it defined? They are constrained by four properties, basically. You have the probability of a false positive. You have the number of elements that the set represents. You have the number of bits that are in the bit vector and the number of hash functions that you use. So how many bits do you turn on for each element that you're going to put in? Mitch Malker and Upfall shown that those are all related by this equation on the chart there. And Thomas first has a calculator online. So you can go in and play with the numbers and see what happens as you change these numbers and how they interact. So we can now construct one once we figured out what our properties are going to be and how many objects are we going to put in and such. Then we come in and we have a number of buffers. We know how big the bit buffer is that we're going to construct. That was the M in the previous page. So we go through each of the buffers and then for each of the number of hashes that we need, we hash the buffer, we take the mod, turn the bit on in the bit buffer, and we do that until we're done with the buffers. So you have left is this bit vector that is effectively your bloom filter. Now Apache Commons in version 4.5, which is still a snapshot edition, now has bloom filters involved. You can use that. And there's four lines of code here that describe how to do that in using the Apache Commons. So you get the hash function. In this case, it's a murmur 128 hash. We have a shape and then we're saying, we're going to use that hash. We're going to put 10 items into the filter and we want one in 2 million probability of collision. That's what we'll accept. Then we create a dynamic hasher. In this case, I only put one buffer in. But if we had 10 buffers, we would have 10 case, 10 width statements, and then build the hasher. And then the final step is to build the bloom filter itself using the hasher and the shape. So in the end here, if we end up with this bloom filter it represents the one object. OK, now there are some issues with this. Interval data doesn't work really well. You cannot do less than, greater than kind of comparisons. But you can get around this by looking at small, medium, large and convert them to an ordinal values. You can transform to integer values. You could put the decimal value in, but then you'd be looking for an exact match for that decimal value. So if that works, that's great. But otherwise you might want to think about changing that data a bit. The other problem you get is when you have properties that have similar values. And the example I have here is that I'm talking about automobiles. And if you talk about an interior and exterior color and you have a white car with a red interior, well, you might have a red car with a white interior. You would get a conflict there. So that would be actually get a collision in your data. So you can get around that by adding the property name to the front of it, as shown here with exterior white, interior red, or something like that to make the values different. So just some of the things you have to think about when you're doing the encoding. All right. So I'm going to do a demo here for how this actually works. But I'm going to use the GeoNames database. And for those of you who don't know what it is, it is a database of basically every place that's got a name on the planet. There's over 11 million unique features in it. Each of those objects has about 20 properties. For the purposes of this demo, I'm simply going to select the feature code, the country code, and the first 10 names for that property. Those properties can have multiple names. Some of them only have one. But there's at least one in there that has over 300 names for this location. There are 680 unique features in the database. And there are 252 countries. And I'm going to take the first 2 million of those records and index them. So the demo code is available. It's in the references. So we're going to take the first 2 million. We're going to use the murmur hash, like it was noted in the earlier slide. We're expecting we're putting 10 objects in. Because I said we would take those 10 values. 10 names plus the 2, which is actually 12. But most things don't have 10 names. So we'll slide by by saying 10. We're going to probability of 1 in 2 million. And then you run that through the calculation. And so you get an m of 302. So our bloom filter is 302 bits wide. Where each property we put in, we have to hash it 21 times. And we get a probability of just 1 in just over 2 million. The multi-dimensional bloom filter library has a hasher that doesn't retain the byte buffers. So when we do the hashing, you no longer have the string that you hashed. Some of the other ones do. But this one doesn't. So all we're going to have is the hashed values. The demo loads. And it tells us, hopefully, that we've got 2 million. Oh, good, it worked. We load it anyway. So we've loaded 2 million records. There are 704,899 unique filters. So we do have some collisions. And I can enter, let's see if we can do this, of Las Vegas, because that's where I grew up. And PPL, because it is a populated place. And that's the code that GeoNames uses for populated place. And then it gave me a whole list of them. And if I scroll back, hopefully, if I can do this, come on, come on, attention. OK, well, let's see if we can get it up too far. Yeah, well, there are several of them anyway. There we go. So there's one that's in the Bogota time. So it's in Columbia. There's another one in Columbia. You see it's got a couple of names there. So this is the data, the GeoNames data. That one's in Cuba. So it found a number of Las Vegases around the world very quickly. And let's see. I was going to try something like what and see if it comes up with anything and it didn't. The next piece it's going to do actually have it do a query, have a demo do a query for all that first 2000 most common words in English. And it does have a number of collisions. So actually there was one there. So saw apparently, if you just were to send in the word saw as your key, it would tell you that this Zinjan, I suppose. I can't get this to scroll down. Well, there we go. There's a couple of them. So somehow saw hashes to one of the values for this location, actually two locations. And obviously you would filter them out then at the end when you've decrypted the data and you're looking for saw. You go, well, it's not here. Obviously this is not what I want. So that's the demo we did that. So did we meet the goals? We were able to identify encrypted documents. Those documents, to be honest, they were encrypted. I did the search and then decrypted to show it on the screen. So we did search the encrypted documents for the data that we were looking for. We did not require decryption of the documents to do the search. Then we leaked data through the index. Well, what we've got in the index is a bunch of bloom filters, which are multiple hashes of the data using one-way hashes. It's very unlikely you would be able to reverse engineer that. You might be able to do some brute force type testing if you knew the structure of the bloom filters and had some idea what the data was that was in the system. So if you had some a priori knowledge, you could probably come up with some collisions that you could use to figure out what data was in the index. Can we leak data through the query? I think we mostly do that. When you do the query, you can tell how many objects, how many fields are being queried for. But you have the same problems in that you've only got the hashes and you would have to figure out what the data are that you're actually hashing. And again, you could do brute force to try to get around that. But it's unlikely that you would be able to get much out of it. And I've got, like I said, I have some additional information about where bloom filters are used and whatnot. I'm not going to go through all of that because I'm going to run out of time. What is a multi-dimensional bloom filter? And then there are about three pages of references. So you can look up all those things that were cited in the document. So with that, let me ask if there are any questions. Questions? OK. Hey, thanks a lot for the talk. Just one thing. So what is the advantage of having it like this rather than, for example, creating a bloom filter out of the clear text data, encrypting it, and then storing it at the server, and then doing the searching on the client, which then you can also do more complex indexes rather than just a bloom filter? So you talk about building the bloom filter on the client and then searching the question of what's the advantage of doing this over building the filter on the client and storing the filter on the client? So the bloom filter is just one example. But you can have much more complex indexes and then have them encrypted and stored on the server. And then on each client, just download that very small index and then search based on that. You would have to download the index. You'd have to have an encrypted index on the server, right? And then you would download it and decrypt it and use that. The advantage is you don't have to download and decrypt. You don't have that problem. So you don't have that overhead. You could basically publish this with a very simple front end on it to do the queries. Any other? Everyone over here, you have to run now. Hopefully, small question. How does the cardinality of the role value that you use to create the bloom filter affect the leakage of data? So if you use three letter n-grams, would that change how much data you can leak? Or if you know a priori stuff that you were talking about, if you knew that the bloom filter data is basically three letter n-grams, how would that affect you leaking data through the query or the actual bloom filter? If you were just indexing three letter n-grams, your filters would look different. First off, they'd be different size than the ones I had, right? But can you actually prevent leakage of data using very small, if you use very small data? If you had very small data set that you were using, it would be much more probable that you could figure out what was in it, obviously. So yeah, if you know that there are only three letters, there's only so many combinations. So that would definitely help being able to crack it. But it's a probability problem. OK, thank you very much for your talk. Thanks, Steve.