 So, next talk is property preserving hash functions for having distance from standard assumption by Nils Fleischhacker, Casper Green-Narsen and Marc Simkin, and Nils is here to give the talk. Yeah, thank you. So, the first question to ask is obviously what are property preserving hash functions anyway. The concept was first introduced by Boyle Lavigne and Vaikun Tanatan in 2019, and it's most easily understood as a generalization of collision-resistant hashing. So, if we have a hash function H, it takes as input a key and a message M, and yes, theoretician think that hash function can get keys, and it outputs a hash value H, and it's collision-resistant if for any polynomial time attacker, if they output two messages M and prime, then the probability that those two messages will be different, but the hash will be the same, is negligible. So, we're saying this is for an efficient attacker, this is not possible. Now, another way to think about a collision-resistant hash function is actually to think of it as a way of checking equality of two strings given only the hash value. So, if I have two very long strings, and at some later point in time, I might want to compare them, I do not have to store all of the string, I can only store their short hash, and at a later point in time, I can check whether they're the same or not. Because only with negligible probability will this evaluation of this equality predicate be incorrect. Now, property-preserving hashing is basically the generalization of that, where we say we have some predicate P, not necessarily equality, and we're saying that the attacker should not be able, so we have some evaluation algorithm that basically allows us, given only the hash values, to evaluate this predicate P. And an attacker should not be able to output two messages such that this evaluation is incorrect. So, what kind of predicates might be interesting? Well, the most natural extension of equality might be some distance metric. So, what we are looking at is hamming distance. So, the hamming distance predicate basically has some threshold T. And we're saying this predicate, given two inputs, essentially checks is the hamming distance greater or equal than this threshold T, or is it smaller than T? And we want to construct property-preserving hashing for this kind of predicate. Now, we, and by that I mean the authors of this paper, didn't really know how to construct this directly, so we used an observation that was already in previous paper, also by Mark and me, from Yurkut 21, that is a connection between hamming distance and the symmetric set difference of sets. Basically, if we have two strings, X and X prime here, and then we can define a set where basically for each bit position, we have two possible values you might put into the set, and if we define the set like this, then there is an, then the evaluation of the hamming distance predicate, so the fact that the hamming distance predicate for the two strings is one, is equivalent to the fact that the size of the symmetric set difference of the two sets is greater or equal than two T, where if you don't know what the symmetric set difference is, it's basically the union of two sets minus the intersection of the two sets, so basically the set of all elements that are in exactly one of the two sets. And with that we can define also the predicate for symmetric set difference, where basically we have again a threshold, and we now get two sets as inputs, and we, the predicate outputs one, if and only if the symmetric set difference is greater or equal than T. Okay, so how do we construct something for symmetric set difference? Well, to do that, we introduce a notion of robust set encodings. So what is a robust set encodings? Basically, we have a description of a large set as one, and now we have an encoding function, and this encodes this in a compressing way, so the small green encoding of this set is smaller than the normal description of the set as one, and we have a, if we now have two of these, what a robust set encoding gives us, that there exists some magic functionality, that given two of these compressed representations of the set, we can get a compressed representation of the symmetric set difference of the two sets, and the trick is now that we might be able to actually decode this symmetric set difference. Now in general, this should not be possible, because as I said, this encoding function is actually compressing, so the S1 should be smaller than the encoding of the S1 should be smaller than S1 itself, because otherwise it's not a useful hash function, right? If you have a hash function that's not compressing, it's not very useful. So this is actually compressing, which means that in general, it should not be able to be decodable, especially because the symmetric set difference of two sets can actually be larger than each individual set. So if it's already compressing for the individual sets as one as two, we should not be able to, in general, decode the symmetric set difference. But the good thing is that we don't necessarily have to be able to do that. So we only need to be able to decode in the case where the symmetric set difference is small. If the symmetric set difference is large, this thing can output an error, and we're fine with that. So what we want from this thing is actually, okay, let that help, okay, so we only care about the case where the symmetric set difference is small actually. If the symmetric set difference is large, and it doesn't manage to decode, then it should output an error. If it outputs a set, then this should always be the correct symmetric set difference, and if the distance is small, then it should always output a set. So this means that if the symmetric set difference is small, then with overwhelming probability, it will give us the correct symmetric set difference, and then we can easily build a property-preserving hash function from that. We use the encoding function simply as our hash function, and to evaluate, we do the magic, then we decode, and if decoding fails, we say, okay, the symmetric set difference was large. If decoding does not fail, we get a set. We can simply count how many elements are in the set and can check is it smaller than 2T or not. So the big question now is, of course, how do we construct this robust set encoding? And for this, we use invertible bloom lookup tables. Now, what are invertible bloom lookup tables? They were first introduced by Götrich and Mitzemacher in 2011. And are basically a way to encode sets in a way that if the set is small, Eurich can recover the set. And how they work is that we have a set, S1 here, and this set is encoded in, for now, basically two matrices. So we initialize two matrices with all zeros, and the size of this matrix basically depends on statistical parameters, essentially on the threshold that we need for our predicate. If your threshold is larger, you need a larger matrix. But for understanding this, this number of rows and columns is irrelevant for understanding what's going on here. So for each row of these matrices, we basically choose a 4T-wise independent hash function, and we map each one of the elements of the set into these matrices, which means we have these 4T-wise independent hash functions, and they give us, in each row, they give us one index where we should put this element. And in the upper matrix, what you call the count matrix, we basically just count how many elements were mapped to this specific cell. In the lower matrix, we simply add up all the elements that get mapped there. So we first map the first element there, which means we have some ones in that matrix above, and we have some twos in the matrix below. And then we map the second one to some different positions in the matrix, and then the third one, and we end up with these two matrices. Now, if we have some one entries in the count matrix, then that means that the corresponding entry in the value matrix actually tells us an element of the set, which means that if we, as long as we have one entries, we can actually decode, because we can then remove this element from the two matrices, simply in the case, because we now know what the value is, we can put it through the hash functions, we know where it belongs, so we can subtract one in the count matrix, and we can subtract the value in the value matrix. Of course, decoding this thing is not really helpful, because we just encoded it, and in general, for a large set, this will not be decodable, because there will probably not be one entries in that count matrix. But if we now have two of these, so we encoded two sets, and we have here two sets that basically differ in one element, which basically means they correspond to bit strings that differ in one position, then we can actually subtract these. And what we get is something that has some negative entries, some positive entries, but basically, if we squint a bit and ignore the signs, what you have there is an encoding of the symmetric set difference. Basically, we can again decode this, because we can look for minus one entries and one entries, and then look at the corresponding values in the value matrix. And if you correct for the sign, then decoding might work, because or at least in the case, whether symmetric set difference is smaller than for the original sets. In general, of course, if the symmetric set difference is large, you will not have any one entries, and decoding will fail, but in that case, we're fine with that. There's one issue here, and that is that not every one or minus one entry in this difference matrix will actually correspond to a real value. The reason is that the two count matrices that we subtract, they simply count how many elements were mapped to this particular cell, they are not necessarily the same elements. So for example, the element three might have been mapped to the same position as the element four. We subtract the ones, and something else gets mapped in the first one, then you subtract one from two, and you end up with one, even though you subtracted two different elements. So we need a way to verify that one of the positions we identified is actually a correct value. And how we can do that is by adding a third matrix. So we now add a homomorphic collision-resistant hash function h, and in this third matrix, basically what we do is we simply add up the hash values of the values. So we can again, subtract these two hash matrices, and then in this third matrix, we basically end up with the sums of the hash values of what we have up here. Which means that if we have some one entry here, then we can identify a potential value, and we can check down here that the hash value of this value is actually there. And because it's a homomorphic collision-resistant hash function, this is secure, so you will not incorrectly identify a value except with negligible probability. One interesting thing here is that if we have this hash matrix, and because our universe that our sets are defined over is relatively small, because it basically has constant size, because we're only talking about twice the length of our bit string, which has a constant length, there are actually relatively few potential values. Which means that when we look at these hash values, we can actually reconstruct the one entries of the count matrix, because we can check which one of these are actually valid hash values. And then we know that there should have been a one entry up here. Which means we don't actually need this count matrix at all. And also because, again, the universe is very small, we can actually invert this hash function simply by exhaustive search. We can basically search over all possible values and identify what is the correct value. Which means we also don't actually need this value matrix. So all we need is actually this hash matrix. So this is now our encoding. We take a set and we compute this hash matrix. That's our encoding. And then we can do the subtraction and can reconstruct hopefully the symmetric set difference from that. And that works as long as the symmetric set difference is small enough. And from that, again, we can easily construct a property preserving hash function simply by using the encoding function as our hash function. And during the evaluation, you basically do the subtraction and then you do the decoding. You check if it fails. It fails exactly when we can no longer, when you basically have no entry here that corresponds to an actual hash value, then decoding fails. But that only happens if the symmetric set difference is too large or with negligible probability. Okay, so how do we instantiate this homomorphic collision-resistant hash function? We can instantiate this from ita's hash function. Then we get basically a robust set encoding from the standard SIS instance. And this has an output length of basically T times security parameter squared times logarithm of the input length, which for large input length is a good compression if basically the input length is short, it will not be compressing. Because we square the security parameter here so the input length needs to be relatively long for this to be actually compressing. And from that then, we get the property preserving hash function as I said, simply by setting the hash function to be the encoding and then using the decoding for the evaluation. And it has of course the same output length as the actual, as the robust set encoding. Now how does this compare to other constructions? As I said, the original paper that introduced property preserving hash functions was by Boyle, Lavigne, and Tevayekun Tanatan in 2019, they did not actually give a construction for exact hamming distance. What they gave a construction for is the gap hamming distance predicate, which is defined very similar to a hamming distance, but it only checks is the hamming distance very small or is it very large? And in between you have a gap where basically any evaluation is acceptable. Okay, they cannot do an exact thing. And for that they have an output length that's basically C times input length where for some constant C smaller L, so they have a constant compression rate. And it's based on the smart short vector problem which is a problem they invented for proving this secure. It is related to syndrome decoding. So there was no previous, basically nobody previously looked at this problem so we don't know exactly if it is a good assumption or not, but it's a plausible assumption, but we would like to have better assumption maybe. So the first construction of exact hamming was actually also by me and Mark Simkin. That had an output length of roughly T times security parameter. It was also based on a new assumption. So this is the Q strong bilinear discrete logarithm assumption which was also not analyzed before. So assumption wise, whether that's matter is a matter of opinion but we managed to get exact hamming there. So the new work that I'm presenting right now is again for exact hamming and it actually has a worse compression than before because we now have T times security parameter squared times the logarithm of the input length which again is still compressing if you have very long input length but not if you have short values. But on the upside, this is now based on a standard assumption, a standard short integer solution problem which is a standard lattice assumption. Since we published this work, basically there has been follow-up work by Minimatsu and I'm sorry for butchering that name which was recently published and they also built something for exact hamming. The compression is again worse than what we do. It's basically T squared, security parameter squared times log L so clearly because T is now squared this is worse compression than before. However, on the upside, they were able to basically do this based only on collision resistant hash functions. So that is a pretty good upside. Okay and with that I'm finished and if you have questions, please do ask. Any question? Yes. Maybe an obvious question is what are the applications you have in mind for hash functions of this type? So in general, a property preserving hash function for some distance metric might be, for example, have applications if you do some kind of fuzzy comparison. In general, you could for example, consider something like gene sequences if you wanna compare gene sequence but you're fine with some errors in there. Basically you could hash them and then you can store something much smaller than the whole gene sequence and you can compare them. However, hamming distance is probably not a good distance metric for that. I have to admit that for hamming distance specifically, I don't actually know of a good practical application. Any other question? No more question? Okay, so let's thank the speaker again. Okay.