 Hello and welcome to this talk. My name is Mark Simkin, and I'm going to present the work robust property preserving hash functions for hamming distance and more, which is a joint work with Niels Fleischacher. A collision resistant hash function allows us to take a large input and compress it into a very small digest. And the nice thing about hash functions is that we can use those digest to compare large data by just looking at the short digest. So given two digests, we assume that the data is equal if the digests are equal and if the digests differ, then we conclude that the original data was also different. And in cryptography, the security guarantees hold in a very strong sense, meaning that we sample this random hash function and then no polynomial time adversary can find a bad pair of long inputs where the inputs are different, but produce the same hash output. But the idea of compressing data has many applications also beyond just cryptography. So for instance, when we look at password hashing, we have some long password, and we would like to hash it into a shorter digest. And in these settings, using a regular hash function will simply say that the password user typed is different from the password that was stored on the database even if one character is mistyped. But in certain settings, maybe it's OK if one character is mistyped and we would still like to make some kind of, we will still like to be able to draw some conclusions about the password by saying, OK, the password is not the same, but they're close to each other in some distance. Or in machine learning, where we usually have large data inputs and we represent them as feature vectors and then we would like to see whether two images, for instance, represent the same thing. So we would have some long feature vector corresponding to one image and to the other image or to one image in a class of objects, such as cats or planes or cars. And we would like to see whether those feature vectors are similar. So they may not be the same, but we would like to know whether they are similar. And another example are DNA sequences where, again, we have very long sequences and we would like to check whether two sequences are similar for similar being defined with respect to some norm or metric or other definition. And all kinds of algorithms for compressing large data items into short digests and then making statements about the original data by just looking at the digest have been studied in algorithms and data structures. But commonly, they're studied in a benign environment where we say that the data items are fixed and then we have some uniformly random coins for sampling the hash function, for instance. And if the hash function is sampled independently of those inputs, then some correctness guarantees hold. But in cryptography, we would like to have something stronger, namely where the inputs may be chosen adaptively. So the hash function is sampled, then the inputs are chosen, and then we would still like to have some kind of correctness guarantees about the statements that we make based on the short digests. So this idea of generalizing hash functions to check for more than just equality was recently introduced by Boyle, Levine, and Vaikunan Tanatan, where they introduced the primitive of robust property preserving hash functions. So robust property preserving hash function takes a long input and compresses it into a short digest. But now, when we look at the digest, we don't just want to know whether we think that the original data was the same, but we might want to evaluate an arbitrary predicate with a binary output. So indicated here by this blue box, we would like to be able to evaluate some predicate on the original data, and we would like to evaluate a corresponding predicate on the hash values. And we would like to be sure that if we just look at the hash values and we evaluate this predicate, we get the same answer as if we would have done this on the original data. And as a special case, if this predicate were to be equality, then this would just be the definition of a collision-resistant hash function. But this predicate can also be something else, like Euclidean distance or edit distance, in which case it would give us something that we cannot trivially obtain by just having collision-resistant hash functions. So there are several metrics that we care about when talking about these robust property preserving hash functions. And the first one is the compression rate. So given a long input, we would like to compress it as much as possible, which is defined by the compression rate, which is simply the output length divided by the input length. The next thing that we are interested in is the class of predicates. So what kind of predicates can we evaluate on the digests? And what kind of statements can we make about pairs of inputs based on that? And the last thing is the cryptographic assumptions. Ideally, we would like to construct robust property preserving hash functions from the most minimal and simple assumptions that we know. So what is already done in this previous work? Well, they showed some interesting impossibilities for certain parameter ranges by relating the compression rates and the efficiency of property preserving hash functions for certain predicates to one-way communication complexity protocols and their corresponding lower bounds. And then the paper provided two constructions for the so-called gap hemming distance predicate, which I will explain in a bit. And both of those constructions provide a constant compression rate. So the digest is a constant factor smaller than the input that was being hashed. And both of those constructions are based on some bipartite expander graphs or a little bit more complicated, where the first construction is based on just collision-resistant hash functions and has a big gap, which is some parameter in the gap hemming distance problem. And the second construction has a smaller gap but is based on a new syndrome decoding assumption. So what is this gap hemming distance predicate? Now consider we have two bit strings given here on the left and on the right. And the hemming distance is defined as the number of positions in those two bit strings where the bits differ. And the gap hemming predicate asks, is the hemming distance between those two bit strings smaller than some parameter t? Or is it larger than n minus t, where n is the input bit length to the hash function? So this predicate can basically tell us whether those inputs are very similar or very different. But for anything in this gap, the hash function of the previous work could not make any statements and did not provide us with any guarantee. So the evaluation on the digest could output any value. A stronger predicate to consider is the exact hemming distance, where we only have one parameter t and we would like to know whether the hemming distance is smaller than t or larger than t. So in some sense, the exact hemming distance predicate is just gap hemming with a gap of zero. So what are we doing in this work? Well, we basically construct a new robust property preserving hash function for the exact hemming distance predicate based on a new Q-type assumption and pairing friendly groups. And we show that our compression rate, so the size of the hash output of our construction is close to optimal. So how does our construction work? So the first step in our construction, assume we have a bit string with length four. So here it's x1001. And we have some public fixed matrix indicated here where the number of columns equals the length of the bit string and the matrix has two rows and the values in this matrix are just fixed and publicly known. Now, it turns out that constructing something directly for the hemming distance is quite difficult. So rather than directly working with bit strings and the hemming distance, we will transform this problem into a problem about sets. So what we do is we encode the bit string x into a set capital X as follows. So if the ith bit is zero, then we look at the ith column and pick the entry from the first row. And if the ith bit is one, then we pick the value from the ith column in the second row. So in this case, we have the bit one in the first position. So we pick the element six, and then we have zero zero. So we pick the elements two, three, and then we have another bit one. So we pick the element nine. And now we have encoded our bit string small x into a set capital X. And why is this interesting? Now assume we have another bit string y here 1110. And we perform the same encoding with the same matrix to obtain a set capital Y. And here we observe that if the two bit strings, as here for example, have a hemming distance of three, so the last three bit positions differ, then the sets have a symmetric set difference of six. So just to recall the symmetric set difference is the number of, is the, are the elements that are not in the intersection, but in the union of the two sets. So it's basically the elements in X minus Y and the elements in Y minus X together. And what we can see here is that whenever the bit strings small x and small y were the same in a bit position, then they would include the same element in their corresponding set encoding. However, if the two bit strings would differ in a position, then each of them would include a unique element to their set encoding, which would then kind of like add two elements to the symmetric set difference of the set encodings. So more generally what we can observe here is that if the two bit strings have a hemming distance of T, then the set encodings have a symmetric set difference of two T. And this is nice because now, rather than constructing a property-preserving hash function for hemming distance in bit strings directly, we can focus on the task of constructing a property-preserving hash function that takes set encodings as inputs and evaluates the symmetric set difference predicate, which checks whether a symmetric set difference is larger than two T or not. Now on the next step of our construction, what we will do is we will take those sets and we will encode them again, but this time we will encode them into polynomials. More precisely, we will encode the sets into the roots of a polynomial. So here we will take the set X and we will encode it into polynomial P of Z, where P of Z evaluates to zero, if and only if we insert, we input an element from the set X and we do the same for the set Y to obtain a polynomial Q of Z. And here we observe that if we were to divide those two polynomials, then the common set elements would so to speak cancel out and we would obtain a rational function where the numerator contains a polynomial that represents all the elements that are only an X and the denominator contains a polynomial that represents all the elements only in Y, but not in X. So the elements in the numerator and the elements in the denominator together are basically the symmetric set difference between X and Y. And what we see here is that if the symmetric set difference is of size 2T, then the degree, i.e. the degree of the numerator polynomial plus the degree of the denominator polynomial will be 2T. So why is this helpful? Well, given such a rational function, it can be interpolated from a number of points that is linear in T. And the very high level intuition behind our construction is as follows. So say you're given some fixed number of points of this rational function. Now what you can try to do is you can try and interpolate based on those points. And if the interpolation, if the hamming distance between small X and small Y was small, then the sets capital X and capital Y have many set elements in common and then P and Q have many roots in common, meaning that the resulting rational function will have a very small degree. And given enough points, we will successfully be able to interpolate this rational function. However, if the hamming distance between the original bit strings is large, then not many things will cancel out. Thus, the true rational function will have a high degree. And if we're not given enough points, then the interpolation will likely be incorrect. Now, armed with these insights, let us have a first attempt at our construction. So now we're given two somewhat longer bit strings. And what we will do is we will define the output of our robust property preserving hash functions as simply evaluations of the polynomial encoding of that bit string at some publicly known points. So let's say there are some points, Z1 to ZO of T that are fixed and to compute the hash of a bit string, we will take the bit string and code it into a set and code the set into a polynomial and then evaluate the polynomial at Z1, Z2, and so on. And then the concatenation of those points will be our hash value. And at this point, we can already see that if the number of evaluation points is sufficiently small, then the output of the hash function will be much, much smaller than the original input because the original input could potentially be a very long bit string, which just would correspond to a very high degree polynomial. Now, given those hash values, we would like to figure out whether the original inputs were close or far in hemming distance. So what we now do is, given P of Z1 or generally P of ZI and Q of ZI, we simply divide them point-wise and then we use those points to try and interpolate a rational function, capital P of Z divided by capital Q of Z. And by the argument before, we know that if the bit strings are similar, then this interpolation will be successful and if the inputs are very different, then the interpolation will fail. What we now would like to do is, we would like to figure out whether this interpolation was successful or not. And usually we do this with a standard trick, so what we usually would like to do is, we simply pick a random point R, we take the two original polynomials, small P and small Q, we evaluate them at R, we evaluate capital P and capital Q at R and then we check whether this equation holds. And if the interpolation was correct, then this equation will always hold and if the interpolation was not correct, so if those two things are not equal, then this equation will not be true for with an overwhelming probability over the random choice of R. But the problem here is that this check only works if we choose R independently of the polynomials for which we want to execute the check. But here we run into a problem because where does this R come from? At the time of hashing some input, we would already like to provide, for example, P of R or Q of R, but at the same time R should somehow be hidden because if R is known, then the adverse we can potentially pick a maliciously chosen P or Q such that the check passes even though those two rational functions are not the same. And the idea here is to encrypt the point R in a clever way such that we can still perform this check, but the adversary does not learn anything about R itself. And what we do here is we simply assume that part of the description of the hash function will now be G to the R, G to the R squared and so on for some random point R. And this is useful because given a polynomial P and those powers of R, we can evaluate P of R in the exponent without knowing the value of R. So how does this help? Now what we will do is the hash value will now be the evaluations of the polynomial corresponding to the input bit string along with an evaluation of that polynomial in the exponent at the secret point R. And again, as before, we will take those points, we will divide them pointwise. We will interpolate a candidate rational function capital P of Z divided by capital Q of Z. And then we can compute those four values because G to the small P of R and G to the small Q of R are given as part of the hash values and G to the capital Q of R and G to the capital P of R can be computed from the interpolated rational function and the powers of G to the R. And now we observe that to check the equation at the bottom, we can also check the following equation, which we can just do by evaluating a pairing as follows because by the grantees of a pairing, we basically pull out the P of R, Q of R, small Q of R and capital P of R. And then this equation will hold if and only if the equation at the bottom that we wanted to check originally holds. So why is this a secure construction? Intuitively, the reason is as follows, if the inputs have a small hamming distance, then the interpolation will always succeed. So we pick the number of evaluation points that are part of the hash values, part of the hash functions output, sufficiently high such that we only always have enough points if the hamming distance is small. And thus every point R will succeed in this equality check. So nothing bad can happen here, so to speak. And what we then show is that if the inputs have a large hamming distance and the check still passes, then we can actually compute R given the powers of R and thus break our corresponding security assumption. Thank you for your attention. And you can find all the details in the e-print version of this paper, which you can find under the following link.