 Hi, this is joint work with Ilan Komargoldsky. Very broadly, this work is about proving limitations on the best possible time-space trade-off attacks for finding short collisions in mergel network. Hash functions are one of the fundamental primitives of photography that have many different applications. Certain applications like password hashing, et cetera, require a hash function to handle different input levels. However, it is infeasible to design a different hash function for every length. Hence, iterative hashing is used to construct variable input length hash function using fixed input length primitive. Some popular iterative hashing mechanisms are the Merkle-Damgard construction and the Sponge construction. Our focus for this talk is the Merkle-Damgard construction. It is based on a compression function H that takes as input two values lying in one through N and outputs a value in one through N. The hash of a message M is defined with respect to a hash key A also known as the salt as follows. M is padded up to an appropriate length and then broken up into blocks such that each block lies in one through N. The compression function is first evaluated using the salt and the first message block. Next, it is evaluated again using the first output along with the second message block and so on. The output produced by the final evaluation of the compression function is the hash of M with respect to the salt A. This construction is used in MD5, SHA1, and SHA2. One very fundamental property that we want any hash function to satisfy is that of collision resistance. That is, given a random salt, it must be hard to find two distinct messages that have to the same value. We are interested in quantifying the collision resistance of the Merkle-Damgard construction. The most common approach used in doing this is to model the compression function H as a random work. When doing so, one can find collisions using roughly square root N queries to the random work. This is essentially the birth attack and this attack is optimal. But typically, H is a public function and the adversary might be able to do a lot of pre-processing on it. And it turns out that this birth-style attack is no longer the best one when considering adversaries that can do pre-processing on the random work. The scenario of pre-processing adversaries were studied in many earlier works. For example, in the context of function inversion, collision resistance, et cetera. The auxiliary input random oracle model introduced by Ulru captures the power of pre-processing adversaries against random work. The collision resistance game in this model is formalized as follows. An adversary is modeled as a two phase one. Its pre-processing phase gets full access to the random oracle and it outputs S bits, which is passed on to the online phase, which additionally gets us input a randomly sample salt A. The online phase can make at most T queries to the random oracle and the adversary wins if the online phase outputs two distinct messages which have the same hash with respect to the salt data. We refer to such an adversary A as an SD adversaries. We parameterize the advantage in terms of S and T and define it to be the maximum probability of any SD adversary winning this game. We note that in allowing the adversary to compute any arbitrary S bits of pre-processing, we make it very powerful. Hence, any limitations proved in this model imply a very strong value. Coretty et al gave a tight characterization of this advantage in terms of S and T. That is, they proved an asymptotic upper bound of SD squared by N on the advantage of any SD adversary and also gave an attack that achieves this advantage. However, their attack finds collisions of length nearly T. Now, say that an attack takes time to power 60. It means that the collision that it finds is the petabytes long. Collisions that long are not really useful for any practical purposes. In addition, short collisions are harder to find than longer ones. This was proven by Akshima et al. Here, the two in the subscript denotes the advantage of an SD adversary in finding two block collisions. That is collisions where both the messages are at most two blocks long. Akshima et al showed that the advantage for finding two block collisions is upper bounded asymptotically by SD by N plus T squared by N. And this result implies that the two block collisions are harder to find than arbitrary length ones. Further, in the same work, they gave an attack that has advantage roughly SDB by N plus T squared by N ignoring the polylog factor. We refer to this attack as the SDB attack. They also put forth a conjecture called the SDB conjecture which says that the SDB attack is asymptotically optimal for any value of P. Before our work, this conjecture was unresolved for any value of P that is at least three but much less than T. In this work, we show that this conjecture is true for all constant values of P and for certain other range of parameters. Our two results are incomparable. I will cover the proof of the conjecture for all constant values of P in the rest of this talk. I refer you to our paper for the details of the other result. Also, I should note that this second result was recently improved upon in a followup work by Akshima, Goh, and Liu. This is the main theorem that we proved. We show that the maximum advantage of an SD adversary in finding a B block collision is asymptotically upper bounded by SDB squared times log S power B whole divided by N plus T squared over N. For constant values of B, this is asymptotically SD by N plus T squared by N ignoring the polar factors. This proves the SDB conjecture for all constant values of B. The proof of this theorem is based on the multi-instance framework recently introduced in the works of Chuang et al and Akshima et al. This framework was inspired by the beautiful techniques it used to prove a constructive turn of bounds by Impaleozo and Kamalax. Now, here is a very brief description of the framework. First, you different solves A1 through AU are sampled uniformly at random. The online phase of the adversary gets some fixed pre-processing independent of the random oracle, say the all zero string and the random solved A1. It can make at most T queries to the random oracle and it needs to output a B block collision. Next, the online phase of the adversary is a successively run on the other sample solves. The adversary wins the multi-instance game if it successfully finds a B block collision for every solve. The multi-instance lemma relays the maximum advantage of an SD adversary in finding a B block collision to the maximum probability of an adversary winning the multi-instance game which we refer to using epsilon. In more detail, the lemma shows that when U is S plus log N, the maximum advantage of any SD adversary in finding a B block collision is upper bounded by at most epsilon power one by U. We will prove that epsilon is in turn upper bounded by this quantity here which for constant values of B and U equals S plus log N is asymptotically in the order of S T by N plus T squared by N power U ignoring the polar factors. Applying the multi-instance lemma would give us the upper bound that we set out to prove on the maximum advantage of an SD adversary in finding B block collision. So we need to upper bound the maximum advantage of the adversary against the multi-instance game. We shall do this using a compression argument. The main idea behind a compression argument which is formalized in the lemma here is that it is impossible to compress a random element in a set X to a string shorter than log of size of X with slug even relative to a random string. Recall that our goal is to upper bound the advantage of an adversary in winning this multi-instance game. Our strategy will be to come up with an encoding and decoding procedure for the random oracle H and the random solves A1 through AU which uses this adversity. Such that the decoding procedure is correct whenever this adversary will. Using the compression lemma would then lead us to an upper bound on the maximum probability of the adversary winning the multi-instance game. Before starting with the encoding procedure, let us make this following simplifying assumption that when the adversary, multi-instance game adversary runs on a particular solve, it only queries the random oracle on values prefixed by the particular solve. Of course, this assumption is false but we will make it initially for the ease of exposition and remove it later on in the talk. Note that this assumption in particular implies that the query made by the multi-instance adversary when running on different solves are completely different. The encoding procedure works as follows. The adversary is first run on the solve A1 and the solve is included as part of the encoding. When the adversary makes a query to the random oracle, the answer to that query is added to a list in the encoding. If the adversary makes a query that had not been made before but which has the same answer as some previous query, then the answer to this query is removed from the list of answers in the encoding and instead the indices of the two queries with the same answer is added to a separate list of tuples in the encoding. This is done only for the first time this happens when the adversary is running on the solve A1. Similarly, the adversary is run on the other solves one by one and the encoding built up likewise. After running on all the solves, the points of the random oracle on which the adversary did not query it are appended to the list of the answers to random oracle queries in the lexical graphical order of inputs. That completes the encoding procedure. Now let me show you how decoding works. The adversary is first run on solve A1 which is present in the encoding and its queries answered from the list of answers in the encoding. When a query is made whose answer was removed from the list, the set query is detected by checking if the query index appears as the second element of a tuple in the list of tuples in the encoding. If so, it is answered with the query answer of the query whose index appears as the first element of the tuple. Similarly, the adversary is run on all the other solves and finally the unqueryed values of H are deduced from the remaining entries of the list of answers which are in the lexical graphical order of inputs. Whenever the adversary wins a multi instance game we are guaranteed to have a collision for every solve and since by our assumption the adversary makes a completely different queries for different solves. It follows that we save u times log n minus two log t bits in total. Since for every solve instead of remembering a query answer we included our tuple of query indices to the encoding which take log t bits each. Using the compression lemma this would give us that advantage of the adversary against the multi instance game is at most t squared by n whole power u which is what we want. But remember we made a false assumption. So we have to work much harder to get rid of it. To this end as a first step we introduce the notion of a query graph. This is a graph that is initially empty when we start running the adversary on the solve A1 if it makes a query on random of a query on X1, Y1 and the answer is Z1 we add two nodes for X1 and Z1 and add a directed edge from X1 to Z1 with label Y1. Next when it makes a random or a query on Z1, Y2 and the answer is Z2 we add a node Z2 and add a directed edge from Z1 to Z2 with label Y2 and so on. The graph grows as we run the adversary on all the new solves. In particular note that the adversary when running on a solve may make queries that it made earlier while running on a different solve contrary to our earlier assumption. We can assume without loss of generality that whenever the adversary finds the collision it must have queried the random miracle at all points required to compute the collision. So let us first see how a B block collision looks like in a query graph. A general B block collision is a sub graph of the query graph that looks something like this. This reminded us of the shape of a mouse hence we named it the mouse structure and referred to the different parts of the sub graph using different body parts. Note that there can be slight variations to the structure the entire body of the mouse might be a cycle or even a self loop the tail might be missing, et cetera. For every solve that the multi instance adversary is run on, even if it finds multiple B block collisions, we arbitrarily choose one of the collisions and refer to its sub graph as the mouse structure for the particular solve. Next, we categorize the types of queries made when the adversary is run on different solves. We say that a query is new if it is being made for the first time. We will assume without loss of generality that when running on a particular solve the adversary does not repeat queries since it can just store the answers. So if a particular query had not been made when the adversary was not on a prior solve, the query is new. We'll mark the new queries with red. Queries that are not new are repeated queries. We will categorize them into two types. Repeated mouse queries are those that are presented in the mouse structure of some earlier solve. We mark these in blue. Any repeated query, any other repeated query is called repeated non-mouse query and we mark these in green. Further, we'll make an assumption here. We assume that before running the adversary on a solve no query with the particular solve test prefix has been made. We will soon show why making this assumption is justified. Note that this implies that every mouse structure is a new query. Based on these definitions, we classify the mouse structures into different categories. These categories are not mutually exclusive but they are exhausted. Mouse structure gets classified into the earliest category that it falls in. The first category is when there are colliding new queries in the mouse structure. The query is in black, maybe of any type, new repeated mouse or repeated non-mouse. The second category is of mouse structures whose body is a self loop and third is of mouse structures that have a new query whose answer is the input solve of a repeated mouse query. The two other mouse structure categories are the ones that have at least one repeated mouse query and one that have no repeated mouse queries at all. Our goal will be to save at least the following amount of bits for each of these mouse structure categories. We refer to this quantity as delta. This leads us to a saving of at least new delta bits in total and this suffices since applying the compression lemma would give us the bound that we want. Before we describe how we save, let me address the assumption that we made. That is when running on a solve, the adversary had not made any query prefixed with that solve when running on the earlier solves. This is reasonable since otherwise we can save enough on the solve itself. We save at least delta bits as follows. We omit the solve from the encoding, the saving log n bits. Instead, write down the index of the query where the solve appears as the part of the encoding. Since there are at most u times t queries, this costs us at most log of ut bits and that suffices for our needs. Thus, we can make this assumption as we showed that otherwise we already save enough. We first show how to deal with some of the easier cases. Consider the case when the mouse structure has colliding new queries. Suppose the new queries are q1 and q2 with q2 made after q1. Here we save log n bits by omitting the answer of q2 from the encoding and instead remembering the indices of q1 and q2 among the t queries. Another easy to handle case is when the answer of a new query is the input sort of a repeated mouse query. Here we save by omitting the answer of the new query and putting the indices of the new and the repeated mouse queries in the input. The index of the new query can be encoded in log t bits while the index of the repeated mouse query needs roughly log of u times b bits because there are roughly at most u times b repeated mouse queries in total. And this gives a sufficient amount of savings. We'll now see an example of the case that is much harder to deal with. Consider the case when the mouse structure has some repeated mouse query but none of the input solves of repeated mouse queries is the answer to a new query. In this case, our strategy would be to omit the answer of the new query and instead remember the indices of the new and the repeated mouse queries and the path back from the input sort of the repeated mouse query to the answer of the new query. Note that one can find a new query and repeated mouse query such that this path consists entirely of repeated non-mouse queries which had already been made before running on the current sort. But how do we encode this path? Let us zoom in. We can remember the edges back for every node in the path. But there might be a large number of incident edges for every node in the path. We say that the query graph has no large multiplication if there's no node in the query graph that has in degree exceeding log u. In case the query graph has no large multiplication we can encode the path back using the following number of bits. We encode the path length using log u bits and each edge requires log of log u bits. This strategy turns out to give us enough savings in case there are no large multiplication. But what if there are large multiplication? In this case, our key idea is to save from the large multi-collision cell. So if a node in the query graph has in degree m we say that it is a m multi-collision. Our strategy to save from a m multi-collision is to remember only the answer of the first of the m queries and encode the indices of the rest of the queries as a set. We save these bits for omitting the answer of m minus one queries but incur this loss to import the set of indices. Now it turns out that when m is at least log u we save more than delta with suffices for us. Of course, the full formal proof needs to handle several subtleties and I refer you to our paper for all the details. In conclusion, we prove the STB conjecture for all constant values of b and for some other parameter ranges by characterizing the structure of collisions in the molecular construction. In a follow-up work by Akshima et al thus one of our results were improved that resulted in the proof of the STB conjecture when ST square is at most n. Also in a different follow-up work by us along with Podifrite Act the similar question of characterizing hardness of short-sponsed collisions was studied. The main open problem that stems from this work is proving the STB conjecture or coming up with better attacks for the regime ST square greater than n. The full version of our paper is on eprint. Thank you.