 Hi, everyone. This is a presentation of the paper Time Memory Analysis for Parallel Collision Search Algorithms. And this is joint work with Sorina Yonika and Gildequian. My name is Monika Trimoska, and we are all from the University of Picardie in France. Now, let us first introduce what the Collision Search Algorithms are. Say we have a random map F from a finite set S to the same finite set S, and let's say we will denote the cardinality of S by a big N. Now, a collision on this random map is any pair r r prime of elements in S, such that F of r equals F of r prime. Okay, so the classical presentation of a collision search is, for instance, the Polar's Row method. Here, that is depicted in this picture, so you have starting from an element X0, you do this random walk using the F function, and since the set S is finite, you are sure that at some point we will discover the same element twice. Okay, so this is what the cycle depicts. Now, at this point, the point that we discover twice is called a collision. Now, ideally, F should be a random map, and we also have a formula for the expected number of steps until a collision is found, which is roughly the square root of the cardinality of the set, and is calculated using the birthday paradox. There are many applications of collision search algorithms. So, for instance, there are both applications that require only one collision, like the elliptic curve discrete walk problem, and there are applications that require multiple collisions. So, for instance, the attack on 3DES, classically, and you also have ECDLP in the multi-user setting, and some newer applications are in the isogenic-based crypto world. We also have computational super-similar isogenic problem. In this presentation, we will use the elliptic curve discrete walk problem to explain how a collision search algorithm works, and also our implementation was done for this application in mind. Let us define and set some notation. So, let E be an elliptic curve over a finite field K. The ECDLP problem is given two points on this curve, p and q. Find an integer x such that x times p equals q. And we recall here the definition of a collision, and we show an example of what a collision is for this application. So, you have starting from a point r on the curve. f puts r into either r plus p times r, r times q, and this set r is divided arbitrarily in three sets. Now, this is just an example. It can be any other function, but you need a function that adds p and q to the existing point. Because the most improper property of this function is that if we are given an input that is a linear combination of p and q, then the function should output also a linear combination of p and q. A different one with different coefficients here. How are we going to use this to find the discrete walk? Well, we can now define what a collision is in our case. So, a collision is when we have two different linear combinations of the same random point r on the curve. And using this, we already know that x times p equals q, so we can easily compute x as such. Now, that was the classical collision search ideas. Now, we go to the parallel collision search algorithm, which was proposed by van der Schytten, a winner. So, let's say in a random walk, one of the ways to detect the collision would be to just store all of the points. So, when we find the same point twice, we will detect it immediately. Now, it is not possible to store all of the points because it's simply too much. So, the idea that van der Schytten winner had was to store only a proportion of the points in the set, which we call distinguished points. These are just points that have some easily testable distinguishing property. Like, for instance, in the ACDLP case, we can say that our x-coordinate has three trailing zero bits. And we only store points that have three trailing zero bits. We also have a very important parameter, theta, which is the proportion of distinguished points in the set S. And another difference of this algorithm with everything that was discussed until now is that this is a parallel collision search. So, instead of having only one random walk, we will have a random walk starting from a different point for each thread that is involved in this algorithm. So, this is depicted in this picture here. And you have, for instance, here you have two threads that start from a different point and collide at a non-distinguished point. So, a very also important thing to note here is that since the walk is random but also deterministic, even if threads collide at a non-distinguished point, they will continue to march toward the same distinguished point and then discover the collision. Now, let us see the work that we did on this algorithm. First, we did some time complexity analysis, both in the one collision case and the multi-collision case. And for both of these cases, we derived the optimal value for theta. The analysis is more interesting in the multi-collision case, so this is what we're going to present here. In the one collision case, we didn't do many improvements compared to the original analysis. We just give a more rigorous claim that the algorithm scales perfectly. So, in the multi-collision case, we have both the case when the memory is constrained and the case when the memory is not constrained. So, let's first see, as an intermediate result, the case when the memory is not constrained. We have a formula for the expected total number of computed distinguished points for finding m-collisions. And we suppose that memory is unlimited. Okay, this is the formula and it turns out to be pretty accurate compared to our experimental results, even for as few as 100 collisions. Now, we will use this to derive a formula for a case that is more relevant in the real world. That is the case where the memory is limited. Okay, so here is the theorem that gives us the expected running time to find m-collisions with a memory constraint of w-words. We have some other parameters here. We have L that is the number of used processors and note how L divides the whole time memory. So, this indicates that the algorithm should scale perfectly. Of course, not so much in practice, but close to that. So, we also have, we still have theta that we introduced that is the proportion of distinguished points and n is still the cardinality of our set S. So, before we decompose this formula a bit, let's see how this algorithm works in practice. So, we assume that while we still have memory left, we will continue adding points while looking for the collision. And when we add w-points, we have to stop. We can no longer add any more points. So, the new points that are created to try to find the collision with the existing points, they are thrown out. Okay, so this is unfortunate. It has some implications on the algorithm, but this is the strategy that we have to take if our memory is limited. Now, this means that we will have two parts in the algorithm, the part where we still look for a collision and store points and the part that we only look for a collision. So, we have first here the expected number of iterations needed to find and store w-points, which is just if one over theta is the expected size of the walk to a distinguished point, then we multiply that by w. And then, for the rest of the points, we need to use our intermediate formula. Remember that we had a formula that tells us if we want to find m-collisions, how many points do we need to store? Okay, so inversely, we can calculate if we had stored w-points, how many collisions did we find until now? Okay, so that is what we have here. The number of collisions found after storing w-points and m minus that is the number of collisions that we still need to find. And of course, we multiply this by the expected number of iterations needed to find one collision when w-points are already stored. And note here that this w now is fixed because we no longer store any more points. So, a corollary from this theorem would be the optimal proportion of distinguished points that is minimizing this time complexity. Okay, so we found the optimal value of theta. Another thing that we can derive from this is if we choose the value for theta as such, what would be the running time of the PCS algorithm for finding n over two collisions? n over two is the number of collisions that you need, for instance, for a classical meeting the middle attack. And what this formula tells us, we won't go into details about how it is derived, but we see here that we have w as a denominator which shows us that the more points we can store, actually, the running time becomes lower. So, it shows that memory is an important factor in the running time complexity. Now, this is already something that we knew intuitively. Of course, the more points you can store, the better chance you have of finding a collision sooner, but now it is also shown in the formula. And this motivates our further work more on the implementation part, where we try to find a data structure that we should use for storing the distinguished points. So, first let us set some requirements that we have for this data structure. Now, keep in mind this is for our model because this can change immensely depending on how you decide to perform this algorithm. So, we want the structure to be space efficient, that is in every case, because we already saw the more we can store, the better the running time is. We want it, of course, we need the algorithm to be thread safe, so we also want to be able to unlock only a portion of the structure in order to make it thread safe. We don't want to lock the entire structure every time because this gives us a huge overhead when threads are waiting to store points. And this is why I say this depends on the architecture that you choose because if you choose to have only one server that stores all of the points that it receives from many, many clients, then you don't need it to be thread safe if you have only one entity that is actually storing the points. But we put this as a requirement, we want it to be easily thread safe. And we, of course, want fast lookup and insertion. Now, classically what is used as far as we know in the literature, if you want to have this requirement, you can use a hash table. In this work, we propose to replace the hash table with an alternative structure, which we call a packed Radix tree list. Now let decompose the name a bit. The structure is only inspired by Radix trees. And so here on the right, we have an example of what a Radix tree is briefly. So a Radix tree is a tree where each node is one letter of the word that we're going to store. Here I'm talking about letters and words because we're in this is what it here. Otherwise you think about bits and bit vectors. So for instance, we store 1, 2, 3, 4, 5, and we create the node 1, 2, 3, 4, 5 here. Then we store 1, 2, 5, 4, 4. 1, 2 already exists. So we are just going to add a child to the node 2 to continue our word here. So we profit from the common prefixes. We don't allocate any more nodes. Now, despite profiting from common prefixes, this is a very inefficient structure at the implementation level because you see you really need to create all the structures. You need to create the node. You need to create a pointer to the next node and so on. And it is not efficient if you have many pending leaves. They are just created for nothing. This is why we opted for a hybrid between Radix tree and linked list. We'll show you what this means now. So let's say that we construct a Radix tree up to a certain level, like in the picture, and then we add the points to linked list, each linked list starting from a leaf on the tree. And here we have a full tree. Imagine this is in base 4 because we have space on the slide for base 4. And then imagine we add this point. So 0011 is added here starting from the prefix 00, so it is added in this leaf. 0031 as well. So which point is added to the prefix that corresponds to. Now, notice that at the implementation level, since we know here that all of the leaves are going to be filled, they will all exist. If we arrive at a level like that, and we will show you later that this is always possible, then we don't actually need to construct this tree. It's a fixed value. It's always like this for every run of the algorithm. So this is what our structure would look like implemented. We only create an array, and the index of each array corresponds to the prefix of the points that will be stored in this slot. So for instance, here we store the point 0011 and 0031 in this first slot. And of course, these are change lists here. Now, this structure that you see here looks a lot like a hash table. I'm implementing it. It is exactly the same structure. Now, I will tell you what are the subtle differences between this and the hash table and what makes this structure more appropriate for this type of application. So usually when you use a hash table, you want to have each element stored in a different slot preferably. You don't want to have hash table collisions. But in order to have this, you have to allocate a bit more than the elements that you will store. So you need to allocate approximately three times than the elements that you have. And this is just not possible for this algorithm. We store a lot of points. We know that we cannot allocate huge hash tables and we will have to use link lists anyway. So this is why there is no use to hash the value and try to avoid the hash table collisions. And when we don't hash the value, on the other hand, we can profit from common prefixes like we would in the radix tree. Notice here that I didn't copy the 00 value. I copied the suffix of the point because there is no need. You know the index of the slot tells you which is the prefix of the point. So we will save a lot of space. You will see using this when you scale it up. Another thing that we did is that the stored data is packed in a single vector. This means that, for instance, for the CDLP, you need to store the x coordinate and some other coefficient. Well, we packed everything in a single vector so that we don't waste space on alignment because space is very important in this case. And also, we will show you now on the other slides how we can estimate the optimal branching level. And we'll see what this means. Now the optimal branching level is the level at which there are no pending leaves in the tree. This means that every, when you go into implementation, every slot in our array is filled. And at the same time, the link lists are as short as possible. So if you want just one of these requirements, for instance, there are no pending leaves, you can get a very small prefix and you're sure that every slot is filled, but you have very long lists. If you choose a longer prefix, then it is the other way around. So we want to find the right middle between the two requirements. So we equated this problem that we have with a very known problem which is called the coupon collector's problem. So here we have a brief, we briefly recall what the coupon collector's problem is. So let's say we have a given number of coupons. Here we have five coupons that each have a different color. And we want to collect all of them, but let's say each time we buy a coupon or we choose, we can get any one of them, any one color. So the goal is to collect all five colors. And the question is, well, how many coupons do I need to buy in order to be assured with a high probability that I will have collected all of the five coupons? For instance, here we continued buying coupons until we found the last color, the yellow. And why is this the same problem as the one we have? We inversely, we know how many points we are going to store in our structure. So let's say that is the set over here. And we want to know, well, how many slots do we need? Which is the maximal number of slots that can be filled using all of these points. So that's how we are going to determine our optimal dungeon level. So we say that as per the coupon collector's problem, the optimal level is L such that, well, K, the number of points stored is bigger than or equal to. Here we have the number of elements, the number of slots that we have. So B to the L. In our case, we store in binary. So this will be like two to the L. So the number of leaves in the tree also, which will be the number of slots in our area, times L and the number of elements plus a small constant. So this is why knowing K, we calculate L. And it turns out also to be very accurate. We did experiments trying to fill the structure using the level L and the full completely. And then when you go just one level lower, you start to see empty slots. And it doesn't work as well with the hashtag that we tried. So finally, let's see some experimental results. All of our experimental results concerned the elliptic curve discrete block implementation. And we use curve E over a finite field Fp with P prime. But we also adapted it for one collision and multi-collision search so that we can do all of this experiment. So in the multi-collision case, think elliptic curve discrete block or multi-user setting. The data is always one over two to the B over four, for a B bit curve. So for instance, here I will show you experiments on a 55-bit curve, which means that the number of trailing zero bits is 13. So data is one over 13. And what we did in these experiments here is that we wanted to show that memory really is an important factor even in the running time. So we limited the memory for our smaller parameter. We limited it to one gigabyte. And we asked the algorithm to find four million collisions. And we did this both using the PRTL structure and the hash table. So here are our findings. We see that the running time of the PRTL is better than the running time of the hash table. And this is immediately explained when you see the number of store points. So with the PRTL, we managed to store about 46 million points and the hash table stored only 12 million points. And this is actually the only reason that the PRTL is better than the hash table. Otherwise with the insertion, the insertion time and also with the being thread safe and everything, they are pretty much equivalent. The only thing that we were able to gain is actually being able to store more and this has a huge consequence in the running time. And then here are some bigger parameters. So we do two gigabytes and we asked for 16 million collisions. Four gigabytes, we asked for 15 million collisions. And we have the same outcome each time. The first one is an average of 100 runs and the other two are an average of only 10 runs because they take a lot of time. Okay, so we have an implementation in C. We use some external libraries for writing huge numbers and for parallelization. Now to conclude. So we revisited one collision and multi-collision time complexity analysis and we showed more precisely that memory is an important factor in the running time complexity. So as a result, this was the motivation for our further work. So then we proposed an alternative memory structure that allows us to store a lot more than... Well, a lot, that's subjective. It allows us to store more than a hash table while keeping all the properties that a hash table has like your fast insertion and being a thread safe seamless. And if you are interested in this work you should check out our artifact. We made it so that it is seamless to add other structures and compare them both in terms of running time and memory. So thanks a lot for checking out our work and we look forward to your questions at the conference.