 So now we have the follow-up, 11 years later we still don't have collisions for SHA-1 but we're getting closer every year and so the first talk is entitled Practical Freestyle Collision Attack on 76 Steps SHA-1 by Pierre Cartman, Thomas Parrain and Marc Stevens and Pierre Cartman will give the talk. Thank you Bart for the nice introduction. So this is going to be the outline of my talk so nothing really unusual and because it's early in the morning I'll start with a few recaps on hash functions so hash function is mapping from strings of binary strings of arbitrary lengths to one of fixed length let's say n and it's a really useful primitive in crypto you can implement many higher order things all with hash functions typically hash and sign you can do mac you can also do stream ciphers which is the other topic of this session and but the thing that's a bit unusual with hash function is that they are keyless so it makes things a bit more complex to evaluate the security to define what is a good hash function but we know how to do that I mean they've been around for a long time so usually we use three security notions and in an informal way that's what we have the first two are quite related their pre-mage resistance and let's say we I'm just given a target t and I want to find a message that hashes to this target and if I know nothing about the structure of the hash function the best I can hope to do is to do sort of exhaustive search and which has a complexity of about two to the end with and the length of the hash second pre-mage is quite similar but this time I'm given a message M and I want to find a distinct one that hashes to the tape to the same value with the same complexity and then we have collision attacks which are sort of ill-defined sometimes but we just have we want to find two messages that has to the same value without any sort of condition of the messages beforehand and thanks to the birthday paradox we can do this is over to the end over to things become a more specific if we consider practical constructions of hash functions so for instance let's consider Merkel-Darmgård hash functions because that's the kind of hash function that Shawan is and in this case we just construct the hash from a compression function which is a message block of fixed length an IV of fixed length and while a changing value of fixed length and it hashes to a changing value again and then we can hash the message by changing like this the compression function starting from an initial value that is fixed and that is well fixed in the definition of the hash function and the nice thing about this is that we have a security reduction from the hash function to the compression function because if I have an attack on the hash function I can attack the compression function so if I don't have an attack on the compression function I know I don't have an attack on the hash function but then if I have an attack on the compression function it's not so clear what we get I mean it's not obvious that we can convert that to an attack to the whole hash but at least it breaks the security reduction so if we consider functions that are Michael Damgort, for instance, we can have additional attacks by considering the IV as an input to the attack. So, for instance, if I, well, I won't really define some a free start, but let's say I want to do free start, it just means that as an attacker, I can decide what the IV is going to be. It's an additional freedom to the attacker, so we can also expect to do things that are a bit more powerful. And the variant of this sort of thing that, let's say I just want to attack exactly the compression function. So I won't have any chaining. I have to do everything in one block, but then I can control the additional input that is the chaining value. So in this work, what we did is collisions on 76 steps out of 80 of the compression function of SHA-1. So it means we attack 95% of SHA-1. And the nice thing about this attack in particular that it's practical, and we computed such a collision explicitly. And we did that with the GPU because it was fast. So a few details about SHA-1. So it was designed by NSA 20 years ago as a quick fix to SHA-0. And okay, the hash size is 160 bits. So we should expect a collision resistance of about 80 bits. So anything that is below that is an attack. And yeah, the blocks are 512 bits long. And the structure, if we're a bit more specific, it uses a block cipher in Davis-Mayer mode. And the block cipher inside is a five branch error rate for our style. So we get this nice round function. And the message expansion is linear and you see this formula. And if you attend in my RAM session talk, you should know that this rotation by one is the only difference between SHA-0 and SHA-1. I mean, the only significant one. And so there were 80 steps of this five cell structure. So in a picture, that's how it is. We have five 32 bits words of state. And then we just do this operation with the Boolean function here. Okay, so as Bart said, SHA-1 has been attacked before and attacks have been presented in this room like 10 or 11 years ago. And the main breakthrough in SHA-1 attacks or the attack by one in a new 10 years ago and they show collisions for the whole hash function. And the way this attack works is that, so we want to find, we want to linearize the step function and this will give us, we want to find a good differential path for a linear version of the step function. And then we just want to compute many messages and we hope that it followed the linear path and we know that if it does, we get a collision. But this cannot work for the whole function. So we also need to get a non-linear differential path at the beginning of the function, especially to connect the linear part and the ID because otherwise it cannot work. And also a very important feature of the Wong's attack is that the technique of message modification, which allows to speed up the attack by making the best use of the freedom you have in the message so that the probabilities take part of the attack starts as late as possible. So the initial attack had a complexity evaluated to two to the 69 equivalent of SHA-1 function and it was eventually approved to two to the 61 by Steven Satyroprib 2013. So this is the best we got so far for the hash function. So as a side note, if we consider pre-image resistance, SHA-1 is much more resistant because we don't know attacks on the full function yet. Maybe there isn't, well, I'm not going to make any hypothesis, but well, we don't know any public attacks on that. And there are practical attacks up to about 30 steps from also crypto 2008, the Kaniyar and Reisberger. And if you want to have something, I mean, if you're allowed to have a non-practical computation, we can do about double of that. And this is the talk at 2020 of this room, if you're interested. So, okay, so much for the overview, what did we do? So we wanted to, okay, first, let me justify what we did for restart because as I just told you, there are attacks on the whole hash function. So what the point of doing something in an attack that is less applicable somehow because we need conditions on the IV. So essentially we wanted to do something that is practical up to the furthest we could. And if we do restart, then we can start computing the things and the attack from a middle state because then we don't care so much about what IV we're going to get eventually. And we can hope that starting from the middle is going to give us an advantage by giving a faster attack. And so yeah, if I start from the middle, if I shift the message somehow, I can use freedom up to a later step. So the probabilistic part of the attack is going to start later. But then as I said, if you do this, you don't control the IV anymore. And potentially also you need to introduce differences in the IV. And because you don't control what's going to happen in the beginning of the computation, you need to do something backward. You also have to be careful because if you have some conditions backward, when you compute, maybe some of these conditions are going to be invalid. So you have to be careful about how you get things. So this is why we did restart. And in the picture, the rationale, it's exactly as I said. So this is what you would do for a usual attack. This green thing in the IV, the light green part, is the part that is entirely determined by the freedom in the message. So you can put this to whatever you want. And then for this light blue part, you can use message modification to start, well, to get things for a few steps. And then eventually you have the purely probabilistic part. You just have to compute the message if you get a collision. So if you do free start, what you can hope for is that you can, you're going to initialize the state in the middle of the function somewhere. And then you also get entire decision for this part. But then you have this backward computation I mentioned. So this you don't control really. You have to be careful because if you're going to invalidate things, if you have to check this every time, sort of, I mean, it doesn't make the attack more efficient. So you have to be confident that this is going to happen properly. And then, but then you could shift the freedom you get. So the probabilistic part of the attack starts a bit later so it can make it more efficient. So then if you want to do this, what the process? So the first step always is to find a good linear part. So you'll have to start with this otherwise you cannot really do anything. So you'll have to decide on that. And then you'll have to construct the non-linear parts like in the one's attack. But this time it's going to be shifted a bit because we have an offset. And then the final step, just when you want to implement the attack, you'll have to find some ways to accelerate, to speed up things. So to find what I call accelerating techniques, message modification in the term I used before. So we did this for 76 steps and the two reasons were, well, for the best practical result or condition on SHA-1 is 75. So we wanted to say, oh, we did more. And also the nice thing, but the consequence of that is because SHA-1 has a five-cell structure with five words. If you have a collision on 76 steps, you still have one of these words that used in the final output of SHA-1. So this gives you a partial collision on 32 bits. Of course, at a code that is really more important than just doing this randomly, but I mean you can still say it, which is nice. So the first thing I said we have to do is to find a good linear part. So the criteria for the usual one is just we want something of high probability because that's what's going to define the complexity of the attack in the end. But here, because of our free start setting, we also have these two extra conditions. So we need, because if we get differences at the end of the five steps of this linear part, it means that we need to introduce differences in the IV so that we really get a collision. And if we have a lot of such differences, the backward propagation is going to behave badly so we want to avoid that. And also for the same reason, we want to avoid to have too many differences in the early words of the linear part, I mean of the message. So in the end, there were not so many candidates and we picked the type 25050 following manual notations. Is there anyone in the room who knows what that means? Okay, never mind. So this is what it looks like. So this is the state I have. So we have 80 such steps. And if I have a dash, it means I have no difference here. And if I have a cross, it means I have one difference. So this is the differences in the message that I'm going to fix. This I can choose as an attacker. And if the message pair I chose follows the differential part, I'm expecting to have differences following this pattern. So you can see that there are actually very few differences, a bit more in the message, but that's expected in it. So this is the same, I'm continuing from step 57 to 76. So these are going to be the five last state words at the end of the computation of the 76 steps. And so you can see that we have only two differences here. So we need to introduce those two differences in the IV to cancel. Okay, so next step is the non-linear part. So to do this, we did it sort of in two parts, while the first one is easy actually. You just want to find a good prefix of very, with very few differences for the early part. So this is what I said would be the backward propagation. For at the beginning of the computation, we want something really high probability and especially high backward probability. So we want to find this first and then we're going to use a standard somehow, even though we improved it way of constructing the non-linear part to bridge it with the linear part. So to do this, we use mostly the improved John Local Collision Analysis, which comes from Steven's paper at Eurocrit. And so at the end of this process, we got something that we were quite happy with. It's a path with 236 conditions up to step 36. Of course, we don't have freedom up to step 36, but you can see that 236 is much less than the 512 bits of freedom we get in the message. And we get additional freedom in the IV. So this is quite good. So that's how it looks. So you can see, or I'm telling you that this is the part with the up to step five of high backward probability. You can see that there are a few differences. So the difference is changed name because now they're assigned. I'm not going to enter into the details, but U and N are differences and one and zeros are equalities. So anyway, this part, you can see that it's somehow quite sparse, but then at step 678 you have three big carry chains with many, many differences. So this is the intense part of the non-linear part that we need to compute efficiently. And then you can see that even after step 10, it's almost linear again. So it's a very, very narrow, well, I don't know, dense non-linear part. So now the last item on my list was finding good accelerating techniques to implement the attack. And so the two families, even though they're quite related, but you have message modifications where you want to, if you're trying to compute something, you're trying to make it follow the differential path and then at some point it doesn't. So you're going to try to modify slightly the message so that now it fulfills a condition that it didn't before. But then you also have neutral bits which are a bit different. So you're just generating good instances up to a certain step. And then you know that if I have something valid up to this step, I'm going to change one bit up and it's going to give me an entirely different value here, but I know that it's also very likely to follow the differential path. So I can multiply good instances and it's going to make things much more, much faster than just straight forward search. And so we chose to use neutral bits because basically they're easy to deal with and especially they are quite easy to implement and GPUs in the end. And so I said that I would do things with an offset. I didn't mention so far what the offset was exactly. The most important one is for the neutral bits and it's an offset of six. So it means that instead of using freedom in message words, W0 to 15, I do it from W6 to 21. So I got somehow six extra steps from using this starting in the middle of free start thingy. So if we sum up, the attack is doing this. So first we initialize a state with some offset, then the same for the message words, then we're using neutral bits with an offset and this gives us a lot of neutral bits up to a quite late step compared to regular show and attacks. But then the drawbacks of course that we don't know the IV in advance so it's going to give us something free start somehow. And so we have differences in the IV so it's really a free start, not semi-free start. We do everything in one block so in the end it's really a collision on the compression function. So this is what happens in a picture. This is where we initialize the state in practice. This is what we compute with the message freedom and then this is the offset we use for the message. So this whole window in the message freedom we get but this darker shade of orange is where we actually, the words where we actually have neutral bits. And these neutral bits are going to act from step 18 to 26. So this is the whole part where we can sort of speed up things. Okay, so now I'm going to mention how we implemented this thing to make it efficient somehow. And we did that on a GPU because we wanted to have fun and we wanted to try if it would be efficient and it turned out that it is. So we used a nice cheap gamer GPU, the NVIDIA GTX 1070. We got a free video games with that actually. That was nice. And so it's a quite recent one and it has 1664 cores at about one gigahertz. And the nice thing with the NVIDIA GPU is that it allows to do sort of high level programming with the CUDA framework. But usually NVIDIAs, they're supposed to be less efficient for crypto programs somehow but actually this one is quite nice because for all of the instructions we need, so 32-bit arithmetic basically, you can have a throughput of one instruction per cycle per core. So optimally you can really use all of these 1664 cores at the best. The only exception is for the rotation which is a bit unfortunate because we have quite a few rotation in the computation but well, it's still quite nice in our opinion. And also it's quite cheap. So 500 Singapore dollars, you can make the conversion in whatever currency you're using but we bought this one in Singapore so well. So then, okay, I have nearly 2000 cores but it's not like I can use them to implement like regularly on a CPU. Especially because the threads that are going to run on this core are packed in warps, what's this called a warp of 32 threads. And for these warps we have a model which is single instruction multiple threads. So all of the warps, all of the threads of the same warp have to execute exactly the same instruction. So if I have divergence in the control flow because I had underneath some conditional somewhere, basically everything is going to be serialized. So the thread that are going to follow this conditional will execute the one that don't just shut down for a bit and then they reactivate. And so because of this, you really want to avoid branching because at worst then you lose a factor 32 of what you could have. And well also, yeah, you tend to group threads in very big blocks to hide latency for many things. So the approach we used to counter somehow this branching problem is just to use chart buffers for partial solutions up to, so I have like partial solution up to some step. And for these solutions up to the same step I'm going to store them in the same chart buffer. And then the process is that I have a block of threads and they're all going to load their own partial solution and they're all going to try all of the neutral bits available for this step. And every time they get a solution up to the next step that is valid, they will store it in the next chart buffer. And so that's how we decompose the computation between the CPU and the GPU. So up to step 17 we just generate what we call base solution on the CPU because it's not too expensive and it's more like you can do that offline even online actually. And then we use all of the neutral bits on the GPU because that's what it's here for. And we also check further solution up to, I mean we do further check on the GPU and eventually the collision checking done on the CPU because we have so many things up to 56 to check. It's a waste to just use the GPU for that. So the process in a small picture is like this. So let's say I start here the computation and I'm just going to say, okay do I have any solutions up to step 25, enough of them. If yes, I'm going to load them from my buffer. I'm going to try to extend them up to step 26 and then the valid things I have up to step 26 I store them in these buffer solutions up to 26. And then I start again and again and if I don't have anything eventually I will just load base solution that were produced by the CPU and I extend them and I do this infinitely until I have a collision. So the results we got, so as I said we just had one single GPU, quite cheap and with one GPU you can generate partial solution up to step 26, one per minute on average, which is, well, quite nice. And because we know the probability of following the path from step 56 to the end we can have a very good estimate of the complexity of the final attack and this is about five days, slightly less like four days and a half. And the equivalent complexity of this is about two to the 50.25 SHA-1 compression function. And if we compare to what we would have on a CPU so we implemented the same process on a CPU, rather recent high clocked one and this would take about 600 days. So the speed up we get from the GPU, I mean, one GPU is worth 140 such cores, which is quite nice. And especially if we compare to previous attempt at implementing SHA-1 attacks on GPUs and then they got a speed of about 40. So we were quite happy with this. But then if we compare to the speed up we get for raw SHA-1 computation, it's 320. So we still lose a factor about two from the branching. But in our opinion, this factor 2.3 is still not bad and we were quite happy with the results. So that the end of my talk and I hope you enjoyed it. Thank you very much.