 Yeah, this is joint work between CWI and Google, myself, Ailey Burstein, Pierre Carpman, Angel Battini, and Jarek Merkhoff. And this represents a work, well, building upon already research going back two decades. But this particular piece of work lasted two years. We worked for tears on this. And so SHA-1 is a cryptographic hash function. So it's going to take an arbitrary length input, and it's going to compute a bit string of 160 bits. And SHA-1 was designed to be collision resistant, although even we don't have a formal definition of collision resistant, there is this informal definition that it should be infeasible to find different x and y that map to the same output. And of course, you have the birthday search. So there is a generic attack cost with complexity about to the power 80. And now the two most widely deployed cryptographic hash functions were MD5 and SHA-1 the past two decades. And both have been broken initially by a team led by Professor Ketoumi Wang at all. And MD5 has been broken practically. There was immediately the work collisions. And SHA-1 was broken theoretically. And it really took 12 years to finally find the first collision. Luckily, we still have two, at least two, secure cryptographic hash function standards, SHA-2 and SHA-3, that we can now all migrate to. So that collisions are a problem in real world, has been already shown quite strongly in two instances. Namely 2009, when we created a rogue certification authority using an MD5 so-called chosen prefix collision attack, where we created a collision between a website certificate, a very just normal one, and a rogue certificate. And we basically got an actual CA signature on our website certificate just by requesting normal process, an automatic process. And we got our signature. And because of the collision property, this signature was now also valid for a rogue CA certificate, thereby suddenly making this a valid CA certificate. And then, of course, you can generate on-the-fly certificates for any secure website that you can do a live virtually undetectable man in the middle attack. The other instance is in 2012, when there was the super malware flame targeting in the Middle East. And it actually turned out that it used a fake signature for Windows updates. And again, this was using an MD5 chosen prefix collision attack. And actually, instead of targeting the PKI branch of Microsoft that actually signs Windows updates, it actually attacked an almost unknown branch that just automatically handed out certificates on the internet if you knew how to ask. And there they created a collision between two certificates where this was actually a code signing certificate and any Windows update signed by that would be accepted by any version of Windows at that time. This, of course, been fixed now. And of course, once you've created these fake updates, they actually spread them basically on a local network using a special Windows protocol that you can push updates. And then you basically have an infection where you basically can do nothing against. So this represents the state of the art on MD5 and SHA-1 crypt analysis. So we have the identical prefix column and the chosen prefix, which is a more stronger type of attack. And the current state of attack is that we're basically, for MD5, we can very quickly generate collisions in a fraction of a second. Chosen prefix collisions are a bit harder. It takes about a day. We finally managed to get a SHA-1 collision that's to the power of 63 on GPU, which is actually relatively speaking more costly than on CPU, so it's slightly higher. And well, chosen prefix collisions are about to the power of 77. Well, to also view this sort of in light from what's practical, you should also think about the Bitcoin network that does SHA-2 operations. And on a yearly basis, the Bitcoin network does 2 to the power 86 SHA-2 operations. So they already says that basically, well, 80-bit security is not enough. So all these attacks on SHA-1, they really attack the SHA-1 compression function. The underlying function that just processes a 512-bit message block and updates the internal 160-bit chaining value. Well, it would really do this until the entire message plus some padding is processed, and then the final chaining value will, of course, be the hash. And this pretty much operates like a block cipher. It will linearly expand the input message block, so 512-bits, partitioned as 16 words of 32-bits, and it will expand them to 80 words of 32-bits. And then there will be a nonlinear mixing of the five-word states, basically in 80 rounds, where in every round there's one piece of message used. And then finally, there's a Davis-Mayer feed forward to really prevent that the function is efficiently invertible. Now, for collision attacks, we're basically applying differential copter analysis. So we're going to consider two different instances of the compression function, side by side, one for each of the messages. And we're going to analyze the differences. So in particular, we're going to use a differential path, which is a precise description of all the differences how they propagate through the compression function. And the most important part of the differential path is basically last 60 steps, because they really determine most of the attack's complexity. We want that part to have the highest success probability possible and basically as far as possible. And then once we have this differential pass, what we can do, we can just translate it into a system of equations, and then we can try to solve this to find the actual M&M prime that will be part of our collision. Now, how to design these determinants, these differential paths? Well, Chabot and Jue already showed in 98 that basically the way to go for 0 and shall 1 is a disturbance vector. And this disturbance vector basically is a vector where every one bit marks the start of a local collision. And a local collision, well, at that step, at that bit position, a single bit difference will be introduced, and it will directly be canceled in the next five steps. And basically because it has many one bits, it basically shows which combination of local collisions you have to use. And the disturbance vector is an expanded message itself. So it conforms to the linear message expansion of shall 1. And that's why it's actually the unknown feasible way that's actually compatible with the message expansion. So a lot of people have looked at it, analyzed it, which disturbance vectors are actually the best ones for attacks. And Stephen Manuel actually showed that basically all these disturbance vectors that people have looked at fall in two classes that are just shifts and rotates of each other. So we can now easily focus on analyzing these two classes. And each disturbance vector basically determines the shore difference in a message. And also, like I said, this single base state bit difference determines these positions. But there's still a variant in design and carry. So the disturbance vector pretty much already lays the groundwork for the differential path, that already dictates many of the differences. You still can just pick sign and how many carries you want. So we're going to use the same approach as introduced by Wang. We're going to use two near collision attacks, right? Where the first near collision attack is going to introduce a difference into the chaining value. And that will then be present at the start of the second block. And then for the main part, we'll actually use the opposing differences. So we'll end up with a negative difference here. And then the Davies Meyer forward will cancel them both. And we have a collision. Well, the benefit of this approach is that we don't have any additional requirement on the disturbance vector that there should be zero difference here. And then we basically have, we can get better differential paths with higher success probability. But that does mean that here in this part, that doesn't really follow disturbance vector, we have to manually connect them. Well, Wang and I did this completely by hand. But now, of course, we have actual algorithms that can compute these so-called nonlinear differential paths to connect the chaining value input to the main differential path. So OK, yeah, we have our differential path can translate it to a system of equations. And actually, this is a system of equations that's only on one compression function, which makes it very convenient. We don't have to compute things on two compression functions at the same time. And the system of equations, when it's fulfilled, basically guarantees that when we apply the input differences, that the differential path will be exactly followed. And these system of equations actually just consist of very simple equation on message bits, involving at most two message bits, and very simple equations on state bits. Again, involving at most two bits. And then, of course, we're going to need to solve them. Well, actually, the first 60 steps are very easily solved. So in the first 60 steps, we have the entire message block still as freedom. And we can just actually choose the value of the message block that it immediately satisfies all state equations. And actually, we can also fulfill all message bit equations, because these are linear message bit equations. Over bits, they are linearly derived from the input message block. So we can just map all these equations to just the first 60 steps. And these can also be easily solved. But yeah, after the first 60 steps, we don't have any degrees of freedom anymore. So now, actually, the remaining 64 steps are completely determined. But of course, we can still do smart changes that don't break any of the equations up to some point. And thereby, we can cheaply generate partial solutions up to some point. And basically, amortizing the cost of earlier steps. And it still offers quite significant control over about 30% of the steps of Xiaowan. And then, of course, for the remaining part, we just have to generate many, many, many solutions. So in this case, up to step 24, to just probabilistically fulfill the remaining steps. And that's why you see the success probability of this part is crucial to the complexity. So this is the overview of the whole procedure in creating such a new collision attack. Well, of course, we first are going to have to build a system of equations. We're going to analyze the disturbance vector, look at optimal differential path for the main part. We have to construct the connecting parts between the chaining value and the main differential path. Then we're going to translate it all into attack conditions. And actually, to speed up the attack, we want to find additional conditions because we allow actually multiple differential paths. And so we can actually find additional conditions that use both message and state bits. And this will give us a slight speed up using the early abort strategy. This is an additional step that we needed for the second near collision attack because the first steps of the differential path were so heavily overdefined. Just by trying different variants nonlinear path, we couldn't get around the solvability problem. And we, in the end, resulted to sadly just find a drop-in replacement of the first few steps of the differential path. And of course, the next step is, well, we have the system of equations that, apparently, is now solvable. And now we actually have to find a solution as fast as possible. Of course, we have to analyze the smart changes that we can do. They are called boomerangs and neutral bits. And now we're going to, of course, have to write the attack algorithm. And in this case, the attack algorithm for the first 60 steps and all the tables, all the constants, they're just completely automatically generated. That's very convenient. But then the part where most of the attack complexity is has to be hand-optimized and also hand-implemented. And especially for the GPU, that has to be done completely by hand. And then, of course, just running the attack since it's such a large-scale operation is also not trivial. Well, hereafter, I just want to highlight the three techniques, main techniques that we used, namely joint local collision analysis, that basically, these are the techniques that allowed us to find this collision. And GLCA, basically, really maximized the success probability we have for the main differential path, so really minimizing the complexity. I want to talk about GPU search because instead of a very expensive search on CPU, it was now much more cost-efficient with a very significant gap. And, of course, I'll also talk briefly about just this new part, this set. So GLCA, we introduced that in 2013 at Eurocrypt. And instead of just analyzing one differential path or actually looking at the one optimal differential path, we're going to compute optimal differentials over the last 60 steps. And the differential is actually an input-output tuple where the probability of that tuple is just computed as a sum over all possible differential paths that have this input and output difference. So we're really using the benefit of all possible variations instead of just one single differential path. And this technique can efficiently compute this for SHA1. And it does so iteratively, right? So it computes a set of different or possible differential paths over just zero steps, then use that to compute it over one step, then over two steps, and so on, until we've found the set that we want, and we can just immediately determine these differentials. And to do so, it wants to start at a trivial set of zero difference, so that's why we split the analysis in independent step intervals. So there's this step interval 33 up to 53 and 53 up to 61, where there's completely a zero difference in the state. So we start with a zero difference in state and we end with a zero difference in the state. And then it's actually very convenient. We can look at these tuples of input and output difference. Well, they have to have this zero difference, so the only thing are the message difference used to get there. And of course, we're just going to select all these message difference that have the maximum probability. So this immediately guarantees we have the highest success probability. And then the next step is, well, then we can also just derive the minimum amount of equations that we need to ensure that we get this highest probability. For the last few steps, we know that the starting difference is zero, but at the end, we do have a non-zero difference, of course. That's what we relaxed with a two-block approach. So now we're going to actually look at all highest probability output difference and message difference. And then we're actually going to select all message differences that have the largest number of output difference. This basically allows us a factor six speed up for the first nucleation block, because there we don't care about which output difference we get initially. But for the second block, we're just, of course, forced to use the negative difference, the negative difference of the first block. And again, we can translate this to a minimum linear amount of linear equations so we get, again, both the highest success probability and the fewest conditions that allow this. Although the first few steps are slightly different, because that's where the speed ups happen. And actually there, we want a fixed differential path up to step 23 so that we actually get fixed state conditions that we can use for earlier board strategy and also to check the boomerangs. So we're going to slightly tweak this statement. So instead of just taking the probability of all possible differential paths, we're actually going to set a fixed part, and we're going to take the maximum over these fixed differential paths over the differential. And this really allows us to, even with fixed differential paths, to maximize the success probability and, again, get all these conditions that we want. So the next part, so after we found this system of equations, well, at least for the second near collision attack, you ran into this problem where the differential paths is just enormously over the pint. Well, of course, the first five words are just directly determined by the chaining value. But then in the words output in the next five steps, well, we only have 15 free statements, so 15 degrees of freedom. But besides these state equations, we still also have message equations that aren't listed here, and we have 23 of them. So it is directly over the pint, and we don't really expect there is any solution at all. And we tried various techniques to just try many variant differential paths, basically with different signing, different conditions, to try to find something that's solvable, and that really took too long, and we didn't manage to get around this. So finally, we decided to just try to encode it as a SAT problem, right? So we created insets, equations over two compression functions, we forced the input chaining values, we forced the conditions of the differential path around step eight, and we forced the linear message equations over those first eight steps, and we just asked, said, well, just give me a variant differential path that just fits this place, that I can just put in directly in as a drop in replacement path. And this just solved the problem in one hour, so this also shows that this can now be efficiently solved, so we don't need any degrees of freedom from the first near collision block, we don't need any conditions, however complicated, on the output chaining values of the first near collision block, so that's very convenient. So the reason why we finally really managed to work is of course we ran it on GPUs, which are much more cost efficient, right? So a GPU that we, for instance, that we looked at, GTX970, has 1600 cores versus now regular CPUs, well, I think they're now coming CPUs with 20 cores, but still, this is as many more, but there is of course one trick, you have to avoid branching to make efficient use of them because 32 cores in the GPU are linked together, if they want to execute the same instruction, they're going to do it at the same time, if they want to execute different instructions, then these are going to be serialized, so if there are 32 different instructions, then it's going to take 32 cycles to execute them, so you really have to avoid branching to make efficient use of them. Well, the problem is, our collision search isn't just a World Shall One computation, it's a depth-first research, right? So at a certain step, we're going to explore all degrees of freedom, we're going to check if the conditions are met, and if so, we're going to go forward, if we exhausted our freedoms, we have to backtrack. So this doesn't seem to be naturally compatible with GPU, but still we have a very efficient framework for GPU that did allow efficient computation, we just changed the model of how we worked the problem. So actually what we're going to do is we're going to store partial solutions up to step in shared buffers, all right? So this stores all partial solutions up to step 26, and then we can just have every thread of a warp, so 32 threads would just take, each would take a solution from here, they would execute the same code to extend this to solutions of 26. So every core would try all degree of freedom and then store the partial solution. And now actually the only branching that's happening is whether or not to store an extended solution into the buffer instead of just skipping to entire different parts of the program. And of course to really enforce depth-first research, we have to enforce that they always first try to look at the last buffer that has enough work. So if we're going to compare this GPU with CPU, well the original work where this is our attack is based on from NeuroCrypt 2013 has a theoretical estimate of 2 to the power of 61. And if you consider that one CPU core does about 2 to the power of 34 SHA-1 operations per hour, then well you end up with an estimate of 15,000 core years. Whereas if you look at the GDX cards, if you look at raw SHA-1 operations, it's actually 2 to the power of 42.3. That means it would only take 50 GPU years, significantly less, at about the same cost, right? So one CPU core and one GPU, they're roughly of about the same cost. But unfortunately of course our collision attack as already showed is more complex than just raw SHA-1. So how big a factor are we going to lose relatively to the CPU? Well this was already analyzed in previous work implementing the free start collision attacks for SHA-1 and there they showed well in the collision search we basically get comparable performance of 2 to the power of 41.1 instead of this factor. And taking this with this figure it basically takes 112 core GPU years. So if you translate this, then actually the theoretical estimate on CPU translate to this theoretical estimate on GPU, right? So we lose a factor of 2 of efficiency but to actually gain a practical cost efficiency for the overall attack. So this whole attack was run on Google infrastructure which is a very large heterogeneous cluster, distributed over the world of different CPUs and GPUs. One of the problem is that it was completely proprietary, both a compile and job system that no knowledge of and could not get knowledge of. So there was basically we wrote the source code and we handed it to Google and there was this blind adaption phase blind for us because we didn't know what they were actually using and what the constraints were blind from the other direction because while we wrote it for ourselves there were very few comments and documentation of the source code. But we still with just some email contact we really managed to overcome certain problems. One problem was for instance that in CUDA for GPU development framework that we used we used managed variables which really convenience moves variables between CPU memory and GPU memory automatically. Well apparently in the code Google's compile system this completely works and compiles fine that if you actually execute it it doesn't support managed at all but you don't see any error so you just have this feature that seems to work that actually doesn't. So trying to figure that out was also time consuming. So of course our attack consisted of two sub attacks the first new collision attack. Actually we did this on CPU because we already had the source code lying about so we didn't wait for the development time to build the GPU attacks and this was run over well 100,000 CPU cores over the course of several weeks and actually we executed this twice because at the time we didn't have this set step that would really efficiently solve this over defined problem and we wanted to have some degree of freedom and so this is now completely unnecessary anymore. And then the second new collision attack which theoretically is six times as hard is like the bulk of the computation was run on various NVIDIA Tesla cards and this basically translates well to this 114 K20 core years or 71 K80 core years which actually has more GPU cores. And well this is quite a big number but yeah Google has really a lot of GPUs and running on at least 3000 GPUs and actually just the majority of the computation just ran in eight calendar days which really shows that well it was two years work but most of it was development time, right? The majority of computation was just on eight calendar days really show how practical SHA-1 collisions are now. So yeah this is our collision, 128 bytes too and of course actually what we wanted to do is we're not just create a random collision we wanted to create something meaningful and actually reusable. The SHA-1 collision is very expensive so we want something that can be reused. So we have one collision which is just a prefix of a PDF file and you can create implement full PDF files that have distinct embedding PDFs. And the way it works is basically well we have this PDF header and JPEG header we started JPEG comments and then here in this orange part we have the SHA-1 collision and there is a difference here and so this length of the comment field is different from this length and so the JPEG parser will skip over the comments and we'll start the processing image one in this case and in this case it will jump to this first part and it will be another JPEG comment of a certain length basically skipping over image one and this JPEG parser will actually parse image two and of course the collision is here so you can plug in whatever image one and image two you want and you can actually do this yourself so this is a link for somebody who created a script for you to just do this forever for any PDF files that you want. I mean we didn't really expect it to break something but we did break something namely subversion repositories because they were using SHA-1 for file de-deplication but they were using MD5 to check if everything was sent correctly and basically putting a SHA-1 collision in there it just broke the repository because it could never get anything sane out of it anymore that actually validated with MD5. So of course we had some more impact Git actually started moving away from SHA-1 and actually Google Drive and Gmail now actively checked for SHA-1 collisions and also Git and GitHub at the end of my talk so briefly want to show SHA-1 collision detection that's a real-time detection for SHA-1 collisions just a single message so you don't need to both messages and it's now actually also this was used in Git and GitHub by default in Gmail, Google Drive, Microsoft One Drive so that's very nice right so real-time protection against these SHA-1 collision attacks so yeah I basically want to end my talk here and are there any questions?