 We've got two talks for you. The first one is a free start collision for full Xiaowan. It's a joint work with Marc Stavis, Pierre Carman and Tomas Perret. And Marc Stavis will give the talk. Okay, thank you Bert. First some background. Of course, as we all know, Xiaowan is a cryptographic hash function standardized by NIST in 1995 and basically maps arbitrary length bit strings to a bit string of 160 bits that's basically random looking. Although we cannot formally define collision resistance for these dedicated hash functions, an informal definition of collision resistance of Xiaowan is that it's practically infeasible to find different messages X and Y that have the same output hash function. Now, of course, it's a generic birthday search attack that basically allows you to find such collisions in about to the power 80 Xiaowan calls. And we can already see that basically 80 bit security is not enough nowadays if you look at Bitcoin and that and we can basically see that with the current Bitcoin computing power it would take about two months that's too much computing. So, widely used digital signature standards like Xiaowan RSA, Xia2 RSA and before MD5 RSA are all built upon the hashed and signed principle where if you want to sign a message you first hash it then this hash is actually then signed with a public key cryptography RSA with a private key and to verify basically anybody can do that by comparing a recomputed hash from the message and by recovering the sort of a encrypted hash from the signature using the public key. Now, because these signature standards use the hash directly the security of these standards depend on the collision resistance of the hash function because if an attacker can already build a collision for Xiaowan and basically gets one message signed by someone then the signature for that message is also a valid signature for the other hash function because basically the hash you compute from the message side is the same as you can reconstruct from the signature and of course this would not be possible if the signer actually would use randomness so instead of a hash function would for instance use a Mac with a randomly generated key that's supplied together with this signature. Well, it's well known that Xiaowan is not collision resistance already in more than 10 years ago in 2005 Wang Edel presented a collision attack with complexity to the power of 69 and this has only been later improved to about to the power of 61 in 2013. Well, that's still not really practical so there are no collisions have been found yet and Schneier basically has a blog post about projected costs of Xiaowan collisions basically showing the declining costs to compute such as Xiaowan collision over the years due to Moore's law and so he predicted in 2012 it would cost about almost 3 million but it would drop by about today to 700,000 and by 2021 would drop to 33,000 by just renting computing power on Amazon EC2. Now, Xiaowan accepts arbitrary length bit strings and basically processes them using Merkle-Dumkart construction so it splits the entire message into 512 bit blocks and processes them these iteratively using a compression function that only takes fixed size inputs. Now of course there is this security reduction that basically if you have a Xiaowan collision you can basically determine a collision for the compression function and this basically means that if the compression function is collision resistant if it's infeasible practically infeasible to find collisions for the compression function then it's also practically infeasible to find collisions for Xiaowan and but it also implies that once we found a collision for the compression function then basically this whole security reduction does not hold anymore. So, Xiaowan freestart collisions. Example Xiaowan collisions have been thought to be an eminence ever since this first attack yet if we look over the years what's been happening it really shows that analysis is more complicated than we've thought and even so even still the 2 to the power 61 is remains a too high cost to really quickly compute this. So in this work we basically want to follow these research sections where we wanted to focus on freestart collisions meaning a collision attack on the compression function not Xiaowan itself it's a weaker attack on Xiaowan and a collision attack it's it's more practical more easier. We also to make it more practical we looked at using massively parallel architecture specifically NVIDIA graphics cards because they have a really a significantly higher performance per cost ratio compared to regular CPUs and naturally to really make sure we get everything out of the crypt analysis we use joint local collision analysis which is a very precise analysis over later steps of Xiaowan to ensure that we get optimal complexity there and maximized amount of degrees of freedom. So in a previous work we already created a freestart attack on Xiaowan reduced rounds only 76 steps out of the 80 and there we actually only applied as a message modification to only neutral bits with a very simple form of message modification a speed-up technique but still the attack was very very quick it's only took about five days to to generate such freestart collisions for 76 steps Xiaowan. In this weird work we've really built on top of this and we've extended our GPU framework now we that we can also do boomerangs a much more advanced type of speed-up technique a message modification technique to cover the full 80 rounds and it has been well known that basically very at the end covering more rounds of Xiaowan the attack complexity really quickly increases but nevertheless with the additional tool boomerangs finding freestart collisions for full Xiaowan only takes about 640 GPU days and we actually built a cluster of 64 GPUs 16 machines just regular desktops with four graphics cards inside and took only about 10 days to do this computation. Now this is the first practical attack on full Xiaowan and we have an example collision on our website the shappening and also we will make there the source code available for other people to run on but this is not a collision for full Xiaowan yet and so but based on this work we can still estimate what we can present new updated estimated costs for finding a full Xiaowan collision and basically using GPUs we estimated would take about 40,000 GPU days and if you rent this on EC2 this was cost about 100,000 which is significantly lower than than SNIOS estimates before. So as a overview of our attack works so if we look at the Xiaowan compression function we have the message 16 words which is expanded to 80 words using a linear recurrence relation and we have a chaining value which is consists of five words which is then expanded up to 85 words using a nonlinear function that each uses one single expanded message words and then finally there's the Davies Meyer forward where this input chaining value of five words is basically added to the last five words computed in this fashion. Of course the main tool for the collision attacks are differential paths where we consider two strongly related computations of the compression function and we look at the differences and the difference of path is basically a precise description of the propagation of those differences through the compression function and here it's very important to note that the basically the last 60 steps determine most of the attacks complexity as we'll see basically the first few steps are easily dealt with and it's really the last 60 steps that we really have to optimize and that's there we use GCLA that was already developed in 2013 and improved in 2015 and 16 where we basically consider the set of all differential paths adhering to this so-called disturbance vector and then we can basically determine the maximum success probability and then also the maximized amount of freedoms. Now if we have this differential path we basically have to translate it into a system of equations to solve which is very easy and what we basically get is we get linear equations on the first 16 words of the message so that's very simple and for the state bits we may have very simple equations. Now the first 16 steps can easily be solved just by just choosing the correct values but this already determines the remaining 64 steps which are definitely harder to cope with. We have some speed-up techniques neutral bits and boomerangs and these basically make very predictable changes in up to step 24 where we can generate new pairs set in fine conditions very cheaply that's why the conditions up to step 24 which basically means we only have real control over Xiaowan up to about step 24 and then the remaining steps the remaining conditions that are left they basically have to be filled probabilistically so we have to generate many solutions over up to step 24 and basically check whether they're satisfied. Now in a free start collision attack we're not going to start completely from the beginning we're going actually going to start from the middle and basically the advantage is that the hardest part actually becomes a little bit cheaper that's our freedom we have a lower attack complexity and the disadvantage is that we now we don't have full control over the input anymore so we can't build a collision for Xiaowan but we only get a collision for the compression function and we basically have differences at the beginning and at the end but we use the feet forward to basically ensure that these are chosen such that they cancel out and we really get a collision. Now for the GPUs we used the NVIDIA GTX 970 there was already a new generation upcoming which promises even higher price performance ratio and these are basically 1600 cores running at 1.2 GHz with a very high good throughput just one per cycle per core except for bitwise cyclic rotations and it's very cheap just 350 euros. Now this is of course very different from a regular CPU because the GPU has a single instruction multiple threads model where execution is bundled in warps of 32 threads and they basically have to execute the same instruction and if they don't do that then basically everything is serialized so every this distinct instruction is executed in a separate cycle so it's really important to minimize branching. Well on the GPU you can actually run more threads than you have actual cores because it has transparent scheduling of actionable warps to cores so this is another advantage of the CPU for the GPU because you can basically hide the latency of computations and of memory access in this way but you also have to be very careful about memory reads they have to be very close to each other and not for the every memory operation of the threat to the warp have to be very close to each other basically in the same bank and then everything is very fast if they're too far away then again every memory access is serialized again. So normally a collision search is a depth first tree search so it's very highly branching and to make this work very efficient on a GPU we've basically split the entire computation into steps and we use shared buffers in between steps that basically stores partial solution and then to set a warp to have a warp compute something it basically looks at the very last queue where there's work and then every thread of a warp loads one solution basically goes over all the remaining degrees of freedom for that particular step verifies the condition if it doesn't satisfy the condition it throws it away and if it satisfies the condition then basically together with the other threads in the warp stores these solutions partial solutions for the next step in the next shared buffer and basically that first search we always process in the last queue with enough work and we've basically removed all the branching just to the decision of which step to process but the entire warp does the same thing so there's no real branching there and basically there's only a minor branch in whether whether or not to store something and even with the storing everything is loaded basically very closely together so we've really optimized the the parallelism in instructions but also in in memory access so we've we have a free start collision an actual free start collision for full shall one which is the first practical attack on show one and the Schneider predicted in 2012 that in around 2015-16 it would cost about 700k dollar based on Amazon EC2 rates Moore's law and work in 2012 and with our analysis we actually come up with new predictions for the cost of collision for full shall one where you basically take the best attack to the power 61 and we basically estimate this cost about 40,000 GPU days on the older GPUs that Amazon EC2 uses so if it actually on on the GTX 970 that we used it would actually take less GPU days and if we look at the Amazon EC2 spot prices then you can basically this would cost about 100k dollar and this is really a factor 7 lower cost in 2015 and it also really puts a lot more stress on that on how fast we have to deprecate shall one so to conclude I'm actually very fast we have a first practical attack on full shall one and this really invalidates shall one security reduction but there is no shall one collision yet and so this work and also the previous free start collision work has really served as to build up towards the the real shall one collision attack so now we have the GPU framework and all the tools built for that that we can actually can now start undertaking this and the industries deprecation of shall one has been really really slow if we basically for all the secure websites there are still plenty of shall one certificates out there and they're basically still accepted until the end of this year while NIST already said in 2011 that shall one should be deprecated for digital signatures at the end of 2013 so it's the industries deprecation is really slow and that's why we really want to build practical example shall one collision to really speed up this deprecation but that's a future work okay thank you any questions very nice work can you just clarify something so this estimate of seven hundred thousand dollars is for two to the 80 computations I assume with low memory accesses sorry the 700k from Schneider or ours no for Schneider right yeah I assume it's two to the 80 computations to the power 60 actually Schneider's estimation in 2012 is based on a attack also to the power 60 okay I was assuming it was boot force sorry okay I wasn't sure about the relation between those numbers thank you very much other questions so I guess you're all looking for natural collision now yes are you making any progress or for shall one collision we basically are building an identical prefix collision and not a chosen prefix collision like has been done for md5 to rogca so identical prefix which is not as powerful I think we still have a very powerful example that we are building for it but it will not be based on certificates so we've built the application example and for identical prefix you basically need to have two near collision attacks so basically two separate attacks on the compression function and we've finished the first one and we're working on the second one but we still need to adapt our our framework a little bit more for the to cope with the increase in the first few steps are more over-defined the system of question over the first few steps more over-defined so they're harder to cope with and we still have to adapt our framework for that but we're working on it and thanks any other question so first of all this is great work thank you for for continuing to work on this I one question I have for you based on what you've you've seen is you feel like it's absolutely certain that you can get a collision you know just is it just a matter of more work or are there still likely to be places where you still have to untangle things you know find unresolved conflicts that you didn't have to find a new way around or what do you think is it just a matter of throwing more force at it yeah I mean given all the tools and we really have a lot of degrees of freedom left in the overall attack so and we really understand everything about all the parts so I don't see any big hurls the more immediate hurl here is that now the first few steps is over the fine so it's a little bit harder to find a solution for the first few steps but if you look at the complete attack cost that's the complexity you spent on the first few steps is completely negligible with all the speed up techniques they really spend the time there so maybe just one solution over the first few steps is sufficient to do the entire attack so but we did not have to cope with that before so now there's mostly a man our issue in that we have to adapt the tools for that to get that but then basically I see no no real issue except of course actually running the attack on really a lot of GPUs which of course isn't like a very big distribution computing projects it's also take some effort so I think with the available computing power we have it's now actually the available man hours is more of a hurdle than the computing power thanks any other question okay if there are no other questions that's thanks mark again