 Hello, my name is Mike Stay. I'm the CTO of Pyrofex Corporation, and today I'm going to tell you about how we recovered several hundred thousand dollars worth of Bitcoin from an encrypted ZIP file. Around 20 years ago, I was working as a reverse engineer and crypt analyst for AccessData while getting my physics degree at BYU. It was the late 90s, early 2000s. The US government had been gradually easing off export restrictions on software-containing cryptography, but most software still contained pretty worthless password protection. With my desktop office software, I'd reverse engineer it to figure out what algorithm it was using for access control and then break the crypto. It was a never-ending stream of interesting but not impossible math puzzles. I wrote about 40 password crackers during my time there. We would sell them to home users, system administrators, local and federal law enforcement agencies. I got to go down to the Federal Law Enforcement Training Center in Glinko a few times to teach secret service agents, FBI agents, ATF, about cryptography and the use of our products. Two of the projects really sent out in my memory. The first was Microsoft Word 97. Before Word 97, the files were encrypted by XORing the bytes with a repeating 16-byte string derived from the password. The most common bytes in a Word file were either 0, 255, or 32, which is space. So we'd just look at the most common character in each of the 16 columns, try the three to the 16th variations on those, and recovering the key was usually instantaneous. But to help people feel like they'd gotten their money's worth, we'd put on a little show like the scene in War Games where it would be animated showing random characters that would gradually reveal the right password. Microsoft 97 changed that. It might have been possible to find out the encryption format through Microsoft Developer Network, but we were a small company and couldn't afford the subscription. It also wasn't clear we'd be allowed to write the cracker if we did get the info from them. So to figure out how it worked, I used Softice. There was a button on a cable that would go to a card in the computer. I'd type the password and then hit Enter and the button on the cable as fast as I could and hope that it would break somewhere in the crypto stuff. I'd walk up the stack trying to figure out where the algorithm was. This was in the days before Ida Pro, so I printed out a few dozen pages of assembly code and taped it to the wall and then drew all over it with a red crayon updating the function names and so on as I figured stuff out. I was very pleased when I finally worked it all out. At the time Microsoft was only allowed to export 40-bit cryptography, so they did as much as they were legally permitted to do. They'd repeatedly MD5 the password with some salt, which is randomly chosen bytes that is stored in the file, and then they'd get a 40-bit key. Then they'd add salt to that and repeatedly hash it again, so it took about half a second to test a password on the computers at the time. So despite being a 40-bit key space, it was a fairly hard problem to attack. We had to resort to a dictionary attack because breaking it outright was pretty much impossible. We did eventually write a cracker for larger companies that had their own computer labs and resources, other large agencies to brute force the 40-bit key space using the fancy MMX instructions on the Pentium. This was long before graphics processors were available. We have one place that ran the software for nine months before finally getting in. The other really fun one I did was zip archives. The developer of PKZip, Phil Katz, made the decision, which was unusual at the time, to document his file format and include it with his software. That made it a favorite of developers. Roger Schlaefli designed the stream cipher used for encrypting the archives. The zip standard quickly became by far the most popular compression format on Windows. And many other formats like Java's JAR format, OpenOffice's document formats are really just zip files with a particular directory structure inside. InfoZip is an open source implementation of the software. It was used as the basis for nearly all other branded zip software like WinZip. And at the time I was trying to crack it, WinZip had 95% of the market according to CNET downloads. Elidbiam and Paul Kosher had published a known plaintext attack on the cipher, but the known plaintext was compressed plaintext. To get the Huffman codes at the start of a deflated file, you basically need 32 kilobytes of the file. The attack was practically useless to law enforcement agencies. This diagram is an illustration of the internal workings of the zip cipher. Each of these boxes represents a byte, eight bits. The cipher has 96 bits of internal state split into 3 32-bit chunks. One is called key zero, key one, and key two. This subscript indicates the state of that chunk after having processed that many bytes of text. So here this is the initial state. We feed in the first byte of text and we get the subsequent state of key zero. And then we take part of that and feed it in here and we get the subsequent state of key one. Then we take one of these bytes and feed it in here and get the subsequent state of key two. Then we do one more operation on these two low bytes to get the stream byte. Going into a little more detail, this first section is CRC 32. CRC stands for cyclical redundancy check. It was designed for detecting errors on the circular tracks that you would find on floppy disks. So here we can see the shift. It's being shifted right eight bits. We take these eight bits that got shifted off the bottom and use them to look up a 32-bit word in this table. Then we take the byte that's coming in and we also look up a 32-bit word for that one in the table. Then we XOR the high 24 bits here that were left after shifting it down with these two words and we get the new state of key zero. We take the low byte of the new value of key zero, add it to the current state of key one, multiply by this constant, add one and get the new state of key one. So this is called a truncated linear congruential generator. Truncated because we're only taking the high byte as output, linear because it's addition, congruential because it's modulo 2 to the 32 and generator because it's spitting out pseudo random bytes. So it takes this byte that it's spitting out and feeds it into CRC 32 again. Now this is a linear feedback shift register, but this one is linear with respect to XOR, while this one is linear with respect to addition with carries. So they don't work well together and that's what really gives it whatever strength it does have. So CRC 32 truncated linear congruential generator, another CRC 32 and then this pseudo squaring operation. If you look in the source code, the pseudo squaring operation is like this. You take key two and the low 16 bits of that and you or it with two. So you set bit one and then you multiply that by itself XOR one. So one of the values is going to be even and the other one's odd. You shift it down by eight and take eight bits. So it's the middle of this pseudo squaring operation. And the result is a stream byte. Now the way this stream cipher works is that you initialize it first with a password, and then you encrypt the plain text that you want to encrypt to make it harder. They generate they encrypt some salt first PK zip used bytes that were in memory. So it would allocate 10 bytes and use whatever bytes were there. That's how it got its entropy. Info zip because it was built for use on Linux as well as Windows. Couldn't do that because when the memory was allocated, it was also initialized. So they got their entropy from the process ID and the timestamp. You'd XOR those two together and call SRAND on it to seed the random number generator. Then it would use RAND to generate 10 bytes as salt in the file. Then you'd follow it by CRC 32 and the plain text. Now BM and kosher had their attack and it was published when info zip was implemented. So they knew that if they just used RAND, then if somebody had the timestamp and the process ID, they could compute these bytes. The CRC was stored in the file to check that the password was entered correctly. So they would know these bytes and that would give them 12 bytes, which was sufficient to mount a known plain text attack. So info zip tried to avoid that. And they made it harder, but not quite hard enough. So renaming these bytes individually, R0 through R9 are the bytes generated by RAND, C0, C1 are the CRC 32 bytes, and P0 through P3 and so on are the plain text bytes. And I'll call this array X. Now, when you feed the password into the cipher, it treats it as though it's plain text. You feed in the first password byte, it updates key 0, key 1, key 2, spits out a stream byte, and then it throws the stream byte away until it gets to the last character of the password. And that final stream byte is used to encrypt the first salt byte, R0. So I get this set of 10 bytes that I call Y, which is just these first 10 bytes generated by RAND, exclusive or with the 10 bytes generated by the stream cipher. Now you'll notice here I call this S0, but this one is S1X. These ones that end in X depend on the values in X. And S0 only depends on the password itself. That turned out to be their tragic flaw that when they encrypted this a second time, so they were using Y as the salt bytes in the file, they would generate S0 again because it was the same password. So they would XOR with S0 once to get the salt byte, XOR with S0 again to get the byte that was stored in the encryption header. Now all of these are hard to figure out because they depend both on the randomly generated bytes from RAND, as well as the stream cipher bytes twice over. So this is doubly encrypted stuff, but this first byte leaked. And given five files in the archive, all encrypted with the same password, this gave me eight bits of entropy per file to find the 32-bit internal state of the RAND pseudorandom number generator. So given five files, I could simply take the first byte of each archive, run through all 32 bits of possible internal state, and verify that every tenth byte gave me the byte I was expecting. So I could uniquely identify the internal state. I could uniquely find these 10 bytes. And so then I had some place to stand to mount the attack. So the next thing I did was notice that I didn't need all 96 bits to produce the next stream byte. I only needed 40 bits. So here I didn't need to know both key zero and or rather the second and the first byte of key zero. I only needed to know the XOR of this byte with that one. So that's an 8-bit guess. I know the value from RAND that's going in so I can predict this thing. Here I'm multiplying by this constant. If I take the... I don't need to know the entire value of key one. I just need to know the high byte of key one times the constant. So I've distributed this multiplication across the addition we had earlier. So here is the high byte of that. And when I add one it might cause a carry bit so I have to guess a carry bit. Then it comes in here. I know what the high byte of this is. I feed it in. I just need to know these values and given the stream byte I can then figure out what the low six bits are here that are needed. So this is how to figure out given a stream byte, well a plain text and a cipher text pair. These are the bits that you have to guess in order to get this stream byte. So what I would do is I would guess 40 bits which would allow me to process this byte here. And so then I could check for each of the five files does this value match what I expect it to. And since I had 40 bits that I was guessing, and 40 bits of entropy to compare it to from the file, I could filter out all but one guess. And then for the next byte I would guess 26 more bits and then just a few more bits to get the rest of the state after that point. Usually once I found this 40-bit one the next one I could get within a few minutes. The whole attack ran in about two hours on a pentium at the time. Cipher text only attack. I was able to break into these zip files. We caught several child pornographers that way. And I got a paper out of it. So in 2001 fast software encryption. I got this paper and that led 20 years later to a surprising message. So in October of last year, a guy contacts me out of the blue and says, I read your paper on known plaintext attacks. And I've got this password that I've forgotten. Is there anything you can do to help? Can you tell me how things are looking these days? And so I looked it over. Turns out he only had two files in the archive. With only two files, I didn't have enough information to derive the internal state of RAND. And I didn't have enough information to filter the guesses I was making. So far more guesses would pass each filtering stage. And it worked out to be something like on my back of the end vote calculations, something like two to the 70. So I told him, you know, compare the work done this year to find hash collisions in SHA-1. SHA-1 has 128 bit output. So to find a collision, you've got to look at two to the 64th hashes. That cost around $100,000. This is not something that you can use off the shelf software for. And then he blew my mind and said, I could spend so much on this archive. At that point, I knew that he probably had several hundred thousand dollars of Bitcoin in this thing. At this point, I started doing a more careful analysis. There are 96 bits of key material that need to be guessed in various stages. But I can't guess the bits of key material directly. Instead, I have to guess bits of a function of the key material and use those constraints to limit what the key material could be. This first part is CRC32 of the initial key zero value with the zero byte. Since CRC32 itself is invertible, once I guess these four bytes, this is a very simple linear operation I do to recover the initial key zero value itself. With key one, it's a little more complicated. We take the initial key one value and multiply it by this constant from the truncated linear congerential generator to the nth power. And I'm guessing the high byte of that value. Given four of these, that'll put enough constraints on key one that I can force it to figure out what key one is given those four bytes. The initial value of key two I can guess directly. So in the first stage, I guess these five bytes, that's 40 bits, and I guess four carry bits. These carry bits come from the possibility that when I add one in the truncated linear congerential generator, it carries over all the way to the 24th bit, which would change the value of this most significant byte. So I have to guess these carry bits. I need to guess two for each file. There are two files because I have two passes of encryption that I do to each byte. So two passes of encryption, two files makes four carry bits that need to be guessed. So the total work for this stage is two to the 44th. And after filtering the 16 bits from the file, I get two to the 28th key guesses passing. In stage two, I guess these three bytes, that's 24 bits, four more carry bits makes 28, added to the previous two to the 28th key guesses, I have two to the 56th keys to filter. And after filtering, two to the 40th remain. In stage three, I guess 16 bits plus four carry bits makes 20, previous 40 carried forward give us two to the 60th keys to test, and after filtering, two to the 44th pass the third stage. In the fourth stage, I guess 16, as before, four carry bits gives us 20, 44 from the previous stage, 20 now makes 64. This was the most complicated stage, two to the 64th. After filtering, two to the 48th remain. Now in my original attack, this is where I would stop. I have all the key material. With these four bytes, I can brute force key one, run through the two to the 32 possibilities and get the value itself. I can invert this and that would work. But at this point, it struck me. If I have to do two to the 32 work for each of these two to the 48 keys, that will give a total cost of two to the 80th. There was no way we could do a two to the 80th attack. So I had to come up with a special approach, a new idea to make this step much less expensive. I remembered having read somewhere about an attack on truncated linear congerential generators that uses lattice reduction. Lattice reduction is the idea that if you have a lattice, which is like a vector space, but only integer coordinates, and a basis for that lattice, then you can find a nicer basis, where nicer means that the vectors are shorter and closer to orthogonal. So here I have a two-dimensional lattice. Here's a basis for that lattice, where one of these goes here and another goes there. Every point on this grid has a coordinate in terms of how many steps of this kind and how many steps of that kind it takes to get there. But lattice reduction gives you a much nicer basis. These are both shorter and closer to orthogonal. Lattice reduction is in general a hard problem. About half of the candidates for the post-quantum cryptography ciphers that NIST is considering right now use lattice reduction, for example NTRU. But in small cases it's tractable, and I had a very small case, only four or five-dimensional. So I went and poked around on the cryptography stack exchange and found exactly what I needed. The vectors in this case, the basis vectors, were powers of the constant from the truncated linear congruential generator. In our case it was 0808405 in hex. So you take the, if we call that C, then C and C squared and C cubed and C to the fourth and so on become basis vectors for this lattice. And then you add in one more basis vector for the modulus, which in our case was 2 to the 32, here called M. So the idea was that given the four most significant bytes of the key one value times the consecutive powers of C, I would do a linear transformation, get a value that was close to a basis vector in this small vector basis, then round off to get it exactly to a basis vector and then transform back, and that would give me the exact value of key one zero. So I wrote up a SAGE program to figure out how exactly that would work. SAGE is a library for Python that lets you do linear algebra, among many other things. So here we have the modulus 2 to the 32, we have the constant from the truncated linear congruential generator, we have the matrix that defines the basis for our four-dimensional vector space. Now the vectors themselves are the modulus and the powers of the constant with minus ones on the diagonal. This does the lattice reduction step. Now to test how B worked, I would generate a random initial K1 value, then compute the actual powers of C times K1, modulo M, that I was going to try to recover. So I'd print these out. The most significant bytes were the things that I had guessed. I wanted to recover Ks from the MSBs. So the secret stuff that I needed to know were the low 24 bits, the part that I had masked off up here. So this is the secret 24-bit values that I need to guess. Given the vector that is the most significant bytes, I also applied it by B, which switched from one lattice to the other, did a basis change. Then I rounded to the nearest vector in this nicer basis, and that meant that I would get an actual value of K1,0 that would produce those same most significant bytes. Then I reversed that, transformed back, and got the guess. And I'd print it out. And when I started doing this, the guess was never right, or rarely right. And that was because of the limited number of guesses I had here. If I could do one more stage, then I would be able to get the right answer every time. But here I had too few powers of C to find out what it was exactly. But then I started checking the difference between the guess and the actual value. So I'd take the keys, subtract off the MSBs, and the difference between the guess and the actual one turned out to fall into one of 36 possibilities every time. So as I thought about it, I realized that in this four-dimensional space, by truncating it and just using the most significant bytes, I was adding a bunch of noise. And so the point in the four-dimensional space I wanted was at the center of a cell, and what I was actually getting was a point near the hull of this cell that tiled this four-dimensional space. And so because it was always within this cell, there were only a finite number of other possible lattice points that it could be. And so I computed for all of the 2 to the 32 possible k10 values. What the set of guesses were, and found that they fit into this one of 36 classes. And so instead of having to try out all 2 to the 32 possible k1s each time, I would only have to do 36. So instead of 4 billion, 36 values. And that made the attack feasible again. Once we'd guessed all the key material, we could filter the 2 to the 48 keys at stage 4 by using the remaining bytes in the encryption header. My business partner, Nash Foster, then started working on adapting my CPU-based attack to run on GPUs. He wrote the code harness for getting code and data onto the GPUs, did all the CUDA stuff, advised me on how to structure the attack. We discovered very quickly, though, that getting the petabytes of possible keys onto the GPU would take too long, that the GPUs would just be sitting idle for most of the time we were paying for them, waiting for the data to arrive to process. So I went back to the drawing board. In each stage I was guessing a whole bunch of bits and then filtering the results using the 2 bytes from the 2 archives in the file. I'd only keep about 1 out of 65,000. If I had some way of using that information to derive bits, rather than just guessing and checking, it'd save a lot of work and, more importantly, a lot of network traffic. The problem with that idea was that the math is too complicated. It involves mixing finite fields with integer rings, and those simply don't play well together. So I thought about some other cryptanalytic attacks I knew, and one that seemed promising was a meet-in-the-middle attack. A meet-in-the-middle attack usually applies to block ciphers. When it uses one part of the key material to do the first half of the encryption, and the second part of the key material to do the second half of the encryption, the reason we have triple DES instead of double DES is precisely because of this. Triple DES is a 112-bit cipher, even though there are three keys involved. You can do encrypt and then decrypt and then encrypt. So 112 bits of strength. If it's 112 bits, why not just do DES with one key and then DES with another key, encrypting it twice? Well, the answer is right here. What you do is you take your plaintext ciphertext pair and then encrypt the plaintext under all 2 to the 56th possible first DES keys. And then you take the ciphertext and decrypt it under all possible 2 to the 56th DES keys. And right here in the middle, chances are that one of them will match. And so you sort all of these intermediate plaintexts and whenever you find two intermediate texts that are the same, one of them came from the plaintext and one came from the ciphertext and that uniquely identifies the key that should go between these two. So a meat in the middle attack works when you've got key material that is not all used at every stage of the cipher. And in this stream cipher, that seemed to be the case, right? I had only key 0 at the beginning and only key 2 at the end and it seemed like we could do something in the middle. But it wasn't quite right. But then I realized I could use the XOR of certain bytes in the middle of the cipher to mount the attack. Here's a diagram of the cipher again. And the cipher gets called four times in each stage for each byte. Two for each file, the double encryption of the byte from RAND. So in the center of the cipher, there are four different most significant bytes for key 1 after each of these encryptions. So what I did was take the first of those four and exclusively with each of the other three to get a 24-bit key into a table. And under that key, I would store my guess for these bits. Then from the other side, given a stream byte, there are 64 possible 14-bit values here. So I would guess some intermediate stream bytes, six bits of this, I guess it's boxed here, these six bits to figure out what the other eight are. And that would allow me to derive linearly from this stuff what this most significant byte of key 1 was again. And so given these bits that I was guessing, I would derive a key and store these bits under that key. And then any time I had a collision, I would know that the set of bits here that I guessed were consistent with the set of bits here that I guessed to give this plaintext to go into that stream byte. Now that is far smaller than running through all of the possibilities and then filtering. The amount of memory it required was on the order of a few megabytes. So here is the code itself. This is stage 1a moving forward through the cipher. We guess the initial stream byte because it was XORed with itself twice. We have no information about it, so we would guess that one. Then we'd guess chunk 2, which came from CRC32 of key 0. Chunk 3 is the most significant byte of key 1. Here are the carry bits. There are two bits for each of the two files. So 2 to the fourth is 16 possible carry bits. This is debugging code for checking against files that we created ourselves. This is the first half of the cipher, and it updates an upper and a lower bound on the possible low 24 bits of key 1 so that if it ever goes out of bounds we can throw it away. We compute various bytes, get through here. If it has passed all of these upper and lower bound checks, then this converts those four most significant bytes to a map key and stores the candidate under that map key. In the second half, we're guessing the stream 1x byte for file 0. We're guessing the 64 prefix because each stream byte has 64 possible pre-images under the pseudo-squaring operation. So these are all of the indexed pre-images, and this gets the specific pre-image that we're carrying about. There's more debugging information. Again, we do the second half step, and if there are no consistent results, then we just go to the next gas. Here we're guessing the stream 1x pass for file 1. We do that second half step, and again, if there are no consistent ones, we go back to the next gas. Finally, we compute the stream 1y bytes, and if there are no consistent ones with that prefix, we continue. So now given each of the possibilities for the first and second and third bytes, we put them together into a map key and then meet in the middle. Here is the CRC-20. If you'd like to look at the code, you can see how these exclusive were together and we can derive some certain things. So starting at the bottom, we derive bits 15 to 2 of K and L from 15 to 2 of S and T and these bytes OP that we've computed. So in the center, we find those groups that... those guesses that work, and we push back that guess whenever the two match up with each other. And here's some more debugging information. So that's the idea of the differential meet in the middle attack is computing forwards and backwards, exclusivoring one of the bytes with the other three, and then whenever the two guesses match, we have a set of candidates that's a product of the set from the first half of the attack with the set from the second half of the attack. Using the differential meet in the middle attack on stage one and stage two, allowed us to reduce the complexity of the attack in stage one from 2 to the 40th down to 2 to the 22nd. That's from trillions down to mere millions. And then in stage two from 2 to the 56th down to 2 to the 40th. And that let us run stages one and two on each of the machines that had GPUs attached to it. We completely eliminated the need to ship batches of keys over the network. And so we could generate the keys in place and then run what we called the GPU stage three kernel that did all the rest of the attack. Now we ran it for 10 days. We had tried it on all of our test test archives that we'd created that worked fine. 10 days passed and it didn't find a key. We were distraught pulling our hair out. What have we done wrong? We went back and discovered that there were a few other process IDs that could have worked. And we were wondering, oh no, are we going to have to do this four more times? But then our client, who was a programmer himself, discovered that there was a bug between the GPU stage and the CPU stage. When I run the tests, I had done it locally on my computer using the CPU version. And the test, we knew exactly where the key had to be so it was a very small piece of key space that we had to check. But when my client went back and tried it again using the GPU stage three, he discovered that when the key candidate was the first one in a list, it succeeded, but when it was the second one, it failed. And that led me to this line right here. We had swapped the thread index with the block index. So instead of incrementing by one into the current block, it would increment the block number and go off into space that was uninitialized, and so it would never work. So by switching the block index with the thread index, we started the search over again. Within a day and a half, we had found the three keys that decrypted the archive and our client was able to get his Bitcoin keys out. In the end, the improvements that I made to my old attack took it from something that we estimated at approximately $100,000 and a year of processing down to something that took about $10,000 of GPU time and a little under two weeks. Our client was very pleased and gave us a big bonus, and that's how we recovered his Bitcoin for it. Thank you very much.