 Good morning everyone. Thanks for the opportunity to speak here today. Let me tell you a story. A year ago, last year's EAA conference, I gave a talk about the Atari 2600 console, a game console that came out 41 years ago, and had a surprisingly long lifespan. And in particular what I spoke about was a game for it called Entombed. Now this was a game that involved archaeologists and zombies, and they were running through a maze. Now the interesting thing here wasn't the portrayal of archaeologists in the game, it was not the gameplay, it was not the physical artifact. Now the interesting thing was what was inside, it was the digital artifact. And in particular one of the things I found is that there was one piece of code in Entombed that could be found identically in a game called Towering Inferno, even though there was no clear chain of authorship between them. So in other words, there was code reuse. Now after I gave this talk last year, I was talking with Andrew Reinhardt and Megan von Ackerman. I apologize, I don't have a picture of Megan. All I remember of her is that she studied medieval locks and keys, and so but this is appropriate because their question is actually the key to this research. And what they asked me was this, can you find all occurrences of reused code? On the surface it seems like a very similar problem because what I had done last year was looking for one short, very specific piece of code that I had manually identified. But the problem with digital artifacts is there are so many of them. And having the luxury to manually analyze these things just doesn't work at scale. And at scale what we need to be able to do is automatically identify repeated pieces of code, but first we have to identify what in a game image constitutes code. It turns out this is a really hard problem. So hard that you can mathematically prove that it is unsolvable in general. And it even gets worse for the Atari 2600 because programmers to save space would do all sorts of little optimization tricks. So this is an actual excerpt from Carol Shaw's River Rage source code. And what she did to save one byte of memory is she overlapped a data table with an instruction. In other words, there is no way to separate those two for her game. And there are other games where there are legitimate cases of real code being overlapped with other real code, again to save space. In other words, there might be multiple solutions to this problem. And I know what you're thinking. Did you go and solve this anyway? Maybe. The system I built is called this or no particularly deep reason other than I just called the directory that and it kind of stuck. What I did is I built an Atari specific disassembler to heuristically separate code and data. And I take all these game images that I have sold game cartridge images and I preprocess them once. And from that I get what I call a dump file for each game. There's a bunch of information there, but the thing I want to focus on in that dump file is what I call bad code, which stands for binary abstracted disassembly. Now, when you reuse code from one game to another, there's some things which are likely to change naturally. So memory addresses, for instance, there are other things that are likely to remain the same. And what I wanted to do is I wanted to throw away so abstract away the stuff that was likely to change and keep the stuff that was likely to remain constant. So in other words, what I wanted to do is try and find some sweet spot between false negatives where I would miss instances of code reuse and false positives where I'd be overwhelmed with things that weren't really code reuse. So now with all these dump files, what I do is I window them down into a representative set. The corpuses you find in these games will have things like eight prototypes of a mind littlest pony game, and it's not going to give you that much more information than simply looking at one of them. So I narrow these down to a smaller set of dump files, and then I found the matching code of a certain length. I'll come back to that on the next slide. So length greater than equal to n, take the search results, and I have a number of analysis programs so that I can look at the results. Now, if I were to choose a length that was too short, then this tells me nothing, because ultimately there is only a small finite number of ways to do certain things in the 2600. On the other hand, if I choose a value that's too long, I'm going to find either no matches at all, or I'm going to miss some legitimate cases of code reuse. So what I did is because these programmers back then were actually really good at their craft, I assumed that they would not knowingly be reusing code within the same game. And what I did is I took two instances of combat there on the left pitfall on the right, and I ran these games against themselves to find out at what point the code sequences became unique. And for combat, it was 9, and for pitfall, it was 15. Now, that gives me a lower bound. At the other end, that excerpt I showed you from Entombed, my system should be able to find that, and that was 21 instructions. And taking all these constraints into account, I use in the end a value of n equal 15. I'm keenly aware that I'm speaking to a room mostly filled with archaeologists, and you might be interested in knowing why I'm telling you this. Remember, the point here was to find all instances of code reuse and look at the programming practice of human programmers reusing code in their games. So this is one of the analyses that my system can spit out. But this does it takes one of these instances of code reuse that it finds, pops them up side by side and highlights the differences. So I as a manual analyst can then go in and say, yes, this looks legit, or no, this is a garbage result. This is in fact the Entombed and Towering Inferno excerpt, which my system quite easily finds. What you see highlighted there are memory locations, and this is one of the things that can jostle around naturally when you reuse code. But it is in fact the same code. What was more interesting was the fact that I still had the calibration game images sitting around in my test directory. Remember, I calibrated using combat and pitfall. And what my system picked up was a previously unknown sequence that had been reused between Entombed and Pitfall that I had no idea was there. And yes, apart from one memory address location changing, it's the same code. You can also look at a single game and compare it across the corpus. Now the 2600 was effectively designed for the game combat. That was the first game, it was the first pack-in game. And what this graph is showing us is on the vertical axis, you're seeing all the memory locations in combat. And on the horizontal axis, you're seeing where in a corpus of about 600 plus games instances of combat code are seen reused. There are two takeaways from this. One is that there's not a lot of it. And the other is that there's really only two bands where it is particularly reused. This is the final analysis, this is a heat map of what I call coverage. Now this is really a first-door approximation to code reuse. And what you're seeing here, or actually more what you're not seeing here, is you're not seeing a lot of yellow. Because what yellow indicates is that there is some amount of a game cartridge that is reused. It's not saying definitely that there's that there's code reuse, but this is again a first-order approximation. So if you were to see this and it was all yellow, you would be able to say, oh my god, there's a lot of code reuse. But it's sparse and it's faint. The red are cases where the comparison has been skipped for one reason or another. So if you look closely, there's a red line across the diagonal, and what this would be would be comparing a game against itself. Pitfall compared to pitfall, okay, they're going to share a lot of code. The other cases are things where there is a cascade of matches. And invariably when I found my system spitting these out, it was a heuristic error in terms of detecting code and data. So it was actually matching some data against other data as opposed to code against code. And so those I just filtered out. But while there's not a lot of things to see here, we can ask the same question with this system at different levels of granularity. David Crane was the programmer of Pitfall, and this is the same heat map, but looking at just his game. You'll notice there's a lot more yellow here and it's a lot more intense. So it's very likely that the practice, at least in his case, of code reuse was a lot more prevalent. And speaking as a programmer, yeah, we rip off our own code all the time, because we're too lazy to rewrite it. You can also look at practices within a single game company. This is Activision. And again, a lot more yellow, a lot more intense yellow. So it seems that code reuse within the company was likely a lot more prevalent. There's something, there's something Steve Jobs used to say famously at the end of his talks. Oh yeah, right. One more thing. Since I had all this data anyway, in a perfect world, what I wanted to do is have the system tell me where I should invest my time when I'm analyzing game code, because manual analysis of game code is very laborious, very time intensive. And ideally, I would like just to be told, okay, here's the interesting parts. Go there and look. Ignore all this samey looking code. What I did is I used a metric from data mining, inverse document frequency to keep track of how frequently certain words were seen over this entire corpus of Atari games. And I used that information to colorize a game's assembly listing. As a result, I can now ask questions like, isn't tuned interesting looking at the code? Well, this is the code that I pointed out earlier that's shared with Pitfall. And you notice in heat map terms, it's actually dark. It's really cool. And what that's saying is that, while I didn't know about this previously, the fact that it's cool means it shows up quite frequently overall in the Atari corpus. And when I looked into some of the results, yes, this code sequence is seen an awful lot. So this was a favorite to reuse. On the other hand, remember Intuned has this maze and the code to generate that was very obfuscated and difficult to analyze. And here my system lights it up like a Christmas tree. What I'm not showing you is that these results are little cherry picked here because there are other spots in the code that were also lit up that were less interesting. So at least this part, I would characterize as a work in progress at this point. And to conclude, yes. Thank you very much.