 So our next talk is by Robert Brandon on Jumpgate accelerating reverse engineering into hyperspace using AI. We'd like to thank our sponsors Endgame, Silence, Sophos and Tinder and we'd also ask if you could raise your hand if you have an open seat next to you so that basically people in the back know that there's a seat available. And then finally please silence your cell phones and here's Rob. All right, how you doing folks? So I'm going to talk a little bit about something that's been kind of an obsession of mine for the last couple of years. So I hope like Tori, who am I, been working in tech for a while, finished my PhD in computer science at the University of Maryland Baltimore County last year, largely on the research that I'm going to be talking about today. And I'm currently a threat hunter with Newfoundland Hamilton in Stark Labs. So when you're doing research, in a lot of cases it's a good idea to kind of figure out what's the big question you're trying to answer. Otherwise it's really easy to find rabbit holes, say you're going to go down and waste a lot of time without solving the kind of questions that you set to solve out. So the big question that I've been trying to figure out is, is there a good way to represent machine code in a way that computers can understand? You can kind of sit back and say, well, you know, of course, computers understand machine code because they execute it. But that's like saying, you know, a line chef at McDonald's understands all of cuisine because they can follow a set of instructions. You know, what I mean by understand is, can they take that particular bit of code and in some way place it in the context of all the other code that exists? You know, so it's kind of the question of, is there a representation of machine code that captures the semantic meaning of that code in a way that both computers and humans can kind of make easy comparisons between different types of, different pieces of code? And this has a lot of applications. You know, one of the big problems for tasks in reverse engineering is the problem of binary similarity. You know, so given a piece of, or given a program, you don't just want to know, is this thing malware? You want to know what kind of malware is it? You know, does it have some kind of encryption component that could mean that it's ransomware? You know, is this similar to other rats that we've seen? This also has a lot of applications in vulnerability discovery. You know, if you have a program in the inside, hey, is there, are there any vulnerabilities in this program? You're going to be asking questions like, does it have a library that we already have that has known vulnerabilities? You know, and so in a lot of cases you can do that with signatures, but then those signatures tend to break if the library gets re-compiled. So there, there are definitely ways to approach the problem right now, and VimVip is awesome, but it doesn't really scale well. You know, it's, VimVip works great if you've got three or four binaries you want to compare. If you've got three or four thousand, though, it very quickly becomes computationally infeasible. Then we also have similarity hashes like SSD and SD hash. Those are great once again if your only question is similarity. But those hashes don't really do, and don't really help you as far as being able to encode any of these semantic meanings of a function or a program. And so kind of the other problem I'm trying to tackle with this is how do you model binaries for machine learning? So most machine learning algorithms, you need some kind of fixed length. In a lot of cases when you're working with a program, those fixed length feature vectors are going to be constructed by domain experts. You know, they'll look at things and say like, okay, there's important things like how many sections are in the PE header? You know, how long is the text section? You know, how many bytes is the binary? How much data is there? What's the introduction? But those features aren't always comprehensive. One domain expert might take a different set of features than a normal domain expert. There's no real good way to decide what's the right, what's the right way to do it. And in a lot of cases some things like a n-gram computation, that could be pretty computationally intensive once you get a very less layer of n-gram. When you're looking at machine code, machine code doesn't easily fit into a fixed length feature vector. Machine code can be incredibly variable in length. You know, if you're looking at functions, that's the length of the function is kind of just going to fit kind of on the verbosity of the program. Some people like writing really long functions and putting everything in one function. Other programmers will have very short functions. And of course, labeled data is really hard to obtain. Nobody can sit down and look at all the programs out there. They thought to be, you know, categorized them like, okay, this one has a number of kinds of capabilities, this one has a section, this one has a section. The other real significant challenge in this field is there aren't a lot, compared to fields like vision and language, there really aren't a lot of machine learning researchers working in the security field. And even the ones that are working in the security field, the ones working on, you know, binary analysis in reverse engineering are very small sums of that. So because of that, I like to try and find approaches from other domains where other people have been successful that I can apply to the domain I'm working in. So for binary analysis, I found the field of natural language processing to be extremely useful. Because really there are a lot of structural similarities between computer languages and human languages. They're both created by kind of the same webware. So both consist of some kind of arbitrary length sequence. You know, both have a very rich semantic and conceptual information content, you know, on top of just what the controls are doing. Functions have meaning to humans at a level that's higher than the actual code or the actual sequence of instruction. And fortunately enough, there's been a lot of research done on how do you process and represent language. It's been ongoing since before computers were a machine. And I always like to try and avoid reinventing the view on that question. So a lot of natural language models kind of rely on converting the text into some kind of high dimensional space, you know, a hyper space. So a hyper space is basically any Euclidean space which is more than three dimensions. So it's a space where you can do things like add vectors. It's more importantly for data science and machine learning, you can keep distance between the vectors. So in data science, you know, it's very common to model similarity by taking the distance between two vectors in some high dimensional feature states. And in machine learning, you know, machine learning, fundamentally the vast majority of machine learnings are just trying to figure out how do you take these data points and draw lines between them. So kind of as an example up there, you have a data set in one dimension of X's and Y's. You want to figure out how can you draw a line between the X's and Y's. You know, if you're just working that one dimension, there's no way that you can do that. But if you can do something like say take the square of each value and move it up into a higher dimension, in that higher dimension, you can then draw a line between the two classes. So the way that you can take language and move it into higher spaces. There's been a lot of techniques that I need to do this over the years. One of the most common techniques is a bag of board model, where you just take a straight count of each of the words that are present in the document that can be used to construct the vectors. That also translates as well fairly well to a machine token. You can do a bag of op codes when you just take a count of the op codes. You can also do n-grams, which are basically where you take say the sentence, the cat ran past, you know, the two grams of the cat, cat ran and ran past. That works reasonably well. I mean, when you're, so the problem with n-grams is when you're looking at machine code and once you get over about five grams, you start looking at processing power on the order of ones to try and compute all of the five grams that are present in the set of binaries. So most of the, but most of the really awesome natural language approaches kind of moved away from just basic law counts and they do to a concept called embedding. And what embeddings are trying to do is take those word counts or take a document and convert it into a dense vector within some higher dimensional space. And what I mean by dense is there aren't a lot of zeros in it. It's going to be made up of a real number. That doesn't make complications. So in general, most, most documents are kind of far, you know, your typical document does not have all of the words in the English language that are in it. So a vector of that document is going to have a whole lot of zeros in it. But the really cool thing about word embeddings as well is the vectors within the spaces, within the high dimensional space just kind of naturally cluster into areas where high level meaning for humans kind of resides within that area. So for example, if you train a document or a vector space model on a whole lot of English language documents, you're going to end up with a space or a hyper plane or a space within that high dimensional space where the concept of capital sitting is. You'll have another space where the concept of, you know, countries is. So you can do things like say the vector for London plus vector for Britain minus the vector for France will land you somewhere in the region of the vector for Paris. So one problem with applying with figuring out how you apply this to machine code though, is that for most of the natural language vector, vector models, the vector is constructed by examining the co-location of words, you know, what words are used together. The meaning of a word is you can kind of infer that by its native. So when you try to, you know, apply these concepts to machine code, you run into a problem of what, first off, what is the equivalent to words in machine code? You know, you're looking for something that has a fairly high level semantic meaning to a human, you know, something that encodes a lot of very dense information, but isn't so, isn't too generalizable, you know, it's not so general that it has no meaning. So you could, you could use say op codes, but an op code like push new that, you know, it really doesn't say much about what the program does on its own. You know, something like a basic block is another intuitive structure that you could use. Those are easy to do. You know, a basic block is basically just continuous execution code in the assembly without any branches. So the, you know, in the assembly would be a sequence of things until you run into a jump or a call or some other point where the code has to make a decision about what to do. But at least from my thoughts, when trying to figure out, you know, kind of what is the basic unit of a program that you want to apply some kind of semantic meaning to, functions seem like the natural choice. So I mean, programmers, when they're coding, they commonly break things up into functions. Reverse engineers, when they're looking at a binary, they usually turn and break it up into functions and then figure out what each function does. So the problem now, you know, now that we've decided to use functions, functions don't exist as the same kind of temporal properties that language does. You know, when you're looking at a document of English language, you know, the words that are right next to each other, you can say that these words have something to do with each other. If you're looking at a static binary, you really can't make that a function. You know, you're going to have function definitions right next to continuously that really have nothing to do with each other. You might have printf defined here right next to, you know, open socket, which is right next to, you know, encrypt all the things in the ransomware. And each of those functions, you know, sequentially in the binary, really, there are, location of the binary really doesn't say anything about what function they perform. So when you're looking at machine code, you can't take a co-occurrence type of approach. You really have to look at the composition of the function. You know, what instruction comprises functions, you know, not so much what's next to it. So from there, you know, if you're going to be working with a compositional approach, you have to figure out, okay, how do we represent the compositional function? So just looking at, you know, average length of x86 instructions in a wide variety of machine code, you know, the common length of most instructions is probably around seven bytes. You know, it includes this instruction as well as the operand. That's really computationally feasible if you're just trying to do n-grams. And you can do, you know, two or three grams, but at that point each of the n-grams is just going to be a subsection of a instruction. And, you know, half an instruction really doesn't give you a lot of usefulness. You know, you could sit down and have a human who really knows assembly really well, figure out, okay, how, you know, what are the things that are significant when I'm looking at assembly? So you could, you know, you might find things like, okay, if I see a bunch of pushes followed by, you know, the top, followed by, you know, a bunch of x-words, you know, followed by a bunch of x-words, that might, that's probably some type of a encoding encryption routine. That's, you know, a significant pattern. But there's a lot of patterns like that. They're extremely variable length. And figuring out, you know, which patterns are significant, how do you encode them is not really a tractable problem. You know, if you step back and think about it intuitively, you know, if you really want to figure out how do you build something, then you need to, you know, if you want to know what features are significant to represent something, then kind of knowing how to build it, figuring out how do you build it, you know, which features are important to construct this thing is a useful idea about it. But since that's the approach, you know, and it's a really hard problem for Cubans to do, why not let a neural network figure out how to compose functions? And then just take, you know, take what the neural networks want, save all the work. So fortunately, there is a type of neural network that does that, which is a character RNN. Character RNN is a generated neural network that generates test sequences. You know, the great thing about a generated neural network is you don't need labeled data to train it. You know, the data is its own label. And the other really awesome thing about this particular architecture is all of the popular deep learning frameworks out there have the example code for doing this, you know, just in their reference libraries. It's a fairly common part of this type of thing. So training the generated RNN, you basically are going to show it a sequence of like one byte at a time. It's going to try and predict the next byte in the sequence, you know, if it gets it wrong, you know, after you go through and let it predict everything, you just feed back and let it correct the weights in the network training. But like in this example, you know, you have a neural network that in each timestamp is trying to predict the output based on all the sequence of letters it's seen so far. You show up the letter C, it's going to predict the letter A, you know, good job neural networks, you know, that's right. You know, as it says, okay, I've seen CA, I'm going to predict the next letter is C. Great. That's okay, now I've seen CA change. Maybe the next letter is Q. You know, at that point, the training is going to go back to CA. That's not a valid word. Let's go ahead and correct the weights at that point in a way to hopefully get it working well. So this is an example. This is just a bit of assembly that a, you know, by training the generated row or a generated RNN on lots of assembly. You can kind of see that what, you know, it's reasonable looking assembly, it takes up all the registers right, you know, there's no registers that you wouldn't expect to see. And it even learned to clean up the stack when it's done. So what you end up with after you train one of these networks is you basically have a method for embedding your functions into a higher dimensional space. You know, the final set of activations in your generative neural network you can basically treat as a high dimensional vector. You know, if you have 100 neurons in your network, then each of them is going to have an activation which is a number. You've got a 100 dimensional vector within vector space. And just given the way the way the training process works, similar sequences of code are going to call similar activations within the neural network. So you can kind of say that if you've got two pieces of code, each of the vectors, you can say they're similar based on the fact that they activate the neural network that's looked at a whole lot of that same system, the same one. So of course, you know, being a data scientist, there's no science without testing. You can say, hey, this is great, but how do you know all these numbers that you're generating actually do what you want them to do and, you know, you're not just generating a bunch of neural network garbage. So in order to test this, I trained three generative LSTMs on a data set because of the functions from React OS, Arch Linux, and New List, and you'll compile with both GTTC and Visual Studio. Overall, the data set was about 23 million functions. LSTMs were all single-layered. I trained one with 100 nodes, one with 500 nodes, and one with a thousand nodes, just trying to get a broad understanding of how much representation can you get within a particular neural network. And just given that a LSTMs takes over the train, when we train each of these, we're training that with the top, and the back-pollination should try and track you to that 500 times less. All the statistics that anybody cares about, we can check later. So the next big decision is you want to work with assembly versus the wall binary. There's been some prior research, but in a lot of cases, people want to disassemble the code first. There are some downsides to that. The first problem is what is the correct disassembly? You know, every disassembler can disassemble things slightly different. You might have one disassembler that wants to use AT&T syntax, another one that wants to use impulse syntax. I'm kind of figuring out which one is correct is a hard problem. And really, I prefer to try and keep things as close to the level as possible and close to the original data set without introducing more. The one problem I did try to find a workaround for that was that wall binary code misses some valuable semantic information. So the x86 instructions that is a, for example, any function calls are done relative to the current address. So if you have a print app sitting at address 20 in memory, if it's called from a function at address 5, it'll say actually call v plus the t. If you call it at address 10, it'll say call the function at v plus 10. So that introduces some problems because there's no way to tell that the same function is being called multiple times. In order to work around this, I do some basic mobilization of the data, you know, if there's a function that's being imported, some idea of binary, then I'll just take, you know, before I send it over to the vectorized, I'll just compute the 32-bit hash of the function name and substitute that into the address. So that's every time you let's say call print app, you know, and that would see that this is a call for the same So how do we, how do we evaluate these embeddings after we make them? In a lot of cases, the criteria for deciding, you know, are embeddings any good is kind of like, well, do they work for the problem I'm trying to solve? That's great from an engineering perspective, not so great from a scientific rigor perspective. So one of the ways you can kind of evaluate embeddings are you can kind of plot them out and look to see, you know, that they kind of have some kind of structure that's interesting, you know, you can do a random, you can do random sampling or just have somebody kind of eyeball whether they look good or not. No, that's okay, but I mean having a human kind of eyeball down that's a 8000 scale, and then you're also kind of prone to bias, you know, if you really want your algorithm to work, you might be tempted to ignore some of the stuff that you can't quite explain and focus on the stuff that you can. So in order to evaluate these embeddings, I'm going to kind of borrow some similar things that one or two researchers in the natural language state have been doing to figure out some standard tests for an embedding model. So in the natural language realm, you have things like a standard list of synonyms, you can evaluate, okay, do the embeddings say that all of these words are synonyms? We don't really have that for a machine choice. So the criteria I'm proposing are consistency, you know, are these embeddings consistent with embeddings generated with other algorithms or with other models? As well as, you know, the ultimate job of a embedding model is to try and extract these semantic content of something into a fixed dimensional space. So can we come up with some standardized tests to kind of measure how much semantic meaning is being extracted? This is just a scatter plot of some of the embeddings, you know, plotted by operating system and compiler. So you can kind of set, well, not quite as obvious with the colors, but you can kind of see that, you know, even just eyeballing it, the stuff that was compiled with GCC, which was with the semi-arch Linux, you know, is sitting in a very different area of the space compared to the stuff that was compiled with Visual Studio. And as a human analyst, you know, I'd say, okay, you know, yeah, so in the Visual Studio and GCC, you have a completely different function for a long time. So when I'm looking at it, it's totally obvious. But, you know, when we were trying to do embeddings, we didn't try to optimize them. We didn't say, we didn't tell the model, hey, these things are separate, so we should own that. That's just something I'd pick up on the phone. And for evaluating consistency, I'm going to define two types of consistency. I'm going to say heart consistency, which is in set, you know, using model one, the nearest neighbor for function A in set one. If we do the same nearest neighbor measurement on set two, they have to have the exact same nearest neighbor. That's a fairly rigorous criteria there. That's something you definitely, by random chance, it's not going to happen in even two reasonable models, you know. Like, for example, is the word fluffy, you know, up closer to soft than the word those? They both kind of have the same meaning, but which of those is to be the closest neighbor to soft? So to kind of relax and look at, I also have a measure of soft consistency, which is that for, you know, if you take function A in set one, then if nearest neighbor in set two is going to be within the 10-year radius, that way you're saying it may not be exactly the closest one, but it's still in the same area. So evaluating the models, I think actually got some pretty good results for consistency there. So for the consistency measurement, I took a random sample of 10,000 functions, that's because doing a full m-by-m measurement of 23 million states forever. And out of those 10,000 functions, around a quarter of them met the criteria for hard consistency between models. So that's basically both the 100-year-old neural network and the 500-year-old neural network. 25% of the time, they said, okay, these two functions have the exact same nearest neighbor. And when I relax that to the soft consistency of 10 nearest neighbors, then we are still getting around 50, 50 plus percent, which is really good results. If you try and look at what you'd expect from random chance, you would expect that in your year that's been associated with this kind of consistency. This really kind of gives you some kind of a confidence that the neural networks are learning consistent, useful stuff about the value that they're being shown. So for the problem of standardized tests, standardized tests, not just for your kids anymore, some of the tests I came up with are, given the embedding, can we tell what, given just embedding, can we tell what, can we train a model or tell what compiler was used to build that, to compile that function? Going a little further, can we tell what optimization settings were used with that compiler? And on top of that, you know, can we determine whether a particular, what functions from a particular library, for example, ws232.dlll, can we tell whether those were used in this function from the embedding? So to do that, you know, I had, given that I had labeled out if I compiled all these things myself, I trained several classifiers. So trained a logistic regression classifier with stuff to tell what compiler was used. And no surprise, that actually got 100% accuracy, which looking at the PC platform earlier, you can kind of tell that the two classes are basically linearly separable in this space. What I was actually even more impressed with was the off-max classifier that I built for detecting the compiler optimization actually got between 72% and 85% accuracy, depending on which embedding type it got, a little better if it would be higher dimensional embedding. That's really impressive when you consider that in a lot of cases, especially for small functions, a compiler may not generate different code. You know, if you have a code, if you have a function that just takes two numbers, adds them and turns the result, it doesn't really matter if you're using a one or a three, there's really not much you can do to change that and make it better. And even in the cases where the off-max classifier didn't get the correct optimization setting, it's still determined that it was the correct compiler. We'll just say, well, I know this is GCC, but I'm not, you know, I'm going to say maybe it's 51% that it's compiled with Dash L1 and 49.5% that it's compiled with Dash L3, but it's definitely not position. In addition to that, I trained a random port classifier on functions that had WS232, or imports on WS232.dll and functions that didn't. That was able to get about a 78 to 91% accuracy, with the accuracy increasing with the dimensionality of embedding, which I thought that was actually really cool. And embeddings really appear to be encoding things like, okay, this has an import from the Windows network line. It looks like these things work. So how can we actually do something? So this is where the framework that I'm working on and calling Jumpgate comes in. So basically, one of the challenges in the reverse engineering space is that there is definitely not a shortage of tools to use. You've got IDA Pro, you've got Binary Ninja, you've got Redair. All these frameworks do really awesome stuff. None of them interoperate though, and none of them really work in the same way. So in a lot of cases, you know, in this, they'll kind of pick which do they like better, you know, you'll have some people that would like, okay, I'm an IDA Pro user, you know, I don't want to touch anything else, but this is like, oh, IDA Pro is too expensive, I like Redair. So my goal here is to be able to take embeddings and use them with whatever kind of front-end you want. If you're an IDA user, you should be able to use embedding models, so Redair user is trained and vice versa. So kind of the architecture with Jumpgate is you'll have your client, which is going to be IDA Pro, Binary Ninja, Redair, but at whatever front-end you really like to use. The key here is the vectorizer. Which is basically just a Python class that implements. You can write vectorizers for whatever framework you want. You know, if you're a PyTorch person, you can write your vectorizer in PyTorch, if you're a Keras person, you can write it in Keras. If you really like just coding everything from the ground up, you can do that. You know, as long as you can send something of flat string fights, then you're good to go. From there, your vectorizer will, you know, convert, move it up into this high-dimensional space. And you can set it on to models that do whatever tasks you want to do. You want to train compiler software. Look at that. You know, if you want to have it do find the 10-years neighbors and your collection of functions and send back whether you have something within a certain distance, you can do that. And that's really intentionally left very brief ones that you can do whatever you want. Like I said, some of the, you know, example applications that I've either tested or I'm working on right now are a function similarity. So I've got this binary that I just found somewhere I want to know. In all the other binaries I've seen, do I have similar functions? Compiler Identification. Is this binary? Which functions will compile with the Visual Studio? You know, are they running at one compile with Visual Studio? That might be interesting. The Crypto, that's a little more challenging because that requires building a data set of known Crypto functions to train classifier and all that stuff. This kind of enables that a little bit. And the only limitation is what can you think of to do with vector states model code. So some of the ongoing stuff I'm doing with this, I'm doing the building out a 64-bit data set model. That should be a lot more interesting with 54 bits where everything's moving right now. As well as you don't have the same function calling different complexity that you have with f86. If you can do compiler problems with 54 bits, it's the same actually. That's fairly interesting. Also, in the middle of transitioning the entire project from Python to Python 3. I'm not up and get help right now while I finish that transition. Hopefully get it up there within the next week or so. But that's your URL where it'll be even just up there. And other than that, that's my talk. You can message me on Twitter if you want.