 My name is Scott Miller and I am here with offensive computing and I'm going to be presenting a new bioinformatics inspired binary analysis toward coding style and motif identification. And there's a lot of people who are helping me up here. I was NSF, Scholarship for Student Service recipient and there's a number of people here, some of which are in the crowd heckling me now even as we speak. So the big thing that motivated this particular speech, there's a member who couldn't be with us this year at DEF CON, he was here last year. So this is dedicated to that B guy. I'm going to be talking about the concept of what I did for my thesis work and then how it was extended. It's a technique or a suite I'm deeming binblast after the basic local alignment and search tool. I'll be talking a bit about that and then I'm going to look at two separate applications. One is inversion analysis which is very important if you're trying to protect your own networks and systems and I'm going to work on or then follow that up with a discussion on the application for automatic signature generation for viruses. All of this, everything you're going to see in this presentation is just in preliminary work. It's the actual development of the system was highly theoretical, required a lot of testing, proving and validation, but even at the end of all of that, unfortunately right now the code is just at a prototype stage. So I'm in the process of stepping this up and going a number of different directions. So at the end, if I don't get to field your question directly, I'll be down in the corner over there and I'd more than like to hear your input. So as I kind of started off the speech, our current policy to defense, at least on a larger scale, is that we have various policy software and hardware that are trying to keep our systems clean and secure. The bad thing about this is that when a new exploit or new threat or something new comes out, if it's done well, it's going to require some patch to that system. It's going to require a change in policy. It's going to require changes in firewalls. It's going to require changes in hardware. The problem with this is that any time you make a change in a very complicated system, you're going to have to verify that this patch on the system has actually done two things. One, fix the problem that it was supposed to fix. And two, actually not create any additional problems. And as I'm sure all of you know, this is a very, very difficult proposition. And worse than that, not only you have to do all this on your own, but the time that you have to be able to verify and validate all of these patches is decreasing very, very quickly as motivated threats, I suppose, dominantly by monetary concerns are basically finding profit in your inability to fix and repair your system. Long and short of it, the very tools that we're using ourselves right now to defend ourselves are contributing to the very problem. So, well, what can we do about this? Well, there's a number of things we can do. One approach, I suppose, is just stick our head in the ground and ignore it. Another one is to actually start trying to take a more proactive approach to the defense of our security networks. So, as far as being able to do this, we need to start considering, when we bring code on to our system, what sort of function does it have? And fortunately for us, this sort of functional analysis from various codes is nothing new. In the biology world, they've been having to deal with this for many, many years now. And a lot of our new cancer drugs and research is the result of some very, very good work that was done in the early 90s. So, kind of fun here to draw. This is the analog that actually I was using for all of this work that you're seeing today. And that we have the code, your genetic code in every one of your cells, is eventually transcribed from DNA to RNA. And then it's translated from RNA into various proteins that allow you to stand. And further than those, these proteins enable you to form bones and all sorts of stuff like that. What's very interesting is that this is not in the least bit a conservative process. At every step, when you go from DNA to RNA, there is a lot of information that gets cut out. And when you move from RNA to proteins, there is, again, a lot of information that gets cut out. So with respect to functional analysis in a biology domain, you're given a protein, all right? Well, what section of my genetic code actually coded for this protein? Well, geez, that's a good question. It's a very difficult question, too. But it's kind of interesting because we've got the same sort of situation when you're compiling source code. Because when you go from source to object code, well, you're losing something. And when you go from object code to linking into an executable binary, that process, too, loses a lot of information. And that linking step may be compounded by additional packers and encryptors and all of this stuff, which basically serve only to remove more and more information. So it becomes harder and harder to sort of look at the original function given the final product. So one of the ways that the biology domain has come up to address this particular problem is, as I said earlier, a tool by the name of the basic local alignment and search tool, which was developed by Dr. Carlin and Dr. Altschul back in 1994 when the paper was done. And superficially, all it is is it's a scored string matching algorithm. So you've got a quick example at the bottom here. And you can see we're going to do an alignment of two words. Let's say, if the characters match, we're going to add one. If the characters differ, we subtract one. So it goes along and we're adding stuff up until we reach what appears to be a local maximum of four because THES and THES match. And after that, it starts to taper down. So this brings up a couple of important points. The BLAST algorithm is actually an approximation. It's trying to come up with a local minima for this, basically find the string that has the highest match. But it's also important that it doesn't try to match everything in the entire string, because then you basically get back trivial answers. You'd say, well, does this file match this file? And it would come back and say, yes. Which is not unto itself entirely useful. So under the covers, the entire process is funded on a first order Markov chain model, which makes an assumption that is not up on this slide, but it's kind of important to know. And that assumption is made because it is a first order Markov model that the choice of the next instruction is not dependent on the choice of the previous instruction. I have very, very good reason to believe that this is not the case. There are certain sections. Compilers have automatically generated code. So it handles certain sections like structures and this and that. And every compiler has its own little way of saying, all right, I'm going to take this particular structure access and I'm going to use this set of op codes and fill in the memory offsets, fill in the registers. It's a template. So the assumption, superficially, it seems as though the assumption that using this sort of technique is flawed. Ooh, I've got this problem because the choice of instructions is not strictly independent. However, this is also a similar, there's a similar problem in the biology domain as well. So the kind of over what I was trying to do here is just get the start of this concept. Because once I get this first blast approach working, there's a number of variations on the blast algorithm in the biology domain that can handle this sort of repetition. And I'm only going to briefly talk about those. So how did I come up with all this? Well, I looked at what Distra Watch said at the time or the top five Linux installation. I stripped out all of the binary packages they had on there, looked at all of their instructions, which came up to half a billion instructions and then came up with a little storage format which reduced it into a common four-byte format. So there's the first two letters of the mnemonic were used in the operator class. And that choice, two instead of three, actually maximizes a particular namespace, which is kind of beyond the scope of this presentation, unfortunately. And then I went and truncated the operator to one byte and then had the operands to two bytes. So this is kind of making the assumption that in a system, the operands are the most likely to change. And I think this is a fairly safe assumption to make in the x86 system. So what can we do with all this? Well, I'm working on getting together a suite of all of these program in more of a production code setting, which does not have the, perhaps, the tedium of my thesis where I was actually sitting with a piece of paper next to my computer writing out offsets. So where it's going right now is I've developed a couple major analysis tools. I've got a quick little program to help build your libraries and indexes, basically reducing everything into that four-byte format that I was talking about earlier. And then after that, we have the actual comparison engine. And then kind of as a nasty side effect, it's necessary to filter the output. And this relates to the statement I made earlier, because the input is not random. The input choices are not random. There's a lot of extra interesting stuff that comes out of it, which is something that needs some work in the future. I'll get to that in a bit. And then there's also a program that helps with some visualizations. It does some side-by-side disassembly, helps generate some tables necessary for other tools, things along those lines. So this is kind of the overview of what I have so far. The general process I've kind of thought of is basically you use makelib to build up your library. You perform a comparison. You pipe the output through it. And after that, you run that through match output to get the result that you want. The kind of unfortunate thing of this is any time you're starting to go from prototype to production code, your definition of what you need to do functionally changes. I spent a lot of time working on a web interface, which unfortunately does not have the utility I'd hoped it to have. So instead, now I'm going to focus a bit more on the command line tools. But this is kind of a sample of what I was able to come up with, where you can see the raw results are in the lower right window and some of the similarity in coverage tools are in the upper left corner. That's kind of hard to see. I apologize about that. So with that said, here's one particular application that I kind of see for this technique. And that is inversion analysis. So we have here, we're going to do a version analysis of SendMail. So we've got four versions of SendMail and two controls. The negative control, which is completely as far away from the mail transfer agent that's server and socket base that I could get, which is a graphics side scroller by the name of Chromium. And then I have a positive control, which is a sort of competitor mail transport agent to SendMail post fix. Now, it's kind of interesting here. I want to point out that the last entry, SendMail891, is the only one that was compiled with a pre-GCC3 compiler, which this shows up a little bit later. So as far as what you actually get out of this technique are these fairly confusing raw results where you get a score. It's that integer score I was talking about, how long the score is, and two offsets, where it basically matched in the both sets. So you get a sort of what's called an alignment, a scored alignment of two sections of code. And dominantly, the major alignments that you see right off the top are these huge sections of code in the beginning, which are responsible for resolving all of the external library dependencies. So this is something that it's very, very repetitive. It goes on and on and on about the same thing. So I'm sure most of you in the back can't read this, unfortunately. But basically, it starts off at the top. You have a typical entry point. And then after that, you have all of your symbol resolutions, which is this continuous repetition of 3 or in GCC 3.0, it was anywhere from a 3 to 5-code op-code sequence that was repeated over and over and over and over. So what can you do with this? Well, all right, we go through. And instead of considering the individual matches, let's group them into what coverage. So OK, I have file A, and I have some other file I'm going to compare it against. And instead of trying to figure out all of the possible matches, just say, well, how much of file A is included in matches to file B? So you can compute a similarity measure. And from this, you can compute a distance measure and use, again, a biology, a very, very well-established biology tool by the name of Philip, which is the phylogenetic inference package. So all of these graphs are directly generated using biology domain tools, which was very, very, very convenient. You'll note from this graph right off the top, you'll see that sendmail891 is very, very distant from everything else. In this particular graph, which is an unweighted pair group method with arithmetic mean, this graph, the horizontal distance, is significant. And the vertical distance is just to show association. So what this says is that sendmail891 is very distant from a common ancestor, where the other five programs are very close to some intermediate ancestor. So this is kind of an interesting graph. From the standpoint that in the biology domain, there is very, very, very hot debate over the graph on the left and the graph on the right. Now superficially, it might seem as though they're kind of conveying the same information. However, the graph on the left in the biology domain has been hotly contested because it is associated with what's called a molecular clock, where there's an assumption made that mutation rates happen at a constant interval. And in the biology domain, there's been increasing research to support that says this is not an assumption that you can make. And I think it would be safe to not make, rather, the same assumption in the computer science domain because the way malware is adapting and evolving, I pretty much think has no really deterministic sort of mutation rate. So the graph on the right is what's called the nearest neighbor method. So basically, it's a graph as opposed to the left, which is a tree. And it basically shows the current relative distance from everything. So what that graph says is that sendmail 813.4 and 813.5 are really close together, followed in a close second by sendmail 812.9. And then the negative control is slightly closer, or the positive control postfix is slightly closer than the negative control. Chromium. So again, this reflects kind of intuitively, well, duh, this is what we were expecting to see happen. But that's a good thing. I think it indicates this technique has a great strength. And this is something that I've applied against some other malware approaches. Have I quite gotten to the approach where I've considered 20, 30, 40, 50 pieces of malware and tried to put them in the same graph? No, not yet. And again, this comes from the ongoing development of production code to automate and produce these results with little interaction. So I'm going to talk now. OK, so there was one thing, version analysis, which is, again, very important if you're trying to receive and update your system and be able to keep up with those updates and say, OK, yeah, yeah, I've got this version of code that came in and oh, yeah, it's this new DLL. And oh, yeah, it looks pretty much like the last one. But it seems to have, well, it has something that looks like a root kit. Well, that shouldn't be in there. So that's the sort of thing that I'm hoping to address with version analysis. Now, as far as signature generation, again, this is kind of going towards a proactive self-network defense, where basically you as a network administrator, you as a person with your own computer can verify and validate your own systems. So how does this work? Well, the step one, again, talking about the coverage of various programs, we start off with comparing the file, some malicious malware against other malicious malware against, that's pretty much your candidate generation set. So you're asking the question, what makes this particular program unique in a set of other programs? And after this point, you're going to get a collection of fragments that are included in the partial or fuzzy matches. And then after that, you can take the negation of that particular set and get the fragments that weren't included in any match. And at this point, the third line down, this is where you're actually working with the sections of code that make the file A unique. And ideally, to make the system robust and avoid, if you're using this to do antivirus signature generation, ideally, to avoid this matching particular things in your host system, in your host operating system, your own applications, you're going to need to do a negative selection, where basically you take all of these candidates set and you run against some universal self-set. In this way, you basically remove the candidates that match yourself and you end up with a selection of programs that, a selection of signatures that match this particular nasty or good thing or whatever as unique. So here's a quick example, and I'm going to do a really brief demo here. And so let's pose the question, what makes Borland Compiler Code unique? So the way we're going to go about doing this is first I'm going to use a very, very simple code program that will exercise a number of compiler constructs. And then after that, we're going to go ahead and compile this against a number of compilers, Borland, VisualC, GCC, to basically get some set to compare it against. What I'm not going to do is actually run that final candidate generation and negative selection. That is something that hasn't been quite reached yet. So we go over here, and we're cheating already. We have the answers. So we'll start again. So we have basically the output of the bin blast program after it's run through bin compare. So unfortunately, this is very tiny. We will go ahead this way, and that does not make it any less tiny. Oh, unfortunate. But basically, what this is showing is it's just showing this sort of foretuple of results and how they match up. So we're going to run this through a brief program that will run that candidate generation that I was talking about. Because that previous result set basically says these are the sections of code that were included in matches to other files. So we're going to run this against through the candidate generation program. And we're going to get out a bunch of possible candidates. So of the three compilers that I compared, the three executables that were generated, these are all of the difference according to this technique. These are all of the differences that were reported. And you can see this list is not very long at all, not very long at all. And by the way, if you were kind of curious as to what the code was, this was some example code that Chris Eagle had come up with, which basically it's just a short series of comparisons and just some boolean operations to exercise various, if-then-else constructs, switch constructs, very simple code. So let's go ahead and say, all right, so we've got this, all right, what's it here? It's DEC9DB, seems to be a unique code. So we're going to pop over here, and we're going to go ahead and look that up. DEC9, so that's just the start of it. So what does this actually look like? Well, and it looks like we're matching on a floating point multiply. Well, geez, that's kind of odd. I wonder why Boerlin's spitting out floating point code when working with just some simple integer operations. Kind of something interesting, unfortunately. As I said, the code is moving forward towards production code. So this hasn't gone through a very thorough analysis. And as I said, I was borrowing this example from some of Chris Eagle's examples, so I don't have access to a Boerlin compiler right now. So this is just kind of a quick overview of what sort of things you can do with this. So back to the beginning of the presentation, which is not quite where I wanted to be. So we were here, and so we were able to very quickly generate signature, as it were, for what makes or may make Boerlin code unique. So at this point in time, I have just flown through my presentation. So let's look at some additional information here. So I mentioned earlier that I'd done some phylogenetic analysis against a couple, anyways, malware codes. So I was looking at three versions here, My Dunes, A, L, and Q, and I had, for comparison, both the packed and unpacked versions. So again, running it through this technique, it refers to something that it creates something that infers the same sort of technique or what you would intuitively expect. So this is pretty much, all right, well, we know that MyDOOM, A, L, and Q are very closely related. And indeed, you run it through this particular technique comes out with the same answer. What's kind of interesting on the nearest neighbor graph on the right that MyDOOM, A, and the packed version of MyDOOM, A, are very close to each other. I still have yet to actually sit down and look at quite exactly why that is. It's a bit of an enigma. It may be that the packer was used a bit more judiciously later on. It could be that the MyDOOM, A code just didn't go through the UPX packer very well. So these are all questions that, who knows? So, kind of in summary, what I've talked about so far is that basically by making a simple analog, I was able to pull over a technique from the biology domain. And it seems to be working very well, at least in the test cases I've evaluated so far, that it seems to work well on binaries. And I'm developing an open source suite and it seems to be sensitive to compiler code generation. Well, there's some other benefits to this as well, because it is a very simple technique. It has some excellent runtime characteristics, which is something that I'm hoping to exploit in some of the future work. As I said, production is, I'm working right now on moving this to production code and the people at Offensive Computing are helping quite a bit with this movement. And moving towards things, helping out greatly with their automatic malware classification is the big project that I'm pursuing right now. However, I'm hoping to, in cooperation with their collaborative reverse engineering environment, start to use this to identify those particular motifs. As I'd said earlier, that when you run this code, if there's not randomness in the data, you end up with a lot of repetitions in the output. Well, one of the ways that in the biology domain, they have actually solved this is by identifying particular motifs, which is kind of a fun concept in that you can go through and say, all right, well, I've done some reverse engineering and I've come up with this. This is a ROT-13 decoder. All right, and I've looked at it. So I add it back into the reverse engineering database and I say, all right, this particular section of code here is a ROT-13 decoder. And then anytime that particular code sequence shows up again, you have a database of, this is the motif. So let's say you get something brand new, like you've never seen MyDoom, assume for a moment, you haven't seen MyDoomD before. So you run it through the algorithm and all of a sudden you get a match to this particular library you have of existing code motifs of stuff that you've spent a lot of time to develop and a lot of time to consider. And well, lo and behold, all of a sudden you're getting a large number of matches to this ROT-13 decoder. Well, what might that indicate? Well, perhaps this particular virus is using some sort of encoding scheme like that. So from the standpoint, I'm sort of approaching this as not only an automated tool, but an interactive tool that will allow you to database your previous reverse engineering experience and extend it to your new work. Because as I keep on saying over and over again, the time that you have for identifying patches to your system is going to go down to almost zero. And so you need to be able to have some way to defend yourself. And soon this will all be available at the offensivecomputing.net site. Unfortunately, as I said, it's kind of a tedious process of trying to find new functionality, refining the program analysis, and moving forward. So at this point in time, I'm running well ahead of schedule, but I'd like to open it up to any questions in the audience. Now, there is a microphone in the center, so if you have any questions, go ahead and go out and use that. Are there any questions right off the bat? I was just curious how your analysis would handle polymorphic code. Well, that is an excellent question. And there was very, very early on, there was a great consideration as to how might this technique be robust against such a problem? And one of the ways in which it does this is that it tries to make the assumption that, all right, so every architecture has some sort of move operation of some sort, some sort of memory management, and you have your ads and your subtracts and your multiplies, your divides, your pushes, your pops, well, depending upon your architecture. So instead of just strictly considering, all right, this is the instruction, I'm gonna do an exact string match, I'm gonna do a fuzzy string match. So if I match just the operator class, so I've grouped all the operators into a class, if I match just the operator, I'll add four. If I match the entire operator, not just the class, I'll assign that, I'll add five to the score. And if I match the entire operator and all of the operands, I'll add six to the score. And if there's no match at all, it subtracts four. And this was the result of the Karlin-Altscholl analysis mentioned here. So in this way, it has some resistance to polymorphic code in that it's looking not only at the specific operator's operands that are being issued, but kind of sort of the general gist of things is what it was going for. Additionally, because this is doing sub-alignment matches, it's resistant to code reordering. So if you take a block of code, split it in two and shift those around the file, it will say, oh, you have, it will match that first part and the second part. So in that way, it was designed to hopefully deal with polymorphic code in mind. Unfortunately, as I said, it's one of those exciting things, actually, to be able to have something that's just kind of at the start. And it's something I'm, there was part of the design that it would be resistant against that. But as far as how it actually does, well, that's a good question. I haven't quite gotten to that yet. Does that answer your question? Excellent. Do we have any other questions? You ever used or heard of a tool called Bendiff? Yes. Yes, this differs quite a bit from Bendiff in that Bendiff is looking at a sort of function block level. And basically it is looking and going through and cataloging the various function calls that are made. And basically sort of doing analysis on kind of sort of a basic block level. This is actually going one level lower and actually doing instruction sequence level analysis. So yes, yes, Halver's work is actually, I looked at it very, very closely in the development process. And it may actually have superior performance to this in terms of runtime analysis. This is a bit more heavyweight approach. This is, as I said, this has the potential to possibly have a better, better resistant to polymorphic code. Who knows? The long and short of it, does that, first of all, does that answer your question reasonably? The long and short of it was mentioned in Danny and Anthony, or Danny and Valsmith's speech earlier that everyone needs to collaborate. We all need to collaborate as far as trying to go against reverse engineering of malware analysis. So is there any one singular solution to all of the problems that we're facing now? I think not. It's trying to come up with a general solution. There may be some general solution to this approach. Who knows? But I think it's kind of interesting and important that everybody collaborate and say, oh, okay, well, we've got this tool. I've got it to a point where it's working. Well, how well does it work? How well does it work against Bindiff? And that's actually something that I've not really had a chance to get my hands on yet. I have an evaluation copy now, which I'm gonna to look at and play with and see how it does. So this is just kind of the result of a quick paper napkin idea, well, what can I do with this? And it's kind of going from there. So incidentally, he made reference to a program called Bindiff, which is the third from the bottom reference that I have here. He had a paper by the name of structural analysis, or it was a structural comparison of executable objects, which was one of his earlier papers. He's done a lot of work since then. I believe there's even a new version now. So yes, I'm very interested in looking at that. And again, does that answer your question? Yeah. Excellent, excellent. Do we have any more questions? How does it handle code that is designed to generate instruction offsets at runtime and then use indirect jumps or calls to obfuscicate the actual code locations? That's a good question. Let's see, do I even have an example of that here? No, I'm not gonna, that's gonna be probably a bad idea. As far as that, unfortunately, all of the work that you've seen up here on the screen today has been based off of an assembly generated by the ObjDump Util, which is a spanning search type disassembler where it basically starts at the beginning of the code section and starts to produce op codes. And incidentally, this particular approach does a very poor job of what you were describing, where basically it has code that will jump to some sort of offset fragment within the existing code. And as it stands now, using ObjDump, it does an incredibly poor job of that. However, I'm looking at actually making use of some better disassemblers, looking at linking into Ida Pro, which has kind of an interactive disassembly approach where you can walk through and say, okay, well, if I do a static disassembly, this straight up, this is what it's gonna look like. However, there's a lot of stuff in here that's intentionally fooling the disassembler. So if I step through and say, well, no, no, this is not quite the alignment you were expecting, let's go back and start that again. So that's one approach to doing it. Another approach to doing that I have considered is kind of related to another interesting way that this might go, which is actually looking at just straight dumps off the network, straight dumps from memory. And in much the same way that the encoding process with DNA works where basically you have to define a structure or a reading frame before you can actually begin this process, much the same way, consider all reading frames. So okay, I start a static disassembly at this byte, I generate some disassembly. Well, how to turn out, okay. And then just move one byte forward as opposed to trying to go through the instruction directives or the, rather, move one byte forward instead of moving two or three bytes forward as one might expect given the particular instruction that's being executed. So that's another way to do it. So yes, yes, unfortunately what you see up here does not do well with that, but I'm looking at some ways to try to be able to address that in the future. Does that answer your question? Excellent, excellent. Do we have any other questions? Do you see this is applying equally well to a semi-language from all different architectures or is this X-80 specs specific? It's an excellent question as I kind of was a little quiet. So your question was, is this cross-platform, is this applicable to other architectures? Yes, okay. As far as that, yes. The way the hashing algorithm was developed to do the code classes or do the operand classes, it's an automatic system that regardless of what your disassembler spits out, it will be able to do that sort of analysis and it will be able to match the moves, ads and what nots. Unfortunately, different architectures have some major differences in the way their code is actually generated. For example, or for example, whereas the X-86 architecture is a stack-based architecture, memory-to-memory instruction architectures, I don't know, something like maybe a PowerPC type architecture, the code that's gonna be generated is probably gonna look quite a bit different. However, the kind of corollary to that is that there are a number of systems out there now which are including virtual machines and trying to do cross-platform development. So by the very nature of trying to develop code that works on many different architectures, I'm hoping that this technique will have some, actually improve its detection of that type of stuff. Does that answer your question? Yeah. Excellent, excellent. Have you seen much striation in your output based on the breadth of outcodes used for different instruction sets or special-purpose opcodes for specific processors? So your question is basically looking... Does it separate out based on the instruction sets compiled with and if they're using anything that like, say, the C language won't represent an opcode? That's a good question. No, I haven't looked at that yet. I haven't quite looked at a broad spectrum consideration of a great variation. The dominant compiler that was used in this work was GCC because that's just what the distributions that I was using. How well does it hold up against other platforms? No clue yet. That is something that is most certainly on my to-do list and it kind of relates to not having the full up production code yet. So that's a direction I'd like to go. Does that answer your question? Oh, good, good. Well, the question was for those of you who couldn't hear, he asked, how is that helping to clear the tubes? Well, you must understand, it's not a truck. You can't just put anything on it. So this is gonna take a racehorse and put it through the tubes. Does that answer your question? Okay, all right. Any, oh, question back there. So the kind of graph you've shown, right? The phylogenetic tree. So you would get those, especially the one on the left, with any kind of a hierarchical clustering scheme, correct? Yes. Say with Huffman clustering, you would get that. Provided that you have the distance. Same thing applies to the nearest neighbor, key nearest neighbor. Again, provided that you have a distance, you will get that sort of a picture for any distance that makes sense. So now, I understand that you're using BLAST as a means of obtaining that distance between two stretches of code, between two passages of code, right? Yes. Transformed according to your bitwise, sorry, opcode compression classes, right? Yeah. So these being your features. But have you tried simpler distances, perhaps? Something like a naive Markov first order model, this opcode predicting the next opcode? That's a lot less sophisticated than BLAST, but how well would that do? Well, that's a good question. Wouldn't that produce the same kind of thing, or would that be all over the place? Honestly, I don't know. It's a way, actually in the number of presentations that I've given, kind of leading up to this black hat presentation, that has been something that has come up a number of times, and I have no answer. I've not actually pursued that particular sort of, not sidetracked, but sort of parallel to this research. Does a simpler model actually produce the same results? Well, what I'm proposing here is that we step back and take a wider view at this formalism. And what this formalism really means is that you are defining a distance between two stretches of code. BLAST does not have to be that distance. It can be something based on naive Markov, hidden Markov modeling, and so on and so forth. Okay, right? So it's not like BLAST is, so it would be really interesting to see whether code is like the genome, or maybe code is somewhat different from the genome, and a different distance would give you a better measure? Right? That is an excellent question. Actually, while you were presenting your question, I came up with kind of some related research that I'd gone over. I'm not the only one who has been considering trying to do this sort of nearest neighbor type analysis. And actually, there's two people out of, I think Italy, Ero Carrera, and I unfortunately can't say his colleague's name, just my language skills are not sufficient to be able to do all that. So Ero Carrera and others had a paper back in 2004 as Digital Genome Mapping, where they used a slightly different measure for determining the distance. And it showed, again, it showed even, it was not the same, but they were able to get the same sort of graphs. So you could get mutual information between bytes at different distances, right? Yes. The information theoretic. I think this is what they used. Yeah. I think this is the sort of thing that they used. You could use where we can take the self-line. Oh, okay, excellent. But yeah, it was a great idea to use blast in it. Well, it was kind of the underlying approach was, all right, well, we've got a similar problem. What tools work for that similar problem? Well, hey, here's sort of the Swiss Army knife. Does it work here? Yeah, it does. Well, how far can we extend that analogy? But you do bring up a good point. This sort of research would probably benefit from a sort of cross-analysis to other sorts of distance measures, because is this the only way to develop a distance measure? No, it's just the one that I've kind of been looking at for now. So thank you for your question. Thank you. Any other questions? All right, again, thank you. You've been a great audience. I appreciate your time, and thanks much.