 I'm Dan Ragoza. I work for General Dynamics Advanced Information Systems, and I'm going to be talking about automated malware similarity analysis. I'll be the first to admit that that title sounds a bit complicated, and it's really not. My approach is really simple and kind of straightforward once you get familiar with it. Just some quick disclaimers. All this work that I'm presenting today is my own, not General Dynamics or its customers, even though my bio mentions one of them, and all the work I've done has been on my own time, et cetera, et cetera. The problem I'm trying to address has to do with the time required and the cost of doing reverse engineering on malware. I'm talking about more long lines of in-depth reverse engineering, not so much as surface analysis, dynamic analysis, quick facts. I'm talking more long lines of trying to gather implementation details, characteristics that might identify the author, things like that. Stuff that takes time. If you're reverse engineering and you're producing compatible clients or software, things like that, the problem then becomes trying to reduce duplicate effort, especially if you have a large group of people. It can be really easy to fall into the problem of redoing work just because you're not familiar with all the other reports that are being generated by your group. I also don't really like anything that requires more work. If it's not automated, I usually don't really bother with it. I'm also targeting groups that routinely unpack samples. I'll show you why later, but if you're not really unpacking your own samples routinely, this process isn't really going to help you very much. Again, when I'm talking about similarity links, I'm talking about them for the purpose of reducing duplicated work, but there's also the side benefit of gathering intel information. In some cases you can draw links between two pieces of malware and gain some intel value from that that you weren't aware of before. The initial idea I came up with after talking to a couple of people is to use, and this is sort of like the spoiler slide. This kind of tells you in just everything I want to do, is to use Ida's auto analysis, which I will agree is not the greatest thing on earth, but there's some ways to address that. Use Ida's auto analysis, break the sample that you submitted into individual functions, calculate a fuzzy hash, and I'll go into why I use fuzzy hashes and some other algorithms that might be better. Calculate a fuzzy hash for all the functions, and then calculate the similarity between all the functions and all the samples that you have in your database. Once we've calculated a similarity between all the functions of all the samples in the database, you can then sort of get a feel for which samples themselves are similar to one another. For those of you who aren't familiar with fuzzy hashing, fuzzy hashing is essentially sort of like hashing except the hash value that's produced can be used, can be compared against another fuzzy hash, and a similarity score can be generated. So unlike a traditional hash that's cryptographically secure and should never be really indicative of the content, the fuzzy hash is indicative of the content. It's tolerant of small changes like truncation, insertion, slides, things like that. So fuzzy hashing in malware isn't new at all. I mean, if you've read some of the papers about it, there's been a lot of people who have kind of gotten stats on it and tried to use it and had some success. The difficulty is that people tend to use fuzzy hashing on the entire sample. So if you're dealing with full malware samples, I mean like the entire thing, you're fuzzy hashing the entire thing, and you're dealing with small tiny files, you tend to have files that are heavy on structure. So they're heavy on PE structure, they're heavy on like null values. There's lots of junk in there that really doesn't tell you much, but fuzzy hashing takes it into account and it goes into the similarity score. The same is true sort of for large files. For large files, you end up with embedded payloads, whether it be like a decoy document or like a second stage or something like that. And again, you don't really care about that as much because you're looking for similarity in the code. So some of those issues are just by using fuzzy hashing at the function level. So automated malware similarity is not new, I think. There's definitely commercial products and for certain some private products that are available for this same purpose. But really there isn't anything, I think, free that works as well as I'd like it to for this purpose. So some stuff worth mentioning, Zenonix Bindiff and VX class. I had tried Bindiff before and actually got a demo of VX class last night or actually yesterday afternoon. And I think it's actually really awesome. I think their goal is slightly different though. They're trying again to determine similarity between the overall sample. What I want to do is find tiny bits of nearly identical code across multiple samples. And that may only comprise a small portion of the samples, but it's important to me because I'm looking for code reuse for attribution, Intel, and trying to reduce duplicated effort. So overall sample similarity isn't as important to me. And I think that's kind of what VX class focuses on right now. And having talked to them, they may make a shift and make it possible to do sort of small similarity detection, but it's not really easy right now. And I mentioned HB Gary's Denonix DNA because it's sort of a classification scheme. And I wanted to bring that up because what I'm not trying to do is I'm not trying to classify malware. The system won't tell you that this is a trojan or this is a key log or something like that. I don't really care what it is. I'm just looking for duplicated code. So classification, definitely not what I'm going after. And your agency may have private tools that do something similar to this as well. So sort of a refined idea after having talked to people and trying to figure out how I'm going to approach this is just to create an open-source project. So I don't want it to be private. I don't want it to be reserved for my agency or the people we work with, which is why I did it on my own time. Because I wanted contributions from other people outside in the community. And initially, similarity algorithms can become incredibly complex if you let them. So my idea is to keep it really simple and keep it stupid, which is use fuzzy hashing. Fuzzy hashing isn't the best algorithm for determining similarity between two functions. I mean, it has no idea what assembly monics mean. It has no idea what the code is actually doing. It can't do all the fancy comparisons that Bindiff can. But it's really simple to understand. From my perspective, if you have two binary blobs and they're similar according to fuzzy hashing, it's because the bytes are really similar. There's nothing really to understand beyond that. And of course, I just wanted to create an interface so that if, for example, you want to integrate it with other tools, it wouldn't be that hard. It's really straightforward. So some limitations. I know in my title I say it's automatic, malware, similarity analysis. But really it's not automatic because I don't provide you an unpacker. Like I don't provide an American packer. I'm not good enough to write one that would be worth releasing, I guess is sort of the gist of it. But I mean, you may have packers available to you. And also the target niche I'm going after would be unpacking these samples anyway. So if you're not unpacking and you're not doing full reverse engineering, this isn't going to help you that much anyway. So I'm kind of hoping that's not that big of a deal. But you could definitely chain that into the process if you have some search and air compactor. So fuzzy hashing itself, like I talked about, has some limitations in terms of calculating similarity, for example, for small functions. Small functions don't really give it enough information to do calculate a good fuzzy hash that can be compared effectively. It's also thrown off by simple things like offset changes and things like that. Not to a great extent. And overall it produces a very conservative similarity score, which is actually really helpful for identifying really similar code. There's definitely other options as well that are worth exploring. And I'll try to get those a little bit later, but please remind me if I don't. And I'm also relying on IDA. I realize that's not free, but I assume again if you're doing this sort of work that you have access to IDA anyway. So I'm just going back to what I was saying before. This is not a classification system, but it's also not an identification system. So just because you can throw a sample in and get some similarity to existing malware doesn't mean it is malware. It could be reused base 64 code or something simple like that. That doesn't mean anything. But I mean similarity again is important to look for anyway, even if it isn't malicious. So just some quick implementation details. It's all Python, MySQL, PHP, and I've tested it on Windows and Linux currently, but it should work on anything IDA runs on. So this is a long-term idea. Not all of this is implemented, but kind of where I want to go and kind of where I guess I'd like the project to aim for is to kind of break these samples down into function blobs and also gather lots of other information that could be used in a generic way for similarity analysis. Gather that all together, packed it up, send it to a database that can store all this information, and then pre-compute all the similarity information if we have to. One of things I didn't get into was one of the limitations of fuzzy hashing is that you can't effectively store, you can't effectively sort fuzzy hash values. So when you're doing comparison, or when you're searching for something similar based on fuzzy hash value, you have to compare it against everything else in the database. It's a really exhaustive process, and you can imagine how quickly that would grow if you have lots of functions and things like that. So all that data is pre-computed because if it wasn't, it would just make the client really crappy. So we pre-compute all those values, we store everything in the database, and make it available via a really simple web front-end right now, but obviously other interfaces are definitely possible. Another thing worth mentioning is that currently as it's implemented, all of the really important logic is implemented as SQL stored procedures, so writing another interface should be really simple. One other thing that isn't implemented is something I'd like to do, which is allow the user to submit an IDB, which is something Bindiff and VX class does right now, which would allow you to fix up the IDB after auto-analysis to make it something worthwhile and something worth chunking out and throw that in the system and have it just kind of go forward. Something that came up recently is Google's Corgett. I think that it would be actually an interesting project to see what that would do for malware similarity analysis because what Google Corgett is capable of doing is it's aware of assembly mnemonics, so it can find the smallest difference between two pieces of code and try to generate a patch. My thoughts are if you have a really small delta that might indicate similarity, you can just pre-calculate that as well and throw that into a similarity scoring system. I'd also like to work on Ida, all the immunity integration so that as you're looking through code, it might automatically comment code that's similar to something you've seen before or other samples you've seen before, et cetera. So as I said right now, it's just fuzzy hashing only, which I think is working actually really well and there's a couple of small bugs that aren't really too big of a deal, but there's definitely work to be done and that's why it's kind of something I want to be a community project as opposed to something only I work on. So I have a demo of the web front man to show you as well as some of the back end stuff. Now I will sort of caveat this by saying that I had some really good malware samples, but after having looked back on them, I realized that they sort of reflected the work I was doing and the employer wasn't very happy with that. So these are really just random samples I pulled off the web, but they kind of highlighted some of the ideas I want to point out anyway. So there's really nothing interactive about this at all. There's simply an inbox which in an environment I would foresee this being sort of like a network share. You just throw your unpacked samples in and on a cron job you'd run the init script and it really just all it does is automate launching of IDA. It extracts all the pertinent information, packages it up into a zip file that can then be processed by the back end that actually computes similarity. So you'll see all these zip files that comprise all the samples in the system and they include all the information you need, all chunked out, all really simple to parse and it makes it really easy to go forward from here. I would say if you, I mean like on this laptop for example I can get through the full process of about like five samples a minute, but I mean it's like a laptop in a VM so I'm sure you can get better if you threw more hardware at it if you're going through larger volumes. So after the zip files are generated there's another script that actually goes through the zip files and will actually ingest all the zip files into the database. So this is a really simple front end I did together really quickly primarily because all the logic that I cared about was actually already created in stored procedures. So most of this interface really just, holy crap, and that's the danger of live demos, one second. Alright great. So this is the front end and this really is just right now is just a list of all the samples that have been ingested in the system. And there's a lot more information stored in the database, but this is just sort of like a really simple listing of the stuff that I've recently imported. If you click on any one of the samples you'll get a list of similar binaries to it, but one of the views I think is more interesting is just looking at all the similar binaries against one another in the entire system just cross correlating everything trying to figure out what's similar to one another. And I've sort of, I've floored it at a minimum function size just to kind of address the fuzzy hashing problem of small functions and getting similarity that's meaningful. So I haven't really looked too closely at this data, but I'll try to pull something out that seems interesting. Let's see. So I clicked on these two samples, this relationship between these two samples here and it says that these two functions are very similar. And the similarity score is not that great, but you can kind of see that overall for the most part the flow of the code looks the same with the exception of the offsets. And if you look at the similarity that it's giving you it's only telling you about two functions between these two binaries. If you ran a overall similarity score between these full two binaries you get a really low score no matter what algorithm you're using. It really doesn't matter what you're doing. But what if this one function is a custom crypto algorithm that the author uses across multiple tools that he's written. It might be important to look for that. And I think that if you look at the full sample alone you wouldn't get that picture. And one thing I can say and I can't really show you via samples because I've again pulled random stuff is that happens a lot. You'll see a lot of functions that are very, very similar. I mean there's very trivial changes in between them. But if you look at the entire sample there's very little similarity. And I think we're missing that because we're looking at the wrong scale. So ultimately fuzzy hashing isn't the greatest thing for this. I think there's some other options like fast forward transforms that people have experimented with sort of the bio-related similarity things like bin blasts and things like that. And I think that's worth trying out. And I think it's worth throwing those similarity systems in here but doing it at the function level so we can start looking for those traces of code that are common between all these things and trying to find something meaningful between all that. With that said I don't have a lot more to show you unless you have questions or and I'd also be willing to let you come up and just throw stuff in here if you can give it to me temporarily at least and just try it out. Sure. So you're asking if I throw a sample in there and I look at for similarity is it showing me a comparison against everything else in the database? Yeah. Currently what it does is it takes all the functions in the sample. So for example once I click on one of these samples here there's nothing similar to that one. But if I click on any of these samples it looks up the database it looks up in the database for all the functions that belong to that sample and all the other functions in the database from any other sample that are similar to it. And it'll show you all that information and just sort of aggregate it. If you have malware that's actually writing itself into other executables and I would expect you could get a lot of matches that way if you ingested all those executables. Yeah. Sure. Any other questions? Sure. Yeah, that's one of the things that came out pretty early on is a whitelist of some sort, right? There's like definitely functions you don't care about and you don't want to be in there. And that's not really reflected in the UI but yeah there's capability to whitelist functions. I'm a little bit hesitant to do that because I think I kind of care about any sort of similarity but if it really the noise bothers you yeah that'd be worth filtering out. Like for example just throwing you know a library standard library you see frequently just throwing the entire thing in there saying this binary is whitelist and all these functions are good. Don't worry about it. Anything else? Oh, sure. Right now well I can't talk about anything other than this demo database. This demo database right now has I would say about 80 to 100 samples. It's not very big at all. Yeah, but the database itself isn't very large but the overall the large chunk of the database is actually storing the IDB in the sample itself and all the function blobs and things like that. There's lots of data in the database but overall the similarity links it's really tiny. If there's no more questions then oh go ahead. Public database? Possibly if you want to do the hosting. All right, thanks guys. I appreciate you for coming. Hope you enjoyed it.