 So hopefully everybody will be entertained. I know everybody thinks, you know what I really want to do at 5 p.m. is go to a talk that involves math. So yeah. So hopefully this will, excellent. Came from the math, stayed for the mustache. So that's me. I'm going to be dealing, I am Packer and so can you. I'm going to attempt to keep this to the 45-ish minute mark. So I can do some Q and A. Hopefully I'll get some hard questions. Hopefully I'll get some good questions. All right. Now we're on to the agenda. Do a little bit of an intro. Talk about the product, or not the product, the project. A little bit about me because why wouldn't I am up here? Give everybody a little bit of a refresher kind of techniques. Talking a little bit about the PE format, since that's mostly what we're going to be focused on today. Right, we're going to look at the data and pull out our magnifying glass and look at wins and zeros. Do a little bit of math. We're going to then look at the solution and then finally, we're going to look at the results. So the most important part, me. What do I do? Currently threat research at Bit9 Carbon Black. Right, those are my hobbies. Static analysis, machine learning. Anybody else from Texas? There we go. If you're in Austin, I will totally buy you a beer. I run a little project on a website called secrepo.com. If you guys are looking for various security data, I try to keep a somewhat updated and curated list. So everything from kind of bro logs and snort logs to other projects that have way more information than I could possibly host. Follow me on Twitter at sushi. And then finally, I may sometimes occasionally contributing member to the MSI project. Thanks, Alex. And feel free to tweet about this and use the hashtag secure because math? Because we are going to be talking about math. All right, so what's the main problem here? So I'm sure a lot of people are familiar with the idea of detecting compilers and packers and crypters and all sorts of other stuff, right? There's some good tools. Some of the tools are really old. So I'm going to pick on PID here. PID was written in 2005, right? So in essence, it's 10-year-old technology. Maybe there's a more interesting way or a better way to kind of manage this problem. So really the goal was set out is can we do something new and different? So we've got some goals. So we've got some great projects out there, right? Like PEID and some of the other ones. Yeah, they might be a little old, but there's probably still some validity. However, for this, we're going to try and adopt kind of a zero trust towards them. In other words, if somebody as an analyst says, oh yeah, this PID signature is verifiably correct, then great. We, being myself or anybody else in this room, can create a signature and kind of directly translate it into this new language. The other one is this easy to create signatures. So looking at PEID and some of the other associated tools, you've got to live in a hex editor, right? You've got to maybe open up IDA and find the exact pattern that you're looking for. It requires kind of a certain bar to entry. So the idea here is can this really be distilled down to something anybody can get value out of, right? So let's make it easy. And we're going to talk a little bit about the signatures as well. Cross platform. So running PEID on a Mac itself, right? It's a Windows program. That's not going to happen. There are a couple solutions to attempt to let you run PEID signatures on Linux, on Mac. They're really good. They're not as full features actually using PEID on Windows. So that's kind of a negative there. The other thing, once again, right? Simple to extend and understand. So in my opinion, what I'm going to start with here is kind of this base notion, this idea, present some data and say, look, I'm pretty sure this mostly works. And then hopefully somebody, multiple somebody's in this room or elsewhere will go, wow, that guy wasn't really dumb. He was only mildly dumb. And instead, here's a couple of enhancements, right? And the other thing that I really wanted to get out of this was this idea of fuzzy matching. So if you've got kind of something like PEID or another signature based language, generally it's the signature hit or it didn't. So instead I want to kind of introduce a notion of, well, part of the signature hit and this is about how much of the signature hit. So in other words, when I use this or when anybody else uses this for signature management, you can kind of figure out where your overlapping signatures lie and you can maybe be a little bit more effective kind of out of the gate. So this we're going to jump in, just an easy refresher, talk a little bit about the terms. When I say certain words, what I mean, it might be different from what other people said, so I want to make sure to do basic level setting, talk a little bit about the PE file structure. I'm sure all most of you in this room go home and dream about the PE headers. Probably not everybody does. All right. So this is a very simplified look of the PE file structure, right? You've got kind of this dust up at the beginning. You can get these other various headers, some of which are optional, some of which are only generated by certain compilers. You have this notion of sections, right? Some sections contain the code and some contain data and so forth and so on. This idea of resources, so if you ever look at, you know, a programmer executables icon, it's generally stored in the resource section. So there are many, many different parts. This is one of my favorite graphs and I apologize if you can't see it all that well. These are all the header values that you can have in a PE file. Now, keep in mind not all of them are required to exist. Not all of them are required to be filled out in an entirely accurate way, but this is what you can deal with. So there's a lot of things to mess with. They're color coded. So really as far as the PE format itself and the header structure, this is what we're going to care about today. This is the three basic things that I decided whether I'm correct or not. That's fine. But three basic features out of the PE header that I said these can be kind of interesting and these should generally vary enough from compiler to compiler or packer to packer, gripper to cryptor, that they should be useful features and kind of doing this type of analysis. The other one is number of sections. So things like UPX and a lot of other packers, right? Maybe they jam the entire executable and we'll get a little bit more into this in a second into one section and then just have their little tiny days section. So when I use the phrase tool chain, what I'm talking about is the set of tools used to develop software. So you have things like IDEs and linkers and compilers and all that kind of stuff. And each one of these actually leaves somewhat of a relatively unique fingerprint upon the binary that it creates. Now once again, you can manually go in and change these. Not a lot of people do. So for this, when I talk about tool chain, I'm actually going to go, we're going to talk about kind of the build environment. So GCC versus Visual C++. So packers, what are they? Packers are generally this program within a program. When I want to pack a binary, what I'll do is I can take the original executable, kind of smoosh it down and ram it somewhere inside this new pack executable. So I generally want to do that to evade AV, right? Make analysts lives harder because who loves or who doesn't love really stepping through our LED bug, trying to figure out how do I get the unpacked version of this in memory because this is just ridiculous. So at least if you know, you can identify what packer is similar to anything you've seen before, right? You know what steps you have to go through or maybe you know what tool to pull out of your tool box in order to do the unpacking. So there are really two parts to a packer. You get the packer executable that you run on the original file. This is the thing that actually does the compression or the obfuscation and creates this new executable. And then you get the unpacker. And the unpacker is generally this little stub that comes out in the new program that when this new executable is run, the stub is generally the first thing that is executed and it goes through and it kind of, you know, unpacks the original binary and goes, okay, now I'm going to run this. So really when I talk about packer detection in this context, I'm actually going to be referring to the unpacker or the stub, right? So unpackers, how do they work? So what you really want to do is you want to take control of the address of the entry point, right? So where when a Windows or P file is loaded, where should I go and begin executing code? So you want that to now point to your stub. And then once you unpack it, right, so maybe you decrypt it or maybe you obfuscated it or whatever it is, right? So you find the pack data, you kind of restore it, you get this little in-memory image. You've got to do a couple relocation fixes because it's not the Windows loader doing the actual loading for execution. You have to mimic some of that. And then you jump into the original program and keep going. Alright. So now on to the popular kids. So these are kind of the three, in my opinion, and there's probably several more tools that when people do compiler detection or do packer or cryptor detection, this is what they're talking about. So one before, it's nice, the signature language is pretty good. It's been around forever. It's my opinion, it's kind of the de facto standard. YARA has its own signature language, several projects that will allow you to take PEID rule sets, convert them to YARA rule so you can kind of update your analyst tools, but you're still kind of using this limited idea of what it is you're looking at or this harder way to describe data. And then this last one, this RDG packer detector, I actually really like their slogan. Alright. So now, now we're going to dig into data and who doesn't love data? And honestly, if you're going to talk about math and if you're going to talk about, you know, doing any type of analysis, if you don't use data and you don't understand your data, right, it's really, really hard to get good results. And a lot of times, data is really ugly, right? It's not this beautiful end result, it's this nasty thing you have to slog through and dissect and understand. So this is the data that I used in my testing setup. So I went and I found and I Googled and I threw together 3,977 unique PEID signatures. That's a lot of PEID signatures, right? So that alone kind of got me thinking, maybe we can address the signature management problem. We'll get some file sets, various sizes, right? Got smaller ones that I understood that I could pull apart and go, oh, okay, I get it. Yeah, these two are right and this technique seems to be working. And then we have this giant random sample at the bottom, right? So 411,000 files. Because everybody loves big data and this wouldn't be a math talk unless I use the phrase big data. So there you go. So that was kind of the end all after I felt comfortable with the technique and comfortable with the tool. What I ran it over to kind of verify and then did some spot checking with that giant data set. We'll talk about that as well. So let's get into some of the data analysis, right? So for this, there's a handful of slides. We'll go through them. We're going to talk about the basic exploration of the Zeus data set, right? So if I go back a slide, I think, yeah, there we go, right? So 6700 samples roughly is what these slides are based off of. There we go. Okay. So first thing I did was, all right, what happens if I run PID on these 6700 files? Well, turns out PID signatures don't match 4600 of them. Really disappointing. So you get some other ones, right? So this different UPX and another UPX version and, you know, Microsoft Visual Basic and Armadillo Packer, which I'm sure just by looking at the numbers you could probably make a relatively educated guess that maybe Microsoft Visual Basic 5.0 and 6.0 and Armadillo Packer are really, really closely related. So what is the, kind of those numbers, what do they look like in visual format? It's a bar chart. You don't have to worry about the numbers. That really tall line is the 4600. So it's kind of another way to visualize it, right? Just to kind of get into the idea that creating signatures is hard, right? It's not, it's not trivial. So having an easier way to do it would be great because then that really big giant, and I apologize for not using grayscale, blue box or bluish-purplish box, to make that smaller, right? To get more things that you can actually label and understand. Okay, cool. So this graph in my opinion is what science looks like, right? You show this to somebody and they're going to go, that dude up there totally did science. So this is simply a correlation matrix. And the idea being is you take all of these PID signatures and for files that had multiple PID signatures flag, you want to see which signatures flagged, right? With a high correlation or flagged, when one flagged the other one was very, very likely to flag. So the diagonal is basically the signature correlating with itself, which makes sense, right? Because every time a signature fires, it's going to be observed. So with this you kind of want to pull up the little black dots. And while this one's kind of hard to view, we can zoom in on one little snippet of the graph. So this is kind of that upper left hand corner. And you can see that there are a couple signatures that are highly, highly correlated, right? So there's a lot of signature overlap. There could be signature overlap, right? In your environments, there's obviously signature overlap on the stuff I downloaded from the internet, right? So every time, you know, one of these ASPAC signatures flagged the other one did. And so with that, you kind of get a feel for, oh, this is, this is where I'm lacking or this is maybe where I have some duplication. So that's where we've got, we understand what PID looks like in a sample data set. So now looking at maybe some of the other features that we can use in addition to the header features that may allow us to definitively say or say with a very high probability that we're looking at, you know, a specific packer or a specific compiler. So we can use PDB strings. I love it when any type of malware author or any author in general includes a PDB string because sometimes it's like hitting gold. Sometimes they're awesome, right? And they're like, oh yeah, by the way, we're using the, this cryptor called Crypto Revolution, right? You know, it's our visual C++ project. Sometimes, you know, it's kind of random garbage. It doesn't really give you anything. It's important to keep in mind that these are just text, right? So there's no reason why you can't create your own. So for misinformation strategy, right? So now I kind of mention this linker version. So you've got these major and minor linkers. What do they look like in the sample set? So this is just kind of breaking down. So if you got the first one, right? Linker 2.5, 2,000 of them. So while you can group, you know, this new sample set or many other sample sets just by looking at the number of linker versions or those, sorry, looking at the linker versions of their count, it still really doesn't tell you the whole story. So we look at the number of sections and you can kind of see relatively similar distribution, right? You've got a big groups of files, right? That might indicate a specific campaign or something like that within the Zeus data set. And you can have this longer tail. So the thing we really wanted to look at, assembly mnemonics. So I think these are kind of cool. So the idea here is, right, when an executable runs, there's code. And that code, those bytes, can be translated into a mnemonic. And all the mnemonic is is simply instead of, right, the byte representation for add, it just prints out the word add. And it's easier for me and a lot of other people to understand. So the idea is maybe we can use assembly mnemonics to help understand exactly what it is they're looking at. Right? And Johnny 5 is live, but in order to get assembly mnemonics, you must disassemble. So, sorry, Johnny. So for this, Capstone Engine was used. I don't know if anybody's played with Capstone Engine, if you're looking for a free and a really awesome disassembler, it's great. I love it. Runs in multiple architectures. There's bindings from multiple languages. It's super easy to use. So the reason I call this out specifically is I'm sure a lot of you have noticed that every single time you run a different disassembler on an executable or some code, you will get different results. Right? So really you only get consistency within a disassembling engine. So if you were to write your own or use one of the other disassemble libraries, the technique itself would still work and that's totally cool, right? I'm not pimping completely Capstone Engine. I like it a lot. But the point is just to be consistent with this type of stuff. So then I had what I thought was a really bright idea. I was going to look at the correlation between assembly mnemonics, right? So every time an ad appears, how likely is it that a move or maybe a call or a jump also appears? Yeah, that was an awful idea. So we moved on. So let's get into some of the math, right? Because how do you not love math? Math is so fantastic. So going back to kind of the assembly mnemonics, right? These mnemonics describe the program behavior and that's kind of what we're looking to capture is what exactly is this unpacker doing or how exactly does the executable get set up, right? Because it's generally compiler specific or in the case of a packer or a cryptor, right? They have to know what to undo so they can then run whatever code they want. So we want to kind of capture this program behavior and that's what we're doing with these assembly mnemonics. So how can we look at these various assembly mnemonics? We looked at correlation. Correlation doesn't really take order into account. You saw the correlation matrix, it looked ridiculous, right? So imagine looking at that for 400,000 samples. It's going to be some massive gray blob and you're going to go blind and be sad. So there's this kind of notion of distance or similarity that fuzzy is if I have a signature I want to know how close what I'm looking at is how close is it to the signature, right? How similar are two things, this idea of similarity. So we'll talk a little bit about jacquard distance. Jacquard is awesome, it's cool, however it doesn't take order into account. The idea being that with assembly it executes an order. It doesn't jump around. I mean there's flow control and all that kind of stuff but generally if you see an add, a move and an XOR add is executed in that order and not move XOR add or vice versa. So while jacquard is great and it might be useful, the order I thought was pretty important to take into account. So there's this idea of leveraging distance, there's another cool distance metric, the number of edits determine the distance and position is important. So let's look at one of the examples of jacquard distance. So here we have two seemingly random sets of assembly mnemonics. So we can say the leftmost is the one at the edge of entry point so this is where the executable will start and then it moves from left to right. And you can see there's various ones. So the easy way to view of computing jacquard distance is to take the total number of shared elements divided by the total number of unique elements and that's your distance. So in this case it's move push which is two divided by the other set which is eight and this set membership is concerned these two things are have a distance of 0.25. And while okay and I just didn't quite feel right. So with Levinstein once again you have this idea of order. So how many things have to change to make one into the other? So this kind of fit the domain a little bit better. So once again kind of just doing a quick compare you know looking at if they're different so right there's one difference and then they're not different and so forth and so on. So basically seven changes are necessary to make one set into the other set therefore we get a distance of seven. So kind of what we were talking about before but code is executed in order there may be branches I really didn't want to build any type of flow graph or any of that kind of stuff I wanted to keep it simple and understandable and efficient. So in theory the assumption was what I worked with was on the left should be more than the assembly mnemonics on the right because it will execute starting on the left and it will finish somewhere off the right. And if there's a jump in there maybe I want to care about it but maybe I don't really want to care about this stuff after as much as the fact that there was the jump. So there were a bunch of testing and metrics where I tried to figure out where the cutoff was, how many assembly mnemonics were required so forth and so on and we'll get into that. We also have taken into account how big is the stub and if you don't know what you're looking at and some of these questions are really hard to answer. So we turned to taper levinstein and this I think is a really, really cool algorithm. So basically the idea is it's position dependent like regular levinchines except the ones on the left any edit to the left will have a higher weight than it adds to the right which kind of makes sense. So this is kind of a way to capture now we have, oh we care about more of the things that are executed first something like a branch or a jump right and now we have a language that's assembly mnemonics to kind of capture program behavior so we can put those two together and the way you basically calculate this for every single position is one minus the position of the thing you're looking at right divided by the length of the set so in this case there's ten things in the set so the first thing right requires one full edit the second thing requires zero edits and the third thing requires right .8 of an edit so you kind of go on and now you have a distance of 3.5 so to me this was great because it said yeah these things are separate and different but there might be some some sort of similarities the nice thing you can also do with levinchines is you can actually use it as a similarity calculation so if you want to use it as a similarity so it says basically those two sets are 65% similar so this is how the idea of saying oh we get this fuzzy hashing this kind of idea of similarity mixed into the algorithm alright so now that we've made it kind of through this great refresher everybody loves p files and their headers and all the various values and we have an idea of the features we're going to look at right where we're going to use we're going to use the major linker version these various assembly mnemonics number of sections we have some really fancy sounding algorithms that are actually really simple to understand which is great we have a way to do fuzzy matching awesome so what do we do well first step gather samples we already talked about the data sets so you know there are well over 411,000 samples that we dealt with so the second thing was let's get PED kind of this industry standard this thing that I'm very comfortable with I've used a lot in the past let's see what it looks like for everything then from there for every single one of the executables we're going to disassemble them because we need the assembly mnemonics and in this case we end up using the first 30 assembly mnemonics we'll get the header features we'll talk a little bit about clustering so you can kind of understand which PED files are similar based on these three features assembly mnemonics then when I ran this across all the data sets my threshold was this 90% similar so I felt that if an executable's signature and a signature that I was matching against were not at least 90% similar I felt that wasn't good enough to call it an actual match so one of the things that I started off using was banded minhash it's a similarity comparison optimization because I didn't feel like doing a lot of comparisons especially on 400,000 things however the implementation of banded minhash that I was using was broken so I wound up doing a lot of comparisons but luckily not by hand then we created signatures so we could test and verify so one of the things that I kind of want to talk about briefly is why signatures so everybody we live in this great age we're like oh my god security data science we have to do supervised machine learning and if we're not using random forest or run unsupervised if we're not using dbscan or k-means with k-means optimization you're just like no sometimes it's overkill right so one of the nice things about signatures in this case is we can use it to capture kind of this domain specific language but me or anybody else we don't have to worry about model drift so after you create this awesome machine learning model that might have great accuracy what happens when you get new data and you go to train it that accuracy begins to drift the model kind of gets out of wax so to speak you've got to keep going through this large process this is one of the issues with operationalizing machine learning but also the model will vary based on training source so if I trained it against only my apt one set well then it would be really good at probably finding things labeled apt one but it would be worse about trying to determine which packer or which cryptor is what right and likely everybody else will have different data than me so it really wasn't a good fit kind of that last bullet is really where I was going is simple right you want to play you want to do things you want to tinker sometimes machine learning is fun to tinker with sometimes you really just want to get something done so here's kind of what the signature language itself looks like so really really simple it's kind of highlighted to show you the signature and I'm going to go into a demo in a second but so the signature for Microsoft Visual Basic is the top line and then the parts where it matched on the file so you can kind of see those blue highlighted regions there's quite a few the ones on the left there's a really long run right so you get this similarity of .902 right because it required 2.9 through repeating edX and it accurately captured yeah the signature is relatively simple to the file and I feel pretty confident that this file matches my signature alright now let's move into a demo oh god I'm going to minimize that real quick I think we should be good I think I broke everything that's phenomenal yeah seriously alright this is what I get for doing trying to do like an honest I got demo huh maybe if I don't do it in full screen we can figure it out there we go it just hates full screen so I'm going to it's Asian guy showed up awesome it went from friendly math talk to clan rally kind of quickly I apologize so I scripted this out because I was kind of a chicken as well I didn't want to type commands really watching me type commands is boring so I'll direct your attention to the top kind of small box and walk through the demo oh god alright this really is I apologize you know what just sit there for like 2 minutes gotta want it that's alright if it doesn't work I have slides but I thought a demo would be way more entertaining for everybody third time's a charm third time was not a charm when in doubt try a different port oh maybe that's awesome if that was the case alright I really didn't want to like try and lean over you know what screw it I'm gonna unplug one more time we're just gonna go back to the presentation if anybody wants to actually see a demo I promise I literally promise it works I swear I don't promise it I didn't do I didn't do oh no no no I was trying to make sure that I think we're good alright screw you demo gods now you get completely unreadable slides so I'll try to describe what's going on there's two phases to this one there's the signature generation phase and that simply says run this one script on a binary that I can't even show on a computer that's what I get for trying to do a demo and generate the signature and all the signature is going to be is a simple list of assembly mnemonics and then give you this major minor linker version and then as well as the number of sections and then all you have to do if you're not giving a demo is run this other script that if you can see it that m m p e s dot pi on the signature and you can do all sorts of things you can give it a threshold so if your idea for similarity is different than mine right if you say I want to know everything that's 50% similar you can do that you can give it this crazy verbose so it says alright here's the signature that I have and here's what I'm matching against you can do that in case you really want to interrogate everything it also tells you when the major and minor linker versions match or when they don't match or when the number of sections don't match right it tells you how many edits you have and then the actual similarity so this is actually between two ABT1 samples and you can kind of see the signature generated on the two files on this directory the first one really didn't match all that well right it had this 0.844 required roughly four and a half edits but then this other file it matched exactly so all 30 assembly mnemonics were perfectly in order both the major and minor linker numbers matched as well as the number of sections and here's kind of a better description of the rule that you guys might be able to see alright apologies for the demo so let's look at some of the data sets we'll actually look at some of the bigger ones because again big data so we'll start with the ABT1 so for here this is kind of describing the clusters so in other words the like things grouped with other like things and it's two bar charts superimposed on one another with the color variations once again apologies for not doing it in grayscale so that very far one on the right the idea is PED found in that yellow thing and said this many things are similar and then kind of that green bar is the assembly mnemonic comparison of this many things were similar so kind of the cool thing even with having zero trust in the labels of using something like PED you get this anticipated view you expect a lot of things to kind of fall into a few buckets and then you get this really long tail that as an analyst is always a pain in the ass to deal with so one of the other ways you can represent this is kind of these neat looking bubble graphs it's not really science unless you have sweet graphs so this is just clustered on assembly mnemonics so once again kind of representing what you could see this one really large cluster and kind of these other ones the signature language and this work revolved around a couple other features so what do they look like alright so the darker blue is the actual is the group so in this case it's that big orange one is the big dark blue one and then within that one cluster right based only on assembly mnemonic similarity you know have kind of these three sub clusters based on number of sections and so this is kind of interesting right there's maybe a little bit of variation maybe somebody use this slightly different version of something so forth and so on likewise with linker versions I thought this was kind of neat there's you know very little in this sample set deviation for linker versions when uses a sub clustering and so then this is kind of a three dimensional or two dimensional view of three dimensional set of features so once again kind of that the dark blue is the assembly mnemonic circle and the dark blue is various sub circles kind of the one on the lower right hand corner you can see the cluster and then you can see one cluster that was actually based off of number of sections and you can kind of see two sub clusters in that and then everything else only had that one cluster so it's kind of cool so let's look at Zeus much bigger data set much bigger graphs much more science so this is what Zeus look like here with the little teaser you get this massive massive massive PID unknown label but the cluster one it actually it breaks it up so this one and the stacked one you can see the assembly amount of clustering on that yellow graph the yellow bar kind of in that blown up window is a little bit more manageable and you kind of get this slightly more gentle sloping curve but you get a lot of bubbles so either the end result is I shouldn't do anything in D3 or you should never D3 while you're high because both scenarios and badly so once again what does it look like if we sub cluster on a number of sections versus the cluster the initial cluster on assembly mnemonics you get more circles what if we do it on linker version you get these crazy sub spirals like things look so bizarre this was for me it was kind of enjoyable because it was a really neat exploration of Zeus and kind of a way to visualize this entire data set and then when you sub cluster on both you just kind of want to go home and cry it's never very good alright so I mentioned that I did something on 411,000 files which was awesome so let's talk about them alright this is this is just the assembly mnemonic graph so you can see there are tons of clusters based on similarity based on assembly mnemonics this is awful to read so one of the fun facts about this is roughly 5800 out of these 411,000 files are not 90% similar to any other file in this entire corpus I thought that was really cool and really surprising so this might be some polymorphic stuff it might be various crypters who knows but it was cool 5800 things is way too many for me to actually dig through right so we'll kind of skip through some of these everybody loves spirals and I really wanted to leave 15 minutes for questions so don't D3 or I shouldn't D3 I actually broke D3 on one of these my nest adjacent was too big it just wouldn't work so once again some cluster number sections and this is the one that I broke so this is where D3 just simply said I give up or you're doing it really wrong and it might very well be that I was doing it really wrong but it cried so there were a couple of really really cool things that popped out of this relatively large data set like Google Chrome there were 97 Google Chrome instances right hashes in this 411,000 and they all match the same signature right they all had this kind of same assembly mnemonic string and so they're very consistent with their builds at Google so if anybody's in the room from Google thanks I appreciate that they're very consistent with what linkers they have what linkers they use so out of the 97 kind of the take home is 94 of those 97 have matching linker versions matching number of sections and assembly mnemonics within 90% right this .9 distance so this is kind of cool and then it really wouldn't be a talk about Packers if they didn't talk about UPX because somebody was going to ask about it so this was kind of cool this was kind of telling I dug into UPX some in the past but this actually forced me to do a little bit more digging so I kind of cheated and I said all right what if I do this really really naively and just look for the string right UPX0, UPX1 or UPX bang in the file and said that's probably UPX right because once again I didn't want to test any prior solution and I wanted to really see how this kind of stuff stacked up so with the assembly mnemonics right out of just doing that simple thing through it it got 65 different groups and I thought shit now I'm going to be left off stage however there's some pretty cool results in here so you can kind of see in the table there's this group label and there's this count so that's group label the cluster label is just the arbitrary number that I assigned to it this group so you can kind of see once again you get this neat little slope and I was like all right so maybe there's some variations of UPX maybe I'm much smarter than I thought I was and I can do UPX version detection with this right maybe my head's going to explode or maybe I failed miserably and the answer is kind of somewhere in between so looking up against how it stacked up against PEID while I didn't trust the PEID results fully it was neat to say either you know me or and or every random person that I pulled signatures from on the internet were making the same mistakes or maybe we're totally onto something so kind of the cool thing was is here's the numbers it looks like you know maybe I was onto something after all there's also kind of that none I dug through that a little bit see what was going on and if this algorithm was completely failing it turns out there's a bunch of packers that basically wrap UPX which I really had much exposure to so I thought that was awesome so I learned a whole bunch there these kind of variations so right let's go through this recap the idea was easy to generate signatures had I had a working demo you would have literally seen me type one command and the signature would have appeared out of nowhere it would have been awesome but I can show you later I'm happy to it involves math right who doesn't love math it's cross-platform it's all written in Python because Python's the new old Ruby it's this capstone engine it's cross-platform it met that requirement it's mostly easy to understand it involves a little bit of math but hopefully not too bad even for 5 o'clock on a Friday and probably most important for me so even though the paper promised the demo it didn't work I'm going to release it online the guys at work were more than happy to say yeah you can totally release this tool and some sample signatures for people to play with and use that's the URL where it and these slides will live the updated slides because the old ones are on the CD so feel free to take a picture of it or you can ping me on Twitter however it's not up there yet because I'm a slacker so it'll probably get done next week and last but not least if anybody has any questions I'm more than happy to answer them so the answer is once you have all of this data what's the action and that's actually a really good question so aside from why did I do it because I love messing with things it's important in my opinion for any analysis to drive an action and the action is to understand what you're looking at as a malware or somebody looking for extra context right so if if I can kind of solve part of the signature management problem and you can get this idea of fuzzy matching out of signatures and what not and have fairly accurate signatures with very little low lift right when you're at your home organization and you've got man I've got this piece of malware that I've never seen and you go grab 3,900 signatures off the internet right you can go oh right here's a technique that uses these types of signatures that works that tells me how similar it is to some of these other things that other people have seen so it kind of helps give you a starting point for analysis and I'm going to go ahead and do some more okay there's your bullet what honestly I haven't looked at it much so I don't really know if I have a good opinion on it sorry anymore I mean it'd be awesome would you believe it oh if anyone's using it I haven't run into it so the question was have I run in anybody actually putting in the packer information into the packing the packed files my answer is no because I didn't run into it in any of my sample sets however even at 411,000 binaries given the number of executables that everybody talks about right that's still a relatively small sample set so it's nowhere near everything anymore question yep so does this apply to protectors as well or am I using packers on a broad side when I say packer I mean protectors, cryptors the whole gamut the whole idea that it's you want to obfuscate some intellectual property or something in a binary or someone to get the juicy bets anymore man was my math that much on point that not everybody fell asleep and nobody has questions on math alright cool so I'll be around if everybody has questions oh one more how do I make this mustache happen I think it is genetics it is math this is what happens when you do too much math you wear super classy this is I did yeah so I actually had a really long beard at one point in time and my wife hated my long beard because I told her I was going for wizard length so I said you know if I can have a long beard I'm going to have a long mustache now I sleep on the couch too much D3 exactly alright any more questions nobody alright cool thanks for coming I appreciate it