 All right, I think we're going to start a little bit early because it looks like there's a lot of people here. Thanks for coming. I'm Val Smith, and basically I do malware analysis, penetration testing, exploit development, reverse engineering, all that kind of stuff. One of the founders of offensive computing, Metasploit contributor, and CDC NSF. And I'm Dal Chai. Pretty much a lot of the same. What I do is analysis of logs, analysis of traffic, and I also do development of IDS signatures, also CDC NSF, and the spiritual advisor for offensive computing. Yeah, definitely. And there's no joke, you can go to the page. OK, so just a real quick background of what offensive computing is in case you don't know. We're a big online malware research database. We do all kinds of analysis, and we basically have a malware blog, so when interesting viruses come out, we talk about it on there. So basically we get our samples from members, to the community, donations, honeypots, all that kind of stuff. Typically we get samples of interesting viruses and malware pretty quick when they come out. A couple examples of Rustock, which is a pretty intense root kit. The Dolphin Stadium Trojan, the Semantic Worm. So we've got like 160,000 samples, which I think is pretty much the biggest publicly known archive on the net. And we do auto-analysis on these samples. So if you go to our site, you'll get a bunch of information about them. So a little bit about what the auto-analysis is and what is on our database. There's a searchable web interface where we get file types, multiple check sums, packer detection, which I'll go into a little bit. And we do a bunch of antivirus scans. If you've seen something like Virus Total or one of those, it's similar, except we store more information. And you can actually search for specific viruses and get copies if you want them. And that's a nice visualization of the herpes virus. OK, so we have all this malware, thousands and thousands of samples. What do we do with it? What's the point? Well, besides having an archive with analysis that we can check, wow, they're loud next door. Anyway, we can mine the data for interesting information, especially the strings and the malware. This is a little more useful than you might think, even though a lot of malware is packed or encoded. You can still find a lot of good information. We'll show some good examples of that. This talk's going to be kind of light, kind of entertaining and funny, hopefully. So relax, and we'll get to it. Some of the other good information inside the malware that we'll find is like URLs for command and control, callback, droppers, email addresses, IP addresses, that kind of thing. So I want to go into a little bit about how this information has been gathered. We call this talk Malware Secrets, but it could also be called malware statistics, because we're going to go into some statistics that we gathered from this huge collection. So like I said earlier, the way we get our malware is through honeypots, submissions, spam attachments, that kind of thing. So it's sort of a dirty collection in the sense that there could be all kinds of files in there. We do some things to verify that they're actually valid PE files, but it could be anything. So each of these files was not manually verified by hand, because if it takes like four to eight hours per sample and we have 100,000 samples, we never get done. So we're relying on a combination of things to know that it's pretty much valid. We're also going to talk about some antivirus company statistics, and I want to stress this isn't to be taken as gospel. It's sort of a fuzzy information to give you some ideas about how things work, but don't necessarily take it as set in stone. We use Linux based scanners currently to scan our collection, and we're going to which one's in a bit. Now I'm going to turn it over here to this guy in a second. Oh, real quick, the results of our auto analysis are saved into a database. So we have this huge database with all this information that we can look up. When you go to our website, there's an interface to this database where you can get this information. We did a lot of data mining with Pearl scripts, shell scripts, that sort of thing. These results are somewhat fuzzy, not necessarily 100% accurate, but it's pretty good and it gives you some interesting trends. Another thing about our collection is there's a lot of genetically similar samples, so variants, that's what we mean by genetically similar, but they have a different check sums, but same basic functionality. Okay, my part of the deal is that once things have been reverse engineered, unpacked, and stored in the database, my part is to go through and what I call shred, and I go through the strings results of the malware we get, searching for specific things. At first I was doing this by hand, and I discovered that that takes a lot of hand. So I started to write a tool. The tool's name is PISDA, and what PISDA does is it's a series of automatic, automatically goes through the database systematically, and looks and runs a set of 26 unique regex and PCRE checks against the strings results of the database, extracting information such as IP addresses, email addresses, or custom strings. There's actually a command line function that'll allow you to put whatever string you want in there, so if you're looking for something specific. And what it does then is it takes the results from that, puts it into a database with a unique marker as to which piece of malware it came from. From there we can correlate things such as how many people are using unique, or how many people are using the same pieces of malware, where they're coming from, what the results are. And from there it will automatically, at this point, it automatically generates snort signatures and pushes them out. So my goal, and I have not achieved it quite yet, is that from the time a new piece of malware is entered the database until an IDS signature generated is roughly five minutes. And we have some examples of what we've found. This is one of the more interesting ones, 777.gif came from approximately 36 unique sources from around the world. This is a call that was found inside of different pieces of malware to download this particular file. And I've just got a small sampling of the places that it was found. But if you actually download 777.gif you'll find that it's not quite a gif file. It's actually a little bit of HTML, a little CSS. And it calls its malware buddies and says, hey we're having a party at this computer, come on down. So that's one example of the things that we find correlating the data from PISDA. And also we found some really unique things and this is a good list of some of the oddball bits that we found. The NeroStartSmart was one that caught me by the short and curlies because I wasn't expecting to find that ever. Current theory right now is that it actually is making ISOs of your hard drive and then taking them down. But that again is strictly theory. Some of the other ones you can see where it's executing pieces of malware that it downloaded. And there's some nice email addresses here. These email addresses were used as phone homes. And part of the vetting process is to make sure that the email addresses we're getting are actually being used as phone homes because a lot of times malware will create false pages that say, hi, my name's Bill from Yahoo. If you need help, go to help at yahoo.com which obviously isn't going to be any kind of repository. So part of having the PISDA database is to filter those out but to keep the unique ones. And the last two you see there, although they seem to look random, operating theory is that they're encoded in such a way that they may be passing data in the first part of the email address. Hey, real quick, actually, we were looking at this. We tried a bunch of different types of encoders like XORs and ROT13. Does anybody out there recognize this type of encoding? Have an idea of what it might be? Okay, I was just hoping somebody would. We'll keep working on it, but there's lots and lots of these in there. This is what I do late nights when I'm sitting there home alone listening to Pink Floyd, you know, the black lights on and mm, mm, mm. So the thing is, automated IDF signatures, is that a good idea? Well, these are the questions I expected you to ask. It's not the best idea in the world because sometimes we could have false positives and we could automatically create 100,000 signatures and suddenly the IDFs would be clogged. So right now it's done manually. It automatically is just one switch away. I mean dash A will make it automatic. And I've tested that at home and I've watched my machine go up in smoke. So I'm working on a way of throttling it, but the idea is again to get accurate signatures out. So the problem is still being crunched on. The database, I have the database, it's broken up into three sections, black list, a white list, and a gray list. So a black list would be something we know is bad. We've confirmed is bad. There's no question whatsoever in this world like your 777.gif. A gray list is something we're not quite sure of. And then the reason I do that is because I've discovered a lot lately websites that are supposedly innocent. And they are innocent. A good example was a Monpot Kettle grocery store webpage somewhere in Kansas. It was a one page webpage. They just said, you know, all the dry goods you need, come on down, we'll be happy to see you. The site had been compromised, but instead of putting something silly on there like, you know, we are pwned you or something or LOL cats or whatever, what they did was they made a subdirectory and filled it with malware. And then consecutive attacks through email or other methods would point back there. And the owners of the webpage would never know that they're actually a malware repository. So we would discover things like that and we'd try to report them, but sometimes it's not possible. Also, delayed deployment because sometimes there are things that hit the gray list. The gray list, when we're not really sure, we can delay the deployment of it and manually check it. I can't stress enough that the majority of this has been manual labor. Actually, one person sitting down and vetting these things out to make sure they're right or wrong and then going to automation because without that, you really opened the doors to make some big mistakes and I prevented a lot of big mistakes by doing that. So the PISDA database is the result of quite a bit of manual hands-on. It's not just I ran a script and went and got a bottle of Jack Daniels and watched Tunt for Red October. I actually sat there and vetted these things out by going to the websites and verifying that these things are actually malicious. Ah, are we gonna catch? We're not gonna catch everything. And the answer to that is what is. The idea behind this is to be an aide to analysts because as an analyst your time is valuable and with this system you can automate enough of it that you spend more time doing actual, I don't wanna say legitimate, but resultful work than spending your time manually going through things and manually vetting things. So it's not meant to be an end-all, be-all, but it's meant to be an assistant tool. It's something you can run over here while you're doing something over there and to help the analysts do their job better. And the name PISDA, well, I was in a mood one night. And I think there's only one or two people out there I think who are, I can see you smiling already who know exactly what that means. I'm a moody coder. So what's gonna be done with PISDA? It's under improvement. I'm still writing the code. It's a mix of, it's pearl mostly, some shell and people keep asking me to convert it to Python which is something I'm working on, but I've got to learn Python first. So, and PISDA will be available on OC. What I'm hoping to do is I'm hoping to push the PISDA signatures to OC so they'll be available the same, the OC website so they'll be available in this manner similar to the actual pieces of malware. Also, it tags the, when the signatures are created they're tagged with a unique ID number which correlates to the ID number of the malware in the database. So they're easy to put two and two together in case of a false positive or in more likely case it's like I kind of write a better rule than that and some people can and I really respect that because there's people who have a lot better skills at writing IDS rules than I do. It's not meant to be a catch-all again, it's not meant to be the end of all things but with a little bit of work, my goal is to improve the response time. Like I said, the idea is to have a window from when a new piece of malware hits the OC database until the time the signatures released to be about five minutes. That's the goal. And I'm very, very close to that goal. Okay, so I'm gonna kind of go pretty quick because I think we've got probably another 40 slides in 10 minutes, so okay. One of the things we did was gather statistics about packers, this is what people use to obfuscate their malware and make it hard to analyze. Couple of things we thought were interesting is which packers you use the most and also which ones you use the least. Why do we only have a couple of samples with a specific packer? Are those maybe targeted attacks of some kind? So one of the interesting things we found is we could actually detect in a lot of cases what compilers are being used. And this is on a sample set of about 35 to 40,000 samples. So there's the results. Basically, Microsoft Visual C++ is the most commonly used compiler, but Visual Basic, Delphi, et cetera. So this is a slide. If you came to my talk yesterday, you might have seen this slide. These are the most commonly used packers, UPEX by far being the most commonly used and then all the rest. So these are the least used packers. So out of our entire collection, we've only got a few samples. Let me explain what these numbers mean here real quick. So basically right here is the name of the packer. This is how many samples we had and this is the percentage it represents of this particular graph, not of the entire total. So like for example, morphine, we only detected morphine in one sample out of 30 to 40,000, which I think is kind of interesting because morphine's a pretty good packer. I don't know why more people aren't using it. Okay, so now I'm gonna talk a little bit about AV testing. Hopefully there's no AV companies in the audience. So there's lots of sort of misinformation about AV testing and how it should be done, who performs the best. It turns out that testing AV correctly is actually pretty hard. I don't, as far as I know, nobody out there really does it the right way. There's a whole set of things you need to do. You need to verify your samples in some methods so that you know, yes, these are real malware, these actually are valid and then there's all kinds of stuff to test in the antivirus. Like you need to test their static signature capabilities or heuristic capabilities. Different products actually do quite a bit of different things. So there's kind of a lot of variables. Just straight up testing how well each antivirus company detects a sample is not really that useful, but it's somewhat interesting and that's basically what we did so we'll talk about it a bit. So we tested around 30 something thousand samples with this and about 446 of them weren't detected by any antivirus company. Which turns out to be not, so taking collectively, they don't do too bad. Individually some of them are better than others. Out of these 446 files, we didn't manually test all of them but we took a quick sampling and the ones that we did test were indeed malicious. So here's some examples of sort of the detection rates of the antivirus products that we tested. You can see there that pretty much Bitdefender has a high detection rate. Now this could mean a lot of things. Either it detects whatever, like it has a very large signature set which isn't necessarily very good, has a lot of false positives or maybe it's good. So this is just some sort of informational slide for you to take into account. I wouldn't necessarily use this to decide oh, I'm gonna buy Bitdefender because it's the best one. That's not really what this means. It's just on our sample set, this is how they did. So this is sort of the converse graph of the previous slide. This shows you which ones miss the most. So obviously Bitdefender misses the least. That's why its graph is a little bit smaller. Again, take this kind of with a grain of salt. So what else can we find by analyzing all this malware? One of the most interesting things that I found to do is figure out is malware using your company for some reason? You have a name brand, you're out there doing business, is malware leveraging that, if you have a huge name, maybe they're gonna try to trick people into coming to their site by modifying your name slightly or whatever. Are they using you for a drop point? Like he was talking before, putting malware on your website without you knowing it. This kind of stuff we can actually detect. Okay, so these are gonna be a bunch of examples of funny stuff that we found in malware. Parsing the strings yields a whole bunch of different results. Like for here, there's a lot of just Russian websites. We parsed our malware for the word hack, looking for specific URLs, and we found a whole bunch. One interesting thing to do might be to go to each one of these pages and see what's on them. Looking for government, we find a lot of Brazilian sites. Brazilian bank Trojans are pretty prevalent out there on the internet right now, and for some reason they reference government sites all the time. Maybe it's because these are the people that are trying to detect them and they're trying to avoid being caught. All kinds of international connections in the malware. There's a website called vx.netlux.org, which is similar to my site. It's a virus collection, except it's a lot more, basically you don't have to have an account, you don't have to do anything, you just go and download whatever. So I think a lot of the malware actually goes, downloads, they're just droppers, and they download something more nefarious from the netlux site. So a couple other interesting things, some email addresses that show up all the time. I think these ex-focus guys, they used to write a lot of exploits, and probably the malware authors were just ripping exploits out of the code, dropping into their malware, and that's why they're showing up. Hopefully they're not authoring massive amounts of malware. Obviously, these are the kind of things that show up a whole bunch. Bank, social security numbers, credit card. So one of the trends we've noticed is the financial motivation behind malware is growing quite a bit. So just interesting URLs. I thought this was kind of funny. I don't know if you can patent that, but it might be worth a shot. So yeah, you can find stuff like botnets. This is pretty much some botnet commands. I don't know why someone would put this in a malware sample, but whatever. Maybe after they own you, they want you to know what they think of you. This came up thousands of times. I don't know why anyone would use this thing, but there's credit card generator. Yeah, if you see this, don't do it. I thought this one was a little bit funny. Some malware out there is using TFTP to hack your box. This was kind of interesting. It's a little bit hard to read, but what this shows me is that whoever wrote this piece of malware had access to the RBOT source code, and it sort of gives you some hints as to who they might be, the path on their computer where they're coding this up. So this guy is kind of a friendly hacker. Someone who doesn't speak English very well, I guess. I'm not sure what a web station is, but he hacked it. These guys apparently own a lot of stuff, and say goodbye. More source code. This is interesting. Hotmail hack tool, visual basic script. Cancer and Jesus owned you. So I thought this was really interesting. He's asking you not to send the police to him, but he's telling you where he is, and who his crew is, and who he is. So why would you encode this in your virus? I don't know. This poor guy. Yeah, I don't know. I feel sorry for him. I don't know if you guys recognize that picture. That's the Barbie bank robbers. But yeah, banks show up all the time in our malware collection, just tons and tons of banks. So I think a lot of the malware is actually trying to steal your account. If you've ever seen those keyboards that you use your mouse instead of typing, that really doesn't do any good. So don't do online banking. All right, so basically in conclusion, we've got this huge collection of malware and it provides a lot of interesting data mining opportunities. You can also do AV testing to get some ideas. It's not the best way to test AV, but you get some good results. So basically, thanks for listening. My time's up, and I appreciate it. You got anything to add? Okay. Oh, thanks to everyone who came to the party last night at Crave. That was a great party last night and a big shout out to the NSF for helping flyer and help run the thing. CDC, Metasploit, PoundVax, three of three, thanks.