 Hello there. Good afternoon. Thank you for coming. I understand that there is a long line at the airport, so you're probably safer here than there if you're hoping to get on a plane. We are the last talk on the last day, or at least we're in the last slot. We will go over, as in the other times I've talked yesterday and the day before, we'll go through our material relatively quickly and try to get to questions. And then we'll be trying to get out of town since I'm driving to San Francisco after I finish here. We are driving. I'm Paul Vixie. This is Dave Dagen. I am from ISC. He is from Georgia Tech. As with most of us, there are lots of other affiliations and other sort of things, ways which you might know me or you might know him. Those are the primary things and those are the organizations being represented by this work. Going to, I'm going to give some introductory remarks and then Dave is going to get up here and explain the hard stuff, which is how we usually divide the labor. So this is my part, the sort of political details and the economic details. We're going to explain how malware gets to the people who are interested in it. We all know how malware gets to the places where nobody's interested in it, but it also gets to places where it's welcome, or at least where it is interesting content. We'll explain how it gets there and then what those people do with it. I'm going to talk about what I would like them to do instead and David is going to talk about ways that you can take a piece of malware and torture it to make it reveal with secrets, some of which are very interesting. As much as I wish it had the malware's ICBM coordinates, the author's ICBM coordinates, or at least the Gestapo coordinates, that stuff's not in there, but there is other interesting information that it will reveal once you pull it apart and pin it down. So there are a lot of people who run honey pots and they collect malware. Sometimes they'll let it go so far as to penetrate some VMWare instance and execute some number of instructions or system calls until it reaches the point where they sort of pull out the plug and say, well no, that's about far enough. I think you have unpacked yourself and I think I can now take copies of you and figure out what you're trying to do to me. Generally these will be antivirus companies whose principal interest is to figure out what they can make is there a pattern file for this? Is there something they can publish to their customers that will cause this piece of malware to now be recognizable without having to do all that work? Obviously the authors of the software know that this is going to be attempted and so there's a struggle to try to obfuscate and try to make that sort of pattern difficult. Now there have been some high profile mistakes made by various antivirus companies where they will put something into a pattern file that will mistakenly also match all the Excel files on somebody's computer and then they have to do an emergency release and so that tends to mean that even if they're not going to share the samples that they collect with their competitors they'll still have to spend a lot of time doing quality assurance on whatever patterns they think they're pulling out of this because it's kind of a front page event in meat space. The ideal position for an antivirus company to be in is that no one ever got fired for buying their stuff and the only way to hold on to that is to not make mistakes and you can however hold on to that without actually doing your customers much good so there's a bit of tension there. Now to the extent that they share at all they do not share openly because if you have a malware sample that came into one of your honeypots then maybe you can build a pattern for it and get it to your customers faster and then your competitors can do the same thing and get it to their customers so to the extent that there's any sharing at all after I've had a head start then maybe I'll give it to my competitors and that's unfortunate because the ultimate losers are the victims, the people who would like to have their laptops protected or whatever it is that you're running possible to malware, you'd like that protected and to have all of this sort of brinksmanship going on amongst the people you're paying to protect you does you no good. So fair number of people feel that it is in their best interest to run more than one antivirus package on their Windows machine sometimes as many as three or four. My experience with the factory installed Windows that came on my laptop was that before I installed SUSE Linux was that the antivirus stuff slowed it down and I can only imagine that running three at once would slow it down three times as much but nevertheless that's what people feel like they have to do because they know that if they're only buying their pattern files from one source they're not getting them all or they're not getting them all on time on a timely basis. There's also some incompleteness built into this methodology. Not every piece of malware is going to attack every honeypot. So we started with the assumption of what if it was economically unfeasible for antivirus companies to hoard. What if we caused it to be in their best interest to share in a community where sharing was the norm? I believe we would get much better coverage of censored bias. I just want to say that every talk I've given at DEF CON I've been next to a room where they're having more fun than us and so I apologize for that. I seem to be a bad luck charm. Anyway there's just a philosophy of share for the benefit of both could be applied here and that was when I got to talking to David and he said well there are some details about how the scheduling works and sort of how the economy works and how the technology works and let me tell you how much harder that is than you think it's going to be and then many pints of beer later at this presentation in these slides and the work that David will now demonstrate. Thanks Paul. I wanted to start off by asking just people in the audience how many people here run honeypots or run something of that sort? How many people of those running honeypots also go the extra step of doing reverse engineering maybe some unpacking work. You may do it professionally if you're in that category or maybe you have a warm sense of entertainment. So as it turns out the data that you collect from honeypots is as you know in imperfect collection the honeypots that you'll often see being run is a waiting game. You'll set up an unintentionally vulnerable machine and you sit around and wait and then a couple months go by perhaps some time goes by and then something happens you get an alert and you look at what's in your traps and you say oh gosh another SD bot you know I'm sick and tired of getting these and it's sort of a waiting game and what's happening here is that there is as many of you who are running honeypots have intuited a need to have a very wide collection field. I'd be interested to know who here has actually the largest network block that they're monitoring over but it actually works out to be the case that malware authors are aware of the different policy frameworks that people are collecting malware under and the inefficiencies that come about from sharing and putting these together in a collective sample set and they've actually adapted many of their practices to accommodate just the time it takes for us as researchers to communicate ideas and even share samples. Another particular problem is that of illuminating sensors. Those of you who had honeypots up if you could raise your hand again. How many people here who are running honeypots take the trouble to determine whether or not those honeypots have also received samples that other honeypots have found. So in other words you just run what you get or do you first check with your friends to see if they also got something like what you found. Many malware authors are suspicious and of course there's a very large IP reputation scheme being run by black hats where they try essentially to map out the internet space and figure out where the honeypot sensors are. It's actually very simple to do. There's actually been some recent work in the academic circles. Beth and Court had a great paper at the recent NewsNIC Security Conference on mapping internet sensors where he basically figured out who was working with D-Shield by sending low rates of packets, pings and seeing what happened on D-Shield's reporting service and then he knew which IPs were affiliated with that reporting system. So it's actually very possible that the malware collection system you have is known by bad guys and what you're getting is lower tier lower grade random stuff that might be run from kits. Additionally there's also this problem of automated victims updates that I'll get to. So in order to illustrate this problem I need to actually tell a little bit of a story and here this gets controversial because people often want to get precise numbers on this graph but sort of bear with me about how I describe the conceptual lifetime of a piece of malware. There's probably four important phases that happen and they're all on a critical path with each other so they have to happen in this order. There's the authorship of the malware, the release of it and here we can call zero day in that sense the release. And then there is D-Day detection day which I describe as the first opportunity for detection. It's the first moment when that malware that's propagating shows up in somebody's sensor log in an IDS alert or in a honeypot somewhere. There's the first chance that there could have been a discovery of this malware followed of course by a response. Now we'll of course disagree about how long apart these different phases of a cycle are and there's probably of course no correct answer in all cases but certainly these have to happen in this order. Now let's sort of look back at the history of what antivirus companies have been doing. If you read old copies of the virus goals and you'll see some early studies on how long it takes antivirus companies to update their signature list for new known viruses and in 2004 that was measured in on the order of weeks and days and nowadays this can be measured often in the case of minutes. Now that's not to say updates and patches and new signatures arrive immediately within minutes but rather they're available and then a large corporate infrastructure takes over and it may take another week after that to actually deploy the signatures but at least we can think of the response time as having actually moved a little bit closer to the detection time and now we often speak of things in terms of minutes and if they're not of course there are instances where it takes days and even weeks for antivirus signatures to be updated. They should be measured in minutes. We can conceptually think of this as a very short time frame. So how can we as a group move detection a little bit closer to the release day to zero day? Well there's actually two countervailing factors that prevent this from happening. The first is that malware authors actively avoid known sensors. If you've just been running a honeypot on the internet and not taking care to look into how you've structured it and coordinate with others there's a very good chance that what you're doing is already known and mapped by somebody. Maybe not by all bad guys but somebody out there is certainly aware that what you're running is a honeypot and they've recorded at least the network block that you're in. An additional factor that prevents D-Day from being moved a little bit closer over to the release day is that repositories don't share. If you can think of the entire IPv4 space as this flat line and each of us own perhaps someone has a slash 8, someone has a slash 16, maybe hobbyists might have a slash 32 maybe a few more IPs. You can think of these as buckets that are collecting raindrops that would randomly fall in. Perhaps malware is analogous to a storm in that respect. And if we don't actually correlate all of the information that we have together your detection potential is really just the limit of the largest collection block that you have. So if you're running a slash 32 and you're not working with other honeypot researchers what you know about what's happening on the internet is one out of two to 32 chances that it will show up in a timely manner. That's not to say you won't see things, there's a lot of activity out there. But there certainly isn't an ability to reason about how you can move that detection day further over by aggregation simply because repositories don't currently share. Now sensor illumination is actually very simple and very sophisticated. What malware authors will often do is compile a single unique virus, a one-off version. They may take an existing source kit, modify it throw in some source code for some other programs so the signature is completely different, pack it with something unusual and send it out. They'll send it not to just anybody but to one sensor in particular, just one IP. They don't spray it into their botnet, they don't inject it into an existing infected cloud. They send it to one IP that they're testing or perhaps a slash 24 in a network that they suspect is collaborating with antivirus companies. And then they look at their watches and they wait to see how long it takes for the antivirus companies to update their signatures. And if it turns out that that IP later caused the event of an update for an antivirus signature particular to that virus then they know that this must be Symantec's IP block, this must be some researchers. And then they'll try and of course figure out who does this. If you were around for the early days of, there was a site called Taka, M-A-Z-A-F-A-K-A dot C-C. And I guess the joke is that whenever you have to talk about this you say the name of the site. And they had a rather extensive bulletin board, some of it in Russian, some of it in English, listing an astounding range of IPs on the internet basically noting who they work for, who they collaborate with, where if you send your samples to their which sensor tool we updated most quickly and what other sensor tools will follow in tow. And so malware authors who are actually trying to make something that's really valuable in a very smart, routed worm or virus will simply build in this list into the propagation logic of the virus and simply not send itself to those honeypots. There's no value in it, it's simply going to help the researchers. So this has been something that's been going on, it's been going on so long that actually academics actually noticed this problem and wrote about it in Usenix. So that's an indication of how pervasive it is. Now let's sort of reason about what this means to have detection measured in days and response measured in minutes. And by the way, I think the AV industry, we owe them some gratitude for having changed response from something on the order of days down to minutes. But what does it mean now that we have this potential between days and minutes between detection and response? Well, what will happen is that there will be queenbott programs that are aware of the fact that it takes days for detection to occur. Think about it this way. What if you had a program whose purpose was to compile other programs, viruses and send them out? And it did this on an automated basis. And every version that it spat out was somehow different. It would take random functions, it would insert dead code, it would obfuscate it, it would change the key for the packer, it would change the order of packing, it would do recursive packing, very simple things to programmatically script up. This queenbott program is then spitting out unique child programs that don't match any existing signature. An existing botnet can be told, download this new binary, delete the old one, and start running this new one that you've downloaded. Now many of you have probably noticed that a lot of bots have this update feature built into them. And Rick Wesson I just talked today went over some of these same SOX update lists that are out there. There's actually very nice web interfaces for controlling large botnets. And you can tell them download this new binary, delete the old one, and keep running this new one. And they may connect to a new domain or probably the same domain. This is done on such a rapid basis six hour windows that the chance of an AV company actually catching up with that storm of new unique viruses is almost nothing. So here's what happens. If we look at this graph that we have before, we have the malware running in the wild for maybe an under a day. It will be days before detection occurs. And then at this point, the virus will be told, delete yourself, download this new version, and run it instead. It's completely different and doesn't match the signature of the previous old one should a signature ever arrive. The bot runs for about a half day and repeats the cycle over again. And so what ends up happening is that you have this perpetual window of zero day binaries. The botnet, the infective cloud of victims is perpetually being told delete all the evidence of any prior infection with a signature that may later show up in an AV tool and get this new stuff in here right away. Now this queen bot program will just spit these out all day and update them. And I've seen things on the order six hours in some cases as the lower threshold. But obviously this can scale quite nicely. All you have to do is compile programs. And that doesn't take much time. Now what does this have in terms of an impact? Well, some of you who follow trends in the AV industry have probably heard that we're experiencing the death of signature systems. That signature base detection no longer has the lift that it used to have. And if you talk to people in the AV industry, they're all, of course, working on a new, better behavioral based approach, other types of analytical tools that will actually give them the same type of lift that they enjoyed before. But if you just look at some of the things, for example, that virus total has to offer, there's about I think that's less than a 1% detection rate for samples that are being submitted to virus total. So what we're seeing is that there's an era where viruses are propagating. There is a limited ability of AV companies to keep up with this flood of automatically produced binaries. And the end result is that signature base schemes have very low detection. Now I also have to qualify that the virus total is based on user submitted samples and the base rate of infection in that red part may actually be populated with false positives. For example, there may be a lot of benign programs in there that makes the average look quite low. But just based on experimentation in the lab and hearing other academic researchers report on this, the trend we've noticed is that you get anywhere from 10 to 25% detection rate for a new virus that you found. And it's unknown, of course, how old that virus is. We've seen it, what we think is a very early time as propagation based on different collection heuristics, but it may be that even the detection that we're getting is based on the fact that it's probably a day or two old before our sensors saw it. Now, one of the primary problems involved in this is that if the queen bot's behavior is to simply pack and repack binaries, maybe one way to reason about solving this is to come up with a way to automatically unpack the binaries that come in. For example, if the queen bot merely changes the packing key, automated unpacking might keep up with this flood of new binaries that are being pushed down. We have to sort of think about what were the motivations behind the malware author first to engage in packing. Well, first there's the reduced malware size. We can sort of discount that. They're not too concerned about Egobot having megs and megs and megs of size and capabilities. Fat bot as well, extremely large binaries. But packing certainly does help in that regard. There's also primarily an obfuscation transformation occurs. And this is where binaries show up in your mail traps and your spam traps. And they're completely opaque. You know that there's something wrong with them, but you can't really tell much more beyond that. And malware authors are keenly aware of an information differential that occurs. Somewhere in your organizations you probably have an MX wizard, a mail deliverability wizard, who could stop a flood of malware that's being sent. Or a firewall wizard who could stop victims from contacting a rallying domain. If they only could find out where in that binary the domain name is hidden. But since it's packed, it's obfuscated, it's difficult for them to look at it. They have to hand it off to a colleague who does RE. This is often somebody who works in an antivirus company, so there's often different companies involved. This takes time. Additionally there's more opacity that comes from having invalid PE32 headers to complicate reverse engineering. But the general idea is that there is a skill set of those people who raise their hands, who know about unpacking and reverse engineering. And you know how painfully painful that can be, how difficult it can be at times. It's certainly not something that your firewall engineer is particularly familiar with. Now currently, it would be a great practice if all of us could share the various samples that we get from our honeypots. The current practice that's being handled by the AVE industry is often analogized to that of a hostage exchange. They'll share samples only if other people share samples back. I've seen a lot of arrangements, private arrangements between groups. And the deal is always struck where somebody says, I'll give you something provided I see value back. I'll send you a few days worth of samples and you do the same. And if I feel like I'm getting value back, I'll continue to share. Well, obviously there's some competitive reasons for doing this. There's IP issues at stake here. But we really have to question whether this sort of competitive advantage really translates into advantage to the AV shareholders. Because by not being able to see the full panoply of viruses that are out there, antivirus researchers are prevented from creating timely signatures. Now so actually in this particular set of slides here, Paul, I had a note saying that this would be something that you would talk about. Because it gets more into the policy aspects of it. This would be a good point to interject. Sure. So I mentioned this in my initial remarks. In the free and open source world, we have learned that sometimes it is better even for a commercial competing organization to share stuff. It doesn't mean that they can't compete or they've become communists. What it means is that gee, if we all started on the same kernel, then we could maybe compete on how good our distro was on top of that. And if I fix something in the kernel, I certainly don't want to be in the position of maintaining it forever after. So I really need to send that back so that somebody else can be stuck with maintaining whatever bug fix I've come up with. So we have seen a development and I'm looking through the room here. Some of you don't know that it was ever any other way unless you've read a history book. But there was a time when once the people at CSRG would cut a BSD kernel, son and deck would never talk to each other about fixing bugs in it. And so sort of everything that goes on now with the different BSD distros and the different Linux kernel developers is new in the last 20 years. And it's become somewhat obvious even in the commercial field. Fortune 500 executive has pretty much stopped arguing about the benefit of sharing for certain types of things. There's not yet being one of them. We want to find the part of malware that is beneficial to competitors to share with each other and give them a way to do that and then teach them that that way exists. And in some cases that'll mean a certain amount of arguing debate companies that just really don't want to let go of the old ways. That's where I come in. It'll be fine. Now obviously whenever you set up one of these sorts of co-operation arrangements between different corporations, there's a fairly good chance that they will try and spoof it and they will share enough that you think they are sharing everything but they're actually keeping the goodies to themselves. I haven't worked out yet how we're going to do the verification of compliance. But I think it'll end up so that problem will solve itself because the value of hoarding will go to zero. Go ahead. So that's a very high level perspective on policy and I wanted to translate these down into specific requirements for what a malware repository would do. I think it should not help illuminate sensors. So if people are sharing information you don't want to accidentally turn in your friend who gave you a sample that was hot. You might perhaps want to hold on to that sample. Use it for your own internal purposes. Do static analysis. Don't do interactive network based interactions with it until you're sure that there's enough of it out there in the wild that you're not being set up to see if your sensors are being illuminated. And also I think another fundamental point is that any malware repository you set up should not become a virus amplification service. Viruses should not go out and simply be told download these viruses from this malware repository and run them because they're useful. I also think that a repository absolutely should help automate the analysis of malware flood. I have just been stunned by the number of people who manage mail servers who say I can stop a large number of spam problems. I can stop a large number of exfiltration of victim information, credit card information, leaving by email. If I only knew where this stuff was being emailed to, if I only knew the Gmail address where people are dropping all this malware off at, all the private information off at. And so because they don't hang around, you know, Win32 engineers and don't really want to spend their life studying structured exception handling, they would much rather have an automated list. And I think the way to do this is by having automation on a back end for the repository, which I'll get to in a second. I'll show you some examples of that. And additionally, this would help coordinate different layers, reverse engineering gurus, mail gurus, snort roll writers, people who work at different layers of the protocol stack who would like some help from the reverse engineer community to speed up some of their work would be useful. So the idea we proposed was a service oriented repository. And this is where users and we'll get to what users means in a second would upload samples and downloads are restricted to classes of users because we of course don't want just arbitrary people, perhaps even scripts downloading viruses into the wild and perhaps using this as an amplification service. And the repository should also provide binaries and analysis. There's enough binaries out there really for everyone to do research on. We're not really sure if we have enough of them to do coordinated signature creation. But I think a more important aspect is that we can get a lot of other people who are extremely talented into the game if we simply provided them a way of transparently looking at binaries after they've been unpacked after they've been analyzed and say, this binary is attempting to make my users on my network contact this evil domain. I will make an appropriate firewall rule entry to stop anyone who is incidentally infected until the AV companies can come up with the patch set that we need to prevent further infection. There's also some value in PE32 header analysis and longitudinal detection data. What did the AV tool know and when did it know that there may be some interest in people understanding which AV tools have different capabilities compared to others. And a future to be done later is a malware similarity analysis or family tree. This will be based on some multiple class clustering analysis and quite frankly a lot of the work you saw in some of the other track talks are excellent candidates for that. The structure and layout of the repository is quite simple. There is a web server that will take in analysis samples. It'll immediately go through PE32 header analysis, unpacking and AV analysis. This will be placed into a database and people who have contributed these samples can get back information. There's probably a lot of you out there who have said, I wish I knew what was inside this binary. If only someone could unpack it for me I might want to play with Perl and try and lock down my users a little bit better during this storm. And so having this round trip where if you submit something you're able to learn a little bit more about what you submitted is something that we want to facilitate. The individual workflow that we have is that we have this fire hose and we would hope to get it down to something the size of a garden hose. We have a very large number of samples coming in. And there's a couple of trivial reduction steps that we do. For example, an MD5 checksum can be performed on them so the duplicates can be thrown out. Of course noting that they came from different samples and different areas and reported by different people is important because this goes back to the sensor illumination problem. But we can certainly reduce the workload that we have to analyze in the next test by throwing those out. Additionally, if you simply put this through a battery of AV tests, if the AV companies already know about this or have a signature for it and aren't just saying it looks fishy if they really know something particular about it, this is probably an old sample that's being recreated or recirculated by somebody who downloaded a kit and is trying it out on the internet. So we can actually further reduce the workload through those trivial steps. The harder part is through a payload recovery service. And you can think of these as a loss function. There's not, every single sample of course can be put through this automated process. Unpacking is directly analogous to the halting problem and there will be questions as to when you sufficiently unpack to say that you're done. That is a direct mapping in many ways to the halting problem. So there will be some things that cannot work in a virtualized context. There will be evasive binaries, there will be operating system conflicts and others that fall out because of those reasons. The poly unpack layer takes those and then crafts a custom built PE32 header for the binary reconstructs the tables and hands back both the binary and the unpacked version of the binary. And then this is actually where something really fascinating begins. We take these unpacked binaries and we push them back into the cycle again. We give these unpacked binaries to the AV tools and they actually start detecting viruses. These are binaries, packed binaries that the AV tools said. These are absolutely fine. We don't have any problems with these or we don't know about them. And when you unpack them they become trivially identifiable. We've had ranges from a 10 to 40% lift rate and improvement of detection just simply based on unpacking. This also gives you some intuition into the extent of these queenbots programs that are simply changing packing keys to generate new and different binaries. Trivially identified when unpacked but evasive when they're recursively packed for example. Those that further fall out from that should deserve hand analysis. Obviously this is one of those 80-20 splits. If you have tens of thousands of samples getting it down to a smaller number means that those of you who raised your hands indicating you do reverse engineering will have a job but will not have an impossible job. The unpacking itself is done by a series of dynamic analysis steps. How many people here saw Joe Stewart's talk for Ollie Bone? So we started out with a similar approach that he took and I think at the end of the talk he said there's this private tool out there that someone was doing with Ollie. That was us. And the heuristic is actually very different from what Joe was doing involving memory page reads and writes. But essentially we can look at a binary and statically analyze, create basic blocks for that portion of the binary that is still transparent, the unpacking portion as well as taking note of any DLLs that are loaded required by the binary. Entries and jumps into those memory addresses either known as being part of a static flow analysis or part of a known DLL address. And those we consider to be whitelist. Those jumps are part of some loop infrastructure that's continuing to run. If it's not we can simply infer that at this point the binary has finished unpacking itself. So it's actually a very simple heuristic. In addition to this of general approach there are about 60 different other heuristics to handle. Structured exception handling to check for people doing red pill awareness checking to make sure that all of the other gotchas that people are now building into binaries and malware out there are handled appropriately. So we can get a 70 to 80% reduction in the number of viruses that we have and turn those into fully unpacked versions. Those become a direct fee to people who want to use them. As I mentioned before the static analysis model behaves in this way. We have some malware. We perform a static analysis to get a flow graph, reduce this into a basic block structure. We then perform dynamic single step analysis and each point we look for jumps and branches. If it's a branch to an area that we've previously mapped in memory as part of a known whitelisted basic block because it's from the static view or from a DLL we say we're in a loop. We can actually even prove using basic compiler optimization techniques that we are in a loop and continue on our merry way. Eventually though we see that there is an execution of code that has been unpacked and created and at that point we say we haven't solved the halting problem but we're done fiddling with the halting problem and we're going to say that this is unpacked. It works surprisingly well. There's an example of this going on rolling in an Oli debug and I'm actually a little bit embarrassed now to show you this because Joe had a far superior sample but the general insight is that of course there is some honeypot somewhere that starts up. This is a video that's running of a startup phase but it can also start from a previously saved value save cache state in a VMware instance and as the unpacker executes and walks through using the Oli engine unlike Joe's approach which would occasionally prompt you and try and get your input because his tool is geared towards reverse engineering experts. This will simply bluster through looking for the mapping of this heuristic onto the binary. As you can see here we're getting some disassembly and the binary itself is unpacking and being created. This continues on for some period of time and eventually the virus even though it attempts to delete itself, attempts to remove itself is unpacked and handed back to the user. So we have batteries and batteries of these woodchippers that take in large numbers of these binaries that are flowing down every day and it actually reduces quite a bit the number of samples that you have to analyze. In terms of performance, it's not so bad. There's a lot of binaries that give up the goods real quickly and actually what we're seeing here is actually a one-class classifier for a virus type. There's some very simple packers that work and you'll see them have individual spikes and peaks in terms of run times and I've truncated this but the VMware instance can be made to run out continuously, suspend, come back, perform a work queue management so that there's continuously new viruses being prioritized into the analysis stream. We have some analysis and results with these taking just a small 6K sample set about a day's worth of malware for some of those people out there doing malware collection and compare these with PEID. The repository found that about 63% were packed whereas PEID which I think is a fantastic user tool, found about 43% were packed. So we had a higher capability to detect packing. I wasn't of course using all the latest PEID plugins which do things like entropy analysis and might give some more lift in there but just using the out-of-the-box settings we said, yeah, we're actually fairly competitive in terms of identifying these pieces of malware. In terms of the steps we took in some 6,000 samples, AV scan reduced a large number of those. Then focused on those portions that the AV scanner said were okay, performed some unpacking and then did a re-scan and found this was one result we've actually validated is with further testing a 10 to 24% improvement over the old stuff. And of course it depends on the AV tool you're talking about. Clam scan tends to get a really big lift. Other AV companies are a little more judicious about what they claim to be a virus, a little more aggressive and they don't get as big a lift but they still find across the board great improvement in value. In terms of users who get access to this, I obviously don't want just malware authors to use this as a drop-off point but I would certainly like people who run honeypots and who do research like me in academia or do research in industry to be able to make use of this. So there'll be for example a class of user known as random unknown users, humans and authenticated users will define these as different groupings. We may alter this on final deployment but the unknown users will be allowed to upload, look at aggregate statistics and basically get your sort of simple virus total type experience out of the deal. Humans of course can upload if we know that you're a real human being and you found something that's interesting and you want to show it to us we'll let you download an unpacked version that may help you in what you're doing on your network. If you're an authenticated user you'll be able to upload, download, access further analysis and be able to coordinate with other reverse engineering researchers. So this basically creates a hub-spoke architecture. At the hub is the collection of malware that people submit. You have various feeds that are coming in now and are scaling this up in size. The various spokes are things like unpacking services analysis partners. If you're a researcher and want to contribute rows to a SQL table that you would like to design and have visible to other users and want to manage a spoke you can say I would like to have your stream and selectively share back for certain classes of users but not all users the types of findings I've found. The theory is that you might find something that other people will find useful and users just simply don't know who they are yet. So this becomes much like this conference serves as a clearinghouse for information. In terms of economic goals here again is another point where I said Paul has really got to pick this one up and bring it home. Okay so there's a lot of shit and we are here to disturb it. So the AV companies that are selling the pattern files and the various engines for them are making a healthy living doing what they're doing but I think if we really do have a 20% infection rate with hundreds of thousands of new infected nodes every day then I think I can say that the AV industry is failing I don't want to say that they have failed but they are certainly failing to address the problem. I think this is one of the reasons that they are failing to address the problem. So once we have this up and running I will start knocking on the doors of various CEOs in the industry and running a shorter pointy haired boss version of these slides by them to start to get them to understand that if they hoard they're going to be alone soon. Early indications are that the old ways I think will part pretty easily this time. So we can take this up with Q and A if we can get done. So with about 10 minutes left I just got the warning if people want to queue up for questions I'm going to go into the phase of the talk where I repeat myself in a conclusion. So the general idea is that we'll have a service oriented repository not merely a collection of malware on the net that's private, you can't join, you can't have it unless we like, love, or trust you in some ways. It will be oriented towards providing information into the hands of people who can take operational actions to stop active infections doing blocking and things of that sort. And further details will be available at TISF.net. I believe malphys.org is now the domain that's been set up. Wrong on both counts it will be malphys.owarkey.net but it's not there yet so don't look for it. There will be an announcement on some mailing list that you frequent. So we should take questions that I really want to hear from people about this. Yes? So I'm curious as to how well you think the offensive computing site matches your design already. So there's actually a large number of people who have tried this and are trying this in different contexts and what I have to do is sit down with you authors and discuss what other aspects they have, what other aspects we have. I actually don't see this as a comparison but rather something that should be combined. I can certainly provide a feed to them and them to us and if header analysis is what's something that people particularly need or find useful for a particular problem we'll have strength in that automated unpacking as well. There is also a cat and mouse game to this and there will inevitably be a phase where for example the heuristic we are using becomes less and less sensitive over time and more evasion occurs and Joe Stewart's approach becomes a little bit cruftier but then people will reinvigorate them and do more analysis and back and it may shift back and forth between different sites. I'm looking forward to work with the authors. I don't have a comparison in mind at this point with bullet points. David, I was wondering if you could tell us two things. One, if Bogan's had ever been identified as something that they didn't want to deliver malware to because it might have a honeypot on it or if hijacking routes that were known to actually not contain any address space was a viable option for keeping some of it in circulation. So Rick is referring to a particular problem with people announcing routes that they should not be announcing and we have seen that for spam runs. I have not in this particular context for the project C done for malware distribution. Others have told me yes it's done that way not only for spam but also for malware distribution. And your second part was about containment. I was actually asking the reverse. Are Bogan sites excluded from a distribution? I will have to check to see the old Mazafaka repository listing to see what particular ranges have been identified. I would not be surprised to see if there isn't some notation in there. Probably a naive question but what percent of executables come packed that are benign. And then secondly is there a triggering mechanism to say this is an obfuscated packer therefore do not download this product known or unknown. So what you're getting at is something that's known in academic circles as the base rate fallacy. There are a large number of benign packed programs out there and I absolutely don't want to suggest that packing itself is not a crime man. Now in terms of what percent well it depends on what you're sampling. If you take a pristine install of a machine you may find some user install programs that come in a packed form. And what you're sampling over there is a very large denominator and so the percent would be quite low. In terms of the samples that we see that people have identified as suspect that they think is fishy, 70 to 80 percent arrive in a packed form. And again that merely reflects the sampling bias of the streams that we're getting. The unknown omega of the overall population of infected machines on the internet is still not something that has been worked out and so it's impossible to say what percentage of viruses out there are in this form. And could you restate your second question? Is there any sort of triggering mechanism for the extremely paranoid to say if this hits this trigger that I'm just not going to bother downloading it? That'll be something that I would encourage you to investigate by looking at PE32 header dumps by saying invalid fields, fields that don't exist, fields that have invalid offsets, zero date times and things of that sort are things that you would not want to trust. Whether that yields a sufficient detection system for you is something that you'll have to determine but I would venture to guess that you need this type of information conveniently presented to you so you can do the type of clustering analysis to arrive at that. Thank you. Let's take one last question. You raised an interesting point at the beginning of your talk where you said that some of these sensors that are out on the internet are known discovered by the virus folks. Whether I run a single honeypot or whether I work for one of the large antivirus companies that advertises 25,000 sensors out on the internet, I guess my question is number one, how do I know if my sensor has been discovered? And number two, if it has been discovered, what can I do to mitigate that so I can get back and start capturing good stuff again? So you're asking about the spy versus spy game of counter intelligence. What can you do to know if you've been illuminated? Well, you'll have to get fortunate enough to visit some of the sites that list known sensors. If you have reverse engineering talent in your I'm assuming you're from an antivirus company, look at the propagation logic of the malware that you see to see if they avoid propagating into spaces that you might perhaps run a sensor in. In terms of recovering from being listed in that sense, a lot of people are now moving into the dynamic address space of ISPs, buying large blocks, making arrangements for having DHCP leases appear in the normal victim population pool. Tor is another approach that people have been using as well. Basically anonymization or looking like you're a real victim based on where you are on the internet is something that people are attempting. Thank you. Thank you very much.