 Today I'm going to be talking about unpacking, semi-automatic unpacking. I'm not talking about guns here. I'm actually talking about unpacking of executables, specifically Windows executables, because we're talking about using Oli Debug here. Oli Bone is a plugin that I wrote for Oli Debug, so it could be applied in other ways, but here's the proof of concept I've got for it. So to get started, the problem that I have in my job as a reverse engineer, I have to look at malware all day long, and it's packed pretty much 95% of the time, and you know how many packers there are out there, there's just a ton of them, all different versions of them, and they all work slightly differently. And so what ends up happening is I've got to learn the ins and out of a dozen or so packers that might be used every single day, and it's kind of boring. Once you've done it and learned exactly where to set your breakpoints and how to go about unpacking a particular packer, it becomes just a tedious thing at that point, and really I don't want to spend my time messing around with the packers. I want to get inside the malware and find out what it is that it's doing, and so the packer is just kind of an annoyance. So I thought I'd just kind of sum up first how people are approaching this problem right now other than me. Reverse engineering is about as hard as you can get as far as the approaches you can take here, doing it manually, and that's what I end up doing because actually I'd rather spend my time looking at a debugger than going out and trying to see if I can download something that will automatically unpack the malware that I'm looking at. So if we're going to go through all this trouble of learning how to reverse engineer this particular packer, then we might as well just write an unpacking engine for it to save us some time later on. So that's what a lot of people have done. They've written these unpacking engines that are tailored to not only usually the packer but a specific version of the packer. So as somebody releases a new packer version, you've got to get the latest unpacker version. And so this is taking up a lot of time if you're into this kind of thing, too. Like the antivirus companies, they have to do this if they want to be able to scan the malware and detect if they have a signature that already matches it. Because anybody can take the same malware and pack it a dozen different ways and if the antivirus company doesn't have all of those packers and a way to work around them, then they're not going to detect it the other ways it's packed. But if they go down this route and they write an unpacking engine for each packer that's out there, what happens is your scanner, as you can imagine, is going to get pretty bloated after a while. I mean, you've got to put a lot of code in there to deal with all these specific variants of packers. So unfortunately what I think a lot of the antivirus companies do is just don't bother. And it's kind of scary to me how often I see new malware come out and it's the same old thing. It might be an Arobot or Agobot, something that we've seen over and over again. But it's been newly compiled, newly packed, and most of the antivirus companies don't detect it until they get a copy of it and actually write a signature. The other way, or one of the other ways I guess you can approach the problem, is to emulate. That is, write your own virtual CPU instruction decoder and your own executable loader that will pick up the executable, parse it, and then start executing your emulation engine to unpack the code gradually. So in this way, if you're good enough at emulating the CPU, usually x86, then you don't really have to deal with specifics of that particular packer's algorithms. You don't really care. You're just going to emulate it just like the CPU would. Writing these things is pretty hard. I'm not smart enough to write one. So the people that do end up writing these end up either being the corporate, the antivirus companies, or somebody that's put a lot of time and effort into it and so they generally don't give these away for free. There may be some exceptions. But again, you also have to be very careful about accurately emulating that CPU because if you mess up one instruction and you load a register wrong, then something's going to go terribly wrong with that program and it's not going to unpack. So you've got to get it right. Another way is kind of a shortcut. If you have no other way to get around something and you just have no other time, it's not a bad way for approaching malware anyway. Basically, you just take a virtual machine, perhaps, in VMWare, you just run the code. While it's in memory, you just dump that virtual or physical memory and then you've got an image of the code after the unpacking part ran and you've got basically the unpack code. Now it's pretty messed up in terms of where the PE headers are aligned and the import table is going to be all screwed up. But you've got the code. You could probably load it into Ida and look at it. And so that's what a lot of people can do. The problem is that you don't have that code in the same condition that it was when it was first run or in an unpacked state. What happens is you've got all these variables that are initialized or uninitialized when the code starts. And then as your program runs, these variables in memory get filled with other values and so now you've dumped this memory position with those values in it and so you're running it in a completely unknown state really. And you've also got to figure out where the start of the code is. What we call the original entry point, the OEP. You've got to find that in the code. Now a lot of compilers, it's pretty easy to spot where that OEP is because they use a very specific prologue to starting up the actual code in main. But if somebody perhaps just wrote it in assembly, just a raw image that they put together, you may not be able to tell right away. You might have to do some sophisticated tracing to figure that out. So all in all, it's a good solution if you just maybe want to unpack something and see what the strings are inside it. But in terms of running it and debugging it, it's not really a good option. So I thought about what is it that most of the packers have in common and everyone that I looked at or pretty much most of them that I looked at work basically the same way. What they'll do is they will go through each section of code, they'll pack it, encrypt it, compress it somehow, and then they'll append a stub section at the end which has their special unpacking code. Then they'll go to the PE header, they'll edit the entry point to now point to their stub code and then when you start that executable the stub code will run and then it'll do its unpacking and then after it's done unpacking it will jump to the original entry point, the OEP, and begin running. And a lot of times that just happens to be the first section. Here's the diagram pretty much just kind of shows you the difference, you know, what something looks like in the code sections of an unpacked and a packed executable and you can see on the pack side that you see a stub section at the end that's been added and then that's where the entry point is and then after it runs we jump to that original entry point in the first section. So a lot of them work like this. So working with Oli Debug a lot of times I just sat there and said, you know, if I could just set a break point right there on that first section of code, you know, no matter where in that code, you know, it's running, if it would just break when it hit that one section then all I have to do is just run the code and it would break right there and I'd be at the original entry point and I'd be good to go, it'd be unpacked. And in a way you can kind of do this already. Oli Debug lets you set a break point on access for a particular section in memory. The problem with using the break on access in this way is that your stub code now is going to, before it's going to run this code section that it's unpacking, it's got to read and write from it probably thousands or, you know, how many other instructions there are in that code. So there's a lot of times that this break on access is just going to break because it was a read access or it was a write access. So what ends up happening here is that you're just going to go really, really slow until you figure out whether you're actually reading, writing or you're executing because the way that the x86 architecture is designed, you really don't have a way to tell built in between the difference between a read access and an execute access. And so if you're going to have to do this, you might as well just use tracing. Tracing has been around for a long time. The problem with tracing is that, okay, you say, well, all right, I want this code to stop whenever I'm in between this address and this address and I'm executing. So tracing is going to single step through it and just keep checking, am I at this address yet? Am I at this address yet? And just keep single stepping through. And it makes the code really, really slow in unpacking. And depending on how they packed it, it could be too slow, you know, to even be practical in terms of what we're talking about, you know, trying to unpack things fast for malware analysis. Now if you're just trying to crack a program because you don't want to buy the serial number, then it may not matter to you to spend two or three days tracing it. But for what we want to do, tracing is not really an option most of the time. And another problem with tracing is that you can detect it pretty easily because the program and several packers do this already, they will check to see how long it's taking to run. They can check to see how many CPU cycles have passed and if it's taking too long, they can say, oh, I'm being traced and just, you know, change the path of execution or just quit. So what I said, I wish I had a way just to break on the execution only of a particular section of memory without tracing, without hitting read access or write access. So what I came up with, I call OliBone, break on execute is the bone part. And this basically is a proof of concept that implements this type of, you know, break on execute thing that I've been wanting. So the way we do this, you know, getting around this X86 architecture limitation, we borrow some ideas from the PAX project. Now, instead of doing what the PAX project does is, you know, it's designed to protect your stack from being executed, you know, in case of an overflow or your heap. But we're just going to tell it to protect an arbitrary page of memory or more than one page of memory that comprises the section that we're targeting here. So a little bit of review here. This is actually the page exact feature of PAX. There's more than just that one feature. But basically it works because the CPU has something called translation look aside buffers. And what it does is when you go to tell the CPU that you need to read or execute a particular address and memory, it has to do a virtual address to physical address translation to figure out where in the physical memory to go to get that piece of memory that you're asking for. And this is done using a page table walk and it's kind of slow. So what the translation look aside buffer does is it saves that translation so the next time you go to look for that address, it's already cached for you and it makes it much, much faster. And the great thing about the way it was implemented is that they put in a separate look aside buffer for your read and your write access than your execution access. So what we're able to do, and this is how the page exact feature works, is that you can cache the read access. And so the next time that it goes to read the page, it just proceeds as normal. But then when it goes to execute and it looks in the cache and that execute translation is not there, then it has to do a page fault. And what happens with PAX is it takes over the page fault handler and then says, you know, let's find out where this page is, you know, somebody is trying to execute the stack, we're going to kill this process. And so basically it has a way of marking these pages. There's the page table entry for each page has a different bits that have different meanings. There's a user supervisor bit in there and PAX overloads that to actually mean whether something is page protected or not. So we can do the same thing basically. And though of protecting our stack and our heap, all we care about is that these pages of physical memory that belong to the process that we want to unpack, these are what we're going to actually tell our page protection. This is what we want you to prevent people from executing. And the other difference is instead of like PAX killing the process, like it's something evil, what we're going to do is we're going to immediately tell the page fault handler to bail out and jump to the int1 handler for us. And what that lets us do is immediately return control back to Oli Debug. So what happens is in Oli Debug, it just throws up a single step break and the program stops. Oli Debug is right there and you're hopefully at the original entry point. So the way that this is implemented for Oli Bone, there is a DLL which is a plugin to Oli Debug. And then there is a kernel driver. And this kernel driver's job is just to implement this PAX-like page protection for anything that we tell it to. All we need to do now in our plugin is send an IO control and just tell it, you know, these are the sections that we're interested in and it will then apply that protection to them. So because it's implemented like this and it's a, you know, kind of a split architecture, you could conceivably write an Ida Pro plugin that would do basically the same thing using the same kernel module. If you're into Ida Pro, you could definitely pour this over to that if anybody was wondering that. All right. So basically a walkthrough of how this might work in real life here. For our data access, the first thing that's going to happen when we're unpacking our program is we're going to try to read or write from our target section. So the CPU looks and sees that it doesn't have anything and it's cache already for that virtual address, so it's going to do a page table walk and it's going to basically generate a page fault. This is where we come in with our OliBone CIS kernel module. So our page fault handler takes over for the Windows page fault handler and just says, okay, we're going to see now if this page fault is due to our code running or if this is just a normal page fault. Now if it's just a normal page fault for some other reason, it will pass it down to the system page fault handler. Otherwise, what it will do is it will toggle that bit that we marked to say that this page is protected. So now that page is unprotected. And then it will read from that page and what that does is once you read, it automatically caches in our look aside buffer that translation. And so then it will now reset that page table entry bit back to say that it's protected. And so now we've cached a read entry. The only thing that's not cached now is an execute access. So what happens then, our unpacking program is now done reading and writing. It's using that cached entry for these pages and proceeding on its way. And now finds that it's time to execute. We're hopefully at the original entry point. So once again, we get a page fault because it's not in the cache. So it checks to see, is this due to olibone? If so, what it will do immediately basically is just pop one extra argument off the stack because the int1 handler doesn't have that extra argument and jumps right to it and dumps us out where we want to be. So some problems that we've run into basically using this is that virtual machines and emulators don't always perfectly emulate the x86 translation look aside buffers. Now VMware works great. It's pretty solid. If you get to something like Box and QMU, they're actually emulating. And what they've done, I've looked at the source code, they've actually implemented a single translation look aside buffer. So they've basically ruined our whole idea of having these split buffers that we can leave one cached and use the other one as our protection mechanism. So theoretically they could work if they were just to implement split TLBs. So right now, as far as I know, there's no way that this would work on Box or QMU. Also I imagine this means that Pax wouldn't work there with the PageExec protection at least. And then Microsoft Virtual PC, I have not tried. If anybody has it and wants to throw AliBone in there and see how it works, I'd be interested to hear from you. It's pretty easy to use. All you have to do is load in your executable and AliDebug like you always would. The next thing is you have to do a little bit of guesswork and try to figure out what part of this executable is the final unpacked code section. What is it going to be when it's actually running unpacked? Like I said, it's usually the first section after the PE header, but not always. So now all you do is locate that in the memory map and you toggle the break on execute flag that's been added to the AliDebug menu by AliBone. Run the program, then when you hit an int1, all you have to do is basically watch it come up and stop and decide for yourself whether it's unpacked or not really. I mean, you know, AliBone doesn't really have a way to know whether it's unpacked or not. You've got to be able to determine that from experience, basically. So I've got a video demo here of some various packers being unpacked. Let's see if we can't kill this here. Try to blow this up here a little bit so people can actually read it. I don't know how well you'll be able to actually see this in the back, but I hope you can see something. All right, our first executable we're going to unpack here is packed with FSG. And if you've ever unpacked FSG, you probably know that it's ridiculously easy to manually unpack. But, you know, just to show you how it works, we'll go through here. So what we're going to do first is we're going to look at the memory map in AliDebug and this LAR FSG, these are the three sections that belong to our executable process. So we've got our PE header first section, second section is code, and then we've got our SFX import section. So we're going to highlight that code section there and set our break on execute toggle. So now we do is just hit play and run the program and immediately we come to the break on execute. It's telling us that it's stopped. So run our analyze code and we can see here that we can now read the unpacked binary. This is Peter Benia's collection he loaned to me for the purpose of this demonstration. So I appreciate that. But basically this is unpacked now. It was that easy. So let's do another one. All right, so UPAC also kind of similar to FSG, pretty easy to unpack in this particular version. So we look at our memory map there again. There's our code section and, you know, since we've looked at UPAC before, we know that's where, you know, we want to be when it's all over with. Set our break on execute, hit play, and we've got a break on execute. Run our analysis and see what it looks like. And yes, it's unpacked. Let's do another one now. It's a little harder. All right, this is our demonstration of unpacking AS protect. Is it AS protect or AS protect? Does anybody know? Okay, whatever it is, it sounds a heck of a lot better than an AS pack. So we'll go here. We've loaded it in. Now if you'll notice here, there's something that we have to observe that we are actually stopped in the code section itself. So instead of being in a stub section to start our execution, we're actually in that code segment. So obviously we can't set a break on execute because we need to run it far enough to unpack. And if we just, you know, start our break on execute now, we're not going to do anything. We're just going to spin our wheels. So what I did there is just step through the first four instructions and basically that now jumps to another section. So that's good. We're out of that code section now. Now we can go back to our memory map since we're out of that section and we can set our break on execute there. So we run it and uh-oh, it detected that we are running a debugger and is now giving us a warning message and saying, you can't do this. So this is one of the things, you're running the code pretty much unchecked. So you know, any type of debugger checks that it might do, you're going to fall prey to those. So this is something that is just kind of a trade off. You're going to have to learn a little bit about these tricks and how to get around them. For AS Protect here, it's not that hard. Basically the way it detected that we were running a debugger is just with a simple API call. So what we're going to do is we're going to make the is-debug-present plug in here tell the API just by setting a variable in our peb to tell it that there is no debugger present. And so once we do that, we'll just go through the steps again, walk through until we're at a section that's not the code segment, set our break on execute and hit play. So now we are somewhere and it looks like we're in our code section. So let's look and see what actually we are executing. And this time we actually had to remove a bad analysis because Ali-debug got a little bit confused about what was what. And we look here and actually we're executing a return. So what's happened here is we've ended up in our code section but we're not done unpacking yet. This was just a return back into the unpacking code. So now what we have to do is manually go back, untoggle, remove the break on execute flag, step back through with the F7 just to step one step so that we return back into we're actually on the heap now executing some code. And so now we're out of that code section. We can go back to our memory map and we can set our break on execute one more time, run it and take a while and pops up and we've got another break on execute. So this is actually unpacked here. Once again, Ali-debug is a little bit confused about how the analysis looks. If we run the analyze code here, I didn't get a good chance to show you that that was unpacked because Ali-debug is a bad analysis here because a lot of that is ASCII in there because it's an assembly language program. It doesn't do such a good job sometimes. But that was unpacked. All right, next one, PE compact. Once again, we are in the code section as we start out. We're not starting in the stub section. So we're going to need to walk through until we get to another section. So what happens here is it actually uses a access violation exception to do some of its unpacking work. It's going to do some of the stuff in the structured exception handler. So all we have to do basically is step into the structured exception handler and now we're in another section. So we're good to go. Just go back in and set our break on execute and run it. And now we are back into the code and we're just going to jump here. We're going to move our break on execute so we can jump back to more of the unpacking code. Now we're going to go back in now that we're out of there and set our break on execute one more time and hit play and now it's unpacked. Once again, Ali-debug not doing such a great job with the analysis but you can see if you look down, you don't see the text rings from Peter but you actually do see the message box and exit process calls there towards the bottom. So one more here, T-lock. This one starts outside of the code section so it looks pretty good here. Go to our memory maps, set a break on execute and we've actually landed already but we're not in the code section. So what happened? Well, T-lock is using single steps. It's using int ones for its own code and it's kind of getting in the way of our use of the int one to tell Ali-debug when something has hit a break on execute. So what we're going to have to do now is we're going to have to manually walk through all the int one breaks that T-lock does. So we're just going to basically just keep hitting shift F9 here to keep playing until we land at one that looks like it's in our code section now. And Ali-debug tells us it's a break on execute, run our analysis and we're unpacked. So that is it for the video and there's a URL you can actually download this from source and binaries. All right, so let's talk now a little bit about ways around this particular method of unpacking, sorry. So like I mentioned, anything that it's doing to try to detect that you're running on int-debugger is not solved by Ali-bone. Basically Ali-bone is just a shortcut to get you to the OEP but you've got to work around these other tricks. So if they do have this code you're unfortunately going to have to learn what's up and figure it out. Other types of packers might not work anything like this. There may be no idea of a stub section or just unpacking one section at a time. It might just unpack everything to the heap and run from there. Or it might unpack itself a little bit at a time. So this isn't really going to help you a lot with packers like that. Unfortunately that kind of thing is a little bit harder to write so there's not a lot of packers that actually implement it that I've seen. Most of the packers that I encounter every day are like these ones in the demo there and are pretty easy. So in terms of evasion now once all the packer authors out there find out that we're doing this well all they have to do is make their code work slightly differently and now we've got a problem. So for instance if they're not running from the stub code maybe they could run as part of the code section itself so they're always in the code section with us making it impossible for us to set that type of break on the section. Since we're setting our protection per page we can break it down by 4K sections so if it came to that we might be able to work it so that only the first part of the code section was actually protected in the last couple of pages where they inserted their unpacking code might not be. So it's one of those things where you just kind of have to adapt as they adapt and it just becomes kind of a cat and mouse game. There's other things kind of bad stuff I guess that could happen. We're letting this unpacking code basically do whatever it wants to. We don't really have any control of it while it's running so it could do sneaky things. It could send its own IO controls to Oli Bone perhaps and make it unprotect the code section. It could do other nasty things like protect parts of our kernel which we might not want protected, cause us to crash. So I don't recommend you use this on production systems. This is really just kind of a proof of concept at this point. There's very, very little error checking in terms of what the code is allowed to do. So if you feel like you want to add some error protection to it and maybe work on the code, I'd be happy to accept patches. Like I said, the source code is out there. You really don't have any excuse to complain if you don't like exactly how it's protected against these type of attacks. You should just patch it and send it. So there's other small tricks perhaps that could be used. Maybe the packed code is going to affect its own memory permissions using the virtual protect API. So we might have to do something like hook that API and at some point it becomes kind of a mess of trying to outdo each other's tricks. So right now what's probably going to happen is the more advanced packers are going to just make sure that it's really still hard to unpack them and the packers that are out there and have been out there that everybody uses that nobody maintains anymore probably aren't going to change and it'll probably be a while before you'd see it be pretty much impossible to use this trick. I do want to say that after I wrote this program and started talking about it, somebody actually took me aside and said, hey, by the way, you're not the first person to do this. I actually found out that there is a private unpacker out there that's not well known that has this feature and has had it for a while apparently. So I don't want to claim to be the first to do this or anything like that. It's just my little proof of concept that I've been using and I just wanted to share it. So you can download it. Everything that I've written as GPL on my immediate to-do list, which I'm probably not going to get to myself, which I'll probably use help for, I would like to be able to set break on execute for more than just the sections that are assigned by the PE headers. For instance, if something's riding out to the heap and executing there, I'd like to be able to protect that. I would also like to be able to set break on execute on DLLs. The problem and the reason that I'm not doing this already is because some of these DLLs are in shared memory. So if you go and set a break on execute on a particular section in kernel32.dll, then every program on your system is going to hit an end break point immediately and blue screen your system if you're lucky. So there's a feature here called copy on write that can come into play. You can actually force your own copy of each page of memory. So it might be possible to force a write of the same byte perhaps to a particular location that's already there, forcing it to copy itself over to non-shared memory and then set your break on execute. But like I said, past this initial proof of concept, I haven't done that. So if anybody wants to implement that, feel free. That's all I've got for the presentation. If anybody wants to ask some questions, the microphone is right there. Just make sure you talk in the mic. Thank you. Hey, Joe. Hi. I assume that if you've got a packer that's mixing any debug checks for soft breaks as well as hardware checkpoints, you're going to have to manually step through that first before you kick in the memory protection. For example, I've seen at least some packers that will make one of the kernel calls that will kill all of your hardware debug points inside the process you're debugging. Right. So we're not really concerned with hardware breakpoints here at all. I mean, this is really kind of a hack on top of the kernel. So if they were using hardware breakpoints, it wouldn't affect us, or if they were trying to change them. The segment trap doesn't count as a hardware breakpoint? No. No, it doesn't. It doesn't take up one of those hardware breakpoint registers at all now. It's all implemented by our own custom page fault handler, which the malware really doesn't have any way to know about unless it actually gets wise to this trick and test for it later on. All right. Thank you. Thanks. So the 64-bit Intel, where they now have the AMD where they have a proper NX bit, does that mean that there will be some inherent protection against packers, or could they just turn off the use of that bit for their code segments or whatever? I haven't really played with the NX bit too much. I actually don't have a processor that has it on it. I haven't gotten around to it. I don't see any reason that you couldn't write the same protection using that bit instead of the whole page fault handler hack. It would probably be actually a much more elegant solution. It just wouldn't be compatible with all processors. But yeah, that's basically the same idea. All right. Thanks, everybody.