 Alright, my name is Luis. I'm going to be talking about bridging the gap between static and dynamic reversing. Right here you'll have, this is the URL to my blog and my email address and all that good stuff. I guess I should say something about how I'm so cool and I could talk about this shit, but whatever. I'm just going to show the stuff I've been working on. You guys should check it out and then just use it. Alright, here's the agenda. I'm going to be covering a bunch of different things. Mostly right now we have these different sets of tools that we use for reversing and they don't really work that well together. We have things that are band-aids that make them work better, but they're still not what we need. And what I'm presenting here is still a band-aid. It's a better band-aid, but I'm going to talk about what we should do in the future and other avenues that we can take to try to make a more cohesive reverse engineering framework. So first I'm going to cover some of the current tools we're going to be using. So here's some of the definitions, at least my definitions, yours might be different, but static reversing is using a disassembler, looking at dead code listing. You can do all kinds of analysis just by looking at the code that you disassemble. Dynamic reversing is when you use a debugger. You open up something in Oli Debug and Wind Debug. You're able to look at memory. That's dynamic. But these two techniques are rarely used together. A lot of times there's this gap between information you have in your disassembler and information you have at runtime and rarely does it get put together. So there are tool additions, enhancements, plugins, what have you, that will help the problem, but it's not a complete solution. So here's some of the current tools that everyone's using. IDAPRO is pretty much the disassembler. If you want to do reversing, you use IDAPRO. There's other stuff. PVDASM is a small project. It's a Win32 disassembler. It has a plugin architecture. It's made by this kid in Indonesia, I believe. It's free. You should check it out. WDASM used to be the preferred disassembler for cracking software protection, but it hasn't been maintained in a very long time. For Debuggers, most of us are doing reversing for Win32 because we have no source. So most of this talk is going to be geared towards that. You can reverse things for UNIX type platforms, but generally because of open source and whatnot, we have the source available. So WinDebug has gotten way better. It used to suck balls, but now it's actually usable. It has great support for, yeah, thank you, Mr. Microsoft, man, on the front for making it better. But it has great support for symbols. You can dump, there's the DT command, so you can dump complete structures. If you're working in a kernel, you're doing kernel mode stuff. It's great. AlleyDebug is probably the best user mode debugger available for Win32. It has great plugin architecture, lots of plugins are written for it. There's scripting available for UNIX platforms. It's pretty much GDB. That's it. There's not really that much competition. Yeah, ooh, GDB. There's a couple other tools. I think there's a total view debugger, but those cost a lot of money and I've never used them. And SoftEye is dead now. It got end of life, CompiWare killed it, and it's gone. Driver Studio 3.2 is the last version they're making and they're not going to provide any more support. So what tool enhancements do we have? First, we have map files. Map files is a text file that shows the function name or the public, the global, and then an offset into the file. So when you load up a map file that got generated from a disassembler into your favorite debugger, you're going to have symbolic information, which is very important when you're debugging. I don't want to be using X4032F3. I want to see ReadStream. I want to see the name of my functions. I don't want to waste my time trying to write little notes on a piece of paper and trying to remember all these numbers in my head. Another thing along these lines is I2S, and it's a really cool plug-in for IDA. It takes your IDA database and dumps it to NMS format files, which is what SoftEye uses for its symbols. And you run that, then you load the resulting NMS into SoftEye and you have all your symbols. Another tool enhancement are emulators. Emulators allow you to somewhat simulate the runtime environment. So you can see how the static disassembly is working within the simulated environment. It allows you to see a lot of stuff, but still not a complete solution. So how do we bridge this gap? How do we make these tools work better together? It's a cohesive unit. The first step is symbolic information. You want everything, all your annotations that you do in IDA. You want those in your debugger and vice versa. But that right there is only one way of communication. You're taking data from the disassembler and you're bringing it to the debugger. Step two is two-way communication. You want to take information from the disassembler to the debugger and back again. We want to create this loop where we feed information from runtime that we don't have access to during static analysis. And this is basically taking the best of both worlds and creating the more ideal solution. But still, up to now, these are still Band-Aids. The third possibility is a common API or interface. How nice would it be if we have one standard reversing API or interface? This reversing API hooks up to WinDebug, to AliDebug, to IDA, to WDASM, whatever you want to use. Let's say I don't like IDA. Let's say there's another viewer that I want to use. I use this viewer. I write a plug-in against this standard interface. And then I could use WinDebug. I could use GDB. I could use even different processors. I'm going to be going over this towards the end of the talk. And it's what I commonly refer to now as a software JTAG. A JTAG interface and hardware provides a standard interface for debugging. And this is what we also need on the software front. So symbols. Elf symbols are easy. Elf is completely documented. The symbols go into the... Let me get a little drink of water. It's dry out here, huh? Yeah, I got to stay hydrated. Yeah, someone slipped some vodka in here. G-H-B. This is going to become a really interesting talk, I think. So, okay, Elf is not that exciting because it's all documented. We can just look up the spec. A couple of tools have inserted symbols back into Elf. There was Fenris by Michael Zalewski. I'm sorry to him if I mispronounced his last name. He had this tool called Dress, which is the opposite of Strip. When you strip a binary, you take out all the symbol information. And what he did is he created something similar to Flirt and Ida, which automatically detects library functions. And he would reinsert all library functions for different libcs and glibcs back into Elf. But Microsoft uses PDB files. So it's a separate file from the binary that you're trying to reverse, debug, what have you. And the file format is not documented at all. So how does Microsoft create these files? There's a DLL that comes with your visual studio, whatever version you have, dating back. As far as, like, I think five. Five is when I think PDB started coming into play. So there's this file called MSPDB. And here it says XX to represent the version. It could be 60707180. So this file generates PDBs. That's how the compiler generates them. And they're read through the DIA SDK, or more commonly, the debughelp.dll. This is what, let's say, the determina PDB plugin that reads in PDBs for Ida. That's what that uses. That's what PDB Plus uses. That's what WinDebug uses. And debughelp.dll, if you need to get it, it comes with the debugging tools. Microsoft provides the free debugging tools that come with WinDebug, and then it'll be in there. So there have been a couple of different versions of PDB files. Visual Studio 6.0 came with PDB 2.0. And then when Visual Studio.net came out, they needed to add support for .net. And for all different kinds of types that weren't around before that. So at that point, they decided to create a new standard, which was PDB 7.0. So it would match the DLL version. So everything from Visual Studio.net to 2003 and 2005 all uses PDB 7.0 file format. As far as documentation, we've already said that there's no documentation provided by Microsoft. Some people have done some reversing on the format. There's some information in the Shriverbook, undocumented Windows 2000 secrets. That covers some of the PDB 2.0 file format, how it's laid out, some of the stream, how it's split up into different blocks. We're going to look into how the PDB file is actually like a mini file system. It uses non-contiguous blocks and links them all together. So this is why I'm here. I wrote a plugin called PDB Gen. This is a plugin for IDA. Basically what you do is you run this plugin and it's going to take all the function names and what have you that you've commented in IDA and it's going to make a custom PDB. Then you take this PDB and you load it into your debugger and you have all your symbol information. This is like I'm saying is the first step that we need to move some of the reversing forward. So in order to create these PDBs, I had two choices. One was to reverse the API for the MSPDB DLL file. This is the same way the compiler uses it. So all you have to do is maybe hook this DLL, see how the compiler is using it, write small programs, run it through, see what's going on. The second possibility is to reverse the reader file. If we reverse debug help.dll, then we're going to know the internals of the file format. The first one only provides us an API, the second one provides us the complete format. So there's some pros and cons as I was saying. The pro for the MSPDB DLL is that it's the same way the compiler does it. If the compiler changes and we know the API, our PDBs will be the same. This is faster time to product or to the plugin. Some of the cons is license and redistribution issues. I can't redistribute this MSPDB file since it's part of Visual Studio. So everyone would have to have a copy of Visual Studio. And yeah, there's the free version of Express. So people could get it that way. But this still isn't really reversing PDBs itself. It's only reversing the API. So the reader, debughelp.dll, what are the pros? The pros is that we're going to have a complete information of the file structure. And if we have this information, we can make a platform independent reader. So let's say you want to do analysis on Microsoft binaries. You're looking for vulnerabilities and you want to use symbolic information. You want VTables, you want all this stuff. But you want to do it in Linux. Right now there's no way to read PDBs in Linux. So if I reverse this file, we could have the entire file format. And it would allow us to not be limited just to Win32. We could use it in other OSs and in other platforms. There could be a Python interface to it, Ruby interface. Whatever the language, the juror of the day, whatever you want to use. So what are the cons? This takes a lot longer. You're debugging a pretty big DLL. It's pretty complex. And one downside is internals might change. This file comes out every time there's a new Win debug released. So far I've been tracking the releases for about two years. And the only thing that's changed are the add new APIs. But the file format stays the same. It's still backwards compatible. So I don't think that even if we go this route, that it's going to be a problem. So I said that PDBs are like file systems, non-contiguous data. There's different streams. And I'm not going to bore you with trying to read Hex, Hex dumps on some projector from way in the back. And it's not going to make any sense. So all the internals with pictures and whatnot is going to be on my blog. That's going to be posted during this week. So check that out. It's dword with an e.blogspot.com. And you can prove that and look at all the details there. So what is a plug-in doing? Right now it's only doing global functions. Right now it's not gaining anything more than what we have in a map file. But this format is extensible. Soon I'll be able to do all local names. More types. We'll be able to read vtable structures, classes. There's tons of information there. The second part of it is I want to release a library, LGPL library. Something interesting about the PDB file format is I don't know if any of you guys do driver development or read OSR online. It's kind of like a driver-centric community there. A few months ago it was reported that the WDK, which is the new DDK is what they're calling it for Vista, released I think greater than 5,270, accidentally included private PDBs for the kernel. Holy shit. So there's all the internal structures to the kernel. That got yanked, I think, within a week, but it got distributed. It's everywhere. So someone wants to know the internal structures to the new Vista kernel. It's all there. With a PDB reader library, you don't have to use Windybug to dump each structure one by one. You can grab a PDB, and we'll be able to dump entire structures, all kinds of information for each PDB file. So I believe it's the kernel and some other stuff, both for Win32 and for X64. So if you have that WDK, and you want to let me take a look at it, I'll be around all weekend. So yeah, the last thing is a symbol browser. We want to be able to look at all information in a PDB without having to go through a debugger, because all we want is the data at this point. Okay, so that's it about the tools. Now I want to talk about where do we go from here? A lot of this stuff is just like ideas I've been kicking around in my head, but it's where I think we can get the most benefit. First of all, we need two-way communication. Even like the symbol transfer, like I said, it's only going from the disassembler to the debugger. We need to take that and make that a loop. I want to see what type of objects are in memory. I want to see how things are working and feed that back in. We could potentially run something in a debugger for a while, feed different types of input, and have that annotate or disassemble you. And like I said, the step past that is the standard API and a standard interface. So one example where this would be useful is C++ reversing. C++ is a fucking pain in the ass. You have V-tables, you have indirect calls, you have all this stuff. So if we use both the static and the runtime and feed it back and create this constant loop, we can know where these indirect calls are going. You just trace it back, and we create cross-references as we'll see. How do we do this? So like I said, most of us probably reversing Microsoft stuff. V-tables are listed in the PDB. So we find a V-table for whatever class we're looking for. Set a breakpoint on every method. Run it for a while. You're going to start hitting these breakpoints. See where it was called from and backfill the function name, the method name that is coming from an add-across reference. After you run it for a while, you're going to have all this information stored up about who's calling who. You're going to see patterns. You're going to see locality. What functions only call within the same class. What functions are called outside. This is how you determine if something is public or if it's private. Now we also have protection. The object is being used. There's other things that can be done as well. Here's an example of indirect call. On the top part, we see EDI has a pointer to a V-table. That's loaded into EAX. The next line where you see EDI be copying into ECX. ECX is the this pointer. So if you've done OOP, you know about the this pointer. The next line, it says call D-pointer EAX plus 14 hex. I have no idea what that is. I'm looking at dead-listing. I have no idea. But if we find a V-table and we set these breakpoints, this is what it's going to look like. It's going to be calling object.method, and that's what we want. Then we don't have unrelated methods being called. You find a method that you think has a bone. You right-click on it now or hit X, and boom, you have the path. Everyone that's calling it. You can backtrace to see where user input is coming from. Here's something else that we need. We need debugger type info. Oledbug is great. If you have a register, let's say EAX with some value in it, and it happens to contain a string, it de-references it for you. So you'll see hello world in Oledbug. That's really helpful. It's a simple thing, but it makes great sense. So why don't we extend this to other, more complicated types of objects? If I see this in EAX, I want to know that that's my object type. Let's say a stream object or some other kind of object. It just makes the reversing that much easier. So how do we go about this? We can look for constructors and destructors. We set breakpoints at the exit of constructors. At the exit of a constructor, you're going to return the address of the object in EAX. So we set these breakpoints and boom, we have this memory map where all the objects are. So we hold this table. We have a breakpoint on all the destructors at the entry, so we can remove these from our list. So now we have this big list that tells us what memory locations or what type of objects. Now this is the first one I thought of. There's some problems with this, because there's constructor chaining. If you have inheritance, you have a base class, and you have another class inherent from it, this constructor is going to call another constructor. You also have constructor overloading, and you have multiple constructors per object. So this could get a little confusing. So there has to be another way. And there is. It goes back to the v-table. This is what we saw earlier. This is how we got the v-table, right? The first d-word of any object points to the v-table. If it's pointing to the v-table, we know what type of object it is. So it's just like a string. A string, you see the address in the register, and it de-references it to a string. So now we'll see an address of an object, and we'll also check if it has a v-table. If it does, we pull that symbol information and display it right next to the register. So inheritance. Inheritance is where you have some kind of base class, and then you have other classes that are children of that parent object. And what can we do to look at inheritance? We can look at the constructors. We can look at constructor chaining, right? So you have a constructor for a certain type of object, and it's going to call the constructor for a base class. We're going to see these patterns, and we're going to be able to tell that there's some kind of relation. We might not know exactly what's going on, but there's something different about these types of objects. Another thing that we can look at is scope. Certain objects, I mean certain methods, can only be called from certain methods. For example, privates. Privates can only be called from within methods of the same class, whether they be public, protected, private. They could be from a class below it. If your method is protected and you're inheriting from that base class, you can call it from the base, but it can't be called from the outside. So we have all these cross-references already. Why don't we just start looking at that, create some kind of graph where we see who's calling what. From there, we'll be able to determine whether something's public or it's private. Not only that, we'll be able to look at some of the other parts, methods, how about some of the other things inside of the object. Public, private variables, accessor methods, all this type of stuff. We'll be able to see if something's public and then it's accessing something internally. We'll be able to see more about how the object is working. The next thing is pure virtual methods. You can see these in vTables. When you look at a vTable for any type of Microsoft DLL, you'll see something like pure virtual. I forget the exact name of it, but I'm sure you've seen it before. This means that, this is probably like a base class or something you cannot call that, but you'll probably see other objects that are inheriting from it. Same thing with, this is all going back to vTables. Basically what we're doing is we're looking at how C++ works. We're taking the behavior of it, creating some kind of a metaphor for it, and this is how it works. We use these metaphors to gather more and more information. It's all there. We just have to look at it a little bit differently. You have different vTables. Let's say something inherits from something else that's going to, it's not a virtual method. They're going to call the same method. All these different objects are going to call the same method. They're probably going to be at the same offset in the vTable and it's going to be pointing to the same method. So generally you're not going to have a method calling something, objects not going to have the same method as some other object unless it's related. It's not going to be pointing to the same thing. So that's another way to determine that there's some kind of relationship created there. A lot of this stuff can probably be graphed out, some kind of analysis can be run. We might be able to even have this all automated. So as far as implementation, I started looking at Python for Ida, Ida Python, whatever it's called. There's some limitations that don't allow me to do all of this. First of all, there's no debugger callbacks in Ida Python. I can't set breakpoints and have things work off those breakpoints using Ida Python. There's two choices. We either have to write a plugin which is fine or I'd rather do it in Python, so Ida Python has to support callbacks. Most of this stuff can be done automated. You could load up something, disassemble in Ida, have it exercise the binary using whatever input you choose within the debugger in Ida. It's all there. It starts annotating your disassembly automatically. You come back, I don't know, half an hour later and all this work has been done for you. So the standard API interface. I think lots of people have been wanting this. I know I've heard Matt, my friend up here out front, he's going to be doing another talk with me tonight at 8. He's been wanting this. A lot of people have been wanting this. A standard interface would allow would allow us not to have vendor monopolies. If there's a standard interface we can use any disassembler with any debugger. We're not locked into a certain tool or a certain vendor. A lot of things can be abstracted out where we could be using one tool and our debugger could be over the network some embedded device down somewhere else. All you need is to write the plug-in for the debugger. You could be using a remote console and debugging let's say a Cisco router running GDB. Have full access to that from your console. So this is kind of like a software JTAG. JTAG was developed by hardware people because they needed a standard interface. You want to have one interface that everyone can comply to and your debugger will work in all these different platforms. So this is what I'd really like to see some way that disassemblers, debuggers, what have you could be swapped out, they're interchangeable. So what has to happen? A plug-in has to be written for every tool that complies to this interface. Register and machine-specific information can be abstracted out. It doesn't matter if it's ARM, if it's X86, Spark, what have you, that would be all contained. So there are some problems with doing something like this. It would not be very effective to use this for, let's say, reversing malware. It would be very effective to use this for protection. Anytime you have some kind of set interface like this, anti-debugging code is going to catch it just like that. So this is more for reversing functionality, looking for vulnerabilities in different types of products and stuff like that. So I'm running a little fast, but here's the conclusion is that we can combine and yield us better results. And this is going to evolve into a more coupled framework that you can use different tools together. And one tool I didn't mention in my slides, because I haven't had time really to look at it, is PyMEI. I'm really excited to check out PyMEI and see how this can be integrated into this sort of thing. I know Pedram took off already, but I've been wanting to talk to him about something like this. And a lot of this stuff is going to be in my blog. My blog is also mirrored on OpenRCE. So if you're on there, you'll see it as well. And I'll be taking any questions. If there's any questions, just walk up to the mic, raise your hand or what have you. Alright, since someone's coming up right now. Thank you. My question is about reverse execution where you actually can go into a debugger and backstep. Do you have any experience without a recommendation for tools that can do that? I haven't used any of them. I know that Simix makes, or Virtual Tech I think is the name of the company, makes Simix. Simix is this simulator and they have a product called Hindsight. I've never used it, but I know that does that type of thing. And I have a second question. Do you have any experience with debugging with multi-threaded applications? And do you have any strategies to debug those type of applications? I don't have too much experience with that. I mostly do reversing for looking for vulnerabilities and for just reversing other people's products that I want to take a look at. Thank you. So if there's anyone else has a question, go ahead. I have a question. Couldn't you use this to figure out what inputs it would take to get to a certain point, like in the program? Well, yes, you could. That's a very good point. What's your name? My name's Jeff. I was just thinking about this when you were talking about it. Yeah, what a coincidence. We'll actually be talking about something like this at 8 p.m. today. Go ahead, Ryan. In one of your example slides, you've got a pointer to a V table and you end up calling, you know, eax plus 14h. And use a debugger example to pick up the member function name. Do you have the PDB symbols or where are you getting the name from in that case? Oh, I'm getting the name from well, it's a Microsoft PDB it's going to have the V table in there. So I set breakpoints on all the methods. I hit a breakpoint and I look I look up on the stack to see who called it and then I backtrace and refill that information. Why do you need a debugger to do that rather than being able to do it straight from the dead listing? You could. It's just, for me, it's faster to do it this way than to generate the cross references like that. It's the way I'm doing it. There's other ways to do it as well. Okay. And while people are thinking of questions or what have you, I'll take a little side note here to talk a little philosophy. So last night I was talking to Katie, Katie's right here up front. Katie used to do a lot of DNA stuff. She worked on the genome project and what have you. Always been interested in like this bioinformatics, you know, bio stuff. And I was like, never really looked at it and I was like, so what's the deal with this DNA stuff? How does it work? And she's saying that there's, you know, these four different letters, right? These letters make combinations and these combinations make streams and these double helixes and they all come out and that's what represents you or me and the differences between us. And then I thought and I was like, hmm. So this sequence, this DNA sequence is basically you, possibly. I'm not saying it is, but possibly it's a fuzzing iteration. Me, you and everyone here is just a single iteration in this fucking huge fuzzer of life. So that's it. I'm number one, three, five, seven, two, four, whatever. And that's it. We're just malicious input into the machine of life. Thank you. So later this week I'll be posting everything on my blog and it'll be mirrored on OpenRCE. If you have any questions, feel free to email me. Feel free to come up to me and ask me any questions or what have you. I am more likely to answer questions if you put a drink in my hand and I'll see you later.