 So, let's basically just get started. Binary obfuscation is something that you have probably come across if you've done any sort of reverse engineering. This is something you have to be very aware of when you're doing any sort of malware analysis, or if you're trying to reverse engineer any sort of proprietary code. For example, some programs are protected with a piece of software called AS Protect, which is a combination packer and encryptor for binaries, which does a whole lot of nasty things. It does all sorts of really ridiculous obfuscation. In general, it's just really cool when you're looking at this binary, and it sends your EIP all the way to the moon. You have no idea where the hell it's going, and it's beautiful as much as it is frustrating. It sends you to drinking, you pass out, and you have no idea what the hell is going on with your binary, and you just have no idea what the hell to do. So CTFQal's Binary 300, for example, employed a whole lot of really nasty binary obfuscation. If you actually tried to take a crack at that, it was pretty difficult. They put in a lot of really evil things, not necessarily evil to destroy your computer, just evil to make your brain hurt. But the thing is that doing any sort of binary obfuscation, pretty much your best bet is to do it after a compilation, because you have ultimate control over the assembly and you can do whatever you want. However, there are actually a whole bunch of different things you can do in order to obfuscate your binary without actually having to do all that. You can actually write specifically crafted high level code to push it against your compiler, make your compiler freak the hell out, and spit out a bunch of really weird code. That's basically what we're going to talk about today. So why exactly are we going to talk about this from the top down? Well, assembly, you know, it's very simple. It's basically a bunch of different commands for a Turing machine, and it's just basically a math machine when it comes down to it. But because it's just simply a math machine, writing out anything coherent, like a really big old program in assembly can be kind of tedious. I mean, there are a whole lot of really awesome 64K programs out there, but it takes a lot of time. It's also a lot easier for some of us to write higher level code. It's more coherent for us to actually be able to think of it from the top down rather than from the bottom up. It's just a more natural way of thinking about things. And why do it by hand when you can be lazy? If you can actually do this the simple way, that's usually the greatest way. So let's talk about the purpose of obfuscation. The entire gist of obfuscation is to be an asshole. You don't want anybody to read what your code is. You just want to be a complete dick. You don't want them to read your code. You want to send them into alcoholism. You don't want them to actually be able to read your code whatsoever. You're basically just trying to be a jerk. You're trying to waste time. You're trying to intimidate them. You're trying to make them think that they won't be able to read your code at any chance whatsoever. So the tools I'm going to be using this talk are C and C++. I apologize for the fact that the binaries that you will see mentioned throughout the talk are written in MSVC++. However the techniques used in this talk basically are for leveraging against compiler design in general rather than against any specific compiler out there. So this is actually leveraging against the science behind compilers and compiler design in order to make sure that your code actually passes the tests that wind up coming through compilers. So what are we not going to cover? Anti-debug is one of the things we're not going to cover. Bug techniques usually intermix with obfuscation in one way or another, but the actual techniques implied in anti-debugging are actually out of scope of this talk. Source obfuscation when it doesn't actually relate to binary obfuscation, like if you've ever seen any obfuscated JavaScript, for example, any Google APIs that use JavaScript, Apple's video player on their video site has a whole bunch of really weird obfuscation, but it doesn't actually really translate that well into binary transformations. The effectiveness of obfuscation as a technique in order to prevent reverse engineering or to frustrate it is kind of outside the scope of this talk as well. And any sort of post compilation obfuscation is also kind of sort of off topic here. So this is probably going to be really boring if you know this stuff already, but this is a really important prerequisite. I'm going to be covering a whole bunch of different things, function pointers, method pointers, calling conventions and assembly, and things like that in order to bring everybody up to speed. This is really necessary to cover, and I'm sorry if you already know it, but we will get to the fun part, I promise. So I mean, where would we be without pointers? They're absolutely beautiful. I love these things so much. Once you actually figure out how pointers work, it's kind of like you've reached his end moment of programming. It's kind of like figuring out how to do functional programming. It's just absolutely one of those things that just kind of blows your mind. It's absolutely beautiful. So let's talk about function pointers. These are some sort of ancient voodoo employed by the K&R book that basically allow you to cast variables as functions, and it's really cool and fun to work with because with function pointers, you can basically do whatever the hell you want. Let's look at an example of what exactly a function pointer is. The underlined slash bolded part is basically where the actual casting happens for casting the type of function pointer you want. To the left, we have the type that our function is going to return, and then in parentheses with the pointer asterisk is the foo pointer, and then to the right, we have the types that are going to be passed as arguments to the function. Then we cast foo pointers simply to foo without having to actually worry about calling it, and what's going to happen is bar is going to return the number that foo executes. It's very simple. Contrary wise, method pointers are ugly as sin. The whole thing with method pointers is these are basically function pointers for classes, but unfortunately the actual syntax of these things is just absolutely annoying. If you just look at this, this is just a complete mess of things. So let's try to destruct this. The first underlined thing is the declaration. The foo pointer is hidden beneath a whole mess of things. To the left in the parentheses, we have the actual class name that we need to actually call, then the asterisk and foo pointer, etc. Then at the second underlined and bolded part, we actually have the assignment to the function pointer. So foo pointer equals ampersand, my class, double colon foo, and that's really annoying. And then to actually call the method, we need to do this stupid parentheses thing that I don't even like looking at. I'm just going to ignore it. So now we have calling conventions. Calling conventions are really important when you're going to deal with function pointers because, for example, there are different calling conventions for different types of functions. For example, printf is a var arg function, which means that, obviously, there are variable arguments at the end because how else would you be able to do printf? So some functions are not going to work properly if you don't actually properly use the calling conventions. So these are the four typical types of calling conventions. You have standard call, sedecal, fast call, and this call. And these are very simple calling conventions. What you really need to know is for standard call, it pushes a bunch of arguments onto the stack, then the called function pops everything from the stack, and the person who calls the function cleans up the mess. Sedecal is kind of the inverse of this. It pushes arguments onto the stack, the called function pops from the stack, and then the called function cleans up the mess. Fast call, however, is kind of interesting in the fact that it tries to be fast because it pushes the first two arguments into the two registers, meaning that it doesn't have to access the stack for those registers. So then the rest of the data, if there is more of that, are pushed onto the stack, then the called function pops from the stack, and then the called function cleans up the mess. This call is also another very interesting one. This is basically used when a function within a class object is called. The this pointer is moved into ECX, the function arguments are then pushed onto the stack, then the called function pops, and then the call cleans up its own mess. So now we're going to talk about compiler optimizations. Your compiler does a whole bunch of different things to your data. It basically figures out whether or not the flow of your program is not repetitive. It figures out whether or not there are certain pieces of code that don't execute and things like that. And it makes sure that basically everything is supposed to be as efficient as possible. Now there are a whole different ways that compilers optimize your code. There's static variable analysis. There's a bunch of proprietary algorithms. Great example of just absolutely weird optimizations. You'll notice that in one of the binaries I mentioned in the talk, specifically the control flow using an AVL tree, is actually not what it should be. What it should look like is there should be one piece of code that kind of controls everything that calls a dereferenced register in order to call the function. But what happens is on full optimization, the MSVC plus plus compiler actually unravels a static in order traversal on an AVL tree. And if you know anything about data structures, that's actually really good optimization. It's kind of scary to think just how good it is. So we're going to cover four different types of analysis that compilers do, a control flow analysis, a variable analysis, reach of use, and especially something that's very interesting, the volatile keyword. So at compile time, your code is separated into a whole bunch of logical blocks. Control flow analysis figures out how literally your code flows. The blocks afterward are optimized to figure out, OK, within this block, is there anything redundant happening? Within this block, is there anything redundant happening? What about in the entire graph itself? Is this being canceled out by this, et cetera, et cetera? So here's an example. For some reason or another, whatever code I've written winds up being pushed into this thing. So here we have EAX being equal to 949. Then we X or EAX with 310. Then we compare it with 0. So if it's not 0, then it's going to jump to the XOR thing, which will then XOR EAX with 310 again and then push EAX onto the stack. However, if it is 0, then it's going to XOR with EAX and then leave the program. But the compiler knows exactly what's going on. It's going to see this and be like, that's never actually going to execute. I don't really want this in my program. So simply it just cleans up the mess and your code gets optimized. The conditional branch is removed because obviously the compiler can tell that that logically never actually executes. The compiler also actually winds up looking at your variables to make sure you're not doing anything redundant. Like I said earlier, it's going to make sure that your code isn't actually repetitive or if that things don't cancel each other out. So there are various algorithms that can be applied to point this out, like the DAG algorithm, static variable analysis, like I mentioned earlier. And it just basically makes sure that all your variables aren't all weird and repetitive. So let's go back to what we had earlier. This is basically the optimized version of what we had earlier. But there's still a problem with this because the XOR operation actually doesn't do anything. That just inverts itself and it comes to 949 again. So your compiler is obviously pretty smart because it turned all these lines of code back into two lines of code. So how are you going to add a whole bunch of junk data if that's the issue? Your compiler is also a total need-free. It pretty much has OCD, at least for what it knows how to touch. Variables that aren't ever actually used get completely tossed aside and you don't ever have to worry about them being your binary. So let's look at another example. EAX is 949, EBX is 310, ECX is 213. Then we do a bunch of math with EAX and EBX. And ECX kind of just gets completely left behind. We don't ever actually use it and your compiler is going to see that. And it's going to get rid of it. But there are times where you don't actually want optimizations to happen. And these are for legitimate purposes. In a lot of hardware development, there are basically variables that you don't want optimized whatsoever. Basically, what the volatile keyword tells you is, hey, this is my variable. It's not your variable. Please, for the love of God, do not touch it. You do not know what you will wreak. So here is basically how you cast a volatile keyword. It's very simple. It's basically like the unsigned statement in C and C++. So if you just basically cast it to an N to car, UN32, my fancy struct underscore T, it'll basically work for anything and your data will not be optimized. So now we have an example. We're doing a whole bunch of math to this X variable and we don't know what it's going to be. I literally just typed a bunch of random numbers and operations. I don't even know what this number is going to be. So X equals seven, then it shifted over to, multiplied by two, subtracted by 12, then squared and then shifted left. And what's our magic number going to be? I don't know. So maybe this is going to get translated into a whole bunch of really weird code. No, it really doesn't. This is basically what the compiler is going to spit out. It's going to look at all those numbers and be like, hmm, let me do a little math. Okay, there's your number. So your compiler is going to say, hey, that's one E6C. I'm not stupid. Don't mess with me. But if we put in the volatile keyword, then a whole bunch of weird stuff happens. This is the exact same code, obviously. This is what happens when you actually compile it. Seven gets moved into a pointer on the stack and then that variable shifted over to and it's basically literally copying every single thing we did. This is obviously very easily circumvented because you could just look at that and be like, oh, that's a bunch of math. Let me just scroll down here, put a breakpoint here, run, oh, it's one E6C. So this is very easily circumvented. It's just, it's basically an example to show how the volatile keyword actually works. So Unix philosophy here, everything's a file, even your executables. But file formats are really important because you need to know where your data is being stored if you actually want to manipulate it correctly. If I just put the string hello world in my binary, someone can also see that the string says hello world in my binary, even if I encrypt it at runtime. It just says hello world. So I can do a string on that program without ever actually running it. Oh, I see hello world there. So the most common file formats you're ever going to come across are the PE and ELF format, most reverse engineers already know this. This is a very common knowledge. But what's important here is that both of these formats have a table that they use for external library calls, for example, printf, execv, et cetera. For Windows, it's usually called the IIT, and for Linux, it's the PLT. Now, if you obfuscate function pointers, they're likely not going to show up in here because what's going on is that you're not actually hard coding the data, so we're going to cover circumventing this issue later because this can actually cause a bunch of problems in the end. So there are also various methods of analysis that people will actually employ. This is also important if you actually want to employ the psychological annoyance of whoever you're trying to leverage against. Yeah. So API analysis is very common. You look at a program and you see, hmm, this is using internet URL open A, and it's also writing a file to the hard drive. This is probably a downloader of some kind. So you can basically figure out through there, through just looking at the API calls it makes, what the gist of a program is doing. Obviously more analysis is necessary, however, this is basically how you figure out what a program is doing. Yeah, that is, there are a few programs that are out there. Two of them specifically I want to focus on because of their two different methods of analysis are virus total and zero wine. Virus total is this really cool online app that allows you to upload anything you suspect might be malware and send it off. And what it does is it sends this binary into 40 different virus scanners, Panda, ABG, McAfee, the five different versions of Norton, whatever you have. And at the end of the analysis, along with who picked it up and who didn't, basically a report card for who sucks at scanning and who doesn't. At the end of the analysis is a list of recognized Windows API calls and data sections and the size of the program, et cetera, et cetera. Zero wine is a malware analysis tool that sounds exactly what it is. It runs in wine and it runs your program, then it does a bunch of analysis on the binary as it executes and then it gives you a whole report. And this also shows you a bunch of API calls that are made with the program. So it also figures out what files it wrote to the hard drive and things like that. So when you're analyzing a binary, there are two schools of thought that we're going to employ here. There's dead code analysis and there's live code analysis. And they sound exactly like the way they should. Dead code analysis means you don't run the program and you try to figure out what it's doing. Live code analysis means you run the program and you try to figure out what it's doing. So obviously these two different programs use two different types of analysis in order to accomplish their same similar goals. VirusTotal basically scans the binary, then puts it against a bunch of signatures with the antivirus vendors, and then it gives you your report, which doesn't pick up on everything if you have a packed binary that isn't known how to unpack. ZeroLine basically, like I said, employs live code analysis. It runs it in a controlled sandbox environment and then it does analysis and then it gives you all the report data. Now, if we're going to obfuscate our binary, there are basically a few things we can do to actually leverage against these two types of analysis. Dead code analysis is actually very easily frustrated. You can just do a polymorphic techniques and do all sorts of really ridiculous things to your code and unless actually someone executes the program or knows how to unravel that manually, then you can basically evade anybody doing dead code analysis. Live code analysis is a lot harder because people can actually watch what the program is doing. So this is why you have a lot of anti-debug because people are going to essentially try to keep people from seeing what is being written. So the entire idea behind trying to circumvent live code analysis comes from Jedi's sleight of hand. Like you do not see me accessing the internet kind of thing. So we're almost at the fun part. Now we're going to talk about the various types of obfuscation we can actually employ. All right, there are basically three different types of obfuscation you can employ. There's a layout obfuscation, there's control flow obfuscation and data obfuscation. Layout obfuscation essentially means moving the program around this way and that, doing a bunch of pragmas and defines and things like that in order to make the code completely illegible. Like doing it like lowercase i, a lowercase l, then a one, and doing that a million times on your function name. I'll probably piss somebody off. The international obfuscated C contest is actually a great example of this. And by that I mean just look at this. This is one of the entries from 2004 for the international obfuscated C contest and it's beautiful and how ugly it is. Basically just somebody kind of sort of carved out the letters haphazardly. It doesn't even look even. And I don't even know what this does. If you just look at it, you don't even know what it does. You'd have to run it unless you figured out a way to de-obfuscate what the hell it's doing. It's very strange. The IOCCC is also a very interesting contest. People have submitted very strange programs like one that looked like a lily pad. I think it played asteroids or something like that. They have a lot of really interesting things on that site. Control flow obfuscation basically means twisting the actual typical downward flow of your program into some sort of nest of spaghetti code. And when you actually employ this in an upper level code it has a really cool benefit of completely messing with the actual code and making it eligible and really hard to understand because the data is literally going in a whole bunch of different places. So data obfuscation is essentially asking whatever data you have, strings, numbers, even functions and a whole bunch of other array of things that get stored in your program are actually data. So you can do whatever the hell you want with it if you can figure out how to actually tweak it and then rebuild it at runtime. So now that we have all those foundations, now is where the fun begins. So our goal is to obfuscate the binary without actually doing any sort of binary transformations. We don't want to download some skeezy packer from the internet and then use it on our program only to find out that we got owned or anything like that. No, we don't want to do that because we don't even know what the program does. We don't know if there's a whole bunch of bugs in that pack or we want to use on our binary anyway. So now we know how the compiler optimizes our code, what it does to our data and how it stores some of that information important for programmatic logic in order to figure out what the program is supposed to go or what the program is supposed to do and where the program is supposed to go. So now with all this in mind now that we know about file formats and function pointers and things like that, we can actually leverage our code against the compiler. So that IOCCC thing is essentially completely useless because a lot of layout obfuscation does not translate into binary obfuscation at all. If you do a define sort of pragma kind of thing, that's not going to really translate all that well because it's just a defined pragma. That's a compiler thing. Basically renaming variables and things like that, that's not going to do anything in your C code because you don't have variable names unless you have debug symbols on. So that's not going to do anything. You could do a bunch of whole layout obfuscation if you were trying to transport code somewhere and basically keep other people from reading your code, but that's essentially all it's really useful for. Like I was mentioning earlier, if you have an interpreted language for example, like JavaScript or Python, layout obfuscation is the only way you can go essentially because even though there are Python compilers and JavaScript compilers out there, it's typically used as an interpreted language so it's not very useful. So with this in mind, there's actually a lot of meat in control flow obfuscation. There's a whole bunch of different stuff we can do. If we have function pointers, we have method pointers, the volatile keyword, and yes, there is also used for the go-to keyword in this presentation, we can do a whole lot of really fun stuff. So one of the things we're going to cover is opaque predicates. Opaque predicates are total logical if statements. Basically they're if statements that are always true or they're always false. They can't be optimized because the compiler doesn't know what the hell the statement is actually going to be in the end. It's going to look at that and be like, yeah, I'm gonna let you go along. And like I was mentioning earlier, you see it's a lot in obfuscated JavaScript. This shows up in a lot of malicious JavaScript that winds up doing a whole bunch of stuff like ActiveX injections and things like that. You also see it in proprietary code that's distributed by Google and Apple. So let's get an example going. So here we have A, B, C, D. And we're gonna do a bunch of math in our if statement and it's always gonna be greater than zero. We know it's always gonna be greater than zero. So what we're gonna do is it's gonna print yes and it's always gonna print yes. We know it's always gonna print yes, but we don't want people to know it's always gonna print yes. But the compiler is wise to our tricks. It's going to know because those variables are static at run time or at compile time, I'm sorry. It knows what those variables are and it knows that that statement is always going to be, is always going to execute and no is never going to execute. But if we add entropy to the mix, we don't even have to use volatile keywords here in order to mess with our compiler. We can basically just make A, B, C and D random numbers as long as each number is greater than one. So now that we know that the numbers are, this is basically the same thing we did earlier. It's just with entropy now. So now the compiler looks at this statement and says, hey, wait a minute, I have no idea what's gonna happen here. I better not touch that. So this is a great way to put in a whole bunch of different junk code into your program. It basically will allow you to put in a whole bunch of different red herrings, a whole bunch of different branches that will never actually execute in order to make people that are reading your code with a dead code analysis method, try to look at it and try to figure out what it's doing but instead there is a potential for them to go off in different paths than they don't actually want to go to, therefore wasting a whole lot of time because they have to start over and figure out where the unpack routine is, et cetera, et cetera. So control flow flattening involves literally flattening the typical downward flow of the typical downward graphical flow of your program. Usually you have a top down flow. It's very coherent. It starts at main and then it branches and it's kind of like an inverted tree kind of thing. If you have your DEF CONCD with you and you happen to have your laptop here in the audience, you can go, oh wait, no, it's in a zip file, I'm sorry. Basically extract that zip file and in the crack me section, there are actually two different files I have here. One of them is a crack me, one of them is kind of an unpack me kind of thing. You basically look at this and if you want to actually read a Keygent for it, feel free because it employs a lot of really, it employs basically control flow flattening which is really frustrating. So here's basically what a flattened graph looks like. On the left, we have the single piece of code that's controlling everything. A single variable says, okay, this is zero, so I'm going to go all the way to the left and then it's going to change that variable into maybe one and then it's going to go back and say, okay, I have this variable, it's one now. So I'm going to go down and then back and then down in the back and it's like, oh, I'm done, so it leaves. Now on the right, we have whatever, you'll figure it out, you can see the labels. Basically, we have the normal flow of the program. It goes downward, it goes left and then it goes back up kind of at some points. So to give you an idea of how much of frustration this can cause, this is a side-by-side comparison of two graphs from Ida Pro's graph viewer. On the left hand, you can see that on the, okay, right-hand side, now I can figure this out, on the right-hand side is the typical downward flow of the graph. This is the unobfuscated version of leadkey.exe and you can see that there's a very obvious downward path that it takes. It goes left and it goes right and then it maybe goes all the way right and quits and then it just, it's very coherent. But on the left-hand side, we have the obfuscated version and it basically, it looks like a complete nest of code. If you look at this and you actually try to read the code, it's very hard to figure out where it's going and it goes the same way when you actually wind up looking at the code in assembly as well. It's not just a graphical representation because this code is literally jumping in all sorts of different directions. So this is basically how you employ this. It's actually very simple. So on the left we have what we would normally want to do with our code. We want to do this, want to do that. We want to do a whole bunch more stuff. But with the go-to keyword, if we just have a single control variable in this example X and we just switch along it, we can basically mess with the control of the program. So if you look at this, this is a very simplified version of what's going on in leadkey.exe. And there's a whole bunch of jump instructions that get generated when you do this and it basically completely disrupts the control flow of the program. Let's say you have a nested loop in one of these if statements. Well, if you do that, you have another nested area of obfuscation. And there's going to be a bunch more jumps and you don't know whether or not that section of jumps corresponds to this section of jumps. And it gets more and more complicated the more and more nesting you do. And what's really cool is that you can do a whole bunch of different stuff with this. You can do this with a link list. You can do this with an ABL tree. You can do this with a graph. You can do this with a whole bunch of different algorithms out there that are used for traversing data structures. And it's just very fun. There are two different examples on the CD. There's a control flow flattening link list and then a control flow flattening ABL tree. This is the one I was talking about earlier where the MSVC++ compiler was actually able to mess with our data. Most programs are reducible and this obviously means that they can be optimized in one way or another. If we can't actually reduce the program, then obviously it's not optimized. And what happens is we get a bunch of really weird assembly code that gets pushed into the binary. It may be more verbose, but it's still nonetheless very hectic to pour through. So a great example by this in one of the papers I used to research this talk is making a loop irreducible. And this is actually a very interesting technique. You can see this again in the irreducible C file. Raising bogus exceptions is one of those cases where anti-debug crosses over with obfuscation because what you can do is add a whole bunch of junk data that goes into your code again that basically will add a whole much more red herrings. And this is very easily accomplished. You use a tri-block, you set up a tri-block, and then you execute a bogus exception. For example, a divide by zero exception. You can do the same thing with signals in Linux, raising a bogus sig-hub or things like that or just any sort of general signal. I have another source file for this. It's C flow exceptions. I also have an example here. It's actually very simple. We make our trigger volatile so that it doesn't optimize away and wind up wiping out our tri-block. Then we do this, and then we do that. Then we trigger our divide by zero exception, which is caught at the catch block, then it does more, and then it does a whole lot more and whatever the hell we want. And the never executes command, obviously, by its name, never actually executes. So if we just keep going with that, then it's going to keep adding more and more code into our program, which adds basically a whole bunch more red herrings. So data obfuscation, there aren't really a whole lot of things you can do with data obfuscation, but there are some things you can do. It takes a little bit more care than control flow obfuscation because you have to be aware of what's going to happen when somebody tries to reverse your code. Like I was saying earlier, the Hello World string example, if I do a whole bunch of really weird cool things with it, if I move the H to this array and then the exclamation point over to this character point over here and then turn it into an eight and do a whole bunch of other stuff, that's great and all, but if I do a strings on the program, I see Hello World. And that completely defeats the purpose of whatever the hell you were trying to do. So if the data isn't obfuscated before runtime, then your obfuscation is completely useless. And one of the more obvious ideas is to encrypt your strings. Even though people try to divert you away from using simple string analysis in order to figure out what a program is doing. And they rightly should because you can add a whole bunch of junk data into a program that again will add to a bunch of red herrings. And basically it can sort of help you, but it's not really that useful. So either way, you don't wanna give any reverse engineers a sort of lead into what you're trying to do. So it's best to just hide that data. So let's go back to what we were talking about with the volatile keyword earlier. Now, as I said, this example here is very easily circumvented because you can just look at all the math that's going on, do a breakpoint when it ends, and then watch the variable as it turns to 1e6c. But if you do it a lot, you can just, and if you also employ like control flow obfuscation in order to prevent people from figuring out where it's actually going to stop, then you can do a whole lot more with that data. And it's actually really interesting because the assembly, you know, winds up going into all sorts of different directions. Data aggregation is another technique you can use. Let's say that we have this string here, and when we do a dead code analysis on our program, it's just gonna show up as fooar. I don't know, there's probably no other way to pronounce it, but it's basically just a very strange string of nonsense. But at runtime, we can just basically put it back together. If we just do that loop at the bottom, foo becomes the string foo, and bar becomes the string bar. Unfortunately, they're not zero terminated, so if you try to print them, bad things will happen. So functions in the PLT and the IIT are definitely data. These are things that we can do. There are things that we can do to even function pointers to prevent API call analysis from being done on our binary. And this is basically accomplished through load library and get proc address in Windows, and deal open and deal sim in Linux. Now there are actually, I have three different examples of this. There's load lib for Windows and deal open for Linux. And then finally, I have the piece of code that kind of got this talk all started called MDL. This is a very tame proof of concept piece of malware. In fact, I wouldn't even really consider it malware. It's just basically a kind of sort of Trojan dropper kind of thing. All it does is take a URL as it's executed, downloads that URL to a hidden file called download, and then you can basically do whatever you want with it. Now, the interesting thing about this is that first when I wrote this, I decided to compile it and send it up to virus total. Now there are a lot of really good heuristic scanners out there and about eight or nine of them, rightly so, picked it up as a downloader because I was using functions like internet URL open A, write file A, and things like that. And it was able to easily pick up on that. However, when I used load library and get proc address, all that showed up in my PLT were load library and get proc address because that's all I was actually technically using in the PLT. So, basically that way I was able to circumvent heuristic analysis by essentially hiding my pointers. So, now we're going to talk about something absolutely absurd in this binary obfuscation talk. Basically, one of the things I came up with as I was doing this was, I didn't really, I still kind of don't know how to write a packer, but I wanted to write a packer, but yeah, I basically wanted to find a way to do this. And this combines control flow and data obfuscation in the same way, and it causes a whole bunch of headaches, both for the author and the reverse engineer. And all it really does is it revolves around compiling, copying data and applying function pointers to obfuscated or encrypted data. So, if we, there's another great example of this, it's called manifest.exe. It's actually really big and it shouldn't be that big, but I'll explain why it's that big in a few minutes. If you have a problem actually running this binary, because when you first run it, it is likely going to crash. And that's not necessarily the program's fault, that's by design. So, if you have any problems actually trying to unpack this binary, just ask a DC949 member what the group motto is. So, here is basically the ridiculous steps I take in order to actually accomplish poor man's packer. First we compile it, then we disassemble it, then we copy the bytes of that function and we make it an array. Then we apply encryption or aggregation before compile time in order to make our data all messed up and weird. Then we recompile it and then it gets decrypted or unsusciverted run time, then we make that a function pointer and we run it. The pmpconcept.c on the CD-ROM will give you an example of this, it's a very simplified example. But this comes with a whole bunch of different problems. Any functions that you've used in the now compile, the now obfuscated function are now no longer in the PLT or they're just generally no longer in the tables. It just don't exist because when the compiler actually compiles that byte code, it's not going to find out that the program is actually making references to functions that it wants to be in the table. So, from there, you've got a broken program. Data offsets are also completely messed up. If you use a string within a function, usually if it's a static string it's going to be pushed either onto the stack or more likely in memory. Functions of C++ also get completely messed up because basically what's going on is that your this call is completely shattered. So, you have to get around that too. Then there's also the case of, you know, you messing up your calling conventions. Like if you use CDecl or if you don't use CDecl on a var arg function, your compiler's going to look at that and be like, wait a second, no, this is wrong. Let me fix this for you. Also void pointers are scary. If you pass a data structure like just a, like a double void pointer data structure that has a bunch of, you know, generic data that you want to pass to every individual function, you can actually circumvent the issue that's caused by a broken memory section and by relative jumps and offsets. This also applies to method pointers and C++ objects in general. And what's really interesting about this is that this gives you the opportunity to add and remove necessary program data as you see fit. Let's say only at this stage I want to be able to see this data and then I can just wipe it out arbitrarily if I just allocate it into the heap. And it's actually really interesting because we have more control over our data than is usually given to us. You also have to be really sure that your calling conventions match because when you try to start debugging it, a lot of things are going to be broken. A lot of things are going to be broken if you're not careful. Like I mentioned earlier, CDecl is a calling convention for var arg functions like printf and basically all the f whatever style of functions that you have in the system. Fast call and standard call should usually be fine. Obviously you have to pay attention to this call and have to deal with all the different method pointers that are going on with your C++ objects. And mismatched calling conventions, like I mentioned, are going to cause a whole bunch of different painful things. So why do we actually want to do this? This seems like a whole lot more work than is really necessary in order to do any sort of packing. Why can't I just go out and download UPX and use that on my binary? Well, first of all, everybody uses UPX and they know how to extract it, so that would be really dumb. So what is actually really good about this technique is that you have the absolute control over your data. You can do whatever you want to that data. Now that you have that function in your grasp, you can compress it. You can decompress it. You can make it base 64. You can basically apply any sort of methods of encryption or movement or whatever the hell you want to your binary data without having to actually worry about some of the third party, not knowing how the hell to use a Windows API. Your code is actually still portable and executable to a degree because of how programs execute data. For example, a lot of programs now don't let you execute data in the memory section or in the stack. So what's going to happen is that your data may not be executable but it may also be executable. It's still portable to a degree but it's not entirely portable. This also adds a very interesting layer of obfuscation because you can actually layer this a lot. I can have this function have more functions in there and then from there I can have these functions call and there's more functions here and it basically just keeps going and you can dynamically pack and unpack your data as you go along. So after I'm done calling out of this function I'm just gonna pack it, encrypt it and completely destroy it and then it goes away and that function is completely disappeared. So if somebody needs to backtrack and they're not tracing my program then what's going to happen is that the data is going to be gone and they're just gonna have to start over. So when you do this enough it really also obfuscates a source because when you do it to the point of seeing nothing but data arrays all you see are just data arrays and encryption strings and things like that. So there's really nothing there for people to go on. So why is this terrible then? So it makes your binaries huge if you're not careful about where you actually put the data. Typically what's going to happen is if you want to actually add those extra layers of obfuscation you're going to have to wind up compiling what's going to happen is that your compiler is going to put a lot of those byte arrays onto the stack which generates a ton of move operations which essentially octuples your data and it's really bad. That's why that binary is 90 kilobytes. It also takes a whole ton of different work to accomplish. If you actually want to do any sort of it just takes a lot of time. It's also really frustrating to debug. You have to execute the program and run it in debugger to make sure that it doesn't crash and then you find out oh actually it really does crash. Let me figure out why this. Oh it's a lot more involved with pointer arithmetic than it really should be and that can be a problem. So there are actually also other tools you can use in order to actually accomplish this without having to do it all by hand so you don't have to do that very weird go-to loop that I've got going on in that one slide or all the different math. There are actually a bunch of different things that you can use in order to transform the code without having to worry. So what are these things? We have TXL from the University of Toronto I believe. You can find it on txl.ca. It's basically a regular expression kind of thing but there's more to the language. You can use it with a whole bunch of different languages. They have a bunch of different signatures. It's really cool. I haven't gotten a chance to use it but hopefully I will be able to in the future. SUIF is basically the same thing. It uses a bunch of different techniques. I'm not entirely sure how to use it but in the papers that I use in order to research this talk they mention it specifically in on the effectiveness of source code, et cetera, et cetera. So if you are interested in that you can basically read that paper and it's a really interesting paper. About 70% of the things I found out that I'm presenting this talk are from that. So if you're interested in all, I suggest reading this paper. It provides a lot of insight into just general obfuscation in general and there's a ton of sources that they cite. It's really great. There's also the binary rewriting defense against stack-based buffer overflows. If you're into exploitation that's also an interesting paper to read. And then finally a very briefly used paper but one used nonetheless is binary obfuscation using signals. And this is also an interesting paper. It's kind of short. I think it was only like four or five pages but it was still pretty good. And that's pretty much all. I hope you enjoyed this presentation and learned something out of it. Thank you very much.