 are we doing? Good? Awesome. How many people have gone over to CTF today? Yeah? How many first timers do we have here? Okay. At some point you need to go over to the CTF area. When I came to my first DEF CON and I saw what CTF was and all the lights and everybody hacking everybody else, I thought it was the coolest thing ever and it's still the coolest thing ever. So take some time to go over there. Now the DEF CON CTF is harder than just about any other contest in the entire world and ATLAS has won this four times. Is that right? Been a part of a bigger team but his teams have won it four different times. So one of the best things I like about DEF CON is coming and finding people that are always just head and shoulders better at all kinds of stuff than you are. At home we're all the alpha nerds. Here we're all learning from everybody else. It's my pleasure to introduce ATLAS who is going to melt our brain with some very technical stuff. Let's give him a big hand. Hey, all. How many of you have participated in a capture of the flag? How many of you are doing so right now and coming to see me? Thank you. That's very hot. How many of you are government? I saw that. Well, welcome. Today we're going to talk about symbolic analysis specifically focused on a tool called symbolics. The creator of which was threatening to show up today and I don't see him yet but dodge that bullet. How many of you have heard of symbolic execution? Very good. Most of us. How about symbolic analysis in a larger scale? Is symbolic execution the only thing you've heard of? Today we're going to talk about some of these with doing binary analysis. We're going to talk about some of the holy crap. I'm the last caption. I love it. Today we're going to talk about symbolic analysis and its use in determining very interesting things about binary. A little bit about me. Very, very fast. I'm a Jesus follower. Walk out if you like. I'm a husband of one wife, father of three. I have horses and goats and chickens and I ride a Honda Shadow 2002. But that's not why you're here. I'm also, this guy likes to tear things apart, make the break in very interesting ways and I have learned some from several of you in the crowd that I've already seen. So this is not all about hey, I'm fucking rock star. This is about, this is cool shit. Let's make good use of it and whether you're on the side that wants to kill all the bugs or let them lead long productive lives, hopefully this will help out. I love vulnerability, I'm tripping over my tongue as well. Vulnerability research, reverse engineering. I play with hardware, radio, firmware, software and cars, medical devices, smart meters, the whole thing. I'm a bit 80D. If it weren't for my wife ratcheting down, I'd probably be dead. Also a core developer for VIVA SECT, a binary analysis framework created by Invisigoth. A little bit about VIVA SECT to start off. It's a binary analysis framework like I just said. It is written in pure Python and you can use interactive Python to poke at, figure out how code flows through your binary. It provides several different scripting options. It's written from scratch to be collaborative. There's a client server model, there's even a shared workspace model. It includes a platform debugger, emulators for multiple platforms, a GUI for those of you who want one. Although the focus is on programmatic analysis. We want to write programs that find zero day and then exploit it. Yeah, I think we're in the right time. Give you the colorized version because the last one was so hard. I'm going to throw some code at you very early on so that it's familiar to you. You can come back to the slides later and start tearing apart a binary in your own interactive Python session. To start VIV, use VIV BIM, that is the name of the binary. Dash B means bulk load this thing, don't bring up a GUI so that it slows down, just analyze the crap out of it. And for today's specimen, we'll be talking about Stage 3 from the 2005 CTF quals from Conchotto. It creates a .viv file which is, you guys don't need to know this but it's basically a list of events that happen from the very creation. Which means it's not fully implemented yet, but there are many ways to back out changes. Those of you who use IDA a lot, you know why that's important. Can you see a vulnerability in the slide? Yes, it's too small. Look again. If you can't read it, that is a push 2047, a call to, or then pushing an input buffer, pushing our arg zero for this function which is the socket, and then reading. So we're reading in 2047 bytes into a buffer that basically goes on forever. So this is not a buffer overflow. However, down below you see load effect of address, blah, blah, blah, blah, ebp-1048. That's basically a 1k buffer. And then a call to scan F. Scan F reads in, well, up to the end of the string, right? Causing a stack smash and a fairly easy to implement overflow exploit. Here's how I like to view Vivisect. Most often I will have two versions of, two ways of accessing Vivisect. I will have the viv bin binary, I'm sorry, the viv bin GUI so that I can scroll through function analysis, blah, blah, blah, poked into a server which actually houses all the changes and handles interaction. And then I have a command line tool that allows me direct access to it as well. It doesn't matter. I make changes in the GUI. I make changes on the command line. It all updates and the GUI changes when I run my analysis stuff. So I get in using IPython. If you're using Python and not IPython, well, I'm sorry. Please consider, you know, seeing the light. Import vivisect.cli as vivcli. That's just standard Python stuff. We create a Vivisect workspace. We then load from a file and it will look for, is this an elf, binary, is this a PE, is this a Mako, is this a iHex or a binary, a blob. And it just puts the loadable modules into the workspace and does nothing else. I then call analyze. Analyze does all sorts of automatic analysis based on the architecture and the platform. So Linux on ARM, for example, things like that. And then I call save workspace and it saves out all the events that have happened during analysis including the loading. There we go. So that's enough about Viv for now. It's enough to get you started. Let's talk a little bit about symbolics. So one of the core foundations of vivisect is the NV analysis, the NV disassembly framework. NV is not just disassembly though. In order to make an NV module, you're supposed to create an emulator for it as well because it's amazing what an emulator does when you're analyzing functions and the rest of code. Particularly in ARM where, well, they have conditional everything. And there's this whole hopping back and forth between ARM and thumb mode. The idea of symbolics is the dragging of system state from a beginning of a code path through to some arbitrary end state. So maybe that's from the entry of a function to a return. And many of you will know there are many ways through a function that has a beginning and an end. So we choose one. Graph theory helps us out here. But for this point, just think of a list of assembly instructions that would get executed in a row. Those assembly instructions are translated into the symbolic effects that they would have on the underlying processor. For example, push EVP, move EVP ESP. Shout out if you know what that is. Yes, sir. It's function prologue. Exactly. So we translate these into symbolic effects, simple symbolic effects. And then later, I translated this into applied effects. We'll talk about the difference in a second. So it becomes set ESP to ESP minus four. And then memory location that ESP points to now holds the EVP value. And then we update EVP with the new ESP. I have a fly. And it's bugging me. So we have to talk a little bit about graph theory though. This single thread of execution through a program that can find vulnerabilities in a very specially crafted thing. But what we're really trying to do is exercise many code paths as many as we can that are valuable. And to do so, we rely on graph theory. Ever heard of graph theory? Yeah, I hope so. Graph theory is amazing. It's not necessarily easy, but it can cover some very complex problems. There have been times where I found that Viv, a while back, didn't do a very good job of creating a graph for a particular function. And it caused me great headache. So that's where the first bullet came from. You've probably interacted with certain visual aspects of graph theory. For example, if you've torn apart a function in IDA, or in Vivisect, and you've looked at graph view, it is kind of a visual representation of a graph. Of a code graph. It is a graph, obviously. So you can all hold your applause until later for my visual wonderfulness. So you see at the top, we have a code block that has a decision tree. Either goes left or goes right. They re-merge and end up exiting the function. Very simple view of a code graph. It is a directed, wait, did I skip that already? Yes, it is a directed graph, which means that edges flow in one direction. Very important because if you could actually make your X86 code flow backwards, then we would have a whole different class of vulnerabilities. So take this back to stage three, just briefly. What you can't really read here is the code graph from the child request handler in stage three. So I said that it's not quite the same thing to have a code graph and have an IDA graph. And the reason is IDA and VIV, well, they don't follow every call. And there are things that are conditional that don't necessarily show up as a different node, which they should. Because if they're conditional, they're either executed or they're not. And yet compare exchange, for example, on X86 just shows up in the code flow. So if we were to take this graph and take all the calls and link them to other parts of the graph and have more code flow from there and then back into this graph, it would be more of a specific or of an exact code flow graph. If we were to take conditional instructions like branch, for example, JZ. If we jump, if the zero flag is set, in reality, that's kind of its own thing. It deserves its own node because it may or may not actually do what you think it should do. Now we purify it because that would be kind of hard to follow. So as we're analyzing a code path, we step through and we drag the initial state through symbolically so that as we modify memory locations, as we modify register values, they are modified and represented and stored in terms of the initial state. So if EAX started off as zero, all of your state throughout each instruction would be aware that EAX started out as zero and reference EAX as offsets and whatever as you add and subtract to it because it needs to maintain that initial state for us to do the analysis that we need. So when we're walking through code, we first translate a binary opcode into a simple list of effects. As we hit conditional flow, we add constraints. So as the graph branches based on a yes or a no, a constraint is added for a code path that goes left and its opposite is added for a path that goes right. This allows us to determine a code path that we want and then figure out what the hell gets us there. So I keep showing you things that are not really, what is that? It looks different every time. I don't quite understand. Well, Vizzy in his wonderfulness created all of Symbolics to supply a wrapper version and a string version to represent what they mean. This really helps while developing because it allows us to see at a second's notice exactly what the state of the symbolic state is. So the top part here, we're looking at the wrapper version. And it was created such that you can copy and paste it into another interactive Python shell or in another code and it actually recreates the symbolic state because that's the name of all of the effects and the symbolic ... Can I get some water, anybody? I'm dragging over my tongue. Thank you. So if we print symbolic state, you notice these are constraint paths at the top. Well, in the pretty version down below, this is also constraint paths. It says all that goodness that will recreate the Python symbolic state, this is what they really mean. So if a call to this function returns in nonzero, they didn't have any vodka in the speaker room. I was kind of depressed. This is very good. Thank you. So I think you'll agree the bottom one is a lot easier to read than the top one. Again, all leading back to interactively working with the system, creating code, thank you, that tears apart code, very powerfully and very easily and easily debugged. So a little bit more example. Set variable EAX to a constant one, okay, cool. Set variable ESP to the subtraction of const the top of the stack. I'm using tools that actually turn the top of the stack into something very easy that we most of us kind of jive with. So basically we are subtracting from ESP four. We're shoving then, setting EBP to EBP. Then we add to, oh man, I'm not even going to continue. Look at the bottom one. The bottom one is the exact same symbols. I'm just using print instead of wrapper. And it says, hey, I set EAX to one, ESP equals ESP minus four, EBP equals EBP, blah, blah, blah, blah, blah. Much easier to read, I think you'll agree. So I have to call out though. Symbolics has two different ways of, two different stages in symbolic analysis, both of which are actually very powerful and important to work together when you're doing, when you're writing tools. So I said before, we translate binary op codes into simple effects. We then apply those simple effects to a binary or a symbolic emulator. And it extrudes the initial state through into every single effect that you've passed in. So you're left with simple effects. So I know this is a push EBP, so it subtracts from EAX or from ESP four, it doesn't actually keep the state. And then it pushes into the memory location of ESP, the thing. The applied effects are the ones that keep the state all the way through. So to give you a couple things to type in when you go home and want to play around with us, once you've set up your workspace, you disassemble the op code. You say op equals VW dot parse op code, you give it a virtual address. You then translate that op code into simple effects, having a translator and executing translate op code with the op code that you've just created. And then you run the simple effects through the emulator, giving a list of applied effects. EMU dot apply effects and you get the effects from your translator and it spits out a list of applied effects. Basically, Symbolix is architecture independent. The only got you there is the name of things, changes with the architecture. For example, R15 on ARM would be PC or EIP or RIP on X86 or X64. So how was Symbolix put together and why do I care? So Invisigoth used many powerful things of the Python language and subclassed basically the arithmetic functions that every object has like addition, subtraction, X or whatever. And I'm jumping ahead of myself. Please forgive me. I've been hacking on binaries for way too long. Sorry. I am in the middle of CTF. So Python, I'm sorry, Symbolix has the following primitives. We have a constant which is just a constant value. We have a variable which is a variable of whatever name. It could be a register. It could be some known symbol in the workspace. A memory object which represents memory. And we have a call and this allows us to keep track of where a call might fit into a Symbolix state but not necessarily go do all of the call before finishing our analysis pass. You may if you so choose. And ARG meaning something that we were handed into a given function. A C not effect which is basically saying, hey, do the opposite of that. So if you have a var eax, you're register and you say not eax, you end up with a C not of the variable eax. And then an operator. So our operators are where hooking the Python object sub methods come into play. So basically we have an operator O underscore add which is used to represent the addition of two Symbolix states. O underscore sub representing the subtraction of two Symbolix states. Obviously the order is important here. These are implemented using Symbolix base which is the Python class that all of Symbolix components subclass. And simply overloading the underscore add function and I add because it doesn't matter if it's signed or unsigned in our case. We have effects and these are the things that actually happen. These are action verbs. We have set variable, read memory, write memory, call function and constrain path. So the constrain path obviously is where you hit a decision case and you have to choose where to go from there. You constraints are little objects, well, little names anyway, called eq, ne for not equals, greater than, less than, greater equal, as you know. When you run into an unk or a not unk, this is where we really can't reason very well about the state of, about what the constraint is arithmetically because it's the product of an or or some other bit why I'm sorry, yeah, or X or some other bit wise effect. So let's talk a little bit about how to make, how to be powerful about this. I like to use Symbolix just interactively. I get code that I don't really know what it does, throw it into a code path that's interesting to me. I symbolically emulate it and I get to see what the effect is and the Symbolix state is at the point of what I'm interested in. Well, that can be overwhelming and I'll tell you why in a moment. Our applied effects get run through the Symbolix emulator, we just talked about that. We then have the option to run reduce on the Symbolix effects. This takes things like X or eax, eax and says oh, that's zero, so eax equals zero. And things of that nature, addition, subtraction, they all kind of coalesce. If mathematically you can combine them easily, then they can be reduced. Why? I mean that kind of sounds nerdy, right? It's just a simple number or something. Well, because this effect right here is enough to blow your mind and yet it reduces to something exceedingly simple. We're also given the capability of solving. Now, as many of you know, a Symbolix state may be discreet or it may not be discreet. If it's not discreet, how do we solve it? Well, if it is discreet very easily, we just run the numbers through and spit out the answer. But if it's not discreet, Symbolix gives the ability at least to compare to Symbolix states even though they're not discreet. And that is using the hash of its basic wrapper. So for example, var dot solve because var can't be discreet by definition. Spits out a long integer of an MD5 sum of its thing. We can also update the Symbolix state using an emulator that already has data. And as of what about a year ago, Visi wrapped in the ability to create substitutions. Now actually in my opinion, this should have been called solve as well because we put together a set of values that a Symbolix variable can have. And then we ratchet that into the Symbolix state solver and we get back a generator which gives us all the different things that those values would have provided. So here's an example of using substitution. I use this in switch case analysis in my branch of Vivisect. Basically I put together a list of ranges given a constraint. For example, when you're looking at a switch case, you generally start at some base zero index and you roll through three, five, 50 different options that follow that. I don't know how many of you have done the work to analyze switch cases as spit out by a compiler. But when they happen, there's often, they're broken up into like groups. Because if you have a switch case that has a zero case and a 32,000 case, you don't want to have a 31,999 if there's just those two. Or if there's like a pocket of five or 20 around each one. So generally they represent different code paths and they end up starting at a zero index with some relative base. So we come through and you can see the debugging here. See how good my laser skills are. I'm not used to being this far behind the thing. So you can see my debugging here with a print of the variables of the given state. We create a range range and we roll through every index that we've identified that this switch case handles. And then by solving that, we're able to see what the outcome of the switch case is. So if it's a switch case zero, then there is some place in an array that has a pointer to a code block that handles switch case zero and so on and so forth. So we ratchet through that so that we can create cross references in the VIVA-SECT workspace. So I won't talk much about this right now, but I recommend that if you check this out, look into arch end. In VIVA-SECT, in symbolics, there's an arch end module that allows you to do a lot towards architecture independence. I happen to know firsthand that it's been used to solve completely the function comparison problem. So what do we care about this? I know I'm a nerd. Well, vulnerability research and reverse engineering they basically are solving problems or answering questions that are very difficult to answer. Reverse engineering is identifying behavior and vulnerability research is about finding very specific behavior. Vulnerable behavior. So we're hunting fat, juicy behaviors? Absolutely. Got a couple of case studies here. Rob gadgets. Who here is dug through looking for Rob gadgets? Yes, we all have. Well, it turns out that by searching through a binary, an executable area of a binary, you can trace symbolic state up to some known terminator point and ask very specific questions about what that little code snippet does. So rather than starting at a ret and stepping back, byte by byte by byte, making sure that it still decodes into a ret after some things and trying to figure out, oh, this is a really cool Rob gadget that does this thing and it kind of writes over there. But it really updates these other things that I'm really interested in. You can use symbolics to do analysis on code snippets and spit out. Hey, this moves ebx into eax and it's three bytes long or three instructions long. You can do actual culling of Rob gadgets using a symbolic state engine. For example, forgive me. So for example, we roll through a snippet of code and then we dig into the variables that are discovered. So we say symbolic state, what things have been written to? We then look for, hey, is that thing this register? And is it writing to a register? Then we know that we have a register copy. And hey, if the value of the second register ended up in the first register, we know we have an exchange. These are things that are programmatic and solvable by your own code. And just to give you something else to think about, I won't dig into these. But we already talked a little bit about switch case analysis. So basically what we're doing is we're trying to tell the computer to do the things that we do in our own super magical portable brain. So in switch case analysis in my branch, we start off at every dynamic branch. Then we say, well, dynamic branch can be a call to a register or some dereference or a jump to a register or some dereference or whatever your architecture version of those two are. So in the vivisect analysis pass, we already identify these things. It's just a virtual address set. So we pull what vivisect has already given us. We then start at that point and we back up until we are able to identify which register is used as an index pointer. We then roll through looking for anything that modifies that index pointer. And it gives us constraints that say, well, this is like from 50 to 75. So our switch case is 50 through 75 in this case. So let's now start at the beginning and ratchet through all this one code path. We just ratchet through it over and over with different indexes and it gives us the next code block that gets executed for every different index and we're able to wire up the function, the code block edges. And that helps us a lot. And that leads us back to stage three. So as you've been hearing about the cyber grand challenge, this whole idea of automating the discovery of vulnerabilities is a pretty big hit list. How do we do that? There are many different ways. There are a couple different ways actually. Some people have taken to just basically symbolically fuzzing where you start at some place and you just keep going through different code paths until some desired effect. Like for example, I don't know, EIP equals something of our input. We can do that. And it can be a very impactful way. It can also be highly cumbersome to the computer. Yes, I know computers do awesome things repetitively over and over. But there's this whole halting problem. With graph theory and with code path tracing, we end up running into code paths that may never end and we also can meander and take up all the cycles for all the time in the world and still not find what we're after. So I prefer starting at where we're trying to go and back up and seeing if where we're trying to go with a particular code path can provide us with the behaviors that we're after. So we start with mem copy and all the references to mem copy and we trace backwards. And with a good graph, then we are able to say, well, this mem copy is called with two fixed buffers and a fixed size. Crap. Okay, not vulnerable. Move on. Find something that allows us to compare and look for a dynamic sized either source, destination and move size. Now it can be a little complex. I mean that's a fairly simple way and that's definitely one of the analysis modules. But back at our stage 3 case study, the vulnerability is the fact that we've allowed creating of a string up to 2047 bytes and then run scan f on that string and put the output into a buffer that's too small. So we have to identify the size of our destination and our source and the constraints on them because our source is actually unconstrained. It's huge. But we have to be able to copy them into a buffer that is too small and not have constraints applied that keep us from overflowing and overriding ret. So here's that example. We called to read 2047 bytes, called to scan f. Oh, I forgot the bacon part. Yeah, we all love bacon, right? So since VFBF EBE4 is 1052 bytes from the top of the stack, which is ret, then our overflow, we overflow by 995 bytes. We can do this programmatically. I'm not going to show you code for it. Go write your own. So as I said, starting where you want to go and backing up is my preferred method. Starting at a known entry point and going forward is also very powerful. Turns out combination of the two is probably the best option. So I leave you a couple of things for your play time. Import vivisect.cli as vivcli. Create a vivisect workspace with VW equals vivcli.vivcli. VW.load from file, some poor binary. I like to turn on verbose mode because it spits out a lot of messages that I would have otherwise not gotten. And then call VWanalyze. Import vivisect symbolic analysis. Symbolics.analysis, yes, as vs. anal. You read that correctly. Create a symbolic analysis context. Create a symbolic graph. And then get some symbolic paths going. As you interactively create symbolic paths and you review the symbolic effects, I think that you will see just exactly how powerful you can be. And here are a couple of places to go look. Thank you very much.