 I'm a reverse engineer. I work for a security company or for a company in Germany called Saber Security. We do obscure reverse engineering tools for niche markets and I'm going to give one of the oddest talks I've given in my life Mainly because this talk will not actually present any solutions to anything But mainly list a bunch of things that I think are not that hard to solve But haven't been publicly solved yet and things that I would like to see solved It will also can include a few few problems that are a bit more difficult All right, so Yeah, so this this talk is quite different from the other talks I've given no bugs No new tools No solutions mainly a wish list of things that I would like other people to solve or that I will work on at some point I'll just don't know when Secondly ideas for many of the problems that I'll be presenting. I'll have a vague idea of how to do this this idea is most likely going to be completely wrong and Thoroughly misleading so take them with a grain of salt All right, I think I'll just start now reverse engineering has grown quite a bit over the last few years and Perhaps we should start understanding or trying to try to understand what reverse engineering means And in order to understand reverse engineering we first have to understand engineering because reverse engineering obviously is the inverse process of engineering so when you engineer then you have a problem you define a problem and then you design a system consisting of multiple components and their interaction and then you construct these components or you buy them or you steal them or whatever and Then finally you integrate them make them all work together and hopefully your problem is solved So and in software this means that the source code that you actually compile is The last step of a concretization process where you start with an abstract design an abstract Understanding of your problem and I'm understanding of the solution you want to build and then you build it on the source level And that is really like once you have the source code That's the last step of the engineering process then it's just compilation and running testing selling so um I Have to work in a few jokes to compete. I guess but I'm very bad at this being German so The engineering process is not only producing the final output like when I engineer or build a car I don't only get the car in the end. I get a whole bunch of high-level design I get interactions between the components. I get specifications for the components and so forth so The engineering process produces way more than just the source code or the final problem It produces an understanding of the problem on of the program that we're building so Well everybody of us who has been involved in engineering at some point knows that the actual process of building something is a lot messier Than what I just described this does not change the fact that even if I just puke a bunch of code into a file somewhere I still have an implicit design. I have implicit implicit modularization I have implicit trust relationships between components, so I really can't help it But whenever I build something there's a whole abstract design lurking even though I don't necessarily have documented it well or whatever Okay So what's a really good definition of reverse engineering? Let me first tell you what reverse engineering is not First of all, it is not producing a disassembly. A disassembly is the first step But really well reverse engineering is way beyond that just given a large large set of disassemblies You don't really understand anything So a good disassembler is of course something that you absolutely need to understand more But it is not the end goal like having a clean disassembly cannot be the end goal of reverse engineering A good disassembly should recover all the functions in the executable and which function calls which other function and Properly or mostly properly separate data from code So secondly, what is not reverse engineering? A lot of reverse engineers think that Decompilation would be the holy grail of reverse engineering But honestly Decompilation cannot be the end goal of reverse engineering either because as I just described we start in the engineering process at a Very high level and then whittled down to have source code and then compile So the source code is already like the last step of the engineering process Which means that recovering source code does not recover any of the high level abstractions if I give you the source code to To Oracle now you have no clue of how the different components interact and fit together So I might as well just give you the assembly This this scales upwards because I don't think anybody at Microsoft really understands vista or anybody really understands any any larger piece of software, so there's a big difference between Decompiling something and really understanding it If I have all the like if I can completely disassemble or take apart a car and have all the pieces there I still have no understanding of the rationale behind the design of each individual component and I have no understanding necessarily of how an engine works All right, so the really broad definition of reverse engineering would be recovery of high level abstract abstractions and program understanding so ideally we should try to given an arbitrary executable well Reconstruct the things that a good design process would have had like Remodularize the the functions would just get one big of functions in the relations Separate that them into modules build things that help us understand the program Of course for for most of us bug finding is the primary targets Like why would we need to understand the program in order to find bugs and so forth? So the the thing is that realistically speaking I don't know if it's going to be two years or five years or ten years or twenty years But out of my out of bounds memory access will be gone at some point just gone gone I mean the the a340 avionics code was Verified with a static checker that works soundly and proven to be out like free of out of bounds memory access And that's 110,000 lines of code and multi-threaded. So that's a fair achievement Now scale this ten years into the future and you can see that most things that are critical will be Verified to be free of out of bounds memory access So anything any bugs that we'll want to have in order to break systems will require us understanding a system And then finding ways that it was misdesigned or there's flaws in the logic and so forth So we will need these things a few years down the line so What does software really consist of and what would be a proper way of abstracting like a bunch of functions that you guys Just get thrown into your face into something that makes more sense well One possible abstraction would be that an application is a set of modules like you have the network interaction module You have the packet parsing module you have the authentication crypto module and so forth and These modules interact with each other and they interact with the operating system and they're strung together to form an application So what is a module in itself a module is a set of functions and A bunch of these are being well exported So you don't call every function in a module You call a number of predefined APIs and they act as gateway nodes basically in the core graph Meaning in order to reach most of the functions of the module you pass through a relatively small number of well-defined interface So you have the external interface external functions and then you have external data structures Which are data structures that the module uses to communicate with the outer world and then you have the second set You have the internal functions functions that are never actually called from the outside I only called through the gateway nodes and these functions might have a set of internal data structures which are used only internally to that module and in Well the the idea would then be to try to recover these structures So we just assume that somebody is somewhat sane in the process of building software and We then look at any book on design principles Since I don't own any books on design principles I had to use a book on source code auditing that Mark Dowd and John McDowd have written excellent book pre-ordered on Amazon now It's great and There's two principles that should well apply to modules one is loose coupling Not well wherever I know what you people are thinking Which means that modules should communicate using very very few well-defined interfaces that do clear input sanitization and so forth Secondly strong cohesion modules should provide functions that operate on very much related tasks so Would be interesting if given an executable we could decompose the executable Into modules that exhibit well loose coupling and strong cohesion In a situation where you look at OOP code stuff gets a lot easier just Well OOP we have objects We can reconstruct classes from the executable. We can reconstruct class hierarchies from the executable and so forth So once we can do that We have a lot of the abstract architecture already already recovered So factually, I think that reconstructing higher level design patterns a higher level Getting a higher level understanding of executables might actually be easy in the object entry into oriented case Then in this standard C case just because more of the high-level design thinking is still present in the executable All right, so the thing is that reverse engineering is in some aspects stuck in the 80s what I mean with that is not very bright colors and shoulder pads in your jackets, but what I mean is Reverse engineering is a really really small community and most of it has been done in secret or has been done by government agencies and so forth so We have on one side the development like software development and that's like a huge industry And they do all this this crazy research into how to build better tools You have these round-trip engineering tools which generate code from your UML diagrams and UML diagrams from code and so forth and on the reverse engineering side we have either and Not so much else or not so much in addition to that so The reverse engineering scene has grown a lot in the last few years, which means Well, it's just crazy these days the fact that I'm standing in front of a room of these many people who seem to have a cursory interest in reverse engineering at least is Is already showing that the industry is growing a lot and we now have more reverse engineering tools that are coming in like HP Gehry's inspector and Pedram's Pimei or Bindiff and Binnavi the the two products which we are selling But all these products are really really niche and really tiny in their reach But because reverse engineering has been growing so much it might be possible that we finally get some real research and development money flowing in from somewhere and What I'll do in the following now. I'll name ten research problems that Well for most of them. I think that they can be solved using Moderate amounts of time and money investment for some of them. I don't know Well, many of the problems that I'll be presenting will be NP hard in in the general case But lucky enough. I've never met the general case. So Don't be turned off by that Very very simple simple first challenge. It's so simple that it's got the number zero as really really not that that difficult I find it surprisingly useful to have a program that Given a set of functions Generates me for each function the set of possible return values of that function So you just see a sub function call and you see that this function can return plus one minus one zero You can almost deduce that this isn't a comparator of some sorts and You get a function that returns on our minus one or the return value of malloc, you know that minus one signals failure of malloc Things like these this is really really trivial and can be done in a few hundred lines of python So let's go on to something more interesting Something that we really want is full executable data structure reconstruction Given an executable Reconstruct all data structures that are used in this in this executable Just get me all of them. Secondly Reconstruct all the points to relations between the data structures Meaning if possible try to recover nested data structures and whenever you have a structure member pointing to another data structure include that in your Well recover the this relation as well Finally once you're you're done doing that to construct a graph Consisting of the data structures every data structure is a node and whenever a data structure contains a class member Or a data structure member pointing to another data structure add an edge This graph will immediately immediately tell you a lot about the relations of data structures in that program You'll immediately see recursive data structures such as graphs and trees and you'll be able to very much understand What is going on a lot quicker this would be what this would even be useful on the source code level and there It's really trivial to generate Yeah, so I think this is relatively easily doable I think it's even doable for venture capital back companies Which means it's it's not difficult and it's definitely doable for anybody who does not run after cheap money All right, I even have a little bit of pseudo code which I might or might not work if you ever try to build this a very very rough algorithmic stick sketch so you iterate over all functions in the executable and You find all the memory cells like the separate memory cells associate like exist in that function You retrieve the offsets for these separate memory cells and you create a data structure for that So basically in each function for each memory cell you get a data structure What you do then is you iterate through the entire executable and build prototypes for Function called arguments from the data structures that you've just recovered and finally you iterate over all functions in the executable and Merge data structures that are past So if I have like a function prototype as my local data structure in this function is being passed in as first argument And you have a parent function It says well the local data structure there is being passed as first argument to that sub function you merge the two And you iterate that and in theory you should be terminating this after n half iterations where n is the maximum depth of the core graph All right Well creating the the graph of relations now after should be really really easy. So I won't Talk too much about that. So let's go to challenge to Given an executable don't only require reconstruct the data structures, but reconstruct classes them associate functions to the classes so recover all the methods and Reconstruct the inheritance hierarchy between the classes Finally create you ML diagrams from the executable Um, merge them merge the uml diagrams with the type info generated in challenge one and you have something that Will at least allow you to understand large object oriented programs a lot more quickly than what we can do now I very much think this is doable. You might be to a certain extent compiler specific Although I have yet to see a C++ compiler that doesn't build objects in the manner that putting like we put a v-table at the first First data member and then pad stuff at the end and and so forth There's in fact, actually a striking convergence of Compilers towards like a sane optimum for implementing C++ language features. So I think this is definitely doable Yeah, so the just a few few ideas on how to do what You can glean the inheritance relationships between classes by looking at the constructors meaning I have a constructor It passes my this pointer on to another constructor I can usually if there's v-tables present infer that my current class is being derived from the other class Yeah recovering the virtual methods while you have the v-tables that should not be too difficult Recovering a regular methods would imply doing a bunch of data flow analysis So that might take time and might might be a bit more difficult The amusing thing is that once you've done this you can very very cheaply build a runtime object editor So what you do then is you scan through memory to find v-tables like v-table pointers and Every time you find a v-table pointer, you know, there's a class instance here And then you can just because you've reconstructed the types Provide the user with an editor where you can run time edit the contents of the individual class instances in memory All right This is definitely doable. I think the hardest part of this is going to be Improving on silly obstructionist patents which have been put there by companies backed by venture capital Sorry, I'm All right, so challenge number three Decomposing executables into modules Given an executable not written in an object-oriented manner Decompose it into modules find all functions performing a similar task. The modules should reflect loose coupling and strong cohesions and You should try to minimize the number of external like of exported functions Separate the data structures within these modules into internal and external data structures now I really have no idea at all how to do this I have a bunch of rough ideas that might be starting points, but I really don't know how to proceed from there Approach number one would be calculate dominated trees on the core graph mainly because You will thus find like the nodes in the dominated tree once it's constructed which have a very high out degree Will reflect gateway nodes meaning nodes that dominate many other nodes Meaning nodes that well for which you have to pass in order to reach other functions So so these would be candidates for gateway nodes to separate a core graph into into individual modules Approach number two might be consider this an optimization problem try to decompose the core graph and just strongly connected components by Removing a minimal number of nodes. This would be equivalent to or not equivalent But this would be similar to trying to find the gateway nodes Approach number three would be try to group the functions not by the call hierarchy But by having them operate on on the same data structures so if you know that this set of functions here all operate on on a set of data structures group them together a Rough idea for this would be for every data structure in the executable Look at the set of functions operating on this data structure Then try to choose a set of functions That minimizes the number of nodes while maximizing the number of subsets operating on data structures, which we've just described That might work. I don't know you will probably need some sort of fuzziness and configurability for this because it's not a problem where you have a Very discreet solution space where you can say okay now This is the right solution and this is the wrong solution because we're speaking about Decomposing things into something that is appears logical to a human observer all right Challenge number four recovery of template generated code from the executable well templates we we've all seen them They tend to generate a lot of duplicate code because you essentially every time you instantiate a template with a new type All the functions for that that template will be duplicated in the executable So you'll get like 50 variants of the same function which only differ in well basically in offsets and call targets and That just means that you have a massive explosion of code size and we as the reverse engineers should not be bothered with having to see Oh, I've read this before like we can programmatically tell now. This function really is the same as the other other function It's just changed a few offsets, right? This should be very very very easy in most cases the only cases where I can imagine complications with this is somebody Start like of the compiler starts inlining Some some functions of the template and doesn't inline others I don't know this depends on the madness of a compiler. I would assume that in most cases. It's going to be very very easy All right, perhaps with with like bindiff style structural comparison and then a little bit of semantic tracking the after Okay Challenge number five now. This is my favorite one and it's also probably one of the hardest problems. I I can imagine In fact, I have a very hard time imagining a problem that is more difficult than this The problem is Given a location t and the executable we want to reach a look up Well, we we are given a location t which is our target location and we're given a location s that is our source location We know how to get to s because I don't know we've traced the application We know we're here and t is where our bug is now. We would like to automatically construct input to reach t Yeah, so That is difficult and to make things a little bit more interesting We do not only want one input to reach t We want a description of the set of all possible inputs that reach t so That is going to be a bit tricky and the general case This is this is three set and and thus NP hard and we can basically forget about it As I said, you never ever actually run into the general case, but we shall see the first obvious idea which Has been explored by Sherry Sparks and her co-researchers is to use genetic algorithms for this because you have a very very easily Measurable fitness function meaning you can measure the distance to your target node Which means that for any input you can say okay now this came this close to my target node This came that close to my target node. So it's really ideal for genetic algorithms in that sense I saw the talk of of Cunningham and Milton Sparks at black hat It was a good talk. I personally am not totally sure whether genetic Genetic algorithms will be the the way to go for all of it Mainly because you only get one solution to reach a certain place and you don't know which fields you can change Which will not invalidate your sample because consider the fact that you want to overflow something And now you have reached a location and you still need to know whether you can actually put like you still need to trigger that bug so you will need to modify your sample thereafter and I don't know I Have my doubts Then again, my my idea is completely crazy So I don't think that anybody is crazy enough to pick up on this The idea with this would be to model a path for the executable as an equation system mixing modular arithmetic meaning modular 2 to the power 32 and the Bollin functions XOR and or not and raw and We would basically transform a path for the executable into a satisfiability problem or Alternatively into a set of possibly very large polynomials over gf2, which is the field of 0 and 1 Where all the the unknown bits are are the input And then we try to solve either the equation system or the satisfiability problem Rotate right or rotate left just bitwise rotation If you find a general way to solve these equation systems you got MD5 and char solved So there won't be a really general way for solving arbitrary equation systems like this The equivalence to two three set All right, so For the satisfiability stuff we can use a published set solvers these things have gotten relatively good luck They can solve satisfiability problems with a few million variables by now. So perhaps they might be able to solve our path problem What I've been most experimenting with was using OBDD style algorithms OBDDs are Special data structures with which you can represent functions that map from a a series of bits like from a bit string to 0 and 1 and can thus be used for representing very large datasets and What I have gotten solved so far solving equation systems that don't involve multiplication and Rotation like if you only have addition and boolean functions, I can solve the equation systems can't deal with multiplication or rotation yet So oh well, so There there's going to be complications for all of this There's another another very very nasty edge to the problem Which is it's not enough to part like to craft input for one path because most paths will not only be dependent on user input But on program state which means in order to exercise a certain path you first have to authenticate or you first have to send a different packet Which leads to the fact that you basically have to determine Okay, now which state like what for example? Worked what state is the program in here and how can I get that program to that state? So if they test for like are we in state number two then I have to find out where they're setting state to two and then try to Excess the code path to go to state two and so forth. So this can very easily go massively out of hand our Enough of number five number five is really really a bitch in every way like it's just hard Challenge number six. This is going to be a little bit easier Trying to automate the analysis of translation and emulation based obfuscators Well code obfuscation is getting a bit boring 96 was the first time that Solar designer published a crack me which was based on a virtual machine. We basically created a CPU and paper wrote an emulator wrote the crack me in the the virtual instruction set executed it Very very very difficult at that time to take apart and these things have gotten boringly common by now and there's no no generic methods to to analyze these things and Let's not kid people into thinking that these things are actually hard to take apart. So a Good research topic would be try to build a framework which allows the the rapid analysis of well VM VM based obfuscations And usually all these these VM based obfuscations operate in the same manner You have like a context structure which describes the inner state of the the virtual virtual machine And you have a decoding loop that fetches the next instruction and that then branches off to the individual instruction handlers So my rough idea of this would be try to exploit the existing structure of these things For example, you could very very easily build a debugger for an arbitrary virtual machine You just have to identify the decoding loop if you identify the decoding loop set a breakpoint inside the Decoding loop and dump the context structure on every step. Well, we have a single stepping debugger for the virtual machine Very very cheaply done and should be able like should be possible to do this in python very quickly and Then comes the difficult part try to recover the semantics of the individual instructions because most of the people writing these Virtual machines are not very imaginative meaning they'll just have an addition instruction an XOR instruction And the instructions will mirror like a real CPU that we know very closely. So the question is can we somehow? Automatically either statically or dynamically recover the semantics of the individual instruction handlers so This could be done either by trying to statically analyze them or by just seeing okay now I Just exit like I set the context to the state I execute the handler I see the output I set the context to a different state execute the handler see the output and Well, if you see that always the like register C is the result of register a plus register B Obviously this function is going to be an addition I don't know how difficult all of this is going to be I think that the first part is going to be very very easy Think the second part is going to be arbitrarily hard depending on how clever the guy writing the virtual machine is But I think in practice for most existing virtual machines. It should not be hard Challenge number seven an algorithm to transform code into a canonical normal form Now this ties in a little bit with the lecture we had before I started talking Which is basically almost all the the obfuscators on many many obfuscators essentially Translate instructions by doing table lookups So I have one instruction and you have multiple tables which have equivalent entries equivalent by different entries to represent a certain instruction what they'll then do is they'll just do a look up into the table and replace one instruction with another one and iterate this and Well here in an example We can obfuscate add EX 20 in these two different ways and you can come up with an arbitrary other number of ways to to represent this and You just iterate this over a bunch of garbage code for over a bunch of code and your your code Like your signal to row noise ratio in your code can get arbitrarily small like you can have one instruction in the end being represented by 200 instructions of which you can't remove a single one because the output would not be the same so the question is Under the assumption that the obfuscation introduces no new memory access and no self-modifying code Can we create a reduction relation that is confluent meaning? That holds has the following following property if you have a code sequence sequence s and s s prime derived by replacing one instructions with a sequence of equivalent instructions as I described Without introducing memory access or self-modifying code. We want a Function that reduces s and s prime to the same normal form n This need not necessarily be the same the same instruction that was mutated I just needs to be the same representation no matter how I mutate my original one so I can compare I can later on test Is this the same as this? Yeah, so if you are able to to construct such a thing you should be able to trivially de obfuscate most metamorphic engines So that would be fun Concerning the difficulty of this. I'm not sure whether it's possible, but it looks like a really really fun exercise for anybody Who's a bit mathematically inclined? All right Challenge number eight. We're back to fuzzy visualization clicky stuff basically Call graphs are really nice and they're really useful to visualize and see which function calls which but they have one big drawback They lose the ordering of Subfunction calls which means if I have a function that calls malloc string length mem copy I just see in the graph. Okay. Now this node called malloc string length mem copy, but I can't Know in which order they're called But the thing is that the structure of the individual flow graphs in of every function Introduces a partial ordering on the sub function calls and I would like to have a mode of visualizing Program, which is similar to looking at the call graph But which doesn't lose the information of which call comes before which other call that somehow Represents this partial order on the sub function calls visually to me because then I mean it's very different whether a function first calls mem copy Then string length then malloc or a function that calls string length malloc mem copy like just by looking at the two orderings you can infer about what the functions actually doing and As such it should be a lot better if we can actually see that order All right challenge number nine. This is a fun one Given a network demon that Has some sort of internal state like an isa kmp parser or whatever and they have some state variable internally That they update depending on well on what's happening on the network Can we construct? Basically a state diagram from the executable first of all we'd need to to separate the functions into functions that can be reached in a certain state So you'd be separating it like these functions are reached in state one these functions are reached in state two These functions are reached in whichever state and then you try to reconstruct Okay, now they're writing the value four to the state variable in a function reachable from state two so obviously your state diagram should now have a transition from state two to state four and The question is whether you can do this generically enough So you can reconstruct the a diagram of possible straight transitions just from the executable This will be tremendously useful for the analysis of anything that does any sort of network protocol parsing All right challenge number ten Semantics based library signatures Semantics based library signatures means Well, if somebody hand rolls his string copy, I would like to see that I would like my my signature algorithm to somehow recognize somebody has hand rolled a string copy and I want this to work No matter how not necessarily no matter how but I want this to work without Having to have requiring him to have a very specific implementation of string copy Like I want this to work even if you call string length and mem copy or also if he hand rolls both loops I just want a semantic signature for these things I want to semantically identify or I want to identify this function is inserting into a linked list This function is removing from a linked list and so forth so I Think this is the difficulty of this can vary very very much depending on what the function is doing But I think for relatively simple library functions. It should not be that difficult a possible possible idea on how to do this would be Well, I'm not sure whether this will work But what you could do is you can represent memory as an abstract graph and try to represent a function as An operation on an abstract graph and then see how a function manipulates this memory graph And that should yield you information like you should be able to immediately see a linked list insert. I Don't know whether this will work Good luck Challenge number 11 is not all that that series in fact, so it's beyond the number 10 But if we have the the things that we just discussed here We should be able to string them together to build something. That's like a something that statically analyzes executables finds bugs Crafts input to trigger these bugs and then we would still have to take on the problem of automated exploitation I'm not totally serious on this one. So Let's just skip it here. So to reiterate my my difficulty estimates I think challenge numbers zero is trivial challenge number one Should be doable is not totally trivial, but doable to anybody skilled in the craft Very important phrase patent law challenge number two doable not totally trivial, but doable to anybody skilled in the craft Challenge number three There was decomposition to modules. I have no clue how difficult that is Number four was template recognition should be relatively easy number five input crafting as difficult as any problem you can think of I think Yeah, number six VM disassembly or VM analysis should be easy if we have a VM given like with a structure as I described Challenge number seven. What was challenge number seven? Oh Yeah Confluent normal form. Yes, very very very cool. Probably quite difficult challenge number eight. I have no clue how difficult that is Challenge number nine reconstruction of the state diagrams should be possible, but dependent depends of course on the structure of the executable Challenge number ten Semantics-based library signatures looks very very difficult at first But getting something that recognizes some things should be relatively easy. All right some other other stuff We as the reverse engineering community suffer from the fact that we have tools that are all closed and all Fragmented and don't talk to each other. So What we've been doing at Sabre recently was we sat down and we created a sequel schema to have an architecture independent representation of disassemblies in a flat address space and What we're doing currently is we're dumping all our we're remodeling all our reverse engineering tools to work on this database So we can have interaction between different tools that just all operate on the same disassembly Operate on the same floor graph structure Sam core graph structure and you can dump data into that executable take it out again and so forth We had originally planned to make that thing public at blackout We've fallen behind a bit with the documentation like the other sequel schema itself is ready and in testing and seems to be quite good It's just not very documented yet And some of the sequel queries are non obvious and took us quite a few time to to probably build So we will be releasing this in the next two to three months with full documentation and we just encourage everybody to Dump stuff and read stuff from the database format And then we'll hopefully have an open database format with which to exchange disassemblies on and with which to process this Assemblies and on which to work So Please wait for two months while we get the documentation sorted. We're not very good at these these things What else what else? Anybody have any questions? Is this on okay for problem number seven I was actually something I've actually been looking at before and I'm wondering if you've ever looked at using Trace caches or looked into the research. I believe was done by motor role on real-time machine code and machine code translation Basically being able to analyze machine code in real time and they're translated to an intermediary language And then spit out machine code that did the equivalent thing in real-time. They did spell like 10 years ago Yeah, but that's like there's lots of research on translation, but they usually don't have the requirement of After I translate something and get an output I want to reduce this to a Normal form that is the same for any translations from the same source So what's what's the important part here is that you really converge on something that is identical for every Everything that was generated from the same source Two questions for you the first one is related to the first part We were talking about we need we're going to move into a world where all the variables are managed or at least are you know Stay bound etc. So then we you know the big issue will be application logic So the question is do you think that the current managed language that we have today namely dotnet and Java can fit that and The second one is what about doing these challenges and starting them at the mat at dotnet and Java world So we can perfect techniques and then move into the unmanaged world I personally don't think that we'll all move to manage code I think that static analyzers on unmanaged code will get strong enough not to have out of bounds memory access I think just because the the world is moving embedded in everything and The embedded devices are all written in C, but they're statically verified whenever they're really critical And there's no no use in having a managed language if I can verify Verify C code because if I can do the verification upfront Then I can run at full CPU speed and don't have to worry about out of bounds memory access because I've proven beforehand that there is none So if you take sorry if you take the dotnet approach The compiled code is running at full speed the managed sort of environment is just a way to get there with better rules So you you still have a C++ assembly code being executed It's just that we had some rules to get there which is the whole manage world. Yeah well Yeah, then we can of course start on the manage level the manage level will make almost all of these problems not all of them But a number of them trivial because you can trivially extract classes and hierarchies and so forth from the manage code So that's not really a challenge then. Well, it's still challenge to do them But at least you can you have a good base to start Yeah, yeah, but you might as well start with a real challenge Any other questions? automated exploitation Can you stand up and talk and like speaking to the microphone? I guess I was just curious about the bug recognition. So you find It was kind of related to sherry's talk But I went to a black hat and she was talking about the starting point and the endpoint But really you have multiple endpoints and an endpoint is kind of defined as like say one variable overriding another with an out of bounds memory check and Trying to define those as bugs and what other classes of bugs might be found there So the question is that like what what are the endpoints or? okay Okay, sorry, I if you can rephrase it. I might be able to answer it like let's try it after the talk All right, any other questions? No questions. I still have quite a bit of free time Should I just discuss a bit of the research that we've been are not researched by a few of the things that we've been building at Sabre recently All right, so Well, one of the things that we've been doing of course is building Wait, no, I need to the microphone holder because I'll be monkeying around on the keyboard a lot more. All right, so Can we turn on the light again, please? Thanks. All right, so one of the things we've been doing well, we've we've been optimizing the The diffing engine quite heavily and we've rewritten The bindiff engine to be very very fast and with very fast. I mean we can diff 25,000 function world of warcraft image Against another one on this laptop in less than 10 minutes Which also allows us to do Basically a lot very large number of diffs, which means we can do like it's not a big deal to do a hundred thousand diffs a day or so So what we've started doing is we've built an infrastructure for automatic classification of new mailwares into family trees I'll show you a few other results It works basically by you sending an executable in and that gets unpacked with a statistical unpacker And then disassembled and compared to everything in the library And then we create family diagrams based on on similarity mattresses using a little bit of bioinformatics stuff one second so we've classified about a thousand samples here and We can see that they basically belong to Mostly one one family and then a bunch of smaller ones. So I'll try to zoom in Well here We have the entire go bot family Which are just different variants. They all have a different shall one some they have a different MD5 some and we have the the output of Other antivirus scanners as what they classify. What I'm looking there There's a few samples in here that we're not recognized by one or more of the AV scanners But which are like 99% similar to things that we already have in the database Yes, it's very amusing to see the different naming conventions for things because this is all go bot and some of them are called exploit my doom win 32 delve, whatever Yeah, no, where's where's something that's not here. This one here for example is My doom variant that make like trend micro doesn't recognize and so forth. There's a quite a few of these things very amusing very cute Works quite well. It's one of the primary arguments for buying a big cluster because when you diff And samples against and samples. Well, you still have to populate the matrix of n square size So you probably want to parallelize that other stuff. We've been doing Well, we've been working hard on on binnavi Which is basically a graph visualization and One second too many windows open Which is a graph visualization framework Okay, I'll take a minute to load close a few windows in the meantime It's in the an older version that is doing a lot of XML pausing which we've since dropped Well, now that we've moved to the database. We don't have to wait for two minutes to open a project anymore So apologies for this now We'll just have to wait Any questions in the meantime regarding the malware classification? No, okay Come on piece of crap. Ah, here we go so basically what we have on the left side here is a list of all the functions in the executable and Yes, we can look at them and they're nice and shiny and We can well move things add comments all sorts of things we have Mozilla start searching We can look at sub functions by oh, well, let's take another one No worries. You can disable those animations one and they get on once they get on your nerves All right, so you can open sub functions, which is not all that interesting But what is interesting is you can inline sub functions into the existing function and that's fun What you can also do is you can well, this is coupled with a debugger I'll quickly start my VMware to to demonstrate So what we can do here now is Well, let's open the call tree of all functions in the executable that is going to take another few seconds because it's about 8,000 functions Blackberry message which are very much fun All right, so We can search for for example the receive function And this is the receive function. These are all the functions that directly call receive. These are all the functions that directly call receive two layers upwards These are all the functions that call receive one layer downwards then What we can do also now is we can set breakpoints on all functions at once. So what we do is we just hit record and In the background it'll start talking to the VM and Set 8,000 breakpoints So while it's well, it's the setting breakpoints. Okay, set all the breakpoints. We'll now send some data and We hit stop again And what we can do now is we can see You can just see all the debug events that happened and Get a list of all I get the graph of all the functions that were just executed We can lay them out hierarchically. They're not that bad really make great wallpapers if you have a plotter All right so These are the functions other stuff that we can do is Let me quickly try to find the relevant function Here it is. So if we look at this function here, this is the main Hello main SRP message parser Yes, here we are and we can now of course just collect the same debug events on this graph and See the path that executed All right, there was a mistake here. Hey demo effect. I've been waiting for you so we can now see the path that was just executed through this function and Can zoom in and We can inline a sub function one second And then we have the whole function We can set the breakpoints again, and then we can see the trace of the debugger through multiple functions All right So there's a whole bunch of fun stuff you can do with this it can also calculate the new paths through the executable When you have like two basic blocks and you want to get from a to b and so forth and so forth now We've been moving this away from operating on on XML files to operating on the generic Java on the generic sequel format And for that we've also ported a small Python interpreter into it I'll quickly give you a snap or like a glimpse of the development version. So we connect to the database We have multiple executables in the database which is quite advantageous if you have a team of reverse engineers You can just all work on the same database at once and you don't have to hunt for your your disassemblies anymore all right and Now well, this is all something that you've seen previously in the last Last variant of it. The nice part is having the Python interpreter. So we can do stuff like manipulate disassemblies in a very very easy way like x equals basic block o x 4 0 20 80 print x and Well, we can have us have the thing shows its operands like print x dot instructions minus two dot operands print x dot instructions minus two dot operands one. Yes, thanks X instructions operands minus one minus two whatever show the operands I actually stored on the database as trees which is the most generic way of representing them and then you can manipulate them easily and you can do fun stuff like creating a new graph and Adding a node to that graph which consists of The HTML code generated from a basic block and then you can show the graph and Continue manipulating it adding edges adding comments whatever you want. So there's a whole bunch of fun in this one In general the date having a database format that is generic is really really really empowering and I'm very much looking forwards to everybody Trying to use it and yelling at us for all the things that we've done wrongly Yeah, any other questions currently cool Is is the small microphone like a Okay, I had a question on the first product that you were demonstrating. Yeah, where you had all the was a agobots Yeah, it looked like you had parents grandparents children and so forth How do you determine? What are the precedents and descendants of a particular could you speaking by passing the microphone currently? Sorry, how do you determine that the precedents, you know the parents and the children of particular flavors? We we have a basically percentual similarity metric We do structure comparison like we do the stuff that Bindiff does and then we have basically a mapping between the functions between the two executables And we just measure how many functions in the executables We were able to match percentually to the total number of functions So we have a percentage similarity measure and then we use UP GMA or whatever It's called as clustering algorithm to to generate these parent-child hierarchies It's similar to a genetic analysis where you measure mutants over time Yeah, it's basically the the simplest algorithm we could glean from a bioinformatics book without actually having to study it Thank you. All right Any other questions? Well, then I hope this was moderately entertaining and have a good death con