 All right, so let's get started. It's my pleasure to present Jean Calvet Go for it and please welcome him. So good morning everyone So my name is Jean Calvet and this talk is about how to build a multi architecture disassembler And this is a teamwork with my colleagues Nicola and Cedric So first thing first who are we we are software developers working for a small company called PNF software And our main activity is to develop Jeb which is a reverse engineering tool So to give you an idea in 2012 we released a version one the major version one of Jeb at the time It was only a decompiler for Android application. So it translates Android application back to Java code It comes with an interactive UI a scripting engine such that the user can analyze the application And then a few years later we released version two with the ability to decompile Windows Linux executables back to C code For a bunch of architectures including x86 ARM MIPS and their 64 bits variants And then a few years just recently in 2018 we released version three Which is also able to decompile non native platforms for example, Ethereum contracts or web assembly modules so we really see Jeb as a tool that should allow the user to analyze many different types of files and In this presentation I want to focus on one specific part of Jeb Namely, it's native disassembler. So the word disassembler can mean a few different things So what I'm talking about here is disassembler takes in input an executable file compiled by a compiler So there is a x86 executable file represented just by its row dump an extract of its row dump And so this disassembler takes this executable file in input And it will tell you that these red bytes here these are executable code and this code Constituted a routine routine is equivalent of function in ILL languages and the routine will be represented as a control for graph So which is a graph where the nodes are called basic blocks and each basic block contain a series of machine instructions Translated to assembly language and surname disassembler and then there are some edges between the nodes to represent the possible control for graph So the disassembler tell you that these red bytes correspond to this routine and while this green bytes here correspond to this one node Routines, so that's the kind of disassembler. We are talking about here They produce a global disassembled view of a program with all routines That are present in the executable and the purpose of this view is to represent the possible ways the program could execute at runtime And not that while it is an assembly based view So it is assembly language the translation of individual machine instruction to assembly language is just one of the features There are many things needed by disassembler in order to produce this global view and speaking of that Why do we need disassembler? So first they are useful because they provide Foundations for automatic advanced analysis like the compilation to a high-level language like C or Java Disassembler can answer can answer questions like where is the code? Where are the data in an executable? In which order instruction got executed. That's the control flow All the data are manipulated the data flow. They also build abstractions useful for advanced analysis so they group instructions within routines within control for graphs within basic blocks and they can also tell you that a Certain series of bytes in a string another series of byte is a variable So all this is needed for automatic advanced analysis But these are some are also useful for manual analysis because the output can be directly understood by humans And in particular when the automatic advanced analysis fail The disassemblers are usually seen as providing the ground truth because they remain close to the machine So you might know as disassembler Ida pro Gidra by now in India Radar to so all these tools They got their own disassembler engine within and in Jeb that the same thing we got our own disassembler engine That's the foundation for the decompilation we do So to give you an idea when we open a binary in Jeb we got a routine list That's the red box the red box you see all routine and for each routine there is its control program in the orange block So that's the output of the disassembler the control program And then the control program is given to the decompiler pipeline which produce the C code that you can see at the bottom in blue But so in this presentation in this presentation we focus on the disassembler And so most of the logic of a disassembler is architecture independent That is exactly the same code for x86 or for arm for example Except for the instruction parsing and a few restricts I will describe later and as I said before it can also pass non-native code So in this in this presentation my intent is ready to first talk about the problem What makes this assembly up on several architectures and then describe the way we deal with those problems in Jeb So hopefully you don't need to be a reverse engineer to understand what I'm going to show you And hopefully we will show you what what it looks like to develop a disassembler And why disassembling remains actually a quite complex problem And as a small disclaimer, this is intended to be a research talk not a sales talk So I will show what what is the current work in progress in Jeb And it's not intended to be a final best solution to disassembly So first thing first I'm going to introduce a toy example that we are trying Well, we will try to disassemble step by step in order to build some kind of intuition about what is disassembly and what makes it hard So we will start with the simple C code. It's called secret C here There is a main routine it checks if one argument was provided and if it's the case It calls another routine called secret algo with the argument passed in input as a string So secret algo is here It computes the XOR between its argument is string argument translated to an integer thanks to the A to I library routine and Constant name secret key that is defined at the top So overall the logical program is just to return the XOR of its argument with a constant if one argument was provided or zero elsewhere Now if we take this C code and compile it with Microsoft Visual Studio X86 compiler without any optimization So we end up with a Windows executable We can execute it on a Windows machine and as there were no optimization This executable will be a literal translation of the C code and in particular the structure will be exactly the same as in the C code So now if we were to give this binary to a disassembler We expect that the output should be something like this there should be at least two routines one for main Let's say routine one is main as in main there is a test on the number of arguments There should be two possible execution path So the graph should have two path for routine one and in one of his path there is a call to the second routine which is secret algo and So routine two will have a very simple graph with a call to a third routine a to I We don't know where a to I is going to be it's a library routine It might be an in an external file or in the same executable and then there will be XOR with our magic constant So that that's a sketch of what we expect the output to look like if the input is an unoptimized executable coming from this C code So now the question is How do we get there? How do I would transform the windows executable into this global disassembled view with these two routines? So how do we build the box in the middle? So first there are a few things that we need to clarify that we need to have actually to disassemble The first one is that we need to so executables usually comes within executable file formats If you are on Windows, it's a P file Linux would be elf Mac or on Mac So this executable file formats they provide necessary information for the disassembler and for example If we give our windows executable to a P parser The P parser will decompose the structure of a file and provide an output First the memory mapping that is it will tell you where the bytes in the file are located in memory This mapping is usually divided into sections or segments and also it will provide us the entry point That is the address of a very first instruction executed at runtime and also some information on the architecture For which this file was compiled for in our case it for x86. It's a little Indian architecture So all this is going to be the input of the disassembler and not the raw file itself Then there are no other thing that we need we need the ability to disassemble individual machine instruction So let's call this instruction disassemblers So instruction disassemblers take in input a binary blob and they produce in output a past instruction And this past instruction usually contains the memonic Which is the assembly representation of the operation the operands register memory addresses used by the instruction and then some other information like for example What are the next instruction to execute? So for example if we take if we give 55 in extra decimal to a x86 Instruction disassembler it will tell us that it's a push and that it uses the ebp register As an operand and that the next instruction to execute is the fall through that is the instruction just following this one in memory Then if we give our four bytes to this four bytes to the arm instruction disassembler It tells us that it's a sub if not equal with two operands as your register PC register And the next instruction is also a fall through then if we take the same four bytes and give them to a MIPS instruction disassembler It tells it's another instruction So it's branch if equal to zero and there are two operands a register and an offset and as it is a conditional branch There are two possible next instruction the fall through if the condition is false Or the branch target if the condition is true so We are going to need instruction disassembler for all the architectures We want to disassemble of course and not that the instruction disassembler do not tell us anything about what the instruction is doing It's just providing a past representation. That is human readable a kind of So that for the sake of the argument in this toy example We are going to assume that we have a PE parser and we have an instruction in an x86 instruction disassembler And now we come back to the question how do we get from the windows binary to the disassemble view? so first Entreative strategy that you might think of is that we could start from the entry point because it is provided to us by the PE parser and just try to follow the code try to follow the control flow discover the routines and build their graph as we go So let's try this So here on the left are represented the input memory mapping So that's the secret exit mapped in memory The first column is the address in memory then there are the bytes located at this memory We have an arrow on the next instruction to disassemble a pointer to an accent to the next instruction to disassemble represented by the arrow and it's initially set at the entry point of the program So we start from there and we give the first few bytes to the x86 instruction disassembler It tells us that it's a push ebp so we add this new instruction to a new graph into a new block and The instruction disassembler also tell us that the next instruction to execute Should be the fall through the one following this one So we just increment the pointer and we go on we do it again another instruction We add it to the ground block the fall through is the next instruction We do it again and again, and we end up disassembling a conditional branch So g and z sound for jump if not zero, so that's a branch So we end the current block It's a conditional branch of I will be to put two possible execution path at this point So we have a choice to make we have two possible addresses to analyze next the fall through if the condition is false or the branch target So here I decided to continue analyzing the fall through So we store the target address for later analysis in a bucket at the bottom right So that's an address within this routine that we store to come back later Fast forward we analyze the fall through which is a simple instruction and at some point we end up is assembling x86 call instruction. So this instruction is usually used to implement routine calls So we have a same situation here. There are two possible addresses to analyze. There is the call target It's another routine and there is a fall through when we will eventually come back from the call So we do the same thing we continue analyzing the fall through and we store the call target for later analysis But we store it in a different bucket because it's not the same type of address. It's another routine It's not an address within this routine We do we do this assembly again, and then we end up disassembling a red instruction So red stands in x86 for return and so return to the color routine. So that's the end of the ground block So we have another address to analyze within this routine. So we go on We continue this assembling on the next address that was previously stored at the bottom right and Finally we have no more addresses to analyze we have a complete CFG So the grant graph is finished the ground control photograph is terminated and we can go on Analyzing the next routine that was previously stored So fast forward we do it all again, and then we end up with these two routines So routine one, which is which is the main actually at as two possible execution path one is calling the routine two and a routine two as a one not the graph and there is a call and then there is a XOR So I've let a few details out, but you got the ID what we have here is exactly what we expected So we produce what we expected with a simple recursive algorithm just by following the code And it seems that all the magic was in the instruction disassembler Which was providing us in particular the control flow information for each instruction So it seems that if you want to be the disassembler all you need to do have is an instruction disassembler for the particular architectures you are targeting Actually during this step-by-step disassembly we made a series of questionable assumptions, and I will now describe some of them So the first assumption we made and you probably notice it is that when we were analyzing the call instruction We assume that the call always returned to caller We continue analyzing the fall through when we are analyzing routine one We continue analyzing the fall through as if the call will eventually return to the caller But in reality there are a non-returning call and there is no need to go far to go to find an example of that Just looking at the Visual Studio C runtime code, which is statically linked in our Windows executable There are calls to APIs Terminating the application so this call never returned to the caller and the compiler knows it So what it does is put invalid code just after the call in the fall through so here you can see I in free It's a software interrupt It should never get executed and the compiler knows it so for us as disassembler writer How do we know that exit process is never returning? It's actually in that in the prototype of the routine So if you look in the full at the full prototype in the other fight there is an attribute before the classic prototype that tell us that this API never returns and Not that returning void and being non-returning are two different things Another example are infinitely looping routines and once again There's an example coming from a classic compiler code this time GCC So this routine has no way to come back to the caller because it's a no way to come back to the caller It's just infinitely looping so it's non-returning as well So now if we think a bit about this so we need to identify this non-returning calls Otherwise the CFG will be incorrect. We need to cut the blocks just after the call For the non-returning external APIs for like exit process We can identify them from their names if we have access to their full prototype somewhere with the non-returning attribute For the non-returning internal routines for example imagine a small routine a rapper just calling a non-returning API We can identify them by analyzing their graph and if the graph has no returning blocks Then it is a no returning routine and this last bit brings an interesting situation We can only know an internal routine is non-returning after having analyzed it But what if we are on the call instruction and we do not have analyzed the target yet So we do don't we do not know at this point if a target is returning or non-returning So we could stop analyzing the caller just here and go analyzing the callee first But that could be tricky because we would have to maintain the color state and then it can be difficult if there is Especially if there is a chain of calls in the callee possibly coming back to the caller So that might be difficult to do and often we do not even know where the callee is They are non-trivial call where you don't know the target and I will show you example of that later So the way we deal with that in Jeb is that for the external API's we have our own a C parser And we build what we call type libraries From compiler and SDK error files. So these type libraries they provide for all the declared Function in the header file they provide the full prototypes with the non-returning attributes We can just check for their name within the type library and we got type libraries for many major compilers and SDK For the internal routines, so we try to identify the simple cases at the time. There is a call instruction So for example, we have a very simple binary check to see if a routine is just a trampoline So a small routine going to an API So it's a trampoline to a non-returning API and if it's the case we stop analyzing at the call We don't go analyzing the code for true Otherwise, if it's a more complex internal routine, we terminate the caller analysis first Assuming the call will eventually return exactly like we did But then we analyze the callee and if you found out that the callee routine the call routine is actually non-returning We go back at the entry point of the caller and we re-analyze the caller once again So it can be tricky because the first time we analyze the caller we were missing some information We analyze a call as being returning and actually wasn't so it can be hard to undo So that's it for non-returning call and our assumption we made during the step-by-step disassembly is that we assume that the routine control photographs are distinct and When we were done analyzing routine one when we are no more addresses to analyze we consider it done like if a CFG of routine one Was terminated Actually in reality there are examples of routine sharing code and once again just in Visual Studio C runtime There is this these two routines here. So notice all the routine on the left is directly branching within the routine on the right So why is it a problem? Let's say we pass the routine on the right first. So we built control photograph It's basic blocks and Then we discover routine two we pass it and we found out that raise the branch within an existing basic block So the instruction in red here. They are shared between the two routines So now the question is do we split? Do we do sorry do we duplicate this instruction into a new block and we build a separate control photograph for routine two Or do we split the block and have a new basic block that we share between the two routines? So first to think about this we have to remember that a basic block the usual definition is that it is a series of instruction executed Successively and as such it is a super useful abstraction for later analysis because we can process basic blocks without dealing with control changes there are no control flow changes within blocks there is an exception exceptions when there are exceptions They breaks the flow within a block, but let's forget about exception for now And so if we were to duplicate the instruction that means we will duplicate the instruction for a separate graph That means at one address we will have different possible basic blocks and that would make the writing of later analysis harder because we would need To check all these blocks at the same time when we analyze when we are at a specific address So it's likely not a good idea to duplicate instructions if you want to keep the powerful mess of basic block as an abstraction So what we do in Jeb is that during the disassembly we build what we call skeletons basic blocks So this are just containers for instruction and they can be easily modified split it when we need and then once we're all Disassembly is finished. We build the final control photographs with proper basic blocks and much more information inside So that means that in Jeb and address belongs to at most one basic block and a basic basic blocks can be shared Between routines. So if we come back to the previous example in Jeb control photographs looks like this for these two routines There is a block that is shared between the two the two of them Now another assumption that we made is that is that that branch instruction immediately immediately and basic blocks So when we were analyzing the conditional branch we ended our block at this conditional branch Because that's the usual way with the thing to do with basic blocks But when there is a branch there are possible Control flow, so you cut the block at this particular location There is one iconic Coupter example of that if we look at over architectures than x86 and remember we went on to build a multi architecture disassembler So there is some MIPS code So don't need to understand the snippet I just want to focus on these two particular branches So these are conditional branches they branch if a certain condition is true Otherwise they go to a full true Now the tricky part and MIPS is that this red instruction here They will always be executed whenever the branch the previous branch is executed So even if a conditional branch is taken and go elsewhere the full true instruction is executed So that's called branch daily slots It's actually a feature of MIPS but also on spark CPU and some DSP CPU or so So the story behind the city is quite interesting as you might know more than CPUs they execute instruction within a pipeline So that means that when they execute one instruction they load at the same time another instruction from memory So there is a problem with conditional branches because the CPU doesn't know yet if the condition is true or not So we have to make a guess to load either the full true instruction or the branch target instruction So if and if a guess is wrong There is a bubble and he has to empty the pipeline basically and there is a loss of performance So to solve that in MIPS there is this daily slot the full true instruction just after a branch is always executed So they don't if I don't have any guess to make and so it's the job of a compiler to actually use this data slot and put some useful Instruction with inside so sometimes the compiler has no use of a data slot It just put an up sometimes he actually as a use for the data slot. He puts a valid instruction within The data slot so for us from a digital perspective, what does it mean? Let's focus on this particular block here. So there is a conditional branch here So we have to cut somewhere a basic block because there are two possible control flow path Remember that a basic block is a series of instruction executed Successively so that means that the daily slot belongs to the same block as the branch because it is executed with the branch But if we cut just after the daily slot that means that we have now branch instruction in the middle of basic blocks And that's basically breaking one of the most common assumption on control flow graph and basic blocks That means that is they end on branches So a first idea would be to try to avoid that to avoid that situation and find a way to to still have the branches in last position in blocks So first you might think that a cfg is just a representation So we might we can play with it. What if we simply revert the instruction order? If we do that we actually break the order of expression evaluation because the branch condition Must be evaluated first So in this example here according to the graph the v0 register which is used as the condition in the branch is no set to One thanks to the li which is the daily slot So it's no more a conditional branch if we revert the the two instructions So that's not working another idea will be okay Let's create some kind of artificial instruction that groups together the branch and the daily slot So we still have a branch in last position It's actually legal to have a branch directly coming from elsewhere on the daily slot instruction So with this representation we cannot represent that so it's not working as well So as far as I know there are no shortcuts So and that's what we do in jeb we we allow branch instruction to meet in the middle of basic blocks And that's the job of jeb instruction these assemblers to provide the number of daily slot for each branch instruction Because actually there could be more than one. I think just one daily slot instruction is just one of the features of the architecture That's the depth of the pipeline So to give you an idea There is a snippet of mip assembly and the corresponding control flow graph So you can see all the branches are now in the middle of the blocks or not in the last position of the block Another very strong assumption that we made during this step-by-step this assembly Was that we can always follow the control flow? And in particular when we were analyzing the routine call from routine one to routine two We store the target for later analysis assuming that the target was known to us In reality as I said before it's not always the case and to illustrate that Let's come back to the secret c example With a slight modification. So I introduced a function pointer at the top So it's a pointer to a function with the same prototype than secret algo And you can see in main the function pointer is set to the address of secret algo And then we call secret algo using the function pointer So it does exactly the same thing as before except that it uses a function pointer for routine call So if we take this c code and compile it with vizier studio without optimization like before and we disassemble it with our Simple algorithm that we showed at the beginning We end up with only one graph. So one routine has been discovered only and there is the graph So that's the main routine and notice how the call to secret algo is now an indirect call So it is de-referencing a memory address and calling what is stored at this address What is stored at this address is actually written at the top here So that's the address of secret algo And so the problem for our simple recursive disassembler is that it cannot follow the indirect call Because the instruction disassembler cannot find the target of the indirect call It's not in the instruction itself. It's in the state of the machine the state of the memory So we need the value stored at the de-reference address at the time the call would be executed And in this particular case we can find this value this value pretty easily just by looking at the previous instructions And we will find out that there is a Move writing at this particular address So if we assume that no other thread is going to modify the memory between the move and the call We got the final target. We got the indirect call target Is it always that easy of course not? So for a less artificial example of how to compute control flow, let's use gem tables So gem tables are used by comparers to To implement switch statement from i-level language when the case value are close to each other When there are few gaps between the case value. So for example, here is a See a switch statement with all case value from one for to 400 If we compile it with visual studio once again, and we run our disassembler we end up with this graph So we are obviously missing a lot of code here Or the case code actually So what happens is that there is a branch an indirect branch using a register in the computation of the address the ecx register And so this address here is a base address to an array of four byte addresses And the ecx register is an index into this array So the area of addresses these are the addresses of the of the code implementing each case So it means that if we want to compute the control flow for this particular routine, we have to find What are the possible values for the ecx register in particular? What we need is the maximum value such that then we can read in memory the address of the case unlers and make the connection in the graph It's doable because there is a check on the ecx just before in the block before that's actually setting a maximum value So that's just another example of how to compute control flow. There are many many cases i've could have shown But you got the idea So that brings us to the a more general question How can we find the possible values for indirect operands? That is the operands using register or memory address whose value is not in the instruction But in the state of the machine and we need it in particular for the End direct branches to have a better control flow. That is to have a better control flow. We need a better data flow We could do some pattern matching uh to solve specific cases So for example one given compiler will always use the same machine instruction to implement gem tables So we could identify that particular situation and process it with some specific processing to compute the control flow Of course, it will not scale because we would have to deal with all compiler all compiler optimization level all architectures So in the search for a more generic solution to that problem What if we could simulate the execution of routines such that we would build the machine state register and memory between each instruction So then we could solve the indirect operands just by looking at the machine state that we have built For that to be possible for the simulation of code to be possible What we need is the semantics of this instruction We need to know what each instruction is doing so that we can update the machine state As you might guess it's not always doable I will come back on this later But really the out part for this to be even possible is to have the semantics of the instruction What each machine instruction is doing and luckily enough we already have it in jeb because we need it for the decompilation So let me introduce the jeb intermediate language. It's basically a custom language So it's can be seen as a low-level imperative assembly like language So program in jeb iel is a series of assignments made of expressions And there are only 16 different elements in the language And we use this language mainly as a way to express the semantics of the native instruction to to give you an idea There is a x86 instruction It's the XOR between a register and memory slot and then there is its translation That's the semantic representation in the jeb intermediate language So all the side effects of this instruction are explicit in this representation So that's what the instruction is doing And this iel will be used during the compilation It will be optimized and most of the assignment will be removed because they are not used But let's stick with that. So we have the semantic representation here As I said, it's a foundation for the jeb decompilation pipeline Because our optimization they work on the intermediate language So they can be applied on all architectures that we want to decompile The fvev art part is to implement the native to iel converter And so we have one of them for each architecture we decompile So it's really a similar idea to compiler intermediate representation As you might know compiler they apply their optimization on intermediate representation such that the same optimization can be applied for all i level languages So that's a similar idea So we can reuse this Intermediate language and the semantic representation we have for our current disassembly control flow problem We could simulate the jeb intermediate representation to enrich the routine control flow And that's what we do. So we take each native routine and we convert it into jeb iel So the that that gives us a cfg of iel statements We've no we do not optimize it at this point because we want to be very fast And so for example, I was the first basic block in the cfg iel of the main routine in secret xe So you see all the side effects here And then we simulate this iel routine to build the machine state at each instruction So we start from a clean state with pseudo realistic values in register We allocate stack memory and then the actual simulator the implementation of the simulator is not so hard because we have only to handle the 16 different iel elements And then we use this computed machine states to solve the indirect operands and Enrich the disassembly So to give you an idea in jeb if we disassemble this secret c with function pointer the previous example Jeb is able to tell us that the indirect call is going to a specific address So we write it as a comment with the arrow and so that's another routine that will be disassembled by jeb And so the reason we solve this indirect call is because we can find out during the simulation that there is a right to this particular address just before So it might seem magic, but let's not get our hopes too high This kind of simulation cannot always work and Because the the simulation has to be safe It can only provide reliable values to the disassembly engine because otherwise would be too risky to use So to give you an idea if we take this free iel routine instruction There are two registers set to constant and then there is ecx register set to the sum of the two previous registers So the simulation works on this case It can provide us the final value for ecx register But now if we switch if we switch eax to be a value coming from memory So that's the the syntax for to read into memory Here the simulation cannot provide us for sure the value of ecx register because it cannot it cannot know What is in memory at the time eax was was a set So it has to be a safe analysis. That means we provide values For for for case where there are no unknown inputs So it can only solve simple cases But in a generic way in the sense that it works exactly the same for all the architectures for which we have native two iel converters So now we cannot always follow the control flow So do we have another way to find this secret algorithm that is somewhere in memory without any cross-reference on it And that brings us back to a very old question in program analysis how to distinguish code from data in a program just by looking at it In theory, that's a well known and tractable problem like any interesting problem So that's not really helping in practice What makes it a hard problem on most architectures is that Code and data on modern architectures Usually share the same memory space Moreover almost any series of bytes correspond to a machine instruction Due to a fact that the instruction set encoding are usually very dense very compact So they use all bytes of value. So just by looking at a few bytes, you cannot tell for sure if it's code or data But in specific context, we can divide specific solutions And to illustrate that If I show you this row dump of x86 executable compiled by visual studio And if I ask you is it code or data If you are used to reverse on windows, you might notice that these are these There are these three bytes at the beginning and uh just after so these three bytes stands in x86 for push ebp move ebp esp That's the classic visual studio routine prologue the first two instructions of all many routines compiled by visual studio Then there are two bytes here standing for pop ebp red. That's the classic visual studio routine Epilogue and then between these two prologues. There is a sled of cc bytes Standing for in free. So that's the classic visual studio padding within code So if you know the compiler, uh, you can say that it's very likely that this row dump here is code with two routines Because it follows the patterns and the structures of code compiled by visual studio. So that the basic idea And another example from a different perspective. Here is the memory view of x86 executable compiled by gcc So we already identified in memory some areas with code some areas with data And then we ask do this gray area non-analyzed area are code or data Once again, if you know the compiler it can help you answer this question because gcc for x86 Usually does not mix code and data. So that means that the top gray area is very likely code and the bottom gray area is likely data You got the idea So that's what we do in jeb we try to identify the compiler that's safe to create the target And then we apply specific heuristics for this compiler in particular to solve the code versus data question So we have a bunch of compiler identification rules And here's an example of an heuristic that we apply an unanalyzed address a will be considered to be likely code If all this is true the compiler is gcc or clown the architecture is x86 There are no obfuscation malformation So that's also a bunch of heuristics here that we have to check if there is something wrong with with a file If a is within the code area So the code area is defined as the merge of all code section or code segment depending on the file format And if all the bytes that a do not look like code padding if all this is true, it's likely that a is code Now there is an interesting design question if we use this kind of strategy How do we integrate? Such compiler specific logic into a generic disassembler So what we do in jeb is that for each compiler we load different extension and this extension they will feed the disassembler With the heuristics result So for example, there are a few of the heuristics. We have these are methods in one jeb interface So there is one method to check if memory address looks like padding Another looks like a routine product or if a specific instruction looks like a switch dispatcher So the branch instruction used by a switch statement And so the disassembler doesn't know which extensions are loaded in memory So that's the job of another component to load the suitable extension for this particular file Of course, if this will not always work our heuristics are going to be wrong It can happen because we misidentified the compiler because it's a new or old version of a compiler For which the heuristics do not apply anymore or because there are some obfuscation Our manual checks do not catch it So what we have is some kind of feedback loop We log the error we do during disassembly So for example, if we try to disassemble a series of bytes and it's actually not a valid machine instruction We log it If there is a routine that we try to build and the graph is incorrect or weird looking So we log it and we count all these errors and when a certain threshold is reached The disassemblers switch back to a safe mode when we apply only very conservative heuristics And jeb is also an interactive tool So in last resort the user can tweak tweak the disassembler decisions Another assumption we made during the step by step disassembly is that the instruction set remains always the same during the whole execution So the instruction set represent which instructions are available And what what is the encoding on this instruction? So I never said anything about that but that was taken for granted There is one iconic again control example of that if we look at another architecture of an x86 or MIPS It's on arm. So there is a snippet of arm assembly Once again, you don't need to understand the snippet. There is a there is a branch here Going to another routine A free instruction routine and so notice all the bytes of the routine on the left Follow a different patterns that the bytes of the routine on the right So on the left we got two bytes machine instruction mixed with four bytes machine instruction And then in the other routine we got only four bytes machine instruction Because actually these two are different instruction set. The first one is called tumb And it's originally it was originally designed to be a compact version of the second one Which is the arm original instruction set And so these are different instruction set sharing the same encoding space So that means that the same bytes will be decoded into different instructions Now the the tricky part is that they can be both at the same time in the same executable And so out as the cpu knows which instruction set to use In this case it will switch from term to arm thanks to a blx instruction Which sounds for branch with link and exchange instruction set So for us what does it mean? It means that the instruction disassembler must handle all possible instruction set for given architecture It also means that when we know an address is called and not data It's not enough We need to know the actual instruction set to use to disassemble at this particular address And this information can come from biosources It can be from the way the address is called for example the arm blx with an offset The way the address is referenced for example in a health file If the symbol has the least significant bit set to one it's tumb and not arm If the address has a specific alignment Etc. There are multiple hints to solve this question So in there what we do is that we allow our instruction disassembler to handle different instruction set and the generic disassembler logic update the instruction set to use in the instruction disassembler And when we have an unknown code address so we know it's code But we don't know which instruction set to use we try all of them So we do some with all instruction set at the same in parallel and we keep the best result So the best result is basically the instruction set that provides us a correct looking control photograph. So once again, that's a bunch of heuristics A final assumption we made and Is that all code matters and this assumption was not is not really the same kind of assumption But basically we are missing something in our simplistic disassembler If you remember in the secret algorithm there is this a to i library routine that is called in the graph So we are in the corresponding cfg. We have a call for this routine It's going in the same executable because this routine this library routine has been statically linked in our executable So there is a another call at the end. So actually a to i is pretty complex as a routine But it's just the standard a to i routine So that brings us to a very old problem in reverse engineering. How do we identify library routines? Such identification is first is useful because it allows the user to read documentation Of course, rather than analyzing the code in the case of a to i the documentation is quite straightforward Then it also helps the automatic analysis like the compilation by providing precise information on a routine And in particular the prototype which can be hard to guess sometimes So the compilers when they provide these library routines They provide them in their already compiled form within object files And these object files are statically linked by the linker in the executables And these object files way they come with symbols at least the routine names because the linker needs the routine names To do the linking and to know in which object which object file should be linked into an executable So that means that the usual strategy to solve this problem is to take the library object files coming With a compiler to generate signatures And then use the signatures at runtime in the disassembly to identify library routines That's what we do in jeb so our signatures they are composed of features on one side These are the characteristics of the routine that we use to identify them And then there are some attributes so that the knowledge we have on the routine the name and the labels sometimes comments And then when we generate signatures for standard libraries We select the features to be such that the signatures will be false positive free We only want to identify this particular routine and we don't allow any variation Because we want to trust the signatures very much So we use as features a custom hash computed from the routine assembly code So using the assembly code rather than the binary code allows us to be independent from the njness which can Defer on some on some b njness architecture We use also the names of the cold routine so that allows us to distinguish Two wrappers so the two routine having the same code but calling a different Routine so we use the name of this cold routine as a feature It can also be a burden to use that feature because it means that you have to match the coli routine to match the color And then we add some additional features depending on the routine size And the basic idea is that the the smaller the routine is the more features we need to avoid false positive In other words the bigger the routine the most chance the hash would be enough to identify it And as a final note when I talk about false positive in this context It means that the name given to routine does not represent the behavior So the dummy example is that we can identify unlink as remove It's not a full positive because these two routines have a similar behavior in similar circumstances and the same prototype So that's not a false positive So we got a bunch of signature libraries for All our architectures and several compilers compiler optimization level So enough with a broken assumption. What's the point i'm trying to make? First if we sum up what we did we successfully disassemble the windows executable secret xy with a simplistic recursive algorithm But we made a lot of assumptions on the way And then we showed that these assumptions can be broken just by looking at standard compiler code And I've never said anything about obfuscation protection packers All the examples I've shown came from a classic compiler code And if you are a reverser you probably have in mind a tons of our broken assumption We made during the step-by-step disassembly like for example instructions to not overlap code doesn't modify itself So that's actually the way many obfuscation techniques work. They break the assumptions made by analysis tools So a first pessimistic conclusion will be that there is no such thing as a disassembler able to correctly disassemble all programs for all architectures and compilers And the intuitive idea I was trying to show is that there are actually very few assumptions that are Holding true on so many diverse programs And if you are an academic you could make the connection here with the halting problem and its generalization the rise theorem Which basically says that there is no interesting properties on program behavior that you can decide Now we cannot disassemble correctly all programs But we might still be able to do okay on a subset of them. That was I was trying to show with the compiler heuristics What we can do is try to understand the universe of program that we will try to disassemble And we can do it by dividing this universe into groups With the two following properties There exist reliable ways to check if a program belongs to a group And an interesting group has non trivial assumptions that are true for the old group That's exactly the idea of compiler specific heuristics, but you can be much more You can refine that a specific compiler with a specific i level language a specific runtime And then there is a group with some assumption that you know are true for this group And when a program does not belong to a non group, we just apply very conservative assumptions And what we want is really to avoid the common disassembler mistakes So the worst mistake a disassembler can do is to disassemble data because it has some kind of domino effect It can work as I said before the encoding is is built in such a way that it can actually work And it will have a domino effect by creating wrong cross references wrong branches And it will be hard to undo Then there are code considered as data that's also a mistake and it's misleading for the user And finally we want to avoid missing code and data So this process of building knowledge of the program universe Will be uh is easier if a disassembler is coded in an informative way That is it explicitly reports when there is an assumption that is broken So no no silent fail no else missing and then the disassembler can help the developer map the program universe by identifying as we go New corner cases so every time we open a new binary in jeb We got we can have an exception as developer saying telling us that there is a corner case And as we analyze new programs every day, it's it's kind of it's kind of Cool to have this ability and then of course we have to test your disassembler aggressively and on diverse sample set So the diversity of sample set is kind of tricky to achieve Because most available executables you will find on the internet You don't know the exact compiler version. You don't know the optimization level So it's hard to classify in which group that would belong so you have to do some a lot of compilation by yourself And and then so you can obtain the ground truth to make the test from the symbols if you compile by yourself Or also you can use already compiled binaries and just do some differential testing So you compare the output of your tool with other tools and we do it a lot in jeb We compare with other disassemblers or results And finally something that is needed given this context of Not able to disassemble all program We have to empower the user to provide them the ability to review and tweak the assumptions made by disassembler Such that they can adapt the assumption for their particular cases So that means that as developer you have to explicitly say I made an assumption here and I resisted the assumption And of course if there is a ui the user should have the ability to fix the mistake during the analysis their cells So hopefully this presentation convinced you if it was needed that this assembly remains a complex problem And you might think that as there are new compiler version coming out new languages that the problem Comes becomes worse and worse. Actually, there are many novel anti-exploitation techniques that tend to make this assembly easier In particular because they provide hints to distinguish code versus data So for example, elf executables segments all the code is put in one specific segment Microsoft control for guard provide all routine entry points So in the metadata of a file intel control flow enforcement technology But specific instruction at at the routine entry point. So all these provide hints to help these assemblers So that could be so that's why this assembler becomes easier and easier So thank you very much for your attention. If you have any question, of course, I will try to answer them Thank you