 Welcome to the 6.30 talk of today. It's one of the many things I love Congress for is the bold programming. Normally, this is the arts and culture stage and society stage. And so I'm very excited that there's also a security talk today happening. And our next talk, let's break modern binary code obfuscation, is going to be translated into German. It's going to be translated into German. Under this address, streaming.c3lingo.org, which should also appear on the screen now. Wonderful. And from Bochum, I'm going to welcome now Tim, Placidko, and Moritz Kondach. And they're going to explain to us how you can de-obfuscate code without looking at the code, but only its behavior. Well, I'm welcome and a lot of applause. Thank you. First, can we enable the TV down here? Technic. Good to you. Well, hi. So welcome to our talk. So we are Moritz and I to PhD students and reverse engineers at the Royal University in Bochum, Germany. And we discuss and we, in our daily work, we do reverse engineering and things like that. And as part of one of our academic projects, we started with program synthesis for binary de-obfuscation. And then, thank you. And so this is basically the more technical version of our academic talk that we presented and publicated at Youth and Execurity this year. So in the first, Moritz will talk about code obfuscation techniques and de-obfuscation techniques. And I will later join with programs and thesis and how to apply that. Thank you. All right, thank you for the introduction. OK, first things first. Why do we want to obfuscate code in the first place? So to settle the scene, it's important to note that we cannot really prevent reverse engineering attempts, but rather we seek to complicate them at least to some degree. Reasons why we would want to do so are many intellectual property and the protection thereof. So if you get some super secret algorithm which gives you some competitive advantage over some competitor, you can just protect it and just have a head start to this other cooperation or something. Another reason why we want to obfuscate code would be malicious payloads. So we want to make them harder to detect. So that analysis are not easily able to create signatures for those. And the malware lives longer without being picked up by an AV. And the third, I think, most interesting case is digital rights management, where software, especially AAA games, larger games, are just obfuscated in order to prevent cracking attempts and illegal distribution of the game itself. And especially in the context of digital rights management, there's a very fitting quote which settles the scene as to what we can expect from code obfuscation. It's from Martin Slater of 2K Australia and was issued regarding to the release of Bioshock back in 2007. And he told us that they achieved their goals because they were uncracked for 13 whole days. So apparently, we cannot really prevent obfuscation. We can just make the software withstand cracking attempts at least 13 days, which is about the time in which the game distributor makes most of its revenue. OK, how do we protect software? So there are different approaches to protecting software, some better, some worse. For example, we can just take a look at the software that's been used to analyze the programs itself. Like already debug, disassemblers, and so on. And one idea that comes to our mind is just to abuse shortcomings of these tools. So if we know already debug crashes due to some code analysis part of a specific FPU sequence, we could just issue that and make already debug crash. Or similarly, if we get some process tamper, which is relying on fields of the PE header memory, we can easily confuse them and just render these tools unusable. Another approach that's very popular is just to detect the environment the program runs in and check if we are running an application in an environment that gives us information about the presence of a debugger. So there's some capability by the operating system. For example, the being debugged bit and the PEB, which is being said if an debugger is attached. Similarly, there are more low-level tricks we can abuse or some operating system internals, which allows to somehow detect the presence of a debugger and then just devolve execution and here to just prevent any reverse engineering attempts. However, both techniques have the drawback that first they are, once you know how they work, they're easily fixed. You can just patch your tool or you can just circumvent the anti-debugging tricks by just supplying the correct values. Also, if you go to Google and just search for game, doesn't start debugger detected, you get over 6 million people complaining that they cannot run the latest AAA game because the anti-debugging technique that you might have, which might have worked reliably on Windows XP, does not really hold up on Windows 7 and issues are false positive. So benign customers cannot really use the game, cannot really play the game because the debugger detection was faulty. So let's add up some requirements we need for code obfuscation. An important one is that we want the code to be semantics preserving. So the mere fact of protecting the application shouldn't change its observable behavior. So we do not want the game to break only because we want to protect it. A second point, we want to avoid external dependencies at least in the context of our discussion here. So there are various attempts to outsource data like on the internet, on the internet server, on a USB dongle or on other separate media. But we're mostly interested in techniques that protect the code only. So we got a white box attack scenario where the attacker has everything he needs to attack the application at his or her hand. And finally, the most important point probably is that we want to employ techniques that are easier or way easier to deploy for us than for the reverse engineer to attack. So the anti-debugging tricks we've seen or also the shortcomings of the tools which we can abuse, they are more or less effort one-on-one. Once you know how it works, you can easily bypass it. But we want to employ techniques that are easy to deploy for us, but they're very hard to attack by other parties. Okay, coming to code obfuscation techniques. One technique that's been used in commercial protection engines is what really is more known as OPAC predicates. So consider this rather easy CFG on the left side. So we got some application with a linear control flow graph. If we now insert what is known as OPAC predicates, the problem looks vastly more complex. So we got more branches and control flow which we have to track and we have no clue that the underlying program in fact is very simple. Let's zoom in on one of those predicates. Okay, so we've got a bit of code. We've got a true branch and a false branch, both branching to a different basic block. So if we're just looking at this, we might come to the conclusion, okay, we don't know which branch is going to be taken. Right? However, it turns out that this is what we call an OPAC true predicate. So in this case, the true branch is always taken irregardless of the behavior here in the predicate. So what's happening here is that you've got an OPAC predicate on top. This is the whole block which is just constructing a predicate based on an OPAC value, like the return value of the API called getCurrentProcess. And if you've ever worked with the Windows API before, you might know that getCurrentProcess always returns a constant value, namely minus one. So by just issuing this predicate, we make sure that we always take the left branch. The right branch might point to that code or to other point to confuse, here confuse the reverse engineer, but at runtime we always take the left branch. Similarly, there are OPAC false predicates which just invert this condition and follow the false branch. And there's another flavor called random OPAC predicate in which branch upon a random value. This means that we can potentially at runtime follow both branches. So the challenge here is that we have to make sure that both blocks or both paths that follow have to be semantically equivalent because the program breaks otherwise. However, it is also a challenge to ensure that it's not easily detected by the attacker that both branches are in fact, semantically equivalent. Okay, recapping on the part predicates, obviously they increase the complexity of the application. Also, they can be built on hard problems, but what is most important is that they force the analyst to encode additional knowledge that either the analyst himself has to know it or he has to just extend his tool to know maybe about the Win API, know about FPU instructions, know about arithmetic identities or some other hard problems that he all has to encode in his analyst tool to provide reasonable results. So they are hard to solve statically. However, if you now opt to only look at concrete execution traces, so we just let the application run. I don't look at concrete values, we know that a part predicates are essentially solved for free for us. So this is good to keep in mind. Another very interesting technique that has been also been deployed in, I guess, every major copy protection out there are what you call virtual machines. So take for example, this native code on the left. We've got some loop here and some x86 code and let's just assume the loop here is our precious intellectual property we want to protect. Obviously, we can use our common tools like this assemblers like IDAR or Oli Debug to look at this code and reason about this code and get to know what this code does. So the idea here is to replace the code on the left with something our common tools cannot understand. So what we're doing, we're just getting creative and we're just making up an entire instruction set as you've seen on the right. So VLD, VPOP, instructions that are not known in this way by any other architecture. Full with new registers with new encodings and so on. And both the native code and the new instruction codes are semantically equivalent again. What we're now doing is just replace the native code by a call to what you call a virtual machine or like also known as interpreter, which is basically a CPU and software which just lets you run the imaginary architecture we've thought about. So if we now try to take our common tools like IDAR or Oli Debug or some other tools to just analyze this code, it's not really viable anymore because they don't know about our made up architecture but we still have to make sure that the transition from native code to our virtual instruction here goes seamlessly. And to this effect we have to look at the components. So there are three core components to a virtual machine. So mainly we've got VM entry and exit and what they do is just performing the context switch from native to virtual context and back. So the entry just copies the native context, let's say registers and flags of the native architecture to the virtual context and the exit copies them back to the native context. Usually the mapping of native to virtual registers is one to one which makes it a bit easier. Then we got a traditional fetch decode execute loop like in a traditional CPU and what it does is just fetches and decode is one instruction and forwards the virtual instruction point accordingly. Then it looks up the handler which defines the virtual instruction and what we call a handler table here on the right and just invokes this handler and the handler goes on to actually execute the instruction. So as we've seen the handler table is just a table of function pointers index by opcode and we have at least one handler per virtual instruction and each handler then just decodes the operands, operates on them and just updates the VM context accordingly. Okay, so this is an example of a unobfuscated version of a popular VM obfuscation and we might be able to make out these components here. So on top we see the VM entry. So this is, we're coming from the native code calling into the interpreter and here we're just switching the context to the virtual context, initializing the virtual context. Then we're at the VM dispatcher loop which just looks up the current virtual instruction in the handler table which in turn points to the individual handlers you see here below. So get direct reference to these handlers and the handler then update the virtual context and branch back to the VM dispatcher. Eventually we end up at the VM exit handler which just performs the context switch from virtual context back to the native one. Okay, taking closer look. In RSI we have the virtual instruction pointer and in RIP we got the VM context. So you see that the VM dispatcher all it does is just taking one byte of memory, increasing the instruction pointer and looking up the corresponding handler and jumping to it. It might seem to ignore handler as seen here and as you can see it just reads out of the virtual context, performs some semantics and then writes values back into the context and finally jumps back to the dispatcher to execute the next virtual instruction. Okay, so this is rather simple and reasonably understandable how do we harden this whole concept? For one, obviously as we've seen the handlers are simple. There are only a few instructions and they're easily understood. What we can do here is just apply traditional code obfuscation transformations to make analysis of them harder like substituting operations, inserting apart predicates, inserting jump code and so on. So on the right you'll see some slightly more complex assembly listing than before. Another technique, imagine you only got four handlers in your virtual instruction set. You want to make it more complicated, more work for the reverse engineer and what we're doing is just duplicating individual handlers to increase the workload for the attacker. So because the handle table is usually indexed using one single byte, we got up to 256 entries we can populate. So what we're doing just duplicate existing handlers to populate the full table. And then again use traditional code obfuscation techniques to hinder the attacker from easily finding out that two handlers are in fact similar. And again, yeah, we want to increase the workload to the attacker who is now analyzed up to 256 differently obfuscated versions of the handlers. Another technique you might recognize we here get the sample dispatcher and multiple handlers which all branch back to the dispatcher which executes the next instruction. What we can do here is to omit this single point of failure and just inline the dispatcher at the end of each handler. So what happens here is that we don't have a central dispatcher, but every handler branches to the next version. In summary, we inline the dispatcher into each handler because a central VM dispatcher just allows an attacker more easily observe what happens inside the VM. It's like recording every handler that has been executed. And the third technique we can also get rid of the handler table actually because the explicit handler table easily reveals all the M handlers as we've seen earlier because it points to every single start of each handler. What you can then do instead of querying the explicit handler table we've been lying around memory, we just encode the next handler address in the VM instruction set itself. So this might look a bit like this. We get the opcode, the operand, and then somewhere in the encoding, we encode in the next handler address. This has the effect that essentially hides the starting location of each handler that has not been executed yet. What the handler table would easily see where every handler lies and can geofiscate and analyze it but with these indirect encoding, we can only observe those handlers that have been occurring in the concrete execution trace. Okay, going on to Mixed Boolean EarthMatic, which is another obfuscation technique not so widely used in modern obfuscation yet, but to give you an idea how it works, just look at this instruction here or at this term. So it looks a bit involved and you might not be surprised when I tell you that there's a simpler version of this expression, which is a simple addition of the values X plus Y. You might guess that even for this more involved term, there's again a simpler variant. Especially here, it's the addition of X plus Y plus Z. Okay, you might believe me that it's easy if you throw the first term and the second term in your favorite SMT solver and you are able to prove equivalence. However, what is interesting, how do we get to the second term if we only have the first? So we can try to maybe apply Boolean identities, we know from school or something, maybe arithmetic identities, maybe we might even go so far as to draw a nice conic wedge map. However, it becomes evident that this doesn't really help us here. In fact, what we are using here is the concept of the Boolean arithmetic algebra, which gives us a ton of different operations that are in this algebra, like Boolean operators comparisons and even arithmetic operators. It's worth noting that the Boolean algebra also includes, sorry, that the mixed Boolean arithmetic also includes the Boolean algebra and an integer modeling. And it is worth noting that no techniques exist to simplify expressions that contain both. We know how to reduce expressions in Boolean algebra and in the integer modeling, but there's no underlying theory that helps us in easily tackling both at the same time. Okay, on to de-effication. In every de-effication scheme or in every de-effication tool we've seen, we get to a point where we employ a technique at some point which is called symbolic execution. So consider the handler here on the left. This is the lower handler we've seen earlier. And we now want to reason about this handler. So what we're doing is we execute it, but not with concrete values, but with symbolic values. This looks like this. So we're just executing this move symbolically. We assign to RCX the value of the memory cell index by the register RBP. We can continue to do so. And we've got the second assignment. Also, if we've got a not operation on RCX, we just assign to RCX the negation of RCX, which is the same as the negation of the memory cell pointed by the register RBP. And we go on. And what gets interesting is this case where we got a logical end on RCX and RBX, which is the same like the negation of the memory cell pointed to by RBP with a logical end and the negation of the other memory cell. But because symbolic execution is basically a computer algebra system, it encodes some well-known identities from the Boolean algorithm, for example. And in this case, it knows that this expression is equivalent to a know of both memory cells. We can then just continue with execution. And again, here, the symbolic execution engine recognizes the next know. At some point, this works rather well. And we see that the core semantics of our handler here is just a know operation on two memory cells and strongly result in another memory cell. So apparently this works fine for this handler. But what can we do or what is the result if we try to throw this whole concept of symbolic execution at a more obfuscated handler? Here's a rather simple example of an obfuscated handler. You might be able to make out some parts, but let's just throw symbolic execution at it. We see that we get a ton of information here. But the problem is we don't really want a ton of information. What we want is just the underlying semantic, which is rather simple. It's the know operation of two memory cells and the assignment to another memory cell. Also, let's try to throw this at mixed boolean arithmetic. So we have this rather complicated mixed boolean arithmetic expression. And I'll just simply compile it into a program just like the binary program and then run symbolic execution on this trace. What we get is this semantic and it didn't really fit on the slide, so it goes on and on. And you might be able to make out here the resulting values in RAX and we got this super complicated expression. And again, what we got here as underlying semantic is rather simple. So we do not want to have this complicated expression, we just want this more simple semantic here. So symbolic execution is nice because it allows us to capture the full semantics of the code we execute. Also, it's a computer-algebra system and as we've seen in the lower example, it allows some degree of simplification of the intermediate expressions. However, the usability decreases to some point if the syntactic complexity of the underlying code increases. For example, we can introduce artificial complexity by just substituting instructions or just using opaque predicates or other schemes we've seen earlier. Or we can also increase the algebraic complexity by employing techniques like mixed boolean arithmetic expressions. Okay, so it's obvious that we have a problem with syntactic complexity and we want to handle this somehow. So the interesting question here is what if we could reason about the semantics of the code only instead of having to fight about the syntax and just improving our tools to be able to cope with more complicated syntax. This leads us to the topic of problem synthesis which Tim will tell you more about. Yeah, thank you. So we have seen some limitations of syntactic complexity. So, but it comes in my mind. Obfuscation is semantics preserving. That means it has the same IO behavior, input-output behavior. So why not just using a function as a black box and observe what it's doing? So, for instance, we might for this just generate some random inputs, one, one, one, and observe three. And we note that down. And we do this several times more. So we store this one and this one and 20 others again. So, and then we don't look at the code at all, we just look at its IO behaviors this year. And then what we learn is that we learn a function that has the same behavior. So, we might learn X plus Y plus Z. And the goal of program synthesis is to automatically learn these things based on IO samples. So, how do we do that? So, we use an probabilistic approach. Basically, we have an optimization problem. We have a surface, something like that, and each, and this thing has some deep points, some large points, and we have a global maxima, top there. And the global maxima is a program that has exactly the same IO behavior as our black box. And we have a concrete value for each point on the surface. So, the closer we are to the global maxima, the higher is our score. And in probabilistic optimization problem, we basically start with a minor score and just increase until we find the global maxima. How we do that is an algorithm that is based on Monte Carlo research. Basically, Monte Carlo research is one of the main reasons why AlphaGo last year was able to beat world-class human Go players in computer Go. So, let's get practical, and let's just synthesize A plus B, Model 08. We first somehow have to define how we generate programs. Well, we define a grammar. We have a non-terminal symbol, U. Non-terminal means that we just can replace it with other symbols. So, for instance, we replace U by U plus A, or by U plus U, or by U times U, or by A or B. We say that A and B are our input variables. We cannot derive them any further. So, therefore, we have a candidate program, something as A, B, A times B, A plus B, B plus B, and so on. Contrary, an intermediate program is a program that contains at least one U. So, we can derive it further. So, on with that, as a foundation, we do Monte Carlo research. We start with an empty tree that has just the root, as root node, the U. And then we randomly apply the rules of our grammar. So, we derive A. A, in this case, is a terminal program, so it cannot be derived further, so we can give a score to that because we are in a probabilistic surface, we have to score things. And so, how we calculate the score, it doesn't matter now, we come to that later, just to give it a score. And then, we derive the next thing, B, and also there, we give it just a score. This time, we derive U times U. So, we cannot give it a score since we have an intermediate program. What do we do? We do something that is called random play out. We apply the rules of the grammar randomly to and replace the U until we get a finite program that we can evaluate. So, we have something, A as A plus A times B plus A, and give it a score. And then, we do the same with U plus U. Here again, we have intermediate program. We derive something, score it, and go back. Now, we don't have any further rules to apply. What we then do is, we choose the best child node in a probabilistic manner. In this case, it is U times U, and do the same thing again. We derive some expression, do a play out, give it a score, go back. Now, each score basically represents the average score of all the children's scores. So, in this case, we had a new score here, so we just update this one. And we go back. Again, we choose the best node, we do a derivation step, and so on, go back, update, choose the best node, play out, score, update, and so on. So, now here, U plus A, we have a really high score. U plus B, we have a rather low score, and we want to synthesize A plus B. This means that U plus B had a bad play out, and that U plus A had a good play out, because here, the score is better. So, we update this, and go back. So, we again choose this node, go back, and go back. So, as you can see, we always explored this area more than that. The reason is, well, because U plus U seems to be way more promising than U times U. However, we just have some probabilistic behavior. That means we sometimes explore also different ways just to get to know if we might miss something. So, in this case, we would normally choose U plus A again, but we just decide to choose that. We do a play out, we give it a score, and well, it wasn't really good. So, because of that, in the next step, we go back to U plus A. Then, we derive B plus A, and so we have a final program. We cannot derive it any further, so we can score it directly. Okay? We give it a score of one. Why one? Well, because B plus A has exactly the same IO behavior as A plus B. So, in other terms, we are finished with our programs until this part. Okay, how do we score some values? How do we calculate something? Basically, what we do is we generate a random play out, we generate random inputs, and query our black box, and observe the output. So, two plus two is four. And now, we use the same input and query our intermediate program and observe the output. Then, we calculate the similarity of these tools and get a score. Similarity, I will come back to this on the next slide. So, to just, we compare somehow the similarity. We don't do this only for one input pair, we do it for many. So, we do this, we do it for this one. In this case, we have the same output. That means the similarity is one because it's exactly the same, and we do it several times more. And finally, our score for this node is basically the average score of all these similarities. How do we calculate the similarity? Well, we are operating in the bit vector space, so we compare the similarity of bit vectors. In other terms, we can compare if they are in the same range. So, we have a look at the trailing zeros or ones, and at the leading zeros or ones, and count how many are the same. Then, we can still compare how many bits are different. This is the Hamming distance. We use that, for instance, for something as overflows in terms of addition or XOR operations and end operation and things like that. And then, we still have A plus B or something like that without overflow, where we can have a look at how close are two values numerically, so basically the distance. And we use this matrix to tackle these different behaviors of the bit vector space, such as overflows and things like that, and we take again the average of all these matrix. Okay, how do we use that as a core to synthesize obfuscated binary code? So, basically what we do is that we have an execution path, perhaps obtained from an instruction trace, perhaps from static analysis, whatever you want, and we somehow emulate it and observe indirectly inputs and outputs. We say that everything is an input, which we read before I write to it. So, in this case, these are the two memory cells on top there, and everything is the output in red that we write lastly to. So, for instance, the RBX is not modified anywhere, so this is an output, the same for RCX and things like that. And then, we treat this again as a black box, generate random inputs and feed it into our black box and observe the output. In this case, RBX. So, we do this many times, so something about 20 or 30 times, and then we synthesize this output, and then we learn that RBX is not M0, and M0 is here, in this case, the first memory cell. We do the same for RCX, and learn that this is basically the knower of these operations. And finally, we do it for this one, and also there, we learn that this is the knower. To conclude, we learn the semantics of the two knowers and the negation top there. What we don't learn in this handler is the push F operation that is flag based, and the flags that are written back into the memory cell pointing at RBP. Why don't we learn that? Because we don't care. So, basically, we designed our grammar in such a way that it just abstracts higher level semantics, ignores bit level semantics, and just abstracts higher level semantics, or in other terms, we don't consider flags as any sort of input. Okay, we do random sampling about a code. What is if there's a branch that may be conditional, and we feed in random inputs? Well, we might trigger some different behavior, such that the emulator decides to choose another path. Well, what we do here again is we just ignore the flags and we force the execution to go the path that we observed. So, if we want to go to A, we just force it to go to A. If we want to go to B, we force it to go to B. So, how do we implement something like this? Basically, you have lots of different opportunities. What you need is some code base that you can somehow execute code. For instance, emulate it as in box or in Unicorn Engine, or you can do it with dynamic binary instrumentation as in PIN or DynamoRio. What you also can do is you can just translate it into intermediate language, symbolically execute it, and evaluate just the symbolic expressions while feeding it with concrete inputs. But normally, this is much, much slower, especially for large constraints, if you have seen it before. But in general, it is possible. What we did is that we implemented everything in our framework that you called Zyntia. Basically, Zyntia is a program synthesis framework for code obfuscation. It is written in Python. It performs random IO sampling for assembly code based on the Unicorn Engine. And it also implements MCTS, the Monte Carlo Research Base Program Synthesis, as a core. We published this under GPL version 2, and this is the link where you can get it. So we will just do a quick demo. So first, I will show something about the program synthesis itself. Basically, we define an oracle that is a function that we treat as a black box. That we use for IO sampling, and what we want just is to synthesize x plus x plus y plus y. So we just run it. And what you can see here is that we start with different nodes, that we try out different things, and always assign a higher reward to this. And you also see, oh, in this case, it somehow learned that there must be a lot of additions. And at some point, we get something like that. So this is basically v1 or v1, and v2 and v2. So this isn't the simplest form, but it is simple enough such that we can easily simplify it to something like 2 times the input plus 2 times the other input. So we can do that another time. And we might choose a completely different path because it's all probabilistic. Perhaps one play out was better than the other, then we choose another path. It might take up to five seconds, up to one minute, so it basically depends on the quality of our inputs, how we choose a path, and things like that. But at the end, we will mainly reach our goal. If not, we just start it again. So the first one, it took nearly 20 seconds. Now it took 24 seconds. And you see, now we have a much more less obfuscated version in our non-simplified form. So it's okay. How do we use it for simplified code to synthesize obfuscated code? Let's have a look at the source. This unobfuscated top there, this is the plain code. Basically, it takes five inputs, just performs an addition and a multiplication and returns the value. Note, this is a function that means our final value, the return value, will be stored in EAX. And the obfuscated is with Tigris and MBAs, and we get something like that. This is a little bit longer. So, and then we compile that and get some assembly listing that basically is about the 760 lines. So we don't symbolically execute it here because it would be really, really much of time. Instead, we just learn it. What we do is we just render sampling input. So we give it by basically a code file. This is our instruction trace. We define the architecture. We do, well, let's take 20 random IO samples and we write it to an output file. So you see this were only less than five seconds. So if you look at the file, what we see is that we have different outputs. Here, this is one output. This is another output. We basically have 17, 18 outputs and we start counting at zero. We are only interested in EAX, so we will just synthesize that in a few minutes. And you see we have here our five memory inputs, our parameters for the function, and also some registers that were treated as input. Okay. So what we then do is we just define in our sample synthesis. This is a wrapper around MCTS. This reads basically the input file from before. And here, I noted zero because we only want to synthesize the first output now. So we just started. It basically takes the sampling file as input and an output file. And this again, this might take between seconds and one minute. So it depends. So we are ready. We can have a look at the result. You see this is the output. The top non-terminal is some integer, some integer that cannot be further derived plus a concrete variable. The best top terminal, so the best program that cannot be derived any further is a memory cell times a memory cell plus a memory cell. And so this is our final expression. Basically, we learned that the whole semantics is the first memory read times the third memory read plus the fifth memory read. So, okay. Coming back to our slides. How can we use that to break virtual machine-based obfuscation? A quick reminder. The goal of VM-based obfuscation is was to introduce a new CPU scheme that you don't can analyze easily with your tools. You have to manually reverse engineer all the handlers and things like that. And, well, one simple thing we can do with that is just learn the semantics of arithmetic handlers such as if the semantics is an ad or nor or something like that. So, Moritz discussed four different hardening techniques and we will talk about how we can break all of these. So, the first one was obfuscating the individual for M handlers with more complex obfuscation schemes. Then just duplicate handlers and somehow transform them such that they don't look the same. Then we don't have no central VM dispatcher so we don't have an FDE loop. Instead, each handler inlines and concrete the calculation of the next handler and then we don't have any explicit handler table. Okay. What you see here is a handler of team IDA. It looks quite more complex and you don't easily see, also not with symbolic execution, what this is doing. Well, if we ask Zintia, it says well, the semantic is that it is an addition of the 13th and the 14th memory read and it is stored into a 64-bit value. Okay. Solved this problem. The other problem, duplicated VM handlers. Assume we have a handler table where we can see each handler or we have an instruction trace where we know where each handler is located. Given we know where handlers are, so how can we learn duplicated VM handlers? Well, at least we learn its simple semantics. We can learn based on our grammar if it's an ad, if it's an XOR, if it's a sub, if it's a shift left, things like that. And of course, we can find duplicate for free because we learn its semantics and if they are the same, we learn it. Okay. So coming back to our timida handler and no central VM dispatcher. Have a look at the end of the handler. Let's zoom in. Basically, this plus the code above there is the code to calculate the next handler address. So we don't... This calculation is a little bit too complicated for our synthesis part, so it doesn't tackle our semantic approach because we still learn the semantics of the handler. And since the semantics of the next address calculation is too complex, we just don't learn it. And one other thing, have a look at the jump RBX. One advantage or disadvantage depends how you see it, of inlining something like that is that we have a hard heuristic how we can find the end of VM handlers. Basically, they end with return or with jump RBX. So basically, if we split an instruction trace at indirect control flow traces, control flow transfers, we split it into separate VM handlers. This helps us for our fourth part. We don't have an expected handler table, so we don't know by static analysis where each handler is. So what we can do is we just execute it and we observe the instruction trace. And then we have something like that. We just then apply the rule from above. We split everything where we observe the control flow and we get something like that. And as it points out, each color is a different handler. So, and then we can take these single components and just synthesize it on their own. And in this case, we automatically learned large portions or large things of the instruction trace. And we have a built-in semantic disassembler for free for the new architecture if you want it that way. So, but we can say you a lot of things, just try it out your own. We also released our code and furthermore, we released all of our samples. So just feel free to use it and play it with the round. So, to conclude, we talked about obfuscation techniques, mainly OPAC predicates, VM obfuscation schemes and hardening, and mixed Boolean arithmetic. And of course, they all can be used together. And they all are mainly the same if we work on deobfuscation on the instruction trace level. So, then we discussed about symbolic execution and for syntactic deobfuscation. And we saw that it's really great, but it has some drawbacks if it is syntactical too complex. So, on the other hand, we asked us then what about not looking at the code level but instead looking at the semantic level, using the code as a black box to obtain IO samples and learn something different. So, this is program synthesis. And a final reminder, you find everything in our guide. Thank you very much. Thank you very much indeed for this creative solution to an existing problem. We now have about 10 minutes for Q&A. You know the drill, there are two microphones in the middle and one to the left, please line up. Also, signal Angel, do you have questions from the internet? Okay, none. And other questions from the audience here? One, okay. In the back, please. All right. Yes, I was wondering... And please try to go out silent. All right. I was... Okay, do you still hear me? I was wondering, when you take a look at these obfuscation techniques, wouldn't it be possible for a closed-source operating system to just prevent any debugging of code? So, say we could encrypt the code with some key that only the operating system has access to and only shovel it to a protected memory area. Well, what you might do is that you execute it inside a virtual machine or something like that. Assume you can do that and you can take memory snapshots and memory breakpoints at different points of execution. What you can then do is, if you know somewhere in combination with reverse engineering, that at this memory area now lies an de-obfuscated code decrypted code because you know it just was decrypted because it is used for execution now, you can take memory snapshot and dump the code and then feed it into something like that. Yeah, also to add up on this, there's also kind of a non-technical argument to make here because all the... like the argument we had on the slides, if you try to outsource your code, this is like a similar scenario because to have hard guarantees on that, you need some HSM or some other hardware mechanism to enforce this which might not always be feasible like imagine having a corporate environment where you need to supply USB dongles or something which might be in violation of some corporate policy. So we were looking only at things we can do in the white box attack scenario where the attack has everything he needs to analyze the code which is the most important thing here. Hello. I would like to ask these techniques also seem to be very useful for code optimization in compilers. Has any research been done in this direction? Yeah, lots of research. So basically there is one big project this is called Stoke. This is from the Stanford guys. They do stochastic program synthesis for super optimization and this was also an inspiration for us to use a stochastic synthesis approach. Thank you. Do we have a question on the left also? Yes. Okay, yeah, when you go. So if I understood you correctly then you focus primarily on aromatically functions or on aromatically expressions. So can't you use or can't you simply use techniques of numerical curve fitting? Well, this might be possible. We haven't looked into that but we also we mainly looked at that because we needed some something to evaluate our approach. So we designed our grammar to learn something like that but we are free to choose any more complex grammar or learn much more deep expressions. So this was just that we wanted to de-obfuscate rather easy from the semantic level here but the approach is much more powerful. Okay, thank you. I heard that the internet is also still awake and we have a question from the signal angel. Please signal angel. Yes, we do. Have you tried synthesizing intermediate representations? No. This is in general possible because Rolf Roles and others did some work on that and there's no reason why you cannot do this with that approach but we basically it depends mainly on the grammar what you want to synthesize and on your IO pairs. So if you design the grammar that way that you want to synthesize some IR, you can do it. Because we're moving so fast I think we still have time for one question in the back. Does this work? You've been using kind of fancy words but in the end I think what you're doing is essentially an optimized search for an expression that looks like it gives the same semantics. Now I'm wondering what is it that makes you believe that this is faster than just randomly trying expressions. So why do you think that the child nodes in this MCTS tree are somehow better if the metric of the parent node is better? Well basically we have a really large search space. So if you for VM handlers, if you have just A plus B, this is rather simple and semantics. But normally we have something about 20 to 50 inputs and several outputs and we have not only just multiplication or something like that we have much, much larger search space so we have 8 bit input variables we have 64 to input variables we have downcast, upcast 0 extends the whole grammar as bit shifts, left shift arithmetic shifts and XOR and also modulo and division and things like that. So we have lots of components such that we need. So for A plus B, something like that you are probably faster but if you take A plus B plus 4 plus C, you are not that fast anymore. So we really have large search space and if you use some kind of guided learning this helps much. Final question, Mike Wan. Hi. So it looks to me a very interesting work but my only question is what if the function you are trying to obtain maybe, I mean what if the similarity function does not respect what the code is doing what if for instance it's not the case that the similarity increases the more you get closer and stuff like that. In shorter terms, how would you break your own system as violating the assumptions of your similarity function? Well of course if you just make it semantically harder for instance it won't work at some level anymore but the observing so what you can do for instance is if you apply some form of local encryption or something like that this might break or this will break for sure but one of the observations we made is this isn't done yet so this was a really interesting fact and also one of the main things is how you choose the boundaries of where you start where you end with your window that you analyze and there can also be in this case we had a look on the level but there's no reason why not to combine several handlers or split into handlers in half or things like that so there might also be ways by choosing window boundaries to tackle this again. Thanks. Okay those who didn't get a chance I'm sure you can approach the speakers next to the stage once they leave again thank you very much for the interesting talk and let's conclude this with a round of applause.