 So before I start, I would like to say that I'm really excited to present at the first edition of the NorthSEC conference, and it's a big pleasure for me to be on this stage among the other speakers. So today, I will be speaking about practical approaches to reversing object-oriented malware, using your hex-raised decompiler, which is a part of the IDA Pro Disassembler, and this project has been done together with Alexander Matrosov, who is not here, so I'll be presenting to you on my own today. So let's start first with the brief overview of today's presentation. So this is gonna be a practical presentation, and just to define the scope of the talk, we will limit our retention toward C++ as the object-oriented language, because this one is the language which causes a lot of pain for reversers. We will also limit our retention toward a specific compiler, Microsoft Visual C++, to produce the compiler code here, because some of the approaches I will be presenting today and speaking about, they may not work, for instance, for GCC Linux Compiler. And we will use all the year examples will be based on the IDA Pro and hex-raised decompiler, because this is a very versatile tool, and there is a de facto standard in the industry. So we will have three parts in this presentation. First, we'll be speaking about why is your reversing C++ here malware is difficult, how it is different to say, let's malware return and see. We'll just take some simple examples to understand what's the difference, and then we'll move toward malware seen in the wild, and here are some use cases what are the problems with reversing this malware and how to approach this problem. And finally, in the third part of the presentation, I'll be speaking about how to automate some reversers tasks using a hex-raised decompiler. So just to introduce the context, I wanna bring your attention to this slide. This slide here shows your modern C++ malware used for targeted attacks. So the list is not exhaustive and complete, it's not the goal, it's just to highlight the importance of the dealing with the C++ code because all this malware is developed using complex frameworks, and it is challenging for reversers, for malware analysts to reverse this malware to understand its functionality, what is it doing? So we can see a few nice examples like in 2010 Stuxnet, which challenged the industry, a malware targeting Iranian nuclear facilities. Then in 2012, we discovered Flamer, this is another malware, non-arguably one of the most complex to analyze, and recent examples with their animal farms and send-need, so everything is in C++ with the complex framework. So let's move on and speak about why reversers in C++ code is a hard problem, why is the different tour compared to the other C languages? So please take a look at this slide here, and let's compare two snippets of the code, one on the left-hand side and the one on the right-hand side. So on the left-hand side, we have a very simple code, we have a class cat with one method, eat, and we have just one instance of this class over here, I'm just, we'll use mouse to show you over there, we're calling eat method. So everything should be clear here. And on the right-hand side, we have a little bit more different situation because we have a basic class animal, and the cat is your actual child class, so it's inheriting from animal, and we have a virtual method, eat. So the principle difference here, we don't have any virtual methods on the left-hand side, but we have virtual method here. So let's compare the code which is produced by the compiler to see how this is your difference. Different so, I hope you can see this screenshot from the X-rays decompiler. We have an allocation memory for the object, we have initialization of the object here, where is the constructor, it's called, and we have an execution of the method, eat. So everything here is pretty straightforward and nothing special like is if we reverse and see program. But with virtual method, the things are a little bit different because now we have a call to our method, eat, but it is not called directly. A pointer to this method is stored in some variable which is known as a virtual function pointer table. And this is the key difference because the method is not called directly but it is called using a pointer which is stored in some table. And where this table is initialized, this table is initialized in the constructor at the creation of the object. Why does this happens? This happens because this method is virtual and at the time of the compilation, generally at the time of the compilation compiler doesn't know which method exactly which is going to be executed. If it's going to execute method which is implemented in the class cat or potentially there will be some other derived class with the base class animal which will have the other implementation of the eat method. So it encodes this information at the time of the creating the object and then uses this table of the virtual functions to execute this call. And let's speak about virtual functions table as we have seen on the previous slide. We have a instance of the type A class A and we have a virtual function table. So this is a table which lists pointers to the methods corresponding to this class. This is the rectangle in the middle. And additionally, we may have some additional information which is called RTTI object locator. RTTI stands for runtime type information and this is your optional object which may be inserted by the compilers. So compiler inserts this object when you're using dynamic cast or type ID information because in this case you're during the execution of the program the runtime environment would need to know information about the types. So this information is stored in the binary as well. And just to show you the example so we have here a constructor. I hope you can see it well. So this is here where the virtual function is a virtual table is instantiated. This is the pointer to the layout of the virtual table. We can see here pointers to the methods actually over here. And this is information about the RTTI object information. So what's the problem with the virtual functions? First, it leads to indirect method calls which is difficult to analyze in the static context because when you're looking at the program statically and you encounter at some point a call to the virtual method it is not clear exactly where the control is going to be transferred to and to be able to identify this you would need to track back objects to its creation to find constructors to the particular place where this object is instantiated and from the constructor you just gonna grab the pointer to the virtual table which is also generally a hard task because for instance you are maybe reversed in some instance of the code which reads objects from the vector and executes some routine in the virtual table but where these objects came from is not clear. Another problem with the C++ reversing is using C++ templates. This is here another way to implement polymorphic types. For instance, as we can see here we have different vectors of integers, strings, characters or some custom type and every time when you use these types the code will be unwinded by the compiler and inserting the binary so we will produce a lot of extra code to analyze. This is the first problem and the second thing when you're using C++ templates it creates some problem with the identifying some standard library code. For instance, IDAPRO has a special tool which is abbreviation FLIRT. It can highlight as shown here the standard library code so you don't have to reverse it. You just know that this is called like memory location operator new or some other. When you're using these templates you can, the code is generated, the code is not pre-generated because library code is your coder pre-generated and you're just using this library to link it so you can just put some signature on these functions and then you can easily retrieve them. In case of your templates you are going to regenerate this code. For instance, if you're using a vector the vector is defined in the header file and as a result the code responsible for handling methods of the vector will be compiled with your application and you can play with different compiler optimization options to produce different codes with the same class. And just to summarize some problems with the reverse and C++. So generally we have two problems. First to reconstruct the type. When we have reconstruct her but first we need to find the place where the classes are instantiated. We need to reconstruct it attributes. We need to find the virtual tables and if we have information about RTTI in the binary it is always useful to reconstruct class hierarchy like which interfaces and which classes this particular instance which we are analyzing is inheriting from. So this is the first part. Let's now take a look at some real malware which we've come across during working on the malware analysis. And the first I'm going to speak about the Flamer. Flamer is nonarguably one of the most difficult malware to analyze. One of the samples which we received is about six megabytes which is statically linked with the lower interpreter, SQLite database and a lot of additional code to analyze. First let's take a look at this framework because it's very common for such complex malware to implement itself as a framework. And here for instance here we can see at the right in the center there is a vector of tasks. So the tasks are some routines which are executed in the separate threads and there are different kind of tasks here. For instance there is a task mark which is your implements a fake Windows Update Server to distribute malware within the organization. It just emulates Windows Update Server and when the computer operating system Windows requests for an update malware send its binaries as an update so the computer installs it and it's running and it used a forced certificate to be able to work by pass authentication. The file finder is responsible for finding files on the machine according to some pattern. There is a task for instance idler which is responsible for executing some tasks periodically. All these names which we retrieved from the binary from the configuration information so we didn't have any RTTI information in the binary but in the configuration maybe for debugging purposes or for communication with CNC server they kept these names so this is how we got them. So either has some other tasks which are executed periodically. For instance this Beetlejuice is using Bluetooth hardware installed on the computers to be aware of the different instances which are infected in the networks which are air gaps separated which are not in the same network segment. There is a frog module which is responsible for replicating itself using creating user accounts. Ifary is for distribution using removal media using linker files. There is a vector of consumers. These are a kind of triggers. They are called in the flamer consumers and they are triggered when some specific event is happening. For instance mobile consumer is triggered when a mobile device is attached to the infected computer. Media is when some media is inserted in the computer and there is a vector of the common executors so those modules are basically responsible for dispatching commands from the command in control center. So let's speak about some types which we were analyzing while we were working on this malware. A lot of work was done with the standard types like smart pointers, string vectors to maintain objects or some other custom data types. Maybe these types are taken from some standard libraries. We were not able to identify these because we came across from different libraries and it is very likely that this is some custom implementations of the types. And since they are used very heavily the abstracted analysis significantly. For instance smart pointers. I think for those who are not aware of what the smart pointers are these are the objects which are helped to manage pointers in C++ to avoid memory leakage. So they contain two fields here. One is the pointer to the object and the second is the number of references. So it keeps track of the references and when the number of references is zero you can dispose the object and release the resources. And this is just an example of how a constructor of the smart pointer looks like we are just allocating memory for reference and initializing it with the object over here. And other types which is heavily used in this kind of malware is the vectors. This implementation of the vectors is definitely something different toward for instance a standard template library vector implementation. The vectors are used for handling objects like tasks and triggers. It is very common instead of just creating a thread, a thread in the computer on the machine, the malware just creates a special object thread, inserts it into a vector, then passes this vector to the executor which is just going through the elements of the vector and executes them. And those parts are implemented in different parts of the binary. So it's not clear exactly when you have a vector what kind of objects are in. And since this is your implemented using templates, you cannot recognize them automatically that this is your standard code. And of course strings. Strings are used very heavily for paths and a lot of work was done. Maybe you're about 70% of all the code which is reversed. It is your vectors and strings and smart pointers. Okay, so now let's speak about how we're going, what are the approaches here? What is the methodology toward reversing object-oriented malware and specifically C++? And here I wanna bring your attention that when working with the object-oriented malware, we are shifting focus to type reconstruction because when you're working with the C code, this is mostly all around you're all about reconstructing control flow graph. You need to understand where this function is called from, where it transfers control tour and what API routine it is using just to grab the context and describe this functionality. In C++, the things are a little different because there is a lot of work between types. So instead of reconstructing control flow graph, we would need to concentrate on what types are implemented in the binary and how they are interconnected with each other. So to achieve this, we need to first identify object constructors to see where the objects are created to grab information about virtual functions, et cetera. We need to reconstruct objects and reconstruct their methods to understand this functionality. About constructors, sometimes it's tricky, but generally you can recognize constructors using some clear pattern. For instance, there is a memory location using new operator. And then the pointer is passed as shown on this slide to some routine over here. Like we are locating memory and then passing this to this routine which your insight does initialization and you're assigning a virtual table. Okay, so when we have identified constructors, we are going to reconstruct object attributes. We can efficiently use a sear structure representation to describe the objects. And in the IDA Pro, we can use your IDA local types to create the structure describing the objects like here. For instance, this is one of the, this is the structure description of the object in the flamer. When it comes to virtual functions, as we can see on this slide, we have a virtual function for a C socket object. We can create a C structure to describe it in IDA Pro here where the fields are the name of the methods actually here. And we can then modify a pointer in the very beginning of this structure which is a struct C socket to point to this table. And when we will be decompiling the code, we will have this nice picture because it's very clear and straightforward. We have a structure, we have a V table and we have an execution of this routine which we named and this is a readable code. Without this, we would just have some codes going to arbitrary, at some arbitrary offsets. So this is about the flamer. The other malware, which we will consider is one of the modules of their operations and need. We'll specifically restrict our attention to one of the models which is x-agent and tomorrow my colleague Jean Calvié will be speaking about this operation in details, cover all the modules. So let's take a look at the x-agent framework. It is a little less complicated as the framework of the flamer. We can see here that however, it has a lot of similarities. For instance, we have a vector consisting of the objects which are implemented by agent module. So these are the modules which are executed by the malware and kind of stand alone and communicating with your CNC server. And by the way, one thing to mention here that this modular framework is being used here in such a way that for malware developers, it is very easy to develop malware which is operating on the multiple platforms. For instance, we have modules which have with the same framework but they are compiled for Linux and for Mac operating system. But the interfaces and the modules are the same. So we have here an agent kernel which provides some basic services as local storage, crypto and some reserved APIs and channel controller. Channel controller is a kind of multiplexer. It multiplexes your messages between channels and the channel is the object used to communicate with your CNC server. In this case, we can see that there is one channel here in HTTP which implements communication with CNC server over HTTP protocol. We have some modules here, module file system. This is your responsible for triggering events related to file system to scan for the files, remote keylogger, as you can guess from the name is just your keylog, the functionality process retranslator is a remote shell. Here is the architecture of the agent modules reconstructed. So all these modules implement a single interface which is your agent model consistent of five routines here. The first two are used to exchange information with CNC server, receive a message from the CNC server, receive some commands and send message to report something to CNC server. And there are some routines, get module ID and set module ID. Each model has its personal identifier and the final is the method is execute module, actually run its functionality. Okay, how did we get all these names here on this slide? Thanks to RTTI because in the samples we analyzed they contained RTTI information. This allows us to recover type names, reconstruct class hierarchy and identify object virtual function tables. Like here is the screenshot from the plugin class and the former which automatically goes for the binary and extracting all the RTTI information and we can see for instance here that the agent kernel is the implements to interface's agent model and I kernel provider. So this information is very useful for reversers. This is the good thing, the bad thing is that RTTI is not always available in the binary and it is very compiler specific. For instance RTTI information is not present for classes which doesn't have any virtual methods because in this case your compiler know the compilation time, what's the type of this class and there is no need to insert RTTI in the binary. And this is for instance a screenshot of the win of the model remote K-logger virtual table reconstructed using this RTTI plugin. Let's take a look at some of the modules here implemented here. So local data storage it inherits as we can see here it inherits to it implements to interfaces I local parameter storage and I local data storage. Parameter storage is actually a registry reader writer over here so it can read and write data in the registry and the file reader writer does the same thing but only with the files on the hard drive. The crypto interface is used to encrypt messages in communication with CNC server. It has only three methods here validated buffer encrypt buffer and decreed buffer. The crypto interface is based on the RC4 implementation and one particularity which I want to bring your attention here is that toward encrypt a message here it uses a key which is your hard core in the binary but it also generates a random sort of four bytes and when it encrypts a message it generates random four bytes, can continue with the key and uses this information to encrypt the message and then it depends these are sold to the end of the message. So whenever every time the same message is encrypted with the same key the Cypher text is gonna be different. And there are some other reserved API's we have only here like API's listed at the bottom of the slide get volume serial number create mutics and shell execute. We also saw some types heavily strings and containers but in this case it was fairly easy to identify that this is standard template library classes. Over here for instance, we can see find the strings in the binary and this is the on the right hand side we can see source code from the header files. So basically they are using strings and vectors from the standard library which made the analysis significantly easier because we know exactly what kind of types are used and there is no need to reverse them from scratch. Okay, so now let's move on to the third part of this presentation and speak about how can we automate some tasks of reversing C++ applications, C++ malware. We'll be speaking about a tool which is a plugin for the X-rays decompiler. This is an open source tool which was first presented at ARICON 2013. It was inspired on the work on the Gabs Bootkit when we were dealing with the position independent code and this is the first time when we can save this tool but later we decided there's some functionality which was suitable for working with the position independent code. Actually it may be applied to the object-oriented reversing. So before we speak about functionality we need to introduce some basic things about X-rays or SDK, how it works. So because this is the plugin for the X-rays SDK and to represent decompiler function X-rays uses a special tree structure which is called C3. Basically this is a syntax tree structure which consists of objects which is of nodes which are called C item T. There are nine maturity levels because this structure is generated using several iterations and you can actually hook at any level but we are using maturity level final when this structure is ready to use. And on this slide we can just see how for instance these expressions, this expression, variable three equals variable two plus variable seven is represented over here. This is the variable three and this is the operator of the assignment. So assignment here on the left hand side we have variable three over there and on the right hand side of the assignment we have node add with the variable two and variable seven. This is the underline structure. And these are nodes, the item maybe of two types, expression or statement. The statement is there maybe block which is here in the curl brackets or if for while. Expressions is everything what has type. You can do whatever you want with the C tree structure but you have to make sure that the type information is consistent as shown here. Like if you're just doing assigning to integer you have to assign similar type. Well in this case it's the word but that's okay because it's the word and integer, the year have the same properties here and the word is obtained as a result of the call to this routine and this is the signature of this routine so it returns the word and takes two parameters, the word and byte, we can see the word here and the other parameter is third, is just here, it's not on the screenshot. So you can do any manipulations with this node and you have to make sure that the type information is consistent. Why we have chosen HexRace decompiler because this is assembler independent. HexRace supports x86, x64, R so we don't have to care about specific compiler. You are just working with this C tree structure and a lot of work to construct it is handled automatically by HexRace so we are just using this C tree representation to work with this and it offers also some classes to do traversal like C tree and C tree per entity. There should be some demonstration but I think I'm gonna skip it because we're just running out of time and I have two presentations so I'll just spend a little more time on the second presentation because this one is more important. So how we are using this C tree and how it is useful in the reverse in object-oriented malware. We can reconstruct type using this representation so for instance on inputs we can take a pointer to the object instance and it's constructor and then on the output we can get a representation of this object as shown here for instance. To be able to reconstruct we need to pay attention to some specific nodes of the C tree. These nodes are main PTR, IDX, member reference, call PTR and assignment. For instance main PTR is a dereferencing pointer within the structured specific offset. IDX is actually using an index to address specific element in the array. MemberF is a reference to a memory and so on. So for instance we can see here over here on the left hand slide that there is a reference to this field, virtual function table and we can reconstruct this reference using this C tree representation so we are just monitoring main PTR. We are tracking this keyword and we see that at offset zero there is a dereferencing of this pointer so there is a field within this object and we can build structure this way. This is another example but we're using PTR in assignment. We can see here that we have this code over here which is this plus 212 and this is an assignment operator and we can detect that there is some field from offset 212 from the beginning of the object so we are just creating some type since we know that the type of the assignment is integer we're supposing that there is four fields four bytes at offset 212 and as a result we can reconstruct the structure like shown here automatically here which saves a lot of time in dereferencing. For instance here we have another example we have this in the decompiler output a1 plus 12 so we are just casting it to the pointer to the word then we are dereferencing the pointer and assigning some value over here and this is represented in the C structure like this and we can understand that at the offset of 12 there is some member which received this value and this is how we have this structure signature. So over the last few weeks we have been working on the special edition of the HexRace code explorer for the NorthSeq conference and in this special edition we implemented automatic virtual table identification and type reconstruction. So basically we are scanning the binary for virtual tables in it so we know where the virtual table starts, where it tends and we can create structure describing the easier virtual table and once we reconstructing objects we can create an automatically generated type here. For instance this is the type for local data storage and it has two virtual tables over here, this one and this one we automatically generate these structures for this table and show them these parts and these are the tables which are located within the binary. And there this version supports IDEPRO 64 bit as well so I have some demonstration to show you how it works in action. We will use a, okay you can see it over here. So this is a module of the hexagent. Let's go toward this routine. Yeah, I'm just trying to play something with the resolution of the screen to make sure. So this is the one and okay let's try it. Okay, I think this is better or yeah, so we're just going here. This is the routine of the hexagent module so I just committed some fields here. For instance we can go to the constructor of the agent kernel which is looks like this. This is a pointer to the instance of the object past as a parameter in the register ECX and this is how the fields are initialized. So let's reconstruct the object like this here so we need to enter name of the type. We are entering agent kernel and we have automatically reconstructed fields over here and we have created structures for virtual tables. So open sub use and go into local types here. We can go down and see this one is here and this is it. We can go down and reconstruct the other one. Channel controller, for instance. Let's go here, we can see the same constructor. We reconstructing type and entering channel controller. Again, this type is reconstructed so we are just exiting from here and we can see that this variable is then assigned to this one. So I need to cast it to the structure agent controller which we created. This is it and then go back to the channel controller and we can see that this is stored in the field here and this field. So this is type agent controller. We can rename it as the channel controller and the most important thing is we need to reconstruct it to your type as well so we are channel controller. Okay, and now when we are going this variable we can go to this method which is to register channel. This is another object which is created. So we're just gonna go here and see that this is the pointer to the agent controller which we just reconstructed. We need to cast a type, struct agent controller and see here if it is being used. Okay, I think we can see it here. Now we have this automatically reconstructed at this point and we can use this method is executed so we can just navigate into it and this is it. This is how it works because otherwise you would have this call which is your rather meaningless. You don't know what is here at this subset without reconstructing this type. Yeah, this is about the demonstration. Okay, let's go on and just to wrap up the presentation I would like to say a few words about the next plans for the project so we are just going to use, we are going to switch to IDAPython because right now the plugin has developed in C++ and when we started to work on the plugin there was no Python interface. So we did everything in C++ but why Python? Because it's interface now is more consistent. When a new version of the IDAP or IDAP SDK is released the developers sometimes change something in the interface so you have to, sometimes your code is not compilable because there are some changes in the SDK and you need to see what are the changes in the adjusted code. But with the Python this is okay because the interface is pretty much consistent and it's not changing, it is the way easier to develop. For instance this small plugin which is just printing the types of your blocks in the city structure looks very nice and very neat and very small in C++ you would have got something bigger. And some plans about further research and development so something that we would like to improve first our version works for Visual C++ compiler. It would be interesting to see if it's working on GCCR compiler samples and you're to port it to your support for GCC compiler. Another interesting thing is to find cross-references to the object attributes. For instance when you have a type and there is some field in this type which is being used in reference you wanna get the other methods which are modifying this field. And another interesting thing is to look at their code similarity based on data flow analysis. For instance to find similar code based on their patterns of the access to these structure fields. For instance sometimes existing tools here which allows you to assess how one code is similar to each other doesn't handle it very well because of their code is recompiled with different compiler options with some optimization and sometimes these optimizations they render code completely different. So in line in functions I'm meeting frame pointers makes it difficult but generally the data access patterns should persist among these transformations so this is an interesting field for the research. So for me that's it for today. Thank you for your attention. And if there are any questions I can take Zany.