 All right, thank you, Fred David, for the warm introduction. So actually, this is my second time at NorthSec, and I'm really pleased to give a presentation on this stage with other speakers. It's a really great event. And today, I will be speaking about distributed reverse engineering for large-scale malware processing. In a nutshell, we have a large collection of malware millions of samples. And we wanted to distribute reverse engineering of this collection across a small cluster. This presentation is co-authored by me and Alexander Matrosov. He is not here today. He is a security researcher with Intel in the United States. But however, I would like to give also credits to Gabriel Barbosa and Rodrigo Branco. Those are researchers with which we collaborated on this research, which was originally presented at Black Hat 2015. And at the bottom of this slide, there is a reference sort. I will start with your small disclaimer that opinions presented in this presentation does not necessarily reflect the opinions of our employers. And we are responsible for all the results or mistakes and whatever. So blame us, not our employers. And let's start with your picture of the hardware. So basically, this is a small cluster that we were using in this research. You can see everything is well set up. We had about nine machines. And there is even an AC here to cool down the system when it works really hard. But however, if you look at the different angle, there are still some room for improvement. And interestingly that we were working on this research in July, June. And it was really hot in Oregon, Hillsboro. And electricity at some point went down. So we're really thankful for fire detection and smoker system, which prevented from a disaster. So I hope Rodrigo's insurance company won't see this slide. And here is the agenda of today's presentation. So we'll start first with your introduction and you're explaining our objectives and motivation. Why did we do it this way? And why did actually we started this research? And then we will go to our detailed overview of the process. How did we process malware, machine learning algorithm, which are malware analysis algorithms we were applying to them. And then we will go to the discussion of the results we obtained. And finally, we will present some validation of the methodology and tool set to be able to confirm that our approach that we decided to take is correct or it contains some flaws and needs some improvement. And then I will wrap up with the presentation with some conclusions and acknowledgments. So basically they're one of the most, so about the motivation. I think their presentation of Olivier Bilodeau and Hugo yesterday about DevOps was a very nice explanation that the malware is growing with an absurd space and we need to have a scalable analysis environment. So that's what we tried to build because right now the process is focused per single sample analysis. And this is mostly manual work. So we decided to try to scale our analysis algorithms for large collection. And we also tried to provide the research material for the researchers in this area to be able to contribute as well, to use our results and build their research on top of them. So the objectives were just to demonstrate their possibility of in-depth large-scale malware analysis. And I want to highlight here a word in depth because right now there is quite a bit of framework that allow to do our analysis, but in most of the cases it's quite shallower and they're allowed to do some kind of filtering whether this is malware or not, but in depth we are mean a complete disassemblier of the malware sample. We also tried to distribute across the cluster IDAPRO with X-rays decompiler. So why did we pick up IDAPRO? Because this is an indispensable tool for malware analysis. It is widely used in the industry. And so we were kind of interested in our case or if we can do this automatically in a batch mode on the cluster. And another objective is to share with the community all the information that we generated, IDB files or scripts or dumped intermediate representation and so on. Let's speak about the scope of this research, the scope of the project. So we're limit our sample set to only 32 and 64-bit PE executables, that means our executables for Microsoft Windows platform, which were not packed. We were using mostly static analysis. Why static analysis? I will explain on the next slides, but we were doing only static analysis for this reason. We need to have unpacked your samples. We didn't impose any restrictions on their size of the malware. So we had a processor in our collection samples as large as their several like tens of megabytes. And we were given a preference to Microsoft Visual C++ compiler. So we're analyzed samples compiled with Microsoft C++ because of using our X-ray school explorer, which works right now only with their Microsoft Visual C++ object-oriented types. And we were using this cluster, which Rodriguez and Gabriel worked on and they presented this result at Black Hat 2012. So this paper contains the complete description of their infrastructure, but in brief, we have about nine machines. They're sharing 72 cores and 100 gigabytes of RAM. And here is the brief overview of the methodology of the process. So we started by pre-processing. So we take our samples from opener sources and started to filter out. So we filter it out everything. What's not 32 or 64-bit or non-packed malware. And I think in the input we have about almost eight million samples. After filtering out all their samples that don't meet these criteria, we started with running them in the analysis environment and applying different machine learning malware analysis algorithms. And we were obtaining results from their decompilation and disassemblier and we're storing them in the local file system in files. At the next step, we did your structuring and aggregation of the information to be able to process this, to compute some statistics or to do data analysis. That was done in phase number four. This research we were using are only static analysis. For the following reasons are generally, dynamic analysis takes longer time than static analysis. And since we had a very large sample set for us, the performance was really crucial. So in static analysis you don't have to spend like on your head time for setting up the execution environment. We're just loading the file or in the disassembler and it works pretty fast unless there is your sample obfuscated, packed or it exploits a vulnerability in their analysis environment. Of course, there are some limitations for instance. When you're disassembling you need to make sure that there is a good coverage of your disassembled code. For instance, there is some dead code which your disassembler is not able to identify. It's not reachable from the entry point. But how we're at a proiter has a very nice heuristics and generally provides a very nice coverage for your disassembly. So that's why we decided to go with this tool. And in addition, we used the following software, additional software to get additional data. So we used our hex-race code explorer to dump C3 easier of some either recognized functions. I will explain what the C3 is later. But so far you can consider C3 as an abstract intermediate representation of your decompiled function. And we also were extracting object-oriented types from the file using hex-race code explorer. When dumping C3's, they're from their IDB file. So an IDB file is basically, I'm pretty sure everyone knows this is your file where I'd approach towards all the results of the disassembly. So I will be frequently using this acronym IDB file to refer to the results of the disassembly. So we were computing C3 depth like how far this function is from the entry point in the call hierarchy. We were also interested in seeing how many malware is actually using the newer instructions which are implemented hardware implementation of the advanced encryption standard and your get-seq instruction, which is your works with the Intel technology, SMX or safer mode extension. Additionally, we were doing study of this pointer usage. So this pointer is an implicit pointer which is used in objects-arranged languages to refer to the instance of the object. And we were paying attention to their, its actual implementation within the code. For instance, Microsoft or a Visual C++ is implementing this pointer by means of our ECX register or our CX register. If we're speaking about 64-bit platform, but however, sometimes when you're doing some optimizations, it can use other registers, ECI or EDI register. So we were really interested in looking how actually pointer this is implemented for these compilers. And we're also analyzing cryptographic functions, functions implementing cryptographic primitives using the idiscope plugin. So let's take a look at how this process works regarding dumpings and trees. So first, we're loaded a file in the idepro. We'll wait until their auto-analysis is completed. The functions are identified. And we started by enumerating those routines in their disassembled files. And we started with their dumping. So we processed a certain number of the routines because like some IDB files, they contained up to a thousand routines and more. And in this case, we don't have much space to store, first, all the data. And secondly, not all the data are really interested because for instance, there are some many small routines which are just contaminating the data set because they present in every sample. So we decided to filter them out. And as a result, we were processing first 60 routines of size larger than 116 hexadecimal. We were processing 30 crypto routines identified using their advanced encryption standard or spotted by their idiscope plugin. And we were processing the first 60 other routines bigger than 60 bytes. Those constants, 60 and 160 were chosen empirically using our experience and manual analysis and during the manual analysis. Once we find a function, we're a decompiled function using a hex-raised decompiler. As a result, we obtain intermediate representation C3 and we serialize C3 toward a string. But before serializing, we are doing C3 normalization. Normalization is important to make sure that same role functions, they have the same C3s. For instance, you can imagine two functions where they're exactly the same except in one function, the local variable is initialized with variable one and the second function, the same local variable is initialized with value two. So those functions are very similar and we want to have exactly the similar C3s but since the difference or initializers are used, C3s will be different. So during normalization, we're filtered out everything from the C3 which is not really relevant to keep the general structure of the function into dumpets or I will say a few words about this later. Next thing, we're dumped information about object-oriented types. The object-oriented types contain two interesting properties. First, the virtual table which contains pointers to their polymorphic methods. So basically methods which were overwritten in their subclass. To do this, we were finding all the references to these pointers within their IDB file and we were looking for virtual tables which were stored in the following sections. R data, data or section with any other name but with the data attribute. So usually Microsoft Visual C++ compiler puts them in the R data or data but sometimes due to obfuscation, the sections may be renamed or we analyzed all the data sections and we were finding all the cross references to virtual tables within the IDB. Later, we used some heuristics to determine the size of the virtual table and the total number of the methods in this virtual table. The code, those heuristics, they're available in GitHub account for X-ray code explorer so you can go here and see how we're doing this and once we are recognized virtual table we create structure representing this table and include this in the result. In addition to our finding virtual tables, we also did reconstruction of the types attributes. In order to do this, first we identify an instance of the object within the IDB file and then we find cross references to its constructor. So, and once we go into the constructor, this is your part from where we actually define the attributes because let's say most of the attributes that are initialized within the constructor and once we go within the constructor, we can see, okay, this is this, it points to this buffer and we're actually tracking all the references to different offsets within this buffer with the size of the reference. So basically, this defines the layout of the type. We are filtering types which contains only three year attributes or more because there are actually many types. When we first started to run this on a smaller sample set, we got many types which are contains zero, one, or two attributes and this is not really interesting for the results. So, we are filtering out small data types and left only the ones which have three attributes or more. One dump in C3 is we're very computing how far it is from the entry point in the total number of the cross references. For this, we are enumerating cross references to this routine and we used a breadth first search algorithm to find if we can reach an entry point from this routine and we limited our search to 100 nodes just to make sure we are not running infinitely and once we find an entry point, we are calculating the distance from the entry point in the total number of the cross references. When it comes to identifying our Advanced Encryption Standard new instruction set and GETSEC, we used a linear sweep approach in this case because it is faster. So, we always start processing with your section where entry points point at and your Wisconsin first half megabytes of the section. We use a linear sweep disassemblier and once we identify GETSEC and Advanced Encryption Standard instructions, we look for 15 instructions before and after to see if all these instructions are correctly disassembled because this may be the case if we are, let's say you start disassembling instruction not from the right of set, but let's say from the middle of the instruction, we get your AS instruction and then there is a summer bad instructions so that means that we're probably not doing this in the right way. Regarding this usage study, we were checking up to 5,000 calls within their HIDB file and we were analyzing five preceding instructions to a call to see if there is a loading of the E6 register. We use some heuristics to find if the E6 register is being loaded with some value, for instance, it is a move instruction or load effective address. And once we identify all their calls which are loaded in E6 register, we compute their percentage and you're including this in the results. A few words about distributing IDAPRO. We were really surprised with the performance we got using IDAPRO because actually we were a little bit skeptical about this because when you analyze your sample or let's say your single sample analysis, IDAPRO is a kind of interactive disassembler. It shows you a lot of your dialogue boxes trying to choose different parameters for analysis. It does auto analysis and once your auto analysis is completed you need to enumerate through the routines within this IDB file and do this in efficient way. But however, the performance was really nice and it met our expectations. However, there are some problems first related to our developing IDAPRO plugins. Is the key. Is the key is complex and you're quite, probably the commentator. Are there any people in the audience who developed your IDAPRO plugin or X-rays? Does it raise your hands? Okay, not too many people. And also the thing is that your signatures of the functions can change from a version to version. So that means sometimes you have problems with the compatibility between different versions. Let's say either 6.8 or 6.9. As I mentioned, we had a really good performance on commodity hardware. We didn't have a high-end systems like about eight cores per machine and 10 gigs of RAM. And another observation is that most of the plugins were written for Microsoft Windows platform. Like they were using Windows API, they were using Windows types before making them work on Linux because our environment was running on Linux. We had to apply some efforts to program. Another observation is that most of the plugins are not made to scale. Like they are targeted in a single sample analysis as well as in IDAPRO. So that was one of the observations which were made. And interestingly, when we run IDAPRO with the plugins on the large sample set, or we are identified many bugs are in either in our code in other plugins. So it was a very nice test. Sorry, if you're interested in testing your plugin, please send us and we'll try to see how it works. And now I wanna say a few words about the HexRace Code Explorer. I, Anatix, we are developers of the HexRace Code Explorer. The first version was released on Atari Com 2013 and the latest version works on IDAP 6.9. But however, some guys approached me and they told that there are some bugs in the course. Or please, if you find the bug, you can go to the GitHub page and you're reported there. I think this was the most efficient way of reporting the bug. So there are numerous features in the HexRace Code Explorer. So originally it was designed to facilitate object-oriented code analysis and your position-independent code analysis. But in this research particularity, we were using only one feature which is extracting object-oriented type information from the binary. And right now I'm going to give some low-level information because the HexRace Code Explorer works with the HexRace Decompiler, which is using the C3, the intermediate representation of the decompiled routine. And here is the explanation of what actually C3 is. C3 is an abstract syntax tree which represents their decompilation. For instance, on the left-hand side or part of the slide, you can see an expression, variable two, variable three equals variable two plus variable seven. And on the right-hand side, you can see how this was actually implemented in C3. So you can see here, so if there is a mouse here, no, okay. So there is, oops, sorry. So there is an assignment operator which corresponds to a, which, oh, there we go. Yeah, so there is an assignment operator here which corresponds to this equal sign. And so the variable three is located on the left-hand side of this operator and on the right-hand side, we have an addition which takes two parameters, variable three and variable seven. So this is the way actually how C3 looks like in the decompiler and how it works with. So each block within C3 is of type CITMT. And CITMT actually is a base class for two subclasses. One representing expressions and the other one representing statements. So statements are basically where they correspond directly to the statements in the C language. These are the blocks or E4, while expression is something what has a type. So you need to make sure that the type information is consistent in the C3. For instance, if you're calling a function and passing it in an integer, then you need to make sure that their C3 block representing integer actually has a type integer. And this is the only requirement because the hexracer SDK allows you to do whatever you want with the C3, but you need to make sure that the type information is consistent. And also, a hexrace offers a convenient way for traversing those structures if you need to enumerate their blocks or do some transformation with them. Okay, so let's take a look how this is used in this research. So we did a type reconstruction. We reconstructed object-oriented types and for type reconstruction, there are two essential steps. First, we identify virtual table and the virtual table is related to our polymorphic types because if there is no polymorphism, there is no reason for a virtual table because all the functions, they are called directly. And virtual tables contains pointer to the functions which are actually overwritten and they're recovering virtual tables which gives us very interesting results about the type hierarchy. Another idea which we use is a type attribute identification, reconstructing type attributes. So for instance, here we can see that there is a virtual table identified in the IDB file. So it is here. And we can see that there is a cross references to this virtual table and this function. So if we go to this function, this is a constructor. So we get a constructor, we have this pointer and we have attributes in this object. So the idea is to reconstruct those attributes. This is actually what we are doing and we are obtaining this structure representing the type and its attributes and their virtual table as well. Regarding normalizing the trace, so here is the implementation of their filter item routine which does filtering of their blocks within their C tree which are not relevant. So we are skipping some cast operations, some helper functions and some other minor blocks. And for instance, here is the complete C tree of a function and everything what is marked here with black or blue is included in the C tree and everything what is marked with red is filtered out. And here are some general thoughts about like pros and cons of using Hexrace intermediate representation. The good thing is that you are working with your not platform dependent code because your Hexrace decompiler works with your 32 or 64 bit arm and you don't have to care about actual upcodes because you're working at higher obstruction level basically at source code level analysis. But however, there are some disadvantages as well. First, that the Hexrace is designed for single function analysis. Even though there is an option you can just run, okay, decompile the whole function. But what you will get is not going to be a really accurate decompilation because the results of subsequent decompilation may invalidate or change results of the previous decompilation. You can imagine like you have two functions, function one and function two and function two is called from the function one. So you first decompile function one, then you go to decompile function two and during the compilation of the function two Hexrace identifies some parameters that were not identified at the decompilation function one. And as a result, the signature of the function is changed and you need to go back to function one and decompile it properly so that you can see a proper call to the function two. So this is one of your limitations and constraints we are faced at this research. Now let's go to your results. We are obtained by analysis of all the intermediate representations we obtained while running this in the cluster. So I included a subset of the results. The complete version is available in the Black Hat Talk and the Black Hat Presentation. So first we started with preprocessing of 7.8 million samples and as it appears like only 31% of them were not backed. So this is the sample set we were using and out of this 31%, 13% was only using our Microsoft Visual C++ as a compiler. I think that the most popular packer was UPEX and its modifications. Here is the results of the use of this usage study. So on this slide we can see a table with the top 10% of this usage. Like how many percentage of the calls that we're loading E6 register. Well actually I don't see here any interesting like a function of loading E6 and their prevalence of the percentage. Like we have a maximum at 4% of calls and however we can see also high value of 64, 66%. So from this point of view we tried to use this E6 register identification tour to be able to automatically filter out our code which is using object oriented model or not. So for this reason we've chosen a threshold to limiter which is around one or 2% to be able to filter out. If we have percentage less than this we consider that this is not object oriented code and we're not going to consider this in the research. This is the information on the top 10 repeated C trees. So in total we have eight million C trees dumped and obtained from the samples and some C trees are like very frequent. We had 40,000 C trees which is 0.4% and this most one is 14,000. And if we take a look at their unique C trees we find that 30% of the C trees are shared between different samples while 70% are pretty much unique or pretty much unique tour-specific samples. So this is an interesting result which requires further investigation because it's not sure. I was really surprised with this ratio because I was expecting to see a larger portion of the C trees shared between the samples. Maybe the internalization is not well configured and we need to filter out more results. So this is regarding the unique C trees and however if we take a look at this approach but from the point of view of the samples with repeated and non-repeated C trees we can see that 9% of the samples contain 9% of the samples that are shared between each other C trees while 91% are not sharing. When it comes to C trees reaching entry point and their average and standard deviation of their depth we can see that roughly 50 C trees that we dumped were able to reach the entry point and the average depth is 5. So that means that most of the C trees we dumped is 5 calls away from the entry point. So this result might be interesting for let's say those who are developing emulators because sometimes when you're emulating the code you need to know where exactly you need to stop during the emulation and having an average depth of 5 that means that you need to emulate up to 5 calls to be able to reach those routines which we were dumping. And here is their C trees with their cross references to them. We put here maximum 10. The top 10 C trees and they're like we have one C tree with their 11,000 cross references to it so this is a pretty high and this is an outstanding number. As you can see in this table this number of parenthesis are steadily decreasing. And you can go to the original version of the presentation to get complete results. In the last part of the presentation I'm going to speaking about the validating and their methodology we were using because we run these algorithms on a large set of malware and then we wanted to apply the same algorithms to a smaller set which we know is related to be able to validate and see how well it works in this case. So here is their timeline of the modern C++ or malware used in targeted attacks. There are different groups identified Stuxnet, DoCore, Equation and Animal Farm. So for this case we choose your Animal Farm case study because we know that this malware is written in C++ and they are all related. And there is a very nice presentation on this topic presented at ARICON last year, Totally Spice where you can get all the complete information on their analysis of this malware family. But a brief introduction to it was discovered by Canadian agency CSEC as an operations law globe and it was written in our Microsoft visual C++ so that's actually what we wanted. So we applied our algorithms to this malware and we tried to compare and find some similarities. For instance here we can see the comparison of Casper's and Dino's virtual tables found in the outputs of our tools and we can see that there is some intersection. There are two objects here the run key in Autodale which are shared between those two samples so that means our virtual table approach or identification works here and there is their definition of the run key so the run key defines how malware communicates with the registry. This is a base class which defines subclasses which define particular implementations like the malware can access register directly by API using common prompt using Windows management instrumentation classes and they're from REC command and Autodale defines how malware actually removes itself from their system after infection and there are also different subclasses related to API using Move file to remove itself or using common prompt or using Windows management instrumentation classes and the usage of this each particular subclass depends on the parameters of the configuration parameters of the malware. So we did identify the similarities and we tried to reconstruct their object attributes so here is their left hand side you can see the casper's run key constructor and on the right hand side you can see Dino's run key constructor and there is a significant difference here because in case of the casper each object is created depending on their configuration value so for instance if there is a AV strategy run key API is set it creates only an object with run key API otherwise if there is a AV strategy run key registry it creates only an object with run key registry while in Dino's case it creates one large object which contains other objects as sub-objects and if you try to reconstruct the virtual tables we will see that they are not equal and in this case our approach doesn't work let's go further and we compared also Dino's virtual function tables again we can see that there is an overlap in virtual tables so here again our virtual table algorithm works well we are able to identify the intersection of the same virtual tables so that means the share similar code and if we go to the constructors they are pretty much similar not 100% but they are very similar and we try to construct their attributes of the structure we will get the same structure so here it works well and it's very nice so here is the table which summarizes the shared types between different malware families we can see that most of them share a lot of code between each other and it's also interesting to see that there is some small difference between CASPER and Dino it contains implementations of run key VMI and auto del VMI and CASPER appeared after Dino that means it was an evolution of the Dino so they reused code in Dino by adding additional functionality and here is the same information but from the different point of view the number of their shared types are in this malware families we can see that Dino and CASPER are highly correlated so they share a lot of reused code so to conclude the presentation so what we did we actually distributed IDEPRO so we demonstrated that IDEPRO can be run in large scale in batch mode but however there are some actions that we need to apply to make them run and there is a call for plugin developers it would be a nice option if you also add a special batch mode for your plugins not only for interactive work with the user and of course if you want to test your plugin on millions of samples you can set it to us and we will see how it works and we also will be releasing a special version of your code explorer for NordStack conference we are going to release it your Control-Z yes or actually a few things like we were actually going to release it before but we need to do some code clean up so as Olivia told it couldn't take quite long but we hope it will be done sooner in the 21st century and there is no Control-Z in X-Rays code explorer so this is actually where we are going to address in NordStack edition of the plugin so this will be one of your biggest improvements and we also will add a major feature extracting type information in C3's JSON format because right now it's dumped into your custom text file which is not really good for future work which we are going to do like what we would like to concentrate on maybe you have different feature requests so please let us know like one of them is pardon matching for C3's let's say you can find in the idbeer functions by C3 pattern you say you input this C3 pattern and it outputs all the functions which meet this pattern this might be useful for the vulnerability research so not only use the idbeer and X-Rays code explorer as a base tool but also use the OpenRail project released by CRUSH and if you have any bugs and features so please submit them if you are interested in the project you can discuss code explorer on a RihindsGear channel so we are really looking for the feedback and before I conclude the presentation I would like to say some acknowledgments to our employers and to our Elfa Guilfanev and X-Rays team for their support of this research without them this research wouldn't have been done so thank you and that's it thank you for your attention if you have any questions