 So I present my research about machine learning and Python. This project is for one year, so in a big, big data set with 2 million models. So here I focus just a little data set named Zazu, and I love Mac. Okay, I present a different way to use machine learning and Python to clustering a data set, to generate Yara rules. Everybody knows Yara? Okay, I define the Yara rules and how to process with the Yara rules to make a team and to define different rules to make a team on the virus or other, like a girl for example. So my name is Sebastian Rignier, my handle on Twitter is sub-driven. If my presentation is bad, you can write in my email address for remarks or questions. I'm organiser of BotConf, it's an international conference about botnet during December at Paris this year, two years ago. The next year it's Mongolia in south of France and the next year we made that at Toulouse, Toulouse is a city of south of France. So you're welcome to present and to submit at the CFP because the CFP is open. And I support different stuff open source and all stuff open source is directly on my Github. Yeah, so you can use the different code, share, make PR or commit, you're welcome. So during this topic we talk about malware pair format. It's a format for Windows executable. So I focus my work only on this kind of malware. We talk about clustering. I define this clustering. The clustering is the same thing, the classification in machine learning. So I present different definitions about machine learning to use the same work because machine learning has its own words. And we have on cybersecurity our world. So the goal is to mix, to have a good comprehension of the clustering and the problem of the clustering. And I speak about Yara and Antin. The Antin principally I make that on virus total, directly on virus total intelligence. It's very, very useful to download new sample of a family, of a new campaign. And I define many, many Yara rules to follow different campaigns, APT campaigns or cyber brand campaigns, it depends. So pair formats, definition and generities. Anybody, everybody know pair format or it's new? Okay, so I make a quick. It's a different layer. The first Yara is the MS-DOS header. The second layer is MS-DOS header. It starts with a magic number. MS-DOS header at the start. If you make an example directly on the file, the four first bytes is named MS-DOS. It's a magic number. After you have the segments, the segments is your... This program does not do blah, blah, blah. Pader, it starts with PE in the layer. And then you have a table of sections. Table of sections describe the different sections of the executable. And the section is the piece of code, is the data, is the work location when the PE is loaded by the operating system. So if you have many, like Elf, you have many, many sections. But you don't have a standard like Elf. Only anyone makes a compiler to create a name of a section. For example, with non-UTF8 characters. So if you have a parser of Onasky, for example, your parser crashed. So be careful. So it's very, very easy to craft a name for a different section of PE. You don't have a standard like Elf or DEX. So the Pader, it's a C structure with different properties. So you can retrieve the magic number. And you have different information. It's all the information. The operatic systems don't use this information now. You use any information in the periodors. You can just load periodors and your executable is launched correctly on the operating system. You have the periodors describe two important headers, file headers and optional headers. The file headers, you have the number of sections. The kind of machine or computer is 32 bytes or 64 bytes. It's here. It's described. You have the different interesting stuff like number of samples. Samples is the name of the input. The size of optional headers because the optional headers follow the periodors. So in an optional header, you retrieve the magic. And an information very important in the address of an executable is the first assembly code was executed by the operating system. The first address at start when you launch is the executable. You have different information. The linker, the size of header, blah, blah, blah. It's not interesting for here, but it's all of the stuff is documented directly in the MSDN of Microsoft. It's not magic. After you have the image directory, image directory is more interesting because it's a bit little. So you have four different directory entries very interesting. For example, it's directory entry and import. The import is a function of an operating system or used by the program or the malware. So it's very interesting when you make malware analysis. You have export table. It's export table if you developed a DLN and you expose different functions to use by another program. So it's defined as this kind of export table. The resource is the icons of your software, the strings of your software. Our stuff is directly stored in the resource of the software. And you can store, for example, another executable on the resource and directly read the new malware or the new software in the resource. So it's a program, technical backing, but it's a technical backing. Another information interesting is the entry security table. Entry security is like here. The certificates are stored in the executable if you sign your software. You store directly in this entry the different certificates. So you can check if the developer software is good or not. Another information interesting is entry debug. It's a symbol of debug when you debug your software. Many, many developers of malware forget to erase the information in the debug. So you have different information for a nickname, for the kind of operating system. So you have many, many information to make a picture of the developer. Another layer on P interesting is the table of section. So the table of section, the section is really the piece of code of the software or the data or the resources. You have different layers. And you have a name, you have a physical address and you have a physical size when the section is loaded on the memory because our section of PA is loaded by a page of memory directly. And if you use a debugger, you find the same name of a section, the length of the executable. Another truth is interesting is the characteristics defined if your section is executable rights or readable rights or readable rights. If you have a section of code, normally you have a section with rights in execution and in readable. The operating system can be read the section and can execute the code in the section. If you have a data section, you have just read the data. So normally a data section has just, it just can be readable right directly to the data section. When a malware use technique like process alloying, it creates a memory page with all rights, right readable, rightable and executable. Statically or dynamically, if you have a section with three rights, it's not a behavior, a classic behavior. It's a process alloying classic because the malware after allocates the memory page, it pushes the code in the memory page and it executes directly the code on the memory page. So the malware needs all rights on the memory page. So you have a good description of Paveformat by Angel Bertini. And on corcamille.com you can download the different image and documents to describe Paveformat, Alformat, etc. Few words about machine learning algorithm. Somebody knows algorithm or machine learning or it's a new law or new test to make code. So the difference between clustering and classification. Clustering is automatic grouping of similar objects into sets, in the different sets. So you don't know your data set, you make a clustering. So you can, you try to regroup similar objects with different characteristics. Classification, you know the data set, you know the data set. And you present a new object at your classifier and the classifier say, okay this new object for me is this kind of family. And it classifies like that the new object in different families. So clustering is applied on a data set, unleveled. You don't know your data set and the classification is applied on unleveled data set. You know the data set and you know the different families of objects on your data set. For classification we use a supervised algorithm because you know the family and you can supervise the algorithm. For clustering it's an unleveled algorithm because you don't know at the beginning the different kind of family of your data set. When what machine learning needs to make a classification of clustering, it needs a vector of features. So you, the goal is to describe an object with different features. For malware, it can be the size, it can be the importable, the number of sections. In median, software legit has five or six sections. If you have a software with 100 sections, it's strange. So like that, you can describe the different file with a vector of features. And we make a vector of features by file to classify or to clustering the different file. Each file has its own vector of features. The difference of similarity and distance is the length between two objects. So in the life, we use occlusion distance. When you go to, for example, to go to Morale, to Quebec, the distance between the town is occlusion distance. But in mathematics and in physics, you can use many, many distance. So, and we, we have a, we have a concept. If you have object with a small distance, this object is similar. Because the description, the, the, the vector of features of two objects is very, very near. So the object is similar. But here, we mix the previous concept. We use unparalleled algorithm on a label as a data set. Because we want to generate Yara rules by family of malware. So I need a clustering. I don't need a classification. But to, to check my results, I start on the label data set. Because if I don't make that, I can, I can define a vector of features. But my results are completely wrong. So I, I must to check each, after each operation if my clustering is, is correctly or not. The only way to make that is to use, is to use a label data set. So malware clustering. Why clustering malware? The first is to create a nature to catch a complete family. Why to catch a complete family? Because, for example, if you, if you take malware like emote or like tree boot, you have 10 sample by day, new sample. So it's very, very difficult to, to store each signature for each file. Many times it's just computation data has changed. The piece of code is the same or is the packet used by, by the botnet is changed the binary structure. But the, the payload is the same. So if you can create a signature to catch a complete family, you have a gain of storage and you have a scalable solution. Another, another way is minimize false positive. Many, many papers describe malware analysis and the, the analysis is very, very good. But at the end, the name of malware is wrong. Because it's the very same between another analyst or the version of malware or updates. So the, the analysts think it's a new sample. No, it's not a new sample. It's a new version of family, known families. So to, to have different, if you have different signatures to catch a complete family, it's more simple to minimize false positive when you make, when you make naming. And it's very, very useful to uniting a campaign. Because all, all people push malware on a virtual store. So I can just stay and wait to have the new sample to uniting the campaign. And the goal is to catch malware with a bad score on a virtual store and analyze why. We have a bad score on a virtual store. It's a new packer. It's a new family. The signature is not, is not good. So it's very interesting stuff. The first, the first tentative of clustering. It's SSDip with fuzzy hash. Fuzzy hash, it's, the concept is to create a signature for each file. And you want to compare, and you compare the difference, the different signatures with a distance edition of Amin. It's not a clinging distance. It's a distance to compare two stream bytes. And you can compare, you can compare two with a distance, for example, a word for, with a dictionary. It's a word, this word is English or German or you have, you have a distance, a distance between, between them. Two words, two the same words as the addition distance equal at zero. It's normal because you have the same letters. On this example, you have two, two different plagues. Plagues, it's, it's a rat used by Lapti campaign or Kramer, Kramer campaign. So the, you can see the, the chart 256 is very different. Is, is different. Now, if I make a fuzzy hash, the format of this hash is a flow, you have a chunk size, you have a chunk and a double chunk. And if you look, you have the same chunk size, and the chunk and the double chunk is very near. And if you compute the addition distance between signatures, you have 75% of matching. So with SSDEEP, you, you can try, you can start a first, a first history, the problem of like that. If you have many signatures, you have many, many, many, many comparisons of two like that. Because the, the complexity of the, the computation is in square. So it's not scalable because scalable is, is linear. Here, it's a square complexity of computation. So it's many, many used to powerful of computers and memories to, to make a cluster with SSDEEP. If you have a little data set, it's very useful and it's very, very good. But we have a limitation. If you have two signatures and of, and the chunk size is different, the, the computation of this, of aiming your distance is stopped and the match is equal as zero. Because when you compare two objects in addition distance, this object must have the same length. If you don't have the same length, you can compare. So if you have two malware with the same off code, but you have a section with garbage, garbage data, the result of your matching is zero. But it's the same code. It's, it's, it's the same code. Another tentative of clustering is a peer hash. A peer hash uses the characteristics of the peer format. And the clustering is possible because when he processes the hash, he computes an approximation of Kolmogorov complexity. Kolmogorov complexity is in mathematics. It's to know if the two programs are similar or not. The problem with the Kolmogorov complexity is impossible to calculate. It's just a mathematical matrix. But you can compute an approximation. Using the LZMA algorithms to compress. And it's possible to have an approximation of Kolmogorov complexity. The problem with this peer hash is it's impossible to compute a distance between two files. And with two hash. The hash are equal or not. So you have different families with the same hash, with the same hash, but you can make a distance between two different hash. So it's a kind of clustering, but you don't have a distance in different family. Another way is to use impass in fuzzy. The idea of this clustering is that two malware have the same import table or near. They are from the same family because they use the same function of a veriting system. Yeah, it's a concept. In many times, we have a good result with like that. And other times, for example, if the malware is packed, you classify the backer. Or you retrieve the backer. For example, in campaign to target Japan with Plugix, the attacker uses WinRAR to make executable. And the different import table are for WinRAR, not the real malware. So you classify WinRAR and not the malware. And in the Aristotle, if you check the impass, you have many, many malware, but it's not Plugix. So it's not this campaign. It's another campaign or another packed malware by another attacker, but not the campaign we want to hunting. Impass was created by Montante. The goal is to make an MD5 of import table of the pair. Infasi is the same concept, but with SSDip. Because with SSDip, you can make a distance between table import because you have a distant addition to compare two signatures. The problem of impass or infasi, if the compiler changes the order of the import table, you have a bad impass or difference. Normally it's MD5 and the infasi match at 40%. So it's bad. It's bad because with 40%, you don't make a good assessment to know if the same family or not. Another technique is to disaster binary and generate the graph flow and compare the graph flow. I make a mistake on the Poly. It's Polychrome, the name of the project is a French project reverse engineer. The goal is to use fuzzy hash on graph flow instruction to detect the same basic block or basic block very similar. Mashock is developed by a French company named Connix. It's the same way, but he doesn't use a metal sample like Polychrome. He uses R2 to decompile and to disaster and to make fuzzy hash directly on the graph flow instruction. R2-GraphET is just to print to graph to graph flow and is the analyst to compare directly if the graph flow is similar or not. The majority disadvantage is the scalability because like when you make an SSD, an SSD plastering, you have a complexity of query. Your strategy for scalability, we decide to use two algorithms, two unparalleled algorithms. We use Debescan and Kermit. And the following data set is you can download on this repository and name this one. So the data set is correctly labeled by the analyst. So it's very useful to design your vector of features to make clustering. Because in machine learning, the first program is design of vector of features. You must have a good vector of features to have a good classification or a good clustering. When you make the clustering after, you can directly generate the error rule by clustering. And if your system is good after months because you make a team and check the new malware with a signature by cluster is good, you can generalize the system with a data set bigger. But at the beginning, I think the best is to start with a little data set. A bit math, Kermit's algorithm. Like that, it's a bit complex. But in geometry, it's very simple. You make that the median of distance of the different objects when you choose a centroid of each cluster. You choose a centroid of each cluster and after that, you compute the distance of each object with the different centroid. If it's near, the object is on these clusters. So geometrically, it's very simple to think. The big problem with Kermit's algorithm, we choose at the start the number of clusters. So if you know the data set, it's simple. But if you have a big data set and we don't know a good comprehension of the data set, it's a bit complicated. So the first step, you choose the number of clusters. You choose the initial centroid with compute an index called inertia. And with the best inertia, the algorithm says, OK, I find my centroid, I stop here. I use this centroid to make my clusterization. After, the algorithm calculates the distance between the car centroid and all vectors in the different objects. And construct car clusters, a different example of clustering of Kermit's. So this algorithm is very clever if your data set is heterogeneous. If two homogenous data sets, this algorithm is bad. If you look at this, the border between these two clusters is very strange. The second algorithm is Debescan. You must know the number of clusters at the start. So the algorithm makes the clusterization. At the end, he said, OK, I found 100 clusters. And when he counts clustering correctly an object, this object is a family. This family is a noise. So it's impossible for Debescan to classify correctly this object. Here, the noise is a black point. Because the algorithm can't decide the family of the clusterization. But here, you have a good clusterization. Here, not because the border is not correctly linear or separated. So on the extreme of math, you can make a decision. But in the cluster, it's more difficult. So the problem with PE formats is a binary file. And the vector of features is different numbers or different metrics. So we must transform a binary file or a vector of features. The first way is to extract all metadata of the period on a JSON file. The name of sections, the number of sections, the size of malware, the entropy by sections, the resources, the kind of resources. You put all information in the JSON file for each malware. And after, you make a vector of features. What's very interesting in features for a malware is the sections. The name, the size, the entropy of a section. If you have a high entropy, the malware can be packed or can be encrypted. The number of modules, the number of symbols, different functionalities. It's a network. It's to encrypt files, to open files, to open processes. The export and the size of files. So we make a first vector of features, a live vector of features, size of files, number of sections, media of entropy, number of imports, and number of exports. And if how to make that on Python is very simple. So in the zoo, it's not too small. So it's a different file on the zoo. So we have 800 files, I think. Here, in this piece of code, I retrieve the metadata and I store it on the JSON file. After that, I check the different collisions. And I'll make my featuring. I describe my feature of vectors. Size of files, number of sections, media of entropy, number of imports, blah, blah, blah. And I put all information in release database. On the first vectors, I design a matrix with a different kind of vector of features. And you have the shape of the matrix. 5 is the number of features and 724 is the number of marvels. But I have my matrix. And I reload my matrix. I record my matrix in a pickle format. So be careful if you speak up. Because it's an on-security segment. It's a bit dangerous. So I retrieve my shape format, the same matrix with the same shape format. This is the helper of Cikitlion to use a camera. I use Cikitlion to make the crystallization. You can use TensorFlow by Google. You have different open source. But Cikitlion, the advantage, it's many, many used by many people. So if you have a problem or a question, it's more simple to have a support. So I define the number of my clusters. The number of the job. And I don't pre-completed the distance of matrix because it's n squared complexity. And I make my clustering. Each number is the level of the clustering. On the top of left is the first malware. The algorithm is classifying the 41 clusters. The second malware is the 8 clusters. I retrieve the cluster centers. So I have a 19-centrate. Because I choose 90 malware. But all malware, blah, blah, blah, blah. And after that, I compute a distribution. You have the label on the left. And on the right, it's the number of malware in this cluster. And if you make a boil chart, you have the distribution. So in the cluster 8, I have a more million of my data sets. And in another, you have different representative interesting clusters. And many, many in others. Why others? It's the cluster with just one, two, or three malware. So I have many, many malware in different clusters. If I check the same thing with the database scan. So it's the same. You load the matrix. You call the database scan. You define for the database scan mean samples. For example, if you consider if you... These numbers is arrived by the algorithm. You make a family. So here, like my dataset is very heterogeneous. I put in value one directly. If I can have a family with one malware. My metric is Clidian. And I define the number of jobs when I compute. So I make a distribution. And if you check the cluster one, we have many, many, many, many, many, many, many equation group malware. But we have full positive. But it's not bad. In the 28 cluster, we have many equation group. If I make a pie chart, I have this partition. So why have this partition? Because my dataset is very heterogeneous. But we retrieve the same family. The number one of the database scan and cluster eight of the camins is the same. So it's not bad. But in different... If you check correctly the distribution, different family, you have different full positive. Why different full positive? Because you don't have normalized the vector of features. The value of each vector of features. And if we check the different vector of features here, the size of file is a feature with arrays or other value. So you have another malware with the same size. The classifier classifies this malware in this family because we don't have normalized the vector of features. We have the same of the database scan. How to normalize vector of features is very simple. I choose the max of each values of features and divide it by these values. Now, all my values on my features is between zero and one. We have only one with the same input, but we have normalized the vector of features. So it's the same call in Secutron. It's just your input matrix change. The distribution and it's cluster zero. Now we have only one equation group malware. Only one, only one, only one, blah, blah, blah. And if I take an account in the label one, we have a question group, two equation groups and a shaman. So we have two false positives on the first cluster. And if we make the same with the database scan, we don't have false positives. Now the first cluster, here we have only a question group malware. And the other, if you check another family, the distribution tree, the cluster tree is a volatile signal. It's just a volatile signal. You don't have any malware of any family. So for this dataset, my vector of features is good. And it's enough because I have a good clusterization for each family. Potato Express, Duke Udo. The difference of each equation group cluster is not the same version of the downloader. So it's normal if you have two families of an equation group. In this dataset, the database scan has good results because our dataset is very, very heterogeneous with a family with one or two malware. And the database scan likes that to classify this kind of malware. Now all generation quickly. So I use a Yara generator. It's a software developed by Xenophone. It just the rule-based on intersection of strings. It's not like it's just that. Not by code or not ISM assembly or just intersection of strings. And on the equation group clusters, I make a rule and I put this rule directly on VT. And for the six months ago, I found 39 new equation groups. And yesterday, I don't know why, but a Korean guy uploaded 100 equation groups and 12 different on VT. I don't know why, but I have the different results on my mailbox. So after the conference, I check why. But the very important idea is I don't have false positives. All malware I found with this Yara rules generals automatically watch my clustering found uniquely equation group malware. This family of equation group malware. If the malware changes a bit, I don't find it. But it's not my mindset and I'm focused on this family. So magic learning is not magic. Like name, like said many wonders. A big war of featuring must be done including of the dataset. Here, I have a very simple feature because we have a simple dataset. At BotConf, we made a keynote and we explained to make a vector of features of dataset of 2 million of malware. The video is on YouTube if you take a look. The magic learning is very useful to make first filters because it's very scalable and to make different clusters. And after you have made this clustering, you can generate different rules or if you like Ida or DBG to make a reversal difference or the centroid of each cluster, for example, to understand the family. Thank you. If you have a question about this presentation.