 You know it takes a lot of effort to talk in front of a big group and this is Andrea's first time so let's give him a round of applause. Can I start? Okay. Well good morning and thanks for being here. This is my first time DevCon. It's a big pleasure to be here. My name is Andrea Marcelli and today the talk is looking for the perfect signature and automatic rule generation algorithm in the high era. I am a PhD student at Polytechnic Tutorino in Italy. I study machine learning, semi-supervised modeling and optimization problems. But I'm also a PhD security researcher at ISPA-SEX Systemas where I work on new techniques for large scale Android malware detection. Today we will talk about the signature generation problem about the algorithm that I have developed and the tool that I have developed. Then we go through a quick demo and some results of the algorithm. So let's now, let's now talking, start talking about the signature generation problem. And what is a malware signature? A malware signature is a unique pattern that indicates the presence of malicious code. And since the problem is that since malware evolves over time, new signature should be generated very frequently. And historically there are two type of signatures. There are syntactic signatures and semantic signatures. Syntactic signatures are those based on textual strings and on binary sequences extracted from the application. And this is also the industrial standard type of signatures and where most of the existing research focus on. Instead today we will talk about semantic signatures, which according to some recent literature could lead to a better detection. And those provide an obstruction of the program behavior. So in today's talk we will talk about Android malware and automatic signature generation. But all the concepts are very generic and should be applied and could be applied to other targets too. So which are the motivations behind the work? Well, first of all we want to reduce the malware exposure time. Second, generating, writing a malware signature is a very repetitive task. And keep in mind that we, from 20 to 50,000 new applications received every day. Mono analysis is not an option. Then we have a goal, which is to recall and very high precision requirements. In other words, you want to lower both false positive and false negatives. And lastly, in the end it's all about saving a lot of resources and time because writing signatures manually, it's very time consuming and expensive. And we will generate signatures using the YARA language. YARA has been defined by Victor Alvarez, his creator, has YARA to file what's north is to network traffic. And the advantage of YARA is that YARA is fast and it's also the standard type of languages that is used in the antivirus industry. So it's very spread. And YARA naturally supports syntactic signatures, but it also supports semantic signatures through custom models. So you can write your own models. This is an example of YARA rule. There are three main sections, although the most important one is the condition that is where the logic of the rule is placed and also where the semantic attributes of the signature can be added. Now let's go through the algorithm. The algorithm for the automatic malware signature generation is placed within a pipeline that is very common to many antivirus industry, antivirus software's house. And the idea is that there is a submission of new APKs every day. And those APKs are analyzed through machine learning techniques, mostly unsupervised techniques, in order to automatically infer new malware families, that is the clusters. And in the end, for each malware family, you want to generate a signature. And this is where my algorithm comes in place. And in order to generate a signature, you need to start from some attributes, some features extracted from the application, through the application analysis, both statically and dynamically. And just to have a graphical representation, each feature could be a small square within the gray grids that you can see in the slides. And it could be, for example, a neural hell, it could be a permission, an intent filter, or everything else. And it's very, very important that the analysis is performed very carefully because through the analysis, we extract the attributes. And having good attributes is the key to have a good signature tool. So the idea behind the algorithm is very simple. We have two applications. We have a set of attributes for each application. Probably some attributes will be of the same type and will be actually the same between the two applications. In this case, we have some orange and some green attributes which are the same. And actually, we can generate signatures by combining those attributes. Although the reality is much more complex. And as you can see, in most of the cases, you cannot find a unique pattern which is in common to all the samples of the same malware family. And so the problem of generating signatures is the problem to find the attribute subsets that cover the entire malware family, all the samples of the entire malware family. And if you can, if you think about it, it's a very, it's a problem that is very similar to a very well-known problem in the literature, which is the set coverage problem. Actually, this is a variant of the set coverage problem, which unfortunately is very hard to solve because it's NP-complete problem. But since we are not interested in a global optimal solution, we just need a local optimal solution. I have developed a dynamic reading algorithm to solve the problem and actually automatically generating a signature. So as you can see, generating a signature is not such a big issue. But the main challenge is to evaluate signatures. So in order to better understand this process, I will introduce a couple of formulas. First of all is the DNF, the disjunctive normal form, where a set of clauses are in logic R and each clause is made by a set of literals which are in logic N. And a literal is one of the attributes that we have extracted before through the analysis. So if we reduce each signature to a DNF, we can actually weight each clause. And in that way, we can say that the weight of a signature is actually the lowest among its clauses. And this weighting process is the base of the evaluation process of a signature, because if a signature has a very high weight, it means it's too specific. It will not be able to generalize and will actually produce a lot of false negatives. Instead, if it's too generic, it will generate a lot of unwanted detections. That is a lot of false positives. So we want to generate a signature which stays between two thresholds, TME and TMAX. And that's where the optimal signatures are. So as you can understand, the key of the weighting process is to actually assign a good value to the attributes. And the value to the attributes is very tight to the value of the two thresholds, TME and TMAX. And actually, I wanted the entire process to be automated and I didn't want to rely on the knowledge of the expert analyst. I just wanted to make it automated. And one of the possible solutions is to start from the repository of YARRA rules. There are a lot of public repository of YARRA rules that you can parse them and reduce to the disjunctive normal form. Then it becomes a linear problem that you can solve by means of a linear programming algorithm, like the Simplex algorithm. And in the end, you are able to satisfy the 95% of your rule set in order to have a weight which is between the two thresholds. Why 95%? Well, because some rules set are so specific to a type of malware family that it's not possible to really extract some useful knowledge for them. Well, when I actually implemented this algorithm, I ran into a new problem. That is, the signature were too specific, too many attributes. The weight was so high, it was actually impossible to use them. So in order to create a better signature, we have to remove some attributes. How to remove them? I have developed two strategies, one very simple, which is the basic optimizer, which simply randomly throws away some attributes. And then the other is the heave optimizer. And why we need the evolutionary algorithm to optimize signature? Well, there are some no written rules about how to generate a good signature. And those rules can be placed within an algorithm, a genetic algorithm, that actually find the best combination of them to generate the best signature. And the good thing about all of them is that they are very fast. One less than one minute, the other about five minutes. So it's scalable. So I actually implemented the entire procedure that I have described into an algorithm, which is called Yaya-gen. Yaya-gen is an acronym, a sense for yet another Yara rule generator. And if you know Spanish, you will probably know that Yaya in Spanish means grandmother. And the goal of Yaya-gen is to start from a set of application reports, which means start from the result of the analysis of several applications, and generate for them a set of Yara rules. Yaya-gen includes a lot of functionalities. And for sure, the most important one are the algorithms to generate a rule. There are two variants. There is the clot and the greedy. The clot is an unset greedy because the problem with the greedy is that probably you will generate rules which are not homogeneous in the number of samples that are covered. Instead, the clot tries to generate a rule with all the same coverage in order to create better rules. Then we have True Optimizer, the basic and the evil one. And some heuristics because in the signature generation, it's all about heuristics. For example, we are very interested in including some URL if those are malicious, so there are some heuristics to understand if URL is malicious. There are some heuristics to throw away some attributes, so performing some kind of filtering at the beginning. And the software package also includes the Yara rule parser for attribute weight optimizations. So if you have your own set of Yara rules, you can actually apply this program and find out which values satisfy the rules starting from your set. Then it also supports the false positive exclusion. And although the false positive match is something very rare with our automatic generated signature, we want to include this option in the rule generation. So in case of unwanted detection, you just include that detection into the process of generating a signature and the signature will not match that false positive anymore. Everything is written in Python 3, so very easy to use and customize. And finally, as you may understand, this is just a plug-in with a much bigger infrastructure because you need an infrastructure to actually analyze the application and an infrastructure to test the rules that you have generated. And that's the reason why Yaya Gen has been written to directly work with Kudos. Kudos is the open antivirus from Ispa second. It's free and open to everyone. And it works with the Android applications, Android malware. So the great news about Yaya Gen is that it's free, it's on GitHub, so any improvement is really welcome. Now I'll go through a short demo. Okay, in this case, I'm using Yaya Gen to generate a rule using the Cloud algorithm for the SkyGo free family, malware family. And as you can see, I gave in input some ashes of known applications that belong to this family. And what the tool actually is doing now is just downloading the report of the application analysis from the internet. So I said it's generic and it is because if you change this part, if you feed into Yaya Gen another type of report, it's all about changing which attributes are selected and then everything can be optimized. It's not tied to this type of report only. And the rule that is generated and everything is super fast because I didn't speed up anything. And the rule that is generated is a valid rule. So you can actually use this rule in your system. Be only aware that probably you will need some custom models in Yara to use it according to the type of the attributes that you use. In our case, we are using Android and Kuku, which are models that are available in Kudos, but probably if you will use it on your own platform, you will have to add them. And let's just go through the rule. This rule has 17 literals, which means 17 attributes. With a score of 484, which is a quite high score, and it's above the threshold. The minimum threshold is set to 400. It has a coverage of 8 over 8, which means that all the samples that I gave in input are covered because in this case, the family was very tight. If the family is loosely defined, it doesn't matter. Simply find more rules. And that's all for the demo. Now let's go back to the result. I wanted to compare the efficacy, the effectiveness of the automatic generated rules with the human written rules. So I started from some of the available ones. And I tried to recreate an automatic version starting from some sample that were originally detected. Well, actually, I did some tests on 1.5 million data set of Android applications collected through 2017. And what came out is that all the automatically generated signature improved the number of detection, ranging from the 8% to the 131%. With a neighbor age of 65%. So it's a huge improvement and all of that without even generating a false positive. So in order to conclude, I developed a set of algorithms to automatically generate VRR rules. I did it in the context of Android and using Yara language. But as I said, everything can be easily ported to other targets and other languages. And what comes out is that automatically generated rules perform better than human generated ones. Then one interesting fact is that there are several ways in which the expert knowledge can be included into the Yara rule generation. For example, there are the heuristics. There are the way in which the attribute value is set. And there are also, there is also the optimization phase. So actually a malware hasper can really increase the quality of the automatically generated rules. And finally, all the approach is very scalable because generating a rule for 100 up, it takes less than five minutes. And if it grows to 1,000 up, it takes around 30 minutes, less than one hour. So it's a lot, but it's orders of mind to the less compared to the manual work. We are working on a new version of Yaya Gen, which will target Windows executable. It's called Yaya Gen P. We are still working on that. We are still testing it, but preliminary results are very great. So I will keep you posted if I have any news. And that's all. Thanks a lot for attending the talk. And if you have any question, I would really appreciate to talk with you off the stage.