 Next talk is by Zun Wei and Shang An, so you want to build Antivirus Engine. Hey guys, welcome to join our talk. Today, we will be demonstrating our engine, Qwak. It is an obfuscation-neglet Android malware scaling system. So, my name is Jun Waisong. I'm a security researcher and the co-founder of Qwak Engine. And we also have another speaker, Yu Xiang. He is the core member of Qwak Engine. So, this is the outline. Number one, we will introduce our malware scaling system. Number two, we will show you how we design a Delphi bytecode loader. Number three, we will go through two cases of real malware analysis using Qwak Engine. Number four, we will share our strategy of generating the detection rules. And last thing, yes, the future works. We still have lots of things to do. Alright, so let's introduce the malware scaling system. As we know, when developing a malware analysis engine, it is important to have a scaling system. However, those systems are either business secret or too complicated. Therefore, we decide to create a simple but solid one and take that as a charger. And since we want to design a novel scaling system, we stop reading and decoding what other people do in the field of cybersecurity because we don't want our ideas to be subject to the existing systems. So we start to find ideas in fields other than cybersecurity. And luckily, we found one. Yes, the best practice we found is the criminal law. When sentenced a penalty for a criminal, the judge with the penalties based on the criminal law and after decoding the law, we find principles behind it. And we developed a scaling system for Android malware. There are only ad principles decoded from the criminal law. And I will go through it in the following slides. Now, let's see principle number one. A malware crime consists of action and target. In the criminal law, the definition of a crime consists of action and target. For example, steal money or kill people. So when this principle in mind, we develop a definition of a crime for Android malware. And definition is the malware crime consists of action and target. For example, steal photos or steal your banking account passwords. Now, let's see principle number two. We consider that the loss of fans is greater than the loss of wealth. In the criminal law, physical body injury is more serious than psychological injury. So the principle we decode here is when things are hard to recover, we consider it a felony. When this principle decodes, we develop our second principle. We consider the loss of fans is greater than the loss of wealth because it's easier to make money back than review your reputation. Okay, now let's see principle number three. Arithmetic sequence. In the criminal law, when a murderer is sentenced 20 years in prison and a robber is sentenced seven years in prison for his crime. Why 27 years? Why the number? We found no obvious principle. We found no obvious principle in a criminal law. So we use arithmetic sequence to wear the penalty of each crime. For example, the penalty weight of Y1, 10, Y2, 20, Y3, 30, etc. So now let's see the most important part of the scoring system. We create an order theory which consists of three principles, principle four, five, and six. Let's first do get the principle number four. The later the stage, the more we will assure that the crime is practice. As mentioned in chapter four of Taiwan criminal law, each crime consists of a sequence of behaviors. Those behaviors can be categorized in a specific order. That takes murder, for example. Determining means somebody designed to kill someone. Conspiracy means he or she starts a mega plan for the murder. Preparation means buying stuff like weapons or arranging a surface for a murder plan. Star means when things are all set, the murderer takes action and is on the way to kill someone. Practice means the murderer does pull the trigger and shoot someone. So as we can see here, the later the stage, the more we are sure that the crime is practice. With the principle in mind, we developed Android malware crime order theory. In this theory, we also have five stages for a crime. For example, if a malware tried to send out your location data by using SNS, in stage one, we will check if relative permission is requested by the malware. And then we will check if key native API is code. In stage three, we will see if certain combination of native API exists. And then we will check if the APIs are code in a specific order. Finally, we will check if APIs are handling the same register. Okay, so now you can see from this picture, this is a two-dimensional map for Android malware crime. For the crimes, we put them in Y-axis. And for each crime, we use X-axis to see the evidence we talk for the crime. So X5, Y1 means in crime number one, we have found native APIs that are code in a correct sequence, and they are handling the same register. And X3, Y5 means in crime number five, we have found certain combination of native APIs that is used in this advocate. So now let's look at principle number five. The more evidence we code, the more penalty weight we give. So we give stage two more weight than stage one, and stage three more weight than stage two, etc. Okay, principle number six, proportional sequence. As we decode from the criminal law, the later the stage, the more we are sure that the crime is practiced. So we consider proportional sequence, for example, two to the power of n to present such principle in our scoring system. All right, principle number seven, crimes are independent events. For velocity, we assume crimes are independent events, and penalty weights can be ADR directly. So this is the example of adding on two crimes. In malware, we found two crimes that are stolen photo and steal your banking account password. So the calculation of the total penalty weight is quite simple. For each crime, we use penalty weight of a crime to multiply proportion of cop evidence and add up the result of the two. The last principle, principle number eight, threshold generate system. After calculating the total penalty weight for a malware, we need to have the threat level thresholds. So that we can tell which threat level does the malware fill in. Unfortunately, we can find land in the criminal law, but we know we need to design a threshold generate system for that, not just give any number by intuition. So we decide that threshold for each level is the sum of the same proportion of cop evidence, multiply penalty weight of a crime. We know it is not a perfect one, but we are sure that we build a foundation for future optimization. Alright, so now let's talk about the design logic of Delphi Bicol loader. Our Delphi Bicol loader is actually the implementation of the Android malware crime order theory. We implement every stage of the theory. There are five stages. The first stages are easy. We simply use API in another open source tool, Android Guard, to implement the first three stages. As I just mentioned, the implementation of first three stage are easy. But in stage four, we need to do a little bit more. So before the implementation, we need to do, we need to know what does stage four do. In stage four, we find the calling sequence of native APIs and check if they are code in a specific order. For example, if a malware sends out your location data by SNS, then first you will call native API, get cell location to get your location data. And then call native API, send text message to send your location data by SNS. Normally native API are work in functions. So we trust back to see which function is cross-reference from the native APIs. And we call those functions the parent function. And we will keep tracing back until we find mutual parent function for both the native API. Here is the example, send text message is code by SNS, which is the parent function of send text message. And get cell location is code by get location, which is the parent function of get cell location. And if we keep tracing back, we will see that both send SNS and get location shares the same parent function, send message. And after we find the mutual parent function, we will scan through a small line code of the mutual parent function and check which function is code first. So this is the small line code of send message. We can see that get location is code first to get location data of the cell phone. And send SNS is code to send out the location data. And instead for, we found out that our design can also overcome the obfuscation techniques used by malware. With implying obfuscation techniques, functions except native APIs are renamed. This has made the decompile source code hard to read for humans. The machine can still run a code because the logic of the code remains the same. Here is the example, when applying obfuscation techniques, native API send text message is code by function K. And function K is code by function F. The other native API get cell location is code by function E. And both function E and F shares the same parent function, A. You see, if you start reading the decompile source code of A, it will be hard to figure out what is going on there. By the way, since our goal is to find the mutual parent function, so it doesn't matter how many layers the workers are. Now, let's see the implementation of stage 5. Yes, the most important part. In stage 5, we need to confirm that if the native API are handling the same register, let's use the same example. Send out your location data by using SNS. So, when native API get cell location is code, it will return the location data of the cell phone. And what we do in stage 5 is to check if the other native API send text message, send out the location data, return from get cell location. So, in stage 5, we simulate the CPU operation. We read line by line of the small delight source code and operate like CPU to get two things. First, the value of every register. Second, the information like functions who have operate the same register. To make this happen, we create a self-defined data type. We call it register object. In each register object, we store three kinds of information. Number one, the register name. Number two, the value of the register. Number three, the function who use this register. Let's see the example. So, the register name is V7 and the value of the register is a string. And the string depends the value of string 1 and the result of function 1. And then we can see that the register is used as the input resource of the function 2. By the way, when filling the value of use by which function in a register object, we expand every register by cross-referencing of the register object in a table. So, for example, by cross-referencing, we know that V8 is a string called userLocation and V3 is a function called galocation. As you can see in the lower right corner, the result of galocation is append to the string userLocation. And the new string is sent out by using function sendSNS. In other words, the value of register V7 is generated by using function galocation, which has native API 1 in it. And the value is used as the input for function sendSNS, which has native API 2 in it. So now we prove that by using the register objects, we can check if the API are handling the same register. So after we scan through the source code, we produce lots of register objects. And those register objects will be organized when a two-dimensional Python list. It is a similar idea like hash table. We use it to boost out the read and write of the list. So now, let's see the table. As you can see here, register V4 has three register objects. That means in the source code we scan, V4 was used three times. And every time when it was used, we store the present value of the register. And the function will use it if there is one. So basically, the whole table is the history of the register. So when we finish constructing the table, we then scan through all register objects in the table to check if the native APIs are handling the same register. So now, let's see how to use Quark Engine to analyze the malware. And we'll take care of this part. Okay, so in this section, we prepare two malware. One is non-upfuscated, and the other one is upfuscated. And for each malware, we'll show how we detect behavior of the malware with the detection rule. Now let's look at first the malware. This is a non-upfuscated one. We will use the rule in Quark Engine to detect whether if the malware sent out cell phones location data by using SMS. So this is the detailed report of Quark Engine. In this report, the engine shows the detection result of one single malware behavior, or you can say one single malware crime. So for example, we try to find it, the malware sent out your location data by using SMS. In this report, we list out evidence we found in each stage of the Android malware crime order theory. And this report shows we find evidence in every stage, which means we have 100% of confidence that the malware has this behavior. So let's see. In stage one, permissions like send SMS, SS course location, and find locations are requested. In this second stage, key native APIs like get cell locations and send text message are used. And in stage three, we found certain combination of native APIs exist. And in stage four, we found out that in functions like send message and Dubai, that the API are called in the right sequence. And in stage five, in functions send message, we found out that those APIs are handling the send register. So now let's think. If you are analyzing this malware and you want to trace the decompiled source code, see the evidence, how do you do it? Our suggestion is start backwards. That's mean you start from the stage five. For example, in stage five, we know that inside function of send message, it has two functions that contains the two native APIs respectively, and they are handling the same register. So you start to locate function send message in the decompiled source code. And in stage four, we know that those two functions are called in the right sequence. So we can start to find functions like that contains the native APIs and check if they are really called in the right sequence. The information of the two functions and the sequence will be shown in the next version of Quark engine. So now let's look at the real example. Let's locate the function send message. And we found out that two functions that contains the two native APIs respectively, send SMS and get locations. And if we dive into the function of get location, we'll see that it contains the native API get cell locations. And if we dive in the function of send SMS, we'll see that it contains native APIs send text message. So the code here means it first collects your cell phone location data and send it out through SMS. So now let's dive into the source code of get locations. As you can see in the source code, it tries to call native APIs get cell locations and return this information at the end of the code. And now let's dive in the source code of send SMS. Native API send text message is used to send out location info. Quite simple, isn't it? So now let's look at the second malware. This is an obfuscated one. We will use the rule in quark engine to find whether if the malware detects Wi-Fi hotspots by gathering information like active network info and cell phone location. Okay, so as a malware analyst, we read the report backwards. As you can see in stage five, there are functions like p.a. at view.c and a at the round. It has two functions that contains the native APIs respectively and they are handling the same readjusting. And in stage four, those two functions are also called in the right sequence in function p.a. at view.c and a at the round. So according to this report, we can say that the malware has the behavior of Wi-Fi hotspot detection in three parts of the source code. We can pick any part of the further analyst. So we pick up function p.a. So now let's see the source code that's located function p.a. And we found out that two functions that contains the two native API respectively, they are ap.a and f.f. And if we dive into the function of ap.a, we'll see that it contains native API get active network info. And if we dive into the function of f.f, we'll see that it contains native APIs get cell locations. So the code here means after collecting information from function ap.a and f.f, they send the information as an input for function ap.a. So now let's dive into the source code of function ap.a. As you can see in the source code, it tries to call native API get active network info and return the related information at some point. And now let's dive into the source code of f.f. Native API get cell locations is used to get the cell phone location data. And this information is processed with some other strings. At the end of this functions, it returns the screen with the information. As mentioned earlier, after collecting information from function ap.a and f.f, they use the information as an input for function ap.a. And we notice one thing that function ap.a use byte array output screen as one of its input parameter. And we know when seen byte array output screen, it means the function is probably trying to write the data into a file. This is amazing, isn't it? So with Quark Engine, malware analysts can really boost up their productivity. Okay, so now I will introduce our detection rule generate strategy. So why do we need to develop the detection rule generate strategy? Because to make our engine practical and easy to use, we need to have more detection rules. However, the speed of rule generated by human is quite slow. And the human generated rule is subject to his or her experiences malware analysis. So we develop a rule generate strategy to boost up the production of detection rules. Since our goal is to find all kinds of behavior in the malware. So if we use permissions and native APIs to generate all possible rules, we will have an amazing amount of rules. After generating rules, we then use Quark Engine to find the intersection between those amazing amounts of rules and the malware we prepared. In other words, we find rules that match the malware behavior. However, this is not a good way to generate detection rules, because it's time and resource consuming. So we develop a seven step rule generate strategy. So first step one, we crawl down all native API information on Android offshore API reference. For example, this is the native API information of send text message. You can see the input parameters returns value and the description of this API. Okay, next step two, we did a little bit modification to our engine. We ignored the permissions checks in stage one of the Android malware crime order theory. And in step three, we find all kinds of API combination and generate rules without permissions information. In step four, we use the modified Quark Engine to find the intersection of the rules and the malware. We call rules in the intersection the first stage verified rules. In other words, this rule needs to be verified again. And since we don't need to generate rules with permissions and verify the permissions in Quark Engine, the whole process of rule production speed up. Next in step five, we try to generate rules with permissions inside the intersection. We have first stage verified rules matched with malware. We then use the first stage rule and permissions in the matched malware to generate rules with permissions, which is the second stage rules. In step six, we then use the Quark Engine, which is the full function version to find again the intersection of the second state rule, which are the one with permissions and the malware we prepared. After that, for each rules, we level the number of matched malware. For example, the behavior of number one, the behavior rule of number one can be found in the 100 malware. So finally, step seven, after leveling the rules, we then sort the rule by number of matched malware. We reveal the rules from the highest matched one. All right, last part, future works. As I mentioned earlier, we still have a lot of things to do. For example, we need to have more detection rules, and we also need to deal with the SO file and packed APKs. And we want to have more features of the Delphiq by Co-loader. For example, the downloader. And applying the scoring system to other binary formats is also in our to-do list. And we noticed that API changed in different version of Android. We also take care of this problem in the next version of Quark. We probably will change the core library since Android Guard is quite inactive recently. And one more thing, actually, we are trying to make Quark easier to integrate to other tools. For example, user can import Quark in Python library and output the analysis result as a JSON file. And now Quark is collected in black R-related and Intel O, which is the threat intelligence analysis tool. And there's one nugget that I want to share. We work at the limit of our tools. When new tools come along, new things are possible. Okay, that's all for today. And if you have any questions, please feel free to DM or Twitter accounts.