 So, hey guys, welcome to join our life-sharing talk. Today, our topic is, So, you want to build an anti-virus engine and we'll be demonstrating our engine quack. It is an obfuscation, necklack, Android malware scoring system. So, my name is Kun Yu Chen. I am a security researcher and a founder of quack engine. And we also have another speaker, Jun Wei. He's the co-founder of quack engine and he gave talks in security conferences like HIDV and DEF CON. So, this is the outline. Number one, we were introduced our malware scoring system. And number two, we will show you how we designed the Delving Biker loader. And number three, we will go through two cases of real malware analysis using quack engine. And the last thing, yes, the future works. We still have a lot of things to do. Alright, so let's introduce the malware scoring system. As we know, when developing a malware analysis engine it is important to have a scoring system. However, those systems are either business secret or too complicated. Therefore, we decided to create a simple but solid one and take that as a challenge. And since we wanted to design a novel scoring system we stopped reading and decoding what other people do in the field of cybersecurity. Because we don't want our ideas to be subjected to the existing systems. So we started to find ideas and fields other than cybersecurity. And luckily, we found one. Yes, the best practice we found is the criminal law. So, when sentence a penalty for a criminal the judge weighs the penalties based on the criminal law. And after decoding the law we find principles behind it. And we developed a scoring system for Android malware. There are only eight principles decoded from the criminal law. And I'll go through it in the following slide. Now, let's see principle number one. A malware crime consists of action and targets. In the criminal law the definition of a crime consists of action and targets. For example, steal money and kill people. So with this principle in mind we developed the definition of a crime for Android malware. And the definition is the malware crime consists of action and targets. For example, steal photos or steal your banking account passwords. Now, let's see principle number two. We consider that the loss of fame is greater than the loss of wealth. In the criminal law physical body injury is more serious than psychological injury. So the principle we decoded here is when things are hard to recover we consider it a felony. With this principle decoded we developed our second principle. We consider the loss of fame is greater than the loss of wealth because it's easier to make your money back than rebuild your reputation. Okay, now let's see principle number three. Arithmetic sequence. In the criminal law when the murderer is sentenced 20 years in prison and the robber is sentenced 7 years in prison for his crime. Have you ever think about why 20 and 7 years? Why the number? And we found no obvious principle in the criminal law. So we use arithmetic sequence to weight the penalty of each crime. For example, the penalty weight of white one which is steal banking account password is can and white tune steal photos from your cell phone is 20 and the penalty weight for white three is 30, etc. So now let's see the most important part of the scoring system. We created an author theory which consists of three principles. There are principle number four, number five and number six. So let's first look at principle number four. The later the stage the more we're sure that the crime is practiced. And as I mentioned as mentioned in chapter four of Taiwan criminal law this crime consists of a sequence of behaviors and those behaviors can be categorized in a specific order. So let's take murder for example in stage one determined it means somebody decide to kill some and stage two conspiracy it means that he or she started to make a plan for the murder and stage three preparation it means buying stuffs for example weapons or ranging services for murder plan and stage four start it means when things are all set the murderer takes action and is on the way to kill someone practice the last stage, stage five it means the murderer has pulled a trigger and shoots someone so as we can see here the later the stage the more we're sure that the crime is practiced. So with this principle in mind we developed android malware crime order theory and in this theory we also have five stages for a crime for example if a malware tries to stand out your location data by using SMS in stage one we would check if the related permission is requested by the malware and then we would check if the key native API is called and in stage three we will see if certain combination of native API exist and then we will check if the APIs are called in a specific order finally we check if the APIs are handling the same register okay so now you can see from this picture this is a two dimensional map for android malware crime and for the crimes we put them in Y axis and for each crime we use X axis to see if the evidence to see the evidence we caught for this crime so X5, Y1 means in crime number one we have found native APIs that are called in the correct sequence and they're handling the same register and X3, Y5 means in crime number five we have found certain combination of native APIs that are used in this APK so now let's look at principle number five the more evidence we caught the more penalty weight we give so we give stage two more weights than stage one and we give stage three more weights than stage two etc okay principle number six proportional sequence as we decoded from the criminal law the later the stage the more we're sure that the crime is practiced so we consider proportional sequence for example two to the power of N to present such principle in our scoring system alright principle number seven crimes are independence events so for simplicity we assume crimes are independence events and penalty weight can be added up directly so this is an example of adding up to crimes in the malware we find two crimes there are stealing photos and stealing your banking account password so the calculation of the total penalty weight actually is quite simple for each crime we use penalty weight of crime to multiply the proportion of caught evidence and add up the results of the two the last principle principle number eight threshold generated system so after calculating the total penalty weight for malware we need to have threat level threshold so that we can tell which threat level does the malware fit in unfortunately we can't find them in the criminal law but we know we need to design a threshold generated system for dots not just give any number by intuition so we decided that threshold for each threat level is the sum of the same proportion of caught evidence multiplies the penalty weight of crimes yeah we know this is not a perfect one but we are sure that we build a foundation for future optimization alright now let's talk about the design logic of Delving Michael Wilder and my partner Junwei you will take care of this part hello everyone my name is Junwei and I will take care of this part so now let's talk about the design logic of Delving Michael Wilder our Delving Michael Wilder is actually the implementation of the Android malware crime order theory we implement every stage of the theory there are five stages the first three stages are ED we simply use APIs in another open source tool Android Guard to implement the first three stages as I just mentioned the implementation of the first three stages are ED but in stage four we need to do a little bit more so before the implementation we need to know what does stage four do in stage four we find the calling sequence of native APIs and check if they are code in a specific order for example if a malware sends out your location data by SNS then first it will code native API get cell location to get your location data and then it will code native API send text message to send your location data by SNS normally native APIs are work in functions so we trace back to see which function is cross reference from the native APIs and we code those functions the parent function and we will keep tracing back until we find the mutual parent function for both the native APIs here is the example send text message is code by send SNS which is the parent function of send text message and get cell location is code by get location which is the parent function of get cell location and if we keep tracing back we will see that both send SNS and get location shares the same parent function which is send message and after we find the mutual parent function we will scan through a small line code of the mutual parent function and check which function is code first this is the small line code of send message we can see that get location is code first to get location data of the cell phone and send SNS is code to send out the location data and stage four we found out that our design can also overcome the observation techniques used by the malware when applying observation techniques function except native APIs are renamed this has met the decompiled source code hard to read for humans the machines can still run the code because the logic of the code remains the same here is the example when applying observation techniques native API send text message is code by function k and function k is code by function f the alternative API gets the location is code by function e and both function e and f shares the same parent function which is a so you see if you start reading the decompiled source code of a it will be hard to figure out what is going on there and by the way since our goal is to find a mutual parent function so it doesn't matter how many layers the workers are now let's see the implementation of stage five yes this is the most important part in stage five we need to confirm that if the native APIs are handling the send register let's use the same example send out your location data by using s and s so when native API gets the location is code it will return the location data of the cell phone and what we do in stage five is to check if the alternative API sends text message sends out the location data return from gas cell location so in stage five we simulate the CPU operation we will read line by line of the small bit like source code and operate like CPUs to get to fit first the value of every register two the information like functions who have operate the send register to make this happen we create a cell define data type and we code it register object in each register object we store three types of information number one the register net and number two the value of the register and number three the function who use this register let's use the example so the register net is V7 and the value of the register is a string and the string depends the value of string one and the result of function one and then we can see that the register is used as the input source of the function two and by the way, when we fill in the value of use by which function in the register object we expand every register by cross referencing of the register object so for example by cross referencing we know that V8 is a string called user location and V3 is a function called gap location as you can see in the lower right corner the result of gap location is appended to the string which is user location and the new string is sent out by using function send SNS in other words the value of register V7 is generated by using function gap location which has native API one in it and the value is used as input for a function send SNS which has native API two in it so now we prove that by using the register objects we can check if the APIs are handling the same register so after we scan through the source core we will produce lots of register objects which will be organized with a two-dimensional Python list it is a similar idea like hash table we use it to boost up the read and write of the list so now let's see the table as you can see here register V4 has three register objects that means in the source core we scan V4 was used three times and every time when it was used we store the present value of the register and the function will use it if there is one so basically the whole table is the history of the registers so when we finish constructing the table we then scan through all register objects in the table to check if the native APIs are handling the same register so now let's see how to use clock engine to analyze the malware now let's get back to Quinn Yu Hi, it's me again so in this section we prepare two malware one is non-obfuscated and the other one is obfuscated and for each malware we will show how we detect the behavior of the malware with the detection rule now let's see the first malware this is a non-obfuscated one we would use the rule in clock engine to detect whether if the malware sent out the cell phone's location data by using SMS so this is the detailed report of clock engine and in this report the engine shows the detection result of one single malware behavior or you can say one single malware crime so for example we try to find if the malware sent out your location data by using SMS in this report we list out the evidence we found in each stage of the android malware crime all over theory and this report shows we find evidence in every stage which means we have 100% sure we have 100% of confidence that the malware has this behavior so let's see in stage one permissions like send SMS access course location and find location are requested in the second stage key native APIs like get cell location and send text message are used and in stage three we found certain combination of native APIs exist and in stage four we found out that in functions like send message and do byte the APIs are called in the right sequence and in stage five in function send message we found out that those APIs are handling the same register so now let's think if you are analyzing this malware and you want to trace the compiled source code to see the evidence how do you do it our suggestion is if you are reading the detailed report generated by quark engine we suggest that you read the report backwards that means you start reading from stage five for example in stage five we know that inside the function of send message it has two functions and it contains the two native APIs respectively and they are handling the same register so you start to locate function send message and the decompiled source code and in stage four we know that those two functions are called in the right sequence so we can start to find functions that contains the native APIs and check if they are really called in the right sequence the information of the two functions and the sequence will be shown in the next version of quark engine so now let's look at the real malware example as you can remember in the previous slide we need to locate the function send message in the source code and we found out that two functions that contains the two native APIs there are send SMS and get location and if we dive into the source code of function get location we will see that it contains native API get location and if we dive into the source code of function send SMS we will see that it contains native API send text message so the decompiled source code it means this malware will first collect your cell phone location data and send it out through SMS so now let's dive into the source code of get location as you can see in the source code it tries to call native API get cell location and return this information at the end of the code and now let's dive in the source code of send SMS native API send text message is used to send out location information so that's how we use the quark engine to find the evidence in the malware quite simple isn't it now let's look at the second malware this is an obfuscated one we will use the rule in quark engine to find whether if the malware detects Wi-Fi hotspot by generating by generating information like active network info and cell phone location ok so as a malware analyst we read the report backwards so as you can see in stage 5 there are functions like p.a at view.c and af.run and those functions they have two functions that contains the native APIs respectively and they are handling the sandwich system in stage 4 those two functions are also called in the right sequence in function p.a at view.c and af.run according to the report we can say that the malware has the behavior of Wi-Fi hotspot detection in three parts of the source code we can pick any part for further analysis so we pick function p.a so now let's see the source code let's locate the function p.a and we found out that two functions that contains native APIs respectively they are ap.a and af.f and if we dive into the source code of function ap.a we will see that it contains native API get active network info and if we dive into the function f.f we will see that it contains the native API get cell location so the code here means after collecting information and after that they send information as an input for function ap.a so now let's dive into the source code of function ap.a so as you can see in the source code it tries to call native API get active network info and return the related information at some point native API get cell location is used to get the cell location and this information is processed with some other streams and at the end of this function it returns the stream with the information as I mentioned earlier after collecting information from function ap.a and af.f they use the information as an input for function am.a and we notice one thing the function am.a use byte array upstream as one of its input parameter and we know when seeing byte array upstream it means the function is probably trying to write the data into a file so we show again how we use quark engine to find evidence of malware activity in binary so with quark engine malware analysis can really boost up their productivity alright the last part feature works as I mentioned earlier we still have a lot of things to do for example we need to have more detection rules and we need to deal with the .so file and impact ADKs and we want to have more features of the delbig bike loader for example the feature of downloader and we also want to apply our scoring system to under binary formats and the last thing we probably would change the core library since we use andro god and since andro god is quite inactive recently okay that's all for today if you have any question please feel free to ask thank you thank you Jeremy thank you Kunyu there's no questions so far I did forget to tell people do you use the zoom Q&A function in the meantime I can ask you one thing I hear a bit of echo it's fine I'll endure what made you interested in doing this work with the antivirus because the speaker is quite what made you interested in this subject what is the most interesting part for you the construction of finding the process of constructing the malware scoring system we have experienced the unexpected part for example the function of this engine we can find we can net like the obfuscation point that is an unexpected part we went through an adventure and we didn't meet something that is unexpected that's the most interesting part I see you have your hand raised I'll ask you to use the Q&A function we have a question are you planning to keep this to android only or do you plan to look at other platforms can I say that again sorry are you planning to keep this on android only or do you plan to look at any other platforms we will try to apply it to other platforms for example the binary formats of ELF or key files the other theory can apply to other binary formats we will do that in the future thank you alright so there are no more questions thank you again for the talk and we will be doing a break now thanks so much a lot of people at home are clapping as well