 Thanks. Thanks, Emilia. Hi, everyone. I'm excited to see all of you face-to-face and Those watching the live stream. Thanks for joining And my name is Yuri. I'm a data scientist in a threat research group in imperva Imperva is a cybersecurity company a leading cybersecurity company. That's been around for about 20 years focusing on data and application security We provide our customers a wealth of solutions like well application firewall DDoS protection client-side protection account takeover prevention and more Today I'm going to tell you about one of our recent projects that combines research in security and machine learning and this project is about intensifying pieces of code into obfuscated and clear text ones So let's start so the motivation is clearly to help our customers to prevent attacks and One of ways to do so is to concentrate on a client-side perspective so a Typical website has many many resources. Most of them are totally harmless But there could be several malicious ones if attackers managed to inject them into the website And the question is how how we find them and how we distinguish between those So since a website has many many resources This is something that clearly cannot be done done manually and more over even Semi-automatically is not good enough. We need a totally automatic automated process to do that So let's take a simple example of Resources and and and see how how it looks so here We have a piece of code in JavaScript, which is actually a key logger So you can see I'm not sure if you can see my laser But anyway here above you can see the command and control endpoint which is example.com And you can see that we are listening to key down events and We send in the data to the command control Back every every ten seconds And now let's take a look at another piece of code So it looks really similar. You can see that the command control endpoint here is missing and you can see that we listen into the same event and We just log in it locally once every every hour So I just highlighted the difference here for you and as you can see For this very very simple example almost trivial example the differences are very subtle so it requires really tedious work to to even spot them and attention to the tiniest of details As I said in Imperva we are interested in this problem and and an example is our client side protection product which enables our customers to look on the resources on the website and and see which which ones are potentially harmful and we provide Several scores and several data for those resources so the customers can really understand what's going on and As you saw in this really trivial example This this problem is real really difficult for humans And the question is whether it is solvable at all for machines and and by machines I mean of course in machine learning sense So Just one more word about this so before this project The way we tackled it was to use the best data we have in a company and to calculate reputation For IPs and domains and combine all this data in rule-based method In rule-based techniques and and that's how we we provided this this course So now back to our story when we see you know a difficult problem What do we do? Well, we search for an easier one and and this is of course for dramatization purposes only for this presentation So an easier problem with which I'll just explain in a second why it is at all related is obfuscation So for those of you who never heard of it before obfuscation is is an interesting thing The theoretical foundations of it go deep in computer science and the computational complexity But if you are down to earth a bit is just a family of algorithms of techniques that make Transformation on your code they take your code and then transform it So it preserves the functionality and hopefully the right time But it really hides the inner structure in a way that people find it very difficult to understand and it makes it almost Unintelligible to humans to to really understand what's going on. So let's let's take an example So this is like the most trivial piece of code in JavaScript. You could think of it's just You know printing hello word to the console log and when I take this piece of code and obfuscated using one of the readily available tools in the internet We end up with with all this and actually even all This too, so you may say well, this is ridiculous. This is probably some contrived example But but but you are wrong And and the reason is that I just used a tool with the lowest obfuscation possible In this specific tool, which is obfuscator IO and if I would use This tool with the highest level of obfuscation. We would end up with many many pages that I would have to show you here So this is this is a real word example, and this is really how obfuscation work and Just for for this trivial piece of code imagine what happens when your code really does something interesting So And of course the question is how is it all this is related to maliciousness because remember we started with maliciousness So the answer is that it depends on the language so The the thing with With obfuscation on our client side for example in JavaScript is that sometimes it can be used for legitimate purposes For example for preserving intellectual property if you want to to hide your code because you invested a lot of time developing it And there is nothing malicious there And there are a few more a few additional reasons But on the client on the server side This two two problems are really closely related because usually you don't have to protect your code from hackers on The server side if you don't you know move it anywhere and do not give it to anyone And still if we look on these two domains together client side and server side Still obfuscation is a very very interesting signal that usually helps us to determine maliciousness of the of this piece of code Okay, so So what we want to do from this point on is to classify every JavaScript document into clear text or obfuscated And the question is is it easy for humans So, let's look on another example I hope you can see I hope it's not too small I try to make it as large as possible Take a few moments to try to understand what what it does if you if you will and Then of course we have another piece Look pretty similar on day, okay, so without you know torturing you too much This is a part of Of a malicious script that I took and and this is indeed obfuscated and This part the first one is is actually a clear text This is a part of a script from from YouTube that is Used to speed up The if you embed a piece of YouTube in your on your website So this is the code that just just a small part of this code that uses to speed it up and and this is probably Calculate some hash function or things like that So As you saw it's not really it's not really easy to to distinguish between the two cases for humans And actually it turns out that it might take a lot of time for seasoned security researcher and the reason is That in order to decide if the code is obfuscated a clear text what what this researcher does usually is Really trying to understand what the code does. It's not just trying to see I mean it's not written anywhere You know it's obfuscated or not. We need to understand what the code really does and sometimes it takes time So it's clearly something that is unscalable in any in any way And So the question is is it easy for machines? And of course we didn't jump into the machine learning solution right away We we tried working with heuristics and it turns out that heuristics are not good enough simply because they are too Specific and we will talk about the various obfuscators in a few moments and you might then understand why And so the mission from this point on is to build a machine learning classifier for this problem So I'll tell you about several approaches from literature we could use So the first approach is a classical one and there is a paper by a ten Bach and and several additional researchers from 2016 Where they propose to do a classical feature engineering meaning to extract things like average length of line frequency of specific words specific characters and then Build a decision tree. So I guess you've heard about these kind of trees and here is a simple example so This this tree is checking whether the average length of line is larger than some threshold And if it is it then checks the frequency of certain character And if and if it exceeds the threshold for example, it says that the script is obfuscated Of course, this is really simple example, but just to get you understand the principle Another approach was suggested by Skolka and his friends in 2019 they suggested to build a deep learning classifier based on convolutional neural networks and abstract syntax trees so convolutional neural network just in a sentence is a Method used a lot in vision But today it also used in additional fields in deep learning and the abstract syntax trees is that you can Easily download a tool that takes your script the JavaScript in this in our case and the component builds out of it the syntax tree for For the script where you can see the structure and you can extract many interesting features out of this tree And then these features are fed into the network of these guys and the last possible approach is natural languages processing so In the recent years we we see a really tremendous advances in this in this field we see Amazing models that understand human language and enable us to to solve a multitude of tasks for example question answering and and text classification and many more so So in our case what we can do with it We can take a model Bert is just an example to use the weights That the model was trained on and then we train it on downstream tasks may and namely our task We just feed it documents Which are JavaScripts classified into clear text or obfuscated ahead of time and then the model learns it and adjust the weights That it held learned previously So Our first approach was Inspired by the by the first work. We started simple We wanted to make to build a simple model and we want to benefit from decision trees that enable us to gain explainability it can when we use a decision tree It's very easy to see really which features are affecting the model and how So we took something about 40 features and We trained on a single obfuscator and it turns out that this model didn't generalize So What went wrong? In order to understand that we have to dive a bit deeper into the various obfuscators for JavaScript So here I prepared a list of the most used obfuscators for JavaScript Sorted by the popularity metric of github, which is the number of forks as you can see Aglify.js and obfuscator.io are the most used ones and there are additional ones here So what methods do these obfuscators employ Let's see a few examples So the first thing people immediately think of when they think of obfuscation is Renaming of variables and functions and indeed if you think of your code if you were name it a bit sometimes it makes it clearly unintelligible Even before it might be not really easy to understand So in this simple example, you can see we just rename a few things and it's already looks a bit more intimidating Additional technique is modification of functions Function calls the function arguments and the return values So in this simple example, I just took a function that just squares a number and I show you a snippet of the code It's not the whole code that was result that was produced up after obfuscating it But you can clearly see that The function gets many parameters and returns many things and it's really complicated things a lot Additional key example is a modification of strings By using and coding encryption and string generators, so in this example, we have a simple variable and If we want to to find the strings that were used inside of it like it's you know the Honda Accord car you can see that the strings are splitted and then several Arithmetic operations are used and if you look on the resulting code without having seen the Cleartext one ahead of time. It's really takes time to understand. What's what happens there? And and and the last example in this context is the manipulation of constants So if we have a simple constant, we might end up with some expression Which we need to calculate to to understand what the constant is and there are additional methods like changing the base of integers and Injecting a dead or redundant code, which of course, you know, it's clear what it does complicated things a lot So what are the differences between those those obfuscators and in general between the obfuscators? So the differences are as you might imagine specifically naming methods that the encoding and encryption functions that that I used and some more so more over Most of these obfuscators can receive a lot of parameters to tweak them and that the result in the resulting document the resulting code Looks much looks very different if you run a single obfuscator in mode one versus mode two. So Just an example for for JavaScript obfuscator. It has 40 parameters and the Aglify.js has 30 parameters. So as you might imagine when we compare between different obfuscators that clearly the Output distribution looks different and in in if we think about distributions so We get different Everything that you that you like average length of line length award Function size a proportion of encoded characters and in general this is just an example if we want to Measure the normalized backslash count, which is one of the features that was proposed in the paper that I mentioned before or Ever gender length of line you can see that the differences between the obfuscators can be really big and then this is log scale by the way So before Telling you about the approach that really worked for us. I just say a few word about the data. We train the model on So we use the public data set of about 150,000 javascripts all of them are clear text and we applied four different obfuscators of them and and four because we wanted to Test the model on additional three that them that the model has previously not seen and we We got perfectly balanced data set of about 100,000 clear tech JavaScripts and 100,000 obfuscator JavaScripts equally divided between the Between the obfuscators so our approach combined to Approaches number one and three that I mentioned And and and the idea is as follows we tokenize the input into words and we Calculate the most common words in the clear text JavaScript So we take all the clear text JavaScripts We extract the common words out of it and look on several hundreds of them and then for every input We measure the difference between this input and this app and this calculated distribution. So let me give you a small example So assume that the top three words in JavaScripts are Function document an input and assume that like function appears about once in every 20 words Document once in every like 33 words and input like once in every 100 words So if we have following JavaScripts and and if it's too small, I'm sorry, but you don't have to see the exact details I will just read them out for you So here we have 71 words and If we look on the word function, it appears twice meaning it appears in 3% of the cases The word document appears once and the word input appears and never appears So if we calculate the difference between the clearest occurrence We calculate it as I showed you in the previous slide and the actual occurrence. We can see the differences in red So What do we do with these differences so we feed them into a boosted decision tree And this is very similar to the example that I gave you before And then we just look on the differences between the specific word how how many times it appears and then We can know whether the text is obfuscated or clear text. So I hope this this this is clear This is just the example of decision tree with the different features every feature Corresponds to a top word we extracted from the clear text JavaScripts So it turned out that the performance of our model is is pretty good We were mainly interested in this case in false negatives and false positives You can see that both of them are less than 1% about that this was the product requirement and specifically We wanted So what's false negative in this case false negative means that we are classifying some document as a Clear text when in reality it is obfuscated and false positive means we calculate we classify document as obfuscated We're in reality. It's clear text. So We were interested in in in a case where false negative rate is smaller than the false positive rate and the meaning is Since remember as you as you recall as I said, it's related to maliciousness. So we do not want to flag Cases where people would look and say oh in reality, it's not obfuscated. So we don't want to You know to increase the number of these cases And the next question we asked whether Our approach is generalizable to additional languages and it turns out that that the answer is yes So we looked on Python and PHP for example as I mentioned before they are closer to So the case for these languages since our server side is that male obfuscation and maliciousness are closely related to one another and and This is why in this in this case is it's even more interesting And we got really good results. Of course, we trained these Models on on the specific data set that were chosen for these languages So, let me show you just a very very short demo. This is the QR code you can scan. So we set up a website That you can use it's publicly available with some version of our model And You can just play with it and see and see what's What's what's what's going on? So if we take This script as example, you can see it's non obfuscated Okay, so we can just take it we can feed it into our model. This is by the way a live demo so You can see that the model says that this is a clear text with probably probability is 0.95 Meaning the model is pretty certain and here you can see a sharp values I'm don't have time to explain what what they really are They are just the significance of features that affected the decision So in this case, you can see an example of one gram model Which is a little bit different of what I described before it's one of our models that we built during the development of this project and If we take another example just just a quick one So I took I took a piece of code and obfuscated it so you can see that I'm not sure you can see unfortunately the window is a bit small here, but You have to trust me on that So if we feed the code here Then it said that it's obfuscated. Of course the probability is rounded And you can also see the various features and how they affected the decision. So I cannot promise that this This website will remain in this in this exact form But if people are, you know interested and will use it we might enhance it and we might continue maintaining it so What are what what have we learned in this in in this? project So the first thing is about the relationship between maliciousness and obfuscation in different languages as I mentioned It's not a one-to-one relation but usually obfuscation is a strong signal for maliciousness and It's really cool because I think that solving the obfuscation classification problem is much easier than solving the the maliciousness one in terms of machine learning, of course So we saw that classifying code we saw many problems that are difficult for humans and but even the code even the problem which which mind which might seem simple of classifying the code into obfuscated or clear text is not is not so easy and And as we saw a building a machine learning model that solves it is relatively and not very sophisticated And not too hard Another thing is about the obfuscators. So we had several choices here building a machine learning model per obfuscator is not scalable and Looking on the internals of each obfuscator is not scalable So we use this you can call it a trick or approach of looking on the clear text Extracting the top words out of them which is inspired by natural languages processing Approach and and and this this is really enabled us to solve this problem And another nice thing is that this framework seems general So we didn't test it on all languages But I think Python PHP and JavaScript are like the more the most interesting interesting one in terms of obfuscation And there is no reason to believe that this wouldn't work for additional languages Of course if you train the model on a large enough data set that is properly balanced So thank you so much. I am looking forward to your questions. Hope you enjoy the talk