 All right, hi everyone, so today we're going to talk about research that we've done on the Chrome extension. We researched the threat landscape and the Chrome extension anatomy, and we created the models to detect malicious Chrome extensions. So we are going to talk about the old process that we've done from looking at the threat landscape and to train a model that can detect malicious extensions. Actually we are threat researchers, we consulted with our data scientist Jonathan, which is sitting here. So thanks for Jonathan. So I'm Tal, I'm working at Deep Instinct for the past two and a half years. Prior to Deep Instinct I've been serving in the IDF for seven years. Hi guys, I'm Rui, I've been working with Tal for the last two years and previously I've worked in additional security companies and also in the IDF intelligence as well. So the agenda for today. So first we're going to talk about some background, about browser extensions and why we decided to work on Chrome extension specifically. After that we'll talk about the threat landscape and we will show some example of attacks that have been done using the Chrome extension. And then we will talk about the collection of the data and the creation of the data sets, the sourcing of the data and the ground tools that we were able to set in order to validate the data. And after that we will talk about the model that we created and we will show some three models that we tried and we will show the results and the process that we've done. All right, so let's quickly talk about browser extensions. First of all, what are browser extensions? I mean if we look at the Wikipedia definition we'll see it's a small software module customizing a web browser. Basically whatever we need to do to add functionalities to our browsers, add advanced features we would like to have. And other than that we see a trend today of much more services and companies turning to web based products and we see that our today the browser extensions are exposed to much more standard of data than before. Which obviously makes it an attractive attack surface for hackers and attackers. So why specifically Chrome? Well basically Chrome is the most common browser of the mall. With controlling over 65% of the market as well as other browsers who also support Chrome extensions such as Edge, the new Edge browser and Oprah. Which are all based on the Chromium project. Other than that we can see a Cisco report from 2016 which I mentioned that Chrome extensions are one of the most prevalent sources for data leakage in enterprises. Mostly due to the fact that they are undetected for long periods of time. This is just a small graph to describe the Chrome versus additional browsers as you can see from rules. So there are some features like built in features in Chrome that makes it attractive for attackers. So first the extension can be installed through developer mode. So after applying the switching to developer mode the browser it's possible to install extensions not through the Chrome web store only. Also manually for example by social engineering the user or with the flag which is called the minus minus load extension that can be passed to Chrome.exe. Another feature is ignorance of HTTPS. So in case this flag is turned on. If a user surfing to a website that has invalid certificate the default message of invalid certificate is not shown to the user. Last feature is like it's not a feature but it's by design like an extension can look at fields, sensitive fields like password fields or credit card fields without the need of spatial permission to be declared. That's it. So now we will talk a bit about the anatomy of Chrome extension. So first Chrome extension has the file type .crx so from now on we will mention crx as Chrome extension. So the crx is actually a zip file, an archive file which is built as follows. So first there is the manifest file which is a JSON file that describes the most important things about the file such as the permission that the extension requires, resources it uses, metadata such as name, ID and stuff like that. Background script, background script is like the main script which is the JavaScript of the extension. This script holding the state of the extension and every event that occurs so the background script is notified and decides what to do. For example it can trigger another content script. So now I will talk about the content scripts. Content scripts are scripts that are made for particular web pages. It has full access to the full JavaScript API in contrast to the background script that does not have full access and it's used to manipulate as I said particular web pages. Last component is just general resources that are in use in every extension such as HTML or CSS files, images, etc. All right so let's talk a bit about the thread landscape of Chrome extension malware. So at 2010 if we'll go all the way back we'll see that the Chrome Web Store was live for the first time. A year later we're talking hundreds of millions of downloads and according to Google 200 million active users. In the couple years following there was substantial measures taken in order to prevent attackers from accessing user's data. For example we have seen several measures taken by Google for either if it's scanning each and every extension before applying it to the Web Store or stopping all options to natively or secretly install extensions without using a developer mode. Also they canceled the style install and also they discontinued for example the Netscape plugin API which was basically allowed users to allow the attackers to access the file system and it was deprecated. In 2015 they stopped it, 2014 they stopped it. A couple of years ago obviously these measures caused a drop in Chrome malware distribution. However last couple of years specifically 2019 was the biggest one. We saw a big growth in the propagation of malware from CI-rex files. And this is a graph from Viral Stoles. You can see that only the first half of 2019 already passed the 2016 which was the biggest year for Chrome malware. Also another interesting thing we've seen from Viral Stoles is the lack of treatment to Chrome malware in traditional AV vendors in Viral Stoles. Most extensions in Viral Stoles are with zero detections. So let's talk a bit about the attack surface. So first all the JavaScript code runs in isolated environment sandbox of Chrome. So for example it cannot touch or gets the memory of the machine or the file system without having an exploit. What it can actually still do is everything that is related to the browser itself. For example stealing information or credentials. Forming a botnet by installing extensions on different, a lot of machines. Monetizing like using advertisement and stuff like that. Now we'll see an example of an attack that was presented in Defconn and Black at two years ago. It was called the Game of Crumbs. So basically what the attack was about is to create a botnet based on a Chrome extension. So it was like that. The extension was injected an iframe to the Facebook page of the user like an invisible iframe. And inside the iframe it signed up to Wix and created a website in Wix. Wix is a free platform to create a website very fast. So using the Wix website, the Wix website was published to all the friends of the user. And by entering to the link of the Wix webpage other users were installing by clicking and installing the extension. The extension was spread. And lastly the extension rated itself in a Chrome Web Store with five stars which just genius. So that's one attack. Another attack from the last month is Dataspy. Dataspy was spread with eight extensions for Firefox and Chrome as well. The extensions collected URLs and hyperlinks and webpages titles. And all this data was sent to a CNC server. And the data was actually is still sold in third party websites. You can see the research paper and see the particular website where it's sold now. Why it's interesting for example in URL, in HTTP protocol. So there are passed a lot of data, tokens for file sharing is passed through URL. Zoom meeting uses tokens and stuff like that. Okay. All right. So we're going to talk a bit about the daily collection and the labeling process. So basically when sourcing for data sets such as the CRX we had several challenges such as finding reliable sources. This was a highly uncommon and small and rather new threat landscape. Other than that we wanted to make sure that all FISA were actually in their correct format. There are actually CRX files. And of course knowing how to distinguish between the malicious and the non malicious ones. For that we used three sources. One was the intelligence feed for example. The other was the official Chrome Web Store and last was unofficial Chrome stores which are basically stores that crawl the official store and offer the same extensions. So we're going to talk a bit about each one. So intelligence feeds had the lowest amount of CRX files, but they had much more quality metadata. For example, we had the AV vendors detections. We had the number of submissions when it was the first uploaded, etc. But also had some more challenging metadata. We faced a lot of missing details. We couldn't find the ways to across reference between the Chrome Web Store and Virestall for example. Also a lot of files were actually incorrectly tagged as CRX files. In fact something like 30% were invalid even though they were tagged as CRX files. However in the Chrome Web Store obviously it's the most reliable resource to valid files. But they also removed malicious files, at least the one they were aware of. And they only stored the most recent version of each extension. So we decided to refer to most files there as benign. And we did have find some metadata that was useful, but as I said it's hard to cross-reference between them. If we talk about the unofficial stores, so obviously this is the largest source for CRX files because it stores all previous versions, all deleted extensions and basically everything that was ever uploaded to the official store. But in some cases we didn't have metadata to use only an ID or not even that in some cases. And obviously we do not know what happened to the CRX between the store and the crawl. So we're not sure what's the trash factor for these extensions, for these files. So just to sum it up real quick. So as we said the threat intelligence feeds were actually almost the best in everything except quantity. So what we did was use most the official Chrome Web Store and the unofficial stores together with the feeds and basically uploaded these files to get the results we wanted and made the data that we thought would be necessary for our ground truth. So basically we decided to use five positives by AV vendors in order to decide the files malicious. As I said at the beginning we see that not a lot of vendors even scan these files. So five is not a lot of vendors, but it was enough for us at least for this research model. And B9 we decided that whether the files are in the official store and have zero positives after they were scanned in Firestore or uploaded from the unofficial stores and also received zero positives. All right, so now we will dive into the three models that we created. Something important to say is that these three models are our research model. We just wanted to prove a point that it's possible to create a model to detect malicious Chrome extension with good detection rates. So this slide about the overall process of the creation of the model. So first is the prepossessing step. We validated the file that were introduced into the data set. After that we extracted from the relevant files the relevant like the features. We ended up with a big dimensional size. So we had to reduce the dimensionality to the dimensional. So we first used Chi-square and after that we tried another model with applying two different method like with two steps. First Chi-square and then Gradient Boosting. We represented the features into numerical value using TFIDF in order to train a model. And the models that we trained were one logistic regression model and two DNNs. Last thing to say about that is that we didn't try to optimize the hyperparameters since it's just a research model as I said. Sorry. All right. So the prepossessing. So first we validated the files as I said. Every deployed file should be CRX or ZIP. You can check it or test it by looking at the magic of the file for example. Another thing to test is that each archive has manifest.json file which is a must for any extension. We ended up with a bigger dataset but we used 20,000 files. 75% of the data sets went for training and 25% for test. All right. So the feature extraction method. We extracted features from the JavaScript files and from the manifest. The feature extraction method that we used was looking for sequences of four characters or more of alphanumeric values or relevant characters to JavaScript such as underscore or dollar. Now we'll see an example. So let's say that we're trying to extract features using this method from this function. So these are the features that will be extracted using the features. I hope it's clear. All right. Dimensionality reduction. So using the extraction method that we saw earlier, we ended up with 1.1 million features which is... A big amount. Of course that we had to reduce the amount. So as I said we tried chi-square to reduce to... Actually we tried to different numbers. We tried to... We decided to take for this model the 10K features. And in the second step we used the gradient boosting to reduce from 10K to 1K. And then we represented the features into numerical value using TFIDF as I said earlier. So here are the results of the tree models. So first the DNN model using chi-square only. The results were the best results out of the tree models. As you can see the recalls was 99% and the false positive rate was about 2 or 3%. The second DNN model using another step of XGBoost was worse. The recall was pretty similar but the false positive rate was much higher. The last logistic regression model using chi-square and XGBoost had the worst results out of this tree. The false positive rate was around 30% which makes it... It's not a sustainable model. Another thing to note is that in general the recall is more important for Chrome extension. The false positive rate since an average user is around 5 or 10 extensions and you wouldn't want to miss the one malicious extension rather than false positive on one of the out of the 10 which is not good as well but not bad as missing the one malicious one. And the last thing to say about this model that all of these tree models were able to detect the data spike campaign and the other one the game of chromes. Future research. So we would like to try to de-opfuscate the JavaScript files before extracting the features. So the extension straight landscape suffer from a lot of obfuscations on the JavaScript code. So we'd like to do some normalization and de-opfuscation before we extract the features. Another thing to try is combine the dataset to train a model with combined dataset of Chrome and Firefox extensions. These formats and the files should be similar but we didn't try to train a combined model or tested the results on Firefox dataset. That's pretty much it. Thank you very much. Actually we are out of time so we'll stay here. All right. One question, okay? Is there a question? All right. Thank you very much.