 Awesome, thank you everybody. Next up, for our last talk actually at the AI Village, we have Shimon and Tal on noisy distributed, distorted data sets, and let's all give them a round of applause. Good morning everyone and thank you all for being here on this Sunday talk. Actually it's quite a nice attendance for a Sunday morning. We're as presented before Shimon and Tal and we'll be talking about how we take noisy and distorted or initially noisy and distorted data sets and data sources and bring them into being excellent prediction models. Just a bit about ourselves, so as I said before I'm Shimon, I'm the VP of research in deep learning at Deep Instinct. I've been working with Deep Instinct for the past three years since mid-2016, but overall I have roughly 16 years of experience in cyberspace, both in offensive and defensive positions, both deep into research, but also in recent years more in management positions. So good morning everybody. I'm Tal, maybe some of you remember me from the talk about malicious Chrome extensions two days ago, so it's a big pleasure to be here again. I'm working at Deep Instinct as well for the past two years and a half. PO to Deep Instinct I've been serving at the IDF for seven years. Okay, so after you know us a bit better, let's talk a bit about this talk's agenda. We'll start by introducing the problem that we're about to talk about in the next 40 minutes or so. I guess a lot of the people sitting here know that putting together excellent data sets is hard. It's not an easy task and it involves a lot of knowledge, a lot of experience, a lot of trial and error. We'll be talking about some main problems, sorry, main problems and common issues when it comes to building those data sets. We'll be talking about source quality, accuracy and diversity and how do we evaluate those. We'll be talking about issues related to specific file formats and how file formats in general might affect the way that we build and engineer data sets. And then we'll talk about some noise and distributions and distortions are issues in threat intelligence sources. We'll conclude and wrap up by talking about how everything that we will be talking about drills down to the way that we conduct research into false positives and false negatives of the different models that we create. We'll talk about some tips and recommendations about how to build the best data sets possible. Let's get started. Alright, so the problem that we are trying to, we are talking about this presentation. So basically the very first step at every creation of a model is building a data set. The data that is used for the data sets is of course an important step in the creation of the model. As I was stated two years ago in a presentation by Hillary Sanders they was talking about how bad data affects the results of the model that you create. So in this talk we would like to dive into some specific use cases and issues that we have experienced in our work, in the everyday work, and how these issues affect the data sets that you might build. So we will be talking about some sources, the commercial feeds such as the Virus Tolar and reverse and glabs, malware repositories like free repositories and software repositories such as NSRL. Okay, so I briefly mentioned before some of the main three topics what we'll be showcasing in each one is as follows. When it comes to source quality and accuracy we'll talk about free and open malware repositories. They're a good thing but we also need to be careful about using them. We'll show why and how. We'll talk about sample duplications and the way that those duplication manifest in the most common threat intel feeds. As far as file format issues we'll showcase how the threat landscape in two specific file formats, XLS, Office Files or Office Excel, you know what it is, and RTF, how the threat landscape in those two file types manifests into a very specific distribution of the files we see in the threat landscape and how that might affect the data sets and ultimately the models we build. And then moving on to noise and distortion, threat intelligence will show even sometimes how time distribution can be very weird and that in itself is not a desirable situation that might have pretty adverse outcomes. And we'll talk about two specific threat landscapes, one being the 64-bit PE threat landscape and the second one being the, yes, even Android going into the stock. How these two threat landscapes have very, very dense and specific distributions that ultimately we might even call them a distortion of a kind and how that might affect how we should build the data sets around those kind of threat landscape. Let's get going. Alright, so the first category is the sources. So there are like two different kind of sources. The first one is the free repositories and the second one is the commercial. We'll use two use cases, different use cases, one from free repositories and the second one is from a commercial one. So in general, start with the free repository. Every repository is welcome. I mean, if you can get like more data, so actually it's good. So every time when there is like any new repository out there, we are happy. We would like to take this data and to use it and to like put it into our databases and to connect it to our data pipelines. So it should be a good thing in general. But sometimes when we start looking at the data, it's not as good as we might expect it to be. Since a lot of repositories are biased and are indeed noisy, which create imbalances in your data sets if you take it as is. Alright, so the first use case that we will be talking about is the free, open repositories. There are a lot of repositories out there. Some examples are showing the presentation. We took one repository that we've been facing like a lot of times. We found it very used by customers and by researchers. And we'd like to see what lays inside these repositories, this specific repository. Since it was really used a lot and we saw that in general the results that we had on this repository was not as good as we are used to have. So we took a data set of 130,000 samples from this repository. This repository claims to be containing malware only. So we tried to answer two questions. The first is what is the relevance of this repository? Like what can you do with this repository? And the second, how good the data that lays inside is? Or actually is it really containing only malicious files as it claims? We've also published a blog post about it, so you can read about it later. So about the data that lays inside this repository. So as you can see, only around 60% were P files. And there was like a very huge amount of files that were just textual files, which of course cannot harm a computer. And it's a bit problematic to classify them as malicious files. And of course, the data from this repository cannot be taken as is for testing models, for example. But we saw that the distribution types and of course that not everything is being in this repository, but at least we hope to see that everything is indeed malicious files as it claims to contain. So we took the data set that we created before, as I mentioned, and we tried to figure out how many malicious files are indeed lays in this repository. We found out that there are only 56% malware files in the PE section of the data set. We were truly amazed to see that about 40% are poor and around 3 or 4% were just benign files. And we say benign, we mean pure benign files, like not something that can be poor or stuff like that. So of course that as you can see, data from this repository cannot be taken as is for testing models or training sets. Of course that one can use the data from this repository. It's open, it's free, you should be trying to use it. Not everybody has like, you know, API keys and commercial feeds, but you cannot take it as is and you need to do some engineering on the data before you take it. So that's the bottom line. All right, another use case that we will talk about it is from commercial feeds. This use case is the sample duplication. A lot of times there is like a specific sample that is changed with a few bytes inside the file and is uploaded again. And of course that the change makes the file has a new hash and then you find like thousands or even tens of thousands of times similar files, almost the same file with a few bytes that are changed inside the commercial repository. So let's talk about a few examples for this case. So the first is like polymorphic malware in order to like evade IOC based detection that is used by a lot of vendors. Sometimes malware changes like something inside the file, like first to change it first and second to change some IOCs such as file name or IP address or things that vendor can sign on. The second use case is virus infected files. So viruses in general what they do is like infecting a file. We'll see in the next in the presentation we'll talk about a use case that a virus like changed something in the file, in the Microsoft file and how it affects the data. And the last but not least use case is the math submissions. So a lot of times there are like researchers or even threat actors that try to see if their brand new malware is caught by vendors. So they can like change a few bytes or change something in the behavior of their malware and upload it again to virus total for example. And then we can end up with a lot of duplicated samples again with different hash but very, very similar files. Now let's back up for one second. Math submissions are we actually talking about a poisoning attack? We're in the AIA village and I think that poisoning attacks or attacking AI in general is one of the most discussed topics. So one could say and this for the people sitting in this room this might be quite appealing or attractive even to say that wow we're actually seeing AI poisoning attacks. So we actually think that this math submissions use case is not exactly that. First of all because as we said in many cases we see that these math submissions are coming from users with API keys and those tend to be usually consumers on the defense side of these feeds and sources. Not to say that when it comes from threat actors this is not maybe also used not only for testing their own malware but also to try maybe and confuse or send researchers and send the security industry into different places. The attackers maybe don't want them to look at so they'll upload mass amounts of files and to cause someone to think that this is the hot stuff right now but actually they're doing something else so but still there's a difference between a say poisoning attack and just doing something masses in order to confuse your enemy or your adversary. So we think it's more the second case than the first so we're not still seeing actually targeted poisoning attacks. Not to say that it might not happen in the very very near future but we don't think that's exactly the case as it but still taking those mass duplications in might inadvertently actually cause those mass submissions to be a kind of poisoning attack so that's why we chose to bring it forward here. All right so let's talk about a specific example of a malware polymorphic one that has been uploaded like tens of thousands of times with the different hashes so we're talking about the aramnit. Aramnit is a worm that is active for more or less seven or eight years already and as you can see in these results it has been uploaded a lot of times. We used it's SSDip and after taking this SSDip we looked at some of the samples that lays inside this cluster in order to figure out if they are actually the same file or not. So as you can see we found out that all these files were in the size of 3.5 K kilobytes and they were all DLL files. The sections was the same sections by hash at least three out of four. About the fourth one it was the data section and this section hash was different so we were interested in like taking this example to showcase why the sample duplication actually happened so we looked further into this data section in order to see what is the difference about. So as you can see the difference was in only a few bytes the red bytes in these screenshots inside the data section. These few bytes were looking very interesting because it's actually a string so we decided to take these samples and to figure out like what these strings actually used for. So by looking we died out on these samples we were able to find out that this cluster of SSDip contains like a lot of samples that are used to be the loader of the actual malicious file actual malware ramnet. So as you can see the string was some string .exe and this is the loader and the actual file is being loaded after that by using the create process API code. Okay let's move on to talk a bit about issues that are very related to the format of the files themselves. So before we get into it let's think for a second what actually makes a file format a format and how do different formats are different from one another. So formats define the file's structure, its header, the syntax meaning that the actual binary sequence that separates between different parts of the file. The type of data that we can find in the file whether it's a textual format or some kind of binary format what kind of other data might be found within the file whether it's compressed data photos pictures compressed data encrypted data etc and of course different file formats have different functionalities. Okay for instance you know a PDF has its own functionality office files have their own functionalities, PEs have their own functionality they're meant to allow code to run. The way that this relates to what we're talking about here is that the threat landscape of each file format is based on that functionality and on the context in which it's being used and in many cases actually usually in most cases attackers abuse the innate attributes and functionalities that lay in the different formats that we all use on a day-to-day basis and use those in order to achieve their malicious intent and when we say that we mean that in many cases a lot of the file formats can be abused without actually you know leveraging vulnerabilities or exploits and let me give a few examples in that regard when someone embeds a malicious JavaScript regardless of what the JavaScript does into a PDF file so you know having JavaScript in a PDF file is not a vulnerability it's a feature okay even if that JavaScript you know triggers some kind of some kind of exploit the fact that again JavaScript is supported in PDFs the same goes for macros and office files right I mean having a macro is not again is not a vulnerability it's a feature so and if you look at every every file type that can be abused and of course even looking at PEs you know if the PEs running okay it might not it might have you know shouldn't have been there in the first place but the fact that it's running you know that that's what it does it's running it's running code on the machine so again all those things when you come when you think about it in many cases there isn't really any vulnerability or exploit necessary just abusing what the file format allows us to do the fact that this is how the majority actually of the threat landscape behaves makes us think that we need to be very very careful in examining how malicious files look like and how the benign files look like because we don't want these again seemingly malicious but not necessarily malicious you know differences between you know the way malicious files look like and benign file looks like to affect to affect our models and let's give a few examples here to to and demonstrate what we're talking about the next few slides will all relate to just random recent data sets of 100 K known benign files from each file type that will show and a hundred K malicious files from each file type that will show will we use the very very naive ground truth approach here with as far as the malicious label is concerned with just anything with more than 20 engines 20 detections in either a virus total or other threat feeds that we have so let's talk about excel files you know we don't need any kind of AI machine learning or deep learning or what have you in order to to have a pretty good detection mechanism for malicious excel files we just can we can use just one if if condition and get you know 95 percent recall and only again with one infestation it's not bad only 10 and false positive if we just say you know if this is an excel file and it has macro let's deem it malicious because if you look at you know at macro content so only yeah only about 10 percent of the benign files of benign excel files have macros in them where you know like on the other hand a whopping 95 percent of the malicious excel files have macros so but again do we want a model that makes makes that distinction I think not especially not with a 10% false positive rate which is again for such a naive model sort of speak is not bad but obviously this is not something that we want to have the same goes if we look at the number of streams 80% of benign excel files have less than 10 streams you know xls files there are always all these are built built out of streams so 80% of the benign ones have less than 10 streams whereas 80% of the malicious one have anywhere between 11 to 20 now again someone might think wow this is an amazing feature because statistically it really differentiates between the two between the two between the two files or between the two you know categories of files but again this is not something that in and of itself we want to necessarily affect the decision because then it would create a model that's very very easily evaded and by or bypassed similar example looking at rtf you know finding a benign rtf file that has an oily object embedded in it is of this is was quite a task because less than 0.5% of benign rtf files I mean today I think generally speaking that you know no rtf has benign because who uses rtf anymore but still if you look at the benign rtfs that are out there barely any any one of them has any kind of oily object but if we look at malicious rtf files so we're talking about 70% more or less again this is not something that we want to translate immediately into the models that we create and we need to engineer the data and think about okay so how are we actually going to generate a decent data set here considering the fact that these are the statistics okay now we're going to talk about some other distortions and noise phenomenons in threat intelligence they're not necessarily related to a specific file format but more to a specific landscape or just to you know what we're seeing in threat landscape in in threat intel sources a lot of what we'll see here lies in the fact that you know the threat sources that we look at doesn't matter if it's the biggest threat sources the most you know the most expensive one with the biggest number of files and the best coverage of the threat landscape they don't reflect the reality there are they are a function of the reality okay they're probably a much closer and better function of the malicious threat landscape as far as the benign files that exist in the world they obviously contain a fraction of what's what's really out there because I mean most benign files don't find their way into these sources and feeds and that again that in itself causes some kind of weird phenomenons that we'll now look at the first is time distribution and this is something that we did looking pretty much at all of the PDF files found in virus total from a period of six years okay between 2012 and 2017 now if you look at the distribution of these of of of the files per month you see like quite a weird phenomenon you see that if we look at the beginning of that of this time period it seemed like almost everything well not almost everything but I mean roughly I don't know two-thirds maybe even three-quarters of the total PDF files you can find where you could find a virus or were actually malicious whereas if you look later into that time period that tendency seems to change and okay now one would say you know virus toll became so much more popular but I mean you can compare between 2012 and 2017 but then again it makes you think is this something that we want to somehow leak into our data set because if even if we say okay let's just take 50 percent of the benign files of all times 50 percent of the malicious files of all times what we'll get in in our data set is I mean much more of the benign files that we would have are from the second half of that time period and that again depending on the kind of model that you use the kind of features that you use that's something that might leak into your model which is I think you'd all agree with me that creation time stamp in PDFs is not something that's very relevant for their classification and this is something that's found in just and clear text and pretty much every PDF out there. This is looking again at the distribution but just looking at absolute I mean looking at each of the landscape and seeing how that specific either the benign or the malicious landscape distribute in time what we also see and again this comes back to those mass duplications we talked about before as we're also seeing some like very weird and obviously out of the ordinary months where suddenly there's a huge surge in the amount of malicious samples and sometimes even in the benign samples that are that you can find so again both this this phenomenon sort of speak but also this one with these distorted months again this is something that needs to be looked at because again it might if we just look at things randomly it might really affect the data that we that we put into our to our data sets. So these statistics are talking about a data set of 64 bits p so as you can see with the random data sets of 64 bits p there is around 90% of the files just dot and assemblies files so again it's it's huge bias to dot and assembly and probably that this is not what you're trying to generalize the model on when you take like 64 bits data set 45% of the data was just Microsoft infected files so again there is a strong bias to a lot of Microsoft files in the malicious data set probably it's not the case in the benign data set so you'll find out you end up with a benign and malicious data sets when you have more files that look like Microsoft files in your malicious data sets and it might be a problem so let's see an example for why there are so many malicious infected Microsoft files in in the data set so we there is a specific DLL DNS API which has been like tens of thousands of times in in a random data set that we created and and which we try to find out like why there are so many DNS API infected files so one concrete example is was discovered by malware byte in this in this link and what actually was there is like a virus that changed the host files like it was just patching the host files in order to hijack the host files the host file is a file that like there is there can be configured like like DNS entries so every time your machine needs to like do like it when it needs to do like a DNS query so prior to the DNS query the machine checks inside the host's local file so by hijacking the host file the malware can control the DNS that that are served to the machine and of course that as I said 45% of Microsoft files or like Microsoft files in the malicious data set might lead easily to false positive on Microsoft files another use case lays in the Android landscape so APK of course I guess that you know is an Android package like an archive that contains the resources and stuff for an application so there are a lot of malicious files in the straight landscape of Android there are ransomware for Android spyware a lot of differences of poor but also there is a specific type of malware premium SMS like an application that what it does is just like sending SMS messages from from users in order to monetize it and to like pay money from by the user to the to the attacker so we created again a random data set and we were looking at the families of the malware that lays the plane inside the data set so as you can see we looked at the three top families all the three top families are composed are like taking around the 37% of the data set and all of them were a premium SMS so again here's another use case of a strong bias to something that we probably didn't want like the model like the APK data set to be biased to like we probably wanted it to be more like presenting the the real the real different types of like malicious APK data that are out there all right so now we'll conclude and we will show you like also we'll say we'll talk about some takeaways from this presentation but first I'll start with like a process that we are doing post training and can be taken and applied like it's in a process or an attitude that you can take and apply after that post training for your models so first you need to identify like a pattern specific pattern for example as I said earlier let's take for example a P model so you if the data set was biased as we've seen earlier so maybe you might have force positive on Microsoft files then what you need to do is like going back to your data set the reason is to look at the training set that the model was generalized generalized realized on sorry generated from yeah in order to like examine the pattern appearance to see how it's distributed inside a benign data set and the malicious data set in the training set sometimes you can go even further and look at the feature level and like trying to create like what is the combination of features that makes your makes this file or this collection of files look like malicious file for for the model leads to false positive of course and by that you need to understand the meaning so for example in the collection earlier that I described you might call it like Microsoft files or DNS API ideal file the last step is like adjusting and engineering the feature distribution or the pattern distribution inside the training set so tell just talked about you know how we might use all the all the data that we've shown here in the false positive for the false negative research phase that again usually occurs post training whether it's when you still whether it's when you're still you know testing and evaluating your newly created model or even if it's you know and for the folks here that are coming from no AI based from AI vendors or from next inventors or whatever or the legacy ones that are now using machine learning as well you can do it on of course on your production data okay so false false positives of all negatives that occur in production but a lot of what we talked about comes back to you know what we do pre-training and again how we build and engineer the best status that's possible that will ultimately enables to create the best models possible and those are some of the you know the dues sort of speak or the recommendations that we have based on what we've seen the first is you know we need to really understand the threat landscape and whatever is the problem that you're trying to solve and think about how it might affect the data because it might affect the data you need to really differentiate between again those innate features or features in the general term of features not in model features but the features in the file format or the functionality of the file format how that's abused in the threat landscape and of course what's you know not really an abuse of the format but something that's you know again a vulnerability or an exploit that could be leveraged those things will manifest differently in your benign data sets and your malicious data sets and you need to think about careful you need to think about that carefully and understand that before you go about building your data sets another another thing that we that we want to mention is again different file formats behave and distribute very very very differently from one another sometimes it's you could see that in even in formats that are very very closely related let's say PEs 32-bit and 64-bit don't look the same don't behave the same at least as far as their their distribution threat landscape is concerned and again this goes also to different types of document files or even if we look at even more specific packages or formats like all these so again doc files and Excel files and PowerPoint files don't distribute the same although their format is the same one one of the most important takeaways we believe is that we all need to very carefully examine the raw data that that we take and examine its distribution meaning okay of course I believe we all you know normally try to you know build the biggest data sets possible with respect to you know whatever we can find or how big the the available data is and sometimes it's a question of the resources and computation that we have at hand but of course you know yes a good approach is starting from as big as possible but then really examining the distribution and making sure that it seems reasonable noise and distortion can be found somewhere in the most it can be sometimes found in the most unexpected of places like time distribution with really just you know weird weird phenomenons of like months where you know PDF was the king of the malware or whatever but it's not actually the case and usually it is some kind of distortion that you better if you clean it out now in a lot of what we've mentioned here there's no you know there are no textbook recipes okay but as we all know well thought of trial and error is our best friend in many cases and in applying AI and to into the field we're in so and this is the part where you know data science is a bit of an art as well yes of course there will be more files with macros in a malicious Excel data set but whether we want to have that exact distribution the way it is if we look at the raw data our recommendation is no engineer that that somehow differently in what way whether you do it with all the resampling or with you know compromising for a smaller data set with a different distribution again that lies that depends on on everybody sitting here and your peers and colleagues but we recommend that you again look at the things and you know do your trial and error and see and see what's best different models different you know feature spaces different file formats different threat landscape the solution for them is is quite different this pretty much concludes our talk thank you all very much for listening and we'll be happy to take any questions if there are in a way but the question is what is your bias and how and again what's the problem you're trying to solve sometimes that problem might be if you are the very very early stage how do I build the best model that generalizes on the right things and not on distribution of the threat landscape as it is because that might lead to adverse results and sometimes in order to fix something that you already find to be problematic a specific pattern that consistently is causing false positives or false negatives and then our recommendation is go back to your data sometimes the problem is not in sometimes the problem is not algorithmic it's not in your feature space it's not in your you know model architecture or the specific algorithm you chose sometimes the problem simply lies it might not be as simple to solve it but the problem lies in the data itself yeah just there when I add something the issues that it's it's when you create a data set it's not possible to find out like all the these biases so in many cases you just you know after seeing some production environment or like after tackling and using the model for for a period you you find out these patterns and these bases so that's the point that it's not always so easy to find them out yeah this this is a great question the answer is yes but it's I'll repeat the question did we try to come up with some kind of metric to build data sets or to balance them the the short answer is yes the longer answer is yes but it's way out of scope for this talk and for 45 minutes because it's just it's so it's it's so different between different file types and different formats and different threat landscape so again there's no textbook recipe and it also I mean what we've done is you know is good for our approach for our models for the way that we work for our infrastructure and framework it might be very very different for other researchers for other companies that use intrinsically different approaches algorithms models data sources etc so and the end in that sense what we said there's no textbook recipe if there was one we we'd share it thank you all