 Welcome to AI Village, the next talk is it's a beautiful day in the malware neighborhood by Matt and basically we'd like to thank our sponsors Endgame, Silence, Sophos and Tinder and of course Silence Your Cellphones and if you have an open seat next to you please raise your hand to the people next to you or people coming in can know there's a seat. Thanks. Hey good afternoon everyone. Even though Silence is a sponsor don't worry this isn't a sponsor talk. This tool is completely open source so again my name is Matt Maisel I'm the manager of security data science at Silence and specifically today I'm going to be talking about the use of nearest neighbor search techniques applied to malware similarity specifically in a tool called Rogers that's open source on GitHub right now. I have a feature branch that I'm working on some updates from the content today that I'll present but this tool is designed for malware analysts and security data scientists to perform malware similarity research. So just some motivation so building databases of our malware is interesting for analysts and for data scientists so search and retrieval of similar samples and it can provide valuable context to analysts and systems. The objective in this case is to basically build a database index or malware by some attribute or some set of attributes and when we have some unknown sample query that database and hopefully if we've ever seen a similar sample we get back valuable context and maybe other labeled samples maybe samples that have been reversed and we have a lot of details on. So that's use case kind of number one for these systems. Use case number two is a sample that we've never seen before and it also doesn't match anything in our corpus. We can prioritize that for manual analysis or maybe more advanced reverse engineering. Finally the final use case of search and retrieval systems from our similarity is to augment larger systems maybe doing clustering or maybe doing classification so we can use nearest neighbor search techniques to basically process incoming alerts and leverage any sample any hits that we get back with the context to determine if we want to route that sample to other work flows in our environment. Historically this of course has been done in big databases of caches. Fuzzy hashing notably SSD is also kind of a standard approach still and still is a fact that I'd argue. So how does this relate to nearest neighbor search so if we consider malware similarity as being performed through comparison of raw bytes or extracted static and dynamic features that distill the semantic characteristics we can take these features and represent them in this n dimensional feature space and with that we can feed that into a lot of nearest neighbors algorithms and as well as other machine learning algorithms as well too. So nearest neighbor search simply put is the task of being given a set of samples X or our corpus basically take a query sample or unknown sample XQ and query the index and basically get back the K nearest neighbors according to some distance function and there's many different distance functions that could be used here. A gentleman earlier today mentioned Euclidean and we can use cosine and we can even use other metrics like string operating on edit distance or Levenstein distance and ultimately nearest neighbor search of course is hard to at scale you know with a high dimensional space so we have to look at approximate variance of this that allows some error threshold epsilon and basically kind of bounds our true distance whenever we query index. And just again as a really simple example here if this kind of two dimensional space here is if we query this red dot the K nearest neighbor for three would be these three samples in the inner circle. If there is maybe this is maybe an approximate variance you know there might be a chance that we could accidentally return some of these samples in the larger kind of radius here. So don't worry I don't have any algorithms but I actually cut out a lot of this but there's a lot of interesting theory and literature around nearest neighbor search you know over the past several decades. I kind of categorize them specifically in three different areas so there's tree based methods you know where we're basically partitioning our data set into you know these cells in our feature space so we use tree data structures to exploit that nature and basically rapidly look up and identify the cell kind of shown here in this case here is still in a two dimensional space. We use a tree data structure to quickly look up and identify nearest neighbors in this particular cell. There's also hashing nearest neighbor methods as well too you know where we're typically in this case applying a non-cryptographic hash that kind of has this property of you know ensuring that any small input or any small change in the input space only results in a small change in the output space. So the idea is here we're actually looking for collisions between similar objects. In this case we can come up with different hashing algorithms. Locality sensitive hashing is one kind of popular one where the whole idea is to find hashing functions that take some input so some input sample in this case represented by our detector and if we hash it we ultimately want to end up in the same bucket or end up with the same hash code if it's similar and then again ultimately this determines it reduces the number of candidates in this case that end up in the same bucket ultimately to actually do a distance comparisons on because doing you know again doing like pairwise distance is super expensive and no one wants to do that. Finally a kind of more recent really more recent approach for nearest neighbor search is graph based methods so the general idea here is that we're going to build these proximity graphs maybe several layers of graphs that we kind of stack together and we have algorithms that basically do an initial kind of offline phase of building our graph and connect to the neighbors and with these edges based off of their similarity and then ultimately query time we kind of end up in one part of the graph and navigate around basically traverse our graph potentially traverse multiple layers and build up our candidate set for comparison. The downside here is that a lot of the graphing algorithms are extremely expensive you know we have to basically find specific types of graphs when we do our build our offline build phase that make them easy to search through and traverse through in a short amount of time. So there's a crap ton of methods out there I highly recommend checking out this ANN benchmarks page it's on GitHub there's a paper associated with well too every so often this developer reruns all the latest and greatest implementations of these various nearest neighbor search methods across a wide variety of data sets for benchmarking kind of the typical benchmark that's used here is this trade off between queries per second so how fast can we look up items to the recall in this case the recall is the fraction of the true nearest neighbors returned in our search so the kind of general idea here is that up until the right is better but you can kind of see usually there's this trade off of hey we can query our index for nearest neighbors very quickly again if this is like in a large production system but we get that at the expense of having really low recall conversely if we you know really want a high recall basically we want approximate methods to bring back the exact results and we typically trade it off at the cost of queries per second. Now one algorithm to kind of point out here and I'll get into it in a bit is this hnsw or hierarchical navigable small world so that's this that's this line right here so it does fairly well and again this is just one example of this New York Times data set for K equals 100 if you go back to their site you'll see that actually does fairly well across a large variety of data sets and also does well at varying levels of varying levels of K so hence kind of why that's one of the algorithms that I specifically picked to look at and you know use the implementation in Rogers for our similarity so this method was recently created in 2017 against based on a lot of different algorithms and graph nearest neighbor search but the basic idea at a high level is basically construct this multi-layer graph and use it to greedily identify candidates for comparison so as I kind of alluded to in my overview slide you know there's this phase where we construct this graph we query the candidates through this traversal mechanism and then iteratively search for neighboring nodes until some stopping criteria is met so hnsw defines all that all the kind of methods there for the stopping criteria for the way that we build the graph and ultimately kind of just to sketch this out here after we built this graph consisting of multiple layers so it's really multiple layers of graphs you know we set different parameters for this algorithm to determine how frequently basically how deep a sample ends up in one of these layers I should say how shallow so we basically start from the top layer here going down to the bottom layer but the idea query time is once we've constructed these these layers of graphs we start at some point here navigate in this case there's only one sample so it basically searches the neighborhood goes right to the sample eventually reaches a local minimum here because there's no other neighbors to look for and then drops down to the next layer and it kind of this process continues until eventually we get to the final layer and this also can be tuned as well to a query time to determine how deep into the layers of graphs we'd like to go but ultimately this results in again kind of this paper based basis of this approach and basically ensures that any samples that we visit across all the layers are likely to be nearest neighbors and that's ultimately what we use to determine the number of candidates that we end up querying or sorry end up doing a comparison against with our query. So that's a graph based method so there's also another really recent method too that I actually caught at the NIPPS workshop they basically machine learning research conference back last fall it's really interesting so it's called prioritized dynamic continuous indexing or PDCI and there's an earlier iteration of this just called dynamic continuous indexing as well too so the authors here actually design an exact randomized algorithm and it's built around this idea that we're going to avoid partitioning samples by vector space so kind of going back to this example with the database methods where we're kind of splitting up our feature space along the each feature that gets that basically has a lot of issues and the PDCI the authors notice that you know what we can do is just build these indices and basically take our samples and project them along a random direction and we do this so we can control the number of indices we want to build to basically determine how well this this method actually works and the idea is we construct multiple indices and the kind of the main gist is that as you visit the indices and you query the index you end up at a place where you're you're basically your query is projected to any of the samples that are nearby either in this case a larger or smaller you kind of pop those samples off and if they end up appearing in all indices again this paper kind of shows that that's highly likely to be the exact nearest neighbors so you add that to the candidate set for comparison and again this this is particularly interesting because and this specifically within this exact nearest neighbors search method some of the guarantees in this paper are pretty pretty compelling and unfortunately though because it's an academic paper I mean the author is a very well very well respected PhD student I think at Berkeley but there's no open source implementation yet so I went ahead and tried to do a naive implementation in Rogers specifically so that's a PDCI again these two algorithms are the ones that kind of I focus on right now for the this talk so other other malware similarity systems you know there's quite a few that have different approaches to doing this for nearest neighbors search or just similarity in general so of course fires total has different ways to index data also using SS deep but also a clustering basically clustering API that is based off of feature hashing of the static from my understanding from the docs from the structural data pulled out from static feature extraction and this is actually where I source some of my data sets for evaluation of these methods which you'll soon see to my detriment unfortunately. Our very own Brian Wallace one of the AI village core team members actually came up with a blog post and virus bulletin a few years back that basically exploited the way that the SS deep digest is built to eliminate the number of comparisons needed and then more recently again you can take this idea and basically apply it to elastic search as well too so you can just use an off the shelf database to actually use the same kind of method here for index SS deep so again that's using one of these similarity digest methods that's kind of in the in the bigger larger group I should say of hashing based nearest neighbor methods and then of course there's kind of the popular academic implementations of malware similarity system so bit shred is highly cited back in 2011 so it uses pairwise jacquard similarity and it uses Hadoop to do that so it gets again fully pairwise so it's very expensive hence the need for Hadoop. There's also the malware provenance system which is a little bit more recent that uses minhash a type of the basically an LSH family a locality sensitive hashing family that approximates jacquard similarity so that's used across the sliding window of ngram features on disassembled samples and the others are the two final ones here there's also malware which focuses more on the behavioral feature similarity comparison specifically for cuckoo so there's a lot of different capabilities built in there for clustering and also classification but underneath the hood there's this kind of use of looking at behavior features for doing prototype identification or prototype selection so you can basically identify prototypes that are in a large cluster pretty much like the centroid around these points and use that for a way to quickly do comparison and then there's also the sarvam which kind of takes some computer vision ideas from basically indexing images and takes a binary as wall bytes and basically converts it into a grayscale image and then indexes that so there's a handful of systems out there there's a lot of algorithms to kind of choose from again that have a lot of different properties again that kind of going back to the performance metric there of query by recall so when we're approaching the design of the system to do malware similarity and specifically use that system to evaluate different nearest neighbor search techniques you know I kind of define these four key design ideas so number one you know of course we have to extract and store sample metadata from our wall features we have to then transform that feature transform that raw data into you know some feature representation again in this n dimensional feature space and we might have a variety of different vectorization pipelines that we want to experiment with talks earlier today mentioned a few TF IDF you know I use feature hashing specifically in my more recent approach so we might want to kind of change out that vectorization pipeline depending on you know what we want to evaluate and what features we want to include and also how large our data set might be so after we transform our features with one of these pipelines we then fit you know the different nearest neighbor methods and you know do some bookkeeping maybe have to save some basically save some of the database structures that are required and that's kind of the fit stage and finally once we fit all these indices you know we want to actually query samples and then basically again kind of pick the parameter K to determine how many nearest neighbors we're going to pull back and then possibly if we have this database of sample metadata we also might want to display the contextual features maybe again we have some case database of previous incidences in our environment and we want to annotate our samples that we pulled back with some of that and it might help for more of the analyst context kind of thing so that really gives us or really what I kind of came up with is this design then for Rogers. Rogers is a Python 3 application that you know pretty much has a sample class it's really only a sample class consisting of the PE class that really only focuses on really basic static feature extraction using PE file but it's built in a way that you could expand the number of sample classes if you again bring in you know other things other than just portable executables. Vectorizer so there's a basically scikit-learn pipeline APIs that I pretty much use extensively and in this case right now I have a latent semantic analysis which again earlier talk kind of uses some of those ideas with the TF IDF and then projecting down and then more recently just because I started getting into data sets that were a little bit larger to handle I started looking at feature hashing approaches you know and this basically can be extended with anything supported by scikit-learn or other vectorizers as well too and then finally the kind of final component here is this index class you know which in this case I have implementations or I use libraries for like HNSW or I use the LSH force in scikit-learn which is going to be deprecated anyway here soon but then I have an implementation for index SS deep and PDCI and it pretty much at this point uses SQLite you know as the store for at least index SS deep and the PDCI methods. All the feature data too is I forgot to mention stored in protobuf a message definition that has basically a kind of a structure that allows you to add different modalities of features so you can add static features, dynamic features, contextual features and then give them if you want to you can give that a variable type if you'd like to actually automatically build like feature vectorizers later on but again that's not really supported yet in the vectorizer class. Cool so now unfortunately to the sad panda part of my talk so yeah doing these types of experiments and getting data sets you know for doing malware similarity is difficult I don't really get as much time on doing these experiments as I wanted to what we're looking at here again is we have two charts so this chart right here is this recall at K for exact nearest neighbor so on the x axis we have the number of K's that we picked for each of the experiments and then on the y axis here is the recall again this is the fraction of the true nearest neighbors returned for the query on this side right here we have precision IK for neighborhood class so this is actually more of a same kind of metric where we're looking at the relevant documents over the K or the total relevant sorry relevant over the total number of relevant documents in our data set in this case I'm actually just using the class for the samples to basically say hey if I query the sample and I get back all my results in my query are all the same class you know because I have labels for them you know that indicates to me we have high precision IK so if you look at this it does very well for these samples and I kind of illustrate this in a sec you'll kind of see why but the actual exact nearest neighbor results are pretty bad you know we have .3 for PDCI and then HNSW kind of is you know pretty low and again I did parameter selection I did some grid search you know for parameters for each of these methods and tried to come up with things but I really couldn't get anywhere and you know just to kind of highlight this again the VT clustering data set with 27,000 samples and 15 classes so cool so a quick demo time to kind of illustrate you know at least the interface so Rogers is exposed as a command line application so if you want to use it on the CLI you can but I also have APIs exposed to basically make it so that you can import Rogers and then build an index and then I use plotly in this case just to visualize some of the results so I'll blow this up a little bit real quick like I said it's still probably hard to see in the back but the idea here is that I've previously fit an index this one specifically HNSW and again I have some kind of standard APIs here for passing into samples, setting the number of K, getting back the neighbors and in this case if we clear this we can see we get back the sample here's the query sample just kind of again doing some print statements here just to make it easy to see what we're getting back then here's the nearest neighbors and again I mean look how similar these are this is cosine similarity the graph here just kind of displays them and if we look at some of the kind of features here so here's the query sample you know we can see that this is with the Lamer label and we look at some of these other samples it's Reg Run, there's another one too and they are totally different so the labels themselves are different and what I kind of realize after getting into this is that in fact that my feature space is limited to the static feature extraction methods you know and given that you know we're really only looking at samples that are usually I think a lot of these are just like packed samples everything really looks identical so I think that kind of might explain why I had really bad results and kind of illustrates the need for me to get better data sets for evaluation here so just for a little bit bonus so Zory was released at Black Hat and because of the frustration I had with some of the feature extraction around the basic static stuff I went ahead and implemented a Zory feature extraction class for PE and you know just in this case I only got to pulling out a bag of words for the mnemonics so it's just kind of an example of this particular sample running Zory on it I was able to run this on like 500 samples or so and build a vectorizer and again now we can rerun the same query with I'm sorry this is a different sample we can now leverage the bag of word mnemonics as well in addition to the other kind of static features that are pulled out and it gives us slightly different results again I haven't really done any formal comparison of these methods so this kind of wraps up stuff I think but the general idea here you know is that Rogers is a tool for one experimenting with different nearest neighbor search techniques but also is a tool to build out vectorizers for the different methods similarity in your environment might depend you know you might have different use cases you might only want to do static comparison you might only want to do dynamic you might want to do both you might want to you know apply like an automated disassembler like Zory on it there's clearly at least in the vectorizers that I published with this tool you know it's very limited to kind of just PE and there's definitely opportunity for different modalities there's also opportunities for doing feature selection and learning representations as well too to come up with a better feature space similarity comparison for experiments definitely my case you know I did run some parameter optimization but unfortunately just need to get more data sets for doing benchmarks and also I think it would be interesting to evaluate different distance metrics you know beyond just Euclidean for some of this and basically use that to determine again similarity for some of these methods and again some of these methods like HNSW by default uses cosine but like PDCI is Euclidean and finally more use cases so potentially in this case you know we're only indexing Maurer samples but you could also potentially index benign samples as well too and of course like the key is being able to continuously update the index with new samples as they come in and have been classified or been analyzed so doing like a partial fit or like an insert operation would be pretty easy to extend as well too cool so this point for questions so again this this tool is up on github I do have a feature branch I apologize I got to get it out it's just it's just been crazy the past week so that this feature branch will add feature hashing will add the PDCI and basically publish the experiment again I do feel that the experiment results were pretty weak but maybe could be explained just by the data set I was using certainly would be kind of interesting to experiment more so yeah pull requests are welcome any questions at all sorry go show the vectorizers yeah so I guess Zory I was kind of experimenting with this right here so this is a signature vectorizer so this one is actually using the YARA rules repo so I just use the YARA detections as features themselves and kind of try to figure out like what in this case apply TF IDF to determine like what signatures are more useful compared to others so yeah in this case there really isn't any feature selection other than applying like TF IDF and then projecting down any final questions? I'm sorry I can't hear you at all I apologize I still can't hear you so yeah as I mentioned HNSW uses cosine the PDCI is using Euclidean but the metrics are just this recall at K so basically I do exact nearest neighbor search on my data set just using brute force and I compute that for Euclidean and I compute that for cosine and then those are like kind of the basically the ground truth that I use in this case for this recall at K sorry for the exact neighbors yeah so I mentioned like I don't do any feature learning don't do any of that kind of stuff and I also think that going beyond some of the basic static features would obviously change the result significantly so I hope to do that in the future oh sorry didn't see you yeah so back on so the question is you know do more experiments pretty much experiments are fun sometimes they're sad but you know it's always got to push forward to the next one the other question was like what did I learn kind of doing this implementation it was definitely really a lot of fun to do the PDCI implementation again it's really naive in Python 3 using SQLite basically to store those indices so it was great to have a chance to kind of go through a paper that again has no published source code no published implementation for reference and then try to do that and then yeah a lot of Python 3 kind of things I picked up during this project as well yeah there's not so those algorithms don't do any feature engineering they don't do any feature selection you basically pass in your feature vector into index so that's kind of a pre-processing step in this case here which I did at the vectorization phase cool thanks everyone