 Our next speaker is Dr. Ethan Rudd, Senior Data Scientist at Sophos. Hello, everybody. Yes, my name is Dr. Ethan Rudd. I'm a Senior Data Scientist at Sophos, and the title of my talk is, Loss is More, Improving Malware Detectors by Learning Additional Tasks. So, before I go on to the meat of the talk, I just wanted to let you guys know a little bit about who I am so that you know that I'm not just some random guy that sort of stumbled in here off the street. This is my first time at DEF CON. Very excited to be giving a first DEF CON talk. And thank you. I've been on the Sophos Data Science team for about two and a half years in a research capacity. Prior to that, I worked on several projects and in several areas of applied machine learning. My PhD research was funded by the IARPA Janus project, a face recognition project. I did a project at Google with their advanced technologies and projects team. And then I've been involved in several other small business and university projects. So, I mentioned the face recognition stuff because we're also running a great facial recognition demo at the Unwind session. So, please check that out. And you can check me out on Twitter or Google Scholar for various research if you like this talk. So, what is this talk about? Well, as we've seen, there have been several great talks on machine learning for information security prior to this one. But for many machine learning malware detectors, we're looking at training on a single malicious or benign label when there's actually lots of additional information available, lots of additional labels, lots of additional metadata, et cetera. And so, really the question that we answer in this talk is, can we train and craft a bunch of auxiliary labels to train on rather than just having a single malicious or benign label? And can we get better performance? Well, as it turns out, we can. And interestingly enough, we also find that these performance gains can be attributed to a better informed classifier. And I'll explain what I mean by that a little bit later on. So, in before and after photos, if you will, what we're talking about is adding additional loss functions during the optimization. So, on the before side, you'll see that we have only a single loss function. This is how many malicious and benign detectors are trained. And they work pretty well. There are a lot of them that are commercially deployed. You can get good performance. But if you add a bunch of auxiliary loss functions on a bunch of labels, hence the after loss part of this before after thing, we get way, way better performance, as it turns out. So, before I dive into exactly how we formulate this, I'll just give a brief review of machine learning for malware detection. So, up until about 2015, most malware detectors were largely signature driven. There were a few machine learning, but ML really took off around then. Now, actually, a lot of the detectors consist of hybrid ML and signature. They use signature largely for blacklisting. And one can really triage detection as this diagram here, where ML and signature detectors actually both work on static and dynamic features. For ML, static is a little bit more common. And we focus largely on static detection in this talk. The reason, by the way, that static features are more common is that to get ML to work well, it requires lots and lots and lots of data. And it's easier to collect a lot of static data. So, we find that we can actually do very, very well with that. So, a typical detection pipeline is built on some sort of a binary classifier, deep neural network, or maybe a gradient boosted machine. I'll discuss the deep neural network use case for this talk. And we're talking about training on millions to hundreds of millions of labeled malicious and benign samples. And we're also talking about classifiers that are periodically retrained to be able to reflect current threat landscapes. These can be deployed in a lot of different contexts. They can be deployed actually on endpoints. They can be deployed in the cloud. They can be deployed in security operations centers. It really doesn't matter for the purpose of this talk. Now, as far as labeling sources that a lot of vendors use, they rely on vendor aggregation services or threat intelligence feeds, which basically take a bunch of vendors or different labeling sources throughout the industry and submit malicious and benign samples to those and say, okay, how do these label the samples? And then some sort of an aggregate label is generally derived. Often there's also a little bit of time lag between the, between the time at which the samples are submitted to the vendors and there's a little bit of time lag that's left to basically let vendors update their blacklists and let the scores settle down. But long story short, most approaches take an aggregate malicious or benign label that is obtained from these threat intelligence feeds. Now, these detection engines, also because they use ML, they need to convert the malicious and benign samples to some sort of an ML friendly numerical representation. There are a variety of ways to do this. Some try and do something that's closer to raw bytes. Some use various types of feature vector representations. We presume a specific one and we use portable executable malware and benignware in this work. But the approach can be, the approach is fairly broad and can be done in a lot of different ways. So the way that a typical neural network will look is you'll have features that are extracted from malicious and benign samples during training. A forward pass is done through the network. The output of the classifier is taken and then some loss function along with the associated label is used to correct the representation so that we have a good representation that we can then later deploy. At deployment time, we take this learned representation and here I just want to highlight that we don't have any labels on the malware samples. That's what our classifier serves to do. We deploy that to wherever we're deploying, whether we're deploying on endpoints, whether we're deploying on socks, whether we're deploying the cloud. And then we submit our feature vectorized forms of our files in real time to the classifier. And in this case, our classifier is a neural network. It could be whatever, but we're dealing with neural nets in this work. And we use the predicted output basically as a score. This says, okay, how malicious is this file? A maliciousness score, if you will, that one can threshold in a variety of ways. So this is how things are commonly done, I should say. However, just a little bit more on these threat intelligence feeds, they have lots and lots more information than just whether a given file is malicious or benign. In fact, that's even a simplification from what they're providing. They also provide information on individual vendor detections. They provide, of course, the net number of vendor detections. And then they provide some string information on the detection names per vendor at the very least. Some provide a lot more. So really revisiting the original question that I posed, can we craft and learn from auxiliary labels and get a better detector? And the answer is yes, in fact, we can. The technique that we've derived to do this, we refer to as aloha, or auxiliary loss optimization for hypothesis augmentation, hence the nice Hawaiian art here, besides... So in short, we've got all this auxiliary information that we want to utilize. And as we saw in the case of just a malicious and benign label, while we have this loss function that we use here, so why not just add more loss functions. And that is really the crux of what aloha does. We have more labels, more loss functions. And this has a couple of nice advantages. First, we can, although we have more labels and we use more network outputs during training, we do not actually have to use these during deployment time. So we can notionally get a much better network representation during training, but at deployment, we don't have to update our infrastructure at all. Now, alternatively, we can use the additional auxiliary outputs to do certain additional tasks. So if we want to do things that are, I'll say specific to like an EDR or an MDR type application, where we're getting more fine grained information about the particular malicious samples, say maybe in a sock, but maybe not on our end points, we can sort of dual purpose this training and use the learn models in a variety of ways. So for our labeling sources in this work, we selected nine vendors from our aggregation feeds and used detection labels from each of the respective vendors. We also used the net number of vendor detections, and there were more than nine in our feed. There were tens and used the integer value of the number of vendor detections as an auxiliary target as well. We also use our main target of this aggregate malicious and benign label. And then we also use 11 semantic malware attribute tags. So these describe the content of malicious and benign samples. These are derived from the detection names within our feed. The derivation process I can speak a little bit more to at the end, but I would actually refer you guys to a, in my opinion, very good paper that we wrote on it. And I'll include a link for that in the end. But basically, these tags are not mutually exclusive, and they summarize the content of most of the nine samples in ways that a human can understand. So for each of these additional labels and network outputs, we have additional loss functions. And so our main aggregate loss function is actually a binary cross entropy loss taken between the output of the, the output of the network and the aggregate malicious and benign tag or aggregate malicious and benign label. Now this is, this is a pretty common loss function for a lot of neural networks that are doing malware classification. Well, with respect to our auxiliary losses, we have a loss function that is specific to the vendors, one for the tags, the semantic tags, and then one for the counts. And for the vendor loss functions, we actually take a sum of binary cross entropy losses for each individual vendor response. For the tag losses, we take, we do a very analogous thing, but for each of the attribute tags. Now I would, I would again point out that none of these tags are mutually exclusive. So we use binary cross entropy here rather than, rather than say a softmax categorical cross entropy and take some, then for the count loss function, we use a Poisson loss on this. Now prior to that we, we do an exponential activation to constrain our count range from zero to, well, basically to be non-negative as counts count be non-negative. So our total loss is, is written at the bottom here. And this consists of the malicious and benign loss with all of our auxiliary losses just summed and multiplied by constant. Now the constant in this case, we use 0.1. We didn't explore good values of this in depth, but other work has. And so we did this sort of in a principled manner consistent with, with some other work that I'll reference towards the end here. So during training, and you've seen this at the beginning of the talk, but basically we have all these aggregate loss functions where we have a main malicious and benign loss that's what we're trying to ultimately optimize for and detect. But with respect to the aggregate loss functions, we have our, our vendor counts, our individual, our individual vendor detections, and our attribute tags. Now the aggregating all of these losses together, we can use, all of these are just one or two of these auxiliary losses, or we could even potentially add more if we had more information in our feed. I would very much point out that this is a, you know, this is sort of a proof of concept model, a very general sort of architecture that I'm describing. But the point is that adding these auxiliary losses theoretically helps. At inference time, however, absolutely nothing has to change whatsoever with respect to the network outputs. You'll see that in the prior slide, we had all these different outputs that we added, but we, we pruned those, we pruned the associated model parameters at inference time. And so our deployment infrastructure can remain entirely the same. We don't have to change that at all, which is nice from an engineering perspective. So I've made these claims that the Aloha model works very well. Now I intend to actually provide some evidence of that. And to do that, we collected a dataset of approximately nine million training samples, 100,000 validation samples, and 7.7 million test samples. And the training and validation split was taken temporarily before the test split to ensure basically a fair evaluation. I mean, we can't fit in order to ensure a temporal consistency. We ordered our samples as follows. And for our aggregation, for our aggregate malicious benign label, we used what we call a one minus five plus criterion here, which basically means that for one or fewer vendor detections, we label as benign. And for five or more, we label as malicious. And then we ignore those with two to four labels. Now I'd mentioned there are, there are more sophisticated ways to do this. This is just the one we chose largely for simplicity, but there are more sophisticated ways to do this. But this works pretty well. It's a fairly standard practice. When we look at actually the vendor counts across our dataset, and looking at this was one of the reasons why we chose the one minus five plus. But what you'll see is that there are a disproportionate number of one minus, and specifically zero, and then a lot that have many, many, many vendor counts. And bear in mind, these are taken over a logarithmic scale. However, we still see, and this was one of our motivations for using account loss initially, that just taking these basic thresholds washes out a lot of finer-grained information. So this was actually one of our motivations for the count loss. And as you can see, it's not a, it's not really a common occurrence, but we might be able to say something about relative sample difficulty by adding that. We also looked at the respective vendor agreements with one another, and these are plotted in this confusion matrix here for each of our nine selected so-called high coverage vendors. And as we can see, we can, we see an agreement that occurs most of the time, but not all the time. I mean vendors are consistent all approximately 85 to 95 percent of the time, but they don't always agree. So perhaps there's some independent auxiliary information that we can glean from these. As features, we use the same features as Saxon-Berlinded in their work deep neural network based malware detection using two-dimensional binary program features. In full disclosure, Sax is my boss, so that's one of the reasons why we chose to use these features. We used the features that he and others within our group derived. And I won't go into these in depth, but I leave the paper there, and I just want to give sort of a semblance of what these are. So basically they fall into three different camps. So 512 of the dimensions of our net 1024 dimensional feature vector are based on windowed byte statistics and basically aggregate histograms of, sorry, windowed byte statistic histograms, which are basically aggregate statistics over the entire file. We then have 256 dimensions devoted to a two-dimensional string length hash histogram, or basically across a logarithmic scale of different lengths. We apply the hashing trick. And then we also have specific P metadata fields like the exports, like the imports, etc. that are hashed into another 256 dimensional vector. And all these get concatenated. So that's our representation of individual files. So that's how our dataset breaks down. When we compare performance here, so we tried using different combinations of our main malicious and benign loss with different auxiliary losses. So we used one just our malicious and benign loss as our baseline. So that's sort of tantamount to a lot of types of models that are currently deployed. Then we applied each individual loss type. Then we applied everything combined, which I guess you might say is the full LOHA model. And we fit each of these classifiers for each different loss combination. Actually we fit five different classifiers, and we report our results in terms of mean and variance statistics over receiver operating characteristics curves to be able to gauge statistical significance. Now for those, I know that there's a lot of talent in the room with a lot of different backgrounds. So for those of you that might need a refresher on receiver operating characteristics curves or ROC curves, basically we look at this false positive rate across the x-axis, and then a true positive rate or a detection rate at that false positive rate across the y-axis. And so typically what's done in the industry is at various false positive rates that are that are deemed sort of acceptable to the user, a threshold is chosen, and then you'll get the true positive rate at that threshold. So what we see when we add our count loss is that we do, in fact, get better performance in terms of both the area under the receiver operating characteristics curve, which is basically a gauge of how good is the curve overall. But specifically we see also at higher false positive rate regions or sorry lower false positive rate regions, we see a particular bump and this becomes a bump in detection rate. Now this becomes relevant because as we get to lower and lower FPR regions, there are more deployment scenarios that we can address with our models. Similarly for the vendor loss, when we add that to our baseline, we see a boost in the receiver operating characteristics curve or ROC curve at the relevant region. We don't see quite as much of a boost in terms of the area under the curve. In fact, the area under the curve stays statistically pretty similar. And that's not to say that this isn't still a very significant result. In fact, again, the AUC is a net statistic on the curve, but we don't really care so much about the higher false positive rate regions because the detection rates there are very, very good already and so we can deploy those very easily. But as we're getting down, we see that although the AUC is relatively similar, we still see this as basically a win. The tags loss gives us a similar result. And I'd mentioned that actually both of these loss functions, the tags and the vendors loss functions, not only do they assume sort of a similar functional form, but they also give us an even better result than the Poisson loss function did or the count loss at the lower false positive rate areas, but they are slightly worse in terms of in terms of AUC performance. When we combine everything together, what we find is that we get even better results and we find that not only our results far better, but our variance between different model instantiations is reduced. And we see basically there are two modes of improvement here. There's an improvement at the higher false positive rates that is above 10 to the negative third and then there's an improvement below that. And basically the higher FPR improvements, these are really what are driving the area under the curve improvements, but again the lower FPR ones are still quite relevant. So in summary we see that yes, adding additional loss does seem to improve performance and notably it also reduces variance across different instantiations of the model. We suspect that this variance reduction is occurring because as you have more things to optimize for, you're inherently sort of constraining your optimization process. So there aren't as many different types of ways that parameters can vary. We also see that there are similar behavior for similar loss types. So both of the vendor losses and tag losses consist of sums of binary cross-entropy losses. And again these seem to drive different things with respect to our RRC curve. We suspect that we see these higher FPR gains in detection for the count loss because it actually does communicate something about the difficulty of samples. And then with respect to the tag losses, perhaps the network is able to correlate some sort of information between when say just one or two of these vendor tags trigger versus when say all of them trigger. And so it drives things really at lower lower FPRs. So okay we've presented or I've presented some evidence hopefully that the Aloha model is able to deliver better detection performance. But now I'll just really briefly discuss what's driving this performance gain. Is it some sort of a smoother optimization service that's brought about due to due to a regularization effect of multi-objective optimization? Or is it perhaps due to a more informed representation from all of these different auxiliary label sources? And going into this we sort of suspect the latter of these two but we want to actually make sure and then see what's going on here. So in order to test this we used auxiliary loss functions on on so-called non-informative targets. So we employed various mechanisms of duplicating labels. One of the ways that we did this was or using or providing labels that delivered no no additional information about the sample. One way that we did this was with a pseudo random label where we took the hash of the file contents and just took the sign of that as an auxiliary target. So for a given file you're going to be looking at the same label but the labels are just pretty much randomly there. We also tried adding a duplicate target and optimizing for that and then we also applied a duplicate target with a different type of loss function. So we scaled and shifted a copy of the target label and then we used a mean squared error loss on this which is a common regression loss. And so from these what did we find? Well we found that adding these non-informative losses did not improve our performance in a noticeable manner at all. In fact it was statistically identical if not worse when we added auxiliary auxiliary targets to our original target as well. So this suggests that yes the aloha network gains are actually coming from additional information from the additional labels. The network's doing what we want it to do and it is learning a better representation a better informed representation. So overall what we find is that yes our aloha technique works well and it seems to be a result of the neural network's ability to actually correlate information from auxiliary labeling sources. It's not just simply an artifact of regularization. We also have the advantage that the network can be trained and deployed with minimal changes to existing infrastructure that's out there so no re-engineering of anything on the endpoint, anything on the SOC, anything in the cloud has to take effect. And then also there are additional applications that these outputs can be used for like EDR and MDR so one additional application as an example is since we have outputs that describe the content of the malware we can actually group malware by the predicted tags and you know we might have an application where we might want to deploy that sort of for internal use or say as a service but yet still be able to deploy our model well we can do that all under this one training regime just by pruning our losses and pruning our outputs respectively. So before I take some Q and A here I just wanted to also mention some related research and some directions for future work if you found this topic interesting. There's been a lot of research by our group and also by other groups that is related it's interesting to look into and it can perhaps be leveraged in some very similar ways. So first I'd also mentioned that Aloha is a usenix paper now so I'll be presenting this at usenix next week but interestingly oh yeah and it's available on archive as well so feel free to check it out if you want more more information. Interestingly a gentleman by the name of Jason Trost actually did a nice blog post where he used Aloha architectures for a much different problem using endgaming domain generation algorithm detection code he tailored that to use basically these Aloha losses and anyway he did a nice blog post about his results. There's also a paper out of Microsoft research called MTNet and this paper actually is similar to ours in a variety of ways it uses an impressively large dataset of dynamic features and it largely substantiates a lot of our findings however they use only one type of loss function they use multiple loss functions but only a categorical cross entropy softmax they do something sort of similar to a tagging approach that we do describing the content of each sample except they use a Microsoft malware family names and so they actually do employ some sort of a mutual exclusivity assumption here but anyway it's another great paper and it's very cool to see that they're also able to sort of substantiate our findings with a with a much different data modality it's also p files but it is dynamic features I'll also mention the paper on malware attribute tagging so smart semantic malware attribute relevance tagging is a paper that we also put out there which will pretty much tell you everything about everything that you want to know about malware tagging it is the approach the tagging approach is the same as we employ in this in this work so please see that for details on the tagging problem and the tag prediction problem there are several other models that we employ in the in the smart paper as well so if you're interested in that check it out I will also mention another paper that Moon a mixed objective optimization network the reason why I bring this up is while this is applied to facial attributes it has nothing to do with malware it is actually the approach that I use in my face recognition demo which again please please stop by during the wind down session if you want to see basically how this type of optimization can be employed very powerfully in action the approach is fundamentally the same as as a low-hop but with a much much different data modality one final work that I'll mention then I promise I'm done is a work paper that we did called learning from context it uses multi-view learning or multi-input learning in contrast to our approach using multiple labels and multiple loss functions but using this approach we are able to include extra information in the representation just in in a different way we're sort of turning the aloha approach on its end on its head and this type of approach could be trivially combined with aloha I would mention so that's that's maybe a nice direction for future work so having multiple multiple P file features and then also other auxiliary information like we took the we took embeddings of the path of the P file on disk and and concatenated those together but also potentially multiple labels you know that just adding adding multiple information multiple other sources of information into the representation it seems to work well so it's it's definitely an avenue for future research I'll finally close with an obligatory self-host pitch so I'm with the self-host data science team we do a really cutting-edge research and we're always interested in transparency and collaboration and in hopefully I've communicated publication and while we're not the only one we are one of the only research teams in the ML sec industry that is getting papers accepted at some of the top tier academic venues like usenix our group consists of about 10 to 15 people were split about half between research about half in development and so check out check out our group if you're if you're interested you can talk with me or you can talk with Richard Ang who's who's also here who's one of our directors of data science here's a picture of our team lots of great colorful characters there's some more self-host presentations going on this week as I said I have a facial recognition booth rich has a talk on hacking facial recognition on the on the 10th and then he also presented a talk on on security data science at b-sides so if any of you saw that there's just a name to correlate and then you know I'd like to thank Sophos for funding and for promoting this research and I'd particularly like to thank my collaborators and my co-authors here for all the work that they did this was definitely a team effort so with that I'll I'll open it up for questions yes thank you yeah so that's a great question the question is so this talk was about incorporating these auxiliary losses on neural networks but have we tried other classifiers like ensemble models random forest boosting etc and while I would say that we don't have while we don't have concrete results here on those there is nothing that would nest there's nothing that would preclude a person from doing so the representation that's learned by an ensemble model is is a little bit different so I guess that like in the in say a random forest or in a boosted model I guess the question is how you would do shared splits in a way that works well across data so I'd say that the technique could be very well applied I don't know how well it would work I can say that I've looked at some of the libraries that are out there for this like light GBM like XGBoost etc and they generally assume that you're going to be using only one loss function but there's nothing that would preclude somebody from implementing it I just don't know how it would work thank you um let's see more questions yes please um so what are some of the the next steps of the process um so and and some of the features that we want to develop so so features in terms of representation of the uh sorry in terms of representations of the malware fed to the classifier features in terms of just like extra things to tack onto the classifier sure sure yeah so so extra um additions to this overall technique so um you know I would certainly say that the approach that I'm most interested in is actually having a unified um multi input and multi output model that's really able to um learn multiple labels or learn from multiple labels but also um also have multiple just heterogeneous inputs like um you know you could have uh you could have as an example the character embeddings of the file path you know you could you could also apply this I would say to a lot of different malware types um you know I've uh I've been talking about p files this entire time but there are a lot of different types of malware that one could apply this to um so uh you know I'd say that those are two different avenues that um that you know I'd certainly like to go down um and then I'd also say that there are other um sources of of data that are on some of these um throughout feeds and so I think that looking into those would be uh would be very interesting as well yes please uh yeah so well so yeah I mean I'd say that not only does it not do uh worse I mean it does on an aggregate it does better um and I would say that yeah if you have multiple inputs um yes uh we do see in fact and and we have seen that I'd actually point you to that um learning from context paper that um yeah we do get a we do get nice performance bump but actually yes having um having missing data is um a little bit more of a problem with that um so for uh for our loss functions here if we have a missing label we can just zero that entry out into loss but if we have a missing um and just back propagate that but if we have a missing input well that that becomes a lot more um a lot more hairy and you know that that's an area of research that I'd really like to see addressed uh a little bit better um let's see I think we have uh uh time for one more one more question yes please yeah yeah that's uh that's a really good question so the the question is which inputs are our most um are most prominent in terms of the their respective output response and so this goes back to a lot of the model interpretability literature um so you know I'd say the uh lime shaft values uh uh layer-wise revel relevance propagation uh you know uh a lot of the literature in that area would be uh would be very good to look at or uh or techniques like activation maximization um yeah those those are a few techniques and this is definitely an area where I think that not only I and I think that not to speak for the entire industry but I think that they're interested in that so anyway uh I can chat more after on that but uh yeah thank you uh thank you for um the question good question