 My name is Brian Wallace. Some of you might know me as Botnet Hunter from Twitter. And today what we're going to be talking about is machine learning and malware classification and its uses and so on. And now I'm going to pass the mic along for everyone to introduce this, introduce themselves on the panel. Hi there, my name is Andrew Davis. I'm a staff data scientist at Silents. My main day-to-day job is training our PE model, so I know a little bit about a feature extraction and different methods like that for training PE models, disassembly, static feature extraction, starting to work a bit with dynamic feature extraction. And yeah, PE is kind of my bread and butter. I'm RetroRang. I'm a principal data scientist at Sophos. I also work a lot on PE but also on other formats including document and HTML, particularly malicious JavaScript. My name is Hyram Anderson. I'm the technical director of Data Science at In-Game. And I work with a great team. We have also primarily PE was our first machine learning model but also we do macros and Mako. Hey everyone, I'm Matt Maisel. I'm the manager of Security Data Science at Silents. I work closely with both Andrew and Brian. And yeah, I guess my kind of interests are more in, again, the nearest neighborhood search if you attended the last talk, clustering, weak supervision, active learning and basically applying those different methods to help basically perform malware classification or help with the pipeline to do malware classification at scale. And if I may, I would like to just briefly introduce Amanda Rousseau who will be joining us momentarily. She's fleeing the paparazzi and will be here when she arrives. She is, unlike everybody else on this panel, not a data scientist but a malware reverse engineer. So I'll provide an interesting perspective there. Thank you. So we're going to start out with kind of an easy question for all of our panelists. What do you guys believe is the current state of the art for machine learning and malware classification? Who wants to? You open controversy. Oh boy. I guess I think the current state of the art for malware classification is basically applying deep neural networks. Basically getting a crap ton of benign and malicious samples and basically using that to train a very large scale model. And then obviously once you have a great model, you actually have to put it in production somewhere. So whether it's on an agent or in the cloud, a lot of the challenges usually come after you make the model. So yeah, I'd say deep neural networks. Again, my kind of expertise isn't necessarily in the classification algorithms. So I'll probably get overturned here. But yeah, so let's kind of start off with that. Just a note. We're not talking purely about static detection. We're also talking about dynamic and any sort of automated machine learning based malware detection method. Yeah. So at in-game, we primarily deal with static detection. There are I think at least two other types. There's dynamic. There's also contextual machine learning. So things when you see a large spike in traffic, right? To a vendor with a lot of hosts that could indicate a new attack where you haven't actually analyzed the file. You only see the number of hatches flying by, right? So in terms of static machine learning, I will respectfully disagree with my panelists. But I think that the way you define state-of-the-art is a Pareto curve. And there's an optimum along that curve depending on what you want to achieve. And that depends on things like obviously false positive, false negative trade-off, but also think about model size. Does this need to live on the endpoint or does it need to live in the cloud? Or think about time required for detection. So in cases where milliseconds matter, like detecting ransomware in a dynamic case, it's often really advantageous to have a speedy, lightweight model that within a few, in sub millisecond time, can determine if that's ransomware. So my belief right now for static, the determination I'd like to make a clear cut that in my research still, there's a difference between ends-to-ends deep learning performance and parse a file first and apply a model performance. That is actually highly related to the next question. And I will move along then. Yeah, so what are the different approaches for feature extraction and how do we represent malware to a model? That is, I imagine something that's new to much of the audience. I guess I've got the mic. Yeah, so I think a lot of the stuff, as far as I'm aware, that's actually in production tends to be static artifacts. And so you have two different ways that you can represent it. You can either use some sort of parsing approach where you do things like you pull out section names or the entropy of individual sections, but it relies on you being able to dig in and analyze the file as a PE file. And then you also have features that don't require any parsing. So you can just get like the entropy of every consecutive 256 bytes of the file or something like that and treat that as a sequence. And you can also, there's research that's been done in convolutional and recurrent neural networks where you just feed the file in a byte at a time or you feed in chunks of the file and it just operates directly on the raw bytes of that. And so I think all of them have their pluses and minuses. There's trade-offs in terms of how much domain knowledge you need to put into it and whether your parser will actually choke on some files and not others and how flexible the representation is. But that's probably the key challenge in getting an effective malware model out is figuring out what the right balance between these sort of very domain-specific expert knowledge required parsing features and the more flexible, less structured, you know, I'm just going to ingest the whole file and it doesn't actually really matter that it's a P file, I'm just treating it as a bag of bytes. So I have a question for the panel real quick. Do you guys believe there is a single approach, like a single model that could be applied to detecting malware or is it a requirement to have multiple different approaches being used in conjunction? Do you mean different file types or what do you mean? File types, dynamic, contextual? Um, no. I mean, you're going to have different problems like Hiram said, right? You're going to have different places where you want to deploy it, you'll have different requirements for latency, for speed, for size of the model and every one of those decisions is going to lead you to a different kind of model. Yeah, so I would definitely agree there's probably not a single all-encompassing model that would work well across all different types of malware. Um, one of the things that I wanted to mention about the previous question about feature extraction and feature engineering is that feature extraction and feature engineering is vitally important to make sure that your model isn't abusable. If you're extracting features that don't really make sense, your model might cue in on things that it shouldn't be queuing in on. Um, and that's not going to be a very good model, it's not going to be a robust model and adversary could find those weak points and exploit them. Um, which kind of brings me around to why there probably isn't an all-encompassing model that would work well on all different kinds of malware because the things that you would use to do malicious things with P.E. are going to be different from the things that you would do to do malicious stuff with health or Mako or any other file format or JavaScript even. So having a single model that just cues in on, you know, yeah, TLDR, no, single model. I will just note there are corporations who do use a sort of single model for all file types and have claims about that. So I would, you know, it's something that wouldn't be great. I should hope they're using transfer learning. Yeah, wouldn't be great. I'd say the only thing that I have to add would be like, you know, obviously each model would be specific to the modality but considering like multimodal models where we're looking at feature spaces from static and dynamic and identity, you know, again depending on how you source, how you're sourcing that data, how you're doing the feature extraction I think could change things but each model could be trained independently and then again, you know, we can ensemble or build, you know, like a majority vote. You know, again, simple ways of basically combining the outputs of different models that are, you know, built for a specific task I guess would be my two cents to that. So onto the next kind of set of questions. What do you guys feel are the biggest challenges in like the field of detecting malware with machine learning? Well, I'll start. I think, again, you can break this down into the different types but in static detection, I think an awful lot of engines still have trouble with packed samples, right? Because packers change, it's hard to track them. The entropy and sections can vary depending on the packing technique used. And at the end of the day, you know, not all static engines actually attempt to unpack before inspecting the contents. They actually use instead of sort of, you know, instead of looking inside the sample and trying to determine a string or signature that would indicate maliciousness, instead they look at artifacts that, you know, the smoking gun, you know, something in here is probably malicious because of the imports and all those things that can't be packed. But that still represents, I think, a weak point for some type of static approaches. Do you want to introduce yourself, please? Yeah, I was actually hanging out with malware tech. Trying to get food. But yeah, my name is Amanda Rousseau. People also know me as Mauer Unicorn. Can you see me back there? I work at Endgame doing malware research. I help a lot of the data scientists kind of tune the machine learning models that we use at Endgame. And we try to look at different types of malware and kind of see, you know, where do we need to focus our feature set on, as well as, you know, determining manual verification if something is malicious or benign. So I think we're still on the challenges, right? Yes. Challenges? Please continue the challenges. Yeah, I think the, so packing an encryption is a major headache. I think the other challenge that's really kind of a big one is just getting a good labeled set for doing supervised learning. So, you know, you've got a lot of repos of stuff that's, you know, allegedly malware or mostly malware. You've got, you know, various threat intelligence feeds you can subscribe to. But a lot of the stuff in there is kind of, the labels are kind of garbage, right? And you've got a lot of disagreement between vendors about whether some samples are malware or not. And so being able to get a good set of ground truth labels that you trust, that you can use to then train these models in a fully supervised fashion is kind of a constant headache. And so finding ways to refine those labels and make them more trustworthy is something that we actually spend a lot of time on. So that actually leads well. Oh, sorry. And to elaborate on that a bit, I think one of the main challenges that I see is that if you're dealing with certain kinds of file formats where you don't have a lot of labeled data, like let's take Mako, for example, you might scan VT for labels or something like that and you might come back with something like 1,000 or a couple tens of thousands of malicious Mako executables. And when it comes to training models that will generalize well, how do you learn anything from 10,000 samples, right? When you compare Mako to PE, for example, you might be able to get a labeled data set of several hundred million malicious PE executables. But when it comes to file formats where sort of like the way to abuse them hasn't been fully exercised like it has for PE, you know, people have been writing malware for PE for decades now. It's a really difficult problem to deal with. So like he said, file format is a major issue because you don't have just file format, you have Mako, Elf, whatever. You also have compiler type. So there's a lot of bootstrapping code in the beginning of a binary that can change your feature set. Say you're focused on the beginning of the binary and you're trying to design features around the malicious part of the code, but you're looking at the bootstrapping code of the compiler. So when you start to cluster PE executables based on just the PE alone, you also need to cluster them based on compiler, a language that they were written in, because all of the calling conventions, even in the malware binary itself, are all different depending on the compiler and what processor it's meant to be. Same with 32-bit versus 64-bit. You're going to have different training sets for those. So you have to make sure that you have those samples in your training set that are labeled properly. I'll just add one last thing to this question. Not necessarily from the perspective of modeling itself, but from building really engineering sustainable model pipelines to support rapid experiments, you know, again, deploying models, getting feedback quickly, being able to make use of feedback that doesn't come through normal channels as well, too. That's a really hard problem, I think, especially when we're considering some of the data set sizes, maybe not necessarily in the class of the imbalanced data sets, but for PE, moving that all around can be challenging and again, quickly building new models, experimenting with different instances of them and evaluating them and then getting evaluation and getting feedback from the field can be a hard engineering problem. So we can move on to the next question now, which I think is going to be kind of interesting. Well, more interesting. So what should we do as an industry of, you know, security folk about things that are only malware in the malicious hands like PSXAC and so forth? If anyone wants to jump in on that. Yeah, I'm not all that. So I think I'm a proponent that machine learning is not panacea, it's not a cure-all. A model should be designed with a specific focus and a question and it can do that well. And if you kind of try to fuzzy your objective, then you'll get kind of fuzzy results. So with that in mind, I think the sort of the tools packages sometimes can be addressed better without machine learning. So looking at common rules of parent-child relationships that should not happen. You can't discover an exploit much more easily than maybe a machine learning package could, for example. So I think there's a lot of room for traditional security concepts to sort of wrap what machine learning provides in a more laser-focused setting. I think just not the binary itself, but it also depends on the context. So one of the things that we focus on is adding another, I guess, feature to the set there is determining how the malware is delivered because if you're using a tool and it was delivered a part of package, whether they're malware packaged inside of a zip file, for instance, in that context, it's going to be probably used maliciously than rather an admin using the same tool in the wrong way. So there's more outside factors to just the binary itself to determine if something is malicious or bad. Okay, so great answers, everyone. So like a lot of machine learning engines applying to malware classification generally tell whether a sample is malicious or benign. How do we feel as an industry or how could we approach naming malware through machine learning or is it really a problem we need to address? Well, you could build a classifier around all the VT detection labels and basically just produce labels from that or generate labels, I should say. But I think there's a lot more interesting kind of machine learning tasks here beyond just binary classification. So multi-label classification or multi-class classification could be applied here. I guess this kind of maybe goes in the realm. So I just to clarify, binary classification, he means one or zero, not like... Benign or malicious. Yeah, but benign or malicious, not like raw bytes. Yeah, thank you for that clarification. I think other areas too, especially because again, I grew up in a sock working as an incident responder so I loved having tools that enabled me to do my hunting better or do my analysis better. So applying things like nearest neighbor search or clustering for customers and even internally, I think it's really helped us out. So yeah, I guess I think there's more beyond just or again, beyond binary classification on things being malicious or benign that could definitely help out. And maybe this gets into model explanations as well too, but I don't know if that's a question later but I think that's also a very interesting area too, model interpretability. Yeah, on that front, that's one of the reasons why at Sophos we're so deep in on deep neural networks as opposed to some other more structured classifiers simply because you can put a lot more flexibility into them and so you can do things like give me something that will not just say, is this malicious or benign, but is this malicious and if it's malicious, is it malicious because it's Pua or malicious because it's a virus or malicious because it's ransomware and you can even put things like family labels into it. So you can build a whole lot more flexibility in how you want to slice up your space of your malicious and your benign samples and sort of enrich your output in that sense. And then also, gplug, blackout talk, we have found that using sort of the internal representations from these deep learning models actually is a really useful analytic tool to take a look at sort of how these different samples can be grouped together in space and maybe suggest other clusterings that might relate more to functionality rather than to authorship similarity and things like that. So I think there's definitely, and I think we're just sort of seeing the beginning of this, there's definitely a large role for these machine learning models not just as tools of classification but also useful analytic tools that can be sort of handed over to people that do more conventional malware analysis to them in this sort of virtuous cycle. So I think as far as auto classification, taking a machine learning model and having it take a binary and try and predict what kind of malware a family belongs to is a pretty good idea and a lot of people do it really well. However, I think we will always need malware analysts in the loop. We're always going to need humans to validate these results. We can't just be automatically classifying things and be asleep at the wheel. So I'm not super sure if it would be a great idea to have machine learning models that do things like automatically come up with new family names and things like that. For one thing, it may come up with something that's not entirely meaningful in the name because malware family names, there's often like some, you know, maybe signatures being like all over the place? Never. Yeah. But yeah, just having a human in the loop to give a specific malware family name a meaningful name is, you know, that's always going to be important. From the malware analyst perspective, it's the data science job to help us bubble up the important things for us to resolve, right? So, you know, a lot of traditionally the malware samples were identified by someone giving it a cool name and saying a children something, you know, wanna cry or something like that. But really, we, you know, if we kind of went back to the whole biology part of it, we classify that it actually does this certain behavior or has a certain property, rather than just giving it a random name and calling it a certain class, that would help us a lot better because we would need to figure out how to get rid of the malware itself, but the only way is if we knew what type of behavior it did so we can remove it. And I think that's one of the major problems that, you know, data science tries to resolve is like there are heuristics based on behavior properties that could be more helpful than just a generic family name that someone threw a dart on the wall and decided to name it. I'm just gonna lead into the next part because we're a little bit over because we started late. What do you guys feel is the future of malware classification without burning intellectual property? Does no one want to touch that? So I think from the sort of machine learning perspective, the amount of data that's coming in day over day is growing really fast so there's a lot of challenges for us to overcome, just unlike the engineering and the pipeline end of things, but also being able to sort of put analysts more tightly into the loop and sort of engage that feedback loop and maybe bubble up stuff that's suspicious but hasn't been assigned like a family name or something and hand that over and get a final verdict that's suspicious or is it benign? So being able to get more and sort of focus more on an active learning type problem where you're putting humans into the loop and you're learning how to sort of spend analyst hours most effectively to supplement and improve your machine learning, I think is a really promising direction for the future. I'll just add that I think of everybody I think represented on the panel, we rely maybe to an embarrassing extent on traditional AV to supplement our labels. I agree. It's really an important thing but they also rely on each other. But these labels don't come for free. We put in human effort but there's also automated effort in there. So one thing to think about in the future is if everybody moves to machine learning then from whence the free labels so there's a conundrum here that reliance on the human is going to become more important or I think that a really hot research area is not in supervised learning of machine learning models but in unsupervised and semi-supervised machine learning models where the reliance on labels diminishes I have a strong I'm strongly in the opinion of using active learning I'm a big fan of it. I guess just kind of go on top of that I would say there's a really interesting project called Snorkel that's a framework for doing weak supervision so specifically in the case where there's labels of all these different types of malware and maybe we're getting like heuristics or signatures from analysts weak supervision the task is basically to try to combine all of those labels in a way that overall improves another model later down the road and this project Snorkel I think it's Hazy Research Group I can't remember the author's name right now but they have a great tool it's kind of more built around weak supervision so combining all these different labels from different kind of mechanical turks specifically for image classification that could be potentially extended to help with some of the problems of taking really noisy labels and also taking account of external databases of signatures or heuristics that might be useful for influencing the model so I'm sorry we have to stop now we're over time but thank you for all the panelists great answers to all the questions super happy to have you all thank you everyone for coming