 Hello everybody. My name is Angelo and I'm here to talk about a little bit about fighting malware with deep learning. So I will show you how you can use deep learning to build a classifier, a binary classifier. And I'd like to thank you for coming and thank you DevCon AI Village. This is my first DevCon. So I'm a little bit lost here. But it's been pretty, being very amazing here. So and I'd like to thank my company, Tautus, to support me to come here also. All right. So who am I? I work as an ethical hacker at Tautus, the largest software company in Latin America. We build ERP. And I'm a part-time data scientist, basically, you know, self-made. And then I'm a part-time PhD student. I hope my supervisor won't listen to that. And I'm interested in deep learning data science for malware detection classification. Okay. So first I want to share some experience I had. Okay. So I was not a researcher, you know. I got a degree in mathematics, but I was not a researcher. So I went to, I started working with computers, programming and so on. So actually you're not researching. But then I wanted to make some research and I wanted to get into this machine learning, deep learning AI thing. And I didn't know how. So I started studying machine learning, there's online resources and books, deep learning and malware analysis. I've already knew something about malware analysis, but I kind of learned more. Okay. And then since you want to make research, you need to start studying the state of the art papers, you know, to see what's going on, what's in the edge. So okay. And then I started to read these papers and I got stuck when I tried to reproduce them because they are very theoretical, you know, and sometimes you don't have enough information to reproduce them. So this is a very big problem in machine learning papers and the, for malware analysis in particular, because the guys, they don't want to tell you that they're hyper parameters, they don't share data. So it's quite hard to reproduce the papers, but also sometimes because you just don't know the basics, you haven't been through the basics. You didn't code something simpler to see it working. So you need to do that. Okay. So here what happens is because we, actually what he's saying here is that there is a difference between knowing the path and walking the path. What does that mean in this context? It means that I wanted to do research, but I haven't done anything yet. I want to jump, make a big, take a big step. So that's quite hard to do. So you need to step back, right, and feel that gap and how you do that. Basically, you want to build, that's very important for machine learning in not only malware research, but any kind of research. You need to build your baseline, your own baseline models from the scratch. And from the scratch, I mean that you don't need to implement back propagation yourself. You don't need, you can, but you don't need. You can use a high level framework, deep learning framework to do that for you. So that will be your starting point for doing research. And the second thing is that you want to build simple and working models using the techniques you want to master and research because you need to learn them very well if you want to contribute with something new. And the third point is, this is kind of strange, but you need to not contribute first to be able to contribute later. First, you need to learn the basics. Then you are not contributing. You are contributing with yourself. And then you can then try to contribute with the scientific community or or whatever. So always be humble. You don't know anything. You don't know everything. Okay, so we are data scientists, right? Right. So we need, in our case, we need malware, a lot of malware, meaning we need instances. We need data to feed our models. This one is my favorite repository. It's called Virus Share. So there you can find, I checked yesterday, almost 34 million instances of malware to download. So you can download huge files, six gigabytes each with 60,000 instances there. And you can play with them. Okay. And then one good thing is that they are already labeled using VirusTotal. So as you can see, I told you several layers. And the first layers are low-level features and the deeper layers, high-level features, what we are interested in is getting this data from this data sets. I will show you and then try to figure out if they are malicious or not, because we're trying to detect malware. And then we have also the recurrent neural networks that they're good for sequential modeling. So if you get sequential data in time or space, something like that, you can use recurrent neural networks for that. They are good for dynamic analysis. So, okay. So we need to build our machine learning deep learning data science pipeline. This is actual real information that I took about one month to download malware in Goodware. And then I set up an environment hypervisor with nine virtual machines running for four months around the clock to run this malware and get the dynamic information. And then we need to do the basics from data science, meaning we need to preprocess our data. We need to clean the data. We need to make some feature engineering, get the best features for our problem. In this case, the feature engineering was minimal. I haven't done almost any. And then we are ready to play with the models. So you use a high-level framework and you can start building the models. So, all right. So I have here three models to show you. I will share this code, the Jupyter notebooks, and I will share the data set, the data sets. There are several data sets. So don't worry if you can get something because I need to be very quick. Now just need some help here because I will need to throw the code there and I can't see the code here, you know, because my screen extended. Can someone help me there from AI Village, please? Yes, but then I can't see then the code from here, you know, when I drag. Please. So what I need is a, I open here the code and I want to project there, but the screen is extended. That's why I, so I will have time to explain step by step the code. But basically what I'm showing here is an example of each of those models I showed before. You know, this one is a multi-layer perceptron for static analysis data. So basically what we do first is to import all the libraries, necessary libraries from Python libraries. They're amazing. So here's our data set. I'm getting data from the PA sections. So PA sections basically describes the sections in the executable file. Okay. So one, one information here that's very important is this entropy. Usually when the entropy is high, it means that either the code is encrypted or packet. So malware, usually malware authors, they usually like packing the content of the file to avoid an antivirus detection. But that's not enough. Okay. This can be very high, but it's not a malware. So can't rely only on one feature. So okay. And there is a, there is a column here saying some hour or not. Okay. So this is how our data looks like. And this is tabular data. It's in a table. Here I do some correlation analysis, you know, to figure out correlation among the features. You see that some features are very high, highly correlated. So we can drop them out. You don't need to duplicate data. And then what I'm doing here is opening another data set that is called imports, P imports. P imports basically, you know, create a program you need to import functions from DLLs and so on. And maybe we can find, we can, the machine learning deep learning algorithm can spot some patterns that are used for malware in hours. So what we're doing here is in this data set is the most 1000 important imported functions in reverse order. So for example, this malware imported this get proc address function. And this get proc address function is the most important, most important function. Okay. So this is what it means. So you see that there are a lot of features here, and they are categorical. Categorical features are a little bit harder to deal with. Because you can't just play with them. You're not playing with numbers. Okay. We're playing with categories. All right. So and in the end, we have say some hour or not. And this data set has 47,000 entries. Okay, so we remove duplicates, and then we merge the files. And then now we have the data set ready. Okay, I'll jump a little bit here. We convert this data sets, band as data frames to NumPy arrays to feed the models. And then there is another problem here. We need to check the imbalance. And this is high in balance. See 24 to 1. So we have 24 malware instances for each goodware instances. So that's a problem. Okay. Then we need to make this train test split stratified. So meaning that after performing this split, we need to keep the proportions. Okay. And basically here, I'm standardizing the code. I can't, I will only standardize the code that is numeric. See, I'm not touching the categorical data. Okay. So let's jump to the model here. There is something interesting. We need to deal with imbalance. So the data set is imbalanced. So we can try to use some algorithm to over sample, over sampling technique. So for example, the smote won't work here because it can't deal with categorical features. And there is a variation called smote nominal and continuous that was taking just too much time to run. So I gave up for now. And then we can use a random over sampler. But random over sampler is bad because it just duplicates data. But in our case, helped. So after running that, you see, we can see here that now the data set is balanced. And then I will perform some visualization using TSNE. TSNE is a pretty awesome algorithm for visualizing high dimensional data. Basically it projects high dimensional data onto the plane or to a space, to a space. So you can have an idea how the data is in that high dimensional space. And then you can also have an idea about the decision boundaries your model will need to learn. And if you see this is complicated, this one. Red is mower and green is good work. We need to separate them out. So deep learning. So here's a model, deep learning model for mutilated perceptron. As you can see, the model is quite simple. This is not research. This is just that homework. I said in the beginning that I need to understand how to build a simple model in order to try to build something more complex or new. So this is just a model with two dance layers. And then after that, we create the model, get here the summary 146,000 parameters. It's not really too much, right? Here's a model. We can get this model, the architecture of the model, just with one command. And now here there is one algorithm that I would like to explain more deep, but I can't because I don't have time. But this is a model selection. Basically, I'm making an automation of the model selection here. I'm getting those hyperparameters and testing each one of them combinations and performing three fold cross validation to see which combination is better. And after that, we train the model for some time. This case didn't take too much. I trained at home. I have an NVIDIA RTX 2080 Ti. So 4,300 to the cores. It's a decent card. So 10,000 seconds, roughly three hours. And here it's important to evaluate the model using the test set. And okay, we have here the results from the model selection. So this is the best combination. Dropout rate 0.6. The first model architecture, the first model architecture here is a dropout first and then batch normalization. There is a holy war about that. You should use batch norm first and then dropout or vice versa and neither of then or both in any order. But actually it depends on the nature of your data. So that's why you need to test it. And the lowest number of neurons won. So after that, we perform an evaluation. And with the evaluation, we have a thing called confusion matrix to show the two negatives, false positive, false negatives, and true positives. First, what we do is to create a benchmark. So suppose that you have a model that predicts that every example is an hour. So what happens? These are the numbers you get. And you see, since the data set is imbalanced, you get pretty good numbers here for accuracy, precision, recall, F1 score. This is misleading. So you need to use a better metric for unbalanced data set. The best metric, as far as I know, is a balanced accuracy. And balanced accuracy is showing here that basically you have chosen, you have predicted every example as an hour. So it's very bad. This is the worst case scenario. And then when we apply all the test set, our model, we get this result. So it's much better. So as you can see, if you are using a multi-layer perceptron, that is a very simple, deep neural network, we can already get very good results. Now you can imagine those very, very deep models with those inception layers and so on. You can get much better. Okay, now the next example I'll just show you the model. This is interesting because what I have done here is just to treat the binary data like an image as if it was an image. So take a look at that. This is what we get. So these are our images. I get the binary and then I treat each byte like a gray scale and then scale it. So you get this information. And then basically we feed it to a convolutional neural network because convolutional neural networks are specialists in dealing with images. So the same problem in balanced and so on. And then visualization looks similar to what we have seen there. Pretty complicated decision boundary. And then deep learning. So this is our model, convolutional model. So as you can see, it's very, very simple. We have a convolutional layer and then there is a model selection here. It's about should we apply max pooling first and then batch normalization or batch normalization first and max pooling. This is also a holy war. Nobody knows what's better. But what you can do is to perform model selection with cross validation and then you see which is better for your kind of data. All right. And then so basically we have a convolutional layer and then max pooling and then another convolutional layer and then max pooling and then we flatten then out to feed to a fully connected layer for classification or in our case, binary classification. Okay. So we trained this for some time. I think it's still got some five minutes. I want to show you how much time I needed to train here. So 15,000 seconds, five hours, four hours. Not too much, really. And the results. So cross validations and results. Evaluation. Okay. So the bankmark also pretty good. Accuracy precision we call F1 score because they don't deal very well with unbalanced data sets. Balanced accuracy is the worst possible. And then when we use our model to predict using the test set, we get this. So also not good. But there are some explanations here. First because I think the main explanation here is because malware usually nowadays, they are all packet. So the content is encrypted or obfuscated. So what you are seeing here is mostly random noise. So you can't expect the algorithm to do much better. But still, it's a pretty good result. You could just, for example, add and step in your pre processing pipeline to unpack this. But this is quite complicated because each malware uses a different technique, different key. Sometimes the key, he gets the key online and so on. So it's complicated. And then finally, dynamic analysis. Dynamic analysis, I will show you the network called long short term memory. It's quite pretty good for dealing with sequential data. And all this software you see for speech, recognition, automatic translation, use this kind of network. So here I'm getting the sequence of calls, of API calls. See, T0, T1, these are the calls the malware does to the Windows API. So I got just 1,024. And then basically I do the job here of cleaning and then balancing the data set and take a look at this even more simpler network. There is a version. If you are running TensorFlow and Keras, you can use the QDA version to that runs much faster than the LSTM version. And then basically, this takes much more time to train. Your GPU will be burning there for some hours, really burning 90%, 100% usage for several hours. And in this case, it took about 29,000 seconds, 9 hours. And the results here of the model selection. And basically, you get our confusion matrix and then get the results here. You see, we got 91% balanced accuracy. This is a little bit impressive because our data set is small. And the number of features, I don't know, you have noticed that. But the data set has 1,024 features. And we used only, I think, 300 because my computer didn't have enough memory because we need to make a conversion of these features from categorical to one-hot encoding before feeding the model. But my computer didn't have enough memory for that. So all right. Now we can have an idea how you can apply deep learning from our detection and classification. That was the idea of this talk. So let me just close here. Can you see there? These lights? Okay. So, okay. And my research basically is about malware behavior. And I'm researching specialized data augmentation methods because the current ones, they just don't work very well with this kind of data. And also, specialized representation learning techniques from our detection classification. And this will lead improvements in detection and classification of zero days and polymorphic and metamorphic malware. And that's it. Thank you very much for attending. And, um...