 Hi, this talk will be about what machine learning can and can't do for security. So my name is Wendy Edwards. Let me start by introducing myself. I'm from Urbana, Illinois, which is about 200 kilometers south of Chicago. It is probably most famous as the birthplace of HAL 9000 in 2001. So I work in my day job as a software developer. I am part of the NASA data notch, which is what gave me an interest in data science and machine learning. And I also got to participate in the 2017 SAMS Women's Academy. And on Twitter, I am Wayward710, and my pronouns are she and her. So what are we going to talk about? Let's start with terminology and concepts. And we'll have some examples. And we're all talking about the limitations of machine learning. So first, probably a lot of you have heard of artificial intelligence, machine learning, and deep learning. And they kind of, and they're essentially kind of nested within each other. AI is the broadest category in machine learning and then deep learning. So basically, artificial intelligence, if you recall since I'm from Urbana, which is the birthplace of HAL 9000, this here is a picture of HAL 9000. And artificial intelligence, which would include HAL, would be the 4-HF computer science that basically attempts to emulate human intelligence. And slightly more neverly defined, we get machine learning. So machine learning is really actually a lot of math. And it looks for patterns and inferences. And you can have something called super, you got things called supervised and unsupervised learning. Those are the main categories. And so supervised learning, probably anybody in stats may have heard about a prior probability called Bayesian modeling. And that is heavily used by supervised learning. And with unsupervised learning, you're basically drawing abstractions from unlabeled data sets. So for example, with supervised learning, you tend to have labeled data sets, which means you have the answers. Unsupervised learning, you don't have the data sets. You don't have the answers. The data sets are not necessarily labeled. And they could draw abstractions for these unlabeled data sets. So of course, one challenge with unsupervised learning could be making sure that you have it right. And then we'll talk about deep learning, which also use the concept of neural learning. So neural networks are actually not anything. They were invented in 1942. And so they're basically modeled on the whole idea of layers of neurons. So they're modeled after the human brain. So the layers are made out of nodes. And so the nodes is basically just where computation happens, which is kind of how a neuron works. And essentially, it fires where it can encounter sufficient stimuli. So the node combines input from data with a set of coefficients or weights that either amplify or dampen that input. So it's like, do you want more of this input or less of that input? So the algorithm by going through this process is trying to learn what will get the, kind of, what will bring you the closest to the right answer. And here we have a picture of a neural network. So it's what's called a B4 or neural network. So you have an input layer. Then in this case, you have one hidden layer with four neurons. But you would have more or less neurons or more or less hidden layers. And you have an output layer. So that is essentially just what you're talking about. And now we're going to look at some videos showing neural networks. So you can kind of see, I don't think the animation works. You can kind of see it's gradually getting better. Because you're talking about the test loss and training loss. And you want those losses to be as small as possible. But you kind of see, it's actually just trying to accurately categorize the dots and the picture. And you can see it's getting closer and closer. This is basically the TensorFlow program, which anybody can use. It's just a very interesting way to see neural networks in action. So when you talk about neural networks, you're having to talk about tweaking parameters. And that means you might adjust your inputs. You might adjust your layers to have more or less. And so here, let's see what happens if we tweak it by just adjusting the input a little bit. By doing that, we actually add it. We stay with only one hidden layer or four neural networks. We added more features, which are more inputs. And if you watch it closely, you can see it was very accurate. But it worked a lot more quickly than the previous one. And that could be actually very significant when you have big data. And then in here, we're actually going to look at just a slightly different kind of manner. And then actually that, if I recall correctly, works a lot more quickly if you have fewer input layers. But that is just a neural network in action. And you can kind of see for the weights of a line, which inputs are weighted higher or which inputs are weighted lower. Essentially, deep learning, when we talk about it, it could also be called stacked neural networks. So we've got a network composed of several layers. Big data is a very big challenge. One thing that's helped with neural networks is advancements in GPU processing. And GPU processing is a graphical processing unit. And that's often used in basically big computing. So it's not a solar bullet because it needs an extensively labeled data set. And that is a lot easier to say than do. But if you can get that, if you can get such a thing, there are a whole lot of things you can do with it. You can use computer visions, speech recognition, natural language processing, machine translation. I don't know if you've ever used like the Google Translate or depot.com, but those are machine learning. And for security, it can also detect anomalies in user and network elements. And that's going to be very important. Basically, for you to use machine learning, what do you do with the data? So with data selection, it's simply you're deciding, OK, so you've got a data set. Like, let's say you've got a PCAP file, you've got something. So what part, what information in this thing do you care about? What do you want to represent? And that's often considered, that's often called your feature set. And the most machine learning algorithms you're going to need to encode the data in some kind of mathematical fashion, like just dumping the raw data usually is not going to work. And then there's normalization. So normalization, essentially you're transforming values to a range between 0 and 1. And this is just something that you do to get your data ready for machine to plug into a machine learning algorithm. And then this would be an example of normalization. Like, you're looking at the number of requests per second in your CPU utilization percentage. And that could be, that could tell you like if you're being hammered by a DDOS attack, is a Bitcoin miner gracing you while you were circle with this presence is something wrong going on. So in here, the request per second feature have a range 10 times larger than those of the CPU percentage feature. And so if we didn't normalize those values, the distance calculation would be really skewed. And so we just address it. We basically address this by normalizing both features to the 0 to 1 range, which is a pretty simple map. So essentially one big thing in machine learning is pattern recognition. So you're trying to discover explicit or latent characters in state hidden in the data. And one thing about that is that you can use an algorithm to recognize other forms of the data into the same characteristics. Like, for example, botnets or command and control channels. You very often see similar behavior. You also see similar patterns say in malware. So essentially clustering. Clustering, the whole idea behind clustering is that bad things happen together. So with clustering, you want to find how do you group things into clusters? And you want to group things in a cluster so that they're more similar to other things in the same cluster than they are to other things. And there are different techniques you could use, k-means and db-scan are two big ones. db-scan, the advantage of db-scan is that you don't have to start with a preset number of clusters. So let's talk about a clustering example. So this is good. I just put the URL up close to slides. This is where they're clustering incidents together. They actually, I believe, incorporate either such as Splunk. So you can recognize similar instances and keep crack of responses in a similar manner. So all the time you're stuck, you're going to have a lot of these disparate inputs coming from different sources. So you'll get all your sources you'll collect. And then this will basically group them into incidents. Incidents, basically, a collection of events happening on the same machine at the same time. A lot of them, they may have the same group cause. And so basically, you could find these similar incidents maybe not on the same machine, maybe across from the artwork. They could save the SOC analyst a whole lot of time because the SOC analysts sometimes often get to spend their time doing a lot of very tedious stuff. So clustering takeaways, you could apply it to a lot of different kinds of data. You do need some statistical validation. And it's very useful when you have a whole lot of data that you need to find a way to sift through more efficiently. So that is clustering. Classification. So classifications a little bit differently because classifications, you've got these predefined class. And you're trying to figure out what the odds are that a sample will be part of this class. Classification is an example of supervised learning which means you have to start with a pre-laid set data set. And one point is that a sample can belong to multiple classes at the same time. Like for example, a mango could be part of food, yellow, tropical, whatever. And it's like so, so for an example of classification, one really big, very common classification problem related to security is classifying email is spam or benign or possibly phishing. Sometimes phishing gets its own classification. So let's try just some examples that you might do classification and security. You might look for a botnet because there's just a lot of commonalities with that. And then this is what, this is basically what a Microsoft study did. They basically, so they took, they issued a bunch of set HTTP requests to these servers and they were looking for a very specific set of a bot software. To be honest, it wasn't catching any kind of bot, it was catching a kind with a very common bot related control panel that used WordPress. So they were querying it and then they were using, and then they were creating a, they were using a decision-free model. Like so for example, like they were sending requests to pass that were kind of consistent with known sketchy sites. And then this, for example, might be what a decision-free could look like if you were trying to categorize an unknown, a website is a botnet command and control site whether or not. Like you could say, do you hit a whole lot of daily visitors? If yes, it's probably not. Has it made your Alexa top site list? If yes, probably not. Well, but maybe it's not getting many visitors in your URLs auto-generated. That's a bad, it's not on your list and there's an existing print intel report on it. That's not good either. That's probably a bot, or there's probably a botnet. Just classification and taking ways. So classification again is a supervised learning model. There's four phases of that training, validation, testing and deployment. So let's say with classification, you have to break your data into training and testing data. And one thing with classifiers is that it's not necessarily an absolute oracle. It's not telling you that it definitely is or it's definitely not. It's telling you how likely it is that something belongs to a particular category. Like so for example, like in your email, you know how like the nine emails sometimes get caught in your spam filter and spam sometimes get through. That's an example of it not being perfect. So another example, like finding malicious PowerShell scripts. So one of the things that people of PowerShell, the attackers using PowerShell like to do is obfuscate your scripts. Like for example, any idea what this means? Yeah, me neither. So what they did was they took a whole lot of PowerShell scripts and they used something called, and this is a very interesting example of machine learning uses natural language processing. So there's all these techniques and LLP that could attempt to extract meaning out of words. And so they used one of a well-known one called the word Chevec algorithm. They were able to actually extract these tokens and represent them in a way that showed the context. And this is just the graphic, it's just an example of some clusters they found doing this way. So what they basically did was they had a very large sort of a pretty script already labeled Cleaner Malicious. And they're, and basically a project worked very well. It did really well with recognizing aliases, which are pretty commonly used in PowerShell. And so, and yeah, it was pretty successful. So another thing you're doing is anomaly detection. So essentially you've got to think about what represents normal? So what is normal? What described in your dataset could be anything. It could be like your network logs, your logs, your network flow, whatever. And then if something is happening that's outside normal, that's considered outliers. And so a lot of times unusual is not necessarily bad, but it often is. It's like, so it often might suggest fraud or something like that. So even an intrusion detection system are also like an IDS. What do you want? You want low false positives and false negatives, like a false positive will waste your stock analyst time, a false negative, you'll miss a stretch. You want a reasonable growing curve. You want something that can keep on security chains constantly. And you probably will have only a limited amount of resources and then explainable alerts. Like, for example, if you get an alert, you don't want to disable somebody's account without knowing why. And then these are just some things we look for in host intrusion detection system, your processes. You have a sketchy user account, you have an unusual kernel modules loaded. Are you looking up, I don't know, are you looking up sites that are known bad, doing DS, look up to sites that are problem, unusual network connection or have there been unexplained changes made to your registry? We'll just look for that kind of thing. There's a fairly, there's a pretty useful tool for post-inclusion data inspection called OS query. So, I mean, it can measure reliability and compliance, but it's often used for intrusion detection. So, for example, for example, you can schedule it, you can populate queries to be able to be queried later. One limitation is not built to operate a known trusted environment. And it does not have any built-in orchestration, but it can work with Chef, Ansible, Popwood, whatever that. So, network intrusion detection. Okay, so you've got to compromise post. It's probably going to need to initiate communication somehow because a lot of heads like a firewall is not going to let you go in and make an established connection with that host. So, some examples that are like your botnets, APT, your adware, your spyware, you've got an order choice you use. TCP, though, basically will capture all your network traffic. Snord is a rules-based engine. Zeke, which is all formerly known as Rural supports deep packet inspection. That's kind of got its own programming language. It's traditionally, it's signatures, but that can be kind of a limitation. For example, like when you're dealing with a zero-day or something like that, it does not have a signature. And then these would be some of the things you would look for in web application intrusion detection. You're basically unusual IP addresses. Like you've got maybe like a website that's just local and you're suddenly getting like a lot of traffic from China, Russia, Iran, fly. Basically, if you're getting really strange URL, entities, they may be trying to, they may be trying to buzz it. If they're using strange characters, they may be looking for vulnerabilities or other things that suggest SQL injection. User agent patterns, like for example, if you're doing automated access, if you're using Coral or something, you don't say your user ID, it won't show one. And mailware analysis, that's kind of group, that's basically group by family. You've got a couple of ways you can analyze your static. It's traditionally identified by signature matching, but that won't pet your zero-days. Typically, a static analysis is when you look at the P, you look at the code, maybe you decompile it. Your dynamic analysis is when you run it in a sandbox. And then these would just be some of the things you would see in Android mailware, your strange strings, API calls, network. Like for example, you might see references to system binary, server addresses, checks to see if you're running an emulated environment, like you've got mailware, maybe they want to know if you're running it in a sandbox. And again, like we talked about a previous example, obfuscation. And I've got to show my jeannexiness. I don't know if anybody recognizes this guy, but this is Bob Seger in the Silver Bullet Band. So basically it's no Silver Bullet. As you talked about before, you've explained ability, like do you always, you know why it is or why it's giving an alert. There's a whole lot of human knowledge to our research. Others, probably it's called overfitting, underfitting. Underfitting is like when it doesn't figure data very, your data very well, period. Overfitting is like for example, it matches your training data too well, but not your real-world data. Garbage in, garbage out, which means you start with bad data. Or if your AI system is being attacked, which it can be, like sometimes directing a bunch of crap in a machine-wearing system as it kind of is a form of attack. So basically you're taking away, you've got the potential to automate some tasks and if they increase efficiency. So the ability to look for anomaly detection or stuff that's going on strange is very useful for zero days or APDs or stuff that does not have signatures established. Risk of error, like, I mean, one example, I feel like we all saw was Facebook tried to automate some of its content management, you saw it doing some, and I think a lot of us saw it doing some really strange things. So there's always a risk of error and we'll, I think we'll always need competent security professionals. This isn't going to replace us. And then these are just some examples of machine-wearing software you could use. Yeah, I would use some credits to my sources and list some additional resources and thank you so much for your time.