 Let's get started. But before I get started, I do want to give one final warning to you guys. This is a talk about defending. I'm not breaking anything. You might have stumbled into the wrong room. So I'm trying to build something here. This is not like an Android malware thing that people are going to patch tomorrow. I'm trying to make something that could potentially change the game in a few ways and help a few problems that are very, very dear to my heart when it comes to defending blue teams and things like that. Also, there's a lot of math involved. I mean, not as much as people would start rioting and throwing things at me, but if you haven't had lunch yet, you might experience some difficulties. It's not good to do math in an empty stomach. But anyway, if anybody is living, I guess we should get started then. Just want to give just a quick note to where I'm coming from. And a bit of this presentation is going to be a bit ranty because of this experience. So I've basically been in information security for like 12 years. And one of the reasons why you've never, ever heard of me is because I did most of this in Brazil. So I used to work with one of the largest information consultancies there. And where I come from, where I did the most, was actually leading socks, putting secret operations centers together, building sim solutions, building log management solutions, either inside of a company or in this huge, massive sock you guys see in the movies and of course in all your organizations where we're trying to make things like day to day of log management and defending work. And after I was done with that, I got into machine learning and I found it to be a very, very interesting subject. And I was quick to try to find ways that this could potentially help the problem that we have been facing. Okay? And this is my first presentation. I do want my shot. Okay? So, man, perks of the job, man, perks of the job. I mean, don't tell my employer but I used to drink before the presentations all the time. That's the only way I could get them. Excuse me. Anyway, here's an overall structure, right? I'm going to talk about the problem. I'm going to talk about machine learning so that I want to be sure that we are in the same level here. I don't want to dazzle you with some math shit and just tell it works. Okay? I just wanted to get the basic feeling. Of course this is not introduction to machine learning one-on-one. We're going to be like four hours here talking about this. Not at all. I don't have the time. It's not that cool anyway. But I just want to make sure you understand where machine learning is coming from, what are special considerations for information security which is something that's very, very important. And most machine learning experts or data scientists or however you may call them, they will not understand this point. And a little bit of the case study which actually goes through the process of one of these algorithms that I have designed. Okay? So let's get moving because we have a lot of ground to cover. First of all, you've all learned of log management, right? You've got a bunch of logs. You've got a certifications. You've got a best practices. It's everywhere, you know? You just have to deal with it. Everything is generating logs for you. But the problem I have is, and bear in mind I've worked with a lot of SIM solutions and log management solutions over the time. And nobody's happy. And everyone's deploying this. They have the state of the art. They're doing everything they can. And everybody talks, I mean, and I say everybody, I'd say, you can quote Gartner, 93% of organizations had breaches they couldn't catch with the state of the art deployment that they had. And everyone is unhappy and looking for other options. I mean, can't be all that bad. What's going on here? You know, I got this great graph from SANS. And I do apologize. They don't have the horizontal axis markings. I have no idea what those numbers are. But that first one seems big. Okay? So I'll take their word for it. So, but anyway, what really caught my eye here was this, is that what's the biggest problem in SIM deployments? Well, it seems I cannot identify key events from normal behavior. Maybe it's a pronunciation. This huge thing is like, yeah, that thing. Security information and event management solutions. They have correlation rules. They do all kinds of fancy shit. They put some graphics on. Try to pretend that you understand what the logs are talking about. That's that. Anyway, the point here, I cannot tell, my biggest problem is that I cannot tell key events from normal behavior. It's like if you bought a car. And your key complaint is that I can't seem to get from point A from point B. I know the car doesn't start. I cannot find the accelerator. This is what we're buying these things for. And we're not able to make them work. One of the reasons why I believe we cannot make them work is because how do you use these things? Okay, you're already coming up with some magic numbers where, okay, I want to be alarmed if someone sports scanning me, if they hit me like five times. Okay, so if I hit you four, I guess I'm home free, right? Absolutely, no problem with that. Or I'm trying to make up these arbitrary rules where if something happened and then other things happened, okay, that's an alert. That's something that should be brought up to me. And in a way, it's all permutations of this. All this 300 and I don't know how many rules people put together. It's so that, you know? And when you're on an engagement, it's actually, you're just iterating on this until the customer is satisfied and you see the strike there. Or you run out of money, which is much, much more likely. And I don't know, it sounds like a Ponzi scheme to me, you know? Like this thing was built in order to make sure that we generate the most consulting out of that. And I don't claim that scene solutions are the hugest offenders. If any one of you ever worked with identity management, okay, these guys, they take the crown. But anyway, anyway. And this is the point I want to make. So one of these vendors, they have like this huge S curriculum, right? You get one month of training. I want to be an expert in this solution. You get one month of training, four weeks, 20 days. There's not a single hour of this training that is about explaining to you what is the content that comes with it. It will teach you four weeks on how to build new rules, on how to create new dashboards, on how to create new things. But I mean, aren't you telling me that you have all these amazing dashboards? How do I use them? How do I apply them to what I need to do? So it seems that things are harder than they should be. And I don't want to touch about behavior rules. I'll talk a little bit more about them in the future. They do help a little of this configuration. But I'm not really to bash the tool, right? So, I mean, some people are very, very good at log management. You see? It's my log management. But there are very few, right? I've worked with teams which are very, very good. You gave them like six months, you know? They would make these things sing. But the problem I have is that this does not scale. We do not have enough people who are good enough at understanding all of that. And those are fantastic tools if you want to build something. But they're not ready. And most people buy them thinking that they will solve a problem, they will be ready. And it's not there. It's definitely not there. What really, really got me scared was big data. Because we are, I mean, the point is we have this smallish databases, okay? So if you think about SIM solutions, anyone here who is in general IT, this is nothing more than a highly vertical business intelligence data warehousing solution. I mean, from the 90s where they never invented columnar databases, right? But now they are starting to catch up. And now you're going to have integration, you're going to be ingesting petabytes of data, petabytes of data per day. Who's going to do that? Who's going to analyze this? We got to create the whole rules ourselves? I mean, usually when you handle this kind of data, you really have to work with statistical analysis. You got to know this. It's a whole different discipline from actually from your traditional information security analysis. And I'm not saying, I've met people who are very good at both, but that makes the pool even smaller, okay? And I mean, given all of this, the only solution I can come here at this stage and tell you about is that we need an army, right? Let the robots talk to themselves, right? I mean, if there's a machine generating data, let's have a machine reading the data for us. We're hopeless. We're not going to be able to keep up. There's absolutely no way. But that's when we start talking about machine learning, okay? And I want to make sure that we get the basics right so that we can understand a lot of what we can and can do, okay? We really haven't got at the skynet stage as of now. I know there's a little bit of discussion, although I've met some people who are really intent of building this, yeah, let's put all this machine learning thing together. It's going to be asthma. Not so sure. But the main point when you think about machine learning is that you're not writing the code. You're not really writing the code for the decision-making. You're writing a code that will identify some data for you and then it will start making inferences based on the data, right? So it's pretty much as if I went to the computer and I said, computer, this is a chair. Chair, okay? And then the computer will look at the chair. Okay, yeah, it's got some metal rims, it's got this stuffing on the back. Got it. It's a chair. But okay, this is another chair. Oh, that's a bit different. But okay, I think I got the idea. And as you show enough chairs for the computer, it will eventually be able to generalize what a chair is, okay? And the secret is all in how you tell the computer what a chair is. And that's the real, in a way, science slash art around this is how do you build what we call the features in order for the algorithms to be able to take this in and make decisions based on what they saw if something is this or that, okay? I'm going to give a lot of more examples. And this is everywhere, okay? Absolutely everywhere. And I think it's a shame that we don't use this more in information security. So everyone is selling your shit out of this. You go to Amazon, okay? Amazon is one of the big examples here. They'll use this thing, this technique called collaborative filtering, which will pretty much say, okay, if you bought these things, and there's a lot of people just like you who bought these things, if they bought something else, I'm probably going to suggest it to you. You guys are probably pretty similar. So it will make decisions. If you look at the math, they have like a billion roll matrix here, billion column here, and they just multiply the hell out of that shit. And they come up with, you should be reading this book. It's actually quite awesome. But I mean, this is something that has been studied. It's something that's very well understood by the marketing community and the sales community. When you talk about trading, and this is a very, a very good picture, which is one of the cautionary tales here. This is the flash crash that happened, I think it was two or three years ago, where actually most of our high-frequency trading right now that's run by these large quantitative funds, it's algorithms. They're just talking to each other. And they're selling and buying based on what's happening. And some of them got confused there and almost crashed the whole economy. But I mean, someone was watching the monitor, no, no, no, no, wait, wait, just back up a little bit. And that's, okay, you guys play nice now. Let's separate this fine. But it's everywhere. You know, this is a very sensitive place, what it is. On the other hand, you got people doing some really, really cool stuff with image and voice recognition. So this is actually a picture. That's good. Should I continue? No. All right, so we, we, evidently he's the only one that does not know why we were here. So what are we called? Shot the noob, that's right. So your first time speaking at DEF CON? Yes, sir. Congratulations. Thank you. We would like to raise your hand if you're a first timer. You in the blue shirt, because you were faster than everybody else. Sunday morning at DEF CON. You got to love it. Thank you. What is this shit on the screen? Oh my God. It's an evil cast. It's an evil machine learning. That is, that is really cool. All right. Data visualization of some kind. And everybody first time at DEF CON. Cheers. We'll see you soon. Okay, I'm happy now. I guess we can continue. Anyway, what's this care about? So the guys from Google, they actually set up like a, a little 16,000 cluster machine. And they told the machine, okay, find me cat pictures on the internet. Right? They really know their audience, right? So the, they did. And it did. So they use a technique that's called deep learning, which actually creates, I don't know, an arbitrary deep and complicated neural network out of the blue. It's, it's, it's, I don't claim to understand it, but it's, it's mighty fancy. And anyway, this is a visualization of what a computer thinks a cat is. That's awesome. I mean, I can see a cat there. Man, this is the future, man. The computer is watching cat videos now. Anyway, one or more serious note, okay? What, what is this being used for security right now? So a lot of the fraud detection system, they will use some type of another. The most basic technique they use is called clustering, which I'll talk a little bit more further on. But anyway, they're trying to find deviations on a pattern. So if they can create a baseline and identify where you are in a group of customers, they would be able to, to see if, if you're not, if you're not you in the dimensions that they can, they can look at. So if you used to do this kind of shit when you're using your credit card, if you do something very different, okay, that's probably a flag. And here's what I touch about behavior monitoring. Behavior monitoring, no matter what people tell you, it's not machine learning, you know? My, my, my, my economics calculator, you know, the, the, the HP 12C can do rolling averages, you know? It's not learning. Rolling averages are very easy to do. So they, the statistical analysis is helpful, okay? But it's, it, it's a first step in understanding maybe a burger scope, but that's not machine learning in any shape or form. And finally, spam filters, which are the unsung heroes of machine learning, you know? You remember the Bayesian filters? Yeah, that's it. That's actually, that's actually the algorithm that they use. And the point I'm trying to make here is that, how many talks did you see this year? Or one of the past two years about spam? Nobody seems to be doing research on that anymore. I, I know I have my, I have my, my Gmail account. I opened it in 2004. And it's a long time ago. I mean, I'm pretty sure every single spammer on the internet has a hold of my, of my account. I don't see spam. I don't get spam. Do you get spam on Gmail or something like that? I mean, I do get all the crap I signed up for, you know? I do get phishing emails, of course, but phishing emails, they are, they're specifically crafted to look like a normal email to get past all of this. This is a problem that we don't really look at anymore, because it took, I don't know, 10 years, maybe 15 years. Actually, I got, I got, I found, I was doing a research for another talk. I found a paper from 98 that talks about some techniques of using Bayesian learning for spam filters. It seems like the problem is solved in a way. And that's, that's one of the things, one of the messages that I want to leave there, okay? Maybe the work, if we start doing this work now, maybe we'll get very good at picking up those stuff in about five to ten years. And this is the power. Once you have enough data, and I believe, arguably Google would be the one who has the most data. We, we probably can agree on that. You can get pretty good at this. Anyway, now we start getting a little bit more, more technical. So the idea here is that there's two big kinds of machine learning. You get your supervised learning where you are actually telling the computer or the program what these things are, okay? And there's two major groups, which is called classification, which I'm telling, okay, this is a chair, this is a table, this is a chair, this is a table, bunch of chairs, bunch of tables. Okay, here's an object. What is it? Is it a chair? Is it a table? So you're pretty much giving it data to train on, okay? And you're giving it labels, which is, this is a chair, this is a table. And you're, and then you present new data in order for it to predict to you, okay? And that's, that's the word that we use, if it's a chair or it's a table. When we think about regression, we're actually not looking for a binary answer, okay? But we're looking for, okay, how much of this is a table and how much of this is a chair? I know it doesn't sound to make sense, but if you look at a stool, or maybe one of those, of those benches you have from pianos, you know, that actually look like a table, but it's like, we, hi. So, yeah, the computer would get confused into, yeah, it's a dot for chair, you know? And, yeah, I mean, it's a very general example, but this is the kind of stuff that we do. We're trying to analyze where this lies. And based on this, we can either use this to, okay, now this is definitely a chair or definitely a table, or we can use this data to make the humans better informed on what the decisions they should take. And then you've got unsupervised learning, which is, okay, I don't know anything about this. Look at this and tell me what you can find, all right? So, you have two big groups as well. You have what's called clustering, which is the one I mentioned before in fraud solutions, which, I mean, it's by far the most abused machine learning technique for anything ever, okay? So, it's not always applicable, and it has a very, very fatal flaw that you have to actually tell the computer how many clusters you're looking for, so, so that it can be able to guess, which would be the separation of the data that you send it. So, of course, there are techniques to discover that, but then it starts getting very complicated and most of the people don't actually do all this legwork, so it can get a little bit fuzzy. And finally, the composition, which in a way is a tool for, for you to design better algorithms in a nutshell. So, you've got, I've got a chair, okay? And I need to tell the computer what a chair is, and I have all these different things that I could tell the computer about, its height, what's it made of, and all these things, and so I get like a, I don't know, a hundred possible variables out of a chair, and I'm trying to find which ones matter the most in deciding if something's a chair or it is a table, okay? It could be argued maybe that for chair and table it could be the height, you know? So, probably if I ran this chair table model on a decomposition thing, and try to give it, tell me that the PCA there is principal component analysis, which are the guys who are really making a difference here, it would probably tell me that the height is one of the guys that should definitely choose for my model, because it really makes an awesome difference. Anyway, by the way, if you like this, this is, this tutorial is, is awesome. So, it's not magic. You still have to train the computer, you still have to, to give it data. But one of the basic principles of machine learning is that if you design your algorithm well, if you design your model well, it will generally get better with more data, okay? So, there's, there's a lot of mathematical proofs that you can show that the more data you have, and this is the, the, the, the drawing on the, on the left, your E in, which should be the, the, the arrow you have inside your training model, which is the data that you know you're using to train will, in both the E out, which is the error of the stuff you've never, ever seen before in your life, they will converge to an expected error. And I mean, don't let anyone tell you that they have a hundred percent working machine learning model, because there's no such a thing. It will make mistakes. We make mistakes as, as we look at it. And it's all part of the way that we make sense of reality, and we're trying to emulate that into a computer. There will always be errors, and you can always hope and work your model for it to be the, the least that it can be. But the point is, you're all, sorry. Okay. The point is, you have to be careful with what data you're taking in. Okay. So I'm going to give you guys for an example where I've used the sans D shield data, which is this, this viral data that, that you get from them, about viral blocks. And I mean, I, they wouldn't tell me, but I would assume that it's, it's pretty US centric. So there, there is this, definitely this bias on, on the, on the results that I'm going to show you guys here. But you always have to be cognizant of that. And you always have to look out for adversaries. And this is one of the points I wanted to make in this talk, that everyone who is into building these models, and they do the machine learning, they do not understand that people would like to fuck with them. Okay. So I mean, we understand that in a very deep personal level. And one of the first things that, one of the first things that came to my mind is, okay, even if I build something here, how can I potentially exploit it? How can I send some random noise and things like that, and that will render this completely useless. And I'm, I'm going to talk a little bit like that as, as, as, as one of the, one of the weaknesses of this specific model that I built. But I mean, it is there, right? People will try to mess with you. And if, if this, and I really believe it will, if this become a valid method of, of actually defending and helping defense, there will be a lot of talks in this conference with some really crazy math guys on how they will defeat this. It, it will be the crypto wars all over again or something like that. But anyway, the point I tried to make here just to exemplify this, remember spam engines, all right? So when we started getting this Bayesian thing going on, the spammers just started posting whole sonnets of Shakespeare at the end, you know? And then the model would think, hmm, I haven't really seen many spams with the word thou. So that's probably a legitimate email. And that's pretty much what it was. That's pretty much. And then people refined that and people made sure that they would understand the sort of thing they evolved the model as well. And we are getting to the level that, that we're getting to. But people will always find a way to break things. And I think especially on these kind of applications, this is something that we have to be very, very careful about. Anyway, enough introduction. Let's get to it. Okay, let's chew on the logs. And the idea, the most of the talk here is about the feature engineering, which is the really important part here. There are, on that slide about the kinds of machine learning, there were some names of algorithms. There's a whole bunch of them. And you just try them all, you know, you create a process for you to try and see which works best for your data, because there's no one size fit all. The problem is what data you're feeding it. And that's the difficult selection process and what everyone will tell you that is the real hard work around anything on this. Anyway, I was telling you about this shield. And what I did with them is I've been collecting their, their bulk logs. So I mean, if you go to the website, you'll have a lot of top 10 ports that are being attacked, top 10 places, bad people are coming from, from the internet and things like that. But if you ask really nicely, and they were completely awesome about that, I really would like to take the SENS Institute for their help here. You can get the bulk data. And I just started mining it starting January. I got like seven months of, of shit. And this is very basic stuff. And this is one of the points I'm trying to make. This is firewall data, blocked firewall data, okay, and it's summarized. I get to know how many kind of blocks we got for each one, which I use as one of the ways to decide who I'm going to, to select to the model. But that's all there is, right? But you know that for this group of people who are submitting the log files, these are the guys who were potentially attacking them, right? I mean, you always start with a port scan. You always start with something like that to see which machines are up. So if people were hitting them on the firewall, and this is one of the points I'm trying to make on machines that didn't have that port open, well, maybe if they hit the right port, that would be an issue. That would be something worth looking at. And this is just a summary of the amount of data, okay? So roughly by day, I got a million observations. An observation is pretty much this IP address attacked this port, okay? So from that, I would select the behaviors that I would be looking for. And when you summarize that, sorry, when you decompose that and I didn't get to decompose logs, I just got a number there, you would get roughly 30 million log events per day. And one point I wanted to make out, the thing I wanted to point out here is that this is not big data at all. This is like nothing. I'm running this on my laptop and it's not even a good one at that, okay? So don't let anybody tell you that, okay, oh my God, it's the cloud, you know? So, yeah, you know it is. So I'm really trying to bring this to a tangible level and this is one of the objectives I have in this talk. So I'm just getting some data and I'm doing mining with it on my own laptop. So one of the intuitions here of this model is the proximity. And anyone who has ever done real sock work, they kind of developed this instinct where, okay, I've seen this IP address before. Or I've seen coming, people coming from that side of the woods before. And it's interesting because I've seen a lot of things being caught out of mistakes. So people would look at IP addresses, oh, I remember this guy. Let's look at this. But actually, no, the guy had never appeared. It was something that was an IP address that was similar to that one. Okay. And you follow the, yeah, there's actually something there. And so I started doing some research on that. And there's actually some anecdotal evidence and some really, really hard statistical analysis facts about this. And one of the things I like to point out is the Spamhouse cyberbunker thing. And I don't mean the DDoS, I don't care about that. But I care about the fact that Spamhouse actually stood up and said, okay, forget about that ISP. Anything that comes from there, it's bad. They're just going to try to spam you. Just block the whole thing. I can't take it anymore. That's an interesting conclusion. And they really fight to be, can I say, level headed. Sorry. I don't know anyone. I don't know anyone. I can't really comment on that. We'll have to take your word. So, but then you got the Google report, the Google malware report, where they said it blatantly, okay, there are places where they're more likely to have malware than not. And there's this paper from a researcher in Brazil called something more. I forgot his first name. He actually did a statistical analysis for the past seven years of logs from the Brazilian research network, the one that connects the universities. And he was like, he was proving that shit like statistically all the time. This is not random at all. There is some information we can potentially extract from that. So then what I started doing to create my features, I started grouping those logs by arbitrary net blocks, okay. Let's pick some things, okay. Let's choose and see if it comes out all right. And of course by ASN, which really, which is what really shines, which pretty much the synchronous net, autonomous system they're coming from on the internet. And for that I use a lot of team key services. They have an awesome who is service that you can pull and things like that. So here's a visualization of that, okay. And it seems like a lot to take in and that's understandable because that's the internet up there. And the point here is that this is a projection that tries to maximize the proximity as you draw the IP addresses. So I put the drawing there. This is called the Hubert curve, okay. And this is the kind of transformation that it's trying to do. If you think of the IP addresses as if they were on a straight line, okay, from 0, 0, 0, 0 to 255, 255, blah, blah, blah. I'm actually twisting this like this to make sure that the IP addresses are as close as possible to their neighbors, okay. And this is actually data from accumulated data until the 20th July on people that were trying to attack the shield group on port 22. I don't think that's random. And you can see, of course, some places on this map, they are dark by nature. I mean, the DOD is not really doing anything there, although they have like two slash eight blocks, you know. Even IBM, they have a whole, I think it's eight dots something. They don't really use that. But even on the other places, you can see that there's some density going on, okay. And by the way, if you're wondering, if you're in the death connect work, you're there at the start, right? So I'm trying to make, there is some clustering here. And you can start even if you look at this cluster, you can start to jump to conclusions, right? Oh my God, you know, look at this guys, they're obviously up to something. And then you start doing some more mining of this. And oh my God, you know, they're definitely up to something. And I mean, I mean, this is data, okay. But the point I'm trying to make here, it's very easy to do this and start jumping to conclusions. And we really have to see this thing to fruition at the end. So if you look at it, what actually happens that it's us, right? We're pretty much beating the shit out of each other all the time. So anyway, I just want to make a point that if you're just blocking for Oh, I hate this country. I'm not going to let it in. That doesn't really mean anything. Just have to be careful with that. Anyway, let's let's let's get moving. So we get this, there's proximity. But I want to be able to decay that because the neighbors, the neighborhoods might renovate. Okay. So I want to make sure that if people are changing ISPs, they're changing their anonymous proxies. I want to be able to accompany that otherwise I just have a massive blacklist that goes on forever. And I'm eventually going to block the whole internet off. So as time passes, I want to be able to forget these things happened and do some sort of exponential decay here, right? So you choose your metric. And you can see that I mean after a few after three to four months, it doesn't really matter anymore. You completely forgot what if people were attacking you, which actually pretty much mimics what an analyst would do, right? You have so much memory, right? You go out to party and things like that. So if something happened a month ago, you're not as likely to remember this as if you were it happened yesterday. So I'm given those two intuitions, we have to start calculating this, okay? And the point here is that we create some sort of rankings which are going to be the features to our model by IP address, by some net blocks that you choose, and by the ASNs, okay? If you're missing data, I took a shortcut that just said, okay, if I can't discover what your ASN is, or if you are a bogan, you're bad, man. I'm just going to put you with a high score, which is a I mean it's a shortcut, but it's just to it's just to make sure that we don't ignore these people, which is maybe would be bad in a scenario like this. And the point is for each day that we have, each day that happened based on the on the time series that you saw, we're going to have a calculation. So this is what happened these days. So the following day, we will actually decay this by the function, the exponential function, and then we add what happened the following day. We're pretty much adding one. So this happened this day, this happened that day, and we go decaying the rest. And the importance of us having this history is that if we're if you're using log data to detect stuff, log data will have dates. So I can't use few future calculations I used with with things from yesterday that doesn't make any sense. The model won't make any sense. And it will it will tend to survival, survivorship bias, which is I think that something happened in the future influences in the past. And that's a very rookie mistake that you can do with this sort of thing. So you just have to make sure that you you're pairing the data, the features that you have with the data that you have. And this is just an example, right? So the vertical scale is actually in log scale. So it's like it when you see six there, it's like almost a million. It's a million, right? And if you get one on this score, which is the horizontal axis, it means you were hitting me every single day. So I mean, I I found it pretty interesting that you could get at least another 10,000 guys from this is three, three, this is RDP that were hitting me every, every single day. And even if I was using that as a blacklist, which is a perfectly valid way to do this, look at all the guys that I'm ignoring. That doesn't mean anything compared to the rest of the data that I have. And I have another example here on 22 when you can see pretty much the similar behavior, right? There will be some guys who will be able to get you. But if you only look at the daily blacklists, there's so much information you're leaving behind, you know, that you could potentially leverage to have help what you're doing. We good? Too much math? Yeah. So oh, man, I'm not I'm not going to finish like this. I'm sorry. What's all right. All right. Enough with your math bullshit. It is time to drink. I can wait a second. Did we already share dusha with you this morning? You did. No, not this morning. You just came to this room before you can do another one. There's no problem. Matt, he wants to do another one. I'm sorry. Is that a Russian accent I'm hearing? No, it's Brazilian. Really? Yeah. Oh, my God. Wait, just wait. Maybe we just want to drink. I don't know. Some of the audience. So wait, wait, what? See, maybe I saw you first time right here. You've done a shot. Yes. This is the second time you're fucking a is awesome. It's great. All right, for the second time, we've got a few shots. Wait, wait, what's your Steve? Clearly, we've lost track at this point. Steve, this is everybody. Everyone first time. It's Andy. Here's to see. I guess we'll be back in five minutes. Okay, let's make sure we get close to finishing this. The point here is we've got a bunch of numbers, right? But we know if this bunch of numbers that these guys yeah, sorry guys, we should do a drinking talk. Anyway, so we get this so we get these features are calculated and we pair them with the data that we had. So the assumption here is if someone hit the firewall, we they're all to get you and they should be considered bad. Okay, so you put these guys on your training model. So pretty much the IP address, the features that you calculated and you feed that to the you feed that to the model. Okay. And you usually have to take more than one data, more than one day. I'm sorry. And that's what that's why the separating the ranks makes sense on daily basis because you want to make sure you're pairing the right date with the right data that you're using. All right. So rule of thumbs and all depends on the algorithm that you're using. Okay. The point is you can't always have malicious data. You have to have good data as well. Otherwise, the computer just say, yay, everything is a chairman. That was an easy job. So the point is you got to find something I did. I took a bunch of IP addresses from Alexa from from chromium to make sure that we we we got enough data to pair it up with. So it was at least 50-50. So as far as algorithms, I'm not going to go into this because I don't have enough time. But support vector machines are awesome. Okay. They do math that people don't even understand about. Okay. It's like the the the mathematicians. Well, I think there is an infinite dimension where this makes sense. I don't care. I'm just going to calculate this on the matrix and it should sort itself out. And it works. Scary. Scary as fuck. Anyway, here's the point. Okay. So we train this every day. Okay. So I got a bunch of ports. Okay, what's happening on port 22? What's happening on port 389? What's happening on port 25? And I would get something around 20 83 to 95% of training accuracy. And accuracy is what I got right. So bad is bad and good is good. Divided by the whole number of things that I gave it to to process. And this is good. But it's it's not really accurate because especially if you're using time sensitive data, what it is a technique called cross validation, which doesn't really work very well for that. So what I did to actually train it was okay, let's look at the following day. So if I have all and this is the point, right? I did all this calculation for today. Who are the guys who are most likely to be attacking me tomorrow based on all this data that I got? And I would get something from 75 to 85%. And I'm going to break that down for you so that you guys can really understand what 75 79 79 to 95%. Man, this is hard. Anyway, just just an idea of the progression. Okay, you have to run. Of course, if you have a model from February, you try to run it from from something that happened in July, you're going to have a terrible time because a lot of things has happened from them. And it also illustrates a little bit about the moving about of the of the of the environment. And blah, blah. Here's the point I'm trying to make. Yes, the same shit. So here's the point I'm trying to make. Okay, it's 79 to 95. Okay, what's the good? What is the best stuff I got bad? And what's the good stuff I got good? So it's true positive and true negative. Okay, so the numbers are a little slow, you see that the true negative was really bringing it up. And I think once I get different data, it should reduce a little bit, should even it out. The point I'm trying to make here is that if you calculate this, given this, this error rate, which means this, something that this picks up on your logs and tells you to look at is about 13 times to 18 times more likely to be attacking you than all the rest. Okay, so the point here is that wouldn't you have a randomness look at that first? I'm not saying I'm this is not catching everything. And neither are the analysts, you know, spoiler alert. But if you if you have time constraints, okay, if you and there is so much people so much time, they have to eat, they have to sleep, you know, I know it's terrible, right? Labor loss. But let's try to make this a little bit easier, right? And this and this is where this is coming from. Okay, so this is an idea of prediction. Okay, so I'm just simply it's the same curve you saw before. I'm sampling 100K IP addresses from each place. And just making sure so the brighter the tile, the more likely it's for people to be attacking you from that. Okay, and this is a logarithm rate, so it can go like 10, 10 to a thousand things like that. Anyway, these are the deal guys, you see it's, it's everywhere. Anyway, challenges IP addresses are bad. They're the worst stuff you can have to try to do real incident response. But it's, it's just like point I'm trying to prove here that even with the real, really shitty stuff, we can actually get some interesting results. Okay. But I mean, anonymous proxies, okay, it holds well. But Tor, there's not a lot of clustering. If you're coming from Tor, it starts messing it up a little bit. If you're just changing your IP address every 30 minutes, fuck you. Okay. But I believe that if you can reduce the cycle, so this is this is a daily resolution. You could start getting smarter about that. But then I wouldn't just go on IP addresses. I would go on different stuff. So anyway, what am I trying to take this? As it is, it could potentially help security analysts. Like I said, it brings a new priority dimension to a SOC work. You want to do the like, okay, these are the most important assets, but these are the guys who are most likely to be able to get me. So yeah, let's, let's see who are the guys here and let's invest some time in that. What I think it's really cool is that I created a model for, for firewalls, right? I could do exactly the same thing for IPS. I could do exactly the same thing from WAFS. And if I start taking the inference that each of these individual guys is giving me on these IP addresses, and I combine them, I could become to you and say, okay, this guy has 200 times more likely to attack you. I would block this fucker. You know? So, uh, and this is the kind of confidence that we're trying to build. And, uh, I don't know, blocking is a very bad word in information security, especially if you're defending, but maybe we can get that confidence. And the way I'm trying to do this, there's actually a project which what I'm pretty much is begging for data. You send me, you send me data. I'll send you reports based on what the model is sending back. And the point is that the more data I have, I can start fighting the bias that I got from the, from the, from the sense database. Okay? So there's some URLs, there's a Twitter feed. If you look at, if you look it up on, on Google it might come up, not sure. But anyway, I'll be, I'll be around. Uh, anyway, take a ways. Machine learning is cool. And it can help. Okay? It's not a monster. Of course, there is marketing hype and stuff, but it can really help. And I think 13 to 18 is, is pretty, pretty good. Thanks. Uh,