 So, before we begin, I just want to give a couple of shoutouts today, just a little preview of the packet hacking village. Later on today, we've got Chef. You'll be speaking later what time? 1610. 1610. We've got 1610. And also, I want to give a very warm welcome to DEF CON from Japan. And Megumi, you'll be giving a training at a workshop area. And welcome to DEF CON, and thank you so much. Thank you. So, right now, we're going to start off with a third talk of the speaker workshop. And without much ado, it is absolutely my pleasure to introduce you to Princess Leia. Hi, everyone. Thank you so much for your patience. Sometimes the Linux works with displays, and sometimes it tells me to go fuck myself. So, the fact that it's up here right now, I'm very happy. I may look over here because my presenter view is about this big, so I can't actually read it on my laptop. So, I apologize in advance if I'm glancing over a bit. So, this is iron sights for your data. Predictive analytics for your blue team? Before I start, I should tell you a little bit about myself. I actually have a master's in education and ABD in research psychology. I almost finished the PhD, and then I realized I didn't want to be a professor for the rest of my life, and I noped out of there. I also realized I really enjoyed doing the data part, not so much the research on some of the topics I was doing in research psychology, so I just went back into data analytics. I've been a data analyst in higher education off and on for about 13 years. I love data, and so I play with it in my spare time. I'm also a fiber artist, a knit crochet, learning to sew. That's my cat. And my Twitter is at sweet girl, or you can email me. And if any of you want to try to say my last name, it's Figueroa. You can just call me Princess Leia. Okay, so before we start into looking into predictive analytics, one of the things we needed to look at is where we're currently at. So, one of the things I use is the data breach incidence response database that's put out by Verizon. It's one of the nicer ones, it's a little more comprehensive. In 2016, there were more than 64,000 reported security incidents. I'd like to point out that all of these are responsibly disclosed items. These are not the ones that the companies pretend didn't happen. More than 3,100 of them were confirmed data breaches where people's data were leaked. 2015, you're looking at 79,000 incidents. Not as many confirmed data breaches, but there is still a lot. Same thing in 2014. I anticipate 2017 is going to have an even higher number of data breaches attached to it. But it can't possibly be that bad, right? It's not on the news all the time. Here's a year in review for last year. I'd also like to point out if you have Blue Cross Blue Shield, make sure that they tell you if they have a data breach. Over 5 million people's records were leaked last year. Or any healthcare for that matter. So these are just the big ones where there was enough data breach that it could possibly have accounted for some kind of issue. That doesn't include all of the data breaches. I just picked the largest ones. So yeah, it's pretty bad. And I apologize for Linux. This was really cute where he freaked out and fell down. Linux. So what the hell do we do? Well, you can run around and scream. I know a lot of you work with people that are like, oh god, I don't know what. We're just going to let them have all the data because they have it anywhere. It's no point. You can weaponize your data. So the underlying philosophy behind this talk is that past behavior often predicts future behavior. And I know, I'm sure some of you who have ever taken psychology at any point have heard this. It's often, other words, people often repeat the same behaviors. And a lot of people are like, well, it's not necessarily people, but people are behind all the data breaches. They design the botnets. They design everything else. So we're looking at that philosophy. So what the heck does it have to do with security? A lot of times data breaches, attacks, threats or vulnerabilities often come up in cycles. The same types of vulnerabilities that caused X company to get hit are most likely the ones that are going to cause it to get hit again. Attacks and attackers and attack vectors typically follow patterns you can recognize. Usually there's one big name associated with some type of attack, or it's an attack that has been recycled and just made it pretty and then it continues on. But you can trace it back to the original source. So learning about where the stuff came from can help you understand what's going to happen next. So how do I weaponize my data? Well, before you can weaponize it, you have to understand how to approach it. This is actually a two-pronged approach. It's data mining and predictive analytics. So why bother mining your data? There's a lot of it. I know any of you who run logs have pages and pages and pages. And just massive amounts of data storage. You know, why should you use predictive analytics? Well, I'll tell you one thing. Facebook uses it. Big brother uses it. The three-letter agencies use it. Pretty much everybody else uses it. So why shouldn't you? You know what, more seriously, why bother? Well, one thing is forecasts can predict attack surfaces, vectors, and actors. Just like in weather prediction, forecasts can kind of help you prepare for something bad. If you know a hurricane's coming, you're not going to sit in your house that's directly in the path. Well, you might. But most people won't do that. You know, and there's too many breach reports, too many ice giants, not of Nordic gods. It's not perfect knowing what your data is can put you in a more defensible position because, like most people, you probably have information coming at you from every source. And not having a way to make it easier to understand and codify it makes it a lot harder to actually see what's going on. It's just a bunch of noise. This is an ongoing effort that requires a team effort. But if you do this kind of thing, it can make it so you can effectively fight back. So, I know some of you are here like, but I have this, Palantir and Maltego, Gabana Splunk. You have all these fancy tools for working with your data. You can use them, but only after you look at your data. Actually, if you use one of those things first without cleaning your data and trying to figure out what it is, it's basically like you're using a fire hose to fill a shot glass. Everyone is going to get wet and nothing's going to get done. So before we talk about weaponized data and doing the data mining and predictive analytics, this is what an actual weaponized data feedback loop would look like. You start with the data collection. Data mining allows you to clean it and codify what you need, which allows you to form a predictive analytics or actually perform addictive analytics on the data, and you can forecast it. Once you forecast it, you have a narrow scope to look at, and you can start collecting data within that area. Don't stop collecting the other data though. Seriously, just, you need it. But that allows you to look further at the topics that you mean. Say, for instance, you have a problem with passwords. Somehow, logins are getting compromised left and right. You have other issues in the network, but that's the big one. With this, it'll let you know, hey, I've noticed that it's always the top officials in this business that are having their passwords root forced. That allows you to focus better on them. Maybe get them a password manager or something along those lines, but that tells you where you need to focus. The feedback loop also makes sure the forecast becomes a lot more accurate. Without the feedback loop, you're just shouting into the void. So before you start, you need to have a framework for collecting data. A good framework includes incident tracking. You need victim demographics. I know people don't think that's that important, but sometimes you can find out that there's something going on. Maybe it's a specific subset of users who have this Windows box at their home that are being attacked. Sorry, it's like in my face. Is that better? Yes. Okay, sorry. My mouth doesn't open that big. Okay, a good framework for data collection should include incident tracking. Each incident should have its own special identifier. If you have redundant identifiers, it'll just make the data dirtier. You also need victim demographics. Trust me on this. It can be a pain in the butt, but if you have something that pops up consistently in victim demographics, then that can help you pinpoint what's going on or where some issues are or where you need to focus. You also need an incident description. I've looked through some of these, a lot of them, and my favorite was person was hacked. Okay, that's not telling me anything. What happened? I don't care if it takes you four pages to describe it. If you have a good data person, they can figure out what you're saying, but telling me the person was hacked means nothing. Discovering response, what did you do? What happened as a result? And impact assessment. Sometimes you don't know the impact, then you can say that, but this kind of data collection makes it a lot easier for people to use the data in a manner that allows for predictive analytics. Now there's some good frameworks out there. I like the Verizon one because, like I said, I live in data, and it's nice and set up, and there's little columns, and everything is nice and neat. Data's not neat, but you can time enforce it into boxes. There are less Verizon-y ones, like SIRT, FAIR, NIST, RMF, Octave, the SANS, 20, Critical Security Control, TERA, and others, or you can make your own. That has all of those issues, or all of those stipulations. Data collection. I'm sure you have incident reports, and system logs, and application logs. All of these allow for different kinds of analysis. For instance, incident reports lets you do trend analysis. System logs lets you do broad analysis, and application logs can allow focused analysis. All of these can fit in, and it's basically the idea behind Facebook, or any of the other big ones. The more data you get, the better it is. Many tools exist to help you extract it. Log stash is free and flexible. You may have Splunk, which I've heard great things about, but I can't afford. Or Security Onion, which is kind of the kitchen sink of all operators. There are other processing tools, but you can ask your operations team. They should know about this. Now, data mining and predictive analytics are sometimes used interchangeably, which is wrong, and they should not do that, but nobody asks the data people. Data mining produces decisions based on normal reports. Predictive analytics uses data mining reports to move forward. This is kind of how it goes. You define the project, collect the data, produce data analysis, their statistics, then you do predictive analytics modeling, and you deploy it. What it doesn't show is there's a nice little feedback loop that goes back to the definition in data collection, because sometimes when you're doing this type of data collection and predictive analytics, you realize that you were looking in the complete wrong area. You're like, I am so sure that people are coming from outside the network and attacking, but no, there's something inside the network that is causing all of these issues, and you never spotted it before because of all of the noise. Data mining finds valuable information that is hidden in large volumes of data. So five major elements, ETL, storage and manage data, provide data access, analyze it, and present it. You can use different levels of analysis. Predictive analytics extracts data from the existing sets. Basically, predictive analytics allows you to identify trends. You can kind of see this in when they're talking about housing markets, or rental, or any of those big things, but this actually is more important for security, because you can predict all of these things. It's not an absolute science though, so if someone says, I'm using predictive analytics and I can tell you exactly which targets are going to get hit at what time, they're full of bullshit, and don't pay their contractor fee. This is how predictive analytics should work. It is a continued iterative cycle. You have reporting analysis, what happened, why it happened, monitoring, which a lot of people in security spend a lot of time doing, what's happening now. Predictive analytics is what's going to happen in the future. This allows you to start an action, which sometimes changes what's going on. So if you're using predictive analytics and find an attack vector that are able to successfully kind of shut down that area, sometimes the same attackers will try a different area using the same sort of attack vector, and that can allow you to predict that kind of action. It's a weird concept, but just think of it as a spiral. You can't stop doing this once you start, which is kind of a little disheartening, but at the same time it does make your security team a little better. So forecasts, models allow forecasting. A lot of times it's very much like the weather forecaster. Sometimes they're spot on, and sometimes they're off, and that's because sometimes things change in the model that you can't account for. Let's say that there was a threat actor and something changed. Maybe the threat actor that you were looking at was arrested. You didn't know that, and so you're looking at this area, but someone else had found about the work they did and decided to attack from a different point. It does allow minable data for future forecasts as well, because once you pull the prediction down you can also pull what actually happened and compare those and create a model that takes those two into account. So eventually the model does become more accurate, but it's never going to be 100% perfect. So before you begin using any type of predictive analytics and data mining, you need to select a tool and learn to use it effectively, and you'll want to use it in that feedback loop. Sorry, it's really, sorry, the microphone. Anyway, the best approach is to combine both data mining and predictive analytics tools with a nice friendly GUI. It really gets messy. I play with some of this stuff. It's really, really messy on the back end, and I know most of you live in, you know, bash or everything else. Trust me, it gets messy with data, because data. The top tools in use are enterprise tools like IBM and SaaS. Where I work, I'm at SaaS shop. I don't really like it because it's clunky and expensive. For my free time I use a lot of R. R is a nice open source project. It's free. They do have things you can pay for, but it's friendly, and I like open source. There's also WACA, NIME, and Rap and Minor. There's also probably other tools that you can use. I prefer R just because it's one of the ones I can install at work and work on in my free time. So this is the blue team or predictive analytics from earlier. Basically, the blue team actually focuses defense, so you add an extra step of focusing defense and data collection and monitoring it. Now, remember your front-end tools? Kamana, Sponk, Maltigo, Palantir, Graphite, Graylog. Once you have all the data cleaned up, you can actually feed it into this, and it actually makes it a lot easier. It cleans out some of that noise, that background noise, things that you don't need to worry about and allows you to focus your data so it's a lot more efficient. Data is everywhere. It's useful. It's pretty powerful. Do you have any questions? I'm sorry that went wrong. I'm super nervous. Any questions? Yeah. Top three things not to do when you're migrating from just being a command-line jockey to starting to use the analytic and predictive tools. Top three, don't assume that your command-line skills work as well in one of these tools. So one of the things I like about R, sorry, I'm short and this thing is so tall. Anyway, R has a very nice GUI that you can use. R Studio, so it has a pseudo command-line so you can still use those skills but it allows you to kind of convert into R language because it doesn't always work exactly like it does in Linux command-line or anything like that. Another one is assuming that these tools make the same kind of logical sense that you're used to in working in the command-line. They don't always necessarily make the same sense and sometimes you have to do some anti-acrobatics and the third is to assume that everything you have is valuable data. It's all valuable in the sense that it's data but trying to get it all to process through without doing pre-cleaning is a huge mistake because it'll end up bogging down your time and preventing you from focusing on what's important. Hi. Approximately how many analysts what is the ratio that you would suggest for a number of people looking at the data to the quantity of the data? That really depends. So if you have someone who's very comfortable with the data you can usually get away with one or two analysts as long as you We have a lot of data like a lot a lot. You could usually honestly as long as you have support from the team and you have a date like a DBR in place so I would suggest a DBR and then at least one to two analysts you have a lot maybe three but honestly you have to make sure that the analysts get along and work in the same kind of flow this sounds kind of weird and stuff but if you work in I work with a group of five analysts there's five of us two of them I won't work on any of their code projects because I'm pretty sure there is crack smoking going on while they're working on these projects I don't understand their code and it's supposed to be code that I can read so I work with the others but as long as there's as many analysts as you think a good data analyst will be able to pull out important things and so here's this is kind of ooh a good analyst can look at data and see trends before they start pulling out so in the same way that they still use humans to look at child porn in order to recognize these scenes the human eye recognizes it better than any computer model a good analyst will recognize it out and get to work on it so you need someone who really really likes data and not someone who just wants a good paycheck and just a quick follow-up question what is what do you guys usually use for data storage in terms of how much data you have and a follow-up to that this is my last one what time window do you look at? we have a seven-year time window on data so I work for a college and so we have a seven-year time window and we are talking I think we have for our particular storage for the data that we work with I think we have 2,000 terabytes and it's a lot of students every single data piece related to every single student back to the start of time so some data from 40 years ago that's been digitized and put up but we do keep seven years I would recommend for doing this kind of analytics keep a report but keep your data for a year before you dump it but you need to make sure that you have an adequate summary report of the data that was dumped so people can use that model to continue working you don't necessarily need that we have some federal questions I had a follow-up question related to how many analysts yeah is it a thing where you can kind of put in as much as you want and get out as much as you put in or is there like a minimum amount of eyes that you need to have on something to get any meaningful use out of analysis if you have a good analyst you can get away with one but to be safe you want probably two huh the bus problem yeah and a good so in my case so on the stuff that I work on we do a lot of predictive modeling and stuff I always keep something on the server with all of my latest code because if something happens to me someone needs to be able to come in and pick it up not all analysts work that way we have that's as I said part of our team kind of works fast and loose and I'm sure they're coding and Klingon or something to run it SAS is like I don't know piece out we're not doing this but really a good analyst who's passionate about data is going to be able to give you a lot more than a team of 20 analysts who don't really care yeah you want people who care about the data and I know that sounds weird but you have people like me that's like ooh I get to play with data and it's a little odd but it's the same way people like playing with bugs or anything else but I'm super nervous I'm sorry