 Okay, good afternoon. My name is Ryan Talavis and I'm a security consultant in Hawaii. I work for a company called Security&A. I'm also a member of the Hawaii Honey Net Project and today I will be talking about dangerous mines, the art of gorilla data mining. Obviously, my talk will be about data mining and how it pertains to information security. I'm also a big military intelligence fan. This is also going to be about data mining and how it pertains to information warfare and military intelligence. My hope for this talk is that I would be able to give you a short glimpse of how you can use data mining in your own research. I really hope that you gain something from it so we can share. I'm always looking for new data sets to use and all different techniques and all that stuff. I'm really looking forward to having you learn and having us do more collaboration and stuff. Before all that, I always start with a quote. I think this quote actually embodies the principles of this whole talk. It is said that if you know your enemies and know yourself, you will not be imperiled in a hundred battles. It's true, right? Very true. So enough about quotes. Let's go on with the talk. Okay, for a brief background. Actually, this research actually got formalized and started about two years ago. I presented in Black Hat this thing called security analytics. It is the concept of using data mining and artificial intelligence in security and information warfare. What I did then was I presented techniques, theories, and theories about data mining that we could use in information security. Back then, I think what I did was I presented a lot of algorithms, math, and stuff like that. I think I ended up with a pretty boring talk. I've learned my lesson. For this talk, there will be no talk about math or any complex algorithms and stuff like that. For this talk, we move from theory to practical applications. Hopefully, I'll be able to provide to you scenarios, tools, more importantly, tools and examples to leverage these techniques. Let's start. Information. Information is the key. It is the key to data mining. I am focusing on two areas. These two areas are information security and information warfare, pretty much military intelligence. Military, reconnaissance, information gathering, and espionage play an important part in battle tactics. Yes, that is Sparta. I just put that there because it highlights the importance of information, like how a small force can be able to leverage it to gain advantage to a larger force, stuff like that. Information security. Obviously, the more information you have, you'll have a better chance to protect your organization. Obviously, if you have good threat intelligence, you have a better chance in drafting better policy, hiring better people, using the correct tools. Those are the key, the cornerstones of our information. Let us now talk about information warfare. Information warfare is the use and management of information in pursuit of a competitive advantage over an opponent. The key things here is use and management because information is just ones and zeros, if not used properly. Think of it, a book is just pieces of paper, if not read and understood. Pretty much like that. Data analysis is what makes information actually meaningful. This is not new. Everybody plays the game of information warfare. So usual suspects, CIA, FBI, NSA, IEO, and of course, let us not forget foreign governments. Don't think that it's only the big foreign governments like China, Russia, North Korea. I've come from developing countries and they have something. Everyone has something. Everyone plays the game. What I forgot to put there is not only that, crime syndicates. I actually saw this really nice talk yesterday about spam and pump and dump stock market. That's pretty cool. Aside from that, big corporations think about corporate espionage, corporate intelligence. That's also a kind of information warfare. Just that, everyone pretty much plays the game of information warfare. Let's go to more specifics like projects. The projects here, we have project echelon, probably the most popular. I guess you've heard about that, project echelon. We have talent, advice, matrix, able danger. That was pretty cool. Anyone here familiar with able danger, like the 9-11 stuff? Pretty cool. Pretty cool stuff. But I'm not a good resource here to talk about these stuff. I won't be talking about these stuff. All I wanted to point out is these are large endeavors. Millions, billions of dollars. Why are these large endeavors? Because there are distinct challenges in doing data mining. First and foremost, there's the sheer amount of data. There's just too much, just too much data. Resources, way too little. It could be lack of effort, lack of tools, lack of money, a combination of everything. So in that diagram, as you see, that's where the challenge is. You have so much data, but what you need is to funnel it and process it so you'll get something meaningful. So that's the goal. So that's what we are looking at. So this is where my research comes in. So I call it the Veritas project. Veritas is actually Latin for truth. So this project is modeled after a threat and intelligence gathering premise. So it's just a fancy way of saying that I based it off some Cold War spy network movie. But it'll make sense a little later. It will make sense. My goal here is I primarily based it off community sharing approach and using tools, technologies and techniques that are freely available. My goal here is so that you guys can also use it so you guys can do it because information is a realm for everyone. So hopefully you'll be able to use these technologies, tools and techniques that I'll be talking about in a little bit later in my talk. Okay now I've told you like spy network, right? So here's the analogy. So think of it this way. You have agent one, you have agent two, agent one gathers some part of information from country one. Agent two picks up on information in country two. Both pieces of information are seemingly unrelated. So you have two pieces of not really usable information. So what happens here goes to HQ, it gets filed, still two pieces of separate information. But here's where the thing is. Here's where the important thing is, the analysis. This is where you, the important thing for this whole framework. This is what makes the framework work. And for example, the analyst finds agent one's information, analyst finds information from agent two. So they put it together and they find relationships. That is the key for this talk. Finding relationships between large sources of data. And when they find one is to one plus one equals here, you find relationships. So one mediocre piece of information becomes something more relevant, something more useful. And that's where we have the decision makers now action, action based on what we find. So stepping back, let's go now to the framework. So just a little more, we'll get to the actual framework now. So data collection. We have data collection, we have data storage, we have data analysis, and we have decision making. Okay, let's go through each one of those circles. Okay, now for the sources of data, it really depends on what you will be using. Depends on what your research will be. So for example, if I had like, I wanted to do some research on hacker chatter. I had a honeypot and I wanted to do some research on hacker chatter. So I'd use chat logs, right? So if I wanted to research on military intelligence, I'd use press releases and use releases. So sky's the limit here. You can use anything that is based on your research. And the idea here is the more information that you can gather, the better your results will be. So that's the simple part. So data storage, information can be stored anywhere, pretty much anywhere, relational databases, flat files. This is possibly the easiest part of this, easiest part. So as long as you have something to store the information, you should be good. But here is the core of everything, analysis. So this is the most important part of the framework, crunching large amounts of data. And the more important thing here is we find relationships within that data. And it is there where we find data more meaningful. And I told you that I won't be talking about algorithms and stuff like that. But I have to just mention a few. So these are some of the things that if you do your data mining research, you'll probably encounter them. So we have K-means, neural networks, support vector machines. We can talk about it, but maybe later, maybe after the talk, if you would like to talk about it. I'd rather not, but if you want. So doing this manually, it's not easy. But the thing here is there's a lot of tools out there that you can actually use. So some of the good data analysis tools out there and are free are these. I use this a lot, the first one, TextGarden. It's just a group of Windows binaries from some university in Slovenia. So it's very, very useful. It's hard to use though. So it takes some stuff to getting used to. So Ontogen, that's very useful too. It has a really nice GUI. And this is actually where I started off using. So I used that first and that got me going with all this data analysis stuff. And later I'll probably give you some sort of exercise so you can use it after this talk. And hopefully you'll get going with your own research. So the next stuff here is like the more popular ones. You've probably heard of it, Weka and RapidMiner. So I think that particular picture is actually from RapidMiner. And we have like Tanagra, Orange. It's a Python implementation, Mead, which is more on sentiment analysis. But I'll talk about that later. I don't really use it, but that's something that's up and coming sentiment analysis. So obviously with all this data analysis, it's always up to us to do the interpretation. I have some samples a little bit later. Why? Oh, so finally that ends the concept part of the talk. So as I've said, I'd like to focus on more on applications. So the rest of the talk will be more on applications and what I did using these particular concepts. Oh, that was fast. So let's talk about the scenarios now. So I have a number of scenarios that I would like to show you using data mining. I'll just get this. Yep. I have a couple of scenarios to show you about data mining. So the first and foremost is trends research. That's really my flagship thing. I do a lot of trends research, like finding relationships of different topics over time. So I have two versions of that. One using for information security and another one for information warfare, military intelligence stuff. And the next one is malware taxonomy. It's not really that what you call this exciting, but I started off with that. So that's how I got started with all this data mining stuff. And I think that it's the perfect place for you guys to start also if you wanted to do this. So I'll show you that. Yes, sentiment analysis just because it's the up and coming thing today. Then probably if we have time, it's not security, not even information warfare, but it's opinion polls. That's pretty interesting too. So later if we have time. So let's talk about trends research. So the thing here, what we're going to do is find increases in chatter. Find increases in chatter. And more importantly, find secondary topics that also increased because of the primary topic. You'll see this more clearly later in the graphs. So as I've said, we have the framework, right? Data collection, data storage, data analysis and basically interpretation. Let's start with data collection first. So with this particular project, what I did was just collect information security items. So these could be like news articles, press releases, blogs, forums, whatever. It's really dependent on what you're going to research. And the more information that you gather, the better. For example, here I just use crawler. Sometimes I manually take him. So it all depends on how and what your research will be. So data storage, just a simple relational database. I just put it all there. The thing here, the core, what I used with the data analysis tool is called, I told you a while back, text garden. And since this was my very first implementation, it wasn't too fancy. I did the analysis per month. So later version two will be daily. So this I did per month. So obviously I have to put it there, decision making and interpretation. Let me show you something here. Okay, yep. Okay. Okay, this is actually, yep. This is actually an engine running off what I was telling you guys. Unfortunately, it doesn't really fit too well in the screen, right? Yep, there. So for example, let's do like a quick search. Like, yep. Oops, nope, nope, not that. Okay, let me try one more time. Let's do a quick search. So this, as you can see, these are increases in chatter. Increases in chatter on specific topics, China. So that's the particular topic. So increases in chatter in China. And that's pretty simple. That's really simple to do. Find increases in chatter. Useful, but really simple. The key, but the key here, what's the important thing here is find relationships. Find topics that are related to that increase in chatter. So this is where this thing comes in. So this computer assisted thing is actually what it does. Is it uses the algorithms that's built in there to find relationships, find the secondary topics that were related to those increases. So let's click that one. Okay, pretty, huh? Pretty, but quite confusing. So yep, so that's it. Let me go back to the slides because it's more clear there. I just wanted to show you that. So you'll know that I actually have something working and I'm not inventing all of this up. So here we go. This was what you saw a while ago. So increases in China chatter. So increases in security chatter. And the thing is like what other topics are correlated with these activities? So you saw this also, right? So these are the correlated topics. Some are obvious. Some are not. But for example here, one of the correlated activities is nuclear activity. So basing off that, you'll have your hypothesis now. Hacker activity from China, nuclear activity. So where does that get us, right? You track that. This is actually like tracking the two. Nuclear activity and hacker activity from China. So as you can see here, yep, yep. There's actually some sort of overlay. Yep. So as we all well know, when there is smoke, there's probably fire, right? So that's one thing. And you might be thinking, yeah, I knew that. I knew there was like hacker activity in China. Then that's related with nuclear activity like nuclear labs and stuff like that. But think about it. It's not that easy for computers. With the human mind, it's pretty cool because we have like intuition. We have hunches. Suddenly we connect one thing. It's not easy for computers. And that's what this particular thing does. So let's do something that's a little bit more closer to home. Let's do something about DEF CON. Okay. Let's try to track chatter about DEF CON. So you can see here. So this is a pretty limited data set, more on 2008 and early 2009. So you'll see here is these are increases in chatter. Around the time for the call for papers, DEF CON call for papers, this is actually increases in chatter based on the actual conference itself. So that's the conference itself. Then here about 2009, another increase in chatter, call for papers. So obviously we know if we use like, we put in like submissions, like paper submissions, you'll see that there's an overlay. There's a correlation between submission, paper submissions, and DEF CON, obviously. So call for papers, increase in submissions, then conference itself. This is kind of weird because this actually appeared before the call for papers. I've yet to explain that. So this is like call for papers, increase in submission. So we all know that, right? So it's as expected. But there are some things like when you are doing data mining, that's kind of weird. Like why did this happen? Why is this thing correlated? Like for example, let's do this. DEF CON and crime. Whoop. Okay. Yeah. Let's go back there. So this was the submission. And this was crime. DEF CON, crime. DEF CON, crime. DEF CON, crime. Yeah. Yeah, that's strange, right? So conclusion. You know, you know this, right? 1984, big brother. So let's do some big brother logic. DEF CON is correlated to crime. Crime is bad. Therefore, DEF CON is bad. Epic fail. Really. And strange things you find. Strange things you find. So that was version one. That was version one. So after that DEF CON thing, I actually got inspired to make a DEF CON, like a version two. Like, okay, let's do a version two. It's very similar. The only difference here is I used military chatter right now. Then I used the same thing. I used a mix garden, but I process it daily. So it's a little bit harder, but it's much more accurate now. Much more accurate now, though I focus more on military intelligence stuff. So what has this got to do with the research? Okay, let's open that VM again. Okay, military intelligence. Yep. Okay. I'm from Hawaii. In the past few months, there are two things that I've been thinking about. North Korea and missiles. Yep. So, I began trying to track military chatter or military press releases about North Korea. So here's the spikes in chatter. Obviously you'll notice this one, right? The next, I began doing like an overlay with missiles. So it's quite obvious, right? You see the overlay here with North Korea in missiles. So it actually began somewhere here in goes up to here. So let's go to the slides. Yep. Yep. So here's the North Korea chatter. You can see the overlay with missiles, so you'll get here. So as you can see here, there's a direct correlation with it. So as expected, what do you expect, right? So you have talk about missiles North Korea, missiles North Korea, missiles North Korea, and boom, here you go. Yes. Now, so it's actually, you can actually see the pattern here, like going up, then there's something, then there's an action. So that was the things that I was thinking about back in Hawaii, so North Korea and missiles. So, but after that, I felt a little bit safer, so and I began to like thinking of other stuff to use this for. So funny things like there's like strange things that, what drives your research, right? I began doing some more overlays. I don't know why I did this, but suddenly I began like overlaying different hotbeds, like for example Korea, then Iraq. So I was trying to like find like correlations between different hotbeds, different hotbed countries. So I began like starting like one by one. So as you can see here, like you see here like the Korea spike and you have a spike in Iraq also, like chatter in Iraq. So I began like doing like I've been putting like Iran. Okay, this is going to be slower now, so I will, yep, so you can see here there's also the overlay at this one particular point. So let's do it in slides because it's going to be slow. North Korea and Iraq, right? North Korea, Iraq and Iran. Russia. So you see that particular point, right? So everything's like correlating there. There's like a nexus or something. So here it is again. As you can see here, like in this particular graph, you see a slight increase here, then you find the nexus and boom, goes up. This is around, when's the North Korea one? The missile. May? I think it was May. Yep, I think so, yeah. I believe it is May. 2009. Was it July? July? I forgot. It's so little. So anyway, these things are what you call tipping point events. So activities which results in the worldwide shifts of instability. So have you, any of you guys, read tipping point? The book tipping point. It's pretty cool. So the idea there was, in tipping point, there's one event, one single event which changes things and from this particular point, there's a tipping point event and it goes up. So it happens everywhere, like even epidemics, trends, fashion trends, and stuff like that. So obviously, I have a theory here also. I have a theory here. Basically it's about human nature. So human nature and fear. So obviously when everyone is afraid, everyone is afraid and right now it's so easy to do that because there's like the media and all of that stuff. So there's an increase in worldwide tension and with this increase in worldwide tension, actions are easily triggered. So it's basically human psychology, human nature. Fear, tension, then suddenly action is easily triggered. For example, if you are angry at someone, then you take out your anger with other people. So even if it's like totally unrelated, it gets somehow this triggers other stuff. So a wise man actually mentioned this way before I was born. Yeah. Fear leads to anger. Anger leads to hate. Hate leads to suffering. So that is my first theory. So the second theory, I like the military intelligence stuff but some of my friends in Hawaii are big conspiracy theory fans. So the second theory is Skynet. There is something out there. So anyway, that was my version one and version two of the trends analysis. So here, but here in the malware taxonomy it's not as fancy as that. This is not even like doesn't even do it historically but I started off with this. I started off my data mining activities by using this stuff. So it's basically grouping similar malwares together, finding relationships between the two. For data collection, these are notes for malware analysis. So these are not signatures. These are actually like text, natural language, like a description. Like this particular malware is blah, blah, blah, blah, blah, blah, something like that. So data storage just flat files. And what I used here is probably the simplest one that you can use, that you can probably use for your own research right on the get go. So I used antigen. So next, let's start now. Let's do that. Yep. So I actually cheated. I started this off already. I started processing at the beginning of my talk because it actually takes a while, especially I'm using my netbook and stuff. So it's pretty simple. You use just grab antigen, ontology folder. Then you pick the actual folder where your, what you call this where your data is located. So that's what I did. I have some instructions in the slides in doing this. So what happened there is it processed it. So I had like about 2,000 malware descriptions. And the thing here is what I want to do is have it give me what are the groups similar, like for that 2,000 documents, give me similar groups of malware. So let's do this is not an automatic what you call unsupervised learning. So this is semi unsupervised. So because it suggests. So as you can see here it gave me suggestions of what those particular groups are. Let's just run through this. Check, check, check. So from those 2,000 individual text files it gave me 10 groups. So these are the common groups. So you'll see like we have macro viruses boot sector viruses and you can actually like start breaking this down. Like for example this particular part, the macro viruses you can actually suggest you can actually still break it down. Add you'll see Excel and Word. So it's pretty simple to use. Very, very easy. So just install it and you have something. So let me go back to the slides. So the instructions are all here. So it was pretty simple. So these are the these are my data sources. These are all text files, text files with malware descriptions. You have here what I wasn't able to show you was this particular thing. This actually shows you what the particular documents are included in your group. So it grouped it together sorted by similarity. Oops. You'll see here grouped together all the macro viruses. So it took minutes to process that 2000 documents. And you can probably think of other very useful uses that you can actually use this for. So yep, let's go back. It actually does this a pretty neat visualization thing, but it's too slow for me to show you here. So it clusters together all the similar instances pretty, very pretty. So this is something that I hope like after after this talk, you can actually try it out. So you can like start like doing stuff and we can like share information. So here, for example, try this out. Just try this out like at home or something like a nest is in vulnerability profiling. So you have thousands in results and all you need to know is what types of vulnerabilities define the network. So think of it like a network DNA something DNA something. Yeah, let's go through the let's go through the let's go through the framework, the for framework thing. So data collection just run this is just run this is on your network. Grab the XML files. So grab the XML files and what do you do with the XML file? You just parse it, try parsing it. And the important thing here is for each of those findings put it in a text file. So because with Antigina, that's the only thing that it can handle like text files. So what you can do is probably like parse it using like Perl or something like just grab the information there and spit it out as a text file one finding per text file one because that's very important because it will give you the weights and what type like for example, there are more of these types of the vulnerabilities, there will be more text files and the more the weights will be like stuff. So okay, so data analysis just install Antigen, the one that I showed you a while ago, then process the text files using Antigen. So what you'll see there is a clustered view of common vulnerabilities in your network. So if you just wanted to find out okay I have this tons of results but I wanted to find out okay what are the major types of vulnerabilities in my network and hopefully you'll be able to see this through that exercise I showed you and just contact me if you need help or something. So hopefully you can try that out. This is probably the simplest part like simplest data mining exercise that I can think of and it started me off doing this and hopefully it can start you off also. Okay, sentiment analysis. This is not really new because two years ago I was already talking about this. The idea of sentiment analysis is putting weights into words, negative and positive because certain words would have, there are certain words that are negative, certain words that are positive, like for example if you say suicide, that's negative, right? So something that's how the really basic explanation of what sentiment analysis here is like for example let's think of like a hypothetical situation. Let's say let's say the government is monitoring your search key terms or something. Yep, for example and they're able to tie it to households and individuals. I know it's impossible but that's a hypothetical situation. For example, there are two households. One is talking about poison in food, divorce law, insurance before divorce, travel, google maps. So if high risk, the Pikachu one right? So think of it like that, right? If they were able to monitor everything and you'll say it's impossible it's impossible to do that manually like everyone looking at like search terms and stuff but hey, if you can do sentiment analysis it'll just flag you. So this is negative. This is negative and we should like that will be a person of interest, right? So that's what sentiment analysis is. It sounds like science fiction, right? Not really. Just two months ago a patent was filed for sentiment classification. You're probably wondering who filed it. Google. It's a little bit different though. Don't think that Google will start looking at your search terms and just doing the sentiment analysis there. This is more focused on different documents. Like one document will have a different sentiment than the other but you see the technology is out there. The concept is out there and there's actually tools out there like the one I told you like need. It's actually like a free tool that you can use. I haven't used it really but you can try it out. So yes, I think so. Yeah, I do believe so. So what the public is thinking about? This is a little bit different. This is not about what you called it. This is not about security, information security. This is not about information warfare but opinion polls. Everyone nowadays thinks like decisions are very highly dependent on opinion polls which is kind of true. So what I did here was you remember the Obama town hall meeting? That was like some months ago. I forgot when but there was like lots and lots of questions. So what I was trying to do was try to look for, go through all the questions and see to get a pulse of what people are concerned about. This is one of the samples here which is healthcare. There's actually more. There was like auto industry, economics and difference. But I'll show you healthcare. I actually have a site. There was like this American Mines and one of the very important things that I found, like in healthcare was this cannabis, marijuana. Marijuana. So people like marijuana. People, what people like is good. So marijuana is good. So there's probably some benefits having big brother, huh? So anyway anyway this would not have been possible without other people who are involved in this. It's not my own. A lot of the stuff that I've been using other people built here and that's why I'm sharing this to you. Hopefully you can use it also. So here, thank you very much here. I actually I have time, right? Let me Yeah, in case there's more time I prepared this. Yeah, you know the chatter logs, the hacker chatter thing. So you can actually use that too. So if you have like a honeypot or something you get like IRC chat logs. I don't know about the privacy stuff, legal stuff about like monitoring IRC chatter logs or something in your own honeypot. Not too familiar. But one thing you can actually do is like monitor chat logs and find persons of interest and like correlate different cells together. For example, you're able to have like your monitoring different chat rooms and you try to like, okay, who are these people? Who are related to which people? So you can do that too. So these are sample chat logs and you can do like you can cluster them like who is related to who. So those are like some of the some of the other stuff that you can do actually skies the limit as long as you have data as long as you have information you can pretty much do data mining and the tools are out there. So it's pretty easy to use like the ontogen thing. I really hope you try that out that exercise. And please feel free to email me if you have problems with that. If you want some assistance in doing that. So in any case I think I will be I'll be accepting questions in 104 if you have any questions. So thank you very much. Mahalo.