 My name is Andrew Hay. You know, very, very high-level background about me. I learned graph theory so that I could understand what the hell my researchers were telling me because they were way and they are way smarter than me. So this was at OpenDNS where I was expanding and growing and managing the data science team and the analysts. So I would be talking to people and I don't know, some of you may know Dia Majub, but he's got a PhD in graph theory. So one, I didn't know that that was a thing. And then in talking to him I realized he is way smarter than I would ever care to be as it pertains to graph theory. So I gleaned as much information as I could, which was a big challenge because I have always been incredibly, incredibly shitty at math. So if you can't math good like me, then this is the talk for you. And if you are good at math, well, good for you, your mother. Just be very, very proud. So basically what we're going to go over, I'm going to talk, there's going to be a very gentle introduction. There's not going to be a lot of math because again, suck at math. Talk about some of the graphs in everyday life, very, very high level. I'm focusing a lot on the tooling aspect to help you build your own tools or get accustomed to tools and platforms that will let you apply graph theory as it relates to OSINT. Because it's very, very powerful and probably a little bit easier than you think. And cheap, very, very cheap. So, show of hands. How many believe that A is a graph? One, two, a couple people. Okay. How many believe that B is a graph? How many believe that C is a graph? Well, the good news and bad news is you're all right and wrong. If you look up any of these charts, so we'll go over this a little bit, but if you look any of these up on something like Wikipedia, it actually says, oh, it's a type of graph that is used for charting this graphical reference model. Like, well, that doesn't really clear things up. Thanks. So, just for the context of this talk and from a data science perspective, when we're talking about a graph, it really is a collection of vertices, nodes, dots. And I really do use those interchangeably, and a lot of people do. Like, you'll hear people say vertices, child node, whatever. And it's where that entity represents something, whether it be a name, an IP address, a hash, and, you know, it could be any of the picture. And then there's the edges. Those are the relationships that tie those vertices together. And, you know, there is some sort of relationship between the two. So now you don't have to get a PhD in graph theory. Or at least an undergrad in graph theory. And I just want to say this. I always, you know, I poke a lot of fun at graph theory and people that understand it way better than me, but that's because it impresses the hell out of me how much they know. So it's very tongue-in-cheek. So no one should get very upset. Or too upset. I'm a big guy. I can take it. So this is a typical graph that you will see. And you can start this by yourself on a piece of paper. And I've done a lot of projects where I'm trying to map like information together just on a big sheet of paper with circles. And I, you know, I probably look like a toddler trying to color the worst picture in the world. But so we have a vertex or a vertices, sorry, that is one that is connected to three via nine. So nine is denoting that edge connecting the two. You just graduated to your master's program. So again, it's a very basic structure that you can apply to things. So like person, Bob created or installed let's say an application on this server. So now you have a person, Bob, you could map to additional product installations, additional IP addresses, additional data to build this huge semantic network graph of data that will give you a very good idea of what is happening as it relates to Bob or the organization. So you can also add additional information to add more context. And that's typically known as a label or an ID. So, you know, for the person, we could say that this entity or this vertices is a person. And that person in this instance happens to be Bob or Marco. Marco. And Marco is 29 years old. And the software that was installed was a Java program called Lop. These aren't my pictures, by the way. All the sources at the bottom. So now that we've gone over that, how many believe that A is a graph? B? C? So if you talk to a data scientist, they will get extremely angry if you call a chart a graph. Likewise, they will get very angry if you call a plot a graph. But like I said before, they are all graphing mechanisms. And it looks like we're going to build a fort with that cardboard. Awesome. So again, for the purposes of this talk, a graph is a connection of nodes via edges to show you or to aggregate like information that you may need to reference at a later time. So when you're talking about graph theory, you'll hear terms used interchangeably like networks. Or like networking graph is used interchangeably. And because I came from a networking background when I started my career, I tend to focus on network because a network graph, it makes sense to me visually if I'm trying to map packets going from hop to hop to hop or from router to router to device. And that's, you know, I spent most of my early formative days using TSP dump. So that's just where my mind goes. But you can use them interchangeably. If the edges are directional, we're closed. Jeez. So if, is he all right? He's good? Okay. Good. Yeah. Yeah. All right. He can watch the video. So you'll have the direction. If you have the directionality associated with a graph, you don't need to have a direction. It is helpful when you're trying to map out a large amount of information as anyone who's used multigo will know. So this goes to this, not vice versa. But you can have, when you have a directed graph with the arrow going one way, that's sometimes called a digraph. But a lot of people just call it the directional graph. So if all the edges are bidirectional, which happens, you can have the arrows going both ways. We don't judge in graph theory. We don't judge. Or it can be unidirected. Or you could have no directionality at all. Entirely up to you how you want to represent the data for your purposes. So there's some variations as well. And you may have seen this in some larger applications where the, the edges may be thicker or a different color or have a different, like instead of a solid line, it's a dotted line and that denotes something else. It's very common. And again, it's up to you as the data scientist, OSINT person, shitty math person to make it look like you want it to look. And the same thing with the nodes. You can make them any color you want. Makes it very easy to spot when something seems out of place. Or you have a cluster of information that you may not have understood, but now when it's clustered like that, it kind of pops and you can draw additional conclusions. So graphs in everyday life, we run into graphs all the time. And probably my favorite, and again, I'm a network guy. So when, remember when those maps of the internet came out? It's like that is awesome. You know, oh, there, there's my, so I'm Canadian, like, oh, there's Canada, this tiny, tiny little network connection, then there's the United States and there's like all of Europe and so Canada has like one little dot. But it, if you look at the graph on the left, you can see colors and those colors are associated with specific countries. So let's say that red and it probably is red is associated with the United States and the internet graph, you can see, you know, that huge cluster of information as it pertains to the United States. So if you are a visual person like me, it makes it very easy to zero in on something. If you're colorblind, I apologize. I didn't do colors. Another perfect example is Google Maps or any sort of mapping algorithm. How do I get from A to B? Like, oh, will you go through C? And say what you want about all of the different mapping algorithms and mapping applications? They have probably helped you at one time or another. They may have had you off by a block or so, but they get you in the right neighborhood. And, yeah, you can also use graphs for perception and attitude analysis. Like, oh, what hashtags are trending now? Something about Britney Spears, undoubtedly. Which presidential candidate is being talked about the most? I think we know who that is. And then we can use this theory and apply OSINT to it. So, 10,000 foot view. So hang on, we're going to go really quickly through the tools. And all these slides are available. Say, I don't have to take pictures. Don't feel you have to take pictures of the slides. They probably won't come out that well, anyway. So, has anyone here used Google Fusion tables? It's one of those, like, pseudo undocumented, perpetual beta type of services that they have. But it's a fantastic way to build a network graph. Because you're just putting a whole bunch of stuff, a whole bunch of data into a table, like you would a Google spreadsheet, and then saying, okay, this belongs to this. This belongs to this. And this is, you know, people. This is IP addresses. This is an address. This is an email address. And you're building all this, and then you hit go, and it's like, boop, hey, we've got a graph. That's kind of cool. You know, how much did it cost you? Nothing. Well, your soul, a little piece of your soul, maybe. Because it's Google, but, you know, it has a very robust, yeah. Does anyone hear from Google? Does anyone here want to admit that they work at Google now? Yeah. Yeah. Are you the developer of Fusion Tables? Because I'm a big fan. So yeah, if you want to take a look at the docs, developersGoogle.com, Fusion Tables, very cool, SQL like interfaces to do queries and move around your data. You know, a lot of people bemoan SQL, but it still gets the job done quite a bit, or quite frequently. GraphViz, this is, I won't say it's one of the older, but it's been around for a while. It allows you to build somewhat complex network graphs, allows you to export them to all the various supported formats, PDF, SVG, ping. But, you know, it's, what makes it kind of cool is that you can embed hyperlinks. So if you want to embed hyperlinks and click on something, have it open up something else, and you all know how hyperlinks work. Some of us more than others. Now, I found this one very interesting. So Viz, has anyone heard of Viz before? It's, it was designed for journalists and authors and, you know, people that don't have a lot of technical acumen to start creating graphs of their sources and their information. So that when they're writing a story, instead of just a whole bunch of scribbled notes, they have that visual representation of how Bob was related to Mary and how they were both involved in, or being associated with a particular newspaper and then a story. So it makes it a lot easier for someone who's not incredibly technical to wrap their head around a symmetric network graph. Now, Geffi. Geffi is one of the old workhorses. This is one of the greatest data science tools for visualizing data because it can create a ridiculous amount of nodes and edges. People do call it the Photoshop, but for graph data. So you can change pretty much everything and anything. It is, you know, if you think of it as a tool built by data scientists for data scientists, that kind of sets your expectation or level of knowledge as to how difficult it is to use. But it's kind of like, is anyone here use R on a, yeah? Yeah, on a daily basis. I use R every day. It's the kind of thing that if you don't use frequently, you lose a lot of that knowledge, just like any programming language. Geffi, I find, is very similar in that respect. Does anyone heard of Open Graffiti? Hey. So, Tebow Roy at Open DNS created it and it's a 3D immersive symmetric network graph engine. And he built it. It was amazing. He hooked it up to the leap and the, the hell the goggles there. Yeah, Oculus Rift. So it was funny because you could see people and they were like, oh, grabbing nodes and edges and moving them around and then falling over eventually. It was great. But, you know, if anything came from Open Graffiti is that it justifies the purchase of an Oculus Rift for your sock. So if you ever see them, just think of like, hey, thanks for letting us buy toys. But an API, very accessible API to graph things on the fly and make those connections. So I've used this for, so I used to be an industry analyst at 451 Research and I was always curious which companies had received funding from which venture capitalist firms and what overlap there was in actual types of products. So I put five years of security mergers and acquisitions into my model and I was able to see who was acquiring the most types of a particular company and who could potentially be looking to acquire a company of that type. That's kind of interesting information for an analyst firm or for bankers to say, hey, you know what? This company doesn't have a product of this type but all of their competitors do. Who is available and for what price? So it's kind of fun. Actually I think that data set's included in the Open Graffiti. And then there's Maltico. I love this tool. It's fun. I, apparently I'm very bad at typos too. We're good at typos depending. It's the Community Edition. If you've not used this, download the Community Edition. It is great to play with. You can have directed graphs. You can map your data quite easily. It's not that steep a learning curve. It looks a little bit daunting at first especially if you try and use all of the visualizations and all of the connectors and run all the transforms. That's just not a good idea because that will just frustrate you and you won't want to use the tool again. They also have case file which is essentially an offline version. So if you're doing incident response or forensic analysis and you want to trade up from your handwritten notebook, not bad to use. Now databases, does anyone use Neo4j before? Does anyone like Java? The hands all go down. So Neo4j is a graph database that has a wonderful front end that does all of the visualizations for your graph associations for you. Now like any database, if you do really poor or poorly structured queries, you're going to have a very bad day when it tries to visualize all of them. Especially if you're not on a very powerful machine. So you have to be very careful. Just like running SQL queries against a large data set or a large database in production, you could have adverse effects trying to visualize things back. And even working with the data, like moving around zooming in, you could eat up a lot of your memory. But still pretty cool. Especially if you don't want to do a lot of the work yourself of trying to manipulate the nodes and edges. It has a pretty cool, it's called the Cypher query language, which is very similar to SQL. But yeah, end free. OrientDB, I haven't had a lot of time working with OrientDB. I played around with it a bit. I actually, in favor of Titan, so every big company that is doing threat intelligence, whether it's Cisco, OpenDNS, anyone that's doing like massive scale graph representation, odds are they're using Titan. Because it sits as an abstraction layer on top of your data, usually Hadoop, and allows you to have a graph database abstraction layer without really impacting your core data that much. So very, very cool. I love the graphic too. Graphics awesome. Good logoing. And then TinkerPop, which is, let's be honest, that's really fun to say. TinkerPop. It's an open source graph computing framework. I'd say it's relatively new compared to some of the other tools. But again, you can have it as an abstraction layer over different databases to get that, you know, that graphing that you want. So NetworkX is probably Python programmers in the room. Yeah, I really shouldn't have put my hand up because if you looked at my Git repo, you'd be like, you're not a Python programmer at all. And you'd be right. So NetworkX is what most people use for creating semantic network graphs. Graph tool is also fairly cool. Snap, I haven't used, but because it's written in C++, it is pretty fast from what I've heard in talking to people. And it does scale immensely. So if you need to make huge network graphs, and I'm not talking about visualization, I'm talking about just making those associations, you might want to look at Snap. Although NetworkX is still pretty awesome. Unless you're doing NASA level data manipulation, you're probably going to be okay with NetworkX. Semantic net, so Tebow who created Open Graffiti created a very simple graphing network library. And it's just creating semantic graphs in JSON. Makes it very easily readable to use in your other tools. And it's, you know, because it's JSON, it's easy to read. I tend to like that. Plotly, if anyone's not used Plotly, it's kind of cool. You can get a lot of your graphs and visualizations without having to do a lot of work. So this is really a module that just lets you map everything and then you push it off and Plotly will plot it for you in a cloud environment and give you like visual dashboards. So if you're not like a JavaScript whiz and don't like to do the Viz stuff on the fly, take a look at Plotly. Make your life a lot easier. Now if you are a JavaScript kind of person, Viz.js is very, very popular for creating semantic network graphs. JS NetworkX, again, it's a JavaScript port of NetworkX. So if you, you know, you've heard good things or you like the way NetworkX works, but you're a JavaScript person, then take a look at JS NetworkX. So let's make sure we're, yeah. Alright. So I like, when I was at Open DNS, I always loved when someone would create a report that said, you know, well, the first thing it would say is, you know, we believe these Chinese hackers did X, Y, Z or we believe these Russian hackers did X, Y, Z based on these two IP addresses. And then they would expand the report. And if you're lucky, they would put all of their IOCs in the bottom of the email or throughout the, or bottom of the blog post or scattered throughout screenshots in the blog post. If they're not there, sometimes you had to kind of go through back channels and ask someone like, hey, can I get the IOCs and then start doing some research on them. What I've found is that when I have that starting point, generally I can find additional new information that may be pertinent to that investigation that they had not considered or that just wasn't available at the time they did their research. So it becomes very cool if you want to get your feet wet, start looking at other people's published IOCs and then see where you can pivot off and add additional nodes and edges to make associations. So this is a, from Minerva Labs, they did a really good blog post talking about binary that was mimicking NaviCat, Info Ceiling, Malware, POS or Antimodule. So I loaded the indicators into Maltigo. So Maltigo CE, because it's free, I'm cheap and poor. So really what I did was I had available the hashes, domains, IP addresses, email addresses and I manually went and created those nodes and as you can see those nodes are not associated with anything yet. They're just nodes that have no edges whatsoever. So there's no association other than I know I put them in there. So what you can do, which is very cool with Maltigo, is if you right click on certain nodes you can enrich that data automatically. Now I would, if anyone has run all transforms after they've loaded a ton of transforms into the plugins folder, you're going to, you're going to have a bad day because it's going to look like someone threw a bunch of spaghetti on the wall in different colors and then you have to make sense of it after. If you start going through, or through an iterative process of running the transforms or creating those associations between the nodes, it's going to be far more manageable. So what I used, I had access to VirusTotal, so the VirusTotal public API, which anyone can get, Threat Crowd, which anyone can get access to, Passive Total, it was free, just had to register, so those were sources of information that I was able to use. And when I started enriching the data, it drew the associations together for me. So this is, you know, the most zoomed out view that I could get that showed all of the different nodes and you can see the associations, or all of the data that's associated with each node. Makes it a little hard to read, but luckily they have a nice color-coded legend at the bottom that helps you make sense of things. And you can click in here, zoom in, and figure out what's going on. So when you zoom in, looking at a particular hash, you could see, because of the VirusTotal association, or the VirusTotal transform, how the various vendors were recognizing that particular variant, which may or may not be useful, but if you start seeing associations between additional hashes that weren't mentioned in that original report, then you can draw inferences that they may be related and research a little bit more in depth. So you can also see very interesting intersections. So the domains are all associated with the same domain registrant. So I got that from PassiveTotal automatically. They're all associated with JD1ZZL33 at gmail.com. So it's not a hidden service, it's not a hidden email registrant, it's an actual gmail account. And you can also see that, you know, fraternitylaw.co.in, well, you know, allegedly in India, you could see the changes in dates. And then you can also see associations like how the domains are associated with the same IP address. So all of the domains that are in that report were associated with that same IP address. If I wanted to, I could then say, given this information, what additional domains are associated with that IP address? That could be a huge list, if it's a virtual hoster. It could be a very small list. Especially if this threat actor is only registering domains as they need them for specific campaigns. And then you can just enrich the data with all the other domains registered using that particular email address. So we could find streetfighterx.com. You could find a whole bunch of additional domains associated with that registrant that may or may not have been used in a previous campaign or might be used in a future campaign. So that gives you additional IOCs to track and monitor as time goes on. So when you start making all these associations, things get a little sloppy. So when you want, one thing I recommend is when you're working with many indicators and a lot of overlapping associations is to create a copy of that graph. So if you just want to, it's, one thing I'll say about Maltigo is it's hard to back out to the previous revision or several previous revisions because the revision control is more like Microsoft Word than it is a traditional revision control system. And you know, I'm not going to fault them on that. It's a great tool otherwise. But if you want to run test transforms and just see what kind of data can shake out by making assumptions, then just create a copy of that graph and play with that one. Keep your master copy untouched and untarnished. And if you find that in your copy, if that association adds value, then go back to your master and run the same association. It'll make trying to back out a lot easier. So you don't need a degree in mathematics to understand graph theory. If you have one, granted you will understand far more than me or the average person. But if you know the basics, it becomes so easy to use open source intelligence information to draw associations and make inferences based on that data to help with your investigations, to help with your recon, to help with anything you want to do with that data. And like I said, you can use a tool if you're not a developer or not even like a, I'll say like a punchy scripter like me, use a piece of paper. All of the graph theory I learned from Dia was him sitting with a notebook drawing things for me. We didn't even touch a computer. He was like, okay, well, we're going to draw, this is what a directed graph is. This is how you make these associations. And it was all in just a paper notebook. So you don't need fancy tools to do this association. Granted, it makes it much easier to enrich the data instead of like writing out a hash on paper and then saying, oh yeah, it's all these different associations. But you don't really need to. So the connection of any related information will help you visually represent the data. Some people are not visual people. I'm a visual learner. The application of graph theory to incident response and just investigations in general has greatly expedited my research when I'm conducting it. Because I see the associations, I see groupings of like data and I can focus or completely take that information and just move it off to the side and not have to worry about it. Because it's not pertinent to my investigation at that point in time. Like I showed you, there's a growing number of tools just in preparing this talk and giving this talk. There's probably been like five new really cool tools that popped up that apply to graph theory or help you apply graph theory as it relates to malware analysis or open source intelligence in general. But don't feel as though you have to pick one particular tool because it's everyone uses. So if you don't like network X, if you're not a python person, you're a javascript person, there are other tools you can use. If you don't want to have to use a graph database, then don't use a graph database. Work with flat datasets. Some of the best data scientists I've ever met hate databases. They would rather use just big flat files and just aggregate. They would rather use bash to analyze their data and then create really intricate and ridiculously long bash scripts to start mapping all of their data to one another. So pick what works for you, whatever is going to help you get the job done. If you find that one piece of advice, if you're doing any sort of development and you're saying, you know, I'm spending an awful lot of time on this, maybe I should just do this by hand, the answer is almost always no you should not. Because you've already started down a particular track and if you write a tool or write a script to help you do this, it's repeatable and you can do it again in the future and you can just modify that code. Or better yet, if you're a shitty programmer like me, put it up on Git and say, hey intern, part of your job now is to fix this dumpster fire I have going on over here in my Git repo. And yeah, it works out really well. I've had people compliment me on my code, my stylings, oh, thanks. That was a great intern that I hired to fix that. So with that, I think we're kind of back on track. I tried to go as fast as I could, it may have been too fast, but if everyone wants to contact me, please feel free to reach out, email, you can try phoning, I probably won't pick up, but Twitter or email, I answer those. Does anyone have any questions? Because we have time, hey. No questions. Oh, yes, hi. Yeah, I tend to, so just to reiterate, what's my confidence in some of the transforms for the publicly available information? I use them all more as indicators or echoes that I need to follow up on. I would never say they're the definitive truth. And I would say that about all threat intelligence vendors and threat intelligence providers, because the data is only as good as what is coming in. So I take every association with a grain of salt, and to your point, some of these crowdsourced group threat intelligence tools for IPs, it's like oh great, your honeypot found an M-map scan. That means that China is attacking today, so that's everyone's top priority. Thanks, honeypot, but those crowdsourced intelligence can be easily gamed, and that's a whole other talk to, you know, tainting the well of information. Yeah, well, yeah, Virus Total is a great example. If I was a malware author, I'd be uploading just garbage all the time just to hide my tracks. But you know I'm not, because I've already said I'm a shitty programmer, so it's not me. And I'm bad at math, so you know I'd get bored really quick. Any other questions? No? You can beat the lines. Run!