 Thank you! Okay, so just a little more on who I am and why I'm here. So John Marriott, IT dork, security dork, I like data and long walks on the beach and I have lots and lots of experience in security. Almost the entire time during my career I've actually been studying in higher education. My very first job was working at a university and I got that job directly out of high school and the entire decade I was there I took courses for free. It was one of the perks of actually being a permanent employee and then I left the university moved on to some other roles and quickly found I actually missed putting a full day of work then coming home and doing homework. So I'm a bit odd that way. This is a photo of me when I was a child I am the one making the stupid face wearing a tie. I still do that fairly frequently and a mildly interesting fact if you are interested in EDM electronic dance music the kid directly in front of me is dead mouse who went on to become quite famous. That's Joel Zimmerman back in fifth grade. So I work at the independent electrical system operator. We are a public organization in Ontario. It's part of our code of conduct whenever we are speaking or we have to make it explicitly clear that we are not there representing the IESO and its opinions. So any expressions of opinions, thoughts, feelings, insecurities, inappropriate stories, questionable fashion tastes, tugs that are too long, poor judgment, good judgment, moist handshakes and not following PEP eight conventions while writing Python. Those are entirely my own and not representative of my employer, my boss, my colleagues, my neighbors, my family, my wife or my dog. So data science is definitely a bit of a buzz term. You can find it in the news fairly frequently now if you're working in a corporate security setting you don't have vendors coming to you every day telling you about all this great data scientists that they have working for them and all the machine learning and AI that they have. But it's a really poorly defined term. Even though if you go back and look through a lot of literature you can find it's used going back to the 60s sometimes as the original term for computer science. So just a level set I figured we'd give a brief definition. So data science is the sexist job of the 21st century and that came out of the Harvard Business Review. This statement is 100% correct. It's not actually the job that's sexy. The people are sexy. So all you have to do is start saying data scientist and you will be 10% better looking. Some of the other things data science is that what they call statisticians in California. That's a pretty fair criticism because a lot of what data scientists do is really the same thing that statisticians have been doing for many, many years. What I like and I think the best definition is the one on the far end where it says data scientist is somebody who is a better statistician than a computer science major and a better programmer than a statistician. That the reality is when you're doing any kind of data science and we'll talk about what we mean when we say that. You'll actually spend most of your time getting data, cleaning data, asking questions, graphing things out. You'll actually spend very little time on actually building models and running those. When I do some code later, the actual model building, the actual machine learning, the AI, all of that. It's like two lines of code. It really is because the libraries are so robust and powerful now that it can be very, very easy for anybody to do. I thought this was a fun infographic. It talks about the modern data scientists and it talks about how they have to be good at math and statistics and programming and databases and need domain knowledge and soft skills and communications and to be able to visualize and tell these stories. I can assure you I have never met any one person who was good at two of those things, let alone all four. I'd say if you ever met a person like this, you should kidnap them and hold them ransom. They are a very valuable person and you should not let them go ever. This is a Venn diagram that came out of a guy named Drew Conway. I like this one because it talks about the data science is very much interdisciplinary. It also talks about hacking skills and the fact that there is a danger zone. Python is a great language. I love Python. It's very easy to pick up. It's very easy to learn. Because of how simple the libraries have become with machine learning and AI, it's very easy for you to throw something together, feed it some data, get some answers that sound nice but don't really make sense unless you have a little bit of a math background. Everything I will be talking about here, I tried to keep it at a high school level. If you don't have undergraduate degrees in math, you're going to be okay. There's plenty of techniques you can use that you would have learned in high school, elementary school, just to make sure you're on the right path and as you're analyzing data, it makes sense. You're not selling anything too crazy. Why should you care about data science? If you look at the breach reports, these are the ones that came out of FireEye. 146 days, global average before the detection of a breach. A lot of times companies are popped and they have no idea. I've definitely been involved in incident response. When you look at the system, you can see this happened a long time ago. There was indication that it was compromised but nobody looked at the data or didn't look at the right data or didn't do the right analysis to realize that this was going on or it just got buried in alerts. Sim is really good at correlation. Where I work, our Sim processes around 25,000 events per second, which creates hundreds of millions of events per day. The correlation is great but that deeper analysis of trying to find more underlying patterns and really get some insights into what's going on just isn't there. Sim just isn't really the right tool for analysis. You, EBA, if you've been to any of the other conferences, they're all more vendor centric. This is another huge buzzword in the industry. Vendors are of course pitching this is going to solve all of your problems. I'd just like to have with every other solution they've ever had. It probably won't, but it's based on machine learning for the most part. Another reason to care is the ability to take data and just understand it, process it, extract value, tell a story out of it is a valuable skill regardless of what industry you work in, what a sector you're working in, whether you're working in information security or somewhere in the finance. If you can make sense of data and tell a story about it, that's just valuable for you to have. That makes you a very valuable person in any organization. So some warnings about data science. So bad analysis of bad data is bad. That's pretty straightforward. Bad analysis of good data. So even if you had the right data, if you do the wrong kind of analysis or you think about it the wrong way, you're not going to get very good results. That's fairly obvious. Good analysis of bad data. We'll give an example that I think everybody has seen before where that can happen is bad. And even if you do everything right and you have the data and you do some good analysis, it doesn't necessarily mean that all of a sudden everything sunshine and rainbows and happiness that just the nature of machine learning and AI and data science and all of that is that sometimes when everything goes right, it's still not all that useful for you. Next KCD. This one actually came out fairly recently. So I was able to toss it in at the last minute. And the idea of this person coming in and saying, oh, I'm going to solve all your problems because I'm using algorithms and you guys have never thought of doing that before. And then after six months, they realize, oh, actually these are hard problems and there aren't easy solutions and the belief that algorithms are going to solve everything is a little far fetched. So in 2015 Verizon, they did their annual database report. And the three top exploited vulnerabilities that came out of that, according to them, was around Windows XP and cross-site scripting in a PHP application that came out in like 2002. And a number of things that most people in the security world, they looked at it and like, this doesn't make sense. I refuse to believe that this is actually what the number one exploited thing was in 2015 that it seems very, very unlikely. And without access to their data and how they collected it, it's a little hard to make that sort of assessment. But this definitely looks like a case where they had bad data. So even if they did all their analysis correctly, without that good data, an actual source of truth of what's really going on, you're going to come to conclusions that are maybe a little ridiculous in academia. So this is a paper that came out in 2006 and was just trying to do classification of malware. And so the technique that they took was they had a number of known malware samples and then they took a bunch of Microsoft executables and binaries and they did ngram analysis. So that's you decompose the binary into bytecode and then you basically just count how many occurrences of the different bytecodes happen within the files. And their analysis showed that, yeah, you can do this and you can do pretty good models at predicting which things are malware versus which things are malware. Which things are benign. I mean, it was good science in the sense of you can follow the methodology. It's very easy to reproduce. When you look at it, yeah, I think it was around 89% success rate at determining just based on bytecode frequency, which things were good, which things were bad. And that's a reasonable success rate for a model. But some other researchers, they went and they recreated it and they said, okay, we can recreate this. This is good. It works. But when they tried it on non-Microsoft binaries, they found it had a success rate of less than 30%. It was worse than flipping a coin. And some deeper analysis you would find that, yeah, in every Microsoft binary, the string Microsoft is embedded in it. And so inadvertently, the classifier basically just said, if it says Microsoft, it's good and if it doesn't, it's bad. And so you can see how even if you're doing good analysis and you're trying to do the right things and cross validation and testing, making your things reproducible, it's just very easy for a human error, maybe not having that full domain knowledge to make a mistake that kind of undercuts the value of the work. And this is where it really is easy with the libraries. You can just feed them all kinds of data. And if you don't know, really understand what your data is or how some of the underlying algorithms work, you can play with things and get numbers that look reasonable and sensible, but ultimately are just meaningless. So I mean, I love XKCD. Let's talk about what we're going to do here. So the idea is usually the hardest part in any kind of data analysis. Figuring with that question you want to answer is really honestly the hardest part that all of the cleanup and the analysis and all that, that's actually really easy. If you're getting good questions and then where to get that data and how to process that data, that's really where you're going to spend a lot of time getting the data. So for the example that we're going to work through, we took a number of, there's a researcher, he produces lists every day of the algorithms, sorry, of the domains from a domain generating algorithms for a number of malware. And he gives these nice long big lists, lots of extra data around it. So we're going to use that, we're going to label that data as being bad domains, and then we'll use the Alexa million for domain names that doesn't exist anymore in a public free, but another company picked up, so it's the majestic millions. So we're going to look at around 400,000 domain generated domains and compare those to regular domains that are just popular and people go to and see if we can classify them. Exploring the data, so I mean anytime you get some new pile O data, you're going to want to spend some time and just look at it, explore it, understand what's there, what's useful, what's not useful. Python makes it very easy and we'll look at some of the code in a little bit to read in, you know, massive files and CSV files and parse it and split it off and it does that very, very well, but you really do need to sort of look at it and figure out what makes sense and what features or what are the interesting bits in this. Vectorize, that's where we're going to take data because you can't really do math on strings, it just doesn't really make much sense in a machine learning context, so you typically need some way to convert either a string like a domain name into a bunch of numbers, so you can actually do some analysis on it, or if you have categorical data, so something like a port, so you know port 80, port 443, those are numbers and you can do math on them but you know if you add two ports together and divide them, have you actually found a mean of anything, it's something where you could do that, the math makes sense, but it'll actually, it doesn't and it'll just give you weird results, so it's the kind of thing you need to avoid when vectorizing or be aware of. Then we're going to train a model and I mean you can do it in one line of code, like the entire process of building and training a model and all this data could be done and I think around like six lines of code if you really wanted to, and you skip a couple things, and then once you have that model trained, we're actually going to spend a little time and look at how you can optimize and how you can actually let the computers and use a little bit of computer science and let them figure out not only which model to use but then every model out there has dozens of parameters that you can tweak and again you can just actually have them run through all of those. So for the tools, Jupyter Notebook, it's basically creates a web page where you can write code, it works as your IDE, but it also saves all the output into the same web page that you're writing your code in and while from a security perspective that seems like crazy sauce, it works really well because it gives you something you can easily share with others and whenever you're trying to do data science or any real kind of data analysis, you'll likely need to share it with other people just to get feedback, bounce ideas around, and so it gives you a very easy way to write code, run the code and share your results and experiment. So Jupyter is definitely the way to go. It's using, you can use different languages with it, super powerful. Python, there's more than Python that you could use but it just makes everything easy. It's generally very easy to read, which I think is actually its greatest strength. Generally, most people can pick up Python pretty quickly and you can read it and so that's why I would definitely recommend Python. It's not the only way you can go. Java has some great libraries, but Python can definitely integrate with everything you're likely to need to do. And SK Kit Learn, it's a library within Python. It has without a doubt the best documentation, I think of any open source project I've ever seen, especially of any library I've ever seen. I mean they put together all kinds of graphs, need little flow charts like this, they have working code for hundreds of different models and how to tweak things, and I assume it's all written by graduate students who should be working on their thesis but are putting it off. Having worked with a number of them, I know how common that is and that's the only reason I think that this exists and is as well documented as it is, but I'm definitely glad for it. And Docker, so installing all the libraries and getting that all working together can just be a huge pain in the butt, whereas with Docker, your life is definitely easier. It was literally two lines. I got a new Mac like a couple weeks ago. Yeah, install Docker, you just download the app, install it, one line in the command line, and you are up and running. It takes care of everything. All the analysis I did was done inside one of the VMs within Docker. It had like a two gig memory limit, so you really didn't need much power to process things, and we'll see how big some of the data I was working with is shortly. So yeah, let's actually take a moment. So usually the first step is to look at the data in something other than Python, in this case I'm using Vim, and the only thing you really need to take away from these two sort of things is that the first file has a long disclaimer in front of it, and when we're reading that in Python, you actually have to tell it to skip a bunch of lines, and otherwise Python handle, and pandas, which is used for some of the data analysis, it just handles everything beautifully for us. And ultimately, this is about 1.4 million lines in these text files. So within pandas, pandas, it's one of the easiest ways to manipulate data I've definitely seen in any programming language. You can see how with a single line, we're reading in a CSV file, we can even read in zip files, it's just super easy to use, and then it puts it into something that kind of looks like an Excel spreadsheet, it's called the data frame, it's just a matrix, so it has an index, you can easily search it, querying it, cut pieces off of it, it's just very, very nice and easy to use. So the first thing we're going to do is, because we did a little analysis before putting the presentation together just to make me look smarter, we're actually going to drop all of the columns out of what we're labeling as bad data, except the domain, and that's what we're going to focus on, we're going to add a label to it, because we're going to be doing supervised learning, and in supervised learning, you actually tell the computer, here's the question, and here's the answer, and then you hope that with that information, you can eventually give the question and have it generate an answer for something it's never seen before. So we had to add the label, and we added a length column, just it wasn't in the original data, but just so we could do some exploration and look at the data a little more, and see if maybe there's, see if we can understand if there really are any differences between the domains that are domain generated, domain names that are domain generated, and regular domains. So with Pandas, you can do some descriptive statistics, again, just like one little line, just tell it the column of the data frame you say describe, and it'll bounce this out for you, and already we can start to see that, yeah, the bad DNS names, the ones that are domain ultimately from an algorithm, they tend to be longer, a lot more varied, but they're actually not the longest, that actually in the good list we actually have some longer domains. So length alone isn't a super good predictor, but it definitely gives you a suggestion that if the only thing you knew was the length of domain, you could probably make a reasonably educated guess. And we'll graph that out because, I mean, it's usually just easier to spot some of these things when you visualize them. So here we're just doing some bins. So the y-axis is the proportion, so what percentage of the domains are of each length, and you can see that, yeah, the good ones, just like the descriptive stats showed us tend to be a little shorter, a little wider, and the bad ones are actually kind of that u-shape. So again, you can see it's not like a nice even distribution, and that's because there's a couple of different algorithms doing the generation, so each one has its own quirks, but yeah, you start to get that visual sense that, yeah, so we're going to build a classifier to figure out what's good from bad, and yeah, that's probably going to work out well for us. So within the machine learning space, you'll hear every talk about features and vectors, within Pandas, it's really just a column, and that's a very easy way to think of what these are is when you have a massive Excel spreadsheet, that each column is a feature list, which a list of vectors, and those need to be numbers, so you can do some learning on them. One of the challenges that is just an open challenge in the field is the curse of dimensionality. So as you add more features, you're adding more dimensions, so if you thought of height and width, that's two dimensions, you could represent those as x and y, and if you wanted to add a third dimension, that becomes z, and you can see you can have an x and a y in a z column. When we get a little further on, we're going to have over 1,400 columns, so it becomes a little hard to describe those in dimensional terms, but all the math that you would use from trigonometry and dealing with points on a graph, all that math still works, all that trigonometry still works, so you can use that to measure how far apart things are, the angle between points to try and understand how similar they are. All that high school math actually is still kind of useful, even though I personally never would have guessed that or admitted it in high school. So what we're going to use is, because we don't know what the answer is, we're going to try a couple of different things. So the first thing we're going to do is unigrams, so we're going to take a domain name, and then we're going to count how many a's are in it, how many b's are in it, and then a's a unigram, but b's a unigram, they're called a unigram because there's one, and we'll also try bi-grams, and so a is a bi-gram, a b is a bi-gram, imagine that goes and they're called bi-grams because there's two of them. It's quite easy to do in Python, it's, again, like two lines, I'll get you there, and when we do that, we end up with a really big matrix, and it's 1.4 million rows and 38 columns across, and it's going to get bigger still. Fortunately, Python's very good with, actually, the data type called sparse matrices, so within this matrix, most of the values are zero, and we don't actually have to store those, that the way the math works, we can actually ignore those for what we're trying to do, and so it makes it so that you can actually do it all your analysis inside a VM running on a laptop, and that's true in a lot of cases. So if we take our unigrams, which is going to be the features we're looking at, and again, we'll graph those out, just to, again, look at the data and try to understand, do we do it right, do we make a mistake when we're doing this? Is there really differences we can see? Does this make sense? And what you see is, yeah, bad domains look very different than good domains when you start mapping out what the features are. And so the x-axis there, that's the actual feature name, so it starts with a dash and then a dot, and then it's got all the numbers, zero to nine, and all the letters of the alphabet. And you can see that bad domains tend to be a little more evenly distributed, that in good domains, you don't really see a lot of jays, it's just not a popular letter in domain names, whereas it's actually fairly common in bad domains. So you can start to imagine how the math might work, how you could start to use those specific features to predict, is this good? Where's this bad? What we're actually going to do though is, because I did some analysis in the Unigram, it gets you an OK model, but you can actually do much better if you use bigrams, and there's just a listing of the bigrams in the middle there, so it starts off like dash dash. And this ends up being, if you wanted the math on it, about 38 squared is how many of these features we end up with a little over, around 1438, which it actually doesn't, some features don't exist, I didn't even bother looking up what they are, but when we get this massive matrix that is 1.4 million rows and these 1400 columns, only 1% of that matrix actually has any data in it, the rest of it is all empty, which is good, because otherwise we couldn't store it in memory. Then you have to use, there's hashing features that you can use if you need to store things on disk and get even bigger. Most of the work that happens in here, you don't actually even need separate libraries if you wanted to parallelize it or start farming it out, all of the feature extraction, all of that is stateless, so it's very easy to use like pool in Python to spread that out either locally or in Jupyter you can actually have a whole cluster and it's really easy to write the code to do that, it's maybe four or five extra lines to be able to farm most of that work out if you actually need to go even bigger, but a lot of times you won't need to. So this is the same graph that we saw before but this time looking at the bigrams and you can see it starts to get really hard to read because there's a lot of data that's much, much wider, much, much taller. This kind of thing happens all the time when you're trying to do analysis so what we'll do just to make it a little easier to read it and make it a little more sensible is we will scale the numbers so we're actually taking a logarithm of the y-axis just so that all the numbers are actually going to fit all on the page. You probably work with logs when you were again in high school, this is something that's okay to do so going from one to two is actually doubling going from two to three is a doubling again so this lets us show the how big the data is but again we get a very similar pattern to what we saw with the unigrams where the bad domains kind of have a lot of symmetry going on and the good domains are kind of all over the place that we're humans we tend to like the same letters over and over again we're not very good at using the whole alphabet and that gets shown here so very easy and common mistake that a lot of people make when they first start off and try to do any kind of analysis train models is the overfit so they take all the data that they have and they run their model on that and then they get some numbers that say wow my model is able to predict everything 100% of the time and that's because it's predicting things it's already seen a model that can predict what it's already seen isn't very good just like if I could tell you what happened yesterday that's not very impressive if I could tell you what's happening tomorrow that is impressive that being able to predict things that you haven't seen before is good in predicting things you've already seen is not impressive at all so this is just showing a couple of graphs where if you're actually graph what does that mean if we say that something is overfit versus underfit the x and the y so the y is the price of a home x is the size of the home and you can see like the very first one didn't use enough data they underfit the model and so you get something that looks like a line that you know houses get bigger the price goes up and it's very very simple if you get things just right you see it's actually more curved that you know there kind of is an upper limit to how big a house can be that a house that's a million square feet versus a house that's two million square feet isn't actually worth twice as much and then if you look at the the third one which is the worst case which is overfitting and you get this thing that's going all over the place and that's because it's trying to match the data you gave the model exactly and that is bad because it won't be very good predicting things it's never seen before so fortunately lots of libraries to help us with that so usually the first rule is don't train on all of your data the more data you have the more data you can send to the trainer if you don't you don't have a lot of experience with it 80-20 is a pretty good rule to use 80 percent of the data for training 20 percent of the data for validation and testing and figuring out did your model work and how well does it work on data it's never seen before in this case we train our model it took it about eight seconds on the bi-gram data sorry on the bi-gram data and we got an accuracy score of about 87 percent so 87 percent of the predictions it ultimately made were correct so if it said a domain was bad it was probably bad and if it said it was good it was probably good and that number on its own might look okay it certainly sounds like it's better than a coin flip but our data wasn't balanced and that just means that we didn't have the same number of good examples and the same number of bad examples so almost 80 percent of our data was actually good so if the model just said good every single time it would be right 80 percent of the time and that wouldn't be a very good model because it would never guess things that were bad but it'll still have a number that looks kind of good and that's where a little more validation or thinking about the data or looking at it will help you avoid problems like that if we go and use bi-grams instead of the unigrams our model score goes up a little bit so up to 91 so that would suggest that yeah the bi-grams are the better way to go so what we're doing here is a confusion matrix so we're trying to figure out like how good really is our model so we have an accuracy score so you know that 91.7 percent but what would be more accurate than rather than just saying this is how many predictions I made and how often I was right let's do a little bit deeper of a dive and look at okay if I said something was bad how often was it actually bad and if I said it was bad how often was it good and you can see that if we say it's bad it really is bad 92 percent of the time if we say it was if we said it was good and it's actually bad that happens around like 8 percent of the time and we get about the same with good so we're wrong about 8 percent of the time on both models which suggests that yeah our model is okay it's not too bad but in data science we actually have the word science which should make you think of experimentation so there's a lot of different models out there and different models will behave differently on different data so rather than manually running each single one you can just use a simple loop and actually have Python try a bunch of different models and then let you know which one worked better and so that's exactly what we're doing here we're just saying here's my data here's a bunch of models that I think will probably do okay run them let me know Python how things work out and what we see is in this case logistic regression it's a very simple model but it tends to work very well most of the time the original model we were using was multi-nominal naive Bayes which worked very well for spam in the early 2000 when spam was getting really bad naive Bayes networks were actually what helped bring it down it's a very simple way to do that you can do some processing and figure out oh spam or ham is it good or it's bad it works well it's usually the first thing you should try but in this case we can see that the other models just work better for the data that we have we also did cross-validation so we took some of the data we used it for training purposes we took some of the data and we tried slicing it up in different ways just in case we maybe got all good data or all bad data that when you're randomly sampling data you don't know what you're going to get so it's a good idea to sample things multiple times just to make sure that your model is performing consistently if the numbers at the bottom there was like 98, 50, 48 we would say that our model had some problems because depending on which data it looked at it was getting very different numbers whereas this time even though it was looking at a different slice of our data each time it's getting about the same number so that suggests that we're on the right track some more experimentation so the SKlearn documentation is fantastic they go into tons of detail on just about everything that they have there and with every single model there's often dozens of parameters that you can tweak and tune and set to all kinds of variables and knowing which ones you should do is actually really hard and fortunately because it's Python somebody wrote a library that well you can just tell it try all these different variables that I think might matter with different configurations and again let me know which one worked best so using the model that we got out of the previous code and having it try a number of different parameters this took about 35 minutes to run on my laptop so again still not very long considering the total matrix size if it actually contained data it was around two billion records and it spits out and tells us this is the exact way to configure our model to get the best results as far as it can tell and if we look at our confusion matrix again what we see is that yeah things have gotten better we're really good if we say something is good it's almost certainly good if we say something's bad we're not as good at there so typically for every 20 bad things bad domains we make a prediction on we'll be right 19 times we'll be wrong once whereas for good domains for every 100 that we see we'll be right 99 times wrong once so we're definitely better at predicting good things than we are bad things but this is still definitely a pretty good step in the right direction and it's a pretty useful model now most importantly I didn't want to attempt to demo gods and actually run it on my laptop but if we do a prediction on insect.io which is not an original training set we can see that it's good which considering this is the conference we're at I certainly hoped it would turn out to be that case and then we just ran one on another random domain generated a domain from a domain generated algorithm and it correctly guessed that yes it was bad so that's what I'm going to cover for classifiers I would say within the blue team space classifiers are probably one of the more useful tools available taking the code that I'd written here you probably need to add maybe one line if instead of analyzing domains you wanted it to analyze text messages or email messages all you'd have to do is change like one tiny little configuration and instead of those bigrams on letters it would do it on words you could do it for spam filtering you could do it if you see a ton of false positives coming in through your ticketing system you could feed all of those into a classifier tell it which ones you know are good which ones you know are false positives and then you can have a preprocessor you could feed new ticket data into and let you know is this something I really need to look at or is this probably just a false positive it's one of the things that I'm currently doing some work around it's very promising model persistence so Python notebooks they're great to work in easy to share but really hard to use in production because it's a website that lets you just run any Python code you want so not something you'd actually want out there fortunately extracting the model out of your code putting it onto a disk so you can use it again and again in other places make a web service out of it make an executable out of it it's very easy you actually just use the native Python pickling facility which just serializes your objects it's very easy to use works fairly well there's a couple of quirks it depending on which version of Python you're using you can run into some oddness but in general it works fairly well and robustly I want to give you one more example we'll go through this one a little quicker so this one is unsupervised so the previous one it's supervised learning that's where we tell it here's an answer here's a question and then we eventually try and give it a question and have it give us an answer to it in unsupervised learning we don't necessarily know what the answers are we have a big pile of data and what we want the computer to do is try and tell us can this be sorted into different things so if we took this approach to that original data we had because maybe we didn't know which domains were going to be good and which ones were going to be bad that all we knew is we had domains and what we wanted to know is are there different kinds of domains so if I had this list of 1.4 million domains are some good and some bad or are they something and something else and fortunately there is a whole algorithm dedicated to that called mean shift most clustering algorithms you actually have to know in advance how many clusters you have and they can figure out how to sort your data into those piles with mean shift you actually don't even have to really know too much so it's a good starting place and if we feed it all of our data it will say that yeah there's two different things in there there's two clusters that it could identify using the same features that we used for our classifier where we're using bigrams one of the things when you're doing clustering is trying to measure similarity this is where I had mentioned some of the trigonometry where now all these things often different points in space and you can measure the angles between them that ends up giving you a similarity matrix kind of like this one here so I don't want to get into linear algebra but so the top left corner it says a one that's the most similar something can even be and as you can see the index and the column same name so it's the same object so you would certainly expect that to be one that something is similar to itself as it possibly can be and everything else is between zero and one and that's just how similar it are so more similar closer to one less similar closer to zero and so again because you're dealing with you can be dealing with a large volume of data and at the end it'll tell you that there are clusters that doesn't mean that there really are unless you actually start looking at the data so this is one way you can graph things out so I just took the first column that similarity matrix and then I mapped out and plotted out every other item in that matrix where was it on the y scale so y at the top is one that means the most similar why the bottom is zero means not similar at all and then the numbers at the bottom are which index in the field and you can see and then we color coded it based on the predictions so blue was the the bad domains or category one and green was category two which was happened to be the good domains and you can see that yeah they kind of naturally cluster when they're plotted out this way that this isn't me telling it these are the these are values that sort of came out of all of our machine learning and you can see that yeah when you plot it out and the coloring again was all done by the machine that yeah it looks like the blue things are different than the green things which when you're doing unsupervised learning is about as good as you can do most of the time just so that you know these two things are distinct and then plotting it out just helps you validate that yeah this kind of makes sense and then going back through and looking all all the data so some I'll try and wrap this up for you guys so some takeaways we all have data everybody has data one of the challenges I definitely hadn't put together this presentation is I couldn't show you everything that I'm actually working on because it's all data proprietary to the company and like most companies were fairly shy about sharing their own data but everybody definitely has data there's tons of data out there exploring data it's actually gotten very easy it used to be it was kind of a pain in the body if you tried to try and do these kind of things in C or Java or anything else but the libraries have made it so much easier that anybody can really start playing around making models and machine learning and data science and all that again there's a wealth of resources out there the program that I did through Harvard almost all of the materials actually also available for free online basically you can enroll in the course and then you actually get to interact with the professors and do homework and get it marked but the actual the lectures the slides the homework all of that is put online for free and anybody can take it that there really some fantastic content Coursera also has some great work Stanford has some fantastic work as well if you want to get involved in this though you're definitely going to have to read and learn that I mean it's a big field that crosses many domains and so it's it's not something that you know you can really learn an hour I mean I had an hour to sort of talk to you guys a little bit about it and maybe get you interested but this is a deep topic and there's no way we could spend nearly enough time on this and the time that we have but wealth of information if you want to learn more thank you for watching