 Machine learning for fun and profit is what I want to talk about today, and I've got 30 minutes to convince you why this is a great idea So in a nutshell, let's start with what the premise is the goal of What we're after here is to use Ruby to answer questions about your users and make your business money Quick hands. How many people have a users table? Right. Yeah, the people who don't have their hands up would do internet of things Maybe Bitcoin or something crazy like that where we don't have any real users that we know about just aliases and such but everybody ends up having a users table somewhere in their system So we want to be able to make money from this and whether you are a small band of bootstrapers or if you're you know more traditional VCs You've got some sort of business plan and that business plan is something ridiculous And then at the end of it there's profit and in between there's this magic thing Well, what I want to talk to you about is that magic thing that is in in between those two places And what I want to talk about is how collecting something can lead to profit or to Insight or to social good for that matter We can do this a lot of different ways and the tools we're going to use are ruby Which is exactly where we should be with this conference here in the Rocky Mountains We're going to use a users table which might look something like that and as promised We're going to use science. So just to be clear We're not going to use this kind of science like kind of the crazy mad science We're going to use the kick-ass science like this. I find it fun that both of those people are named Neil So and they have three names. So both of those things. It's a slight disambiguation Disambiguation problem between the two Niels and the person who's taking you there is me just real quick The reason the arrows there is I did all the code for this at rails comp and the code hasn't changed all that much So it still says rails comp instead of something a little more generic I tweeted out what this was this morning and I'm happy to show you and help you get the code down because I'm not going to do live Code I'm going to show you the results of the code. I've been doing this a long time These are real conferences I presented at and conference talks that I was at when building a data warehouse was like 10 gigabytes. Wow You've got 10 gigabytes. Well, I'm from NASA. I've got a terabyte. Who doesn't have a terabyte of data anymore? I Started programming data analysis stuff in Microsoft Microsoft Visual Basic You can't see from that picture down at the bottom. It says for DOS Which was an actual for real product We got a math coprocessor and this picture is starting to become hot again. This is a back propagation neural network I used to do neural network problems and the computational power wasn't there and it wasn't sexy anymore So I went and I did other things like, you know Ruby and Java and more Ruby and now I'm getting back to doing this stuff again Which makes me really happy So I've been doing data science in some form and data analytics in some form for a really long time from a lot of different dimensions So let's start with this question. Who are your users? How many people think they really know a lot of information about their users? They have good analytics about their users. They feel really comfortable Knowing their users there's got to be a few of you because some of you have to be at a startup where you know All the names of the people that use your product So, I mean, you know, you might know your users really well then this is not applicable to you All right It's applicable to you in a few months right when you have thousands of users or hundreds of users or tens of thousands of users I find all of this kind of funny because I was a chemist in another life And we used numbers of things and statistics that were like 10 to the 22nd 10 to the 25th Ridiculous numbers of users that aren't available in real life So I get kind of wiggie about statistics that only involve, you know Like a few million users or a few hundred thousand users, but I'm learning to adapt This is how most people look at their users, right? Maybe you use mixpanel Maybe you use heap or one of these other tools You know Google is telling you something about your users. All right, what Google's telling you here What the analytics are telling you about your users is this they're telling you? You've got this user and this user is some sort of aggregation of all your other users That's not a very good story. I don't want all my users to be the same How can I market to my users? How can I learn something about my users? How can I change my user's life depending on what my goal is? By that information that I have there because if we use bad statistics a lot of you might have seen this xkcd before Right, you know if you start looking at aggregates, you can draw some really horrible conclusions All right So let's figure out a way to do that This is what our users look like right now At least we know we have this idea that we've got different kinds of users So now we want to figure out a way to kind of like fill in the blanks here So the first thing I'm going to do no matter how little ruby experience you have I know there's folks from the various schools here I know there's a lot of new people here. I met a bunch of people at rails bridge There's all sorts of people here. This is what we need to do to do the first thing We're gonna do gender assignment I hate saying that out loud because that's not really what we're gonna do But what we're gonna do is we're gonna take data you have the first name of a user We're gonna take this gym called sex machine did not name it It's a C library under the hood and what we're gonna do is we're gonna assign people sex male or female based on their name This gym is really fun for doing that a lot of times you'd say well, you know, just collect gender information up front Well, boy, that's a landmine field right landmine field. It's a landmine set of landmines a field of full of mines So what did it take Facebook 40 different? Choices to be able to handle gender the example I like to think here is ordering t-shirts for a conference for real like you didn't think to ask how many women How many were women you didn't think to ask what their preferences were for girl cut t-shirts boy cut t-shirts So you're going to try to make an intelligent guess because we don't have to be precise We just need to be in the ballpark. So, you know, this is that's one practical application of this This is what the code does We get a detector with sex machine We give it names and it prints out genders So for instance, this is like real results from a set of data and all this data is in that in that in that repo So you laugh at that. Okay, is there anyone from the UK in the room? Okay, no one from the UK. Jamie's almost always male in the UK This gym is smart enough the library underneath if you give it locale information it can swap back and forth But I actually have friends whose children are named River Cedar Justice So and it says Andy because it's androgynous. That's what its default is You can make it something nicer like unable to compute or something like that But you can take this and run it against a set of data and get at least a rough idea Here's a real-life example at treehouse. We were curious how many women Were among our users and we don't ask for that information up front. I don't need to know exactly how many we're just trying to get a ballpark is it 5050 is it 6040 and Using first name analysis gave us a pretty good guess at that which was really fun And it was easier than sending out a survey because if you've ever sent out surveys You know that it's impossible to get back all of the results good results and consistent results So that's number one you can take it home. You can start doing it today very simple gem very easy not hard ruby code next up Location awareness a lot of people probably already deal with geolocation if you're a smaller company or you want to do it for fun You can do your own geolocation, which is getting another piece of data There's this really cool buzzword compliant tool called free geo IP, which is a hosted service But you can download it is go it needs Python to run scripts and it uses Maxmine to pull down some free data There's a little script that is in that repo where we use Faraday, which people probably are comfortable with I'm running my own copy of that IP database Maybe that I've enhanced a little with some extra information possibly political affiliation based by state or possibly average income based by zip code So that's what you know the request looks like I could do this with curl and do it all from the command line But I'm trying to you know make it a little sexier I can loop over all the users in my database if you're using device You probably already have the IP address in there. That's kind of one of the default things I'm using the resolver IP for regex So I don't have to write regex is because my regex is bad and if it thinks that it's a IP for reg regular expression It's going to take that IP. Sorry if it matches the IP for regex. It's going to take that IP It's going to look it up and it's going to spit some data back. I'm going to hit the data by doing that I'm going to parse some Jason and I'm going to get stuff that looks like this Average income political leaning latitude and longitude which are pretty crucial for what's near me and then What I've done is I've taken this And I made it something more like this people that look real people that you could walk up to and say, oh, okay You are a real person. You're not some fake cardboard cutout. So Whoo users are described Everybody with me so far I've been going lightning fast and I'm going to drink more coffee possibly and we'll see where things go because What comes now that we've got our users described? Is we're going to go off the edge of the map a little this is where the more advanced among you might find some joy And at the edge of the map there's always dragons and in Ruby there are a lot of dragons here And I just am so thankful. I'm not teaching this as a two-hour workshop. That's all I'm going to say Because the dragons can look like that. That's smog. Wow. It's really dark. That's smog That is a scary dragon to me. That is a scary dragon to my children I want it to be a dragon that looks more like this one. That's a little happier I like to think I'm pretty sure that's supposed to be an apple But when I squint it looks like a ruby So, you know, I'm glad that Pete's dragon was guiding me down the path that I was hoping to To go with all this code because the dragons Get in the way and we want to slay them. Let's look at this picture a lot of people here So again our users Can anyone see clusters of people there any groupings? You know why they're grouped, you know I mean it looks like maybe someone fell down and is hurt and people are looking there And these people are have their iPods and their iPod buds in and they're walking away There's a mine performing over there. I don't know exactly what's going on at all these But we can look at this and intuit that there were clusters, but we know it's hard for a computer to intuit things So clustering is a really common class of problems We're gonna do something pretty straightforward in Ruby We're gonna include some important properties about the user whatever those properties are, you know dollars spent on your On your e-commerce platform for us We've done things at treehouse like number of badges earned points earned Time spent on the site anything we want to get a little more information about and there's this beautiful gym called AI for our artificial intelligence for Ruby that hasn't been updated in eons and It's good that it just works for the most part. So that's one of the places you'll find dragons This gym wraps up a lot of really complicated math Because you don't want to implement this stuff by hand I've seen people implement it by hand and it's scary. I've implemented it by hand and it's just heinous But how clustering works we're gonna walk through code real quick What we're doing is something called k-means clustering and what the K is is how many clusters we have so you say in advance Hey, I've got three kinds of Customers great customers mediocre customers and the customers I really don't want to spend any money on and what I want to do is figure out which of my users are more Likely to be which so whatever that important property is maybe how frequently they shop So what I'm gonna do is I'm gonna start by randomly sorting people into groups. That's all it's like dealing out carts So if you picture red green and blue cards, you deal them out in a pile We've got a big pile of mixed up colors of carts and all of you could figure out how to sort that you might even be able To figure out how to get a computer to sort that What we're gonna do next though is where it gets interesting you can see the slightly highlighted line up there that says calculate centroid we're gonna figure out the center of The each cluster are three piles what the center means and then for everything in that Everything in that pile we're gonna go through it and if if the center of our cluster is red We're gonna say is this red no put it over here because it's more blue is this one red No, put it over here because it's more green We're just gonna sort through them and if we go through these passes enough times We're eventually gonna get to the point where we're not switching groups anymore And we've gotten people clustered into k number of groups So first step is to calculate the center Second one is to loop through the people and figure out what the the GD is a geometric distance There's a lot of ways to do this We'll talk about and then the like the last little bit that changed equals true Awfulness there is basically to keep it from going sorting people back and forth into a kind of an infinite loop back and forth So now you've just learned how to do k-means clustering. Whoo. A lot of a lot of stuff. We're getting done Math how many people love math? Math all right linear algebra people in the house All right a few linear algebra people in the house. Okay, you linear algebra people Please don't correct me. Please don't embarrass me by correcting my horrible math. I was a chemist I did a lot lot lot of linear algebra, but I was never very good at it I made other choices with my evenings and my eight o'clock calculus class was not my best So we've got linear algebra, which we need to know matrices and vectors is how all of this stuff works I'm Knowing how to do set operations in the database can let you do some of these things sometimes and you got to know There's better math tools than Ruby Ruby is kind of the gateway drug and so here's where I come clean and say you know Ruby is probably not the best way to do this But it's a fine way to do it and find out if it's good enough for you Because if it's good enough for you, then maybe you'll take the effort and learn one of the other tools when it Becomes too hard and we'll talk about some of the reasons Like I said that was called k-means clustering lots of words the word that I see is varione cells because it just sounds cool So you math people you can explain what a varione cell is some it's some magic. That's what k-means does whatever Use the gym cluster your stuff and find out what's interesting It was really fascinating to cluster the people that worked at treehouse based on number of badges They earned to see who you were most like Because it was very funny not to see all the developers at the same place and how many designers were ahead of developers and big clusters on different axes it was really fun So There's a lot of kind there's a lot of algorithms here Here's one thing that's fascinating the algorithms are all going to give you different results That's got to make you feel good when you're trying to pitch to the VC or your investors or your community or your boss That hey sometimes this gives different answers It can give different answers depending on how you sort in the beginning can give different answers depending on your algorithm And you could derive all of these algorithms The one way we do the clusters the first way we did it is we sorted it into k buckets and we said okay There's three buckets. We're gonna fill three buckets another way to do it is to basically have one cluster for everybody That's the like extreme option right everybody's a single cluster and you could do this in this room If we had more time you could look around and say who's most like me Maybe you choose by beard length or maybe you'd choose by you know Male female or any of the other aspects we could do and if I said you know move one step closer to hevers closest to You eventually all of these single people would end up in some sort of clustered groups And when people stop deciding that anyone else is more who they want to be like that's where this these algorithms end The other way is the reverse start everybody in one giant group and if you feel like you are different from the rest of the group step away and That's that's how the divisive hierarchical cluster in works plus There's a whole bunch of ways to figure out the distance between Centroids I kind of like hedged earlier and said you calculate the center and figure out which one you're closer to There's at least seven common algorithms for figuring out the difference They all give you different answers and they're all appropriate in different scenarios And the only way to find out is to use it on your data. No, I do sound like it's naked salesman All right, we are on time Last thing that's a big concept. So liking is really important um and one of the best ways to to see what What people like is do something called collaborative filtering there are zillions of collaborative filtering algorithms Um, you may remember that Netflix had a giant competition to figure out Who could improve their recommendation engine? Well They used some of the magic that's here. They use something called SVD Which is not a not a disease. It's single value decomposition that comes out of linear algebra Which is what that gem is and again all we need is some important important properties from our user So what we're gonna do here? I kind of turn this around I've done this talk a couple of different ways in different formats This is SVD and it makes makes my head hurt just to look at it So I thought a visual would be better I mean if you how many people have done SVD anyone ever done single if you've done linear algebra Yeah, you know, you know what this is otherwise. It's a whole bunch of matrix barf up here So the Russian that star had earlier was a really good example. That would have been a much better way to show this But visual usually works for people so this kind of shows an example of what's going on What we're doing with collaborative filtering is taking everybody's Opinions about something and if you want to think about a matrix think users going up and down think, you know ratings of movies going across this way That's some crazy infinite dimensional space. I mean the dimension is the number of users times the number of of Ratings that they've done it's just an insane thing to visualize and it's a hard thing to calculate But what SVD does is basically pull out a constant and once you have that constant That's really expensive to calculate then anybody else coming in you can take their set of answers Multiply it by that constant for for our purposes and then it puts them on a graph like this And so you can take someone Bob his new his ratings You can take this giant constant that was really expensive to calculate multiply Bob's ratings by that giant constant Put it on this graph and then you can use some distance algorithms to see who's closest and you can see here They said that Ben and Fred were the closest to Bob and that you know, John I'm down here. I clearly do not like the same kind of movies Bob likes and Tom is close But not too close so Ben and Fred are probably the people we want to use so I'm gonna show some garbage up on the screen Linear Algebra Jim. What is beautiful line three there? We can take a matrix and we can call single value decomposition and it does it We don't have to implement the math, which is great because Smarter people than me have implemented that then we're gonna use some really horrible stuff down here to basically Do some math on columns and again what's beautiful about this is you don't have to know what it does I only halfway follow it because like I said my linear algebra is still shaky This is this is straight out of some examples that Ilya Grogorovich, right did I get his name right the guy who's at Google? had done years and years ago when he wrote the when he wrote some articles about this and Basically, what we're doing with this is generating that picture. We are basically taking All of the we're taking all of those rows creating a constant multiplying Bob by a constant and putting him in the This two-dimensional space and seeing who is close to him I had some fascinating times running this against the treehouse data to to see who was most like we wanted to do stuff Like this is where you calculate out the difference between Bob and someone else What we wanted to do was figure out. Hey, if you've done a lot of HTML and JavaScript and Ruby You know, are you the same as someone who's done a lot of JavaScript and CSS and Ruby? You know, maybe you all should be friends on Treehouse and maybe you should do different things with each other and help each other out so all this math all this Code is all in that repo and it all just works and you can enjoy it. I'm so embarrassed There's a counter there, but that's just what it takes to get the math to work sometimes And then we did cosine similarity at the end, which is how we figured out how far apart things were and now we're at a point our goal Was to figure out how to use Ruby to answer questions about our user in our business and what I tried to do very very very Quickly was give you some tools. There's four tools You can assign gender to your users for some degree of gender You can figure out where your users are and mark them up with location data You can figure out what clusters your users are in Based on any sort of important criteria and then you can based on ratings do some collaborative filtering on your users to see Who your users are most like and what other users might be good people to recommend as a friend or a mentor or such Everybody with me so far All right so we filled in that spot with all those tools and Then we end up in this spot rolling in money, right? That's because that's what the goal is or rolling in the credit and and and joy of our users and our community Because I talk about profit. This is all about solving problems. It's not always about profit So real quick credits and I'll have time for questions. I've gotten a workshop this a few places Turing rails comp rails cons was the two-hour version of this Which if you can possibly stand it was on a confreaks where actually did all the code And I'm really thankful to be here at Rocky Mountain Ruby I really got to think Boulder because everywhere I've worked in both or worked since I've been here since Sunday I've run into people in Boulder every single one of these places rails bridge was great I really liked your supermarket down the street. They that was really fun to be at I had to give up one more set of maple bacon donuts So I could um so I could come and do this talk, but boy the tea and cakes maple bacon donuts. I like those Have to do the shout-out so I get paid Treehouse is who pays me to do this stuff. We are hiring right now. We are hiring at least two developers and a Quick thanks. That's me, but I've got two more slides. Hang on. That's me I'm not on the Twitter's all that much. I am not on the Facebook at all I am old and social media scares me and you kids should get off my yard I am often asked so I know a lot of the answers to these questions in advance people ask what books There are a lot of books and frankly they go on sale all the time at O'Reilly And if you're in a users group your users group should be able to give you a 50% off right off the bat So, you know, you won't end up spending a lot for these machine learning for hackers is lots of fun All these books are really really fun and they are somewhat pan tool. They use different tools They're not just all uniform with a single tool If you are less of a book learner and more of a video learner, you know working at Treehouse, I can appreciate that Coursera has a really famous class and or Inge class over there on machine learning is on the far left. Has anyone taken that? Well, you need to know some linear algebra for that, right and week five when they start doing back propagation neural networks Hey, look, I brought that full circle. Boy, that's when people usually start start having some pain Do not neglect to read his or watch his video on linear algebra So he's from Stanford. That's on Coursera that really horrible graphic down there is the the inverse apparently of the tri R Which O'Reilly did with code school Which is a really quick way to learn R Which is one of the languages that's important for doing data analytics. Our logo is up in the corner Johns Hopkins has a giant class that they do on Coursera as well and it looks like this toolbox thing There's like 10 versions of it things. You're gonna learn like I said Ruby is your gateway drug into here Python is really good at math. Don't hate on the Python. Python's really good with the math R is Weirdly good at the statistics, but it's both for the most part. It's both memory and CPU bound which is insane I have a script I inherited that runs for 25 hours and you can't put more resources on it because it can't use more CPUs It's ridiculous Octave is the very user user hostile Matrix algebra tool that's sort of like the slightly less hostile MATLAB Mathematica exists, of course, and there's lots of fun and all the Wolfram stuff and Julia Which is kind of the the new kid in town for doing mathematical computation and it's kind of fun You can do all these algorithms people have already written them the algorithms are rarely as important as the using the algorithms parts of it And with that I'm gonna skip ahead to the thanks so I can take a couple of questions. I'm John Paul. I work at treehouse I like data science and I am happy to answer any questions you have about data science either that you're willing to ask me to my face Or if you would rather ask me quietly Afterwards or later this weekend. So what do you want to know? What can I answer? Yeah, so the so the question is do the tools work with names other than English the that sex machine Jim a German guy wrote it There's 46,000 names in it And they're they're done by frequency and locale. So it's got Arabic names. It's got Chinese names Most of them have been transliterated. So it's not actually Arabic in it So there's so there's that kind of gotcha, but it should work on a fairly international group pretty well And you know, it's it's only so accurate But it's it's really fun to see what I'll do and it honestly does not take long to run it on your data But yes, it is internationalized So yeah, that was a good question. Is that hand over there? No hands there. Yes, sir Question is have we done anything with visualizing data? What I would do is Refer you to this graphic if you ask if I've done visualization of data clearly visualization is not my strong point So yeah, yeah, you know, yes, we have done some stuff with it There's two traditional solutions to visualizing the data one is to you know Run the big calculation and stuff it into something like D3 to get pretty pictures Most of the calculations we do are nowhere near real-time. So, you know, it's it happens in batches So it's easy to do that. It's also easy to do some of the you know Whether it's Google charts or something like that But honestly a lot of times the output of this is Spreadsheets and then you dump it into Excel So you don't have to reinvent Excel for the business analyst or the exec or whoever is looking at this and wants to slice And dice it a bunch of different ways So it is not hard, you know to do this the the scripts and code that I all have here takes the data and Stuffs it back the new data into a database So then anything that you'd use would do that tableau is pretty common Business intelligence crystal report, whatever. I mean you can use whatever visualization tool you want because the calculating is the hard part But yeah, you can you can see what I do with visualization So that's a good question to what else it's got at least two more minutes You don't want me to sing Son, yeah, I shouldn't have said that because I knew someone say that la la So, okay, um, well if you don't have any more questions, I'm sorry. I went through that quite so fast All that code is up. I will I will happily take pull requests. Oh, here's what I should say. All right true story ruby You're gonna have to install some stuff from brew because all of those things that do math They require C and what's scary is it doesn't really require C? It needs Fortran Yeah, so you're using ruby's C bindings to under the hood talk to Fortran. So good luck with that There there is there is brew for it and it does run on ruby 2 1 2 I've verified that I've not run it on 2 1 3 But it was frankly excuse me a bitch to get up and running the first time and if you're on 1 8 6 or 1 9 God help you getting this stuff done. So yeah, so I should have I should have given you that caveat, but go do it It's really fun. It's not hard. Everybody should be doing some data science I really appreciate being out here and being in Boulder. Thanks a lot