 people cool if I go ahead and kick into this right on time everybody okay with that looks like we got a lot of seats full thank you guys so much for coming down to talk I'm John I'm gonna talk about machine learning I know you could be at Sandy's talk right now so I appreciate you coming down to this instead Sandy will be great on video she's a wonderful person and I know that I have to deliver at least as much value as you would have gotten out of Sandy's talk so you set a pretty high bar for me and I appreciate it and I hope I don't let y'all down so what's our goal my goal I like one takeaway three takeaways is great one take away is even better I want to use Ruby to answer questions about your users and your business that's my goal we're gonna use machine learning to do it there's some chairs down here guys if you want to go ahead and grab them and scoot them around somewhere move put them around the back something like that this room set up kind of funky um so I've got a question for all of you this is gonna be interactive for a bit how many people have a users table in their Rails app okay this is a better question how many people don't have a users table all right yeah I'm just curious what's the God object in your table or in your app instead of users sorry asset okay that one makes sense what else assets are users so um machine learning for fun and profit with your users table sorry assets and you know some things like that you'll probably find the same techniques apply but everybody's got a user table which is what started this thing out now what what is the goal of your all's business anyone just shout it out what is the real goal of your business make money thank you thank you so we got users and we got the profit right um who's got a plan for making money from their users yeah just need to raise your hand all right you're first so I'm gonna just put you on the spot what what's your plan how do you turn users into profit awesome I understand that business plan that is awesome so um so we take take load people pay off the loans we make money on that profit that's awesome who works for a social network type company that's going to monetize the attention economy of the yeah oh well okay yeah for sure I've done that too and so that that's what kind of frames the story for me because we're probably all familiar with people with that plan right we've got users we want to make a profit everyone knows the underwear gnomes our friends the underwear gnomes and you know there's that wonderful part where the gnomes explaining to the South Park boys that uh step two is you know this big question mark after they collect the underpants and they're gonna make profit and it's just strange to speculate on what business models actually you could build by collecting underpants and using machine learning on underpants to create profit but we're not gonna do that today we're gonna figure out how to fill in that bit how to fill in that question mark the stuff that's in your users table right now that you can use to turn into money or hopefully some kind of money we're gonna use a particular set of tools Ruby which everybody here is probably pretty familiar with we already talked about how everybody's got a users table almost everybody sorry sorry almost everybody's got a user table and we are going to use science so I'm a big fan of science I was a chemist in another life so I like that but science can take you down a bad path so I want to be sure that when we're thinking about the data science we're gonna do today that we're a little less this crazy guy the professor I think this is from the third movie of back to the future and we're a little more kick-ass science guy Neil deGrasse Tyson is my one of my favorite guys in the world so we're gonna use our users table to figure out how to make a profit with data science and we're gonna try to do it thinking more kick-ass like Neil deGrasse Tyson been crazy like like that so real quick the obligatory that's me I'm John Paul Ashton Felter I work here at Tree House I asked earlier but how many fans at Tree House a few okay before that I worked at General Assembly so I've got two of the big ones taking care of so I'll probably go over to Dev Boot Camp and get a job there next so I can just continue collecting working for education companies I've got Tree House stickers for anybody that wants them up here because we do have pretty cool branding you can come get Mike the Frog or you can come get I gotta say we've got they sent me these I really have no clue what a boat is for Tree House but it's wonderful but more importantly why should you care about me and data science and being someone to tell you I've been doing this for a long long long long time these you can't see too many details but this is from 2006 I started the data warehousing track which is another kind of data science at the MySQL conference I taught it a lot at O'Reilly's open-source convention and you can see what I highlighted here because it's just funny to go back and look at what you know just about 10 years was we were talking about big databases that were in the 10 to 100 gigabyte range okay I mean that's just that's just huge it was hard to figure out how you store data that big in those times who's got a database bigger than a hundred gigs just curious yeah a fair handful people bigger than a terabyte we have Facebook here with their exabyte data I don't think there's any Facebook people here because they're all PHP right so but anyhow data has changed a lot and that means the tools have changed a lot just one more quick digression from the history of me you can't see everything there but when I started doing this I actually started with neural networks back in grad school or actually undergrad to be honest I started with neural networks and you can't see it over here but down at the bottom here it says for MS-DOS so we did visual basic with no number for MS-DOS and we had to buy a math coprocessor for the computers we ran it on because you know the 386 math coprocessor was an additional cost and you slotted it in and all of a sudden your math got better because most of the time when you're running numerical simulations back in those days it was kind of like this you press play and you waited for on average you know two three hours I had some runs that took three days between data points literally three days and that's changed a lot so I've been doing this for a long time ironically someone who started like a lot of things like this at exactly the same time the month I started my research project this was the cover of ink magazine so just saying there was some interesting stuff going on at that time that far more interesting than what I was doing data science this is our format we're gonna start with a problem and some data and we're gonna do some stuff with code to get some kind of results we're gonna learn something about our users and we're gonna use that to make money I started this as the machine learning for fun and profit as I've been doing a lot more of this and thinking about it a lot more I started to think a lot more about it like storytelling storytelling what's going on with in my case in our case for these example storytelling about your users because I think stories are much more powerful metaphor so this is sort of arranged in the stories and we're gonna start where people like to start with simple stories stories you tell around the campfire stories you tell to make people happy stories you tell to teach people things stories that we all love and enjoy so I'm gonna ask a question who knows who their users are do any of you actually really work with your users table you know maybe you're in marketing maybe you're in you know the business dev side maybe you're very many people really feel like they know who their users are so yeah so no one's really to say I do I know my users that's probably because you've got a lot of them right you know it's easy when you've got five or ten users I mean I you know I literally look out in here and I can't count how many people there are because you know your mind goes one too many you know and it's kind of gone so it's hard I'll tell you one thing I bet all of you know about your users how many people are familiar with thinking about your users like that right you know Google analytics heap mix panel kiss any of these things you're used to thinking about your users like that which is another way of saying thinking about your users like that right they're all exactly the same person and then what do you do well you take that you aggregate it and honestly if you if you look at a typical Google Analytics dashboard about the only thing in there that tells any sort of story is we've got a lot of people in North America in this one I mean it's a little more of a story I could guess you know I can tell a story looking out at you guys there's a lot of white guys with facial hair there's more women than there used to be here we're missing all of the people with colored hair you know I mean I can notice a few things about this but you know it's very superficial data much like you know that Google Analytics dashboard showing me hey you know we got a lot more traffic from the US so this is what our users look like and we add things up about users we use vanity metrics we've got you know 10 billion users the user spent so so much you know this is what we can do we roll them up into aggregates all the time all right now aggregates are okay but they really don't tell the whole story aggregates tell you about your average user how many of you all dream of being the average user of a company really no one wants to be the average user of a company I mean you know we all know that you know everybody's not a special snowflake we've been hearing that over and over and over you know we should all have the same tools everybody wants to feel special though regardless you know of how we're looking at the data so that means we need to tell good stories aggregates are boring SQL DBAs from the past people who deal with reports any of anyone anyone right that's why you Ruby dev so you don't have to do SQL and reports and all those things that's for the Java guys running Bert and the people who are using crystal business objects whatever it's called this week and Oracle and a world most of us don't live in but aggregates can tell more of a story they can then turn into events in motion that are more interesting seeing aggregates over time is wonderful being able to press the play button and see data change over time seeing your users grow over time your number of tweets grow over time your your cash base grow over time and then context makes it interesting so there's a lot of questions where you want to know things about the context and when you're putting all those things together you're telling stories so I was thinking that there is some users in my database and the users in my database spent good money at my company and then I thought wonder how many of them are female and then I had a revelation right I'm trying to do irie glasses storytelling which no one does as well as irie glass this is a napkin representation of the storytelling that America this American life does something happened then something happened oh my god insight something happened something happened something happened oh my god more insight and you know if you've ever listened to this American life you learn stuff about it they're telling individual stories and then you come up with some better picture some better understanding some insight to guide your life morning edition does it in a similar way the big V there is after their intro they go way down into the trough to talk about all the details and then they come back out of the trough and they say well we talked to John Aschenfelter about this and we talked to Abdi Grimm about this and we talked to this other person about this to put it back into human context there are different ways of telling stories and there's the very internet way which is still a good way to tell stories because you all click on stuff like this seven unbelievable facts about your users click here for more so you know I mean that all these things are ways people want to hear about data so your users what do you know how do you know it what's missing this might be what your users look like you've got some vague outline of them in your head how do you find out more about your users if you right now wanted to know let's let's keep going with the male female distribution if you want how many people collect that right now like like actually ask in registration or something how many so not many not many how would you figure it out just holler out if you needed to know for whatever reason what the male female ratio was of your population go ahead and holler it out name analysis wow that would be a good one if only someone was doing a talk at this conference on name analysis thank you there's there's nothing paid for that what's the traditional way to do it ask them how do you ask them surveys thank you how does anyone know what survey percentages are like have you ever done a survey through SurveyMonkey or something like that yeah are you going to get them yeah go ahead they're tiny they are they are and that's assuming that people open the email in the first place which is more than likely how you shipped it out so you're multiplying tiny numbers which leads to even tinier numbers and you end up with very small data sets that you extrapolate from and hope that they're somehow relevant help that they're somehow statistically significant and there are statistical techniques for dealing with that but wouldn't it be better if you had more confidence knew more about your users descriptive data lets you slice your users into segments right you can use things like lookup tables to do this which we're going to do in a second you can do you can do the name analysis which we're going to do in a second most of these are fast easy to do they're going to give you way better results so if I said I could give you like 80% accuracy on male and female based on first name male and female gender based on first name who thinks that's worse than what they get from a survey right yeah I mean it's at least as good as what you're going to get from a survey you know probably better and it takes very little time and effort so that's one thing I'm going to send you home with today so let's talk about it this is one of the first examples we're going to do and we're going to see how these examples go the first two I know you can do without any sort of any any crazy gems or any linear algebra anything like that so yeah the gyms called sex machine I did not write it so this is literally the code you know you're selecting all your users by first name and then we're going to take the sex machine gem and and analyze it and so let's run through code real quick and then we'll do the code we'll see see if that works how many people sort of got gyms installed so far okay so there should be at least someone nearby that you can kind of see and so I'm going to explain it we're going to take a minute to do it maybe while people are doing it you can try one more network to see if you can get the gyms installed from the repo and we'll go with it but so basically sex machine is pretty trivial I'll tell you a little bit about what's under the hood so you create a detector there's a couple of cool things you can do there's case sensitivity so you know most people don't want to do case sensitivity you can also pass it locales because different names are masculine or feminine in different locales how many people are British in here UK British okay just just the British people and and the US people think to yourself your answer okay the name Jamie boy or girl British people boy US people girl right I mean it's not certain right Jamie's a little bit of an androgynous name but in Great Britain it is far more boy than girl in the US it is far more girl than boy so this library understands some of those things if you want to to lock it to a particular particular regions so basically we've got this detector which we're going to which we're going to create from the sex machine gym and we are going to basically get the gender of names that's literally all this gym is now that's a lot easier even than putting together a woofoo survey and sending out an email to all your users and honestly later on you can check how right it is if you want to send out a survey if you need better statistics so that is literally all there is to the sex machine gym I'm going to tell you just a bit about where it came from because you should always question these black boxes right we've got this black box that you put a name in and you get a gender out now for all you know it's random right you know hopefully it's not random so what what is interesting about this gym is the data is seven years old it's a collection of forty forty two thousand names I think that some guy in Germany did by checking census data from all sorts of different companies it's got percentages by country and it's packaged up because God forbid I use a gym that's pure Ruby in this talk it's packaged up in a sea extension and so the sex machine gym wraps the sea code that has all the names in it but it's very easy to muck about with this gym and it runs really quick so let's do our first exercise there's two files if you look in the exercise one gender thing in this repo and I'll put the repo thing up again for people that might not have it check your gender let you check your own gender so you can put your name in it you can see some of the unusual ones I have a couple of questions just before you see the results of it I have litter I pulled friends that have children with unusual names okay so I want to know male or female when I need to hollow these names out cedar okay river justice it's just fascinating right because you know I'll say because those are all in the check your gender file because that one I put together just put your name in it to put some other examples you can see what what sex machine has to say about those those are all true stories friends of mine all have children name those various things so and then the assigned gender to users so there's this ongoing story that we can tell through what's in this database there's a machine learning there's a sequel light file so it's not hard I would encourage you if you've got a slice of your users table on your local machine to go ahead and hook it up to your local machine you know use the Postgres gym use the my sequel gym if you want to take this and actually run it against your real data right now there's no reason you have to use mine but I gave you a set of data I pulled a bunch of the people that work at treehouse I dumped out some of the personal data but I kept their name and I ran the ran that into a sequel light database so it was easy to distribute and that assigned gender to users file is actually more of a read from the database pick a gender write it out so what I want to do is see how it works for us going ahead and trying that we're going to try it maybe for about five to ten minutes this one's not too hard so it's either going to run or it's not going to run either have the gym down or you're not and the Wi-Fi is going to kill you and let's just see what happens so is everybody cool with that plan all right let's take five minutes to start and then we'll go a few more and see if we can get either check your gender and assign to gender I would be particularly interested when you do the first one if you feel really upset about what it tells you you are I usually go by John Paul because there's so many jobs at most of the companies I work at and it says I'm androgynous so I guess it's because I have two names but if I do John or Paul obviously it says I am masculine so it'll be interesting to see what it says for you especially if your name is justice or river or cedar or something like that so yeah let's let's give it a shot and see how this goes this is the your time to do something so I'll go ahead and show some code up here okay so this is what we got so that's the assign gender to users hang on let me get the other one what's the other one called the other ones so check my gender all these you can just run in Ruby from the command line if you've got it you can just Ruby these these files so you can put your name here your name here by the way is androgynous as well if you actually run it just like this it'll tell you your name here is androgynous test user one is androgynous test user 10 is androgynous I learned a lot about about what is an androgynous name looking through some of the junk data in our in our tables so anyhow let's give it a shot yes sir yeah sorry I'll bring the repo up to the repo is right there so let's see how this works nothing like live coding with all of you so let's see how it works in case you in case you really want to know like the before and after are in there all the code is there it's no big deal if it doesn't work for you or if you don't want to do it right now or whatever it's got the before and after so you've got this so you can take it back you can hook it into your users table and get something of value right away if you run it and you're surprised or or happy about the gender assignment just just raise your hand and tell people yeah really how about that how about that I would have to agree with that too so we actually have Kyle's at tree house so yes I actually knew that that one was weird I don't know why fascinating so Kyle clearly is a problem what else it could be the locale the default locale locale is us though which is weird it also spits up a lot when it doesn't know a name and so some of the newer names like it's not so good with common names now like Kelly see which my wife delivers you everyone laughs you all think I'm making that up my wife delivers babies she that's what she does and she hasn't delivered one but they keep track of like the hot names and you would be stunned at what the hot names are so we're kind of past the Cheyennes and the Shanias that era's kind of kind of kind of over but it changes over time this gym was first done in 2007 there's some interesting things and the good news is it's really easy to hack the data format for it because it's basically just a big text file how are people doing are people getting this to run okay that have it I mean if you don't have it downloaded so yeah cool anyone else surprised shocked upset disappointed to find out there on draw androgynous or mostly male or mostly female I laugh every time I think gender assignment is what what I mean that's just the wrong thing to call this but yeah what is androgynous there you go and you can see what it did for justice and and I didn't put in charity but justice and we should see what it does for Kelly see I think it throws up its hands and calls anything it doesn't know androgynous to so you can also set the the there's some ways in the C code to set the threshold for saying androgynous and androgynous but so I'm so are we close enough I should go on I have no clue how to do the pulse of this because you know either you've got it you can run it or you can't run it are we putting yes there's a threshold and off the top of my head I can't remember what it is but it it basically like it's something like if it's somewhere like 80 85% sure that it will say male or female and there's this window where it's kind of sure and it does the mostly male and mostly female and so if you run that second file the the this one I can run it so I mean we can at least see what's going on if we if you do assign gender to users let's get a terminal up here hang on I know the terminal is not there yet so if we assign gender to users sorry I made this bigger so you could see it and then it doesn't fit on the screen so it's so if we assign gender to users and run it across the the database that I gave you you know you can see that treehouse apparently used pretty male and pretty androgynous the androgynous is all sorts of test junk in there I did a longer write up of this but you know garbage in garbage out we have three mostly female and seven mostly male and then the whole bunch of androgynous names and just a handful of women and we have one woman named Fabby who like it doesn't know what to do with there there's all sorts of people it's confused about Amy I am EE it gets mostly female I think but that one it was having interesting time so anyhow gender so what I just did is I saved you having to do a survey and having to compile the results and having to deal with the statistical sampling technique you'd have to do to backfill it enough that you felt confident that you got an overall overall decent segment of your users so you could figure out who's male and female the problem I originally did this for was to figure out how to order t-shirts so I want to put this in a real context we're trying to figure out for one of the meetups how many male and female t-shirts we could do of course I could have counted because treehouse only has like 70 employees I could have counted but I thought it was a good example of figuring out how to use machine learning to do that because it let us know how many male and female t-shirts we needed it was about 10% female t-shirts made it really easy and now if we we literally have recently started hey all you treehouse users that mentioned you're there we've initially started we've started sending out treehouse users are sorry sending out t-shirts to people who subscribe and we're like wonder how many male and female ones we need and what it would cost well now we have a good idea to estimate how much that would be because we can run this against our user base figure out how many are probably female and probably male and get good estimates on that so hopefully that's one take home you can assign gender to your users everybody cool with where we are so far all right so the next one also is not very sexy machine learning this one is geolocation I bet a ton of you do geolocation right people who do geolocation already from IP address for their users a fair bit how many people use a third party service for that max mine probably sorry what else sorry cool so there's a handful of companies that do it anyone used free geo IP that's why that's why I was curious so free geo IP is the focus of the next one again it's something you get to take home it's something you can use today and there's real reasons for using it we wanted at the context for this let me get ahead of my talk the context for this at treehouse the context was we want to know better what we needed to do with our support hours we want to see where people were we didn't need super accuracy so this was a good technique for it we just needed to know roughly how much west coast time how much east coast time how much European GMT time what does our profile look like of our users so we knew better how to staff the support people again this is going to let us put a financial value on our users financial value on how much we spend on support and make sure we can use a really good way of spending money effectively to support our users make people love us more and have a really good experience so basically our technique is very similar we're going to select IP address who has the IP address in their users table right it would be anyone who uses its device right by default that has an IP address so almost anyone almost any vanilla rails app has IP address in there a lot of other people just put IP address in there in general so free GIP net is a service but the code is all open source they use two things they use the the MaxMind free location database which has about 20 miles accuracy maybe five miles accuracy depending on where you are it's good enough for a lot of the kind of things that people need to do for us we need a time zone resolution easy easy enough though I guess people here in Chicago are right on the wrong edge of a time zone and then there's the people in Indiana that are on the other edge of that time zone so I guess maybe 20 miles does matter so if you want to run this you can get it all from github it needs python and go because it wouldn't be fun if we didn't have as many possible languages it uses python to pull down all the data from MaxMind it then takes that data munges it with a local CSV file that adds more information about locations and countries to it then you spin up a go server how many go programmers we got awesome so you know we spin up a go server and then we can use Ruby to throw IP addresses at it seems like a lot of work one nice thing about this is you can control it inside you can keep it inside you can use the data and add your own data to it to make the country information richer to make the IP address information more rich and you can basically have a good time with it so the code is pretty straightforward we're going to walk through the code give you an opportunity to do it in my perfect world I thought I'd sit up here and I just run the server so you all could hit it so you didn't have to install go pretty sure that's not going to work so against the conference Wi-Fi so this might be a take it home so basically if we look through the code and this is all an exercise to the location thing this is all Ruby right we're going to set a geocoder which for us is going to be local host we're going to use Faraday because I'm old school just to grab a request and then when we do that we're going to grab the user we're going to make sure it's an IP for regular expression that's what that little bit here does the reason I do that is our load balancer at Treehouse was misconfigured for a while and some of our users have the load balancers IP 6 data in it both of which are problematic because a it says it's coming from the load balancer which has nothing to do with the user and be IP 6 can be a problem with some of the some of the libraries so match it against IP 6 so we're throwing away the data that's bad everyone following the Ruby so far right nothing rock science here rocket science here rock star rocket science going to get it straight and then we're just going to Jason get we're going to just grab somebody so we're calling that that connection that we set up a Faraday to get the Jason representation of the current login IP and we're going to parse that out as some Jason data and what that's going to give us among other things is a latitude and a longitude and a big Jason packet so again what I did is I set up a bunch of data in the machine learning SQLite database this runs against that it throws data against a go server and then it stuffs the result into a new table so you've got it so again we could ask people where they live doesn't really matter where they live if we can figure out where they live based on their IP address is this perfect no right I did a lot of geolocation work at general assembly and we tried to deal with things like I leave a plane and leave on a plane from San Francisco and I'm going to New York both the places we were having both places where we had presence and I have Wi-Fi on the plane should I show you New York San Francisco or something else my answer was let them choose but the answer we had internally was let's use some sort of magic to figure out where they are and try to assign them to San Francisco and New York which we could also do as well anyhow how do we do this code you see the assigned location to users in the second exercise it's going to look like this and you can see that I was pretty honest about what it is right one thing that's really important is this doesn't work if you're not running the go server so I'm going to go ahead because at least I can demo this I'm going to run a go server over here hello there is our go server so we're running a go server we've already downloaded the data and parsed it with Python go is running our server here and then we were going to go over here sorry keep saying over here and I'm saying it too much and we are going to go over to exercise to we're going to make it so you can actually read this and you're welcome to do it right so we're going to just Ruby assign location to users it's going to go ahead and stuff a lot of users into our database we can go see that it asked the go server you can see the go server is doing its thing go server is wicked fast by the way I love this little go server this made me want to try go so it's just sending a bunch of IP addresses from our database and we're getting all the JSON location data now we've got latitude and longitude so you want to take five minutes and try it you've got your users table you can hook it up how many people said they had like device or something with IP addresses and full just a hint 127.0.0.1 doesn't geolocate 10.anything doesn't geolocate you know all the 192.168 addresses don't geolocate the junk addresses don't geolocate so you know you need to throw those out but so two things so far we've got a way to assign gender to users and we've got a way to assign location to users so we took these guys maybe we turned them into these guys so they actually are people you know in full color and look like people as opposed to these vague silhouettes and we described our users so that is kind of the end of act one so let's take a couple of minutes to go ahead and see if you can get the geolocation running would it help if I wandered around I'm not sure how useful it is because the code either runs or it doesn't because the bundle install is such a pain in the butt with the internet in here so I'm trying to take a pulse I'm taking a survey which I already told you is wrong I should just use some sort of machine learning to figure out whether I'm going faster slow enough I could use maybe eyeball contact or you know some sort of other statistic we doing okay okay the linear algebra yeah there's some yep yeah basically you need did did what you do who was in you were saying that you were Kyle see I already Kyle is already a person to me because we we used the gender assignment thing to make sure he was male I'm going to guess he's from the Pacific Northwest and be wrong no okay good okay I was yeah exactly yeah and I try I was going to put it on like memory sticks but I can't put go on a memory stick in a way that installs I can't put all the C libraries it's when we get to the conclusion you'll see why we chose this and what some of the options are but the question that I was looping back to it's probably that oops sorry it's probably this core a post well anyhow got the core post here but what it turns into is probably the installing the build the build F2C so you can install that if you can get an internet connection that'll probably let the linear algebra gym to compile because since we're doing polyglot remember we've done Ruby we've done Python we've done go we've done C with sex machine and now we're adding Fortran to the mix because this is taking a Ruby library and using C bindings to take the Fortran C bindings and plug it all together because that is how you win how many people have ever used Fortran anytime in their life wow wow I am I mean really used Fortran not used it under the hood right you know and so yeah Fortran was my second third third programming language side project yeah everybody should pick up Fortran side project I totally agree with that okay so we've got about 45 minutes to get through the second half how are people doing with the geolocation they're just fine because like no one has go and they can't run go and we don't have any internet in here and I can't run things so okay to move on all right so if you leave now you have taken two useful things hopefully you can do the the gender assignment of your users you can do the location and you can find out more about your users hopefully in a useful way we're going to start well I always get ahead of my slides hang on I'll shut up till I have a slide to talk from so we're past this point so now we're going to stories of myth and legend we're going to take stories that were simple you know who you are tell me about yourself tell me your gender tell me your your location tell me the normal things you say when you're introducing yourself to somebody you're telling a story about yourself now we're going to do something crazy I put this here there be dragons because I knew I knew we're going to have trouble with linear algebra I knew we're going to have trouble with compilation I knew we're going to have trouble with Wi-Fi so you know all these things are a problem dragons can be scary right so smell right from the Hobbit is pretty freaking scary with Benjamin Cumberbott but they're not always scary who knows what this movie is yeah a handful of people know it speaks dragon I did not add anything to that it looks like he's holding a ruby and that just came straight off Google Docs and that let me know I was on the right path because there are dragons at the edge of the map right and we are at the edge of the ruby map and we want to have less him that's you know eating us and more him who is our friend as we get here at the end of the map of what Ruby can do so we've been looking at how people can be described so you know here we've got a whole bunch of people and now we can describe them a little better we can put a gender on them we can put a location on them but the next step is to put people in the clusters to put people into groups because we form tribes naturally you know that's the people telling the stories around the campfire on myths grow up around groups of people and individuals right but people agglomerate towards those towards those people and this is a random grouping of people but you look at that and you're like there's some order you know I mean you can see clusters in there you can argue about how you draw the cluster you know I could draw the cluster and say here's a cluster of people someone else might say here's the cluster of people you know someone else might say well that's a really good cluster and this is just some weird shaped cluster but it doesn't matter visually you can look at that and you can say okay some of those people are not like the other people they're grouped differently and your users are like that too when we're looking at users we often use that giant Google analytics pile of aggregates right and you know Batman was slapping us for using aggregates because they don't tell the whole story so what we can do is we can take important properties about your users whatever those are I'm going to do it in a sort of a tree house context but you can figure out whatever your important properties are and we're going to find ways to see the inherent structure there and to find ways to find similarity there those are the two things we're going to do so this is where we get into math too before I go in how many people had linear algebra in the past okay all of that was because you were computer science majors right and they made you do it okay so linear algebra is fascinating and underlines all this linear algebra is very easy to do wrong I've got one thing here that's done kind of by hand and then we're going to use this wonderful library the AI for our artificial intelligence for Ruby Jim is chock full of clustering algorithms and ID three decision trees and all sorts of wonderful so you don't have to do it by hand but it relies on the linear algebra jump anyhow we're going to take important properties from our users we're going to use either by hand or we're going to use this gym we're going to do stuff with it so here is a specific example at tree house we treat all our users pretty much the same but it would not be out of the ordinary to think maybe we have kind of casual users we maybe have professional users and then we have the crazy people that earn every single badge we have in 25,000 points and all that you know we got our super users we got normal users we got casual users that's a hypothesis let's you know we could find out if that's true well what we're going to do is we're going to use a technique called on cameans clustering we'll define it in a second but basically it's a way of saying I want to take all this data and put it into a particular number K of groups so I'm not clustering and you know about until so let me rewind I'm saying I know how many clusters I want I want three I want five I want ten the example I'm using right now is three I want to break into what I think our casual users super users and and professional users let's say what you'll find if you do a lot of machine learning is you will take a lot of those assumptions and you'll try it with three four five seven and ten or something like that and see if the results make sense because you don't know this algorithm much like a lot of statistics you make assumptions at the beginning and then you have to kind of stick with them all the way through the end so for cameans clustering we're going to figure out K clusters we're going to put these users into groups and we're going to see what we can learn from them so I want to talk through some code if you look in the EX three the example three folder in the repo I'm going to just pull bits and pieces out of the out of the clustering the first clustering so I'm going to make some clusters so like I said for example we're going to do three so I'm going to do three clusters we don't have to know what a cluster is a cluster is just a group and then I'm going to take all of my users and I'm just going to go into a much a little bit so I end up with randomly sprinkling them so if the visual of this is I'm taking all of my users and I'm just throwing them out on the floor in any random order that's what I'm doing and the value I'm going to use I choose to use the number of badges people have earned at Treehouse because I think the badges has a correlation with whether these people are power users or casual users or stuff at Treehouse you get a badge for finishing a significant chunk of work basically so I'm going to use badges is the one thing I'm going to measure I can use way more than one but it's easier to work with one I'm going to just basically throw the people out on the floor don't really care how they're organized because I'm going to try to find some order in there and then I'm going to go on to the actual math method so I'm basically going to for each person I'm going to find the center of each cluster remember I started with three clusters kind of threw them all out randomly people are in these clusters I'm going to figure out the center basically so I'm going to figure out the center using some mysterious math and then for each person I'm going to go through all the other clusters and see if they're closer to another cluster than they are to the one there in the center of it so basically when I'm looking at all these people on the floor I find three visual points that are centers and then if someone's kind of like hanging way out here on the edge between this one and this one the person out on the edge probably really belongs over here and so I'm going to put them over there and I'm going to do that for each person and then when that's done I'm going to do it again but I'm going to calculate new centers for everything and so what that's going to do is eventually it will stop moving people will sort themselves out to the closest center the center will move a little and it'll kind of separate people out slowly it's really cool to see the visualization it's really hard to do the visualization in Ruby so I've got a text visualization but I'm just basically going to keep doing that till it's done and at the end of that I'm going to have three groups and then I can look at the statistics of those three groups so you might wonder what calculate GD is calculate GD is calculate the geometric distance so you'll find for all of these algorithms calculating the distance is the one true thing that differs between them there's all sorts of ways to do it I'm using a geometric distance which is kind of you know if you want to think about like a hypotenuse there's Manhattan distance which is like blocks in Manhattan where you never take a diagonal because there's a building there there's all sorts of ways to do these things and that's when you know the linear algebra pays off to know some and to understand the bits but anyhow we've got the assigning users to a segment just curious how many people actually have the linear algebra gem installed and it's probably like 10 or 12 of you right okay so I'm going to show it up here feel free to go ahead and do it but I'm going to give you an idea of how this actually works so we've got we can close off our go server because we don't need it anymore we can close off this we can look at this we're going to go ahead and open what is this this is the cluster alright so if you skim through the code you can see there's our calculate centroid and there's you know blah blah blah there's some math math is not very complicated this is all square root math and this is basically just what I said we're doing earlier when I went through all those bits and pieces I put in a bunch of really ugly puts so we can see what's happening and it'll make sense when I run it so I'll just shut up and run it and I'm going to pull this over so you can see it and you'll see there's two different gems in here or sorry there's two different files in here one's doing it with AI for R so basically what that does is that hides all the details it does k-means you don't have to derive anything you don't have to do any math it's wonderful I'll tell you the first time I saw it just as an aside clustering I was at Lacombe flashier in Paris and Tamer did a machine learning thing where he basically said eight people sit at a table fill out the survey I will assign you based on machine learning to tables because there are k-tables each table can have so many people at it and he basically used a version of this kind of clustering to figure out who should sit with who based on some interesting questions most people ended up just sitting where they wanted to but it was an interesting experiment and he did not use the linear algebra gem to understand so if you think the linear algebra is hard to understand doing it by hand is just horrible so anyhow enough slamming on that so we're going to assign users to segment and so a bunch of stuff happened so I want to talk through it and I couldn't think of a bigger way so remember we basically threw all these users out onto the floor and so we're going through each cluster that we assigned them to we initially assigned them to a cluster and we were calculating the number of each cluster and then for each person we look at all the other clusters and see if we should move them from the one they're in to one that is geometric distance wise closer to them so you can see we moved a lot of people around from cluster zero to one or two because they were closer and then we went to cluster one and we did the same thing and then we went through cluster two and then you say it says iterate again we're just starting again because we went through one pass of all the clusters move people around and we iterated again and we you know kind of skim through all these iterations and then at the end you can see what happened the movement's got smaller there were more and more people in the right cluster and fewer and fewer people that were needing to get moved and so eventually nobody moved and it said okay we're in a steady state let's stop so what I did then is I spit out the cluster to see how many people were in each group and because badge count was what was important to me I wanted to see what the badge average was so out of the there were not quite two hundred people I'm sorry I don't remember we can add it up how many people were in this so 61 people were in cluster zero and they had an average of 12 badges and if you look at what the badges are this is like I said the world's worst visualization because it's text but it was easier to do it this way in Ruby so you can see you know you're like okay they're all not too many they don't have too many badges I could buy that and you look at this one well this cluster you know has an average of 51 there's a lot fewer people in it there's 30 in this one and you look at it and just intuitively you're kinda like yeah that makes sense I mean you know 56, 39, 40 alright you know just like we looked at the picture earlier we can intuitively say here's a cluster and here's a cluster and the edges might be a little foggy but they're good enough and then when we get down to the third cluster our third cluster is like all the people with crazy many badges so and this is all staff we have students with tons more but this average the centroid was around 150 so we had clusters that were really well separated the first one was like 12 the second one what was what was it 48 and the last one's 150 what was the second one 51 and 150 so we have really good separation here and so maybe we do have three clusters of users I can run this again and cluster into five users I'll probably believe the clusters it comes out you know I also can put more dimensions in here it's easy to think about in two dimensions it gets weirder to think when it gets multi-dimensional that's what I did in the next example I put in not only their total badges but how many points they earned for each one of our 10 major areas HTML, CSS, JavaScript because I was like maybe we have clusters of people it stands to reason JavaScript and CSS go together in HTML if you're learning that and maybe they're different from Ruby because Ruby's over here and you know there's overlap between the skills but they're probably different people and the people that are taking WordPress are probably really different from those people too who take some HTML time too so what I did with the next one is I used the gem to make life easier actually sorry so linear algebra is all about matrices and vectors under the hood that's all matrices and vectors you already do a lot with sets which are kind of like matrices and vectors if you know SQL as we're finding there are way more better numerical tools than Ruby okay Ruby is good to get your feet wet Ruby's good to do stuff Python of course is the king for doing numerical programming and that's okay it's okay to have something that's different R is crazy R has been around forever it's like strangely like Lisp in some ways which you know Lisp's are hot right now and it's weird to think R is sort of like that but there's some things about scope I was getting ready to say that so not only MATLAB but there's an open source MATLAB called octave which I didn't know about I just started I took that Stanford machine learning class to see what it was that big giant MOOC on Coursera and they now pretty much standardize on octave which is just as hostile as MATLAB but $2,000 cheaper so I mean I used to do MATLAB and Mathematica way back when I was a chemist because those were the languages or the platforms tools of quantum physics and stuff like that so R Python octave or MATLAB Mathematica there are tools that are really good at this and they all have strengths and weaknesses just like for instance R is memory bound so it's really great there are some weird ways around the memory binding of it but if you have a 16 gig data set you need like 16 gigs well you need more to get everything to run but you have to have enough memory to store your stuff R is weird most other languages manage Ruby is probably pretty bad when you get up there too so one more thing about k-means I should have started with this I highlighted all the fun words vector quantization partition in observations into k-clusters which is what we did we had in-users k-clusters nearest mean my favorite word out of that is Voroni cells which sounds like some sort of science fiction something apparently he's a mathematician from the late 1800s who did computational geography so or geometry computational those are Voroni cells so that's apparently what we just worked with alright so the next example is where we talk about the alternatives to k-means so there are other clustering tools so basically it's the same thing I've got a bunch of users I'm throwing them out on the floor and I'm going to try to put them together what we did before is we arbitrarily chose three centers because we put people into three groups and then we move them to the closest one so there's in clusters and then it looks to see which two clusters are closest and the two closest clusters get merged together so it kind of agglomerates up and then it stops when it gets to the number of clusters you say or to the distance between the clusters there's two different ways to have stop conditions so if you look in the the AI for our example we're using something called complete linkage there are 11 different ways to measure the distance in the AI for our gym and they will all give you slightly different results there are different ways to agglomerate the users up and measure distance in between them the other way is to do it in reverse the devices hierarchy cluster starts with one cluster and then plugs the person that's furthest out into a new one and then keeps doing divide or kind of like scatter and pull together and the big difference is how they measure distance you might say since they're all giving us different answers what's the point of this well in any sort of machine learning problem what you're doing is you're traversing this multi-dimensional space and trying to find a minima and there's local minima and there's global minima and the goal is to get to the best answer in a reasonable amount of time and all these tools but you can just change the name of the linkage that's the ai for gem the ai for our gem is great you can say complete linkage simple linkage you just go through all the clusters and see how it changes your data it'll change the averages a little it'll change the clusters a little but it probably won't change your conclusions all that much which is what's really interesting and if it does it's probably because there was a local minima all the things got stuck in except the one that was different it's pretty cool to be able to dig into this and see what's really going on so ai for our and please linear algebra is worth installing just for getting ai for r because it doesn't just have clusters if you want to do id3 decision trees and get your feet wet with that if you want to get your feet wet with neural networks which I got to say are a pain in the ass to program I mean having started years and years ago the back propagation of money from your users the last thing you want is to use the wrong algorithm you're going through enough effort to get all the data together and get the data clean and get the data into the system and figure out how the heck to run this stuff you don't want a bad algorithm on top of that so there are far smarter people than me that do these algorithms and I'm very happy to steal much better to take it and use it to learn about your users and make money instead of learning how to do a new numerical simulation because they are unless you love numerical programming if you love numerical programming that's fine so this AI for our gym we grab the users we put them in a data set and you can see right here here's the money shot so the cluster we're using complete linkage and we're towing complete linkage though we want a data set with three clusters I mean how much easier is than that figuring out what the algorithms are there's no geometric difference there's no math all the math is under the hood and then we can spit out data about it so I use this one to do complete linkage on the same data set and we get slightly different answers than we did before and you're welcome to do this for the 12 of you that managed to get this installed okay so this data is just pure data this is the one I want to do so we used two different things in here if we dig under the hood one I did the same badge exercise same people with badges this one came out kind of different than the other one did there are a lot fewer people in the top one and I should have calculated the average I think the average is probably a little lower for the second group which is weird and a lot more people are in the bottom group here if I was trying to figure out how many badges you have to be to be a super user up there in group 3 the other one said 150 the average here is like 200 I'd probably err on the side of 150 because I'd rather call more people super users than less super users but either way I know that I have this group of people and it's a fairly rarefied group compared to the big group of people so somewhere there's this magic point between 30 or 40 badges and up around 100 where they jump from one group to another what can I do to figure out how people get from one group to another the stuff down here at the bottom is I used something even more incomprehensible I put in all of their point earnings in the 10 major categories that's why there's 10 chunks of data here so that's saying I don't care about your total points I care about the point distribution among word perfect sorry word perfect WP I think word perfect because it's just built into my DNA and it means word press where we are which never crosses my mind as a tool I'd use so sorry to the word press people anyhow word press design PHP HTML JS CSS that's what all these numbers are how many points people are so I was digging down into the points this is still pretty incomprehensible but you can see we have groups of people as we throw this on a visualization which would help a ton because this just looks like vomit and yeah so now we've done two different kinds of grouping if I want to change this from complete linkage to simple linkage or one of the other ones that's supported by IFR I change that one line of code to use the different linker and I see if my results change significantly so now we've done the first of our really nasty sets of data analytics we've worked with algorithms to take our users and segment them in different ways we described them we said they're male or female we said where they're from we could have used that as input into here we could have said male or female latitude longitude cluster and like that maybe we have a big female following in the UK at Treehouse that would be an interesting piece of data to know maybe we have a huge male following in the Pacific Northwest I think that's what we were starting to dig out and visualizing this is kind of best used people use things like D3 or I guess you could use some of the ggplot things some of this ends up for me going into R so I use some of the R tools to show it off so but this lets us get Ruby results pretty quickly you saw that that one ran for a little longer though there was a definite pause when that one ran it was taxing my little MacBook Air here a lot the wifi situation and the linear algebra compilation problems what we're going to do in the last bit getting back to the talks let's talk about likes things that are like each other talk about similarity so again I'm interested in how people collaborate how people recommend things to one another how people find people who are similar to them and everybody here I would bet at some point like Netflix or something similar that tells you what you're going to like based on how you've rated movies or how you've rated purchases or something like that so again we're starting with important properties like before we were using badges or point totals this is pure linear algebra and this magic thing called SVD so single value decomposition it's one of a handful of techniques for simplifying complex matrices and so basically the gist of SVD is you have users and ratings of something you know all the users all the movies and I think I'm going to do something out of order here and give you a visual so basically we're going to come back to how the math got calculated here but basically what we're doing is we're taking all the Netflix users and all the movies imagine how huge that matrix is you know just in your mind just this giant matrix and we're collapsing it down using math into a two-dimensional space there are very clear proofs you can do if you care about linear algebra proofs they make my head explode so I'm willing to trust the people that are much smarter than me that have done the math but basically we're going to summarize everybody onto a board like this and then we're going to take someone new Bob in this example we're going to throw Bob on the board by taking this greatly reduced simplification that SVD gives us putting Bob through a mathematical process and put him on this board and then see who he's similar to and we're going to use something called cosine similarity which basically mentions the angle that Bob has from the origin and finds people that are very close to the line that he draws so that line from the origin through Bob says this is Bob's space of what Bob likes or a good approximation of it and we find the closest people and say I'm going to recommend things to him so we I mean that's amazing if you think about taking giant sets of users and ratings and collapsing it down into something where we can just throw Bob in against one symbol matrix spit out a number and do a distance calculation it's amazing how that works it is nuts so this also looks like the terminal vomited a bit we're going to make matrices in the linear we're going to do this real quick the key part is right here mu sigma and v transpose that's why those are named like they are even though later on it transpose the vt transpose the magic happens here there's actually a single value decomposition method in the linear algebra gym that's why you want it because it saves you having to do matrix math by hand because matrix math by hand is just stunningly nasty as that lacomph exercise that tamer did would demonstrate so what that does is that basically ends up giving you a just a two dimensional representation of this giant matrix and the single value decomposition theorem is that you can take a matrix and any matrix can be decomposed into the mu the sigma and the vt transpose magic happens and then you end up extracting a two-dimensional space from it and it's a black box right and unless you're writing the code for it you don't really care about that black box so I hate to belittle the fact that there's really complex math in there and that it should be invisible but it should be invisible because you don't care about that you care about the users you just want it right anyhow so we're going to use that magical single value decomposition method to get out the matrices we want then we're going to flatten it into 2d space because a theorem says we can and then we're going to take Bob we're going to put Bob's values A or Bob's values we're going to turn Bob into a matrix and then we're going to multiply them by the bits and pieces that we extracted from that other matrix using math and then we're going to use more matrix math to magically calculate the cosine similarity by using normalized matrices and matrices dot transpose and I'm just going to start from there I mean I can sort of follow the math but that's the magic of SVD and then you end up with being able to loop through all the users that are similar to Bob and decide who's similar enough to him to recommend him so we end up here so I'm going to run through this real quick and then we will wrap up and talk about questions SVD is the one that is that I do do a lot of recommendations at Treehouse so everything I've done with SVD is kind of kind of more what's the best word to put it more exploratory what am I thinking of hang on that's what it is so what I'm doing again is to figure out similar users I'm trying to figure out based on the points they've earned in various courses who would be most what course we should recommend to people based on who they are most similar whether we should put more of them in HTML put them more in a CS track put them more in a Ruby track and iOS track what have you and the easiest way is probably just to run it so we're going to get rid of this we're going to sit over there and we are going to Ruby okay so what we're doing it's a little off screen what we're doing is we're taking all the users and we use the same users for all of these things oh it wrapped sorry and you'll see one line a bunch of our users in this test that never earned anything they're mostly test users and if you put zeros in zeros blow up to not a number and zeros are useful for recommendations because how could someone who's never done anything give you a valid recommendation so there's math and logic reasons for throwing them out so we've got 85 users left we're going to get Bob's point scores those are actually my point scores on Treehouse because like I said I do a lot of our exercises and other users I like that the most similar user in one way is an unsubscribed user from our test database that gave me we've got point 999 similarity and you know here's another user down here that's got point 997 Passan who runs our conference programs is 996 demo demo is pretty similar to me that's great and our for some reason our guy who does finance is really similar to me which surprised me because he basically does accounts I didn't actually so that was exciting and honestly I don't know who Luke is so I think Luke is an old employee or fake looking at those scores anyhow so I put in my scores I wanted to find the people that are similar you see I have a whole bunch of zeros for tracks I haven't done and the goal here is what track should I do next because the SVD is best at saying if you haven't done this this is one you'd like and unsubscribed user is the most similar to me and it's suggesting I start JavaScript CSS or word perfect joking I know it's WordPress so it's suggesting that those are the tracks I should start in order of which one is probably the one that I would like most and now I know that I could tell the user hey JS track might be where you want to go next based on what you've done and the recommendations of all of our users so SVD we doing good I give you a minute to run it but I know linear algebra is just going to blow up anybody who doesn't have it installed and you can run it whenever you want now which is awesome too so to wrap up the goal was Ruby to answer questions about your users and your business and I want to make sure you left with some tools because you gave up two slots and seen Sandy talk to watch this and fought with linear algebra and fought with Wi-Fi and we're back here in the furthest part of the dungeon in the basement that was possible so I want to make sure you had something useful to take out of here so you know I keep thinking about a black box and there are good black boxes and there are bad black boxes a lot of the machine learning for our intent and purposes especially if you're a Ruby person can be a black box you don't have to know the details of SVD if you can get it implemented right you don't have to know how the how a neural network back propagation algorithm works or an ID 3 decision tree works if you can find one and use it you have to know what the goal of it is and you know the foundation of what kinds of questions you can ask with it but you don't actually have to implement the math and that's good because the math isn't what's exciting unless you work at Matlab the math the math is what gets you to having a better plan right because we go back to our friends the underpants know so we're at the very beginning a lot of times we've got all these users in our table and there's got to be a way to make more money from them Treehouse we had a really interesting discussion about how to increase ARPs to plan what's the best way to increase your average revenue per subscriber if you're a subscription based business anybody raise the price thank you so after that got shot down we have multiple tiers we have silver and gold and the goal was to get the only way ARPs can go up if you can't raise the price is to move people from the lower tier to the higher tier and so the goal was to figure out what we can do to move people from a lower tier where they're paying a lower amount per month to a higher tier and so the way to start with that was figure out more information about our higher tier users we're transitioning from calling them from Gold to Pro and I'm not sure which they are now but so our Gold to Pro users versus our basic silver users how do we get people to move from one to the other and before we can do that we need to understand more about them and we can offer discounts we can treat them as a big agglomerated mess of people who are all the same or we can treat them individually and the kinds in the same way as far as silver and gold does time zone make a difference because you know far in countries we have far fewer people in gold for instance so it was fascinating and the goal at the end of all this was to use some black boxes to have a better business plan so we can roll in money because rolling in money is usually what the goal is that keeps you employed and that is what I hope the tools that you have from this started with real quick I'm going to do credits and then I'll do questions so thanks to the RailsConf team which is awesome for having me do this Jeff runs the workshops if you don't know Jeff G. School sorry Turing I.O. is awesome it's also where Katrina works they're great people I work at Treehouse I have all sorts of Treehouse stickers three different kinds up here I like taking the train so I ended up here coding a lot on the train on the way out here thank you Amtrak that is my contact info which will come back up no worries real quick recommendations because people always ask where to look and what to do with stuff O'Reilly has great data science books and they go on sale fairly regularly and they also have something called the data science tool kit which is like five books or seven books I forget however many it is they go on sale pretty regularly and I think all these are in it none of them talk about Ruby so just be prepared because Ruby is not the optimal language for this but you can read about SVD in one of these books and then go implement it or use one of the implementations of it in Ruby you can read about what an ID3 decision tree looks like and then come back and do it in Ruby so these books are certainly have their place and these are some of the ones I found most useful I also really like lean analytics from the lean start up series even better if you like more class oriented stuff Coursera that one machine learning class which is brutal uses octave teaches you octave which is kind of open source MATLAB and this week is actually back propagation so I'm taking my pass because I'm here at RailsConf from doing the homework this week and Coursera has a couple other ones John Hopkins has a whole data science theme the first one is a gimme for anybody that's interested in it because basically by the end of it you've put a markdown document on GitHub so for most Ruby people the bar is pretty low rather than its data scraping and cleaning and then it gets into using R to do stuff there's like nine parts it's one of the specializations they offer so they can actually get money so for like 500 bucks you can get a certificate for zero dollars you can take the same stuff and do the same homework and get all the same learning that horrible, horrible, horrible copy image there is the try R thing CodeSchool has a try R that they did in association with O'Reilly so R is worth it just to appreciate Ruby and to try a Lisp and to actually be able to do some really, really elegant mathematical calculations and plot some really horrible graphs so all those are tools of the trade you'll see them and then that's me I would happily answer questions you can hit me up on Twitter where you know I may or may not respond because I only Twitter at conferences I will be around today tomorrow I'll sit in here we can try to get the linear algebra gym installed if you're really upset that we couldn't do it it's really internet related for the most part and then the F2C thing so today Fortran C Python Go Ruby and then you also can go home with users that you can figure out their gender figure out their location and then one day when you get the linear algebra gym finally installed for real you can go ahead and do either SVD or K-means clustering K-means is so fun I mean it's ridiculous to talk about math like that I love seeing what you can do with crazy clusters you want to cluster your people into 100 alright what happens what do you find from that clustering them into two groups wow I didn't ever know that cluster them on crazy things like points earned and how long they've been a member doesn't seem so crazy maybe there is a correlation there all sorts of cool things you can do with it so I hope it was worthwhile thanks for skipping Sandy to do this thanks for skipping the other slot I'm answering questions that's me thank you