 Okay, I think I'll go ahead and get started. First of all, Program Note. It turns out originally this talk was called Machine Learning for Fun and Profit. And there are like six other talks that are X for Fun and Profit. So the title for this talk is officially changed for Machine Learning for Insert Cliche of Your Choice Here. So I'm Chris Nelson. I'm with Gaslight. We do web and mobile app development. We're also doing a training class in Backbone and CoffeeScript in San Francisco in early December, so if you're into that, please check us out. But what I'm going to be talking about today is Machine Learning. Let's see if I can focus or narrow that a little bit. Yeah, that center's a lot better. So Machine Learning is a very broad topic and briefly the way it's defined, at least on Wikipedia, is a branch of artificial intelligence that has to do with using a set of examples or a set of data and trying to learn what the rules are and predict outcomes based on it. And there's a whole bunch of different algorithms to do that. And this talk is really more of a depth talk than a breadth talk. So I'm not going to survey all the different Machine Learning algorithms that are out there. There's quite a few. I'm really going to drill in pretty specifically on decision trees and even more specifically on a particular algorithm to implement decision trees and go into how it works in detail and hopefully give you guys a comfort level to know when's an appropriate place to use decision trees, where they do really well, where they fall down and how the algorithm actually works. So hopefully that's of interest to you guys and also give you some resources at the end so that like the other algorithms that I don't cover if you want to go learn those well, there's some resources that I can point you at. But that's roughly the plan. So just a fair warning, there'll be some math. I myself am not a particularly super mathy guy. I actually had to relearn some math that I'd forgotten to get ready for this talk. And that's a good thing for me. And I'll try to go through it pretty in detail and I don't think it's terribly difficult at all. So how this talk came about is from an actual project that I was on for a state government. And the project was all about recommending home improvements to homeowners and figuring out what incentive programs they might qualify for. So it had a lot of different rules about, okay, I should replace the heating system if I have an old heating system and this type of heating and this particular efficiency writing blah, blah, blah, then I might qualify for an incentive program. So a lot of rules and fortunately when we came to the project they were already expressed to us as cucumber tables. So that was a pretty nice situation to be in actually. We already had a customer written QQs to start off. One of the rare situations that I had the good fortune to be in. So we started in on the project and we had rules. This is like one of the very simpler rules that we had where it's talking about whether I should recommend the homeowner replace the pool pump. And as you can see there, if I don't have a pool obviously that upgrade is not recommended as you might guess. If I do have a pool and it has an efficient pump then I don't need to recommend it. If I have a pool and it doesn't have an efficient pump then I should recommend the upgrade. So fairly simple and as you might guess it turns into fairly simple code. You know I look at the property and if it has a pool and it has an efficient pool pump up, that's backwards. I'm sorry, if I have a pool and it doesn't have an efficient pool pump this is the way that should actually read. I'm sorry about that. So this then is a slightly more interesting example. This is to do with replacing the lighting system in the property and really this is kind of based on two different things. Whether I have, well what type of lighting system I currently have in there and then if I don't know how long ago was the lighting system replaced in the property. And that turns into slightly more interesting code and I look at the lighting tip type first and if it's one of those types that I know hey I should recommend an upgrade then I recommend the upgrade. If it's not then I fall back and look at, if the type is don't know I actually look at how long ago it was installed to decide. So again not terribly difficult code to write and I had the cucumber tables to actually make sure that this code was right and that it actually satisfied the rules but I got done with a couple of these and actually my guy that I did this project with pointed out that that number is actually wrong. I had two down and I had something like 300 to go. So a lot of rules to write with unit tests and of course a very aggressive schedule to meet. So a little bit of sadness ensued but fortunately I had an ace in my pocket and that is I in fact pair program with the wizard. So I said to the wizard, wizard what thank you upon our dilemma and the wizard stroked his beard for a minute banged his staff loudly on the ground and said nun shall pass and then he said to me and also you might want to look at decision trees. So when a wizard tells you to do something of course you do it and that led me to learn more about decision trees which I had either not learned about in CS school or completely forgotten about so I had to dig in and learn and that's how this talk comes about. So what are decision trees? In brief let's go back to that table and basically decision tree is kind of a tree based representation of what I would look at the decisions I would make as I went through and kind of implemented this table. So if we just look at this table it's pretty obvious what the decision tree for it is going to look like. It's going to look something like this. Ooh and that's off the edge of the screen that's so sad but not too bad I can tell you what it's about. But basically I look at the type first and if it's one of those types I know the outcome for already I'm done. If it's don't know then I have to fall through and look at the last replaced and then make the decision based on that. So that's just a tree based representation of the table we were looking at just a second ago. But the interesting thing to point out I mean if I look at this as a human it's like really obvious how I should build that tree where I should root that tree and what I should look at first. But as a computer I don't really know that and there's actually more than one way I could express the same table as a decision tree. I could also express it like this where I look at the last replaced first and then I look at the type. It's just that there's a whole lot more decisions to make and it's just kind of silly. You know I would never actually choose to implement the logic like this as a human. But the important thing to point out is there actually are multiple ways to build a decision tree for the same set of examples or rules. So of course if I really want an algorithm to be able to build those trees for me it needs to figure out some way to figure out which attribute to look at first. And it turns out there is an algorithm that's good at doing that and it is called ID3. It's one of the more popular algorithms for decision trees. It stands for the look this up iterative dichotomizer. I think that S could also be a Z but if you're a collector of weird words dichotomize is kind of a cool world, cool word. It means to split or to classify it into two parts. It's up there with Amploskepsis in my book as far as weird words to collect. It was written by a guy named Ross Quinlan and what it uses to figure out which attribute to put at the root of the tree and which attribute to put at the recursively do look at next as you go down and build that tree. It's all about a measure called entropy and the way that I'm defining entropy for the purpose of this talk is really a measure of how much variability I have in a given set. So there's actually a formula that it uses to measure this and we're going to go through and look at this and so this looks a little scary but it's actually fairly simple and we'll go through and break it down and look at some examples and maybe it's not scary at all to you but for me I had to look at this for a little bit and understand what was going on but basically what this means is if I take the entropy of the outcomes in that table how to calculate that is I loop over the different possible outcomes and in that set there's only recommended and not recommended those are the two possible outcomes and then for each one I look at the frequency of that outcome times the log base two of the frequency of that outcome so it's actually fairly straightforward excuse me I've got to get some water so my voice doesn't get up so if we take our table again and we can calculate the entropy for the result of that whole table there what we do is we take a look at the frequency of each outcome so if we take a look at recommended we'd show that recommended actually occurs five eighths of the time so we take like five eighths times log base two of five eighths and then we add that to the frequency that not recommended occurs which is three eighths and then we time that by log base two of three eighths so we come up with a number and that number is about 0.95 blah blah blah blah blah so that's the entropy of the whole result so entropy by itself is interesting but it doesn't actually tell me which attribute to pick but I wanted to show you entropy so that when I showed you gain it would make sense and gain basically is a measure of the effect on entropy when I choose a given attribute to split up the table by so gain is obviously based on entropy excuse me and the gain for choosing a given attribute so the gain on that table for a given attribute is the original entropy of the outcome 0.95 minus the sum over the possible values for that attribute times and the sum of the possible values I'm sorry that's not quite right so I loop over the possible values for that attribute and then the calculation that I do is the frequency that attribute occurs times the entropy of the results for that attribute and if that doesn't make sense yet that's totally cool because we're going to go through this in a little bit more detail and show exactly what that means but that's what the actual calculation involved is so let's actually take a look at that in a little bit more detail so it makes more sense so if we take a look at the gain for type we basically go through each possible value for type and calculate the entropy of the outcome so yeah we can read that okay so in this case it's pretty simple we take a look at pin base fluorescent as the first value for that attribute and we see that for both of those results it comes up with not recommended so I only have a single value there and if I calculate that it comes up with the frequency that not recommended occurs all the time which is 1 times log base 2 of 1 which is 0 and then on the other and then it doesn't occur at all in recommended so that 0 times log base 2 of 0 which I think is like infinity or something but it doesn't matter because I multiply by 0 so it's a lot of math to tell me something that's intuitively completely obvious which is that the entropy for something that's all the same is 0, there isn't any entropy, it's all the same so a lot of math to tell you what's intuitively obvious but since computers don't have intuition we need math instead so I do the same thing in the case of screw in incandescent it's just recommended instead of not recommended that the entropy is still 0 the same for screw in CFL as you can see there and the only entropy I have to work with is over on don't know and for don't know I have recommended appearing half the time and not recommended appearing half the time and if I calculate that out the entropy of that is 1 so then if I go back and plug that into my original equation for gain that we talked about earlier I basically take the proportion of each of the possible values for lighting type and for the first three it's 0 so 1, 4, 0 is still 0 and then for the last one I have oops, I'm sorry it should have been 1 quarter of 1 there and then it actually comes out to be 0.25 so the gain for lighting type is 0.95 minus 0.25 and the total there is 0.7 so a higher number for gain is better that means I've reduced the entropy by this amount makes sense so far, except for my typo a little bit alright so again it's probably we already know this from looking at it but now we have a way for the computer to figure out that it should start the tree at lighting type and if we just, you know, to verify this the gain for installed on it doesn't reduce the entropy very much and the gain for installed on is actually 0.05 if you're interested so it's obvious that lighting type is a lot better so all this is really neat and cool but how do we make practical use of all this goodness so there are a few different implementations of this ID3 algorithm in Ruby and the first one we're going to look at is AI for R AI for R is actually a gem that implements a lot of different artificial intelligence intelligence algorithms there's implementations of neural networks, genetic algorithms and classifiers which is what we're talking about here and it has an implementation of ID3 and one of the cool things about this gem is it actually will output the code based on the tree that you're, based on the the set of examples that you pass it so we'll see what that looks like and I'll bump over to my editor now at least I'll try to first I have to break out of that mode bump over to my editor now see I'll make that, sorry make that smaller I don't really need to look at the tree can everybody read that okay? so yes I'm on the right example so in order to use AI for R what we're going to do here is we're going to build a data set first and then data set actually has a convenient method to load a data set from CSV so basically what I have is I have a CSV that's just exactly what was in that table we were looking at earlier and so I just feed that in here and it builds an ID3 classifier for me and then I have a method that takes that decision tree and just evaluates a given property based on it and then I just define a set of examples over here so that I can take a look at what that looks like in IRB so let's go do that now if I get a top here then I take a look at replace lighting fixtures rule evaluate and look at my example data and if I use like the pinbase fluorescent example sure enough it tells me that's not recommended if I look at one of the other types of lighting like a screw in CFL it tells me that it is recommended if I look at I don't know old it'll tell me it is recommended so it's doing the right thing but as well as just being able to evaluate an example I thought one of the cool features of this gem is it actually will output that code so let's look at what that looks like and I need to actually get at that decision tree in there and then I can call get rules and it actually will output code for me in Ruby where it's comparing the lighting type that elsef is a little cut off there but so if I actually have an object that implements methods for all those columns that responds to lighting type I could literally take that get rules and do an instance eval on a property object and it would like it would literally be writing my code for me so when I first ran across this and started using this I was pretty blown away but it turns out there's more to the story so we'll continue on any questions so far? so that's AI4R in a nutshell so yeah, we already did that so that's really awesome however there's a rather large caveat with using the ID3 algorithms that currently exist and in order to understand that let's take a different permutation of our example from a minute ago and let's imagine that we take that last replaced and instead of where we conveniently had more than 10 years ago and less than 10 years ago now we actually have numeric values for the year it was last replaced on well let's think about what it's going to come up with now for the gain of last replaced and to think about that take a look at what it's going to do for gain it's going to take each possible value and look at what the entropy is for the for one of those values and you can probably guess what that's going to do but what we're talking about here is something called the entropy bias and the entropy bias refers to the fact that it's going to be biased by default toward attributes that have a large set of possible values or in this case really an infinite set of values and how this plays out is the entropy for installed at for any of those values 1990 in this case for any of those values the entropy is going to be zero there isn't any so what it's going to do is it's going to add up all that entropy and end up calculating the gain as the total entropy for the set minus zero and that's going to be always the maximum possible gain so it's always going to split on that attribute first even though that's not really predictive of anything that's not really predictive of an outcome so it kind of falls over at that point and is not useful which you know made me a little sad but I got over it and how I got over it is I realized I didn't realize I read and did some more research and it turns out there is actually a solution to this problem and the problem basically the entropy bias is really about how do we handle continuous attributes or attributes that are over a continuous range rather than a set of discrete values and what we need to do is we need to discrete eyes those values and it turns out there's a way to do that which is a little tedious but not particularly complicated and what we need to do to be able to handle a continuous value like installed year we just sort all those values and then build a list of the halfway points between each of those values and then measure the gain from splitting on that point of each of those and basically we look at the gain from each possible split point and the gain with the highest split point wins and all of a sudden now we have a split point and we can just discrete eyes every row into is it less than the split point or more than the split point so now we are like our back in business as far as like the rest of the algorithm goes so kind of cool unfortunately that AI for our ID3 implementation does not implement this feature this algorithm actually comes from the same guy who did ID3 he actually did a later algorithm called C4.5 I don't know why it's called that maybe Cs for continuous but I'm really just making that up but one of the cool things that it does have in that algorithm along with some other improvements is what we talked about being able to handle continuous attributes so that's pretty cool and it turns out there is an implementation to this in Ruby by a guy Ilya Grigorik who he's spoken at some Ruby conference he's a super smart guy if you have a chance to talk to him you should I don't think he's here or at least I didn't see his name giving any talks but anyway he did a gem called Decision Tree obviously enough conveniently enough and it has some nice features one of the features is that it can actually output the tree that it builds graphically so that's pretty cool and it also more importantly maybe it deals with continuous attributes correctly and just sort of as like a side little fun thing it happens to add an entropy method to the array class so as I was prepping for this talk and like doing these entropy calculations I could just like build an array and do array dot entropy and it would return a value so I don't know if it's all that practical unless you're giving a talk on entropy but it's still pretty cool so let's see that in action how do we jump out of here I should probably start with the code for Decision Tree example first so it makes sense so this is using the Decision Tree gem it's a little bit more work to set up initially in that I have to take that two dimensional array I get back from parsing my CSV myself and I have to pop off the labels and I have to drop the last result in there so the labels in this case are lighting type install that result well really only the labels for attributes are the ones I want so I drop result off the end and then I just have lighting type and install that in that labels array and then the other thing I need to do is munch my data a little bit there are all strings as coming back from parsing the CSV so I convert the ones that are all numbers into actual numbers it's a little cutoff but there's like a 2I over there at the end so just a little bit of data munching not much and then at that point I'm ready to feed in to the Decision Tree ID3 guy the labels the data itself a default value and a config object and that config object is all about telling it what columns to treat as discrete columns and what columns to treat or attributes rather what attributes are discrete and what attributes are continuous if you've ever looked at this gem that config thing is like brand new that's in like 0.4 you just bump that gem version but so I can tell it that the lighting type is discrete last replace is continuous and then I set up some example data and I'm in business so now let's see that work I have a tree and tree has a predict method and I again can look at my example data on thin base fluorescent and it's not recommended screw in CFL is recommended and we can see it's working just as it did before so I also told you that there's a feature there oh yeah so we look at array here you can actually get the entropy if I can type so that's kind of cool and then the last cool feature is really that tree dot graph dump that to Ruby comp it dumps out a PNG file for me and among other things that'll let me see exactly what it did as far as building that tree so it's split on lighting type just the way it should and then it figured out that based on the set of data that I have the split point is 2003.5 so originally I had 10 years ago so it's like really close and if I gave it more examples it would get closer and eventually exactly get to 10 years ago so pretty cool yeah I think that's all I really had to say about that implementation decision tree any questions on that? alright so the last section really is about whether I have rules or whether I have examples and to show it so decision trees in general they're really good at taking lots of examples and inferring rules from them but in our cases we got farther along in this project what we realized is we actually had the rules themselves and we didn't need something really as complicated as the decision tree algorithms and to show you what I mean let's look at one of the later rules that we had to implement oh and that's a little bit cut off but I think it's enough there that I can show what's going on so what we really had for some of these later rules is kind of a more simplified just table of if it's electric resistance none of those other values none of those other attributes matter if the type is electric resistance the outcome is not recommended the same for some of those examples later on basically any time I have a space in that table what that means is that attribute for that row doesn't matter so out of the box and there might be a way I've talked about this a little bit with other people there might be a way to adapt the decision tree algorithms to be able to deal with these blank values for attributes in this table but out of the box it doesn't know how to deal with this but what I realized is the decision tree algorithms are really all about trying to figure out what decisions to do first and in a situation like this you can actually use a really simple brute force approach where basically you just go down you start with your example and you just step down this table and do compares at each row and the first time you find a match that's the outcome and you're done so I didn't actually need decision tree quite although learning about them and figuring out where to apply them was very cool and I enjoyed it I actually had a simpler problem so I looked around and there might have been an implementation that did what I wanted already but it was just so simple that I ended up writing a gem to do this myself and I called this decision table because it's not really a decision tree algorithm per se and it's really for where you have rules expressed in a tabular format rather than a bunch of examples that you need to learn the rules from and the key point is you already know what order to do the comparisons that's what those algorithms are really figuring out is what order to do the comparisons and if you already know you can just use a simple brute force approach like I did so we can look at this as well we'll bop over to the decision table example and decision table again can operate from well really a two-dimensional array that I'm parsing from CSV and I have a simplified space heating CSV that is basically exactly that table that I looked at with the space heating and I feed that to my decision table rule set and then at that point I'm ready to run my example data through it and I have an evaluate me evaluate method just like I did for the first day of our example I guess out oh yeah sorry I have a different set of examples in this case so I think I have a gas furnace efficient and in that case if it's already an efficient gas furnace it won't recommend I replace it I have an inefficient gas furnace it will I don't know but it was replaced a long time ago it'll recommend it so it's working what happened here sorry about that so it's working just as I expect and it's just a simpler brute force kind of approach for when I need that so to kind of summarize some of the things I learned through all this and the first thing is that if you have a situation like this where you have rules that are expressed like this to you from the customer in a tabular format don't you know write all the code yourself you don't need to do that there's several approaches that'll work either the if you have examples and you need to infer the rules something like ID 3 or the decision trees gem is a great way to go if what you really have is just a table of the rules that you can use something like the decision table and turn that into a DSL essentially but what's really really helpful is in our situation we wrote a few of them by hand and then we had the test cases so that we were when we were ready to replace it with the the decision tree algorithm or the decision table we were able to verify that it was still working expected so in this case the unit tests were actually more valuable than the implementation code because they allowed us to replace the implementation code with something better so really strong case for unit tests the other thing is you really need to understand the algorithms and what they're good for and when they fall down initially when I started looking at decision trees they just sort of look like a wonderful magic black box that would write my code for me and as they dug in some more it turned out the answer was maybe kinda but it's really important to understand how these things work rather than just plugging them in and of course the last thing is the simplest thing that can possibly work is always a good choice so some resources to look at egvita.com is Illya's website there's a bunch of good blog posts on different other AI algorithms to look at he goes into detail on things like Bayesian classification and singular vector distribution I think I'm getting those words wrong but lots of different algorithms that are worth looking at there's also really good discussion of id3 with an implementation in python and last but not least I actually have all my slides available on github this whole thing is done in reveal.js so I don't know how much time I have left but it looks like yeah I do have a few minutes for questions if there are any so using the decision trees can you handle noise in the prank data? like you have a few incorrect results in the error will it kind of deal with that if you have nothing or will it a tough version? you know what I'd have to give that a try and find out I don't know the answer to that off the top of my head oh I'm sorry to repeat the question would it deal with noise in the data? would it figure out that if you had some values that some rows that were incorrect would it be able to figure that out? I think if the values vary it will learn I'm sorry I didn't hear that I think if the values vary and you can tell it you know what was passed away it can't even learn I think it would have a better chance if it was a continuous attribute than if it was a discrete attribute of continuing on and working through that situation I don't know what it would do if it was a discrete attribute that it had like an outcome that was wrong are these prunes in the decision trees? I don't know the answer to that I'm sorry prunes in the decision trees? yeah the question was on a pruned or untruned decision trees and I do not know the answer to that well, if there are no more questions, thanks a lot