 So, as usual, all the slides begin with these Creative Commons notes about fair use and share and share alike with the slides and the software. So what we're going to do now, and we're, I guess, running about 20, 25 minutes ahead of schedule, but we're going to probably need that extra time for Module 4. So right now we're in Module 2. We're going to be doing decision trees, and this is a combination of both a lecture and a lab. So it's going to be about an hour and a half long, and then we'll give you guys about a 40, 45 minute break for a late lunch for some of you, a regular lunch for some of us on the West Coast here. So again, just highlight, we're talking about decision trees here, a little bit of an introduction, and the reason why decision trees are useful, because they are used in classification. Decision classification is something that's frequently done in machine learning, and we're going to show you some examples of classification, and this would be classifying nominal or categorical data. You can also use decision trees slash random forest to do regression as well, which we'll talk about tomorrow. So we're going to talk a little bit about classification and clustering, and we're going to introduce you guys to decision trees in machine learning, and to introduce you to the concept of entropy, normally called Shannon entropy, and the Gini index. These are both measures of information that is used to evaluate and make decisions as part of a decision tree optimization. We're going to talk a little bit about feature selection, and then we're going to introduce you to a classic problem that's widely done to help people understand the basics of both decision trees and machine learning, and this is called the iris classification problem. Irises are flowers. We'll be looking at how to classify them based on their petal and sepal length and width. And then with that problem sort of explained, we're going to show you the details of the Python code for this particular decision tree. And then we're going to look at other types of data in using decision trees. And it's also going to be an important lab because this will be your first chance to sort of fiddle around with co-lab using some real code, and so we'll try and give you a gentle introduction to co-lab and the code. So as advertised, we're going to talk a little bit about clustering and classification. So clustering and classification are different in terms of mathematics and in terms of computing. Many of us will sort of put the two together, but they are fundamentally different. So clustering is essentially a process when you get objects that are logically similar. So they might be similar in shape, similar in color, other properties. And what you are doing in clustering is the data is unlabeled. The classes are yet to be defined or yet to be labeled. And it's up to, in this case, maybe an unsupervised tool, unsupervised machine learning method and unsupervised statistical method to maybe perform the logical grouping and therefore the logic helps you then classify or label things. Now classification is done where you have advanced knowledge or prior knowledge and where the labeled objects, so now the sock is labeled as a sock or the red ball is labeled as red ball. So some label that you've given or some all-knowing being has given about the object's properties or characteristics. And so these are categories based on their properties. So clustering is done without labels. Classification is done with labels. So therefore clustering is fundamentally different than classification. Classification is typical of supervised learning. Clustering is more typical of unsupervised learning. So in supervised learning, we're trying or supervised machine learning. We're trying to focus on using algorithms that learn how to both assign or group things according to class labels. So the reason I'm reintroducing decision trees as essentially our first example is because they are the simplest of all machine learning algorithms. They're generally simple to understand, they're simple to implement. It's a useful method for both classification and regression. With a decision tree, the computer, the model learns to split or categorize or regress the data based on a series of decisions. Usually yes or no, greater than or less than if it's numeric data. And essentially some inherent knowledge about the cost of those decisions. So in the case on the right, survival is a measure of cost. If you survive, that's good. If you don't survive, that's bad. As I mentioned before, decision trees have this structure where there are edges, which are the lines or arrows that connect the boxes. The boxes are the leaves or the nodes. So the survival of passengers in my Titanic, we have a root node, which is the top, which is gender. And if you are male or female, it's a key deciding factor whether you survived or not. After gender was a section on age, a node, and how young or old you were. So if you're older than nine and a half, you died. If you were younger than nine and a half, you generally lived. But then among those who were younger than nine and a half, if you had fewer siblings or spouses associated with you, then you generally survived. If you had a large family, it wasn't good news. So this is the decision tree that was learned from survivor data from the Titanic manifest. It wasn't the decision tree that the captain or the crew used. But there was clearly some decisions that were made about women and children first. So the decision tree has a formal definition. It's technically a flowchart. And the nodes, the boxes or circles are tests on an attribute. So what's the age? What's the gender? How many siblings? And then the branches or the edges are the outcomes of those tests, the decisions. The paths from the root, which is the top node or the top box, to the final boxes are essentially the classification rules. So we make decision trees all the time. Flowcharts are drawn all the time. So this is why decision trees are essentially intuitive to understand. But to write a program that makes a decision tree is not totally trivial. Now, in the case of decision trees, we talk about classification or regression. So classification is to essentially classify objects to predict classes or categorical values. Regression predicts continuous values. So that's like linear regression plotting a line or nonlinear regression. So fitting your data. So classification works with essentially nominal or categorical data. Regression works with numeric values. Decision trees are formally called classification and regression trees or carts, C-A-R-T. So when you're building a decision tree, you're trying to decide on which features to choose and which conditions to choose among the splitting. So there are many features, obviously, among passengers in the Titanic. Gender and age are obvious ones, but they could have chosen clothing or first class and second class and third class that could have chosen eye color or hair color. Any of those things could have been deciding factors, but also within a decision tree, you have to make a decision on when to stop. Eventually splitting things, as they say, splitting hairs becomes kind of useless. So if you're not separating any objects anymore, then there's no point in introducing any more decisions. So the nodes, as I said, are these decision rules or condition rules. So in the case of is this individual male or female, there's a yes or no. The green and red boxes essentially are leaf nodes, which indicates sort of a fate. And then the edges, again, are the paths. So if you take one path, then you hit another decision rule, then you take another path, then you make another decision rule. So the choice of whether it was male or female, I mean, that's pretty binary. But in terms of the age, this is where it's nowhere categorical. It's numeric. So should the age be greater than 10, greater than nine, greater than nine and a half? Similarly, the number of siblings and spouses, is there more than two and a half or less than two and a half, or should we more than three or less than three? Those numbers are what are determined by training on the data and also by testing on the data. So in decision trees, they use a lot of terminology similar to a tree. So there is a root. Now, the root in the decision tree is at the top of the tree, whereas in real trees, the root is at the bottom. The root node usually represents the entire population of the sample set. And eventually that root is divided into two or more homogeneous sets. Splitting is a term we use in decision tree. Theory is essentially dividing a node into two or more subnodes. When a subnode splits into a further subnodes, it's called a decision node. So technically all the nodes, except for the terminal nodes or decision nodes. A terminal node is one that doesn't split. It's one where no more decisions need to be made. That's the final grouping. Obviously, if a person has died in the Titanic, then you can't sort of separate between dead and really dead or something like that. It's just it's a terminal node. There's also parent and child node. And that's where that's when the parent node is the node above multiple child nodes. Subnodes are considered child nodes. So there are advantages and disadvantages to decision trees as a rule because decision trees are something that we do all the time in life. They're pretty easy to understand and they're interpretable. Technically, they're called white box models, whereas things like neural nets or deep neural nets and even hidden Markov models to some extent are called black box models because they're really hard to understand. So it's possible to write out a decision tree as a series of sentences or as a paragraph. And people can understand that even if they're not mathematically inclined. Decision trees can handle both numeric and categorical or nominal data. So we have the case of male female that's categorical and then age, nine and a half, greater or less, that's numeric. The data can basically be brought in as is. It mirrors obviously our own ways of thinking. It has essentially a form of feature selection that that's built into the process. So you don't have to do prior lasso feature generation or things like that. And the other thing is you actually don't have to normalize the data or transform it. You recall where I talked about the need for transforming or normalizing data. And that's particularly true for neural networks and other kinds of data. But decision trees, you don't have to do that. Now, there are disadvantages with decision trees. They're not the most robust. They can be susceptible to small changes in your training data set, training size. Although there are tricks called begging and boosting that can fix this. It's a heuristic method, so it's not as mathematically robust as say, neural nets are. There is a tendency to overfit. But again, if you do your proper testing and validation, then that's not an issue. The decision tree algorithm is called a greedy algorithm. It's not one to give you the best solution, but random forests, which are a combination of trees, which is sort of a consensus tree. Those essentially fix the problem of optimal solutions. There's also some inherent bias when you have more and more categorical levels. So this is also a weakness with decision trees. So I'm going to show you how to learn a decision tree. And so the term learning is sort of how to create the model. And this is the language that machine learning specialists talk about. So the most common algorithm is called recursive binary splitting, or RBS, or iterative dichotomization. And that iterative dichotomization algorithm went through three rounds. ID 1, which was improved to ID 2, which was improved finally to ID 3. And so ID 3 is the current algorithm that's used by just about everyone. In this ID 3 RBS method, all the features that are in your table are considered. And different split points are tried and tested using certain cost function. And I'll talk about the cost function a little bit later. In more detail. So the split is, as you try all these, or as you iterate through the splitting between left and right, with the lowest cost, so that would be in the genie index, or the highest information gain, that's the Shannon entropy, is the one that's selected. If there are three different features that we're looking at it, so it might be age, gender, and number of siblings and spouses, then you have three features, therefore there are three possible splits. So you're going to consider age, which gives you the best split in terms of information gain. You're going to consider gender, you're going to consider siblings and spouses, and find out which one gives you either the highest cost or highest information, lowest cost or highest information gain. And in our case, at least with the Titanic survivors case, gender actually turns out to be the one with the highest information gain or the lowest cost in terms of what's called the genie index. And then the next one is age, and then the next one is siblings and spouses. So those three features are sort of sequentially ordered based on their quality in terms of predicting who survived and who didn't. So in terms of information gain, this is something that's based on a concept called Shannon entropy. Some of you may have heard this before, it's relevant for information theory. It's also used in sequence motif evaluation. It's a measure of uncertainty or disorder where PI is the probability of being in class I. So it's PI times a log based to of the PI. And you're summing over all of the classes where C is class. So by summing over all of these probabilities, you can calculate Shannon entropy because you're taking a log, it's kind of, it takes a little bit more time on a computer. Then information gain is the difference between the entropy in the data set and the collection of all the entropies in the feature. So it might be the entropy of the parent minus entropy of the child sets as you break out or split it out. Generally, it's the split with a maximum information gain is the one that's the root. And then sequentially, the ones with other high information gains through a split are also then near the root. And then it progresses as you get less and less information gain as you go down the tree. So probabilities, by definition, have to be less than one. So 90% probability or 0.9, 10% probability or 0.1. The log of these numbers because they're less than one means that the log will be negative. So you have to put a negative sign to make sure that the numbers are positive. If you have a two class situation, when you're splitting into two classes, the entropy, the maximum entropy is one. If you're four classes, the maximum entropy is two and so on. And this is just showing you the calculation. So two classes, you sum over class one and class two. The probabilities are one half being in one or the other. So one half log base two of one half gives you one half plus one half, which gives you a total of one. So here's an example where we have different features, three features here and observations about cars. So this is no longer situation for survivors, but this is just one that we grabbed from the web, which is a good one. Age, the age of the car, the mileage, number of miles on the odometer, and then whether it has been road tested, whether someone's taken it out for a spin and found it to work. And so based on the features of whether it's old or new, high or low mileage or road tested, there were decisions about whether to buy or not to buy. So those represent observations. So there's features, which are the columns, observations, which are the rows. So we can calculate essentially this decision of buy, not to buy, instead of a two case scenario. There are four decisions in total. And so we can calculate the entropy of the root node, which is the buy or not to buy, by taking the probability, the count of the number that we bought and the number of total examples. So we suggest two out of four should be bought. So P for buy or PI for buy is 0.5, PI for not to buy is also 0.5. So if we plug that into the entropy formula for Shannon, and if we remember that log 2 of 0.5 is minus 1, log 2 of 1 is 0, we can see how the entropy adds to 1.5 or 1.5 plus 1.5 gives you 1. So we've calculated the entropy for, in this case, the root node. Now we're going to distinguish between, and this is where we would iteratively assess, which ones have the best information gain. Is age more useful? Is mileage more useful? Is road testing more useful in order to make the decision to buy? So what we're going to do is look at each of those three features, age first, then mileage, then road tested. And we're going to evaluate their information gain or entropy first and then their information gain. So in this case, if we look at age, there were, again, four choices. There were two instances where the recent age was used, one instance where recent age was a recommendation not to buy. In the case of old, there were no cases where we suggested to buy. And in the case of old, where again, it was just don't buy. So we can plug in based on the ratios for the recent. So the probability for recent would be 0.666, that's PI, so that's two out of three. Two instances out of the three. Then we had a check mark. And then in terms of not to buy, it was one out of three. So we'd use 0.33. So the recent entropy, the calculation is shown at the bottom, and you can see all the decimal places and the multiplication. We come up with a number, the entropy is 0.918. In the case of old, it's a simpler calculation. There's an instance of 1, and then there's also the one of log 2 of 0, so it's not worth calculating, but P log P is 1 log 2 of 1, log 2 of 1 is 0. So we calculated an entropy for the old case of 0. So how do we determine the information gain? So the information gain is determined from the entropy or weighted entropy of the child nodes. And those are the entropies one of 0.918 and the other one is 0.0. In the case of separation, there were three instances out of four that were part of recent and one instance out of four that were part of old. So we weight the first one by three quarters and the second one, so left node by three quarters, the right one by one quarter. We do the math and we get a weighted average of the entropy of 0.688. And then we do the subtraction. The parent node had an entropy of 1. The average child nodes had an entropy of 0.688. So the information gain terms of age is 0.31. So again, it's a fairly detailed calculation and we're trying to estimate the information gain of which one is more useful. Is age, is mileage, is road tested. So let's look at the one for mileage now. We're going to calculate this again. So there are two instances where low mileage were either buy or don't buy and two instances of high mileage of buy or don't buy. So the net effect is that at least you can probably tell that there's mileage doesn't seem to be making a whole lot of difference in terms of decisions of buying or not buying. We can calculate the entropy for low mileage and that's calculated at the bottom. We can calculate entropy for the high mileage and in both cases the information is 1 for both low and high. And we can also calculate the entropy gain or the information gain. So we can calculate the average 2 out of 4 instances times 1 plus 2 out of 4 instances plus 1. So that was their average child entropy. And so the root node had an entropy of 1. The child node had an entropy of 1 minus 1. The total information gain is 0. So basically it's saying mileage tells you nothing. It doesn't help you determine whether you're to buy or not to buy. On the other hand age because the information gain wasn't zero, it does tell you something. It suggests you that probably a recent car is probably better to buy. Okay we've looked at age. We looked at mileage. Let's look at road testing. So in this case there are two instances where we said buy because it was road tested. And then in terms of if it wasn't road tested it was two instances of don't buy. So road test yes, road test no. So we can calculate again the entropy for these ones and as it turns out the entropy is zero for road testing. Yes road testing and no road testing. So the average child entropy for those two nodes is zero plus zero. And so the information gain is 1 minus zero. The entropy of the parent minus the entropy of the children. So that means that in terms of information gain, road testing has the most information. It's the most informative about whether to buy or not to buy. So recall that the mileage had an information gain of zero, age had an information gain of 0.31. So we've now done our calculation and this is trying to choose which node or which decision rule should be followed first from the root node. And we've looked at age, mileage, road testing. We calculate the one that has the maximum information gain. Road test does. So that one becomes our next node or if you want the decision rule. So this essentially becomes road test as the root node. Yes, two instances, no road testing, two instances. And so we can get clear separation. And in fact, this decision tree boils down to just this very simple architecture. Start with a collection of the four vehicles was it road tested or not? Yes, no. And they both lead to sort of terminal nodes. You don't need to do any more decision trees because you determined what's the best route to decide to buy or not to buy. So this algorithm could potentially be used in all future cases of when people are trying to decide to buy cars. Just do they do a road test? Don't look at whether it's, don't look at the mileage, don't look at the age. Now, this is pretty naive. This is not how most people buy cars, but this is an example. Now, the Shannon entropy is one where I think there's more useful data. It's typically used more often in decision trees. It's more logical. But if in the case where you have lots of data, lots of numbers, and in our case, not a lot of time, we can use the genie index as another measure of classification probability. So the genie index is initially used as a way of separating wealthy countries or wealth disparities between different countries. It was a sociology thing, but then emerged to become something that's quite useful for classification. It looks a little bit like the entropy. It's just that it's not log anymore. It's just p squared. So the genie index has to be a number between zero and one, not unlike information gain, which also has to be between zero and one. The genie index is given abbreviation GI, whereas information in gain is given, the abbreviation is IG. The genie index has a zero value when everything is in the same class, and an index of one when things are randomly distributed across various classes. So when there's sort of in the case of extreme wealth inequality, the genie index closer to zero when they talk about equal wealth across all groups, the genie index is typically one. So just like with the entropy, PI is the probability of being in class I. So it's the same meaning. In terms of calculations, calculating a log is computation expensive, but calculating the square for probability is computationally fast. So the genie index is faster to calculate. And then in the case of genie indices, it's the low number that's best, whereas an information gain, it's the high number that's best in terms of choosing the root. So you can choose either information gain or genie index. These are the ones that essentially do your feature selection. They're the ones that, as they say, decided whether you should use mileage or age or road testing. Or in the case of the Titanic, it was gender, age, and number of siblings. So high information gain or low genie index. Those are the ones that decide how you choose your features. Those are the ones that become the root nodes or then the subsequent decision nodes as you go down the tree. This selection of features, this natural way of selecting features is a way of essentially reducing the amount of data. And I talked to you earlier about this idea of feature selection. And it's really important. The nice thing about decision trees is that the feature selection is sort of an inherent part of it. You can also, as you perform the decision trees, as things that are removed from the root node that are no longer considered, that's essentially a way of pruning your tree. And I'll talk about that a little later. Entropy and genie, formerly they're a little different, but this just plots out their characteristic versus what's called the impurity index. But they have basically the same shape. The entropy's just a little broader than the genie index. And so mathematically, they're so similar, there's not much to distinguish them in terms of their utility. So that's why you can use both. The genie index has the advantage of its faster to calculate. Now pruning was something I just mentioned, and on the right is just sort of a picture of what happens when you prune a tree. So pruning in reality means you're cutting off branches, sort of thins out the tree. A thinned out tree usually looks a little better if you're an arborist. But in the case of decision trees, it actually improves their performance. It helps eliminate problems of overfitting. And it also reduces the complexity of the decision tree. I think you guys are probably hearing some background noise, so that's the buzz saw that's slowly demolishing our kitchen up here. So hopefully I'll try and speak louder so you guys can hear me over the sound in the back. So in terms of pruning, there are different methods that are used, weakest link pruning, reduced error pruning. We're not going to go into them very much here, but those are available for sort of more advanced decision trees. This is just again a picture in terms of how feature selection is inherently done with decision trees. When we're sort of calculating the root nodes or the subsequent nodes that will be considered as decision nodes, we're always doing information gain calculations or genie index calculations. Here I'm just showing the different colored features. They could be colors, they could be age, they could be gender, they could be siblings, they could be mileage, whatever features we're considering. I've crossed out the features, the ones that have the lowest information gain and the ones that have reasonably high information gain, those are the ones that are going to be kept. So in the case of, let's say, the Titanic data set, we could have done a selection early on and said what was the information gain in terms of if we included hair color and eye color and first class and second class passengers. So that may have been done for this particular study and so this may have allowed them to reduce the initial set of features just to these three, gender, age and siblings. And that's why they were able to come up with a fairly simple and robust decision tree for predicting Titanic survivors. So this is how the data was tabulated from the passenger list or manifest. So there are 1,317 people in the Titanic when it sets sail. They identify who survived and that's sort of the, that's the decision issue where that's the thing that you're trying to predict. And then they had information about whether they're men or women, male or female, their age and then how many members in the family. So that included both the siblings and if there is a spouse. So you can see some had the family size of five, some had a family size of just one, some was just an individual. So these cases here obviously because we're trying to predict survivorship that's like useful label and then the qualities or characteristics of the passengers, gender, age and sip size are also useful features. Now here's a table where there's a useless feature. This is the zodiac sign. So we have the zodiac sign that these people and we put it in a numeric form. So we've converted Sagittarius to 12 and who knows whatever. So all of these things are associated with their birth date and their zodiac sign. So these are indicated along with sip size age and male female. And if we did do an information gain, we would probably find that the zodiac sign had no information gain whatsoever about who survived. And so in that regard, zodiac sign is considered a useless feature. And here's the information gainer or in this case, I guess it was the genie index actually. So in the genie index, low numbers win, high numbers lose. Information gain, high numbers win, low numbers lose. So the genie index in this case was hovering between 0.11 and 0.28 for sip size male female whereas the genie index for the zodiac sign is 0.98. So not useful. So this is essentially we've done through our calculation of information gain. Essentially some feature selection. Now the RBS or ID3 algorithm in principle could go add infinitum. So you have to have a way of stopping or preventing it from going forever. So there's certain rules that you can have. You can have the minimum number of affected objects. So if you have just one object left in a decision node or zero objects, then basically there's no point making a decision. You can also have a sort of a depth. You can see on the tree on the right where the tree has first with this maybe a couple of leaves then four leaves and then I don't know about 30 leaves. That is sort of the depth through the longest path or the length of the longest path from the root. So a lot of people just choose a number, maybe a depth of five or a depth of four. The depth should be a fraction of the total number of features. So in the case of the Titanic, we had three features that were considering after we've done feature selection and we basically went to a depth of about three. So it shouldn't be a case where the depth is more than the features. It gets redundant. At some level with the original Titanic data, there were probably about 50 different features considered. And so from the initial set of 50, it went down to just three. And so in that regard, it's the depth is a fraction of the total number of features. So that's sort of a backgrounder in terms of how we make decisions about when to say split or which feature is relevant or not relevant and how we use either information gain or genie index and how that is used to build up the essentially the algorithm or the model to produce a generalized decision tree. So of course with the Titanic decision tree, they could say, okay, you know, the next time you have a large ship that's about to hit an iceberg, let's pull up this decision tree so we can decide who should live so that we follow, you know, I guess the best rules or best practices. If you're buying a car, you pull out this decision tree that you've made, which is, you know, choose the one that you've test driven. Now, again, they're kind of silly in the sense that they're not really intended to be predictive. But that's the point is you create a decision tree on some training data, you test it and then if it's robust, then it can be used for performing future classifications. So the Titanic one is would be the one used by captains whenever they're sailing ships. So in this machine learning workflow, we're going to follow the six step process, define your problem, propose a solution or rough solution, and you're going to construct your data set, you're going to transform your data set, select features, you're going to choose a model to train it, test it, and then use this to make future predictions on new examples. So the one we're going to do here, which is a little more realistic, rather than the Titanic survivors or choosing a car, is iris classification. So iris flowers, how do you distinguish them between different species? So if you're a botanist, this is really interesting. If you're not, it's, I guess, a trivial example that shows you how decision trees work. So that's our problem. How do we classify flowers based on their color and shape? And next is, well, we need a training set to sort of test. So where, where do we get this data set? So this data set actually came out in 1936 by a famous statistician named Ronald Fisher. Fisher exact T test that somebody might have heard of. So he's done lots of statistical work or did lots of statistical work. And he also pioneered the concept of linear discriminant analysis or LDA, which is a precursor to partial least guard discriminant analysis. And it's also sort of a precursor to regression. So the data set, which is used by many, many people around the world as a test set, it has 50 samples of three different Iris species. Iris, Satosa, Virginica, and Versa color. Pictures of Versa color, Virginica and Satosa are shown below. So Satosa is an Iris flower, which has very small petals and very small sepals. Versa color has very big petals, and relatively big sepals, as does Virginica. Although I think Virginica has smaller, smaller petals than Versa color. What you can do, they're all purple irises, so color doesn't help. But what you have to do is then look at the length and width of the sepals, which are the wider petals, and then the conventional petals, which look more like leaves. And by looking at the sepal and petal dimensions, length and width, you should be able to classify them according to their species. So this is actually from the paper. So this is maybe a little faded, but this is the typed up work. It was collected in Quebec by a botanist named Edgar Anderson. Fisher got the data, put this into his paper. And here's the four sets of information. So sepal length, sepal width, petal length, petal width, Iris, Satosa, Versa color, and Virginica. And you can see right away, you don't have to be a computer, that the Satosa has a very, very small petal width and a very, very small petal length relative to Virginica and Versa color. On the other hand, the sepal length and sepal width are kind of about the same. Virginica is generally longer than Versa color, although they overlap. And likewise, you can probably see a slight trend in terms of petal length, with Virginica being slightly longer than Versa color. So as I say, you don't have to be a genius to sort of figure out how things will separate here. But the fact that you and your eyes and your knowledge of a little bit of math can detect the pattern, it's a good test then to see if the computer can also detect the pattern. So we have this data set of 150 flowers times four dimensions, sepal petal length and width. We're going to then transform our data set. So this is the best way to form the data. So we have in terms of columns or five columns. So we've got species, Satosa, Versa color, Virginica, we've labeled them. I'm not putting all 150 here or 50 times three. But now we've also separated the petal and sepal length. So those are in columns and along what they're with. So this is in a format that is very readable and very usable for decision tree or even for neural nets or any other machine learning set. Okay, so we've put our data together. We've done the first three steps of the process. So we're now going to go and sort of choose our model. And we've already decided that we're going to use a decision tree. We could use neural net and SVM and Markov model, any combination of things. But this is a simple data set. It's a modest number for training and testing. So decision trees probably safe. And you may recall that I said, you know, typically you need a minimum of 1000 data points. Here we're only using, you know, 150. The reason is, is because it's a really simple problem. And the separation is very obvious. So in some cases, you can use machine learning when the separations are obvious. On the other hand, maybe you don't even need to use machine learning when it's obvious. So this is, as I say, purely for illustration purposes. So if we were to program, and we're not going to make you program because there's not enough time. But what you would do is you would go to your colab website, you'd open the file. You'd select a new notebook from file. So you're going to create a new program. Then you'd change your file name to iris decision tree in Python. And then you now have text that you can start entering, or you can type in, you know, hello world and whatever else. Now, as I say, we don't have time for people to sort of break out into a lab and start coding. What instead we're doing is giving you the code. So it's in module two under the CBW machine learning folder. And you can use either the Python code or you can use the R code. So you have choices. We're going to work with the Python. And if you click on the Python select code, it'll open up in the Google colab. Now, it's about 100 and some lines long. And we're going to try and go through the algorithm or the program. And it's broken up into about eight different sections. There's a section where we read the data, the table in the format that I just showed, we're going to check the data to see if there's any missing data. Then we're going to create our training and our testing sets. And so this is we're using sort of the two-third, one-third rule, two-thirds for training, one-third for testing. Then we're going to write a bunch of functions. The functions are going to do essentially the splitting because that's what a decision tree is. It's making a decision. It's the side left or right. Yes or no. So first of all, we have to do a splitting. Then we have to create a function that calculates the genie index. We could have also done information, but we're using genie index here. We're going to try and write essentially a function that determines the best split. So do you split it nine and a half or do you split it 10? Do you split it nine? And then we also have to have a function that said, when do I stop? Tell me when to stop this tree. And so this is a termination node function to say, okay, you don't have to go any further. You've hit the point where there's no more classes or nothing more to split. And then we actually have to write this, the splitting function. So the decision tree itself and the decision tree function is just called split. And then to call, once we've got the trained decision tree, we want to be able to use it. And so what we've essentially created with the decision tree is a program that could be used if we went back to the Gaspe Peninsula, or if we went out to our backyard and had a whole bunch of viruses, and we wanted to figure out what species we have, then we could just type in our information about petal and sepal length and width. And this program will be able to tell us whether it's virginica, cetosa, or whatever. So let's go through this. So again, if you were typing the program, you would invoke or ask that you want to have numpy and pandas. And remember, I told you a little bit about numpy and pandas. So these are pandas is used for handling matrices or data frames. Numpy is using some nice math operations and for matrix or array handling. So it's just a useful set of functions, library functions. Here's the code for reading the data. And so as I said, this is how it looks, or at least the structure of the data in the CSV or text file, where we have the sepal length and width, petal length and width, and the species. And the PD is the pandas. And so we're reading this CSV file. We can read it through different array markers. The colon is to indicate the end of the file. So once we've read in the data, we want to make sure the data is valid. Was there a mistake in the reading? Is there some missing data? So this data check, which you'll see in essentially every program we use for the rest of today and tomorrow is always there. And it's basically to see, is there any missing value? And if it does see something where there is a null in any of the columns, it will indicate there's a missing value in the column. If it sees nothing of, that's unusual, just basically print the data set as complete, no missing values. So great. So we've now finished, I guess, if you want collecting your data, selecting not even quite selecting features, but we're now using our ID through your RBS decision tree model. So in this part, we now have the data read in, we're going to create the training and the testing set. So we're going to split it out, 70% for training, 30% for testing. And then essentially, we're also going to be doing a three-fold cross-validation so that we can then randomize that and take another 30, take another 30. So this is the code to split it. We're just taking an index. Well, first of all, we scramble the data. So we make sure that it's sort of randomized. We don't have all of the setoses in one group and all the virginicas in another. Randomizing your data is important so that it doesn't bias your training set. So once we've randomized the data, then we're going to split it up. And so we've got the length of the array and we're going to make one case multiplying from 0.7 to the beginning and then the other one to 0.7 to the end of the array or covering the last 30%. So we've split to 78.30 here. We've also just printed out the fact that we have created the test and training set. Now, the essence of decision trees is I highlighted with some of the examples is to determine which features have the most useful either information gain or genie index. So in this case, we're looking for the one that has the lowest genie index. And starting from all samples, we could calculate the petal lengths, petal width, sepal length, sepal length, and which cutoff, three centimeters, two centimeters, five centimeters for each of them, sepal length width, petal length width. It turns out if you do this for all of them, you'll find one that works really well. And that's a petal length and a cutoff of less than or equal to 2.45. That one gives you a genie index of 0.665. So the lowest genie index, remember, is the best. And with that genie index, you can get a separation of the cytosas completely. And then you can group the versicolor and virginica together. And so this is shown in the diagram on the right, which is the red showing the cytosas and the versicolor virginica, which is on the right. And so that one separation based on the petal length alone cleans up your data set very nicely. So that's the most powerful separator. And what's more, by the time you've separated into cytosa, it's a pure node. So there's no more separation. It's a terminal node. The next one is a mix of both versicolor and virginica. It still has, at least in this case, it's now been partitioned. So it still has a lower genie index than the root node, but that's because it's already been separated. So this is still considered, it's called an impure node or a decision node. And so it can still be split further. And what I've shown at the bottom is just essentially what was done when we looked at petal length and how the genie index was calculated. And somewhere around about two and a half, 2.4, you hit your minimum genie index. But as you went to things that were too low or too high, the genie index started climbing again. So the sweet spot is what we found and that cutoff of 2.45 centimeters is the first separation point. So what we're going to do is essentially write code to do that, to do that split calculation. So first of all, we have to be able to identify split points. So we'll just split rows of data given a feature and given a certain cutoff. So this is just a way of testing and saying, okay, what happens if I use petal length of 1.2? What happens if I do 1.3? What happens if I do 1.4? So it's just essentially determining how things split out. Now that I've been able to split data, I now have to calculate the genie index for each of these different cutoffs or for each of these different split offs. So we're going to have left and right groups and we're going to use this function called test split a little later. And then we have the different label classes, virginica, versa color, setosa, and we have the number 0, 1, and 2. And we also have a check just to make sure that we don't perform a genie index calculation on an empty group. So this is what the code is. It's more comments than code itself, but this is just simply calculating the genie index. Now there's a couple parts to this. We have to initialize a score. Let me sum them because remember the genie index is a sum. So we've got the probability p squared and it's 1 minus p squared. So you can see the p times p and the genie index 1 minus p squared and then we kind of normalize just based on the size of this. So this returns our genie index. So that's the function and that's the definition of the genie index. So now that we've got the split function or get split, we also have the genie index to calculate and we have the test split functions. So genie index and test split are used to help build the get split function and the get split function is a bit more code. So we have the top part, the middle part, and the bottom part. So for each feature we're going to try and find minimum and maximum value. We're going to increment in this case the dimensions of length, petal length, petal width, sepal length, and sepal length width by 0.1 centimeters. So we're going to count by 0.1. We calculate the genie index and as I say all that's really happening is this calculation is done over all of the petal lengths, all of the sepal lengths, all of the petal widths, and so on. So that's the calculation we perform and after that's done we can decide or we can return our index and the ones with the lowest genie index are returned. So we can do this genie index calculation and we can calculate the splitting forever and so to avoid going into an infinite loop we have to stop growing the tree so we have a maximum depth parameter and that's the number of nodes that we're going to use for this thing and given that there's just sort of three classes of flowers then probably the the depth we would choose is probably just three. We also have to accommodate certain situations which is potentially if the data is not split perfectly then we have to just use the most common class value in the terminal group. So this we now have all of the functions we have the genie index calculation, we have the split calculation, we have the get split calculation, we have the node termination function. So all of these have been built out we've also read in our data, formatted our data. So now we can put all of those functions together to perform the decision tree calculation and the decision tree is using this function called split and it uses information from get split. It also rakes things into left nodes and right nodes and so the full set of code is essentially shown in these three sets of slides and it's recursively calling the split. So as it's going down we're checking for the depths, we're checking on the left child. If we've reached a maximum depth we stop the recursive splitting, if the maximum depth hasn't been reached then we go to the left child and force it to either perform more splitting or if the two two few samples we stop. So we continue splitting on the left side of our tree and then we also process or split on the right side of our tree. So remember if you first you know the first root node goes left and right and if the left is not terminated then we keep on splitting but then on the right side if those haven't terminated we keep on splitting. So we keep on doing running this splitting process until either we've kept the hit the maximum depth or left and or right child nodes have no more data so there's nothing further to split. So that's the end game for decision tree. So once we've created our decision tree then you can use it and so the point is obviously once you've got a model use the model to help and the intent here is to have a program that could allow you to go out in your backyard or go to the gas bay measure petal and sepal widths type it in and it'll automatically identify what the species that is. You don't have to remember. Anyways this is essentially the prediction or making the prediction with a decision tree where it essentially takes the paths and returns the information and outputs what's come out. So all of those functions the reading the data the genie index calculations the get split function the split function the prediction function is about 123 lines so there's 91 lines of coding 32 lines of comment. If you were to run this on python it would take about a one to two seconds and if you were to run it on r about the same period of time. So what you want to do is essentially test on your training set so we could have done three different training sets or one different training set and calculate the average over the different training sets but for the single sort of three-fold split where we've got 70% training and 30% testing we're looking at roughly 100 and some examples and what we can see is terms of observe versus predicted we were able to come up with an algorithm for these 100 or so examples where we could perfectly predict which ones were satosa which ones were virginica and which ones were versa color so the diagonal is all ones the off diagonal elements are all zeros so that's really good and that's on our training data set so we're not done because we now have to assess our performance on our test data set so that's the 50 or so flowers that we kept out and we're going to see how well that does and so this is the testing on on our on our validation set or training set so if we test on our test data set you'll see that the we aren't quite perfect the diagonal is not 111 it's 11.93 that we get some confusion between virginica and versa color so we're 93 correct in the versa color but we have a 7% error switching or mixing up versa color from virginica so that still actually is okay and this is where you want to assess whether you've over trained or under trained and as a rule with the training data um you you generally do slightly better with the testing data you'll do slightly worse and generally you want training and testing data to be with them about say five or six percent of each other um and you know for satosa to sosa they're zero percent virginica to virginica they're zero percent versa color to versa color they're a seven percent error so averaged over all three classes it's like two two point three percent so you're within um the allowed error and it certainly indicates that we have not over trained our system so this is a say really important a lot of people forget to do this they just either evaluate their training data over and over again and say look I'm perfect but then when they try it on some new data it looks it's abysmal so this case we deliberately said let's let's look at how we do on a hundred flowers we do really well let's see how we do on a holdout set of 50 and we still do pretty well about 95 96 accuracy overall so we've we've now generated a machine learning model it's a decision tree um the code is there actually to make predictions because we've essentially validated and we've determined that it's pretty accurate and so this model could be used by anyone out in the field in gas bay or in your backyard to determine what irises you have growing around um so it's written in what we'll call pure python so we're not invoking any particular um keras or scikit learn modules these this is all pure python there's also a version that's all pure r um so it's trained on 100 105 flowers and it was tested on a set of 45 so you could adopt this code for almost anything where you're doing classifications um it's a general one it uses the gene index which is a valid method for assessing things assessing um cost functions um and as I said if you are more comfortable in r you can go to the website which has the r r code written out but as I say we'll be doing everything in python just as as the illustration and to highlight just what's what's being done algorithmically