 We're now in module two and we're going to focus on decision trees. So we've got about an hour and a half or maybe an hour and 20 minutes here. We'll cover this one. It's a mixture of both a lecture and a lab. And the way that we'll do a lot of these things is we're not just going to have labs alone. We're always going to have a bit of a lecture. And then we'll allocate about 15 or 20 minutes for people to do the lab. We're not going to make you code. Otherwise, I think we would be sitting and watching all day as people are trying to figure out how to code. So a lot of the code has been given to you. And so what you're going to be doing is mostly running programs or playing around or looking at things just to sort of get a feel for what's there. I'll explain that later on. So this is a combined lecture module on decision trees. So we've seen a picture like this before. And we're talking about classification and machine learning. So we might have labeled data. Here's the red group and the blue group. We could call them Republicans and Democrats if you want. And we've done a machine classifier that now has separated them. And now we can see how they distinct they are. So, you know, machine learning, or any classification algorithm that includes things like logistic regression, principal component analysis, partially squares, discriminant SVMs, neural nets, grid markup, all those things can do classification. And they can take this, you know, label data, which lists initially looks really confusing and it'll pull them apart and separate them. So today we're going to, at least for this module, look about what is classification? What is clustering? We're going to talk about decision trees, which are a form of machine learning. And we're going to talk about how you can convert a decision tree into something that's a little more mathematical using information gain, Shannon entropy and the genie index. And then we're going to talk about feature selection and we're going to show how you can use this to do classification for flowers, which is a toy problem that's inched introduces a lot of people to decision trees and it's actually a really nice one. We're going to go through the Python code for a decision tree to separate irises, which are the flowers we're looking at. And then you guys will be able to use the colab for your first bit and we'll see how you guys do with actually running those programs that we're going to provide to you. So diving right in clustering is something that's different than classification. They are not the same. A lot of people think they are but they're different. So clustering is a way of grouping objects that are logically similar. So clustering is like matching socks after you've washed and dried them and they're all separated and now you have to match pairs of socks. The object classes haven't been defined or labeled and to do your, you know, sock clustering. If you were blind and you couldn't see the colors you probably have to do it by their shape or size, hoping that, you know, all of them are sort of paired up to a similar size and you can tell. If you had, you know, vision and you could see color or then you could start matching the socks by color but you might have to match by size, but you have to figure that out yourself. You might use some parameter to help with that clustering. You could do the same thing with clustering balls or clustering toys clustering paintings classification is something where things are labeled objects. Essentially clustering by the label. So we have the blue and the red objects, the Democrats and the Republicans. And they're categorized based on their properties. In supervised machining which learning which is the most common thing we use algorithms to essentially perform that classification and in some cases to help with assigning class labels for new data. So here's a bunch of new individuals we don't know what what they are whether they're Republicans or Democrats, we can run our machine learning classifier and say oh this person is a Democrat or this person is a Republican. So decision trees are the simplest thing to implement for machine learning. They're easy to understand. They make sense they produce a rational agreement. You can use, use decision trees for classification which is what they're primarily used for but you can also use them for curve fitting or line fitting, which is regression. You can use them for numeric data. In the machine learning approach to decision trees the computer or the model learns to split and categorize, or in the case of regression how to fit or regress the data based on decisions. So it might use a greater than or less than number that might use a yes no, and it evaluates the cost of those decisions, whether this is a good split or a bad split or a good decision or a bad decision. So the example of a decision tree was, you know, the Titanic sinking what do we do women and children first. So, we're looking at the data that was collected and we can learn from that you know how did they make their decisions and we can see that most women survived the Titanic. And then those of the men who did survive usually they were younger, and they were usually parts of siblings, or have spouses. And if you were single and male and old, you usually died in the Titanic sinking. And you can see that there are boxes which things marked, you know, gender, age, survived age cutoffs decisions and we have branches, which are the lines, those are called edges. So that's this this tree like structure in a decision tree. So seven nodes and I think six edges. So the definition of decision tree is a is a flow chart, in which each internal node represents a test and attribute. So male, female, old, young, siblings, no siblings. And each branch or node represents the outcome of that test. Are you male, are you female, are you young, are you old, are you with a family or not. And there's a leaf node that represents the class label, young, old, male, female, sibling, no sibling. So the path from the root, which is at the top of the tree to each of the nodes is sort of the classification rules. So there are two types of decision trees the classification tree which is the classic one. It's for classifying classification. It can be used once it's constructed it can then predict or classify things it could predict, you know, if there's another sinking of another boat. If there's another ship of Newfoundland, what are you supposed to do. So the, you know, ship captain will call up his decision tree and say, okay, these are the people that should get on the boat, the rescue boats and those ones that have to sink. Same sort of thing as you know are you Republican or Democrat might decisionist and did you vote for Trump or did you not and then how old are you and are you male or female and are you college educated or not those would be decision tree points to sort of decide whether you're Republican or Democrat. The regression tree is not for classification it's it's for handling or fitting numbers and it's it's for curve fitting. And you can use decision trees for that as well. The official or formal name for a decision decision trees classification and regression tree or a cart, because it can be used for both. Most people don't use decision trees for aggression. It's a shame, because it can be very useful. So, when you build up decision tree you have to decide which features, which conditions to split the groups, and you also have to decide when to stop. You know if there are no more people left on the boat. Then, you know, you don't need to keep on deciding who should get into the rescue ship, or if you've you know got your collection of people when deciding when we're Republican or Democrat and there's no one left, and you can stop. So this decision tree we had to say, you know, is, are you male or female. Yes, no, are you greater than nine and a half, or you're less than nine and a half, do you have siblings and spouses. Yes or no. Those are the questions that are asked. So what's marked in black is the decision or condition are you male female are you young or old. So then there's leaf notes, and then this is sort of the result. If you were female, you survived as a rule if you were older than nine and a half, you generally died and if you were male. If you had family you also generally survived. And then the edges are the ones that connect the final results to each of those leaf notes. The terms and decision trees. The top note is the root note that's the entire population. So all voters in the US. That's the root node. It was all passengers on the Titanic that's the root note. And then eventually if you're doing classification it's going to be divided into these different groups. So splitting is another term and that means where we divide the node into two or more sub nodes so you can split into two, three, four if you want. There's a decision node. And that's when that's when a sub node splits into further sub nodes. There's terminal nodes, those were, you know, the end is reached so if you're a female, all women survived or most of them did. But then for males, there are lots of other decisions about whether you're young enough or old enough. And related to the Titanic we also call things nodes, parent nodes and child nodes, not passengers and Titanic but it's just the terminology and trees. So the root node is typically a parent node. Then later on, among the male nodes, then we divided those into other groups. And it also was a parent node that was then broken down into are you young or old. So decision trees are pretty easy to understand. I think most everyone should have understood how we did the Titanic separations women and children first, and how decisions were made and this is actually from real data that they collected after the Titanic and evaluated who survived and but so that you know you can see the decisions were made it wasn't a black box it wasn't a neural net. So we call things white boxes, which means that you can understand it, it makes sense. Decision trees and the examples I've been giving are mostly categorical but you can do it for numbers. You don't have to do data transformation you don't have to do data normalization you don't have to do data scaling. It mimics our mirrors help humans think, and so we learn, and a lot of learning and approaches that we have to life really do involve decisions, and we make choices and we assess the costs of those choices. So our decision trees have kind of a built in feature selection so you can kind of have all kinds of garbage data, and it will do the feature selection on its own. And as I said you just don't need to normalize or do any statistical fixes and decision trees are not the most robust machine learning method. The decision tree surrounding forest is pretty robust, but just a decision tree on its own is not as good as an LSTM or a GNN. You can have small changes in the training data set and that kind of screw up the whole decision tree. There are tricks called bagging and boosting the work it. It's kind of a heuristic method. There are robust functions evaluations called the Gini coefficients and entropy gain. It can be prone to overfitting. It's not going to give the best solution. And this is where random forests can fix that. So again these are sort of mathematically proven, because it's called a greedy algorithm. So we need to sort of keep on categorizing into some kinds of ridiculous categories. So it's not as maybe intuitive as we might like. So, we can kind of understand the Titanic model and decision tree approach there. Someone told us the rule women and children first. But how do you learn a decision tree. This is a little different. So the formal way of learning a decision tree is called recursive binary splitting. And it's also called iterative dichotomization. So I've taken a tree, turned it upside down the roots in the in the pot, and there's the root node and then you can see the leaf nodes. So recursive binary splitting is split two ways. So you start off with all the features. And you start doing different splits and say, okay, should I just, you know, split by age first versus male and female. Should I split by, you know, their zodiac sign. Maybe that was how they actually decided he was going to be saved or not. And then you test and see, you know, how well did I do in predicting or determining how the Titanic survivors were selected. So the split with the lowest cost. That's one thing or the highest information gain is the one that's selected. So from our decision tree, if you chose choose gender first, that's probably it was the most information. That's how they initially group people and say, okay, all women this site. And then, okay, if you're a young kid will also throw you in there too. If we have, you know, three different features of the age, family size, gender or sex, then there are potentially three candidate splits. So, okay, do we split by age first to be split by gender to be split by family size or do we split by zodiac sign. Those are things. And then in each case we calculate the cost of that split. And that could be the information gain or entropy loss or Shannon entropy or genie index, those are all things that calculate that cost so we repeat that calculation for other splits and for other reasons, as we keep on cutting through this recursive binary splitting. For people sitting long silences that evolve on the sleep. Okay. Information gain or IG is based on entropy, which is a sort of something if you go from physics is uncertainty or disorder. And the entropy is called Shannon entropy which is something that was developed for information theory. So entropy is the probability of being in a cost and in a class. So entropy is equal to the sum of these different probabilities of minus P log P. And you can calculate the inch information gain for an entire data set and you can calculate information gain for a specific feature. And so when you subtract the two from each other, you calculate the information gain for specific feature. So is it age, gender, number of siblings. So this information is given to which feature which attribute. In Titanic case it was age or gender or number family size, which is given the maximum amount about how to class or how to perform a class or split things. In Shannon entropy all the probabilities typically have to have a value that's one or less. You have to have a negative sign because logs, things that are less than one will produce a negative number. So if you have two classes, the likelihood of being in any one below Ms, you know, one half. So you can calculate the sum of the entropy for two classes is one half log two of one half, and the summit over two states. So it's minus one half plus minus one half multiplied by a negative member. That gives you an entropy of one. So this is a calculation for four classes. This is now one quarter, really log two, one eighth. And you'll get maximum values depending on the classes of, you know, two, three or four. So I'm going to take an example. This one's not from the Titanic, but this is one from buying cars. And we're talking about cars where they're old cars and recent cars. They're low mileage cars and high mileage cars, and they're cars that have been road tested where this person has driven them, not driven them. And there are recommendations that are associated with this. In these cases, if it's a recent car, low mileage and road tested, you should buy it. If it's a recent car, even with high mileage, and it's been road tested, you should buy it. But if it's an old car and it hasn't been road tested and they're still low mileage, the recommendation is don't buy. And same way with a recent car, high mileage, but hasn't been road tested, don't buy. So these are, this is the table of data. And you're going to try and then come up with a decision tree that would learn this so that if you came up with another example and said, okay, what if I have a an old car with a high mileage and has been road tested, which they do. And this is, you know, the training day that we build our decision tree with. So we have to start doing some math, because we have to in this case calculate the Shannon entropy using that formula we saw before. And the first thing we're going to do is calculating the calculate the entropy for the root node, which is buy or don't buy. And we take the number of cases that we're, we buy it's two out of four and the number of cases where we don't buy that's two out of four. And so we can plug in the numbers with the remembering that log of point five is minus one and log of one is zero. And we can calculate that entropy for the root node and it's, it's one. And from that root node, we're going to look at age of car. And we can see that there are three recent cars and there's one old car. And of those, you know, some of the recent cars we recommended buying but one of them we didn't recommend an old car we also said don't buy. And what we can do when we split this node based on age, we can calculate the entropy between, you know, using recent and using old as the split decision or age if you want. And we can calculate based on the number of instances, we can plug it into the log P. I've done those calculations at the bottom. But the, you know, recent entropy gives you some information knowledge it's 0.98188 and the entropy for identifying old doesn't give us a whole lot of information. We can calculate the information gain by taking the entropy. That's given from the parent node, which we have one, and then we can calculate the average of those two child nodes. One had an entropy of zero and the average other one had a point nine one eight. We could do a one half one half but in this case we waited because three of the instances out of the four was recent and one of the four was old so we waited a little differently. And so we take an average value of 0.688 in that group, and we can say the information gain to use age as a discriminator was one minus 0.688, which is 0.3. Not great, but it's it's there's information there. Okay, so what if we tried mileage is our first choice to split things. In this case we've got information about low mileage and high mileage. And in this case it was kind of random. Some cases it was low mileage we buy have the cases low mileage we don't buy. So we can calculate based on the instances we can calculate the entropy for each of them. So in the mileage case, the entropy for the low mileage was one. And the entropy for the high mileage is one and the information gain from the weighted average of the children child node entropy is one. The information gain is one minus one, which is zero. So this information is completely useless. There's no information with mileage playing a role in our decisions. Okay, so this one won't be a good child node. And then there's a road test. And you'll notice that in the cases of road testing. If we road tested we would buy and if we didn't road test we wouldn't buy. So you can calculate the entropy for road testing. In this case, using that formula, we get a zero entropy for road testing and we get a zero entropy for not road testing. So we can calculate the weighted average of those at zero. So the information gain from parent node which is one minus the child node is zero, we get an entropy gain or information gain of one. So this is perfect. This one actually fully separates these things and is probably should be our first choice as the decision tree. I just wrote down the information gains for these things road testing got an information gain of one mileage and information gain of zero, no information, and age tells us a little bit of information but not as much as road testing. So the maximum information gain is for the feature road testing so this would be the, well, not necessarily there. I'll just do the real note. And so this will be how we distinguish them. And do we road test. Yes, road test no, and that's what we buy or don't buy. So information gain using Shannon entropy is formally correct, but involves calculating logs and logs are expensive to calculate on a computer. So you can use something called the genie index, which is used by actually economist to measure things like, you know, wealth disparity, but it's also used to measure variables being wrongly classified. So like Shannon entropy can range between zero and one. But it's the genie index, which is a GI, instead of information gain, which is IG. So it'll get a little confusing, but the GI is zero when everything wants to certain class, and one, when things are random. So in this case, if you gain IG, high information gain is good genie index, low index as good and high index, high GI is bad. So it's similar we use probabilities, you know, just like the PI in the genie and the information gain or entropy method. It's not using logs, it's just p squareds. So different people, different algorithms will use either information gain slash entropy or the genie index, both can be used. Highest information gain is good. Lowest genie index is good. Those are the ones that are placed at the root of the decision tree and that's how you decide which ones are ranked and then you rank things subsequently below that. So if you have one split, you got information gain of 0.9 or a genie index of 0.1. That's the one that's, you know, the root node and then you might get another one with an information gain of 0.4 or a genie index of 0.3. So that might be your second one and then you might have another node which has an information gain of 0.2 and a genie index of 0.8 that would be your third layer of your decision tree. It's a way of doing feature selection. It's a way of getting rid of things that are useless. So in the case of the car classification, it was the mileage I think was useless. And this is, as I said, it's an automatic way of doing feature selection and essentially pruning the tree. So just the mathematical functions about the information gain which is shown in green or the genie index. There's something called the impurity index so they are slightly different in terms of both their shape. But they're close enough. So both are totally valid. Now, when you're running machine learning algorithms. And also make sure that they're not producing trees that are too complicated. So this is called pruning, getting rid of branches that are of low importance. And pruning actually improves the decision tree performance. It is known to reduce overfitting. Obviously it makes it less complicated. And it's things called reduced error pruning and weakest link printing. These are methods that are applied to decision trees to clean them up. So, you could also, in essence, what these are just looking at all the calculated information gains. So if there are seven ways of getting information, you know, evaluated all of them. And only include the high information gains, anything with an information gain of about 0.4 or a genie index of below 0.3. In the case of the Titanic one, there were just three features that they eventually decided on. They probably had other things that they could have chosen. So, the label was there's passenger ID that's kind of useless. So that's not a useful feature. The survival was what you're trying to predict. And then the other features were are you male zero is, if you're not one is if you are your age and the size of the family. So, in terms of, you know, this is the the actual data, there's 1317 passengers on the Titanic. All of them were have the information age male female and family size. Now, if you had data that was also on the zodiac sign that they were born on. Well zodiac signs, I think that would be a useless feature and have no information and therefore no information gain. And so you could determine from your Shannon entropy that zodiac had nothing to do, whereas age or male female, in terms of the ranking male female have the highest utility age was second and subsize was third. So our feature selection, this is what the genie indices were, or in terms of information gain they sort of one minus those. So lowest genie index best was sex. Next best was age last was subsized but zodiac had a genie index of 0.98 or an information gain of pretty much zero. So, this is, you know, how you're making your feature selections it's how you're deciding where your nodes will be placed which ones the first node the second node the third node. You have to have a number of training inputs and identify the number of objects that are affected so you know if you have a decision tree that splits, you know, three objects into three different categories. It's probably not a very useful thing. Sorry, I think Angel has her head up. She wants to ask a question. Sure. Thanks, maybe this is a silly question but I was wondering, suppose on the Titanic if there were a lot more children, would the genie index for the age have gone down. Like if it became more important to classify the age before classifying the sex just because there were so many more kids. Yeah, again, I think, I don't know if I can do the math in my head I suspect. If, you know, if you had a different population than what was currently in the Titanic, or what was known to be in the Titanic, then I think, yeah I suspect the genie index. I don't know if it's the genie index for children would have gone down, but again, it's based on the historic data that you're. I understand I'm just trying to understand how. Yeah, that would factor into the model but yeah, thanks. So you decide the number of inputs you want to put in. You also have to decide on the maximum depth of your model so is it going to be a three layer decision tree a four layer and 200 layer decision tree. The depth of a tree should be small, you know, two to three, maybe maximum four layers. There's no hard and fast rule but you'll see most decision trees typically have a depth of three or four. You can help and discern which which depth you want to have by feature selection. So we're going to take a real example here. And we're going to follow the workflow that I described last time, there's these six steps. And the six steps are define your problem, suggest a solution. If you've described your problem, then you construct your data set. Once you construct your data set then you have to do data transformation feature selection. Then you choose your model, is it a decision tree on your own that or whatever. Then you test and validate your model and then you say okay models ready, you release it and say you know, let's let's use it so in this case it's how do I classify iris flowers in my area of the country. Based on the floral dimensions. So it's a well defined description. It's a problem. Maybe it's not one that is earth shattering but it is one that we have data for. So, what we need to do is get some data to train and test. So this is a data set that was introduced by Ronald Fisher. Fisher is developed, basically single handedly developed all those statistics. And he used it to use to describe linear discriminant analysis or LDA something that I talked about before. So we have data from three different flowers, purple flowers, iris Satosa, virginica and versa color. And these are examples of these different flowers. So they have essentially six pedals, if you want, one is called the pedal, and the other is called the sepal, and the sepal is generally longer and more fancy. Has relatively large pedals and I guess moderately sized sepals. Satosa has tiny pedals and giant long sepals virginica I think also the same. This the species can be differentiated by their pedal and sepal dimensions. And this is showing length and width so there's four dimensions to for pedal size and to for sepal size. And this is how they classify those species. This data was actually collected by a Canadian Edgar Anderson, and it was published in 1936 and it's showing actual measurements of the iris Satosa versa color and virginica. And you can see that the sepal length for iris, the versa color and virginica is generally long sepal width is about the same for the two of them. But the pedal is quite distinct. So the pedal length is very long for virginica, moderately long for versa color and very short for Satosa. The width is also kind of intermediate so that have tiny pedal width for Satosa, moderate pedal width for versa color and fairly long width wide width for virginica. So that's highlighted here versa color kind of sits in the intermediate virginica is the big one. Satosa is the small one. So you can kind of look at it yourself. You don't have to be a botanist or machine learning expert kind of kind of see a trend here. This is what's called a toy problem. Most machine learning ones the data's too many separation is too complicated to noisy. So this is why we can basically do in this machine learning example with just 150 examples. Why we can do it. But because it's so trivial, you really don't need machine learning. And this is, you know, when to use machine learning. Well, typically when the answer is, is not obvious, or when the data set is so large that you just can't figure it out. So we're going to take the data set in fact we took the tables from the published paper and then entered all of them. So we have now 150 rows, 50 of the Satosa 50 of the versa color 50 of the virginica with all their sepal pedal went lengths and widths. So we've got the data. Now we can try and choose our model and we decided to do a decision tree. And so what we're supposed to do is move to Google Colab, which you guys learned how to do and open a file there. And then you have a notebook, which is where you're going to be typing your program. And then you can all start coding this. Now, we don't have enough time for you guys to do the coding. So to save time, we've got you guys with Python code that's already been written. This is written a couple years ago by TAs who developed this help develop this course back in 2020. So you can do this right now. If you want, you can go to module two, which is what we're at. You can go to Python code and the CBW learning near Google Drive. If you don't want to do that right now, we can save that for later. But this shows you how to navigate it. And I think I'd like people at least to try this. And I'll continue lecturing but just, you know, find your Python code. Don't choose the R code and look at module two. And you're going to click on which is the iris decision tree version for Python. And you're going to open it with a Google Colab. Now, it's going to pop up some code and it's going to be more than 100 lines of code. And the general algorithm for this decision tree is fairly simple. It's going to read your data. In this case, this is the 150 iris dimensions. You know, it's a table of four or five columns and 150 rows. You're going to check your data. This is something you always have to do with machine learning because usually there's lots of data. You want to make sure it's clean. You're going to do a training and testing data split. You know, two thirds so that it's going to be training or 70% is going to be training and 30% is going to be testing. Then you have to create your splitting function. Because this is how you're going to say do I split them into, you know, three groups, two groups, one group. So I decide on petal length, on sepal length. And so I have to have a splitting function. And then I have to, I'm going to use the genie index because it's faster to calculate than the entropy one. And so I'm going to put in a genie index function to calculate things and decide, you know, how do I decide on petal length and sepal length and everything else. And then I have to do a split function. So it's an optimal split function. So I decide based on the genie index where to cut. I have to also have a terminal node function. So when do I stop because I've finished everything. So I have to have a function that says have I finished. And then I also have to sort of do a recursive splitting function. So these are all functions that I have to create to do this. They have names with underscores. And then once I've created all of these things to do that, I still have to have a way of calling it. Because by then if I've trained it, then I want to be able to use it so that if I have some more flowers with more dimensions, I can classify them and see how well I do. And that's actually what I'm going to do with my testing set and say, okay, train and train and train. Well, does it do if I take some unseen data? And does it classify things into Satosa or VersaColor or Virginica properly? So with Python, you, because we're going to do some math, you have to import the NumPy library. You have to handle arrays or matrices. And then pandas to do what's called data frame framing or giving you some data framing capabilities. So these are almost always used in Python anytime you're doing. So we've imported, you're going to have code to import these two libraries or functions. We also have our code. This is the code for reading the data. So if you're looking at this, and I've just blown things up, but you've got this function for reading, and it's in a data one CSV file. And then we commented called this is the data head and then it's just giving the positions of where these data sets off. So that's the reading. So this is the data verification. And we're just trying to see if we have to determine if there's any missing values. You know, we've got 600 different values. What if we've only got 598. So it's looking through all of the data columns and trying to find out if there's no data in them. And if there's, if everything checks out, it'll print out, you know, data says complete no missing value. And this is always good to have for any machine learning algorithm because you, you can have, as I say, 10s of 1000s of data sets, and just to make sure that's clean. Now, if it's not cleaner if there's a lot of data missing. We didn't put in a lot of fixes for it. In fact, it's specific to each problem. Now to impute it. If it was repeats, again, you might want to check to see if there are repeats. If there's nonsensical values. Data birth and date of death and if the date of death is before the date of birth, that's something you should fix. Those are things that, that happen or need to be done, partly, you know, manually, or with someone getting familiar with the data and say, you know, what's what makes sense here. In some cases people use imputing. Sometimes the value is, you know, too low to measure. And so you'll give a lower estimate or an upper estimate. In some cases you might have data that's fairly consistent. You know, you've got a child that's age five, it's a boy, but you're missing the weight. You can probably use, you know, the average weight for a five year old and you're probably pretty good. So that's a form of imputing. So effectively we've done these first three steps we've defined our machine learning problem we've constructed our data set we didn't really have to select features because this is a decision tree. We've chosen for a cursive binary selection decision tree model that's already done. And so if that's what we're going to do with every machine learning program, you have to divide your things into your training and testing stuff. So there's 150 flowers, 50 of each species, and we've decided to do a sort of a 70% for training and 30% for testing. The, it's a whole that says technically it's not threefold cross validation because cross validation is typically done with the with the 70. So it's done and you can see how the code is written there's green is the comments and the function name is defined as deaf and then we talked about how we've divided it and we've multiplied 0.7 by 150. And the length of the data set and then we've got another one which is the testing data so there's training data, and there's the testing data. So we've divided things into the testing and training, then we have to call that genie index that's the genie function. And this is where we're, you know, actually doing the calculation of the genie index. And we're trying to find the minimum genie that's the good thing in genie index remember information gain high information is good thing genie index low. And if we're using petal length, we can come to, you know, choose different points is a two centimeter cut off good one at three centimeter cut off a good one at four centimeter cut off a good one. And by calculating, roughly every point one centimeter interval, all the way through, you can come up with a minimum genie index, where you get a perfect separation between the setosis, and both the virginica and versa color. So, at least you get one group perfect. And then you still have to classify versa color and virginica separately. And that's the petal length and the number is 2.4 2.5 centimeters. And you can plot the genie index as you increment through point one values starting at one centimeter 1.1 1.2 2.1 2.4 2.5 3 all the way on plotting out your genie index. And this is where you're going to do this for going with sepal length petal length sepal width, petal width, and determine which one gives you your best genie score. And in this case, the minimum and which one gives you the worst genie score. So, before you calculate a genie index over the different split points you have to figure out how to split things. And this is where we call this function that I talked about in the algorithm outline called test split. And these are the lists will contain the split. These are, you know, a left split and a right split. These are two nodes that you call a left node and right node. And we based on that, you know, decide, okay, how many are going to be in the left node and how many are going to be in the right node. So this is a simple function just to make sure that we've got these two groups partitioned. They could be badly group groups or nicely group groups, but at least that separates these groups. Then, now we can split things we're going to calculate the genie index. And so this is the genie index function, genie underscore index. And it's going to count all the samples at the split point. Some it's going to calculate the genie index. And this is part one of it. So, we have these input classes, because there's three different types of flowers. We don't want to perform a genie index on an empty group. So that's caveat that's put in the code. The second part because this is a fairly long function. This is where we actually calculate the genie index. Summations perform each of the class values. And then we return this genie calculation and you can see the formula from the genie calculation is in the earlier slides and then this is just calculated here. So it's the p squared, summing over them. So that's the end of the genie function. Then there's the get split function. And this is determining optimal splits, split points, which uses both the test split function and the genie index functions that we previously written. And we have to figure out, you know, your minimum and maximum values to determine what was best, where to get the split. In this case, the split was at 2.45 centimeters, although this one only increments in 0.1 intervals, so it would be 2.4 centimeters. And as I said, this is this instrument where we're stepping through this. So we're stepping by 0.1 instead of 0.05 centimeters. So this is going through this as we're determining the satosa versus the versicolor versus virginica. And looking at whether it's a petal length, petal width, sepal length, sepal width. And we're calculating the genie index all the way through. And so this is what we're doing with these functions. We're stepwise moving through that. And you can see where the genie index starts bottoming out and then it starts climbing. We're determining those optimal splits to that point, one centimeter step. And then we have to decide when to stop growing the tree, when we've reached the maximum depth. So it's a number of nodes from the root node. And this is the two terminal function. It has to be able to accommodate a few things when the tree stops growing, when the class is there, and that things, if they're not split perfectly, we have to choose a common class value. And then there's the recursive splitting function, which we talked about that was another function that was described in the algorithm. So we have to write this one. And here's the split function, or get split, we get the left and right ones. And we process and repeat, process and repeat. And then we determine or use a minimum size to force a terminal node if there's too few samples. So we split, there's nothing left. That's a terminal node. The split function has certain components into it. If we reached an empty group, we see if this is a terminal node. If our maximum depth has already been reached, then we have to force terminal nodes and stop doing the splitting. If we still haven't reached the maximum depth, we can still force to do more splitting. And so split and get split, get called as needed. There's a question from Lance, I guess, specifically, do the slides correspond with the Iris TT4 Python code? Some of the functions in the slides are not in this notebook. They should. I think, I mean, just checking. You guys should check because that was supposed to be done for the check. It's there. If you're running for the Iris code, Iris code runs and that's where you're checking and helping a couple of people to run to get the data imported in. So yeah, if Mark and Sagan, if you guys can also be on the chat room, please, yeah, I'm in the comment chat. Yeah, I think in some cases, the comments may have been added for this specifically may not be in the original code. So the green stuff may not be in all the code. But all the elements are supposed to be there. Well, like, for example, the get split function, I couldn't find that. And it doesn't quite follow as to what we're showing in the PowerPoint. Yeah. So that will have to double check that one but it's in some cases the code's been slightly modified. But there should be if we look back at the original algorithm. So it's supposed to be the test split get split terminal node and split and genie index. So those are the, I think there's five functions that should be there. So something's been changed that it shouldn't have been but this is the way that the code was written and should have been the way that you guys got it in your files. And it might be that that somebody renaming has happened, which I didn't know about, but in the end there should be still these five functions. So, once you've got this program written you still have to be able to call it so that you can take new data and predict with it. So this is the predict function and so it allows you to do this analysis and to make your splitting and decisions within the data set. So the entire program and if it's different from what you guys got was supposed to be 123 lines. There's 30 common lines and 91 coding lines. Because a small data set, it should work very quickly. So it should only take about a second to train, and then to test on the test data set 45 samples. So we have another second with the whole thing. With this three set of species, we actually have a confusion matrix which is a three by three rather than a two by two. So Satosa, Virginica and VersaColor. And we want to figure out what percentage and this is should be 100% 0% or the number that were there so the program and it was trained on this training set of 105. The performance was perfect. So from the training, it learned. And the question is, has it over trained. So this is where we have to test our model with the whole data set. And the whole that set is these 45. On that on that program. We find out that we get almost a perfect set but it kind of gets confused between virginica and VersaColor. And what should have happened if it was perfect we get, you know, one one one along the diagonal and zero everywhere else. But we can see that the VersaColor and virginica are confused. So we have this summer virginica being classified as VersaColor. So, ideally, and this is something that people need to be aware of, you want to have your training, you know, as good as you can get in this case it's perfect. But even if it wasn't perfect, you know, supposedly it's better than what you could do manually or what other people have reported. And then you want to say have I over trained. And if it's over trained, sometimes a little hard to tell I mean you could have also got a result with your testing say data set, where it also looks identical to your training. That's a good sign. It says it's robust. If your performance on your testing data set is maybe five or 6% lower overall. In this case we're going to calculate the average of 100, 193 so divided by three, the average performance is like 96% 97%. So, this is within that five or 6% range. If I had done it on the test and I got on the diagonal point three five point three two point four six. That's probably an average of 35 or 40%. That's terrible. And it tells me that my, I've over trained that I've somehow either decided to use features or hadn't had enough training data and maybe my testing data is so different than the training that that's why it's doing so terrible. I have to evaluate but in this case, the training and testing performance is within that five or 6% threshold and so we can be confident that it hasn't been over trained. So, if we've evaluated and you know we've created the program we've got our data set we've chosen a model written a program. There's some discrepancy between the program that's on the slides and what you guys get and I hope that Sagan and Mark and Vasu can kind of sort that one out because the code wasn't supposed to change from the slide. And then you can test and validate the model and then once you've done that and we've done that we can use this to make predictions. So we've basically got a decision tree program. It's in Python, and it predicts iris fire classes we tested on a training set of 105 hours. So we trained it and then we tested on a holdout of 45. And it's quite generic uses a general genie index uses general numbers it has. So we can actually use it for classification of patients cases and controls with maybe different levels of gene expression. Protein expression or tablet like levels or snips. And, and so in this case, we're going to sort of dive a little bit into using the code. Now, I've shown the stuff in Python. We've written this in our and so people are more comfortable with our understanding are they can use.