 In this video, we will continue on Descent Tree Classifier. So, we saw that Descent Tree is developed in 1986 as an ID Tree algorithm and all of variants of them is used in the current modern tools. Let us look at what is ID Tree algorithm that actually to answer two things, which node to choose as root node and where to stop. This is a very key question you have to answer in the Descent Tree. So, let us see with this particular table, you have attendance, midterm marks, final marks is greater than 70, you have seen this table in multiple times. Which node to choose at the root node and where to stop. I am not going to explain solving Descent Tree in this particular table, but I am going to explain the mathematics begin, which node to choose and where to stop and you can apply that formulas on this. It involves only two things, one is entropy and the information gain. You all might know what is entropy in your field, the entropy is different, differently in different fields like electrical is different, mechanical is different. But let us see it is to measure the uncertainty in the data. That is the example, let us use it here in computer science. Entropy of X is particular, entropy of particular S is the complete set, complete data is property of S. So, that is the entropy formula, let us y to. So, entropy of S is this, this is a formula minus P i and to log P i. This i is number of classes, if it is binary, it is S or no classes, it is 2. If it is 4 classes to classify in Descent Tree, it will be 4 something like that. So, property of P i number classes. So, P i is property of S and the P 2 is property of no. So, if I apply this to two class problem like yes or no, property of S log property of S and property of no, log property of no, that is exactly given in this equation. The log 2, check this base was very, very important. The two classes, that is why we used to log 2 base, two base. If you compute this with log 2 base value for the previous table, point, previous table, let us see how many yes and how many no. 1, 2, 3, 4, 5 out of 7. So, 5 out of 7 is 0.71 and 2 out of 7 is 0.29. If you compute that in this particular log 2 value, log 2 base, then it is the value 0.87. What is this 0.87 means? Now, it says entropy. And what is this particular 0.87 means? This should I consider this as a good or bad? What is this point? Let us understand what is entropy with the property, a simple diagram. So, the diagram is not clear. This is the property value, this is the entropy. So, this is 0.5, this is 1 and this is 0 and this is 0.5 and 0, something like that. So, probability value 0 or 1, the entropy will be 0, the probability value 0 or 1, entropy will be minimum for two class classification that I am talking about. Entropy will be very minimum that is 0 entropy. For probability value is equal to 1, say probability value is 0.5, the entropy will be maximum. So, let us try to understand what is entropy measures from the probability value. So, this is maximum. So, what it tells us in a previous example, consider I have 30 students passing the exams and 30 students failing the exam. So, probability of passing the exam probability of yes, that is a past the exam is 30 by 60 total students. So, 1 by 2. So, 0.5, probability of passing the exams 0.5, which means opposite to the probability of failing also 0.5. If there are 60 students, of of them can be passed some of them, certain it is not this uncertainty, like it is no, it is like equal value. If you pick any student, it can be either pass or fail. So, always for triple probability. So, the entropy is high. Consider I have, I have probability of, I have this, so same 60 students, I have say 59 students pass and only one student fail, right. So, which means probability of yes is 59 divided by 60, it is almost 1, you know, 0.9 something, 0.9 for something or maybe. So, it is almost 1. If probability of yes is almost 1, the entropy will be very minimum. So, that tells the uncertainty to classifying this thing is very, very low. So, if you pick that value, any student from this class, you can say pass because 59 out of 60 pass, only one time it will fail. But so, that is why the probability, if it is a equal amount of both yes and no, a black and white or the two classes, entropy will be high. If the number is less, like it is only black or only white, the entropy will be low. Our aim in a decision tree classifier is to bring that entropy equal to low, not high. Because I want to make a decision of one leaf to be, this student will attend the class, this all students here will not attend the class. So, my aim here in decision tree is to bring that entropy value to as minimum as possible. That is it. That is a basic concept of what a decision tree works of. So, in a previous example, if you look at it, see 5 by 7, 2 by 7, it is almost, it is 5 by 7 and 2 by 7, it is almost what we say, it is not like a 3 by 7 or something, but 5 by 7 is also not exactly 6 by 7 by 7. So, if you see 0.71, 0.6, 0.7, if you plot that value, so 0.6, 0.6, 0.7, if you plot this value, I am sorry, it goes something like that 0.9, it is 0.87, something we have computed in the entropy. That is exactly the formula we computed in a 2 class classification problem. That is entropy. So, which note to choose? Now, you can understand which note to choose. You have to apply entropy on all classes. So, there are 7 classes, entropy is 0.87 and you have to select one of the features and create one of the applications. Say the attendance is high. If you put the attendance is equal to high, what happens to the entropy? If information gain, entropy is reducing or not. How do you know the entropy values reducing or not? It can be computed by the information gain. Let us look at what is information gain. What happens is it takes the entropy of the complete set without making any decision. They are all root. So, entropy of complete set is 0.87. I am not taking any decision. Just given a particular student, whether student will pass or not, I will just make a decision out of it. So, I will prove about 71% of time that is 0.7 in probability, but we want a better classifier. So, this is a given set entropy divided by, if I want to divide this, if you want to divide this, make a decision to be out of one of the feature, let us say I will select one feature. The feature I selected is midterm marks. So, I consider the midterm mark as a node. You have to select all other features and select different values for features. Let us say midterm marks as the node. And if I consider midterm marks as the node and what will happen is, so midterm mark as a node. So, I have selected one value. So, let us see. If I select a midterm marks as a node, we need to compute what are the classification you can make from the midterm marks. So, my root node, consider my root node is midterm marks. Now, you need to make a decision whether you want two trees, two branches out of it or three branches out of it, two child or three child, I make a two child. One is say greater than 50, sorry less than 50 and one is greater than 50. So, I make a decision here. So, how this number comes up it is also the mission like you have to compute different numbers or you can make it to different categories or that is all about how do you start with that. So, let us say I will take a midterm marks root node and I computed two ratios less than 50 and more than 50. Now, my there are two values to be there that is a values V. So, one values of the particular A that midterm marks, that is two only two let us consider there exists two different say classification in this particular midterm marks. I am not talking a mathematical term to explain this value because my idea is to you to understand what is information kind, what is entropy in a general sense. If you want again I am telling if you want to really know mathematics begin all this, want to use exact mathematics terms or the values please go ahead and read in the internet. My idea this course is for because none which means I want to keep the maths as slow as possible but explain the core concept behind these algorithms. That is why I keep it as easy as possible. But if you are interested get motivated by this videos I request you to go and check in internet and read more. So, let us consider the midterm marks as a two values I want to make a two decisions less than 50 more than 50 only two values. So, that is exactly two values, which means what happens here is entropy of yes this is basically less than 50 marks there are 5 students who got less than 50 marks there are say more than 50 marks 5 students the two students out of 7 got less than 50 marks. Let us look at the table again once. In a midterm mark yeah that is exactly here there are two students who got less than 50 marks and all other students 5 students who got more than 50 marks. So, let us classify it. So, let us see all the two students who got less than marks is no one other than other students have like past exams. So, let us come back here when you come back here there are 5 students out of 7 who got you know less than get 50 marks that is exactly this class that is what 5 by 7 here complete set or many people are particular value the decision you are making 5 by 7 entropy of getting more than 50. So, we saw that probability of students who were passing more than 50 is 1 if the probability of if the probability of 1 which means entropy equals to 0 you know. Similarly, probability of less than 50 getting probability of passing the exam is 0 which also equal to 1 like it is also 0 both are 0 if it is 0 this value will be complete 0. So, the entropy gain is the current entropy that is 0.87. So, there is no information loss. So, maximum gain you select a node which has the highest gain to the as a root node. So, the some other if you pick semi any other node or any other decision maybe instead of if I pick a 40 you might have a different value then that value will be less gain. So, you have to pick a root node which gives you the maximum gain the gain will be computed by the current set this is currently a root node maybe next step it is not a root node maybe next step some other decision can be taken. So, then the decision you want to make then you compute this information gain formula then you compute it. So, for example, in this particular time if less than 50 you can simply draw that is it simple this decision tree of this particular given table is very simple yes and no this is only one root node. Try this decision tree with much complexity you create your own data you know with the 30 students attendance, 30 students meeting marks and 30 students engagement in the class, 30 students are submitting assignment in time and their interaction with the moodle or something like that you create a big table and also why and compute and see how this decision tree is made that is the idea. Let us see that is what is a root node and you want to know when to start is no more conditions to make. Suppose you have seen all the conditions all decision has been made and also all the options belong to one group consider you already made achieved everything to be probability 1 or 0. All the students will be in here 0s only no. If you assume that particular reach that level you have to stop the decision tree. So, you have seen decision tree right what are the drawbacks of decision tree learning list down 2 points. Please list it down after listing it down rushing the video to continue. So, the complexity increases if number of decision increases it looks very easy if you have say 5 levels or 10 decisions to make the tree looks good nice. If you have say very complex problem of 15 features 20 features and each has a different values categorical more categorical not just 2 decisions more decisions to make that tree would be very, very complex to look even it is visually good but you do not understand anything right because too complex tree and decision tree has a very, very main drawback that is overfitting. It tries to create the decision tree for the training data variable. So, it performs very good on the training data 100% accurate but on a test data it is not it is not known to perform well just because in a training data it takes all the small conditions it tries to create a new branch for a fixed condition but in test data it is not possible. So, that is a bigger problem and the greedy search algorithm I mentioned in the just starting of this video that is a greedy search tree. It means only local decision mode might lead to the next decision it is not possible sometimes you have to combine decision of 2, 3 features to make a good decision but that is not possible in decision tree. Also, it is not applicable for continuous data you have to make a decision splits before you start decision tree but latest algorithms automatically makes bins based on the probability of each bin. So, that is not a issue. So, I said that two problems right there is complexity and pruning. So, complexity and overfitting that can be solved by pruning what is pruning it is basically cutting out the branches trimming the tree right you have to your tree is very dense too big and too dense you have to cut down the branches to make it trim so that we can grow further and makes looks good exactly that is what happening in the decision tree also. Here what they do is they cut down the branches which makes not much decision then to cut down the branches to reduce the complexity. Also, cutting down such kind of pruning helps in better performance. So, in the sense it helps in improve the performance of the decision tree in the test data set also. That is other two advantages disadvantages but that can be solved by the pruning methods and other other disorders also can be taken care by the latest machine learning algorithms that is not a issue. So, in this video we saw what is decision tree learning and I request you to go and try decision tree in our tools and check it out. I did not give a complex problem for decision tree but I would request you to practice decision tree using the complex set of problems with more features more decisions to make and see how it works. Thank you.