 Today, we are going to discuss the topic classification algorithms which are decision tree based. At the end of the session, a student will be able to demonstrate various classification algorithms based on decision trees and compare between them. A tree what we consider is basically generated and a set of rules are derived from the given data set. The records which are available for developing the classification methods are divided into a training and a test set. Attributes of these records are then classified into numerical attributes which are numbers and categorical attributes which are some text whose domains are not numerical. When we go for the properties of decision trees, we see that the inner node of a decision tree represents an attribute. The edge represents a test on this particular attribute of the further node and the leaves give out one of the classes. We generate a decision tree based on the training data which we are going to give to the particular system and we are going to use a top-down strategy. An example of a decision tree is if we have a training data set which has five attributes. There are special attributes which are called as class attributes. Some of them are numerical and other are categorical as we have just seen. Then we plot a tree and this tree will have five nodes. In a decision tree, each leaf node represents a rule. Here we are seeing that outlook could be sunny, overcast, rainy, taking us to humidity, play and windy. As we discover rules, if it is sunny and the humidity is not above 75%, then you can go to play. So generating such rules are possible from your decision tree. How do we apply classification? For classification, we traverse the tree from the root node to the leaf node. A record enters the tree at the root node. At the root, a test is applied to determine which child node the record will encounter next. This process is repeated until the record arrives to a leaf node. All the records that end up a given leaf node of the tree are classified in the same way. However, there is a unique path from the root to each of the leaf and the path is a rule which is used to classify the records. This classification, what we are seeing here, is for an unknown record that performs classification and here we assume for a record that we know the values of the first four attributes as outlook being rainy, temperature being 70, humidity being 65 and windy being true. We start from the root node and check the value of the attribute associated at the root node. The attribute is the splitting attribute at this node. This splitting attribute is very important for a decision tree and in our example, outlook is the splitting attribute of the root. Since the given record has outlook is equal to rain, we move to the rightmost child node of the root. Thus, we then go for the splitting attribute, windy and we go ahead to find windy being true and we generate at the next level that we don't have to go for plink. So the classification, what we are developing has to be now tested for its accuracy and this accuracy is determining the percentage of the test data set that is correctly classified. We can see in the rule one where there were two records, the test set satisfying outlook is equal to sunny and humidity is less than 75. The only one of these that is correctly classified is as going to play. Thus, the accuracy of the rule is 0.5 which is 50%. Similarly, the accuracy of the rule two is again 50%. Rule three is 0.66%. Now we go in for seeing that here the splitting node in this particular graph is now the pin code and we see here that although the splitting attribute is the age, the right child node has the same attribute as the splitting attribute. At the root reveal, we have nine records. All are associated with the criteria which is called as pin code 50046 and as a result, you have two subset of records 1, 2, 4, 8 and 9 to the left child and the remaining all go towards the right child. This is repeated for every node. The advantage of using a decision tree is decision trees are able to generate understandable rules. They are able to handle both numerical and categorical attributes and they provide clear indication of which field are most important for predictive or classification. Weaknesses are growing of the decision tree makes computations expensive and some decision trees can only deal with binary-valued targeted classes. The ID3 dichromatizer uses the decision tree. It was developed by Quinlan in 1986. Each node again corresponds to a splitting attribute. The possible values of the attribute are indicated on the arcs and we consider now an object of interest which is called as entropy. The algorithm uses the criteria of information gain. Here we are seeing an example from Quinlan's ID3 algorithm wherein the class attribute by computer has two distinct values. They are two distinct classes therefore m is equal to 2. The class C1 indicates corresponding to a yes and the class C2 corresponds to no. Means yes we are going to buy a computer or no we are not going to buy a computer. There are nine examples of classes yes and five examples of the class no as indicated in this particular table in the rightmost column. The exact classification rules are developed from the trees. These are in the form of if-then rules. One rule is created for each path from the root to the leaf and each attribute value pair along the path forms a conjunction. So we see that the leaf node holds the class prediction finally and the rules are easier for human beings to understand because it is flowing in a path. These are the solution rules that are developed. The algorithm for using decision tree induction is the basic greedy algorithm where the tree is constructed in a top-down recursive divide and conquer manner. It starts all the training examples at the root. Attributes are categorical and you may use selected examples for recursively based and selected attributes. For example you may go for an information gain calculation and the conditions for stopping the partitioning are all samples of the given nodes belong to the same class. No attributes are used for further partitioning and no samples are left. So the attribute measure which is gain can be calculated by selecting the attribute which has the highest information gain then finding out the measure s which is for the representing this particular class c i which has i classes. So this tuple of classes is of importance. We measure the information based on this particular formula. We consider entropy of the attribute with the values using the second formula and the information gain by branching on the attribute a by the gain formula consisting of all the s's which are from the tuples which are generated. This entropy is defined as entropy of s is given to be a value which is based on the logarithmic aspect of the particular samples which are existing into the particular system. And we consider for this a value t which is total number of instances that we have which is an addition of p and n where we see that we are going to add these two to develop our class distribution of s. So entropy is finally calculated using the entropy formula and for an example where p is equal to 9 and n is equal to 5 we generate an entropy of 0.940. If both the classes are considered we see that the entropy for a complete pure set is zero and for one for a set with equal occurrences like if they were all the samples in the first set being considered and no samples from the second set so 14 samples from first set and zero samples from second set then we get an entropy to be zero here and if equal samples are taken from two sets the attribute selection by information gain computation we see that we could use an information gain for an attribute age and we could do a computation and then find out since the age has the highest information gain among the attributes it is selected as the test attribute and we find the information gain for the particular age to be a combination of information which is from all the ages and here we have found out for the yeses and the noes and we come to a conclusion that information gain is 0.246 similarly we find a gain income a gain of the student and a gain of the credit rating. Now we go for an example which consists of the training data from an employee database and here we see we are going to use the id3 algorithm to construct a decision tree from this given data here we see initial entropy then we find the gain of the department gain of the age gain of the salary and the initial entropy once again the gain of the at the different levels we are going to perform these particular computations other attribute selection measure is the guinea index which is given on cart of the IBM intelligent miner where all attributes are assumed continuous valued here and we assume that exist several possible split values for each attribute and other tools such as clustering may be used here we again concentrate on the split values and the categorical value attributes which have to be modified the guinea index is given by this particular formula summation j is equal to 1 to n of pj square okay wherein we now consider if a data set t contains the samples from n classes the guinea index defined by this particular formula is done after splitting that t into the subset classes t1 and t2 with the sizes n1 and n2 then the guinea index is again remodified as guinea of split at a position t is equal to n1 upon n2 of the guinea index of t1 and n2 upon n of the guinea index of t2 this attribute provides the smallest guinea split is chosen and the split of the node which need to enumerate all possible splitting points for each attribute a solution for this particular problem is indicated here we represent all our data in the form of confusion matrixes and these bring us to finally our particular values what we have to compute for our references we have used your textbooks dunham and camber thank you