 So, dear participants, welcome to the course on supply chain digitization. It is jointly being taught by Professor Priyanka, Professor Sushmita and Professor Devabrata from IIM Mumbai. So, in today's lecture which is lecture 4 of module 3 that is analytics in supply chain management, we will continue from the previous lecture that is lecture number 3 of module 3. So, in the last lecture if you recall correctly, so what we did we tried to develop a predictive analytics model and the idea was to predict whether a machine will fail or not. So, based on various parameters such as age, utilisation, MTBF value that is mean time between successive failures, then unplanned downtime, oil contamination, overhauling scheduling compliance, schedule lubrication compliance and so on. So, idea was depending upon these parameter can I predict whether a machine will fail or not. So, this is the model which we developed in the last class. So, today what we will do we will try to develop the steps and understand how this model was developed. In the last class we discussed this model and how to interpret this model, what is the managerial implication of this model and so on. So, in today's class we will see how this model has been developed. So, we will discuss the steps of building classification tree model using this example. So, to start with the first step is to start with the complete training data in the root node. So, in our case the root node is node 0. So, in node 0 if you observe we have 700 observations and out of this 700 observations 447 is yes and 253 is no. So, that means 64 percent of the time machine failed and 36 percent of the time machine did not fail. Suppose you have only these data from the plant and if you pick any machine randomly and if I ask you what is the probability that that particular machine will fail. So, what would be the probability? Probability would be 64 percent. So, now irrespective of its age, irrespective of its utilization, irrespective of its empty way value any random machine if you pick the probability will remain same. But in reality the probability of failure depends upon all these factors age, utilization, unplanned downtime, oil contamination, overall seedling compliance, seedle lubrication compliance and so on. So, therefore, the idea is can I take these parameters in the model and try to predict in a better way. Can I improve the accuracy? So, right now the probability is 64 percent irrespective of its age, irrespective of its utilization and so on. So, whatever be the value of this parameter it does not matter if I pick any random machine the probability that it will fail is 64 percent that is the current scenario. But can I improve this accuracy? Can I make better prediction? So, for that what we will do? We will try to utilize these parameters age, utilization, empty way value and so on. So, if I take these parameters whether my prediction will be better or not. So, for that we need to split the root node. So, in this case root node is node 0 which is having 700 observation. So, I will try to split this node in such a way that my impurity is reduced maximum. So, there are impurity measures like genie index, entropy and so on. So, we will talk about this indices in a later slide, but assume that you know about this indices and then we will try to find out how do I split this node in two parts. So, that my overall impurity is reduced maximum. So, let us assume that if you bring the variable while contamination. So, I had age, utilization, empty way value and all of these parameters. Now, out of this parameter if I bring while contamination as a variable and split this root node that is node 0 in two parts. So, in part 1 you can see that is the node 1 which is having 141 observation and in node 2 I have having 549 observation. So, 700 is divided by divided in two parts 549 plus 141. Now, if you see this in the right hand side that is node 2 the while contamination value is more than 5.5 in the left node that is node 1 while contamination value is less than equal to 5.5. So, now compare node 0 compare node 0 versus node 2. So, in node 0 I do not have any information pick any random machine and the probability that it will fairly 64 percent. Now, if I tell you that the machine which you are talking about having while contamination is more than 5.5 then I will be in node 2 I will no longer be in node 0 because I have an additional information that while contamination is more than 5.5. So, therefore, I am in this node node 2 and in node 2 probability that 70 percent is the probability that the machine will fail. So, my accuracy has improved from 64 percent to 72 percent. So, that is the idea behind decision T. So, we have to consider to split the node in such a way that my accuracy improves and my impurity reduces. Now, is it possible to improve the accuracy further. So, I got 72 percent accuracy by using variable while contamination and splitting node 0 in two part node 1 and node 2. Now, can I improve this accuracy further is it possible for me to get more than 72 percent accuracy. Yes, it is possible for that I need to split this node further. So, now if you look at here node 2 has been splitted using MTBF value and node 1 has been splitted using utilization. So, now in the first step we brought in the variable while contamination in the second stage I am bringing the variable MTBF and utilization. So, if you look into node 2 the MTBF value is less than equal to 23.95 and MTBF value is greater than 23.95. So, at node 2 I am having 559 observations and if I split it further using the MTBF value I can see node 5 I am getting 493 observation and node 6 I am having 66 observation and in node 5 out of 493 380 observation is having yes value and 112 observation is having no value. So, 381 is nothing but 77 percent. So, if I summarize it and tell you that the machine which you are picking is having while contamination more than 5.5 and MTBF value is less than equal to 23.95. That means, the machine which you picked is having while contamination more than 5.5 and mean time between failure is less than equal to 23.95 then the probability that the machine will fail is 77 percent. Now, if I compare these three probability 77 percent is much better. That means, my prediction accuracy has improved. So, initially when I did not have any variable I was just looking the overall data set probability was 64 percent. Now, as soon as I gave you the information that while contamination is more than 5.5 probability improve from 64 percent to 72 percent. Now, when I give you further information that while contamination is 5.5 more than 5.5 and MTBF value is less than equal to 23.95 then the probability has further increased to 77 percent. So, this is how you see the accuracy has been increasing. So, the idea is to split the data into small small part and improve the accuracy. Now, if you look into node 4 which is in the left hand side the accuracy is even further 88 percent. So, that means, if I go from node 0 to node 1 node 1 to node 4 then accuracy increased to 88 percent. So, how can I reach at node 4? So, node 4 means for any machine for which while contamination is less than equal to 5.5, but utilization is more than 92.05 percent. So, if you pick any random machine from your plant and while contamination you see less than equal to 5.5, but utilization is more than 92.05 percent your probability that machine will fail is 88 percent. So, now, compare node 0 versus node 4. So, obviously, node 4 is giving me better accuracy. Now, if you compare node 0 with node 3. So, what happens in node 3? Node 3 I have 109 observation out of 109, 90 observation is no, 19 observation is yes. That means, 83 percent of the machine did not fail. So, if I give you a machine whose while contamination is less than equal to 5.5 and utilization is also less than equal to 92.05 percent then the probability that machine will not fail is 83 percent and the probability that machine will fail is very low 17 percent. So, now, if I compare node 0 versus node 3, node 3 is giving me better accuracy. So, therefore, what we saw that if I split the node using some intelligent algorithm like in this case we have used impurity measures either Gini impurity index or entropy and split the node further then your accuracy will keep on increasing. So, this is the overall idea and you can say that can I increase the accuracy further? Yes, you can increase the accuracy further, but in this case we have used the stopping criteria like in this case there are multiple stopping criteria. So, first stopping criteria is level of tree from the root node. So, I have used two level this is the first level we have used oil contamination. In the second level we have used for node 1 utilization for node 2 MTBA value. So, I have gone up to depth 2. So, that is my stopping criteria if you want to go further depth then the accuracy will further increase then there is another stopping criteria is minimum number of observation in each node. Now, if you look into node 6 I have only 9 percent observation that means 66 by 700 is approximately 9 percent. So, if you can put a stopping criteria that if in node I have less than 10 percent of the observation then I will not split it further. Then there is another of stopping criteria is minimum reduction in impurity. So, we will talk about the impurity in the next slide in that time I will focus on this stopping criteria, but as of now these are the 3 stopping criteria which you can use to stop the decision tree process and to stop splitting the node further. So, why does it need to stop? Because if we do not stop if we keep on splitting the node further and further and further then the model like for the training data your accuracy will increase, but for the test data accuracy will start reducing and this problem is called over fitting of the model. So, therefore, we would like to avoid that. So, now I am sure like you have got the idea how decision tree is built, but we have to explain this concept genie index and entropy. So, for that let us spend some time to understand what is genie index and what is entropy. So, assume there are k classes in the data then the impurity measures at node T of the decision tree can be calculated as this is the formula of genie index. Genie index at node T is nothing, but 1 minus summation i equal to 1 to k proportion of observation belong to class i given your at node T square. Then we have another measure of impurity is called entropy which is calculated by using this formula summation i equal to 1 to k since k classes then proportion of observation belongs to class i given your at node T multiplied by log base 2 proportion of observation belong to class i given T. So, these are the formulas of genie index and entropy. Now, we will explain this formula using one example. So, let us assume your k equal to 2 suppose you have two classes class 1 and class 2. Let us assume that at any node you have 100 percent observations belongs to class 1 and 0 percent observation belongs to class 2. So, the genie index will be 1 minus 100 percent mean 1, 1 square minus 0 square that is 0 and entropy will be minus 1 minus 0 because proportion of observation belongs to class 1 is 100 percent that is why 1 log base 2 1 minus 0 log base 2 0 which is 0. If in your data set whole 100 percent belongs to 1 class that means, there is no impurity in the data, everybody belongs to the same class. So, there is no randomness in the data. So, therefore, the impurity is 0 the genie index is also 0 entropy is also 0. Now, if you see another example class 1 is 0 percent plus class 2 is 100 percent. So, 100 percent of the observation belongs to same class that is class 2. So, that means, there is no randomness in the data there is no impurity everybody belongs to class 2. So, therefore, ideally impurity should be 0 and if we apply the formula impurity is also becoming 0. Now, what is the maximum impossible impurity in a data set? If there are 2 class 50 percent belongs to class 1 and 50 percent belongs to class 2. So, let us take an example that I have good and bad product and in a lot 50 percent good 50 percent bad. So, obviously, the data is maximum impurity because randomness is the maximum. So, therefore, the impurity also should be maximum and if you see the formula it is 1 minus 0.5 square minus 0.5 square the value is 0.5 and for the entropy the value is 1. So, in both the cases I can see maximum value of genie index when I have 50 percent observation in class 1 and I have 50 percent observation in class 2. So, therefore, that is the most random data and this data is of is not giving me any additional information because 50 percent in class 1 50 percent in class 2 and randomness is at max. So, now if I compare both genie index and entropy you will see this kind of randomness. So, in x axis proportion of observation belongs to class 1. So, if proportion of observation belongs to class 1 is 0 that means, 0 percent of the observation belongs to class 1 100 percent of the observation belongs to class 2. Then there is no randomness everybody belongs to same class and obviously, impurity has to be 0 and that is what is happening. Now, think about the other extreme suppose 100 percent of the observation belongs to class 1 100 percent 0 percent of the observation belongs to class 2. So, there is no observation class 2 everybody belongs to class 1. So, there is no randomness and impurity is 0. So, both genie as well as entropy. So, genie index measured by this orange color graph and entropy is measured by blue color graph both is 0. So, these two are extreme situation there is no impurity. Now, there is another situation which we discuss in the last slide the example 50 percent of the observation belongs to class 1 50 percent of the observation belongs to class 2. And you see I am getting the maximum impurity for genie as well as maximum impurity for entropy. And if I calculate the value for 0.10, 0.2, 0.3, 4, 0.6, 0.7, 0.8 and 0.9 you will get a similar kind of graph. So, it says that if in a particular class most number of observation belongs then the impurity reduces. So, impurity is maximum when in a particular class I have 50 percent observation another class I have 50 percent observation. So, I hope the concept of genie index and entropy is clear. Now, we will take you back to the previous example and see how this impurity concept is used to split the node. Because genie and entropy is one of the important concept as far as classification tree is concerned. Now, let us see how we are using this concept to split the node. So, I have at node 0 700 observation and right now in one class I have 64 observation in another class I have 36 observation. So, the genie value 0.46 we can calculate it using the previous slides formula. Now, maximum value of genie is 0.5. So, in this case I have got 0.46 which is very close to 0.5 that means in the data set I have randomness. So, which is very obvious from here 64 percent yes 36 percent no and the maximum random data is 50 percent yes 50 percent no. So, which is very close. So, therefore, I can reduce the impurity further. So, now if I introduce a variable called oil contamination and split this node, but oil contamination is a continuous variable. So, therefore, I have to pick a value of oil contamination. So, I am not only deciding oil contamination I am also picking a value that oil contamination is 5.5. So, 5.5 is the cutoff value of oil contamination. So, I am splitting node 0 using this condition. If oil contamination is more than 5.5 then I am in node 2 if oil contamination is less than equal to 5.5 then I am in node 1. So, the idea is to reduce this overall impurity. So, right now is 0.46. So, how can I reduce this impurity? If I reduce the impurity then in one class I will have more observation and obviously my accuracy will be better. So, I have split this node using these parameter oil contamination and the value is 5.5. Now if I calculate the Gini at node 1 it has become 0.44 at node 2 Gini is 0.4 that is in node on a node 2. So, now we have to calculate what is the overall reduction in impurity. So, how can I calculate that impurity at node 0 which is 0.46 then impurity at node 1 and that has to be multiplied by how much observation I have in that node that is 0.2 then minus impurity at node 2 and then how many observation I have in node 2 which is 0.8. So, I am getting a value 0.052. So, that is the reduction in impurity. So, if I split node 0 using the parameter oil contamination and the value of oil contamination is 5.5 then I am getting node 1 and node 2 and my impurity has been reduced and overall reduction in impurity is 0.052. So, in place of oil contamination if you take any other parameter as 0.5. So, MTBF utilization anything else your reduction in impurity will be less than 0.052. Oil contamination with cutoff value 5.5 is giving you maximum possible reduction in impurity. So, therefore, at node 0 this particular parameter and this cutoff 5.5 is used. If you use any other cutoff instead of 5.5 suppose you take 6 you take 7 oil contamination value you take 8 as a cutoff then your reduction in impurity will be less than 0.052. So, 0.052 is a maximum possible reduction in impurity if I split node 0 using oil contamination and the cutoff value 5.5. So, therefore, oil contamination and the cutoff value 5.5 is picked at node 0 to split that node in 2 parts. So, now if I move to node 2, node 2 currently is having Gini 0.4. Now, node 2 is having 72 percent yes 28 percent no. So, I have to split this node. So, how can I split this node? I have to bring that particular parameter which will give me maximum possible reduction in impurity. So, if I bring MTBF and if I take the cutoff 23.95 and split node 2 in 2 part MTBF value less than equal to 23.95 MTBF value greater than 23.95 then I get node 5 and node 6 and if we see there the cutoff Gini index 0.35, Gini index 0.45 and if you calculate overall reduction in impurity it will be 0.4 Gini at node 2 minus 0.35 Gini at node 5 into 70 percent of the observation that is 0.7 minus 0.41 impurity like Gini at node 6 into 9 percent of the observations the value is 0.11. So, in place of MTBF if you take any other parameter age, oil contamination, utilization anything and split this node in 2 part your reduction in impurity will be lower than 0.11. So, therefore if I use MTBF and cutoff 23.95 my reduction in impurity will be maximum that is 0.1. So, any other variable you bring in place of MTBF any other value cutoff value you bring in place of 23.95 your reduction impurity will be lower than this. So, the maximum reduction in impurity is possible when I split node 2 using MTBF value and cutoff value is 23.95. In place of 23.95 if I use cutoff MTBF 20, MTBF 25, MTBF 26 any other value other than 23.95 your reduction will be less than 0.11. So, therefore through the iteration we have found out that MTBF and cutoff value of MTBF 23.95 will give me maximum possible reduction in impurity if I split node 2 in 2 parts. So, same way we can split node 1 using utilization and cutoff value is 92.05 then your Gini index will be 0.28, 0.21 overall reduction in impurity will be 0.38. So, in node 1 the best possible parameter to split node 1 is utilization and the value of utilization is 92.05. So, if I split this node using this parameter utilization and the cutoff value is 92.05 then I will achieve maximum possible reduction in impurity. So, this is the criteria that any node can be splitted in 2 parts in classification tree provided that particular variable and that particular cutoff score is giving me maximum reduction in impurity and that is how you keep on splitting the node and this is the criteria to split the node. Now, let us summarize how to build a classification tree. So, we started with complete calculation containing data set in the root node. In our case we have 700 observation then we split the root node using a predictor variable. So, that it results in maximum reduction in impurity and impurity measures are Gini index or entropy. We have explained both the concept, but to split the node in our case we have used Gini index although we learnt the concept of entropy. Then in step 3 I will repeat step 2 for each internal node using independent variables until the stopping criteria is met. What are the stopping criteria? Level of tree from the root node is a fast stopping criteria. So, in our case we use this criteria that will go up to 2 level depth 1 and depth 2. Then we can have minimum number of observation in each node. We can take that I need minimum 10 percent observation in the each node. In our case we started with 700. So, I want minimum 70 observations and so on. So, if the node is having less than 70 observation then you do not split it further you stop over there. Another stopping criteria is minimum reduction in impurity. So, I can assume that epsilon reduction impurity should be greater than equal to 0.0001 and so on. So, we can decide this cutoff. If the reduction impurity is less than this then you do not split it. So, these are 3 stopping criteria or you can say these are 3 criteria to split the node and accordingly the decision tree will be created and then based on that you can take the decision. So, we will stop this video over here and we explained from step by step using step by step how a decision tree model specifically the classification tree model is built and how can we use gene index and entropy to split the node and then what are the stopping criteria and so on. So, that is what in this video in the next video we will discuss how to develop this model using python code. So, again we will do step by step we will have this data and from the data we will use python code and so on. So, thank you for carefully listening to the lecture looking forward to see you in the next video.