 Dear participants, welcome to the course on supply chain digitization. It is jointly being taught by Professor Priyanka, Professor Sosmeetha and Professor Devabrata from IIM, Mumbai. So in this class, which is lecture 5 of module 3, so that is analytics in supply chain management module, we will discuss the coding part in python. So in last class, we discussed how a classification tree is built step by step. We discussed the Gini index, we discussed the concept of entropy, we also discussed like how a node is pleated to increase the accuracy of the model and so on. So that theoretical part of it was covered in the last class. In this class, we will see how actually you can build a decision tree model using python code. So this is the model which we developed in the last class, we also discussed the managerial implications of this model, we also discussed how this model is developed step by step. So now, we will develop the exact same output, we will get the exact same output using python coding. I will first explain the code over here in PPT and then towards the end, we will run the same code using python. So first, let us understand the coding part and then in the last, we will move into the Google Colab and run the code and get the same output. So to start with first thing, you have to import the data. So the first thing we have to import the data, we have a data set called maintenance.csp and I have 1000 observation. So therefore, you can see I have serial number 0 to serial number 999. So 1000 observations are there, 1000 rows are there and I have variable called age, utilization, MTBA value, unplanned downtime, while contamination, overall schedule complaints, schedule lubrication complaints and machine failure. So these are my columns and I have 1000 rows. So this data is there in a csv file, first I need to import this data into python. So for doing that, I need to get the library panda. So there is a library called pandas. So pandas library used for data manipulation and analysis. So we are importing panda and writing it as pd. Then next, I am creating a data frame df which is nothing but pd.red underscore csv and the file maintenance.csp. So now in the data frame df, this whole data will be saved. So df will have 1000 row and all this column like age, 1, 2, second column utilization, third column MTBA value, 4 unplanned downtime, 5 while contamination, 6 overall scheduling, 7 schedule lubrication and 8 machine failure. So I will have 8 column and 1000 row in the data frame df. So by using this 3 line code, 1, 2 and 3, I will get the data frame in python. Then in the future, I will also use a library called sklearn. So I am mentioning it now but it will be used later. So sklearn, the full form is scikit-learn. It is a machine learning library for python. It features pages algorithm like clustering, decision tree, random forest, joystick regression and so on. But in today's class, we will call the decision tree algorithm from sklearn and we will use it. So now data is imported and stored in python. Now what is the next step? So next step is first, I will list the variable. So now my data frame is there and then dot column. So if I write that and use list, then I will get all 8 variable. Can you see in the right hand side? I have age, utilization, MTBA value, unplanned downtime, while contamination, overhauling, schedule complaints, schedule lubrication complaints and machine failure. All the 8 variables which I had in the data set are listed. Now out of 8 variables, I have to define which one is feature variable like that is what are the variables which are x features. That means what are the variables which are independent variables. So what are the variables which are independent? If you recall, the first 7 variables are independent variable. So I will use this 7 variable to predict the failure of a machine. So this is my dependent variable. This is my dependent variable which I will predict and these all are my independent variable. So in machine learning term, these are called feature variables also. So I am writing x underscore feature equal to list df dot column. So all 8 variables will be listed. Then I am removing machine failure. So now the, so after removing machine failure variable, I am having 7 variables, the first 7 variables. So x underscore df is nothing but df within bracket x underscore features. So if I list x underscore df dot columns, I will get the 7 variables. So what we did? We removed the dependent variable and constructed a new data frame x underscore df which will have 7 observation. So df will have all 8 variables and 1000 row, x underscore df will have 7 variables or 7 columns and 1000 rows. So the number of rows will remain same for both df as well as x underscore df. Only thing will be less in x underscore df is one column. That is machine underscore failure column will not be there in x underscore df. So through this, we have segregated dependent variable and independent variable. Now we have to define the feature variables and dependent variable. So feature variables are nothing but the independent variable which we have already defined. So x is x underscore df. So x underscore df are my features variable and y which is dependent variable is nothing but machine failure. So I have defined dependent variable, I have defined independent variable. Now if you see the whole data set, machine failure is my y and this whole, this whole, so everything like from starting from here, starting from edge till serial lubrication these are my x. So this whole data set is x and this data set is y. So I have defined feature variables that is independent variables, I have also defined dependent variable. Now my idea is to predict y using this x. So how do I predict that? So how do I predict the value of y using these independent variables x. So we will do that. So for doing that, we have to split this whole data set. I had 1000 rows, what we will do? We will split this data set into training and test because at the end of the day, I have to validate the model also. Now see if I take all data and train the model, then I will not have any data for testing purpose. So what we are doing here, we are splitting the data set in two parts, training data and test data using these two line code. So what we are doing? From S. K. Learn library, I am importing training test split algorithm and then we are splitting whole data set into training data and test data. So x train, x test, y train, y test, then I am splitting it and the test size is 30 percent. So 30 percent of the observation will be stored as a test data and 70 percent of the observation will be used for training data. So when we will develop the model, we will develop the model on 700 observations. So now I have shown here the x train data. So x train data if you see 700 observation I have and I have 7 column. So 70 percent of the observation, 70 percent of 1000 that is 700 observation are stored as training data. Now if I go to the next slide, you will see x test data. How many observation I have? 300 rows, 7 column. So 300 observation I have for the test data. So using these two line code, I can split the data set into training and test data. Now test data is used. So this test data I will store it to validate the model. So first I will use the training data which is 700 observation. Now I got training data, test data, I am using training data to train the model. Now since we have to train the model, we need to call decision tree classifier. Since our algorithm is classification algorithm and we are using classification tree. So from this Skelin, so from Skelin library, I am importing decision tree classifier. Now after importing decision tree classifier, I have to give the criteria. The first criteria is Gini. That means the Gini impurity index which you learnt in the last lecture will be used to split the node. And what is the stopping criteria? Maximum depth equal to 2. So I will go up to level 2 from the root node. So the depth of the tree will be 2 maximum and the criteria I will use to split the node is Gini. There is another criteria which you can use is entropy, but in this specific case we are calling Gini index which will be used to split the node in two part. Then we are fitting the data set X train and Y train. So we will use decision tree classifier. We will use Gini criteria and maximum depth to to develop other decision tree. So I am using training data to fit the model and then once you get the model, we will use test data to validate it. So again we have talked about like what is pandas, what is a scalon, then what is criteria, what is max depth. So main main concept has been listed over here for your reference purpose. So by doing this, we will get a decision tree. Now in the next, I have to print the decision tree. So the model is fitted. So how to print it? So to print, we have to use a library called matplot library. So matplot library is a library which is used for plotting. So it is a very interesting library which is used for plotting. So we will use matplot library and I am writing it as plt. So wherever you will see plt is nothing but piplot of matplot library. Then we have a scalon another library from there, we have imported tree. So we will use plot plt, piplot plt and we will use tree from a scalon to plot or to print the decision tree. So what is the size of the figure? 15 comma 10 that is the size of the figure. So I will have a decision tree, the size will be 15 comma 10. Then I am using feature name see X train, I am using X train data which is 700 observation I have, all 700 observation from the training data, I am taking it and putting it in the model and I am printing it. So I have two class, machine not failed, machine failed. Now if I run this algorithm, I will get the decision tree. So python will give you this exact decision tree. Now if I just go little forward, this is what we shown it to. So we have shown that this is the decision tree. Now if I compare this decision tree versus this decision tree is exactly same. So this is how you will get the output in python. So if I just help you to read this decision tree, so first you see I have 700 observation as sample size. Then out of this 700, 447 observation did like 447 observation I have like in which machine was failed and 253 time machine did not fail. So the overall since more number of observation represents machine failure therefore the whole class is noted as machine failed. And Gini index is 0.462 and we have used oil contamination and the value 5.5 to split this node. If oil contamination is less than equal to 5.5, I will go towards this node. If oil contamination is more than 5.5, I will go towards this node and so on. So if oil contamination is more than 5.5, then I will have 559 observation. If oil contamination is less than equal to 5.5, I will have 141 observation. And out of 141, 94 observation machine did not fail, 47 observation for which machine fail. Since 94 is more than 47, the whole class is listed as machine not fail. In this case out of 559, 400 times machine fail. So since 400 is more than on 59, the overall class is noted as machine failed. And then I can also see Gini index 0.40, I can also see Gini index 0.44. So now you can see Gini index as well as sample size you can see whether the machine what is the predicted value. So in this case the prediction is machine failed. In this node prediction is machine will not fail. Now I can further split this node in 2 part MTBF less than equal to 23.95, then I will go towards this node. If MTBF is more than 23.95, then I will go towards this node. If MTBF is more than 23.95, then I will have 66 observation with me out of 66, 47 observation represents machine not fail, 19 observation represents machine fail. Since 47 is more than 19, therefore I will mention that this class represents machine not fail. And what is the Gini value? 0.41. So same way you can read this table and it is exactly same representation of what we have shown in the last class that is this one. So as I was talking about you have total 700 observation. Now if I just switch back I have 700 observation. Then we have used oil contamination as a variable and the cutoff value is 5.5. The python output is also used oil contamination cutoff value 5.5. If oil contamination less than equal to 5.5 I am going towards this node. If oil contamination less than equal to 5.5 I am going towards this node. If oil contamination is more than 5.5 I am going towards this node. If oil contamination is more than 5.5 I have 559 observation. If oil contamination is more than 5.5 see I have 559 observation. If oil contamination is less than 5.5 I have 141 observation. I also have 141 absorption. So, the same output you are getting in python ok. It also shows Gini index see Gini 0.44 for oil contamination less than 5.5. So, Gini is 0.44 for oil contamination less than or equal to 5.5. Then I can see Gini 0.407 for oil contamination more than 5.5 same thing ok. Now, if I go further down I am using MTBF value. Now, if you see the python output I am again using MTBF value ok. So, MTBF less than equal to 23.95 I have 493 absorption ok. So, MTBF less than equal to 23.95 I have 493 absorption. MTBF value more than 23.95 I have 66 I have also 66 and the Gini index is also matching 0.410.351 0.410.35. So, exactly same output which we have shown it over here you are also getting it in python. Similarly, if you see this node which is having 141 observation is splitted using utilization and the cutoff value is 92.05. So, the utilization is less than equal to 92.05 then I am reaching at this node having 109 observation and Gini is 0.28. So, you can see the same thing if I split this node having 141 observation using utilization less than equal to 92.05 I am coming over here with 109 observation and Gini 0.28. So, exactly same thing same output only representation is little different, but you will get the exactly same output using python code ok. So, now we have got the output and this output is generated using training data ok, but I also have to use the test data to check the accuracy whether the model is performing well with test data or not. If it performs well with the test data then we will see that model is good. So, now let us see how to predict the behavior of the model on a test data. So, we will use this code from a scale and I am importing this matrix and then we are using predict X test you see X train we used for the training X test we are using for testing ok. So, this 34 line code if you write you will see the actual value of Y versus the predicted value of Y ok. So, this is actual value of Y this is predicted value of Y. So, actual value whether the machine failed or not for the 521st observation actually machine did not fail the predict is also machine did not fail. For 737 observation actual 0 prediction is also 0 for 660 observation actual 1 predicted value is also 1. So, there might be some scenario where actually 0 prediction is 1 sometimes there could be scenario actual is 1 predicted value is all 0. So, these scenarios are possible I have only shown the first 5 observation, but if you see the whole data you will see there will be some discrepancy like 0 will be predicted as 1 1 sometime may be predicted as 0 this is possible. Now to see this how many 0 has been predicted as 1 ok how many 1 has been predicted as 0 how many 0 has been predicted as 0 how many 1 has been predicted as 1 we need to know about the confusion matrix and accuracy score ok. So, therefore, I need to import confusion matrix from a scalar and I need to incorporate accuracy score. So, I will import these two algorithm and I will run this code which will give me this kind of output at the right hand side. So, how to read this? So, actually I have 122 no 178 yes out of 300. So, I have 300 observation as a test data out of 300 test data 178 observation I have yes 122 no ok this is actual. Now what prediction does predicted value 73 no 227 yes ok. So, of course, there is discrepancy. Now if you see column by like cell by cell I have total 57 observation in which actually also no model also predicted that machine not fail. So, this is correct I have 162 observation in which model also predicted yes actually also it was yes. So, out of 357 no has been predicted as no 162 yes has been predicted as yes. So, therefore, the accuracy is 57 plus 162 divided by 300 which is 0.73. So, 0.73 or 73 percent is accuracy for the test data. Now this value if you see I had total observation like 65 observation actually it was no, but model predicted as yes that the 16 observation where actually the model predicted and model predicted no, but actually it was yes. So, therefore, these are discrepancy these are not predicted correctly. So, what was predicted correctly 57 and 162. So, 57 plus 162 divided by 300 will give me the accuracy score. So, accuracy score is 70.73 and this whole matrix is called confusion matrix. So, this matrix is called confusion matrix. So, this is where this is how you can actually test whether the model is performing well on test data or not. So, this is one of the way to check validity of the model using confusion matrix and accuracy score. So, now what we will do we will take you to Google Colab and run this code and get the same output. So, first go to this website you can search in Google Colab or you can write this in your browser you will go to Google Colab. Then you have to login using Google account and then upload the data file maintenance.csb and then run the code step by step as I am going to show you. So, now in next 3, 4 minutes I will show you how to run the code and get the same output. The code has been explained only thing we will show you now how to run the code using Google Colab. So, now this is how the Google Colab looks like. So, if you go to Google Colab this is how the page will look like. So, now I have already put the code for you I will just run the code. So, what you have to do first thing when you go here this is the sign file upload. So, you upload the file maintenance.csb and then write the code and run it. So, first if I run it what I am doing I am importing the code you can see I am importing panda as pd the same thing what we explain in the PPT is over here I am importing the data. So, I have all the data set over here I have all 8 variables and I have 1000 rows. The next list string the column. So, I am listing the column C I can see I have 8 columns in the main data frame file df. Now I am defining the features variable. So, out of 8 columns I am taking 7 column first 7 column and storing it for feature variable that is independent variable and the dependent variable is machine failure which I have removed it. Then I am defining the feature variables and dependent variable. So, feature variable is x underscore df all 7 independent variables and dependent variable is machine failure. Then I am splitting the data into training and validation say that this training and test data you can see 30 percent test size 70 percent training and this is random. So, randomly I will get 700 training data and randomly I will get 300 test data. So, now this test data I will park it aside for testing purpose. Then I am running this code and that is that will build the decision tree based on training data. The criteria is Gini and maximum depth of decision tree is 2. So, I am going up to 2 level and using the criteria Gini to split the node. Then I am printing the decision tree ok. So, I am importing pi plot from the map plot library and importing tree from the escalon and then printing the tree. So, this is how the decision tree is printed. Now you can see the whole decision tree is printed exactly same output we have shown in PPT also. Now the model is developed. Now I have to test it whether the model is performing well or not. So, for that I have to use the test data ok. So, I am using the test data and run this model. So, for this column represents actual value of y and this column represents predicted value of y. So, actually if you see 4 935 that observation actually it was 1, but model predicted is 0. So, there are all 300 observations are there, but because of limitation of space it is only showing 10 observations first 5 and last 5, but if you see the excel output of it you will see all 300 rows are there. So, then we created the confusion matrix and this is how the confusion matrix looks like 57, 65, 16, 16, 2 and 73 percent is the accuracy. So, this is the whole code and if you run this you will get the same output like I got it over here and all the steps of the code have been explained in the PPT ok. So, the same thing we have also explained in the PPT over here. So, this is the final accuracy matrix you can see 0.73 ok. So, we got this value 57. So, you can see 57 over here then you can see 65, 16, 162 and 0.73. So, whatever output you are getting it python same output we have also got. So, that is what from today's session and if you follow the same steps like we did you will also get the same output ok. With this we are finishing this part like we started with how to develop a classification tree to predict whether a machine will fail or not. Then we used that example of whether a machine will fail or not with two different variables age, mtba value utilization and other variables and predicted whether a machine will fail or not. And we have explained the concept of gene index, we have explained the concept of entropy. And finally, using the python code we have also shown the output. So, with this we will stop it over here. We will see you in the next class. Thank you for listening carefully and practicing it along with me. Thank you so much.