 Dear participants, welcome to the course on supply chain digitization. It is jointly being taught by Professor Priyanka Burma, Professor Sushmita Narayana and Professor Debra Prathadas from IA Mumbai. So, today we are going to talk about lecture 9 of module 3 of Analytics in Supply Chain Management. So, in the last lecture that is lecture 8 of Analytics in Supply Chain Management module, if you remember like we developed this regression tree and tried to predict the demand of retailer located in various region based on their balance credit, promotion, age, size, promotion and holidays. We tried to find out what would be the demand of these retailers and to find out that we develop this regression tree. So, I will just quickly summarize what this tree is all about and then how this tree is being built. So, let us understand. So, we have various parameters related to the retailers, region, balance credit, location, age, size, promotion and holidays. So, I have to find out the demand of these retailers based on these attributes. So, if you look into this regression tree, so I have first size of the store. If size of the store is less than or equal to 30.5 thousand square feet and promotion was not given in that week, then if you look into node 3, the demand is 943, it is the estimated demand. Now, again if there is another retailer, the size is less than or equal to 30.5 thousand square feet and promotion was given. So, promotion was given to them, then you can see that estimated demand is 2360. So, let us say there is another retailer whose size is more than 30.5 thousand square feet and age of that retail store is less than equal to 17.5 years, then the average estimated demand is 2887. Let us assume there is another retailer whose size is more than 30.5 thousand square feet, but age is more than 17.5 years, then the estimated demand is 8227. So, using this regression tree, given the retailer's various characteristics, we can estimate what would be the demand of these retailers. Now, in the last class, we explained like various managerial insights of this, how to interpret this tree in more detail, but in today's lecture, we will focus on how to get this tree. In the last lecture, we only focused on the interpretation of the tree. Now, in today's session, we will focus on how to get this tree using python coding. Now, what I will do? I will first explain the code as well as output over here in the PPT and then towards the end, we will move into Google collab and show it to you like how this code can be run and the same results can be replicated. So, like classification tree, which we have seen in maintenance problem, we similar kind of coding also will be written for regression tree. First, I need to import the data. So, for that, we are calling the library pandas. So, basically this is data manipulation and analysis library. So, first I am importing panda as pd, then I am reading the data file. I have a data file demand dot csv. So, I am importing the data file into python. Now, after importing, if I just print few rows, it looks like this. So, I have let us say 310th retailer, whose region is south, balance credit amount is 12 lakh, location is urban, age 24 years. That means they have been operating since last 24 years. The size of the store is 36,000 square feet. Promotion was not given in that week and there is only one holiday. So, for this particular retailer, the actual order quantity was 8640. Similarly, there is another retailer, 311 who which is located in west. Balanced credit amount was 6 lakh, location urban, age 11 years, size 36,000 square feet. Promotion was not given and holiday was 1, the order quantity was 3960. So, I have all the data. I am just reading out only 2 rows. So, all 1000 data has been incorporated into python using pandas library and reading read underscore csv command. So, now once we get the data into python, we have to now find out and tell that out of these variables, which variables are my features variables, which variables are my independent variables and which variables are my dependent variables. So, in this case, I will have only one dependent variable that is I need to find out that what would be my estimated demand. So, to do that first, I am doing the list of variables. I am list and df was my data frame. If I go back, so df is my data frame. So, this file demand.csv has been incorporated into python and renamed as data frame df. So, now, I am listing out all the columns of the data frame. So, what are the columns? I have region, balance credit amount, location, age, size, promotion, holidays, order quantity. So, 1, 2, 3, 4, 5, 6, 7, 8. So, I have 8 variables. All the variables have been listed out over here. Now, out of these 8 variables, I need to mention what are the variables which are independent variables? What are the variables which are my feature variables? So, what we are doing? Now, we are defining feature variables x underscore feature equal to list df.columns. It will give me list of all the variables. Now, out of this, I am removing out of this x underscore features, I am removing the variable order quantity, because order quantity is my dependent variable which I would like to predict. So, order quantity is nothing but the demand in this case. So, now, if you look into the output, I have all the variables which I have here. Only thing, order quantity is not here in this case. So, these are 7 variables which are my feature variables. So, now, region, balance credit amount, location, age, size, promotion and holidays, these are my x features variable that is, these are my independent variable. Using these variables or using these characteristics, I have to predict what would be the demand of a retailer. Now, here one interesting fact is, if I look into the region as well as location, they are not continuous variable. They are categorically in nature. So, I have 4 regions, south, west, east and north. So, I cannot enter this data as it is in the model. Similarly, I have location rural, semi urban and urban. So, I have 3 category in location, 3 categories. I have 4 categories in region. So, now, I cannot enter this data as it is into the model as a text. So, therefore, I have to convert this into binary variable. So, whatever model you are using it, whether you are developing a regression tree, classification tree, logistic regression, any machine learning model you build, you have to do this data processing first. If you have a categorical variable, you have to convert them into binary variable. So, let us see how are you converting. So, if I have 4 categories, then I need 3 binary variable or 3 dummy variable. If I have 3 categories, I need 2. So, in general, if I have m categories, I need m minus 1 dummy variables. If I have m categories, I need m minus 1 dummy variables. So, in the case of region, I have 4 categories. So, I will form 3 dummy variables. Let us see. I have used this code encoded underscore df equal to pd my data frame, then get dummies this is a function in python. If I use this function, I will be able to create dummy variables. So, I will show you the output of this. So, region I had 4 categories. Since I have 4 categories, I will need 3 dummy variables. So, I have created 1 dummy variable region underscore north, 1 dummy variable region underscore south, another dummy variable region underscore west. So, this dummy variable will have 2 value either 1 or 0. So, how do I explain this? So, let us see for the observation 978, the region underscore north is 0. That means, the region is north for this particular retail, region underscore west is also 0, region underscore south is 1. That means, this particular retailer 978, it is from south region. Similarly, if I see 979, it is from west region, it is from west region. Can you tell me what is 980? 980th retailer is again from west region. Now, what about 981 retailer? So, this is north from north, north from north, north from south, north from west. So, it is then from where? So, it is from east because there is no binary variable for east. So, east is my test category. So, therefore, 981. So, this is from east. So, it is north from north, north from south, north from west. So, it is from east. 982 again, north from north, north from south, north from west. So, it is east. So, using this three binary variable, using this three binary variable, I can code like all categorical variable. Similarly, 983, it is from north, it is from north. 980 is again west. Now, that is how you can interpret the dummy variable. So, I had four categories in region, I converted them into three dummy variable. Now, can you tell me what will happen about location? Location I have three categories rural, semi-urban, urban. I will need to form two dummy variable and that is what we have done, location underscore semi-urban, location underscore urban. So, urban and semi-urban like two dummy variables we have created, the rural is my base category. So, if I look into 978, the value is 0. That means, it is not semi-urban, it is not urban. So, what does it mean? It is rural. So, south and rural, the first one, 979 observation, it is semi-urban, is not it? Because semi-urban value is 1, 980, again semi-urban, 981, urban, 982, not semi-urban, not urban that means, it is rural. So, rural, 983 semi-urban. So, that is how we can convert dummy variable, like convert categorical variable into dummy variable. So, main idea is if there are m categories, I have to create m minus 1 dummy variables and while creating this one category has to be base category. Like in the case of region, we have kept east as my base category, because north, south and west have created 3 dummy variable. So, east is my base category. So, east is base category. Similarly, here I had 3 categories rural, semi-urban, urban, rural is my base category. So, I have created 3 m minus 1, 2 dummy variable. So, now, this treatment I have to do before I actually go for a model building. So, whatever could be your model, it could be logistic regression, multiple linear regression, any like decision tree, random forest, you have to do this data processing first, so that model can understand the categorical variable properly. So, now we have done the data processing, categorical variables have been converted into binary variables. Now, we have the data with us and you see number of variables has actually increased. Earlier I had 7 features variable. Now, if you see instead of only region, I have 3 variable, instead of only location, I have 2 variable. So, region variable has been converted into 3 variables, location variable has been converted into 2 more variables. So, therefore, I have total 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Now, I have converted to 3 variables. Independent variables I have and dependent variable is order quantity, which is there. Now, once we get dependent variable, independent variable, data processing is done, we have to split the data in 2 parts, training data as well as test data. So, what is the requirement for that, because I cannot take the whole data set and train the model on it. If I do that, then I will not have any data set to test the performance of the model. So, therefore, we keep aside some portion of the data to test the model, which is not part of the training. So, in our case, we are using 30 percent of the data as test data. The rest 70 percent data will be my training data. So, we are using this python code from SK Learn library. We are using model underscore selection, then import train, test, split. So, automatically this function is there. So, if I call this function and put test size equal to 30 percent, I will get 30 percent test data, 70 percent training data. You can change this number 40, 60, like whatever you wish for. Now, once we get this training data, I have 700 such observations. 700 observations are there. And then I have 300 observations, which are my test data. So, this test data, I will keep it aside to test the performance of the model. Now, once we get the training data, I need to build the model on the training data. So, in this case, since we are using a regression tree, so what we are doing from SK Learn, I am importing decision tree regressor. So, SK Learn is a machine learning library for python. It features various algorithms like K means decision tree, random forest, regression. So, in our case, we are calling decision tree, that too decision tree regressor. And in the decision tree regressor, I have to mention what is the maximum depth of the tree, like how much depth we would like to go. So, we are mentioning 2 now. And then I am fitting the regression on training data. So, now once we fit the regression tree, then I need to print the tree to see how this tree has come. So, for printing it, we need to import a library called matplot library. So, matplot basically a plotting library for the python programming language. And then again from SK Learn, I am importing tree. So, I am importing 2 library matplot from matplot. I am importing piplot from SK Learn, I am importing tree. Now, using these 2, I can actually print the regress entry. This is how the output would look like. So now, how to read this tree? Because you have seen this output, which we have explained in the last class. This output is exactly same as the output, which you got using python. I have represented it like this, so that it becomes easier for you to understand. But the tree, which you got using python is exactly same. So, let us explain how it is same. So, first thing is the node 0, in which I have all the 700 observations. You see, all 100 percent data is there. 700 means 700 training data, like total data I had 1000. I had kept aside 300 for testing. So, I am taking 700 observations. I have all the 700 data and the average demand is 2 to 7 0, which is my predicted value. Now, from here, I am using the variable called size of the store to split this node in 2 parts. Size of the store less than equal to 30.5 thousand square feet, size of the store more than 30.5 thousand square feet. So, if you see here, in the python output exactly same, size less than equal to 30.5 thousand square feet, sample 700 and the predicted value is 2 to 7 0. So, 2 to 7 0. So, at node 0, it is matching and then node 0 is split into 2 parts, size less than equal to 30.5. So, if I go in this direction, this statement is true. If I go in this direction, this statement is false. That means this is size less than equal to 30.5. This is actually size greater than 30.5 thousand square feet. Now, if size is less than equal to 30.5 thousand square feet, then I am seeing here, demand is 1902 and how many observations I have? 612. So, now, you see size less than equal to 30.5, demand is 1902, number of observations 612. Size is more than 30.5, how many observations I have? 88, demand is 4829, demand is 4829, this is exactly matching. Now, here after coming to this node, I have to check promotion, promotion less than equal to 0.5, but promotion is a binary variable. The value is either 0 or 1, less than equal to 0.5 means promotion equal to 0 and this direction means promotion greater than 0.5, that means promotion equal to 1. Now, size less than equal to 30.5, promotion 0, for that particular retail store, the estimated demand is 943, you see is matching 943. Similarly, for the retail store whose size less than equal to 30.5 and promotion was given, the estimated demand is 2360, I said 2360. Now, same way I have to go to this direction also, size greater than 30.5 is less than equal to 17.5, this is less than equal to 17.5, this direction is less than 17.5. Now, for a retailer whose size is less than 30.5 and age is less than equal to 17.5, the predicted demand is 2887.3, that is 2887, so you can see 2887. Similarly, for a retailer whose size is more than 30.5, age is more than 17.5 years, the estimated demand is 8226.87. So, if we round it off, it will be 8227 and that is what it is coming. So, the tree which we got using python coding, it is exactly same as the tree which we have shown during the last class and the tree also gives you the MSC values. So, if you look into this tree, see MSC value is given 815, 1813 which is matching with exactly this. Similarly, so any node you go, you will see the MSC values also. So, that tree which we got using python coding is exactly same, the one which we are showing it over here. Now, what we will do? We will go to this Google Colab website and show it to you like how this output is generated. So, let us go back over here. So, what we have done exactly like the code already been explained to you. So, we have imported the data, we are importing the data demand dot csb, then once we import the data, we are listing the columns, then we are defining the feature variables. So, I have all the columns you can see. So, out of these columns, order underscore quantity, I have to keep it aside for dependent variable. So, that is what we are keeping it aside, we are removing this, then we are getting region, balance credit amount, location, age, size, promotion, holidays as x feature. So, these will be my independent variables and dependent variable is order underscore quantity. Then we have done the dummy variable part, we have converted categorical variable into dummy variables and this is the output we can see, region has been converted into 3 dummy variable, location has been converted into 2 dummy variables. And then we are defining feature variables and dependent variable like x is my feature variable like after encoding. So, we have to first encode the categorical variable that means convert categorical variable into binary variable, then whatever x we will get. So, these all x, these are all my independent variable, we are putting it over here and the dependent variable is order underscore quantity. Then we are splitting into training and validation set. So, x train, I have 700 observations you can see, y train, x test I have 300 observations. Similarly, I will have y test and y train, y test will have 700 rows, y test, y train will have 700 rows, y test will have 300 rows. So, now, this test data will be kept aside for validation, we are using now training data, you can see x train and y train and then I am importing this is entry regression. Then once you run this, I am using maximum depth equal to 2. So, I will go up to 2 level, then we can print the tree and this is how the printed output will look like. So, if you see is matching exactly same like what we showed in the PPT. So, you can also reproduce the same like output, we will be sharing you the data as well as Python code, you can use Google Colour, Jupyter Notebook or any other Python Notebook and run this. So, in the next lecture we will show it to you like how the validation happens, like how do I test the performance of this model. So, we have developed this model, but how good this model is, how good this model behaves with the test data. We have done it for the training data, but we have to actually see on the test data how it is how it is performing. So, in the next lecture we will show it to you how the performance happens, if the performance is not so good, what measures should you take and so on. So, thank you, see you in the next class.