 This course on Quality Control and Improvement with Minitab, we are in session 26 and I am Professor Indra Jitmokherjee from Shailesh J. Mehta School of Management, IIT Bombay. So, in last session we are discussing about multiple regression and how to select the variables. You see there are confusions which arises when we have multiple x which is regressed with a y single y. So, we are not sure which variable should be considered which is not to be considered. So, some examples we have taken last time. So, let us try to see whether we can resolve that problem and dilemmas that we are facing. So, this was one of the problem that we are dealing with at the time point that electrical power consumption is monitored over here and this may be related to variable x1, x2, x3 and x4 and the details are given on the left side of the screen in this power point and we want to see which is the best model that explains y with respect to given x set of x like that. So, what we have done is that we have gone to Minitab and then what we have seen is that these are the variables this from C5 to C9, C5 is the y and C6 to C9 is the 4 variables. So, what we have done is that stat we have gone to regression and we have gone to regression and then we have used fit regression models and then what we have done is that we have selected y and the set of x conditions that is x1, x2, x3 and x4 and in models what we have done is that all the variables we have considered include constant term and options we have not given any transformation over here and in this case first we have not done not given any let us say stepwise regression over here and we may see graphs over here normal probability plot residual plots and order plots like that. So, this is possible and validation 10 fold cross validation that we have discussed last time also we can put over here and then storage I want to store that the standardized residual let us say and click ok. So, let us try to see what are the results that we are getting. So, Minitab gives you automatically a regression equation over here, but what is surprising over here you see that although x1 is the only variable which is significant over here that means less than 0.05 and others are not significant terms over here. So, only x1 is showing significance others are not showing significance over here and also you see the r square value adjusted r square is about 76 and 10 fold r square this cross validation is around 48. So, when there is a difference between these two significant difference that exists over here there must be something going wrong over here and maybe model fitting is not correct or we have overfitted the model basically ok. So, in this scenario when we are not sure which variable will go in which variable will go out and analysis of variance also confirm that one of the variable x1 is significance others are not significant. What we can do is that we can remove all the variables all the x2, x3, x4s like that and retain only x1 like that only retain x1 and then we can see. So, what will happen? So, this is trial and error method that I am showing over here without going into our stepwise regression like that. So, I can remove this x1 which are not significant over here and we could have regress this one and the stepwise regression we have done none over here. So, in this case I can click ok and see what is the performance of that model and what we are seeing is that x1 is significant that is shown over here regression coefficient and the equation is also given and but the explain variability is very less r square adjusted is around 60.85 and 10 fold cross validation is about 50 percent although now there is somewhat match between cross validation and also r square adjusted somewhat close we can assume over here, but still I feel that there could be something more which can be done on this ok. So, lack of fit test also shows that we are above 0.05. So, in this case there is no sign of lack of fit. So, then and to avoid all this confusion what we can do is that directly we can use that stepwise regression over here. So, what you have to do is that fit regression model and go to instead of this variable you select all the variables that is x1, x2, x3 and x4 over here and in stepwise options what you do is that right stepwise over here and there will be a alpha to enter alpha to remove this 2 you can keep. So, this this methodology works this stepwise regression works like adding variables and removing variables like that simultaneously it will work in adding and removing variables which is significant which is not significant like that. So, that way it works and the theories can be seen in any books on stepwise regression you have details about this stepwise regression. There are other methods of stepwise regression which is also forward selection and backward elimination these are the other two methods but stepwise include forward selection and backward elimination both of them. So, we prefer to use stepwise regression over here. So, in this case I click ok and then if you click ok over here what will happen is that you will you will find a scenario over here which is two variables are entered into the model that is x1 and you can see that x1 is also prominent over here 0.048 and the other one is x2 is more prominent which is p value is 0.026 both are significantly less than 0.05 over here and also r square adjusted has improved significantly from 60 to 75 over here and this 10 volt cross validation from 50 to 57 it has improved over here. So, in this case we can say that this may be the best model that we have considered over here and we should go we should go about go in implementing this one. So, this may be this model that we have generated over here now only thing is that residual whether it is a normal distribution we can check by a normality test over here and the residual will be saved at the end values that we see over here residual 2 that is the final residual after we have run the models and the residual also says that there is no abnormality in the normality assumptions that we are considering in the residuals over here 0.253 is more than 0.05. So, in this case residual also satisfies so there is no problem. So, if I consider x1 and x2 in the model with y we are getting x1 x2 significant and also the model adequacy test are quite there is no significant deviation from that. So, this is one of the scenario where we can use stepwise regression method, but for the second case if we consider this one that heat and x1 x2 again the confusion comes again the confusion will arise because and and we can see that one. So, what we will do is that I will take this second one where heat is the y characteristics and that has to be egressed with other variables over here. So, this is the first example that we have selected and these are the values that we have seen x1 x2 this is with x3. So, this is now this is the variables that we are talking about heat is with x1 x2 x3 and x4 four variables we are trying to egress. And then what we can do is that I can go to stat and I can go to regression and then regression and fit regression model. So, over here and instead of this I give heat and then I give x1 to x4 variables I select this one and I do not do stepwise regression at initially. So, I want to see what are the results. So, what will happen is that I have all the criteria. So, and I will store let us say the residue also over here and I click ok I want to see what what model says. So, over here what you see is that x1 x2 p values if you observe p values none of the p values is significant, but r square adjusted is highly I am getting a very good model fit over here 97 over here and the 10 fold cross validation is also very good over here. So, something wrong is happening because none of the variable is significant, but I am able to predict very very high predictability over here what we are seeing like that. So, these equations means we can immediately we can say that let us set up this one, but we will not do that because there is another issue coming over here which is known as variation inflation factor that is a variation inflation factor over here. So, variation inflation factor indicates that whether there is a situation of multicollinearity in the data set. So, what is multicollinearity we will try to explain multicollinearity in a sense it says that whenever the x are interrelated with each other let us say x1 with x2 or x2 with x3 and the correlation is very high that will influence the model and the model will not be correct and it will give you a bias judgment and the sign that it will get coefficient sign that it will get may interchange that means what it should be positive it is reflecting negative like that. So, that can happen over here. So, multicollinearity means there is a high significant relationship between the x variables over here and this will be reflected by a index that is known as variation inflation factor that is known as variation inflation factor and this is what I mean to say over multicollinearity over here. So, you see the numbers of x1 is given over here for a particular and x2 is also simultaneously recorded. So, x1 is 1, x2 is 2 and when this is 2, this is 4 like this, this is 3, this is 6 like this. So, x2 has a functional relationship with x1 over here. So, that means there is a high amount of correlation between the data set that I am having in x1 and the data set that I am having in x2. So, I can calculate the variation inflation factor between this data set. So, variation inflation factor for x1 I can calculate similarly for x2 I can calculate like that only 2 variables over here. So, then I calculate a ri index that is the coefficient of determination over here. So, ri index can be calculated where x1 let us see is regressed with a function of x2 like this and then the r values are indicated. So, that is width 1 and 2 like that. So, this values will be indicated and that will be reported over here, we can we can put that value and I can get the variation inflation factor. So, variation inflation factor for the first variable x1 will be same as variation inflation factor for because there is only 2 variables over here say this one and this one like that. So, x1 and x2 if I am considering 2 variables over here. So, this variation inflation factor what you see will lead to models we will not get the best models out of this when multicollinearity exists and then the beta coefficient estimation goes wrong and in this case and the prediction will also go wrong. So, if I consider that if I ignore this multicollinearity what what can happen is that my prediction model will show something different and actual scenario may be something else like that. So, in this case I need to rectify this multicollinearity there are different ways of rectifying the multicollinearity problem. So, one of the one of the approach that that takes care of this may be this this what we are using as what we are using there are 2 methods over here. So, one of the method that is stepwise regression we have adopted like that and that may eliminate this multicollinearity problem that we are having. So, another method is known as best subset regression based on which we can select variables which will go in and which will go out like that. So, first is best subset that that we can talk about is stepwise regression like that. So, what we will do is that we will go to regression. So, when we have fitted this model that this variation inflation factor that you see over here what if I if I copy this one as image and we can paste it in excel and let us try to see enlarge the image and try to see what is the results over here. So, variation inflation factor for each of these variables is indicative over here in the VIF what we are seeing and we can we can just go to and we can paste this one and what we see over here is basically 38.5 x 1 is having a variation inflation factor of 38.5 this is 254 this is 46 there is high amount of correlation that exists between the x variables over here which one is highly correlated which one we can see by the correlation matrix plot and we will be able to know which is related with which one. So, whenever this relationship strong relationship exists this variation inflation factor will be more than 5 will be more than 5 or 10 like that. So, generally statistician follows some rule or thumb rules like that if it is more than 5 we will take action and we want to eliminate multicollinearity problem in the regression equation. So, that my prediction model becomes more aggregate like that. So, anything more than 5 is we may consider as significant and we can take actions over there by using different methods and addressing the regression equation equations like that developing the best regression equation like that. So, here problem is that it is more than 10 or criteria 5 whatever you can select like that and generally recommended is 10, but some statistician also suggest anything more than 5 is also a concern for me. So, we should we should try to remove that multicollinearity problem and then develop the regression equation like that. So, this is a problem variation inflation factor and in this case the options that we are having is that we will we will go for stepwise regression. So, what we will do regression fit regression model over here. So, this is x1 to x4. So, in this case I will use stepwise regression and let us try to see which how it works in this case and then I will go for let me see storage of this residual is already there and I will click and graph what we can do is that we can see residual process fit normal probability plot after doing stepwise regression, validation 10 fold cross validation we are doing over here and I click ok what we observe is that only x1 and x2 is retained over here. So, you see the equation after doing stepwise regression what is coming out on this model and we are getting a regression equation like this and heat is equals to 52.58 into 1.46. This is a coefficient plus coefficient what we have for x1 and x2 and the results also indicate that now variation inflation factor if you look at this column what happened is that copy us picture and if we have taken this as a final equation then the variation inflation factor is near to 1 what you can see if you announce this one. So, this is near to 1 when it is near to 1 that means it is quite perfect and no multicollinearity problem does not exist now and x1 and x2 are independent over here we can assume independence between the variables over here. So, we have replaced this is x3 and x4 basically we have we have just removed x3 and x4 over here and the r square value and 10 fold cross validation more or less they are close to each other and ANOVA analysis also shows they are significant and we can see that residual normal probability plot and we can also verify whether the final residual is normal or not it is calling normal or not. So, what we observe over here is that this assumptions is also validated. So, p is more than 0.05. So, this is there is no problem in the error or residuals like that. So, we can we can just remove all benefits so that later on we have only required information over here. So, what we can do is that so this is verified over here. So, two variable goes in and two variable goes out over here. So, that means this stepwise regression has taken care of this multicollinearity problem over here somewhat we are fortunate that this is taken care of over here. So, why and let us try to see multicollinearity issues over here. So, what we can do is that basic stat over here and we go to correlation over here. So, correlations between what we can do is that we can see the correlation between heat and the other variables over here from x 1 to x 4 and we can select this one I want to see the correlation matrix. So, I go to options and I say Pearson correlation I want to say I want to identify and then in graphs what we can do is that I in the plot I want correlation with p values like that because I am interested in p values which is correlated with which one and which is significant like that and we can see pairwise correlation also and when I click ok what will happen you will get some graphs like this. So, this graph will indicate what is the correlation between y variable and x 1, x 2 and interrelationship between x 1 and x 2 how is the correlation. So, if I see the first column over here what do you see heat is ah having showing a p value of 0.01 with x 4 variable. So, heat is highly correlated with x 4 it is highly correlated with x 3 not so much it is more than 0.05. So, we can we can say that that this may not be significant so much, but x 2 I see p value is less than 0.05 x 1 x 1 x 2 and x 4 there are there is highly highly high amount of correlation that is existing. Now, x 1 if you see x 1 is highly if I see within the variable. So, x 1 how it is related with x 4 it does not have any correlation x 4, but with x 3 it is having a high level of correlation over here. So, x 1 is related with x 3 x 1 and x 3 are highly highly correlated over here. So, in this case this is the observation that similarly x 2 you see perfect relationship exists within x 2 and x 4. So, x 2 and x 4 are more or less perfect and the r coefficient is negative coefficient is 0.973. So, negative correlation exists between this variable x 1 x 2 and x 4. So, that means, these two variables are highly correlated similarly x 1 and x 3 is highly correlated. So, whenever high correlation exists and I want to do regression in that case what is required is that one of the variable has to go one of the variable has to go out of this x 1 and x 3. We can think of an x 2 and x 4 basically has to go one of the variables has to go and the stepwise regression has correctly identified two variables instead of four and it has identified x 1 and x 2 it has retained those two variables and removed the other two variables because there is a multicollinearity problem. So, some part of multicollinearity problem can be addressed by when I use stepwise regression, but that can always be verified by seeing the variation inflation factor and seeing the model adequacy and other checks. And finally, we select the models over here. So, this correlation matrix helps you to understand that which variable is highly correlated to which one, which variable is removed like that, which variable is added by. So, this is one way of selecting the variables which is known as stepwise regression when there is a multicollinearity problem that is existing over here. So, this and there is another option which can be explored over here which is known as best subset regression which is another option that this stepwise regression what the limitation of this approach is that it will select the final variables x 1 and x 2 it is selected like that, but it will not it has dropped x 3 and x 4, but scenario can be that I want to explore what happens if I include x 3 instead of x 1 what happens if I include x 4 instead of x 2 because those variables are easy to control maybe because I want a regression equation where variables can be easily controlled like that maybe x 1 and x 2 is too difficult, but regression equation by significance and best subset methodology we are getting x 1 and x 2. But I want to see the complexity if I if I use different combination like that if I use x 3, x 4 combination what is happening and like that. So, what will happen? So, there is a option which is known as best subset regression. So, if you go to regression, regression in minute we will see an option of best subset over here. So, if you click best subset regression for this then you identify which is the variable. So, I give heat as the response over here and then x 1 to x 4 these are the variables I want to include in the model which is free predicted over here. So, if you want some predictor to be always there in the model. So, predictor in all models. So, in that case you can just try to expand to be there. So, I am not identifying that one. So, I said every variable is free. So, show me all combinations and best combinations like that. So, I click ok over here then you will get a information over here which I am copying and I will just paste this information over here so that it is easier to see also. So, in this case what you observe is that there is an indicator which is known as mallows cp over here which is known as mallows cp over here. So, this mallows cp is generally used to select the best model and combination. So, with one variable combination. So, over here what you see is that there are x 1, x 2, x 3, x 4 variables over here and there with one variable that was selected the best model with one variable that is given over here. So, first two is giving you the best model. So, if I include x 4 that is giving me a high value of r square adjusted over here and high value of r square also over here and this is the best model with one variable that is selected and that is the summary of that is given over here. Similarly, with second best model that is with r square value 66.6 and I am getting a variable that should be that should be retained only x 2. So, if you model with x 4 what is happening well model with x 2 what is happening best two models are given. So, with single variable we have four. So, four models can be developed like that four models can be developed like that. So, if you have k number of variables 2 to the power k is the total combinations of the permutation combination we can we can think of all possible models all possible models that can happen like that. So, MINITAB only reports that best model with one variable best model with two variable best model with three variable and finally, with all variables like that. So, we will ignore the last one because we want to reduce the number of variables and we will go by the lower combination which is which is giving you better options because I want to reduce reduce the number of variables because I want to control less number of variables like that and also I want to take care of multicollinearity that is that is an issue ok. So, in this case what we do is that there are three values that we see r square adjusted to make a compromise over here r square adjusted r square predicted and mallows cp first we go by mallows cp indicator over here which indicates the variance is less and in this case the error that is that is the sum of square error that we are committing over here that that is that we are getting over here is much less as compared. So, how do we compare that one there is a indicator that is known as mallows cp which says that the thumb rule over here is that the mallows cp value should be less than the number of variables considered for modelling plus 1. So, that is if you are going for let us say for first one over here. So, this is a number of variable considered is 1, x4 is considered over here in the model. So, x4 plus 1 that is that is one variable that is 1 plus 1 is 2 and the mallows cp should be less than the 2 less than the value 2 over here and that is not the case over here 138.7 is the mallows cp indicator that we are getting over here. So, this criteria until and unless this criteria is fulfilled. So, this is not the best model that we should select. Secondly, second one also you see that 142.5 which is which is very very higher than 2 that also can be eliminated over here. But in the third case what you see is that two variable model when x1 and x2 is considered mallows cp value is approximately 2.7. So, in this case this calculation of mallows cp any books will give you what is the calculation values of mallows cp considering some square errors over here. So, the formulation is given. So, I am not mentioning the formulation over here that you can see in any standard textbook. So, what I am recommending is that this value you see two number of variables. So, total number of 2.7 is less than 3. So, this can be a possible possible combination and also stepwise regression has also shown that this is the combination x1 and x2 is the best combination and that is the best model that stepwise regression has identified like that. Here also the solution is mallows cp is less than and very close to the number of variables plus 1. So, this value should be very close to number of variables plus 1. So, 2.7 is close to 3. So, that means, this is one of the competitive models over here and this is 5.5 it is more than. So, this also goes and this 3 is also less than 4 over here. So, this can be one of the possibilities and this is another one of possibilities over here. Now, you have to check that whether this model is with two variables is good with three variables what is happening with other three variables what is happening like that because of correlation that exists between these two you will find that variation inflation factor will be high whenever I consider x1 and x3 together or x2 and x4 together like that there will be problems like that. So, over here the closest model that means based on mallows cp criteria also we are seeing that x1 and x2 is the best one mallows cp is 2.7 r square predicted is 96 that is quite good enough r square adjusted is nice 7.4. So, and this seems to be the closest one and we should select this one. So, based on mallows cp criteria and based on our stepwise regression we we are converging to the same models which which can be suggested over here which is the best model over here. So, x1 x2 variables regress with y that is the best model over here. So, if we have if we have considered any other combination of that. So, maybe we will not get the best models or maybe the assumptions will be violated assumptions of normality and other assumptions that is there that can be violated heteroscedasticity. So, all these things. So, whenever I have selected the best models there is always a requirement for checking the model adequacy. So, all the error terms and the assumptions of errors are to be verified like that. So, in this case what we see and and we see that x1 and x2 is the best selection over here we can check what is happening what is happening if I select x1 x2 and x4 over here, but x2 is highly related with x4. So, if we have selected that one variation inflation factor that problem will will come. So, if I assume this one. So, in this case regression if I assume x1 x2 and x4. So, in this case if I remove this one x1 x2 and x4 and x2 and x4 we have seen highly correlated and I I remove stepwise regression over here and then I do the calculation over here what is being observed you see that the variation inflation factor that you observe over here is very high variation inflation factor is very high over here if I paste this one over here and you will find that variation inflation factor is 18.78 18.78 x2 and x4 there is high high correlation that exists between x2 and x4 that was prominent in correlation coefficient also. So, this whenever it is high this type of regression equation cannot be used that is the overall suggestion that we have. So, and in this case what happens is that why. So, let us take this example to finish off with this y equals 2 and we will continue with another examples and another situation in multiple multiple regression when the error assumption fails in that case what we can do like that and then we go into the design of experiment part of that. So, why I am explaining this because when we develop design of experiments regression equation we should be concerned about the variable interrelation between the variables and how to select the best models out of many variables like that how to eliminate variables like that ok. If this is a scenario y and x1 and x4 so we can eliminate this one and we go by this regression analysis regression over here fit regression models. So, in this case what happens is that I selected y variables over here and then I select x1 to x4 and select this one sorry this has to be coming over here continuous predictor x1 to x4 and I select this one and stepwise regression what we have done is that we use stepwise regression over here and use this one. So, suggested model is x1 and x2 these are the variables and variation inflation factor is less. So, this can be the best model only thing is that r square adjusted is around 74 75.61 and this is 57. So, there is some gap that we are observing over here 75 and 57 over here. So, whether we can improve this 10 fold cross validation over here. So, again what we can do is that we can see this regression by best subset values and best subset regression we can do with y and x1 to x4 and try to see what the model recommendation like that. So, in this case what do you see is that x1 and x2 which is best subset is giving me a value of 3.4. So, this is about 3 over here this is about so this is more than 3 basically. So, number of variables plus 1 and this is more than 3 over here. So, and this is 3.8 which is very close you see 1 to 3 4. So, x1, x2 and x3 variables if I consider and that is coming out to be very close. So, malose CP based on malose CP index what we are seeing is that if I consider x1 and x2 and x3 variables over here that is giving me malose CP which is which is approximately 3.8 which is very close to 4 and in that case 3 variables can be considered. So, if I go over to 3 variables over here fit 3 variables. So, let us let us reduce this let us incorporate x1 x2 and x3 stepwise says x1 and x2 only. So, we will remove stepwise over here and try to see what the model gives. So, if you click ok over here what happens is that it gives a 3 model 3 variable models over here. So, over here what you see is that only x1 is coming prominent and others 2 are not coming prominent because the p value is more than 0.05 over here although the variation inflation factor is not significant over here, but R square predicted R square adjusted value somewhat improved and the 10 fold cross validation is also somewhat improved over here 62.98. But the residuals that I have saved over here. So, this is the residual plot that you see. So, when I have used 3 3 variable models over here what happens is that if I go to basic standard normality test what will happen is that. So, if I go to the last variables and try to test this one what happens is that you see that there is a violation in the error distribution over here. So, whenever I have added this one. So, if I have restricted this to 2 variable models like that. So, I go to regression analysis regression. Fit regression model instead of x3 I go to x1 and x2 only which is suggested by stepwise regression and I save the residual over here and I go the I do the normality test over here with the residuals residual 3 which is saved over here and I do this and what I see is that the residuals are perfectly following normal distributions like that 0.253 like that. So, we will always go by suggestions that is what statistician has suggested. So, we go by stepwise regression we do not add unnecessary variables which are not non-significant terms, but whenever I am removing a non-significant terms please remember that we are losing some amount of information. So, sometimes people can suggest why we should remove that one. So, there is a we can we can debate on that that which one to retain which one to remove like that. So, this is an art and this is not perfectly black and white scenarios like that in regression at least multiple regression like that, but there are suggestions which can be incorporated like that. So, so based on which we can select the variable. So, one I have shown is that best subset regression is one of the option when we have different combinations of the variables and we can select one or two of them and then try to figure out which which model is basically good or we use stepwise regression and forget about everything of combinations like that. So, whichever is the best will be revealed and that model we will we will we will recommend like that, but you should be careful about the model adequacy checks and all these things. Even if you have done stepwise regression also finally, you have to make a check of model adequacy over there. So, that is the suggestions and and there are other ways of dealing with multicornerity which is more more statistically sound like that. So, one is partially square regression and one is principal component analysis based regression like that. So, these things can be adopted. So, we will stop here and we will we will try to discuss about another example where the multiple regression fails like that and error assumption fails and in that case how how how we have to deal with that that is not discussed it is discussed in simple regression. So, we will start from here and another example I have on this time velocity temperature and yield and selection of the variables over here also we will discuss and then we will move into the core concept which is the improvement phase and that is design of experiment we will try to emphasize now on design of experiments and how do we how do we do design of experiments one of the things. So, basic idea of design of experiments in our next session basically. Thank you for listening.