 Come to session 25, in the course Quality Control and Improvement with Minitab, I am professor Indrajit Mukherjee from Shailesh Jimetha School of Management at IIT Bombay. So, previously we what we are doing is that last session what we have done is that we are trying to understand regression, simple regression and a model adequacy test like that. So, there can be scenarios where model adequacy can fail and in that case what is to be done that also we want to see in regression. So, I will take one more examples to understand in simple regression what are the complexities that can arise. So, this is one of the examples that we are taking over here where we have two very robust demand and one is energy usage, one is y and one is x over here and this is a scatter plot that you see over here and this scatter plot says that there is a linear relationship that exists between x and y over here and this is the x variable and this is the y variable over here. And what we are seeing is that normal probability plot also says that there may not be any problem and Anderson Darling test also says there is no problem. But when the residual what we are doing is the residual with feed what we are seeing is that there is suddenly increase in variability of the residuals as we move ahead with the feeds with values over here. So, it indicates that although we have fitted a regression model over here scatter plot is highly prominent over here and and we have done model adequacy checks and one of the model adequacy checks is normality distribution normal distributions of the residual and it is not violating that the Anderson Darling test shows that Darbyn Wurson test also shows that the p value is quite not significant. So, in that case there is no problem in auto correlation of the residual, but Bruch Pagan test when I am doing this in R what I am seeing is that p value is quite significant that means 0.0007 and that means heteroscedasticity is an issue over here. So, that means the model cannot be generalized and we need to correct we need to correct this this model and how do we do that then what what what can be done is that we already know that there is a transformation which can be used and we have used a box cox transformation over here which indicates a value of 0.05 approximately 0.5. So, lambda rounded value is 0.5. So, lambda equals to 0.5 and this indicates that there is a if there is y values y to the power lambda we have to do and this will be 0.5 on this power over here. So, this is nothing but square root of y and this will be a function of x which we have to generate like that ok. So, we have done that correction and after doing the correction what happens regression equation is square root of y and this is the beta 0 that was generated and this is the beta 1 or slope multiplied by x analysis of variance also shows that the x variable is quite significant and the p value is less than 0.05 and also when we plotted the residual versus fit over here we do not see any abnormalities now over here and the which Pagan test was again reconducted with this data set of residuals that was saved after we have generated the equation with square root of y and the value of p is 0.89 which is which is showing that heteroscedastic problem is not there. So, this equation that we have generated can be used for any unknown observation of x to predict what will be the y like that. We will generate square root of y that can be converted into y basically ok. So, this is one of the examples. So, let us try to see how we have done this in Minitab. So, just to for you to facilitate and so here is the C 50 and C 16 column is given over here and what we will do is that we will apply first we will see the scatter plot. So, graphically we can do the scatter plot over here and we can take let us say this is demand information and it is already taken over energy is such over here. So, if you click ok y and x variables the same graph that I have shown in the excel sheets in the PPT this is the same graph what we see. So, strong positive relationship exist over here that is shown in the scatter plot then what we will do is that I will go to regression and then fit regression over here. So, fit regression model and then we will do the demand and then we will take the energy usage over here and then in models what we do is that we include the constant term this is important. So, beta 0 will be included in the term and in options at present no transformation is required. So, I have clicked no transformation over here and other things we do not do anything over here. One more thing that can be done whenever we are doing regression over here. So, maybe validation. So, this part can be done in many situations we do that in regression validating the models like that. So, there are two methods over here validating with proportional test sets and k fold cross validations like that. So, people prefers to do k fold cross validation. So, and generally number of folds that is taken is 10. So, you can see cross validation how people are doing 10 fold cross validation the theory behind this is simple very simple I divide into 10 data sets like that one of the data set will be used at test data set on which the R square value will be generated and that is the way we do cross validation. So, that is one option we can keep in mind when we are generating so that we can generalize. So, but model adequacy test is required and that will show whether everything is fine. So, normal probability plot in and also what we can do is that there is a Pareto plot which can be done and residual what we want is standardized residual because we are talking about using standardized residual for normal plots and also for residual versus weak plot and residual versus order plots can also be seen which says whether there is any dependency between the errors like that. So, this can be seen and storage at finally, we can store the standardized residual and click ok. And when you click ok over here what will happen is that you will generate equations also then you will generate this Pareto effects shows that anything beyond this red line indicates that that variable is significant that means energy usage is significant over here. So, it depends on the alpha value that we have taken. So, formula is there to find out this cutoff over here anything beyond the cutoff indicates that that variable is important. So, this is the standardized effect plot over here normal probability plot that you see indicates that not much deviation is there we have also saved the residual. So, residual is saved over here. So, we can check whether the basic assumptions of normality for the residual is ok. So, we can take the residual and check the Anderson-Darling test we can do that. And what we see is that Anderson-Darling test is not showing any any adversities over here. And then here in the previous diagram what we have not seen is this one is that funnel shape what we told. So, heteroscedasticity is prominent that we can see from this graph of residual versus feed. And however, the autocorrelation aspects we do not see like that any trends we do not see over here. So, in this case also we have done Darwin-Warsson stat that I told that in this was done earlier and that was not the problem that was not the issue that we have identified over. So, Darwin-Warsson autocorrelation is not an issue over here because p value is coming out to be, but Bruch-Pagant test showed that there is significant heteroscedasticity that is that we are getting. So, in that case bottle cannot be generalized. So, we need to do something on this. So, what we can do is that we can go for transformation we can go for transformation over here. And using the box cox transformation this was done. And what we can do is that this is the then we have to convert the y variable over here. So, that the residuals we will not have a model relativistic problem over here. So, what we have done is that we have gone for the first option maybe that box cox transformation. So, how do we do box cox transformation? Box cox transformation of which variable demand over here subgroup size is this where we want to save that one optimal or rounded value. So, let us try to see first then we will save. So, in this case if you kick what will happen is that box cox will recommend you what is the transformation that is required. So, over here you see the estimated value is 0.35, but you can round it up because lower and upper confidence it will include this. So, 0.5 we can we can consider as rounded value because lambda to the power 0.5 is understood by many people that is square root transformation. So, square root of y is required. So, we can use that. So, what I have done is that I have taken square root of y over here that is in C17 and then I have regressed C17 with C16 like that. So, then what I have done is that regressed regression models over here, pre-regression model. So, everything remains same only instead of this I have taken square root of y over here everything remains same and I click ok. And then what is generated is that now energy usage is showing to be significant p values over here even after transformation and 10 fold cross validation these values is around 61 that means cross validation values that we are getting around 61 percent ok. So, that is on the lower side, but still we may be satisfied with this energy, but there is a significant relationship that exists between these data set. And here also this effect plot also shows that this is significant over here energy usage is prominent and this variable needs to be considered and there is no problem with the normal probability plot also. And we want to check whether the final one is heteroscedasticity is problem is eliminated. Here also we do not see much problem in the behavior of the residuals with respect to x or respect to pitted values like that. So, and we have done the testing we have done the Durbin Watson test for autocorrelation we have also done with the transform data with the transform data and the residual that we have generated. And we have also seen that the Bruch Pagan test also does not show significance that we that observation we have done when I have converted the data. So, over here what you see is that after square root transformation Bruch Pagan test that this is 0.8985 and that is more than 0.05 and that indicates that the problem of heteroscedasticity is removed by using a boxbox transformation over here using a boxbox transformation over here. And Minidive gives you an option that means, whenever I am doing regression over here. So, if I close this one and you can do the transformation over. So, here regression what you can do is that regression fit regression model over here. And in options what you can do is that you can say transformation lambda equals to 0.5 transformation I want and after doing that you click ok. So, Minidive will automatically generate a boxbox transformation with lambda equals to 0.5 cross validation 10 volt cross validation we are doing. And in that case what is the corresponding P values what is the final equation also it will show. So, this is the final equation it is showing. So, copy as picture we can we can just paste this one over here. And you can see what is the equation final equation that is generated over here. And the residual that will be generated that is the actual versus minus predicted values for a given value of x. And all the residuals when we plot that one and we do the modular frequency check of heteroscedasticity and other things. What we observe is that it will it is satisfying basic all conditions all conditions it is satisfying. So, this regression equation can be used for generalization. So, within the range of x. So, we will not extrapolate, but within the domain of x where the equation is generated within that any value of x you give I can predict what will be the y over here. So, in this case how do we do that? So, in this case let us assume that this I want to predict something that energy level 676 what will be around 700 let us say 700 what will be the value of demand. So, what will be the value of demand. So, in this case what we will do is that regression and regression what we have is predict. So, you can do prediction over here. So, energy usage let us say 700 what will be the predicted square root of y over here. So, we can also see this square root of y over here. So, in this case options confidence interval that can be generated. So, results over here regression and predicted tables over here. So, in this case storage if you want to store that one that is also possible view model is not required. So, in this case what will happen is that this predicted value will be given over here. So, if I copy this one and I paste it over here and I can paste it over here and to show you the values what is being generated over here from the regression equation. So, feed value is around 1.97, 1.97. So, this is approximated. So, this is generated square root of y this is square root if you make square of this you will get the actual value of y over here. So, it will give you a confidence interval it will give you also a prediction interval. So, this can be we can generate the confidence interval based on certain formulations over here. So, this is the expected value, but it will have a range it will not be exactly values that will come out, but the range of this confidence interval of the values that is predicted is approximately 1.05 to 1.34 and prediction interval is around 0.04 to 2.25, 0.04 to 2.25. So, this is this can be also done in Minitab software. So, what is the prediction interval, what is the confidence interval for a given value of x? So, given value of x is given a 700 for this I want to predict like that. So, 700 should be within the operating zone of the control variables that we want to predict after generating the regression equation and we can do all the checks and finally, we will adopt the equations like that. In real life also we develop the models between y as a function of x and then try to use that to reach to the optimal optimal solutions like that what should be the x so that I get the best best while like that ok. So, this is a simple linear regression where we have many things to understand model adequacy cross validation then r square values then beta is significant or not slope is significant or not ANOVA analysis all these things needs to be considered when we are talking about regression ok. So, we can extend this concept of regression to multiple regression simple regression to multiple regression over here. So, multiple regression is nothing but when we have more than 1 x over here and this is the natural scenario what we will encounter most of the time. So, this is a matrix of x and there will be one single y over here. So, this is known as multiple regression this is known as multiple linear regression. So, simple is 1 y and 1 x like that simple is 1 y and 1 x, but in case of multiple linear regression what will happen is that there will be multiple variables that will influence that means in a process when we are talking about this is the CTQ which is y that can be influenced by many x over here. So, x p up to x p variables. So, x 1 up to x p variables can influence the process CTQ and all may be potential x all may be potential x over here then we want to generate the functional relationship between y and function of all x 1 to x p and I want to generate the function. So, how do I do that? We do that by regression we do that by regression when I have more than 1 x variable we call it as multiple linear regression and this is the expression that is generalized expression that you see this is the model empirical model functions that is given over here and there will be some error whenever I am generating a function and it will not be realistic and so there will be some difference between actual and predicted that will be error or residual and that will be saved over here. So, that we can we can see and this is the matrix notation that you see in in case when multiple x and multiple we have multiple x for predicting a single y like that and this is the beta coefficient that is beta 0 plus beta 1 like this up to beta p. So, this is estimated over here. So, this is the matrix that you see over here and this beta can be estimated from this values of y and x. So, I will have y and I will have multiple observations x 2 up to x p observation. So, this we will have some values let us say for a given value of this can be some values 2 over here and this can have 1 minus 1 4 or something like that. So, there will be different values. So, 1 row of x will give me certain y over here. So, this is the setting condition that will give me the process condition output condition over here that is given over here. So, if you have multiple y observation and multiple x observation and for any row of x we will have y conditions like that then that will if that is there in that case by matrix we can matrix formulation we what we can do is that matrix algebra what we can do is that we can generate what is the value of beta over here. And this is based on some least square functions over here that is error minimization basically it will minimize error difference partial derivatives will be taken over here and that will give me the values of beta 0 up to beta beta p. And Minitab does it automatically for you. So, in this case one of the example that I am taking over here is known as pool strength and I want to see whether there is any relationship between wire length and die height over here these are the two x variables and I have multiple observations over here and this is one set of observations that you are seeing and this is another set of observation that you are seeing over here same variables and same y over here. So, we have placed it side by side. So, that you can see the complete data set like that there are two variables or predictors and one predicted values or CTQs over here and I want to see whether the both the variables are important to be included or one is sufficient to model this one and which way I should develop the model. So, this is a scenario where multiple regression is required to so that we can model y as a function of multiple x, x1 and x2 over here. So, what we do over here is basically we again do the regression analysis here we instead of one variable two variable information you will get wire length and die heights over here and this is p value showing to be significant again the same interpretation if p is significant in that case it will indicate that these variables are important and this should be this should be included in the models like that ok. And the r square what you can see the interpretation that we have seen r square that is how much of the variability ss migration by ss total how much of the variability is explained and this is coming out to be 98 which is quite significant and but what we do is that in multiple regression we see r square adjusted ok. So, r square adjusted gives you a precise estimation over here which is formulation is given over here and we go by mean square we go by mean square calculation because this prevents over fitting of the model this is preventing over fitting of the model because if you keep on adding x r square will increase, but it will not increase the r square adjusted until and unless it has significant influence to reduce the mean square error like that ok. So, another is predicted value that means, one observation will be dropped and what is the value prediction how close is that prediction over here. So, that can be done by r square predicted value over here and Minitab gives you all options to see how they are calculating r square predicted and you can see some resource or our books resource how r square predicted is. So, each of the observation will be removed it will it will generate an equation and then predict the single observation that is the way we do for r square prediction over here. Analysis of variance then analysis of variance shows that when I have two variables over here overall it is significant or not that will regression equation yes it is significant out of these two variable which is important both are important because the p value is less than 0.05. So, our interpretation is that we should include both of them in the model ok. And the overall equation that you see is that expected value of strength is equals to this is beta 0 over here and equation is generated by Minitab beta 1 x 1 this is the equation and this is plus beta 2 x 2 over here both are significant this coefficient are significant and this is plus sign this is plus sign indicates that both are linearly positively related over here x 1 increases y increases x 2 increases y also increases like that. So, this things can be interpreted out of out of the analysis and how we do that we want to see and this is the surface plot what you see over here and this is showing you that pull strength that is y how it is related with x under the Minitab also gives you an option to make a 3D plots like that if you have only two x variables and one y variables then plotting is also possible like that. So, we will show that one also. So, in this case how this data is generated let us go to the analysis part of this and so what we will do is that we will go to the data set and we will try to see multiple regression when we open the data set this is the data set that is you want to see three columns what you observe over here pull strength, well length and high height. So, what we will do is that stat we can see graphically first how these two variables are and each of them are related to the y variables. So, what we will do is that we will say pull strength and well length I want to see and again pull strength with dry height I want to see how the relationship is and it will plot two diagrams like that. So, one of the diagrams is with well length what you see is that this is very strong relationship that we observe well length with the strength and positive relationship exists. So, we expect the coefficient of beta should be positive and here it is somewhat not so prominent, but increasing trend we can see over here. So, not so strong, but there is a positive relationship that we can understand and it exists and so we should include that also in the model. And after doing the scatter plot we understand which is to be then we can go for regression and fit regression and fit regression model what we will do is that we will take good strength as a response variable and continuous predictor it is continuous variable dry height and well length is placed over here and in models what we will do is that we will include constant term that means beta 0 has to be included that we have considered other things we are not changing over here. So, terms we can add, but we are not adding terms over here. So, we want linear equation at this present moment not polynomial at this time point. There if transformation is required we will see if it is not required we do not need transformation. So, no transformation option is given and coding we can avoid this time because sometimes x variables are coded like that. So, that helps in there is a theoretical advantage if we are doing coding over here ok in design of experiment coding is done. So, that is that is an important aspects and stepwise regression we will try to understand afterwards and model cross validation is possible here also I can go for k fold cross validation 10 is the k that we can assume over here and then graphically we can see is normal plot residual versus fit residual versus order standardized residual and Pareto effects whether all the are significant or not I will click ok and then I can store the residual standardized residual over here and I click ok and then click ok over here and let us see what we are observing. So, first is equation that is generated over here unit I have displaced the equation and so, the equation will be given and we can just paste this over here and we can see that this equation is coming over here. So, this is the equation that we are having and then next observation our observation is this is the coefficient information that we are getting. So, this is the equation second is model adequacy we want to see whether both the variables are important what we observe is that well length is having a p value which is significant and also die height is having a p value which is significant both the variables are important then what we can do is that we can go to the ANOVO analysis and try to see and so, this is the ANOVO analysis which we can copy and try to see this as an image and enlarge that con and here also it will be if the coefficients are significant ANOVO will also show that one. So, well length is coming out to be high value also f over here die height is also high. So, the p value is significant that we are observing over here and both the variables are important. So, overall regression there is significant. So, overall regression indicates that any one of the variable is significant and that that shows over here in the regression and individually if you have to see each of this p value we can observe and we can see both are significant over here and degrees of freedom regression what do you see over here is that there are two variables. So, we have two degrees of freedom for this two x variables. So, two degrees of freedom and each is consuming one degree of freedom over here total number of observations is total degree of freedom 25th or 25 observations we have 25 minus 1 and error is the remaining degree of freedom that we can calculate ok. And then it indicates that both are important for us and there are some unusual observation that you that you can see and k full cross validation if you want to see this one copy as image and you can paste this one and you can see that after cross validation also model giving good results. So, overall r square adjusted is around 97 that is very good we will see r square adjusted value over here I told r square predicted should be close close to r square adjusted if the model is correct and can be generalized and 10 full cross validation and r square values of that cross validation means we we are taking one one 10 we have divided the data in 10 folds 10 10 different data sets randomly and one of the data set is used as testing and the other data set will generate the training data set which is the training data set and based on that we will generate the equation and then test what is the r square values for the test data set like that. So, that will be calculated for this and then what what is reported is 10 fold r square value that is 97 that is quite quite good that is quite good over here what we are observing and then what we see is that we have the plots also and plots if you see the effect plots what you see is that this at level of alpha of 0.05 and this is the cutoff over here that you see and these calculations you can just take the formula to do this calculation 2.07 out and this is given in Minitab. So, in this case and also any general books on how they are how they are calculating these values of cutoff over here. So, a is significant which is beyond this cutoff b is also significant which is beyond this both are important and can be included in the regression equation also normal probability plot you see not much deviation which most of the observations are a middle part. So, there is not much deviation, but we have saved the residual and we can just check whether the basic residuals follow normality assumptions at the end you will find the residual and this is saved already and you go to Anderson-Darling test and the Anderson-Darling test on this data set indicates that the p value is not significant. So, in this case the error residuals is more or less we can assume that that is ok. And what we can do is that we have not generated the heteroscedastic plots. So, in graph what we can do is that residual versus orders residual versus orders. So, we have not done it. So, we can just check that one graphically. So, this is the graph that we are generating based on this data set and not much unusual observations. So, Bruch Pagan test can be done over here and try to confirm whether there is any heteroscedastic behavior of the residuals that can be observed, but I do not think it is there and also there is no as such strength that we observe in the standardized residual with observed order over here. So, in this case also not so prominent, but we can do that individual testing like that. So, in this case and also we can do the surface plot that that I have not shown. So, surplus plot can also be done. Graph over here 3D scatter plot, 3D surface plot is option is there. So, I can do a wire diagram or I can do a surface plot over here if you click ok, it will which is the z direction which is the variable. So, this is x 1 and x 2 variables like that and you can change the shapes you can change the plots types of plots like this. So, this is the plot that will be generated surface plot and you can just rotate this one also. So, you can rotate the axis like that. So, that is also possible over here. So, I can rotate that is possible over here. So, this is if you want to rotate and see what is happening over here. So, this is also possible in Minitab and that gives you an idea how the surface is over here ok. So, this is also possible in Minitab, this is also possible in Minitab. So, so that we can see all aspects of the of the regressions, all aspects of regression over here. Similarly, there is another data set like this it and with some variable x 1 to x 4 and we want to see the relationship between this. So, we can close this one and we can see another another examples of it and this. So, if you want to generate this one regression equation regression fit regression. So, in this case variables will only change. So, it is the outcome that is measured over here and these are the x 1 to x 4 variables that you want to include in the model select that one and then in model include the constant terms that is there and we do not want to increase any other terms over here. So, this is ok and in options no transformation because everything is let us assume that everything is ok. So, then validation and everything store residuals over here. So, you can store the residuals and you click ok finally, when you click ok. So, then what happens the equation is given at the first stage coefficient which is significant which is not over here. So, you see x 1 is significant over x 1 is also not significant none of the variables coming out to be significant over here. So, in this case what is happening is that all the p values but in this case ok. So, but there is a clear relationship r square adjusted value is 97.36 and k 4 cos validation is also high ok. So, in this scenario what we are getting is that x 1 is quite prominent x 1 is values is quite prominent over here and the variables what we are seeing is that there is some issues over here in this case and there is no issues other than that. So, if we have included only one variable over here x 1 what happens that we can we can see x 1 over here only x 1 if we consider only x 1 over here what is the what is the scenario the regression in case of all variables we can only consider x 1 variables over here and I click ok and what I will observe is that not much variability is explained r square adjusted is low. So, we keep on adding the variables and there is no lack of fit also in this but overall explanation of the total variability. So, we keep on adding the variables x 1 is added that explains some part of variability when I add x 2 that adds some part of variability x 3 and x 4 like that. So, this kind of analysis simple analysis can be done. Let us also assume the c 5 to c 9 variables over here and in this case also we can see the regression equation which what is the equation that is generates over here. So, this what we can do is that we can we can just go for this y variable that is over here and these are the x 1 to x 4 over here and this is selected sorry this is selected over here in the continuous predictor x 1 to x 4. So, this is selected over here and if I go for this what happens is that here only one variable comes out x 1 to be prominent over here p value is not significant over here but we are retaining those variables because that gives me r square adjusted value and we can do it automatically also. So, let us let us try to also discuss this one how do we select which is the variable to be included which is excluded. So, regression analysis also gives you a fit regression model there is a option of stepwise regression over here. So, if I if I go for stepwise regression over here it will automatically suggest which variables to keep and which variables not to keep like that. So, if I change these variables over here let us say heat is the first variable and we are going for x 1 to x 4 over here and I click ok and in this case if I click ok it will suggest which are the variables to be taken and you see that it has only considered one variable that is first variable x sorry second variable it is considered to be placed over here. So, there is certainly some other problems that is existing and x 1 and x 2. So, x 1 and x 2 is prominent. So, these two variables are included over here. So, these are the two variables to be included we do not want to include all variables. So, we will discuss more about this stepwise regression and you see whenever there is a like previously what we are facing is that which variable we should take because they are not prominent. So, which one will go in which one will go out like that we are not certain like that. So, what we do is that we have a method which is known as stepwise regression and best subset regression which will allow us to identify which variables to include in the model which variables to exclude in the model because I was not sure that it is not significant all are not significant. So, in this case which one will go in and which will give me the best fit like that. So, this technique which is known as stepwise regression automatically identifies which two variable basically maximizes the r square predicted an r square adjusted value over here. So, this is around 97 percent and this is the best model that you can get out of this dataset that you have generated. So, include X1 this one column C12 and C13 and that gives me an option that which variables should go in and which variables should not be included. So, we will continue with this discussion of best subset regression in scenarios when we are in dilemma which will go in when which will go out and then extend that one and discuss about one topic which is known as multicollinearity regression and that is a one I wanted to discuss because that is that significantly affects the model generalization basically. So, we will discuss about that in your next class ok. So, we will start from here where we left we will take some example and how to include X variables like there should not be any dilemma this will go in this will go out and no confusion in selecting all the variables. So, I have a easy way by going through stepwise regression and using best subset regression like that best subset methods to select the variables that will lead to the final generalized model like that. So, we will stop here and we will continue in our next session. Thank you.