 To our session 27 on Quality Control and Improvement with Minitab, I am Professor Indrajit Mukherjee from Sheresh J. Mehta School of Management, Tiety Bombay. So, previous session what we are doing is that we are discussing about the multiple regression and multicollinearity problems like that. So, we will take another examples to again emphasize on the importance of how to tackle multicollinearity like that. So, multicollinearity what we have explained is that relationship between X variables that exist and that can distort the relationship between what is and we are not able to generalize the equations which can be used in reality for prediction like that. So, we need to deal with this and the variation inflation factor that we have to calculate and try to see that it is not more than 5 and if it is more than 5 then we have to adopt some of the means to deal with that and how to deal with that that we have mentioned over here that means, highly correlated variables one of them will we can take and other ones can be eliminated. And we can attempt stepwise regression method which takes care of which can suggest that which are the variables to be taken and we can reconform the multicollinearity exist or not in the final models like that. And we can also see best subset regression where multiple options exist because we do not want to stop only considering what stepwise regression gives because sometimes what happens say alternate model is easy to control in real life that means those variables are easy to control which is suggested. And if it is fine if we have to sacrifice some amount of r square values or something like that that is not a constraint for us in production processes, but variables which are very difficult to control and stepwise regression gives those variables in the final equation then it becomes difficult for us to change that one if we adopt that one. So, in that case it becomes difficult. So, best subset regression gives you some more options to select the variables like that. Then appropriate methods like partially square regression principle component regression all these things can be adopted. So, we will take one more examples where we have variables like yield is the y characteristics and this depends on time, velocity and temperature. These are the variable X that was considered to be significant or potential variables and we want to get a model out of this regression model out of this. All are continuous variable over here. So, in this case and when we did the correlation analysis of this correlation analysis what we observe is that velocity is highly correlated with time variables over here and then we can see that yield is highly correlated with time velocity over here. So, but this p value will indicate whether which of the variable is highly correlated. So, yield is one of the variables. So, this seems to be significant over here velocity is also significant over here, but temperature does not seem to be significantly influencing the yields over here. And what we can see is that temperature with time is not significant, temperature with velocity is also not significant. So, velocity with time is highly correlated this with time velocity and time is highly correlated over here. So, this correlation matrix gives you some preliminary information what is expected means when we when we run the regression analysis what is expected if all these variables. So, what we what we can see is that time and velocity are two variables which are highly correlated. So, one we have to adopt over here. So, how do we select which one to adopt over here? So, maybe when we do a trial and error basis if you have to select the variables the policy may be that because yield is the correlation coefficient is 9454 time and for velocity is 914 maybe this variable we should select like that time may be the only variable and replace the velocity with the time like that. So, yield has to be regressed with time and if temperature is significant then we will include that one. So, anyhow this is the suggested one of the guidelines. So, which is highly correlated with the y variable we can select that one. And we want to see that what happens if we select stepwise regression what happens like that. So, this is the data set C 11 to C 20 this is the data set that we are having and we want to we want to implement this. So, what we have done is that basic statistics. So, over here we have gone to correlation and then we have identified which are the variables over here we want to understand the relationship. So, yield is the first variable time velocity and temperature and in options what we have given Pearson correlation we want to see and then in graphs what we want is that correlation with p values and if I click ok and results what we have mentioned is we can also see pair wise correlation table and if you click ok this is the this is the diagram that I have shown in the PPD slides like that. So, this shows relationship between yield and other variables yield is highly related with time the p value is point near to 0 and velocity is also highly related with yield over here, but temperature does not seem to be significantly influencing over here. So, this is the relationship and time is having a high time is not having a correlation with temperature over here and velocity is having no correlation with temperature, but velocity is having a correlation with time over here. So, that is reflected over here temperature and time. So, this is highly 0.958 is the minus 0.958 is. So, this is so what we can do is that we can go to stepwise regression. So, I will use stepwise regression over here. So, fit regression model and here only thing I have to do is that I have to introduce stepwise regression over here and click ok and then validation we can we can trace validation as usual and also storage we can save the residual standard as residual over here and I click ok and let us try to see which are the variables we will go in the model and which will go out of the model ok. So, what happened is that it has identified time and temperature. So, initially we are thinking that it is not so highly correlated, but this is not so significant 0.05 just above the cross line that is 0.05 that is the cutoff over here. So, it is the analysis over here says that we have to take into consideration time and temperature that is the best model that is coming out of this and the r square adjusted value is 90 point something like that and 10 4 cross validation more or less close. So, this model seems to be adequate if I consider that alpha level of significance is 0.1 in that case we can also consider temperature as significant variable we may retain this one. So, let us try to see what about the normality distributions of this residual over here. So, last residual will be normality we can check and in this case 0.74. So, there is not much problem with the normality assumptions over here and we do not have any problem over here. So, in this case normality is not a constraint. So, in this case we can also see when I use base subset regression which is the best model it is giving. So, I have taken yield over here and we have taken all the three variables and if I click ok then what happens let us try to see and we can we can just paste it copy as a picture over here. So, then we can paste it in excel and try to see what happens ok. So, we can adopt we will try to see what happens if we select excel we are opening on excel sheet ok. So, let us let us do that one and we can paste it over here. So, when we are pasting it over here what what observations we have that we want to see. So, over here this is time variable over here this is temperature and this is the velocity over here with one variable what we are seeing mellow cp is about 4 r square adjusted is 88.5 when I take consider only time as the variables like that ok. But mellow cp is higher than the number of variables plus 1. So, this is not recommended second one is also not recommended with one variable third one what we are seeing time and temperature it is mellow cp is less than number of variables plus 1. So, 3 is 2 is less than 3. So, this seems to be adequate this one of the suggested model over here and we are getting r square value of 92 which is higher than the earlier one and what we are getting is that r square adjusted is also high over here and prediction is also more or less same what we are getting over here. Only thing is that when we adopt this one 10 foot cross validation also we can check ok. 5.9 is quite high. So, this goes away and last one is considering all models we want to deduce this one. So, last one does not we do not see the last one. So, mellow cp suggests that this is the one when time and temperature can be can be considered like that. So, so that is also suggested by stepwise regression what we have seen like that. So, in this case again I am doing this. So, if we select this one fit regression model over here and in this case time and temperature. So, I am removing this one and these are the two variables and stepwise regression I will not use now. So, in this case I have already identified the variables and I want to see all graphs whether it is and this residual can be standardized residual over here. So, if I click ok and validation we have already k foot cross validation we have given and if you click ok what happens is that this is the final equation minus 130.7 and plus time it is positively correlated and second one temperature is also positively correlated like that. So, now temperature is retained over here although the p value is not so significant over here we can also just eliminate and see whether that model performs in any way because we have a variable to enter and exit we have given a alpha value of 0.15. So, that is why this model it has come in the model when we have done stepwise regression, but we can retain this one because this is very close to 0.05 and we can we can see that normal distribution assumptions over here that seems to be satisfactory as all the points. So, this graphically when we see seems to be satisfactory points on the line. So, but we can cross check that one even the residual versus feed plot also does not show any abnormality or patterns like that. So, maybe Bruce Pagan test will also confirm this one and also there is no as such abnormalities in auto correlation what we observe some trend or something is not observed it is on both side of the 0 0 0 point like that. So, in this case what is expected is that it should confirm the normality and other assumptions over here. So, we can just check this one and the last row will be the last column will be the variables residuals that we have to check. So, in this case it is coming out to be 0.7 seems to be satisfactory. So, in this case you have to practically also think that whether to retain the last variables or to remove, but R square adjusted as has improved what was observed. So, R square adjusted as improved if I consider only one variable maybe R square adjusted will be low. So, in case I consider one variable as time as the only variable. So, in this case we can do that. So, we can we can just consider time as the only variables. So, feed regression model instead of temperature I will remove this one. So, just keep a note of this R square adjusted is around 90 this is around 85 R square predicted and 10 fold cross validation earlier was around 85. So, 90 and 85 approximately that is the range what we are getting and when we do this here you see R square adjusted is also lower than the earlier one. So, it depends on the practical sense of whether to adopt the variable or whether to remove the variable which is not significant like that that depends on the processing engineers or somebody who is knowledgeable about that whether to retain that one or eliminate that one and there is no lack of feed as such that is observed over here. So, this is lack of feed. So, linear model seems to be adequate so we can adopt this one. So, you have to think from practical aspects like that. So, multicollinearity problem when we are considering both the variables over here. So, if I consider both the variables and in that case we are not getting any multicollinearity issue over here. So, this time and temperature if we consider both of them and variation inflation factor is around 1.05 so which is quite satisfactory and so whether to include both the variables or not to include both the variables that is the judgment that you have to take, but I will suggest to retain this one because this is improving the R square adjusted and also 10 fold cross validation and R square predicted is also improving so we can retain this one ok. So, but there is no black and white scenarios in regression like that. So, it depends on the process engineers and then see the predictive behaviors and then we try to adopt whichever model is very close. Suggested model over here is conclude we can include time and temperature both the variables, but if you go by significance in that case temperature may be dropped and we stick to time only that is the only variable ok. So, this is one scenario and we can have in regression some other scenarios like that we can place this is the example I want to place over here maybe we will delete this one SCDOL over here. So, this is one example that we are taking. So, RPN type of cutting tool and surface finish this is the variable surface finish is Y and RPM is the continuous variable and there can be a categorical variable that means or type of the cutting tool. So, here the number is categorical it can take two values 302 and 416 like that. So, this is the two types of tool. So, this is categorical variable basically. So, this is no I cannot say 302 is greater than 416 like that. So, this cannot be arranged in ascending order descending order. So, this is like color different types of colors. So, this is categorical variable we have to treat and regression has an option to deal with categorical variable also. So, surface finish is Y over here. So, what we can do is that stat go to stat regression fit regression and fit regression model over here and instead of this variable so over here I can just use continuous variable and categorical predictor also we can include in the regression model over here. So, in this case what we will do is that we will we will incorporate the response as surface finish which is the actual variable and then continuous predictor over his RPN and then categorical predictor over here is will be type of cutting tool over here. So, in this case stepwise regression we can use and we can see which variables we will go in and which will go out. So, same significance level we have used and in models we have included both the variables and included the constant term also validation also we have taken cross validation. So, over here and results what we can do is that we can see all results over here. And here there is a options of Darwin Watson's statistic which has to be compared with tabulated value then we can we can use this one. So, I am not using because I can convert into R I can go to R and model that one and see Bridge Pagan test and also the DW test and corresponding P values. What I will do is that standardize residual I will save over here again and then graphically what we want to see normal plot residual plot and order of the data plots like that. So, if you click ok what happens is that it it suggests selection process and then what we have selected like that by default. And these are the two for 302 types of cutting tool this is the surface finish is related to the RPN this is the equation and for 416 it will give a different equation. So, categorical variables different levels that we have selected each levels we will have a different equation with the continuous variable. So, this is shown over here and the coefficients are also given over here. So, based on which we have developed the regression equation and we see variation inflation factors is approximated. So, that is not an issue. So, these two variables type of cutting tool and also RPN both are significant over here and about R square adjusted value is 97.5 and cross validation is 97.02. So, the very close. So, this model seems to be very accurate and what we are getting is that no lack of it is observed over here. Only thing is that we have to check whether normality assumptions is violated. So, here what you see is maybe there is a problem with normality over here one point is over here some has gone outside like that. So, we have to test that one and heteroscedasticity does not seem to be a problem because it seems to be random and also we can see that this one may not be auto correlation is not so significant. So, these two checks can be done, but what we want to see normality. So, this C33 column will tell me whether normality assumptions is violated or not. So, what I will do is that basic stat and normality test and what we will do is that normality residual we can check and try to see what happens. Normality test and what we observe over here for the residuals is that P value is less than 0.05 that indicates that there is a problem of normality issue over here of the residuals. So, in this case directly the Y characteristics needs some transformation and we have boxbox transformation and also Johnson's transformation both options are there. So, in this case what we will do is that first we will see whether boxbox transformation works for this for the Y variable. So, what we will do control chart boxbox transformation and for the surface finish we have taken this variable over here subgroup size is 1 selected and in options optimal or rounded value of alpha just mention that one and when you just click that option what happens is that it suggests you that rounded value is 0.5 that means lambda is approximately we can take as 0.5 although the actual value is 0.72, but we can take a rounded value which is lying within a confidence interval over here. So, 0.5 we are selecting over here as rounded value. So, I have taken a square root transformation over here. So, 0.5 means Y to the power 0.5 means square root transformation that is suggested. Which square root transformation then what we have to do is the regress variable speed instead of surface finish what we will do is that we will we will mention that let us go for a square root transformation of Y that values and do all the analysis same analysis over here. So, regression analysis over here and here also we see that R square adjusted 97 and so it has regressed square root of Y with the variable and only one with the variable r p n over here. So, this both the variables are significant and the equations is given lack of it is not prominent. So, in this case one one outlier is recorded over here as it is minus 3 point more than plus or minus 2. So, in this case what we have to see is that again there is a after this transformation there is a residual that is generated over here. Let us try to check whether the correction has happened with box box transformation or this is not adequate. So, in this case what we will do is that we will see the residual over here and try to see. So, let me just check this is stat basic stat normality test over here. So, I will go to the last variable that is recorded and then I see what is the value of the p value. So, p value is again less than 0.05. So, again there is a problem and there the problem is not resolved. So, error is not coming out to be normal. So, in this case again it is not white noise. So, in this case what we have to do is that we have to so then what I have done is that I have gone for Johnson's transformation family of transformation. So, Johnson's transformation. So, what I have done is that basic stat sorry this is quality tools and in that case Johnson's transformation is there and in this case I have reported that place it into single column data are arranged where the data is. So, I will say surface finish is the data and store in which column. So, I have mentioned over here as C 31. So, when you when you click that one C 31 over here. So, in this case what happens is that if I if I click ok and options what we have given is that point 1 is the value to select the best fit. So, in this case ok. So, it will give you the transformation that is required. So, this is the transformed equation transformed function over here that is given l n what you see the last last over here and this is transforming the variable. So, initially p value is less than point the original data is less than 0.05 and after transformation p value is coming out to be 0.906 that means it has done a rightful transformation now we have to only confirm and this is saved over here Johnson's transformation is saved over here in C 31 column. Now, what we will do is that with this column we will regress with RPN and type of over here. So, what we will do is that we will go to regression regression analysis fit regression instead of the square root what we will do is that we will use Johnson's transform variable and other things remain same. So, I will click ok and we will get a residual over here and we will get the equation. So, after Johnson's transform variable with the RPN this is given for 302 and 416 types of tools like that and what we observe is that 91 percent r square adjusted 88. So, very close. So, this is quite acceptable and in this case there is no lack of it and both the variables are significant that is observed over here categorical as well as RPM over here. So, in this case residual we can check whether the correction has happened some positive things have happened over here. So, I am going to normality test over here and what I do is that I go to the last residual values and I click ok. And what I observe is that when I have done this transformation certainly this problem has gone. So, that means, the normality problem of the error residual has gone. So, when I am using Johnson's transformation this has happened when I use Johnson's transformation this is giving results like that. So, which one box cocks or Johnson you have to try out and figure out that whichever gives you error as white noise. So, that has to be adopted like that. So, this is one of the example when categorical variable is also considered in the model and we are able to address that one. So, categorical variable and how to address in case there is a non-normal situation in multiple regression how it is to be addressed that we can see. And how to select the variables in case we are in dilemma which variable. So, we have talked about stepwise regression. So, this regression is an important aspects which which which can suggest that which are the variables or potential variables can be considered in experimentation. So, but we have to understand that regression does not say we cannot extrapolate regression equation. So, whatever region that experiment is happening that means this is y and this is the range of x domain within which we have got the information that means data are collected over here and these are the observations over here. So, I can only restrict to this region and predict. So, for a given value of x at a given value of x what is the predicted value of expected value of y. So, expectation of y for a given x can be only calculated within the domain or within the range of x over here. So, x we can say upper bound and x lower bound within this. So, no extrapolation is generally preferred in regression equation and association does not mean that means y is related with x function of x does not mean that it is a real variable that impacts y. So, certain scenarios can be there is a relationship between two variables which are not physically any way connected, but there can be high correlation. So, many examples can be cited like that. So, it does not mean that there is a causality. So, regression does not prove causality. So, design of experiment is the only way to define causalities like that. So, to understand causalities between the variables we have to intentionally induce variation by changing the factors changing the factors of variables that we are interested into and try to understand what is the functional relationship. So, the best appropriate way to develop the functional relationship is by design of experiments. So, there is no other alternative. So, regression with historical data does not give you proper association or causality cannot be proved based on that. So, then what is the option? Then what is the option for quality improvements? Then the options we have identified few variables based on regression. So, there are variables x matrix over here and I have collected also y variable information. So, how do I connect these two and this connection and develop the function? So, I need a function over here that will explain the variability of y over here. So, what is required is that systematic way of variation is required systematic way and then what comes is statistical experimentation which is the improvement phase until and unless I have a right function I cannot improve I cannot reach to the global optimal point like that or setting points like that ok. So, design of experiment is the stepping stone for improvement phase like that. So, basic things that our understanding over here is that there are types of variability types of variation we have already spoken about it and some basic information I am providing over here which we can think of as recap. So, there are common cause variability in the process there are special cause abnormal variability which are easy to detect based on SPC process control chart and and then what we do is that to understand the functional relationship between y and x we have to do it a systematic change in we have to induce variability we have to induce variability by changing the factor by changing the factor x over here. And if we can do that systematically what happens is that we can generate a function and the function can be optimized that function we know this is a real equation between the in the process this is the empirical relationship that exists and this is based on systematic theoretically strong models that we have developed over here. Then we can use optimization technique just to reach to the global optimal point like that. So, we are interested into and for that design of experiment is most preferred like that ok. So, design of experiment what we are seeing over here is basically reducing the common cause variability. So, this can be applied when the process is in statistical control. So, when the process is in in stable process. So, whenever it is stable process we say and it is in statistical control then only we can go for design of experiment because we want to reduce the common cause variability further because more and more I reduce the variability. So, accuracy and precision both improves over here. So, we are interested in being and variance both aspects like that what we have seen in capability analysis also. So, for that what is required reduce further variability requires design of experiment. So, for that we need to. So, that helps statistical design of experiment helps us to bring the mean to target value and also to reduce the variability near to 0. So, we want minimum variability over here. So, various design of experiment techniques are proposed and huge huge amount of resources available for design of experiment. So, you can see. But preliminary book that that you can see is a Montgomery's book which is which which can be a good resource of learning initial steps like that, but there are other books also you can see. So, there are many other books which you can see like that Amitav Amitri is one of the books where design of experiment is detailed ok. And so, this is one of the techniques and also I want to recap that one of the concepts that if you have variability that means causes and this is impacting my why and causes are interlinked also somewhat they are interlinked with each other and they together can impact the why like this. So, this is the scenario like this is one of the variable x 1 this is one of the variable x 2 x 3 sorry this is x 3 and this is x 4 over here. And they are interrelated with either in this what you are seeing and they are not completely independent over here what is observed over here. So, if they are completely independent in that case we can understand that what is the level of x that that will deliver the optimal y over here. So, this is possible, but this is not possible when we are having a complex relationship between multiple x's over here and together they can impact the why that that can be scenario. So, it can be scenarios which is not possible to identify. So, by simple other analysis like design failure mode effect analysis or process failure mode effect analysis that is not possible by that ok. So, complex relationship and developing the function when we are developing the functions over here only design of experiments can help. So, design of experiments can help and before we go into details of design of experiment this process P information is again I am recapping over here. There will be certain control variables which is in the hand of experimenters experimenters. So, experimenter will change this control variable basically. So, when you go to a process what you observe is that you will see the operator is changing some of the variables not all variables some of the key variables which is possible to control. So, these are the variables x 1 to x p which is known as controllable variables. Design of experiment is all about controllable variable most of the time we deal with controllable variables. There will be some inputs and I also talked about covariates which can influence the outcomes over here and that can also be considered in the model when we are doing experimentation we can also consider the covariates as a variable which will vary we cannot control that one, but influence of that has to be considered when we are developing the mathematical models or something like that or we can deal with that so covariates also we can see. And there will be some uncontrollable factors that means it is uneconomical to control and sometimes we do not have enough information about the some of the variables because you cannot build a perfect model like that why is a function of all x p variables x 1 to x p is like that. There will be some amount of error in the model because of this noise variables over here because of this noise variable. So, we do not get a perfect function because all the variables that is impacting the process is some of the variables are unknown to us. So, in that case these are known as noise variables or and the influence of that is very less because we have identified most of this. So, these are the variables and some of them are controllable some of them are covariates over here. So, we have knowledge about that. So, and there will be some noise variable which we do not have any control or we do not have enough information about these variables like that ok. So, we want to we want to get a setting. So, in design of experiment what we are trying to do is that this is control variable. So, it is in our hand to control. So, we want to get a combination of this combination of this that will optimize my y c t q and y will be the average value of y should be close to the target value and the variability of y that is that we will generate is very close to 0 values like that. That means, there is no as such variability we want to develop. So, our objective of design of experiment is twofold objective. One is which are the variables impacting y and then if these are the variables which is significantly impacting y, what can be the combination or setting conditions of this x p variables that is impacting over here which I can control. So, that I get the y exactly to the target value defined by the designer and also the variability near the target value is near to 0 like that. So, that is the overall objective of design of experiment. So, and another objective is to develop the functional relationship so that we can optimize like that. So, whenever I have a function I can optimize. So, I need a function. So, function to need a function I need what are the x p variables which is controllable which is significantly contributing to the variability in y. That means, this is and there will be some error that can be because of this noise variables which are not considered and which is difficult to control and uneconomical to control. This can also be impacted this errors may also be impacted by these covariates which are influencing the but we do not have any control on that. So, I can only control this x p variables I do not have any control on the inputs over here and the uncontrollable variables over here. So, most of the time we try to get the best combination of x 1 to x p that in presence of this noise and in presence of covariates or inputs variability that we are experiencing in the process like that ok. So, this is all about what we will try to understand in design of experiments and how to do the experiments that we will try to understand over here. There are different ways of doing experimentation, different designs are available and we will see some very few of them. So, that that is our objective in experimentation. So, that we will cover in our next next session ok. Thank you for listening.