 Welcome to session 22 of our course on Quality Control and Improvement with Minitab. I am Professor Indrajit Mukherjee from Shailesh Jameda School of Management at Bombay. So, we are discussing about analysis of variance with some examples, we are illustrating how to do it in Minitab. So, basically we are in interface of control and improvement like that ok. So, this is to identify variables when I am doing analysis of variance which factor is basically influencing the output of CTQs like that. We want to identify critical variables like in cause and effect diagram what we have identified and whether this significantly influence the mean of the CTQ response like that. So, intentionally we will change the condition of X and we want to see what is the what is the influence on the CTQs mean of the CTQs or variance of the CTQs like that. So, ANOVA analysis can be used for that ok. So, we have taken a hardwood concentration example where the CTQ is tensile strength and there is hardwood concentration and we have seen that how to do the analysis and determine that 20 is the best level we have to set hardwood concentration. So, that to maximize the CTQ response over here. So, we have used a combination of ANOVA analysis and we have used a multiple comparison test, two case test to confirm that which level I should freeze. But although these factors we have identified that changing this factor this is an important X factors which needs to be considered in further experimentation when we consider more number of Xs like that. And the condition of ANOVA analysis what we have assumed is that these are at discrete levels and this value is continuous the value that is noted down over here during experimentation which is the tensile strength which is the continuous variable without continuous these assumptions over here ANOVA analysis cannot be done. So, value of response should be continuous then only we can do ANOVA analysis like that ok. So, for this analysis we have some information that we have gathered like that grouping information over here for the two case test and we have seen the later points if they are different that means, their different levels are having significant mean difference like that. So, that is why we have freeze at 20 and more information what you will get is that this model summary is over here. Out of that one summary I will explain over here which is known as R square information which is nothing but SS treatment SS treatment values that we get that means, for hardwood concentration what is the adjusted SS over here and this divided by total SS total over here. So, this divided by SS total will give you the R square estimation over here why this is important over here this explains that how much of the X is variability of Y or the total dataset that we have collected over here during experimentation how much variability of Y is explained by this change in X basically. So, how much of the variability is explained. So, that is the fundamental concept that we can think of which is additional that we are getting over here R square values and we are not talking about R square adjusted R square predicted that we will talk about when we are talking about regression analysis and in this S values what you see over here that the lower the values that means, model are quite adequacy of the models can be checked in regression also this is used and this S measures is nothing but square root of this mean square error that you are seeing over here. So, square root of this will give you the SS information over here the variance of the Y variance over here standard deviation of the variance, standard deviation of the Y or residuals we can think of. So, so that is that is what we get informations out of this ANOVA analysis that we are getting over here you have to remember that there are some model assumptions we need to be need to be checked over here after you have after you have completed the analysis there is a residual analysis that has to be done. So, that you have made a conclusions based on certain assumptions what are the assumptions assumptions are error follows normal distribution error follows normal distribution and with some with mean 0 and standard deviation sigma square which can be very this this can be estimated sigma square I told is nothing but mean square error that we are estimating. So, sigma estimation can be we can get out of that ok. So, normality assumptions of the residual what is residual over here this is a formulation of residual residual means Y ij or any information in the data set that is 24 observations in that case if I have 4, 6 combination that we are using subset in the earlier experiment of Hardwood concentration. So, in this case each of the observation is known as Y ij over here and each of the level average is given by Y 1 dot let us say average of this. So, for this observation over here if I have to calculate residual for this observation over here it will be nothing but and the average of this this levels over here. So, in this case Y ij individual observations and then I will subtract by Y ij this is this is the mathematical mathematical substitute over here what we are using for prediction over here we are using the average of this particular levels over here what is the average that we will get individual observations subtracted from the subtracted from the level average that will give me the error error error condition over here. So, I will get 24 error error information over here 24 Y ij's E ij information we will get and those can be stored in Minitab and that can be verified whether it is following normal distribution or not. Then the second condition that we can we have to verify is that whether the residuals are homoscedastic or not. So, in this case that means variants of the residual does not change with levels of X like that. So, variants of the residuals needs to be checked over here this is one check that we have done initially when we have implemented analysis of variants we have seen the method of checking equal variants like that sigma 1 square sigma 2 square up to sigma S square over here. So, we have used a ballet test if the underlying distribution of each group is normal and otherwise we have used Levin's test or multiple comparison test like that. So, to confirm whether the variance is same or not. So, similar test exists also which is another test that can be done which is known as Buchbacher test for the residual and which is most common people uses when we talk about regression basically when we talk about regression, but we can we can use the other concepts of Levin's test also to do this homogeneity of variants of the residual or on the Y like that. So, this is Buchbacher test is not available in Minitab. So, what we have to do that we can we can do this test in R interface. Darby Watson test is available another important assumption is independence of the error. So, error ij is independent for any any other error error which is not equal i not equals to jj. So, they are different whenever they are different whether whether there is any relationship between these errors over here or whether they are each are independent like that. So, there is no auto correlation that is another important concept that comes into when we talk about independence in that case they check we need to check whether the errors are auto correlated or not. So, for that there is a Darby Watson test statistic which is used now generally error should be independent in when we are doing some test, but in chemical processes sometimes what happens is that errors may be interrelated like that observations are correlated. So, errors are also correlated like that. So, in this case that can happen scenarios can happen. So, in that case there is a Darby Watson test which is possible in Minitab, but only thing is that it has to compare value has to be compared with tabulated value. So, that is that is sometime point it is difficult, but there is a process in doing that. So, Darby Watson test can be done in Minitab also. So, in this case there is a possibility and what do you mean by this heteroscedasticity that we talked about that is homogeneity of variance checking over here what we are doing by Bruchberg on test and in this case what we see is that whenever the variance of errors increases in magnitude and this is the error direction and this is the labels of X that we are setting over here and if you see that these are different levels 1, 2, 3, 4, 4 levels are there let us say and the variance of this is changing what you see over here. So, this is the non-constancy of variance that is observed in the error and this is not recommended means whenever this exists that means if I can if I conclude based on this type of scenarios in that case the the interpretation may be wrong and in that case judgment may be also be wrong. I have rejected the null hypothesis which may not have been rejected like that. So, that can be that can be one thing we have to we have to take care of. So, whenever there is a heteroscedasticity condition over here and that is checked by either Bruchberg on test or other Levin's test or some other test like that where non-constancy of variance can be proved whether existing not existing statistically like that. So, in that case I need to do some correction over there before I make an interpretation of the ANOVO analysis like that ok. So, there are variance stabilizing transformation on the Y. So, there will be variance transformation variance stabilizing transformation on the Y variables or the CTQs like that and after that that ANOVO analysis will be done. So, there is that one of the two options that I have shown you is box cops transformation another one is Johnson's transformation can be used for variance stabilization like that ok. Like if it is not normal we are using that. So, in in case of non-constancy of variance so in that case also we can use those transformation and so that the Y distribution becomes normal and then we can apply and we can see whether there is the outcome residual is white noise or not. So, that that we have to ensure or residual is for residual is basically following normal distribution or not and constancy of variance and they are independent all this test. So, Darwin Watson statistic that I told can be used for independency test like that ok. So, if the behavior is like this this is basically random. So, this is expected is random throughout on the 0 axis or 0 on high side and low side like this. And if the scenario is like this also variance is changing at different levels of X and there can be certain scenarios like this is the scenario. So, U shape like that errors are like U shapes over here this indicates that higher order term may have to be introduced like that we will discuss about this afterwards. So, in this case maybe non-linearity existing the models like that which is assumed to be linear that is not the case. So, in this case non-linearity. So, inequality of variance is checked over here and inequality of variance is checked over here like that this is the this is the scenario we expect. So, that everything is fine like that heteroskeleton is not there. So, this is the visual impression we can get when you plot the residuals basically ok. So, then and that is possible in Minitab we will get different options to plot the residuals like that ok. So, one is I told Darwin Watson test statistics that is also possible in Minitab that is also possible in our interface and Ruchpagan test and the p-value will indicate whether there is significant heteroscedasticity p-value is less than 0.05 we will say heteroscedasticity is there p-value is less than 0.05 over here we will say autocorrelation exists the errors are not independent like that. So, these are the two tests that can be used for interpretation and let me take one more examples to complete that ANOVA analysis and then we move ahead with some other examples where this this will be violated this and then we have to see some stabilization options like that or convert into normal distributions like that what are the options like that. So, let me take another examples to illustrate. So, this is the display design and percentage increase in sales means in a if you go to a let us say mall or something like that there can be different if you go to some of the shops like pantaloons or somewhere you are entering and you will say that the structure or design the way you experience the different products like that will be changed at time and again like that. So, so ASL design can be changed like that. So, what is expected is that and they what they are expecting is that when they change the design they expect more sales like that. So, they keep on updating this monthly, weekly like that they change the designs of how the people will move in the shops like that so that more exposure will happens and more sales will happen like that. So, one experiment was done with three different display designs over here and percentage increase in sales which is why or CTQ that we are interested that was monitored over here. So, this is an marketing examples we are taking over here how the display designs are changed to increase the percentage increase in sales and try to see which design is optimal over here. So, in this case what we can do is that first we can see the variance test to do the basic things. So, ANOVA analysis test of equal variance first test that we are doing over here and then what we can do is that response are in one column and then we will say that percentage increase in sales is the one and this is the factor is display design like this and in options let us assume over here all the group informations are normally distributed like that we can we can do this test and we can check this one. So, whether they are in groups whether they are normally distributed or not that check we can do and if you click ok over here what we can do is that now in this case we are not saving anything we want to see whether the variance is same and the p value for this test is coming or ballet test that I mentioned p value is more than 0.05 what we observe over here so that means the variance is constant. So, there is no problem with that then we go ahead so there is no heteroscedasticity that we observe at present and then we go to one way analysis of variance and then in this case we change this one and response also we are changing over here and so the response is basically percentage increase in sales and this is the display design over here and in options what we have assumed is assume equal variance that is proved already. So, I click ok over here comparison test 2 case test I am doing over here. So, and the test information will be given like this. So, if I click ok over here so what will happen is that I will say whether the design is significant or not. So, I can copy this as a picture over here and I can just paste it in excel and to enlarge this one and see what is the result. So, display design the variability of display design is significant or when I change the display design what is happening is that mean is changing basically at at least two labels like that. So, this is significant and overall analysis says that display design is an important variable when you change that one basically percentage increase in sales also changes like that ok. And the R square is around 85 85 percent over here what you what you observe over here. So, this is copying as a image if you see. So, if I paste this one over here information what it says is that basically R square that we are getting over here that 85 percent of the variability of Y is explained by this change in X over here. So, that is the interpretation that we have from this data analysis. So, in this case what we can do is that we have also saved the residual let me see whether the residual is saved or not and this is let me just delete this one and let us redo this one again so that I understand at the end residual will be saved. So, in this case storage I will just ensure residual is saved over here. So, if you click ok and in the graph what you can do is that I can see normal probability plot of the residual to to understand whether the residuals I am doing the checks. And residual versus fit this is one analysis where we can see heteroscedasticity is there or not and autocorrelation whether it exists or not the residual versus order this this gives you some information visual impacts like that. But Darwin Watson test and Wuchbagan test and all these tests can be used to confirm that one statistically whether there is any difference like whether there is any significant heteroscedasticity or autocorrelation that exists. So, in graph if I do ok over here so what we will do is that we can see is that it is significant and then what we can see is that in grouping information what we can see is that if I copy this information and if I paste this information over here what observations that we have is basically what we are observing over here is that display design 3 is giving you a percentage increase in sales which is much higher as compared because group is a group code is a and that is different from b codes 2 and 2 and 1 both 1 and 2 is giving me a lower mean over here. So, if you have to maximize your percentage increase in sales I have to go by display design 3 over here. So, that is the having a grouping information of a over that is showing your group information of a over here. So, in this case this is possible and then this is 2 case comparison test and then normal probability test more or less seems to be and then in this case also this does not seems does not having heteroscedastic behavior which was also confirmed in earlier case we have done the variance test that is Levin's test we have done and Barley test we have seen also that because we have assumed normality so, that is why Barley test we have done. So, and also there is may not be residuals with ordered information also we do not see much changes over here. So, maybe that test also will not fail. So, in this case so, maybe no autocorrelation that exists ok. So, in case there is non-constancy variance. So, in that case what you have to do is that ANOVO analysis one way ANOVO analysis. So, in this option you you have to assume unequal variance over here. So, that just do not click this option. So, remove this one and when you do the multiple comparison test it will ask for this games Howell test over here. So, this you have to click over here because it only gives games Howell test which is a combination I will use with Welles test whenever the variance heteroscedastic behavior is observed like that. So, if you click ok over here and that takes care of heteroscedastic information and if you click ok over here what you will get is that Welles statistics will be given over here. So, you will get this information copy as picture over here and I can I can just paste the information. So, what what what we get from Welles test in case variance is not constant. So, in this case display design is significant over here and it will also give you the information that when I use games Howell pairwise comparison. So, here also you will find the the information which is different from which one like that. So, I will delete this one I will delete this one. Welles test in combination of this group information that is games Howells method like that. So, this is the combination I have to use Welles test and this games Howell test in case heteroscedasticity is observed like that we can we can go ahead with the ANOVA analysis and this is well chanova this is known as well chanova over here and this is the one way we can do ok. And and sometimes what happens is that you you may have to you you may be given some information over here like right like they may be given in columns like the data can be saved like this these are the different types of chocolate milk chocolate like that dark chocolate like that and then combination of this and completely milk chocolate to dark and milk chocolate combination and cholesterol level is measured like that some some some CTQ which is important over here is measured like that and I want to see whether there is any difference in when I when I compose when it is completely pure dark chocolate when it is combination. So, this example is taken from again from Montgomery's book. So, this is the example that I am talking about over here experiment is conducted to investigate the effects of consuming chocolates on cardiovascular health. So, over here what is monitored is that antioxidant capacity in their blood plasma was measured over here that is the CTQ over here. So, one is dark chocolate one is dark chocolate in combination of milk chocolates like that and one is completely milk chocolate like that. So, 20 sorry 12 subjects were used 7 male and 5 females we within the age groups and weights and body body mass index like that on different days a subject consume is one of the one of the combination and one hour later the total antioxidant capacity of that is measured. So, this is the observations that we are getting over here we want to just check that whether there is any significant difference within the mean and antioxidant level over here. So, whether antioxidant capacity increases if I only use dark chocolates like that. So, that can be checked and for this these are given in columns information. So, when you do minute minute have ANOVA analysis like that ANOVA you test for equal variance. So, in this case what happens is that they are not in one column you have to give in separate column like that. So, then you mention the which are the dark chocolate is the three different columns we are mentioning over here and we may have done independently. Let us try to see whether they follow normal or not C21 let me try to check. So, that we are confirming this one. So, in this case we will go by dark chocolate and try to see whether it is normal. So, 0.07 is the and our label is 0.05 and we will safely assume that this is normal. Second case again I go to the next one. So, DC plus milk over here and let me do the Anderson Darling test. So, this is 0.69 this is also the group is having normal distribution assumptions is true and the third one is DC only milk chocolate like that let me check whether the assumptions here also it is not violating the normality assumptions over here. So, all the three columns are normality assumptions are fulfilled over here and in this case what we can do is that ANOVA equal variance test we can do test of equal variance. So, in this case they are in different column. So, we will mention that we want to see DC DC and this one is MC and our options we will write use test for normality test. So, ballot test will be used to confirm this one if you click ok what will happen is that you will get the ballot test information over here and if I copy as picture and I can paste it over here and when I paste this one information what I see is that ballot test says that 0.08 and in this case confirms that there is no difference between the or variance are same at all labels variance are same over here. So, in this case variance is same now I have to do the ANOVA analysis and confirm. So, ANOVA analysis one way ANOVA analysis. So, everything is same. So, this will be I will change this information over here and I will change the information over here and I will place this there in different columns. So, separate columns response this is dark chocolate combination this is milk chocolate and options what we assume equal variance say variance is same let us assume and then comparison test two case comparison test can be used and the same thing can be and graphically we can see this one or we can save the residual. So, if we store the residual then we can see the residual checks can be done. So, graphically we can see what is normal probability plot residual versus fit this two we can see and we can just do this one and then click ok like that. So, when we are doing this what happens is that they have significant difference which is confirmed in ANOVA analysis p value is close to 0 and which is different from which one. So, that that is given in this letter code. So, dark chocolate is giving me higher mean anti antioxidant level that we are looking for antioxidant capacity of the blood plasma which is higher the better type of functions like that. So, if it is more it is better like that. So, DC is having a group that is very different from the other groups like that. So, only with dark chocolate this seems to be giving me a higher mean which is which is what we are expecting what we are expecting like that ok. And then two case test and all this then we have a normality test normality plot over here does not seems to be violating that one condition. And the last column that you see over these are the three columns of residual over here and we can check the normality assumptions over here each of the residuals we can check or we can combine all the residuals and do the checks and see whether the residual check is whether there is any violation in the assumptions like that. So, what we can do is that we can combine this and copy this one control X and we can paste this over here and we can just paste it. So, residual should be combined and in this case we can paste it over here. And we can remove these two columns and say this is the residual we are looking for and we want to analyze this residual over here. So, in this case what we can do is that we can we can just check statistic basic stat normality test can be done over here and the last residual column will be taken and the center link test will be done. So, it is around 0.157 p-value is 0.157. So, there is no as such deviations of the normality assumptions like that. The other two checks which Pagan test can also be done in R and also we can do Darby Watson test statistics can also be seen and compared with the table or we can see the p-values reported in R and immediately we can say whether autocorrelation or independency of the errors are true or not. So, that can be verified like that. So, that are the conditions, but there can be scenarios there can be scenarios when we have assumption fails when assumption fails like that. So, let us take one example where assumption fails. So, in this case let me just show you where the assumptions may fail and in this case we may have to do the correction over here. So, this is another example which I am using over here is marketing strategies are differently taken and card sales are reported like that. So, card sales are reported over here and number of card sales let us assume that is continuity over here. So, in this case although this is not continuous variable, but let us assume this variable is continuous over here and we are going add with analysis of assumptions of ANOVA and there are different way of advertisement that they have used. Sometimes they have given the agenda of quality as the advertisement agenda, sometimes they have talked about design flexibility in the in the advertisement and sometimes they have given price as the priority while making the advertisement like that. So, number of sales that is reported based on different types of advertisement at different periods are noted down and then we want to see which is giving me higher card sales like that. So, in this case ANOVA analysis can be straightforward use and let us assume all criteria is fulfilled like that and when I do this ANOVA analysis over here, let us assume all other conditions remains true and I want to analyze this one. So, in this case what I am doing is that this is so, this is column number over here is C 18 and C 19 like that response is in C 19 let us say and this is in C 18. So, we can write down as C 18 over here and options is that assume equal variance this we can check. So, maybe afterwards we are checking like that. So, here what I am interested need to show that the residual we have saved at the end and let us let us run the ANOVA analysis. So, this is the residual last residual that you see will be the residual of the ANOVA models that we are getting and when this is the residual let me just cross check whether this is normally distributed or not. So, when you go for the residuals at the end and click this one and do Anderson Darling test what happens is that you observe that the residuals are non-normal because p value is less than 0.05 here it is reported as 0.015. So, then in that case what is to be done? What is to be done if the if some condition fails whether it is heteroscedasticity whether it is other condition, independency assumptions over here and whether it is normal distribution assumptions that fails ok. So, when I have done individual testing over here let us say this is quality design flexibility and price and they are in different groups when I do only on the why whether they are normally distributed or not. So, if I have done the individual observations I have taken and I have done this test let us say for quality what I see Anderson Darling test is more than 0.05. So, this is satisfactory and then if I go to the second observations and I do the normality test design flexibility and do the Anderson Darling test again I see 0.939 this is also satisfactory and the third level when I when I go ANOVA analysis sorry this basic statistics and normality test over here and do the last and final one groups that is price and I want to see whether the why variable is normal or not in that group. So, this is 0.117 this is also normal. So, you see why is normal over here, but when I do the ANOVA analysis and save the residual residual is not following normal distributions like that. So, we have to always check the residual and confirm that whether the normality assumption or the assumptions of the ANOVA analysis depends on all on residual basically ok. So, I have to check the residual and see the conditions and if this is not true what can be done is that we have to transform the why variables which is the CTQ like that. One of one thing what we can do we have discussed already is that we can convert this into using box cost transformation over here. So, if this is the variables that we want to check so, they are in one column. So, I want to change this variable and make a transformation over here I have to select a subgroup size of 1 and in options what I have given is that I will go by. So, I am not saving this information at present let me see what is the optimal box cost transformation that is Gibbs and it is saying that use a transformation of rounded value of 0. So, whenever it is 0 it indicates that lambda transformation of 0 indicates ln transformation that is natural logarithm transformation over here. So, ln transformation base E will be used over here and in this case what we can do is that we can we can make it over here. So, we can write that ln of card cells let us say and we can we can we can just calculate the values over here. So, we we can use calculators over here and we can we can just do ln ln information. So, ln of C 19 I have put over here. So, ln information and store it in C C 20 we we may put store results in C 20 let us say C 20 over here we are using that. So, expression is C 19 we are using the conversion and if I click ok this is the transformation that is done on this data set that we are having. Let us check whether transformation has changed it to normal or not why into normal assumptions whether it is fulfilled like that. So, ln car let me just try to check and it is showing that this is following normal distribution. So, there is no problem. So, let me do the analysis of ANOVA analysis over here. So, ANOVA analysis conditioning other conditions remain same. So, in this case what we have to do is that response table response we have to change. So, I have to change because I have transformed the data because of that non normal residuals that we are getting. So, I have used a conversion over here which will be used and on the converted y I will do the one way analysis of ANOVA. So, in this case the factor will be same only the y condition y will be changed to y lambda which is lambda is over over here is 0 and that is that is that indicates a log logarithmic transformation over here natural logarithmic transformation and then I will again store the residual at the end and I will try to see what happens. So, here also it is saying that there is no difference when marketing strategy because p value is not significant. So, marketing strategy does not influence basically the outcomes or car sales like that. Now, let me check the final residual whether it is now normal or not. So, I will do the normality test on the final residual which is residual one which is saved at the final end. So, if I click ok over here what will happen is that you see now the residual is calling normal distributions like that. So, whenever I have done the transformation on y what happened is that and then I have run on the transformed y data and the x factors that is that is same then I will get the true information. So, sometimes what happens is that if you ignore the model adequacy test in that case what may happen is that you may be concluding wrongly in certain scenarios like that you may be misleading informations can come out of the ANOVA analysis like that. So, proper transformation may be used over here, but although ANOVA is quite robust to handle any non-normal behaviors of the residual like that. Although people says that or statistician says that this technique is quite robust and even if some deviation model deviation in the normality assumptions or model adequacy checks are not correct in that case also whatever results you get may be quite adequate to make a conclusions based on that ok. So, all these checks needs to be done because model adequacy check is an important aspect because if you can correct that one and do the ANOVA analysis that then you can be much much more sure that whatever conclusions that you are drawing of the ANOVA analysis or comparison test like that is quite correct and can be generalized like that for the for the given levels that is that is there ok. So, those that factors what we have selected is not there is no as such ambiguity when we are selecting the factor. So, this one way ANOVA analysis is basically when we are trying to see the transformation, when we are just shifting from control phase to improvement phase and trying to determine which are the which are the factors needs to be considered in experimentation. So, this type of small small one factor analysis of one way one one factor one way analysis of variance can be used and then t test two sample t test pair t test or this kind of small small experimentation and one important tool that can also be of help while selecting the factors over here is known as regression analysis that that can be used to identify whether the X variable is when X is continuous and Y is continuous. So, X variable is also continuous Y is also continuous. So, those kind of variables can be best identified if we are if we are using regressions like that. So, simple previous historical data when experimentation was not done, some previous data and some correlations some linear equations can be established. So, those things we can see when we talk about regression analysis which is extremely useful not only in design of experiment, but while segregating the factors also some of the previous information can be used and that will give you a lead whether the factors to be included or to be excluded from the analysis like that. So, we will discuss about regression in our next lecture from simple regression to multiple regressions like that which will be helpful in our design of experiments. And moreover, we will add over here analysis of covariance another another one important aspects which we have missed out over here. So, we will start with analysis of covariance and then we will shift to regression analysis regression analysis where X is also continuous and Y is continuous, how to identify that whether X influences Y or not. So, those kind of things which will be used as a screening in the screening phase and which will be used in basically full flow experimentation in the improvement stage. So, these are the techniques which can be used and we need to have gives you all options to explore this one. So, we will stop here and we will continue from here in our next session. Thank you.