 Welcome to session 21 of our course on Quality Control and Improvement with Minitab. I am Professor Indrajit Mukherjee from Shailesh J. Mehta School of Management at Bombay. So, in this session we will see some examples of doing one way analysis of variance. So, last time in our course what we are doing is that we are trying to understand what is the analysis of variance and why we are using that. So, just a brief introduction to that. So, what we have discussed in the last session is that. We want to see when I change factor labels like that. So, let us assume there is one factor and this diagrammatically explains what we are trying to do. So, there is only one factor I can ignore the other factors over here. And there is one factor X1 and I want to check whether at different conditions of X1, whether the response that is Y, whether that response mean response is changing over here or not mean response means. So, here the average response population response we can think of mu 1, mu 2 and mu 3 like that. So, we want to check whether everywhere this average value that we are getting is same or not or whether there is any two labels mu i equals to mu mu i. So, alternate hypothesis mu i not equals to mu j like that. So, when i not equals to j like that. So, there is at least two labels where the average is quite significantly different. So, this is developed by when we have one factor. So, this is not the scenario in most of the design of experiments, but this is the most favorable scenario you can expect when I have one factor and I want to check the optimal labels and find out what is the optimal labels which will optimize basically Y or CTQs like that ok. And when one factor at two labels when there is only two labels like that we have suggested two sample t-test for that or pair t-test in certain scenarios like that ok. So, if you have a factor controllable factor which is in your control as an experimenter and you have more than two levels then type one error can be controlled and at point 05 like that. So, the type one error will not increase if I apply analysis of variance instead of two sample t-test like that. So, two sample t-test more than two labels is not recommended. What is recommended is analysis of variance proposed by Donald Fisher approximately around 1921 at the time point. So, and this was very popular, this was very popular when it was proposed and people accepted this one and still people are using analysis of variance in design of experiment. So, we are at the improvements phase what we are discussing now analysis of variance two sample t-test all are in improvement phase like that we want to check when we have done improvements whether it is effective or not. So, statistically whether they are different or not here also we are trying to check whether the factor is significant or not which influences why or not and what should be the level of x that will optimize the y over here ok. So, for that analysis of variance is suggested. So, here you can see that we are comparing different means and the variance information is used over here. So, variance information is used over here to see the mean difference basically. So, that is why it is known as analysis of variance ok. So, and then what we did is that we try to explain the concepts of how people are doing experimentation with one single factor. So, here the factor is hardwood concentration and it has different levels 1, 2 and this is 3 and this is 4 like that and these are known as replicates what we have mentioned last time also replications ok. So, and n equals to 6 over here that means at 5 percent hardwood concentration experiment was run with different samples and 6 different samples over here and the 6 observation with 5 percent concentration is given over here like this similarly at 10 percent. But this experiment was done randomly. So, randomization was another important concept that was introduced and why it is required we will understand afterwards, but at present we should know that randomization is basically the concept that you have to implement when we are going for experimentation every experiment has to be randomized ok. And how do we randomize? We select any of these levels over here and we select any of the samples and combination of that the results what tensile strength that is generated over here this is known as y ij let us say observation number and and i over here i varies from number of levels over here is a what we have defined. So, a number of levels over here and j varies from 1 to 6 like that. So, this is j variation over. So, this is y ij one observation is represented as mathematically expressed as y ij that is shown over here. So, a number of levels that we have we are experimenting over here a number of levels for a factor x over here which is nothing, but hardwood concentration which is nothing, but hardwood concentration over here. So, we have to understand which is a factor how many levels we want to experiment. So, that you have to freeze before going for experimentation like that. How this range of the factor means how this level of the factors is selected it is based on engineering judgment that this is the variation or the process can go up to this extremes like that. So, this may be just below the just above the minimum and this may be below the maximum like that which is feasible range we have to determine like that. So, x range feasibility we have to check and based on which we have to select the levels and also there should be gap between the levels it should not be very close. So, it is also suggested you can see books how the levels are selected like that. So, this is one example where I want to maximize tensile strength and hardwood concentration experiment was done, randomization was implemented over here, replication 6 replicates are taken over here and first experiment may be with 5 percent, second may be with 10 percent, then 20 percent, then 15 percent like this and 24 observations. So, 6 multiplied by 4 levels over here 24 observations reading we have got and we want to analyze this data and try to figure out at what level we should which would freeze hardwood concentration if this is the only factor and then we can we can also see that whether there is any difference between the levels like that whether statistical significant difference exist or overall when I change the levels whether it is impacting the mean value of the response is changing the mean value of response that is what we do in analysis of variance. Assumptions over here is that in analysis of variance the factors these are discrete levels, these are the discrete levels that we are experimenting over here and the why that we are getting over here is basically continuous that continuous variable. So, this is an important assumption why is continuous over here. So, there are some assumptions that has to be satisfied and the primary assumption is why should be continuous then only we can apply one way analysis of variance one way means one factor at different levels which is more than 2 most preferably more than 2 like that more than 2 levels over here we are and one way and also we this is in the hand of experimenters over here. So, I am changing the levels. So, this is in my control basically. So, the model that is used statistical model that is used is known as fixed effect model that is known as fixed effect model and that analysis we will we are going to show where the factor and I can change based on my will like that. So, that is factors in is in my control experimenters control basically. So, for that what models we we but we cannot generalize in we can say for 5, 10, 15, 20 like that these are the levels based on which we are making a judgment and we cannot say that we cannot generalize any values between 5 to 20. So, that is when we want to do that that is random effect model basically. So, here what we are doing is that fixed effect model that is these are the levels discrete levels and this is the outcomes of the experimentation and from here we want to determine which is the best level where do I set hardwood concentration so that I can maximize the tensile strength over here. Then what I told is that there is a ANOVA table. So, this is the table that we will get and in this case what is observed is that some SS treatment calculation is done. So, this SS treatment is basically the variation of each individual observation average. So, we will get some average over here at a particular level and from the overall average that is the grand average that we will get y dot dot bar. So, that that is symbolically denoted in any of the books like that. So, that variation when we capture that variation this is known as SS treatment. So, this is represented over here formula which I have not told earlier this is the formulation which we can think of. So, level average from the grand average that will give me the SS treatment basically that will give me SS treatment over here ok. And this is the overall variation that is SS total that is the SS total over here and there is individual observations minus overall grand observation that means each individual values over here y ij minus grand average that we are getting y double dot like that. So, that is the difference that will give me SS total over here and then what we have is that error error variation SS error over here. So, because this is not the only factor which is influence the overall variation of the process. So, this is induced variability we are changing intentionally x variables over here, but there can be other x which we do not know like that. So, there will be some error in the estimation over here. So, that is known as error SSE which can be SST if I can calculate SST minus SS treatment we will get SS error over here. And the formulation is also given over here individual observation from the level average basically. So, individual that will give me. So, SS error is nothing, but this this observations over here this observations over this observations what we are getting over here and from that what is the level average at this time point. So, individual minus levels and this is also known as residual this is also known as residual afterwards we will see the analysis in ANOVA analysis like that. So, we can calculate SSE SST like this and there are degrees of freedom. If I have A levels A minus 1 is the degree of freedom. If I have total 24 observations that experimentation with replicates then in that case A multiplied by N minus 1 that is the degree of freedom for this and this degree of freedom can be calculated if I subtract this this value from this one and we will get A into N minus 1 like that. So, and when you rewrite SS treatment divided by A minus 1 I get a mean square treatment. So, this is between variation which is expressed as between variation like that and this is expressed as within variation within variation this is error variation. So, mean square error over here mean square treatment like that ok. So, Fischer recommended that you calculate a statistic which is known as mean square treatment by mean square error which will give you a F value which follows F distribution basically and this F values can be compared with tabulated values like that and I have shown that F at level of significance alpha like that and degrees of freedom, numerator degree of freedom will be A minus 1 and denominator degree of freedom will be A into N minus 1 like this. So, if you if you use this you will get a tabulated value like that and F calculated from the experimental results and F tabulated value if you compare and if the F values are higher than F tabulated value then in that case we can expect P values to be less than 0.05 like that and we will go by P values in Minitab analysis like that ok. So, let us do the experimentation now with using Minitabs like that and for this what is required over here is that some assumptions has to be verified initially that is one of the assumptions which is required is that whether the variance at different levels. So, if I can plot over here and we have X on this axis and Y on this axis and X at as different levels over here 5 percent, 10 percent, 15 percent and 20 percent like that. So, in this case we expect that there will be variation because the experiment if I repeat N number of observations I have over here at 5 percent that is 6 observation I have have and this can vary this this will vary basically we cannot get single values. So, similarly second value will also have some variation, third value will also have some variations like this and we want to check whether the standard deviation over here or variance that we are estimating over here and variance that we are getting over here are all same or all different because based on that ANOVA analysis will change. So, in this case like two sample variance we are testing is there here also if there is more than two sample variations over here and we can we can compare that whether the all variance are same or whether any two variants are different like that. So, for that and underlying assumptions has to be made over here that is whether the values in each group over here follows normal distribution or not each of them follows normal distribution or not based on that test will also differ. So, we have to first test that group wise whether they are normally distributed or not individual at 5 percent 10 percent 15 20 and if they are assumed to be normally distributed in that case what what we can do is that we will assume that one and go ahead with the test then if it is not true then we will go ahead with a different test which is known as Welch's test which is known as Welch's test and that is also possible in Minitab. So, what we will do is that first we will do Minitab and try to analyze this data set and which is the hardwood concentration and then we will see what what what to do and how to analyze the ANOVA and how to interpret the ANOVA analysis like that. So, here hardwood concentration is changed and data is given over here and tensile strength data is given over here. So, what we will do is that we will go to stat and in ANOVA analysis there is a option of test of equal variance over here. So, what we will do first we will test whether the variance condition that is required which is satisfied or not over here. So, then what I will do is that we will just see that each factors are in same same columns like that. So, in this case what is the response variable over here we we have response data are in separate columns no we have in same columns no. So, what we will say is that tensile strength. So, this what we will do we will just highlight C2 over here and factor that we have to give you as hardwood concentration over here. Then you go to options over here and use test for normality. So, this check has to be done over here. So, what you have to do is that you have to just separate the values for 10 percent and 15 percent, 20 percent, 5 percent like that and if you can differentiate that one and then check. So, I have to segregate that one. So, I am doing a rough approximation over here and trying to check whether the overall values is normal. So, I will do the basic statistics, but what you have to do is that 5 percent now you have to segregate the data set 5 percent, 10 percent and individually you have to see whether it follows normal or not. I am taking a overall test of tensile strength over here which I am doing normality test. So, I will take the tensile strength and want to check Anderson-Darling test. So, if you click this one what you get is that you get values approximately like this p value is 0.5, but group wise we have to do like that, but I am doing a overall analysis over here and in this case it shows that mostly we expect that this is a normal distribution data because p value is more than 0.05 and then what we have to do is that. So, assumptions of normality I am considering over here while checking the variability variance whether it is same or not. Then I go back to ANOVA and I will do one way analysis of variance and then in options I will write that assume equal variance assume normality over here. So, this is sorry I will I will go to this this is not the test that I have to go stat ANOVA analysis what we have to do is that test of equal variance that we are doing. So, over here what we have to do is that response data are in not in separate column one column. So, tensile strength and hardwood concentration we have given in options we write use test for normality distribution assuming normality distribution over here. I do not change the confidence level over here. So, confidence will remain same and and and results what we want to see all all things we want to see over here let us say and I click ok and what will happen is that it will give me some results over here which is the Barlet test that you will observe over here. So, because normality assumptions is taken over here then this test most suitable test like F test what we have done in two sample t test while comparing the variance like that here it is Barlet test that you will find which is when I am assuming normality over here. So, this is the most strong test that is that is possible to do and statistician recommended this one and see the p value p value is more than 0.05 indicates that all variants are same. So, overall there is no statistical difference in the variance like that. So, when this test is completed so, equal variance assumptions is checked. So, then what what I will do is that I will go to ANOVA analysis one way ANOVA analysis then in this case in options what I will do is that I will click assume equal variance that is the because we have already checked that one ok. So, in this case I will click ok and then I will click in the graph what you can see is box plot we can see over here and also some assumptions has to be checked that we will see afterwards. So, let us let us do this and let us try to figure out what happens if I have taken response as a tensile strength factor is given what ANOVA analysis says when equal variance condition holds and normality distribution condition also holds in group wise normality distribution also holds. So, in this case if I click ok what will happen is that I will get the ANOVA analysis and which is shown over here. So, this can be copied as a picture like that and we can paste it in excel to enlarge the views like that and let us try to see what the results indicate and what we have studied that mean square error and what is the values it is giving. So, if I paste this one so, I I just paste this one and just enhance the this image over here what you see is that source of variation is this is SS this is Hardwood concentration. So, when I change the Hardwood concentration what is the change in variance that is estimated over here and it says adjusted S square formula remains same what we have shown in our slides like that and SS calculation basically as it will say a minute I will say adjusted SS. So, this is nothing but sum of square sum of square variations or due to treatment basically. So, this is showing you due to treatment what is the variation then degree of freedom is 3 because why 4 levels so, 4 minus 1 is 3 over here and mean square error is SS divided by degree of freedom that is 127 over here and total degree of freedom is 24 trials I have done total overall experiment 24 minus 1 is 23 over here and sum of square of total is calculated. Now, if you subtract sum of square total and then from that you subtract 382 you will get 130.2 over here. So, the linear relationship between this or additive relationship that we can immediately 2 of them we get the third one we can get. So, then mean square treatment is 127 how it is 382 divided by 3 that is 127 over here and mean square error what we can calculate is 130 is calculated over here 130 by 20 that will give you 6.5 over here. Then F value how it is how you are deriving this one mean square treatment divided by mean square error over here. Mean square error gives you an estimation of standard deviation of the process basically for this variation systematic variation or it gives you an estimate sigma of the y and also gives you an estimate of error this also gives you an estimation of error over here ok. Error variance basically error variance over here. So, over here what you see F value is coming out to be 127 by 6.5 this value when you divide it is 19.61 which is very high values of F and it is expected that if F is quite high on the higher side what what we can expect is that P value should be going down P value should be lower values we can expect like that. So, highly significant what we are seeing is that there is at least two levels where when I change the level from one level to the other level basically significant difference exists between the average response that we are getting that means this variable X is important and is influencing the change in mean of the response city queues like that. So, this factor can be considered for further experimentation in future like that, but this is based on certain assumptions this is based on certain assumptions like normality distributions and everything. So, that needs to be checked, but if you are not doing this and if you are assuming that variance is different. So, in that case what will happen is that another test which is equivalent to this one way analysis well stressed will will be applicable. So, in the options if you do not assume equal variance or variance is not same in that case statistical test that exists which is known as well stressed. So, if you click ok over here you will get another values which is given over here and you just I just copy as picture and I will paste it over here which is equivalent and which is very strong test also which is recommended in case the variance is different. So, this is well stressed what you see over here also we are checking whether the levels when I changing whether it is influencing the mean value of city queues like that and P value is coming out to be less than 0.05. So, that indicates, but this is this test is only applied when the variance is not same like that. So, this is a statistical test which is equivalent like ANOVA analysis like that when the variance at different levels are not same. So, we can use this well stressed in that scenario here it is not the scenario, but I have shown you the options. So, when you when you just click over here. So, if you go to stat analysis of variance one way analysis of variance in options, if you do not click this one and we we we tested that variance is not same just unclick this one. So, if you unclick this one immediately well stressed will come. So, the results will be reflected over here and you will get all other values like that confidence interval and all these things. So, model summary is how much R square values over here. So, this is at present not required that that we we are not explaining that one we are explaining that what is the overall overall idea. So, at least there are two levels which is which are different basically which are different over here and that is the well stressed ok. Now, let us go back to another important concept. So, this what you are seeing over here is basically the analysis of variance same same analysis of variance over here. So, this what you see in this this is the analysis of variance table and this is the same what we have derived using MINITAB and this is the variance test ballet test that I have shown over here and this is the confidence interval at different levels 5 percent, 10 percent, 15, 20. So, 95 percent confidence interval is given and model summary R square R square values we try to see how much of the variability of total variability is explained basically by hardwood concentration variation. When I change the levels how much variability is explained. So, that means, whether this factor is very much significant or not that can be seen by this R square value higher the R square value that means, basically the change in the hardwood concentration is influencing the overall variability basically. So, that is known as R which is known as R square value which is known as coefficient of determination that will come when we are discussing about regression analysis. So, that that will be more clear when you see the formulations of regression. So, this this we will leave out at present moment model summary is over here we are interested in this ANOVA analysis 19.61 and P values over here this indicates that there is at least mu i and mu j not equals to mu j over here. So, there is at least i and j where i not equals to j. So, in this case at least there is 1 i and 1 j where the mean response is different like that for the CTQs for the CTQs or response basically. So, or why you work here when I change the level of hardwood concentration like that ok. So, ballet test is used you have to remember that this is the test if normality assumptions is taken, but in case normality assumptions you are not taking in that case Levin's test is there multiple comparison test is there and multiple comparison is more powerful than Levin's test and Levin's test is a non parametric test which can be also used for interpretation when the data size is small like that ok. So, ah and distribution is skewed like that. So, in that case we can use Levin's test like that and we will see some scenarios when it can be applied like that maybe ah some examples later on. So, ah let us try to see that ah another important concept over here which is shown in this diagram what you see over here is ah I know that two two means are different two means are different giving me different mean values like that, but which one is different from which one whether it is 5 with 10, 10 with 15 or 15 with 20 that ANOVA analysis cannot tell you that ANOVA analysis cannot tell you and for that we need something which is known as ah multiple comparison test which is known as multiple comparison test over here and there are different methods of doing multiple comparison test and we will go by one of the methods which is known as 2 case method which is given in our ah Minitab software and there are other options other options features ah method is also given. So, ah anyway so, there are different methods, but we will prefer using 2 case method over here I will explain one methods other methods also you can see. So, ah over here the overall objective is that whether ah the ah this paired comparison when I am doing with this with this with this and this with this which two are similar and which two by seeing the diagram over here dot plots over here what you can see is that these two are very close these two are very close over here these two are very separated like that 5 and 20 why I am doing this because I want to find out which level I should freeze so, that I get the maximum CTQs and that is the most optimal level like that ok. So, where do I set 5 percent 10 percent 15 percent 20 percent then I need to know ah which is different from which one which is different from which one. So, over here I need to know whether 20 is different from 15 or 15 is different from 10 10 is different from 5 like that paired comparison we we we want to check. So, in this case I will use in MINITAB. So, now we have seen that there is significant ANOVA analysis says that there are two levels which are significantly different let us figure out which is different from which one like that how do I do that. So, ah we are just going to check whether which one is mu i is not equals to. So, this will be like paired comparison what what I told so, that will be reflected when I use the two case ah this two case ah paired this is two case multiple comparison test when I am doing that I will get this information how do I get that ah we will see some letter codes that will come over here and we are only interested in seeing the letter codes over here which is written as A B C like that. So, A B C like that let us try to see how this is derived over here. So, ah the letter code which are not similar that levels are significantly different. So, here I am getting a letter code of A and here I am getting a letter code of C and these two levels I am getting a letter code of B and B like that. So, when they are same letter code that means, there is no difference between 10 percent and 15 percent over here, but A level which is 20 percent over here is significantly different from this ah ah 10 and 15. So, 20 is very different from mean value over here mean of 20 is very different from statistically different from this one ah any of the three basically any of the three and this is having a letter code C. So, C is also far away from this ah this 15 and ah this is ah very different from this 15 ah 10 and 15 like that. So, this is very different. So, 5 percent is the lowest one we are getting the lowest mean over here. So, 5 percent is far away from this we want to maximize. So, we have to go on this side only we have to go on this side. So, it says that ah which mean is different from which one. So, paired comparison like that this is known as multiple two case multiple comparison test that you can find out when when you are doing this in MINITAB how how to do that. So, what we have to do is that ah because we have found significant difference here in two levels like that. So, ah to understand which level is different from which one I go to STAT ANOVA analysis again I use one way analysis of variance. Now, there is a comparison test that is given over here. When you go to comparison test, so I and all ah this one ah you keep it as default. So, options over here so equal variance if I assume equal variance over here and then do the test let us say and we get the analysis over here. The plot is shown over here you can see the box plot also shown over here. So, ah this indicates that ah how the values are changing. So, ardu concentration is increasing. So, ah but what do you see is that 10 and 15 percent are more or less having overlapping distributions like that 10 and 15 like that and 5 is ah having. So, slope is there when when I compare 5 with 10 and 15 and 20 is also having a high slope as compared to 10 and 15 like that and 5 is the lowest one and 20 is the highest one like that. So, box plot will give you that indication over here and then what we have to do is that I go to STAT what we wanted to do is that multiple comparison. So, I go to comparison test over here and I go to 2 case test over here I do not see any other test what we will adopt only 2 case test over here. There are other test which can be used, but I am using only 2 case test. So, grouping information this is very robust test 2 case test like that. So, grouping information is important for us and then we will click ok and we will not change any other default condition. So, ah graph we want to check box plot you can see that one otherwise you can ignore that one if I ignore already box plot we have seen all the data set and I click ok what will happen is that I will get a letter code over here I will get a letter code over here group information. So, if you if you click this one go down over here and copy like this and you will get a group information over here. So, if you paste this one and I will enhance this I will magnify this one what do you see the same results over here 20 is giving a letter code of A 15 and 10 is giving a letter code of B and 5 is giving a letter code of C. So, A is very different from 15, 10 and 5 like that because A is not this letter code is not matching with any of the other levels 5, 10 and 15 like that. So, in this case what we can say is that A is having a significant higher mean as compared to but B and C is giving me that is 15 and 10 levels over here is giving me the same same mean values like that ok. So, ah so, if you have to freeze if you have to freeze which is the level I will select over here I will go by the 20 percent hardwood concentration because that is giving me a significant higher tensile strength over here that is giving me a significant higher tensile strength over here. But if we have got letter codes that means we have got letter codes over here. So, in case we get letter codes that means 20 is similar let us say it is giving you a letter code of A and 15 is also giving you a letter code of A and this is B let us assume and this is B like that. So, in this case both are having letter code A that means there is no statistical difference between 15 and 20 percent. So, in this case I will go by lowest cost I will go by selecting the level which is giving me lowest cost like that because if hardwood concentration 15 maybe this is the lowest cost and there is no significant difference at population level because hypothesis testing at population level based on the sample information. So, in this case I will freeze at 15 here there is statistical difference because this is A and these others this one is coming out to be B and this was C like that. So, A is significantly different. So, we should freeze at 20 percent that is the optimal level basically we should freeze assuming this is the only factor and then but otherwise if both the levels are showing the same letter code. So, in that case I will go by the lowest cost. So, if 15 and 20 is giving me if I have options selecting between 15 and 20 I will go by 15 percent which which will maximize the CTQ values like that and that is the level I will select because 15 and 20 is not statistically different. So, whether I freeze at 20 or 15 does not matter only matter what we have to see over here is that which is giving me lowest cost setting like that. So, we will go by the lowest cost setting and overall it is the optimal scenario that is that we are getting over here. So, we will stop over here and we will continue with the assumptions of analysis of one-way analysis of variance and further we will discuss some more cases on this before we go into actual more than one factor experimentation that we will discuss in subsequent slides. So, we will stop over here and we will continue from here and try to figure out what are the other things we need to check and do in case while we are doing one-way analysis of variance ok. So, thank you for listening we will continue the sessions starting with again one-way analysis of variance model adequacy check ok. Thank you.