 Welcome to session 20 on our course on Quality Control and Improvement with Minitab I am Professor Insajit Mukherjee from Shailesh J.Maker School of Management IIT Bombay. So, ah last session what we are doing is that we are talking about two sample t-test and ah and how to use that in what scenarios and ah today we will extend that ah concept of two sample t-test and in what scenario the test are conducted that we will try to see. So, ah this was the two sample t-test example that we have dealt with that there are two catalyst over here and we want to improve the yields over here and we want to see that which catalyst is effective which is not effective like that. We saw that both catalysts are effective. So, in that case there is ah both catalysts are not effective. So, in this case what happens because of the p-values that we are getting over here this is more than 0.05. So, in that case both the catalysts are ah equally equally giving the same means or we can say that mu 1 population mu 1 and mu 2 ah for catalyst and catalyst b if we can say that one that way. So, I cannot reject the null hypothesis basically over here ok. So, there is no way improvement has happened with a different catalyst ah with an existing catalyst we can think of ok. So, this is basically a one factor at two level experimentation. So, what is one factor two level experimentation? This is a factor will be equals to catalyst and there are two levels level A and level B like that. So, this will be the levels we can we can say L A level and L B level like that. So, when we have one factor and we have over here catalyst A and catalyst B. So, this experimentation was conducted like that and we wanted to see the effectiveness of catalyst over here. So, two sample t-test is a starting point of experimentation what we do ok. So, in case the distribution assumption fails over here like catalyst A and catalyst B what we have to do that we can just check ok. So, I will show you the non-parametric options to this. So, two sample t-test we have we have proved that everything is working over here, but assuming that distribution assumption fails over here and in that case what is to be done we want to see. So, two sample t-test file we will open. In that case this is the catalyst A and catalyst B we can write over here catalyst A and this is catalyst B and we want to see the difference between these two. So, you have a non-parametric option over here to check this one and this case man witness test is the option that we have where median will be compared. So, one will be catalyst A and second will be catalyst B 95 percent confidence level and not equals to condition we want to check. So, I will click ok over here and what we will get is that these W statistics and P values will indicate whether the significant difference exist between level 1 and level 2 basically catalyst A and catalyst B like that using median values over here. So, in this case what we are seeing is that we do not see any difference between that. So, non-parametric test also confirms that there is no difference between catalyst A and B and then we can move ahead with this. So, this is non-parametric options. Now, one important thing that we also need to consider when we are doing experimentation and doing two sample t-tests like that we told that independency of the samples that is required like that, but scenarios can exist that some of the sample observations that we have taken over here may be one single sample and we have two observations over here. So, this is the subject that was so that we are recording cholesterol level over here. So, cholesterol level which was measured over here. So, 50 participants evaluate the effect of that and exercise like that. So, before and after exercise like that recordings data was recorded like that. So, we want to see that whether exercise has a positive impact on this lowering the cholesterol levels or something like that. So, total cholesterol was measured in each subject and then three months after participation. So, this was again recorded like that. So, we want to see over here. So, in this case this data over here before and after experimentation this before and after the that control and all these things exercise is expected to be correlated like that. So, whenever data is generally it is from the same subject. So, in that case what is expected that this is a PR data basically this is a PR data from a single sample like that. So, what is expected is that data will be correlated like that data will be correlated and for that two sample t-test is not the appropriate test that can be done. So, here PR t-test is generally recommended where the data is highly correlated like that or significantly correlated like that. So, how do we see that one? So, we will go to the data set like that. So, maybe we can this is the data set that we are having before and after over here C2 and C3 column that you see over here these are the two data set what we wanted to see and let us see the correlation between these data. Earlier we have seen independence here also we will see basic stat and correlation between the data set and we expect that there will be high amount of correlation that exists between the data and in options what you do is that we are keeping this in results we are PR wise correlation table we want because p value is required. So, when you do this the R value is around 0.7 and what you see is that p value that is reported over here. So, if I can enlarge this one I am copying copy as image over here and then I put this then and then I just place it in excel sheets. So, this will be placed in excel let us say and we want to see we want to see over here whether the whether fair amount of correlation exist or not. So, I will just enhance this one and I am just copy pasting this image over here and for that I will just enhance so that you can see like that and the p value corresponding is you can see that this is 0.003. So, it says that sample 1 and sample 2 which is before and after and before like that. So, they are highly correlated over here. So, in this case what they suggest is that statisticians suggest that we cannot we cannot go for two sample t-test over here most appropriate test over here is known as spare t-test there is a most appropriate test to check whether there is any difference between these two labels that is before and after like that. So, what they have suggested is that we have to calculate the difference and based on the difference and standard deviation of the difference what we can do is that there is a statistics which is used over here and what you see the formulation of this. So, this is the formulation what you see over here. So, this is the formulation that is used and D is the difference between this before and after over here. So, the difference average will be taken and then standard deviation of the difference and square root of number of observations that I have that is 15 observation or paired observation that I am having over here. So, based on this I will I will calculate the t statistics calculated value and corresponding p value will be calculated like that ok. So, we have ensured that there is high amount of correlation and in that case what is to be done is that I will go for paired t-test over here. So, this difference can also be calculated. So, if you go to calculator and use calculator over here what you can do is that we can we can just say before minus after over here and I want to save in C4 like that column number are given. So, this is the where I will save and I click ok I will I will find out the difference that is there over here and the assumptions that to somebody test this paired t-test assumption is that this data should be normal this data should be following difference should follow normal. So, that can also be verified over here basic standard normality test like that. So, C4 can be identified and we click ok and we will get a p values for this and what we can see is that p value is 0.404 and that satisfies our condition that means difference also follow normal distribution. So, I can go for a paired t-test. So, immediately what I will do is that I will go to basic stat and I will go for this option paired t-test over here and in this case what I will do is that sample one this this option will will be blank when you when you click this one each sample is in a column. So, this is the or you can give difference over here summary is data in difference is given. So, here what I will do is that I have already before and after observation. So, then what I will do is that in options I will not change anything I want whether the difference is equals to 0 not equals to condition both sided t-test I am doing over here. I click ok and in graph also if you want to see the box of difference like that box plot of difference like that you click ok and you click ok like that and you will get all the results. So, before the standard deviation mean is given over here and then confidence interval is given for the estimation of the mean spare difference estimation over here and that is a confidence interval that is given over here and then we have a key statistics which is given over here which is the test result which I can copy from here and I can paste it for viewing like that. So, I will paste it and I can just enhance this image over here. So, t-test says that p value is around 0. So, approximately and that is less than up to 3 decimal plus it is 0. So, in this case what we can assume is that there is significant difference between this 2 data that before and after data set that we are having. So, in this case the difference is significant over here. So, difference reduction over here is significant that we can observe like that. So, mean is also given over here. So, before 261 after 234. So, basically it has reduced cost if we are doing exercise and other dieting we are following. So, in this case this is and this is accurate. So, because we have taken pair t-test over here we can also do two sample t-test, but accuracy level somewhat decreases in case correlation is high we will lose some information we will lose some. So, over here what is suggested is that we will go for pair t-test rather than two sample t-test rather than two sample t-test which is recommended statistically which has been shown that pair t-test is more effective, but if there is no correlation much correlation over here I will not go for pair t-test I will go for two sample t-test like that. So, there is another example over here what we can see is that this is a moisture meter reading with some instrument over here and with an old instrument I want to see whether there is difference or no difference between old one because I want to implement a method new instruments which is more which takes a recording very quickly like that old methods may be taking long times like that 5-6 hours over here in minutes we are getting the reading, but whether the readings are accurate or not. So, old wave was accurate that was the condition over here and this is a new reading for a given sample. So, old reading and new reading as provided over here I want to see whether there there is significant difference between these two readings or not. So, in this case what I will do is that first I will go for basic statistics correlation coefficient whether it exists between these two variables. So, correlation I will go and I would like to check a new instrument and old instrument correlation coefficient and what I see is that r value is 0.889 approximately. So, if we announce this image over here what you will find is there is high amount of correlation that exists over 0.889 that is showing in this relationship diagram and if you go to the p value of this what you will get this p value is around 0. So, this value is around 0. So, I can copy this one and paste it over here for your convenience to see. So, this you see that p values are of approximately equals to 0. So, in this case what we can suggest is that there is higher amount of correlation. So, in this case what we can suggest is that go for PRT test instead of two sample T test. So, what I will do is that I will go to basic statistics and then I will go to PRT test and in PRT test what we will give is that new instrument and old instruments like that and I will give options over here that same I want to see the differences. We can do one sided also, but we are doing two sided. So, immediately. So, over here what you see is that new instrument and old instruments in this case one is giving 38.43 reading that is new instrument on an average. So, in this old instrument is giving around 35. So, there is a difference of around 3 minutes over here and that is significant that is shown in the PRT test that you see over here. So, this I can copy and I can paste it over here below we can paste this one and we can just enhance this and try to see. So, in this case also p value is less than 0.05. So, in this case what what conclusion we draw this we can draw is that the readings are different, the readings are different over here. So, in this case we can we can just say that the two readings are different. So, before and after readings are quite different. So, we will use PRT test whenever whenever there is a higher amount of correlation between between the samples means with the same samples I am taking two readings over here like in manufacturing we will find that I have hardness testing machines of different types with different tips like that. So, whether the when I use different tips whether readings are different or not. So, that to confirm that one also we do PRT test because sample remains same. So, in that case on the same sample I am using two methods like that I want to compare which method is giving me higher reading or their same readings like that or any of them will give the same hardness like that. So, for that also this type of analysis is required. So, this is the all about PRT test. So, then an options for this is nonparametric way if you want to analyze when it is not satisfactory normal distribution assumptions of the deviation is not satisfactory what can be done one sample will coxon test can be used. So, in this case let us say old and new one and we are taking the difference over here let us say D is the difference over here we can just calculate the difference. So, calculator can be used. So, in this case what we can do new instrument minus old one over here and I click and save it in let us say C7 I am saving it in C7 and I click ok over here the difference is recorded over here. So, difference is recorded over here and maybe it is not following normal distribution, but here it will follow normal distribution this is a classical example taken from book. So, in this case in case it fails what you have to do is that you have to go to nonparametric test and with this difference what you can do is that one sample will coxon test. So, in this case what you can do is that difference that we want to see and median value should be equals to 0 that is our assumption difference should be close to 0. So, in this case not equals to case we have taken. So, P value of this you can get. So, if I copy this image over here and I paste it like that. So, this observation and I paste it what I will find is what I will get is that the difference is equals to 0 not equals to 0 that is the condition I am testing and the P value approximately 0.001 which is less that means there is a difference significant difference within the that difference is quite significant that is why P value is less than 0.05. So, that is the alternative we have in case you are unable to satisfy the assumption. So, in this case we will prove for this nonparametric test one sample will coxon test like that. So, that is the overall idea when I am experimenting with one factor at two levels like that. Now, we will extend this concept to an important concept which is known as analysis of variance like that. So, for that what is required is that some understanding over here is required the same diagram I am using and these are the control factors which is in my control while doing experimentation I can change the levels of these factors over here x1, x2, xp. So, there can be P number of factors which can be controlled by the experimenters like that these are known as controllable factors like that. There will be input conditions or sample observations samples we need to go into the process like that then these parameters will be controlled there will be some noise variables over here or uncontrollable inputs like that or variables like that. So, in presence of this I need to determine the setting of this condition so that I get the best CTQs or output over here which is close to target with minimum variabilities like that. So, in this case, so for this two sample details what we have done is that we have assumed that x1 is a factor let us say in the process only two factors and there is only level 1, level 2 like that. So, if x1 has more than one level over here so level 1, level 2 and level 3 one possibility is that I do individual assessment over here whether level 1 and level 2 are different or level 2 is different from level level 3 mean is different similarly level 1 with level 3 like that I can do pairwise comparison like that. So, I can do two sample T testing and see which level is different from which one like that. So, why I am doing this because I want to see whether this factor x1 is influencing the CTQ or CTQ this is like screening I want to screen the factors and do some preliminary analysis so that I understand that this factor can be considered is full-fledged experimentation at a later stage. So, I want to see whether one of the factor is influencing the variable CTQs like that ok. So, I can experiment more than two levels like that. So, it can be two levels, but we can also experiment more than two levels for a given factor like that. So, in this case what is the best option? So, if I do pairwise comparison what will happen is that and I have to conclude based on this all pairwise comparison the type generally increases over here. So, what do you see is that whenever I am do so the null hypothesis over here is that all the means are same and alternative is any of the means is different from any of the means over here. So, there is different means when I not equals to j when I am considering this condition like that. So, there can be three different levels and three different means will be generated and whether they are same or whether there is a difference between the means of any two levels like that. So, when I am concerned about that we go for a one-way analysis of variance. So, what I can do is that I can compare mu 1 with mu 2, mu 2 with mu 3 and mu 1 with mu 3 like that and pairwise like that pairwise comparison is possible. So, what is shown over here this with this and then this with this like that and then this with this this is also shown over here. So, if there are three levels 3 C 2 basically we can do combination like that. So, what will happen is that if I am doing this combination over here overall judgment will be correct. So, that reduces like that. So, the type one error basically increases over here and in a statistical book you can see why type one error increases. So, in this case what is required is that to keep the type one error as 0.05 what we have assumed for an experimentation hypothesis over here. So, my level of significance will be always 0.05. So, for that Fisher developed this analysis of variance and where the level of significance remains 0.5 and I can make a conclusion based on this that I will be true 95 percent of the time. So, in this case this is a extension of you can think of two sample t-test like that, but there are more than two levels that we want to see over here. So, there are more than two levels and I want to check and, but there is only single one factor over here. So, there will be only one factor and that will have different levels and I want to check that whether when I change the levels whether it is basically impacting the CTQs or not. So, it is impacting the CTQs or not or mean CTQs or not that is our objective over here. I want to screen this factor and figure out that within the range if there is two or three levels of X 1. So, whether the mean of CTQ is changing or not. So, that is our overall objective and if it is so, then in that case X 1 is a critical variables which influences why and we can take it forward to full-fledged experimentation when we go for response surface methodology or like that. So, after full factorial when we are going for full factorial. So, in that case this may be one of means factors to be screened. So, here experimenters can do a simple experimentation with a single factor and see whether that is influencing or not and although it is always suggested that we take all together and do the experimentation, but initial studies we can do and try to screen that one and then go for full factorial. So, this is used for basic screening experimentation like that, but you have to remember that we cannot go for pair 2 sample t-test over here. So, t-test is not sufficient. We have to go for a F-test analysis over here which is given by analysis of variance like that. So, we have to consider that scenarios like that. So, let us take one example and try to understand the scenarios before we go into analysis of this this is known as one way analysis of variance basically. So, what we are doing over here is that we are analyzing variation basically, but difference between means we are trying to trying to report over here whether there is difference, but the variance information is used for that that is why it is analysis of variance basically. So, what it does is that so, one experimentation this is one example taken from again from Montgomery's book and what you see is that inside strength I want to maximize. So, in this case I am changing the hardwood concentration which is the only x factor that I am having and this is the only y I am monitoring over here. So, this is the condition and the hardness concentration the experimenters have decided that it should be between 5 and 20 that is the optimal range of x that we can take over here and it can have more than two levels like that. So, the experimenter choose 5, 10, 15 and 20 and this is the x level that you are seeing and these are the y observations that you see y i j we can think of observations over here. So, in this case I through and j th column we can think of. So, this is and there are total a number of levels. So, if you if you write the generalized form over here what you can see is that y 1 1 is the first observation like this there are n number of observations n equals to 6 over here. So, n number of observations so, this one will be y 1 n like that. So, first and there will be a levels like that it can have a levels like that. So, this is the data of general forms we can write. So, mathematically we can express this one in this way and then what we are seeing over here is that when I change the level from 5 to 10 or 10 to 15 or 15 to 20 is there any significant difference that is happening in the CTQs y CTQs over here. So, mean value whether it is changing in any two levels when I am changing over here. So, that means if any two levels it is changing significantly that means this factor is important and can be considered for the further experimentation like that. But assuming that there is no other factor and I want to maximize or optimize the levels so that where the hardwood concentration should be kept so that I have maximum tensile strength which is my objective over here. Let us assume that other factors are not prominent over here only one factor is there I want to select the factor over here. So, this is the first step of experimentation where we want to determine the condition where we will we will get maximum CTQs or CTQs can be maximized because my target is tensile strength. So, over here tensile strength higher the better. So, if it is higher the better that is the condition we want to achieve over here ok. So, one is 5 percent over. So, levels over here are more than 2. So, there are how many levels over here 4 levels that that you can see over here. So, A is equals to 4 that is the thing that we are assuming n equals to 6 like subgroup size what we have seen. So, this is n equals to 6 observation and there are 6 was a 24 total number of observations. So, over here in this matrix what you see 24 observations and all are random random of this this is experiment is done by randomization randomization. So, what is randomization? So, at 5 percent some one sample will be taken arbitrarily and 5 percent Hardwood concentration will be checked and accordingly what we will say that there is 6 specimen over here. So, in this case first specimen will be taken and what we will do is that we will run at 5 percent and observe the tensile strength. So, this is the 10 7 was observed tensile strength like that. Then I will randomize this means this data set is general this decision set is generated based on randomization that means sample will be selected any of the samples can be selected and any of the levels will be selected over here. So, this is known as complete randomization like that. So, we do not fix 5 and then do the experimental laws and determine all the 6 any all 6 observations should not be completed in one go over here. So, at 5 percent all 6 samples will be put and we will take that it is not like that what we have to do is that we will randomize this run over here. So, what will happen is that this is because of some reason which we will understand afterwards. So, what I am saying is that this is done randomization is done over here. So, levels will be selected randomly over here samples will be selected randomly samples are assumed to be homogeneous over here. So, in this case so sample to sample variation is very less only percentage variation what we are considering over here. So, at different levels we take 6 readings like that. So, we will have a average of this. So, y 1 average like that we can save over here. So, then y 2 average y 2 n average y 2 average over here second level like this first level like this. So, this will be y a average like that and then we can make a grand average over here. So, this can be a grand average like that. So, this is symbolic notation that you will find in Montgomery's books that is the symbols that I have written over here you can use a different symbol also. So, that is my overall objective over here is to determine the level whether it is 5, 10, 15 or 20 where the where the tensile strength will be maximized over here for that I have changed intentionally. So, this is systematic induced variability that I have I have considered over here I have changed the levels arbitrarily over here and then I have recorded the tensile strength and based on this mean value over here this and this and this I want to freeze a level where my tensile strength will be maximized like that. So, how do we do that that is important for us and we will we will discuss about that. So, that is the and what Fisher has given is that what Fisher has done is that Fisher has given a table over here which is known as ANOVA table like that and Minitab will report this ANOVA table over here where you will see source of variation sum of square over here degree of freedom and mean square calculation and F statistics that will be reported like that. This is important F statistics over here and it will be compared with a tabulated value and if this F is greater than this one. So, in this case he will come out to be less than 0.05. So, this is the process this is the process we make the interpretation like that. So, over here treatment means when I change from level 5 like that 5 to 10 or something 10 percent over here. So, treatment whenever I am changing the levels the total amount of variability that is that it that it creates like that. So, that can be calculated as and known as SS treatment like that. So, because of change in treatments which is level 1 to level A what is the overall variation that is happening because of these changes like that. So, that we can think of think of a sum of square measure. So, formulation is given sum of square measure any books will give you sum of square measures for treatment treatment means when I change the levels like that and the overall variations will be given like that and this is the overall variation that you see like that. So, this from the overall average out is differing like that. So, that can be considered as a individual average square of that we can take square of that variation. So, 7 from the overall average if you take the reference and square of that that will give you a summation all this will give you the sum of square total basically and variation due to treatment. So, there is formulation over here which you can see in any books like that. So, SS treatment will be calculated SS total is the total variability of the data. So, because treatment will not explain all variabilities. So, that that is the condition that is the SST will be SS treatment over here and and there will be some error SS error over here. So, Fisher has given you to a way to calculate this one and how to calculate SST formula is given from the given data set that is generated over here. So, from this matrix we can generate these values of SST we can generate SS treatment and SS error can be calculated like that SS error will be SS total minus SS treatment that is the formulation we can consider and degrees of freedom will be reported also over here. So, degrees of freedom means A number of treatments is considered over here. So, A minus 1 is the degree of freedom and total observation is 24, 24 minus 1 this is A n minus 1 is the total degree of freedom and if you subtract this A n minus 1 and then minus this treatment combination what you get is that error degree of freedom basically. When we have calculated this degree of freedom over here. So, SS treatment divided by degree of freedom gives you mean mean mean square treatment like that. So, this has to be calculated similarly SS error by A into n minus 1 this will give me MS error like that. So, then these two will be compared like that and the ratio will be taken MS treatment by MS error over here. This will give you a F 0 values like that this will give you a F 0 values over here and this F 0 value what you have to do is that this F 0 value has to be compared with tabulated value like that ok. So, this will have a degree of freedom A minus 1 over here numerator degree of freedom and the denominator degree of freedom will be A n minus 1 like this. So, F will be calculated at let us say alpha level of significance and this will be calculated and F 0 will be compared with this tabulated value over here and if this is greater than this in that case P will come out to be less than 0.05. So, we are going by P methods like that we are not going by tabulation and all this. So, P will indicate whether to accept or reject the null hypothesis which is null hypothesis over here null hypothesis whether the mean at different levels 5 percent over here and equals to 10 percent over here equals to 15 percent over here or this is 20 percent over here and the alternate hypothesis over here is any of this mu i not equals to mu j over here or i not equals to j like that that is the condition I am checking over here ok. So, here you can see the formulation that I have even I have written like that. So, this is will be compared and if it is better we will reject the null hypothesis basically. So, this is the overall theoretical aspect that is covered over here that means how to collect the data randomization is considered over here and n will be the subgroups what we are considering. So, this is also known as replicates over here, this is also known as replicates. In experimentation this is an important thing that we have to consider more and more I replicates what will happen is that more and more I will be sure of the final conclusion that I am drawing. So, but number of samples size will increase over here. So, total number of samples here 24 experimentation was carried out if I have taken only two replicates in that case only I have to do 4 into 2 8 experimentation like that. So, you have to find out the optimal combination of this where I need to stop how many replicates I will consider in the experimentation considering the cost like that. So, in this case and if you can collect this data randomize the data and collect the experimented results over here what will happen is that then ANOVA analysis becomes easier and it will show you whether when I change the levels this will show you a value will tell that there may be if there is significant in that case it will indicate maybe 5 is different from 10 or maybe 10 is different from 15 the mean value of Y CTQ 10 is different from 15 or 15 is different from at least there are two levels when I change from that level to the other level what is happening the mean of the CTQ is drastically changing that means significantly changing basically not drastically we can say significantly statistically statistical significance exist between the two levels like that. So, they are different statistically basically. So, that will be the conclusion but which level ANOVA will not tell you from which level to which level this has happened basically it will say any two levels are different. So, that will be concluded based on these F values or P values that you generate over here. So, MINITAB will generate P values and based on that we will conclude. So, we will continue discussion of this ANOVA analysis further to understand more about ANOVA analysis in experimentation. So, and how to implement that in MINITAB in our next session. So, thank you for listening and we will stop over here and we will start from the ANOVA next time again. Thank you.