 Welcome back. We will be now starting with the second part of our course on statistics for experimentalists. In this lecture, we will be looking at a relatively simple situation. Here experiments are carried out by changing only one variable or only one factor. Usually experimentalists vary more than two factors. A simple example would be we are interested in looking at the yield or conversion from a chemical reaction. So we may vary temperature and pressure or temperature pressure flow rate of the reactants, temperature pressure flow rate and catalyst involved and so on. For the purpose of illustrating the basic concepts, we are going to consider the variation of a single factor only. The other factors are assumed to be kept at fixed values or at constant values, they are not being changed. The reference for this lecture is the book written by Montgomery and Runger applied statistics and probability for engineers 5th edition Wiley India. Let us look at the terminologies first. The factor is a controlled variable whose effect on the outcome is being investigated. Level is the value that is assigned to the factor and many levels of the same factor may be tested. We want to study the effect of temperature on the yield in a chemical reaction. So the factor is temperature. We want to vary this factor to see the effect of this factor on the yield. The levels of this factor can be different temperatures, 30 degree centigrade, 50 degree centigrade, 100 degree centigrade and so on. So we can have several levels of the same factor. Now we are looking at another important term treatment. It is somewhat very unusual term in experiments. Sometimes we encounter it, so better to define it. It is very simple in fact. Treatment is each level or setting for a factor. So it is the value taken by a factor when it is kept at a certain level. We can have A treatments for our reactor example there may be A temperatures. Many times we are not satisfied with doing the experiment only once. If we want to study the effect of temperature on the yield, we study various temperatures like 30 degrees, 40 degrees, 50 degrees and 100 degrees and so on. But we have considered each level of the factor only once. That is not what I meant. We want to repeat the experiment at the same treatment or level of a given factor and see the effect of the repetition on the reproducibility of the response. We want to carry out the experiment at the same temperature, let us say 50 degree centigrade and see what is the yield when we repeat it several times. Repeats also are intuitively appealing to us because if you get more or less the same response from the experiment, whenever we repeat them at a given setting, then we are convinced that we have done the experiments properly, the equipment or the reactor is working properly and we are having confidence in our results. So repeats are very important from a statistical point of view also, repetition of experiments is very essential. When there are A treatments and N repeats, we will have a total of A into N experiments. The next term which we are going to define is the response, the outcome of the experiment for each treatment. What is the output from the reactor? What is the yield from the reactor? What do we get? So that is what we are calling as the response and since there are several random factors that may influence the outcome of this experiment, the response is treated as a random variable. Normally we denote the response as Y. We are going to concentrate on only one factor. The reason for this is we want to establish the basic groundwork, introduce U to the concept of variance, degrees of freedom, the mean squares, the analysis of variance, F test and the conclusion you make after looking at the F statistic. This also involves hypothesis testing. Anyway we will cross the bridge when we come to that and so let us get on with this introduction. The effect of changing the levels of one factor on the desired response is investigated. So we are looking at the effect of different treatments. There may be many settings or treatments of this factor as well as many replicates or repeats for each treatment. Why does the experiment give different results? Even if we take all the precautions like keeping the factor level at pretty much a given value, all other factors or all other variables that may influence the experiment are well controlled. We are not varying them. We are making sure that the ambient conditions are not varying too much. Still we may get variation in the response. These are attributable to random errors. When we repeat the experiments we get variability in our response and that may be attributed to the random factors or random phenomena. So in order to get an idea about the experimental error we need repeats or replicates. Whenever we talk about experimental error we are not accusing the experimentalist of doing the experiments in a bad fashion. Despite his best efforts to maintain proper conditions there may be variation in the response. So we talk in a neutral sense whenever we refer to the experimental error in the data. When the level of a factor is changed there is going to be a variation in the response. We think that because of the change in the level of the factor because of applying a new treatment there is a variation produced. So let us look at crops being grown in a field and we want to test different fertilizers. So in plot 1 we put fertilizer A. We look at the yield. Then we apply fertilizer B. That is a new treatment and we look at the yield. If there is a change in the yield, a difference in the yield we think that it is because of the change in the treatment or change in the level of the factor there was a difference in the crop output. So this is what we normally think. We do not think that there could have been other factors which may have caused a difference in the yield. But the farmer or the person who is doing this investigation may firmly state look I only varied the fertilizer, the type of the soil, the amount of watering, the length of watering all other factors were unchanged. Only the factor that changed was the type of fertilizer. And then we have to be careful. We have to see whether the variation in the crop production, the variation in the reactor yield, the variation in the response generally was due to changing the treatment or changing the level of a factor or it was because of random effects. Random effects which were not in our control affected the experiment strongly or in whatever manner and produced a variation in the response. So the extent of this response change may be different. Sometimes there may be a small change in the response. Sometimes there may be a large change in the response. If there is a large change in the response of the experiment then we think that it is because of the treatment change. Sometimes there may be only a medium or small response change when you change the level of the factor or change the treatment. Then you do not know whether the response changed because of changing the treatment or it was because of random effects. So we need to quantify this so that the results may be presented in an unambiguous fashion. So we are going to look at variance. Whenever we do repeats of experiments we look at the mean outcome or the mean yield or the mean crop production. But it is not only mean which is important. In addition to mean or average we also have to look at the variance. So again whatever we studied in the first part of the course is becoming very relevant now. The variance is a very very important factor. Let me not use the word factor because we are already using it for looking at the variable. Variance can create an important influence on the interpretation of the data. So let us see how this happens. What we are doing is we are going to compare the variation due to change in treatments with variation due to repeats. As I said earlier repeats are representatives of the random phenomena. Whenever we repeat the experiments we may get different results and hence that variation is representative of the experimental errors that influence the process on which the experimenter usually has no control on. But of course he can change the level of the factor. He can go from fertilizer A to fertilizer B or he can go from 30 degree centigrade to 50 degree centigrade. So he has control over the variable or factor he is actually changing and he can maintain them at the constant value. So we are having change in treatment and also we are having random errors. We have to compare the 2 and we have to compare the variability produced by the random error with the variability produced by the change in treatments. So we have to compare the variation between treatments to variation within treatments. Let us look at table of experimental data collection. So we are having the A treatments in this column. We are going from 1 to so on to A and for the first treatment we have carried out n repeats. So you can see that you are going from y11, y12 so on to y1n, 1 standing for the first treatment and 1, 2, 3 so on to n standing for the repeats. So we are denoting the experimental outcome as yij. The response is given as yij where i stands for the treatment and j stands for the repeat. i is the index for the treatment and j is the index for the repeat. So the treatments are varying row wise. So i is running from 1, 2 so on to A whereas j is running from 1, 2 so on to n. So totally we have A into n runs. So all these runs are recorded as responses and we have totally An elements. Now we can total them and for a given treatment we add all the responses due to the repeats n repeats and we get y1 dot. So we are fixing 1 which is the treatment and dot represents the summation. So it is instead of writing sigma j is equal to 1 to n y1, j we are writing it as y1 dot and when you take the average when you add up the responses for a given treatment n responses for a given treatment you get y1 dot that you divide by the number of repeats that will be y1 dot by n which is the average response for a given treatment 1 that is represented by y bar 1 dot. The bar represents the averaging. Similarly you can do for the second treatment you can do for the 8th treatment. So you will get y1 dot, y2 dot so on to yA dot and the averages may also be denoted by y bar 1 dot y bar 2 dot so on to y bar A dot and similarly just as you did row wise the totaling and averaging you may also do the totaling column wise. Normally the row wise totals and averages would be used. So what I have done here is to denote the totals. So you have y1 dot when you go row wise for the second treatment when you add all the n treatments you get y2 dot because treatment 2 is fixed and so you get all these responses put in the appropriate terminology. Again I can sum up the values for the first repeat. So there I am summing over all the treatments for the first repeat. So I write it as y dot 1. Similarly for the nth repeat for each treatment I am totalling the responses over all the treatments for the nth repeat. So I get y dot n and when you add all these responses you get the grand total y dot dot and when you divide it by total number of observations which is A into n number of treatments into number of repeats y dot dot by A into n gives y bar dot dot which is the global average or the grand average. So the same thing I have put in this table and I have shown the averages. So when I am considering the first repeat and I am adding all the responses over the A treatments I get y dot 1. When I am averaging it out by dividing it by total number of treatments I get y bar dot 1. So I am adding all these elements I will get y dot 1, y dot 1 divided by A will give me y dot 1 bar or more correctly y bar dot 1. Similarly I can do the averaging for the other columns and the global average is y bar dot dot. So this is the terminology which I was explaining a couple of slides back. You are adding over the index j running from 1 to n. So i is kept constant so you put y i dot. Here I am taking the same summation either this or this and that I am dividing it by the total number of repeats n and I get y bar i dot. Obviously i is running from 1 to A. I am fixing i in this case and j is running from 1 to n. j represents the repeats and i represents the treatments and when I add up all the responses over all the treatments and all the means I get y ij is equal to y dot dot. There is a typo here I will correct the typo. Okay thanks for waiting. The terminology is you should put the i index first and the j index next. So i running from 1 to A and j running from 1 to n y ij is equal to y dot dot. Similarly I am finding the mean the grand mean. So the grand total is divided by the total number of observations A into n. I will get y bar dot dot. This is usually found in these statistical design of experiments textbooks. So it is important that we become comfortable and familiar with the terminology the dot notation. So n is the product of the A treatments and the number of repeats per treatment. The dot represents the summation over the index it replaces. So when we put y i dot it is replacing the summation over j. Now let us look at the experimental response. We want to model that. We are not going to do any complicated modeling. It is a simple linear model but it carries a lot of punch as we will see. Y ij which is the response from the ith treatment and the jth repeat is modeled as a sum of three terms. The first is the global average mu. Then tau i is the effect of the ith treatment and epsilon ij is the random error. Interesting to see the different symbols. Mu is having no subscript because it is standing for the global average or the mean response and tau i is the ith treatment effect and it is having the index i corresponding to the treatment and epsilon ij is having the indices corresponding to both treatment as well as the repeats. We may write mu plus tau i as mu i. So y ij is equal to mu i plus epsilon ij. This is a simple linear model. We have not put a non-linear model here. For example, y ij is equal to mu sin tau i to the power of epsilon ij. Some highly complicated model which we will find it very difficult to work with. We are having only a simple linear model and we are talking about the effect of only one factor. So we are having tau i which is the representation of the single factor we are analyzing. So tau probably stands for temperature or fertilizer. Because tau can have different levels, temperatures can take different values, 30, 50, 80, 100 degrees centigrade. Fertilizer can take fertilizer A, fertilizer B, fertilizer C and so on. So we are having only one factor. So we put only one tau i. If you are considering two factors, this linear model is simply extended. We can put tau i plus beta j and epsilon ij k because we are having now a combination of two factors i and j and then k will become the index for representing the repeats. We will be seeing these two factors shortly, even more factors. So we do not have to really worry about it. Let us focus on a single factor now. Essentially mu would be the response yij every time when the factor is not having an effect and there is no random fluctuations. We will get a unique value in our experiment or from our experiment when the treatments are not effective and random errors are not there. The next possibility is random errors are there but the treatment effects are not there. Then what would happen is this value of mu will get spread because of the effect of the random factors or random effects. The other possibility is both of them will be present. The treatment is having an effect. The error is having an effect. So we are now considering the variability given to a global response mu because of the treatment as well as the random noise or random errors. If there is an effect of treatment on mu, the mu is changing because of the treatment then it takes a unique value mu i corresponding to the height treatment. Remember we can give a levels of the treatment. So depending on what treatment you have given the mu has changed and that will become mu i. A very interesting figure awaits us. Here you are having mu i and mu j. This is the response spread because of the application of first treatment or the height treatment. This is the response obtained because of the application of the jth treatment. The middle value of this is mu i and that is defined as mu plus tau i. If tau i is 0 then mu i becomes mu. If tau j is 0 there is no effect of the jth treatment mu j becomes mu. So in both these places we will have mu and mu. But if tau i is effective mu i will be different from mu j. And very interesting thing is all these spread is because of the variance sigma squared. We assume that this variance is because of this random effects, okay. The random fluctuating components which are not in our control and the variance of the errors are constant, okay. The errors are assumed to have 0 mean and have constant variance. So the net sum of all these errors on the response would be to produce a spread around mu i around mu j with the variance constant variance sigma squared. I request you to take a closer look at this figure and make sure that you have understood the concepts. Still in the process of modeling the experimental response we have the mu which is the overall mean and it is a parameter common to all the treatments. This would be the response which we will be getting if there was no effect of the treatments and there was no random error fluctuations. Every time we do the experiment whether we put fertilizer A, fertilizer B or fertilizer C, every time the field produces 1 ton per annum of the rice grains or the reactor is producing exactly 30% yield irrespective of whether you put the temperature at 30 centigrade or whether you are putting the temperature at 100 degrees centigrade. So that is the common uniform value if none of the treatments and the random fluctuations are influencing the process, okay. And this is obviously not going to happen. Mu i is defined as the ith treatment mean. What is the mean response for the ith treatment? When I am operating the reactor 30 degrees centigrade what is the percentage yield that is modeled as the addition to the mean mu which is corresponding to the unique value unaffected by the treatment and unaffected by the random error. So we are assuming that there is an addition to it addition to the mu. Of course there may be some cases where the effect of the treatment tau j for instance may actually reduce the value of mu such that mu j may be mu – tau j but in general we represent mu i or mu j as mu plus tau i or mu plus tau j. What I am trying to say here is tau i may be positive or negative. So we are having tau i we call it as the effect of the ith treatment and epsilon ij is the random error contribution which is normally distributed with 0 mean and variance sigma square. We are having this nomenclature to represent the normal distribution with 0 mean and variance sigma square. Now we are coming to the null and alternate hypothesis statements which we studied very recently. We can now see the topics we studied in the first part of the course for example the normal distribution, the hypothesis testing all are nicely falling in place in the design of experiments. So we can have the null hypothesis as mu 1 is equal to mu 2 is equal to so on to equal to mu a equals mu. What is the meaning of this statement? All the responses are equal to mu whether I am applying the first treatment or the second treatment or the third treatment first temperature, second temperature or the third temperature. The output is not changing there is no change there is a status quo okay there is no effect of the treatment. There is no effect of the treatment whether I output 30 degree centigrade or 80 degree centigrade in the reactor the yield is not changing. So that is a skeptical view that is a neutral view and so we say that the null hypothesis indicates that there is no effect of treatment. It is a safe view. Now the alternate hypothesis is going to be in opposition with the null hypothesis. The alternate hypothesis is trying to revolt against the current status quo and say that there is a change okay there will be a change upon application of the treatment. So the alternate hypothesis is always supporting or rooting for the change. It says there may be many treatments and of course I agree that there may be some treatments which are not effective but there is at least one pair of means mu i and mu j which are not equal. If at least one mu i is not equal to another mu j then there is at least one treatment which is effective and different from the others. So at least one of the tau i values is not equal to 0. Just going back if all the tau i values are 0 then what will happen? Mu i will become equal to mu. Mu i is equal to mu plus tau i i running from 1, 2, 3 so on to A treatments. So when tau i is 0 then none of the treatments are producing a change from the global response okay. That is the view taken by the null hypothesis. But the alternate hypothesis says among A treatments running from 1, 2 so on to A there is at least one treatment which is producing an effect that is different from all other treatments. In this case all other treatments are producing no effect and there is only one treatment which is producing an effect. So the number of treatments which are actually producing effects may be different. There may be one treatment which may be different from all others or all the treatments may be different from each other and hence all the mu i's may be different from each other and from the global value mu. So you are essentially having y ij which is the response and it is a combination of the treatment effect plus the random fluctuation effect. If you go back to the graph I like this graph very much. If there was no noise what would have happened is we would have got two values unique values mu i and mu j. It would have been a straight line okay a direct delta impulse okay. So that means that you would have got a unique value mu i and mu j which are different from each other. However the values are spread about mu i and mu j because of random factors the random error components with variance sigma squared and so that causes a spread in these deviations. The extent of the spread is the same in both these cases. What I am trying to say is both mu i and mu j are spread in an identical fashion. Only thing is the center of this distribution is mu i and the center of the next distribution is mu j. However the spread is the same in both these cases because the error is assumed to be normally distributed with the 0 mean and variance sigma squared and of course then these are also normal distributions. So when this error distribution is superimposed on each and every one of the treatment means we get a normal distributions which are having a mean value or spread around mu i i running from 1 to so on to a and constant variance sigma squared. This a is not we get a normal distribution we get a normal distributions. So we have to resolve the total sum of squares okay how to get the total sum of squares I will tell in a moment. We resolve the total sum of squares into error sum of squares and treatment sum of squares okay. Whenever we found the variance what did we do? We found the mean first and then we subtracted from each of the number the average or the mean value then we squared it. So we had square of the deviations and then we divided the square of the deviations by n-1 where n is the number of data points. This gave us the variance exactly the same concept we are going to apply here but we are going to have different types of sum of squares and that will become obvious in a moment. So we are essentially looking at error sum of squares and treatment sum of squares. The total sum of squares represents the deviation from each and every experimental response from the global average value y dot dot bar or more correctly y bar dot dot is nothing but the global average. So each and every experimental observation is subtracted by the global average and these deviations are squared. Obviously if we do not square them and we sum all these deviations they will become 0 but when we square them all the negative deviations as well as the positive deviations will now be only greater than 0 and hence their sum will not be equal to 0 usually okay. Miraculously if all the observations are exactly matching with the mean value then the sum of squares will be 0 but it is very very unlikely. So anyway to emphasize my point i is equal to 1 to a, j is equal to 1 to n, i index standing for treatment we are having a treatments j index standing for repeats we are having n repeats and we take the square of the deviations we get total sum of squares. Very interesting mathematical manipulations are possible unfortunately time does not permit us to get into all these nice derivations for some people these derivations may look very complex but it is very nice okay. It is a pity that there is not enough time to get into all these mathematical derivations which will bring out the elegance and beauty of statistics in their full glory but we will take the main results and move on i is equal to 1 to a, j is equal to 1 to n y ij-y bar dot dot whole squared may be split into 2 components that is n into i equals 1 to a y bar i dot-y bar dot dot whole squared plus again the double summation running from i equals 1 to a, j equals 1 to n y ij-y bar i dot whole squared before we go next you please try to look at this equation and see what they are actually representing. If I do not get distracted by n and all these summations this y bar i dot will cancel out with y bar i dot and so you are essentially having y ij-so this is cancelling out so I am getting y ij-y bar dot dot which is equal to this one. You may argue that this linear combination is not possible because you are squaring the terms okay. If I had not put the double summation and I had not multiplied by n then I can write y ij-y bar dot dot in terms of adding and subtracting the y bar i dot to y ij-y bar dot dot anyway. So you get the point I think that is also another interesting interpretation to this. If you know Pythagoras theorem or remember it of course then the sum of the square of the hypotenuse is equal to the sum of the squares of the other 2 sides of the right angled triangle. So the same concept is being applied here the sum of the squares may be resolved into 2 components one due to the treatments and another due to the error. So if you look at this closely let me see you are saying the same thing total sum of squares is sum of squares due to treatments and sum of squares due to error. Here y ij-y bar dot dot represents the deviation of the individual observation from the global mean. This is the representation of the treatment mean from the global average. This is the deviation of the individual observation from the treatment mean. So we are doing repeats for each treatment we have done n repeats for each treatment we have averaged the n repeats we get y bar i dot that is the treatment mean we are comparing the treatment mean with the global average. So this is the contribution from treatment sum of squares. Here we are considering the global average sorry we are not considering the global average here we are considering the individual response with the treatment mean okay. So what we are doing or how we are doing rather is for a given treatment we are comparing the individual response for that treatment with the treatment average if the error contribution was negligible or not present whenever we repeated the experiments we would have got the same response y ij in which case the y ij would have been same as y bar i dot okay they would have been the same but each repeat for a given treatment itself is producing some variation and that is how the error contribution comes in okay. So we are modeling the error contribution by this sum of squares. So with that out of the way we can now compare the treatments contribution with the error contributions and if the error contributions and the treatment contributions are comparable then we can say that the treatments are not really having any effect. So rather than looking at the total sum of squares and comparing the treatment sum of squares and error sum of squares we have to normalize the sum of squares for each term because each term in the sum of squares equation have has different degrees of freedom. Let us now look at the degrees of freedom here you are having a into n observations but not all of them are independent of course all of them are important but not all of them are independent in the sense y ij-y bar dot dot if I am adding the sum of the deviations from the mean will be equal to 0. So if I am calculating the global mean from the responses then I need to know only n-1 or sorry an-1 y ij values. So knowing an-1 y ij values and the global average I can find out what is the remaining value. So we have totally an-1 degrees of freedom the same argument we can apply to the error sum of squares here forget about the treatment for the time being let us say we are having a particular treatment and we have found the treatment average based on the n repeats. So there are only n-1 independent entities and so you have n-1 and then you are having a treatments so the degrees of freedom would be a into n-1 it is saying that there are a into n-1 independent entities in this expression. So this is also out of the way and since all the treatment means when averaged will give you the global average there are only a-1 independent treatment means. So either you can argue on those lines or you can subtract the degrees of freedom for this expression with the degrees of freedom for this expression and you will get a-1. So let us see whether it happens like that the degrees of freedom for the treatment sum of squares just now we saw is a-1. So we have to now find the mean square treatments and the mean square error the simple thing is the treatment sum of squares are divided by the treatment degrees of freedom the error sum of squares are divided by the error degrees of freedom that is it we get the mean square treatment and mean square error. Sum of squares of the treatments divided by a-1, sum of squares of the error divided by a into n-1. The expected values are pretty interesting the expected value of the mean square treatments is sigma squared plus this contribution because of the treatments. The expected mean square for the error is simply sigma squared. Again I am not looking at the mathematical derivations, it is quite straightforward. You are having this. If the treatments were ineffective, the tau i squared will all become 0 or close to 0 and we have sigma squared again. So the variance in the mean square treatments becomes comparable to the variance with the error. So since the expected value of the mean square error gives the error variance sigma squared, we can say that the expected mean square error is an unbiased estimator of sigma squared. The mean square treatments also will become an unbiased estimator if the null hypothesis were true. That means all the other treatment effects were negligible. All the treatment effects in fact were negligible and so you get expected mean square treatments is equal to sigma squared. Then if the null hypothesis were not true, the expected mean square treatments will exceed the expected mean square. Obviously the effects due to the treatments will start kicking in and so this expected mean square treatments will be different from the expected mean square error. So what we do is here we are looking at 2 statistics and what we do here is do a F test. I request you to again look at the scope of the F test what we were doing and here we are looking at mean square treatments by mean square error ratio that we related to F0. We are essentially looking at the ratios of 2 variances which is precisely what the F test was doing. The mean square treatments and mean square error will be comparable if the treatments are not having an effect but the mean square treatments would be higher than mean square error if at least one of the treatments or more of them are making a significant contribution. So we have to see whether they are really significant. So we can set up the analysis of variance table where we list down the treatments error and total. So we have sum of squares due to treatment sum of square due to error and total sum of squares the degrees of freedom are a-1 a into n-1 a n-1 is the total degrees of freedom when we divide the sum of squares by the respective degrees of freedom we get the mean square treatments and mean square error then we take the ratio of these 2 to find F0. So we conclude at this point and we will continue in the next lecture. Thank you for your attention.