 Assalamu alaikum. Welcome to lecture number 43 of the course on statistics and probability. Students you will recall that in the last lecture I discussed with you the F distribution and also we discussed the role of the F distribution in interval estimation and hypothesis testing. In interval estimation and hypothesis testing that we discussed last time was in order to compare the population variances of two normal populations. This se pehle aapko yaad hoga students ke hamne t-test kezariye bhi or z-statistic kezariye bhi, doh population means ko aapis me compare kia tha. But you will agree that there may be many situations where we may be interested in testing the equality of more than two population means. If se situation me hamne t-test lagana chahi, then we will have to do that so many times. For example, suppose that we are interested in comparing the means of the populations of heights of the adult males of three different countries. Aap agree karenghe ke chukhe heights ki baat ho rahe toh isle normal populations hai. To agar ham cha hain, to ham t-test kezariye ye kar sakte hain. But students, if the populations are a, b and c, how many times will I have to apply the t-test? First I will compare mu a with mu b. Then I will compare mu a with mu c. And lastly I will have to compare mu b by mu c. Lekin agar teen ke bhajai chaar populations ho, then the number of tests that I have to run becomes much larger. Only four populations and I have to do the test six times a, b, a, c, a, d, b, c, b, d, and c, d. Ishi tara se agar aap populations ki tadaa thorasa bhi bahainge. So, aap dekhinge ke jo number of t-tests aapko run karne parenge na, unki tadaa zyada tezi se bharti chali jati. And obviously that becomes quite cumbersome. So we need a technique by which we are able to compare the means of three, four, five or even more populations at the same time. And you will be interested to know that the great statistician Sir R. A. Fisher back in 1923 introduced this concept which is called analysis of variance and which enables us to test the equality of several population means. As you now see on the slide, analysis of variance abbreviated as ANOVA is a procedure which enables us to test the hypothesis of the equality of several population means. That is, we are able to test the null hypothesis, mu 1 is equal to mu 2 is equal to mu 3 is equal to so on up to mu k against the alternative that not all the means are equal. Students, aap ne note kaya hoga, ke alternative hypothesis ko istra se state kaya gaya that not all the means are equal. Yani, ham neye nahi kaha ke mu 1 is unequal to mu 2 is unequal to mu 3 is unequal to mu 4 and so on. Yani, ham yeh nahi kaya gaya ki zeroori hai ho ke sab ki sab means aapis me unequal ho. Ham yeh kaya gaya ke jo null hai that is saying that all of them are equal. Aar uska jo alternative hai wo yeh hai ke not all of them are equal. Suppose there are seven of them, ho sakta hai ki un meseh chhe baraabar ho, sirf ek muhtale fo agar ek bhi muhtale fo gayi to then the null hypothesis is violated. So, we can say it in this way that not all the means are equal or we can even say at least two of the kaya population means are unequal. The next point is that analysis of variance has its application in regression and also in experimental design. Jo baat me aapke saab ne present kari ho that is the utilization of the technique of analysis of variance in experimental design. To pehla swaal yeh hai ke what exactly do we mean by experimental design? As you now see on the screen, by an experimental design we mean a plan used to collect the data relevant to the problem under study in such a way as to provide a basis for valid and objective inference about the stated problem. The plan usually includes the selection of treatments whose effects are to be studied, the specification of the experimental layout and the assignment of treatments to the experimental units. All these steps are accomplished before any experiment is actually performed. Students, shahada aap samaj rahi ho ke ye to kuch bohati complicated baatin shor ho gaye. You will inshallah understand this concept as we discuss it through the example that I will be presenting in a short while. Lekin usse pehle yeh jo do teen terms istimal hui unki baat karte hain. Dekhye sari baat ka ju gist hain hain. Ki hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain hain reliable ho, proper ho. Or uske liye agar han appropriately design karne aapne experiment ko toh zahe rahe ke that will be very good. Experimental design is a very vast area. What I will be doing is a very basic introduction to this concept. There are two types of designs the systematic designs and the randomized designs, but the analysis of variance technique that I will be discussing that is applicable only to the randomized designs. Students the basic randomized designs are the completely randomized and the randomized complete block designs and I will be discussing these one by one. As you now see on the slide the completely randomized or CR design it is the simplest type of the basic designs and it may be defined as a design in which the treatments are assigned to the experimental units completely at random that is the randomization is done without any restrictions. This design is applicable in those situations where the entire experimental material is homogeneous that is all the experimental units can be regarded as being similar to each other. Let me illustrate this concept with the help of an example. An experiment was conducted to compare the yields of 3 varieties of potato. Each variety was assigned at random to equal size plots 4 times. The yields came out as follows. For variety A the yields were 23, 26, 20 and 17. For variety B 18, 28, 17 and 21 and for variety C 16, 25, 12 and 14. Test the hypothesis that the 3 varieties of potato are not different in their yielding capabilities. Students, come to understand this interesting problem. This is an example of the completely randomized design. What is the situation? The situation is that there are 3 varieties of potato and we are wanting to compare on the average. But students, we should not be allocating this variety one to one farm and variety two to the second one and variety three to the third one. We want to do replication, that is to repeat this process. As you saw in this example, variety one was assigned to 4 different farms, variety two was assigned to 4 different farms and variety three. This is called the repetition of the basic experiment or replication. What is the purpose of this? You saw that variety one was assigned to 4 different farms. So, what happened was not identical. As you once again see on the screen, for variety A, the yield in the first farm was 23, but the yield in the second was 26. The one for the third was 20 and for the fourth, it was only 17. So, this is the point. You may not get the same result. This is a very, very important point in experimental design. Students, this is an example of the completely randomized design. In this, we assume that the 12 farms in which we assigned variety A, variety B and variety C, all 12 of them are homogeneous. One is the soil, fertility one is the soil, rainfall, weather conditions. In any case, that is the soil. And in this, we are totally at random, these three varieties are four times more. This is the basic layout of a completely randomized design. In this of course, our assumption is that the 12 farms, they are of equal size. If you consider that the big size of the soil, you have cut it in 12 equal parts. And in each of the parts, once variety A, B, C, or one, two, three, one of them is the same. In this way, it is four times first, four times second, and four times third. But, repeating myself, that you have put it randomly in that. And you can actually use the random number table to decide which is the first farm, which is the second farm, and in this way, you can go on. Now, that we have discussed the point, that this is a CR design. Of course, the next step is, how do we proceed with testing, what we are wanting to test? Students, the basic format of the hypothesis testing procedure is exactly the same as before. The formulation of the hypothesis, the level of significance, the test statistic, the computation of the test statistic, the critical value and the conclusion. So, as you now see on the screen, the null hypothesis in this particular problem is, mu A is equal to mu B is equal to mu C, that is, the mean yields of variety A, B and C are the same. But the alternative hypothesis is, that not all the three means are equal, may be all of them are unequal or at least two of them are unequal. The level of significance is 5 percent, and the test statistic is, F is equal to the mean square for treatments divided by the mean square for error, and it can be mathematically proved, that if H naught is true, then this statistic follows an F distribution with 3 minus 1 and 12 minus 3, that is 2 and 9 degrees of freedom. Students, it follows the F distribution with 3 minus 1 comma 12 minus 3 degrees of freedom. You see, we have three varieties of potato. Technically speaking, we have three treatments. So, it is representing the number of treatments minus 1, and this of course is mu 1, which is the pair of degrees of freedom. And what about mu 2? As I said, 12 minus 3. And 12 minus 3 means, we have four varieties A, B and C. So, the total number of farms that we involved, that was 12, and minus, what are we doing in 12? 3. That is the number of varieties. Now, if k represents the number of treatments, i.e. the number of varieties, then mu 1 is equal to k minus 1, and if n represents the total number of observations, which is 12 in this case, then mu 2 is equal to n minus k 12 minus 3. Students, what is the fourth step in any hypothesis testing procedure? Of course, it is the calculation of our statistic. Now, there is a detailed discussion here, because in this particular procedure, the F statistic to compute it, you have to first accomplish many steps, and you have to construct what is called the ANOVA table. The table that you now see on the screen is the ANOVA table in the case of the completely randomized design. Students, as you can see, this table consists of five columns, headed source of variation, degrees of freedom, sum of squares, mean square and F. Let me discuss these with you one by one. In the first column, source of variation, you can see that we have under that three sources of variation, the variation between treatments, the variation within treatments, which is also denoted by error. And in the last row of the table, you see the word total, which stands for the overall variation in our data set. If you have a look at the data values once again, you can see that they are different and you can see that they are different. We have 23, 26, 20, 17, 18, 28 and so on. If we consider these 12 values together, it is obvious that they are not all the same and there is an overall variation in these values. This is the overall variation and this is being represented in the last row of the ANOVA table. Students, the other two sources of variation that we have identified are the variation between varieties and the variation within varieties. If you have a look at the data values once again, as you now see on the screen, the values for variety A are 23, 26, 20 and 17, B ke liye 18, 28, 17 and 21 or C ke liye 16, 25, 12 and 14. So, the point to understand is that there is a possibility that there is a variation between the three varieties. If we consider the variation between the three varieties, if we consider find the mean of those and the mean of the values for variety B and the mean of the values for variety C, then it is possible that X bar A and X bar B and X bar C may be quite different from each other or at least one of them may be different from the other two. So, this is what we mean by variation between treatments. The other one is the variation within treatments. If you just consider the variety A, then you will see that the values are not the same. As I said earlier, the values are different. We are saying that there is only one variety and after that we are getting different values. So, the variability in the yields of variety A can be called the variation within variety A. Similarly, the variability in the yields of variety B can be called variation within variety B. Also, the variability in the yields of variety C can be called variation within variety C. Now, we can say that the term variation within treatments stands for the combined effect of the above mentioned three variations. Students, you had seen that we used the word error and it is a very, very important term and widely used. Now, let us think about why we are saying error. Let us understand it in a way that we can say that if we have the same variety and the farms are the same size, then if we are showing the same variety in the four farms, we should get exactly the same yield from all the farms. Theoretically, we find that in spite of all this control, the yields are different. So, this means that there is some kind of an error. Why is there a difference from what it should have been? Some identical value that we should have had for all the farms. Having discussed the first column of the ANOVA table, students, let us now concentrate on the second, third and fourth columns. As you now see on the screen, the second, third and fourth columns of the ANOVA table are entitled degrees of freedom, sum of squares and mean square. The point to be understood is that the variations, the various sources of variation that we have been discussing, these are measured by computing what is called mean square. And as you now see on the slide, mean square can be defined as sum of squares divided by degrees of freedom. The variation between treatments will be measured by computing mean square treatment, which is given by the sum of squares for treatment over the degrees of freedom for treatments and the mean square error is given by the sum of squares for error divided by the degrees of freedom for error. Now, it can be mathematically proved students that in analysis of variance pertaining to the completely randomized design, the degrees of freedom for treatments are k minus 1, i.e. number of treatments minus 1, and the degrees of freedom for error are n minus k, the total number of observations minus the number of treatments. Therefore, as you now see on the slide, m s treatment is equal to s s treatment over k minus 1, and the mean square for error is equal to the sum of squares for error divided by n minus k. Now, the question is how do we compute the various sums of squares? As you now see on the slide, the total sum of squares denoted by t s s is equal to sigma sigma x i j square minus c f, where c f stands for correction factor. Also, the sum of squares for treatment denoted by s s t is equal to summation over j of t dot j square, this whole expression divided by r minus the correction factor and r in this formula denotes the number of data values per column, that is the number of rows in this particular problem. Also, the sum of squares for error denoted by s s e is equal to the total sum of squares minus the sum of squares for treatments. Students, it is very complicated, but when we go through all the steps step by step, inshallah you will find that it is not at all as difficult as it appears to be at this time. These are the terms, we will be explaining them step by step. At this time, you just note that the example we are doing or the formulae we are presenting, they are pertaining to that particular case when every variety or every treatment generally speaking is being allocated to the experimental material equal number of times. Yani, jaisa case example me dekhah, ke variety 1 ko bhi 4 defa boya gaya, variety 2 ko bhi 4 hi defa or 3 ko bhi 4 hi defa. Otherwise, you can also have situations in the completely randomized design where the various treatments are not being allocated equal number of times. Yani, aisa bhi munkin tha, ke a variety 1 ko 5 defa boh dethe, variety 2 ko 3 defa or variety 3 ko 4 defa. Baharhal to keep things simple in the first instance, we are discussing that situation where all of the varieties are being sown equal number of times and the formula that you just saw all these formulae, they pertain to this particular situation. You have noted that the sum of squares for treatment or the sum of squares for total sum of squares for error, that is found very easily and it is simply the total sum of squares minus the sum of squares for treatment. Iski bhaja yeh hai that it has been and can been mathematically proved and derived that very interestingly the total sum of squares can be partitioned into these two separate and distinct parts, the sum of squares for treatments and the sum of squares for error. So, that s s t plus s s e comes out to be equal to the total sum of squares and therefore, s s e is equal to total sum of squares minus the sum of squares for treatments. A very similar situation exists for the degrees of freedom as you now see on the screen. It can be mathematically proved that the degrees of freedom for total can be partitioned into two distinct parts, the degrees of freedom for treatments and the degrees of freedom for error. So, that our equation becomes the total degrees of freedom are equal to degrees of freedom for treatment plus the degrees of freedom for error. An equation very very similar to the one that we had for the sums of squares. Now, it can be shown that the degrees of freedom pertaining to total are n minus 1. So, as you can see n minus 1 is equal to k minus 1 plus n minus k. So, the equation is correct and as I said just a short while ago, this equation that we now have is equivalent to the equation total degrees of freedom are equal to the degrees of freedom for treatment plus the degrees of freedom for error. Students, let me discuss a little bit about the various formulae and the various equations that we have. Let us now apply all of these to our example. But I think before we do that, let us analyze the data that we had. As you now see on the slide, we had 3 varieties A, B and C and the yields of the 3 varieties are 23, 26, 2017, 18, 28, 17, 21 and 16, 25, 12 and 14. If you look at the quantities that you have in brackets next to these data values, students, the quantities in the brackets are simply the squares of all these data values. The square of 23 is 529, the square of 26 is 676 and so on. Now, if you concentrate on the row underneath the data values, it is entitled t dot j. The first value in this row is 86, which can be called t dot 1, the second one 84 can be called t dot 2 and the third one 67 can be called t dot 3. t dot j samurad hai the total of the jth column. So, the total of the first column 86, t dot 1, the total of the second t dot 2 and generally speaking t dot j. Students, ye jo i j ki baate hai, ye to boh zara important hai and it is very necessary that you do not get confused here. Aapko pehle maalum hoga from pure mathematics that if you have a bivariate table, generally i stands for the rows and j stands for the columns, yani first row, second row, aayeth row, ishi tara first column, second column, jth column and so on. Aap ye jo baane abhi aapke sambh ne present kia t dot j, j to hopfuly aap samajgei ke first column ka total t dot 1 or jth column ka total, generally speaking t dot j. Eke question aapke zain mein aaya hoga ki ye dot kia cheez hai. It is an interesting notation, aap dekhye ke agar haam kisi ek value ki baat karthe in that table, we would have said x i j, yani x i j would have been the value in the aayeth row and the jth column. Lekin ab iss vaakth haam ne aayi ki jaga pe dot dala hai, t dot j aur dot students iss ko represent kar rahe ki haam jo sum leh rahe na that is over i. As you now see on the screen if you concentrate on the first column, 23 plus 26 plus 20 plus 17 is equal to 86. Ye jo sum ho rahe students, it is over the 4 rows that you have, yani aap ye na kye sum jain that it is the sum over the first column. First column ki values ko sum kia jah rahe over the rows. The value of the first row 23 plus the value in the second row 26 plus the value in the third row 20 plus the value in the fourth row 17. So, t dot 1 86 stands for the total of the values in the first column or dot jo hae wo denote kar rahe ke rows ki upar haam ne sum kar diya. Now, let us concentrate on the row that we have underneath t dot j and this one is called t dot j square and it is simply the squares of the values that you have in the row above that is 86 square is equal to 7396, 84 square is 7056 and 67 square gives you 4489. Students if you look at the value to the right of 67, we have 237 and if you look at the value under that that is 18941 ye konsee values hae. These are the sums of t dot j and t dot j square respectively that is summation t dot j summation being taken over j that is summation being taken over the columns that is equal to 237. Also sigma j t dot j square is equal to 18941. Ab haam ne apne is computation table ko the kriban mukammal kawar kar liya hae. The only two items that are left are the values that we have in the last column and the last row of this table. If you have a look at the last column first, you find the expression sigma j xij square on the top of that column and students if you add the numbers which are inside the brackets in the first row that is 529 plus 324 plus 256, you obtain 1109 the first entry of the last column. Ab zara dobara se is notation pe gaur ki jay sigma j xij square. Ab ko yad haena ke brackets ke andar haam ne xij haamari jo values thi xij unke squares likhe we hain aur ab haam inko sum kar rahe hain row by row but the summation is happening over the columns. The first sum is 1109, the second one is 2085, the third one 833 and the fourth one 926 and students if you look at the bottom row of this table, you have sigma i xij square is swerth i xij square is swerth haam unhi quantities ko jo brackets ke andar hain sum kar rahe hain but of course now we are coming downward and we are summing over the rows that is summation is happening over i because as you remember i stands for the rows, first row, second row, i th row and so on. Doing that the first sum of the squares of the data values comes out to be 1894 the one in the bottom of the first column. Similarly the second one is 1838 and the third one 1221 when we add all three of these the total is 4953 exactly the same as what we obtained when we added sigma j xij square all the four values that occurred in the last column of our table. It is obvious that the final sum is going to be the same. Now students this is the one for which we have the notation double summation xij square q is liye ke we are adding over i and over j. That first we add over j and then over i this order may be jale jaye it is a case of double summation. So, as you once again see on the screen sigma i sigma j xij square is equal to 4953. Another very interesting notation is t dot dot what is t dot dot? So, this I have just told you that dot we add over there from which we are summing. So, this means that now we have added dot dot on i and j. So, this means that we are talking about the total of all the observations. First you sum over the rows and then you sum over the columns what you obtain t dot dot and as you now see on the screen the sum of all the observations that is t dot dot is equal to 237. Now that we have all the required quantities we are in a position to compute the various sums of squares that we need in order to fill out our ANOVA table. As you can see the correction factor is equal to t dot dot square over n and that is 237 whole square over 12 and that is 4680.75. Now the total sum of squares is given by sigma sigma xij square minus the correction factor. So, substituting the values we obtain 272.25 also the sum of squares for treatments that is sst is given by sigma j t dot j square this whole expression divided by r minus the correction factor and since r the number of rows is 4. Therefore, substituting 4 in place of r and 18941 in place of sigma t dot j square and 4680.75 in place of the correction factor the sum of squares for treatments comes out to be 54.50. As stated earlier the sum of squares for error is equal to the sum of squares for total minus the sum of squares for treatments and that is equal to 217.75. Substituting all these values in the ANOVA table we obtain as you now see on the slide the degrees of freedom for treatments 3 minus 1 that is 2 and the sums of squares as we just found 54.50. Similarly the degrees of freedom for the total 12 minus 1 that is 11 and the total sum of squares as we just obtained 272.25. 11 minus 2 gives us 9 degrees of freedom for error and 272.25 minus 54.50 gives us 217.75 as the error sum of squares. As explained earlier the mean square for treatments is given by the sum of squares for treatments divided by the corresponding degrees of freedom. So, 54.50 divided by 2 is equal to 27.25. Similarly the error mean square is found by dividing 217.75 by the corresponding degrees of freedom which are 9 and so the mean square error is equal to 24.19. Students we have filled out the ANOVA table almost completely. So that we are able to compute F our test statistic and as you once again see on the screen the computed value of F is to be inserted in the fifth and last column of the ANOVA table and according to the formula that was presented earlier it is equal to the mean square for treatments over the mean square error. Therefore dividing 27.25 by 24.19 the computed value of F comes out to be 1.13. Students the fifth step of the hypothesis testing procedure is the determination of the critical region and it can be shown that in this kind of a situation analysis of variance pertaining to the completely randomized design which is also called one way analysis of variance. As you see on the screen the critical region will be given by F greater than or equal to F alpha k minus 1 comma n minus k degrees of freedom. Hence in this particular example since our level of significance alpha is 0.05 k minus 1 is equal to 2 and n minus k is equal to 9. Therefore consulting the F table for 5 percent right tail area we obtain 4.26 as our critical value. Now since our value is 1.13 therefore it does not fall in the critical region and hence we accept our null hypothesis and we may conclude that on the average there is no difference among the yielding capabilities of the three varieties of potato. Students this is the procedure of analysis of variance with respect to the completely randomized design. Apne dekha ke computation valajo stepena that is elaborate. But otherwise the main system is just the same. Now in this course we cannot afford to go into too many rigorous and very very advanced mathematical details. But I would like to give you two or three very very important points which are the basic assumptions of any hypothesis test in procedure. The first assumption is that the populations whose means we are wanting to compare they are normally distributed. Number two the standard deviations of these populations are equal and this assumption is called homosidasticity. And the third point equally important that we assume that the samples that have been drawn from these populations they are random and these samples have been drawn independently. Agar aap confident hon ke jis phenomenon ya jis variable ke saath aap deal kar rahe hain usme aap ki assumptions reasonable hat tak poori ho rahe hain then you can apply this procedure and as you might have noticed it is an effective procedure for comparing more than two more than two population means. Let us now begin the discussion of the other design that I mentioned the randomized complete block design which is also called the RCB design. Students as you now see on the slide a randomized complete block design is one in which number one the experimental material is divided into groups or blocks in such a manner that the experimental units within a particular block are relatively homogeneous whereas the overall experimental material is not homogeneous. Number two each block contains a complete set of treatments that is it constitutes one replication of treatments and number three the treatments are allocated at random to the experimental units within each block which means that the randomization is restricted. A new randomization is made for every block the object of this type of arrangement is to bring the variability of the overall experimental material under control. Students yeh ju points meh ne aapke saam ne formal tariqe se present ke aaye, in ko zara aasan lauzo meh, samajne ki koshish krte hain. Agar aap uspraani baat pe chale jayin ke hain haam aloo yaa kisi bhi aur fasil ki mukhtale varieties ko bona chahathe hain aur compare karna chahathe hain unki hain yields ko. Isvakta aap you samjhe ki wo baara farms jo hain na they are not homogeneous. Hosakta hain ki wo farms jo neher ke nasdikh hoon unka fatility level yaa water level yaa mukhtalif jo cheeze hain agriculture ke havale se wo kuch mukhtalif hoon compared with those farms which are away from the canal and the ones which are further away may be they are even more different. So, the overall experimental material is not homogeneous, the way it was in the case of the completely randomized design. Randomized, completely randomized hum lagaata hi tabhain hain when we are confident that the entire material is homogeneous. Yaaha pe juki nahi hai to hain kya karthe hain we divide our material into groups or blocks technically they are called blocks and within a block we expect that the material is relatively homogeneous. Yaani hain uski divisioni isthra karenge ke jo uske andar ek block ki andar hain that is of one type or jo dosre block ki andar hain that is of another type and so on. So, students is situation may the analysis of variance is called two way ANOVA and the procedures are quite similar to what we did in one way ANOVA, but the only thing is that it is a kind of a further extension of the basic concepts that you did in the previous case. Let us begin the discussion of this situation with the help of the example that you now see on the screen. In a feeding experiment of some animals, four types of rations were given to the animals that had been divided into five groups of four each. The following results were obtained. What you have in front of you now students is a bivariate table. In the top row you have the four types of rations that were given to these animals and in the first column students you have the five groups into which these animals have been divided. These groups are what we will be technically calling blocks. The values in the body of the table represent the gains in weights of the animals on which these rations were administered. We are required to perform an analysis of variance in order to test the null hypothesis mu A is equal to mu B is equal to mu C is equal to mu D against the alternative that not all the means are equal. In the next lecture we will be discussing this problem in detail and we will perform the procedure which will enable us to test this hypothesis. In the meantime I would like to encourage you to attempt quite a few questions pertaining to the simplest case one way analysis of variance that is valid in the case of the completely randomized design. Best of luck and until next time, Allah Hafiz.