 As Salaamu Alaykum, welcome to lecture number 40 of the course on statistics and probability. Students you will recall that in the last lecture I discussed with you the t distribution, a very important distribution in the theory of statistical inference and having discussed its basic properties we applied it to internal estimation regarding mu. In today's lecture we will be proceeding with the application of the t distribution regarding hypothesis testing and also interval estimation not only for mu, but for mu 1 minus mu 2. Going back to what I said last time, students you will remember that in one particular situation the Z statistic is no longer valid and we should apply the t test or the t statistic and what is that situation? The parent population should be normally distributed, the sample size is small and the standard deviation of the population is unknown and this is the situation where we apply the t statistic. And what was the formula for the confidence interval for mu students? As you now see on the screen, the formula is x bar plus minus t alpha by 2 at n minus 1 degrees of freedom s over square root of n, where s itself is defined as the square root of sigma x minus x bar whole square over n minus 1. I.e. Arge subse pehle, let us consider hypothesis testing regarding mu in this particular situation when our parent population from which the sample is drawn that is normal and the population variances are known and the sample size is small. Let me try to explain this with the help of an example. As you now see on the slide, just as human height is approximately normally distributed, we can expect the heights of animals of any particular species to be normally distributed as well. Suppose that for the past 5 years, a zoologist has been involved in an extensive research project regarding the animals of one particular species. Based on his research experience, the zoologist believes that the average height of the animals of this particular species is 66 centimeters. He selects a random sample of 10 animals of this particular species and upon measuring their heights, the following data is obtained 63, 63, 66, 67 and so on. In the light of these data, test the hypothesis that the mean height of the animals of this particular species is 66 centimeters. We will approach this problem in exactly the same way as we have been doing before. That this is going to be a two-tailed test. Why? Because our null hypothesis will obviously be that mu is equal to 66 and the alternative will be that mu is not equal to 66. Obviously, if we are saying that it is not equal to 66, then there are both possibilities that either it is less than 66 or it is more than 66, so it will be a two-tailed test. The second step is the level of significance and as before, we can set it to be 0.055 percent level of significance. And what is the third step students? The test statistic as you now see on the screen. The test statistic to be used in this particular situation is x bar minus mu over s over square root of n and s is the one in which the denominator is not n, but n minus 1 as explained earlier. Students, you remember that last time I discussed this in detail with you that if parent population is normally distributed or you draw a sample from it, which is not of a large size and after you draw the sample, you find x bar and s. After that, the next thing is that if there is not one sample of this type, but all possible samples, that is, crore-ha samples, then for every sample, x bar or s. And then I said to you that the quantity x bar minus mu over s over square root of n, this type of quantities will also be crore-ha, and whatever hypothesized value we have, we will keep that. And you will recall that I said to you that it has been mathematically proved that this quantity follows the t distribution having n minus 1 degrees of freedom. I have repeated all these things in your mind that we use this formula in any particular situation. The mathematical background of this formula is why we are using it, it does not come out of the air. Now that we are confident that this is the formula to be used, because in this situation, our three conditions are being fulfilled, because we are talking about heights, so we know that heights should be approximately normally distributed. Sigma is unknown because all the animals of that particular species, we did not control all the animals, we just have a sample. So, because all the animals have not been measured, so sigma is unknown, and the sample size actually that we have selected in this problem, it is only 10, we have controlled only 10 animals and measured their heights. So, you see that all three conditions have been fulfilled and that is why I am going to use this particular statistic. The fourth step of course is to compute the statistic and as you now see on the screen, the x values are 63, 63, 66 and so on and when we add them, the sum is 678, also when we take the square of every value and add these squares, we obtain 46,050. Substituting these values in the formulae of x bar and s square, students we obtain the sample mean as 67.8 centimeters and the sample variance comes out to be 9.0667, so that when we take the square root, the sample standard deviation comes out to be 3.01 centimeter. Substituting these values in the formula of t, students we obtain t is equal to 1.89 and please do note once again that instead of mu, we have put the number 66, which is exactly what we hypothesized as the mean height of the animals of this particular species. What is the next step? Of course, it is to determine the critical region or you remember that this is a two-tailed test therefore, half of the level of significance to be taken on the right hand side and half on the left hand side and what is the procedure? You remember that t table, I have introduced last lecture and you noticed that in the very first column you have the degrees of freedom and in the top row you have the values of the areas that you would like to have to the right of the t value that you would like to determine. So, what was the definition of that? That you would be looking against n minus 1 degrees of freedom under alpha by 2 if it is a two-tailed test. So, as you now see on the screen in this problem there are 10 values and therefore, we have to use 9 degrees of freedom n minus 1 being equal to 9. Therefore, when we look against 9 under the value 0.025 students, we obtain t equal to 2.262 and as you see in the diagram in front of you since this is a two-tailed test therefore, the critical value on the right tail is plus 2.262 whereas, the critical value on the left tail will obviously be minus 2.262 because as explained in the last lecture the t distribution is absolutely symmetric around 0. And what is the last step? Of course, the conclusion in this problem as you just noted t is equal to 1.89. Our t value i that is 1.89 and obviously, it is very much in the acceptance region. Therefore, students we will accept the null hypothesis, we have no reason to reject it and we can say that the researchers claim or his idea about the mean height of the animals of this particular species seems to be supported and justified by this data. All right, students now that we have applied the t distribution to inference regarding one single population mean mu, why do we not proceed to the application of the t distribution in the case of two populations whose means we are wanting to compare mu 1 and mu 2. So, we will proceed to hypothesis testing regarding mu 1 minus mu 2. Let us start the first one with the help of an example as you now see on the slide. Record company executive is interested in estimating the difference in the average play length of songs pertaining to pop music and semi-classical music. To do so, he randomly selects 10 semi-classical songs and 9 pop songs. The play lengths in minutes of the selected songs are listed in the following table. For the semi-classical music, we have 3.80, 3.30, 3.43 and so on minutes whereas for the pop music songs, the play lengths are 3.88, 4.13, 4.11 and so on minutes. We would like to calculate a 99 percent confidence interval to estimate the difference in the population means for these two types of recordings. You have seen students, this is a very interesting problem, very musical. We would like to see that the play lengths, the time duration for the songs are more than semi-classical songs or pop music. This person has decided to do a statistical analysis which is one of the most wonderful. He has drawn random samples from the pop songs and noted their time durations. And on the basis of this data, now we are wanting to construct a 99 percent confidence interval for mu 1 minus mu 2. And what does mu 1 minus mu 2 represent? As we indicated last time, obviously we need to first identify what will subscript 1 stand for and what will subscript 2 stand for? As we said last time, students, this is up to you. You can say semi-classical 1 or pop music 1. So suppose that in this particular problem we decide that we are going to use subscript 1 for semi-classical music and subscript 2 for pop music. Now the level of confidence students is 99 percent, a fairly high level of confidence. So, they want to adopt this estimation procedure with a higher probability. Then the next question is obviously what is the formula for the confidence interval for mu 1 minus mu 2? Or what is mu 1 minus mu 2? The mean time for the semi-classical music minus the mean time for the pop music. That means for both the types of music, the play length, the time duration, the difference is on the average. The formula as you now see on the screen x 1 bar minus x 2 bar plus minus t alpha by 2 n 1 plus n 2 minus 2 degrees of freedom multiplied by s p into the square root of 1 over n 1 plus 1 over n 2. Students, you must have started to get confused again. But just see that the pattern is so similar to the one that we had last time. When we talked about mu, then what was our formula? x bar plus minus t alpha by 2 n minus 1 degrees of freedom into s over square root of n. And now, because we are talking about the difference of two things instead of x bar. So, we have x 1 bar minus x 2 bar. After that, like before, plus minus t alpha by 2 with so and so degrees of freedom. And before that, it replaces the s over square root of n over n. That means that it is exactly the same fundamental pattern. If you understand these things, then believe me you will not have a problem. If I would like to apply this formula, first of all let me compute all these quantities. So, as you now see on the screen, for the semi classical music, the total number of observations is 10. Adding them and dividing by 10, we get x 1 bar equal to 3.465. Also, computing the standard deviation by the formula in which the denominator is 9 and not 10, we obtain s 1 is equal to 0.3575. Similarly, for pop music, the number of observations is 9. Adding them and dividing by 9, x 2 bar comes out to be 4.064 and s 2 comes out to be 0.2417. Now, the weighted mean of s 1 square and s 2 square gives us s p square, which we call as pooled variance. And the letter p stands for pooled. s p square is equal to n 1 minus 1 into s 1 square plus n 2 minus 1 into s 2 square divided by n 1 minus 1 plus n 2 minus 1. And it is obvious that we can write in this way that in the denominator we have n 1 plus n 2 minus 2. Students, I have just presented this formula for s p square. You have seen that this is exactly what I said last time. It is very similar. You will remember that when we were computing the PC hat, it was n 1 b 1 hat plus n 2 b 2 hat over n 1 plus n 2. That means, the basic formula of weighted mean is w 1 x 1 plus w 2 x 2 over w 1 plus w 2. This is exactly what we have seen at this time. It is just that n 1 minus 1 and n 2 minus 1 are acting as the weights. So, repeating the formula, the pooled variance s p square is equal to n 1 minus 1 times s 1 square plus n 2 minus 1 times s 2 square over n 1 minus 1 plus n 2 minus 1. Now, the next question is that the basic formula of the confidence interval formula. Why are we going to look at the t value against n 1 plus n 2 minus 2 degrees of freedom? See, this is the same number which is coming in the denominator of this pooled variance. As I said, what is the denominator of pooled variance? n 1 minus 1 plus n 2 minus 1. Or we will write it in short, n 1 plus n 2 minus 2. So, this number acts as the degrees of freedom of the t distribution that you have to follow in this case. Now, the thing is that in reality, for everything, we can go into the detailed mathematical logic and we can do the mathematical derivation. But I am in this course because I want to emphasize more on the derivations, applications or concepts. So, I will just say that you should note that the interval that we made last time for that empty mu, what was it? x bar plus minus t alpha by 2 n minus 1 degrees of freedom and multiplied by s over square root of n. So, the n minus 1 degrees of freedom there was also the denominator of that s square which we were using at that time. Small s square given by sigma x minus x bar whole square over n minus 1. So, to remember this, this is the easiest way to remember that the variance sample variance we are using, the denominator of that is our t distribution, vote distribution is our statistic is following that particular t distribution which has that many degrees of freedom. So, applying this now in this particular problem, here our variance pooled variance or this may our denominator n 1 plus n 2 minus 2. So, then value canically students n 1 is equal to 10. So, substituting these two values in this expression, what do we obtain? 10 plus 9 minus 2 and that is equal to 17. Lehaaza, we will have to look in the t table. So, we have to look in the t table against 17 degrees of freedom. So, as you now see on the screen, if we look in the t table against 17 degrees of freedom under 0.005, we obtain 2.898. Students, why did I say we have to look under 0.005? You remember, case problem may our level of confidence 95 percent, but 99 percent. And if we are keeping 99 percent area in the middle, then obviously half a percent in the left tail, half a percent in the right tail. That is only half a percent and half a percent obviously means 0.005. Now that we have the t value and we have the pooled variance and we have n 1, n 2, x 1 and x 2 bar, I think we are ready to finally construct our confidence interval. And as you now see on the slide, the 99 percent confidence interval for mu 1 minus mu 2 on substituting all these values that we have comes out to be minus 0.188. Students, you can interpret this confidence interval. You can say that on the basis of the semi-classical music, the duration on the average is more compared to the pop music or you can say that it is over. You remember, subscript 1 stands for semi-classical music subscript 2 stands for pop music or mu 1 minus mu 2 ki. Or answer our result that is negative. That on the average, the semi-classical music songs are of less time duration as compared with the pop music. And how much is that difference? Us ke liye you have just constructed the confidence interval. And since our unit was in minutes, lehaza ye jo answer hai, isse bhi aap minutes mehi interpret karenge. If you now have a look at the interval again, it is minus 1.010 to minus 0.188, to agar aap uski ek extreme dekhain. To agar aap uski ek extreme dekhain, to wo hai minus 1.01 minutes agar minus ko chhorein. Aur sirf uski absolute value pe gaur karen, to 1.01 minutes ki baath ho rahe hai. Aur agar dosri extreme pe chale jain aur uski absolute value ko dekhain, to 0.188 ya 0.19 minutes ki baath ho rahe. To goya hum ye karenge, ye jo difference hai, time difference, ye on the average, yani mean difference jo hai, that is in this range. Students, aap aap ek bohat ahem point note karenge. Main aap se ye kaha tha, ke T statistic us situation meh valid here. When the parent population is normal, the standard deviation is unknown and the sample size is small. Aur ye sari baathin us case meh, when you have one population and you are drawing one sample from it. Abhi abhi hum ne jo kuch kia usme to humare pas doh populations hai na. To ye jo bunyadi sharaith hai, they will be applied in this case also iss tara se ke hum ye karenge agar humare dohno populations normally distributed ho, unke variances unknown ho aur aap ke jo sample size jo aap ne doh sample draw kiye wo agar small samples ho, then you will be applying this formula. But what I have just said is not complete until I add one more very important statement. In the case of two normally distributed populations, if we are drawing small samples, if the population variances are unknown but equal, then students we apply the formula that we just applied. Yani iss meh ye addition hai ke wo jo population variances unknown hai na, although they are unknown, but we have reason to believe that they are not going to be different from each other and we assume that they are equal. To wo jo common unknown population variance sigma square hai usse ko hum estimate karte hai by computing, estimate the number of samples in the sample variance. Aur ye jo abhi hum karethe problem of the semi classical music and the pop music students, iss meh the time duration of the song, ye hai humara variable of interest aur agar aap goaar karen to I think you will agree that made be classical music, semi classical made be pop. Aap socheenge ke agar hum unke time durations ko record karen of all the songs of that type aur uss date ka histogram draw karen, to it will be approximately normal. Why? Because most of the songs will have a time duration close to mu 1 in the case of semi classical music and close to mu 2 in case of pop music. Zyada tar jo songs hote hain unki duration to takriban utni hi hoti hai na jo uski average duration banti hain. To aap ki jo distribution hain uska jo hump hai that lies in the middle. Thore se songs aise honge jin ki time duration come hain. And so in the tails your frequencies will fall and this will happen in a more or less balanced manner and then you can assume that these populations of time durations of the songs are normally distributed. Now that we have done the construction and the interpretation of the conference interval for mu 1 minus mu 2 students, let us proceed to hypothesis testing regarding mu 1 minus mu 2. According to the example that we now have on the screen, from an area planted in one variety of a rubber producing plant, 54 plants were selected at random, of these 15 were off types and 12 were aberrant. Rubber percentages for these plants were for the off types 6.21, 5.70, 6.60, 6.60 and so on and for the aberrants 4.28, 7.71, 6.48 and so on. Test the hypothesis that the mean rubber percentage of the aberrants is at least 1 percent more than the mean rubber percentage of the off types. Assume that the populations of rubber percentages are approximately normal and they have equal variances. Students, to case problem apir though populations involved hain, the population of all the rubber plants of the off type and the population of all the rubber plants of the aberrant type, then of course we need subscript 1 for one of the two populations and subscript 2 for the other. So, suppose that we say that we have subscript 1 for the aberrants and subscript 2 for the off types. Then our null hypothesis Cabaniga and what will be the alternative according to our problem as you now see on the screen. In this case H naught is written as mu 1 minus mu 2 is greater than or equal to 1 and H 1 mu 1 minus mu 2 is less than 1. Students, I is that I should be working. Our question is that, we test that the mean rubber percentage of the aberrant type is at least 1 percent more than the mean rubber percentage of the off types. So, this means that the difference in the means that has to be greater than or equal to 1, 1 percent which I have said. So, students here you are not confused that you think that now this percent means that we are not talking about mu, but we are talking about P. I know it is a bit confusing. For the first time, there is some confusion that we are going to talk about P or mu, but you interpret it in such a way that the rubber removed from the plants comes out in a mix, and that mix is measured in percentage form. So, our unit is in this case percentages of rubber. So, that is our unit. What we are going to test is average type of the aberrant type or off type. So, on the average, how much is the average type? We will say it in common words. So, when the average color is said, then it is obvious that we are talking about mu. And I repeat that we said that we want to test that the type of the aberrant type on the average, 1 percent is more compared with the off type. So, the null hypothesis will be that mu 1 minus mu 2, the difference of the two means is greater than or equal to 1. At least 1. And then it is obvious that the average type the alternative will be that mu 1 minus mu 2 is less than 1. Yad hena hamesha wo hypothesis null mein place kia jaan chahi the 1 which has the equal sign. What is the next step? The level of significance, the probability of committing type 1 error, the probability of rejecting H naught when it is actually true or agar hamuse 0.05 rakhene then students will my critical region lie in the left tail or will it be lying on the right tail? You remember last time I explained to you that you always look at the sign that you have in the alternative to decide about the critical region. Our alternative is that mu 1 minus mu 2 is less than 1. So, less than sign ki wajah se the critical region is going to fall in the left tail. And if the level of significance is 5 percent, then I will have the entire 5 percent area in the left tail to the left of my critical value. Z to hai nahi it is the t distribution having in this case again n 1 plus n 2 minus 2 degrees of freedom. So, I will be looking at the t table for 12 plus 15 minus 2 that is 27 minus 2 that is 25 degrees of freedom. Let us consider the formula which is the test statistic which is going to enable us to test this particular hypothesis. As you now see on the slide the test statistic for this situation is t is equal to x 1 bar minus x 2 bar minus mu 1 minus mu 2 and this whole thing divided by S p multiplied by the square root of 1 over n 1 plus 1 over n 2. Students, mai chahungi ki aap is formula ki similarity jo last time formula mu ki testing ke vakth isthimal kiya tha uska saath dekhain. Last time we had t is equal to x bar minus mu over S over square root of n aur tabh chunke ekhi population thi ekhi sample draw kiya gaya tha isliye we had x bar minus mu in the numerator is vakth doh populations hai doh samples hai aur difference between means ki baath ho rahe isliye in the numerator rather than having x bar minus mu we have x 1 bar minus x 2 bar minus mu 1 minus mu 2. Baakh hi rahe gaya denominator to aap note kane ke agar aap S p ko square root sign ke andar le jayin you will obtain S p square into 1 over n 1 plus 1 over n 2 inside the square root sign and if you open the bracket you will have inside the square root sign S p square over n 1 plus 1 over n 2 plus S p square over n 2 and this S p square is the estimator of the unknown common population variance sigma square agar sigma square hume known hota to shayad hum yaha pe likh rahe hote sigma square over n 1 plus sigma square over n 2 jo last time huma n 2 plus sigma r statistic tha us vaak huma re denominator me kya tha S over square root of n agar hum chahthe to usko bhi hum kisthara se likh sate the S square over n under the root aap aap usko is ka saath kumpear ki jhe. Pichli matabha huma re paas tha denominator me S square over n 1 plus S p square over n 2. So, the point to understand students is that the formula that you have now is a kind of an extension of the formula that you had last time. Or agar aap in thamaam tizon ko ghor se study kare or thoda sa time in pelagai tha aap yakinan ek basic pattern usko understand ka link. Alright, let us now do the fourth step the computation of our test statistic. As you now see on the slide in this problem for the aberrant kind of rubber plant the sample mean comes out to be 6.74 whereas the sample mean for the off types is 5.62. Also the sum of the square deviations of the observations from the mean for the aberrant type is equal to 15.9697 whereas the sum of the square deviations for the off types is 5.62. So, of the observations from their mean for the off types is 5.7737. Now S p square which is equal to n 1 minus 1 into S 1 square plus n 2 minus 1 into S 2 square over n 1 plus n 2 minus 2 students this can also be written as summation x 1 minus x 1 bar over n 1 plus n 2 bar whole square plus summation x 2 minus x 2 bar whole square over n 1 plus n 2 minus 2 and substituting the values that we obtained just now our S p square comes out to be 0.8697 so that S p is equal to 0.93. Now in order to compute t we will be substituting all the values that we have computed x 1 bar x 2 bar S p and of course n 1 and n 2 are already known but students the question is what do we substitute in place of mu 1 minus mu 2. As I indicated in the last lecture we must take that value which we get from the null hypothesis because we always begin a hypothesis testing procedure with the assumption that H naught is true. So, in this example according to the null hypothesis mu 1 minus mu 2 is greater than or equal to 1. So, equal sign we pick up 1 and we put it in the formula for t then as you now see on the slide t comes out to be 0.33. The next step of course is the critical region and as I indicated earlier this is a left tail test and we will be looking at the t table against 25 degrees of freedom under 0.05 because our level of significance is 0.05. Hence the t value that we obtain is minus 1.708 the minus sign being there because it is not the right tail but the left tail of the t distribution. Students the last point is the conclusion. Our t value I t 0.33 or our critical value here minus 1.708. So, what is the conclusion? Is my value falling in the left tail or is it not? Of course not. Therefore, we will not reject the null hypothesis and what was the null hypothesis students that mu 1 minus mu 2 is greater than or equal to 1. That is the difference on the average that is greater than or equal to 1. So, this is equal to 1 percent. All right students the next topic that I would like to discuss with you is the application of the t statistic in the case of paid observations. This is the situation in which we can say that both the populations are independent of each other of type 1 type, are different type and they are two independent populations. Paid observations situation was, where two different observations that you might take that are they will occur in the form of a pair. Let me explain this point with the help of the example that you now have on the screen. Ten young recruits were put through a strenuous physical training program by the army. Their weights were recorded before and after the training with the following results. We have three columns. The first column gives us the serial number as far as the recruits are concerned 1, 2, 3, 4 so on up to 10. The second column gives the weights of these recruits before the training and the third column gives the weights of these recruits after the training and students we note that for the first recruit the weight before the training was 125 pounds but after the training it became 136 pounds. For the second person the weight before the training was 195 pounds and after the training it became 201. You can study all the ordered pairs that you have and do note that the weight is not increasing for every single one of these recruits. If you look at the third recruit it was 160 pounds and after the training it became 158 pounds. If you look at the ninth recruit the weight before the training was 195 pounds and after the training it became 190 pounds. Using a level of significance of 5 percent would you say that the program affects the average weight of the recruits? Assume the distribution of weights before and after the program to be approximately normally distributed. I hope that this problem has clearly indicated to you the situation where we will be dealing with paired observations. So this is what we would call natural pairing and then there are situations where we have to compare the two fertilizers. What we should do is to apply both types of fertilizers on the same type of soil and then we can get a yield for fertilizer one and a yield for fertilizer two and now they are comparable because it is because of the fertilizer. This is actually taking me into the area of experimental design which is a vast area. Sometimes the pairing is natural and sometimes it is by design. How will we test this hypothesis that does the physical training affect the weight? Students the basic procedure is just the same the formulation of the hypothesis, the level of significance, the test statistic and so on. The point to note is that the test statistic in this particular situation will be d bar minus mu d over s d over square root of n. Exactly the same pattern as before but the only thing being that our variable now is not x but d where d denotes the difference of the weights before and after the training. In the next lecture we will be discussing this point in detail. In the meantime I would like to encourage you to attempt a few questions of the type that we have discussed today. My best wishes to you and until next time Allah Hafiz.