 We have discussed in detail the tests for the parameters of normal populations. I considered one sample problems in which we considered the testing for the mean and variance of one normal population. We also considered two normal populations and we considered various tests for comparing the means and also the variances. However, when we have the qualitative data we may also be interested in testing for the proportions. So, here the basically the model is that we have x following binomial say n p distribution where n is known. Now, when n is a small then we can consider test based on x for example, I can call say let us define say capital P is equal to x by n. Suppose my hypothesis testing problem is say p is less than or equal to p naught against p is greater than p naught. Then we can consider the test as reject h naught if x is greater than some c where probability of x greater than c when p is equal to p naught that is equal to alpha. Now, in this case what will happen is that it is not necessary that we will get exactly equal to alpha. So, we may need we may need to randomize here we may need to consider a randomized test as binomial distribution is discrete and there may not exist an integer c for which one will be satisfied. Now, when n is large we can consider normal approximation we can consider x minus n p naught divided by square root n p naught q naught. Let us call it say b 1 when p is equal to p naught and n tends to infinity then b 1 converges to z following normal 0 1 distribution. Therefore, we can for testing about h 1 versus k 1 for example, this hypothesis we may consider test based on z alpha values that is reject h naught when b 1 is greater than z alpha. Similarly, if I consider we may consider say h 2 that is p is greater than or equal to p naught versus k 2 p is less than p naught then test is reject h 2 if b 1 is less than minus z alpha and if I consider the hypothesis p is equal to p naught against p is not equal to p naught then test is reject h 3 if modulus of b 1 is greater than or equal to z alpha by 2. Let me give a simple example suppose in a random sample of 100 patients 70 patients got successfully cured using a certain drug. Let p denote the overall proportion cured patients using this drug we want to test say say h 1 p less than or equal to half against say k 1 p greater than half say or we may say p equal to half against p is greater than half. Suppose, we want to test that the overall effectiveness is more than 50 percent in that case the test statistic will become you will have 70 minus 50 divided by root 100 half into half. So, that becomes 20 divided by 5 that is equal to 4. So, if I consider say alpha is equal to 0.05 or alpha is equal to 0.01 etcetera. So, then we will see that z alpha by 2 for example, here it is 1.96 and so on. So, certainly here 4 is greater than 1.96. Similarly, at this one suppose I say 0.005 then that is still higher value it is actually point it is around 3 or something. So, 4 is greater than it is approximately 3 that is greater than this. So, we reject h 1 that is we may conclude that overall effectiveness of drug is more than 50 percent. Sometimes we may be interested in comparing two proportions that means, we have say x following binomial m p 1 and y following binomial n p 2 and m and n are large. We may need to compare p 1 and p 2. So, we can consider hypothesis of the nature this or say h 3 h 2 say p 1 greater than or equal to p 2 against k 2 p 1 less than p 2 h 3 p 1 is equal to p 2 against k 3 p 1 is not equal to p 2. So, let us define say p 1 head is equal to say x by m p 2 head is equal to say y by n p head let us define to be x plus y by m plus n and let us define the statistic p 1 head minus p 2 head divided by square root of p head into 1 minus p head 1 by m plus 1 by n that is actually equal to root m n by m plus n p 1 head minus p 2 head divided by root p head into 1 minus p head. So, when p 1 is equal to p 2 then p 2 has asymptotically normal 0 1 distribution. So, we can construct tests for h 1 h 2 h 3 etcetera based on p 2. For example, for h 1 versus k 1 the rejection region will be for z greater than z alpha for h 2 versus k 2 the rejection region will be for z less than minus z alpha and for h 3 versus k 3 the rejection region will be for modulus z greater than or equal to z alpha by 2 if I am considering level alpha tests. Let me also consider another related topic for example, here we are considering in the binomial 2 categories. So, for example, if I am considering 1 binomial then it is p and then you have 1 minus p as the proportions of the 2 types here we are considering p 1 and p 2. Now, in general we can consider more categories. So, this gives actually rise to a test called goodness of fit tests and since asymptotic distributions are chi square. So, the tests are based on that. So, we call them chi square tests for goodness of fit. Let me introduce the problem first. So, we want to test we want to test whether the sample comes from a known distribution say f naught x. In the previous problems in the usual parametric methods what we are considering is that we are assuming the form of the distribution like normal distribution binomial distribution or I have also given the examples of say exponential distribution or a Poisson distribution. But there can be situations where we would like to test whether we will have a particular distribution say binomial distribution or a uniform distribution or a Poisson distribution etcetera. In that case we will say that the sample comes from a known distribution say f naught x. So, we want to test that means if the unknown distribution if unknown distribution function is denoted by f x and f naught x is the desired CDF then we want to test h naught f x is equal to f naught x for all x against say h 1 f x is not equal to f naught x at least for some x. So, that means we are saying that alternative hypothesis is that f x is not f naught x it could be some other distribution or it may not be a distribution. In the chi square test for goodness of it we consider we divide the range of the variable or distribution into say k mutually exclusive regions. Usually it will be intervals I mentioned regions because suppose I am considering binomial distribution etcetera then you have values 0 1 to n or you are considering Poisson then it is 0 1 2 3 and so on. So, you will have infinite number of values, but then you can take a practical consideration by considering values by clubbing some of the values together and make it a finite number of. So, this k is finite. So, we divide the range of this that means we are actually getting some k regions such as we can give some name here say r i is i is equal to 1 to k and if we denote an observed random variable by x then assume that probability of x belonging to the region r i is some pi i is equal to 1 to k. Now, what we consider when the sample is observed when the sample is observed then each x i each observation belongs to one of regions r i i is equal to 1 to k. Let us denote the observed frequencies of region r i by O i for i is equal to 1 to k. So, now we consider suppose n observations are there we use denote the expected frequency of ith region by E i that is equal to n times p i. So, what we do we construct sigma O i minus E i square by E i i is equal to 1 to k. This has let us call it W then this has approximately chi square distribution on k minus 1 degrees of freedom. So, we can use test for H naught versus H 1 as reject H naught if W is greater than chi square k minus 1 alpha. Let us consider an example here it is assumed that students preferences for various disciplines are uniformly distributed. So, let there be 5 options say C s, E c, E e, M e and C h and let the preference probabilities of these options we say p 1, p 2, p 3, p 4 and p 5 respectively. Then we want to test that is p i is equal to 1 by 5 for i is equal to 1 to 5 against naught. So, that means, we are assuming a discrete uniform distribution for the preferences then a random sample of say 300 students was taken and their preferences recorded as below. So, here we have the branches and the observed frequency O i's is given by 88, 65, 52, 55 and 40. So, we want to test whether the preferences are uniformly distributed or not. So, we consider here E i's, E i's are the probabilities of each group. So, here you notice that the expected frequency of each group. So, if total number is 300 we are assigning probability 1 by 5 to each group. So, the expected frequency will be 60. So, we consider here W that is equal to sigma O i square minus O i minus E i square by E i, i is equal to 1 to k. This is also having an alternative representation. If I expand this numerator I get O i square plus E i square minus 2 O i E i divided by E i that is equal to sigma O i square by E i plus sigma E i minus twice sigma O i that is equal to sigma O i square by E i plus minus n because sigma E i and sigma O i both is equal to the total sample size. So, this is an alternative formula for this. So, we calculate here by 60 minus 300. So, you can do the calculations it turns out it is equal to 21.6. Now, there are here 5 groups. So, we need to look at the chi square value on 4 degrees of freedom. For example, we may consider say at point 0 1 level then it is 13.28. Suppose we consider chi square value at say 0.05 then it is equal to 9.49. So, you can easily see that H naught is rejected that is students preferences are biased towards different disciplines. In this particular case I assumed that F naught is completely known. If F naught is not completely known is not completely known. For example, it may contain for example, I say it is binomial distribution then there will be a unknown parameter P which has to be estimated. Suppose we say it is a Poisson distribution then the parameter lambda has to be estimated. Suppose we say it is normal mu sigma square distribution then mu and sigma square have to be tested to have to be estimated first and then they have to be used in the calculation of the expected frequencies. In that case the degrees of freedom of the chi square will be reduced by the number of unknown parameters that have to be estimated from the sample. So, it may contain unknown parameters say theta is equal to theta 1 theta 2 theta n. In such cases we have to estimate from the sample consequently the asymptotic distribution of W will be chi square k minus m minus 1. Let us take one example here 30 randomly selected documents of equal size are taken and the number of typographical errors in them are recorded the data is summarized below. So, if I make a frequency table number of errors. So, it is recorded like this 0 1 then 2 or 3 errors 4 or 5 errors and more than 5 errors. Then number of documents who had no errors it was found to be 6, number of documents which had 1 error were 5, number of documents which has 2 or 3 errors was 8, number of documents which had 4 or 5 errors were 6 and the number of documents which had more than 5 errors were 5. We want to test whether a Poisson distribution appropriately fits the data because here it is a number of counts errors are counts. So, the data on number of errors. Now, naturally if we assume say. So, we have to assume a Poisson lambda distribution assume x that is the number of errors follows Poisson lambda distribution. Then this lambda has to be estimated first from the given data. So, we consider this we will first estimate lambda. So, we may consider say maximum likelihood estimate or UMBU or the method of moments estimator. In the case of Poisson distribution all of them are the same it is simply x bar. So, here you can see it will be equal to simply 95 by 30 that is equal to 3.1667. Now, based on this we have distribution written as e to the power minus let us call it x bar x bar to the power k divided by k factorial that is the probability of x is equal to k. So, now for example, what is probability of x is equal to 0? See these are the groups here like I mentioned here in the very first this one that we divide into k mutually exclusive regions here. So, k mutually exclusive regions here will correspond to this is region 1, this is region 2, this is region 3, this is region 4 and this is region 5 here. So, what is the probability of region 1 that is probability of region 1? What is the probability that x belongs to region 1? This is my p 1. So, that is equal to e to the power minus x bar which of course, can be calculated to be 0.04214. Similarly, we can calculate p 2 that is probability of x is equal to 1 that is the probability of region 2 that is equal to x bar into e to the power minus x bar. One can evaluate it turns out to be 0.13346. Now, p 3 will be the probability of x is equal to 2 plus probability of x equal to 3 that is the probability of third region that is equal to x bar square e to the power minus x bar by 2 factorial plus x bar cube e to the power minus x bar by 3 factorial that is equal to 0.4343 etcetera. Similarly, probability x equal to p 4 that is equal to x equal to 4 plus probability x equal to 5 that is the probability of fourth region. So, that will be turning out to be 0.28841 that is probability x equal to 5 that is the probability of fifth region that is equal to 0.10164. Now, based on this we can calculate E i's are nothing but N p i that is equal to 30 times p i i is equal to 1 to 5 and we calculate then. So, this I can call p 1 hat p 2 hat p 3 hat p 4 hat because these are the estimates of the probabilities of this regions these are the estimates here. So, we calculate W that will be equal to sigma O i square by E i minus N that is equal to 21.99. Now, chi square value you can see. So, how many degrees of freedom will be there? We have 5 classes and one parameter has been estimated. So, it will be 3. So, one can easily check the values at some particular level of significance for example, even at 0.005 it is 12.838. So, H naught is rejected that is the error count do not fit a Poisson distribution. Let me give one more example where Poisson distribution will actually fit the given data. The following data represents the frequency count of violent crimes reported in a month for 200 randomly selected districts across a country. So, number of violent crimes and we are clubbing 0 1 2 3 4 and more than or equal to 5. So, again we will like to test whether it is a Poisson distribution or not, number of towns that is the frequency. So, 22, 53, 58, 39, 20 and 8. So, we want to test whether the crime count data fits a Poisson distribution. So, once again you can check here that x bar is approximately 2 it will be 2 point something. So, I am just writing 2 here because that is sufficient for our purposes and we calculate the expected frequencies. Expected frequencies will become 27, 54.2, 54.236, 18 and 10.6. So, if we calculate W that is sigma O I square by E I minus n then that is turning out to be 2.33. So, if we look at chi square value now since there are 1 2 3 4 5 6 groups are there that agrees of freedom will be 6 minus 1 minus 1 and let us take say at 5 percent level then it is turning out to be 9.49. So, certainly we have reasons to believe that Poisson distribution adequately represents this frequency distribution. Now, if we see this thing the fitting of a distribution problem is basically reducing to a sort of multinomial problem because you are dividing the entire categorized data into k categories. Now, if we are dividing into several categories then it is immaterial whether we divide into one dimension or we can go for higher dimension also. So, let us consider in general testing for independence in R by C contingency tables. So, if we are considering contingency tables then we are considering the classification according to two categories A and B and for A we have categories A 1, A 2, A R, A C and for B we have B 1, B 2, B C. Now, we can actually divide the entire frequency into several cases we can let us put R here. The observed frequencies I am writing as O 11, O 12, O 1 C, O 21, O 22, O 2 C and O R 1, O R 2, O R C. We consider the row and column sums. So, if we sum the first row we call the sum as O 1 dot, O 2 dot and so on O R dot. Similarly, if we sum the columns we call that O dot 1, O dot 2 and so on O dot C the total sum is n. So, we have the following notations observed frequency of i jth cell is denoted by O ij and then we define O i dot that is equal to sigma O ij for j is equal to 1 to C 1 by. So, it is simply the summations and similarly O dot j that is equal to sigma O ij i is equal to 1 to R. These are the row and column totals. Then if we are assuming that the two things are independent there will be a theoretical probability of assume theoretical probabilities of i jth cell to be pi ij. Then the marginal probabilities of ith row is pi i dot that is equal to sigma pi ij sum over j and of jth column it is pi dot j that is equal to sigma pi ij sum over i. If the row and columns are independent then we must have pi ij is equal to pi i dot into pi dot j. So, we calculate the expected frequency of ijth cell using this assumption. So, that is ij is equal to O i dot into O dot j divided by n. Here capital N is actually the sum of all the frequencies. So, if I use this then we get let us call it W star that is equal to double summation O ij minus E ij square divided by E ij. This has asymptotically chi square R minus 1 C minus 1 distribution. So, we will reject the hypothesis of independence if W star is greater than chi square R minus 1 C minus 1. Let me give one application here. The following data represents the number of accidents taking place in three shifts of four factories producing an item. The data is recorded for a year. So, we want to test whether the incidence of accidents is independent. That means whether in a particular factory a particular shift has more accidents or less. So, independent of type of factories and shifts. So, the data is recorded in this particular fashion. Suppose we have four factories A, B, C, D and the data is recorded over shift 1, shift 2 and shift 3 that is 10, 10, 13, 12, 24, 20, 6, 9, 7, 10, 10. If we consider the totals this is 33, this is 56, 22 and 27 and on this side if we consider the row totals it is 35, 53, 50 the total n is equal to 138. So, we calculate for example, what will be E11? E11 will be 33 into 35 by 138. Similarly, suppose I consider say E23. So, E23 will be 22 into 53 by 138 etcetera. So, we calculate the W star that is turning out to be here 1.81 approximately. Now, if I consider chi square on 2 into 3 that is 6 degrees of freedom at a particular level say 0.05 then it is 12.59. So, we can say that shifts and factories are independent with respect to occurrence of accidents. You can say that the incidence of accidents are homogeneous across the factories. Let us take one or two more applications of the testing and these problems over two seasons a professional player of some game we may consider for example, a basketball player exactly 5 minutes in about 200 games. So, xi is the number of say hits he makes in game i, i is equal to 1 to 200. So, each xi can take value 0, 1, 2, 3, 4. So, we have the following data value of xi is say 0, 1, 2, 3, 4 and number of xi is say 73, 82, 38, 70. We want to test whether a binomial distribution will fit the data. Now, in a binomial distribution we have a parameter p here. So, let us consider say p head based on this data we can calculate actually. So, p 1 that is probability x equal to 0 that is equal to 1 minus p to the power 4, p 2 that is probability x equal to 1 that is equal to 4 p into 1 minus p cube. p 3 that is equal to probability x equal to 2 that is equal to 6 p square into 1 minus p square, p 4 that is probability x equal to 4, x equal to 3 that is equal to 4 p cube into 1 minus p and p 5 is equal to probability x equal to 4 that is equal to p to the power 4. So, we have the likelihood function that is 200 factorial divided by 73 factorial, 82 factorial, 38 factorial, 7 factorial, 0 factorial into 1 minus p to the power 4 to the power 73, 4 p into 1 minus p cube to the power 82, p to the power 4 to the power 0, 6 p square into 1 minus p square to the power 38 into 4 p cube into 1 minus p to the power 7. So, this can be simplified L hat p is L p is maximized when p is equal to 0.224. So, based on this we can calculate p 1 hat that is 0.363, p 2 hat is equal to 0.419, p 3 hat is equal to 0.181, p 4 hat is equal to 0.035, p 5 hat is equal to 0.003 etcetera. So, if we calculate this, calculate the chi square value here that is equal to 0.178 approximately. So, if we compare with chi square value on here we have 5 categories and one parameter has been estimated. So, you will have it on 4 degrees of freedom and one can see this. I will give one application of the general testing problem which we discussed for the normal populations, testing example for normal populations. The summary data is given by, so we have two samples for two types of elements presents in the bones of children and then the following data is collected n 1 is 121, x 1 bar is equal to 2.6, s 1 square is equal to 1.44, n 2 is equal to 61, x 2 bar is equal to 0.4, s 2 square is equal to 0.0121. We want to test whether the two normal populations have similar means or variances. So, if we calculate this firstly we test two test equality of means, we need to firstly test the equality of variances. So, that means, we test say h naught sigma 1 square is equal to sigma 2 square against h 1 sigma 1 square not equal to sigma 2 square. So, let us calculate the statistic s 1 square by s 2 square and it turns out to be 119.00 approximately. So, if I consider say f on 120 60 degrees of freedom, then the values say at 0.1 that will be say 1.34 etcetera. So, this is certainly larger. So, h naught is rejected, h naught is rejected. So, now I consider say mu 1 is equal to mu 2 against say mu 1 is greater than mu 2. And we formulate the test statistic x 1 bar minus x 2 bar divided by s 1 square by n 1 plus s 2 square by n 2 that is 20 point approximately 0. If I consider the degrees of freedom of this t here that is turning out to be approximately 123. So, certainly h naught star is rejected. So, here for testing the equality of the means which procedure is to be used because I discussed four different procedures firstly we need to check about the variance. Now, for the variance here it turns out that it is rejected here and therefore, we have followed this procedure. If it was accepted then we have to follow another one which was based on the pooling procedure. So, depending upon what actual method will be used then only you apply the testing methodology. We have discussed some of the important parametric methods there are many more, but in this particular course I will restrict attention to this. In the following lectures I will move over to multivariate analysis. So, we will have elementary discussion of the multivariate normal distribution and then the related distributions and how they are used for certain that calculations or computations or inferences when you have multivariate data. So, in the following lectures we will take up that.