 In the previous class, we have discussed methods for testing of parameters for one multivariate normal distribution and also for two multivariate normal distribution. So, for example, if we have a one sample from normal mu sigma distribution of p dimension, then we can test about mu is equal to mu naught. We have seen that if sigma is known, then the test will be based on a chi-square statistic whereas, if sigma is unknown, then the test is based on hotelings t-square statistic and we have shown equivalence that the value of the t-square statistic can be calculated from an F distribution and the corresponding formula was given. We also discussed two sample problem that means, we can consider testing for the equality of the mean vectors of two multivariate normal populations mu 1 is equal to mu 2 and again if the variance covariance matrices were known, the test was based on a chi-square as well as when they were unknown but equal, then it was based on a hotelings t-square distribution. One more application of this type of testing, I also showed for the linear functions of mean vectors that we can consider the test for that. Now, another application of this is that we may consider equality of the components themselves. Now, this could be like this that for example, this mu 1 mu 2 mu p, they may be denoting the characteristics of say different components of something which may have similarities. So, now, we will try to know whether the mu 1 is equal to mu 2 is equal to mu p or not that is something like a test for homogeneity. Now, we know that in analysis of variance, we have a test when we are considering several normal populations, then it is called a one way analysis of variance test but there the populations are considered to be independent that means, the sampling procedure that means, we are having then p independent samples. Now, here by definition the samples are not independent because it is coming from a multivariate normal population. So, now I present a procedure for this, let us consider a random sample x 1 x 2 x n be a random sample from n p mu sigma. So, as usual mu vector let me write it in the row form then mu transpose is equal to mu 1 mu 2 mu p. Now, suppose we want to test say H naught mu 1 is equal to mu 2 is equal to say mu p against at least one inequality. So, what we do we write it like this, we can consider the hypothesis H naught can be written as mu 1 minus mu 2 say mu 1 minus mu 3 is equal to 0 this is equal to 0 and so on mu 1 minus mu p is equal to 0 which we can write as say 1 minus 1 0 0 0 1 0 minus 1 0 and so on 1 0 0 minus 1 into mu 1 mu 2 mu p this is equal to 0 0 0 this is we can consider it as some c matrix multiplied by mu is equal to null where c is actually p minus 1 by p matrix. Now, you see this statement which is written in a linear form like mu 1 is equal to mu 2 is equal to mu p I can consider it as p minus 1 simultaneous linear functions of the mu vector of the components of mu. So, we can write it as c mu is equal to 0. So, this hypothesis is becoming. So, we are equivalently testing H naught c mu is equal to 0 against H 1 say c mu is not equal to 0. So, consider the transformation say y is equal to c x. So, this will become then p minus 1 by 1 vector then this will follow n p minus 1 c mu and c sigma c transpose. So, let us consider y j vectors for j is equal to 1 to n and define then y bar vector as the mean vector of y i's and we can also define the variance covariance matrix based on y's as 1 by n minus 1 sigma y j minus y bar y j minus y bar prime j is equal to 1 to n. In fact, it is nothing but see this one for example, it is equal to 1 by n c times sigma x i and similarly this one is 1 by n minus 1 c times sigma x j minus x bar x j minus x bar prime c prime that is actually 1 by n minus 1 c s c prime where this s is based on x and this is simply. So, we can make use of the we can use the test statistic let us call it t y square that is equal to n y bar prime s inverse y bar. This will have t square distribution on p minus 1 n minus 1 this is p minus 1 dimensional here and of course, n minus 1 because you have n observations here. Let me give one example which is adopted from C R Rhoff's 1948 work and here n is the amount of I have used the terminology of that only which is amount of the cork in a boring from the north into a cork tree and similarly you can consider E S W this is from east this is from south and W is from the west and it is considered that n S E W this follows a 4 dimensional multivariate normal distribution with some mean vector mu and variance covariance matrix sigma and we want to test whether whether the amounts are the same on all sides. So, we can write. So, as I have explained here we can use this set of hypothesis mu 1 minus mu 2 is equal to 0 mu 1 minus mu 3 is equal to 0 mu 1 minus mu 4 is equal to 0 or we can also use say mu 1 minus mu 2 minus mu 4 plus say mu 3 is equal to 0 mu 2 minus mu 4 is equal to 0 mu 1 minus mu 2 is equal to 0 etcetera. So, we can write in any other fashion and against h 1 some inequality here. So, the experiment that was conducted and reported in Rao it was having 28 observations based on that y bar was find to be 8.86 4.50 and 0.86 and S was calculated 128.72 61.41 minus 21.02 61.41 minus 21.02 56.93 minus 28.30 minus 28.30 and S was calculated 128.72 61.41 minus 21.42 63.53. So, if we calculate the t square value here divided by n minus 1 that turns out to be 0.768 and if I consider compare to the f here then multiplied by 25 by 3 that is 6.402 and if I consider f value on 325 say at 0.01 then this is more than this. So, this turns out that this is significant that means, the amounts which are collected from all the 4 sides they vary. So, this is an application of hotelings t square we can basically what we are showing here is that we can consider linear functions here. Now, as in the case of one variable the importance of the normal distribution stems from the fact that if we consider the sums of the observations from a sample or the means of the observations from the sample then using central limit theorem we get the approximate normal distribution. Now, a similar result holds for the multivariate data also. So, we can call it a multivariate central limit theorem. So, that is one reason that why the methods for the multivariate normal distributions are widely applicable. So, we have been the following form that let x 1, x 2, x n and so on be a sequence of independent identically distributed p dimensional random vectors with means as mu and x n. And covariance matrix say sigma then the asymptotic distribution say 1 by root n sigma x i minus mu i is equal to 1 to n sorry mu here this is n p 0 sigma as n tends to infinity. So, this is a version of that I have not considered division that means, in the case of univariate we are considering divisor as sigma here, but that we are not putting here because you have a matrix here. So, at the most you can consider multiplication by sigma to the power minus half here, but that is of course, easy to understand. So, this type of result is helpful to establish that we can actually use the one sample and two sample procedures that we have discussed here for the multivariate normal distribution they will be widely applicable. Now, one more case that for which in the univariate case we had some approximate procedure that was when we are considering the test for equality of variance means, but the variances are unknown as well as we do not have any other information on them like if they are equal then we have a procedure which is the pooled procedure and for which I have presented the analog for the multivariate case also for the pooled hotel links t square, but when they are not equal then in the case of one variable we had some approximate procedures. Now, in the case of multivariate we present some procedure which is based on considering a curtailment of the observations. So, let me present one procedure here. So, two sample problem with unequal dispersion matrices. So, we consider let x 1 1, let us go back to the notation that I introduced earlier x 1 1 n 1 here this is n p mu 1 sigma 1 and another random sample say x 1 1 n p mu 1 sigma 1 x 1 2 and so on x n 2 2 this is a random sample from n p mu 2 sigma 2. So, we have two independent random samples and these two samples are also considered to be independent here. We are considering the test of hypothesis mu 1 is equal to mu 2 against mu 1 is not equal to mu 2. Let me write the summary statistics here. For example, if I consider the mean the first mean that will be n p mu 1 1 by n 1 sigma 1 and the second mean which I call x 2 that will be n p mu 2 1 by n 2 sigma 2. So, if I consider let n 1 mu 2 mu 1 by n 1 is equal to n 2 is equal to n. Then, if I consider y bar that is equal to x bar 1 minus x bar 2 then that will be normal mu 1 minus mu 2 1 by n sigma 1 plus sigma 2. You can see here that is this because of this coefficient getting n 1 1 by n 1 and 1 by n 2 being same I can combine this sigma 1 plus sigma 2. And therefore, I can consider here say y j is equal to x 1 j minus x 2 j. Based on this we can define s is equal to 1 by n minus 1 sigma y j minus y bar y j minus y bar transpose. And we can consider hotellings t square n y bar prime s inverse y bar. So, this will follow t square on n minus 1 when h naught is true. So, this gives a generalization of paired t test the paired t test that we defined in the case of univariate populations which is used for univariate populations. Now, let us consider the second case which is the more important one that is n 1 is not equal to n 2. So, if n 1 is not equal to n 2 without loss of generality let us consider that n 1 is less than n 2. In this case let us define say y j that is equal to x j 1 minus square root n 1 by n 2 x j 2 plus 1 by root n 1 n 2 sigma x k 2 k equal to 1 by n 2 by n 1 to n 1 minus 1 by n 2 sigma r is equal to 1 to n 2 x r 2. You see here that in what way we have defined see this is the observations from the first sample and here the observations from the second sample are considered here. This definition we are considering from 1 to n 1. So, the remaining observations that we are putting together here let us see the effect here. If I consider the mean of this then I get here the mean of the first one that is mu 1 minus root n 1 by n 2 the mean of the second one that is mu 2 plus now here the mean of x k 2 is mu 2 and these are n 1 observations. So, it becomes n 1 by root n 1 n 2 mu 2 minus and here it will become n 2 by n 2 mu 2. You look at this terms here this term will simply get cancelled out. So, we are actually getting mu 1 minus mu 2 that means, if we base our test on the mean of y j's then it will be able to test about equality of mu 1 and mu 2. Also let us consider the covariance matrix between say 2 observations say y alpha and y beta that is equal to expectation of y alpha minus y beta minus expectation y beta transpose. So, this we expand this is equal to x alpha 1 minus mu 1 minus root n 1 by n 2 x j 2 minus mu 2 plus 1 by root n 1 n 2 sigma r is equal to 1 to n 1. So, this is k equal to 1 to n 1 x k 2 minus mu 2 minus 1 by n 2 sigma x r 2 minus mu 2 this is from r is equal to 1 to n 2 into this transpose. Now, if we consider this if I consider this into the first term here then that will give me simply the first one that is sigma 1 the variance covariance matrix of x alpha and let us adjust the terms for the other one also. So, this gives us that is equal to this coefficients we combine together delta alpha beta see this delta alpha beta will be 1 when alpha is equal to beta otherwise it is 0 sigma 1 plus n 1 by n 2 delta alpha beta sigma 2 plus sigma 2 minus 2 by n 2 plus n 1 by n 2 minus twice n 1 by square root n 1 n 2 n 2 plus n 2 by n 2 square plus 2 by n 2 root n 1 by n 2. So, after simplification I get simply delta alpha beta sigma 1 plus n 1 by n 2 sigma 2. So, based on this I can easily define suitable n 2 by n 2 sigma 2. So, this is the first statistic for testing mu 1 is equal to mu 2 it is based on. So, n 1 y bar prime S inverse y bar this will have t square on n 1 minus 1 where y bar is nothing but 1 by n 1 sigma y j j is equal to 1 to n 1 and n 1 minus 1 S that is nothing but y alpha minus y bar y alpha minus y bar transpose alpha is equal to 1 to n 1. Again this can be simplified if I substitute the terms here that is if I write the full form of this y alpha and y bar then this is actually giving us sigma u alpha minus u bar u alpha minus u bar transpose where u bar is equal to 1 by n 1 sigma u alpha alpha is equal to 1 to n 1 and u alpha are nothing but x alpha 1 minus square root n 1 by n 2 x alpha 2 this is for alpha is equal to 1 to n 1. So, this procedure was proposed by Shafi in 1943 for the univariate case and Shafi actually showed that this gives us the shortest confidence interval for the t distribution for using the t distribution and here we are actually making sacrifice of some of the observations and Bennett in 1951 he gave an extension to the multivariate case. Now, one can actually consider it for several populations also when we are considering the linear combinations. So, what you will have to do you have to consider the minimum sample size of all the observations and based on that you can construct the statistics. Let me just demonstrate that thing here this approach can be extended to more general cases. Let us consider say x alpha i for alpha is equal to 1 to n i i is equal to 1 to k. So, these are samples from n p say mu i sigma i. So, we are considering k independent samples from k n p mu i sigma i populations and we are considering testing for a linear combination of the mean vectors against say not equal where this beta 1 beta 2 beta k are given scalars and this mu star is a given vector. If n i's are equal then there is no problem then the we can combine as in paired test. If n i's are unequal let n 1 be the smallest and like in the previous one we considered based on n 1. So, again here we will do it on base of n 1 and we can define y alpha to be beta 1 x alpha 1 plus sigma beta i plus sigma i root n 1 by n 2. This is from 2 to k and now we adjust the terms x alpha i minus 1 by n 1 sigma beta is equal to 1 to n 1 x beta i plus 1 by square root n 1 n 2 sigma x r i r is equal to 1 to n i. Then if we consider say expectation of y alpha then it is simply becoming beta 1 mu 1 from first term here plus sigma beta i root n 1 by n i i is equal to 2 to k mu i minus 1 by n 1 n 1 mu i plus n i by root n 1. This should be n i here n 1 n i this should be and mu i. So, this term gets cancelled out n 1 by n i this gets cancelled here. So, you get simply beta i mu i which is the desired term in the hypothesis here. So, we can consider and similarly if we consider the variance covariance matrix of this based on y alpha y beta minus expectation y beta transpose then that is equal to delta alpha beta sigma beta i square n 1 by n i sigma i i is equal to 1 to k. If I assume y bar as the mean of this based on n 1 observations only and so this is equal to sigma beta i x bar i where of course, x bar i is the mean of the i th sample and n 1 minus s is y alpha minus y bar y alpha minus y bar transpose alpha is equal to 1 to n 1. Then if I consider t sigma beta square as n 1 y bar minus mu star prime s inverse y bar minus mu star then that will have t square p n minus 1. So, we can consider the hotelings t square test based on this here. If I define say u alpha is equal to sigma beta i root n 1 by n i x alpha i i is equal to 1 to k for alpha is equal to 1 to n 1 then based on this s is nothing but sigma u alpha minus u bar u alpha minus u bar transpose. So, one can use this based on the hotelings t square statistic. Another problem which may also arise that I consider the two sub vectors of the full vector and now I want to test whether they are having equal components that means like first one I write as mu 1 mu 2 second as mu 3 mu 4 then whether mu 1 is equal to mu 3 mu 2 is equal to mu 4 etcetera. So, this type of problems can also be handled using the hotelings t square let me give one example. Suppose, I consider x 1 and x 2 here and mu is equal to say mu 1 mu 2. So, these are partitioned here these are partitioned here and similarly the variance covariance matrix is partitioned. So, we are assuming here this has q components this has q components. So, if I consider say x bar 1 minus x bar 2 then that will have q dimensional normal distribution with mean x bar 1 minus mu 1 minus mu 2 and variance covariance matrix sigma star where the sigma star can be then written as sigma 1 1 minus sigma 2 1 minus sigma 1 2 plus sigma 2 2. So, if we want to test say h naught mu 1 is equal to mu 2 against say h 1 mu 1 is not equal to mu 2. So, we can consider the statistic n x bar 1 minus x bar 2 s 1 1 minus s 2 1 minus s 1 2 plus s 2 2 inverse x bar 1 minus x bar 2 transpose. So, this will be based on hotelings t square on n minus 1 q minus 1 here sorry q and n minus 1 here. I have shown that various inferential problems for the mean vectors of 1 or 2 multivariate normal populations are several multivariate normal populations can be answered or they can be handled using the hotelings t square statistic. Now, we will. So, there are other things which are based on the for the variance covariance you have something based on the Vishard distribution. However, I am not discussing that part right now because the testing for the variance covariance matrix will be somewhat little more complicated rather we move over to a more practical oriented problem which is called a problem of classification. So, let me introduce this problem here problem of classification of observations. So, quite frequently we are encountered with various kind of problems for example, you consider a new entrant to a for example, college. Now, the students of a college can be described into two parts one who go for an academic career and another who go for a corporate job. So, now, you based on the previous data we have the distributions of the two students performances. Now, if a new student is considered then to which group he would belong to. Now, this kind of problem can be considered in a more general setting we have k population say pi 1 pi 2 pi k. We have k populations say pi 1 pi 2 pi k we want to classify a new observation x into one of the k populations. So, broadly speaking this is the problem of classification. Now, here there can be several variations for example, we may know the forms of pi 1 pi 2 pi k for example, this could be normal say mu 1 sigma 1 normal mu 2 sigma 2 normal mu k sigma k. And now we have a another vector say x new observable we want to classify where it will belong to. Here it could be that mu 1 sigma 1 mu 2 sigma 2 mu k sigma k are known. There could be another problem when these parameters are unknown in that case we need some sort of observations from each of the populations because then we will need to estimate mu 1 mu 2 mu k and sigma 1 sigma 2 sigma k. So, in that case these are called training samples there can be yet another type of problem when the forms of pi 1 pi 2 pi k are completely unknown. So, in that case we have non parametric procedures. So, let me introduce this problem that means, what are the procedures and in what way we can study this. So, what are the standards of good classification? In a very rough way or simple way we can say if we classify an observation into one of the population then either it is a correct classification or it is a incorrect classification. So, a criteria for checking the goodness of the classification procedure could be the probability of incorrect classification that means, we call it the probability of misclassification. So, if the probability of the misclassification remains low then it is a good procedure. So, it is something like in the testing of hypothesis problem where we accept or reject the hypothesis based on the sample. Now, the hypothesis could have been true and we would have rejected it and the hypothesis could have been false and we could have accepted there were the two kinds of errors, but when we are dealing with the k populations here in the classification then the probability of misclassification or the probability of correct classification also becomes manifold that means, an observation could have belong to pi 1 and we classify it as pi 2 then observation could have been from pi 1, we could have classified it as pi 3 and so on and similarly the other way around that means, the observation could be from any of the pi j and we can classify it as one of the pi i's. Along with that we can also have the cost of misclassification along with the probability another additional thing could be that if you do the wrong classification then there can be some additional cost. So, in a general decision theoretic setup one can also consider that. The particular case can be that if you have a correct classification you have no loss and if no cost is implemented and if you make a wrong classification then you are incurring say one cost then you can get a 0 1 loss function kind of thing. So, now let me give some notation here the classification of an observation depends on the measurements is equal to x 1, x 2, x p on that individual. So, we can actually consider r 1 and r 2 as a partition of the p dimensional space here where r 1 is the space where classify that is if x belongs to say r 1 classify x as belonging to pi 1 and if x belongs to r 2 then you classify x as belonging to pi 2. This r 1 and r 2 are disjoint regions in the p dimensional sample space. I mentioned about the kind of errors suppose in the beginning we consider only two populations say pi 1 and pi 2. So, we may consider c 2 1 as the cost of misclassification if individual is classified as coming from pi 2 whereas, he actually came from pi 1. Similarly, we can consider c 1 2 that is classified as pi 1 and he actually came from pi 2. So, we have two costs of misclassification. So, in a decision theoretic setup if we consider it as a loss matrix we can write it in this fashion pi 1, pi 2 that is the statistician's decision and on this side we have pi 1, pi 2 that is the true population. So, if the true population is pi 1 we classify it as pi 1 then there is a 0 cost. Similarly, if true population is pi 2 and we classify as pi 2 then also it is 0. If the true population is pi 1 and we classify it as 2 then the cost is c 1 2 1 and similarly here the cost is c 2 1. So, these two terms are taken to be positive in general a good classification procedure will have minimum cost of misclassification. As we have seen in the previous discussion in general in the statistical decision making problem it is not possible to completely minimize the misclassification cost like in the case of testing of hypothesis problems also we had seen that the type 1 error and the type 2 error cannot be completely eliminated. So, there was a compromise which was worked out that you can consider fixed level of significance and then you consider the probability of type 2 error to be the smallest or the power of the test to be the maximum. So, this was one of the compromise solutions that was considered. So, if we consider it as a true population problem then it is actually a part of you can consider it as a testing of hypothesis problem and therefore, both cannot be minimize simultaneously. So, let us consider here this cost function and in what way we can consider the minimization etcetera. So, one type of terminology which we did not consider in the testing of hypothesis problem is to allocate prior probabilities to each of the population. For example, if we know that both the populations may occur with equal probability or the population one may occur with probability 1 by 3 and population 2 may occur with probability 2 by 3 and so on. For example, you get a satellite image and you want to classify whether it is a land area or whether it is a water area. So, if the image is taken from a satellite of the earth area a portion of the earth then you know that earth area is say the land area in the whole earth is 1 by 4 and the water area is 3 by 4. So, you can allocate the probabilities p 1 and p 2 the prior probabilities. So, if you have the prior probabilities then we can reduce this number the probability of misclassification to a single number. So, you can consider the Bayesian classification rules. So, let me introduce this here now. Suppose q 1 is the prior probability that the observation came from population pi 1 and let q 2 be the prior probability that the observation comes from population pi 2. So, here of course, q 1 plus q 2 should be equal to 1. Let us consider say p i x is the so you may have a discrete or continuous distribution or it could be mixture also, but in particular let us take say either purely discrete or purely continuous. So, you will have a p d f or p m f associated with population pi i and we are considering r 1 and r 2 as the observations which are associated with classifying observation x into pi 1 or pi 2. So, we define probability of correctly classifying an observation in pi 1 which is actually from pi 1. This we write as p r 1 1 this we can write as integral p 1 x d x. So, I am considering actually the density function form if it is a discrete case we can equivalently change it to the summation also. So, I am not discussing this case separately. Let us have this interpretation. See this d x is actually it could be multivariate because depends upon what kind of observation you are having in general we will be dealing with the multivariate observations here. So, similarly we can define the probability of misclassifying an observation from pi 1 that is p r 2 given 1. That means, it is coming from 1 we classify it as 2 that means, the density is actually p 1, but we put it as r 2 and in a similar way we have the probability of correctly classifying an observation of pi 2 that will be p r 2 2 and probability of misclassifying an observation from pi 2 that we will write as p r 1 2 that will be integral p 2 x d x r 1. So, we can consider now expected loss from costs of misclassification. Let us call it e that will be equal to c 2 1 p r 2 1 q 1 plus c 1 2 p r 1 2 q 2. Let me explain this if the observation is from 1. So, the probability of the prior probability of the population 1 is q 1 and we are incorrectly classifying it into 2. So, the probability is this and we also incur a cost c 2 1 of misclassification. Similarly, if the population is actually 2 then the prior probability is q 2 and we misclassify it as 1. So, the probability of that is p r 1 2 and then the cost of misclassifying an observation from 2 into 1 that is c 1 2. So, this becomes the expected loss. So, a procedure r that divides r n into r 1 and r 2. So, we consider this sample space say x not in r n because I have not mentioned the dimension here. Let us consider x here this is the sample space of x into r 1 and r 2 such that e is minimized for given q 1 and q 2 this is called a Bayes procedure. So, we will mention that how to obtain a Bayes procedure here. There can be another way when there is no prior information then I will have 2 different terms that is the probability of misclassification from first one and the probability of misclassification from the second one. So, let me define that also when there is no prior information about the probabilities of each population then we consider 2 terms expected loss if the observation is from pi 1 that is we call r r 1 that is equal to c 2 1 into p r 2 given 1 and similarly expected loss if the observation is from pi 2 that we call r r 2 that is equal to c 1 2 p r 1 2. We can give some decision theoretic definitions which I will be explaining in the next lecture like we can call about a procedure being better than another procedure a procedure being as good as another procedure and admissible procedure a minimax procedure and we will show that when the prior probabilities are known the Bayesian procedure can be determined when the prior probabilities are not known we will try to find out the minimax procedure. We will also develop the procedures for classification into multivariate normal populations. So, these things we will be covering in the following lecture.