 In the last few lectures, I have introduced the problem of classifying an observation into one of the two populations. I discussed various procedures. In particular, I showed that one can define Bayesian classification rules or what you can say as good classification rules. So, we call them the admissible rules or the class minimal complete class that means the rules beyond which you need not discuss. In particular, we consider applications to the classification of an observation into two multivariate normal populations. The first case was when all the parameters are known and then second case we discussed when the parameters are unknown. Now, I will generalize this concept to the problem of classification of one observation into several populations. So, let me introduce the concept of optimal rules here and how to derive these rules. So, classifying an observation into one of several populations. So, suppose we are having populations pi 1 pi 2 pi m these are m populations and we are considering the density functions associated density functions say p 1 x p 2 x p m x etcetera ok. So, we wish to classify or we can find out m mutually exclusive reasons. We want to specify say mutually exclusive and exhaustive reasons of the sample space say r 1 r 2 r m. So, if the observation x belongs to say r i we classify x into the ith population for i is equal to 1 to m. We can also consider the cost of misclassification as we have done earlier. The cost function we can introduce the cost of misclassifying an observation which is actually from say pi i, but we classify into pi j then we call this function as say c j i. Now, we can define it like this c r j given i equal to see the observation is initially from the ith one, but we have classified it as into jth one. So, the probability of misclassification or the cost of misclassification can be considered like this. Now, if you remember the case of classification into two populations I had considered one particular case when we fix the initial proportions of the populations as say q 1 and q 2 where q 1 plus q 2 is equal to 1. In a similar way if I have m populations I may consider the case when the initial proportions of these populations are known. We call them prior probabilities say q 1, q 2, q m. Suppose q 1, q 2, q m are prior probabilities of populations pi 1, pi 2, pi m respectively. That means, 0 less than q i less than 1 and sigma of q i is equal to 1. If I am considering the expected loss of classifying i into j then I can consider based on the prior probabilities the total expected loss. So, the total expected loss can be defined as the expected loss. So, let us consider see we are having the I just made a small mistake here this should be p r that is the probability of misclassification this is the cost. So, this is actually the probability of misclassifying an observation from pi i into pi j. So, this is p r j given i and the cost of misclassifying is c j i. So, if I consider c j i into p r j i then this will become the expected cost. Now, let us consider the total expected loss. So, let us consider this c j given i p r j given i this r denotes the classification rule. This is the classification rule. Now, what we have done here is that observation is from ith population and we are classifying it as into the jth population. So, this is the expected cost or expected loss you can consider. Now, you may put it into any of the remaining populations. So, we vary j from 1 to m where j is not equal to i. Now, this term which is written here. So, now the observation is from the ith one and we can put into any other population other than the ith population. So, this is the total expected loss that is coming there. Now, the probability of or the proportion of the ith population that is q i. So, if I multiply by that and then I sum over all i from 1 to m then this is the expected loss using rule R classification rule R ok. So, this is the expected loss that we can consider. You can remember the value which I wrote for the case of two populations. In the case of two population let me write let me show you the expression here which we discussed earlier. It was given by just I will show you the thing and then you can compare. So, it will be clear that how this has been obtained here. If you remember the case of two population we had only two values c 2 1 and c 1 2. In this case we have several values c j i where i and j both can vary from 1 to m where i is not equal to j. So, this is the difference that is coming here in place of two values c 2 1 c 1 2. Now, I have c j i for all j not equal to i and for all i. So, total number of values will be m into m minus 1 that you will get here. Let me also show you the expected loss that we had considered here. So, the term that I wrote here p r j given i. In the case of two population we had only two values p r 2 given 1 and p r 1 given 2. Now, I have m into m minus 1 values once again that will be there and the expected loss of misclassification was only c 2 1 p r 2 1 q 1 plus c 1 2 p r 1 given 2 into q 2. Now, you compare this with the value that I wrote just now because any observation from the ith one can get into any of the other other than the ith one then you consider all such cost then you add them. Then you look at the ith one and multiply with the prior probability of that and then you sum over. So, this is the expression that you will be getting. So, this is the full explanation of the expected cost as compared to the case of two population. So, as you can see here the expression becomes much more complex here. However, our aim or the motive remains the same that is to minimize the expected cost of classification expected loss of by misclassification. So, our aim is to find a rule r so that the expected loss 2 is minimized. So, now we can consider the as again in the case of one we had considered q 1 p 1 x by q 1 p 1 x plus q 2 p 2 x in a similar way we can consider the conditional probability of an observation coming from a population given the values of the components of x. So, the given that it is coming from coming from pi i. So, that is defined as then q i p i x divided by sigma q k p k x earlier it was q 1 p 1 x divided by q 1 p 1 x plus q 2 p 2 x or q 2 p 2 x divided by the same term, but here now I will have all the m terms in the denominator. If we classify the observation as from pi j then the expected loss is q i p i x divided by sigma q k p k x k equal to 1 to m multiplied by c j i, i is equal to 1 to m i not equal to j. We minimize the expected loss if we choose j so that 4 is minimized that is we consider the term q i p i x c j given i because the denominator is common here for all j and select that j that gives the minimum. So, in principle if you look at this is a direct extension of the case of two populations and if of course, there may be a case when two different j's give you the same value in that case you can choose the one which is well it does not matter because then whichever you choose it will give the same minimizing constant. So, now I consider this procedure assigns the value. So, we are assigning towards a j so that is the reason r j here. So, we consider then the following result then that if q i or q 1 q 2 q m are prior probabilities of pi i and the cost function is given here then the region of classifying r k is given by so r k region is sigma q i p i x c k given i less than sigma q i p i x c j given i. Here i is equal to 1 to m i not equal to j and here it is i is equal to 1 to m i not equal to k that means for the kth one if this is the minimum then you are getting the rule that is you should classify classify x into pi k if this is happening. I will not get into the proof of this in fact the proof is almost the generalization of the proof for the two population that means if I consider any other rule which is minimizing then I can consider the expected loss from the two given rules and write down the difference and as in the case of two population you can consider the condition you can consider the conditions for greater than or equal to thing. So, it will come immediately so I am skipping the proof here. Like the case of two population we can consider the optimality criteria like admissibility base rules etcetera let me just formally define that here. We can consider the conditional expected loss if the observation comes from pi i that is sigma c j given i p r j given i this term I wrote earlier this one basically so I am writing this one. So, this is for j is equal to 1 to m j not equal to i I will use a notation r r i here. So, a procedure r is at least as good as procedure r star if and only if we are having r r i less than or equal to r r star i for i is equal to 1 to m. If strict inequality holds for at least for at least some i then r is said to be better than r star r is said to be admissible if there is no procedure better than r. A class of procedures is said to be complete if for any rule not in that class a rule within the class is better. So, these definitions are similar to the one which I gave for the case of two populations and we can consider that this result is also the similar as we had earlier that if Q i's are positive then a base procedure is admissible. Once again the proof is almost the same let me exhibit at least this proof here. Let r be the base procedure and r star be any other procedure. Now, base procedure we will give you the expected loss less than or equal to the expected loss corresponding to the rule r star. Now, let us take say one of the components to be strictly smaller than the corresponding component for the rule r. So, what I am saying is that actually if I want to show that r star is better than r then we should have r star i less than or equal to r r i for all i and strict inequality for at least one value. So, for the time being I am assuming less than or equal to for 2 to m and strict inequality at least for the 2 then let us see what happens. So, if we substitute it here. So, what we get Q 1 r r 1 minus r r star 1 this will be less than or equal to sigma i is equal to 2 to m. So, from here actually I am writing here Q i r r star i minus r r i. Now, let us look at these two I am writing for 2 to m. So, for 3 to m it is less than or equal to and at least there is one strict inequality and all Q i's are positive. Therefore, this will be strictly less than 0 as all Q i's are positive. If this is happening this is implying that r r 1 is strictly less than r r star 1. So, r star cannot be better than. So, that means, the base rule must be admissible base rule r must be admissible. Actually you can see that the proof is similar to the case for the case of 2 population that is m is equal to 2. There I had taken only strict inequality for 2 and then for 1 I got the reverse 1 and similarly for the other case. Now, if the cost functions are given then also the base procedures are admissible. And the converse of this result is also true that is every admissible procedure is a base procedure. Every admissible procedure is also a base procedure. I will not give the proof of this also as the proof is similar to the case of 2 populations. Now, what we want to do is that let us consider the classification of multivariate normal population. So, let us look at this problem which we had considered for the case of m is equal to 2. So, let us consider the problem of classifying an observation into 1 of several multivariate normal populations. The populations are pi 1, pi 2, pi m where pi i is the population normal mu i sigma. So, let us define say u j k function that is equal to log of p j x by p k x actually here p j x will be 1 by 2 pi to the power p by 2 determinant of sigma to the power half e to the power minus 1 by 2 x minus mu i j I am writing. So, it should be j here prime sigma inverse x minus mu j. So, I am assuming sigma is positive definite because I am writing down the existence of the density function for j is equal to 1 to m. So, this quantity that is the ratio here log of p j x by p k x you can consider here see this term will get cancelled out. So, you will get e to the power half and the term will be corresponding to k in the numerator and j in the denominator and then you take log. So, e will go away. So, you will get basically half of x minus mu k prime sigma inverse x minus mu k minus x minus mu j prime sigma inverse x minus mu j that is equal to after simplification x minus half mu j plus mu k prime sigma inverse mu i minus mu j then the regions of classification if we apply this method that is we are considering sigma q i r r i less than r equal to this if basically we have mentioned here that the one which is minimizing this. So, the procedure will become if we assume the cost functions that is c j i to be say equal then the base rule base classification rule is that is r j that is classify the observation into jth population if u j k x is greater than log of q k by q j for k equal to 1 to m k is not equal to j. We can notice here that this functions u j k they satisfy the symmetry property. So, actually what kind of regions are these if you see it carefully they are nothing, but the r i's are actually bounded by the hyper planes type of region because x minus half mu j plus mu k prime sigma inverse mu i minus mu j. So, what kind of regions you will be getting you will be getting the regions of the type of the hyper plane. So, if the means span an m minus 1 dimensional hyper plane then r i is bounded by m minus 1 hyper planes that you will be getting because the x value that will give you a hyper plane x greater than something or less than something and if the prior probabilities are not given then in place of log of q k minus q k by q j you can put some value and in order to maintain some sort of symmetry of representation we can consider something like actually what is the value of log of q k minus log of q j because these are probabilities. So, basically you are getting them to be negative because they are lying between 0 and 1. So, we can write log of q j see log of q k minus log of q j. So, with the minus minus signs we are getting. So, we can put log of q j before because there is a minus sign. So, we can term put it in terms of the non negative values. If no prior probabilities of populations are assumed we can consider r j as u j k x greater than or equal to c j minus c k for k equal to 1 to m k not equal to j where c j's are positive constants. Actually any rule of this type is a base rule. So, in place of prior probabilities if we are putting some other numbers we can actually define respective probabilities in such a way that they will be equal to something. So, all such procedures they will be giving you the base and admissible rules. So, basically this is the you can say minimal complete class of the classification procedures for classifying into one of several populations. These rules are admissible rules they are also base with respect to they are also base because the class of base rules and admissible rules is the same here. Now, if we want to find out a minimax procedure then we can consider say probability of the correct classification and we can make them to be equal for minimax procedure we may find r. So, that p r i given i are all equal. Let us look at what is the probabilities of correct classification probabilities of correct classification which we can also call PCC. So, x is the observation to be classified then we can consider say u j i that is equal to x minus half mu i plus mu j transpose sigma inverse mu i minus mu j. This is the classification function that we got and of course, u j i and minus u i j they are related here. And so that means, basically we can consider m c 2 classification functions because we do not have to consider both here m c 2 classification functions are there. And of course, this is because if they span in m minus 1 dimensional hyper plane. So, now if x belongs to pi j that means, x is having n p mu j sigma distribution then what is the distribution of u j i. So, u j i that will have see this is normal and we can apply the linearity property. So, this will become mu j minus. So, this will become actually mu j minus mu i and here also you are having this thing here. So, this is mu i minus mu sorry this is mu j minus mu i I just wrote it reverse here. So, mu j minus mu i prime sigma inverse mu j minus mu i half can be taken outside. So, this if I define the term say delta j i square which is a generalization of the malanobis d square function which I wrote in the case of two populations then this is equal to mu j minus mu i prime sigma inverse mu j minus mu i. So, in terms of this. So, this is actually malanobis d square as a distance function of distance function between populations pi i and pi j. So, then this is equal to normal half delta j i square delta j i square also we can look at the covariance of u j i and u j k. The covariance matrix will turn out to be the covariance matrix the covariance between. So, this is a scalar because u j i is a scalar function the covariance between u j i and u j k. So, that is equal to I use a notation delta j k j i that is equal to mu j minus mu k prime sigma inverse mu j minus mu k. In the classification rule when prior probabilities are not fixed in advance then we have to determine constant C j C k etcetera which I mentioned that we can choose them to be non negative. So, we can consider the probability of classifying in 2 j when it is from j then it is equal to f j that is the observation is from the jth one d u j 1 d mu j 1 and so on d mu j j minus 1 d mu j j plus 1 and so on d mu j n and these are C j minus C 1 and so on C j minus C n because the upper side is infinity here where f j is the density of u j i per i not equal to j. So, we can choose this should be u not mu I think the we can choose C j is so that p r j given j is equal for all j. Now, the another situation arises if the parameters are not equal not known then we can substitute estimates for example mu i head can be x i bar and similarly you can have sigma head is equal to 1 by sigma n i minus m i is equal to 1 to m and x i j minus x i bar prime no x i j minus x i prime j is equal to 1 to say n i i is equal to 1 to m when we have random samples x i 1 x i 2 x i n i from pi i then we can consider these estimates and the analog of u i j this will become w i j that will be x minus half x i bar plus x j bar prime s inverse x i bar minus x j bar. Now, as we discussed the case of two normal populations the distribution theory for this part is somewhat more complicated. However, the exact distribution theory is not very difficult because strong law of large numbers will hold and if I take here large sample sizes and then you can consider here that x i bars will converge to corresponding mu i's x j will go to mu j s will go to sigma s inverse will go to sigma inverse in probability and so on. Therefore, the asymptotic distribution of u j i w i j will be almost the same as the u j i. So, the problem can be handled. So, the exact distributions of w i j are quite complicated. However, for large sample sizes we can use laws of large numbers the asymptotic distribution of w i j will be same as that of u i j. So, this problem can be solved. Now, let us also go back to one of the problems that I discussed earlier that is classifying into two multivariate populations when the variance covariance matrices were unequal. I discussed the rule when the two populations had known parameters. So, if you remember the rule that I had mentioned here it was given by this that is when sigma 1 and sigma 2 are different I mentioned that in place of hyperplanic regions you are actually getting much more complex regions because this is becoming I mentioned that one of them becomes central chi square another is becoming a non-central chi square distribution. So, this region becomes much more complicated. Now, we also consider this case for the unknown sigma 1 sigma 2 case in that case we have to substitute the estimators. So, let me briefly discuss this case also. Next we consider the case of unknown and unequal covariance matrices. So, in particular let us consider say pi 1 as the population n p mu 1 sigma 1 and pi 2 is the population n p mu 2 sigma 2. So, let me go back to the expression that I derived earlier the expression that we obtained was in fact, there was a power here which I would have missed that time it should be power half here and power half here also and it is e to the power half x minus mu 2 sigma 2 inverse x minus mu 2 minus x minus mu 1 sigma inverse x minus mu 1. So, we can consider a likelihood ratio procedure we are having the samples here say x i 1 and so on x i n i this is from a random sample from pi i i is equal to 1 2. We can consider a likelihood ratio test procedure for null hypothesis that is the observations x and x i 1 x i 2 x i sorry x 1 1 x 1 2 x 1 n 1 they are from pi 1 and x 2 1 x 2 2 x 2 n 2 this is from pi 2 and the alternative hypothesis will be that is x 1 1 x 1 n 1 this is from pi 1 and x x 2 1 and so on x 2 n 2 this is from pi 2. Now, in the likelihood ratio procedure I have to consider the maximization of the likelihood function under both null and alternative hypothesis. So, in the null hypothesis I will have n 1 plus 1 observations from pi 1 and n 2 observations from pi 2 in the alternative hypothesis I will have n 1 observations from pi 1 and n 2 plus 1 observations from pi 2. Since all the parameters are unknown and unequal this is simply reducing to the problem of finding out maximum likelihood estimators for these cases. So, we can easily write down the maximum likelihood estimators. In the likelihood ratio procedure we are required to determine the maximum value of the likelihood function under null as well as alternative hypothesis. Thus we find MLEs under both cases. So, this we can write easily because the procedure for finding out the MLE is known in the case of multivariate normal distribution we know actually that the sample mean and the sample variance covariance matrix they are the maximum likelihood estimators. So, under the null hypothesis MLEs are so we write it as mu 1 head 1 that will be equal to n 1 x 1 bar plus x divided by n 1 plus 1 because this is the sum of all the observations from the first sample plus x because we are saying that it is coming from the first population mu 2 head 1. So, 1 means basically under the null hypothesis this is given by x 2 bar and sigma 1 head 1 that will be 1 by n 1 plus 1 sigma. So, we call it say a 1 plus n 1 by n 1 plus 1 x minus x 1 bar x minus x 1 bar transpose and sigma 2 head 1 that will be simply equal to 1 by n 2 a 2 here a i's are x i j minus x i bar x i j minus x i bar transpose j is equal to 1 to n i under h 1 that is the alternative hypothesis actually this null hypothesis I am calling h naught and this alternative hypothesis I am calling h 1. So, under h 1 this will turn out to be mu 1 head 2 that will become x 1 bar and mu 2 head 2 now here I will have n 2 plus 1 observations from the second one. So, it will be n 2 x 2 bar plus x divided by n 2 plus 1 for sigma 1 head this will become equal to 1 by n 1 a 1 and for sigma 2 head this will become 1 by n 2 plus 1 a 2 plus n 2 by n 2 plus 1 x minus x 2 bar x minus x 2 bar transpose. So, if we consider the likelihood ratio criteria that will give us the likelihood ratio. So, the exponent term will get cancelled out and we will get sigma 1 head 2 to the power n 1 by 2 sigma 2 head 2 to the power n 2 plus 1 by 2 divided by sigma 1 head 1 to the power n 1 plus 1 by 2 and sigma 2 head 1 to the power n 2 by 2 which we can also write after simplification as 1 plus x minus x 2 prime a 2 inverse. x minus x 2 bar whole to the power n 2 plus 1 by 2 divided by 1 plus x minus x 1 bar prime a 1 inverse x minus x 1 bar to the power n 1 plus 1 by 2 multiplied by n 1 plus 1 to the power half n 1 plus 1 p n 2 to the power n 2 p by 2 element of a 2 to the power half divided by n 1 to the power half n 1 p n 2 plus 1 to the power half n 2 plus 1 p a 1 to the power half. If we consider the costs of misclassification to be the same and the priori probabilities to be equal, we can consider this ratio to be 1 to be greater than 1 then you classify into pi 1. So, we classify x into pi 1 if the ratio is more than 1 else classify x into pi 2. Now, this is one criteria that is the likelihood ratio criteria. Let me also again come back to this original observation that I got here. Another thing could be that I substitute direct estimates that means here I put x 2 bar, here I put s 2 inverse, here I put s 1 inverse, here I put x 1 bar. So, in both the cases the exact distributions of the criteria are not easy. The exact distribution of the criterion is quite complicated and alternative approach can be simply plug in estimates into this function. That is x minus x 2 bar prime sigma that is s 2 inverse x minus x 2 bar minus x minus x 1 bar prime s inverse x minus x 1 bar. So, that is you put this as some greater than or equal to something and so on less than something. So, this could be another one. Once again the distribution of the criteria is quite complicated even if I look at the asymptotic distribution for the large sample sizes by applying the laws of large numbers I will get this one. I have already discussed that if x belongs to pi 1 then this is non-central chi square this will be something like a central chi square. So, the difference of 2 and it is going to be quite complicated. In case it is from pi 2 then this one is a central and this one is a non-central. Once again the exact distribution of these things are difficult to obtain. So, in particular we are saying is that when the variance covariance matrices are unequal the classification rules no doubt can be easily found out, but in order to obtain desirable rules such as a minimax procedure among them is a difficult task because the probabilities of correct classification or the probabilities of misclassification will be quite complicated. Friends, so we have actually discussed so many classification rules. In fact, I framed a general decision theoretic approach to the classification problem by considering the Bayes decision rule, the criteria of admissible rules, the minimax classification procedure, the minimal complete class etcetera. And in particular I showed applications to the classification of multivariate normal populations. We have considered two normal populations and multivariate normal population. So, I actually wind up the discussion on the problem of classification now. Now one can consider some other classification procedures which are available nowadays, but that can be a subject of full-fledged discussion. I will move over to another topic that is the problem of principal components. So, in the next lecture I will briefly introduce a problem how to determine the principal components and also may be I will touch upon the canonical correlations. So, that will be the topic of the next lecture.