 In the previous lecture, I have discussed the problem of classification of an observation into two multivariate normal populations. And we had made the assumption that all the parameters are unknown. I discussed one case in detail in which I assumed the multivariate normal population mu 1 sigma and mu 2 sigma that means, the covariance matrix was assumed to be equal and known. In that case we were able to derive the distribution of the discriminant function and the probabilities of misclassification that means, exact form of the rule was quite convenient to obtain. I also discussed the case when sigma 1 is not equal to sigma 2 and in that case we do not have a linear discriminant function if we use the same methodology. The distribution will be dependent upon the central and non central chi square distributions and it will be somewhat complicated. The form of the rule can be obtained but exact rule is, but suppose we want to study the properties or we want to derive actually the minimax rule here then that is going to be much more complicated compared to the case of sigma 1 is equal to sigma 2. However in most of the practical situations it may actually happen that the parameters of the populations are not known. I discussed the case of say disease etcetera. So, there from the experience of the medical practitioners they may actually identify that this disease has this parameter vectors and covariance matrix and similarly the other one, but there can be various other problems where it is simply a problem of classification for example, land area, economic conditions and various kind of things. In that case low income group high income group and so on. So, in those cases the parameters although we may specify that it is a multivariate normal distribution, but we may not be able to say what are the parameters of the distribution. In that case we consider the problem by substituting the estimates of the parameters in the discriminant function. This procedure was initially proposed by Fischer in 1936 and he called it a linear discriminant function. Basically, he used the same one which I have described in the previous one, but he substituted the estimates. So, let me specify the problem and then so suppose the problem is to classify an observation x into 1 of 2 multivariate normal populations when the parameters of the population are not known. So, we are having pi 1 say that is n p mu 1 sigma and pi 2 is n p mu 2 sigma. So, in this case we can see that this is must have some information on the populations in form of samples. These are actually called training samples. So, for example, from pi 1 we consider random sample say x 1 1 and so on x 1. We may consider equal sample sizes or unequal sample sizes. So, these cases can also occur. So, we can consider unequal sample sizes. So, this is from n p mu 1 sigma. Once again you note here that although parameters are unknown, but I have assumed the variance covariance matrix to be the common. There can be another case when that is uncommon and again you can see that there will be complications as in the case of known parameter problem. And from pi 2 we have. So, this is from x 2 1 x 2 2 and so on x 2 n 2. This is the sample from the second population. So, when we have this data we can easily consider the maximum likelihood estimators or we can consider unbiased estimators. We can look at the sufficient statistics. So, this problem is well studied in the estimation theory. So, I will not dwell too much into this and I will simply write the estimators. So, we can consider estimators for mu 1 mu 2 and sigma as mu 1 hat is equal to say x 1 bar that is actually 1 by n 1 sigma x 1 j j is equal to 1 to n 1 and mu 2 hat is equal to x 2 bar that is equal to 1 by n 2 sigma x 2 j j is equal to 1 to n 2. And this is from and for sigma we can consider. In fact, for sigma we can consider separately and then since sigma is common and we write the joint likelihood function etcetera. So, the sufficient statistics basically reduces basically we can consider then pooling and we can consider 1 by if we want to consider unbiased estimator then it will become 1 by n 1 plus n 2 minus 2 sigma x 1 j minus x 1 bar into x 1 j minus x 1 bar transpose j is equal to 1 to n 1 plus sigma j is equal to 1 to n 2 x 2 j minus x 2 bar x 2 j minus x 2 bar transpose. Actually, this one will be unbiased if we consider 1 by n 1 plus n 2 then that will be the maximum likelihood estimator, but we may. So, for large sample it will not make any difference whether you take this or. So, now if you consider the discriminant function that I introduced in the previous lecture this x prime sigma inverse mu 1 minus mu 2 minus half mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2 greater than or equal to 0. So, this part if we substitute the estimates of sigma mu 1 mu 2 etcetera then we get the form as. So, we use these estimates in the discriminant form. In fact, you can look at the definition of u which I gave here which was basically the left hand side of this rule that is this particular term. So, this u can be considered as then we get let us call it w. So, w is equal to x prime s inverse. So, this I call s this term I call s x 1 bar minus x 2 bar. You can actually match here term by term x prime this I am substituting s then mu 1 minus mu 2 I am writing as x 1 bar minus x 2 bar minus half this is mu 1 plus mu 2. So, this will become x 1 bar plus x 2 bar prime and s inverse x 1 bar minus x 2 bar. This function this is actually because the right hand side does not depend on the x value. So, this will be same for all observations which we want to classify. So, this one is the Fisher's linear discriminant function which he proposed sometime around 1936. This function has greatest variance between samples. So, so this can be considered as a linear as the classification criteria because earlier we have used u as the classification criteria because we are getting the terms like u greater than k u less than k and there it was actually proved that it is one of the base rules and therefore, it falls in the class of admissible rules and therefore, it is desirable and we could actually choose a minimax choice there. Now unlike that this one has not been derived in that fashion because the main reason is that p 1 and p 2 are not completely known here. We have actually substituted the estimates, but we can expect that this will actually be behaving in the same way as the previous one. So, let me then write here we can use w for classification that is the reason r 1 is if w is greater than or equal to some c and r 2 is if w is less than c. So, as I mentioned here that it is not we actually do not have the optimality property in the same way as we proved in the known parameter case, but since we are directly substituting good estimates of the parameters therefore, we expect that this rule will also be good rule in terms of having a smaller probabilities of misclassification. Now another problem which will be immediately coming into mind that means in place of one observation we will have to classify several observation that means a sample like in the previous case I mentioned. Now if it is one of the two thing for example, here x is there now in place of x you have say x 1, x 2, x n. Now in that case also the covariance matrix will be sigma. Now one thing can be there if it is from first one then the mean is mu 1 and otherwise it is mu 2. So, that part is not known, but for sigma we can actually make use of this sample also. The third sample because when you write the joint density this will be added there. So, in place of the pooling of two here we can actually add the third one also. So, that gives you a higher level you can say accuracy for the estimation of sigma. So, we consider this problem also when we want to classify a sample say x 1, x n. Now let me change this n to capital N here just to discriminate because there I am using a small n 1, n 2. So, suppose we want to classify a sample this into pi 1 or pi 2 then we may define s as n 1 plus well s is equal to 1 by n 1 plus n 2 plus n minus 3. So, this n 1 that is equal to sigma of x 1 j minus x 1 bar x 1 j minus x 1 bar transpose j is equal to 1 to n 1 and basically second term will also come. So, let me put it here i, i is equal to 1 2 plus sigma x r minus x 1 j minus x 1 j minus x 1 bar transpose j is equal to x x r minus x transpose r is equal to 1 to n. Here this x bar is actually 1 by n 1 by n sigma x r r is equal to 1 to n and then the criteria that will come here. So, you actually you can merge these two terms here you can take common s inverse x 1 minus x 2 because that is in both the terms we are getting x minus half x 1 plus x 2 prime. So, let me give it a new name say let me call it s star here then the x minus half x 1 bar plus x 2 bar transpose s inverse s star inverse x 1 bar minus x 2 bar. General comment is that the probability of misclassification reduces when the sample size increases because the behavior of x bar will be approaching more towards the true mean. That means, if it is true mean is mu 1 it will approach towards mu 1 and if true mean is mu second 1 then the true mean will be mu 2 and it will approach towards that. So, therefore, what will happen that the discrimination will be much superior because of the strong law of large numbers the sample means converge to the population mean and therefore, it will be really coming out nicely there. The distribution of the criterion that is this term. So, basically this is w here the distribution let me call it w star. So, we have actually either w or w star here that is this term which I can write as again x bar minus half x 1 bar plus x 2 bar transpose s inverse x 1 bar minus x 2 bar. So, this w or w star the distribution of that will be really complicated the distribution of w or w star is complicated. Now, in the case of known parameters we saw that whatever population is there that means whether it is pi 1 or pi 2 we got it as the univariate normal distribution and the means and the variances were easy actually we got it as half delta square and delta square and in the second case we got as minus half delta square and delta square and delta square was known. Now, in this particular case it will not be simply that it will depend upon because that is unknown here. So, it will depend upon the unknown parameter it will depend on n 1 n 2 and of course, n also and unknown value of delta square that is the malanubis d square term here. I will give some representations here and in fact some work has been done by various authors on the distribution of w and w star etcetera. So, let me just mention briefly some of these facts here. One representation is given in this particular fashion that we can consider say y 1 vector as c 1 x minus n 1 plus n 2 basically I can write it as like this n 1 x 1 bar plus n 2 x 2 bar divided by x n 1 plus n 2 let me call it 4 and y 2 is equal to some c 2 times x 1 bar minus x 2 bar and here I am choosing c 1 is equal to square root of n 1 plus n 2 by n 1 plus n 2 plus 1 and c 2 is equal to square root of n 1 n 2 by n 1 plus n 2. Then this we can say that actually the way this is defined here we can say that y 1 and y 2 are independently distributed normal distributions and both will have the covariance matrix equal to sigma. Their variance covariance matrices will be sigma, then expectation of y 1 under pi 1 that will be equal to c 1 n 2 by n 1 plus n 2 mu 1 minus mu 2 and expectation of y 1 under pi 2 that will be equal to minus c 1 n 1 by n 1 plus n 2 mu 1 minus mu 2. The term y 2 does not involve x. So, this expectation will be same for both that means if I consider expectation of y 2 that is always equal to c 2 mu 1 minus mu 2 under both pi 1 and pi 2. Let us define say y is equal to y 1 y 2 and m is equal to say y prime s inverse y that is m 1 1 m 1 2 m 2 1 m 2 2. Then w can be written as square root of n 1 plus n 2 plus 1 by n 1 n 2 m 1 2 plus n 1 minus n 2 by 2 n 1 n 2 m 2 2. So, the density of m it has been studied by set griefs in 1952 the density of m. Then Anderson in 1951 Wald of course, in 1944 they have studied the density of one special case that is when the sample sizes are equal then some simplification does occur because actually in that case this term will simply vanish. If this term vanishes and also you can look at here this will become half half etcetera. So, things will become much simpler here also it will become symmetry. So, if you look at this one then also symmetric this term vanish in this case if I am looking at the same will become much simpler here also it will become symmetry. So, if you look at this one then also symmetry will occur here. So, we discuss this case separately when the sample sizes are same the distribution of w for x from pi 1 is the same as that of minus w for x from pi 2. So, if we consider w greater than or equal to 0 as the region of classification into pi 1 and w less than 0 as the region of classification in this one then p 2 given 1 is equal to p 1 given 2. Basically, we can say that the probability of misclassifying x when it is from pi 1 is equal to probability of misclassifying x when it is from pi 2. So, this case is simplified version actually and basically what is happening is that you can if you use this then basically you are considering a good rule there because probability of misclassification for both the populations will be same that means whether it is from pi 1 or from pi 2. So, this rule will be somewhat all right. Exact distribution is quite complicated, but if we look at the expressions here since the expression of the criteria is involving for example, you look at w here. So, if we consider the strong law of large numbers then x 1 bar will converge to mu 1, x 2 bar will converge to mu 2, s will converge to sigma inverse and so on. We can also look at the convergence for example, weak law of large numbers that means convergence in probability then see these convergence in law or convergence in probability or convergence almost surely that is a strong law and weak law they will actually satisfy algebraic operations. That means we can consider the union for example, we may consider the summation the products multiplications they will be invariant that means this will converge to exactly u. If I take the previous u which I gave when the known parameters were there then this will actually converge to u in probability. In fact, it will converge strongly that means it will converge with probability 1. Now if that happens that means for large sample sizes the distribution of w is almost the same as the distribution of u and in that case the probabilities of misclassification and etcetera have been already considered. So, that is fine here. So, let us see here asymptotic distribution of w by laws of large numbers x 1 bar converges to say mu 1 in probability x 2 bar converges to mu 2 in probability and s will converge to sigma in probability as n 1 n 2 tend to infinity and of course, s inverse will converge to sigma inverse in probability. The convergence of property the convergence in probability satisfies algebraic operations. Hence, you will have s inverse x 1 bar minus x 2 bar converging to sigma inverse mu 1 minus mu 2 and x 1 bar plus x 2 bar prime s inverse x 1 bar minus x 2 bar converging to the corresponding term mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2 as n 1 n 2 tend to infinity. So, the limiting distribution of w is the same as that of u. So, basically we can say that for large sample size we are behaving as if the parameters of the populations are known. And therefore, the probabilities of misclassification will be similar to the probabilities of misclassification as in the known parameter case. So, that means we are not going to do much worse here. It will be basically almost the same here. There are other derivations of the criteria based on regression criteria. Then there is also a criteria called likelihood ratio criteria which is the same as basically the likelihood ratio test procedure. In fact, if you remember we have written p 1 x by p 2 x greater than k. If you remember the Neyman Pearson lemma for simple hypothesis testing problem that means if you have h naught p 1 and h h 1 as p 2 then basically your hypothesis testing problem is also decided on the basis of that what is the most powerful test. The most powerful test is based on that is acceptance and rejection regions are based on the ratio p naught and p 1 and p 2. There we write p naught and p 1. So, it is the same thing basically that is reject for p 1 by p naught greater than something. So, it is the same thing. Now, in the likelihood ratio procedure of course, in that one we just write the thing and we consider the size of the test and based on that we consider the maximization of the power. So, the constant k was chosen subject to that condition. The reason is that the problem of testing of hypothesis is interpreted differently because there we have the probabilities of type 1 error and type 2 error and we cannot actually consider those probabilities equal kind of thing. But in this particular case it is a different matter. There we had fixed like probability of type 1 error as say alpha and then we try to minimize the probability of type 2 error. Here that criteria is not done rather we are looking at the probabilities of misclassification and then we consider the in the base rule we are actually considering the minimization of the probability of misclassification and in the minimax rule we are considering the equating of those things. So, it does not mean that we are proceeding in the same way although the form is the same. Another procedure in the testing was the likelihood ratio. Now that procedure can also be adopted. So, I will derive the classification procedure based on the likelihood ratio criteria. Let us look at this thing. Let me leave this here likelihood ratio criterion. So, in the likelihood ratio criterion in this criterion if x is from pi 1 then we have observations x x 1 1 x 1 n 1 from pi 1 call it null hypothesis. And if x is from pi 2 then we have well this is not complete this is from pi 1 and also x 2 1 x 2 2 x 2 n 2 these are from pi 2. So, this is my null hypothesis if x is from pi 2 then we have x 1 1 and so on x 1 n 1 from pi 1 and x x 2 1 and so on x 2 n 2 from pi 2 this is we call as alternative hypothesis. So, you can note that these two both hypothesis are composite. See this pi 1 and pi 2 we have already written here pi 1 is actually n p mu 1 sigma and pi 2 is n p mu 2 sigma and here all mu 1 mu 2 and sigma they are unknown. In the likelihood ratio criteria when we consider the null and alternative hypothesis is composite hypothesis what we consider is the maximization of the likelihood function over the null hypothesis space and the alternative hypothesis space and then we take the ratio of the two likelihood functions and we consider greater than or less than. So, let us consider this maximum likelihood estimation here basically maximization of the likelihood function in the likelihood ratio testing procedure we consider maximization of the likelihood function under both the null as well as alternative hypothesis. It is equivalent to finding maximum likelihood estimators under the hypothesis. So, let us consider firstly. So, this is my H naught and this is my H 1. So, under H naught. So, you can look at this problem carefully we are considering x x 1 1 x 1 n 1 this is the sample from n p mu 1 sigma. So, when we write the likelihood function it will be the joint likelihood function of n 1 n 1 plus 1 observations from this and similarly the joint for the second one it will be these n 2 observations from this. So, we will get a combined term here and when we consider the maximum likelihood estimators the maximum likelihood estimators of the parameters mu 1 mu 2 and sigma they are. So, I will call it mu 1 head 1 that is equal to n 1 x 1 bar plus x this observation will come because you are considering the mean of these n 1 plus 1 observations for the second one it will be simply the mean of the n 2 observations that is. So, in the first case it is n 1 plus 1 observations in the second case it is n 2 observations here. So, that is this one and the sigma head let us call it under 1 that is equal to 1 by n 1 plus n 2 plus 1. Now, this you will see that it will remain common here sigma. So, it is i is equal to 1 to 2 sigma j is equal to 1 to n i x i j minus x i bar rather than this it will be actually the mean here that is mu 1 head i head 1. Into x i j minus mu i 1 head transpose and extra term will come here because of x here. So, if you consider x then x minus mu 1 head 1 into x minus mu 1 head 1 transpose. If we want we can substitute these values here basically you can consider in the first case x 1 bar and in the second case x 2 bar then there will be some sort of simplification here and the terms then can be written like this. Let me express it here this can be further written as 1 by n 1 plus n 2 plus 1. Let me write it separately it is becoming sigma x i j minus x i bar x i j minus x i bar transpose j is equal to 1 to n i i is equal to 1 to 2. Then there will be some extra term coming in here that is plus n 1 x 1 bar minus mu 1 head 1 into x 1 bar minus mu 1 head 1 transpose plus x minus mu 1 head 1 x minus mu 1 head 1 transpose. See these terms that I am getting here I can define term called a. So, this particular term which I have written here this term let us call it a. So, then we can actually express it in this particular fashion as 1 by n 1 plus n 2 plus 1 a plus and see these two terms again there is some sort of combining that can be done. You look at the nature here nature of the terms here it is x 1 bar minus mu 1 head x 1 bar minus mu 1 head prime and here also x minus mu 1 head 1 into x minus mu 1 head 1 prime. So, we can actually consider expansion of this and then combine this term here. If we do that. So, basically if you look at this term here then there is a mu 1 head mu 1 head prime which is coming here also, but here it is n 1. So, it will become actually n 1 plus 1. So, if you take it out and consider the division there then we can express the terms as n 1 by n 1 plus 1 x minus x 1 bar x minus x 1 bar transpose. So, you can see here that I am able to write down the maximum likelihood estimator. See this part we spend some time in just to express it in a nice fashion otherwise these terms are also fine, but this form you can see it will be helpful because ultimately we have to write down the ratios of the terms. So, then you consider here under h 1 the maximum likelihood estimators of mu 1 mu 2 and sigma, then what this will become equal to mu 2 head sorry mu 1 head 2 that will be x bar 1 mu 2 head that will be n 2 x 2 bar plus x by n 2 plus 1 and for covariance matrix it will become 1 by n 1 plus n 2 plus 1 and if we express in the same fashion then it will become. Now, the likelihood ratio criteria what we do we write down the ratio of the joint likelihood functions. So, if you look at the exponent term in the exponent term after some shooting these terms it will actually become e to the power minus half something that is minus half n 1 plus n 2 plus 1 p by 2. So, that will cancel out now in the density function you have in the denominator determinant of sigma. So, the determinant of sigma term that is coming there that will appear and since the estimates are there sigma head 1 and sigma head 2. So, they will appear here and the power will become n 1 plus n 2 by 2 n 1 plus n 2 plus 1 by 2. So, we consider the likelihood ratio that is let me call it l head under h naught under the null hypothesis divided by l head alternative hypothesis. If we consider this then this is equal to sigma head 2 by sigma head 1 to the power n 1 plus n 2 plus 1 by 2. Now, these expressions are already available here that is sigma head 1 and sigma head 2. So, if I put it here this is becoming equal to 1 plus n 2 by n 2 plus 1 n 2 by n 2 plus 1 x minus x 2 transpose A inverse x minus x 2 divided by 1 by n 1 plus n 1 plus 1 x minus x 1 bar prime A inverse x minus x 1 bar. In place of A we can actually use the S term which is obtained by considering the S term we had derived earlier. Let me just show it again write this was the S term here. If we consider this S term here 1 by n 1 plus n 2 minus 2 this term this was the S term. So, we can actually write in terms of S because here only the divisor is coming A I wrote as this term. So, A by n 1 plus n 2 minus 2 is S. So, if we use this expression then I will get this as this is same as n 1 plus n 2 minus 2 plus n 2 by n 2 plus 1 x minus x 2 bar prime A inverse x minus x 2 bar prime A inverse x minus x 2 divided by n 1 plus n 2 minus 2 plus n 1 by n 1 plus 1 x minus x 1 prime S inverse x minus x 1. So, you can consider the classification region suppose we consider R 1 then it is this ratio which I call it we can call it say lambda that is lambda is say greater than or equal to some value k or c. And this can also be shown to be if you remember your earlier criteria that was in terms of W. So, you can see that they are equivalent if we consider W term here W was given by x bar prime S inverse x 1 minus x 2 and this term. So, this term and this term are having some equivalence here because if you look at this greater than something and then this term will get cancelled out and you adjust this term you take this thing common here then this will be coming exactly of the same form. So, there is a particular case which is equivalent to W greater than or equal to say some c star if n 1 and n 2 are large and if we take c to be 1 then the rule is called ML rule that is the maximum likelihood rule. That means, we are simply considering that choice where the likelihood functions maximization gives you the higher value. So, we can simply call it. So, for example, if we consider say z is equal to say half n 2 by n 2 plus 1 x 2 bar minus say x 2 bar prime S inverse x minus x 2 bar minus n 1 by n 1 plus 1 x minus say x 1 bar prime S inverse x minus x 1 bar then basically rule is r 1 reason is z greater than 0 and r 2 if z is less than 0. So, basically we can think of this as a distance this is the estimated distance of x from x 1 from pi 1 and this is the distance from pi 2. So, you are saying if the distance from pi 2 is more than the distance from pi 1 then you put it into the first one that is pi 1 and if the distance is more. So, basically you can consider this as a simple distance of the observation from a given population which is based on the sample from that population. So, I think the rule is straight forward and it is extremely heuristic rule that is coming here. So, basically if you look at w and z here they are not much different if you consider say w minus z then that is simply half 1 by n 1 plus 1 x minus x 2 S prime x minus x 2 minus 1 by n 2 plus 1 x minus x 1 bar S inverse x minus x 1 and this basically converges to 0 in probability as n 1 n 2 n 2 infinity. So, basically asymptotic probabilities of misclassification are same for w r z that is whether we consider w r z they are almost the same. So, basically what we are saying is that we can use either of these things and this statistics you can say w and z basically they are invariant also if you consider say shifting for example, you look at this z. So, if I translate all the observations then it will not affect here there is no affect here. So, these rules are also translation invariant and therefore, that is another plus point. In the next lecture I will discuss the criteria for classification into several populations and also we will spend some time on the principle component analysis and canonical correlation. So, that is the remaining topic in this thing. The problem of classification has been actually studied in great detail, but in this particular course we will cover only the popular one that is the based on normal distributions. There are results which are available for other populations also and also there are certain current work going on when there are restrictions on the parameter spaces. For example, in the two normal populations when you are considering for example, if all components of mu are the same. So, like that there can be several cases suppose you consider two univariate populations mu 1 sigma 1 square mu 2 sigma 2 square. Now, you may consider say some additional information in the form of say mu 1 less than or equal to mu 2 or sigma 1 square less than or equal to sigma 2 square. In that case what would be the classification rule? Similarly, we can consider exponential populations. Suppose mu 1 is equal to mu 2 and sigma 1 square less than or equal to sigma 2 square in that case what would be the classification rule? So, such problems are being studied currently by various researchers. So, in this particular course we have just given a gist or you can say the basic criteria how we can actually derive such classification procedures and we have given some optimality criteria also. So, in that next case I will try to wind up this particular portion that is the problem of classification and give some glimpse of two other problems. They are called problem of principal component analysis and the problem of the canonical correlations. So, that we will be covering up in the next