 Friends, yesterday I have discussed in detail the procedures of classification into 2 populations. We have given a general framework, one is called the Bayesian framework and then the another procedure I called as a minimax method of procedures. And what we have done is we have assumed the probability distribution say p 1 x and p 2 x for the 2 populations. And we have given that if we know in advance that proportion of the 2 population that means, how many observations actually belong to the first population and how many to the second. That means, we can assign a priori probabilities say q 1 q 2 then we can develop a Bayes procedure that means, which will minimize the expected probability of misclassification. We also gave the concept of admissible procedure, minimax procedure etcetera. And in particular we proved that every Bayes procedure is admissible and every admissible procedure is Bayes. And therefore, the class of all the Bayes procedures is a minimal complete class. In particular a member of the class of Bayes rules will be a minimax procedure. Therefore, for all practical purposes we can restrict attention to the rules which are of the Bayesian form. And the form is also of a very nice nature that we got that say p 1 x by p 2 x is greater than something or p 1 x by p 2 x is less than something. So, then you classify in the population pi 1 or pi 2. So now, this gives a general framework for preparing classification rules for various problems. Now, the problem of classification initially started with the discussion on the normal distribution. So, we firstly discuss the procedures for that. So, classification procedures for two multivariate normal populations. Let me state the problem first. So, we have say population pi 1 it is specified by say p dimensional normal distribution with mean vector say mu 1 and variance covariance matrix say sigma and say pi 2 where it is a p dimensional multivariate normal distribution with mean vector mu 2 and sigma. It is like this you can think of for example, there is a patient. A patient goes to a say medical practitioner and certain tests are conducted on the certain measurements. So, it could be black test and it could be say certain other measurements on the patient. And then it has to be decided for example, the first population may correspond to a particular disease and the second population parameters may correspond to another disease. So, on the basis of the observations on the patient that is x we are having say x is equal to x 1, x 2, x p we have to decide whether they are matching more with pi 1 or more with pi 2. So, this is a classical example. We can think of in other areas also like land classification or classification on the basis of economic characteristics of a country or individual or an organization. So, these are the problems where we can model according to the multivariate normal distribution. That means, different characteristics different components will be individually normally distributed and at the same time the correlated structure is giving you a multivariate normal distribution. Now, in the first model I am starting with the covariance matrix to be the common. So, here we specify like this mu 1 is actually your vector. So, let me write it in the form of rho vector. So, this is mu 1 1 mu 1 2 and so on mu 1 p and similarly your mu 2 vector is mu 2 1 mu 2 2 and so on mu 2 p. So, the form of the we assume that sigma is positive definite. If we assume sigma is positive definite. So, the density functions can be written the form of the pdf of pi i is. So, let us say p i x mu i sigma that is 1 by 2 pi to the power p by 2 determinant of sigma to the power half. Then we have in the exponent minus 1 by 2 x minus mu i prime sigma inverse x minus mu i for i is equal to 1 2. Let me call it equation number 1. If we want to classify according to the rules that we discussed yesterday then the base rules or the min max rules. So, the class is the class of all admissible rules is of the form that is p 1 x by p 2 x say greater than say greater than k or less than or equal to k for classifying into first population or in the second population. Of course, when apriori probabilities are known then exactly the form of k is known to us that is q 1 by q 2 kind of thing. But even if it is not known then it is base rule with respect to some prior ok. So, so that means, the desirable rules are of this form only. If the prior probabilities are given we can choose the corresponding base rule in that class if that is not there then we can choose any rule or we can choose the min max choice. So, but the framework is given that means, we can always consider rules of this nature. So, let us consider the region of classification into pi 1 that is we call it r 1 that is the p 1 x by p 2 x is greater than or equal to k where k has to be chosen in a suitable fashion. So, if we substitute here p 1 x and p 2 x here then this term will get cancelled out and here we will be left with e to the power let me write down the expression. See in the numerator we are writing p 1. So, here it will become mu 1 and in the denominator we have mu 2. So, it will come in the mu 2 here. So, if I write the ratio here. So, it will become something like this x minus mu 2 prime sigma inverse x minus mu 2 minus x minus mu 1 prime sigma inverse x minus this is I have noted here and you are getting e to the power half of this greater than or equal to k. So, if I take logarithmic here and I arrange the term. So, this will become half x minus mu 2 prime sigma inverse x minus mu 2 minus x minus mu 1 prime sigma inverse x minus mu 1. This is greater than or equal to log of k. Let us call this as 2 this as 3 or let us further simplify this first. So, if we consider the expansion of these terms I will get x sigma inverse x minus mu 2 prime sigma inverse x that is 2 times here which you can also write as this term can also be written as twice x prime sigma inverse mu 2 and then you have plus mu 2 prime sigma inverse mu 2. Sigma inverse mu 2 minus x prime sigma inverse x plus 2 x prime sigma inverse mu 1 plus mu 1 prime sigma inverse mu 1. So, after expansion these are the terms that I will be getting. You can see that this gets cancelled out and you can adjust the remaining term as x prime sigma inverse mu 1 minus mu 2 and you are left with this term. Here I can again adjust the terms and we can because you are actually getting mu 2 prime sigma inverse mu 2 and this is actually becoming minus here minus mu 1 prime sigma inverse mu 1. So, I can add and subtract the term corresponding to mu 2 prime sigma inverse mu 1. If I do that then I can factorize and I can write this term as minus half mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2. This is greater than or equal to log of k. So, this result was actually obtained by Abraham Wald in 1944. This is called actually the discriminant function because this term will be same for all observations, but when you are taking the observation x which you want to classify then this part is actually used for discriminating between the two populations. So, we call this as the discriminant function. So, we have the following result when we are considering the density function of the form 1 that is multivariate normal distribution then the best regions of classification are given by if we want to classify an observation x into pi 1 that is n p mu 1 sigma r pi 2 that is n p mu 2 sigma then the optimal regions of classification they are given by say r 1 where we write x prime sigma inverse mu 1 minus mu 2 minus half mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2. This is greater than or equal to log of k. This is for the classification into pi 1 and for pi 2 this will become simply less here. Why I am saying these are the optimal regions of classification? Because in the previous lecture I have proved that the minimal complete class is exactly the class of base rules that means for looking at good rules we did not look beyond the base rules and the base rules are of the form p 1 x by p 2 x greater than or equal to k where k will be suitably chosen it is a number between basically it is a ratio q 1 minus q 1 by q 2, but it will depend upon what values of q 1 and q 2 we are choosing that is why I have written where k has to be suitably chosen. In particular if we consider prior probabilities of pi 1 and pi 2 as q 1 and q 2 respectively then k is nothing but q 2 by q 1. If cost function c 1 2 and c 2 1 is used then k is equal to q 2 c 1 given to c 1 given to c 1 given to q 1 c 2 given 1. So, this can be chosen in general k can be anything, but all of the choices will give you a rule in the minimal complete class that means it is an admissible rule and it is a base rule. Now you may take the extreme case when q 1 and q 2 are same that means we do not discriminate between the two populations in case when c 1 2 is equal to c 2 1 and q 1 is equal to q 2 then the region r 1 is simply because k will become 1 and therefore log of k will become 0. So, this will become x prime sigma inverse mu 1 minus mu 2 is greater than or equal to half times mu 1 plus mu 2 plus mu 2 minus prime sigma inverse mu 1 minus mu 2. Now there can be a question where a priori probabilities are either not assumed or we simply have no information that means we cannot discriminate between two populations on the basis of prior probabilities. In that case one can look at the that means we can look at that we make the expected losses due to misclassification as the same. So, let me just mention that point here. If we do not consider prior probabilities then we may choose k in such a way that the two expected losses do not have to be the due to misclassification are equal. That means I will need the probability of this that is classifying into pi 1 when it is belonging to pi 2 that means under the assumption that x is having n p mu sigma 2 mu 2 sigma and the other one will be that less than this that means we classify into pi 2, but we assume x to be in pi 1 that means x is following n p mu 1 sigma. So, that means we need the probability of this statement greater than or equal to r less than. So, in order to see that we actually need the distribution of for this we need the distribution of. So, this quantity I denote by u x prime sigma inverse mu 1 minus mu 2 minus half mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2. Under the assumption that x follows n p mu i sigma i is equal to 1 2. Now you can use the linearity property of the multivariate normal distribution here. See if x is following since x is following n p mu i sigma. Therefore, x prime sigma inverse mu 1 minus mu 2 we can actually obtain due to linearity property of multivariate normal distribution u will have p. So, basically the dimension will remain the same because this is also. So, this will become univariate normal basically because what is happening is x is a multivariate normal now what term you are writing is becoming a scalar quantity because this is 1 by p then you are having p by p and then you are having p by 1. So, this is becoming a scalar quantity. So, this will you will have a univariate normal distribution. Now let us calculate it separately when x follows n p mu 1 sigma then what will be expectation of say u we will call it expectation 1 then here it is becoming equal to mu 1 prime sigma inverse mu 1 minus mu 2 minus half mu 1 plus mu 2 prime sigma inverse mu 1 minus mu 2. Now, this term we can simplify here you look at this is actually becoming mu 1 prime sigma inverse mu 1 and here I get half mu 1 prime sigma inverse mu 1. So, this is with the minus sign. So, it will become plus similarly you look at the cross product term that is mu 1 prime. In fact, if you go back to the original term where I derived this from there it will be clear that how this term is coming initially I have written here this term was mu 2 prime sigma inverse mu 2 plus minus mu 1 prime sigma inverse mu 1 which is written like this particular term. So, this term is in the plus and this is in the minus here. So, this is minus and then this is getting cancelled out here minus this one. So, half is there. So, it will become plus half. So, then this term can be simplified to which is equal to half mu 1 prime minus mu 2 well sigma inverse mu 1 minus mu 2. Let us put some equation number here. See this definition of U let me put as say 7 and this I put as say 8 and when in this case this term can be this what will be the variance of U? For the variance you have the formula that because for x it is sigma. So, it will become mu 1 minus mu 2 prime sigma inverse then sigma and then sigma inverse this term will be coming. So, sigma sigma inverse will become identity. So, this term will be remaining that means you will get it as mu 1 minus mu 2 prime sigma inverse mu 1 minus mu 2. Let me call it equation number 9. We also write here see this particular term which is written here. Actually, this is basically a major distance measure which was given in 1930 by P C Mahler Nobis and it is called Mahler Nobis distance measure or Mahler Nobis D square. This is the Mahler Nobis distance. So, this let us call it say delta square we give this term name as delta square. So, what we have actually we can write in terms of this year expectation of U 1 is basically half of this. So, this is half delta square. So, what we have proved is that we have shown that if x belongs to pi 1 then this U is following normal distribution with mean half delta square and variance delta square. Now, consider the other case that is n p mu 2 sigma then expectation of U in this case what will happen it will become mu 2 prime here. So, this will become mu 2 prime sigma inverse mu 1 minus mu 2 minus half mu 1 prime plus mu 2 prime sigma inverse mu 1 minus mu 2. Once again this can be simplified here see this is minus mu 2 prime sigma inverse mu 2 and here I will get plus mu 2 prime sigma inverse mu 2. So, this will get cancelled out and you will get minus and this term is minus. So, basically you will get minus half of delta square and similarly if you look at variance that will be same because in the variance that term does not change that is we are saying that if x belongs to pi 2 then the distribution of U is normal with mean minus half delta square and delta square. Now, you can see this result is very interesting we have used U for basically discriminating between that populations pi 1 and pi 2 and here you can see the clear demarcation the average values of U under pi 1 and pi 2 they are showing opposite behavior like here it is half delta square and here it is minus half delta square and delta square I am giving a name Milanobis distance measure. So, if the two populations distance is given in terms of delta square then clear cut demarcation between the populations pi 1 and pi 2 is coming. That means, if x actually belongs to pi 1 then the mean of that is half delta square and in the other case it is becoming minus half delta square. So, it is exactly on the opposite side. So, this is quite interesting and you can think that heuristically it is actually a good classification rule. Now, let us look at the we want to make the two expected probabilities of misclassification to be the same then let us consider this. So, the probability of misclassification if the observation is from pi 1. So, that is p 2 given 1. That is now U classify into 2 if you are getting x prime sigma inverse mu 1 minus mu 2 less than this quantity basically you are saying U is less than c. So, because our original classification rule that I have described here it is in terms of U only. So, basically this term was U. So, U greater than some quantity or U less than some quantity. So, we will use exactly that thing. So, it is probability of U less than c when the true population is say pi 1. Now, under pi 1 we have just now derived that U follows a normal distribution with mean half delta square and variance delta square. So, we can consider we can consider U minus half delta square by delta less than c minus half delta square by delta. So, this is becoming standard normal random variable. So, this can be written in terms of the cumulative distribution function of standard normal c minus delta square by 2 divided by delta where phi denotes the CDF of standard normal distribution. Similarly, let us calculate the probability of misclassification if the observation is from pi 2. If the observation is from pi 2 then the probability of misclassification is p 1 2 that means the observation is from 2, but I put it in 1 that is p pi 2 and U is greater than or equal to c. So, that is equal to probability of U when U is from x is from pi 2 then U has normal minus half delta square delta square. So, this will become U greater than well we can put it U plus delta square by 2 by delta greater than or equal to c plus delta square by 2 by delta. And again this is having a standard normal distribution so this is equal to probability of z where z is a standard normal random variable c plus delta square by 2 divided by delta. So, this is nothing but 1 minus probability of z less than c plus delta square by 2 by delta which is nothing but actually c plus delta square by 2 by delta which we can also write as phi of minus c plus delta square by 2 by delta. So, you can see here we are able to evaluate the two probabilities of misclassification. So, we can choose c such that these two are same. If there is a cost function then we can include that also. If c 1 2 and c 2 1 are given as costs of misclassification then one can choose this c such that c 1 given 2 p 1 given 2 is equal to c 2 given 1 into p 2 given 1 or c 1 given 2 phi of minus c plus delta square by 2 by delta. So, c 1 given is equal to c 2 given 1 phi of c minus delta square by 2 by delta. Let me call it equation number 13 here. Now, this is quite interesting we are actually able to restrict our attention to a rule for which the two expected probabilities of misclassification are same. Since all the terms in this expression will be known because delta is based on mu 1, mu 2 and sigma which is the parameters of the two populations c 1 2 and c 2 1 are the cost of misclassification which will be some assumed numbers. So, all the terms here are known that means from the tables of the normal distribution that is tables of the cumulative distribution function of the normal distribution we can actually find a c for which these two values will be or this equation number 13 will be satisfied. So, this is actually minimax classification procedure minimax classification procedure that is the c that is equal to log k is chosen this c that is equal to log k is chosen. So, as to satisfy equation number 13 you can look at this c v. See this is the point you have minus delta square by 2 and this is the point plus delta square by 2. So, of course, there will be so this is some 0 here say. So, this is the gray area here and c will be somewhere here. So, maybe c is here or it could be here etcetera. So, we cut off like this actually if the point is here or here then there is no problem, but in this portion we have to decide whether we should put in the population 1 or in the 2. So, depending upon the value of c which is here or here etcetera depending upon the nature of mu 1 and mu 2 because mu 1 and mu 2 will affect the value of delta square. If there is large difference or there is a small difference and also the magnitude of sigma all of this will affect the value of. So, it could be that this intersection part is very small and in that case the classification will be good. If the intersection is more then certainly classification rule will be slightly worse that means the discrimination power of the rule will be much less. If we consider say c 2 1 is equal to c 1 2 then it becomes a much simpler problem because if you have c 2 1 is equal to c 1 2 then this equation is reducing to let me write it here. If the cost terms are also same then 13 becomes the first part will become minus delta by 2 and the right hand part will become sorry if that is so minus c by delta minus c by delta minus delta by 2 is equal to phi of c by delta minus delta by 2. So, you can see here this will give you c is equal to 0. So, the rule which is written for q 1 is equal to q 2 that means when we equate the 2 that is actually coming as the minimax classification procedure. So, if the costs are the same and q 1 is equal to q 2 then whatever rule is obtained that is actually becoming the minimax rule. But if c 1 2 is not equal to c 2 1 then certainly you will find from the tables of the normal distribution using the CDF of the standard normal distribution. We can also notice some further fact if I consider the ratio c 1 2 by c 2 1 then what do we get c 1 2 by c 2 1 this is equal to phi of c minus delta square by 2 by delta divided by phi of minus c plus delta square by 2 by delta. Let us call it say some g of c this is this g c. So, because if I consider increasing c then this will decrease, but then it is in the denominator and it is a non-negative function. So, this will increase this is increasing. So, then this is an increasing function increasing function of c. So, if it is an increasing function then certainly there exist a value of c for which equality will be attained. So, there exist a value of c say c star such that g of c star is equal to c 1 given 2 by c 2. That means, the solution will always exist. See both the terms in the expression 3 let me go back to the expression 3 here that is the original discriminant here x prime sigma inverse mu 1 minus mu 2 minus half mu 1 plus mu 2 mu sigma inverse mu 1 minus mu 2. So, you look at this part and this part is common. If we call it say delta then basically we are looking at a solution of equation of the form sigma delta is equal to mu 1 minus mu 2. So, we can consider they involve the vector delta is equal to sigma inverse mu 1 minus mu 2. This is obtained as solution of sigma delta is equal to mu 1 minus mu 2. This is regarding the computational part of this because we certainly need to calculate and then this will involve the inverse here. Actually this term is not difficult x prime and then mu 1 prime plus mu 2 prime. So, only difficulty is because to evaluate sigma inverse and then of course, then multiplication with something. So, if we look at the solution of this type that means delta is equal to this. So, if I consider some efficient computational procedure numerical computation procedure then we can actually obtain this solution. For example, gas, ideal gas or any other method because this is basically becoming a system here because I need the solution delta is equal to sigma inverse mu 1 minus mu. So, this can be done using an efficient computational procedure. We have further interpretation of this that the discriminant function the discriminant function x prime delta is linear function which is maximizing e 1 x prime d minus e 2 x prime d whole square divided by variance of x prime d for all choices of. In fact, if you look at the numerator the numerator here is mu 1 prime d minus mu 2 prime d whole square. But this we can also write as mu 1 prime d minus mu 2 prime d and prime of that into mu 1 prime d minus mu 2 prime d. If we write like this then it is becoming d prime mu 1 minus mu 2 mu 1 minus mu 2 prime d. So, we can express this in a different way here and in a similar way if you look at the denominator here this variance of x prime d then that is equal to d prime expectation of x minus expectation x x minus expectation x prime d that is actually d prime sigma d. So, basically we can say that we want to maximize 17 with respect to d such that 18 is a constant. So, basically we are considering the term d prime mu 1 minus mu 2 mu 1 minus mu 2 prime d minus lambda times d prime sigma d minus 1. So, this is the Lagrange's multiplier here lambda is actually Lagrange's multiplier. So, you can consider derivative of this with respect to d with respect to d and equating to 0 what we will get mu 1 minus mu 2 into mu 1 minus mu 2 prime d is equal to twice lambda sigma d. There will be 2 here also so this 2 will actually cancel out. So, I do not have to write that thing here and this is actually a scalar. So, we can actually write it as some mu that is mu 1 minus mu 2 is equal to lambda by mu this quantity I am calling mu into sigma d. So, you can say that the solution is proportional to delta because delta you see here it was sigma delta is equal to mu 1 minus mu 2 that is the solution here. So, here you can see that solution is proportional to delta. Now, here it is a classification of a single observation x into 2 populations, but the more general problem of classification is that in place of one observation we may have a sample of observations. If we have a sample of observations in that case we can consider the distribution because of the sufficiency in a multivariate normal situation x bar and s there are the sufficient statistics. So, we can actually consider the distribution of x bar. So, x bar will have n p mu sigma 1 by n and sigma 2 by well mu 1 sigma by n and mu 2 sigma by n. So, the problem has just shifted there in place of sigma we are considering sigma by n and the entire procedure will remain the same. So, let me just mention this thing here in case we have random samples random samples say of size say n 2 classify that is sample say x 1, x 2, x n then we can use the sample mean vector classify into pi 1 that is n p mu sigma by n or pi 2 n p mu 2 sigma by n. This is for the purpose because if we are considering this from the first population then the sample mean will have this distribution and if it is from the second one then the sample mean will have this distribution. So, the entire problem is just modified and the procedure will remain the same. So, this particular problem that I have discussed now it is for the classification when the parameters of the population are known and therefore, the procedure that I described in the previous lecture is completely applicable here. That means, I am able to derive a procedure which can be base that means, we are considering the choices here because the density functions are completely known. In case the prior probabilities are not assumed we can find out a minimax choice I have shown in this particular case that the choice can be explicitly found from the tables of the normal distribution. In fact, one particular case we have taken when equality was there in that means, the C 1 2 and C 2 1 are same then the value of C was actually 0 and that was the minimax classification rule. But in the real situation generally what will happen that this ideal situation will not prevail that means, you will have the parameters of the two populations unknown. If the parameters of the population are unknown in that case we have to proceed with the training samples that means, we have samples one sample from the first population and we have the sample from the second population these are called the training samples. From those training samples you actually estimate the parameters of those populations and then use them. Now, on the basis of that then there can be several methods of estimation and then one can use that. So, in the following lecture I will discuss this particular problem of classification into two multivariate normal populations when the parameters are unknown. Well, when the parameters are unknown then there can be several situations the problem that I discussed just now I took the common covariance matrix. Now, there itself the change can occur in place of common covariance matrix one may have sigma 1 and sigma 2. Now, suppose they are known in that case the solution that I derived it will not be that simple. For example, let us consider consider the case when sigma 1 is not equal to sigma 2. So, let me take up the first sheet here again where I wrote the density function. In the density function in that case I will get sigma i here and I will get a sigma i here. So, these two terms will play a role earlier I was able to just cancel this term, but this will now come and here also the adjustment of the terms will not come because earlier I cancelled x prime sigma inverse x from both now it will not cancel. So, the rule will become complicated when you write p 1 x by p 2 x. So, the rule will then the rule will not be in a simple form. So, basically it will become let me just attempt a write up here it will become p 1 x by p 2 x. So, that will become equal to sigma 2 by sigma 1 e to the power well half and then you will have x minus mu 2 prime sigma 2 inverse x minus mu 2 minus x minus mu 1 prime sigma 1 inverse x minus mu 1. So, then we are saying it is greater than or equal to k. So, this is the classification procedure into R 1 that means, you classify the population and the observation as in 2 pi 1 if this condition is satisfied that means, if it is greater than k. Now, unlike the previous case here the terms cannot be cancelled and we do not have very convenient forms actually. Of course, for given sigma 1 and sigma 2 these are some constants I can adjust it to the right hand side. So, I can still say take log here. So, I can write half x minus mu 2 prime sigma 2 inverse x minus mu 2 minus x minus mu 1 prime sigma 1 inverse x minus mu 1. This is greater than or equal to log of k plus log of determinant sigma 1 by determinant of sigma 2. So, well one can look at various cases here and so this is basically a known constant let me put it c. So, the form is the same, but it is not simplified in that fashion as we had done in the previous case because these two terms are actually separate because of different sigma 2 and sigma 1. Of course, if you want to find out the probabilities of misclassification for example, what is the distribution of this under say mu pi 1. So, if you consider pi 1 then say mean will become mu 1, but here you will actually need to do the calculation because of the linearity is not there. In the previous case the best thing was that you got a linear discriminant function the x prime sigma inverse mu 1 minus mu 2 the second part was free from x. Here you can see that this term is not canceling you are getting a quadratic here. Therefore, the distribution theory which is using the Fisher Cochrane theorem etcetera has to be utilized that means the distribution of this x prime sigma 2 inverse this thing will have to be utilized. For example, if you are considering pi 1 then this will become mu 1 minus mu 2 and so they you will have a non central t parameter here whereas, for this one it will become a central chi square. In the reverse case when this is from the first population from the second population then this will become central chi square whereas, this will become a non central chi square. So, this type of distribution theory will come. So, the results are somewhat more complicated and cannot be described in a straight forward fashion, but of course it can be done because nowadays computational facilities are available and one can write a routine for actually evaluating the probabilities of misclassification. So, that you can equate them. In the next lecture I will discuss the case of unknown parameters in detail.