 In the last class, I have introduced the problem of classification of the observations. In the problem of classification of the observations, the classical problem is that we are given two populations pi 1 and pi 2 and it is required for us to find out whether a new observation that is given to us, whether it will belong to the first population or to the second population. To derive a classification procedure, we need say training samples from both the populations and we use that for constructing the proper classification procedures. In the last lecture, I have introduced the expected loss of observation, expected loss if the observation is wrongly classified. Suppose, it is belonging to the first population and it is classified into the second or it is in the second and it is classified into the first. Based on that, we define what is a base procedure and I had also described r r 1 and r r 2 that is the expected loss of observation if it is from pi 1 and it is if it is from pi 2. So, we say that a procedure r is at least as good as a procedure r star if r r 1 is less than or equal to r r star 1 and r r 2 is less than or equal to r r star 2. That means, both the expected losses if the observation is from pi 1 and if the observation is from pi 2 should be smaller then the procedure r is better than the procedure r star. If at least one of the above inequalities is a strict inequality then the rule the procedure r this is said to be better than r star. So, when we say less than or equal to it includes the case of equality also. So, when both of them are equal then r and r star are said to be equivalent. If both are equal then r and r star are equivalent classification procedures. So, one may question that whether there is a procedure which will be better than all the given procedure. That means, you can say is it the best procedure the answer is that no. So, usually in a given classification problem there is no procedure which is the best or at least as good as all the procedures. Now, this can be explained like this that suppose I make this procedure as the best that means there is no chance of error in procedure r then this will become 0 here, but at the same time this will become 1 and in that case this cannot be less than or equal to this one. So, that means there is a comparison. So, sometimes this will be better and sometimes other one will be better. So, we say an admissible rule a classification procedure r is called admissible if there is no procedure better than r. A procedure which is not admissible is called an inadmissible procedure. For example, in this case I have mentioned that if there is a strict inequality in this statement then the procedure r is better than r star. So, in that case r star will become inadmissible procedure. For example, if r is better than r star then r star is an inadmissible procedure. So, in a given problem we are interested to characterize admissible procedures that means all the procedures which are admissible we should be able to characterize. So, then in that case we can restrict our attention to only the admissible procedures. The inadmissible procedures we can discard it can be shown under certain conditions that the class of all admissible procedures is the same as the class of all base procedures. Then we consider a complete class of procedures. A class of classification procedures is said to be complete if for any procedure. So, let us name this class as say C if any procedure not in C we can find a better procedure in C. A class C is said to be essentially complete if for any procedure not in C we can find a procedure at least as good as in we can find a procedure which is at least as good in the class C. So, that means in a given classification problem we would be interested to find out or to characterize what is the complete class of classification procedures. So, another thing is that there may be some procedures which are equivalent, but the procedures themselves may be same except on a set of measure 0 or in a set of probability 0 then they will be certainly equivalent. So, we can always consider that then we consider a minimax procedure. So, if a procedure R is such that the maximum expected loss that is R R i is minimized for all rules R star. So, basically what we are seeing that for any classification procedure let us consider the expected loss. So, they will be like R R 1 R R 2 etcetera. So, we look at the maximum of these values now that procedure for which this is actually the minimum that is considered to be the minimax procedure. Now, we discuss the methods of determining this Bayesian procedures and the minimax procedure etcetera in a given classification procedure how do we start with. So, we firstly consider finding Bayes classification procedures. So, when Q 1 and Q 2 are assumed to be the prior probabilities of the item being in pi 1 and pi 2 then we consider the conditional probability coming from pi 1 given an observation x. Given an observation x what is the conditional probability that it came from pi 1. So, this will be equal to Q 1 P 1 x divided by Q 1 P 1 x into plus Q 2 P 2 x. So, let us look at the denominator denotes an observation. So, then our observation can be from first one and then we write the density or the probability of that first one then it could be from the second one. So, the probability is Q 2 and this is the density of the second one. Now, if we find out the conditional probability that it actually came from pi 1 then we write in the numerator. So, this is actually just an application of the Bayes theorem. For the time being let us consider the cost functions to be equal. So, we can assume that C 1 2 that is misclassification into 1 when it is from 2 and misclassification into 2 when it is from 1 let us assume it to be 1. Now, then we consider the expected losses Q 1 P 1 x d x and R 2. That means, the observation is from Q 1 what we identify as into to the 2 plus Q 2 P 2 x d x R 1. So, this is actually the probability of misclassification. So, our aim is to minimize that our aim is to minimize this. Now, if you look at this the conditional probability of coming from pi 1 and similarly the conditional probability of coming from pi 2 there it will become Q 2 P 2. So, if we compare that and take the higher one then this can be considered as a reasonable classification rule ok. So, let us start with an observation x is given and we want to classify this into pi 1 or pi 2. We propose the following procedure if Q 1 P 1 x by Q 1 P 1 x plus Q 2 P 2 x is greater than or equal to Q 2 P 2 x divided by Q 1 P 1 x plus Q 2 P 2 x then assign pi 1 else assign pi 2 let me call it 3. So, now you can easily see that this is simply equivalent to because this denominator is the common. So, therefore, the proposed classification rule reduces to that is R 1 region is the region of classification into the population pi 1 is Q 1 P 1 x is greater than or equal to Q 2 P 2 x and R 2 region is the reverse of it Q 1 P 1 x less than Q 2 P 2 x. Here one point to be mentioned here I am taking greater than or equal to here and less than here one can put here greater than and here less than or equal to. So, in the continuous distribution models this will not create any problem. In case of discrete there may be a situation where the probability of equality is positive in that case you can further randomize. That means, you can consider a randomized classification rule that means, when equality is there then with certain probability you assign to pi 1 and with certain probability you assign to pi 2. So, we I am just mentioning this point here if Q 1 P 1 x is equal to Q 2 P 2 x then we can randomize and place x in pi 1 or pi 2 with some probabilities alpha R 1 minus alpha here and of course, there is another possibility Q 1 P 1 x plus Q 2 P 2 x is equal to 0 then that point can be assigned to any region. Now the question comes that we actually defined the probability of misclassification as this term here and our aim is to actually minimize this. Now we have proposed a rule here we will show that this rule is the rule which we have written here this rule will actually minimize this probability of misclassification. We now show that the procedure 4 minimizes the probability of misclassification. So, let us take another procedure say R star. So, that is R 1 star R 2 star that means, R 1 star is the region where the observation is classified into population pi 1 and R 2 star is the region in which the observation is classified into R 2 in pi 2. So, for this rule the probability of misclassification is Q 1 P 1 x dx R 2 star plus Q 2 P 2 x dx R 1 star which we can write as Q 1 P 1 x minus Q 2 P 2 x dx R 2 star plus Q 2 P 2 x dx. Now if you look at the second one this term has no R 1 R 2 coming here that means, whatever be the procedure this term will be the same that means, we have to look at the first term only. The second term does not depend upon a specific classification procedure. Now the first term if you look at this will be minimized if R 2 star includes all the points x for which Q 1 P 1 x is less than Q 2 P 2 x and excludes those points for which Q 1 P 1 x is greater than Q 2 P 2 x. Now if we consider probability of P 1 x by P 2 x is equal to Q 2 by Q 1 given pi i this is equal to 0 for i is equal to 1 to 2 then the procedure 4 if you look at the procedure 4 in the procedure 4 we are considering exactly those regions where Q 1 P 1 is greater than Q 2 P 2 and we are excluding those which are having Q 1 P 1 less than Q 2 P 2. So, whatever statement I gave here the first term is minimized if R 2 star includes all the points x in which this is less and excludes those points for which it is greater then the procedure 4 is exactly satisfying that condition. Therefore, this procedure 4 minimizes it is also unique except on the sets of probability 0 hence it is a base procedure. So, you can easily see that we have been able to determine if the prior probabilities of population pi 1 and pi 2 are given as Q 1 Q 2 then the procedure that is given here this is actually a base procedure. Of course, here we assumed the casts to be equal and that is why we could cancel on both the places. If the cast factor is given then that will also be included let me just give the expression for that. For the general setup of cast function that is when we have C 1 2 and C 2 1 not identical the probability of misclassification we will write it as PMC probability of misclassification that will become PMCR that is equal to C 2 1 into PR 2 given 1 Q 1 plus C 1 2 PR 1 given to Q 2 which is actually equal to Q 1 C 2 1 integral P 1 x dx R 2 plus Q 2 C 1 2 integral P 2 x dx R 2 R 1. So, in place of this procedure now we will put cast factor also here and then the procedure that we will be proposing can be written like this. Then the procedure we consider R 1 Q 1 C 2 1 P 1 x greater than or equal to Q 2 C 1 2 P 2 x and R 2 Q 1 C 2 1 P 1 x less than Q 2 C 1 2 P 2 x. When we write this one assumption has to be made because this cast function see in case of some gains this could be negative also. So, we are assuming here that the cast function is non-negative otherwise the inequalities will get modified. So, we have the following theorem if Q 1 and Q 2 are prior probabilities of observations of populations pi 1 and pi 2 that means observation is coming from whether pi 1 or pi 2 if Q 1 and Q 2 are the initial assigned probabilities and if casts of misclassification are C 2 1 and C 1 2 respectively then the expected probability of misclassification is minimized by the rule 6. In fact, we can actually write this rule as you can write actually in terms of the ratios P 1 x by P 2 x is greater than or equal to C 1 2 Q 2 divided by C 2 1 Q 1 and R 2 is the reverse of it that is P 1 x by P 2 x is less than further if the probability of the equality P 2 x P 1 x by P 2 x is equal to Q 2 C 1 2 by Q 1 C 2 1 under both the populations if this is 0 then this rule is unique except on sets of probability 0. So, this rule 7 is a base procedure. So, you can see we have actually solved a problem here. Now, many times it happens in the classification situation usually the size of the populations will be known for example, we want to classify say land water we want to classify as very good student or a mediocre student. So, we may know the proportion of the size of the populations if we know that then basically we can assign Q 1 and Q 2 and in that case the best procedures that you need is the Bayesian procedure and we are actually able because it is minimizing the expected probability of the misclassification. So, you are actually having the best procedure. So, this is actually you can say the first case and here we are assuming the probability distributions are completely known that means, P 1 x P 2 x is known to us and in that cases we are actually able to get the best procedures. Now, let us consider another thing. Now, it could be that the initial probabilities are not known in that case we can consider the general procedures. So, and another thing is could be that we may assign various probabilities like in place of Q 1 Q 2 suppose it is Q 1 star Q 2 star that means, it is not necessary that we are able to strictly fix the prior probabilities and in that case the base rule will change. Now, the question comes that whether if your initial assignment is not correct and you get another rule then is it alright or not the answer is that even then we are reasonably alright because all the base rules will be admissible. So, this is proved in the following theorem if probability of P 2 x is equal to 0 given pi 1 is 0 and probability of P 1 x is equal to 0 given pi 2 is equal to 0 then every base procedure is admissible let R is equal to R 1 R 2 be a base procedure for a given Q 1 Q 2. Let R star is equal to R 1 star R 2 star be any other procedure. Now, we are assuming that R is a base procedure. So, using that we will have since R is a base procedure the probability of misclassification under R will be smaller than P m c under R star. So, you can consider say here Q 1 Q 1 P R 2 given 1 plus Q 2 P R 1 given 2. So, that is less than or equal to Q 1 P R star 2 given 1 plus Q 2 P R star 1 given 2. This we can further simplify we can write it as Q 1 P R 2 given 1 minus P R star 2 given 1 is less than or equal to Q 2 P R star 1 given 2 minus P R 1 given 2. Now, this Q 1 and Q 2 see these are the assigned probabilities. So, they are between 0 and 1. So, we can make use of that now Q 1 is between 0 and 1. So, if you are having this P R star 1 given 2 less than P R 1 given 2. See what we want to prove that this R star cannot have both the components less. So, suppose this is less then this will become negative. If this is becoming negative then what will happen that P R 2 given 1 will become less than P R star 2 given 1. That means, if P R star 1 given 2 is less then P R star 2 given 1 will become more. On the other hand if you consider P R star 2 given 1 less than this then this will become positive. If this is positive then this is positive and then you will have P R 1 given 2 less than this. That means, both the components corresponding to R star of the probabilities of misclassification cannot be smaller than the corresponding components of the probabilities of misclassification for the procedure R. So, let me just write it formally. Now, if P R star 1 given 2 is less than P R 1 given 2 then the right hand side of 9 is negative and hence P R star 2 given 1 will become greater than P R 1 2 given 1. Similarly, if P R star 2 given 1 is less than P R star P R 2 given 1 then the left hand side of 9 is positive and hence P R star 1 given 2 will become greater than P R 1 given 2. So, R star cannot be better than R. It cannot be better than R and so, this proves that R is admissible. Now, you can consider the extreme case for example, Q 1 is equal to 0. If you take Q 1 is equal to 0 then what it will give that 9 will give you simply that P R star 1 given 2 is greater than R equal to P R 1 given 2. Now, for a base procedure what is happening R 1 includes only points for which P 2 x is equal to 0. So, P R 1 given 2 this will become 0 and if R star is to be better than R then the only possibility is that P R star 1 given 2 is equal to 0. So, if probability of P 2 x is equal to 0 given pi 1 is equal to 0 then P R 2 given 1 is equal to probability of P 2 x greater than 0 given pi 1 that is equal to 1. On the other hand if P R star 1 given 2 is equal to 0 then R 1 star contains only points for which P 2 x is equal to 0. So, P R star 2 given 1 is equal to P R 2 star given pi 1 is equal to probability of P 2 x greater than 0 given pi 1 that is equal to 1. So, we have proved that R star is not better than. So, here we took an arbitrary procedure R star and we are showing that the procedure R star cannot be better than R. Now the reverse of this is also true under certain condition every admissible procedure will be a base procedure. We prove the following theorem. If probability of P 1 x by P 2 x is equal to k under pi i is equal to 0 for i is equal to 1 2 for any k between 0 to infinity. So, this is the then every admissible procedure is a base procedure. I mentioned that characterization of the class of admissible procedures and I mentioned that it can be shown under certain conditions that the class of base procedure is the same as the class of admissible procedures. So, the previous theorem and this theorem taken together they prove this statement. Let me give a proof of this. See if 10 holds then for any q 1 the base procedure is unique more over the cdf of P 1 x by P 2 x for pi 1 and pi 2. This is continuous. Let R be an admissible procedure then there exists a k such that P R 2 given 1 is equal to probability of P 1 by P 2 less than or equal to k under pi 1 is equal to P R star 2 given 1. Where R star is the base procedure characteristic corresponding to q 2 by q 1 is equal to k that is actually you are saying that q 1 is 1 by 1 plus k q 2 is equal to k by 1 plus k. Now, since R is admissible therefore, we must have P R 1 given 2 to be less than or equal to P R star 1 given 2. Since any base procedure is admissible we will have the reverse that means P R 1 given 2 has to be greater than or equal to P star 1 given 2. So, basically it means that they are same. So, what we have proved? We started with the procedure R which is an admissible procedure and R star is a base procedure. So, what we did that we considered the base procedure with respect to that for that this is the probability because P 1 x less than or equal to. So, there is a k which is appearing in the form of the base procedure. If you remember we wrote it earlier the form of the base procedures had this term here. Let me give it again you can look at this q 1 P 1 x greater than or equal to q 2 P 2 x. So, you are having P 1 by P 2 greater than or equal to q 2 by q 1 and P 1 by P 2 less than q 2 by q 1. So, if you combine these two then this is becoming a base procedure thus R is a base procedure by uniqueness of the base procedure except sets of probability 0. We conclude that R and R star are same. So, we have proved very significant result that is all the admissible procedures are basically the base the class of base procedures the class of base procedure is the class of admissible procedures. So, in a given classification problem we can restrict attention to the class of base procedures. If this statement number 10 holds then the class of base procedures is minimal complete. So, this is a very powerful result because it allows you to restrict attention essentially to only base procedures. Now, let us also consider some discussion on the minimaxity remember that the minimaxity criteria is based on a different philosophy. We are considering the worst possibility that means worst case scenario that means the probability of misclassification is the worst and then among the worst we are choosing the best. In the Bayesian we are considering only the average loss or average probability of misclassification whereas, in the minimax procedure we are considering individual, but then we are looking at the worst that can happen and then we are choosing that procedure for which that worst is actually the best. So, let R be the base procedure with respect to the minimax procedure. So, let R be the assignment of probabilities Q 1 and Q 2. Let us denote P Q 1 I given J is equal to P R I given J that means when Q 1 Q 2 is there then the procedure 4 that I have written then that procedure we are considering as the base procedure and under that the probability of misclassification I am denoting by Q 1 just to denote that. So, with Q 1 changing like Q 1 is half or Q 1 is equal to 1 by 4 etcetera then this is a continuous function of this is a continuous function of Q 1. Actually you can say that P R P Q 1 to given 1 this will vary from over 1 to 0 as Q 1 varies from 0 to 1 and P Q 1 1 given to varies from 0 to 1. So, you can see that they are continuous functions and they are varying between 0 to 1 as Q 1 varies from 0 to 1. So, certainly they will cross at some point see if I am considering the graphs of P Q 1 2 given 1 and this is the graph of say P Q 1 1 given 2 between 0 and 1 then they will cross at each other. So, they cross each other at some point say Q 1 star that is P Q 1 star 2 given 1 is equal to P Q 1 star 1 given 2 this is the minimax classification procedure that is the base procedure obtained when the prior probabilities Q 1 and Q 2 Q 1 star and Q 2 star is the minimax procedure to show that this is minimax let R star be any other procedure for which maximum of P R star 2 given 1 P R star 1 given 1 given 2 is less than or equal to P Q 1 star 2 given 1 is equal to P Q 2 Q 1 star 1 given 2. Now, if you say this that maximum of this is less than or equal to this then both the components have to be less than or equal to this, but this would imply that for R star procedure the expected probability of misclassification will be less than the expected probability of misclassification when the rule is assigned by Q 1 star that is the base procedure. So, this will contradict that this is a base procedure therefore, this cannot be true. This contradicts the fact that every base procedure is admissible. So, friends today we have considered the basic problem of classification I have given a very decision theoretic formulation of this problem. We considered the costs of misclassification in terms of the probabilities of misclassification P 2 1 and P 1 2 that means, P 1 2 is the probability of classifying into 1 when actually the observation is coming from 2 and similarly P 2 1 is the probability of misclassification into the population 2 when it is actually coming from 1. Now, on the basis of this we have considered two criteria one is the Bayesian criteria. If somehow we are convinced about the proportions of the observations from the two populations say Q 1 and Q 2 then based on that we can actually find out the rule which will minimize the expected probability of misclassification. So, this rule is called the base rule and exact form is obtained here in terms of Q 1 P 1 x greater than or equal to Q 2 P 2 x and vice versa that is less as the regions of classification into pi 1 and pi 2 respectively. One can add an additional cost factor also in terms of C 1 2 and C 2 1 and then also the Bayesian procedure is obtained. We also looked at the desirability of the Bayesian rules in terms of the complete class. For example, we could prove that every base rule is admissible and every admissible rule is based under of course, certain condition and therefore, the class of admissible rules is same as the class of base procedures and therefore, the class of base rules is the minimal complete class. So, in practice this helps us because if we consider any prior assignment of the probabilities we are doing alright that means reasonably good rules are available to us that in fact, whatever rule we propose we will not be able to find a better rule than that of course, we can find other good rules also. Second thing is that in the same class we can actually determine a minimax rule also because we can look at we can vary continuously this Q 1 and C where probabilities of 2 given 1 and P Q 1 star 1 given 2 they match. So, at the point where they match the base rule at that point will actually give you the minimax procedure. So, in the classical decision theoretic formulation we have the solution of this problem. Now, in the following classes I will look at the classification procedures for the normal populations rather multivariate normal populations. So, the original formulation is by Abraham Wald 1940's and then we look at the procedures which are discussed in by Fisher. That means, when the parameters of the populations are unknown. So, we estimate that so that we consider the classical Fisher's discriminated function we consider the Milanobis distance and then we consider the Anderson's classification rules etcetera. So, these are the things that I will be following up in the next classes that means, we will next consider classification of observations into multivariate normal populations. They are initially proposed and studied by Fisher Wald Anderson etcetera. So, we will discuss the classification of properties of these procedures and how the procedures are actually obtained. So, this I will be covering in the next lecture.