 I gave you the base decision rule probably it is looking completely theoretical to you I will try to give some example so that it will be more clear to you. So I will write here let us take the number of let us say the number of classes is 2 and the first density function is this is one density function and let me give you other density function p1x is equal to x when x lies between 0 and 1 2-x when x lies between 1 and 2 0 otherwise p2x is equal to x-1 x lies between 1 and 2 3-x x lies between 2 and 3 0 otherwise. Let us see how this density function look like okay since I will be using this part of the board for some other calculation let me use this part of the board this is x0 x1 this is 2 3 between 0 and 1 the value is x so if I write this as 1 this is a straight line at 0 it is actually going to 0 at 1 the value is 1 look at this and from 1 to 2 it is 2-x so in fact we can include this also no problem and 0 otherwise okay and look at this this is x-1 let us also include this this is x-1 when x lies between 1 and 2 that means at x is equal to 1 the value is 0 at x is equal to 2 the value is 1 so this is again and this is 3-x again at x is equal to 2 the value is 1 and x is equal to 3 the value is 0 so this is a straight line and 0 otherwise if you look at the mathematics that I gave you there omega is a subset of Rn that is what I wrote here our omega is a subset of R real life in fact what is our omega here our the omega is the set 0 to 3 because outside that the values are 0 for both the functions okay so here omega is 0 to 3 now say the prior probability of this class is p let me write this thing as p1 this is p2 prior probability of the first class is P prior probability of this class is 1-p sum of these two things is 1 okay as you can see there is an overlap between the classes there is an overlap between the classes now these are the density functions now I will start now here in the Bayes decision rule what is the Bayes decision rule our omega 1 0 is the set of all x for which p times p1 x is greater than or equal to I am putting the equal to here 1-p x p2 x this is omega 1 0 okay now let us have some cases case 1 say x lies between 0 and 1 when x is lying between 0 and 1 the density of the second class is 0 right look at that one the density of the second class is 0 to 1 the density value is 0 okay so this implies p2 x is equal to 0 okay so this implies actually p times p1 x is this is either greater than 0 or equal to 0 so this is always greater than or equal to 1-p times p2 x right so that means when x is lying between 0 and 1 this has to go to class 1 so this actually implies that this set 0 to 1 at 0 it is closed at 1 this is open it is a subset of omega 1 right now let us take case 2 2 less than or equal to 2 less than x less than or equal to 3 I can as well include equal to also no problem I can include equal to also no problem here what will happen here when x is lying between 2 and 3 this actually implies that p1 x is equal to 0 is it true at x is equal to 2 the value is 0 and greater than 2 also the value is 0 at x is equal to 2 the value is 0 greater than 2 also the value is 0 so p1 x is 0 here this implies 1-p times p2 x is p times p1 x so this whole set this will go to class 2 now let us have the main case that is the case 3 case 3 is 0 to 1 so 1 less than or equal to x less than 2 okay now let us see p times p1 x it is greater than or equal to 1-p times p2 x this implies p times what is the value of p1 x p1 x is x now p1 x is 2-x 1-p times this is x-1 p2 x is x-1 okay p2 x is x-1 so this implies 2p-px is greater than or equal to x-1-px-p this-px-px gets cancelled and 2p-p is p x is less than or equal to 1-p 2p-p is p this 1 is coming this side so 1-p is greater than or equal to x now from these 3 cases what can we say about omega 1 0 so basically omega 1 0 is 0 to 1 and 1 to 1 plus p 0 to 1 and 1 to 1 plus p so this is basically 0 to 1 plus p 0 to 1 here here 1 to 1 plus p so this is 0 to 1 plus p and what is omega 2 0 1 plus p to 2 and 2 to 3 here you have 1 plus p to 2 and here it is 2 to 3 so this is basically 1 plus p to 3 is it clear when x is less than or equal to 1 plus p you will put it in class 1 when x is greater than 1 plus p you will put it in class 2 so here x greater than 1 plus p means it goes to class 2 up to what up to 2 now you might be having a question is 1 plus p less than 2 always it is less than 2 because this p is less than or equal to 1 p this p cannot be greater than 1 so 1 plus p so 1 plus p to 2 and 2 to 3 so this is 1 plus p to 3 so these are the this is the reason for class 1 this is the reason for class 2 now let us calculate the probability of misclassification probability of misclassification probability of misclassification is equal to p times integral p 1 x over this set 1 plus p to 3 1 plus p to 3 plus 1 minus p times integral p to x 0 to 1 plus p look at the expressions that you have the expression is p 1 p 1 x over omega 2 so this is the prior probability of class 1 multiplied by this over omega 2 that is 1 plus p to 3 then p to p to over omega 1 so 0 to 1 plus p this is equal to p times integral p p 1 x is 0 then x is greater than or equal to 2 so basically we need not have to take the whole of 1 plus p to 3 this is right plus here again p 2 is 0 from 0 to 1 so this is basically 1 to 1 plus p so this is equal to p times integral 1 plus p to 2 what is p 1 p 1 is 2 minus x what is p 2 p 2 is x minus 1 p 2 is x minus 1 now this is equal to p times what is the integral of 2 minus x this will be 2 minus x whole square by 2 and there is a negative sign this whole thing this you need to evaluate it at 2 and evaluated 1 plus p from this and you need to subtract and this one integral of x minus 1 this will be x minus 1 whole square by 2 this is also from 1 to 1 plus p so this will be equal to p times at 2 the value is 0 so this will be 2 minus 1 minus p so this is 1 minus p whole square by 2 this negative at 2 the value is 0 minus and there is again a minus so this is going to be positive plus 1 minus p into at 1 plus p the value is p square by 2 at 1 the value is 0 and if you take p times 1 minus p 1 minus p by 2 as common what you are going to get is here you are going to get 1 minus p and here you will get p so this will be p times 1 minus p by 2 so this is base error probability that means if you take any other decision rule its error probability will be greater than or equal to this probability so now you give me a decision rule one of you please give me a decision rule okay I will give one then afterwards you try something else say my decision rule is x less than or equal to 1.2 I will put it in class 1 okay and greater than 1.2 I will put it in class 2 is the decision rule I just taken some decision rule x less than or equal to 1.2 I will put it in class 1 greater than 1.2 I will put it in class 2 note that in this decision rule there is a capital P here but in this rule there is no capital P in this rule there is no capital P there is no prior probability x less than or equal to 1.2 I am putting it in class 1 greater than 1.2 I am putting it in class 2 now let us find the probability of error or probability of misclassification for this rule so that means here our omega 1 is 0 to 1.2 and omega 2 is 1.2 to 3 omega 1 is 0 to 1.2 omega 2 is 1.2 to 3 now what is the probability of misclassification here the probability of misclassification is the probability of misclassification is again P times P 1 x over omega 2 plus 1 minus P times P 2 x over omega 1 so this is the probability of misclassification now what is this value this is equal to P times omega 2 is 1.2 to 3 1 minus P times omega 1 is 0 to 1.2 here again the same thing happens we can take 1.2 to 2 and here 1 to 1.2 right and this is equal to P times again P 1 x is equal to how much 2-x and P 2 x is equal to x-1 and this is equal to P times again let us do the same thing this here it is 1.2 to 2 this is x-1 whole square by 2 this is 1 to 1.2 this is equal to P times again at to the value is 0 at 1.2 the value is 0.8 0.8 square this is 0.64 by 2 okay plus at 1.2 the value is 0.2 whole square that is 0.04 by 2 so this is equal to 0.32 P plus this is 0.02 okay so this is the probability of misclassification now I claim that this is less than or equal to this you have any question why did I take it to 2 look P 1 x look at the definition of P 1 x anything greater than to the value is 0 so we need not have to take anything from 2 to 3 that is why it is 1.2 to 3 here also 0 to 1 we need not take because 0 to 1 the value is 0 0 to 1 the value is 0 for P2 so we need to take only 1 to 1.2 this is 0.02 plus 0.32 P minus 0.02 P right this is equal to 0.02 0.32 P minus this one this is plus 0.3 P am I right 1 x 0.02 0.02 and this is 0.32 P and this is minus 0.02 P 0.32 minus this one is 0.3 P right now I claim that this is greater than or equal to this let us see whether it is true or not because that has been my whole claim that you take any other decision rule you calculate its misclassification probability that misclassification probability has to be greater than or equal to the misclassification probability of base decision rule now I claim that this is greater than or equal to this let us see whether it is true or not 0.02 plus 0.3 P is greater than or equal to this is true if and only if I take this 2 on this side 0.04 plus 0.6 P is greater than or equal to this is P minus P square right this is if and only if bring this thing on this side this is P square plus P becomes minus P minus P plus 0.6 P this is minus 0.4 P right and this is plus 0.04 this must be greater than or equal to 0 is this greater than or equal to 0 yes or no yes why you take any decision rule take any damn decision rule find its misclassification probability that will be greater than or equal to the misclassification probability of base decision rule this is true this is true I have just taken one rule you take any decision rule I mean it just does not matter what your decision rule will be it just does not matter you will be able to show this you will be able to show this okay in fact in your homes you can try with other such omega 1 omega 2 you can try with other such omega 1 and omega 2 you will get some such thing and sometimes you will get some such thing plus some positive quantity plus some positive quantity depends on what decision rule you have taken suppose you take a decision rule such as 0 to 1.1 and 1.7 to 3 class 1 the rest class 2 you can take that 0 to 1.1 1.7 to 2 that is class 2 the rest is I mean class 1 the rest is class 2 you can take even the opposite also 0 to 1.1 and 1.7 to 2 that is for class 2 and the rest is class 1 that also you can take then also you will get a square term always you are going to get a square term like this and plus some constant is a positive value this is something that you will find it for this example this is only true for this example with some other distributions you will get some other such things it is not always true that you will get a square term like this okay it is not always true that you will get a square term like this but you are in a position to show always that the probability of misclassification for any other decision rule will be greater than or equal to that of the base decision you see we found the base error probability as p times 1-p by 2 what is the maximum value of this the maximum value of this is when capital P is equal to half right when capital P is equal to half you will get the maximum value and the maximum value is 1 by 8 capital P is equal to half means what both the classes they have the same probability of occurrence both the classes they have the same probability of occurrence and the base decision rule says this is 1.5 that means you sort of have the maximum uncertainty here if one class has more probability of occurrence then you can put more points in that class you can put more points in that class but if both the classes have the same probability of occurrence then you have the sort of maximum uncertainty and that is what is happening when capital P is equal to half what is the minimum value of this minimum value is 0 when capital P is equal to 0 you will get 0 when capital P is equal to 1 you will get 0 capital P is equal to 0 means what capital P is equal to 0 means your whole class is the second class the whole class is second class and you look at your decision rule capital P is equal to 0 means 1 to 3 you are going to put it in class 2 and capital P is equal to 1 means the whole class is first class and 0 to 2 you will put it in class 1 look at your decision rule okay. So this is again true only for this you have many other distributions you are going to get different formulas for different distributions this distribution is known as triangular distribution because the shape is a triangle okay shape is a triangle so it is known as triangular distribution you might ask me why I have taken triangular distribution there are many other distributions you have normal distribution why did I take triangular distribution well number 1 I am going to deal with normal distribution now and number 2 these distributions are easy to handle I can calculate all those integrals whereas normal distribution integrals they are slightly difficult to handle they are slightly difficult to handle so now my next example is basically normal distribution example so this is the example 2 so let us say you have capital M number of classes and each of them is multivariate normal okay each of them is multivariate normal and the prior probabilities are P1 P2 up to Pm the prior probabilities are P1 P2 up to Pm and the density function is 1 over square root of 2p to the power n determinant of sigma i to the power half exponential to the power minus half x minus mu i prime sigma i inverse x minus mu i this is i is equal to 1 to m sigma i is the variance covariance matrix of the multivariate normal ith multivariate normal mu i is the mean vector of ith multivariate normal and capital PA is the corresponding prior probability of ith multivariate I am ith dense ith class now here let us take a case the case is case 1 let us assume that P1 is equal to P2 is equal to Pm this is equal to 1 by m let us assume that all prior probabilities are equal okay and let us also assume that sigma 1 is equal to sigma 2 is equal to sigma m this is identity matrix all prior probabilities are equal and this is the identity matrix now let us see what exactly are we going to get we have to look at we will put x in ith class if Pi Pi x is greater than or equal to PJ PJ x for all j0 is equal to i okay Pi Pi x greater than or equal to PJ PJ x for all j0 is equal to i then put x in ith class now what does this going to give you here I will write here capital PI into determinant of sigma i sigma is identity matrix determinant is 1 the square root is also 1 so I am not going to write it okay exponential to the power minus half x minus mu i prime then sigma i inverse sigma i is the identity matrix inverse is itself okay identity matrix only and multiplication by identity matrix does not change anything so I am just going to write this okay this is greater than or equal to PJ for all j0 is equal to i you might think that this is a very very complicated expression actually it is not you will see it now this Pi is same as this PJ we can cancel it this 1 hour square root of 2 pi to the power of n this is there on both sides we can cancel it now exponential to the power minus half greater than or equal to exponential to the power minus half means if you apply logarithm on both sides this is same as minus half am I right logarithm is an increasing transformation and I can always apply logarithm that is same as removing the exponential term because I have cancelled this and this so I can remove the exponential this term also it is going to look like this okay then I will cancel these two so what is this x minus mu j prime x minus mu j is greater than or equal to you put x in the 8th class if Pi Pi s greater than or equal to PJ PJ x and that is same as put x in the 8th class if x minus mu i prime x minus mu i is less than or equal to x minus mu j times x minus mu j for all j not is equal to i what is the meaning of this is not it the case that we are calculating the distance between x and mu i and mu i is the mean so you will put x in 8th class if the distance between x and the mean of the 8th class is less than or equal to distance between x and mean of all the other classes then you put x in 8th class there is a standard name for this classifier this classifier is known as minimum distance classifier you take x you have these many means calculate distance of x with all these means find wherever it is minimum put x in that class and this is the derivation of minimum distance classifier this classifier is known as minimum I think I will write in capital letters distance classifier this classifier is known as minimum distance classifier this is one of the standard classifiers that you would see in any pattern recognition book in fact many pattern recognition books they start the classification with this classifier you find the mean and then you see where put it to the class where the mean is where the distance is minimum many old pattern recognition books they start with this classifier you might be wondering or maybe you might have thought what is the proof of this I mean how is it coming this is the way it is coming here note that this classifier is one of the widely used classifiers in pattern recognition and this is the best classifier under these assumptions this is the best classifier under these assumptions all the prior probabilities are equal and we are assuming that the distributions are all normal that is the second assumption and under that assumption we are assuming that all the covariance matrices they are equal to the identity matrix then this classifier is the best classifier look at the number of assumptions that we have made number one the distribution assumption normal distribution number two the prior probabilities being same and number three all the covariance matrices are equal not only equal equal to the identity matrix not only they are equal equal to the identity matrix this is a vast range of assumptions and a vast range of constraints under that this is the best classifier and in fact many times we use this classifier without really knowing probably all these details without really knowing probably all these details we use this classifier now you might be having a very valid question the question is how do we know that the data that is given to us it follows normal distribution that is a very basic question how do we know that the data that is given to us is normal distribution number one number two if we know that that is normal distribution how do we get to know the mean of the distribution and the variance covariance matrix that is the second one now these are these questions I will try to answer them partially I will try to answer them partially why only a partial answer the reason is that we are given a data set which has finitely many quantity finitely many I should say observations on the base of this finitely many observations in general it is very difficult for you to say whether the distribution is normal or not here let me mention a few things to you one of the things that any statistician when he is doing his second year or third year of statistics what he would do is what he would learn is what is known as fitting a distribution given a data set how do you fit let us just say binomial distribution Poisson distribution normal distribution and you have many other distributions like chi square distribution f distribution t distribution there is a beta distribution there are any number of distributions there is a Cauchy distribution now he fits and then he calculate some value which that is known as chi square value if the observed chi square is less than the chi square tabulated value of the chi square then we would say that the data is fitting the distribution this is one thing that any statistician would do that was taught in his second year sometimes in first year of statistics course back in this course there the main assumption is that I said that they calculate a chi square value that chi square value is coming under a an assumption of multivariate multinomial distribution going to some chi square distribution there is something called a multinomial distribution going to a chi square distribution that is one and the test that is developed there that is known as an approximate test not an exact test exact test means the distributions are exactly known approximate test means the distributions are approximated as the number of observations goes to infinity then it follows some distribution that is distributions are approximated that is approximate test exact test means the distributions are exactly known so that chi square fit that one does it that is an approximate test that the student is doing so if let us just say some distribution x is fitting one data set now it does not mean that some other distribution y does not fit to the data set have you understood this point suppose one distribution x fits to a data set it does not mean that some other distribution y cannot fit to the data set so this actually says that when you have finitely many observations given to you it may be possible for you to fit more than one distribution it may be possible for you to fit more than one distribution then the distribution that you are fitting it may not be unique some other person may come and then may fit may be able to fit some other distribution properly to the same data set so this is the first thing that people learn about finding the distributions of given set of point now there is something slightly more advanced that is there is something called estimation of probability density functions given a data set you would like to estimate the corresponding probability density function now what is the meaning of estimating a density function a curve like this what is the meaning of estimating a curve that needs to be properly defined but before going into all those definitions first we need to know how to estimate a point say we assume that a distribution is normal then we have to estimate the mean of the distribution mean is just a single point then how do you estimate that point variance covariance matrix there are also these are finitely many quantities how do you estimate this first we need to know this given the distribution given the type of distribution say normal how do you estimate the mean and how do we estimate the covariance matrix that is here we are estimating points then a slightly more advanced one is how do you estimate the density function directly from the given data set now about this density estimation if you look at pattern recognition text books you would find in a book like Fukunaga's book you would find one chapter devoted to this there are one of the first papers it was by parzen in 1960 parzen windows way of estimating density function initially people have taken the histogram as an estimate of the density I hope all of you know the meaning of the word histogram many of you have background and image processing so one of the first thing that you would have done is finding the histogram of your image the histogram is one of the first estimates that people have used but then a slightly more advanced one is the method given by parzen there he called and that nowadays it is known as parzen density estimate that was in 1960 after 1960 we are in 2010 50 years have passed there are now even many books which are actually dealing with estimation of functions there are many text books which are dealing with estimation of functions one of the books is written by one of the X directors of ISI my organization when BLS Prakash Rao he wrote a book on nonparametric function estimation you can find it on internet so there are even many books which have come out on estimating density functions but there is a small but the but is that there are many methods of estimating densities then you have the next question which method is better okay there are no good answers to these questions there are no good answers to these questions still lot of work is needed to be done one needs to do lot of work to find out which method of estimation may be better to some extent sometimes you can say but then ultimately you one would like to ask well what is the best estimate of density function well for that at least I do not know let me just say my inability I do not know the best estimate of density function. So that is why my answer to the estimating density functions are finding density functions that is only partial my answer is only partial then this comes to a very I should say grave question if we do not know the density functions then how do we apply this rule this is the best rule in the sense that it minimizes the probability of misclassification and the next question is how do we apply this rule if we do not know the density function well the answer is plain and simple you estimate the density function or you assume some functional form if you think that it is following normal distribution or some other distribution assume a functional form and estimate the parameters and use it but there if you do not know the functional form and if you do not want to do these things then the answer is no we do not have any way of applying this rule to any data set if we do not know the density functions if we do not know the prior probabilities and you do not have any way of getting good estimates of this then it is extremely difficult for you to apply this please that means one can make two types of mistakes one is that you can choose a slightly wrong function because the data could be actually you do not know which is the best function you are choosing one number one number two is that the estimation of the parameters itself there is a method which could give you an approximate solution that is I will discuss at least a few of the things such errors or approximations one can do in this process you can is that is that is that one yes so then the very next question is if this is actually like this then why am I teaching this that for that one the answer is very simple the answer is that this is the best rule if you develop a new classifier you can find its performance by comparing your classifier with base decision rule how do you compare your classifier with base decision rule you generate points artificially from known distributions then you can apply base decision rule generate points artificially from known distribution then you can apply base decision rule and you apply your classifier the classifier that you have developed and check the probability of misclassification of your classifier and see whether it is very close to the base decision base class base error probability or not if it is very close then you can say that at least in these cases my classifier performance is close to the base classifier this is what you can tell the reviewer of your paper right that is why this is the very first I mean the starting point of actually all pattern recognition base decision rule is the starting point of all pattern recognition because number one this is the best rule and number two unfortunately you cannot implement it for most of the times so you need to develop your own classifiers that is why there are simply too many classifiers existing now you have you are asking me about SVM SVMs are the latest slightly before that you have multi-layer perceptron which many people use which many people use for classification there is a can nearest neighbor classifier which professors you can do the possibility you in the class okay so and you have like this there are very many classifiers each one of them has some relationship with base classifier each of them has some relationship with base classifier for example multi-layer perceptron which is anyway going to be taught in the class okay there is a particular theorem it was the paper was published around six seven years ago I do not remember the exact reference there he proved the authors proved that there exists an architecture of multi-layer perceptron for which the error probably he only talked about existence of an architecture he never told how to get the architecture so that for that architecture if you apply the MLP then the error rate is very close to base decision base error rate that theorem was proved by I mean one of the authors I do not remember the exact reference that is why multi-layer perceptron is actually working well otherwise I mean you really do not know so there are always some relationship between the standard classifiers and base classifier why many people are using minimum distance classifier because at least in some situations it works well here I am telling you the positive point earlier I mentioned that it has all these constraints but people are using it because at least in some situations they got good results okay at least in some situations they got good result that is why they are applying this can and rule why people are applying it because at least in some situations they got good results so these are basically the problems with base classifier basically the problem is lying with the density function and there is also another problem that problem is this is the error expression this is the expression for the error probability of base decision rule I should write here O is for optimal I0 and there is a complement if you apply base decision rule then the probability of misclassification is if you have capital M number of classes capital PA is the prior probability of the ith class small PA is the density function of the ith class conditional probability density function of the ith class omega I0 is the optimal that set for the ith class and complement C is the complement you need to find this integral to get a value for the probability of misclassification this is an extremely complicated one this is an extremely complicated one even for the case of normal distribution with unequal covariance matrices it is difficult to find it unequal with equal you can find it with equal you can find it with unequal covariance matrices it is difficult to find it so base classifier is good but the basic drawbacks of the base classifier are number one knowing the prior probabilities and density functions and number two is even if you know it is difficult to calculate this integral the shapes of these sets they might be very horrible shapes I mean you might be having I mean depending on how it may look like this may be region for one class okay and you might be having some other thing for another class like this and let us just say you have three classes and the third class may be something like this then how do you do the integration if you have very complicated regions then it is difficult for you to do the integration that is one of the reasons why even though base classifier is the best classifier because of all these reasons people have to go in for some other classifier so that they can somehow approximate there and they can somehow approximately say that they developed classifier is the performance of their developed classifier is close to the performance of the base classifier this is the reason for developing all new classifiers so we dealt with case one in example two we dealt with case one now we will deal with case two here instead of m classes any general m I will take just two classes I will take just two classes and pix is multivariate normal but here I will make an assumption the assumption is the assumption is I am going to assume that sigma 1 and sigma 2 is equal to sigma I am going to assume that the variance covariance matrices are same and I am only in like two class problem and I am going to assume that variance covariance matrices are same now let us see what is going to happen to the base decision rule again so this is p1 x we only have two classes is greater than or equal to p2 p2 x this is if and only if there is p1 this is 1 over square root of 2 pi to the power n determinant of sigma to the power half exponentially to the power minus half x minus mu 1 prime sigma inverse x minus mu 1 greater than or equal to p2 this is x minus mu 2 prime sigma inverse x minus mu 2 you will understand why this assumption is made you see you can cancel out this and this and you can also cancel out determinant of sigma to the power half now let us apply logarithms on both side so what is going to happen this is log p1 log means to the base e natural logarithm log p1 and if you apply logarithm here exponential goes out this will be minus okay minus half x minus mu 1 prime sigma inverse x minus mu 1 is greater than or equal to log p2 minus half x minus mu 2 prime sigma inverse x minus mu 2 now I will actually do this multiplication log p1 minus half this will be x prime x x prime sigma inverse x minus x prime sigma inverse mu 1 minus mu 1 prime sigma inverse x plus mu 1 prime sigma inverse mu 1 is greater than or equal to log p2 minus half x prime sigma inverse x minus x prime sigma inverse mu 2 minus mu 2 prime sigma inverse x plus mu 2 prime sigma inverse mu 2 this is looking slightly complicated but you will see that it is going to be reduced to something slightly simpler look at this x prime sigma inverse x this is minus half x prime sigma inverse x you will find here and the same quantity you will find here so I can cancel it out okay now let us see log p1 minus half x prime sigma inverse mu 1 x is n by n rows and one column x prime is 1 by n vector sigma inverse is n by n mu 1 is n by 1 so this whole thing is 1 by 1 so it is a scalar what is a mu 1 prime sigma inverse x that is also a scalar I claim that these two quantities are same because what is the transpose of this the transpose of this is this and this is a scalar this is a scalar the transposes are same so this quantity is same as this quantity similarly here this quantity is same as this quantity is it clear to you so then what is going to happen here this is minus 2x prime sigma inverse mu 1 plus mu 1 prime sigma inverse mu 1 is greater than or equal to log p2 here there is an if and only if I need to write if and only if here I need to write if and only if here I need to write if and only if here so log p2 minus half this is minus 2x prime sigma inverse mu 2 plus mu 2 prime sigma inverse mu 2 okay now what I will do is that I will take all these terms on this side and the log p2 on this side so if I bring log p2 on this side this is just going to be log p1 minus log p2 can I just write log of p1 by p2 this log p2 when I am bringing it on this side this is log p1 minus log p2 I am just writing log of p1 by p2 this is greater than or equal to this minus half it is going that side it will become plus half so this will be 2x prime sigma inverse mu 2 minus mu 1 this minus half is going this side that will become plus half okay but this negative sign is going to remain so here you have x prime sigma inverse mu 2 here you have x prime sigma inverse mu 1 there is a negative sign so x prime sigma inverse mu 2 minus mu 1 now let me this term also it should be written on this side that will be okay in many books what you are going to find is people take this to also inside then what you are going to get is x prime sigma inverse mu 2 minus mu 1 plus this is mu 1 prime sigma inverse mu 1 minus mu 2 prime sigma inverse mu 2 is whole divided by 2 this is what you will find in many books now why did I do all these calculations I will tell you the reason for doing it if you look at these expressions you will think that x prime sigma inverse x it is quadratic in x you would assume that the decision boundary would be quadratic in x quadratic in x I hope you are understanding the meaning the decision boundary would be quadratic in x if you have something like x prime sigma inverse x if that term is there then there is a square time is at some point x prime is x 1 up to x capital N there is a sigma inverse there is again x 1 to x capital N so you are going to get at some point of time x 1 square at some other point of time x 2 square and at some other point of time probably x 1 into x 2 so these are all quadratic terms but because of this assumption the quadratic term is gone this is a constant this is a constant note that mu 1 mu 1 sigma mu 2 they are all known to us so this is a constant p 1 p 2 they are known to us this is a constant mu 2 – mu 1 this is a constant sigma inverse is a constant what you have is something linear in x what you have is something linear in x basically what you are going to get here is what is known as linear discriminant function this is a linear discriminant function if your distributions are normal and you have two classes and if the covariance matrices are same then the decision boundary between the classes will necessarily be linear if the covariance matrices are not same then you would get nonlinear decision boundaries because you are going to get x prime sigma 1 inverse x here you are going to get x prime sigma 2 inverse x which is going to be nonlinear in x here what you have got is something linear in x you basically you have got a linear discriminant function in x so this discriminant function is known as discriminant the word discriminant it came from discrimination you are going to discriminate between two classes from the word discrimination discriminant it has come so since the function is linear then this is a linear discriminant function this discriminant function between classes is linear this is linear so if you have two normal distribution covariance matrices are same you will get linear if you have three normal distribution let us say covariance matrices are same then what are you going to get between 1 and 2 there is a linear function between 2 and 3 there is again a linear function between 1 and 3 again there is a linear function right you are going to get piece wise linear decision boundaries 1 and 2 is linear 2 and 3 is linear 1 and 3 is also linear so you are going to get piece wise linear decision boundaries between classes okay. So similarly I mean if you have m number of classes you are going to get those many such piece wise linear functions let us stop.