 Welcome back to the lecture series on pattern recognition. In the last class we have been discussing about decision boundaries, decision regions, discriminant functions following the discussion on normal distribution. And for the task of pattern classification we took a very simple example of a minimum distance classifier or nearest neighbor classifier in which we had simply taking the distance of the sample x from the mean without worrying about the class priors and the class distribution functions. And we also discussed that this is not a correct method to estimate or to perform the task of classification because you will get good amount of errors in this process. And there are lots of examples to also show that which we will worry about it later. So to incorporate those we will bring in the classic Bayes theorem. So let us look at the slide for the classic Bayes theorem. So Bayes theorem the Bayes decision rule has been discussed in an earlier class or the Bayes theorem is given by the expression as you see on the top. So we are trying to classify a sample x this is what the capital X indicates a feature vector with an arrow and the W i belongs to indicates the class i. So what Bayes rule or the Bayes theorem says that it requires 3 inputs. The 3 terms which you see on the right hand side 1, 2 and 3 are all given here and this is the output. What are the individual terms which are here? Now there are different books which will actually use these terms in different ways and I am going to use as much of the common terms which are used to describe these corresponding terms within the Bayes theorem which is used for later on used for class assignment. So what are they? This is what you get p of x the denominator term is basically called the unconditional density function or evidence. Some books will casually call it evidence but I think we will stick to the word called unconditional density function or unconditional probability. It is the probability distribution of a feature X in the entire population that is what it means. If you are taking color as a feature to discriminate between flowers and fruit I want to find out how much of red flowers are there. It is not a question of trying to identify whether a flower is a tulip or a separate category of a flower or the classical Rajniganda whatever you want to whatever color it has or the rose. How many flowers have a color red? So if color red or if color is a feature the redness of the color for that particular sample that can be calculated from a huge set of samples. If you want to take some other example let us say the height of a particular person. You take a group of individuals. You may try to categorize individuals based on different patterns such as say language, dress, color whatever the case may be, food habits but let us say I want to take height as a feature and use it for a classification whatever the case may be but height is a feature. So if you take a group of 100 individuals you would like to find out how there are how many people with the height let us say exactly 5 feet or 5 feet 6 inches that will indicate the unconditional density function because you are sampling the data without worrying about the class under which category the data falls that is p of x. p of omega i or w i as a symbol indicates you can use it as a w i as the prior probability that random sample is a member of the class c i look at the index is the same. The prior probability that a random sample is a member of the class c i this is sitting in the numerator of the expression here of the base what is called as the casually the class prior. Let us take an example of a 2 class problem discriminating between apples and oranges. I give you a bag of fruits and there are 10 fruits 6 of them are apples and 4 of them are oranges I repeat there are 10 samples 6 of them are apples 4 of them are oranges. So I can compute p of w 1 for apples p of w 2 for oranges very simple answer what should be p of w for apples the prior for apples it will be 6 by 10 or whatever it is 0.6 and the other will be of course 1 minus that will be 0.4 very simple so that is an example of class prior you need to find out because actually what happens is in the case of fruit classification the fruits are sometimes seasonal so it is good to find out for a particular season what is there certain type of seasonal fruit which is available in large quantities say mango in summer or could be jackfruit or some other fruit which could be apples could be typically more or cheaper let us say at least in winter oranges also are seasonal fruit but of course these days we are having almost fruits all through this is an except mangoes probably not available all that okay. Let us go back the other term which is the most significant term in the numerator so we have discussed two of these the class prior and unconditional probability the term here which is the most significant one is called the class conditional probability or likelihood given a particular class wi how many times as this feature x occurs okay it is the likelihood of obtaining feature value x given that the sample is from a class wi it is equal to the number of times or occurrences of x if it belongs to class wi that means if you pick up apples as a class how many fruits will you have which are the color red well that probability could be higher you would like to compute how many red fruits will you have have if you are given only oranges now you know how many times an orange can become red or some other color let us say which is not orange which is a black or green well oranges if it can be green but let us take some other color red or black so you can compute the likelihood or class conditional probability as it is called for a particular feature given a particular kind so given all these three terms on the right hand side a very simple probabilistic estimate will give you the left hand side quantity which is called the measure condition or posterior probability and this is the probability for any fixture vector x being to be assigned for a class wi so you assign x to a wi if this measured condition probability is higher for a particular class the rule is the same the rule for assignment is the same whichever the corresponding probability posterior probability under base satisfies the maximum likelihood you will assign it but how to compute the likelihood that is done using this formula so on the numerator you have just remember class conditional probability you have a class prior and you have an unconditional can you repeat this with me on the numerator you have class conditional likelihood then second this one is class prior denominator is unconditional probability using these three you compute the base theorem so assuming you know this formula let us go back to our discussion on decision boundaries and decision regions to be obtained under the base paradigm so now look back you remember base this is the expression which we just discussed now okay now consider a discriminant function gi which we have been talking about in the last class a log of the numerator of the expression numerator of the base expression you see these two terms if I take the log this is what I will get and I am ignoring the denominator what was the denominator here which we are ignoring we discussed that just a few minutes back it is the unconditional probability which we are ignoring and we are taking the two numerator terms and taking a log of that expression and within that what we are bringing in now is this term class conditional probability or casually also called the likelihood correct class conditional probability or likelihood this is the class prior okay this is called the class prior okay and this is called the class conditional probability or likelihood is given by this famous expression we have seen this couple of classes back also in the last class beginning this is the what is this expression multivariate Gaussian density function so that means in a redimensional space this is the distribution look in the expression you have remember there is a vector sign which I have kept consistent here but after some time I may not keep this vector notation I have tried to be consistent as much as possible throughout my slides actually ideally this should also have a vector sign but this is a vector mean vector this is sample vector covariance matrix here normalizing term here so using this expression using this expression if we incorporate inside the log can we write the expression of gi will work this out in the board so what did we have we had the base rule I am just writing the two numerator terms can you just tell them what they are the first is the class conditional probability or likelihood what is that multiplied by class prior correct so that is that is my base so what I am ignoring here is the denominator term and what I am writing now is gi of i of x my first discriminant function in my class okay so what I am writing I am taking a log of this basically okay simple log of so it will be log of this plus log of the class prior very simply and this particular term I am writing here this I am saying will be a let us say this is a multivariate Gaussian density function so this will be root over determinant of sigma i for the corresponding ith class correct so there is a symbol here the subscript i then 2 pi to the power d correct the square root will be part of this correct it will be overall this is the normalizing term basically that is the idea then this is the main thing exponential minus will be there here correct divided by 2 x okay I should be careful here that x is a vector mind you I had this vector sign inside my slides for the sake of convenience I am ignoring that but if you are copying you can put the vector sign here because both of these are vectors then you have the covariance matrix inverse of the covariance matrix and then x minus mu i correct there will be i here both okay so what is this this is the this is the overall so if you substitute that take the log so this there will be a this factor which will come out plus the log prior at the end the interesting part will be this factor here so if you check it out I think it will be minus d by 2 log 2 pi here into this can you tell me what I will get here minus 1 by 2 log determinant of okay and the log in the exponential will be cancelling out so there will be a minus sign here so it will be a minus half by this one no sorry x minus mu i transpose covariance inverse let keep getting used to this term plus correct so this is what you get this is your gi this is your gi okay to be very precise let us correct it here this is x just take out if there are any mistakes of notations here okay now if you look at this this is what we get in the slide as well let us look back in the expression to see that whatever we have derived as you see here it is the same correct look at the expression at the bottom as given here and the one which is now in the slide at the bottom here the only thing in the slide is that I have brought in this covariance term at the beginning the rest of it is basically the same organization slightly different man that is all if you look into this expression here these can be visualized to be constant terms which are outside they are not function of x this is the one which only varies if you change x of course if you change i that means go from one particular class to the other there will be another term which could be also changing as well as the class prior unless the class prior themselves are constants or uniform across different classes that means you have same number of arranges as the number of apples when you go for go to the market of the stores grocery if you want to purchase two types of flowers you have as your number of rows as number of tulips or something else that is okay in that case class priors also could be same or different covariance matrix also could be same or different across different classes but given a particular class i when you change the sample of the instance which is being tested the only term which is varying is this one and this is the one actually which gives you the distance d the distance d which we have been talking about since about one or two classes back the distance d distance of a sample from the class mean nearest neighborhood classifier is actually a special case of this expression with the covariance term suppressed you only have these two you get that d which we talked about and we will see that again as a special case so this is the distance plus these are some constant terms which do not vary with x but if you go from one particular class to the other yes there will be some changes which will be taking place and we will discuss that but what may happen also is that the covariance term in the class priors also could be same as a special case across classes so in fact the concentration henceforth will be mainly on this particular term let us go back to the slide so to recapitulate what we have just done now that within the purview or using the Bayes theorem incorporating the multivariate Gaussian function as the class conditional probability for a particular class we have got a discriminant function expression for gi which contains the class priors the class pdf the class pdf in turn contain the covariance matrix which is very important which we have just discussed about now in terms of distance about a couple of classes back and of course a few constant terms which is again based on the dimensionality of the problem and the covariance matrix the determinant of the covariance matrix and if you look back into the expression in the slide as given here it is the term which on the left hand side of this expression for gi of i of x is the one which dictates your distance from or which actually dictates your classification the job of classification and also it is giving a distance measure because all of these the rest of it are not functions of x so when you vary x within this expression it is this term which is changing and actually it gives a value of distance many cases may arise due to the varying nature of the covariance matrix sigma it could be diagonal with equal and unequal elements there may be off diagonal terms positive or negative we have already seen some of these variations when we are looking at the isocontour plots in 2d for a particular Gaussian example you remember two classes back we had talked about asymmetric Gaussians in fact in the last class we have talked about asymmetric Gaussians oriented Gaussians what did it depend on it was the fact that if we had off diagonal term 0 or not whether the diagonal terms one was equal to the other or not these are all the factors which dictated or whether the Gaussian was asymmetric or not whether it was oriented or not the same thing is going to happen with the diagonal off diagonal terms of the covariance matrix is going to dictate how this particular expression in fact it is an expression of a distance already which we have got using a discriminant functions and this will give rise to dr's which in turn will give rise to db's let us look at some special cases of this particular function remember this expression of gi which have just derived it in the class today let the discriminant function for the ith class be the same gi of x so assuming class priors are same for two different classes i or j in fact if it is a two class problem but if it is arbitrary c different classes all the class priors are same that is what we are assuming then the only term which remains actually is this probability function and then based on this we can write a simple expression like this which you have done this is the term and we can write that this constant term here which you have actually broken into two parts or broken into two parts in the last slide is given by some constant q why it is constant because not a function of x anymore and this is a constant multiplied by a distance vector well distance is usually a scalar okay distance usually a scalar okay so this is a scalar quantity mind you di square which is a norm of a distance what is basically taken and this k is simple 1 by 2 it is a simple constant 1 by 2 this factor and this di is basically given by this expression which is going to be the most important part of the discussion for the rest of this part of the course on distance measures with multivariate Gaussian functions for classification for decision boundary estimation. So the classification is now going to be a because why this term is going to be only there for gi this is a constant the classification is now influenced by the square of distance in hyper dimension space of x from mi weighted by the inverse of the covariance matrix and we are going to examine this in little bit more detail in the rest of the class today and henceforth and this term has already been introduced earlier may not be explicitly this is actually the Mahalanabes distance the distance from x to mi y in featured space weighted by the covariance matrix so it is a weighted distance if you had just taken x minus mi which was the nearest neighbor classifier or the nearest neighbor classifier or minimum distance classifier then you did not pay attention to the class as well as the class PDFs or the class conditional probability density functions in such a case the Mahalanabes distance takes the simple role of a squared Euclidean distance but if it is weighted by the covariance matrix specifically its inverse to be precise then you are talking it as a Mahalanabes distance criteria for a quadratic term okay so look into the slide this is the one which I am going to discuss now henceforth so it is all ball ball down to the covenants matrix and the distance to the mean the rest of the class priors they are relevant in certain examples but we will bring them back when we discuss a few things together little bit later on so for a given x sum for some arbitrary i is equal to m the value of g well it is the same small g I have written in capital here is largest when the square of this distance is the smallest for a class i is equal to m remember if you go back to the previous expression we are talking about a constant minus this k remember there is a minus half for the k so if the d value goes down the g i value will be maximum remember the class assignment rule assign it to the class for with the discriminant function becomes the maximum so maximum value of g i if you look back to the expression the maximum value will be assuming it to be constant if this to be maximum this has to go down the distance has to go down the distance must be minimum for this to be maximized so that is what is written here so g m must be largest where this is the smallest the distance is the smallest for a particular class i is equal to m and for that particular class m assign x to that particular class based on the nearest neighbor rule the simplest case of course is that the covenants matrix is equal to an identity matrix i this criteria of course yields have just been talking in the last few minutes that this yields our equivalent distance norm the nearest neighbor classifier or the minimum distance classifier this is equivalent to just taking the mean of a particular class for which x is the nearest for all mu and the distance function is very simplest there is just substitute an identity matrix here you will have these two terms which actually yield you the square norm of the distance from the in d dimension space of course is here going to be your d square but remember here all this d indicated as a vector but you are taking the square norm so it is basically the scalar quantity here so in all vector notations this is how it can be expanded this is how it can be expanded you can write you can write expression in this particular form and then this g of ix remember this is d square by 2 by this because there is a constant here which I am bringing in here for normalization of normalization and this is the this is very simply substituting this overall by 2 you will get this expression and what how can I write this in terms of this where look at this what term w i 0 which is taking care of this term the w transpose is this multiplied by x it seems I am ignoring this term if I ignore this term I can write this expression as something like this I can do this only when well I would not say x is negligible that is not the basic idea I am ignoring this term which is called a class invariant term it is a class invariant term why because if I keep changing the value of i this term does not change why am I trying to do this remember I am interested in classification discriminant function I have formed now I am trying to find out each discriminant function produces a decision region two discriminant functions for two different values of subscript i and j they will create two different the dr's or decision region there is a decision boundary between this pair i and j we will have a decision boundary between this pair i and j so if you move from class i to j to compute this g of i and look back into the expression what will change with this expression here for the same sample x if you change i to j this is the term which is going to change not this so hence this is called a class invariant term because it does not change between classes you are the same value of x because that is the test sample which you are trying to classify after you have learned the discriminant function you move from class 1 to class 2 to class 3 or class a to class b to class c the first quadratic term which you see here in this expression this does not change across classes this does not have a subscript i it is the other two terms which have so I am ignoring this for the sake of comparing gix across classes across classes then I am having this and I am writing this the only thing which you need to worry about is that I have probably taken g i to be minus of this because the sign has been reversed here okay so we what we are doing here is as if gix will be a constant minus this part which is given here you can see that there is a negative sign which has been introduced here this negative sign has vanished so do not get this flip is not automatic it is the way you have taken the g i because I want to maximize some function by minimizing the distance so this will help you in doing this this is actually called in a very simple sentence I have linearized the discriminant function this will actually give you linear decision boundaries this is also very casually called a correlation detector but that is not our main aim we will say that this is the simplest form of a linear discriminant function the linear discriminant function can also be written actually is this the expression which we had in the last slide if you go back let us see here this is what we had it is a sample x multiplied by a weight vector plus a bias term this is basically a bias term which is big which depends on the class mean I repeat again that this is called the bias term which depends on the class mean and this is the weight vector which also depends on the class mean weight vector depends on the class mean and the bias term everything is depending on the class mean because the covariance term is ignored where did we ignore the covariance term you have taken the covariance matrix equal to identity matrix that is why we could write this in the first place and only which whatever remains is just the mu that is why we are able to get a linear expression now of this so the linear discriminant function for a separable class is given by this which we have seen in the previous slide and the W vector is it d dimensional vector depending upon the dimensionality of the problem which is a vector of weights used for class I and we have the expression of W which is simply the class mean so this is it is a d dimensional vector the same thing which holds good here this function leads to decision boundaries that are hyper planes in higher dimension and I have talked about this in the last class that it is a point in 1d line in 2d planar surfaces in 3d and of course hyper planes in higher dimensions we will examine this linear decision boundary in very great detail and the simplest case of a 3d which is a plane of course in 2d it will become a line it will become a plane passing through the origin and the expression gets to the simple what has been written here is now basically this expression the W matrix is given by the 3 components here do not confuse this W is with the class class levels which we have used probably earlier and the 3 dimensional space so x is the 3 dimensional vector as simple dot product here simple dot product here is what has been written in this particular case this is a plane passing through the origin but of course that is a special case you may have a line also passing through the origin or somewhere in 2d space plane in 3 dimensional space hyper planes in n dimensional space once these weights are learnt is it just simple and it is always okay that I pick up these weights class means assign it to my w i's assign the w bias term w i 0 based on the me or me i transpose and the classic job of the classifier is done well in the case of linearly separable problems where problems can be separable by a linear hyper plane a line in 2d or a plane in 3d this will work we will see that with an example now we will see that with an example now we will take examples from geometry in 2 dimensional space and understand the significance of this linear decision boundary which will lead us later on to an important concept of linear discriminant functions lda linear discriminant analysis or flda a little bit later on not an immediate extrapolation right now because what we will do is we will learn the importance and significance of linear decision boundaries then bring in the which one we have ignored to get this linear decision boundary we have ignored the in the gi expression of gi we have ignored one important factor what was that factor the covariance matrix we will bring in that covariance matrix we will see the significance of that covariance matrix which will might which will in fact bring non linearity into the picture of the decision boundary not only due to the diagonal elements but off diagonal elements and see some examples of those and then again come back to linear decision boundary using the official linear discriminant analysis or flda criteria lda which will lead us to a the popular method of supervised classification called lda a linear or fissures linear discriminant criteria analysis that is going to be overall the discussion hence forth in the next few lectures we will stop here