 So, this is where we stopped at the end of the last class and we were actually looking at the significance of the equations here for learning or updating the weights based on samples from two different classes and based on the inequalities here whether you are in the positive or negative side of the hyper plane you adjust the weights based on the sign of the direction of the test training samples okay and we also discussed about the significance of the learning rate parameter as it is called and you start with the larger value and keep on reducing them as time progresses and you stop actually when the iteration converges and which basically means that these conditions are not met typically the condition of convergence in a learning algorithm is not based on the these inequalities basically you find out the errors in predicting the classification of samples as given during the training and the error is very very less or does not change over certain amount of time you stop the iteration of the convergence. To be very very specific in the case of a feed forward neural network where a back projection learning algorithm is used the objective is to basically minimize an error term okay. So this objective of the minimization of the error term is also along similar logics where given a certain weight wk at any point of time of an iteration you adjust the weight and get the new weight wk plus 1 the change in the weight wk is basically given by this particular expression and remember as shown in this diagram the change in the weight is along the direction of the vector okay of course it could be the negative side as well depending upon which class you are working on and the least mean square learning algorithm that is the derivation which I am skipping here which says that the change in the weight should be eta which is the learning rate parameter multiplied by the direction vector of the sample multiplied by an error term this is something new which you have not discussed earlier remember the logic is the same that means we change it along the direction of the vector the delta wk is a vector as given here it is parallel to or along the direction of the test sample which is here multiplied by the learning rate parameter and there is another scalar quantity here which is an error term defined by this what is this error term indicating you look at the difference between the desired value of the discriminant function and the actual function of classification and then decide typically you want this you stop when the error this error term is 0 this error term is 0 when the corresponding value of this g which is the discriminant function here is the same as the desired value in this case in our case it is simply positive or negative but if in a multiclass situation this could actually indicate a value. So when this value of the error that means the desired value for the classification required for the discriminant function g and the actual value of g which is x multiplied by w as given in the earlier class if this value is large you move you actually have a larger value of delta wk because this value will then be larger if this value is negligible or small the change will be also small. So you see there are three factor direction given by xk learning rate parameter which reduces over time and the error. So the error is large you move larger in larger steps if the error is small you move with smaller steps and of course as iteration proceeds the iteration learning rate parameter will also go down forcing you to anyway move at smaller steps when do you converge now you have an error term ek you look at the error term and say for all training samples if the error term is less or some of the error term is less or individual error terms are small or they do not change over time then you can stop because you have reached the solution space you have reached the solution space because this term x multiplied by w is actually equal to dk and are very close to that giving result to a null or a very very small. So this is the essence of the learning algorithm in case of a perceptron as well as the feed forward neural network in general and this is applicable with minor modifications for a multi layered feed forward neural network algorithm as well where the learning algorithm is actually called the back propagation learning law BP neural network back propagation neural network very casual it is called but actually the error is actually back propagated the signal is moving from left to right from the signal input towards the output but the errors are propagated back which is the error term this is an example of such an error term which is propagated back to know how much of the weights should I adjust depending upon the error the error is back propagated but the signal flows forward from the input multiplied by the weights to the output to the next layer and so on in a similar case it is done so it is applicable for a perceptron but a perceptron you can actually use the simplistic form which you have discussed in the previous slide without the error term as well but if you have a layer of different perceptrons forming a single layer or multiple layers with different number of neurons or perceptrons as they are called to form a multi layer feed forward neural network a learning law of this type is adopted where the errors are propagated back they are measured multiplied with an learning rate parameter eta then how the weights are modified always along the direction of the training sample this is what we have learned through the end of the last class and also today that it brings you in the weight space in the weight space towards the positive zone of G for of course the correct set of samples we move ahead and get back to our linear discussion of linear decision boundaries remember in the last class we talked about that we will not discriminate between decision boundaries and discriminate functions now we will discriminate it exactly formally earlier we were not discriminating much one discriminant function is discriminating between one class with respect to the other or the rest and that is also acting as a decision boundary now we will actually do is what we will do is take two discriminant functions both of them being linear what are these two discriminant functions for two separate classes i and j 1 and 2 a and b examples again I repeat fruits and flowers cars versus trucks aircrafts versus trains let us say or two different landscapes for in the case of remote sensing applications whatever you are trying to discriminate one class with respect to the other let us say you form discriminant functions for each class and we have seen expressions of that curtsy using the multivariate Gaussian distribution under the Bayes law that gives us a discriminant function the modern distance criteria under that when we took the covariance term equal to an anti-dematrix we it boiled on to a linear discriminant function if we take two such linear discriminant functions as given here you get a decision region boundary by solving this this is a general expression for a DB decision boundary between two regions if individually each of them are linear then you actually get an expression like this okay where these W's are the corresponding weights these are the corresponding bias terms they are for two different classes i and j and each of them and this actually indicates a linear decision boundary precisely now discriminating between these two classes so you are using two discriminant functions for each class or this is actually represent also a group of classes put together if necessary for certain applications or i could be one class j could be a mixture of other classes but it is basically a binary classification problem two linear discriminant functions representing two groups or two sets of classes or even just two classes themselves or even one versus the rest whatever the case may be we are not worried about this at this point of time but we put them under this expression then you get a linear discriminant function and this is a course an expression of a hyperplane a point in 1D line in 2D planar surface in 3D hyperplanes in higher dimensions separating the decision regions or DRs in high dimensional space of course we know that the hyperplane will pass through the origin if the bias term themselves as 0 or they cancel out either they are 0 or they cancel out such cases we get back to the expression of that capital G which gave us that discriminant function so we are looking at the generalized results of the Gaussian case of a discriminant function G remember in we started with these two classes back we linearize it under the assumption that this is an identity matrix now we will not make an identity matrix let this be a general covariance matrix but again we will take some special cases so this expression is not new this is the malnubis term this is the constant term and this is of course there is a class variant because if the sigma i changes from one class to other this could be separate and this malnubis distance in general if this is not an identity matrix it gives you quadratic terms or quadratic expression it spawns a number of different type of quadratic surfaces some of these examples we will take in the next class when you talk of nonlinear decision boundaries because we will still talk of linear discriminant functions under special cases of covariance matrix not equal to an identity matrix also we will talk about that today and still we will be discussing in detail about linear decision boundaries but remember in general this is a malnubis distance criteria it is a vector distance using the inverse of the covariance matrix term and it is denoted by this this is the symbol we might use in certain cases when this remember if this is an identity matrix then this is the simple equilibrium distance norm which we have used and then we are linearized as well by ignoring the class invariant term the quadratic term okay but in general this is the malnubis distance criteria okay so under the Bayes rule Mohsen class assignments we have assumed that the G of x is equal to this what is this term called in the Bayes theorem I repeat we have done this this is called there are 4 terms class prior unconditional conditional and a posterior which one is this the class contour posterior given a sample x what is the class to which it belongs to correct so this is the left hand side of the Bayes expression that is our discriminant function we started with that and we made several assumptions we ignored class prior of course we always ignore the unconditional denominator term okay but we are of course mainly concentrated on the conditional distribution function that is where we brought in the multivariate normal density distribution so we will assume that the class prior is still the same for the time being okay so the in some sense basically this is what we are talking about G of x well I just give a different notation because this is not the same as this so put the the unconditional density function here product of these two is log of this plus this and under this if the class is same this is the one the class conditional function here is the one which is going to dictate and we take this to be our normal density function this one is this is nothing new we have all done this but I am just recollecting all of that so that we can put ourselves in perspective the log prior term is still here and if you break it into parts and simplify this expression will give you this term and this term the log prior is still here this is the expression within the exponent sorry it was within the exponent it has already taken out so the mononibusian's criteria is still here it is still here so we have just ignored the constant look at the final expression here what we get I leave this simple derivation here this what is the constant term we ignored this one this is the constant term we ignored when we went from this step to this step here the log prior is still here the log of the covariance matrix is still here yes and the inverse this is the mononibus distance here okay so this is what we are working on and earlier we have assumed this to be an identity matrix. So now let us take the next simplest case or the simpler case of the covariance matrix instead of taking it to be an identity matrix we will take it to be a diagonal matrix not only diagonal all the terms variance terms are same and that to follow classes physically if you want to interpret this what are you talking about you are distinguish between distinguishing between say fruits and flowers two different classes and I have taken feature samples like color weight smell size and so on and so forth what I have found is across the two different classes across the two different classes it seems I am having the same variance same scatter of all these features the variance of color feature across the fruits which I have taken is the same as the flowers well from my statement itself you can almost imagine how little bit quite a bit unrealistic this is this assumption but for the sake of mathematical argument we will start to relax these constraints one after another in the previous case mind you we took the covariance to be an identity matrix that means there were no variance at all or variance was equal to one now at least we are having some variance but the variance is same across all dimensions across all classes fruits are having the same variance of color as the case of flower the variance of weight is also the same size is also the same and so on and so forth not an idealistic assumption here but just to show that this may not be good but for the sake of mathematical argument yes we will still proceed and go on and we will eliminate the class independent bias which one did we eliminate in the previous expression first of all there were three terms the log parade is there what did we eliminate can you guess and tell me from the previous one we will go back and have a look this term how could we eliminate this because the log of this matrix actually I should put a mod here because the determinant of this so proof determinant term is missing here so please put the determinant determinant of the sigma okay for a diagonal matrix what is the determinant because we have made an assumption that it is strictly diagonal so diagonal matrix the determinant will be the product product of all the diagonal terms correct product of all the diagonal terms so that is the same for across classes that is why the term could be ignored here and inverse of a diagonal matrix inverse of a diagonal matrix will be 1 by those elements okay and the elements are sigma square so 1 by sigma square now it is taken out and the factor of 2 actually is coming out anyway here this is already there so in fact this term remains with a 1 by sigma square here this is ignored and the log parade is there so this is how you get this expression what do you do with this in general this expression indicates constant hypersphere centered around the class mean where is the class mean here mu i I live it for you to write this expression in two dimension write this in two dimension you will get the equation of a what is the geometry you will get go back and look if you write this expression this part forget the constant term here this if you write in two dimension will actually give you the expression of a you should be able to extrapolate the statement which was there at the bottom of the slide will go back then look at the slide if these are constant hyperspheres in 2D this should indicate what is the projection of a sphere in 2D projection of a sphere in 2D louder you take a sphere in three dimension projected on the two dimensional world simple projection take a ball think of a ball what will the figure indicate geometry in 2D it should be a circle okay so that is an equation of a circle which you get in 2D it will be a sphere in 3D and hyperspheres okay but more of this later on because we will still looking at so this in general appears to give an only in addition boundary because the nonlinearity term here the quadratic term here is what this will do okay but more of this later on so this is an example of a diagonal covariance matrix and this is the case when sigma 1 is not equal to sigma 2 then of course you have to write the inverse of the covariance matrix in this form of course if we ignore the subscript that means the individual variance terms are same you can have the same term along all of this in fact you can take this constant term out and say this is an identity matrix multiplied by 1 by sigma square which we have done little bit earlier so considering the instrument function we had ignored this a little bit earlier in general this will yield a weighted distance classifier and depending upon the covariance term larger small scatters spread of this we tend to put more emphasis on some feature vector components of the other this will give in general hyper elliptical surfaces in rd in d dimension space for each look at this expression now with this covariance matrix how is this different from the previous covariance matrix sigma 1 is equal to sigma 2 equal to sigma 3 and sigma d remember the sigma is still same for all classes but the terms are different now in the previous case the diagonal terms were all same so those were giving hyper spheres nonlinear decision boundaries we are approaching the case where we can get nonlinear decision boundaries because of the quadratic term and the decision boundaries will be spheres in the case when you have equivalent equal terms or identical terms as it is called identical terms along the diagonal if they are unequal as given here if they are unequal you will have hyper ellipsoidal surfaces hyper ellipsoidal surfaces if the values are unequal if these are all are same a special case ellipse becomes a sphere 2d so circular ellipse in 3d you have sphere ellipsoids and in higher dimensions you have hyper spheres or hyper ellipsoid we will discuss this in detail in the next class of nonlinear decision boundaries with examples of nonlinear at least in 2d we can show this sort of thing carrying on with the discussion of the decision boundaries let us assume if all the class priors are same well I must warn you little bit that sometimes we are writing dfp of wi but w is weight so let us stick to ci for all k then eliminating the class independent term which is basically if you look at this term so these two are ignored you are just left with the model of business let us concentrate on this term which is actually telling you that this covariance term tells you where it should more give more importance towards a particular dimension or not because this is what will dictate which dimension has more importance okay so this is the model of this term and if you expand that in this particular form you can write this expression in this form and this can be now written in terms of gix as this provided you can switch of this term you can switch of this term if all the corresponding covariance matrix is same for all classes now what you are saying is this is not a constraint which we are putting that the covariance matrix is diagonal not it can be any arbitrary covariance matrix may or may not be diagonal what we are saying is it is same for all classes how unrealistic that can be just to show okay it may be possible in certain situations but what I am saying is features such as height weight color size spatial extent for two different types of fruits and flowers are also okay and they are symmetric of course it is symmetric matrix but they are all same so if it is same you can switch of this i and if you can switch of this i you can switch of this particular term you will be left with only these two terms here and typically gi is gi by 2 so the 2 will go off you would not have a 2 you will have a minus half factor here so we will concentrate on this term it has come back again this is not a new term 1 or 2 classes back couple of classes back we are talking about this term the last class we discussed at length which gave us to the concept of perceptron learning but linear decision bond is the only difference now is earlier this weight was based on only the mean now the covariance term has come and sit here the covariance term is coming inverse of the covariance matrix to be very precise is coming and sitting here these weights I have just changed the notation to short of a omega here small w you can treat this and this is an inverse of the covariance matrix okay so we have discussed this case earlier when this was an identity matrix and when this was the covariance matrix was an identity matrix we just had this mu as the weights and mu transpose mu without this term is what we had for linear decision we are still having linear discriminant functions we can still have linear decision boundaries but the covariance term is coming and sitting here it is may be an arbitrary covariance matrix just its symmetric that is all so this is what we have and we analyze this for the rest of the class and move over to non-linear decision boundaries in the next class so this is the last discussion on last part of the discussion on linear decision boundaries because these type of gix will give us dr's dv's which are hyper planes and they will be linear by exploiting this constraint as we have done earlier at the beginning of the class beyond this if you have whenever diagonal sigma which is class dependent remember this is coming under the constraint that the covariance matrix is class independent we will go back and look at this constraint look this is class independent that means you keep change go from class 1 to 2 i to j is the same covariance matrix so it is class independent it is class independent if it is class dependent or of the off diagonal terms are non-zero you will typically have non-linear decision functions discriminant functions decision regions or decision boundaries non-linear discriminant functions and decision boundaries are the correct way of saying there is no non-linear decision region but whatever it means is that discriminant functions and decision boundaries actually give you dr's so the boundary of those regions will be non-linear if you have class dependent covariance matrix and the off diagonal terms typically have more roles to play but as long as you are having class independent covariance matrix diagonal or not including identity matrix which is a special case of diagonal covariance matrix you will have linear df's linear db's so again to repeat what are the special cases of covariance matrix in which you will have linear db's or linear discriminant functions one is if the covariance matrix is class independent that is all straight away first of all even if it is class dependent one well in some sense you can say a special case of what we consider as an identity matrix that is also class independent so as long as it is class independent you will have linear discriminant functions if it is class dependent and diagonal you will still have hyper ellipsoids or hyper spheres which gives rise to the simplest case of non-linear discriminant functions which we will discuss next let us proceed with the discussion on the decision boundaries and discuss the effect of class priors in a more general case remember this just to revisit the equations this is what is this equation you have seen this many times Bayes theorem under the Bayes theorem this is the posterior remember the c is changed to wi be careful this is not the same as the class prior w sorry this is the same as the class w but not the weight this is the conditional density function which we are talking about here this is the by this is the multivariate Gaussian density which we started few classes back the normal distribution okay and if you take gi to be this plus the class prior that means you are taking the this term which is the class conditional density function as given by this plus the class prior you can expand it using this expression that means what you are doing this log prior is here and we have done this many many times take the log of this expression you have malonov resistance plus these two terms which one of them is a constant anyway the other could be class dependent or class independent since we are talking about linear decision boundaries what assumption did we make about the covariance matrix it is class independent so there is no subscript here you can see although I have put a subscript here but the covariance matrix is same so we can switch it off here carrying on so we can switch off these two terms because they are class independent we are left with the malonovist distance function that is equivalent distance weighted by the covariance matrix it is same for all classes plus the class prior what is this term doing if it is the class prior is not same remember what is class prior you are discriminating between two types of fruits say mangoes versus apple both are let us say seasonal fruits depending upon certain time of the year you may have more apples let us say in winter in summer or rainy season you may have more mangoes so the class priors will change it may not be the same that is what we are saying what does this affect the decision boundary let us look cancelling the class invariant terms we have just left with these two remember this is the main one which is responsible for all my linear and nonlinear decision boundaries but under the case same diagonal sigma that means we are switching off i diagonal elements class independent what do we have for the expression so we are switching off this diagonal this is the expression which we have plus log prior this is the case a we are discussing diagonal we have done this already this is a diagonal matrix we can take out this half is here inverse of the covenants matrix will ill this term and then we have the class prior when we break open the expression this is the expression which we have nonlinear term here the linear term particularly here as a function of x we will analyze this expression in the next class in the next slide analyzing this expression again the nonlinear term here the linear part here why are you switching off this because if you move from one class to the other change i to j you say this will change this will change as well this will change this won't change so this is a class independent term again it is a quadratic term but class independent term okay and why it has come because the class independent covenants matrix the same sigma which is diagonal okay so once you do this this is what you are and this is the same expression which we had just a few slides back sometime back but instead of the writing inverse of the covenants matrix here multiplied by the mean you are able to write now this is minus sigma square sorry divided by the variance divided by the variance here the same thing which comes here remember the 2 and 2 cancels out here so you will be left with this divided by sigma square here mu mu transport divided by c in general you will have it as a inverse of the covenants matrix here basically this term is indicating the inverse of the covenants matrix but the class power is still here earlier we have ignored this the linear distance boundary is now this we have seen at the beginning of the class today that it can be written in this where individually each of these terms are given by this expression so it is wk transpose minus l where knl at the for the 2 different classes they are not identical and the bias term is also written here let us observe this expression in the next slide the linear distance boundary is given by this which is basically this and I leave this is an exercise for you to prove that this difference in the bias term can also be written as a function of this which is like this where the x0 this is interesting is given as this particular term look at this term I am leaving this is an exercise for you to prove it analytically you can prove this with a little bit of juggling maybe about a half a page of derivation okay and hence forth if you can write this term in terms of this and substitute back here this expression can be written this is nothing new we have this two classes back w into x minus d is equal to 0 from where we derive the perceptron the hyper plane this is the plane passing through the origin or place plane passing through a particular point the only thing is what is this w it is the it is the difference of these two main vectors what is the mean vectors if there are two main vectors this is the vector difference between the two main vectors that is the w w will be the remember the w is the normal to the hyper plane the normal to the hyper plane is now the line joining the two means and if it is so the hyper plane will be normal to the line joining the two means I repeat again the w which you see here we had seen this earlier two classes back that if this is an hyper plane this is the w and now what we are saying is this w is the vector joining the two class means so if there is a class mean one here in the class mean two here this is the my w this is the hyper plane we will have it is in a diagram coming up in the next slide but I hope you got the justification of that maybe I will go to the board and just draw that for you because I may not have a slide ready so if this is a mu k and this is mu l then mu k minus mu l this will be the w the vector direction will be here or here although it strictly does not matter what do you think it is given as mu k the expression says it is mu k minus mu l mu k minus mu l it will be pointing towards this toward mu this will be w and we said that the w is orthogonal to the hyper plane so this is the hyper plane h I am drawing in 2D this is my pattern space or feature space not the weight space like we did in case of perceptron so this should be x1 this will be x2 two dimensional feature space you can visualize in third dimension also this is my hyper plane and which I wrote that equation here w x minus x0 hyper plane which is equal to if you select points here in this hyper plane this will be equal to 0 so this is the expression which defines this hyper plane this h w is normal to that okay and you could ask me where is this x0 it is a point here it is a point here on this plane okay and let us look at the expression for x0 now in the slide look at that expression of the x0 x0 is a point on h and if you look at this expression of x0 look at this term can you tell me where is this term it is the average of the two means go back to the board these are the two means mu k and mu l average of the two means will be the central point let us say so for the time being I will let this some x0 tilde some average value of the two means and it seems to be an approximation because it is not the full expression this containing the first term in the slide so let us go back to the slide so what I have marked there on the board is basically the yeah this term here but there is another factor which is based on the class mean we will explore that now we will explore that now what is the effect of class priors on that point x0 and the x0 is a point which is actually lying on the hyper plane okay so this is the x0 so we had this w is very simply this is the vector it is the line join the two means vector from one mean to the other one orthogonal to the hyper plane and the hyper plane is passing through x0 because that will actually decide my decision boundary it should have been in between x0 will be equal to this under what condition tell me what condition the expression of the second term don't tell me sigma equal to 0 what what constant will make this vanish don't tell me mu k should be equal to mu l don't tell me this will be equal to 0 these two probability terms if they are priors are same class priors are same that means I get equal number of mangoes as equal number of apples and if I can do that then if that happens this term vanishes forcing x0 will be equal to the mean of these two classes that means my hyper plane is strictly in the middle of the two class means in this case hyper plane h is the decision boundary earlier it was df discriminant function but now it is a difference of two g's which is a linear decision boundary we will have some examples let's look at the simple example which I have borrowed from the website I will give that reference a little bit later on look these are two Gaussian functions drawn here and if you look at the Bayes theorem ignoring the class priors this point at the center of intersection of these two should have been the actual value of x0 x0 is a point here mind you average of the two means look at this is one mean this is the other mean class 1 w1 class 2 w2 two Gaussian distributions intersection of this at the middle should have been the point x0 corresponding to this expression but in this figure one class prior is more than the other one so this term will not vanish it pushes the decision boundary more towards the other class ideally I would have loved to have the decision boundary exactly in the middle of the two Gaussian functions provided the class priors are the same but if the class priors are more what does it mean in reality there are more apples than mangoes there are more flowers than fruits there are more men than women there are more forests than mountains okay there are more in landscape there is more water than deserts okay there are more people smiling than more people anger I am talking about discrimination of expressions in the human face these are examples only and in certain cases class priors could be more class prior pushes the decision boundary this will be an additive corrective term to this value of x0 the point is pushed there it could go to the other side if this value of the class prior of class 2 would have been more increase the class prior more you can see how far it goes why because this value is 0.9 this is 0.1 this is 0.7 and 0.3 so this value of class is now more than this it pushes the decision boundary from this point here more towards the other side why because this term will be now larger because this numerator is more than denominator making this term more for this figure than the left-hand side this is a case of a DBE with same diagonal sigma same for all classes alright identical diagonal elements that's why you could have taken out that sigma but one class prior is more than the other one the decision boundary is not lying in the middle it's a point in 1D we will show examples now in 2D where it will become a line and you can extend that to a plane in three dimension hyper planes in higher dimension this is a Java applet available as an open source which you can play and manipulate the individual terms of variances and the class priors so there is a slide bar to change the standard standard deviation is equal to 1 for both both equal standard deviation identical covariance matrix 2D but what happens both identical class priors so this is this is now look at the DBE here this is that hyper plane H which is a DB where it is sitting this is that X naught why because it is exactly the midpoint of these two means mu 1 mu 2 well is written as M 1 M 2 so assume this to be mu 1 mu 2 2 classes equal priors 0.5 0.5 each as given in this 2 bars at the bottom class priors are same so ideal condition so class priors are same if you go back this terms cancels out you exactly have the X naught at the bisector point between mu 1 and now DB is actually a perpendicular bisector of the line joining the means DB is a perpendicular bisector as a special case of the class joining the class means if the class priors are same identical covariance matrices that means class independent covariance matrix what I will do now first I will change the class priors and see what happens to this diagram so now what we will do is we will first change the class priors and see what does it cause in what they what is the effect in the location of the DB that is number one and then we will also of course change the variance terms and see if that has an effect as well it should have an effect as well let us look at this diagram on the right hand side what I have done is maintain the same value of the variance which is equal to 1 both here as well as there but I have changed one of the class priors look at the value here which is given and this blue bar indicating that the value is much larger is 0.9 and the other class bar is 0.1 earlier the both were identical what it has done earlier the decision boundary was at this point of the line here as given in this diagram it has pushed it towards which class it had pushed it towards the other class which had a less class prior class one had the less class prior the DB has moved towards that earlier it was at the center because the class past were same the class prior for class one is now 0.8 indicated by the big red bar there the other value is 0.2 which is much less here look what it has done because the class prior of class one is higher it has pushed the decision boundary from this point has given here to more towards the class why the second term the decision boundary will now change its sign and move in the other direction remember it was a log ratio of class priors so depending upon which term is more you will have a positive value or a negative numerical value pushing the the separating hyperplane as the DB towards one particular class which particular class does it move towards the class mean which is having the less class prior you see the less class prior the DB is moving towards that class mean here it is moving towards class 2 mean because that has the less class prior so this is the effect of class prior in Bayes decision rule under the normal distribution all that has been taken all right the class prior is not allowing the decision boundary to be strictly at the perpendicular bisector whether it is a line or a plane or even a point in one day remember the previous slide where it pushed it towards the class mean or even further away let us go back you look at this example here you look at this it has even gone past the class mean of class 2 because it is really very very large the class prior for class I think there is a typo here this should be w1 class prior for w1 is 0.7 this is 0.9 so it has gone past the class prior for the second class itself the mean is here somewhere gone past them so if it is too big it can even go past the class means whether it will go past the class mean or not depends on the individual variance so what I will do now is keeping these unequal class priors we will change the class variance now the scatter we will reduce in the next slide look this is the effect of class priors unequal class priors but a lesser scatter this has moved but not that much that means the variance has a role to play this is standard division let us go back to the expression there look this is multiplying factor so if this is larger this effectively more pronounced if this is less this will be lost pronounced so that is why in these cases the effect is more and if you enlarge it more if the variance is more then there will be a class overlap plus the decision boundary will skip moving more towards the class mean and even further away remember it will always orthogonal the plane will be orthogonal the separating hyper plane or the db in this case will be normal to the line or vector joining the two class means or the vector joining the two class means will be orthogonal to the plane that will always be the case the plane will remain orthogonal to this vector but it may go away if the ratio of this is higher or the variance terms increases here we have reduced the variance the shift is there is not at the perpendicular bisector because its prior is less let us look at this case here this is the case where the variance is really large it has come the separating hyper plane has come more closer to the class mean it may go further away if we increase this or reduce this further and change this increase it more towards the value 1 so this shows an effect of where the decision boundary linear decision boundary will be located depending upon two factors the spread or scatter of the individual scatter matrices or the variance of the features and the class priors this will decide where the decision boundary will be located remember always it will be the separating hyper plane will be orthogonal to the line joining the two class means or the line joining two class means will be orthogonal to the hyper plane but its position dictated by x naught will be dictated by the second term in general it will be in the perpendicular bisector at the center point of the line joining the two class means but it will go up and down more towards the class mean depending upon the ratio of the class priors multiplied by a factor which also depends on the variance plus there is another factor in between let us go back to the expression normalized by see this is the vector this is also a vector term so this this is an unit vector along the line joining the two class means so the numerical value dictated by the log prior ratio and the variance or the scatter or spread dictates where you are we will stop with the discussion today on linearization boundaries we will move towards non-linearization non-linearization boundaries more where we will have we will have class dependent variances or covenants matrices and the effect of off diagonal terms also as long as it is unequal as long as it is unequal we will have non-linearization boundaries thank you we will come back to the next class