 Continuing the feature selection, we have been discussing so far the algorithms like sequential forward, sequential backward, generalized sequential forward, generalized sequential backward, LR algorithm and branch and bound algorithm, there are many other algorithms available. So now I will start discussing the feature selection criterion functions, how to choose the choice of the feature selection criterion functions. So again so there are x1 to xn you have totally n variables are N number of features and so your feature vector is going to be of if I write the feature vector as for the ith observation if I write the feature vector then this is for the ith observation the value of the first feature is x1i for the ith observation the value of the second feature is x2i for the ith observation the value of the nth feature is xn i so it is your feature vector will be in N dimensions okay now let us just say that there are how many classes let us just see there are k number of classes the number of classes is k the number of classes is k and let us assume that we know the class conditional density functions that is for the first class then it will be p1 for the second class it will be p2 for the ith class it will be pk so they are k class conditional density functions probability density functions and let us just say there are the corresponding prior probabilities p1 p2 up to pk naturally each of them is greater than 0 and then the summation is actually equal to 1 let us just say we know this we know these things and as well as we know this now we are supposed to select small b number of features small b number of features they have to be selected so let us just consider one such subset let us just say capital B is the subset of s let us let B is a subset of s and B contains small b features okay let us say B is a capital B is a subset of such that capital B contains small b number of features now with respect to those small b number of features we can have the density function p1 B p2 B pk B using those features in this subset capital B the corresponding conditional probability density function for the class 1 is small p1 B do you know how to get this small p1 B from the small p1 do you know it if you do not know it please tell me how you will get is you see the small p1 is going to be here these integrations are how many such integrations capital N number of such integrations small p1 of this you know because since small p1 is a density function and its number of arguments is capital N so you have to write this capital N variables x1 to xn and dx1 dx2 dxn and these number of integrations is we are going to integrate over capital N dimensional space okay maybe each of them is from each of them is R okay R this is equal to 1 okay integration over capital R to the power N right now you look at those b features here let us say without loss of generality this let us say the subset B is equal to let us just say x1 to xb these are the b features that we have taken let us just say then what we will do is that look at this function p1 of x1 x2 xb xb plus 1 xb plus 2 xn okay this I will do the integration over dx this I will do the integration so this will be how many n-b over n-b dimensional space or basically this is same as integral R to the power of n-b p1 so if you do the integration what are you going to get let us just say the function that you are going to get let me just write it as g and for the class 1 let me just write it as g1 actually here I wrote p1 b so let me just write it like this only p1 for the set b what are these arguments because x1 to xb they are there you are doing the integration over b plus 1 b plus 2 etc are you understanding the integration over b plus 1 b plus 2 etc be up to capital N so there are some variables which are there remaining there what are the variables that are remaining x1 to xb they are this in capital B you have these features so that is your p1 b have you understood what I wanted to say similarly you can have p2b p3b pkb and that you can have it for any such subset of s it may have small b number of elements it may have I mean this capital B can be any subset of s having small b number of elements which are those small b figures that you can vary by varying capital B you can get the corresponding this set this density functions these are actually called marginal density functions you are doing the integration with respect to the rest of the variables so that the variables that you want they will just remain as they are okay so once you get this p1b p2b pkb then what you can do is what you can do is you can use these prior probabilities and you can get the base decision boundary and you can get the base misclassification probability so find the misclassification probability the base decision rule using these B features using the features in B and denote the misclassification probability as and denote the misclassification probability as J B is this clear denote the misclassification probability as J B now you are supposed to do the feature selection so now what is your criterion function your criterion function is I mean this J and you are supposed to find that particular subset B not containing small b number of elements for which J of B not is minimum that is find find B not subset of S B not containing B elements such that such that symbol is this J of B not is less than or equal to J of B for all B subset of S containing small b number of elements is this clear to you this is this is this should be the way in which feature selection should have been done but usually people do not go in for this what is the reason the reason is very simple for most of the problems you really do not know the probability density functions for most of the problems you really do not know the probability density functions that is one part of it but the second part even if you know it is difficult to calculate J of B it is difficult to calculate the misclassification probabilities if you look at my earlier lectures on base decision rule which was supposed to be the best rule then why at all we went in for other decision rules the reason was that even if you know the probability density functions it is extremely difficult for you to find the misclassification probabilities to find which one is actually the minimum it is difficult to get those I mean decision rules and the corresponding misclassification probabilities it is difficult to get the misclassification probabilities so this should have been the rule that that people should have followed it but actually it is difficult to follow it because you really do not know how to I mean though the expression is fine but it is difficult to evaluate the expression for misclassification probabilities then people started wondering is there any way in which one can actually choose features using probability density functions but without actually going in for calculating base misclassification probabilities how to look at density function without really trying to calculate the misclassification probabilities this was the question people asked themselves and then they came out with the probabilistic separability based criterion functions now what is the meaning of probabilistic separability based criterion functions in order to explain probabilistic separability based criterion functions I will I will explain the things in this following way say these are two density functions for the classes say this is the density function for the class 1 say this is the density function for the class 2 right and then note that there is a gap here okay note that there is a gap here so these two functions they satisfy a property what is the proper property p1x greater than 0 implies p2x is equal to 0 p1 of this thing is greater than 0 what about p2 for these points that is equal to 0 similarly okay now suppose we have chosen the required number of features in such a way that those features they separate out the density functions like this then what is going to happen to the base decision rule in this case if your density functions p1 and p2 are like this what will happen to your base decision rule does it give you any misclassification yes or no my claim is that there is no misclassification why what is the base decision rule base decision rule is put x in class 1 if p1 p1x is greater than p2 p2x this is in the case of two classes now here when p1x is greater than 0 p2x is 0 so for all the points which are here they will go to class 1 and for all the points which are in this region they will go to class 2 because p2p2x will be greater than 0 but p1p1x is equal to 0 so then there is no misclassification right then there is no misclassification so what people thought was that they would like to choose features in such a way that the density function the separation between the density functions is as much as possible the separation between the density functions is as much as possible that is the basic idea that is why it is called separability that means separation between the density functions since we are looking at density functions people called it as probabilistic separability have you understood it now this is called as probabilistic separability based feature selection criterion functions now the question is how do you define the separation between two density functions if you define that then you can talk about which features are going to give you the maximum separation are you understanding what I am trying to say if you somehow define the separation between the density functions then we can say which particular features give you the maximum separation so how does one define the separation between the density functions let us see so you have say two classes so you have density functions p1 and p2 you have prior probabilities p1 and p2 okay now you need to define a function which you need to define some g some sort of integration may be done okay and then you need to have let us just say you need to define some g1 and then again you need to define a g2 so this is a general way in which you can define separation between density functions you need to have some g1 here and then some g2 but then it should have some properties what are the properties the properties are one j is maximum p1 x greater than 0 implies p2 x is equal to 0 that maximum value for some functions it may be of the order of 1 for some functions it may be infinity by varying this g1 and g2 you may get several such maximum values I mean people defined it in just too many ways okay to j is minimum if p1 x is equal to p2 x for all x then there is no separation between the density functions am I right then you should have a minimum value are you understanding first I am trying to give you the intuition now I will give you the formulas it should be maximum when p1 x greater than 0 implies p2 x is equal to 0 okay minimum when p1 x is equal to p2 x for all x I am expecting a question from you look let me ask you that question here I wrote only one p1 x greater than 0 implies p2 x is equal to 0 do I need to write the other one p2 x greater than 0 implies p1 x is equal to 0 what is the other one do I need to write it I claim that I do not need to write it I will tell you why I do not need to write it suppose this is not true that means there is one x for which p2 x is greater than 0 as well as p1 x is also greater than 0 but that does not happen okay if there is an x for which both p1 and p2 are greater than 0 then it contradicts the previous statement are you understanding have you understood the logic so I do not need to write this just this one statement is sufficient is it clear to you and the third one is that so I wrote maximum and minimum and all the other cases they should be in between this maximum and minimum for all other cases J lies between the maximum the minimum and maximum this minimum is same as this minimum and this maximum is same as this maximum okay so using these principles several criterion functions have been defined several criterion functions means several separability measures have been defined and using those separability measures you can have the corresponding criterion functions for feature selection okay now what are those separability measures which tells you which tell you how to look at the I mean how to quantify the separation between the density functions okay the first one I am going to give this will be by Bhattacharya okay this is integral this is my Bhattacharya this log is to the base e it looks to be quite complicated but actually it is not let us look at the case to that is the second one suppose p1x is equal to p2x then what is going to happen to this one this is integral p1x and you are going to do the integration over the whole space so this value is going to be 1 log 1 is equal to 0 and minus of 0 is 0 so the minimum value is 0 now let us look at the other one what will happen to this product the product is 0 square root is 0 integral of 0 is 0 and log of this thing it is minus infinity actually log 0 is not defined when the something that is going towards 0 then this goes towards minus infinity and minus of log is plus infinity have you understood now have you understood why such an expression is taken is it clear and you know who this Bhattacharya is he was a professor in Calcutta University as I was mentioning in one of my discussions with you ISI was created by Mohalan Abhis in the Calcutta University way back in 1931 okay he started ISI in 1931 in Calcutta University there has always been a close interaction between the scientists of Indian statistical institute and Calcutta University okay and Bhattacharya was a professor in Calcutta University I think he died around 8 to 10 years ago he never went abroad he never went abroad in the very famous scientist and this distance is known as Bhattacharya distance probably you might have heard the term Bhattacharya distance so this is the one Bhattacharya distance is this okay it was a very famous statistician and he actually developed this one Bhattacharya distance and this was generalized by Chernov so how did he do it as you can see here if you put s is equal to half you will get Bhattacharya distance right so he had just this was generalized he took any s lying between 0 and 1 so then you will get a generalization and Mathu Sita it is Mathu Sita there how did he do it this is integral square root p1x-square root p2x whole square dx and there are in fact many many more there is something called divergence and there are many many more such measures each of them uses the density functions okay using these separability measures between the density functions one can always define the corresponding criterion functions for feature selection for example how do you define a criterion function using this Bhattacharya distance how will you define it you say that you take those small b features for which the Bhattacharya distance value is maximum okay you take those small b features for which the Bhattacharya distance is maximum so for each for each such set you will get the corresponding what I wrote there p1b p2b I wrote when I explained it to you will get the corresponding p1b and p2b and then you calculate the Bhattacharya distance so for every such set b containing small b number of elements you have the corresponding Bhattacharya distance value and you find out that b0 for which the distance is maximum it is the separation is maximum okay similarly using Chernoff or using Methusita or I mean looking at the there are many other measures you will get the definitions of these things from Deweyver and Kitter's book so you can get there are in fact too many of them he goes on giving the corresponding formulas for those measures right now note that in these cases I assume that you have two classes but supposing you have K number of classes instead of two classes suppose you have K classes that means p1p2pk and here you have capital P2pk now how are you going to get the separation between K such density functions how do you get the separation between the K such density functions we will look at what is known as the mixture density function Px let us just define it as this is mixture density function that is one terminology for this and you also have another terminology for this this is actually the mean of this I density these K density functions isn't it the mean of this K density functions these are the probabilities multiply the function come by the probability take the summation you will get the mean and we are supposed to look at separation between any two such density functions and whatever I did earlier that you do not need to look at the separation between any two such things you always look at the separation of that function with respect to the mean and take the summation that will give you the separation between any two of them and take the summation have you understood what I wanted to say what I did about the variance you are looking at the difference between the mean and each individual value taking the square and then doing the summation whereas we are supposed to look at the separation between any two of them and take the square and look at the summation here the problem is we are supposed to look at the separation between any two of them and then take the summation to get the overall value so instead of looking at that we can always look at the mean and take the different separation between each of them with respect to the mean and you sum it up have you understood what I wanted to say so you can have the corresponding generalization of these things not with respect to two density functions but the generalization with respect to m density with k density functions so what is the generalization the generalization is for Bhattacharya it is going to be – log integral square root pix and this is px dx this you multiply it by p i and you take the summation i is equal to 1 to k you take the summation i is equal to 1 to k p i bracket and you write it with respect to p i and p have you understood this this is just the generalization of Bhattacharya distance but not for two classes but for k classes taking it from the for the two classes whatever may be the formula that formula is generalized for k classes here and you can have the corresponding generalization for Chernoff and also for Mathusita and for other such measures other such separability measures and there are simply many of them in the literature they are simply many of them in the literature since the earlier part of the literature it was for future selection people wanted to directly use the statistical principles so there are several such papers using these distance measures which you would find in the early part of the literature on feature selection this is denoted as probabilistic dependence measures this is the terminology that was used by Deweyver and Kittler in their book they called it as probabilistic dependence measures for future selection which are nothing but generalization of the probabilistic separability measures to k classes simply the generalization of the probabilistic separability measures k classes do you have any questions in my next lecture I will start dealing with when the density functions are not available you have a training sample set then how do you do the how do you get the criterion functions for future selection here in all these cases I assumed that the density functions are available so we try to develop the criterion functions for future selection now in my next lecture I shall deal with the case where the density functions are not available but a training sample set is available for future selection and so then how do you get the criterion functions in that case it is over.