 we have been discussing objective functions for feature selection there we earlier discussed the probabilistic separability based feature selection functions now we are going to discuss feature selection functions where we are given a training sample set so we are not we are going to assume that we do not have any idea on the about the density functions here. So in this setup we have let us just say we have points I am going to represent them as x ij j is equal to 12 up to ni i is equal to 12 up to k that is we have k classes we have k classes and the number of points in the ith class is small ni the number of points in the ith class is small ni x ij denotes the jth observation in the ith class x ij denotes the jth observation in the ith class right jth observation in the ith class j is equal to 12 up to ni there are ni number of observations in the ith class and i is equal to 1 to k thus the total number of observations total number of observations is equal to summation ni i is equal to 1 to k and each x ij belongs to Rm I mean each x ij belongs to Rn capital N dimensional space. So the number of features is capital N each x ij belongs to Rn that is the number of features is capital N and we are supposed to select small b number of features number of features to be selected to be selected is equal to small b totally we have capital N number of features we are supposed to select small b number of features. So all our observations are in capital N dimensional space and these observations are represented by x ij so each x ij each x ij it is a column vector x ij 1 x ij 2 x ij capital N are the notations clear are the notations clear each x ij is a capital N dimensional vector it is represented as x ij 1 x ij 2 x ij N totally there are capital N number of features number of classes is k and the number of observations in the ith class is small ni and total number of observations is this okay now so now the question is somehow we have to select small b features so the intuition here is similar to the intuition in the probabilistic separability based measures there what did we do we selected features in such a way that they maximize the distance between the density functions of the respective classes here we do not have density functions but we have some points so we need to select features in such a way that they maximize the distance between those points is this clear we need to select features in such a way that they maximize the distance between those points now how do we put it mathematically what is a mathematical formulation of this so if I write ? let us say ? denotes distance between say a vector ? 1 and a vector ? 2 this denotes distance between BET between ? 1 and ? 2 so ? denotes a distance between two vectors now using this ? how do we define distance between the classes inter class distance so what may be a way of defining distance here let me first assume that you have the same number of I mean the let me first define this distance in the capital N dimensional space if I define distance here then for any subset of this capital N dimensions then we can define distance correspondingly there okay so how do I define distance now here I am writing a big formula first what I do is that suppose X I 1 J 1 X I 2 J 2 I def I take distance between X I 1 J 1 X I 2 J 2 okay and what I do is that I just do the summation J 1 is equal to 1 2 how much Ni 1 J 2 is equal to 1 2 Ni 2 then this will provide distance between the points in the class I 1 and the class I 2 right then I take average then I take average 1 by Ni 1 into Ni 2 right and suppose I have prior probabilities let us just say capital PI denotes prior probability of class I suppose this prior probabilities are given to us then what I will do is that I multiply by the corresponding prior probabilities and I take the summation and I divided by 2 why do I divided by 2 the answer is simple the same I is appearing here and here the same class so you are measuring the same thing twice so I am dividing it by 2 so if you take some number of features some B number of features then the corresponding vectors if I represent it by say X I J is this suppose I write X I J of a set B that means B is a set having small b number of elements okay B is a set having small b number of element B is a subset of S and B contains small b elements and our S is X 1 to X n and features S is having capital N number of features and we are taking a subset of B of S and B contains small b elements with respect to those b elements you have a corresponding this small b dimensional vector that I am writing it as XI I J B is this clear to you if it is not clear you ask me you have been hearing my lecture since morning and even other days also capital B is a subset containing small b elements so corresponding to those small b elements the vector is going to be of small b dimensional vector so correspondingly for let us just say this capital B let us just say small b is equal to 2 let us just say and capital B is say the feature 1 and feature 2 X 1 and X 2 then what will be this XI I J B then this XI I J B actually will be X I J value 1 X I J 2 just only these two points corresponds to the vector containing b elements capital B containing small b elements corresponding to this capital B set for X I J when you remove all the other features from X I J the resultant vector is going to be XI I J okay from X I J we are getting this XI I J corresponding to those capital B set then this one corresponding to those capital B set this will be XI I J B and this will be XI I 2 J 2 B and this will be J B from X I J corresponding to those capital B set the resultant vector I am representing it as XI I J corresponding to this set B so naturally for I 1 J 1 you are going to have XI I 1 J 1 B for I 2 J 2 you are going to have XI I 2 J 2 B so do this thing then you are going to get J B now we want to maximize it we want to do the maximization so well fine be not subset of S be not contains small b number of elements and J of B not is you want to do the maximization right is greater than or equal to J of B for all B subset of S and B contains small b elements the intuition is similar for probabilistic separability and as well as for interclass distance based we will select those features which maximize the separation okay now there is one small point I have yeah there is one small point I have not told you what this delta is I only said that delta measures distance but in what way it is going to measure the distance we can have several distance functions we can have several distance functions and using those distance functions you can get for different distance functions you are going to get different criterion functions for different distance functions you are going to get different criterion functions I shall be talking little bit about distances in my I mean one of my next lectures okay here what I will be doing is that I will take some particular form of the distance function so what form shall I shall I take we take the usual form what is the usual one ? of X1 X2 is equal to that is the square of the Euclidean distance is it correct it is the square of the Euclidean distance now if you take the square of the Euclidean distance here for this Jb now it gets if you take that the whole expression is going to get simplified the whole expression is going to get simplified can you tell me how it may get simplified do you have any idea if possible I would like you to do this calculations in your home and maybe you can show the things to me tomorrow but I want you to do it I want you to do it now let us just see how it may get simplified I mean I will write down the final form I will write down the final form the final form is going to be like this let me write Xi I bar as 1 over ni summation J is equal to 1 to ni Xi I J all these things are with respect to the set B all these are with respect to the set B okay this is the mean of the ith class of the training set mean of the ith class of the training set J is equal to 1 to ni Xi I J with respect to the set B this is ni 1 by ni okay now let me write µ as summation i is equal to 1 to K P I any all these are with respect to the set B with respect to the set B let me write µ as this µ is the overall mean µ is the overall mean and this is the mean of the ith class on the basis of that sample set training sample set now J of B will be take the distance between Xi I J B- Xi I bar B and this 1 over ni and there is a P I and there is summation I is equal to 1 to K plus summation P I this is going to be this J B this is what you are going to get if you substitute in this place if you substitute in this place the ? as this corresponding to the set B and if you simplify this whole expression this is what you are going to get I want you to do this class to do the calculations on your own and you will get it now can you tell me what are these expressions let us look at this here what we are doing you are going to find the distance of a point with its class mean right you are going to find the distance of a point to its class mean you are going to find all those distances and you are taking their average it is basically within class distance this is basically within class distance now what about this this is we take the mean of the ith class and that we are taking the distance between that and the overall mean that will be what it will be between class distance it will be between class distance so we are actually trying to maximize the sum of the within class distance and the between class when this expression was written intuitively okay this was written just intuitively and then when we use the usual Euclidean distance as ? then this expression has boiled down to this form where this is nothing but the sum of the within class distance and between class distance so we are trying to maximize if you use this expression then we are trying to maximize basically the sum of the within class and between class distances so probably we can suggest a slightly better I should say criterion function what is that we do not need to maximize the sum probably we would like to maximize the second part between class distance we would like to maximize and within class distance we may want to minimize it okay within class distance we may want to minimize it and between class distance may want to maximize it so how does one do it we can take this divided by this and you maximize we can take this divided by this and try to maximize the whole thing is this clear to you because on one hand we want to maximize this and on the other hand we want to minimize this so a slightly better criterion function would be take this by this and maximize the whole or this by this minimize the whole whichever way you take so we can have another criterion function where we take this divided by this and we maximize so another criterion function will be we take the between class distance and divided by within class distance and you just maximize that so using this specific I should say intuition there are a few more such criterion functions where they take the within class scatter matrix which is represented by sw between class scatter matrix which is represented by SB and then they calculate sw inverse SB okay they calculate sw inverse SB and using this sw inverse SB people have selected features right and the basic formulation of this sw inverse SB again from this when you go step by step you are going to come to this sw inverse SB anyone can directly talk about it you take sw which is the within class scatter matrix and you take SB which is the between class scatter matrix you take sw inverse SB okay and use this sw inverse SB to obtain features use the sw inverse SB to obtain features you can use the trace of the matrix do you use trace you can use eigenvalues eigenvectors okay that you can use them to obtain features about this particular aspect about sw inverse and SB I think Dr. Sukhinder Das will teach you these things okay that is why I am not going to the details about that but my part I will just about the interclass distance thing I am just stopping it here where this is the basic formulation from here you can go to that and then from there you can develop very many criterion functions which is one of them is something based on sw inverse SB okay that one is based on sw inverse SB and if you go through what is known as if you go through some of the things regarding multivariate analysis in multivariate analysis where that is in statistics where they they deal with these matrices extensively okay you will be finding many ways of defining this sort of criterion functions you will be finding a some more ways of defining this sort of criterion functions because it is basically statistics okay this part is basically statistics and you will be finding quite many books on this part both in statistics literature and also in pattern organizational literature there are some formulas relating to there are something called canonical correlations and there are also some some other topics you know if you go through the multivariate analysis books basically discriminant analysis there you are going to get in statistics you are going to get the different formulations of this way of selecting features they do not use the word feature selection but they try to tell you all these formulations there okay. So at this portion I am ending this way of taking features that is interclass distance based feature selection but I will be also talking something more about let me just see do you know what entropy means EN TRO PY entropy let me talk a little bit about that entropy if the basic concept has come from physics probably you might be knowing the people who have physics background they will be knowing some laws regarding thermodynamics where seemingly the second law of thermodynamics it deals with entropy where entropy is maximized means you have more disorder entropy less is small means you have more order entropy more means you have more disorder okay that is the basic I should say feeling or whatever it is about entropy but the concept of entropy that is used in our fields the main feeling has come from physics what is the meaning of how do one say that there is more disorder in the system how does one say that there is more disorder in the system let us just say my system it has only two values 0 and 1 suppose 0 occurs with probability let us just say 0.05 and 1 occurs with 0.95 probability and you have second case 0 occurs with a 0.1 probability and say this occurs with say 0.9 probability and say this is 0.2 and this is 0.8 so somewhere you are going to get 0.5 and 0.5 now in this situation you have only two values that the variable is taken where do you think you are going to have the maximum disorder I think it will be at this place 0.5 and 0.5 in a case like this 0.95 and 0.05 you know that most of the times one is going to occur you can say that but in such a case like this you cannot say it so you have maximum disorder at the place where 0 and 1 they are occurring with equal probability okay similarly if you have three states then each of them occurs with say probability one by three there you have the maximum disorder in fact the more and more number of states the disorder is going to increase and the for the same number of states if you have unequal probabilities then disorder is less equal probability is disorder is more are you understanding if the more and more number of states means you have more disorder and for the same number of states if you have different probabilities of occurrence then your disorder is less with the same probability of occurrence for each of the states means the disorder is more okay. So this thing is represented mathematically by this formula suppose in a system you have small n number of states and ith state occurs with probability p i that means summation p i 1 to n is equal to 1 ith state occurs with probability p i then this is said to be the entropy of the whole system this is the entropy of the whole system minus summation p i log p i why the minus sign is there since all the p is are lying between 0 and 1 log of that is going to be negative log are them after is going to be negative right log of any value between 0 and 1 is negative so minus will make it positive minus will make it positive. Now one can also say this thing as the information content in the system now why the word information is used if you have more of an uncertainty then you have to find more about the system that means the information content in the system is more that is the basic feeling that is why people also use the term information people also use the term information I think let me just give you an example suppose say today's weather in Chennai there are let us just say there are four possible states one is rainy one is hot other one is say normal normal rainy hot and what is the fourth one fourth one is cold normal weather means your normal temperature okay pleasant temperature let us just say hot hot means well it is hot okay and rainy means it is raining today cold means it is cold well generally we all know that in Chennai it is generally not cold okay but today it is slightly I mean to the colder side am I correct if you go outside it is slightly towards the colder side so there are four possible states to the system normal weather hot cold rain now between these four possibilities what is the usual thing that occurs here I think probably usually it is hot I do not know you people must be able to tell me tell this thing more than me usually probably it is hot the normal whether that is the present the pleasant weather probably it occurs only a few weeks rainy season probably that also occurs a few weeks winter that is cold probably that also occurs a very few weeks most of the time I suppose it is hot that is why whenever people talk about Chennai Chennai oh it is hot so they have this information because they know these probabilities the probability of something being hot is more here so when people come to visit Chennai and other places like me and my family which is going to visit we are not going to we are not carrying any winter clothing are you understanding so this is a consequence of these things because there is less uncertainty here entropy is less on the other hand if you look at something like say some other places where these things are occurring probably equally likely or maybe then the entropy may or the entropy is more then one has to be prepared for all the eventualities if you are going to that place and you do not know whether it is going to be hot or cold on one hand you will have to take the cold government governments for winter and on the other hand if it is hot I mean you must be prepared to bear the heat then the more uncertainty you have the more the things you are going to carry there are you understanding so this is one function that tells you how much uncertain or how much entropy is present or how much information is present this is one function that does it there are also a few more such functions but this is the one that is used a lot now there are some criterion functions which take this is which take this information theoretic measure this one and they define criterion functions based on this there are some functions there are some criterion functions which define their criteria there are some criterion functions which use this definition to derive the form of the criterion functions this is one thing that I wanted to tell you and though I dealt with only probabilistic separability interclass distance based measures in more detail what I am trying to tell you the is that these are not the only ways in which you can define the criterion function you can also use information theoretic way of defining criterion functions and in fact there are many other ways in which you can define criterion functions there are many ways based on fuzzy set theory there are also some other ways in which you can use a mixture of these things you can use a mixture of a few of these things and in fact if you look at the literature on feature selection this is one of the most I mean this is an area on which you will find just too many papers quite many papers and you can come across several different criterion functions they are just too many of them they are just too many of them I have dealt with only a very very few a very small number of criterion functions this is not the end this is only the beginning okay there are many other criterion functions existing in the literature now I will show you a few of my slides this interclass distance measures this one is the one that I was telling you today but a area distance I already mentioned this is a mutual information this is something that I was telling you earlier I mean this one I did not discuss but this one is also existing this is okay and there are yes this is entropy based one this is entropy based one this is an unsupervised criterion and there are some fuzzy feature evaluation indices there are also neuro fuzzy feature evaluation indices yes this one is the interclass distance based one which I discussed Bhattacharya one I discussed the third one the mutual information this I did not discuss similarity entropy based one this also I did not discuss I discussed little bit about the just the formulation part of it and there are many others like fuzzy feature evaluation indices neuro fuzzy feature evaluation index in fact I just wrote a few of them there are simply too many other such measures they are simply too many other such measures with this I stop okay.