 Good morning I shall be talking about feature selection today let me first basically tell you the problem of feature selection in order to understand the problem of feature selection let me give you one or two examples in my first example I am considering two communities of people one community is Punjabis another community is another community is South Indians we generally know that Punjabis are taller than South Indians it is something that we have seen it many times with just too many persons so one of the distribution features which separates these two communities is the height it is something that we all of us know now the question is if you are given a data set consisting of Punjabis and South Indians and you have measured the features like height weight and certain or some other features height is measured say in centimeters weight is measured say in cages etc. maybe you have around 40 50 such features values which are taken on every one of these individuals that means let us just say you have a 40 dimensional feature vector now using this information how does one say that the height is one of the distinguishing features between these two communities which we know that it is one of the distinguishing features that data should automatically tell you that height is a distinguishing feature now how does the data will tell you naturally you need to write an algorithm which produces the result that height is one of the distinguishing features it distinguishes between South Indians and Punjabis there may be many other features I am not saying that height is the only feature there may be some other features also which may be distinguishing these two communities but height is surely one of them the feature selection method should result in automatically the feature height as the distinguishing features as a distinguishing feature between these two communities how does the feature selection method give height as a distinguishing feature that is the problem let me tell you another example probably even this I think I assume that you might have noticed it if you look at people with mongoloid features basically people living in the hilly regions of the northeastern states if you look at them their noses they have shorter length compared to the noses of our VP who are living in planes now the length of the nose is a distinguishing feature that it distinguishes between the people who are living in the hilly regions of the northeastern states and the others like that if you look at here I gave you examples of different communities they may not be different communities between I mean if there is a classification problem there might be some features which are able to distinguish the classes much better than the other features so this is one of the problems that is there in feature selection now what are the other issues usually let us just say your number of features is capital N features X1 X2 X and this is the mathematical this is the basic formulation you are supposed to select B number of features where B is less than capital N many times being the value of B is known to you sometimes the value of B may not be known to you what is the meaning of value of B being known in the experiments under consideration the users generally have some constraints the constraints may be regarding the amount of computations that he may need to do if the number of features is quite high and if you are looking at something like variance covariance matrices and their eigenvalues or eigenvectors or etc the number of computations may be too large so in some situations the readers the user may say that I cannot have my covariance matrix is of my covariance matrix cannot be more than this size some such constraint the user may put in that may naturally give a constraint on the number of features so I mean sometimes the value of B the number of features to be selected is known from the point of view of the computational complexity generally it varies from problem domain to problem domain the value of B and many times it may not be known to the user what may be the exact value of B that you would like to have he might sometimes say that the number of features that I would like to get is between these two limits anything between 10 to 20 is fine some such thing also a reader may say and sometimes the reader the user may say that I do not I really do not know the number of features that I would like to have so we may not be known we may be partially known and B is known here I have not considered in these slides that I am going to say I generally assume that number of features the number of reduced features is known to the user I generally assume that the number of reduced features is known to the user in practice that may not be the case usually real life problems are more complicated than what is written in textbooks it is almost always the case that real life problems are much more complex than what we teach in the class. So now so number of features to be selected is small b naturally b is less than n now why do we need to do feature selection reduction in computational complexity this point I have already explained to you now there is this one redundant features act as noise so we are doing something like noise removal now I use the word redundant what is the meaning of redundant let me explain say I have just two features x1 and x2 but then let us just say my x2 is 2 x1-1 that is when x1 takes the value 1 x2 is taking the value 1 when x1 takes the value 3 x2 is taking the value 5 when x1 is taking the value say – 10 and this will be – 20 – 21 so there is a direct relationship between x1 and x2 in such a case do you really think that we should keep both the features when there is a direct relationship existing like the answer is no so here since there is a naturally there is a direct relationship one of the features is redundant that is it is useless whatever you can get from x1 you can also get from x2 because of this linear relationship so we can remove one of the features please I was expecting this question when the relationship is nonlinear there are usually some issues involved let me tell you the issues let us let me first assume that there is a cubic relationship there is something like this let us just I mean I have just taken some polynomial of degree 3 I could have taken a polynomial of degree 2 because his question is nonlinear nonlinear means it can be degree 2 degree 3 it may be a polynomial it may not even be a polynomial it can be some other relationship also so I am breaking down his question into a few parts my first part is I am taking a polynomial of degree 3 the reason for taking polynomial of degree 3 is so if you take the value of if you take some values for x1 you are going to get corresponding values for x2 okay but note that here from x1 you can go to x2 from x2 you can go to x1 x2 is equal to 2x1-1 whereas x1 is equal to x2 plus 1 by 2 here x2 is equal to 2x1 cube-10x1 square-5x1-7 can we express x1 also in terms of x2 like this have you understood the point here you have both way relationship you one may not always have a both way relationship right unique relationship and the problem exists with squares also instead of x1 cube I could have written some square so if the relationship is not there for example something like this I can say that x2 is redundant but probably I cannot say that x1 is redundant probably I cannot say that x1 is redundant I can say that x2 is redundant whereas in this situation I can say one of them it does not matter x2 is redundant or x1 is redundant I can take any one of them but probably in this case I cannot say that now suppose you do not have a polynomial sort of relationship you might have something like exponential relationship or you might have some other thing some other relationship in such a case number one are we in a position to establish the relationship even if such a relationship exists suppose one feature x2 is equal to some 10 times 2 to the power of – x1- say 3 times e power x1 plus say some 4 times something like log x1 log to the base 2 let us say I have taken there is some such relationship actually it is existing now are we in a position to get this relationship this is one of the problems one problem is that there may not if there is no relationship it is fine but if there is a relationship but we are not in a position to get it now the second part let me just expand I hope many of you have had courses on numerical analysis okay so you are given the values of say x1 y1 x2 y2 xn yn you can always fit a polynomial of size of degree n-1 right if you are given x1 y1 x2 y2 and xn yn we know that you can always fit a polynomial of degree n-1 and usually in pattern recognition problem you always have finitely many values so here we have come to a I should say a barrier on one hand if there are n observations you can always fit a polynomial of degree n-1 but on the other hand so probably there is no relationship between those two features the feature the relationship between features it has been a difficult problem and this problem has been faced by different research communities at different times and in different situations let me talk little bit about statistics there is one concept called correlation coefficient which tries to measure measure the relationship between two variables x and y two random variables are there then correlation coefficient it tries to measure the relationship between two variables and then there is a theorem the theorem has the statement is like this x and y are two random variables if x and y are independent random variables then the correlation coefficient value is 0 there is a definition of the word independence there is a definition for the word independent so x and y are independent random variables means it is the definition says that probability of x belonging to a y belonging to b is equal to probability of x belonging to a times probability of y belonging to b for all a b then you call x and y to be independent random variables you call x and y to be independent random variables if probability of x belonging to a and y belonging to b this intersection of two events is equal to probability of x belonging to a times probability of y belonging to b probably we all have we all know the meaning of independence of events the independence of events you say that C and D are independent you say that C and D are independent if probability of C intersection D is equal to probability of C times probability of D this was something that we have read long back so this definition is generalized here for random variables yes this is this is the independence of events from here this is generalized to independence of random variables now he is talking about joint probabilities that is also true from here you can go to the whatever you can go to the point that what he is trying to say that is you say that two random variables x and y are independent if the joint probability density function the joint probability density function if I write it as this this is same as the product of the marginal density functions the density function for capital X is a small f the density function for capital Y is a small g and the joint probability density function for these random variables is a small p then p of xy is equal to fx into gy for all xy is also true but this is taken as the definition and consequence of this is this because if you have random variables you may not always have probability density functions you might have discrete probability mass functions also okay so this is the basic definition and from here you will get this independent and identically distributed so IID independent and identically distributed we say that x1 x2 xn are IID that is independent and identically distributed independent means the definition of independence that I gave you that should be used for all these random random variables okay identically distributed means the probability density function of x1 is same as the probability density function of x2 is same as the probability density function of xn and it may not be density functions if it is a discrete probability then probability mass function of x1 is same as probability mass function of x2 same as probability mass function of xn okay. So a simple example of IID independent and identically distributed random variables is you take a coin and you toss it say n times the result of the first trial you call it as x1 secondary trial you call it as x2 nth trial you call it as xn okay then those x1 x2 xn they are IID they have this first they are independent because the result of the first trial is does not have does not make any impact on the result of the second trial so they are independent identically distributed since the coin is the same probability of head is same throughout in the first trial second trial third trial fourth trial etc and you have only two outcomes probability of you have head or tail so if the probability of heads are same in the probability of tails are also same so they have the same distribution identical distribution because it is there are other random variable can take only two values head or tail head with probability small p if I say then tail with probability 1-p and that is true for x1 that is same for x2 same for x3 and same for xn so they have the same distribution and but they are independent that is the meaning of independent and identically distributed random variables we are talking about relationship between random variables correlation coefficient so if the random variables are independent then correlation coefficient value is 0 but it is not the converse is not true that is the correlation coefficient can be 0 but the random variables are not dependent and the example that people give is this I have x is taking values – 1 0 and 1 okay and y takes values 0 and 1 in fact y is equal to x2 okay y is equal to x2 now let me just so x takes values – 1 0 and 1 y takes values 0 and 1 these ones they are going to give you the probabilities so this is half this is 1 by 4 this is 1 by 4 so probability of y is equal to 0 this is half probability of y is equal to 1 this is half probability of x is equal to – 1 is 1 by 4 this is half this is 1 by 4 you can very easily see that the value of the correlation coefficient is 0 but the relationship is this so here the problem is that correlation coefficient is not the right one to measure the relationship okay it is not the right one to measure the relationship you may have so well what is the other one I mean which one is going to measure the relationship better this is a long standing problem in statistics we know the meaning of dependence of random variable I mean independence of random variables we know when the random variables are independent the definition of dependency is the following x and y are said to be dependent random variables if they are not independent random variables x and y are said to be dependent if they are not independent but how much dependent they are that there is no measure how much dependent they are there is no measure and this problem it has been there for a long long time and the same problem persists even now and the same problem is there even in figure selection how do we know that these two figures are related the word related I am using it loosely not from the point of view of correlation or not from other points of it is what is our general meaning of the word relation there is some sort of relation some function or something which is from one variable if you use that particular function we can go to the other way then how to get hold of that function number one and number two is it should not be a function like when you have x1, x2, xn in the corresponding y1, y2, yn you fit a polynomial of size of degree n-1 say that then the error value will be 0 it is not a function of that sort so here the problem is not well defined and I really do not know how to define the problem properly because this is a research issue how to measure the relation between two features this is a research issue there are already very many papers already quite a bit of literature on fact on pattern recognition is actually devoted to measuring the relationship between two features the problem is even now relevant because it is still not solved to the satisfaction of all of us and there is still lot of work that is needed to be done I think with this I will go to the next part of use of the feature selection insight into the classification problem it will provide you insight into the classification problem insight in the sense of if I want to classify Punjabis and South Indians then my feature selection criterion or the algorithm and the algorithm should automatically provide height as one of the main features that means from the data I am getting a conclusion that this feature is highly relevant for classification this sort of conclusions I would like to get from the data by using these methodologies then all these methodologies then they are giving us an insight an in-depth idea about the classification problem which features are more important which features are less important so these are the uses there are in fact one can think of a few more uses there are a few more uses also but these are the main three uses so what are the basic steps of feature selection the basic steps of feature selection are first initially you need to define an objective function which measures the importance of a collection of features which somehow measures the importance of a collection of features that objective function is to be defined first then once you define the objective function the second part is you need to optimize it minimize it or maximization depending on the type of optimization function it is for some functions you need to minimize it for some other functions you need to do maximization so basically there are two steps in feature selection the first step is you need to define a function j an objective function on every collection of every subset of features this objective function provides you somehow the importance of that particular collection of features now the second part is that once you define the objective function then you need to do the optimization maximization or minimization depending on the type of function now how does one do how does one define the optimization function or the criterion function j as I said there are quite quite many papers on this there are very many papers on this so for the present moment what I will do is that I will assume that I will discuss about how to define this criterion functions later but for the present moment what I will assume is that the criterion function is already given to us then how does one do the maximization or minimization optimization of the criterion function and I shall assume that the number of features is number of features that you need you want the thing to be reduced that is known that is the value of small b I am assuming it to be known actually you will understand why I am giving importance to this particular thing algorithm development suppose your capital N value is 100 that means you have 100 features from this 100 features you would like to select 10 features then how many possible such sets containing 10 elements can be found from capital N it is 100 C 10 these meaning subsets of size 10 we can have from 100 features now what is the value of 100 C 10 can you tell me approximately it is greater than 10 to the power of 12 now my question to you is are we in a position to go through all these subsets to find the optimal subsets what is your answer the answer is no 10 to the power of 12 is a huge number after one you have to put 12 zeros that is a huge number so generally we do not like to search all these many subsets to get the optimal one now you have the other side suppose I do not search the whole space can I guarantee that I will all I will get the optimal have you understood the problem or shall I repeat it on one hand we do not want to search the whole space to get optimal so maybe out of 10 to the power of 12 let us just say I would like to search 10 to the power of 9 this much then can I claim that within this 10 to the power of 9 optimal will always be there we cannot say that that means we have a very big problem so let me state the problem the problem is like this there exists no feature selection algorithm which provides you optimal subset of features for any criterion function without doing exhaustive I am repeating it to this day there exists no feature selection algorithm which provides optimal feature subset for any criterion function without doing exhaustive first I will expand on this thing then I will come to you okay let me tell you the meaning of any criterion function that any criterion function means there are some criterion functions they satisfy some properties there are some criterion function they satisfy some properties now you can use those properties to obtain a feature selection algorithm which is not which will give you optimal feature subset without doing exhaustive because the criterion function has some properties but when I said for any criterion function it is not necessarily true that every criterion function has some properties some criterion function may not have any properties are you understanding what I am trying to say the meaning of any criterion function that means you have a criterion function you do not know what criterion function it is but given the set of features it will give you the value now you want to get your I mean optimal set of features now you want to write an algorithm so you want to write this algorithm without using the properties of the criterion function then without doing exhaustive search you cannot guarantee optimality that is what my statement is whatever optimal algorithms that you are seeing optimal that means you are able to get optimal set of features without doing exhaustive search they are using the properties of the criterion function please sure sure it is possible it is possible but for any means it is you should not use all those properties so if you people are interested in you may try to develop algorithms which without doing exhaustive search you would like to get optimal features anyway this I sort of mentioned it jokingly I do not know whether it is possible or not okay but to this day there is no such algorithm so that means whatever algorithms that are existing they do not guarantee to provide optimal feature subset for any criterion function so yes you are not guaranteed to get the optimal solutions though naturally you are going to get some optimal solution before I go into the algorithms I would like to make one more comment this comment was actually made by TM cover and camp and out this is an IEEE transactions on information theory they wrote a paper the title of the paper is the two best features are not necessarily the best two the title of the paper is the two best features are not necessarily the best two there they just provided an example in that example they have taken one particular criterion function j and in that example they have taken four variables let us represent the four variables as x1 x2 x3 x4 okay and the criterion function let us just say we have to maximize it then the criterion function is for x1 it has the maximal value then for x2 it has the second maximal value then for x3 let us just say it and the least impressive is x4 okay now they are supposed to select two features then what do we expect we expect that x1 will be there in those two features but then the example is constructed in such a way that this is better than the rest x3 x4 as a pair it is better than any other pair 4 c2 is six pairs so this pair is better than the other five pairs that is why the title of the paper is the two best features are not necessarily the best two jointly yes this is a I think one and a half page or two page papers you should go through the journal you will find it usually in IEEE journals there are something called a short paper something called a what is that regular paper and for regular paper they published the photographs of the authors now it is naturally it is a short paper but they published the photographs of the authors because I think they found the paper to be really important because it just tells something that we all of us probably are feeling but then probably we are not able to put it quantitatively which they have done it the function j is entropy based function the function j is entropy based function anyway you can go through the reference it is a very standard reference yeah that probably is not available that probably is not available to say I mean it is very very old paper by the way do you people know anything about tm cover have you heard the name in 1967 he wrote a paper with hot h a rt this hot is same as the doda and hot on nearest neighbor decision rule cover and hot it is pronounced as cover but the spelling is cover c o v or as you can see he is an electrical and electronics engineer originally this is 1967 and 73 papers I am talking about we are in 2011 okay he is even now active he is associated with I mean in Stanford University he is from Stanford University Stanford University as you know it say I mean yeah I mean it is a great university one of the greatest universities in the world Stanford MIT okay and he is associated both with the statistics department as well as the electrical engineering department you will find I mean he received several awards he received several awards he is one of the father figures of pattern recognition he worked quite a lot on future selection nowadays it is called as portfolio management it is one of the terminologies that is used portfolio management anyway it is one of the consequences of the future selection problem that has been stated here and he works quite a lot on that portfolio management okay so now probably you have understood why future selection is a difficult topic because of that particular example by TM cover and his student so and this will also tell you why probably you need to do the exhaustive search this will also tell you why probably you need to do the exhaustive search I was expecting a question from you anyway I will ask the question myself now I will give you the answer how is this happening the reason is like this note that I was giving you some examples X1 and X2 is directly related to X1 linear relationship then if you put X1 suppose X1 is directly related to X2 linear relationship then if you put them together there is no extra information whatever is there in X1 the same thing is there in X2 so if you put X1 and X2 together you will get whatever is there in X1 only that one only you are going to get you are not going to get anything extra have you understood what I am trying to say whereas if you put X3 and X4 together maybe you will be getting more than what is there in X1 have you understood what I wanted to say have you understood suppose X1 is a linear function of X2 then if you put X1 and X2 together it is same as just writing X1 two times so there is nothing no extra thing that you are going to get the J X1 X2 is going to be same as J of X1 okay whereas X3 and X4 individually they are probably they do not have much importance but when you put them together collectively they may give a lot of importance so that may become the value may become more I will like to put it like X1 is equal to something like 2 X2-1 yes so linear relationship existing if you put X1 and X2 together it is the same as just using X1 two times so you are not going to get anything extra there X1 X2 will be same as J of X1 whereas if you put X3 and X4 together individually they are unimportant but then if you put them together probably the importance may be more maybe in the next class already the time is over.