 Yeah so we have a set of features this is the set of features B is the number of features to be chosen naturally B is less than N okay now we need to define a function J what is the meaning of defining a function J the meaning is the following let us look at what is known as power set do you all understand the meaning of power set power set means set of all subsets so I am looking at power set of S how many elements are there in power set of S to power N okay and what is the meaning of power set of S power set of S is any B set of all such B is B is a subset of S okay now the objective function J it should be defined from power set of S to say J is to be defined from power set of S to – infinity plus infinity it may be 0 to infinity also okay so I am writing it in a very general form – infinity to plus infinity and J is to be optimized what is the meaning of J is to be optimized the meaning is let us look at this what is the meaning of this is the set of all B B is a subset of S and B contains how many elements small B elements that means we are going to look at all possible subsets of S containing small B elements please for every subset of S we have to somehow attach a value which tells you the importance of that set and the word importance from depending on the context to context situation to situation it may differ it may be a negative value also it need not always necessarily be a positive value depends on what sort of functions you are going to define it depends on that okay so in general I am just you may if you want you can always confine it to 0 to infinity if you want you can always confine it to 0 to infinity I am not saying that you should not confine it to this thing but what I am saying is that as a general formulation J which is supposed to give you some sort of importance of that particular subset okay it may take values J may take values between – infinity to plus infinity I am not saying that J will take all the values between – infinity to plus infinity I am not saying that because this is an uncountable set infinitely many elements are there here you have finitely many elements so it can never take all the values in this interval are you understanding what I am trying to say it can never take all the values in this interval have I answered your question I think your trouble is y – infinity J assigns a value to every subset what is this P of S P of S contains all possible subsets when we say that f is a function from a to R we take an element here and then we calculate f of X right and what is an element here an element is a set so you are going to have J of that particular set in this case J is a function that is defining the value of it is taking P of S as a input and giving some value it takes every element of P of S as an input and for every element it is it gives a value some output yes it gives a value and what is an element of P of S it is a subset of B so J attaches a value to every subset is it clear and we need to find this function J our objective is to find J what is J it is J is not unique number one J is not unique I will give you several examples of J in the later portion of my talk I will give you several examples of J J is not unique I am writing all these things to give you how to tell you how to formulate the problem mathematically for mathematical formulation J is to be defined what is the meaning of defining J the meaning of defining J is a part of it is this J is to be defined for every subset of S what is the meaning of that means J is the domain of the function the domain of J is power set of S and this is the range is this clear I am going to tell you how to define J I will surely tell you how to define J later after one or two lectures but this is the way that you are going to formulate the problem so once a function J is defined then we are supposed to find optimal value of J what is the meaning of finding optimal value of J first you are going to look at all possible subsets of S having small b number of elements we are going to look at all possible subsets of S having small b number of elements all possible subsets of X having small b number of elements I am denoting it by script a suffix B okay now J is to be optimized here that means if we are looking at a maximization problem then it is find B0 belonging to AB such that J of B0 is greater than or equal to J of B for all B belonging to this this is for maximization problem for minimization problem this is going to be this is for minimization problem if J is to be maximized then we should find B0 satisfying this if J is to be minimized then we should find B0 satisfying this and B0 should belong to script AB is this fine okay I will tell you an example of J I think that will make it to more clear I hope all of you remember what Bayes decision rule is we know that it minimizes the probability of misclassification okay now we are supposed to find small b number of features for which if you use those small b features the probability of misclassification is minimized for those small b features I think it is not clear to you for every collection of small b features okay say let us just say capital N is equal to 10 you have 10 features let us just say small b is equal to 2 you are supposed to select two features so how many two element subsets you can have 10 C2 45 subsets you can have okay you take one subset one subset is let us just say the subset containing 1 and 2 X1 and X2 if you take X1 X2 as your only features what is the minimum probability of misclassification that you can get there is one value similarly X1 and X3 you take what is the minimum probability of misclassification that you can get like that for this 45 subsets you are going to get for each one of them you are going to get minimum probability of misclassification now among this 45 subsets which one of them provides minimum of these minimums that is your optimal subset is it clear or it is not clear for the subset having elements 1 and 2 that is X1 and X2 these two features what is the meaning of minimum probability of misclassification the meaning is if you take just these two features you can have several decision rules for each decision rule you have a probability of misclassification among all these decision rules which decision rule will provide you the minimum probability of misclassification are you understanding now if you take only X1 and X2 just these two features you can have several decision rules let us say and then the number of classes is 2 then I might have a decision rule like X1 plus X2 less than or equal to 1 put it in class 1 greater than 1 put it in class 2 this is class 1 and class 2 let us just say you only have two classes so this is one decision rule and I can have another decision rule in fact the number of decision rules is infinite and for each decision rule you have a probability of misclassification now among all these decision rules which decision rule will provide the minimum probability of misclassification you take that that is the value that I am looking at for just those two features what is the minimum probability of misclassification that you can get similarly if you take features X1 and X3 what is the minimum probability of misclassification that you can get like that for every such double term you are going to have probability minimum probability of misclassifications now among all these double terms which double term will provide really the least that is the best set of features so this is a way of defining J a way this is not the only way have you understood okay I suppose now it is clear to you this is a way of defining J okay so if J is to be maximized then V0 is the one that you need to find out satisfying this thing if J is to be minimized then find V0 with this property okay now so there are two parts the first part as it is mentioned here look at this objective function J which attaches a value to every subset of features is to be defined now the second one is that algorithms for feature selection are to be formulated now in my next two or three talks I am going to assume that J is already given to you I will only talk about how to get the algorithms then in my later portion I will tell you a few ways of defining J is this fine with you initially I am going to assume that J is given and I will talk about the second part that is how to get the algorithms which provide you the maximum or minimum which provide you the optimal features of set so now I am assuming that J is given to me J is known and given J is given and B is also known the number of features to be selected and let me also assume without loss of generality that J is to be maximized so without loss of generality assume J is to be maximized now let us look at the algorithms I already told you some of the difficulties with the algorithms in the previous lecture in the sense that when capital N is equal to 100 and small b is equal to 10 then 100 C 10 the value is more than 2 power 12 right when capital N is equal to 100 and small b is equal to 10 then 100 C 10 this is greater than 2 power 12 so we are not in a position to do an exhaustive search here and I also mentioned that if you know some properties of the function J then many times then probably you can develop algorithms which provide you the optimal features subset without doing the exhaustive search without doing the exhaustive search you might get the optimal features subset if the function J satisfies some properties let me give you an example of that that means what I am going to do is that I am going to assume that my function J satisfies some properties and I will give you an algorithm for finding the optimal features subset with that with those properties and this optimal features subset I would like to find without doing the exhaustive search without doing the exhaustive search I would like to find the optimal features of such let me first tell you the properties so this is your set of features now the property is the property is if you take any two subsets A1 and A2 such that A1 is a subset of A2 then J of A1 is less than or equal to J of A2 is this property clear to you you are including more features means you have more information that is what it is trying to tell you right and here we are trying to do the maximization if you add more features means you are going to have more information this property need not always hold good this property need not hold good always if some features act as noise if you add more features then the importance of that feature subset may decrease in some problems it really happens it really happens so it is not necessarily true that this property holds good for all criterion functions this holds good for some criterion functions now if this property holds good then some people have developed an algorithm for finding optimal features subset without really doing the exhaustive search see sometimes even if this property holds you may need to the exhaustive search if you follow that particular algorithm but most of the times you do not need to do the exhaustive search if you follow that particular algorithm the algorithm name is branch and bound algorithm it was developed by Narendra an Indian and I hope all of you know who Fukunaga is he wrote a book on statistical pattern recognition this algorithm was developed by Narendra and Fukunaga now I will give you that algorithm okay I will give you that algorithm this algorithm has some parts basically there is a tree for this algorithm now I will draw the tree I will draw the tree there here arm there are totally here six features and I would like to select two features I would like to select two features there are totally six features okay and so how many possible subsets are there six C2 what is the value of six C2 15 how many nodes are there in this layer please count 15 the number of nodes in this layer is actually 15 each node denotes a subset of the original feature sets this node denotes the whole set that is six features at this level each node denotes a set of five features at this level each node denotes a subset of four features at this level each node denotes a subset of three features and that this level each node denotes a subset of two features next this node you have five features this node you have six features this is a subset of this this is a subset of this this is a subset of this and this is a subset of this. In fact if you go through any such branch this is a subset of this, this is a subset of this, this is a subset of this and this is a subset of this. So what happens is that from here you remove one feature to get this, you remove one another feature to get this one feature to get this okay. Let us say without loss of generality and just for explanation purposes from this one I will remove the feature one, feature one is removed to get this say feature two is removed to get this. Let us just say there are six features x1, x2, x3, x4, x5, x6 let us just say feature one is removed to get this, feature two is removed to get this. Now what I will do is that when I am drawing this whole thing this particular branch what I will do is that I always keep the features one, the features one and two here. That means let us just say feature three is removed to get this, feature four is removed, feature five is removed say feature six is removed. So ultimately this will be this one. Now what I will do is that I get the value of j for this one. I have some particular value of j for this. Now I compare this value with this. Suppose this value is greater than this then this value would surely be greater than this, this, this and this. Are you understanding? Suppose this value is greater than this then this value will surely be greater than this, this, this and this and suppose this value is greater than this one then it will be surely be greater than all these things. So if this value is greater than this then I do not need to search this whole thing. I do not need to search the whole tree, I will just stop if this value is greater than this and as well as greater than this then I will just stop it and I conclude that this is the best feature subset are you understanding I conclude that this is the best feature subset and supposing say this is not greater than this then what I will do let us just say I remove the feature 3 here to get this then in this whole branch what I will do is that I will keep the feature 1 and feature 3 throughout here that means let us just say feature 4 is removed feature 5 is removed feature 6 is removed then this is going to be 1 3 now what I will do is that I will search the value of J I am naturally when I am doing all these things when I come to this I am going to compare this with this also okay and let us just say this is greater than this now my value of J is this now I compare this one with this and this if this 1 3 is the value of J for this one if it is greater than this and as well as greater than this then I do not need to search the whole tree Narendra and Fukunaga made the tree intentionally a symmetric the reason is that if the tree is symmetric then you do not get any advantage here our aim is not to search the whole space so the place that we do not want to search we want it to be quite a lot and the place that we would like to search it will be a smaller area so the place that we do not want to search that is actually here that is the way they wanted to formulate and the place that they want to search most of it is here I think before I proceed further let me first let me tell you how to construct the tree and then I will tell you the meaning of which place you do not need to search and which place you need to search let me first tell you how to construct the tree for the construction of the tree note that this node has three branches whereas this has only one this has to and somehow this has one and this has to and if you look at this this has three branches the question is how does one decide the number of branches of a particular node how does one decide the number of branches of a particular node first let us see how many levels are there let us just say this is the 0th level this is the first level is the second level is the third level this is the fourth level so the number of levels is small n- small b-1 the number of levels is sorry capital N is 6-2 4-1 5 5 levels are there right now let me write down the formula number of branches from a node number of branches from a node plus number of features to be kept to be preserved or to be kept to be preserved at that node this must be equal to b-1 this must be equal to b-1 let me explain let us look at this node there the number of features to be preserved at that node there is nothing that is 0 so for this you need to have three branches b is 2 so b-1 is 3 now you come to this so here this is one branch this is another so you need to remove one feature to get this let us just say feature one is removed how to come to this feature one I will come to it later and here feature two is removed to get this now when you come to this one I am saying that you have to always keep these two features right now so if you look at this the number of features to be preserved that node is 2 for this so the number of branches should be one to make it three again for this the number of branches is one to make it three again for this the number of branches is one to make it three now if you come to this node the number of features to be preserved is only one you have to always keep one in all these things here this one is to be always preserved here that means feature x1 should be present throughout so for this one feature is to be present so the number of branches is 2 the number of branches is 2 so let us just say the feature 3 is removed to get this now throughout this portion 1 and 3 we have to be always be there so here one here one here one now if you come to this again feature one is the only feature to be preserved so for this you need to have two branches for this you need to have two branches the tree is constructed from right to left the tree is not constructed from top to bottom at every level like this it is constructed from right to left first you construct this then you construct these portions then you construct this if you come to this there are no features to be preserved so you need to have three branches B plus one is three now let us just say feature 2 is removed feature 3 is removed and say here feature 4 is removed now in this totality here you have to always keep 2 and 3 so the number of branches for this is one and for this is also one again for this only feature 2 is to be kept that in one feature so you should have two branches so similarly for here no feature is to be kept you need to have three branches so this is how the tree is generated now the question is we know how to generate the tree but then I said here feature 1 is removed to get this feature 2 is removed to get this feature 3 is removed to get this how does one decide those things right so how does one decide that here people have followed many ways I am telling you a way but that is not the only way that you will find in books you will find this algorithm surely in Fukunaga's book on pattern recognition you will find it in Deweyver and Kittler's book on pattern recognition okay you will find it at many many places branch and bound algorithm for a feature selection it is a very standard algorithm people have been using it for a long time and you know when it was published actually Narendra and Fukunaga they wrote two papers both of them were published in IEEE transactions on computers one was in 75 another one was in 77 75 and 77 one of them was branch and bound algorithm for a feature selection another one is branch and bound algorithm for KNN rule KN nearest neighbor so one was in 75 one was in 77 I do not know which is in which year I mean I do not remember it right now you can search it in the internet to get it so this algorithm has been in existence for a long long time now so people have made many modifications in this one now I will tell you a way of choosing this one to three say here in this example capital N is equal to six okay choose three features randomly from these six features let those three features be let us name the three features as say x1 x2 x3 let us name those three features as x1 x2 and x3 okay now let us look at let us look at the value of j for s-x1 s-x2 s-x3 s-x1 means if you remove x1 you are going to have five features in s let us say the relationship is less than or equal to that is if you remove x1 the value of j is reduced the maximum here it is second maximum here it is reduced the minimum then in such a case here you write x1 here you write x2 and here you write x3 so that this is going to be x1 x2 now look at the way this is done j of s-x1 less than or equal to s-x2 less than or equal to s-x3 that means this value is greater than or equal to this value so if this is actually greater than this this is anyway greater than this so if this is greater than this you do not need to check anything is this clear now why is it written like this why x1 is written here x2 is written like this here note that among those three features removal of x1 has reduced the value of s maximum that means x1 is expected to have more information then the second most important feature is x2 and seemingly the least important feature is x3 so between these three features which two features are expected to be more important x1 and x2 that is why we have kept x1 and x2 here they are here so when you decide the number of branches if it is 2 or if it is 3 or if it is 4 then from the set of features under consideration for example here the number of branches is 3 and the number of features that you have is 5 here so from these five features select three features randomly and find which one has more information that you write here and that feature you remove it here and which one has slightly maybe slightly less information than this but more information than the third one that you write here that means that feature is to be removed here and those two features they have to be kept in the third one so this is the way you are going to proceed I think it may be clear to you why the phrase branch and bound are used you will develop a branch only when the bound information is not satisfied you will develop a branch only when the bound information is not satisfied are you understanding what I said first you will develop this and then you compare this with this and this anyway this is greater than or equal to this so if this is greater than this then you do not need to do all these things you have got the optimal features subset otherwise you need to develop this and suppose this is greater than this then you do not need to do all these things you can just be satisfied with this so if this is greater than this note that quite a bit of portion in the tree you are not searching so my comment about vast portion of the search space we do not want to search I hope by now you have understood it is this clear to you I will stop here four minutes left I will stop here but I surely want your questions right I may select X1 X2 X3 he may select X4 X5 X6 that can happen if at the starting here selecting bad features that can happen then we need to get whole tree to find out the yes I am okay I think it is not necessarily true that we will not search the whole tree always what I am trying to say is that many times we do not need to search the whole tree can we say what probability of searching whole tree given a random set of feature I think some such work probably they have done it some such work about what is the probability of searching the whole tree what is the average complexity I think some such works have already been done but the worst case is you have to search the whole tree here in fact searching the whole tree means it is not only just this 60 to 15 it is also these things plus these things plus this so it is much more than capital N C small b it is much more than that but many times you do not need to do it please in real scenario sir given a set of features it is difficult to ensure that set of features holding the property agreed that was the first set thing I said we cannot apply this thing for real problem this algorithm of course whatever you are saying is perfectly correct if some real life problem or some criterion function does not satisfy the assumption that was made then this algorithm is not really useful because the main assumption is not satisfied there how will be ensured that our features it is satisfying that particular criterion function yes that is itself a complex job to find it we cannot ensure it how can it be ensured because some functions they do not satisfy it some functions they do not satisfy it the example that I gave you minimum error probability minimum probability of misclassification that example does not satisfy this I can assure you that there the movement the number of features is increased the probability of misclassification decreases that is not true I have a feature set and I want to apply this algorithm then what should we look for can we apply this for the given set or not because you if you have a feature subset if you have a set of features and if you have a criterion function also and if you are satisfied that the criterion function satisfies the assumption that is made even then it may be difficult for you to apply this particular method because if N is equal to 100 say B is equal to 10 look at the size of the tree that you are going to draw look at the size of the tree that you are going to draw for really very high values of N this algorithm may be extremely difficult to implement even if the criterion function satisfies those properties and if it does not satisfy then you have a real problem is there any way to choose this features like Xi 1 Xi 2 Xi 3 at the first optimally so that the search space is reduced so what some people do is they do something like sequential backward I mean the removal of which feature makes the J value decrease the maximum that particular feature which decreases the value of J the maximum when it is removed then that one you can write it here instead of just Xi 1 that means you have six features first you remove feature one then you have five features find the value of J then from the six feature set remove the feature to then again you have five features find the value of J like that you do it for all the six subsets each subset is having five elements find out that subject where J is decreased the maximum from six to five that means removal of that particular feature made the value of J decrease so much have you understood that you write it here so then why can't we just use sequential backward search and we need to do we shall be discussing all these things I am going to sequential forward sequential backward selection in my next lecture any question.