 you will find this one as a solution tree for branch and bound algorithm for N is equal to 6 and B is equal to 2 then one can give questions of the sort like draw the solution tree for branch and bound algorithm when N is equal to 7 B is equal to 2 or N is equal to 7 small b is equal to 3 or you can give some other numbers also so this can be given as class exercises okay now one can also give an exercise like write a C program for finding optimal set of features for branch and bound algorithm and you can give a criterion function which satisfies the main assumption of the branch and bound algorithm and one can give some such criterion function and one can ask students to write a C program on implementing give a C program C code for implementing branch and bound algorithm. So here branch and bound algorithm provides you optimal feature subset for a special type of criterion function so but you have there are some other general techniques which do not use the properties of the criterion function and there are some famous algorithms famous methods they are sequential forward search sequential backward search generalized sequential forward search generalized sequential backward search you have something like what is called an LR algorithm so these are some of the famous techniques for feature selection I will describe one or two of them here say this is the sequential forward selection method initially we start with a null set now AK it denotes the K features already selected then what we will do is that we will select the K plus one-eth feature which I have denoted it by A1 this is selected from X1 X2 XN minus AK that means AK complement it is selected in such a way that if you include that one with AK the value of J is increased the maximum then then you write AK plus one as AK union A1 now run the loop not the program run the loop B number of times run the loop B times that means initially you are taking null set now you have got capital N number of features now which feature has the maximum value of J you will just find it and that you include it with a0 a0 is already null set so A1 will have just one feature now the inclusion of which feature to A1 will make that particular set has maximum J value that you include it then you will get a2 like that a3 a4 a5 a4 b this is known as sequential forward selection algorithm probably the word the phrases sequential forward selection I suppose they are clear to you why the word forward is used each time we are adding features in that way the word forward is used sequential we are adding one feature at a time so that is why sequential forward selection okay and so since the word forward is used for addition of features we also have what is known as a sequential backward selection where you are going to subtract features no no you are starting with a0 then K is 0 to start with 0 right again we are considering best best features agreed then your your question is the features set that we obtain is it really the optimal feature subset the answer is I cannot guarantee it this method does not guarantee that you will get optimal feature subset it will give you a feature subset but it does not guarantee to provide you the optimal feature subset have you understood because first you selected a feature we selected a feature a1 some feature is selected that feature is going to be there throughout it may so happen that the optimal feature subset may not have that particular feature once a feature is included it will remain there whereas in the optimal one that may not be there if you apply the sequential forward selection method to the morning example we will select feature x1 and that feature x1 will be there throughout it will you will never get that x3 and x4 in that way I mean if you apply this method yeah any other question it starts with a0 yeah so we are using the forward forward for addition of the features so there is also something called a sequential backward selection algorithm where you are always going to subtract features so how is it done you write a k bar let us just say it denotes a subset of features subset of n-k features subset containing n-k features let us say we already have an ak bar then what we will do is that from ak bar we will choose a feature x such that we will choose a feature x i0 such that choose x i0 belonging to ak bar such that j of ak bar-x i0 is greater than or equal to j of ak bar-x i for all x i belonging to ak bar now you write a k plus 1 bar as ak bar-x i0 then you start with capital N features and you go up to small beam you start with capital N features and you go up to small beam that means initially you start with a0 bar subset containing n-0 features is n features which is a0 bar is nothing but the whole space capital S then you start removing one by one features is this correct you will remove x i0 such that j of ak bar-x i0 is greater than or equal to j of ak bar-x i for all x i belonging to ak bar is this correct will it be greater than or equal to or less than or equal to we are doing the maximization it will be greater than or equal to there also it is greater than or equal to here also it is greater than or equal to you need to convince yourself that it is greater than or equal to you will remove that feature which has minimum information which has minimum information means removal of that feature does not reduce the value of j that much that means other features have reduced the value of j a lot removal of x i0 has not reduced it that much so you remove x i0 have you understood it so here also it is greater than or equal to so this is sequential forward search and sequential backward search between these two methods if I ask you which one you will choose and why what will be your answer let us just say B is less than N by 2 this is what you are trying to ask me I am telling you if B is less than N by 2 which one you will choose and why so you will use the backward one you will do use you will do more and more computations right in this case you will use forward number of computations may be less but the accuracy of which one will be more from the point of view of accuracy which one you will find it to be better computationally you think forward one has less number of you need to do less number of computations than the backward one okay if B is less than N by 2 right but then the accuracy wise which one you may find it to be better anyway think about it think about it of probably there is no generalized answer for this regarding the accuracy but many times the accuracy of the backward one is slightly better than the accuracy of the forward one because the backward one from the existing features which is generally a very big number from there you are reducing it so you are trying to take the interdependencies more into consideration than in the forward case so I am not claiming that the accuracy for backward selection will always be better than the forward selection I do not want to make this statement but there you are trying to take care of the interdependencies more that is why you are doing more computations so you are trying to take care of the interdependencies more so many times the accuracy of the backward selection that means accuracy from the point of view of J there that is many times it is found to be better than the one that you are getting by using the forward selection many times yes but not always so the problem is that the first feature that you are going to select the value of J is maximum for that feature but it may so happen that if you take the combination of a few features you may get much better value of J and that much better value probably you may never get by using when this particular feature is present the first feature is present so I mean yes many times whatever you said that is going to happen but it still does not lead you always to the optimal features subset it gives you some features subset which is sometimes probably is optimal but not always optimal but your comment I think is valid many very many times it is valid very many times that much I can say. So when our aim is to remove duplicate features and get unique subset of which you are subset of features that contains the unique features then should we or should not we use this method first similar thing will happen even if you use backward selection also in backward selection also the features that are removed they do not contain much information that they are not expected to contain much information so similar thing will happen even backward selection also and if some such thing is present you can surely use forward selection I am not saying that you should not use it if I say that you should not use it I can I must be in a position to give you a better algorithm which I am not so I do not have any way of saying that you should not use it but what I am trying to say is that it will always give you the redundant features will always be removed if you want to make some such statement that I am not prepared to make but I can never tell you not to use an algorithm not to use forward selection or backward selection that I cannot tell you are not exactly satisfied no sir that is not my thing the I am just thinking like the way we do for data selection we generally cluster it take the cluster mean then we can say that these are the instances that can represent the whole thing can we do something for a case of features or has clustering of features yes it has been done I will come to it I think the time will permit at some point of time I will surely discuss clustering of features it has been done again it has been done the results are many times satisfactory but always they can be improved the main problem with feature selection methodology is that there is nothing called a best feature selection methodology method so so that means whatever algorithm that are that are there they all they have some sort of negative points so you can always improve upon them so when I say that you can always improve upon them as a researcher probably one may feel happy that there are some ways to improve there is so much of literature had to be done but as a person working in the field if everything is to be improved then what is the state of the art are you understanding so from another point of view it is not exactly good but that is unfortunately the state of affairs and feature selection there are too many methods methods are to be improved and there is a main drawback about the algorithms which I have already stated so there is always a scope for improvement that is the state of affairs so one more question please like I am again taking the analogy of clustering like in clustering the number of clusters is so much important right so here less attention has been given on the optimal number of features to be selected like we assume that we will select some in a b number of features in the final subset but it may not be the optimal feature set in many cases like what it should vary with the data it should vary with the data why so less attention has been given in this area like to find the optimal number of features in the final feature set when do you call the value of b to be optimal so like in the iris data set we I think the last two features are the best and it has got four features and I think the two features as far as I remember I do not remember correctly I think two features are good and the two features are not that good something like that in iris data set iris data set yes yes so like in that is data set if we have to select three features that will give poor result rather than to select two features is it the case with every data set yes that is that no it is not the case with every data set every data set in iris you are somehow able to say that two the number of features to be selected is two and two is in some sense optimal can you say like that with every data set probably because we do not know how many features are good and how many noisy features are there no no no no probably yes that is true but on the other hand even if you are given the complete data set like in your UCI archive there are too many data sets available okay and you can do your you can apply your feature selection methods there so there the information is given to you can you say for each data set or at least for some data sets can you say what is the optimal number of features the word optimal number of features it is to be defined the phrase optimal number of features that is to be defined and that definition you would like it to be universal so that you can apply it to every data set and then say that for each data set this is your optimal feature optimal number of features to be selected so on one hand you would like to define the optimality of the number of features and second hand you would like to do it in a universal way and in my opinion it has not been done and probably in my opinion it cannot be done because if you want to define the word optimal number of features in one way probably you can give it I can give a another definition about optimal number of features in some other way it is like saying that which clustering is better can you really say it and I gave examples also where for the same data set different clustering have equal meaning maybe you will I am sure you one must be able to get examples even for the number for the feature selection also where for the same data set maybe you will get two or three different numbers of features which are sort of they have they might have the same importance value maybe for let us just say for two number of features you may get X1 X2 as one set of features which will have the same value as let us just say X3 and X4 and it is better than all the other pairs so for it if you are to select two features and X1 X2 is one and X3 X4 is one similarly for three features also you might get some combinations I mean that uniqueness probably you may not have about the number of features as well as for the number when one you fix the number the set of features with particular search so I think the same way as we cannot say which clustering may be better in some examples so here also I am sure there are many examples where the which I mean the number of features optimality of the number of features one I do not think is in your position to define it but this is my personal feeling and that can always be criticized that can always be criticized okay any other question you have any question the non monotonic property two best features are not always the best two this was what I was trying to tell you so we discussed sequential forward selection and sequential backward selection there are some generalized algorithms also in this regard they are called as generalized sequential forward selection and generalized sequential backward selection the word generalized is used from the point of view of the number of features that are getting added are getting subtracted note that in the usual sequential forward selection we are adding at a time one feature and in the usual sequential backward selection we are deleting one feature at a time but in the generalized sequential forward selection we are going to add our number of features at a time and in the generalized sequential backward selection we are going to remove our number of features at a time so one can have algorithms also using the parameter value or and a slightly more complicated method is what is known as the LR algorithm what is known as the LR algorithm since the analogy between clustering and feature selection has already been brought by one of the students in clustering we have algorithms where you split the clusters you are going on splitting the clusters and and you are also going on merging the clusters merging the clusters you have something like forward selection splitting the clusters you are something like backward selection now in clustering you also have split and merge techniques so similarly here there is an algorithm called LR algorithm where in one iteration you will add L number of features and you will remove our number of features you will add L number of features and you will remove our number of features okay now this L and R they have to be chosen in such a way that ultimately you will end up with the required number of features that is B you should choose L and R in such a way that ultimately you will end up with the required number of features B now there is a starting point for this LR algorithm the starting point is if you are starting with a null set then you should first add L features then remove our features that means R is less than L if you are starting with a null set then you add L features remove all our features again add L features remove R add L remove R so that when you are going on doing it ultimately at some point of time you will get B number of features so you have to choose the corresponding L and R now if the starting point is not the null set if it is the entire set then first you remove our features then you add L features remove or add L so then that means here R must be greater than L because ultimately it should decrease here R must be greater than L so again R and L I have to be selected in such a way that you will end up with small b number of features so that is basically LR algorithm LR algorithm is sort of more generalized than seek generalized sequential forward search or generalized sequential backward search and generalized sequential forward search and backward search algorithms are more generalized than their original counterparts sequential forward sequential backward search algorithms you might be wondering why people are making it more and more complicated the reason is that somehow you should try to get the inter dependencies taken into account as much as possible by adding at a time or number of features in your generalized sequential forward search somehow you are trying to take care of the dependencies in the features when you are adding R number of features similarly when we are deleting R number of features in the generalized sequential backwards there also you are trying to take care of the interdependencies and LR algorithm which is the most complicated among all these five of all these methods there at a time you are removing LL features no you are remove you are adding L features and removing RR features you are adding L features and removing R features there also you are somehow trying to take care of interdependencies among the features as much as possible for all these methods there is a nice book Deviver and Kittler pattern recognition Deviver and Kittler's book on pattern recognition there it contains that book contains six chapters on feature selection and extraction six chapters so all these methods are very nicely discussed there naturally you can see that the these methods even though they have been in existence for a long long time they are people are still using many of these techniques the reason is that they have not been able to find really better methods than this yes sometimes you have found some better methods in some applications that is quite true but these methods they are still relevant because they are people are still using them in their fields and they are still getting good results using these methods if you think that some method is really useless then somehow one must be able to say that some other method works always better than this and that you must be able to show it by using several several several examples so in that way people have not been able to show that much I mean they have not been able to show that much bad things I mean those many negative remarks about this generalized sequential and sequential forward sequential backward generalized sequential forward generalized sequential backward and alarm methods they have been able to show some things regarding branch and bound algorithm because of that particular assumption about the criterion function and those many links those many the tree may be really difficult to be implemented in I mean if the number of dimensions N is very large so people have been able to show some negative things about the branch and bound algorithm but not that much about the other five methods that I have stated in the last one hour one hour 15 minutes she was asking me about feature clustering about feature clustering some methods are there if you want to do something like feature clustering then let us go through whatever we have done about clustering where you have some similarities between points so when we are clustering points we have used certain similarity or dissimilarity measures between points and we have taken those similarities or similarities in our clustering algorithm so now for features also somehow you need to get the similarity between two features or the dissimilarity between two features how do you get it once you get this similarity or dissimilarity between features then you can sort of apply an algorithm similar to a clustering method and you can do the clustering of features right if you can get something like a similarity between two features or a dissimilarity between two features then you can do clustering of features then how do you get a dissimilarity between two features if you have two vectors a1 a2 an a1 a2 am and b1 b2 bm this is a similarity measured between two vectors I have shown this slide in one of my previous lectures so you have a vector a1 to am you have another vector b1 to bm then the similarity between these two vectors this is a way of defining similarity it is not the only way and it provides the cosine of the angle between those two vectors right now suppose feature x1 and feature x2 x1 takes when x1 takes the value a1 x2 takes the value b1 when x1 takes the value a2 x2 takes the value b2 when x1 takes the value am x2 takes the value bm then the similarity between these two features you can measure it by the same thing is it clear to you or shall I tell you once again I will repeat it say there are two features x1 and x2 when x1 is taking the value a1 x2 is taking the value b1 when x1 takes a2 x2 takes b2 when x1 takes am then x2 is taking bm then you would like to find the similarity between x1 and x2 then you can use this please so is it a 1 1 function from a1 a1 to b no no it is like for one person height is a1 weight is B1 for another person height is a2 weight is B2 for the ameth person height is am weight is Bm that is a relevant question there is another relevant question that is the other relevant question is various correlation coefficient coming into the picture where is correlation coefficient coming into the picture if I write here I right vector y1 as this is a1-a bar a bar is the mean of a is a2-a bar this is am-a bar then feature y2 as this is b1-b bar this is b2-b bar and this is bm-b bar then the similarity between y1 and y2 if you use that formula then you will actually get correlation coefficient please but again we are getting a linear correlation coefficient it is we are just looking at the linear relationship between a1 and b1 like the correlation coefficient which you mentioned the last class I mentioned correlation coefficient here because I wanted you to understand the difference between this formulation and this formulation here in this formulation if you remove a bar then you will get this here in this one if you remove b bar then you are going to get this and this is going to give you the correlation coefficient. So just the removal of the means will give you the correlation coefficient and if you do not remove those means it is just simply cosine of the angle between these two vectors. So I mean why I mentioned this one the reason is that you need to really understand that there is a relationship between this and this that is what I want you to understand I just removed the means right and here you are getting the correlation coefficient so if you do not remove the means it is just that cosine of the angle between those two vectors. So this is a way of dealing with this but this is not the only way there are other ways in which one can define similarity or dissimilarity between two features. Dr. Sukhande Das will teach you at some point of time principal components so after he teaches the principal components I will talk about similarity between two features similarity or dissimilarity between two features looking at somehow the covariance matrices. I have to also tell you a few things since we have been talking about correlation coefficient you are talking about linearity between linear correlation coefficient that is some of the things that you wanted to say the reason why you use the word linear is that somehow whenever linear relationship is existing between the variables you are somehow able to get it nicely by using correlation coefficient whereas when the relationship is not linear you are not able to get the relationship value properly using the correlation coefficient that is why you would like to call it linear though I do not like the word linear used in that way though I understand why you are using it I would like to tell you some more things regarding correlation coefficient suppose you have a data set two dimensional data set two variables x and y you have found the correlation coefficient between them then what you do is that you rotate the whole data by some angle theta so you will get new x1 y1 x2 y2 xn yn are you understanding now you try to find the correlation coefficient between the new variables x and y do you think they will be same as the previous one or do you think there will be a difference the answer is they are not necessarily same they are not necessarily same if you rotate the whole data correlation coefficient values need not necessarily remain the same that is one problem there are also other issues regarding correlation coefficient but then how do you find one particular attribute or one particular way in which something that will remain invariant even if you rotate the data by any angle whatsoever do you have something of that sort my answer is yes there are some things those things they will not change even if you rotate the data by any angle whatsoever 0 to 359 degrees or if you look at it in radiance you are uncountably many such angles in which you can do it you will get there is something that is invariant with respect to this rotation those things you will know when we deal with the principal components where they are basically you will find that they are rotation invariant principal components they are basically rotation invariant I think I will stop here.