 Mastering feature selection basics for developing your own algorithm. Now probably this goes with the theme of open data science. So I want to give you some basics and so this is the beginner level session. I want to give you some basics so you can develop algorithm on your own from simple intuition. So maybe first up I will try to express 20 minutes. I will try to explain what is feature selection, why people are still working on it, right? And then I will talk about some of our intuition. How we have worked on it. By the way, I am Dr. Shri. I teach in University of Calcutta. And I am also an executive quantity member of Society for Data Science. That is a non-profit organization trying to promote data science activities and nurture the community. And I am also a life member of Master's Research Club. So here is going to be my contents to introduction. What is feature selection? Why we do feature selection? Intuition to build a simple feature selection algorithm. Extension to graphs. And can we make need-base less need? And then finally we conclude. So that is kind of the outline. So basically this is a preprocessing step. And anywhere you are dealing with machine learning or data science. You are doing classification, plastering, regression, time series or recommender system. You are dealing with data, right? And so you have rows and columns. The columns are your attributes or features or covariates. You know depending on which community you are from. And the idea is to find a subset of the features without compromising on the model performance, right? You cannot compromise on the classification accuracy or the plastering validation rate that you are dealing with. So some other related terms I think needs to mention is feature extraction, dimensionality reduction, feature engineering. So feature extraction is something where your data is unstructured. So you have images and you are trying to extract some shape features or texture features. So that is feature extraction. You have text documents, you are trying to find out the part of speech that is feature extraction. Dimensionality reduction is something if you have done an example is principal component analysis. So you have features or attributes but then you transform it. You make them a linear combination of the original features. So for problems like you know where domain understanding is very important, dimensionality reduction is not very recommended. How dimensionality reduction happens? So the variability of the data is remaining or remains as is. But you try to project it to some dimension so the variability is intact. So if you have 10 features you may try to get 3 principal components which explain the same amount of variability. Now coming to feature engineering probably all these that we 3D, the 3 things that we discussed come under feature engineering. Now let us see some of the key benefits. So of course the model becomes more understandable. You can do a better visualization. The model is generalized. It is efficient in terms of time and space complexity and your data collection effort goes down. And it has varied applications starting from text, space recombination, image retrieval, medical diagnosis, recommender system. So it applies across various domains. So fine but still I mean is it a challenging problem? It still seems quite simple. So now how can we do that? So let us say we have some, so what are feature subsets? So basically if you look at this example if I have a data set of 3 features. So that will have 2 to the power 3 minus 1 subsets f1, f2, f3, f1, f2, f2, f3. I have missed f1, f3 and f1, f2, f3. So these should be the 7 subsets. And let us say I have somehow defined a measure that how good the feature subset is. So if the higher value means better then the one on the top only f1 is the best feature subset. So looks quite good, right? I mean why do we really need to write our own algorithms or we need to study it? So one of the data sets that is called like low world of computer vision, the MNIST data set, handwritten digital vision. If you remember the input is a 28 into 28 pixel image. So 28 into 28 means 784 features. So I will show you what is the magnitude of 2 to the power 784. So this is 2 to the power 784. So this many subsets if I even have a measure and I can go through them one by one. So this is the number of subsets. Now let us say I have a computer and I can evaluate 10 to the power 8 feature subsets in a second. So I will do maybe for years. So all day all years. So I will show you how many years it will take. So this many number of years it will take. So that's why people are still studying feature selection. And you can understand that 784 features is not big, not large. So now I will give you a informal understanding. I hope you have already appreciated why feature selection is done. So this is the bookish example of text classification. Classifying an email as a spam or a normal email. So generally the most simple form I think you represent a text by counting the words that we have. So if you have maybe 500 emails even. So you will have 1000 distinct words. A lot of features so you can understand 2 to the power 1000 is a big number. So if somehow we can find that maybe if we plot the normal emails and spam emails with respect to count of the word discount and count of the word free they are separated like this. So of course the ones that are spam they have higher numbers, higher frequencies. So that way your classification model will be very very simple. So this is one of the motivations. So far we have described feature selection in the context of classification. And our talk will be on classification also but just for food for your thought. This is to give an example of clustering. So clustering is called about natural grouping. So how will you naturally do that? Yes so shape, color, some may say volume. So the like natural word is not very technical. So doing feature selection is for clustering is much more difficult than doing for classification. Now I will discuss how to build a simple feature selection algorithm. So I will talk about a publicly available dataset in UCI as well as gaggle. So it is a breast cancer dataset. There are 32 attributes. So first is the ID identification number for the patient not required for the problem. Then the diagnosis so whether the patient is malignant or denied. And then you have 32 features which are real value. So I have explained that how the real value attributes are calculated. Now initially we can just do a descriptive statistics. And we can see so I have not shown all the features but we start with this. And maybe we are lucky we see that for some of the features standard deviation is 0. Even for the MNIST dataset that I talked about some features had standard deviation 0. So I can drop them maybe that is my first step. But before doing our algorithm let us see what is a good feature and what is a good feature something. So a good feature will be something which is related to the target variable. So if I am thinking of placement success of the student so an aptitude test score is a relevant measure. And his or her blood group or his father's name or his native place may not be may not have any bearing. So not relevant and not redundant. So I have given an example that score of all mock test may not be important. So you have taken lot of mock test score maybe after 6 or 7 mock test you have the same pattern coming. And so here I have put a question that is redundancy bad. I mean if we get more features and they are redundant is it bad. So I will try to give you a small example. So maybe you know you are preparing some sweet dish. So you have added one side sugar and also adding honey and maybe also adding jaggery. So of course you understand that it will over sweet. So that is how if you know you have features which express the same information your model will get completed. Or get biased towards that. So what will be a good feature subset? A good feature subset will have relevant features which are not redundant. And as well as the feature subset is not unnecessarily large. Now let us think of building some feature relevance measure. So what will be an intuition? Let us say we are talking about binary classification like the dataset we saw. And if I want to plot at one side value of the features on x axis and y axis I want to maybe plot the frequency distribution or the probability distribution. So what should I want? I want them to be close or I want them to be separated. So I want them to be separated right. So here is a plot and you can see that feature 9 and 10. So are they discriminating or are they do you feel they are relevant? So let me tell you that this is the malignant plot and the blue one is the malignant plot and the amber one is the benign plot. So they look quite identical isn't it? So they are not discriminating. So if you look at maybe feature 1 that is quite discriminating. Features 7, feature 8 they are quite discriminating. So this can give you some idea. And this is not the only plot right. So you can also use box plot. So it is another way of looking at the same thing but box plot gives you another thing which is the outliers. So if you are comparing between let us say feature 2 and feature 8 they may have or feature 3 and feature 8 they may have same amount of like separability but feature 8 has much more outlier. So you want those features which has less outliers. So these are some intuitions. As I said that we want to build feature selection methods which are intuitive. And now if this can be formulated in a numerical measure we call this as a feature element method. So a very simple algorithm if I want to bring it. So this is just one reference of box plot in case you need to refresh yourself off. So I am not spending time on this. So a basic feature selection algorithm I will give maybe 2 inputs. One is the data set and another is the number of features I want to select. So I will rank the features by the feature relevance and I will take the top k features. So that will be my most basic feature selection algorithm. And there are several options available. So I can take top n, I can take top p percentage or I can take where there is a significant drop. So what they mean by that? So let us say there are 10 features. So feature 1 has a score 10. Feature 8 has a score 9.5. Feature 7 has a score 9. Then there is a drop. The next feature maybe feature 4 has a feature relevance score of 4. So there is a significant drop and then we just stop at the first 3 features. So that is another way. So these steps just has been explained that you have the features. You have calculated the measure, you have ranked them and then you have taken the best features in 3 steps. So this takes care of the relevance. But I talked about another thing. What was that? That was redundancy. So we have not taken care of redundancy as yet. So we can measure redundancy by correlation. So these 2 features show high correlation. It shows a Pearson correlation coefficient of 0.86 and has a very very less p value. So the correlation is statistically significant. So how can I now add this redundancy to the simple algorithm that we built? So what we will do now is we will rank features. We will add features to the feature set. But when I evaluate the next feature I will make sure that this does not have a high correlation with the features that I have already selected. So I have explained this with an example. And of course this needs another parameter threshold or TA. So it kind of tells you that how much correlation I will allow. So this is an example. So let us say I have a data set which has 4 features. F1, F2, F3, F4 and relevant scores are 0.9, 0.8, 0.86 and 0.4. And I also have a redundancy score. So you will see that the upper triangular matrix is unfilled or marked with 0. So what is the reason of that? The reason is that they are symmetric. The correlation between F1 and F2 and F2 and F1 are the same. So here if you look of course F1 has the highest correlation. So I picked up F1 because it has 0.9. Next F3 is the next one in terms of feature level, 0.86. But when I start picking 0.86 I need to check whether with F1 it does not have a high correlation. So when I check that I see that it has a correlation of 0.7. So and I have used a threshold of 0.67. So I will not include this feature. These amount of redundancy are not allowed. So my final feature subset will be F1 and F2. As I have taken K equal to 2 and threshold equal to 0.67. So now I think you have got an idea about how I can build a basic feature selection algorithm. Now it is time to understand in more detail about the feature selection family. So in terms of approach it can be thought as a filter approach, wrapper approach and embedded approach and a hybrid approach. So what is a filter approach? Filter approach actually evaluates some kind of statistical property of the feature subset. So I will see an example of that. So it does not need any model. I take the feature subset and get a measure and then I can see which one is good or which one is not good. I have something as a wrapper. Rapper needs a model. So wrapper what we do is we take the feature subset and we claim a model and see how good it is performing. So that is the measure we do. Embedded means so in all machine learning or data science models you are trying to minimize the loss. How you can feed the model to the data? That should be minimized. Often they also has a penalty function. Apart from the loss function they have a penalty function. Which kinds of make sure that your model is not overly complicated. So that you can make a function of the features. So if you make it as a function of the feature then it will not allow you to select subset which has higher number of features. And of course the name as suggests hybrid. So you will start with maybe filter approach. So you will filter out some features and then you can use wrapper. So both has advantages and disadvantages. In terms of reduction technique it can be univariate. So the ranking based method that we saw comes under univariate. It can be search based methods. So you find some strategy to search the entire feature subset space. Again I will give an example and another is grouping based method. So where you cluster features and try to pick some from each cluster. In terms of supervision it can be either supervised or unsupervised. So for classification or clustering in terms of output it can be ranks. So you get a rank set of features or it can be subset of features. So this is one example of a measure that we are talking about. So if you look at it there are three things that are there. So it is a function of three things. One is the cardinality of the feature subset. So basically SK is a feature subset. SK denotes a feature subset and SK denotes the cardinality of that feature. FIC actually indicates what is the relevance of individual feature of the feature subset. And FIFJ is the measure of the redundancy. So now this is a very famous feature selection algorithm very highly cited. Which is called as MRMR. It is like minimize redundancy, maximize relevance. So that is the term. And again I have given an example that how it works. So if you have feature relevance code given and feature redundancy matrix given. So you can simply evaluate them one by one. So I have taken F1, F2, F3 and F2, F3, F5 just as an example. Both have same feature relevance measure. Both have feature relevance measure of 0.5. But F1, F2, F3 has a lower interaction of 0.07 than F2, F3, F5 has a higher interaction of 0.12. So the first feature subset is better. But these only, you know, kinds of allow. So if you have a feature subset, you can measure its relevance. So that is all you can do. But still we are not sure that the large number I showed you in context of MNIST, how this will handle that. So for that there are several search techniques that are analyzed. So that is a subject by itself. That is a subject by itself. So basically search methods can be, you know, differentiated between continuous and discrete. So continuous and discrete means that the variables that I am dealing with is of continuous nature or of discrete nature. So in our case, when we are doing a feature subset selection, we are actually dealing with discrete feature set. So what do we or discrete input? So what do I mean by that? So let us say if you have five features and five features in a data set and you want to select only one, the first feature and the fifth feature. You actually represent these as a binary string. You represent these as 1, 0, 0, 0, 1. So that is why it comes under discrete search. And discrete search again can be split into complete methods or approximate methods. So complete methods means that it can find the optimal solution. Like branch and bound or like dynamic programming or maybe the one that I told you that I will do one by one, the brute force method. So these are complete methods. But you can understand that that will not work. So there are some approximate methods which can be roughly bifurcated into nature inspired by genetic algorithm or particle swarm optimization or other methods like simulated annealing. So how do we do a search? We start. We have the original set of features. We generate a subset. We evaluate how good it is. If it fits my termination criteria, I end there. Or else I again generate feature samples. So that is how typically a search process continues. So now I will move to something called as graph based feature selection. So again this is a public dataset, a breast tissue dataset. So it measures electrical impedance of freshly exercise tissue samples from the breast. And the problem in hand is predicting the class value. So this is not binary classification. You have several classes, six classes. So carcinoma, fibroadenoma, masoparky, glandular, connective and adipose. And you have nine features. You have nine features and the features descriptions are there. But let us try to see that how I have a dataset. From here, how can I generate a graph? So I was talking about intuition, right? So how can I generate a graph from here using that intuition? So let us start with the correlation matrix. So I have created or I am showing the correlation matrix. You will see that some of them has very high correlation 0.98. If you remember the range of correlation is from minus 1 to plus 1, right? And actually do you think the sign matters? No, it does not matter, right? The absolute value is what matters. So we have quite high correlation for some of them and very low correlation for some of them. So good. Now we have a correlation matrix. Can we convert it into a graph? Any thoughts? Perfect. Perfect. So how to convert this into a graph? What will be the nodes and edges? And how a graph is represented as an expression, right? So some of you may not be from computer science background, but generally we use graph whenever there is a relationship. So social network being the typical example, right? So you show that in terms of graphs. And how typically a graph is represented in a matrix in terms of an adjacent symmetry. So you have a graph and you have that amount of rows and columns as you have nodes, right? And then one means that there is an edge between A to B. So wherever there is one, that means there is an edge between D to E. If you see the fourth one, okay? So as she suggested, so one of the way, so there are two ways I can create a graph out of the correlation matrix. One is I created a weighted graph. So that will be fully connected graph, right? All nodes will be connected to each other. And for some of them what will happen is or an alternative approach can be I use a threshold. I use a threshold. I will keep an edge only when, because we were keeping a threshold of 0.67 if you remember. So I will create, I will keep an edge only when it is more than that threshold. Else there will not be an edge. So if we apply that on the dataset, then what will happen is I will get a graph like this out of the dataset that I have. So we call that as a feature association matrix. Okay, or fan. So now can you tell me that how can I select some features out of it? Or at least this much is very intuitively clear to you that there is a big cluster where features are interconnected. And there are two features which are out of that cluster. So probably something like I can pick one or two from this and two from outside. So that can give me a good feature subset. Okay. So this is again, we are telling about the intuition. How you can use correlation matrix to tackle the problem of redundancy. So here is another concept of graph which is called as vertex cover. So vertex cover is first of all a set of nodes. Okay, is a set of nodes and vertex cover ensures that if you have a relationship, right? So relationship means it is an edge at least one side of the edge is picked up. So that means the relationship is covered. And you try to do it minimally because there will be some nodes which explains two relationships. So in that case, I will maybe pick up the one that is present in both of them. So this concept is called as the minimal vertex cover. But again, this is a like it is, it is very polynomially hard problem to solve. Okay. Now, this is just by intuition. So if you have a graph like this, you can see that four two powers all the four edges that are there. So there are four edges and five nodes in the graph. So if you pick four and two or four or four and zero, then they covers all the relationship. So from the cluster that I was showing you here, if I use vertex cover and take those features, so that will give me a good starting point. But it is difficult to get the minimal or minimum vertex cover, but we can get the minimal. So let us not split here on that. But a very simple algorithm can be, I will start with the result set as I pick. And then I pick any arbitrary edge, maybe having nodes like UNV. So I add UNV to the result set. And whatever edges were connected to UNV, incident or UNV, I will remove them from the graph. Again, I will pick an arbitrary edge, add those in my result set and reduce the graph until when the graph doesn't exist. So this is how I can form a very simple algorithm. So we applied this on the dataset that I described. And all feature accuracy was 62 percent or 62.1 percent. With this, it came to 65.5, maybe not a great improvement around 3.5 percent, but with a future reduction of 65.56 percent. And I will show you some other feature association maps. So you get very, very interesting patterns. So this is a particularly well-studied, if you look at figure 3, the particularly well-studied dataset called as Madeline dataset. So if I am not wrong, it has 500 features. It was given in a competition. So only 15 features were good. Others were created by noise. And probably when we are using this graph, we are kind of getting the features which are relevant. All those noises are getting away because they are not related to each other. And this was another where, okay, so there is a color coding. There is a color coding. So if it is green, that means there is redundancy, but redundancy is moderate. If it is blue, that means the redundancy is little bit higher. And if it is red, that means it has redundancy with more than one feature. Okay, so that's how we have color coded it. So the idea is that even before you start building your model, so this is something you can take to your user or customer and show this map and ask them because so a lot of people in the community will do dimensionality reduction, right? So as I said that one of the advocates of using feature selection is that it retains the original feature meaning. So this is where you can engage with your customer and ask him that see this is where there is a set of redundant features which want to be kept. Okay, and so these we run on quite a few data sets. So these are few of the data sets that we have ran this against. Okay, and I will show you our results on both supervised and unsupervised. So UFAM or Unified FAM that was our method. So you'll see that that has achieved the best classification accuracy and the mean rank is also lowest. So what we mean by mean rank is this is also one standard way of comparing algorithms. So you find out that which ranks best and then you take a mean update and this is more robust than the average accuracy because in average accuracy what can happen is that for one data set some algorithm may give very high accuracy. Which can compensate for you know loss in or losing out on other data sets. So that's why mean rank is a much more robust measure. So UFAM actually perform much better in all context of the three. So we discussed about MRMR. So we have got a better result than that. And so I'm not showing you execution time. In execution time wise also our methods were much more simple and much more like took. So if so there was a order of magnitude reduction. Okay, so if it was in minutes you are doing in seconds. And in unsupervised also we got a very very good result. So for clustering you don't measure by accuracy measure by purity. And basically which measures how homogenous the clusters are. So there also we achieved best mean rank may not be the best average purity. Average purity wise our method was second best. Okay, so this was about the graph best feature selection. Now let's come to the text example that we were talking about. Okay, so we are sure we have heard about a lot of text classification in last two days. And basically you have input as a document and output is a label. So some examples are spam classification or sentiment analysis. You can do a topic clustering. You can determine the gender of the person who is writing. So a lot of text classification tasks are there. And name based classifier you know that it works on conditional problem. Now can anyone of you tell that why name based is called name? So that is because that's why it is called base. Why it is called name? Assumes all features are independent. So this is one empirical study we did on some textual data sets. And you can see that we have compared name based with decision tree support vector machine and KNM. And can you see that name based was poorest or was getting the lowest accuracy in terms of classification accuracy. These are classification accuracy figures. So that was kind of a motivation to improve name based. Can we do something to improve this? So can you think of something? So you know it can be improved. So basically I will give you a clue. So what name based expects? Name based expects that the features are independent. So can I give it features which are kind of independent to each other? Right? So that's what we try to do. That's what so it is we cannot ensure a complete independence. But we can ensure that they will have you know less independence. So now how can we do that? So we can simply cluster them. And what is the property of clustering? In clustering within the cluster the features are homogeneous and between the cluster it is heterogeneous. So if I pick, if I can do it in such a manner that I pick features from each cluster or some features from each cluster. Maybe that will give me more independent words, right? So of course in case of our method the features are the words, okay? All right. So these are some very simple steps. So we start by forming the term document matrix. So this is the baseline way of you know taking the structure out of text. So of course you can use topic modeling, right? And again you can use bar embedding. So the newer methods that have come up. But let us focus on term document matrix for now. So what we do then is we transpose it. So because if I cluster the term document matrix then we'll be clustering the documents. We don't want to do that. We want to cluster the words. So we'll just take a transpose of that term document matrix and we'll create k cluster. So I'm not going into them like deciding how we'll get k cluster. There are several ways for that. And then we select the most representative word from each cluster. So which is closest to the clustering center. Because in k, so we have applied k-means. So k-means you know one of the problem of k-means is the center is not a word, right? So what we have done is we have calculated the immediate norm which is the closest to the center. And that word we have picked up in the features section. Okay. So here are the results. So we have now run on a much larger text data sets. And you can see that there is a remarkable improvement in terms of classification accuracy that we have achieved. So you might definitely question that, okay, good, you have improved database. But is it comparable with other classification algorithms? So here is a comparison of the data. So if you see that apart from the first data set, all other data sets, we have got better results using the methods that we have applied. So another challenge that you might throw at our method is that is it doing good compared to other feature selection methods? So we have compared with a forward search based wrapper. You already know what is wrapper in a forward, in a wrapper method forward search greedy. Okay. So we have really picked up the feature. And if you see that in all the data sets, we have achieved better classification accuracy, but that is not the most important point. The most important point is the time that we have taken. Just look at the time, the first data set, that algorithm took 53 minutes and our algorithm took less than a minute. Okay. So that is about the execution efficiency, if I may call it. And now coming to the feature reduction. So this is the feature reduction that we are actually. So you can see that in some cases, there were more than 3000 featured and we have come down to three. Okay. So it gives you like much reduced feature space. So probably we can start drawing like the dream that we are having that will draw in two dimension and you know, bifurcate spam and have. So we are almost at that level. Okay. Okay. So that is about what I had to discuss. Now a quick conclusion. So if you understood why feature selection is important for any machine learning pipeline and with large dimension, what happens is this traditional concepts of neighborhood searching, you know, finding kernels either becomes too expensive or becomes completely blank. So what you can do is you can follow your intuition which is the model or the data and build your own method. So that's what I actually wanted to motivate you to start doing. Okay. So here are some other areas that we are working on. So you are working quite seriously on forecasting of solar and wind energy generation. We are working also working on developing some algorithms in quantum machine learning. We are doing some work on aspect-based sentiment analysis and feature selection being my, you know, HD topic. So I continue working on that in different dimension. So some other areas where we are working are fuzzy and nonparametric feature selection. And I will end by talking about ICDMI. It is a conference, international conference of data analytics management and innovation to happen next year. New Delhi, January 17 to 19. So the call for papers are up and I definitely encourage or invite you to give papers here. So we can again meet in Delhi in January. And I also acknowledge Mass Research Club for specially Jhar Mustafi for encouraging us to, you know, submit proposals to ODSE. Okay. So thank you and any questions? So one of the worst in classical, actually, you know, led us to darkness. So we started to classify as papers and then we are seeing the faster phase patterns of the world. Not really, but once in a while you just need to make sure that you have enough feature interaction. So if there is not feature interaction, then that's due immediately. So we have developed one heuristic card and all the various strategy which kinds of data set up, that whether it was called independent or strong data set up. Relevance. Yes, yes, yes. So actually one of the ways that we have done it is that when we are taking enough minimal but it's powers, we'll pick the one which has highest potential. Yes, yes. Also what we have done in another world, we call that as feature information or feature association. To take care of what you are talking about. So there we remove features which is very low end. So that is at the first take in here. So the graph is much smaller in size and we are only dealing with relevance. The last thing, so for the graph method earlier. So we talked about correlation which is only worked for calculus and media thing. For other things as there are heuristics at this, for example. Yes, so for categorical you can use measures like high square. But one problem is when we are dealing with different mixed domains, bringing them to same scale and comparing is a challenge. So far what we have thought is we'll deal them separately and then kind of formula. My question is about, so in your techniques, do you treat hyperparameters and features separately? Because both of them interact with each other. How do you solve for the best subset and the best subset or best hyperparameters at the same level? So that is all about heuristics. So generally we don't invest a lot of our time on hyperparameters because we are modeling people. So we generally refer literature, see if that is working well. Only when we see now it is not giving at all a good result we think of modeling. And then like you can do hyperparameter tuning using a grid cert or a more intelligent cert. We use things like that. So you are treating them as independent problems? Yes, yes. That's a very good question. So one of the ways is looking at the entropy of the features. Also there is something called as a Laplacian scale. So which kind of sees that what is the neighborhood of the data points and from there you derive this code based on the features. So which has a good local structure that is a good feature. Something of that sort. And we also have developed our own method which is based on principle component analysis. So we try to find out features. So we have converted the unsupervised problem to supervise. So we are treating kind of principle components and now my target now. So we have used that as a proxy and that also has been investigated.