 Good morning, everybody My talk is gonna be about feature selection and title is up there on the screen. I'm going to do a few things today with you I'm basically just gonna introduce myself because I'm not rude at least not too much I'm gonna tell you a story But I hope you will like and I think some of you might relate to that And then I'm gonna show you some code and some results some numbers. I know you're all wondering. Can you be any more vague? I could have but you know, this would be enough for today. So first things first a little bit about myself There's a few things going on in my life. I am a physicist I Translated into a data scientist for some reason. I work with many different types of technologies that I'm very passionate about and That wouldn't have fit on the slide. So I just added the first few ones and I'm a hemorrhoid operator. Does anybody here is one? One, two, three, four great. I'm not alone. So I usually am and There's a few things I'm but this can be summed up in the fact that I'm a generic nerd. So some of you might relate to that as well and I want to tell you first a couple of things about the company work for the companies called optimum It's right there in the lower in the bottom right hand corner. Has anybody here heard of optimum? No one one hand. Oh, yeah, of course. It's you Sorry guys that he's a colleague of mine. He sits like two meters from me. So, yeah anyway Let's try this again because no one ever heard of optimum Have you ever heard of United Health Group? No one Okay Good start. No one ever has by the way, not in Europe at least We're part of as optimum. We're part of United Health Group. It's a very big American corporation that works around Health care and health services. It's actually quite a big company. We're talking about almost 300,000 people working for that and This slide is actually a bit old because it's from last year It says we ranked sixth on the fortune 500 list. That's not true anymore because we're fifth now. Yay. Go us So It's actually a pretty interesting company and if anybody is interested in working with the hardcore data science and Cutting the edge in health care. We are hiring worldwide. So Have a look. There's a couple careers website on Both of those sites at the bottom of the slide Anyway, back to us the main purpose as I mentioned earlier is for me to tell you a story and as many stories as many fairy tales This story starts with a brave white knight in a shining armor By the way, that white knight. It's basically all of us. It's just a metaphor And as any good story this story also needs a villain and because tonight with this white This knight in shining armor has to defeat and slay a terrible dragon And you were you might be wondering, okay, we're all data scientists We don't really fight dragons usually. Well, again, it's a metaphorical dragon and The way it usually comes to be it comes into existence Is this and I think some of you might relate to this situation It's basically the evil overlords atop their white tire tire going like we have all of this data Let's just do something with it Even if they have no idea what to do So what happens at this point is that you get handed your own dragon and this usually comes in the form of a Very big data set, you know nothing about and you're supposed to do something with and Yes, it is that fake sometimes this happened to me not too long ago and It might look something like this so you start a new project and This video is not running Start a new project if this happened this actually happened You get handed a big data set You know nothing about it. You just know that somebody processed it in some way and You have to do something with it so you start by having a look at it poking around see what's in there and You know a few things like this might happen you realize that the data set is huge and you have no idea what's in there It has over 800 features. What are they? So you just Try to decide to have a look. What are days? And I kid you not this happened and there were hundreds of features like that And I had no idea what they were towards no data dictionary, of course, there never is and It goes on There were about 200 features that had reasonable names You could just kind of make out what they were and then there were like 600 of these What was the sentence garbage in garbage out something like that, I mean, it's just too much and I've been feeling like that for weeks Why working on that project? Can anybody relate to that show hands? Yeah, of course Because The sad reality actually the happy reality if you think about it is the fact that of all the information We usually need a very tiny portion of it to reach our goals to accomplish our task I'm not saying that everything else is definitely useless for what we're trying to do But it's probably just adding something on top what we need to do It's a plus but we might not strictly need it and of course when you're working with such a Big amount of data such a big amount of information. There are many problems that can arise Like the fact that you have no idea what's in there The fact that even if you do have some idea of what you're trying to do and how you're trying to do it Training time might increase sometimes exponentially with a number of features and you don't know which features are relevant Which features are garbage? Hardware requirements might increase because sometimes you just need to load a lot of the good chunk of the data in memory And then you have memory constraints or CPU constraints or if you need to go for deep learning you might have big constraints regarding GPU availability Which are not always Satisfiable for example in our company. We work with very sensitive data. So we are not allowed to use AWS so If it's not in our server farm, you can't have it the problem and Also, there are problems like Increase performance and overfitting because again You don't know what features are noise and what features are relevant for your problem And there's the risk of leaks because again some information regarding the target might actually leak into some of your features and You might just be thinking yeah, sure, but it won't happen to me I'm really careful when I do my feature engineering, but sadly. Yeah It can happen to you and it happened to me not too long ago I was working on a classification project and I spent a couple of days sifting around the data I I could understand most of it I had a fair idea what most of those those features were I did some clever feature engineering tricks like week or so that and then I trained my first model just to have a baseline and I Got an incredibly high accuracy 99.9% So as any has that happened to you train the first model Sky higher accuracy. So of course you're like woohoo. It's great problem solved You know for the first two seconds or so because then you're like Yeah This doesn't sound right. This doesn't look right. There's something wrong here It turns out that there was something wrong There was some small leakage from the targeting to one variable and once I realized that and once I fixed it Accuracy went back to a normal value something that I would have expected but This is a problem because if the accuracy caused by that leak would have been in the order of 85 89 percent. I wouldn't have been that suspicious But the results would have seen been wrong So you should always be careful about these things But anyway getting back to what we're here to talk about we're here to talk about feature selection and we can't Really work with a very big data set and we can't really Expect to tackle a big problem right here because you know, this is just a simple just a talk I we don't have that much time and I'm pretty sure you don't really have that much desire to actually do that So let's let's restrict the scope of this problem. Let's work with a simple data set and that has a target so we're working with a classification problem and Let's see what can we do about it? Full disclosure. I'm using a small data set a public data set that probably most of you have worked on I'm not gonna tell you what which it is until the end of the talk But some of you might guess what it is. Where is my cursor? I need my cursor Ah, and this is what the Data set looks like so we have 500 odd rows 30 features and these are the feature names X 0 0 till X 29 so we have no idea what they are and we have to deal with it and This is a problem because which features do we choose and how do we choose them? This is of critical importance because different problems demands different types of solution and We have to be really mindful of how we choose metrics because sometimes we use Techniques or metrics that are just not relevant or plain wrong for our type of problem This is an example of it Are you familiar with the unscored part it? Are you familiar with correlation for choosing good metrics? But you are a Lot of times we just use correlation and see if a metric is correlated with a target and if it is okay It's important. Let's keep it if it's not. Yeah, maybe let's drop it But this can be a bit deceiving no in the unscored part of this. This is these are three sorry for famous examples of datasets that have the same summary statistics for these four pairs of X's and Y's have the same mean variance standard deviation and regression coefficient Oh, it's a problem because you know clearly in two of those cases to regression the Correlation exist in some in the other cases. Yeah, not so much. So you can just Use there's no technique that fits all problems and you might be familiar with the there's no free-launch theorem Sometimes launches are actually very expensive. Not just not free And wherever you're working with this kind of elements for feature selection always maintain a healthy dose of skepticism. That's never wrong But then how can we perform feature selection then? Well, there are three main classes of algorithms that can do that for us and namely these are Filter methods Rapper methods and embedded methods and I'm gonna tell you a little bit more about those in the next few slides Filter methods are the simplest form of it They basically perform they usually basically perform univariate analysis between features and the target or In the features itself or example a clear example of that would be a variance threshold You don't consider old features that have Less than X amount of variance and this happens all the time to me I always get handed data sets would have which contain at least 10 percent of features that just don't change Or that are all missing They're clearly not really useful at least not in the way. They are Some clear some examples of this will be techniques like a test and over Granger causality if you're working with time series LDA techniques like that Then there are wrapper methods These are a bit more sophisticated and the way they work is They train subsets of features and they look for what features are useful So for all subset of features you train a model Usually classification or regression and you determine which models perform the best So this is a bit more sophisticated than the first type of algorithms and it's a bit prone to overfitting because you don't really have fine control over what the features do and what sorry over what the models do and They have a big problem. You have to train a lot of models So if you have a small data set, these are fine But if you have a big data set probably not a bad idea not not the best idea and then as I mentioned there are embedded methods these try to take the best of both those worlds of filter and wrapper methods and The way they work they perform a classification or regression, but they also have their own internal way of performing feature selection few examples of this would be for example decision trees or random forest which Basically used their own internal representation of the features to determine the best splits and to determine what features are useful and These are quite good. They're less prone to overfitting and they usually Yield good results and they perform their they search their search using cross validation usually All right, let's close this small parenthesis and let's try to remember why we're here Remember the knight in shining armor. I mean come on. We have a we have a dragon to defeat so We have to slay the dragon so and The way we're gonna fight this dragon is with code. I'm gonna show you a Throw example. Let's say on the data set. I showed you earlier We're gonna perform a small feature selection exercise using an embedded system Sorry an ensemble system Because you know want to make things a little more spicy more complicated. Otherwise anybody will be able to do that word of advice this code is Formatted for slides and presentation So don't just take it and use it and expect it to get correct results. I'm omitting a few things I actually have part of this code embedded in a personal library, but there's a lot of things I'm omitting here like, you know validation error handling and so on so Take inspiration by all means, but I would advise against using it as it is so My requirements for this type of library were that It didn't need to be a free-for-all Algorithm you had to have some guidance as to what it was supposed to do and it had to be extensible So you you should be able to just add new Selectors to the to the ensemble algorithms and expect it to work and There are a few ways we can actually combine the results in the end because we're using different feature selectors technique Selection techniques and then we combine the results so we can actually add a way mechanism For example, you might want the results from a specific algorithm to be evaluated more as more important than others And that's not here, but that was the idea So let's have a look at the code Let's start simple. We just need to start by importing the things we need So these are these would be the building blocks of the of the class and of the Selector and we're going to use just five Techniques for this one because you know, it's enough for a demo. So we're going to use Random forests decision trees and various threshold Flavors for classification and regression because we might want to work in both worlds and they do work in a slightly different way So it's a good idea to have them separate and then we set up our class Based on the type of problem we want to tackle So we might have a regression problem a classification problem or a generic problem like when we don't have any targets And we have to still sort out all the garbage because we don't want to deal with that in that case we Are probably better to you We kind of doomed to use filter methods and see what features are just not relevant. I in this case I'm just using a variance select variance threshold and Which has to be used kind of carefully because that has meaning all if if all the features are in the same scale Especially if you're using a hard core hard sets threshold But this library takes care of it in the in the background. So We're all good for that Then we have to initialize our class We need to know how many variables how many features we want to get from each of those models We want to know We want to set a few thresholds like the actual library that I'm using actually is actually able to compare Train results test results compare the scores and see if they differ by a certain amount of score And if that happens it raises a warning like if you get 90% accuracy in your training set and 20% accuracy on a test set That's not a good sign And you should really review what you're doing But again, we don't want to block results. We just want to Be notified of that Then we need to select all the models based on the analysis type like in this case The analysis type keyword would identify regression classification other or Two of them anyway and based on that we select the models from those that are in our class We instantiate them and we put them somewhere safe In this case or somewhere safe is just alive estimators. So the ones that were chosen By the way, if you have any questions about the code or about anything else, just raise your hand and ask No, okay We have our set of models in our internal object and now we need to fit them on the data and This is actually quite easy. We just go over all of them. We have the data and we just fit them Now at this point, we have a set of models or techniques that are fitted that know about our data and we can extract results from them we can see What they consider as important in the data set and The way it can be done. It's the way I did it. There's many other ways of doing it But I basically select the relevant feature from the model itself Some models have an attribute called feature of importance. There's some models have variances some models have score and so on For the models that I was dealing with these were the most important ones and I didn't need to add anything else I get the score or the importance for each feature from each model I combined them and you can see at the bottom here the last row I only get the first any features from each model. This is specified in it So I only want for example five feature from it from each model You will end up with more than five once you combine them because you know, there's no guarantee that all The results would be exactly the same set Actually, that's quite a rare occurrence And then we have to combine the results and the way we combine the results is by having the models cast votes so I Have a list of models with their results and the features that they consider important with the score And I count how many times the features appear as Identified as being identified by one of those models features that get More votes so that have been chosen by more models get a higher value And this is where a good weighing system might come into play because again some models might be not that important I for a classification problem various threshold is probably not as good as identifying in identifying good variables has a decision tree For example, maybe it depends on your data depends on your problem But yeah, this is the basic implementation. We don't really need anything more than that at this point. We are good to go so Let's see if it works remember our data set Was those 500 out columns Sorry rows 30 columns and with those column names want to run our It's really hard to see my cursor there Want to instantiate our ensemble feature selector we want to get we're not Treated as a generic classification problem. So we're going to use all the classifiers and all the other techniques and We set the names of variables so that we have something more explicit than just indices and We want to get five features from each model That we fit on the data set and we get the target. Sorry and we get the votes What it looks like is this nine features were selected These here and these are the number of votes they got and You can actually go inside the library inside the object and examine what scores those object got so if you want to know How important the single features were identified by the models here are the single scores five features for model per model But since each model selects five features you end up with more than five. We have nine all right so Are you satisfied it works? Are you satisfied? You shouldn't be okay It works, but does it work? I mean does So Sorry Exactly excuse me something is this correct result is that this is a useful result. We don't know well it works But does it work? So what I did is I trained the same type of algorithm Which is a logistic regression was this was a binary classification and I trained that using the whole data set first and just a subset that my ensemble has identified and by doing that I Collected the results and what would you expect better same worse by how much better by how much it doesn't have to be a precise number just Not that it's not that good I'm sorry, I'll be a disappointment to you guys Anyway, no I'm using the same seed for both runs and The model score using nine features actually a tad better than The model using all of features, but again both those cores are really high. So I couldn't expect to get 50% increase in that Full disclosure I Run this a few times It's not always better because there's a random component to it Using different using different seeds, but all the results that I could find were always compatible with each other within You know two three percent Sometimes it's better. Sometimes it's worse But I think the worst score I've seen using I think it was eight features that time Was in the range of 94 80 something like that. So it's good because Yes, but why I don't follow Sorry, maybe find me later and we can discuss this Anyway, as I was saying These results are kind of are always comparable and it strongly depends on how the features are selected and this is a small example we're working with 500 500 rows less than 600 rows and with 30 features but imagine that you could safely drop a third of your data set and Get comparable results and imagine that your data set is something big like millions of rows and thousands of features There will be a huge speed up in your training. There will be a huge speed up in your work and I've used this I've used this in a couple of projects and it actually worked how well Well enough, I mean I got results comparable with the with the ones I got without the feature selection first, but to sum up Then many important points are that you should always take care of knowing what's inside your data You should always spend time getting to know it intimately If you can if you have an example like the one I showed you earlier than good luck, you have all my sympathy But you should always take care of Selecting the features that are relevant for your problem because you might have stuff that is really important for something else Then you probably don't need to use that feature selection essentially Simplifies the models because it takes the garbage out or stuff that you're not interested in Out of your over the picture and imagine that you have a very specific problem. You can build your own feature selector Because you know what metric you're looking for and you can just stick into something stick it into something like the ensemble that I've shown you and Give it a very high weight something like that and that would help a lot I've done that a couple of times and it really helped me. Maybe it helped you too um Feature selection increases generalization because again you get rid of noise and If you're working especially with linear models, you don't want to have too much noise in there Especially if you have highly collinear variables because that will throw all your coefficients off the roof And you won't be able to use them directly and that's a big problem Sometimes it helps you avoid the curse of dimensionality because you know as you know the higher the number of dimensions the Less meaningful distances become and that's essentially what the curse of dimensionality is and therefore it's hard to compare objects And records between one another And in general it removes noise and simplifies everything and again If you have to choose between something complex and clever and something simple and clever You should definitely choose the simple and clever and That would be it for me. I'm open for questions now Thanks very much. Are there any questions here? Thank you for your talk. I'm also very interested in this kind of automated machine learning Where you don't know have any clue about features. It's a it's an interesting sport But have you looked at papers Like the data science machine or h2o that kind of tried to do this Automatic machine learning I have and personally I prefer to work on features myself because at least they get to know the data I mean worst case scenario you get to know what's in there. Even if I don't really know what the feature represents um, I don't trust Automatic feature selection completely when it comes from the outside because I want to get I want to know what's going on under the hood But that's one of my idiosyncrasies. So There's nothing wrong with that Hi, thanks for the talks Do you have any particular class of model in which you think that feature selection is really important As compared to other ones. I don't know linear models and compared to random forests So some models do perform feature selection Within the implementation itself like if you're using random forest you are performing feature selections before Actually running the classification That being said Some models are very sensitive to feature some models not so much, but I would say that Going through the effort of doing that is a worthwhile Investment of time because you get to know the data and that's paramount in my opinion Did they agree but do you think there are some particular class of models in which is Incredibly important to do feature selection Depending on what you're trying to do definitely for general linear model Models that's what that's one. I I always am I'm very afraid to touch without having a fair understanding of the data Any more questions? They're making you travel today So did you try regularization like the rich? in Psyched learning because that's usually quite okay with dealing with like a lot of variables Yeah, that's another technique actually that kind of falls under the umbrella of Being one embedded methods bridge and last as well because you can use the coefficients that you get As a measure of how the the the variables are related to the to the target and yeah, that's I've done that And it's a good technique, but again There's no freelance so it might work very well on some type of data and be completely off on other And which was the data set because I have my own pipeline so I want to try right all right So what do you think that data set was? It's a famous one No, I work in healthcare Diabetes there you go Anyone else? Thanks for the talk very interesting So in machine learning If I preprocess my data I split it first And then I use the same transformations from the training set on the test set Do you think the the same should be done for feature selection as well that you select your features only on the training set So that leakage does not occur That's what I usually do When I perform feature selection, I usually work just on my training set And I usually have not just one test set. I also have a validation set that I set Safely away a long time before I started working on it, but that being said you can Work around this and maybe have a look at different permutations of the of the data set I do that sometimes if I want to test the stability of my selection methods But that only works with data set that is not the data that is not Time dependent So it happens very often that not just a selection of features But also a selection of rows might help because essentially Feature selection is eliminating features that are duplicates or near duplicates You could also eliminate rows that are near duplicates So you would essentially have a smaller data set that you would learn your model on without the noise So is that something you look at as well a training curve? it depends Removing duplicates from the training set or or removing duplicates is always something you should be very up because you should So you should it's some it's always something you should Take care of because they don't really add extra information unless you're trying to oversample for in balanced classes Removing almost duplicates or near duplicates. It's a bit more complex because they might actually add information And you just don't realize that by looking at the at a record Some algorithms are more sensitive to this than others like if you were using Deep learning and you had a very deep neural network And you if I would probably leave all near duplicates in because they might have some nuances that you don't really see just by looking at it but They might have some mean dependencies, but if you're using If you have 500 records and you're using the decision tree, you can safely remove duplicates Just continue what I asked before like if you have so many features But you have very limited data Like you have 800 features, but you already have 40 rows of data And what kind of feature selection you can do with it? That's a very tough problem um I can't say I had to work in situations like that So I don't know what I would do right now But personally I would definitely have a look because at the at the initially the variance because if I had very few Rows and a lot of columns there's bound to be Hopefully a good chunk of those features that don't really vary too much And I would probably start by limiting those and after that you kind of have to play it by ear Okay, so thank you for your talk. Um, I just want to continue the question about regularization Did you compare that with the traditional lasso or like your your findings? And did you use regularization? Like lasso regularization in your linear models after doing this initial feature selection? so There are two things that I want to say about this. I have used lasso and ridge for feature selection In a technique like this, um, in my doing my my work and my job I haven't included them in this example because time And it would have added a little an actual layer of complexity because their interface is slightly different to get the coefficients And I didn't really want to tackle that during the talk, but yeah, it's definitely something you should Investigate if you if you're interested in it because it works Sometimes But you happen to have a comparison or the I mean did you run a comparison where they're to run more to have this voting system? Versus just using lasso and say, okay This voting system works better or it's just something that you wanted to have more control you decided to Uh, no, I haven't run a comparison strictly like that. I have used Although I have used results from lasso within algorithms like this So I didn't really need to compare those results with the final one But yeah Thanks, anyone else Wave if you have a question No, all right. Okay. Well, can we thank our speaker again? Thank you all