 So hi everybody Today I will not only talk about or I will talk about specific topic that is rather not that Official in the data science called feature engineering because there is like no discipline that And you can learn in the university or find a course on this is Actually considered to be something that you should be able to do but it's not always that simple so and In the end of the presentation you will get by what I mean by saying data is not flat So the structure of this presentation will be as following I will just briefly mention what features the engineering is what is it is not and I Will concentrate on a certain example of feature engineering. There are like several types Then I will give you a classification example That's a really stupid project of mine and you probably will like it at least I hope So at the conclusion I will again tell you about how to use feature engineering what you can do almost automatically and What actually helps and works in most cases? And afterwards I hope we will still have time for QA session So what is a feature engineering according to the Wikipedia? It's pretty accurate definition This is a process of using domain knowledge of the data to create features that make machine learning algorithms work What does it mean? You probably all heard about concept garbage in garbage out Meaning if you have a garbage data that doesn't even reflect the situation you're trying to work on you Can try whatever model you can come up with it still won't produce any good results so feature engineering is an art of Preparing data the way that any algorithm can perform better So the feature engineering as It was seen in the description. It is an expert domain knowledge by that I mean that someone who is driving a taxi knows probably a bit more about the whole industry and how the things work rather than some college student Then these process can be both automated and Manual I won't talk about automated process. This is more of a machine learning approaches that Cover the feature learning like it's a some part of a knowledge transfer for instance Or there are a lot of also Python tools that if you have multiple data like real lot of tables you can concatenate them some way and Just run automatic Functions to create new features as some sort of transformation from the older columns. I Won't talk about that. I will talk explicitly about manual creation However, all of those approaches are quite expensive you need time to identify what you're working with you need time to identify what features might be useful and Sometimes it is important to First take a look at your model. Sometimes it doesn't mean anything at all and As garbage in garbage outstates data quality is probably way more important than the actual model Future engineering is not part of the data engineering process meaning data engineers are people who collect the data initially make some initial pre-cleaning things like Taking completely insane samples out Somewhat flattering it out if necessary. They make the data available They have no idea actually what for this data will be used and thus they are not responsible for certain specific knowledge search and It is not the part of pre-processing even if it's done by data scientists again these data science is a iterative process you Create a model you feed some data in if you are not even know how your model will behave you just watch what happens And then you started tweaking some parameters on the data side on the model side and this iterative will will repeat itself many times but future engineering comes in question if Your model does not perform well at all and even some additional models do not perform well, and you have a Kind of sense that the data is at fault So I said earlier that there are pretty much two main approaches manual and automated for automators like tons of articles just type in feature selection feature engineering automatic or feature creation Python and you will get all possible blog post GitHub repositories just play around with it it also pretty powerful but the Thing with those approaches you need to have quite a lot of data or you need to have high-dimensional data in this case They will work the best basically, you will try to reduce the amount of noise or Unuseful information to feed into your model Second approach is used when you don't have enough data. We are talking about Either very low amount of samples or very low dimensionality of the data or both so the Thing that I want to show you today will be concentrated on the case where I don't have data I don't not at all, but like thousand samples. It's nothing and At the same time, it's like one dimension So what can you do with this? Again, I quite like the article on the Wikipedia about the feature engineering It has quite a lot of resources and links to for further Usage and further sources to read and they mention as Process how to actually approach the feature engineering so brainstorming of testing features deciding what features to create creating features and Checking how the features work with the model Improving if needed improving meaning you probably have to scale they down or if it's a categorical feature You need to define categories the ways that the model actually will understand them and won't have this cute Distribution so you need to really know your data and Repeat it and repeat it and repeat it. So this is the process that we're gonna all do today For example, I said it's a free time project of mine in I have a friend. He is I don't know. He's a student informatic student and he has very weird sleeping cycles You know those typical informatic people who are going to bed at like seven in best case am and wake up at eight and Actually, he's a quiet big part of our friend like friendship community and everybody is always interested when is he awake? and on telegram everybody's asking is he awake is he awake and As a part of the meme or gag you can say I was given a dare Challenge to write a model that will actually try to learn his sleeping patterns and Instead of actually asking and waiting for response. We will get the approximate answer right away For that reason again, it was a dare project. So I write I wrote yet another wrapper around fully connected neural networks I did it because I needed it to be configurable my way It's nothing fancy. It's just NumPy with some features So About these an n generator it is available on Piper if you want to play around with it The feature that I needed was that the whole model should be configurable with Jason So that I will be able to run several different slightly different models Simultaneously or even have pure understanding. Okay, where I came from and where I am now So you can just give architecture It can be flat these currently there is only the sigmoid Meaning binary classification implemented This is the last layer and then you can have as many layers as you want to with the amount of nodes within you have activations for each layer and something like confidence meaning you have the Probability of something like the you don't get the zero one right away you get the probability and you need to interpret it So you can set it to whatever you want. You can have different learning rate all the hyperparameters pretty much So Originally I was given just thousand samples. These were timestamps And the state whether the guy was awake or not with a certain minutes Step so this is what we're gonna use pretty much We're gonna create input structure approachable for the building the model We're gonna create and train the model and just see what the prediction and the accuracy is in comparison to the Testing set I don't have a validation set because it's just ten Talbin samples and for the purpose of this presentation. It's not needed So originally I was so I thought what will happen if I just throw pretty much a time series inside and see what happens Well, not that good You can see that the model Performed at 30% accuracy, which is way worse than even a coin flip and I can wait just to see what the guy's awake or not and I Thought maybe it's not really a prediction problem. It's not a regression it might be a classification problem and in this case I obviously lack data, but the good thing I have time data and Time has certain attributes to it You can describe certain point on the moment by what date is it to date not even the date by what day of the week it is and For us people it is important because on the Monday to Friday We tend to work meaning we have certain behavioral pattern on the weekends our behavioral pattern might end Might not end but change and in this case this difference is important Also, it is important whether it's late in the evening early in the morning or it's a lunch time so you can see where I'm going there and I've tried to unwrap the timestamp into just Vector containing this feature describing each timestamp So basically I went through brainstorming or testing features Deciding what features to create this is the day and the hour and I created them and I also check them how they work By the way, I'm using two-layered fully connected neural network This is approved to be the final best model because I after I did some feature engineering I also played with different architectures and this one had the best performance So I take it from the very beginning so we run this one with two features which are date of the week and The hour the model suddenly jumps to 81% of the accuracy, which is like really not too shabby and That make me made me thinking What do I else know about this guy? I know that he's a student and He's a human. I know that he also lives in Hamburg where I come from and as any living being in Hamburg we are affected by the weather which is Better tend to be a little bit better in summer and it's completely disastrous in late winter early Spring so the season might also be important there And in this case I say, okay Let's throw some season into it as a third attribute and now we have as input data the three-dimensional vector Now the model performed 8% better than the original or the previous one Like the model is the same the date is just unwrapped so 89% is already way better than I've started originally with but at the same time 95% is something that should be considered to be production ready like 95 and above so I want to go further I want to Push the limits So what do I know about this guy again? Here's a student. Isn't it important? well actually yes because he has exam phase he has some lecture phase and he has vacation and I Know myself and I tended to behave completely different on exams and during vacation I slept like whole day through whole life through so why it should be different for him and In this case, this is really Artificial feature because it is static information taken from the website of our university where I and he are students But you probably cannot omit such information collection if you know how or if you trust the sources that can Provide you with this information anyhow, this is a little more or less static and now we have the four features model Which performed to 95% accuracy for me? It was like 95 industry ready. I'm done with it, but if I look if I step take a step back and look at it all I did is I try to reconstruct Behavioral attributes or time attributes of the behavior of this guy and I started originally with the only time stamp And I ended up with a pretty good results Only by feature engineering You should just believe me. You can obviously tried your tried out yourself, but Tweaking the model parameters taking different models didn't perform well at all I never got better than 45% accuracy on only timestamps. I tried STMs didn't perform. I tried some models from classic econometrics like Arima, but this is not exactly the problem. I was approaching So let's check out whether this guy is awake now No, he isn't it's like Three yeah, he will be probably awake in like two or three hours Ping me and the telegram or whatever just to ask whether he is really awake if you're interested So this is pretty much what I just wanted to show you that is very simple yet very powerful and I've only used my like I am as an expert. I know this guy. I Pretty much simulated my train of thoughts what I know about the guy what I should what information do I use and as a programmer I decide I try to implement or collect this data and Yeah, that's everything based only on timestamp Imagining if you have more data that you can unwrap another case. I can think of is for instance There is a competition on kegel about taxi. You have to compute the waiting time in the taxi based on the four or three years of taxi time from Mexico or Ecuador countries and There are like normal data you can see in the file they provide is Geo coordinates of pickup drop off timestamp of pickup on drop off the distance they actually cab drove and How long did it took how long did the whole ride lasted and stuff like that? but You can throw something like I don't know whether the time like for instance You can again can automatically unwrap the tame timestamps into date Into sorry date a day and an hour and based on day and hour you can for instance Decide whether it's a rush hour and you know that in rush hour cars tend to Have longer drive and waiting time than normally because the roads are full Also, you can define something like holidays national holidays This is where really expert domain knowledge comes in and this is purely manual I don't really imagine anything comes close to this Train of thought and performance by using automatic approach Because the data is really not there at least it looks like it so It is costly I to it took me about three days to like one day to build a model and One day I thought about okay. What what can I do? I only have this much data and only this data What should I do? What helped me originally? I've tried to again reconstruct my own Sots and I've tried pretty much the model why I choose the Connect fully connected neural networks is also reconstruction of my brain work For instance for me. It was important that today The feature importance is also took some place and the layers They are they reflecting my for instance Personal attributes that I value more for instance. I know that Exams for this guy are not that important But what our word is is way important and you could see that after adding just day and hour the We got 50% of accuracy and by adding only season. We got like five or seven percent It is very error-prone Because it's expert domain knowledge You mostly especially in literacy you will have to drag someone in who knows their stuff for instance some engineers or even taxi drivers and Between this communication or time between the expert and the programmer data engineer the data science Some information will be lost also again, it is not Really clear at what scale and what features should be Incorporated into the model or into the data set Sometimes you are So happy about the results about the progress that you do that you try to throw in some features that are based of it One each other meaning you'll land into multi-coloniality Situation where you have quite a lot of noise, but not that much meaningful data So sanity check is always a good thing just to create several features Think how they perform try to maybe fine-tune the model Try to get the best out of this Small scenario this local optimal and then think again because every time you need to create a brainstorming session You will be affected by the previous brainstorming session and at some point you will run out of ideas So feature engineering is not a knowledge transfer it can be used as a part of it Basically you train models on certain amount of data that has at the first glance nothing to do with your Task, but you have to train the first model somehow So it can be useful for that step as well By unwrapping features you create even more features obviously and You can use feature selection you can use automatic feature creation and whatever because you have way more space to do that I Think it's very powerful. You can argue with that, but I think it's powerful and It worked with all models. I tried some neural networks I tried classical approaches like decision tree and they all performed way better when they got more data It's like yeah, duh, but every time you see it in the code and their results. It's wow So this is pretty much everything I wanted to say about the feature engineering. Thank you for your attention So thanks for the great talk. There's time for questions Hi, thanks for the talk. I was just wondering have your tool compared automated feature selection to your Insightful feature selection and whether they perform better or worse Future selection like I won't use feature engineering when I have enough Data to actually use feature selection feature selection comes in first when you have to the data to search like to Take some features out to prepare for training Feature engineering is pretty much enabled feature selection. So I won't compare them on this level Thank you for the talk. How did you select the frequency about your? Your data, for example, I was given it I was pretty much just shoved a file with a textile file and I knew nothing more than this So in your case you want to know the at what hour your friend is awake not in what minute or second It will depend on the task obviously in this case It was enough like certain minutes for humans to know whether somebody is awake or not available or not Oh, the real-time application would be an Automated system for remote teams that will allocate slots for meetings or some discussions based on their Working partners where people tend to be available where I tend to you can use for instance multiclass Models where you put something like I want to do my correspondence. I am available for the talk I am focusing on new features and so on so forth. So in the real-time or real-world example You will probably need a different frequency of the data In my case it was more than enough again Based on the frequency of the data different features will be more or less Important for instance exams. I was just lucky that I had two University phases in the data sample. Otherwise, it won't make any difference Okay, some more questions Hi, thanks for the talk. You seem to be just advocating Adding more and more features until you've improved your model enough Is there some point that will start to become counterproductive and you'll hit the curse dimensionality? Adding more and more dimensions to your model in my case. It's rather hard because I And honestly for for me was already kind of what can I also add? But if you start not with one dimension, but more Basically, yes You can hit the problem where you're trying to get all the data simulating the whole real world Which is the point where you start noticing some? Partance or behavior if we are talking about classification that is not typical for your case But is more general for instance, you know, there are historical 100 cycles or whatever and There are not only the amount of data important but things that show exactly when it happens why it happens so you can learn from features and This is where it starts to be counterproductive. This is why I said Define not more than four features first test them out try to replace them Like it is a pretty small amount of work for the start of like four features unless you have like real giant model but in this case you probably already not Doing something wrong then you started with the future engineering and such a complex model and After a while, I personally tend to say that Depending on the case how complex is something you want to learn if you want to learn just Dislipping cycles of a dude then you don't need much data if you're trying to build a self-driving car Oh boy, you're gonna need quite a lot of sensors quite a lot of data quite a lot of features so basically on the Depending on the complexity of the task depends the complexity of the data and Directly amount of features needed you can estimate yourself How complex your data should be or could be? Reduces by two and then start working. Okay, any more questions? Then come on. I'm not biting Then maybe I have a question So I'm not a data scientist, so I don't have much clue about this, but I Heard that as you add more and more information to your model. There's a Risk of overfitting. Did you do some sort of cross validation? So because your your precision goes up, but yeah, that's sort of to expect it anyway, isn't it? Yeah, true in this case I didn't because I had like way too less data and I was in the territory of always underfitting and The amount of features was totally under control if I would start to throw something like Does this guy live with the parents does he has a pet and so on so forth then probably I will overfeed at some point so basically In order to not overfit or to have low risk of overfitting you need to not only keep your data You need to keep your data still low-dimensional To not over complicate the life or description of the situation okay Well, that's great. Oh, there's still time, but What if we come up with a feature that really is really not helping with the Accuracy, then we'll just drop or how do we how can we observe that? There was the thing When I initially started to creating features I also thought about I didn't sort at all at the first about the scale But this was okay because I had like four categories At some point I tried to scale them for instance and you saw here This three features to three features falls This is basically scaled and not like row version from zero to whatever and this is more of a normalized one from minus two to two or so and This models actually perform Not way better not always I had a thing like I tried to Define season like minus two minus one zero one and it decreases the performance by seven percent Also, some features as I said earlier are just a bit different form of already existing so they Won't really decrease your performance. They might have but they won't help either also always check your data whether this The feature you want to create will be present I had another case where I was really interested whether some national holidays are affecting taxi waiting time Have I don't know I spent not that much time like four hours by really finding out all the information and turns out They just don't have those sample points So time wasted Okay, one last question Thanks, so when you're dealing with these small data sets low-dimensional data How do you differentiate between the cases when? there just isn't enough information to build a useful model and When you just haven't found the right mainly engineered features yet Mostly gut feeling and common sense if honestly It might differ again in this case. It's was pretty obvious. I Simulated my own thoughts about the problem and I figure out what I personally call feature What my brain defined as a feature and this is pretty much where I stop in? more complex situations, huh, you probably would start with model first and not the data if I have at least like four features and after a while I will notice that no model tuning actually works and But the data is clean you I have I don't have missing values it is sane and then I will ask what my model wants to describe and As a human I am more smart. I'm smarter than the model because I knew more. I'm I've seen more and This is where I will start trying to dig Deeper and bring some features in that model might find sufficient might not Thank you very much. Okay, so give a big hand to Alicia again