 Hello everyone, welcome to another episode, and today we're going to be talking about pre-processing techniques in logistic regression. Before we move on, just a quick favor, can you destroy that like button for the YouTube algorithm gods to pick up? That would be lovely. The more likes videos like this get, the algorithm will be like, hey, this video is pretty sick, and send this video out to people like yourself. By just hitting that like button, you're helping us all out as a community. This is so much appreciated, so thank you. All right, so basically starting off with some libraries, NumPy, standard math library, pandas, data frame manipulation library, simple imputer, a good way to impute values in the data frame, which is filling in missing values required for pre-processing, logistic regression to build our classifier, the main object of this video, a couple of performance metrics, ROC score as well as a precision recall score, test train split to split your data set into test and training sets, pipeline to create machine learning pipelines so that we have a streamlined process from pre-processing to actual model implementation. I have an entire video on just pipelines dedicated to this, so do check that out after this video. Then some, what is this, yeah, encoders like the ordinal encoder, standard scaler and one hot encoder, data frame mapper is used to map these pre-processing techniques to a particular column in your data set or your training set. And then last but not least, we have our data set generator, which is called Snape. So quick shout out to Snape. I always find it very difficult for me to just create data on the run. And I think you've seen the last few videos that I spent a lot of time just describing how my data is because I'm not using like a normal Kaggle data set. And Snape just makes that job easy because they have, well, here's a repository. They have just, it's just a convenient artificial data set generator. So if you need to generate artificial data, Snape is kind of a way to go. I'm going to link to this in the description too so that you can read through it and also install it. Now what I have done here though is I've built a wrapper function around Snape because, you know, just to cater to the confines of this video so that we could just pass in, okay, do we want categorical features? Yes or no? Do we want our data set to be balanced? Do we want correlated features? Do we want missing values? Do we need, what's the size of the data set that we need? And yeah, I'm returning the data frame, the label column, the categorical feature columns and the numerical feature columns. So cool. Yeah, that's our wrapper. And then we have this evaluation piece right over here, which is basically computing the ROCAUC score as well as the precision recall AUC. Cool. All right. So that's kind of all the helper functions. And now we can actually get into the meat. How is logistic regression affected by the five pre processing techniques that we mentioned? So first is standardization. So really quick standardization is the process of making sure that your column or this particular feature has a mean of zero and standard deviation of one. And you can see that the mean is definitely not zero and standard deviation is definitely not one for these columns. So it's not standardized data. Now with this data, we are going to split it into training test set where the the test size will be 10% of the train set of the overall, sorry. And then, yeah, just assign it to X train, Y train X test and Y test. Now what we have here is a little machine learning pipeline where we first create the logistic regression classifier. We create a pipeline and the pipeline only has one thing in it right now. It's just the classifier. I'm supplying verbose is equal to true so that we could see additional information about every single step in this pipeline. And in this case, you'll see, well, the main information that we get is the total time that it takes to execute that section of the pipeline. So first, we fit our pipeline. Yeah, train, you know, just fit it with the train set and passing the labels. And then we evaluate the pipeline looking at this, the AUCs seem to be 81%, which is, yeah, that's pretty good, I guess. Yeah. And it took a total of 0.3 seconds to train this classifier. Now, let's say that we did standardize this entire process. So if we did apply standardization, in this case, well, there's a built-in function. So I don't need to code it out. It's called standard scalar. Yeah, what we're doing here is that for every single numerical variable, I want to apply a standard scalar. So this will ensure that the mean is zero standard deviation is one for every single one of our numerical variables. Looking at this now, though, the total time for pre-processing is 0.1 seconds, and the total time for training is much now it's like 0.1 seconds only, as opposed to 0.3 seconds that it was before. Although the exact performance of both of them, they're exactly the same. So what's the result? It's that standardizing your features in logistic regression will help speed up your convergence time for training, right? But it doesn't necessarily help improve the performance of your model. And so it's a good practice to standardize your numerical features. All right, now let's do encoding. So for encoding, there are two major ways to encode. There's many, but there's two well-known ways that we would encode categorical variables. One of them is using one hot encoding and the other is ordinary encoding. So let's just see both cases right now. So I'm just creating my dummy data right now. No fancy parameters, just standard data with like four categorical features, four numerical features, let it rip, right? Now for one hot encoding, let's say that I want to one hot to encode the categorical features, which is about four features here, we'll see that, well, it took almost a second for the entire process to complete. And we have a precision recall, which is around 80%. Sorry, precision AOC and the normal RC AOC as 80%, which looks pretty good. Now notice here, though, I put the number of iterations I wanted for my logistic regression to be a hundred, sorry, a thousand. By default, it's a hundred, but I needed a little more for there to be convergence for this one hot encoding. Now this is mostly because there's just so much more data here. One hot encoding really just expands your entire data set. So you kind of need that training time, especially for larger data sets or smaller data sets, I should say. Now, the other way is ordinary encoding. So instead of, you know, for if there was like five different values that a categorical feature can take, you have five columns or six columns, that's not the case for ordinary encoding. We hear though, I mean, all we need to do is just make sure that, you know, the first class is zero, the second class is one, the third class is two, and so on. And so your data, if we were to just pre just the pre processing phase of looking at, you know, five columns, five rows, I should say, transposed here, it's still just the same number of features as it was before, instead of like, there's eight features here as opposed to like 32 features or something that we saw before. And because of this, you can see that the training time is completely sped up. There is it's much less, much faster. Now it's harder to compare like here, I guess you could say, okay, AUC is 81%, which is kind of similar to what we saw here, more or less, right? The only thing is though, that we have to be careful when using ordinary encoding is that the nature of categorical variables, it won't make sense if you know, the categorical variable is like gender where it's zero is male, one is female, that that doesn't make much sense because that would mean that like female is greater than male or the other way around if we had swapped these various values, but it does make sense in the context of like, I don't know whether where it's like sunny, cloudy, rainy, you know, defined categories that you can kind of definitively say that that one is greater than the other. So overall, ordinary encoding, it works well when the relationship exists between categorical variables like size and weather, otherwise you prefer one hot encoding and one hot encoding though, it takes a lot of space, takes more training time. So you're going to have to be careful of convergence because sometimes it just won't converge within the default set of configurations for logistic regression, in which case you will need to modify it. Cool. So two of them down. All right. Next is data imbalance. So what happens if your data is not balanced with logistic regression where, you know, the positive class is severely undervalued over the negative class. Now, again, using the same exact pipeline, but this time just using ordinary encoder for the sake of simplicity, right? Imagine that all of our value, our variables can be ordinary encoded. So that should be fine. And overall begin AUC of like 78%. But the P but the precision recall score is, well, it's set 40%. So this is kind of also the main reason why I want to introduce this precision recall AUC as a metric, because in the case of imbalance data, it is so much easier. Let's say that, you know, if 90% of our data is negative class, I could just write a model. I don't even have to write a model, I can write a function that just says return false or return. Hey, the class is going to be zero and always predict zero. And in that way, my accuracy is just going to shoot through the roof, right? And so it's not very fair way to assess the intelligence or rather the performance of a model with just using normal AUC or accuracy metrics. And hence this is used to so you can see that performance takes a hit here. You can see that, you know, I didn't actually put I'm just passing in a normal logistic regression here. It's just hard for it to pick up. But in order to deal with how do you actually deal with unbalanced data, it's that you need to pass in kind of a class weight is equal to balanced. Since the ratio is like what nine is to one, I want to weigh the positive samples though nine times as important as the negative samples. This is required because you want your logistic regression model to pick up on these specific samples you wanted to be able to detect them results are that the having an unbalanced data set doesn't harm accuracy, but it harms precision recall metrics of the positive class. And this is mostly due to lower predicted, well, probability values. So cool. All right. Moving on to correlated features. So what happens if your data is correlated, right now, I just pass in a correlated features is true. So the performance is pretty high. It's like almost 90%. That's cool. Now I'm using stats models here just to see like the explanation of these variables, right? So we know that our model is like a 90% performance, but which variables are contributing to that? Exactly. One of these values, it seems to be a little redundant already. So x six, we can kind of see that. Well, it might be explained by some of these other variables. And also there might be a chance of very strong multicollinearity as suggested here. So you can kind of get all of these insights just with a simple, you know, training, like, well, a simple OLS, this is OLS is just basically an at least squares regression model. So let's now try to compute something called the variance inflation factor. So this is the function that is, you know, the variance inflation factor. And what it's actually doing is it takes like, let's say that there's eight columns, right? And you're doubting like right now x six, which is like the sixth column, I guess, right? What it will do is it will train a logistic regret or linear regression for taking the features as the first seven columns. And then the the output of this would be x six, which would be the that other column that we're trying to trying to debunk or trying to see if there is collinearity here. Now, since we're training a logistic regression, which is exactly what's happening in this step, what the output will be is that, well, if if indeed that x six is redundant, then that means the r squared value of this model, it should, well, it should be pretty high, it should be explaining everything that is in x six. If it is explaining everything, that means high, it means that it should be close to one. And if it's close to one, that means this denominator is close to zero. And hence the variance inflation factor is close to infinity, right? And that's exactly what's happening. r squared, by the way, is percentage of explainability. So how much of x six in this case would be explained? And we do that for every single column. So for the first we do for x zero, then taking everything else x one, taking everything else as features x two as a label, taking everything else. And what we get is well, infinity for everything, which means that almost every column can be somewhat explained by all the other columns put together. Now this is a little hard to this doesn't mean that, you know, everything is useless. We just need to pick out, you know, which ones are actually useful here. And so what I do is well, I'll create a correlation matrix to just get us started out, right? First of all, biggest thing that comes to mind x three and x zero, they are perfectly correlated. So you can eliminate start by eliminating one of them, right? So start with removing perfect multicollinearity. I'm removing the feature x three. And I'm just like passing all of this into the same exact pipeline, no changes whatsoever. So you can see that well, the model performance is still at 90%. And when we try to explain the variables, it's still at the same 48.3 nothing has changed. x three is now removed. And well, first of all, we still see that there is some sign of high collinearity that exists. And we're still at the same spot. Perfect. All right. So well, let's just keep going further than at least we remove the full multicollinearity. But now let's try to remove just higher collinear features. And one of these cases, though, can be kind of saw well, right here, right? x six is not being it isn't significant already, which means that a lot of it could already be explained by some of the features that already exist. And so let's start by removing x six as well, which we do here. And then well, lo and behold, the training the model, it doesn't doesn't decrease performance, which is great. And the R squared is still 48.3% explainability, which is good hasn't changed even though we removed like two features right now. And well, all of our features seem significant. The coefficients are a little less for quite a bit of them. But yeah, the we are run on getting that error of like high multicollinearity. So so far, we probably might be good. Let's take a look at the now let's take a look at the new variance inflation factors. And look at that now we're getting some actual finite values. So cool, removing x six, we didn't lose explainability or performance, which is great. So now when you get to this stage, a typical rule of thumb is like, hey, you kind of want you don't want to use features that are that have variance inflation factor that's too high because it can be explained by other values. Like in this case, x seven could be explained, right? So a remedy you would think is like removing that. So let's see what happens if we remove that while we remove x seven, you see now performance is really starting to decline, right? And even the explainability is now declining here too. Although you know, but removing it though, like the variance inflation factor is now one, which is very ideal, which means that all of these features that are now included, they're very D correlated from each other, which is kind of what you want. But again, we've lost some explainability here. So something that you can do to remedy this is well, first of all, you can try to include like polynomial features of these variables into this model to see that, you know, if you can get some other interactions that are more complex, because only linear interactions would be detected with this logistic regression and these linear regression in general two and these other linear models or two. Well, you can just go for a more complex model than logistic regression since it might not be enough to capture all the patterns that were required in your data. So just some remedies there. But I hope this entire discussion of variance inflation factor made sense and how you would also come to, you know, get past that hurdle. I also left those notes here. So in case you're wondering, and yeah. All right. And the last stage now is like, what happens if logistic regression encounters missing values? Well, you generate that data over here. And I am literally just trying to train our model with missing values. And what happens is, well, nothing, because it's an error and we, we can't, we can't have NANDs in logistic regression like that and train it. And so in general, like it's just not good practice to leave it as none. We would want to impute this value, though. So data imputation, which I mentioned before is now performed here for numerical values, we would want to impute it with, well, typically we do the mean and that is the standard for simple imputer. But you can also, you know, look at if you look at like the docs, you can also impute it with a very constant value. Again, this really depends on the business study. Safe. It's to go with mean. But if you have like another case situation where you want it to be imputed with the zero, for example, you can do that too. And so you train your model. You can see that missing values, though, you know, compared to all of the cases that we had before of a balanced data set, non missing values, it was definitely higher. Missing values does obviously hamper performance. And yeah, that's it. I have a summary of exactly everything that we talked about right here. So to summarize standard. So we looked at five pre processing techniques, how it affects logistic regression, standardization, which is the process of making sure that your every column, every numerical column is normally distributed, mean zero standard deviation one, performance doesn't necessarily improve, but convergence is faster. Now, encoding of categorical variables, well, you can do ordinary coding wherever it's appropriate, because it converges much faster than the alternative, which is one hot encoding, and that could explode your data set. But one hot encoding may may be required in some cases where, you know, you just have data categorical data that is not you can't say one is greater than the other, like in gender. Now data imbalance, well, logistic regression obviously performs better with balanced data. But in the case of unbalanced data, we can over sample or overweight actually the positive class, which is the, the minority. And you can also do some sub sampling of the majority class to and a combination of both might actually work. Well, fun fact, though, if you are trying to use the probability values of logistic regression as probabilities, you would need to do some model calibration here, because the outputs would not be representative of actual probabilities. I have an entire video dedicated to model calibration, so please do check that out. And then we have data collinearity. So if, well, if we, we want to remove like perfect collinear data, we want to, we want to then try, you know, different modeling strategies to ensure that we are capturing nonlinear interactions, or you can also try to use, you know, polynomial features within the same logistic regression to see, you know, other more complex interactions, and then missing values, which you would want to impute with either a constant value, the mean value, or something else that is defined by, well, depends on the business use case, depends on the semantics of what that column actually means. And all this code will be available on GitHub. So yep, that's it. I hope you all enjoyed this video. And do stay tuned for some more awesome machine learning data science AI content. And I will see you guys in the next video. Take care.