 This video is sponsored partially by kite, they provide a code completion service for machine learning code. It integrates super well with your editors and even Jupyter notebooks. So click the link in the description to try kite for free. Here everyone. So today I'm thinking that we'll do a different kind of video where we are looking at code on how to build an end-to-end pipeline with machine learning and also in just like very few lines of code. So hopefully this kind of gives you like new implementation strategies on how you implement from pre-processing to actual model training and evaluation. So let's just get started here. In this notebook, just installing a couple of libraries here, which I'll explain. So numpy, math library, pandas, data frame manipulation library, pretty useful stuff. Catboost classifier is pretty useful in just performing catboost. I love catboost. Amazing stuff. Make classification. It's just a good way to create dummy classification data, which I'll be showing you like right below here. Now simple imputer is, well imputation is the process of basically filling in missing values with like an imputed value, which could be like the mean or just some random class in case of like categorical variables. Logistic regression here is used again for another model that we'll be using to train our classification data set. ROC AUC score just gives us like an AUC metric so that we can just tell how good a model is performing. Train test split will split your data into train and test sets. Pipeline is kind of the MVP in this session where it allows us to concatenate multiple processes to execute one after another, like from preprocessing to actual model training to even evaluation. And then a bunch of like these built in SK learn preprocessing libraries, like yeah, ordinal encoder, standard encoder and one hot encoder, which I'll be introducing as and when they show up in the models. And then we have this data frame mapper, which is a library that helps us map columns to transformations so that we can perform specific transformations on every column in our data frame before passing it into our actual model for training. All right, so now that we got out of the now, okay, now that we have that out of the way, we have this huge chunk of code that honestly just creates our dummy data in a way that I just wanted it to. So we have a set of categorical and numerical features for four features in each in this line here, we are just creating 10,000 samples, four features which are going to be numeric features, all of them are good features. And we're just making a 50 50 label split. So like 5000 are good 5000 are positive labels 5000 are negative labels. Then we're just going to add some categorical features here. So the way I'm doing this in a in I guess like slight of handway is that like, each time I want to create a categorical column, I determined the number of classes by just taking a random number between two and 10. And then I create the entire column of just like random values between zero and that number I choose. So if I choose like seven classes for the first, then it'll choose like random numbers between like zero and six. And we'll be you that will be our entire feature column that I am appending to our data set. And then just converting them into a data frame. And in this little chunk of code right here, I am, I'm basically going to scale our regression values because I wanted them to be kind of all over the place. For make classification, though, like the way that this works is that it kind of gives us like a normally distributed data with like a standard deviation of one, I kind of just wanted to, you know, shift the mean and standard deviation of the same values so that they do resemble like, okay, data that you would actually see in some data set. Let's see. And for our categorical features to I kind of wanted to spice it up by, let's say instead of just numeric categorical features, we are dealing with some string values which we do typically see in data. And then I'm also introducing some NANDs throughout. So what this will do is basically say, hey, I want 70% of my data to not be NANDs, but 30% of it, excuse me, my throat is dry. But 30% of it should be NANDs and I'm doing this for every column. Right. And then we just create our final final data frame. And so we have a data frame that looks this is the final product of our data set. So I'm just taking three random samples where features one through four are numerical features with a bunch of NANDs and then features five through eight are string features where, you know, we just have these random categorical classes with the corresponding labels. So great. We spent all this time constructing our data set. Very nice. Now I'm just going to split this into training test sets in a 90 to 10 ratio. I'm choosing just not to shuffle here since, you know, typically data is a lot of data is definitely like time sensitive or time oriented. And so shuffling would lead to data leakage. Again, this really depends on your problem, but I'm just using it for a general a general typical scenario. And then we we extract our train and train and test values for X and Ys. Um, now getting to actual some pre processing and training. So this entire thing is actually just the meat of her entire code right here. So what I'm doing here is I'm creating a categorical mapper, basically a column mapper. Sorry. So using data frame mapper, we map columns to data transformations that we want to perform on those columns. So in this case, for every categorical column, what I want to do is I first want to impute a value. In this case, I'm using simple imputer. And in the cases where there is nance, I am going to fill the value with a class called UNK the string value. So it works. And then once I fill that value, I want to perform some ordinal encoding, which will basically encode the values from like zero one to all the way up to actual the actual number of classes. I'm using an ordinary ordinal encoder here because I'm kind of using a cat boost classifier because it works well with cat boost. And so hence the order on coding. I'm also passing in certain values here like handling unknown values. This is because there may be certain situations that when you're using this model in real time, you might say, okay, you have 10 predefined classes that you've trained your data on, but then there might be like another class, which is not just nan or unknowns. It's like just another class that might happen or appear during your inference time. And we want to make sure that the model doesn't, you know, go ham when it sees that. And so I'm just going to be encoding it with a value of negative one. And it should work since especially since it's a cat boost classifier, which does work with this kind of ordinal encoding strategy. Now for numeric features, what I'm doing is I'm just going to be using the default simple imputer. Here the strategy for imputation for numerical features is typically the mean. So it'll basically see the nan and then just impute the mean of the column in that place if it sees the nan. And we're passing all of this to our data frame mapper. And here I'm returning df out is true. So it will actually return a data frame. And that's when our mapping is done. And we also have a classifier that I've just defined a very rudimentary cat boost classifier that will work on this data. And now we have this like pipeline that we're implementing, which will basically say, hey, first, we're going to pre process with this mapper that I've created. And this will perform the, you know, imputation and the encoding for categorical variables, and it'll perform the imputation for all the numerical variables. And then it'll pass that data frame into the classifier, which will be used for training. And then we just fit this entire pipeline together with our training data. And this is just like all the training that happens and poof, we're done with training. Now, if you wanted to kind of see like what the mapper actually outputs, it's pretty cool. So we take the mapper and we just transform the test set. So initially, this is kind of like what the test set the test data really looks like, I just did the first five columns of that the first five rows of that test data. So the first four numerical features, then these are like typical string categorical features. But then after pre processing it, sorry, let me put that here. Yeah, after pre processing this data. Well, you can see that, okay, this nan over here is now imputed with the mean value of the feature one, these first three nans here are imputed with the first three values of feature two, they're all like the mean value feature two, it's the same. And then like when you come to these categorical variables, you can see it's it is definitely ordinal encoded. So we are doing fine here. And now this pre processed data frame is what's being passed into the catwood classifier for training. And all of this happens very seamlessly. See here. And now once we're done with that, once we're done with all that model training, we have this pipeline, which we can just like store, just dump it so that you know, we can then load this pipeline at in some other file at some other point, without having to retrain the model again and again. Now as far as evaluation strategy is concerned, what I'm doing is I'm just taking all these values, I'm taking like whatever I'm passing in here, I'm going to get predictions for and then compute the AUC. So I do for the train set and the test set and looks like we're getting pretty good AUC so pretty good model with cat boost. Now, let's say that I don't want to use cat boost, right over here, right? And I want to use something simple like a kind of a logistic regression. We have different strategies, though, we can't really use ordinal encoding, we would need to use like a one hot encoding and other changes that you might need to make. But honestly, because it's such a streamlined process, and there's like a very specific way of doing things, we don't really need to change that much code at all. It's actually the fixes are very simple. So all I need to do is like first of all, the main thing is like this classifier is not logistic regression. But here instead of the ordinal encoder, I can just replace it with scikit learns one hot encoding strategy, which works well with logistic regression. And I'm also adding a standard scalar, which will basically perform after impute after imputation, it will then scale our values to make sure that the features are between zero and one. Technically, you don't really need to do this for logistic regression, but it's always good practice. And there's no harm in doing it for sure. And this standard scalar would be using like the formulation like x minus x min divided by x max minus x min, and then just scale our our features. I'd encourage you to read more about each one of these though in scikit learns documentation, which is actually pretty good. So check that out. Again, we're just using the same data frame mapper piping it in a very similar way. And then same thing, like literally the same code to train the model. And we train our model. Now, let's just take a look at what the what the data transformation what data transformations actually happened at the time of pre processing here. So let's say we transform our data. This is the initial data that we had. It's the same thing as what we saw before for the test. I just transposed it right here, because it becomes easier to see. So this 9,000 whatever this column is, this is actually considered as a sample that's passed into the model, right? But after pre processing, we get all this. And let's see, let's just compare the first column here with the first column here, which is like, that's that's the same sample. You can see like, first of all, 629 now became some like negative 0.08, which does make sense because we also perform feature scaling here. Also, if you look at like features, you know, features five, like we have nan over here, feature five, the UNK is one hot encoded as one and everything else is zero, which is correct. Then feature six is str zero is one, which I think is the case here. Yes, str zero is one for feature six and everything else is zero. For feature seven, again, it's unknown, it's a nan. So that's one, everything else is zero. And then feature eight, it's like str one is one is one. And I think that agrees exactly with what we see here. Yeah, nan so UNK, and then str one, which was one for the one hot encoding. And it's this entire data frame that's now passed into it's in piped into our logistic regression model to train. Well, in this case is testing data, but we would do the similar thing with the train data frame x train, and then pass that into logistic regression for training. And then while we evaluate our model for the training test set, and they look like pretty good AUC values. Now, that's all I kind of have for you right now. But at least I hope that this kind of gives you a good streamlined end to end strategy for coding out a machine learning pipeline. And it all hopefully this also makes you understand that there's like no, you don't need to know how to code like 2000 lines in order to write pipelines. And these pipelines are also very easily replaceable. Psycho learn has so many built in functions that it kind of makes your job easy if even if you wanted to change your entire modeling strategy. So I'm going to probably put this code on GitHub. If you like what you saw, give this video a like, please do share this video if you can subscribe hit that bell. I'm building a community here. So and I need your help for it. Really do. Yeah. So anyways, take care. Have fun. I'll see you soon. Goodbye.