 Hi there. In this video tutorial we're going to talk about training a neural network when we have an imbalance in the classes in our target variable. So this is a notebook and a video tutorial explaining the Keras website on class imbalance and you can find the link to that page. I'm going to be slightly more verbose than that page so that you can really understand how to deal with class imbalance. So my aim really in this notebook is for you to understand what class imbalance is to understand how to express class imbalance as weights for a target variable class. Understand metrics such as two positive values, false positives, true negatives, false negatives, recall or sensitivity, specificity, precision or positive predictive values and negative predictive values and then how to use the class underscore weight argument when we're fitting the training data to our neural network. I'm going to assume that you know Python, that you're familiar with TensorFlow and especially Keras, that you know how to construct a densely connected neural network using the sequential architecture and that you understand constraints of the hypothesis space through dropout. So let's have a look then at the packages that we are going to make use of. We are going to use pandas and numpy and so let's just run the cell. We're going to import Keras then from TensorFlow and just so that we are all on the same page depending on what you watch this video. Let's just make sure that we are using the same version of Keras or at least that you know what version of Keras is used in this notebook. So we'll run that cell and we see it's version 2.4 there. And then also I'm just going to make use of matplotlib there as well. And I'm just setting the style there to white grid and because this is a retina display, I'm just using the config magic command there. So our data by the way we here in Kaggle so the data is available for us as you can see on the right hand side. There is the input folder inside of that. There is a sub folder called credit card fraud and we see the credit card CSV file there. Let's click on that so we can just have a quick look at it and it pops up at the bottom and we can see there's a time variable and that is just I think the number of seconds since recording the first transaction. So this data set is all about credit card fraud. So it's a binary classification problem encoded with a zero and one. So we have this data and then the two classes now target variable right at the end would then state whether the transaction was a real transaction or whether it was a fraudulent transaction. And we can see that our variable names are just V1 V2. So there's already been principal component analysis applied to the status files. I know and that is in an effort of course to de anonymize this data. So we needn't you know care about the you know what these variables mean because there's already been a projection projection onto a low dimensional space. What we're not going to make use of of course is this time variable. We're going to just delete that in our model. So there we go. We can just import it. I'm using the read underscore CSV function from the Pandas package. And to reference this file that's inside of Kaggle it's two dots forward slash then we see the input folder as we can see up there then credit card fraud and then the actual CSV file. So that's going to import that data set for us. So we can use just the columns attribute there and it's going to show us all the columns that are there. We can see them as we saw before time V1 and then lastly it was off the screen. So there's a mount and then our target variable which is class. So what we're going to make use of just is the drop method. So DF is what I assigned my data frame to the computer variable that I assigned the data frame to when I imported the CSV file. So dot drop since I just have the one column that I want to drop I can just reference it as a string. I have to say access equals one. So those mean that means the column and I'm setting my in place argument to two so that that change is permanent. The as I mentioned all the data has already been projected onto a low dimensional space through principal component analysis. The only I think value that is still accurate there is the amount of the transaction and we see the amount the 88 units. I presume you know this would be dollars but whatever the currency is and you can see the summary statistics using the described method there on the DF dot amount. Panda series remember if we just reference one of the columns that returns a panda series for us and we use the the described method on that series and we can see the summary statistics there. Let's have a look at the shape of our of our data our data frame so we're just using the shape attribute there and we see we have 284,870 instances rows of data along 30 along 30 columns. Now the value counts if we look at the class so DF dot class remember that's going to give us that column as a panda series and we call the value underscore counts method on that series and we can see the class imbalance. So zero that was non fraudulent encoded as a zero the non fraudulent transactions and one was the fraudulent fraudulent transactions and we see only 492 out of 284,807. So a big class imbalance you know we would love that to be 50-50 when we do our you know when we fit our training data to our model or to our neural network and so something that is not the case. Remember if we set the normalize equals true so we take the normalize argument set that its value to true it's just going to give us back the proportions and then we can see 99.8% of our target class is zero or non fraudulent transactions. So what I prefer to do is just to convert this data frame object into two NumPy arrays one would be my feature vector and we're going to assign that to the computer variable uppercase X and our target class we're going to do that, our target column we're going to convert that into a NumPy array which is a vector then. So how I do that is to do df.drop we're going to drop the class column and we say that it's a column by stating axis equals one as far as that argument is concerned and then I don't put in place equals true because I'm just dropping that class from the data frame temporarily and then assigning that to the variable X. But I do also add the two NumPy, two underscore NumPy method there so to convert that those columns into just a NumPy array or a feature matrix and then my pandas series df.class I turn that convert that into a NumPy array and assign that to lower score Y. So if we do that we have a feature matrix and a target vector both as NumPy arrays. So just as a bit of a sanity check we'll just use the shape, we'll just use the shape attribute on both of those X and Y and we can see that we still have in our feature set, our feature matrix 284,870 rows and 29 columns because we drop the class which is in this vector 284,870 values comma nothing as far as that tuple is concerned. So just to indicate that it's just along a single axis. Now we're going to do splitting of our data so we're just going to have a training and validation split. And because the data is already randomized we don't have to use for instance the train test split function in the scikit-learn library we can just take the first 80% of the values as our training set and the last 20% as our validation set. And so I'm just going to work out what 20% of the values look like. So I'm just going to save that as an integer in the computer variable num underscore val underscore samples. So that's an integer and what we want is the length of X so len our function for length of this X so that gives me the number of rows times 0.2 but just converting that to an integer. So that would represent 20% of the values. So what are we going to do is create train underscore features, train underscore targets, val underscore features and val underscore targets. So what we take is the feature matrix X and we use this notation as far as slicing is concerned so indexing so colon minus this value. So it says up to that last 20%. So everything up to the last 20% so that'll be 80% of the data and that goes into my training set as far as the features are concerned as well as for my target vector and then the last 20 we do this. So we say the negative, the integer value representing the last 20% and then colon so going backwards from the back up to a level that would represent 20% from the bottom. And that's how we get the 80, 20% split in our data here. As I said, we're not taking random rows because the data is already in a random format. So now we're going to use the bin count function from NumPy so I imported NumPy's NP so NP dot bin count, the train target, train targets. So only in our training set now the 80% of the data that we've chosen as our training set. So let's have a look at that and that's going to give us how many are zeros and how many are ones. So we can still see this remainder of this class imbalance that we have there. And now the crucial bit as far as this class imbalance is concerned, how do we convert these values into weights? And how are we going to do that? I mean this counts gives us this NumPy ray with two values and they would be indexed zero and one. We just take the reciprocal of each of these. One divided by 227,429, that will turn into the weight for the zero class. And one divided by 417 is going to be the weight for the fraudulent transactions. So valid zero fraudulent one, we just take one divided by those values. So that's crucially how we're going to set up the weights. And if we look at those values, we just print them to the screen there. Of course you can see because there's this overwhelming number of valid or zero values, the weight for that is going to be 4.3969,766, et cetera times 10 to the power of negative 6. So 0.00004. And the weight for this underrepresented class of which there were only 417 instances, the weight for that is going to be 0.002. You see the big differences. So this class that's underrepresented gets a lot more as far as values concerned for the weight. So that's quite crucial. We're going to do a bit of data preprocessing. And what we're going to do is to standard scaling. Remember that we take one column at a time for that column. We calculate its mean and standard deviation. So that's x bar and sigma underscore x. And what we do, each value, we just take each value down that column and we subtract the mean of that column from it and we divide by the standard deviation of that column. So that transforms every value into a unit of standard deviation away from the mean. And that's every value gets transformed into this z underscore 1. So just standard scaling there. And what we do, first of all, we calculate the mean and the standard deviation. And I'm just going to use NumPy's mean function and NumPy's standard deviation function. So dot mean and dot std. Assigning that to computer variables that make sense, mean and standard deviation. And what we do is we calculate that of our training set. And we calculate that of both the mean and standard deviation of our training set. And then we're going to do this transformation. So it's train features minus the mean divided by standard deviation. But you can see from the validation set, we subtract from it the mean of the training set and divide by the standard deviation of the training set. So that's quite important. So let's use Keras. Let's use the sequential architecture here just to create a neural network. We're going to assign that to the computer available model. So it's keras.sequential. And let's have a look at our layers. We're going to have one, two, three D players and then an output layer. Each of them has gotten the D players are going to have 256 nodes. We're going to use the ReLU rectified linear unit activation function of each of these. And our input shape is our training features dot shape, the last value. So let's just put in a little code cell there. So just make sure what this represents. So let's take that dot shape. So we can express all of that. And you see the shape of that is the rows, the columns of course. And what we want with a negative one is the last one. So our input shape is 29 because we have 29 values going into our neural network at any one time. Then another densely connected neural network with a rectified linear unit activation function. But we are concerning our path of space by using regular regularization. In this instance, we can use dropout 30% of the nodes. Another dense layer ReLU activation and again some dropout. And our output layer is going to be a single, a single node. Because remember, we have a binary classification problem there. And we are just going to use the sigmoid activation function. So we constrain that output to being between, on the interval at least from zero to one. And if we look at the summary of this model, we can see it there. We can see we have only 139,000 or at least 139,521 trainable parameters there. Now let's train our model. And what we're interested in is a couple of metrics. We just want to spend some time on this idea of how we can look at these metrics. So in this little column, this little table here, we can see zero, one at the top row in bold. And remember the zero represents the fact that this was a valid transaction and one that it was a fraudulent transaction. So that's what the model predicts. On the left-hand side, up and down here, we have the actual zero and the actual one. Across the top and bold, we have the predictions. Now, when we talk about these metrics, we have to choose either the one or the zero as either the positive or the negative class. And that really depends on the problem that you're dealing with. So in each instance, that will be different. We don't mean positive and negative in a human psychology sense. We just have to choose one of these as our positive and one as the negative in a way that makes sense for the type of analysis that we're trying to do. So in this instance, we might say, and what we've chosen here is the fact that the one, the fraudulent, is our positive result. And of course, we don't want fraudulent transactions. So fraudulent thing is a negative thing, isn't it? But as I said, that's not the psychology that we attach to it. So if we suggest that the one is our positive class, so the fraudulent transaction is our positive class. And that means zero, the valid transaction is our negative class. So if the actual value was zero, so it's a valid transaction, and the model predicts this bold zero at the top, on that intersection we see TN, that's true negative because a valid transaction was our negative class. So if it really was a valid transaction, the model predicts it was a negative or a valid transaction, that's a true negative. It got it right. The main diagonal on the bottom right, we see TP, that's true positive. So if it was a positive case, a fraudulent case, and the model predicts the bold one at the top, that it was a fraudulent transaction, that's a true positive. And then across from that, we have an actual class of zero, but the model predicts a one, that would be a FP, false positive. And if it really was a fraudulent transaction and the model predicts, it was a valid transaction, that's a false negative. We predicted a zero, that's a false negative. And we have to decide how expensive, and in this instance, expensive obviously pertains to financial expense. But then another problem set, if we deal with, say, human disease, expense might be how bad would it be for a patient if a diagnosis is missed? Or would you rather just have a higher value of false positives, but instead of having false negatives? So you have to evaluate the real-world problem that you are dealing with. And so here are the metrics, so these are absolute values. So how many were true positive, how many were true negative, and then the false positive, the false negative rates. And then we have these proportions, the first one being recall, and in healthcare that for instance, or in other domains that might be called sensitivity. So that's the true positive rate divided by true positive plus false negative. So the false negative remember are actually also positive. So it's the true positive rate divided by how many are positive, are actually positive. The specificity is the other way around. It's the true negative divided by how many are actually negative. So how many are predicted to be negative, and are correctly predicted to be negative, two negatives divided by how many are actually negative. So what we can think of sensitivity or recall and specificity for sensitivity at least then is how many of the actual positive cases are going to be picked up. And specificity, how many of the actual negative values are going to be picked up. And then precision or positive predictive value is given that a positive class was predicted and how many of those will actually be positive. And negative predictive value, if the model predicts a negative, how many of those, what proportion of all the negatives is it picking up or would it likely be, I should say. So the metrics, we're going to store that as a Python list. So keras.metrics, we have false negatives, false positives, precision and recall. And we're just giving them all a name. So fn, fp, precision and recall. So we're just storing that as a Python list. We're going to compile our model. So we're going to use Adam, adaptive moment estimation as our gradient descent optimizer. Our loss function has to be binary cross entropy. We have a single node constrained in that interval 0 to 1. So our loss will have to be binary cross entropy. And the metrics are going to be the Python list of metrics that we want to keep track of. Now, very importantly, for our class weight argument that we're going to use when we fit our data, is we have to save that as a dictionary where the keys are the actual values, the 0 and the 1. And the values for each key that is going to be those values that we had initially by just taking the reciprocal of each of those values. So 1 divided by the number of cases or classes in our target variable. So we save that as a dictionary and we can finally fit our model. So model.fit assign that to a computer variable history. So train underscore features, target underscore features, our batch size 2048. We're going to run through 30 epochs. We're going to set verbose equals 2 to 2 so that we can see those results as they come out. Our validation data is going to be our matrix Cs, my matrix, my vector there of my validation data. And then we have this class underscore weight argument and we're setting that to class weights. Remember, which is this dictionary. So I'm running this off of a CPU. I have not put my accelerator on the GPU on here on Kaggle. So this is going to take a couple of seconds per epoch. And I'll train this and I'll see you on the other side. So there we go. Our model is trained. So let's look at some of these metrics. So the first one we're going to look at, of course, as far as the validation set is concerned, is we look at the false positive rate. So the model predicted that this was fraudulent and in actual fact, it was not. It was a valid transaction because we assigned this model, this trained model to the computer variable history. We can do, look at some of its attributes. The first one is history and with the one that we want is VAL underscore FP and we want to last the value. So that's 709 cases of all our cases were predicted as one, but those were false positives. If we look at the false negative rate, so how many were assigned as being valid and they were actually fraudulent. And here is the important bit. So you have to decide how expensive these things are in the scenario that we are dealing with. It is a trade-off in these models. So we might want to say that we want this false negative rate to be as low as possible. We don't want to miss those fraudulent transactions. But because of that, we are going to have a higher false positive rate. And we have to investigate some of these transactions even though which comes with the cost, time, money, etc., even though they were actually valid transactions. So you can play with that trade-off. So what I wanted to do is just end off with two plots. So we're just going to use matplotlib here and we see the false positives and false negatives across all the epochs. And as you can see there, the false negative rate very low. And that's just what we want. We don't want to miss these transactions that are fraudulent. But on the other hand, we see higher rates of false positives throughout the training and as much as we're going to mark these transactions, flag them, and it's going to cost time and money, as I said, to investigate them. If we also look at recall and precision here. So let's have a look at those two. And we can see the recall at the top. And that is quite high. So remember, recall is sensitivity. So how many of the actual two cases, in other words, how many of the actual fraudulent cases are being picked up by this model? And that's quite high. We're sitting above 80%, 90% there. But very low positive predictive value. So given the fact that the model says this is a fraudulent transaction, how many of them are actually fraudulent transactions? That's the positive predictive value. And you see that's extremely low. And we can see that from these two values. That's 709 were marked as being positive cases, but that was false positive. And that is fine in the setting in as much as we want to pick up all of them and the price that we pay is that we're going to, and amongst all those that we do predict as being positive, is actually going to be incorrectly done. All in the setting of the problem that we're dealing with. So that's it, a short tutorial, just explaining the Keras website, webpage there on dealing with class imbalance.