 Now we've looked in the two preceding videos. We looked at regularization and dropout regularization as methods to tackle this problem of overfitting. So whereas those two videos really were not absolutely necessary to watch, you could just read the the documents that are on our pubs or download them from github and get a sense of what we're trying to achieve here. This is the important video we're actually going to write the code and we're going to implement L2 regularization and also dropout. So I'm not going to do this in the our pubs document during this recording. We're going to use our studio here just so that I can also use the opportunity to show you one or two things in our studio. So here we go. This is the document. You can download it from github or just view it on our pubs as I mentioned. So YAML up there just to show that this is going to be an HTML document that we don't want numbered sections and that we want a table of contents. You can always just copy those. My first cell here, which won't be included in the document. So we have this include equals false statement here, and I also have a name. Now, this is automatic. Remember, you can always put names and what we're going to do is just put a bunch of names to each of the code chunks. And if you go to the bottom here, you can actually see all the names there, which allows you to quickly find where you want to go in the document. That is very helpful. So let's run this first just to set things up. And these are the libraries. We're going to use Keras. Obviously, we're going to use reader tidier, tidier, tibble, and plotly. Let's run those. And we've got some cascading style sheets as well. Now, one thing I wanted you just to notice here, just a quick sidetrack. We get these warning messages. One way to get rid of them is if we are in a code chunk like this, go to this little gear icon there and show warnings and show messages. You can click those so they off, see the green disappeared. And then this code appears your message equals false warning equals false. So you can either type that in or you can just do it from this little gear icon, hit apply. And then when you run it, those messages won't appear. We've got the cascading style sheets there. So that's a bit of a side issue. Let's get to actually implementing regularization and dropout. So we're going to do some sentiment analysis. That's very exciting. That's a kind of data that we haven't looked at before. So in Keras, we can import this data set. It contains 50,000 pieces of text. So someone wrote a piece of text and it was about just reviewing movie database, or movies at least. So you write a piece of text. So the only input is this text and some wrote long paragraphs, some only wrote sentences. So how do you convert that into something that's computable? So as I mentioned, there's these 50,000 examples. They are labeled and they are labeled as either positive or negative towards that movie. So we have the sentiment positive or negative and they were encoded as integers 0 and 1. So the way to do that is by looking at each of the specific words. And I'm going to get into that by showing you in the code what it means. It's called multi-hot encoding. So we've looked at one hot encoding. We're now going to take the feature variables and multi-hot encode them. Very exciting stuff. And I'm not going to say too much about it. Now I'm just going to show you what it looks like and then it'll make much more sense. Now the dataset that we are going to import from Keras is not a normal spreadsheet. It actually imports quite a few things. And one of the things that it imports is behind the scenes is just this enormous list of words. And it's a list of words that occur very commonly and up to a list of words that don't occur so commonly. So we can set how many of those words in there, quite a few of them. We're going to set a limit of 5,000. So we're also just going to import this list of 5,000 words. And what's going to happen is we're going to create 5,000 variables. So in our features, our feature set is going to contain 5,000 variables where each of the variables is just one word. And you can almost guess that what the multi-hot encoding is going to be about. So let's import this with the number words argument set to 5,000. And we're going to save that as IMDB. It's going to take just a little bit of time to download. There we go. And now we're going to use this dollar train and dollar test that's part of this IMDB that was imported. And we're going to create this training data and training labels and test data and test labels these four objects. And each can contain at least the train and the test set. They're going to contain 25,000 subjects each. And that is wrong. I mean, we never in real life, please don't split things 50-50. We've discussed this in large data sets like this. We might have only used 10% as our test set. But this is built into this data set. So we're going to go with the flow. And we're going to create these four computer variables, train data and train labels and test data and test set. Now we're going to use multi-hot encoding. Now as before, remember we did that for the target variable. That's not what we're doing here. We're doing the one-hot encoding. We're doing multi-hot encoding on the training and the test set, the feature variables. Now with the one-hot encoding, there was this two categorical function inside of Keras, but there isn't the similar one. So we have to write one. I'm not going to go into this. This is a bit of scary R code here. You can copy and paste this and always use this. Perhaps later when we talk about R again, we might go through what these four loops are doing and what the addressing is doing, etc. It's not very difficult when you read this. It's almost like reading English. You'll figure out what it does. And what we're going to do is we're going to take the training data and the test data. Remember those are just sentences, paragraphs, and we're going to turn them into multi-hot encoding across 5,000 variables. So we're going to in the end have this spreadsheet that contains 5,000 columns. So let's run that. And that is both for the training data and the test data. So we pass that to this multi-hot sequence. And we saw when we created this function, we had two arguments. The second argument was the number of dimensions, which are 5,000. Now, what I'm going to do is just to show you the first one, and only the first 10. So I'm using addressing here for test data. So we're going to take this data, which has 5,000 columns and 25,000 rows. I'm only going to show you row number one and columns one to 10, just to show you what the multi-hot encoding does. So there we go. Only the first 10 of the 5,000. And we see here that word number one did occur in that first subject's review. So that text that that first subject wrote, it's going to look for in that text 5,000 words. And if that word occurs in that text, it gets a one. So word number one, we don't know what these words are, and it doesn't matter. Word number one, word number two, was there. Word number three did not occur in this person's review. And word number four, five, ten, they all occurred, and then we'll have zero and ones up to 5,000 of them. And if you think about that, that is a very clever way of converting something that is this text of different links and different, whatever, paragraphs, et cetera, turn it into data that is computable. It's a very clever way of going about it. Now let's create a baseline model. It's going to be a densely connected model. Let's have a look at it. Once again, I'm using a name here, baseline model here. And as I said, when you go down here, you can actually find this baseline model because it was two hashtags. It's actually a level two heading there. And I can find baseline model chunk there, and it'll actually jump to this. Now, again, that's a bit of our studio information there for you. Let's create this. I'm going to call it baseline underscore model. It's going to be a curious sequential model. And we use the pipe symbol. Again, remember to pass this curious sequential model as first argument, which is supposed to go right there. And we can to layer these one in inside of the other. So the first one anyway, is a dense layer. It has 16 nodes, we're going to use the ReLU activation function rectified linear unit activation function. And for the first layer, remember, we have to tell it what shape, what the input shape is, and that's a 5000 element vector. And remember, I'm just using num words here, because I saved num words up in the code as a value as a object with a value of 5000. So you've got to put the input shape there. Then a second densely connected layer also with 16 nodes ReLU. And then we're going to have another dense layer with only a single unit. And its activation is going to be the logistic sigmoid function. So just sigmoid. And that, remember, the sigmoid function that gives us a value between zero and one. And it'll automatically have this split at 0.5. So a value of less than 0.5 will be coded as an output, a y hat output as zero and 0.5 up to one will be a one. And that's different from the one hot encoding we showed before. So this is a different way of going about it. If you only have a binary output zero and one, zero and one, zero and one, if that's your target, you needn't do the one hot encoding as we did before. I showed you that because that's very common to use. Because in many instances, we're not only going to have two values, we're going to have a target sample space which is zero, one, two, three, four, five, six, seven, eight, 10 of them, 20 of them, a thousand of them. And then that makes one hot encoding very useful. So in most circumstances, use one hot encoding. This is a very special case where we only have a binary output, and we're going to have a sigmoid output. We're going to compile our model then. I'm going to use the atom optimizer. And my loss function is going to be binary cross entropy. We're going to cover this in future lessons. So atom and binary cross entropy, my metric again, is going to be accuracy. And then I'm using a pipe here to just load baseline model as the first argument in the summary function. So I could also just have written remember summary and then baseline function just to show you a different way of doing it. There we go. And there we go. We see we have 80,305 parameters that have to be learned. So now let's go through and fit the data that we have. So we have baseline model there. We fit to that, the train data and the train labels. We're going to run through 20 epochs, batch size, we're going to cover that in future, 512. And as my validation data, I'm passing a list, which is the test data and the test labels. We've spoken about this before. If you watched the previous lectures, that you can use your test data as validation. And the velocity we're going to set as two, so that we get both training and test data, we the training and test data, we're going to get the loss and accuracy for both. And that's why the verbosity, the verbose is set as two. Once again, you can see message equals false and warning equals false. If I type that in, but I could also go in the little gear and you can see that they are not selected. So this is going to take a bit of time. Let's run it. And there we go. It starts as the noise outside with a neuroscience center being built just gets to a crescendo. It really is maddening to be in this office. And I spend as little time here as possible because of all that noise. We see that there is a problem here. We can see in blue the loss of the training data coming down, down, down, but look at the loss of our test data, our validation set going up, up, up. There's this huge difference between the two. And if we look at the accuracy, where's the accuracy of the training data keeps going up. We see that the validation or test set, that accuracy is really performing poorly. So there's this huge discrepancy between the training and the validation set there. In other words, we have high variance. We have overfitting. It's overfitting here because it is really, the accuracy is really good going towards 99%, 98% then end with a very small loss of 0.0583. So that's very close to what we would think of this baseline ground truth. So that's overfitting here with a very high variance. And if we look at that, we see I'm not running this on a GPU. This is a Core i7 Intel CPU and it takes about two seconds here for each of the epochs to run. So it takes quite a bit of time. Now let's simplify this model because that's one way to go about it. Let's simplify this model. So instead of 16 nodes in each, we're just going to keep everything the same, but we're going to only have four nodes. So there's not a lot of capacity here to learn from the data. Let's see what that does. Going to create that model and then let's fit the data exactly the same as we did before. And as the model runs, we can see we have a similar problem in that we still have this issue of high variance with the overfitting of our training data. Now let's just make things a lot worse and I'm going to create an enormous, well, that's a relatively enormous model here. And I'm going to have 512, 512 nodes. Now remember, my input space was a vector of 5,000. So 512, perhaps not that big, but let's put 512 nodes in each of our layers and let's see what happens. So let's create that model and let's run it. And there we go. We can see there's an even bigger problem here because this network had a large capacity to learn. It really learned the training data very well, but performed very, very poorly on our validation set. Let's scroll down. And what I'm going to do here is just use plotly. And here in figure one, I'm just going to plot the comparison of what we've done before we've had our baseline, our small and our bigger model. Let's have a look at this. I'm just going to create a data frame in the code up here. We might have a look at this in future lectures if we just look at R a bit closer. If we have time for that. So there we go. I'm just creating a bit of a data frame. And then I'm using plotly. I do have a few separate lectures on YouTube on using plotly and R. And let's just run this code because that's what is more important. We saw all of this happening in our studio on the right hand side here. But let's have a look here at figure one. I like plotly because it's interactive. I can zoom. I can save this as a PNG file separately right from here, etc. And as I hover over it, we can see what is happening. So look at our bigger train here at the bottom doing very well. We see on the Y axis, we have the epochs on the X axis, we have the loss as I specified here in the code. And there's our bigger train. And let's look at our bigger where is our bigger loss, bigger loss, bigger validation. There we go. So bigger train and bigger validation. Look at that large or very high variance that we have there. If we look here at baseline train and baseline, there we go baseline validation. So that's this green and the orange one up there still a big difference. And we look at the smaller one. It didn't perform as well obviously on the training set. But we constrained the hypothesis that's doing better on the real world or test data than the other two. You see the smaller difference there. So a very nice graphical representation of this idea of very high variance overfitting. And then how we constrain the hypothesis space, which is actually getting better with the baseline and getting even better with a smaller model. Let's now move over to L2 regularization. And straight off the bat, let me show you what's going on here. We're going to put 16 units in each of our two hidden layers. So that's the same in our output, just being one and activation being the logistic sigmoid function. And what we're doing here is we are adding kernel regularizer as a keyword argument here. And we're doing regularizer L2. So L2 regularization. And we're passing a value of lambda here of 0.01. So we use it as an argument inside of that layer as simple as that. Everything else is staying exactly the same. There we go. And let's run this model and let's see what happens. Okay, so we can see that we still have some overfitting going on here, but certainly it's not as bad as we've seen before. And then the end will do another plotty graph and we'll see the difference. Now that is implementing regularization. We've got it. Let's do the baseline, which compare the baseline and regularization in a plot here. Let's have a look at that. And if we look at that, there was our baseline training and up here was our baseline validation. And we can see the regularization training and the testing much closer to each other than the baseline. So we made some improvement. Let's do some dropout. Now in the code here, where as we use dropout or regularization, at least as an argument, the dropout here, we're actually going to use a dropout layer. So we're going to, there are different ways of doing this in Keras and Intensive Flow. But here, we're just going to add them as a separate layer. I want to show you different ways of doing it. So layer dense, it's going to have 16 activation value. Our input shape is still going to be that vector of 5000 elements. Then a dropout layer set at 0.6, another 16 unit, another dropout of 0.6, sigmoid, everything else stays exactly the same. So note that we use the pipe and we feed that to the next layer and it's a layer, we had layer dense and then layer dropout, layer dense and layer dropout. So that is how we introduce this idea of dropout to the model. Let's run this model and see what it does. Okay, so there we go. We see another improvement as we look on the right hand side here. Let's go down and just compare that to the baseline with our plotly code here. And there we go. Look at this difference. So here we have the baseline again and then here we have the dropout training and the dropout validation. So let's just compare the whole lot to each other. Let's just have a look at this, just plotting the regularization versus dropout in this instance. And there we go. We see the L2 train and L2 validation at the top and in the middle here we see the dropout train and the dropout validation. So in this instance dropout helped us quite a bit. It was quite a bit better than regularization. Now that's by no means the norm. These are hyper parameters both in the design. I mean we had two layers, we had different number of nodes. We could have had three layers, four layers, etc. Different parameters we could have chosen for our dropout, different for our regularization. These hyper parameters, what we used here and the improvement that we saw here, that was specific to this data. If you use other data it's not going to look the same. The dropout is not going to necessarily be better than the regularization. These hyper parameters you have to sit with, you have to play with over and over and over again. That is unfortunately the way deep neural networks work. You have to work at it. It's time consuming. If your models are large and they run over many hours, you really have to look at multi-GPU systems just to maintain your sanity because you want it to run as quickly as possible. So that in short was how to implement regularization and dropout. It's very easy to do and use your data sets. Try and implement these to download this code and do exactly this by trying to use your own code as well. Perhaps in the future I'll just show you how to create a bit of your own code and how to introduce your own code. And that's it for regularization and dropout.