 In this video, I want to talk to you about regularization. Now we looked at this issue of high variance, where the model that we build is just way too specific to the training set, and it doesn't generalize well, and regularization is one of those ways in which we can tackle this problem. Now when you start writing the code, it's just simple little additions that you make. It's a few characters that you type when you create this. So strictly speaking, this video is not necessary. You don't have to watch this, especially if you're not interested in the mathematics behind what is going on. I would advise so that you just listen to it with one ear, or at least just read the RPUBs document or download it from GitHub and just read through it. As with the previous one, really to have the document in front of you and go through it a couple of times is perhaps better than listening to this video. So if you're interested in the mathematics system, or at least some of the philosophy and lightly touching on the mathematics, watch this video. So the problem here is that we have high variance, that the model is just fitting. Our training data is too well. It's memorized that data, and we need to do something about it. The first concept I really want to talk to you about is the hypothesis space. And a way to look at the hypothesis space, one way to look at it is just to look at all the possible solutions that our cost function can take by doing this gradient descent and getting these optimum values for the weights and biases, the parameters. And that all of those values together make up the hypothesis space. And one way to do it is just to try and constrain this hypothesis space so that not all the possible values are available to the model during this gradient descent process, this back forward and back propagation process. If we can limit that hypothesis space, then perhaps we can get to some values. They might not be the best, but they might, for the training data at least, but they might generalize better to the training set or the real world data. And one way that we can go about this is to just consider this concept called complexity. So the hypothesis space is large. I mean, the more variables we have and the bigger network is just the more and more and more and more parameters we're going to have. And so many possible values to those. And if somewhere, if we can create a measure of complexity, if we have such a measure, then perhaps we can start cutting down on that complexity. And the way that it is done is just to take this hypothesis space and make a sequence out of it. And we're just using this double script notation here of the H, just denoting that that is a set and H1 hypothesis subspace one is contained within two is contained within this total one at the end here. So we just create the sequence. So everyone is totally within the next one. So this is the big one. And then if you just take some of it and some of it and some of it getting smaller and smaller and smaller, you can, if you can create this, we can start thinking of putting a number or some value or something to this hypothesis space and deciding where we want to make the cutoff. So that anything beyond that is now not available as a solution. Because that is perhaps where the solutions lie that are just too good for the training set and not so good for the rest. So let's look at ways that we can denote complexity. Can we wrap our head about one way of classifying complexity? Well, there's four for you here. And I'm just considering simple linear regression. We started the series off with that, but it really just expands naturally as I write here, two parameters in neural networks. So one way to look at it is just to look at the dimensionality of the input space. And how many feature variables you have? If you have fewer feature variables, you cut down on complexity. That is a measure of complexity. Now, that's not something that we can use here and that's nothing to do with regularization. I just wanted to introduce this concept of complexity because I can now constrain that, that this measure by only taking four of my possible 34 variables, feature variables. Or I can take, I can take any number, but you can see that I have this sequence of complexity if this is my measure of complexity. Two feature variables will be less complex than having five, which is less complex than having 10, which is less complex than having 20. Another way is just to look at the number of my non-zero coefficients. So remember in linear regression, we had an even in the feed forward, the forward propagation section, we take a weight times x sub one plus a weight times x sub two, so all these weights, these parameters, they are coefficients. And if I take all the ones that are non-zero, I can count them up and that can be a measure of complexity. And I can cut down by just throwing away some of these. And this measure of complexity that's known as this script L sub zero, that's L sub zero complexity. Another way to do it is just to take all these weights, my parameters, and just take all the absolute values, so make them all positive. And just add all of them. And if there's millions of them, just add millions. And that is called L one complexity or lasso complexity. And then if I square each of them and then add all of them, that's called L two complexity or rich complexity. And that's the one that we're going to concentrate on here when we talk about regularization because you get L one and L two regularization. But if you think about it, these measures of complexity say we take number four, that allows me to symbolize it. Let's call our symbol for the complexity based on this little equation as omega. And then therefore we can choose some element of omega, let's call it r, such that we can constrain the hypothesis space. So we're only going to look at stuff that is within this level of complexity and it doesn't carry on, on, on here to the right hand side. And so now let's get to regularization because we can set this r, now it's not a value that we're going to set, it's just think of it as a concept. Somehow we're going to build a constraint into the system. Now there are two ways to go about it when we look at the cost function. Here's in three, I'm just showing you the cost function again, this curly C is a function of all the weights in the biases. And remember this cost function is just a function of all the predicted values. And the actual value, so if it was linear regression, it's the difference between those squared, etc. But it's going to be some measure of error between what the predicted value was and what the actual value was and we sum of all of those and then we divide by how many they are. So that would be one, there's this idealized form, a generalized form I should say of the cost function. And the whole idea here is that we want to, if we do this forward and back propagation all the time, we're going to get these idealized values of all the weights and biases, the parameters, but we want to constrain that somehow. And we can build something into this function that will constrain it, that's called even of type of constraint. But another way to go about it is called the ticker of regularization. And what we do there is we're going to add a term as we can see here in function in equation four. We're going to add this term, it looks horrendous. But we'll break it down, it's actually very, very simple. Now, when you see stuff like this, you think, wow, but it really is this very fancy writing for something that's very simple. We're going to have this lambda value. Now, lambda is a hyperparameter, that is something that you will have to actually decide on when you design your network, divided by 2M. And what we just do is we take the square of something to do with the weights. You can see something to do with the weights. Now, look at this, what's happening though, we are adding something to the cost function. So we're making the cost remember as a value in the end. We're just trying to make it more. We're increasing the cost. And when the back propagation tries to, through gradient descent, tries to minimize this cost function, it actually has something added now to deal with. And one way that it's going to minimize this cost function is by making each of the weights, because you can well imagine that this is some matrix of all weights, it's going to make the weights 10 towards zero. If they tend towards zero, that makes the model simpler, less complex. If some of the values of W get closer and closer to zero, you can well imagine that it means easy to imagine that we have a simpler, we knock out, even knock out some of these weights, that it just becomes a simpler model. And if it's a simpler model, we have actually constrained the possible values that it could take. And the way that we did that was this tick and off regularization. We add a term to it. If we add a term, plus there's an addition here of a positive value, we're going to therefore force this gradient descent to select out smaller and smaller versions of all the parameters. Another thing that it actually does, if you think of the tanh function when we do the activation, if we make it smaller, the weights closer to zero, we're going to be in this linear part of the tanh function. And if we do that, we almost have a linear model. And remember, a linear model gives us a much straighter decision boundary. And the decision boundaries that we discussed in the previous video gives us a closer to that kind of model. And the end effect of these things, these thoughts, are that we are constraining the hypothesis space. Not all values are possible anymore. And that's where this R is an element of a mega idea comes in because I can set this lambda value up here when I write the code. And that is somehow going to, the larger I make it, obviously the bigger value I add to my cost function, thereby driving the value of W even less and less and less until they come closer to zero. And that is the thought behind this. So there is this even of regularization, where I actually build something into this function. Adding something is sort of a different idea. But they all sort of boils down to a similar thing it's called. You can see down here the Lagrangian duality theory. We're not going to discuss that, but it all boils down to the same thing as specifically then. If we think about the ultimate goal here, and the ultimate goal is just to generalize the model. And whether you build something in there, or whether you add another term, you're going to end up with the same thing. So this is done all the time in machine learning. This is called L2 regularization, because we use this form here. And if you look at it, just taking this part down here, let's break it up. It looks something like this. And you think, well, that looks even worse, doesn't it? But let me just show you what it's all about. Remember, in the forward propagation, we take the weight and we multiply it by the vector, the weight matrix times the vector, of this is the vector here of the column of the previous layer. And this x here, the L refers to the current layer. L minus 1 is the previous layer. So to get the values inside of that layer, the previous layer, you take the weights of that layer and you multiply it by the previous layer's column vector. Watch those videos again. And to do that, just look at the weights, or the dimensions at least. So after transposing, the weight matrix should actually be L times L minus 1. So that's the number of nodes in the current, number of nodes in the current layer times the number of nodes in the previous layer. And remember, this is the previous layer has dimensions. It's a column vector of L minus 1 times 1. And if you multiply those out, you are going to get L times 1. So let's just look at W and let's make W something that's 3 by 2. And there we have it, 3 rows, 2 columns. And what all of this does at the top of there is very simple. It just runs through each of the rows. Well, so this column and this column through each of the rows. So it says 3 squared plus 4 squared, and then 2 squared plus 1 squared, and then 1 squared plus 1 squared, and you just add all of those and you get 32. It's as simple as that. This whole just thing just means to square all the values. And that's ridge or L2 regularization. And that's all we're going to do. We just add this term, which just squares all the values in the weight matrix and just adds all of the values. It's as simple as that. And if we think about taking the derivative now of this, if this function, cya, is what we would have had without the regularization term, the derivative of this extra term is just this, very simple. And that's why we put the half in there. It's just a scaling factor because if we bring the 2 forward during the derivation, it cancels out to the half that's there. So we just left with lambda over m times actually the weights. So it's just bringing that forward. Remember, these are just additions. And if you think back at derivatives, if you just have a bunch of terms that are just added, you can just take the derivatives of each of those separately. And that's all that happens. So this term is actually very easy. And when we update the weights, it's just subtract from that the learning rate times this very simple derivative. So it doesn't add much to the computation. And then in the end, you're actually just going to write the line of code just to add the regularization to it. But I think now you would have a deeper understanding of what this regularization is. You can, of course, do L1 regularization. We just add the absolute values there. We looked at it at the top here. There we just add the absolute values of all of those. That would be L1 regularization. Here we have L2 regularization. You can see what it really is very simple. And that's it. We're adding to the cost function, thereby driving some of the weights to zero, making for a less complex system. And that really constrains this to only certain values of the weights, the parameters that we are trying to learn, only a certain number of them being available. We're constraining the hypothesis space. So we're going to get worse performance on the training set, but that might generalize. And that's what we are after, generalizing better to the test set or real world data. So in short, that is regularization. The document will be available. Read through it. It will make sense when you read through it a couple of times. It's actually very easy to understand. The calculations are very easy. And just eventually, when we do write the line of the code and using TensorFlow or Keras, then with the TensorFlow backend in R, it is, I mean, it is simple. So as long as you have some understanding of what is really happening behind the scenes, we're just trying to constrain the hypothesis space. I'll speak to you again.