 In this video, I really want to discuss very briefly some improvement techniques in the training phase of neural networks, as you set it up just to improve training. We have this problem that when you do training that it is a very empirical process. We're not talking about inferential statistics where we have well-defined equations and tests. It is also iterative. You have to run a model, see what happens and look at the results and then change things. You have to repeat this over and over again. For very good training, you also need very big data sets. All of this really takes a toll on your computer resources and your time. We have to look at things that we can do just to improve this learning process. In this video, I'm going to mention a few things. I would rather you read the document that's available on our pubs that we are looking at now, or download the RMD file from GitHub and have a look at it yourself. There is some mathematics in it. I've kept it very, very easy. I'd rather just mention a few concepts so that when we do get to the code, you understand that at least have some intuitive feel for what is happening and why we are using it. One of the first things that we want to do when we bring the data in is to what is called normalize the input features. We can also call it standardized. There's different forms of scaling of the input. But imagine you have a number of feature variables and they all add very different scales. Some might be just fractions. Some might have data point values that are in the thousands. Those are very different scales. That is going to perhaps lead to a multi-dimensional gradient that we're looking at, a cost function that is in some way very elongated in some directions as opposed to others. What you want to do is just bunch those all up so that they're all within the same scale. One way to do would just be to normalize it. That's actually standardization, standardize it. You calculate the mean and the standard deviation for each of your feature variables. Then you go feature variable for feature variable, data point value for data point value. That's what the X sub i has. Subtract from that the mean for that variable and divide it by the standard deviation of that variable. That's a very common thing to do when we start looking at images. The image is made up of pixels and the pixel is just some brightness value from 0 to 255. We can just divide every value by 255. That's the maximum. That would be another way of scaling. This is one of the common ways of scaling, just standardize. You can imagine if you do that, you can end up with that variable having a mean of 0 and a standard deviation of 1. That's why we call it standardization. There's an important thing that you have to note though is when you split up your training in your test sets you calculate the standard deviation and the mean for the training set. That's what you do. And then you use that very same mean and standard deviation on the test set. So don't do the mean and standard deviation on the whole data set and don't take the test set and do its mean and standard deviation for its normalization. You use the parameters from the training set and you also apply those same parameters to the test set. That's a very important point and sometimes that gets missed. Another thing I just want to mention is just vanishing and exploding gradients. So just imagine, you can all think about fractions if I take a half and I'm multiplied by two thirds. These are commutative, it's multiplication with real values. I can switch those two around, but in any way I'm going to get a third or a quarter times 9 tenths is 9 over 40. And if you think about it, if I have this A over B times C over D I'm going to get A C over B D and if I put some constraints on this so that A is always smaller than B so this is always going to be some fraction less than 1. C and D is also going to be some fraction and I make A, B, C and D, they're all positive integers. So those are my constraints. If I do this multiplication, what we're always going to find is that A over B is going to be less than or greater than A C over B D and C D is also going to be bigger. So what I'm just trying to say is this value here is always going to get smaller and smaller and smaller and if you think about what we do with the weights just forget the biases at the moment. So we start with this weight matrix times the input vector that gives us something when it goes through an activation function addition, all of that and then I take the weight 2 and I multiply it by these 2 and so on and so on and so on and what you can see with these weights if they all be tiny between 0 and 1 what is going to happen it is going to get smaller and smaller and smaller and that is what we call the vanishing gradient problem. So these weights they are just going to get smaller and smaller and the derivative smaller and smaller and you get that vanishes. What you could also get for the same argument as we used before here if these are all more than 1 this is just going to get bigger and bigger and bigger both of these values are more than 1 and then you get the exploding gradient problem. So for the same argument you're going to get that and then the similar arguments hold for the back propagation step of course with updates the whole system just means that you get this either vanishing gradient problem or the exploding gradient problem and one way that we can mitigate that is this when we set up our problem we can initialize our weight values remember if we start a neural network those initial weights for that first multiplication those are just random values we can normalize that and what we do is we set the variance of the weight matrix to the reciprocal of the number of input nodes which is to be multiplied by that matrix you can read that sentence it's not that difficult and just a little caveat is that if we use the relu we divide by or we use 2 over n instead of n so what we do is we take that n is the number of nodes and that we are going to be multiplied by in that layer and we just multiply each value in the matrix by this for relu it's 2 over n, for the others it's just 1 over n and for tanh functions that used to be called Xavier initialization so that just means what you do is you just calculate the variance of the matrix or set the variance of the matrix and you do that just by multiplying each of the elements if it's relu you multiply by 2 over n n is the number of nodes in that layer that you're going to work with and for the others it's just 1 over n and that is just something we can set in code the next thing I just want to mention is many batch gradient descent so if you think about an epoch before you take any step in any of the directions in your cost function to go down the gradient you are going to have to work through all of the values and if we have very big data sets of millions of samples that's going to take a lot of computation before you can take one tiny little step so there was this invention of what is called the mini batch gradient descent where you break up your data set, all the rows into little sections called mini batches and instead of using the whole data set when you do your forward propagation and back propagation step to update your weights and your biases you don't do that on the whole data set, you just do that on a small batch and then the next batch and the next batch and the next batch and in each of those of course you take a little step so running through one epoch which means going through the whole data set you've already taken many steps before you get to the end of that data set and we call this mini batch gradient descent because you create these little mini batches but when we write code usually it's referred to as batch size but in reality if you do batch gradient descent it refers to using the whole data set and when you use sections of it it's called mini batch but in code we set the batch size which refers to the mini batch the extreme of this is a mini batch of one so every example you're going to get a Y predicted you do your cost function and you update through every sample that's called stochastic gradient descent and you can really imagine if you have a million rows of data that's going to take a million little steps through one epoch and it's just going to wander around almost aimlessly it's not aimless but it's going to wander around quite a bit and there's ways that we can mitigate that so the usual thing is to go somewhere in between and we set these mini batch sizes to be powers of two usually here in the range of two 128, 256, 512 and that works really well with the memory architecture of many systems what you need to do though is depending on the type of data that you have is that these batches fit within the memory capacity of your CPU or GPU if that is so, if you can set these batch sizes so that it just maximizes that potential of the memory of your CPU or GPU it has to fit in there otherwise you can't run it so that's the idea behind mini batch gradient descent then there's gradient descent with momentum somehow what we're trying to do is we just want to speed up the gradient descent in the eventual direction which it needs to go and this is quite technical the idea behind this and if we scroll down we're going to get to root mean square, RMS prop is that we use something called the exponentially weighted moving average one thing I just want to stop at and just try and explain very quickly is imagine you have a set of data points it's very easy to calculate the mean but you can also have a moving mean, a moving average in other words you start at the first value, imagine it's 10 and its average is just 10 you go on to the second one and the second one might be 20 so now the average between those is 15 the next one might be 17 and the average between 17 and the previous 15 is 16 so you can have this moving average as you go through the values so that's one way what you could also do is to weight some of the previous one so they don't all count equally towards what the current mean is as you run through all the numbers you can weight them and a very good way is to exponentially weight them so that the ones that are just prior to the one that you're calculating at the moment, where you are at the moment counts more than ones further down the line further back in time and that's called exponentially weighted moving average and that's the idea that we do here with momentum so we're somehow going to keep track of the weights and we're going to average over the last couple of weights but the further back in your gradient descent the weight values go the less they contribute to the current average so we can use averages for the mean, for the weights and not the actual weight that you're doing during that gradient descent step you actually do an average over the last couple and somehow you retain some of them so that you move forward quicker and that's called momentum so yeah, I just have some data point values you can see that you can look at the code there's been easy to set up some code X and Y and we're just going to add some random noise to this and it's in the form of a sine function with some random noise and you can actually see the actual sine function value there it's plus added to one, every value is plus one so it starts at one not at zero is the normal sine function but you see this and if we add some a moving average, an exponentially weighted moving average and you can look at the code to do that you actually see this green line which takes these data points from left to right as it goes and it calculates a moving average but it does the exponentially weighted moving average so of course it's going to start at zero because at that first step you have no average because at the first step that you take there's nothing that came before so you start at zero so take some time but you see that it always lags behind the true value which in this instance is the sine function always going to lag a little bit behind and that's what you want and that's the equation there for it so that the moving average is you take some beta value some fraction and you multiply it by the previous moving average that you had minus one minus beta times the current value that you are at and by setting beta these updates can be much more ragged up and down as it pays much more attention just to the previous ones and the decay is a very quick decay backward and you can play with this value usually we set it at about 0.9 and you can play with this code as well just put in different values here for beta and you'll see this green line differ quite dramatically and when you expand this and we just approximate a little bit what we can see is that the number of previous data points over which the average is computed is approximately given by one over one minus beta i and that i is whatever step you are at how many you've included up until that point so what we do with these with a momentum is we take these updates in the weight so that's the gradient according to that specific weight the partial derivative for that specific weight and we keep tabs of it as we run through all the gradient descents and the same thing we're going to get this beta value between 0 and 1 and so we look at what the previous average was and we add that one minus the current partial derivative and that gives us a new value and that's what we're going to plug in when we do the update of the weights another way to go about it is RMSPROP and what we do is almost exactly the same other than the fact that we just square this partial derivative in the end and when we do the update we take the partial derivative and we divide it by the square root of this value that we calculated up here it's a very simple actual concept what's very powerful is to combine this idea of momentum in RMSPROP and when we combine them it is called Adam and you've seen we've used Adam before that stands for adaptive moment estimation adaptive ADA moment M and then there's no E for the estimation but it's Adam and we just combine these two one thing I just want to show you is there's a way to get rid of this initial having to catch up and that is no matter what you use if you use this equation, the V equation here or the S equation here for the different two no matter which one you use I've just used the row here to indicate it's either one of those you take it and you correct it by dividing by one minus beta to the power T and T is whatever step you are at now so wherever you are we used I before but whatever step you are from left to right if you plug it in there as T gets larger the fraction this disappears it approaches zero so you just divide it by one so the further along you go it will have negligible effect and will end up with a green line but initially it will move this one way up and it will start at a much more appropriate spot so we do correct this and then what we do is we just combine this so the update is going to be the V corrected from the momentum and the S corrected from the RMSPROP so we divide those two and that becomes what we multiply the learning rate by so we've got two hyper parameters that we can set there are defaults of course and the defaults is what most people would use is 0.9 and 0.999 you have to look at the current documentation to see what the defaults are but you can set them when you use Adam the next concept is just learning rate decay and we always said alpha just to be 0.001 or 0.003 but what you want to do is initially you can have a large learning rate but as you move along you make it smaller and smaller so that when you get to the theoretical minimum that you don't overshoot that means in the end you're going to take smaller steps as you approach this minimum but at least you don't overshoot and that initially at least speeds up the learning but then doesn't overshoot it doesn't keep these big steps that you take and there's various ways to go about it you can see one equation there but there are many types there's exponential decay, there's staircase decay there's all sorts of decay and the best way that we're going to look at it is just to use it as we do the code with all of these we're going to start using them but at least you've heard of them before and you have some idea of what is happening that's all I'm trying to do the last way to try and improve your learning is this called batch normalization just as we normalize or standardize our input values, our input variables we can also normalize the weights that we are going to use at each deep or hidden layer and what we do is we just normalize the values in the nodes and those are the values before the activation function kicks in so you're going to do all your multiplications your weight matrix times your vector of the input values and then you're going to normalize them and then apply the activation function to each of those values for each of the nodes and you can see what happens there there's also a way to write this value this updated value before the activation takes place just to mention you can also do it after activation there's some papers on that as well you can do that as well but to take this and to add parameters to it so that you get this Z that I've written here with a tilde and have these two parameters which are learnable parameters and we'll see how to set that in the future video as we make use of batch normalization so these are not hyper parameters but they're learnable parameters and we can use those very effectively just to try and set these values before the activation kicks in and optimize those values and that really helps gradient descent as well so we're really trying just to improve on the computer resource consumption and our time consumption by implementing these and as I've mentioned before now we are going to start using them in the very next video I'm going to show you some of these and as you start using them let's refer back to this very simplified version of it of course you can read the original papers about all of these if you're interested in the mathematics but as long as you have some understanding of what we're trying to achieve here that is fine as you start using the code and you see the effect that these things have that's the important part so in the next video we're going to start looking at implementing some of these