 Okay, so far we've looked at momentum, keep moving in the same direction to smooth out the wiggles, and we've looked at renormalizing the features and the outputs of each layer within a network. All of the sophisticated deep learning gradient descent methods use that and bini-batch, but they also use methods that adjust the learning rate for every single weight separately. There's a bunch of reasons for that, and we'll get to those in a second, but first let me just note that as you go deeper and deeper in the network, there are different gradients, and one wants to have network-specific ways to adjust the weights that take account to how fast is this weight's changing. The other thing that happens is that periodically the weights become enormously large or the gradients become enormously large or small, and many people put in what they call gradient clipping. If the gradient is bigger than a million, truncate it to be a million. Sometimes they also truncate the bottom side. If it's smaller than 10 to minus 4, set it to 10 to minus 4, absolute value. So we don't want to have enormously big gradients can lead to instabilities, but all of the methods that we use for sophisticated learning do weight-specific learning weights. I'm going to show a set of these, starting with the simplest one, Atagrad, and adding in more features. So Atagrad, the adaptive gradient algorithm, adapts the learning rate for each parameter based on how big its previous gradients have been. And the algorithm is really simple. It says update our weights, each weight individually, by subtracting off a learning rate, global constant, maybe a yield, if you want, divided by a scaling term for each weight, b sub t plus 1, which will be applied by the gradient of the loss function for that weight at the current time. So this is to be interpreted as for each weight separately. And the scaling, it's just the sum of the squares of all the gradients of that particular weight so far. So it's the square root of the side of the squares, the L2 norm, how big were the previous gradients. What does that mean? If this weight has been changing a whole bunch, then divide by, on average, how big it's been, if, and make it smaller, if this weight has been changing very, very little. Maybe we haven't seen this parameter this picture before, this weight's hardly been changed. Now if we get a gradient, divide it by the average we've seen so small for, divide it by something very small. So this is going to tend to make big weights, historically big weights smaller, and historically small weights bigger, adaptively based on the weights. Cool. This can be then modified to a bunch of ways. One is called root mean square RMS prop, and you can use a moving average instead of a total sum. So exact same idea, but now the v is equal to not the sum of all of them, but an exponentially moving average. It takes the current one times alpha one minus alpha times, I'm sorry, one minus alpha times the current one squared plus alpha times the preceding average keeps a running average for getting gradually the ones from earlier. Even better than RMS prop is atom, which is I think the most widely used method right now, but it has momentum as well as the average weight. So it keeps m, which is the average moving average weighted moving average of the gradient, and v, which is the weighted moving average of the gradient squared, and combines the two of them together, which needs to show up in this equation, and then uses the two of them combined to give the normalization constant for the weight. So we've seen a bunch of these all looking basically the same, all adjusting for every single weight how fast it looks. Well, there are a bunch of cool places to play with these. In your notebook is one of them, which I rather like. You can go there and see if you start at any given different point, click on it, you can watch each of the methods race along to see which local minimum they converge to. Give it a shot, play with it, see how it works, see how the different things converge in different situations.