 We've seen regularization by using penalties on the weights, but another equally widely used regularization techniques is built in the stochastic gradient descent. And we'll see some use of it that's obvious and some that's rather more subtle. Starting with the more obvious one, when we train a neural net, we start with small random weights, typically generated with a Gaussian. As we take gradient steps, mini batches, the weights will get on average bigger and bigger and bigger. If we do early stopping, then we will stop before the weights get as big as they might get. We have therefore done weight shrinkage in some sense similar to L2 or L1 regularization. Pretty much everyone in deep learning does both weight penalties like L2, L1, and early stopping to make the weights smaller. But stochastic gradient descent has another less obvious property. As we saw before, people tend to use large neural nets that could and maybe should overfit. They can train to almost zero training error with very small testing error. And the magic of stochastic gradient descent is that it tends to converge to relatively shallow flat minima in the loss function, which tend to generalize better. Different gradient descent methods really do converge to different solutions. Shown here is a nice set of results on the left hand side training error versus mini batch epic for a number of different methods. You can see Adam in the default up here and Atagrad below, methods we've seen before. On the right hand curve is the testing error, heavily not going from zero to 20, but from roughly 07 up to 20. So it's not the case that the testing error is really as low as the training error. Right, training error on the low methods, the red one goes almost a zero. The testing error goes to about 7.65, right, if you do stochastic gradient descent, just vanilla. So what do we see? We see, first of all, that Adam not tuned out of the box is giving a training error and a testing error of about 13%. And that if you pick better hyper parameters for it, it does a better job, the purple down here, more accurate. We see that the difference between the best method, a, for example, pure stochastic gradient descent versus the worst, the raw Adam, almost a factor of two, 7% testing error versus 14, 13% testing error, maybe. So how you do the convergence makes a lot of difference on what you converge to and perhaps surprisingly, the methods that overfit the most on the training data are often doing the best on the testing data. So what's different across these methods? Can we get some intuition? I've sort of mentioned that before, but let's come back and think more carefully about it. So if you're taking a small gradient step, if you do a small step down, you're going to do a better job of converging carefully and winding your way down to find a perhaps narrow deep local optimum. This could be good. You could be doing a good job of fitting or it could be bad. You're doing a bad job overfitting, but it certainly has less regularization than a slow step. If you take a large learning rate step, then you're going to tend to jump over some of these deep ravines and you're going to tend to land in something that's a very shallow, broad, wide minimum in the loss function, you're going to tend to fit the data less well. In general, these broad, flat minima are more robust. They're less likely to overfit. You're less likely to get yourself trapped in some local minimum. Of course, if you take to big a step, then you will end up underfitting and finding a really crappy local minimum that's neither good in the training or the testing. So tuning these step signs, how big a step you take, particularly early on in the regularization process, affects where you converge and where you end up at the end. There's something else you can also adjust, which has a very similar thing, in fact, which is the batch size. A smaller batch size is going to give a noise your estimate of the gradient. It's going to bounce you around more. It may also therefore tend to improve generalization. So a larger batch size is going to do a better job of converging you smoothly to a local optimum, perhaps to good a local optimum. So increasing the batch size is like decreasing the learning rate. So if you think about having the learning rate, you go half as fast. If you have a batch size that's somewhere between twice as big or maybe four times because it's the square root of four that gives you the error, it takes you down about the same speed. So you can adjust the regularization, not just by L2 penalties, but by taking larger steps or using smaller batch sizes. Give it a try.