 The final mathematical concept I want to present today is idea of natural gradients. How people have noticed for a long time that neural network training suffers from what's called destructive interference. If you train up a network on learning the old days, the times tables, one times five, three times seven, four times eight, and then train them additionally to learn, say, 17 times 32, the new training makes them forget what they learned before, the old training. Humans are pretty good. When you learn new things, it doesn't cause you to forget old things, but think about how gradient descent is working. As you get a new observation, you adjust the weights to do a good job on the new observation, and hence, you necessarily tend to forget how you did on the old observations. You can run through the different training points over and over again and eventually get there, but neural nets tend to be just a little bit unstable in that if you train long enough on new data, you forget the original data you did before. Natural gradients tries to address this problem that adding new training data leads you to forget what you've learned before. So the idea of natural gradients is to move in a gradient direction that tries to keep the neural net from changing, keep it to still do what it did before in addition to doing what it's done before. That is to say, make the outputs of the neural network close to what they were before you changed. So we want to have a mathematical definition of what a similar output is, rather than having a similar squared error of outputs. The idea that seems to work the best is to use as beautiful map behind it, k L divergence. You can view any network which does classification as giving you a probability distribution over outputs y as a function of the inputs x, right? The neural net learns the probability of y given x, the probability distribution of the labels given, say, the images as input. What we'd like to do in a natural gradient is to go in a gradient direction that moves down the gradient, decreasing error, but does it in a way that changes the outputs of the neural net on the other data as little as possible, right? Which keeps the k L divergence between the old output distribution and the new output distribution as similar as possible. So we can look at the equation for that. And what it does in effect is it takes our now very familiar update weights by negative learning rate times the gradient with respect to each of them for each weight, putting in one more term, the inverse of a matrix, the Fisher information matrix. Now the output here, if we do this for a mini batch or a point i, is going to be, there's a d dimensional weight vector. This is a d dimensional gradient. This needs to be a d by d matrix. And it turns out through magical math, which we won't cover in this course, that if you take the k L divergence between the probability distribution over y, the output of the network at the original weight, how far away is that from the weight plus a little change that you get the little gradient piece you add, that that's roughly equal to a half times the change in weight times f matrix times the change in weight. Details, if you don't get them not critical, but important idea is that this equation here magically changes the gradient direction into one, which moves down the gradient, but tries to change the outputs of the weight as little as possible. So it's cool. Why does no one actually use these things yet? Because if I've got a 10 million dimensional weight vector, which is not that big, right, there are billion dimensional ones being used commercially, then f is going to be 10 million by 10 million. And that's too big even for cool to compute. So one has to approximate this in some fashion. It hasn't come into commercial usage, but it's a super cool idea. And it's nice to think about this notion of this is one more thing you want in a good gradient, something that doesn't mess up too much, what you've already learned.