 Okay. Why is training a deep neural net so hard? Well, at some level, the answer is that neural nets are really flexible, very expressive approximation schemes. They can in theory approximate any reasonable function. And we will see they can approximate just not just nicely behaved ones, but also weird things like language. They can even learn to do multiplication, discrete computations, things which are extremely non convex. And anytime you try to optimize something that's incredibly non convex, it is really hard to optimize it. The good news is we don't want to find the optimum anyway, because it's probably overfit. The bad news is it's still tricky to get the optimization to work right. And if you look at what's going on, even in the simplest, simplest network, it's not trivial. So if you consider a simple linear neural network with a quadratic loss function, then what you see is that the error, the loss is a quadratic function of say one of the weights. And you'd like to just walk down that little slope gradient descent down the bottom. Sounds easy, but it's not that easy. So picture a two dimensional version, weight one and weight two, where you have lines of equal difficulty, we want to just sort of walk down that slope, going down to the minimum, which in this case will be in the center of the curve. I'm going to show in this week a ton of methods, all of which are basically saying the new set, the old set of weights, w at iteration t, epic t or time t, full set of all the weights in the network, we're going to subtract off from that some learning rate eta, which we will fine tune a lot, times a gradient, the derivative of the loss function f, the squared error, or the maximum entropy, which of course depends upon the current weights, and is evaluated over either all of the training points, or one of the training points, or a mini batch of the training points, we take the derivative of that loss function with respect to each of the weights, right? So if they're d weights, this is a d dimensional derivative, this delta, this nabla delta is a d dimensional derivative, let me update the weights by adding on or subtracting off a learning weight times the slope. Sounds easy, should work well, but what happens even in a nice little quadratic loss function? So here's just a simple version, we're sitting at some point here, we compute the gradient, the gradient going down the loss function is going to take us in the direction here, if we could take go to the minimum we'd go exactly here, in practice we take a small step, we might take too small the step and go a little bit here or go a little bit past the piece there, in any case, the first thing to note is that the gradient down the slope here of the loss function does not point you to the global minimum, or even the local minimum, which is the same here, it points to in a different direction, and if you take a sequence of steps each time going down the slope, you go down, down, down, down and you zig, zig, zig, zag, you can zig, zag a whole bunch of times before you get to the minimum. Now, that doesn't sound so bad in two dimensions, but you're in very high dimensions, and so there's lots of zigging and zagging and not enough moving down the direction you actually care about. So what we'll do this week is try to figure out how to fix gradient descent, so there's less moving back and forth and more moving toward the direction that we want to go. So what are we looking for in a gradient descent algorithm? We want something that's fast, that takes a little compute time, not just number of multiplications we'll see, but actual little time on the computer, and we'll see next week that different gradient descent algorithms converge to different optimum, so you'd also like it to converge to a good optimum, not just to get to an optimum quickly. We will cover a whole bunch of different gradient descent methods this week. I won't read the ball here, you can see them, but that'll be fun.