 it should be obvious that learning rate matters. If you have too high a learning rate, then as you iterate and go through your gradient descent, the gradient descent tends to actually bounce you farther and farther away and the weights diverge. You take too big a step and you jump out of your truck to really bad solutions. If you take too low a learning rate, well it takes forever. You burn lots of energy, gradually converging to your solution and it can be the case that a moderately too high learning rate gets you stuck in local optimum which never quite take you out again. A good learning rate takes you down smoothly converging to at least a good local optimum. We'll worry next week about what optimum you converge to but this week we'll worry about how fast you converge to them. By the way, I've stolen this slide from a really nice website, the Stanford Deep Learning course, Pei Pei Lee and company. Good place to view that as a textbook, read more about some of the techniques that we're covering today and throughout the semester. So what do we want to do? We want to adjust the learning rate in a way that is optimal. If you learn too fast then you bounce up, bounce farther, bounce farther, you bounce all the way out of your local optimum. You converge often to optimum with huge weights, either very positive or very negative. So you converge to really really stupid local optimum and the gradient can also become tiny causing convergent problems. If you learn too slowly, as I said, you learn too slowly. So one solution is to adjust the gradient step each time. If your error is going up to a bigger learning rate, if your error is going down, you're doing fine, maybe go down a little bit faster. That works nicely. We'll see there are a lot of other variations that play on that theme and work better. So the basic way to do things is to start with some weight, take a learning step, eta, make the eta be a function of t. Now t is going to be an epic, a mini batch. We'll do batches of say 50 points. So t equals 1 is the first 50, t equals 2 is the next 50, t equals 3 is the next 50. Each of those will take a gradient averaged over those 50 points. How did, what's the derivative of the loss function summed over those 50 points with respect to all the weights? And how do we change eta? We can make eta equal a constant going down times 1 over t, going down linear with 1 over t. We can go down to square root of t, you can go on an exponentially. There's many, many different ways to do it. But all of these have the profit that they are fixed learning rates, right? You decide in advance, this is how much I will decrease the learning rate. All of them start with a large learning rate and make it smaller.