 All right, so as we can see, there is no Yan yet around. So we have a recording for everyone who's joining late. So they will know what happened at the beginning of the class. What's happening at the beginning of the class? I have no idea. And this is just four cups. And I figure out that Yan is not yet here. So interesting. I was saying that this semester I was actually thinking to mostly re-teach the things from last semester, but then halfway through, oh, no, I cannot because things changed. I changed the way I see things. Actually, I think the things are still the same, right? But since we thought the energy-based model, latent variable energy-based model at the beginning of the semester in the first part, then now you have an additional tool to understand other topics and other subjects. And then I actually need to use that tool because you already know that, right? I cannot pretend it doesn't exist. And so I think I've been doing a reasonably good job, perhaps, you judge, in re-proposing the kind of same topics as last year, but in a new light, which was rather interesting and new for me, at least. Like I didn't necessarily know this last year or last semester. So then also we had the homework on the energy-based models, thanks to Vlad, which should have given you an understanding of how minimizing a energy or something like doing gradient descent can also be considered inference, or is considered inference, right? And that actually was a proper case, right? So you find the value, in that case, if you do the arc meme, you find the minimum path, right? The Viterbi one. If you don't do the minimum, but you do the soft minimum, right? You're going to get the forward algorithm, that one we saw in a speech recognition. What else? Oh, yeah. And then last class, we saw that the control part, right? The optimal control is nothing but energy minimization, right? With respect to the latent. And the latent is actually the control. So all these things took years maybe, let's say years for me, to understand clearly and then deeply. But now you don't need years, right? Because since you already know how this energy-based model works, then everything nice is just, you know, it works because the other works, right? So you can see everything from the same perspective, nothing should be surprised you too much, okay? Because it's all the same stuff over again. If I knew in advance, now that I was going to be talking, I was going to prepare something, but I can't provide. I don't have a piano here, so no music for you. What else can we do? I don't know, seriously. 15 minutes later. We are kind of 15 minutes through the class and Jan is not answering. I have no idea if he's coming to class. Hello. Oh, good morning. Hello, everyone. We were worried that you are not going to be coming anymore. Sorry, I had a number of different problems. I hope you fixed them. Yeah, yeah, yeah. By the way, there are outdated slides on the website. You want to turn on your space car? Space. Yeah, it's going to take a while. I'm not sure why, but I'm having issues with. Okay. Okay, I can replace it in a post-editing. No, you don't want to do this. It's just that I'm going to freeze for a few seconds. All right, we're going to talk about optimization today. About, you know, a little bit of the theory and the practice. You know, you've played with various optimization schemes. And okay, here we go. We have a background. You've used them a lot, but the question is, you know, can we explain why all those tricks? You know, where do they come from, essentially? All right. So optimization for deep learning. I should tell you right now that I borrowed a lot of material from Aaron DeFazio. So Aaron DeFazio is a specialist of optimization who works at Facebook Air Research. He gave a guest lecture in this course last year. And this lecture was really good. And so I borrowed a lot of material from it and some other things as well. Okay. So let's jump in. Okay. We all know about gradient descent, right? So gradient descent in the context of machine learning we have a loss function, which generally is an average over individual per sample losses that take individual samples, X, Y, I, as well as a parameter. And we're going to just denote the objective function we minimizing just F of W, W being the parameter vector or parameter object, whatever it is, that we're going to optimize there. So we're kind of hiding all the data sets sort of inside of this F. And gradient descent, of course, as you know, is an update formula where we compute the new parameter value at iteration K plus 1 by additive update of the old parameter value. And the update is proportional to the gradient or the negative gradient rather of the function at the location. The gradient is when you write it as a partial derivative, as we've seen is a row vector. So here I've written it as the transpose. Haven't been very systematic about this in the notations later, but that's kind of the idea. And then you have this learning rate here. So the learning rates you've been playing with so far were scalar positive values that you would decrease over time and you would actually use stochastic gradient, not full gradient. And so the gradient that you computed here was not the gradient of this average function but was an estimate of this function on the basis of a small number of samples, perhaps even a single sample or a mini batch. And that's called stochastic gradient, of course, as you know. Now, what we're going to see is that there are various methods in which this learning rate can be a diagonal matrix, which means you have a separate learning rate for each weight. When you multiply a diagonal matrix by a column vector, you can scale each component by some coefficient, or it could be a full matrix, or it could be a low-rank matrix. So there's various methods, and I'm not going to go into all of those methods because most of them are completely relevant to machine learning, actually, or at least to parameter update for machine learning, they can be relevant to inference, but not for parameter updates. And the reason why they're not so relevant, even though most of the literature in optimization actually concerns those methods where gradient-based method, when the learning rate basically is a full matrix, we don't use them very much in machine learning because if we have 100 million parameters, this matrix would be 100 million by 100 million, and that's going to be impractical. So there's been a lot of attempts to try to kind of use the advantages of those methods, but in the end, nobody used them. We all know that setting the learning rate is kind of a bit of an art, and there's a way that you can analyze this theoretically, certainly in a one-dimensional quadratic case, but also in a multidimensional quadratic case, and then extend this to the multidimensional non-quadratic case. So in the quadratic case, when they mention, if you set the learning rate to a small value, the parameter will sort of slowly progress towards the minimum, and it's going to be what, to anyone who is reasonable, we think of as an exponential convergence, exponential in the sense that the distance between the weight value, the parameter value, the scalar weight value, and the optimal value is multiplied by a constant that is less than one at every time step. So in that sense, it's exponential, but in optimization term, if you read the optimization literature, that's called linear convergence. It's called linear convergence because the number of, because they work in lock space. So the number of significant digits that kind of are added to the solution at every time step increases linearly. Okay. And for most people who are especially so optimization, that's bad, that's slow. What you want is something like super linear or quadratic convergence. Unfortunately, that's basically unattainable in machine learning because of the size of the networks that we have the number of parameters and the size of the objective function. So we have to resort to simpler methods like stochastic gradients and it variants, but that doesn't mean that we can accelerate the training and that's basically the focus of this lecture. Now, so in one of the quadratic case, if you... There is going to be a value of the running rate that is optimal that is going to make you jump directly to the minimum in one step. Okay. And we'll see what value that is. It's called Newton's algorithm, actually, if you do this, if you compute this optimal running rate. If you increase the running rate beyond that optimal value in one dimension, the system starts oscillating, but it still converges. And then if the running rate is more than twice the optimal value, then it diverges. Okay. So the situation in multi-dimension is considerably more complicated because in multiple dimensions, the convergence speed here and the optimal running rate depends on the curvature of the second derivative of the function. And in multiple dimensions, you might have different second derivatives in different dimensions. That's what makes things complicated. So in fact, here is an example of this. So here we have a two-dimensional objective function. And so here on the right, I plotted the lines of equal cost. Okay. And the gradient vector points upwards. Okay. So we have a minimum here in the center. This is a quadratic function. So it's a ball, quadratic ball. It's got different curvatures in those two directions. Okay. So this axis here, which is the main axis of those ellipses, it's called one of the principle dimensions in fact, of that objective function. And the orthogonal value is another one. Those are related to eigenvectors of the Hessian. We'll talk about this later. And the gradient vector at every point is a gradient that points upwards and it's orthogonal to the lines of equal cost. Okay. Now if you have an algorithm that uses an update and that update has a strictly negative dot product with the gradient, it's going to get closer to the minimum. So any gradient-based method, or any descent method actually, in a continuous convex case, that where the descent direction has a negative dot product with a gradient is going to take, eventually, to take you down to the minimum. It doesn't have to be the gradient. So the gradient itself, which is, I mean, the best, the descent direction is the negative gradient. So if you take the negative of the gradient, you're going to point towards the minimum, but this is not the best direction necessarily because as you can see in this example here, the negative gradient doesn't actually point towards the minimum. It points kind of a little off to the side. So in fact, there may be a better direction that actually points directly to the minimum. Unfortunately, that direction is very hard to compute and involves inverting a huge matrix whose size is the square of the size of the number of parameters. Again, that's Newton's algorithm. We'll come back to this. All right. So here are two examples. This is, these are non quadratic functions, actually on the left it is. If you have a small enough learning rate, the trajectory that's going to be followed is going to first converge in the direction where the curvature is large. So here you see the ellipses are elongated in one direction. In the elongated direction, the curvature is low. And then the narrow direction, the curvature is large, right? And what you see here in the dynamics of convergence is that the direction, along the direction of high curvature, the convergence is faster than along the direction of low curvature. Now, if you increase the learning rate a little bit, you can get the situation here on the left where in the direction of high curvature, things start to oscillate, still converge, but it oscillates. Whereas in the direction of low curvature, you know, it converges slowly. Okay. So here is the main problem that we're going to have to face, which is that in a high dimensional space, like the space of the cost function of a neural net, we're going to have some dimensions in which the curvature is going to be large and some dimensions in which the direction in which the curvature is going to be small. The speed of convergence is determined by the slowness of the convergence in the direction where the curvature is small. Okay. But the maximum learning rate, so we could crank up the learning rate, but we can't crank up the learning rate too much because below a certain point, our system is going to start diverging in the directions where the curvature is large. Okay. So basically, the convergence speed is limited because we can't increase the learning rate too much that's determined by the largest curvature. And then once we have that learning rate, the running speed, the time of convergence is basically determined by the direction of lowest curvature. There are directions where the curvature is flat. We don't care about those, but directions where the curvature is not very large and it's going to take a long time to converge in those directions. The ratio of the largest curvature to the smallest curvature is called the condition number. And basically, that determines how efficiently you can optimize a function with gradient descent, including stochastic gradient. Now, there's a second issue, of course, which is the fact that the functions we optimize in machine learning are non-convex. So a lot has been written and said about the fact that the objective function is non-convex. This makes a lot of theorists very uneasy. And in fact, one of the reasons why neural nets were basically not well appreciated in machine learning for a long time is because the objective function is non-convex. And if you're a theoretician, if you're an academic, you want to prove things, you can't prove anything, essentially. And so you cannot prove the convergence of gradient-based learning in multilayer neural nets. You cannot prove that your optimization is not going to get stuck in a local minimum. But empirically, we know for a fact that it's pretty rare in sort of the typical architectures that we use with the proper tricks. It's very rare that we get stuck in local minima. It's very consistent that when you train a neural net with the appropriate black heart that goes behind that that is implemented in PyTorch and other frameworks, training is pretty reliable. You pretty much get the same result every time. You don't get the same parameter value. That depends on the initialization of that. But on the random choice of training samples, but you get pretty much the same loss, final loss every time. Now, there's been some studies that tend to show that the number of isolated regions that the gradient-based algorithm will converge when it's a neural net, when you're training a neural net. It's very small. So essentially, most of the minima are, despite the fact that they are local, are connected. You can go from one minimum to another without having to go up a hill, essentially. They are mostly connected. They're not all completely connected. But there's a lot of degeneracy in the minimum. This is particularly true in a very large neural net. So neural nets that we use very often are way oversized, overparameterized for the problems that we want to solve. In fact, we can see this because when we train a neural net, we can generally get to zero error on the training set. And that's because the neural nets we train are much larger than necessary to learn the function. And so it completely nails the objective function, finds a minimum that's really close to zero. If it's at all possible, of course, it may not be possible. And why do we do this? We do this because it causes the objective function to be highly degenerate, which means there are many dimensions that you can move when you are at a minimum. You can move in many different directions without actually changing the objective function at all. There's only a small number of dimensions. That's small, but small dimensions where you change the parameters, you change the weight, the loss will increase. In many dimensions, it will just not change. So imagine a neural net where multiple layers, there is a unit in it which has a bunch of weights coming into this unit. And imagine that all the weights coming out of this unit are zero. So essentially no one sees that unit in the rest of the network. You can change the weights of that unit. It will have zero effect. No effect whatsoever on the loss. And these things happen a lot. Imagine that you have two inputs that are highly correlated or let's say equal up to a constant factor. You may have two separate weights connecting to those two inputs. But since those two inputs are essentially equal, it doesn't matter what the individual weight values are. The only thing that matters is the sum of the two weights. Because the sum of those weights will determine the importance of that feature which is replicated twice. Which means there is a direction in weight space where you can change the first weight by a little and change the other weight by a little in the other direction. You don't change the sum. You don't change the overall function of the system. Yet you've changed weights. So that's one direction where you can change weight and it makes no difference to the output. So in an overprime, I try to neural net, you have lots of dimensions like this where essentially it doesn't matter what changes you make. Right, so of course, in practice, we all use stochastic gradient. And as you know, stochastic gradient consists in evaluating the gradients on the basis of a small number of samples or even a single sample. People don't like, there's some people in optimization who don't like the words stochastic gradient descent, SGD, because it's not a descent algorithm. Sometimes you go uphill because the gradient is so noisy. So some people have advocated the name stochastic gradient optimization, but not stochastic gradient descent because it's not descent. But the phrase is caught, so it's called SGD. And on average, SGD behaves like gradient descent. If you really want to learn everything about SGD and about stochastic optimization methods, I recommend this paper by Leon Boutou, who's a colleague at Facebook, Horge Nossedal, who is a very famous person in optimization and their colleague Curtis. And this is a very long paper, 75 pages, which has the entire theory of why SGD converges. Turns out to be surprisingly difficult to prove theoretically that SGD converges. You can prove it very simply in very simple cases, but for least square for quadratic function with independent variables, this was proven in the 1950s, but it was used to be called stochastic approximation. But in a general sense, and then can you accelerate it? Can you use second-order methods with by estimating the curvature? If you only know everything about this, this is the paper. You need some time ahead of you to digest all of this. So the trajectory followed by SGD is kind of erratic, but it's faster in the context of machine learning. And we'll see why in a minute. Okay, so let's analyze the convergence of gradient descent. So our function, again, is an average of per-sample loss functions. We're just going to call it f of w. And full gradient is this method, wk plus one equals wk, minus step size, running rate times the gradient of the function with respect to w. And if I were completely correct here, I would have had to write a transpose on this gradient. Okay, if you denote as the partial derivative, you have to transpose it. Or you can write it as the nabla symbol, and then it's already transposed. Okay, so what we're going to show over the next few slides is that the optimal learning rate in one dimension is equal to the inverse of the second derivative of the objective. And then we're going to generalize this to multiple dimensions. Okay, so here is a little diagram that I showed earlier. And we're looking for the optimal value of the learning rate here. And we want to arrive at this result that the optimal learning rate is the inverse of the second derivative. Okay, so here's a little diagram here on the right, which is a little complicated. I'm going to spend a bit of time on this. But in the end, it doesn't appeal to any particularly complex concept. Okay, so let's say we have a quadratic objective function like this. We are at location wk. And we're going to take one step. This is going to take us to location wk plus one. And we'd like this wk plus one to be at the minimum of the function. Okay, so what is the learning rate that will take us to that minimum? Okay, so this is our objective function f of w. And this is f of wk right here. And then I'm going to plot here. What I plotted at the bottom is the derivative of that function. So because this is a quadratic function, the derivative is a line, right? The derivative of a second degree polynomial is linear, right? So the value of that linear function is the gradient. And the slope, which is the derivative of the derivative, is the second derivative, right? So the slope of that curve is the second derivative of our original objective function, right? That's the function, that's the derivative. The derivative of that derivative is the second derivative. And that's the slope of that line. Okay, now there's an interesting relationship here which is indicated by those three formulas. So we are at wk. We take a step towards wk plus one. So this distance here is wk plus one minus wk, right? Now, when we're making this step, because the slope here is the second derivative, the increment of the derivative, okay, the difference between the derivative at wk plus one and the derivative at wk is this distance, right? And because we know the slope, we know that if we take this distance, multiply it by the slope, we get that, right? And that's the formula that we have here. I wrote it as if it were multi-dimensional. We are in the scalar case here on the diagram. But in fact, that relationship is actually true also in multi-dimension. So imagine w is a vector. You turn it into a row vector here so that you can pre-multiply by this thing which actually ends up being a matrix. It's called the Hessian matrix. So I'll come back to this. Okay, but let's just think about the one-dimensional case for now. And this is the difference between the derivatives of the gradients at wk plus one and wk, right? So this is an important formula because it tells you basically, you know, the relationship between the second derivative, the difference of two gradients and the increment. And this is basically the main thing you need to know about optimization, okay, about random-based optimization. All right, so now we want to solve this for w, so that the gradient at wk plus one is zero, okay? So I've rewritten the formula here. I rewritten by just transposing everything, right? So this is a transpose of the second derivative matrix. So in fact, you don't need to transfer the second derivative matrix because it's actually symmetric, okay? The matrix of second derivatives is symmetric. And again, I'll come back to exactly to what that means, this matrix of second derivatives. Um, so this matrix of second derivative called a Hessian matrix, which doesn't need this transpose, multiplied by the difference between where you are and where you want to be, or where you were and where you are now, is equal to the difference of the gradients, gradient where you are with respect to gradient where you're going to be. Okay, so we want to jump directly to the minimum. At the minimum, which would be, so our next step would be at wk plus one, at that minimum, we want the gradient to be zero, right? So we're going to set this to zero, okay? And then we are left with this equation, essentially, okay? So now this is the symbol that we use for the minimum, w check, okay? It's the value of w that minimizes the loss. And I just rewritten this equation where I set this to zero, okay? So w check minus wk, pre-multiplied by the Hessian matrix, the second derivative, is equal to minus the gradient, okay? That's the term that's left here. Again, that's basically the main formula you need to know about optimization, about gradient-based optimization. All the methods that people have derived are kind of derived by this, okay? Now here is the thing. In one dimension that's easy to do, you just do this, right? W check equals wk minus the inverse of the second derivative times the gradient. And if you identify this with the update formula for gradient, this plays the role of the running rate, right? This is like a gradient update formula where the running rate has been replaced by the inverse of the second derivative, okay? So you conclude, if I want to jump directly to the minimum, I just need to set the running rate to the inverse of the second derivative. And that will make me jump to the minimum in one dimension, okay? In multiple dimension, it will just make me jump to the minimum in the direction of largest curvature if I can figure out what is the largest direction of curvature. But it's not going to make me converge fast in the other directions. I'm gonna have to wait for those to converge, okay? So this only works for one dimension. Now in multiple dimension, this still actually works, okay? So you can set the running rate to the inverse second derivative because the reason I just explained. But in fact, those formulas still work in the sense that now this is a matrix, okay? And what you need to compute here is the inverse of this matrix. You don't actually do it this way. If you want to do it, you solve that linear system, okay? This is a linear system where the unknown is W check. And you basically solve this linear system as a function of W check where this matrix is the matrix of coefficients. And this is like AX plus B and that's the unknown. So there is a question here. What if that matrix doesn't allow an inverse? Yeah, we'll talk about this. Most of the time, okay, there are two issues. Okay, this is for the quadratic case and this is for the positive definite case where the Hessian matrix is invertible, which means it doesn't have zero eigenvalues and it certainly doesn't have negative eigenvalues because then the function would not be convex, okay? So we're assuming that in all directions the loss function curls up, which of course is not true in the case of neural nets. As I said, many directions are flat, okay? So we cannot even do this with neural nets. We can invert that Hessian matrix. But let me explain it a little bit what the Hessian matrix actually is. So I have a scalar function, okay, which takes a parameter. So this is my function F and it produces a scalar, right? F of w, that's my loss function. But w is a vector. So df over dw, of course, is a vector whose dimension is the same as w, okay? Dimension n, let's call this n. And the second, the Hessian matrix is denoted this way, is an n by n matrix. Okay, so this matrix, the term ij in this matrix is the second derivative of the function. So this is at a particular point, w, right? At that point, w, and you differentiate first with respect to wi and then with respect to wj or the other way around, but it doesn't matter because it's symmetric, okay? So this is really, you can write this this way, d over dw i, let's say, of df over dwj, okay? And it's equal to d over dwj of df over dwi. Okay, so this is the gradient, this is a vector, okay? And this whole thing here is the derivative of a vector with respect to a vector of the same size, okay? So df over dw is a function that takes w, so df over dw, and produces a vector, okay? And the input has n dimension, the output has n dimension. The derivative of that function is a matrix, it has to be a matrix because for every input and every output, there's going to be a derivative, a scalar derivative that indicates how this output is influenced by this input. If you wiggle this input by delta, the output will wiggle by d2f, I mean, by this term, right? So if I wiggle the input w i here of this gradient function by delta, the jth output, the output number j is going to wiggle by this times the delta, okay? Because it's the derivative, it's a partial derivative of output number j with respect to input number i. So that's what the Hessian matrix is, and it's a matrix where you have df, d2f over dw1, dw1, so you have the diagonal terms, and the ij term is d2f with respect to dwi, dwj, okay? That's the Hessian matrix. Okay, now what are the properties of this Hessian matrix? So if I take a quadratic function, and it's a quadratic function that is in two-dimension, okay? So this is w1 and w2, and my function is elongated, which means the curvature in this direction is smaller than the curvature in that direction, right? In other words, d2f over dw1 squared, so dw1, dw1, right? Is something, let's say 2, okay? Take a random example, and then d2f over dw2 squared is 1 or something like this, right? So the curvature in w1 is twice the curvature as in w2. What is the Hessian matrix? So in this particular case, the Hessian matrix is, so basically this quadratic function could be written, I've put the minimum at zero, I could shift it to be easy, but essentially I can write it as 2 times w1 squared plus 1 times w2 squared, and I think I need to divide this by 2, okay? So this is my function f of w, right? If I differentiate with respect to w1, this term goes away, this guy becomes, so the 2 from the derivative cancels, so I get 2w1, okay? And if I differentiate again, I get 2, and same for w2, I would get 1, right, for the second derivative. Now, if I try to compute df, d2f over dw1, dw2, okay, for this particular, this particular simple function here. So I first differentiate with respect to w1, so I'm going to get 2w1, and I now differentiate this with respect to w2, so this is d over dw2 of 2w1. If you differentiate this with respect to w2, I get 0, right? Because I don't have w2s anymore. So the result is that this Hessian matrix now, I can write it down. It's 2, 1, 0, 0, okay? It has no off-diagonal terms. It's a diagonal matrix. That makes it easy. Now, if I wanted to use a gradient descent algorithm to optimize this function, I could actually compute an optimal learning rate, which would make it converge in one step. And here is how you do this, okay? So let me redraw the function here. So again, it's elongated. If I'm here and I take a gradient step, I'm not going to go towards the minimum. I'm going to go in this direction, which is orthogonal to the line of equal cost, okay? And I'm going to have to follow this kind of curvy trajectory to kind of, in several steps to converge to the minimum. But I can use the trick of w check equals wk wherever I am here minus h minus 1, the inverse of the Hessian matrix times the gradient. And I need to transpose this if I want a column vector. And h minus 1 is trivial, right? Basically, that would be something like 1 half, 1, 0, 0 times the gradient, okay? And that I can compute really easily. Even if I had neural net with 100 million dimensions, I could compute this inverse super easily because it's only a diagonal matrix, right? So I don't need to store the whole matrix. I just need to store the diagonal terms. And I need to do a multiplication term by term of those factors by the gradient. But that gives me a different learning rate for each dimension, which tells me here, use a learning rate in the vertical direction that is twice as large as in the horizontal direction because the curvature is too small, okay? So if I use this update formula, my weight update here is actually going to point directly towards the minimum. And in fact, if I use h here, if I don't use another learning rate, it's going to take me directly to the minimum, okay? This is called Newton's algorithm. And what that suggests to you is that this is not a new idea, okay? It goes back to Newton. Newton was essentially trying to solve equations of the type, you know, g of x, I mean, g of w in our case equals 0. And he says, well, you can compute the derivative of g. And then, so this is like solving, you know, this is equivalent to trying to solve our problem, which is df over w equals 0, which is finding a minimum of a function. So finding a minimum of a function or finding the zeros of an equation is the same thing, okay? And Newton kind of invented that method for that purpose. And he said, well, if you compute the gradient of this, which is the second derivative of that, and you take a step that is proportional to the inverse of that derivative, you're going to get closer to the minimum, provided that the second derivative is positive, okay? If it's negative, you have to do the reverse. And that's the issue with Newton's algorithm, which is that it really assumes that the function you're optimizing is convex. It doesn't need to be quadratic completely, but if it's not quadratic, you're not going to converge in one step. But it needs to be convex because if it's not convex, so here the curvature is negative, okay? Right? Curvature is negative. The second derivative is negative. When you multiply your gradient by the inverse of your second derivative, it's not going to take you downhill, it's going to take you uphill because this has a wrong sign, okay? So Newton's algorithm only works if your objective function is convex, which means we can't use it for neural nets, okay? At least not in its original version. So there are all kinds of ways to find close approximations to the H matrix for non-convex functions that are positive definite. And I'm just going to flash the formula at you so that you're aware that it exists. But in practice, it's really not used again for machine learning. So there's an algorithm called which you may find in various packages called the Levenberg-Marquardt methods. And that says I'm going to use an approximation of H, which I can guarantee is positive semi-definite, okay? So it only has positive eigenvalues, which means in every direction the curvature is positive or zero. And the way you do this, when you have a loss function, so when you're loss function, f of w is an average of pre-sample loss functions. And this particular loss function is something like, say, a square error between a target and a neural net function applied to an input and with a parameter, okay? Which is common. It doesn't need to be a square error, but, you know, if it's a square error, you can approximate H as the Jacobian matrix of G. So there would be the sum over, let me do the full calculation because otherwise it's going to be hard to understand. Okay, so what is H? So H, I'm going to put one half here so I don't have to carry a two all the time. So H is d over dw of d over dw of y minus g of xw squared. And really, this is a H for a single sample. I need to sum this over all the samples and index all the samples, okay? And I have a one half in front and I'm making a mess. Okay, so what is the, this term first? Okay, so this would be one half sum over i, one over two p sum over i. I need to divide by p here because it's an average of d over dw of two, whatever is in the parenthesis in the, in the distance, ii minus g of xiw. And then times the Jacobian matrix of G with respect to w. Okay, that's the first derivative. And now I need to, essentially, I can never remember where the selection, here we go. So I can duplicate, I guess, let me go to the next page. Oops, it's not what I want you to do. Anyway, okay, so I have H equal d over dw of one over p, one over two p, sum over i of ii minus g of xiw times the Jacobian matrix of G. So it's a matrix because G may be a multi, you know, a function with multiple outputs, right? It's a neural net, right? And I need to differentiate this whole thing with respect to w again. So that's going to be one over two p, sum over i. I need to differentiate this first term here. So it's going to be, so firstly, it's going to be yi times this gradient, okay? And then it's going to be G times that. So let me write that. So the derivative with respect to w of the product of yi by this is going to be, I mean, the product of this by that, you know, this is a product of two terms. I'm sorry, I'm going to keep it factorized. This is a product of two terms. So the derivative is the product, you know, this is like uv. And so I need to do uv prime times vu prime. So I'm going to first differentiate this and then keep this constant. So the derivative of the parenthesis with respect to w is dgxw over dw. And I'm going to put a transpose and then product by this. And the other term is going to be y minus g of xw times the second derivative of g with respect to w. This is an absolutely horrible object, okay? This is a fourth order tensor, okay? It's a tensor with four dimensions. Why is that? It's because g is a matrix, is a vector function, takes a vector, produces a vector. The Jacobian of that is a matrix, okay? Because for every pair, every pair of variable, it gives you a derivative. Now, you differentiate that with respect to the weights again. And what you get is actually, no, it's a third order tensor. Sorry about that. And I need a transpose here. So this is a horrible object. It's nonzero if the g function is nonlinear. If the g function is linear, it's zero, okay? If the error is small, it's also zero. So here's the trick of the November Markov algorithm or the so-called Gauss-Newton approximation, okay? So the Gauss-Newton approximation is to say that this thing here is basically zero, like we can ignore it, okay? So we use an approximate Hessian that's equal to just the first term. Now, this we can compute with backprop, but it's complicated because we need a gradient for every output, okay? That gives us a matrix, okay? But that's an n by n matrix where the product of those two is an n by n matrix where n is the dimension. And it's positive semi-definite because it's basically the square of a matrix, right? It's matrix by transpose. So it's like a square of a matrix. It may, so it can only have positive eigenvalues or so positive curvature or zero curvature if we are unlucky. But here is the trick for November Markov algorithm. We replace h by this term, basically the square Jacobian of our neural net function, okay? And this was developed in the general case, not for neural nets in particular. And we add to this ideal constant times the identity matrix. It's as simple as that. So that makes it invertible, okay? So now we have an approximation of the Hessian that is both positive, that is positive definite, which means it's invertible, which means we can solve a system where h is the set of coefficient, which means we can run something like, you know, similar akin to the Newton algorithm. But again, this is impractical for neural nets. You could use this, for example, in the context of energy based models to do inference for the latent variable because it's not the stochastic gradient and maybe the dimension is small enough. But we don't really use this. Now you can have diagonal approximations of this and they would be practical. So what if you could compute the diagonal terms of h tilde? Then you would have a running rate for each dimension that would basically be more or less the best optimal. There are methods like this. I actually developed a method like this back 30 years ago, but they're not used that much in practice. Instead, there are other methods I use that I'll go to in a minute. Now let me give you also another example to give an intuition of what this matrix of second derivative means. Now, if I take this little example I had before again, where, let's see, but then I'm kind of tilting it a little bit. Now to have a function like this, I can't write the function as I wrote before, which was just, you know, the function I had before just had x1 and x2. But to write this one, I have to use the product x1, x2 as well. Basically, you can think of this function as the same function as when we had before except in a different reference frame. So you can think of this function as the same as what we had before, which I'm going to write with different variables. So we had w1, w2 here, right? Let's say I'm renaming this u1, u2. In fact, let me not do this. Let me do the other way around. So this is u1, u2. The real variable I'm observing, I'm manipulating is w1, w2. So my function expressed in u1, u2, f of u, if you want, is 2 over 2 u1 squared plus 1 over 2 u2 squared, same function I had before. But if I now want to express that function in the reference frame of w, I need to transform the space of u into the space of w. So I can write this as a quadratic form. This is basically u transpose 2 1 0 0 1 half u. It's a quadratic form with a diagonal matrix. If I want to write this in the space of w, I need to transform. So to go from u to w, there's a rotation matrix that turns u into w. In fact, I need the other way around that turns w into u. So I can write this again as 1 half w transpose q transpose 2 0 0 1 q w. Right? Simply because this is u and this is u transpose. So I've not changed the formula, I just expressed this as a function of w now. Okay, which means in the space of w, my function now looks like this, w transpose, my Hessian matrix now is q transpose 2 1 0 0 q w. It's a quadratic form, but now this matrix is not going to be diagonal anymore. The eigenvectors of that matrix are going to be the columns of q. And the columns of q are simply the coordinates of those two u vectors in the frame of reference of w. So here's the thing, the Hessian matrix of a quadratic function is diagonal in the eigenspace of the Hessian. So you compute the matrix of second derivative, you compute its eigenvalue and eigenvectors. And when you change the frame of reference to this eigenspace, the Hessian becomes diagonal. Here I've done the opposite operation. I started from the diagonal Hessian matrix and then I expressed it in a space where it's rotated to make it more clear. But basically, that's intuition. So inverting a function like this is inverting a matrix of this type is not going to be easy. And it's going to be intractable basically if w is very high dimension. Let me take a very concrete example. Let's say we have a bunch of data points, okay, perhaps we want to do classification into categories, perhaps we just want to do regression. So we have a data set where the output y is a scalar and the input is a vector. And our model is a linear model and we want to do e square, okay? So our f function is 1 over p, sum over i of square difference between w transpose x squared. So it's basically linear regression where x is multidimensional and y is a scalar. What is the gradient? Okay, I'm going to put a 2 here. So what is the gradient, df over dw? Okay, we're going to get 2 times the parenthesis times the derivative of the parenthesis with respect to w. The 2 is going to cancel that 2. So we're going to get 1 over p, sum over i y minus w transpose x times x transpose, okay? Again, it's a row vector, right? Differential to this again, respect to w. We get 1 over p, sum over i, derivative of, I mean, there's only one term that's going to survive and it's w transpose x x transpose, right? And we differentiate this with respect to w. In fact, the proper way to write this is d2f over dw, dw transpose, but we just write this dw squared. And what's going to be left is x, I'm sorry, I forgot the indices, x transpose. Okay, so what it tells you is that the Hessian matrix of a quadratic error function when you do regression is the covariance matrix of the inputs. This is super important. That actually has useful practical consequences. And why is that? What does that mean? It means that if you have, let's say, a single unit, a single neuron in your neural net, it's got multiple inputs. Those inputs could be real inputs or they could be outputs of other neurons. Okay. Why do I need to do in such a way that h is well behaved? So basically, this thing, of course, there is the whole other neural net here, but locally, this system kind of locally optimizes its own little objective function if you want, right? It's not quadratic or anything. But whatever it does, how do we make sure that h is well behaved? We'd like h to be as close to being diagonal as possible. And we'd like all the term on the diagonal to be equal so we can use the same running rate for every weight. Okay. So ideally, h should be equal to the identity. Okay. Which means, what does that mean? Well, so h ij is equal to 1 over p, sum over k. Now I have to use a different index of the i-th input and the j-th input for a sample k. Okay. So I have x i here, x j here, and the ij term is the covariance between x i and x j. Now, what I need to do now is I want my h to be the identity matrix. And so what I'd like is this term to be 0 when i is different from j and to be 1 when i is equal to j. Okay. So what I'd like is h ii. So I would like this to be 0. I would like this to be 1. And if I do this and I minimize a square error or something like that, my hessian matrix is going to be the identity. And so I can use the same running rate for all the weights. And I'm not going to be in this situation where the cost function is elongated among the dimensions I'm considering here. So how do I do this? First of all, the first thing I need to do is subtract the mean from all the x variables. Okay. So the x's have to have zero mean because if they don't have zero mean, it's going to be really hard to make sure that product is zero. Imagine that x i and x j have mean 10. Okay. And there is some small variation on top of this mean value of 10. This product here is going to be something like 100. Right? Because the product of 10 by 10 averaged over samples. So I cannot get zero here unless my variables have zero mean. All right? So that's lesson number one. Variables should have zero mean. The second term here is the variance of each variable. And I want them to be one. So that's lesson number two. Variables should have variance one. Okay. So how do I do this? To guarantee that the variables inside of a neural net are zero mean and you need variance, what I can do is I can take all the variables I have, subtract the mean and divide by the variance. Okay. And what I get at the output of this is variables that have zero mean and you need variance. And you've been using batch normalization. That's exactly what batch normalization does. Okay. So you have an explanation for batch normalization. You also have an explanation for why when you use examples in ImageNet, the first thing you do to the images is to subtract the mean of the pixels, okay, over computed over time. And again, divide by the standard deviation. In fact, you know, to have variance one, you divide by the standard deviation. You don't divide by the variance, right? Or standard deviation one. So that gives you kind of a intuitive or semi-intuitive justification for, you know, why variables in the neural net should have zero mean and you need variance. It's so that it equalizes the curvature of the cost function with respect to all the parameters in the system. Okay. So we've seen a lot of things that are relatively complicated here, but those formulas are valid for multidimensional, the multidimensional case. It's just that they're not practical because we can't solve a system whose dimension is the size of our neural net. If you want the full theory, this is a slide from Aaron Defasio that kind of talks about the convergence rate. So basically, you know, look at, so this is interesting. So let's say you have a quadratic error function of this form, so a quadratic form. You're computing the gradient and you know that the solution is obtained by basically, you know, inverting the, in this particular case here, inverting the A function, which is essentially the A matrix, which is basically the Hessian and multiplying by B. But we're going to use gradient descent, full gradient in this case. And the question is, how fast this is converging, right? So we're going to compute WK plus one minus W star, and then we're going to replace WK plus one by its expression here, okay, inside. And we compute the gradient because we know it's AW minus B. So we get this formula here, a little bit of focus focus inside, just, you know, rewriting some of those things. We know that W star is equal to you know, M minus 1B. So anyway, we go through a bit of focus focus inside of this. And what we get in the end is that the distance between WK and W star, W star need to be multiplied by a factor here equal to the identity minus learning rate times the Hessian. And that gives us the distance from WK plus one to W star, okay, W star being the solution, the minimum of quadratic function. So this is what I told you before that we get a exponential decay of the distance between, if we use gradient descent, the, you know, the W kind of converges towards W star by kind of multiplying the distance to W star by a factor. And that factor is I minus the learning rate times the Hessian matrix. So if the learning rate which is scattered in this case is larger than the inverse of the largest eigenvalue of A, then this term may have negative, negative eigenvalues, this whole matrix, okay, because this could be larger than the identity in some directions. And we may get divergence, okay. Because this is not going to shrink anymore. If this factor here, if this entire matrix has eigenvalues that are larger than one, okay, or minus one, this is not going to converge. And that's what limit the size of the learning rate. The learning rate needs to be smaller than one over the largest eigenvalue of A, which is the same statement I said before, which is it has to be, the learning rate has to be smaller than the, essentially the largest eigenvalue of the Hessian, one over the largest eigenvalue of the Hessian, okay, the inverse, which is the largest curvature, or twice that actually if you want convergence, but we're going to get oscillation if it's larger than one over the largest eigenvalue. So that gives you the convergence rate, which, you know, optimization people call linear, which to everybody else would mean exponential, but that's what they call linear. And so that gives you, you know, something that says, well, the rate of convergence in the direction of lowest eigenvalue, it's going to be now is going to take a long time. That depends on the ratio between the eigenvalue and your learning rate. And so overall, the learning speed is going to be proportional to the ratio of the one minus the ratio of the largest eigenvalue to the smallest eigenvalue. And that's called the condition number. So the smallest to the largest, that's called the condition number. So you want that to be as close to one as possible, but you can, okay, if your function is elongated. I just went through that. So this is just, you know, rewriting what I wrote in the web board that the for quadratic error function, the Hessian is basically the covariance matrix of the input and a missing factor of one half. Right. So here's an example. This is another example I built for a tutorial that narrates many years ago, many decades ago, in fact. So here's a quadratic function, which has, you know, two eigenvalues. The largest eigenvalue is 0.84 in this, in this particular case. If you set the learning rate to 1.5, which is a little bit smaller than one over the eigenvalue, you get oscillation in the high curvature direction, but you get convergence in the slow convergence in the other direction. The largest learning rate you can use is 2.38, which is twice, is two over 0.84. And if you set it to 2.5, then you start having divergence in the direction of high curvature. You still get convergence in the direction of low curvature, but, you know, that's not a good idea. So now there's something very curious about stochastic gradient descent, because we use stochastic gradient descent, you know, we don't use gradient descent. So are the eigenvalues of the Hessian relevant for stochastic gradient descent? And the answer is yes, but not really in the sense that the largest learning rate now has something to do with the largest second value of the Hessian matrix, which is the largest curvature. But the optimal learning rate is not just one over that value. It's something complicated that we actually don't quite know. If you want to really know everything about it, you need to read that paper by Diombo II and Noshida and etc. But there are values of the learning rate for which the convergence will be faster, than, you know, gradient descent with the optimal learning rate. And because stochastic gradient descent, you know, exploits the redundancy in the data. And, you know, I've explained that before. Okay, so here's the thing. You need to center all the variables that enter a weight. Okay, that's the statement I just made with the white board. And normalize the variances of all the variables that enter a weight. And this is a justification for a lot of the normalization tricks that people have come up with, which I'll come to in a minute. Now, let's take another example here. This is the simplest multi-layer neural net you can imagine. Okay, it's a two-layer neural net. It's got one input, scalar input, one hidden unit. In this case, it's a linear unit. Actually, I think in the diagram here, it's a sigmoid, but hyperbolic tangent. But in the formula here, it's linear. And then one output. Okay, and what you're training this neural net with a sigo sample, and that sigo sample is input equal one, output equal one. So basically, you're training this very simple neural net to learn the identity function. Okay, and so the last function is very simple. It's just one, which is a desired output, minus w1 times w2, which is the product of those two weights, multiplied by the input, which is one, so I don't show it, squared, squared error. Okay, that's your objective function. You can plot this into dimensions. And you get a plot kind of like this, where the solution is a hyperbola, right? It's a hyperbola where w1 takes some value and w2 takes one over that value, so that the product is one, and so that the loss is zero. Okay, so you get a hyperbolic region here. This one is not completely a hyperbola because there's a sigmoid in between, but this one here, there is a hyperbola on this side and one on this side, where both weights are positive and both weights are negative. There is a solution. In the middle, you have a saddle point. So it's when both weights are zero, the cost is equal to one when both weights are zero. And if you move in the other direction, where you have a positive w1 and negative w2, for example, then the cost increases like a forced-degree polynomial. Okay, let's imagine that w1 is equal to minus w2. Then this whole thing here is like a forced-degree polynomial, so it increases really fast. Now, the saddle point here is flat. Okay, so if you start with w1, w2 equals zero, this thing is not going to take off. The gradient is zero, and the weights are not going to change because the gradient is zero. This is why, when you train a neural net, you need to initialize the weights to random values, is to break the symmetry, because if you set all the weights to zero, nothing takes off. It's a saddle point. Even with stochastic gradient, nothing takes off. With sort of batch normalization, it may make things take off, but you need to break the symmetry, basically. You need to tell the system, okay, you have equivalent choices. If I have a neural net with multiple hidden units, I can exchange two hidden units and have them carry their weights with them. I get a different point in weight space, but I get the same input output function. There are symmetries in neural net architectures where I can transform the weights in a particular way and not change the input output function at all. I need to break that symmetry to get the system into a region where it has already made a choice of which weight is going to be positive or negative. In fact, the curious phenomenon, which is that there's a number of papers over the last couple of years on something called the lottery ticket hypothesis, which is that if you initialize a neural net with random weights that have the right sign, you know what the solution is, because you train your neural net before. Then you initialize your weights in such a way that they have the desired sign that the solution that you obtained before has, and you train your neural net, it's going to find a solution really quickly. It's going to learn incredibly quickly. If you set to zero all the weights that end up being zero when you train your network and use a little bit of pruning, same thing, the system will basically find the solution really quickly. If it knows already which weight needs to be zero and what sign the other weights need to have, the optimization becomes super simple and trivial. The main issue of training a neural net is to break the symmetries, to figure out what weights are important, which ones aren't, are not, and which ones, which signs should each weight take, and then figuring out the optimal values once you have the ones that should be down zero and which sign you should have is basically trivial. We talked about stochastic gradient optimization, gradient descent is the worst method in literally all situations because of its slow and linear convergence and the fact that it's determined by the condition number of the Hessian matrix, so unless you have a separate running rate per dimension, it really doesn't work very well. But SGD works pretty well, but you have to take precautions and the precautions we'll go to in a minute. Of course, SGD works because on average, it does gradient descent, but with some noise. The noise may be an advantage, certainly because you explore the redundancy in the sample. If the samples are very similar, you're better off just doing more updates rather than refining your estimate of the gradient over a large number of samples. In fact, it's shown that if you use mini batch type training and your batch are too large, you're actually taking more time to converge, more computation to converge. It may be faster world clock time because you can parallelize more, but in terms of amount of computation to reach a particular result, it may actually be slower to use large batches. In fact, the joke I make very often is that the optimal value of the batch size is one. It's almost always one. The reason we don't use one is because of the limitation of our hardware. So the only reason we use mini batches is because we use GPU. It's because it's easier to parallelize when you have a batch of samples than not, but there is nothing fundamental about batching. So anything that is you that is based on the fact that we use batches is probably a bad idea because batches are an artifact of the hardware we're using. There is nothing principled about it. It's just the constraint of the hardware we have. But if you really want to use full batch method, you should use an algorithm called limited storage BFGS or limited memory BFGS, also known as LBFGS. I do not recommend to use this for any kind of real size neural net because it's a non-stochastic method. But it also is much more expensive per iteration. But if you have a highly ill conditioned function to optimize, the dimension is reasonably small, then you can use this LBFGS technique. It basically estimates the Hessian as it goes. It's called a quasi-Newton method, where as the system changes the weights, it estimates the inverse second derivative matrix and a low-ranked version of it or a compact version of it, if you want. So this could be useful for things like inference in an NGPS model or a graphical model, but not so much for learning. Okay, practical tricks now in the 20 minutes we've left. So people very often use momentum. So what is momentum? So momentum is basically splitting the update formula into two formulas. The first one says I'm going to have a descent vector. I'm going to call pk plus one and it's going to be equal to the previous descent direction multiplied by some coefficient, which I'm going to set. I can set it to zero or not, okay, to 0.9 or something. And then I'm going to add the gradient to that previous descent direction. Okay, now I'm going to use that descent direction to do my parameter update. And people call this momentum because this is kind of like sort of a heavy ball type, rolling down a landscape with some friction. Those formulas basically are equivalent to sort of simulating a ball that runs down the lost surface with inertia but also friction. Okay. If you set this beta parameter to zero, then pk plus one is just equal to the gradient and you're back to normal gradient, normal SGD. Now, if you write this on top of SGD, you can write it this way, essentially. Because that's the previous pk, right? That's pk. The difference between your current weight value and the previous one, that's pk. And you have the beta parameter here, which you can set all the way up to one and then you multiply, you add the gradient, I mean subtract the gradient, and then update your weight with that. Oops. This beta, by the way, is not the same as that beta. Okay. This is beta hat. This is beta. This one doesn't get multiplied by the learning rate. This guy gets multiplied by the learning rate. So there's a simple relationship between the two. That accelerates. So without momentum, you get this sort of erratic oscillatory behavior in the high curvature dimension. And then, you know, it kind of goes slowly in the other direction. And this is for a given number of iteration. And with momentum here, it basically smooths it out, the oscillations, it doesn't reduce the amplitude, it just kind of makes sense smoother. But it accelerates in the direction of low curvature, because the momentum kind of builds, it builds up if you want. And so the ball kind of accelerates in that direction. It doesn't accelerate in the direction where it oscillates, right? It accelerates in the direction where it doesn't oscillate. So that kind of partially correct for these issues of ill conditioning without having to do complicated things. That's the effect. So there's a slight issue, which is that if you have large values of the beta parameter, the system tends to overshoot and does oscillate in some dimension or creates oscillations. It overshoots the solution, essentially. Okay. So you got to be careful about it. You know, it's basically a free launch because it's really cheap competitionally in almost all situations. The good values of a beta parameter are between 0.9 and 0.99. Sometimes you get a little bit of gain by tuning it, not always the case. It certainly accelerates. There is different types of momentum methods. There's the simple one that we just explained. This slightly more sophisticated one that is called Nesterov's momentum. So Nesterov is a very famous scientist in optimization. And there the first formula is identical. The second one kind of involves kind of recomputing the gradient, essentially, or kind of looking ahead. So this is like a look ahead, right? This is like a new parameter value. And that's, you know, you basically kind of combine that with the gradient here. And this can be shown to actually accelerate this theory about this. But the theory being what it is, the surprising thing is that it basically doesn't work much better than regular momentum in the context of neural nets, for mysterious reasons. So certainly momentum smooths the gradient noise, which is also something that batching does. But smoothing can be good or bad. It's not clear whether smoothing is good or bad. So here's an example. This looks like a very erratic, noisy trajectory. This looks like a much smoother one. But is this better than that? Not clear. Because there is some advantage to having a noisy trajectory, which is first, it might help you escape annoying areas like settle points faster. So it can accelerate learning actually. It can improve generalization because it drives the system towards regions of the space where the minima are flatter. What the system wants to minimize with stochastic optimization is basically the average gradient or the average of the objective for all the weight values that are obtained by SGD. And if there is fluctuation, and the cost function has a positive curvature, the fluctuations are going to cause the error function, the loss function to actually, on average, be higher. And so this will drive the system towards region of the space where the curvature is very low. So the minima are flatter. And for various reasons, those minima are better for generalization. Because if you think about the optimal value of the weight on your training set and the optimal value of your weight on your test set, those are different values, slightly different. If you train your system on your test set, you will get a different optimal parameter. So you have different optimal values for the weight on the training set and the weight on the test set. So if you want the weight on the training set if you want the error on the test set to not be too different from the error on the training set, you want your loss function to be really flat. So you want to find a solution in which the loss is really flat. And having noise in the gradient actually helps by doing this. There's even papers that say you should not just use the noise in SGD. You should actually add noise to the gradient to make the system look for flatter solutions. So that's the point. So those things are still being discussed. Nobody really agrees completely to all of those things. Okay, so what about automatically learning rate adjustment? So there's a technique, a fairly complicated technique that I came up with over 30 years ago, which I'm not going to explain that consists in estimating the diagonal terms of the Gauss-Newton approximation of the Hessian matrix. Okay, so basically a diagonal Levenberg-Marquardt approximation. And it turns out you can compute those diagonal terms of the Hessian using a backprop-like formula. But those are not actually supported in the current framework like PyTorch and stuff. So that algorithm is not being used very much, although some people have revived it. And so people have kind of tended to use simpler ways to adjust the individual learning rate of each of the weights, essentially each of the parameters, separate learning rate for each parameter. Here is a simple idea. So as I told you earlier, batch normalization or other things like this will normalize the variance of all the variables in your neural net. But there are variables that are not going to be normalized. So in that case, what you should do is basically modify the update rule so that the learning rate reflects the fact that your input, the corresponding input doesn't have a unit variance or unit standard deviation. And this trick here consists in basically dividing the gradient for a particular weight. One particular weight, you divide it by the standard deviation of the gradient. This is for one particular weight. The index has been dropped, but this is for one particular weight. So you divide by the standard deviation of the running average of the gradient squared. And that's like an estimate of the gradient. I take back what I said earlier. This is a running average estimate of the square of the gradient. And the square of the gradient is a substitute for the corresponding diagonal term of the Hessian. It's not exactly the same thing, but it's sort of an approximation to it. And so this is a bit like dividing by kind of the diagonal Hessian. Strangely enough, there's a square root. I would not put a square root here, but RMS puts a square root. And why not? This epsilon factor here is there to prevent this thing from blowing up if the variance is very small, if the variance of the gradient is very small. There's a much more complicated version of this called natural gradient where you use the square covariance matrix of the gradient instead of doing it independently for each component. It's a good trick. That's a better trick. So Adam is a very popular method and it's basically RMS prop combined with a kind of momentum. So it's got a momentum on the gradient, a momentum formula on the gradient, and a running average momentum formula, if you want. I mean, it's just a running average on the square of the gradient. And then your update is the momentumized gradient divided by the square root of the running average of the square gradient. And this is, again, separately done for each dimension, for each running rate. So it's basically RMS prop with momentum. So here's how it works. When you compare all those methods, SGD kind of fluctuates a lot. RMS prop still fluctuates a lot, but sort of corrects, gives you kind of a straighter direction towards the minimum. And then Adam includes some momentum. So it kind of, you know, overshoots a little bit, but then converges towards the minimum a little faster. So, you know, pulling condition problems, which might occur depending on your architecture, Adam is often much better than SGD, pure SGD. And, you know, it's better than RMS prop because of the momentum term. It's very poorly understood theoretically. And it has some disadvantages. There are some cases where it doesn't converge. It gives worse generalization error sometimes. This is something really annoying, but like good optimization methods sometimes result in worse generalization error. It requires a little bit more memory, but that's not a big issue. And you need some more tuning because there's two hyperparameters, essentially. So that requires a bit of adjustment. Okay, lastly, I want to talk about normalization techniques. So you've been using batch norm and things of that type. And basically, those are designed to, you know, cancel the mean and normalize the variance or standard deviation of all the variables in your network. Very often, those batch norm or normalization layers, whatever they are, are inserted between linear layers and activation functions. And, but you could have an argument that normalization layer should actually be placed after the activation function just as well. In fact, you could have both. So the reason for having it before is that you want the activation function, if it's a radio, for example, you want the radio to roughly be on half the time and off half the time. And the best way to do this is to cancel the mean of whatever variable enters the radio. Okay, subtract the mean, essentially. You could normalize the variance as well, but it's not very useful because a radio is equivalent to contrast anyway, so it doesn't matter. But then after that, the next layer up is probably going to be another linear layer. And that linear layer has weights, and those weights really want their inputs to have zero mean and be uncorrelated as much as possible, which means zero mean and unit variance and also be decorrelated. Decorrelation, we don't really know how to do, but I mean, we know how to do it, but it's too expensive. But certainly subtracting the mean and dividing by the standard deviation is something we can do. So we can have a normalization layer right after the activation function. And it's very disputed where you should put them and what's the optimal thing to do. Truthfully, though, most people put it just before the activation function, and then won't put it afterwards. Now, if you have a radio, the radio is not going to produce zero mean outputs. And to me, it hurts. I think it's horrible, but that's what people do. So here is a trick for normalization layers. You take an activation, you subtract some constant, normally as an estimate of the mean of the activation. So it could be a mean over of that particular variable over time. It could be over time, but just over this mini batch over a longer time with a running average, for example, or it could be a mean computed over the whole layer, not over time, but over the layer. Or it could be over a feature map of a completion net or multiple feature maps, over channels, but not over space. So that's the mean cancellation. Then you divide by the standard deviation of the activation computed before. And then you have two learnable parameters, which are basically a scaling factor and a bias. And that's before you feed that to the radio. That's the standard normalization layer. And you need those two things if you want the system to be able to change the average activation of the radio and et cetera. I'm not sure this one is particularly useful, actually. So, yeah, this is for a bit of a detail for batch norm, but let me show it here. So batch norm, so here's a 3D tensor where actually two other dimensions have been collapsed, the spatial location for convolutional net. So you have HW, which are the spatial location for 2D convolutional net, for example, the channel or feature index direction here, and then the instance in the batch in that dimension. And batch normalization says you compute the mean of a feature map. You have a single mean for all the variables in the feature map. So you sum over space and you sum over batch and you get a single value for each feature map. You subtract that, you compute the standard deviation of the variance, the same, and then divide by the standard deviation. And again, that's a single value per feature map in your convolutional net. And that's batch norm for convolutional layers. Essentially. But there are other proposals, layer norm. So layer norm averages over space and channel, but not over time. Okay. Instance norm averages over space, but neither over time nor over channel. And group norm averages over channels, but only a small number of channels, not the entire thing. Okay. Now, why use all those different things? I'm a little uneasy about batch norm, even though it's very popular, because, you know, again, as I said, batch is sort of an artificial result of the hardware constraints of our systems. Why should we average over a batch? Like, why would that, why would the size of the batch that just happens to be the right thing for GPU be also the size of the batch that we use for the, for the normalization? There's no reason for it at all, other than convenience. Okay. There's no deep reason for it. It's just convenient. In fact, it might be better to do this over, like, to running average over multiple batches or maybe a subset of a batch, but it's not as easy. So people don't do it. They just do this because it's simple. Same for those that are simple, but the fewer variables you use in your, in your normalization, the more chances you have with things kind of going south or blowing up or not doing the right thing. So, you know, normalization helped for the intuitive reason I explained earlier that you want your Hessian matrix, you know, the local Hessian matrix for a given unit to be close to the identity, essentially. But like, you know, multi-error net, why really does it help? It's still disputed. There's been paper on this since the 1980s. You know, I had some papers on this in the 1980s. There were a few papers in the 1990s. Then neural net became not so popular. So nobody actually wrote about this. And then it became really interesting again, you know, when deep learning started to emerge, you know, around 2014 or so where people came up with all those Adam and there's another one called Lars, which is another variation. But the theory behind this is very disputed in practice, you know, some things work in some cases, not all. So not clear. Basically, normalization, yeah, allows you to be sort of, you know, it gives you a piece of mind. It allows you to be more careless about how you kind of build your neural net, you know, you don't have to think about it too much. How you scale your values, etc. There are there are some theoretical studies and here on the number of saddle points. I mean, there's a lot of very interesting theoretical studies about the complexity of the of the last surface of deep deep learning systems and the complexity of the functions that deep architectures can approximate. I'm not going to go into that. But just to tell you that there are various papers that say that the number of saddle points that are present in the objective function is very large, combinatorially large, actually. But there don't seem to be a problem. Not too much of a problem in the sense that, you know, you kind of escape. As long as you don't get too close to them, SGD will, you know, find a good solution. There are two questions about this. Go ahead. Why is SGD preferred in SSL tasks? Well, SGD is preferred whenever the objective function you're minimizing is an average over many samples. And SSL is just an example of that. Okay. And then there is one more question. There was a recent paper from OpenAI in DeepMind or DeepMind in that challenging the use of batch normalization. Yeah, I mean, you know, it's part of this of this thing that I was just saying that a lot of the results are disputed. And the reason for those methods to work is disputed. There is some into, you know, semi-theoretical intuitive arguments for why they work. There's no proof or anything. It's just experimentally they work in certain cases, but sometimes they don't. And we don't really know why. There is this issue that I mentioned that the methods that tend to be fast also tend to not to produce generalization that is not actually as good. Strangely, you know, things that, you know, techniques that reduce noise like momentum, you know, may accelerate. But in the end, maybe they cause the system to find worse solutions because because the noise helps regularizing the system. You know, for when we regularize neural nets, right, we can add noise to the weights. What people do is they add noise to the states, to the two activations, right? This is what Dropout does. Dropout basically, you know, suppresses some of the activations of some of the some of the units in the layer. That basically makes the system more noisy. But it helps, right? There are other semi theoretical arguments for that. So a lot of those things are interacting, the dynamics are very complicated, and we don't understand all of it. You know, these techniques are used for learning, right? Whenever we train our models, so we that we have that the objective function is this average of per sample loss function. But then also we use again the scent for a minimization of the energy to do inference, right? In those cases, we use different algorithms. Yeah, so when you're doing inference in an energy based model, you are, you're, you're finding the minimum of function, which is just the energy function for this one sample you're considering. And this is not a sum or an average of many terms that are almost identical, right? So just remember this, right? You only want to use stochastic methods. When your objective function is a sum of many, many terms, many of which are very similar to each other. And SGD will exploit the redundancy between the samples. I think I've used that example before in a preceding lecture, but I'll use it again. Imagine I give you a training set, it's got 1 million training samples. But in fact, it's 10,000 repetitions of the same 100 samples. Okay? If you use batch gradient, you're going to compute the average of the gradient over the 1 million samples. And without noticing, you're going to do 10,000 times more work than necessary, because you could have obtained the same average of the gradient over the 1 million estimate with only 100 samples. Okay? But you didn't realize it. Now, if you use SGD, by the time you've computed 1 million gradients, you've actually done 10,000 iterations over those 100 samples. Okay? So it goes 10,000 times faster. I mean, this is not just, you know, two times faster. It's 10,000 times, right? Now, of course, in reality, in a training set, you never have samples that exactly replicate. But you do have samples that are very similar, right? You do data augmentation, you have a whole bunch of samples that are basically all the same with some slight variation. Two examples of number one in MNIST, you know, are very, very similar. And by the time you get to the top layer of the component, they're basically identical. So you need to use SGD to exploit that redundancy. This is something that, you know, it took, you know, Leon Boutou, myself and a bunch of others, about 15 years to convince the machine learning community that the three lines of SGD was more efficient than the super complicated optimization algorithm they were using for super vector machines and whatever. But it's true. But on the other case, if we are doing, let's say, optimal control where we have one vector of actions for a given well-defined cost, then instead you're going to be using other types of algorithms. Okay, right. So if you have a function to optimize and the function is not an average over lots of samples and no redundancy to exploit, and that function may, you know, maybe convex or non-convex, maybe probably non-quadratic, you want to use an optimization method, which is not stochastic and basically takes advantage of the second order properties, you know, kind of estimates the inverse Hessian as it goes and stuff like that, right? So if the dimension of the variable you're optimizing is large, say more than 100 or something, you don't want to use any sort of order n-cube methods like Levenberg-Marquard or anything like that. This is going to be too slow. Or BFGS, which is another method that does this. What you can use is, okay, so BFGS is a so-called quasi-Newton method that estimates the inverse Hessian as the algorithm proceed, you know, basically by computing the difference between gradients at different places, it can update a matrix that is an estimate of the inverse gradient. That's called quasi-Newton methods. So BFGS is at one end of the spectrum. And the other end of the spectrum is something called conjugate gradient, which is very efficient. It's an order n algorithm, so there's no matrices in it. It's just vectors that you compute dot products off. But it exploits the sort of second order properties of the objective function. It requires a line search between different search directions. But it's, you know, it's implemented in NumPy or SciPy, you know, you can just use it. In between, there is something called LBFGS, which means limited memory BFGS, which is an invention by Jorge Nosedal, the co-author with the number two of this big paper on SGD. It's not a stochastic method, it's designed for full gradient. And it's basically intermediate between BFGS and conjugate gradient. The L, there's a parameter that controls the, basically the rank of the matrix, not really the rank, but the complexity of the matrix that you use as your approximation to the Hessian. That parameter is equal to n when you use BFGS, or it's equal to 1 when you use conjugate gradient. You set it to three or four, you have limited storage BFGS, it's only a little bit more expensive than conjugate gradient, but it might help the speed of convergence. So it's something you might want to use. I would start with conjugate gradient and then maybe use LBFGS. Cool, awesome. And so this was the end of the class for today, I think. All right. All right. Thank you again for everyone for joining. We see you tomorrow for learning the trackbacker upper controller and policy, controller and emulator. All right. Bye-bye, everyone. All right. Take care, everyone.