 Hello and welcome to lecture number 12 in numerics of machine learning My name is Lucas. I'm a PhD student in professor Hennig's group And I'm gonna talk about second-order optimization for deep learning. So last week in Frank's lecture and also in the exercise You have probably seen that deep learning of neural networks is pretty tedious and an extensive expensive process and Today we're gonna learn about a different class of optimizers namely second-order methods that at least Don't have some of these problems that first-order methods have Yeah, so that's the plan for today, but we will also see that these specific optimizers have other difficulties and unsolved issues Especially in the context of deep learning that I want to address So I want to start with a brief example So this here is an example of a minimization problem and you can see the so-called Rosenbrock function in 2d Which is this banana shaped function, which has a global minimum at at 1 1 and we start three optimizers at this initialization of the parameters and Here you can see basically the three trajectories of SGD with some learning rate Adam and LBFGS So maybe the first thing you notice is that for SGD and Adam I need to specify some learning rate, right? So for those two I have hyper parameters and that you can basically by hand just by trying out different values and For LBFGS apparently I don't have to I just use the default parameters and apparently it works So this is maybe one first observation from this little example The next one is that if you take a look at the trajectories then for SGD, we have this very jumpy behavior so it seems to kind of serve this very narrow valley and Doesn't seem to make Yeah, good progress I would say Whereas for Adam this looks a little bit better a bit more stable because it seems to make more progress along the low curvature direction as well and LBFGS just looks nice in my opinion. It seems to make kind of well-informed steps and It also seems to make only very few steps So maybe one guess you could already have is and that the convergence of LBFGS is faster than for the other two So let's check if this is right So at the top plot here I plotted the loss value so the value of the function we're trying to minimize and on the x-axis We have the number of steps So for LBFGS, I think it's seven steps until we reach the tolerance of ten to the minus eight For Adam, it's in the order of a thousand and for SGD. It's in the order of ten thousand So, yeah, and it seems that And SGD and Adam take many steps to reach the tolerance Whereas LBFGS only requires less than ten steps and now you could maybe argue with that Okay, maybe it does less steps, but it could still be more expensive in terms of runtime, right? Because if every step is just much more expensive than yeah, maybe it's just not worth it But actually it turns out that also in runtime the difference is not Yeah that large anymore as in terms of the number of steps, but still LBFGS is much faster than Adam and certainly faster than SGD so my question is first of all, how do such methods work and Secondly, do they maybe also have potential for deep learning? Why don't we apply them to deep learning, right? So this is what I want to address in this lecture and Before we get to the fun the interesting stuff I want to briefly summarize and recap first all the methods But I'm gonna go through this quite quickly Because first of all, I think you have seen this in multiple lectures before and also last week in parts by Frank So first of all, we have training data when we do deep learning We have basically training examples of inputs x and outputs y and we have this Training set which has n such examples and these examples are assumed to come from a Data-generating process basically so a distribution which is called P data and We assume that we have IID samples from this distribution Then we have a neural network or a model that makes predictions based on a parameter vector theta which comes from this space and An input which comes from this space and this is then mapped to a C dimensional output vector Which is basically a C is the number of classes, right? So this is a vector of dimension C and In the last step, we're gonna compare this prediction y hat with the actual label from the training data y so This would be the loss function. So it takes those two things maps it to a real number and This is basically so the loss is basically a measure of dissimilarity right between the prediction and between the actual label and Now we can Do this for the entire distribution of data that we have so the expected risk is defined as the expectation and basically Taking samples from this data-generating process and then taking the expectation over these numbers Okay, so this would be the expected risk and So the problem with this is if this is our objective function, which you want to minimize we cannot because we usually don't have access to this distribution So what we do instead is we compute a Monte Carlo estimator You should know about them from Professor Henning's lecture. So you can basically replace the expectation With an average you could say and we just plug in those numbers of the loss for samples coming from the distribution This is basically the idea of a Monte Carlo estimator so we end up with the empirical risk the Monte Carlo estimator of the expected risk and Our goal in deep learning is to find Configuration of our network so such a parameter vector theta that minimizes this empirical risk Okay, I'm sure you know that there are Numerical methods for this like gradient descent. So here the idea is pretty simple We just go into the direction of the negative gradients because the negative gradient points into the direction of steepest descent, right? So this is usually what for first-order methods do they take the gradients and they Compute the next iterate in this case theta k plus one and In this case of gradient descent This is the old parameter vector Minus alpha times g alpha is the learning rate or step size and g is the gradient of our target function The expected risk so in 2d. This would maybe look something like this So the the lines here are contour lines so we have kind of a bowl shaped function and We start with some iterate theta k and go into this direction of the negative gradient. This would be the idea Now maybe most of you already see the problem We have the same problem as before we cannot really compute this gradient because we don't have access to the data generating process so again We can compute an estimate of this on finite data using a Monte Carlo estimate Which leads to stochastic gradient descent SGD So the idea is we can estimate this thing on finite data So this could be training data. It could also be Which it is in practice a mini batch of data, which is much smaller than the actual training data and Then compute Monte Carlo estimator for the gradient, which looks like this So it's again an average over individual gradients and these individual gradients So there is one for each Example in the mini batch, right? So for example if we have three such Examples a batch size of three then we get three individual gradients as Frank has already Also shown you in the last lecture and the average of these three orange arrows is this estimate she had okay and A nice property of this gradient is that it's unbiased because it's a Monte Carlo estimator. It's Yeah, provably unbiased and The SGD update then looks like this Basically exactly the same as gradient descent just with a stochastic gradient and The efficiency of such methods can be improved if we use for example moving averages so something like momentum Because then you get more robust estimates of The gradient for example and you can also try to incorporate higher moments of the gradient So something like a second moment for example, which is for example also done in Adam So we compute an estimate of the variance basically All right, so on this slide And I try to show you that gradient based methods are sensitive to the condition number So what is the condition number? It's simply the ratio of the maximum and the minimum directional curvature and the condition number is denoted by Kappa and It is defined as L divided by mu So L would be the maximum curvature and you would be the minimum curvature in your in your loss function and Let's assume for a bit that mu is larger than zero. I will later address this point in more detail But here I want just to emphasize that if the condition number is large is it's a problem for gradient based methods That's the point of this slide so if the large curvature in this in this ball shaped function is the same as the small curvature Which means that we're in such a case here then the gradient is actually a decent direction I would say so Because it pretty much points into the direction of the minimum. So it's a good search direction, right? And if you have an appropriate Learning rate, this is a good step and we make fast progress Whereas if the situation looks more like this and we have a direction in which the curvature is large So this would be this direction and we have another direction in which the curvature would be very small So in this direction the the gradient barely changes because it's really flat, right in this case The condition number would be large and you can see that The gradient doesn't really point into a good direction because it would basically move like orthogonal to the Connection line to the minimum and we won't make much progress here And this intuition can be basically turned into a theorem for the convergence of these methods So in this case for gradient descent So if your function is l smooth and mu convex Which basically means that the maximum curvature is l and the minimum curvature is larger than 0 Then gradients descend with some step size 1 over l. So we need to specify the correct step size here Achieves this convergence so the distance to the minimum in the next step is going to be smaller equal to constant factor times the current distance to the minimum And this factor is smaller than 1 let's again Consider like the different values of kappa. So if kappa is around 1 then this factor is around 0 Which means that we basically converge in just one step Whereas if this kappa is large and we have an ill-conditioned problem Then this number here is gonna be close to 1 and the reduction each step is only gonna be very small So we make and small Progress and therefore we have slow convergence And the result is pretty much similar for SGD it convergence also linearly which means that it also has this constant reduction factor in each step But it converges only in expectation because this is now a stochastic process It doesn't really converge to the solution but only to some like level determined by the noise in the gradients and Also at a slightly slower rate depending on how noisy the gradients are But my point here is that gradient-based methods are basically sensitive to ill-conditioned problems Now maybe the question is how can we improve these updates, right? What can we do to the gradients? How can we rescale it to make it better? So I would say what we should do is we should make a small step in this direction For example, maybe up to here Which is basically saying we multiply it by 1 over L Because L is the large curvature direction Whereas in the small curvature direction where we have curvature mu We want to also multiply or rescale a gradient in this direction with 1 over mu Okay, because then if we do this then we have a small update in this direction and a large update in this direction And we would maybe get something that points more into the direction of the minimum So what quantity am I describing if I say multiply by 1 over L or multiply by 1 over mu? Which quantity is that? Yes Yeah, but we're not talking about variance here, but I mean this would also hold for the deterministic case, right? Yeah, it has to do with curvature exactly and Maybe more precisely it has to do with inverse curvature, right? Because I say multiply by 1 over L or a multiply by 1 over mu. So what I'm describing here is a Quantity that is or this quantity I'm describing is basically inverse curvature So in order to reduce this or eliminate This dependency on the condition number. We have to introduce or access inverse curvature and This leads directly to second order methods So With second-order methods. I mean methods that use curvature or inverse curvature somehow to improve the steps and The central quantity of second-order optimization is the so-called Newton's step so I Want to give a quick derivation here So basically this is all based on a quadratic Taylor approximation of the lost landscape around theta k So we're gonna try to approximate our loss function at our current iterate theta k with some quadratic function and This quadratic function has a specific form which is a constant Plus a linear term in theta with the gradient and the quadratic term in theta with the Hessian of Of the loss function and we're gonna assume that the Hessian is positive definite Again, I'm gonna go into more details later So basically what we're doing here is we compute a bold-shaped function that serves as a proxy or an approximation of The actual target function and now because this approximation has such a specific form, right? We can we have this precise form here and we can actually do minimization of this proxy in closed form So the idea for the Newton's step is we could choose our next iterate theta k plus one Such that it's the argmin of the quadratic Okay So that basically means we go from here directly into the minimum of the quadratic function and this would be the new step and it's pretty easy to Yeah to derive that the closed form formula for the Newton's step Well, how do we minimize q? We can simply do that by computing its gradients and setting it to zero Which is pretty straightforward and because the Hessian is positive definite It's also invertible and we can write down the Newton's step Which is exactly this quantity here Okay, so we need access to inverse curvature and the gradient Okay, this would be the Newton's step Any questions so far? Was the derivation clear? Am I boring you? Okay Okay, so the Newton method can be much faster than a gradient based method in certain situations So for a Newton method we actually get local quadratic convergence. So earlier as a reminder For the gradient based methods or a basic gradient based methods like SGD or gradient descent We get linear convergence so conversions of this form Where as we can get quadratic convergence order for Newton's method so if we have target function which is twice differentiable and we have Hessian which is Lipschitz continuous. It basically just means that the Hessian cannot change arbitrarily fast In a neighborhood of the solution and the sufficient conditions have to be satisfied at this point Which means that the gradient is zero and that the Hessian is positive definite at this at this point And if we are close enough to the solution already then the sequence of Newton updates achieves quadratic convergence order and Here I try to compare them a little a bit visually Just for a bunch of example numbers just to make this point so I basically just compare linear convergence to quadratic convergence and The data is the same for both plots and just here I have the a linear scale on the number of steps and here I have the distance to the optimum currently and you can see that it's basically a vertical line for The quadratic convergence and it's a straight line for linear convergence because in a semi logarithmic plot like we have here If we have a constant reduction factor in each step this is going to be a straight line in a in a semi log plot But I mean we cannot really see what the what the orange line is doing here Right, so I also plotted the number of steps in a logarithmic scale And we can see that for for this particular example We would need something like 10 steps for Newton and something like a thousand steps for a method with linear convergence So my point is Newton method can be really fast in certain situations so on This light a little maybe an intermediate summary so Newton methods seems great, right? It is it is robust against ill-conditioned problems. So it doesn't have this dependency on the condition number second other methods can be Really fast to get quadratic convergence if we're initializing close enough to the optimum in Contrast to gradient-based methods Newton steps have a natural length. So with that I mean that In the proof for the convergence of gradient descent method We needed that we needed to specify the learning rate basically. We needed the learning rate to be 1 over L and It's such a thing didn't appear in the convergence theorem for Newton, right? So apparently Newton steps have a natural length scale and don't need this learning rate at least in certain situations and Last but not least they have a long success story and are the default choice in many scientific fields So why don't we use them for deep learning? Why don't you use them for deep learning? Yes Yeah, so the question is basically about a computability and you said that it's easy for the gradient But what about the Hessian and it turns out that you can actually compute additional stuff in the backward path Like coverage information as well and it comes at a price, but in principle you can do it. Yes Yes exactly Yeah, that's a good point. So cost is a problem. How do we start the two metrics is a problem more ideas? Yes Sorry, can you repeat? They are stuck at local stuff. Okay, but this is maybe also true for a gradient-based method, right? They could also converge to a local minimum Yes Yes, that's a good point. So exactly so Yeah, your point was that the Hessian is not necessarily positive definite which means that we basically have a non-convex problem and then What is the Newton's step? We're gonna take a look at this as well Maybe one point that didn't come up yet if you think about The difference between gradient descent and SGD there is a difference, right? So stochasticity exactly So I would say these are the three main problems for applying second-order methods to deep learning Non-convexity, how do we deal with a non-convex objective function which usually we have in deep learning? What about the cost? How can we access store and invert even? huge curvature matrices and The third point is about stochasticity. So how can we handle stochasticity and in the following I in the rest of the lecture basically I want to go through these three problems with you and Give you or try to give you an overview of the ideas that exist for these problems So of course, I cannot go into the technical details for for every method But I'm rather trying to give you some rough idea of the concepts that are applied question With stochasticity, I mean so if you have usually you cannot Compute those quantities you want precisely, right? If you want to compute a precise Newton's step You need a precise Hessian and the precise radiant and Because we cannot compute those quantities for the expected loss We have to estimate them, right like we had to for SGD for example and this can have a very a very strong effect on the performance of these methods and We should at least know how they are affected by noise if we want to apply them in these situations All right So let's start with the first point non convexity. How can we deal with non convex objective functions? So earlier we assumed that the Hessian is positive definite, right? And when we derived the Newton step and the question is why? So I would say it's pretty much clear that we cannot allow Eigenvalues to be zero because if we have a zero eigenvalue Then the Hessian is simply not invertible and we have a problem Because then the Newton step does also not exist, right? But why didn't we allow negative eigenvalues? So if we consider a cut through the quadratic model along the corresponding normalized eigenvector so we're gonna take this quadratic model this bold-shaped function and we basically restricted to a viewing at it over a line segment and so we get a one-dimensional function, right and And Here I'm basically showing you how this function looks like So we're gonna start from our current iterate and we go into the direction of the eigenvector e With some you could say a step size tau. So we're gonna consider this as a function of tau And if you know just plug in the numbers for the quadratic model you can write it in this form and Now the interesting bit is the curvature. So and We have H times e e is an eigenvector. So h times e is the same as lambda times e We can pull lambda in front because it's simply a scalar and we are left with e transpose e and Because I assume that it's a normalized eigenvector. This is the squared norm, which is just one so we can simplify to this thing and If we consider this as a function of tau, this is simply a quadratic function, right? It's simply a parabola Which makes sense because a cut through a multi-dimensional quadratic is a parabola. It kind of makes sense but the interesting thing again is So so first of all, this is a parabola right in tau with a constant curvature lambda so if you compute the second derivative of this function with respect to tau, it's gonna be lambda and This is nice because it gives us an intuition Between the connection of eigenvalues and curvature so if you consider a cut through a quadratic function along some eigenvector The curvature along this direction is going to be the eigenvalue. So I think this is a nice intuition And in this specific case We assumed that this lambda is negative, right? We want to know what happens if we take negative eigenvalues so think about how this parabola looks like if this Curvature is negative. It's bent downwards, right? So if you take a large enough tau if you let this go to infinity the quadratic model is simply unbounded from below Which means that the Newton update is meaningless because it was fundamentally based on the idea of minimizing this quadratic And now if I have a negative eigenvalue Then this quadratic does not have a minimum So in this case like you correctly said It doesn't work, right? And in deep learning The problem is that Usually the Hessian is indefinite so that there are positive and negative eigenvalues So the Newton step is meaningless So what can we do to address this problem? So one idea you could have is So the problem seems to be the Hessian, right? So maybe we can just use a different quantity and And one example of such a quantity is the ggn which is the generalized Gauss-Newton matrix or ggn in short and As we're gonna gonna see this is a positive semi-definite approximation to the Hessian And we can basically use it as a replacement of the Hessian So here I want to give a little derivation So the ggn kind of arises when we take a look at How the Hessian decomposes using this split So the loss function is basically or if you want to compute the loss What you do is basically you take the parameter vector theta you map it through your or you Apply it to your model and then you're gonna take the model outcome and apply your loss function, right? so this is kind of a two-step process and And The derivation for the ggn is based on applying the chain rule to this split and Because all those quantities are vectors and matrices. It's a little bit difficult and you would actually need something like Matrix differential calculus to do that properly, but I still want to give you like the rough idea of how the derivation works so if we now compute the gradient of the loss and We want to do this with chain rule according to the split The idea would be to compute the derivative of the loss with respect to f and then of f with respect to theta So this would be the chain rule, right? And this is exactly what happens here So if you want to compute the first derivative of the loss with respect to theta We first take the derivative of the loss with respect to f so with respect to the first argument And then we take the derivative of f with respect to the parameter theta So this is basically simply how chain rule works Just that it looks a little bit more difficult here because we have those matrix quantities here So Jay would simply be the Jacobian of the model with respect to theta Okay, now if he wants to do this another time we can and Because we have this product here. We basically need to apply the product rule Okay, so how did product rule work again? So we fix the first thing and just compute the derivative of the second Which will give rise to this term and then we'll do it the other way around and we're gonna compute the derivative of this term and multiply with So we fix this quantity and this is what happens here in the second line So here we just copy the Jacobian transpose and then compute the derivative of this thing with with respect to the parameters and we can apply change rule again So in order to compute the derivative of this with respect to theta You can compute the second derivative of the loss with respect to f and then multiply again with the Jacobian so this is simply change rule another time and For the other thing we have to compute now the derivative of this matrix which is yeah a little bit hard because as I said You would actually need to metrics differential algebra to do this properly But basically the idea is to fix one output of the model f then this mapping is basically So f would be if you fix one component, it's just a real number and we can compute second derivatives and Then this is basically simply in the the Hessian With respect to theta of the model But only this the c the cth component of this model and we multiply with the Cth component of the gradient vector of f with respect to of the loss with respect to f So maybe this was a bit confusing But the core idea is basically it to consider the loss function as this split Into a first applying the model and then applying the loss function and applying change rule to this split This is the core idea and you can see that the Hessian then kind of Yeah decomposes into two parts the first part is gonna be what we Define as the GGN and The remaining part is just a residual that kind of makes a difference between GGN and Hessian But the GGN is now defined as this quantity. So just the first part But this would only be the GGN for one particular training example. So for one input output pair xy and Exactly the same as for the Hessian and we're gonna define this quantity as the expectation Over the data-generating process of this product of Jacobian This is the Hessian of the loss with respect to f and the Jacobian again Okay, so this is the GGN and A little bit of intuition what this or how this matrix is different from the Hessian is The GGN neglects curvature from the model because it basically just says okay I ignore this this entire second term and this second term contains Second derivatives of the model with respect to theta. So we're basically neglecting Curvature information coming from the model rather than from the loss and A different way to say this is if the model f is linear in theta Then the Hessian would be zero right the second derivative of a linear function is zero So this term would drop out anyway and in this case Hessian and GGN coincide So if you consider a linearization of your model in theta and then GGN and Hessian are actually the same quantity Okay, so now why did we do all this? I mean the point of this was that we need something that is positive definite, right? And now let's check if this is actually the case So on this slide I want to show you two main properties of the GGN. First of all, it's a metric So if you consider G transpose, which is simply the expectation we just saw I omitted this Distribution because I didn't want to clutter up the slide But yeah, the expectation is still over the X and Y coming from the data generating process and we take the expectation transpose and Because the expectation of a matrix is defined element wise it doesn't make a difference if you pull the transpose inside or not So we can do that and now apply apply the rules that go for the transpose So we have the Jacobian transpose in front then the Hessian transpose in the middle and Jacobian transpose transpose so Jacobian at the right so we arrive at this quantity and now the Hessian is Symmetric because second derivatives are symmetric, right? So we arrive again at G so it turns out the GGN is symmetric Now let's check if it is also positive semi-definite So let's apply a vector from the left and the right some vector V to the GGN Then this is simply a plug in the definition We can pull the vectors inside of the expectation because they are just constants, right? We can pull them inside and outside of the expectation because of its linearity and Then we can basically group those Jacobian vector products together and this is inside the expectation is equal or larger to zero if This Hessian is positive semi-definite, right? And if we take the expectation this property still holds so the question that is left is is this Hessian actually a positive semi-definite and It turns out for basically all relevant loss functions it is and Maybe this is most obvious if we consider the mean squared error loss so in this case the loss would simply be Computing the norm squared between the outputs of the model and the true label and this norm is squared and In this case the Hessian would be a scalar times the identity matrix So it clearly is positive definite and that means this property basically also transfers to the GGN And the GGN is positive semi-definite, which is exactly what we wanted Yes I think you could still compute something like the GGN for that But I mean it's it's really it's fundamentally based on the split So if you don't have the splits, I'm not sure if it's still well defined Yeah So one thing that I that I can think of is maybe compute the GGN still for the loss function and Simply compute the Hessian of your regular riser because if for example for a two realisation This is simply that identity matrix, right? And then you could maybe have something like GGN plus Parameter times identity, which is simply damping and we're gonna talk about this a bit later So this is maybe one thing you can do But I think like in this definition. It's it requires the split of model and loss function All right, so this is one possibility For replacing the Hessian with a GGN Because in this case is positive semi-definite, right? And it gives us a well-defined new step And there is also another matrix, which is called the Fisher information matrix so Assuming that you can write your loss function as negative log likelihood So this is for example true if you have A softmax and across entropy loss then your your model basically describes a distribution over the classes Because you get this this vector of dimension C and it's basically a distribution over Yeah, as I said the classes and In this case you can define the Fisher information matrix as this quantity and At this point. I don't want to go into the details But this matrix is also positive semi-definite and maybe I can try to give some intuition So if you replace the Hessian with the Fisher and compute a Newton step Then this is steepest descent in distribution space. So what does this mean? so One way to think about it is steepest is basically a ratio, right? If we're talking about how steep a function is it's basically a ratio and it's a ratio of a difference between two function values and a difference between two parameter vectors and With this ratio, you can basically say how steep the decrease in the function is But this notion of steepness depends on the metric you apply it to your parameter space So for example, if you say my parameter space I want to apply the a clean metric then what comes out of is it's just a gradient descent because the negative gradient points into the Direction of steepest descent, but it could also use a different metric for measuring distance between two parameters and Because we have this interpretation That the parameter basically models Distribution we could also say we want to measure the distance between two parameter vectors by measuring the distance between the two Distributions for example in culpac lipler divergence and when you do that then The steepest descent is actually this step. So the Fisher inverse times the gradient So this is what I mean by steepest descent in distribution space. It's basically applying Culpec lipler divergence to the distributions. Yes Sorry say again Yes the most in some way, yeah exactly yeah Okay, and there is a strong connection between the Fisher information metrics and the Ggn because in many cases for example Softmax was entropy classification. Those two matrices are exactly the same okay, so We have solved the the problem. I would say kind of because we have found like a reasonable Meaningful curvature proxies we can use instead of the Hessian and we get a well-defined new step But even in the convex case even if the Hessian was positive definite we have still one problem and The problem is that the quadratic model can be arbitrarily bad. So if you consider this problem here So a one-dimensional visualization. We have the loss function in black and We have the quadratic approximation shown in blue and Clearly if we minimize this function and go to this point It's not a good step because we actually not decrease the loss but rather increase it even And this is maybe something that we want to avoid and this leads to the trust region problem and damping So the idea is basically to restrict the updates to lie within some radius are So basically a radius in which we trust our quadratic approximation to be precise and then the trust region problem is We still gonna define the next iterate as the argument of the quadratic But under an additional constraint that we don't move too far away from our current iterate We want to stay within this trusted radius basically and this leads to a modified Newton step Which you can see here. So basically the thing that changed is that we can simply add Delta times identity So Delta is simply a scalar and this is the identity matrix and we add this to the curvature matrix for some Delta larger equal to zero and With damping we can basically interpolate between gradient descent and the full Newton step So if you take Delta equals zero here, then this is simply the new step, right? And if you make Delta really really large then this term in the sum is kind of dominated by the identity and we can forget about the Hessian and Now what is Delta times identity inverse? Well, it's just one over Delta so we're left with Next iterate is current iterate minus one over Delta times gradient Which is simply gradient descent with a small learning rate, right and basically with damping we can interpolate between those two extremes between the full Newton step and Gradient descent with a very small step size So with damping we can basically control how a conservative we want our updates to be and now the question of course is how can we choose the radius and It turns out that Even if we could change a reasonable radius, then oh, sorry Even if we could define a reasonable radius then it would be really hard to compute the corresponding Delta because I Mean this Delta depends on the choice of radius, right? And it's easier actually to work with the damping directly In a heuristic way So for example, there is this living back mark what heuristic and it's basically based on the so-called reduction ratio So it compares to things it compares how much and so if my quadratic model is correct What's decrease in the loss function? Do I expect when I take this step which would be the distance from going from here to there? and it compares this to the actual reduction of the loss function, so in this case the jump from here to there and Depending on how well those two numbers match it will adapt the damping So for example, if they are pretty much the same that means that our quadratic model is a good approximation of the loss function Then the damping will stay the same whereas for example in this case the reduction that we expect is Much larger than the expect than the reduction that we get so in this case We would certainly increase the damping by some factor. So this is basically how this heuristic would work So Here I want to try to give an answer to the question we had How can we deal with non convex objective function? And I think the answers that we can give are pretty concrete in this case So instead of the Hessian we can compute positive semi-definite curvature matrices like the GGN and the Fisher They provide meaningful curvature proxies So, I mean they have some there is some intuition behind them We can interpret them and often they are even the same which is also nice And it's pretty easy to give unbiased estimators of both of these matrices on finite data so we can basically replace the expectation by a Monte Carlo estimator for both cases pretty easily and The second thing we talked about is damping so we can control how conservative updates we want to be and Using damping heuristics like live in back mark what? This actually works pretty nicely in practice. So I would say for this question. We have pretty much concrete answers So let's move to the next problem cost So the question here is how can we access and invert this huge curvature matrices? So to illustrate this problem. I have a little Example so if you for example consider the all seen and seen at work, which has 1.4 million parameters Which is I would say in modern times Maybe a moderate size of a new network, maybe even small Then the Hessian requires to store these squared numbers, right? It's a matrix and If for each number we need four bytes So if you want to start this as a float we would have to store 8,000 gigabyte So I don't know what like machines you have but I don't have a machine where I can store or even invert such a matrix So what can we do about this? So the problem here is a Storage is o of t squared and inversion at least in the upper bound is cubically in cost so how can we do this and For this question, I would say we don't have to reinvent the wheel, right? We can basically borrow ideas from the new algebra Because I mean for example from Marvin's lecture and also from a unit Hans Wenger lecture And there are certain ideas we can just use for example. We could use a low rank approximation of the Hessian Which leads to something like a BFGS or LBFGS? We can use iterative methods that So something like CG and gradients which leads to Hessian free optimization and we could use Structured approximations or approximations that make use of the structure of the curvature matrix, which leads to something like k-fac So in the following and I would like to briefly show you how these ideas work So let's start with BFGS So the core idea of BFGS is that we want to gradually learn an approximation to the inverse Hessian from grading observations So there's a lot of stuff going on in this statement. So first of all, it's a great It's a gradual process. So it's an iterative process where we step by step learn how the Hessian or the inverse of the Hessian looks like and It can actually so I deliberately put learn here because you can actually see this as an inference process in the the Bayesian framework and It's an approximation to the inverse Hessian. So not the Hessian, but the inverse Hessian directly and I'm going to show you how this works and This is going to be based on gradient observations Okay, so we're not going to actually compute curvature information here But we kind of gonna deduce from gradient observations how the inverse Hessian will look like and How does this work? So I brought another visualization in 1d so if you think about the situation where you have two parameter vectors or in this case numbers theta k theta k plus 1 and You can evaluate the gradient at both positions. So this is the gradient, right? Not the target function, but the gradients now if the question is can you give an approximation of the curvature so this the The the curvature of L. So the first derivative of the gradient, right? And how can we approximate first derivatives? Well, we can use the secant, which is the screen line and compute the slope of this function and The slope of this function is simply yk divided by sk So basically a finite difference approximation to the gradient function, which gives rise to the second derivative Is this clear? Okay, I see some nodding And basically the same idea and you can basically transfer this idea also to the multi-dimensional case so if you multiply this by sk then it just goes to the other side and We arrive exactly at the so-called secant equation So we have yk, which is the difference in the two gradients We have sk, which is the difference in the parameter vectors and here we have the Hessian matrix and and Now if we want to get to a bfgs we basically all the idea is so we want an approximation of the inverse Hessian, right? So we're gonna take properties from the actual inverse Hessian and we're gonna require our approximation to have the same properties That's what I would say a reasonable idea So the first constraint we will have on our approximation to the inverse, which is denoted by b Is that it fulfills the secant equation for inverse So here I have the secant equation for the actual Hessian And if you multiply this from the left with the inverse Hessian you get inverse Hessian yk is sk Which is basically the Yeah secant equation in the variant for the inverse Hessian not for the Hessian, but yeah, the idea is still the same Also, we know that the Hessian is symmetric, which means that the inverse Hessian is also symmetric So we're gonna require our approximation bk plus 1 to also be symmetric and Lastly, we're gonna require our approximation to not or to change The changes should be as small as possible between two iterates and these conditions actually give rise to the bfgs update so You can write write down a recursive formula for how to construct these approximations to the inverse Hessian And these updates only involve the previous approximation and the vectors sk and yk right, so Basically, that's the idea. Yes This one. Yeah, so in this case, it's supposed to be a one-dimensional quantity, but yes in general It's a vector. Yeah Yes so this is just for the 1d case and This is for the multi-dimensional case Because like you said this this formula simply doesn't make sense for vectors So this is simply to explain why the form in the multi-dimensional case looks as it looks Not as a proper derivation because I just wanted to make the core idea clean clear Because deriving this would be more complicated. Yes a good question So usually this is initialized as a multiple of the identity matrix because what else do you do, right? But yeah, it I am I am not completely sure so I know that this is like the default choice But there may be cases where you want a different initialization so some kind of Better approximation to the inverse already if you can compute it then just the identity matrix Basically some yeah, you could phrase it this way. Yeah All right So now we still have a problem because this matrix pk So the approximation to the inverse Hessian is still d times d right and in general this is a dense matrix So no sparsity and we still have to store it So the idea of Or the the insight that is important here is that at iteration k This approximation is the result only of beef be zero and The entire history of those vectors sk and the vectors yk So if we store those vectors we can basically recursively construct this matrix And the idea of lbfgs is that we can still get a good curvature approximation or sorry approximation to the inverse Hessian bk hat by only using the most recent curvature information so we're not going to use the entire history of vectors sk and yk going from like the current iteration back to zero, but only like a fixed Window of some fixed size l and With this the claim is that we can still get a good curvature estimate and that means that we have less memory consumption because now we only have to store 2l vectors instead of 2k and Yeah, so if l is reasonably small in some cases, it's even one This is much cheaper than storing the entire matrix and I mean ultimately we also We want to compute the Newton update right so we want to Apply this approximation of the inverse to the gradient and it turns out that there is also an efficient algorithm for that Which is called the lbfgs to loop recursion But at this point I cannot go into the details But just wanted to let you know that there is a specifically designed procedure that is very efficient to apply this matrix to the gradient So next I want to talk about Hessian free optimization So the core idea here is to basically use CG for minimizing the quadratic because CG as like It's specifically designed for minimizing quadratic functions, right? So this is exactly the case maybe where we want to apply it And the nice thing about CG is that it only requires us to do a matrix vector products And not actually So we don't need that the curvature matrix to be in memory explicitly We only need to be able to multiply with this matrix and It turns out that also the Hessian free approach that is described in this paper Actually uses the ggn which is a little bit confusing Because it's called Hessian free optimization But it uses the ggn as curvature matrix and we can multiply with the ggn without ever building it in memory explicitly and This is why it's called also a matrix free approach Now the idea is so for a given input output pair We want to compute the product between the ggn and some vector, right? Because this is what we do in CG. We have to multiply the curvature matrix with some vector And for only one input output pair So if we use the kind of Monte Carlo estimator for the ggn only on one data point The ggn would look like this, okay? And I want to use this setting for this simple example and the matrices are C times D and The Hessian is C times C. So the matrices look a little bit like this So D is the number of parameters in our vector and our network So this is going to be a very large number Whereas C is at least compared to D often very small So the thing we want to compute looks approximately like this Now how do we do these matrix matrix or matrix vector products? so the Like the naive thing to do would be to just follow this rule and compute J transpose times Hessian times J But then we're screwed because then this is a D times D matrix, right? So how would you compute these this product between those two between those four objects? Yes? Exactly, can you explain briefly why this is a good idea No, so J transpose J times V would actually be Yeah, exactly Exactly, so you basically only have to you only have those tiny vectors of size C that you have as intermediate results And at no point you have this D times D matrix in memory explicitly, which is nice but still You could think that Okay, but I need somehow to have access to the Jacobian and this matrix and the Jacobian is also a C times D Why it's quite large so we need some way to efficiently multiply the Jacobian with a vector or Jacobian transpose with a vector and so on we need to do these matrix vector products efficiently and I am no expert and auto diff, but I'm still gonna try to explain the core idea of how this can be done efficiently so We don't have to construct these matrices explicitly and as an example we consider the first operation so we're gonna Try to figure out an efficient way to compute Jacobian times vector so the core idea is To replace the Jacobian vector product, which you want to compute by I by a directional derivative And this is how it's gonna work So we basically compute or we consider linearization of our model, okay, so this is simply a Taylor approximation We approximate f at theta plus delta theta by f of theta plus the Jacobian times Delta theta and we have some error term Okay, now here appears the Jacobian vector product But we don't want this to be a delta theta, right? We want this to be V so maybe we can sneak the vector inside there and Yeah, try to see try to see what's happen happening So we're gonna set delta theta to r times V so r is a small constant a scalar and V is gonna be this vector that we want to multiply with and then you can rearrange the terms and It turns out that you can write it this way. So J times V is This difference between two functions divided by r and we have this error term in principle, you could already use this approximation as an approximation to the Jacobian vector product, right because you would simply have to Evaluate the function two times Divide by r and you get an estimate of the Jacobian vector product, but it turns out that this is numerically really unstable So on the one hand you want this error term to be small So you want this r to be small But if r is small you divide by a small number and these two vectors are also quite similar for small numbers, so you get numerical cancellation and your stability gets screwed so We don't want to use this finite difference approximation, but something else But now if we consider this as a function in r We basically have Some function evaluated at r minus some function evaluated at zero divided by r Now if we let r go to zero, this is a derivative, right? So for r going to zero we obtain Jacobian vector product is simply the derivative of this function with respect to r Evaluated r equals zero and this is an operation that you can actually implement very efficiently in Autodiff library like PyTorch and it turns out that you can implement all those three vectors very efficiently Without ever having one of those matrices in memory explicitly So this can be very can can be done very efficiently using those autodiff tricks basically All right, so that's it for Hessian free optimization. We saw that the core idea was applying a CG and We can use additional Autodiff tricks to even make those individual matrix vector products and very efficient So as a last technique I want to present k-fac, which is the chronic affected approximate curvature and This is based on the Fisher information matrix, so it tries to build an estimate for that matrix and It's basically based on two approximations. The first one is a block diagonal approximation so We assume that the Fisher RV Yeah We will use an approximation such that the Fisher matrix has this block diagonal structure where it has one block for each layer Which basically means that we will neglect curvature interactions between layers This is the first approximation and it's reasonable because It makes inverting this matrix trivial when you can invert the blocks, right? so the invert the inverse of this matrix is going to have the same block diagonal structure, but it's going to have the block in inverses on the diagonal, so if you can compute those it's easy to invert the this matrix f hat and The second approximation is that we exchange two operations Taking the expectation and the chronicle product so what I mean by that is For a linear layer the block can actually have or can actually be written Precisely in this form. So this is not an approximation. It actually has this form and So it's an expectation over a chronicle product over two things over these two vectors and This is approximated using first of all the expectation over the first thing then the expectation over the second thing and Then computing the chronicle product of the tool So this is where the approximation happens and this is also reasonable because if we can write the block as this product of two matrices and Then inversion of the block is really easy so if you want to compute the inverse of the block then I Mean according to our approximation. So this actually shouldn't be an equal sign It should be an approximate sign because of this approximation, but we can plug in the form for the block Which is a I chronicle bi and now we have this nice property of the chronicle product that we can basically pull the inverse operator inside and we get AI inverse Chronicle factor bi inverse and these two matrices are tiny right so compared to the block and the entire matrix these two Components AI and bi are really small and inverting them can actually be done So basically the cost of storing and inverting F comes down to storing and inverting those components AI and bi Right because From them you can basically build the entire matrix and also it's easy to construct the inverse So that's it for K-fac Let me summarize again So how can we access and invert the curvature matrix? That was the question we started with we saw three approaches and What I'm trying to do on this slide is first of all give you a little summary But also try to give you certain pros and cons for the different things Which sometimes are also based on a personal Experience or what I know from colleagues, but don't cite me on that. Okay, so For BFJS and LBFJS. This is basically a dynamic lowering approximation of the Hessian and It's actually the default choice for Small deterministic problems. So for example, if you call sci-pi optimized minimize, that's gonna be using BFJS or LBFJS so for example, if you have a really small network and Noise is not so much of an issue for example, because you only have few data points and you can basically compute the Estimates very precisely then it might actually be worth a shot to try out BFJS or LBFJS And to solve this problem because they can actually be really fast in certain cases Next we also discussed the Hessian free optimizer, which is quite close to the notion of Newton steps So for BFJS, we had this approximation by only a few vectors and in K-FAC we had this chronicle factorization at the block diagonal and So these different approximations and in the Hessian free approach. We don't have those approximations, right? We take the GGN instead of the Hessian But apart from that we don't apply any further Approximations, so it's actually quite close to the notion of the classical Newton step It requires little memory So I mean we don't have to store any vectors like for example for LBFJS or for K-FAC We have to store those chronicle factors But instead we have more sequential work because we use an iterative approach So if we do a lot of CG iterations, this is also quite costly And the Hessian free seems to require a larger mini batch sizes, which is problematic if you have batch norm layers So this is a little bit of a complicated story and I experienced this myself unfortunately So the problem is if you want to use larger mini batches, so for example the authors of this paper by Martens They use something like in the order of five thousand as a mini batch size For an I think an auto encoder on MNIST and if you have such large mini batch sizes you have to kind of split the computation into a Chunks of data you can actually handle so for example if you want to compute the gradient on a large mini batch You you take chunks of data which for which you can compute the gradient and then you aggregate the results to compute the entire gradient, okay? but this is kind of based on the assumption that If you take one mini batch and compute the gradient the result should be the same if you split this mini batch into for example two parts and Feed them through your network separately compute individual two gradients and aggregate those two, right? This should be the same and if you apply a batch norm, it's not so this is going to be a problem and Unfortunately, so batch norm is still used a lot And if you want to use testing free with large batch sizes and batch norm you're gonna have a problem, yeah So the last thing we discussed is a k-fag It's basically a lightweight representation of the Fisher information metrics It's widely used for example also in uncertainty quantification and Philip will talk about this next week and The k-fag optimizer as my last point heavily relies on a heuristic damping so this paper by Frederick Benzing and basically gave evidence that The k-fag optimizer is actually more similar to a first-order method than to a second-order method and It attributes the good performance of k-fag more or less exclusively to the heuristic damping that is used in this In this optimizer So it's actually questionable if k-fag Still contains or to what extent k-fag this approximation still contains Information from the actual Fisher because of those approximations that are made there So I would say in summary and we have some pragmatic approaches like engineering style approaches to Disquestion we had earlier Which all have their specific pros and cons now Finally, I want to address this last problem, which is the causticity So here the problem is we want to compute the Newton step So we would like to invert the Hessian and apply to the gradient But we only have estimates thereof So we can only compute an estimate of the Hessian and an estimate of the gradient Now because we have finite data, right? We cannot access this data generating process So how does this change? things so even if you have unbiased stochastic estimates, so in Expectation this is the gradient and an expectation. This is the Hessian so they are unbiased But even if we use unbiased stochastic estimates, the Newton step will be biased So if we consider the expectation over the stochastic Newton step, so this is the thing we actually can compute It turns out that here is not equal and on the right we have the thing we actually want to compute Now why is that? so first of all the Expectation factorizes if those two things are independent of each other, right? So for example if we take one mini batch to compute the gradient and one mini batch to compute the Hessian Then this expectation factorizes and we can actually split this computation into two parts Computing the expectation of the inverse Hessian and computing the expectation over the gradient And this is not the same as pulling the inverse out of the expectation We will see in a bit why this is the case But then this is actually the same as the Newton step because we assume that we have unbiased estimates for h hat for unbiased estimates for h and g But the point where it definitely goes wrong is this one, okay? so The my claim here is that a stochastic Newton step is biased so in expectation. It's not correct and I also want to I will try to provide some intuition for this problem also in 1d So let's assume a very simple case So we assume that g hat and g is simply one so the gradient and the noise on the gradient is not an issue So the question boils down to a why is the expectation over 1 over h hat not the same as the 1 over h right This is kind of the question if we consider this one-dimensional case So why is there no equal sign between those two quantities? This is the question So let's try to understand this in 1d So We have an estimate of the curvature So this is actually a distribution, right? It has some uncertainty and this is shown as the screen distribution and In expectation our Our estimate is correct. So in expectation. We're gonna get h Now if we map h through this inverting function, so the inverting function is simply taking the Inverse of the argument so in 1d this is simple, right? We can just write down the diffraction Then mapping h through this function is gonna give 1 over h So this is the thing that we actually want to compute Now what is the thing we actually get? so if we consider the screen distribution and we consider it as a left half and the right half then Mapping the right half through this function will not change a lot of the of the shape of the distribution, right? But if we map the left part of the distribution which comes close to zero if we map this through this inverting function Then it's gonna distort the the interval a lot So if we map this distribution through the inverting function, then it will actually Break the symmetry and we get this heavy tail if this distribution comes close to zero and that means Because of this heavy tail the expectation expectation is moved to above, right? So in expectation our estimate of the inverse curvature is gonna be too large This is basically what I'm trying to to say here and if you didn't like my visualization You can also like show this in in math So this function phi is convex for positive values so that means that gents is inequality applies and You can just plug in the numbers and see that the expectation over One over h hat is larger than one over h, which is exactly what I claim in this visualization so this means Due to the uncertainty in our curvature estimates our Newton's step is gonna be too large in expectation and and This also transfers to the multidimensional case So the bias can can affect different directions directions very differently so if you take a look at the 2d visualization imagine we are right here and Basically, we're gonna move in two directions. We're gonna move into a high curvature direction and we're gonna move into a low curvature direction when we apply the Newton's step and The high curvature direction so this direction is Maybe not so much of an issue because here the curvature is large Which means that it's far away from zero and in this case This bias is not really present, right the distribution if we map it through and this inverting function It's not really changed much. It's not really distorted and we get a decent estimate of the inverse curvature in this direction So maybe in this direction our update will look fine like this one It would have the correct length basically Whereas if we have a similar noise level for a smaller curvature So if the distribution gets close to zero and we invert that distribution this bias Can really have a strong effect and our expectation Expectation our step is gonna be too large So we're gonna have a situation maybe something like this where the update in this direction is gonna be too large so if we construct the Newton's step from those two components basically in which and yeah It will result in a step that looks like this and this step is certainly not ideal because you can clearly see that Overstep the minimum by a lot because of this bias happening in the low curvature direction So so far I talked about biases because of those stochastic estimates and Now I want to talk a little bit about instabilities, but the argument is pretty much the same So it can simply happen due to chance that we sample an estimate which is close to zero and in this case this small error in the curvature estimate is Amplified by a lot if we invert it and we just get a step, which is basically arbitrary the large So this can cause an unstable step if we just are unlucky with our sample and If this only happens in one direction and messes up the entire step. So you saw on the slide before We can have on a decent update in one direction But if we screw up in another direction the entire stop step doesn't make sense anymore so one I don't want to call it solution part one way to to Yeah, maybe soften this these biases and instabilities a little bit is applying damping so Remember that damping was basically adding and a constant to the curvature Which means in the situation we are gonna move this We're gonna move this green distribution to the right so we're gonna move it away from zero and that means that yeah the biases we make and The instabilities are not so much of a problem anymore Because these happen when the distribution gets close to zero, right? So if we move it away from zero, we won't have these problems in the same extent But there are certain problems with damping. So first of all, it's just a scalar number, right? So it basically treats all directions equally and maybe this is a little bit over of an oversimplification to just have like one number and Yeah, trying to solve all this complicated stuff that is going on with this one number Secondly, it's adapted holistically often So kind of as an auto loop optimization process where you consider those those reduction ratios and Yeah, it would be a nice to have maybe something that is Yeah smarter than that and maybe something that is also aware of the stochasticity That is in the estimates. So ideally, I think what we would like to have is not something that adapts the damping like in an auto loop optimization iristically, but actually in the inverting Algorithm have something that is aware of the stochasticity aware of the uncertainty in the estimates And then can decide for itself whether it wants to makes whether it wants to make certain updates in certain directions Or not to make them. I think this was would be more reasonable So to answer this question how to handle stochasticity, I think this problem is just fundamentally not solved yet So my guess would be that we need specialized algorithms that are informed by richer quantities So by richer quantities quantities, I mean Distributions rather than point estimates So if you had access to the screen distribution you saw in the visualizations, then you could actually maybe come up with a Rule what updates to apply and what updates not to apply because you can reason about how uncertain you are about these updates and with specialized algorithms, I mean algorithms that can actually use those distributions and Like in the inverting process already use those richer quantities So with that and this was the the last problem we want to want to discuss so The summary would be so last week we saw that training neural networks With first-order methods like SGD and Adam remains expensive tedious and brittle. I'm sure you've also experienced this in the in the exercise and Second-order methods. I think offer a partial answer to that so For example in that they are robust against ill conditioning But they come of course also with their challenges so and we talked about non-convexity where we Basically have concrete answers like we can use Positive semi-different at curvature proxies like the GGN or the Fisher. We can use damping to make more conservative steps The next problem was cost. So I presented BFGS, LBFGS, the Hessian free optimizer and KFAC as Smart ways to deal with this problem of huge curvature matrices and how to invert them And and they all come like with several pros and cons and lastly I presented stochasticity as a fundamental problem, which is not solved yet and But maybe until then I think curvature of course remains still a useful quantity. So As I already mentioned earlier, Professor Hennig is going to talk about Uncertainty quantification and the role of curvature in this case So with that I'm done. Thanks for attending. Thanks for answering my questions and yeah, that's it for me. Thank you