 So, as you can see because of this zigzagging nature of the steepest descent method actually is the main reason for it to be slow and it will be slow when the condition number is large. So, what we can now think of is we need a method as I mentioned in the previous lecture we need a method that takes into account that does not simply follow the steepest descent, but rather takes into account also how this direction itself is going to change. So, it must take into account not only the gradient which tells you the direction of the decrease of the function, but also how the gradient itself is going to change which is and now that the way the gradient itself is changing is captured by the curvature of the function or in other words the Hessian of the function right. So, if we have a so the better method would be one that takes in that knows at this point itself that although this is the direction of steepest descent I should not actually be going here because this is not really a sustainable will not give me a sustainable decrease I would have to again change my direction and go in the in a in another direction etc. So, what we would need is ideally a method that kind of that is when while you are sitting here itself identifies looking at the curvature and all this other information identifies a better direction to move in and that is what is that is the that method is basically Newton's method. So, that is Newton's method. So, the step the it is again it is again a method like before, but now we are going to take the what we are going to do is take a Newton's step. So, P Newton at time k at iteration k is defined in this way it is defined as the Hessian of at xk is defined as the Hessian of x of the function at xk inverse gradient at xk. Now, the now here is one important thing to note see the Hessian itself may not be a it may not be positive definite. So, Hessian need not be positive definite and in that case the Newton the Newton's step or the Newton direction may not actually give you descent. So, the Newton direction may not be a descent direction may not be a descent direction if this is not positive definite. So, as a consequence a lot of what we discussed so far does not actually directly hold for the Newton's method meaning that we are we are not necessarily decreasing at every step with the Newton's method we are we are we are we may not even be getting descent we may in fact be getting ascent you may be increasing the objective value. So, so, so the the the the so when when applying Newton's method we have to be careful that we are actually we have although the the method is intelligent in the sense that it makes takes into account the curvature of the function you have to make we have to make sure that the that the that we are in fact getting descent right. So, okay, so let us so let us now discuss the rate of convergence of this sort of of the Newton method. Now, one other thing I want you to note here is in the Newton method is that the Newton step is itself has it has baked in it already the step size the step size has been chose has already found has already been found for you by taking taking into account the curvature of the function through the Hessian. So, it one does not usually need in addition to this another step size because the Newton step is is is a complete step meaning that it is not just a direction it is a complete it is a it is a complete step to a to the new to the next iteration right. So, here is the theorem so suppose f is twice continuously differentiable and the Hessian at x is Lipschitz continuous in a neighborhood of x star at which the sufficient conditions of optimality are satisfied. Now, what does this mean sufficient conditions of optimality are satisfied it means that it is a point at which gradient is equal to 0 and the Hessian is positive definite. So, that means the sufficient conditions of optimality are satisfied and consider the iteration. So, we write this for you in the bracket. So, the gradient at x star is equal to 0 and the Hessian at x star is positive definite right. Consider the iteration x k plus 1 equals x k plus p n k where p n k is as defined above. Then if the starting point x 0 is sufficiently close to x star then x k converges to x star the iterates converge to x star. Moreover, x k converges to x star quadratically that means if you look at the error between x k and x star that error decreases to 0 quadratically. Third, if you look at the norm of the gradients this norm of the gradients this also converges to 0 quadratically. So, let us take a note of a few things. So, I mentioned to you look at let us see what the theorem is saying that if f is continuously differentiable and the Hessian is Lipschitz continuous in the neighborhood of x star at which the sufficient conditions of optimality are satisfied. That means it is a local minimum of your function f and you look at this iteration which is x k plus 1 equals x k plus p n k where you are taking a Newton step. Now, here is the main thing if x 0 is sufficiently close to x star then you are guaranteed that this will converge to x star. That means if you are sufficiently close what does this mean you are in the neighborhood of x star means you are in a part of the space around x star where the function f looks convex. See if the function may very well do other things elsewhere but around x you are close enough to x star. So, at x star the function has gradient equal to 0 and Hessian positive definite. So, in that in a neighborhood around x star the function is actually convex and what this is saying is that you are starting your iteration in that sort of neighborhood you are starting your iteration in a region where the function is convex. If the function is convex then in that case if you look at the Newton direction the Newton direction in that case that if the Hessian then if the function is convex then the Hessian actually is positive definite. If the Hessian is positive definite this is actually a then a descent direction and then so if you start your iterations in this sort of region then the Newton direction is a descent gives you a descent direction. So, what does this theorem say that if you start your iteration somewhere close enough then the iterates converge to x star moreover they converge quadratically and even your gradient vanishes quadratically the norm of the gradients also vanish quadratically. So, if you start your iteration somewhere in this sort of region then the Newton method not only converges it converges quadratically it converges faster than the steepest descent method. So, the main thing here is that because Newton direction is not guaranteed to be a direction of descent we have to include this rider that you are you are starting in this kind of basin of attraction where the where the function is actually in a neighborhood of the of the true minimum where you are the function is convex. Now, if the function actually is convex everywhere then things become easier and then this sufficiently close really is has no has no bearing on the final on the final result. But the main thing to note here is how we have got now quadratic convergence as opposed to linear convergence which is what which is what steepest descent was giving us. Now, now, so we can we can also do other variants of the Newton method. So, we can do for example, one can do instead of taking. So, in situations where the Hessian is not positive definite one can do one can come up with variants that often of the Newton method that that tend to mimic the behavior of the Newton method. So, for example, you could do simply pk equals some minus bk inverse gradient of f at xk where bk is symmetric and positive definite. So, bk is symmetric and positive definite. So, this kind of iterate is what is called a quasi Newton method. What this tends to do is it brings you a little bit of information about the curvature alongside and also gives you the properties of guaranteed descent. Now, the way the bk is obtained is through the past derivatives and past information that you have obtained about the function. And so there are there are many different ways of updating this bk one of the one of the one of the sort of most popular updates is what is called the BFGS update. So, we do not have the time to go into all details along all this the BFGS up gives you a method update it gives you a way of updating method for it is a method for updating updating bk. Now, what sort of what sort of result can we get for this kind of a all right. So, now with this I think there is this gives us this gives us a wide gamut of methods for solving unconstrained optimization problems. So, from here now we will move on to constrained optimizations.