 Welcome everyone. So, now we completed our study of the steepest descent method or more generally line search methods in the previous lecture. So, what we will now analyze is the basic issue of how fast these methods can be. So, for we will now discuss what the rates of convergence or the speed of convergence of algorithms. So, in order to do that we need to first have a notion or a definition of what we would call the rate of convergence. So, let me define that for you to begin with. So, the rate of convergence. So, let R k be any sequence of numbers that converge to R or R star. So, we say that the order of convergence, the order of convergence of R k to R star is the largest value of beta greater than equal to 0 such that this quantity, this limit which is greater than equal to 0, this limit is k tends to infinity is finite. So, what does this mean? It basically tells you what is the largest value. So, if this quantity is finite, if this limit is finite, it effectively tells you that for large enough k, R k plus 1 minus R star, the absolute difference between these two is roughly equal to some constant times R k minus R star raised to the largest such value of beta. So, that is effectively what this particular thing is telling you. So, for large enough k, the iterates behave in this kind of way. So, if beta is and to be precise here, I should be taking the limits. So, if beta is taken, it turns out to be 1, we say it is linear convergence, which means that the distance between R k and R star, which means the updated distance between where you want to be and where you currently are is roughly equal to a constant times your earlier distance R k minus R star. Of course, because this is convergence, eventually the R k minus R star will decrease to 0 as will R k plus 1 minus R star. But what we are asking is how much progress are we making relative to where we were and how much progress have we made in the new iteration relative to where we were in the previous iteration. And that is what is capturing being captured by the rate of convergence. Now, if beta equals 2, we say it is quadratic convergence and this is usually very, very good because if you can get, if your iterates can converge quadratically, that means you are making the progress you are making is much more than what you at each iteration, you are progressing much more than you had in the previous iteration. And if it turns out that beta is equal to 0, sorry, if it turns out that beta is greater than, sorry, not 0, you are not equal to 0, if it turns out that this is, if it is greater than 0, greater than 1 and if it turns out that beta is greater than 1 and this constant here and this constant is actually equal to 0. In that the constant here, sorry, the constant here is equal to 0, then we say it is super linear convergence. So, now let us to what we will do is we look at the rates of by looking to study the rates of the convergence, we will look at first the simplest form of a descent method which is the steepest descent method and that too we will study it only on first on quadratic functions. Because once you, it turns out that a lot of what we want to say can what we want to learn about the rates of convergence can be learned from just this. So, let us look at this function f which is half x transpose q x minus b transpose x and I am going to take q to be is symmetric and positive definite. The gradient of f you can evaluate is equal to q x minus b and x star is then the unique, let x star be the unique solution, x star which is the minimizer of f is the unique solution in the minimizer x star, the unique solution of q x equals b. Now the good thing about quadratics is that you can calculate a lot of these, you can calculate a lot of these things in closed form. So, in fact, when we are doing this since you are doing now, if you are doing steepest descent with line search, the actual step that minimizes the function along a certain search direction. So, the search direction for us is pk which is equal to minus gradient of xk. So, we can actually calculate f of xk minus alpha grad f of xk. So, this is equal to say remember f of xk plus alpha pk. So, we can actually calculate this in closed form, closed form we can minimize, find the alpha that minimizes this. It turns out alpha, so find alpha and said that as alpha k. So, it turns out alpha k is equal to gradient of f at xk transpose gradient of f at xk divided by transpose q times the same gradient of f at xk. So, the iteration then becomes xk plus 1 equals xk minus this term gradient of f at xk transpose gradient of f at xk divided by gradient of f at xk transpose q times gradient of f at xk the whole thing times gradient of f at xk. So, this actually gives us you can now substitute this in all of these expressions in terms of this expression that where the gradient is equal to q x minus b and you can actually get this in a much more explicit form where xk plus 1 can be written in terms of xk. Now, it turns out that you can also therefore using this then we can also calculate compute from here how x the distance between xk plus 1 and x star and the distance between or the difference distance between f of x minus f of x star. Let us write this as so, first let us define the following let us define norm q norm of x as x transpose q x. So, this is effectively as a kind of a skewed or a tilted norm with where I am taking the weighting matrix as q. And now we can it is very easy to show that from here that x minus x star square q sorry this squared this is the squared q norm this squared is actually nothing but f of x minus f of x star. And now using this we can derive the following theorem says the following now with when when the steepest when the steepest descent method is applied to the function f exact line search what do I mean by exact line search by exact line search I mean that is where we are finding alpha in closed form. So, this is let us let me write this here this is what is called exact line search. The what we did in the previous lecture where we were looking for alphas that satisfied the wolf condition those are what are called in exact line search because they are not we are not actually finding the exact minimum there we are just putting conditions that that that are terminating alpha must be satisfied. So, with exact line search then we what do we get then we find that x k plus 1 minus x star squared in q the q norm is less than equal to lambda n minus lambda 1 divided by lambda n plus lambda 1 whole squared times x k minus x star q norm where what are these lambdas these lambdas this is lambda greater than 0 greater than or dot dot greater than lambda n are the eigenvalues of q. So, these lambda 1 to lambda n they are the eigenvalues of q now where lambda 1 is the smallest eigenvalue lambda n is the largest eigenvalue. So, now what is this relation saying this relation this q norm squared is nothing but the difference between the function value and the optimal value of the function. So, the difference between so, if effectively this term here or this term here essentially is capturing something like an error the error that you have or the departure that you have from your optimal solution. So, the error it says is less than at iteration k plus 1 is less than equal to this constant times the error that you had at iteration k you can see what is this constant well this constant is this. So, this constant is something that depends on the eigenvalues of q. So, usually what happens in a when you run this sort of steepest descent algorithm on a quadratic function like this. So, remember this was a convex quadratic function because I assume that my q is symmetric and positive semi definite. So, if I look at the contours of q. So, let me draw this thing here for you. So, contours of f so, this would be let us say is the my outermost contour this is another contour inside this is a another contour inside this is another contour etc. So, if you do this here is what we get. So, you start from here go here and then you do this and then one goes say maybe I will draw one more contour in somewhere here in some intermediate contour that takes us here then another one takes us here and another one takes us here etc. So, you can see what the way from what I have drawn these this here these are the iterates that your algorithm has produced it is starting from this particular contour this is your x this is the point x 0 from here it goes in the direction of it goes in what is what is the steepest direction of steepest descent keeps going till it encounter until it can minimize the function along that direction that minimum gets attained here at this at the second point here. So, this is point x 0 this is point x 1 then it moves then it again goes in the direction where the this is minimized where the it goes in the direction of the negative gradient minimizes the function along that particular direction gets reaches point x 2 then again goes in the direction where along the gradient now negative gradient at x 2 etc etc right. So, this is this is what happens in when you when you apply steepest descent steepest descent to this sort of quadratic problem. Now, here this was a very simple problem because we actually knew where the solution also where the solution was and you could have found the solution by simply inverting the matrix Q but you could have but we are we are trying to see what how the behavior of steepest descent method actually work. Now, what you can observe here is the main the main thing you can observe here is that the steepest descent method does tends to do this kind of zigzag it is it goes it keeps going zigzagging this way all the way towards the solution and the reason for that is this kind of zigzag is because of this Q that is that is introduced by the matrix by your Hessian matrix Q if the if the extent of zigzagging or really depends on how much you need to keep changing your directions of descent. And that itself depends on what depends on how the how your eigenvalues are how different your eigenvalues are. So, in the simplest case if all the as you in the simplest case what does this say well in the simplest case if all the eigenvalues are equal what would happen if all the eigenvalues are equal where in that case it would be Q would be all would essentially be would essentially be an identity matrix say all the eigenvalues are equal to 1 in that case what would happen then lambda n would be equal to sorry I should have put a weak inequality here my mistake in that case lambda n would be equal to lambda 1 and then this term would be 0 the term on the term in this inequality would be 0 and x k plus 1 would be exactly equal to would be exactly equal to x star. So, in that in the case when the eigenvalues are all equal the first step itself would not point in this sort of direction but rather take you point straight towards the towards the minimum the actual global minimum of the function it would point straight point you to this direction directly but because there is a skew in the function the shape of the function takes you in this sort of path it first the first you go in this direction then you go in this direction then you go in this direction then this way then this way then this way etc okay and the extent to which you will be end up you will end up zigzagging really depends on in general depends on the condition number or the ratio of the largest to the smallest eigenvalue right. So, as the as the condition number grows larger as the condition number grows larger the contours of the quadratic become more elongated they tend to sort of get more stretched out the contours become stretched out straight or elongated and more zigzagging occurs. Nonetheless one thing that this result actually tells us directly is that the steepest descent method converges linearly in general right. So, the convergence of the steepest method is linear. So, this sort of there is all this zigzagging but it does converge but it converges only linearly if your condition number is mild means close to 1 then the contours themselves would look more circular and the first try itself will get you close to the solution. So, that the actual number of iterations would be the actual number of iterations then would probably be a little lesser okay. So, here is the main thing then that the rate of convergence depends on the extent of curvature or the or how different the eigenvalues of the lowest and the largest eigenvalues are alright. So, this is what we are saying. So, let me make a note here. So, the steepest descent converges linearly on this on this sort of quadratic problem. Now, it turns out that for problems that are more non-linear where not necessarily quadratic some pretty much similar sort of result holds. So, I will just state again the theorem for you. You know, suppose f from Rn to R is twice continuously, twice continuously differentiable and that iterates generated by the steepest descent method converge to x star with exact let us say with exact line search converge to x star where if I look at the Hessian of f at x star this Hessian is positive definite. Now, let R be any scalar that satisfies this lambda n is a scalar that lies in this in this interval lambda n minus lambda 1 divided by lambda n plus lambda 1 to 1 where lambda n greater than equal to lambda 1 greater than 0 are the eigenvalues of the Hessian at x star. Then for k large we have f of x k plus 1 minus f of x star is less than equal to 0 less than equal to R square times f of x k minus f of x star. Now, you can see how this why this is the case the f of x k minus f of x star is essentially the same as this error term and this is what appeared here. So, what is happening here is that you know as you for a if you are in if you are considering a twice differentiable function and you can look at the iterates that come from the steepest descent method with exact line search. And suppose they converge to an x star which where where the Hessian is positive definite then the function value or the error term is going down to 0 linearly with is having is showing linear convergence. This you can see and the rate the constant outside again depends depends on the depends on the condition number of the Hessian. The condition number of the Hessian at the point where the where the function converges. So, what is what is the lesson here the lesson here is one of course that the rate of convergence of steepest descent method is linear in the best case. And second is that the quadratic the study of the steepest descent method on the on a quadratic function gives you a good sense of what would be happening you know for a for a more general nonlinear function. And the reason for that the reason for that is basically the rate of convergence at the end of the day depends essentially on how the function how the iterates behave close to the solution because it is eventually a limiting quantity. And since the rate of convergence depends on and when you are close to the solution a quadratic approximation of the function is a fair approximation. So, quadratics tend to give you the performance of an algorithm on a quadratic tends to give you also how the algorithm would perform on more nonlinear type of functions.