 A practical line search type algorithm, the algorithm tends to work like this, let me write this out. So, you start usually with some say an alpha bar greater than 0, you choose a row between 0 and 1 and I will tell you why this row is needed. You choose your C1, you choose your C1 and C2 and then what we what one does is you is that you search over alpha. So, what one does is you try you let me say check if the wolf conditions are satisfied. So, at each iteration check if the wolf conditions are satisfied, check if the wolf conditions are satisfied, if not you try an alpha. So, you write initialize alpha equal to some alpha bar, if not change alpha to row times alpha. So, what you are doing is you start with a large enough alpha and then you keep backtracking to see where you can to see what would be if your wolf conditions are satisfied. You keep and you can keep repeating this until eventually the wolf conditions are satisfied, if satisfied you define alpha k as alpha, define alpha k as alpha and the next iterate x k plus 1 is defined as x k plus alpha pk and then you take k as reset k to be k plus 1. So, what happens then is so your algorithm then effectively then takes you through these iterations where x k plus 1 equals x k plus alpha pk where pk is a sequence of directions and alpha k satisfies the wolf conditions at each k. So, this is basically so what we get therefore is a sequence of iterates such that that are given by this recursion x k plus 1 equals x k plus alpha pk, alpha k pk where pk is a sequence of directions and alpha k is satisfied the wolf conditions. Now, what I have not told you yet is what how is this pk to be chosen. So, we have been silent on this particular topic about what the choice of pk is. So, the pk can be chosen in a variety of ways and I will give you a general result here, but the main thing is that the pk should be should be a direction in which the function decreases. So, the simplest thing you could do for this sort of purpose is to simply look for take pk to be the negative of the gradient of the function. So, locally we are assured that whenever that in that if we take a step in that direction, if you move infinitesimality in that particular direction the function is guaranteed to decrease. So, pk can be taken to be the negative of the gradient of the function, but that is not the only choice one could do many other things to choose the pk. In fact, variants of this gives rise to various different types of algorithms itself. So, I will now also mention to you is a sort of a generic sweeping result which ensures which tells us what sort of for what sort of pk is do line search algorithms actually do line search actually line search algorithms actually converge. So, now define so that brings us to this topic of convergence of line search algorithms. So, define cos of theta k as basically the cosine between the negative of the gradient of the function and pk. So, obviously this kind of quantity is because so it is I am taking the inner product in the numerator negative of the inner product and dividing by the norms of these two vectors. So, obviously this is well defined only when the when the gradient when these two vectors are not 0. So, pk is obviously a direction we are choosing. So, it is a nonzero direction and gradient of the function is if it is not should be nonzero then this quantity is well defined. So, the so what is the what is theta k here theta k is capturing the angle between pk and the steepest possible descent direction you could pick the direction that you could pick which is which gives you the steepest decrease steepest most decrease in the vicinity of the function. So, this is the so theta k is capturing that the angle between these two. So, the theorem which we will not prove, but I will I will I will just mention to you is this. So, consider any iteration of the form consider any iteration of the form xk plus 1 equals xk plus alpha k pk. Now, where alpha k satisfies the wolf conditions w 1 comma w 2 it is now suppose f is bounded below bounded below means it has the infimum of f over R n greater than greater than minus infinity and is continuously differentiable. Suppose the gradient grad f is Lipschitz continuous on an open set containing the set containing the set L let us call this set L. What is this set? This set is called the level set. It is the set of x is for which fx takes value less than equal to f of x0 where f where x0 is your starting iteration starting iteration starting or initial iteration. So, now what does it mean? So, suppose the gradient f is Lipschitz continuous on an open set containing this level set where the x0 is the starting iterate. What does this mean that is there exists an L dash say greater than 0 such that gradient of f at x minus gradient of f at x bar open set let us call this open set n less than equal to L times norm of x minus x bar for all x x bar in in this open set n. So, we can take p k to be any set of directions and choose alpha k such that the wolf conditions are satisfied. And suppose the function is bounded and continuously differentiable and the gradient is Lipschitz in the set that we are considering. So, what is this set L? L is a level set means that it is starting from take all the x's that give you a better value than the one that you are starting with. The one you are starting with is x0 all the x's that give you a better value a lower value than what you are starting with that entire region is called level set that is your set L. And so, we want we are assuming that the gradient is Lipschitz on that set then what does it say well then it says then cos square theta k times norm of gradient xk square this summation is finite this whole sum is finite. Now, what does this mean? So, the claim is that if you can if you choose your iterations this way if your function has these properties and you choose your iteration this way in the way that is indicated and your alpha k is satisfied the wolf condition then this particular condition this has to be finite. Now, this looks like a bizarre technical conclusion, but actually it says a lot in one sentence. So, since this summation this is an infinite sum, so if this infinite sum is finite what is that if that infinite sum is finite then it means that the limiting value of a limiting term here should be going to 0. So, in particular limit as k tends to infinity cos square theta k times norm square of this is equal to 0. So, it says that well you are the this limit goes to 0. Now, if you look at this limit what is this limit it takes it has two terms here it has the norm of the gradient of the function and it has the cos the square of the cosine of the angle between the negative gradient and the direction pk that you have chosen. So, now this tells you something quite nice and powerful it tells you that if I can choose my if I choose my pk in such a way. So, if pk so what does this mean if pk is chosen is chosen so that say cos theta k is say always equal to some epsilon say minus epsilon rather since you have already a minus sign that is equal to epsilon say. So, if I choose my pk so that the cos theta k is always equal to some constant epsilon which is positive then in that case in this limit here in this limit here this cos square theta k would always be epsilon and it would it would it can jump out of the limit. And in that and then what we are left with is just is that the gradient of the of the the limiting value of the gradient of the function should be 0 which means that xk converges to the if xk converges to some x star then the limiting value of the gradient of x star should be equal to 0. So, if pk is chosen in such a way that you that you are making and making a that pk makes an acute angle if pk is chosen in such a way that pk makes an acute angle with the negative of the gradient right. So, it has some component what this means is it has some component along the negative it has a some non it has a non zero component along the negative gradient then you are guaranteed that the that the sequence of iterates actually get you to a point where the gradient becomes equal to 0 which that means you are satisfying the necessary conditions of optimality. Now, this has this means that so this is it is important here that this angle is equal this the cosine here is some epsilon equal which is positive and remains and remains positive and so the way we have done this is by choosing an epsilon that is independent of k. If epsilon also depends on k and stars decreasing to 0 then in that case the limit of this product being 0 does not let me conclude that the gradient is equal to 0 it could well be that you know this product the gradient still remains positive and yet your this limit is going to 0. So, this condition is effectively telling you that the way to ensure that your gradient vanishes is by making sure that the this cosine remains bounded away from 0. So, that means you should continue to make if a non vanishing angle that PK should continue to make a non vanishing angle with the negative gradient whatever that gradient may be. So, PK if it continues to make this non vanishing angle with the negative gradient then the gradient itself will go to 0. So, the simplest way of ensuring that is the simplest way is to take approach to doing that is to take PK to be the negative gradient itself. So, then you are collinear with the negative gradient and in that case you would you would obviously make your yeah. So, in that case this would actually go down to the angle would the angle will the epsilon in that case will always will be 1 and then the gradient would be equal to 0. This kind of algorithm is what is called the steepest descent steepest descent algorithm. Now, steepest descent only ensures that you are taking the steepest the direction of the steepest descent at each point that may or may not be right for you in the long run in the sense that the steepest what looks like the steepest descent at a particular point may not give you a sustainable decrease all the way down when you go further. So, you have to take into account also how the the steep direction of steepest descent themselves change. So, you so the ideal way is to actually take into account also the curvature of the function and that those kind of that gives you a much richer class of algorithms that but but they are they are once again a another another class in this in this kind that I have mentioned. So, they so long as the you when you are designing these sort of algorithms or when you are coming up with your iterate make so long as you make sure that your cosine of the angle is continues to be bounded away from 0 you should be fine in your you will be going to the your iterates will take you to the to the to a point where the gradient vanishes all right. So, I will so with that I will I will stop this lecture and we will continue next time.