 So, we have been looking at iterative methods for solving linear algebraic equations and we have looked at Gauss-Seidel Jacobi relaxation methods and its variants. We also have looked at the convergence behavior. We have analyzed convergence behavior and we know how to ensure convergence by modifying the problem and so on. Now, there is one more iterative method which is quite popular and which also converges pretty fast. So, this is numerical optimization based method. So, well one of the reasons for covering this is that this will be also useful when we go forward to non-linear algebraic equations. So, as I said we want to solve this problem Ax equal to b and then this can be solved by minimizing with respect to x Ax minus b transpose Ax minus b. If I minimize this with respect to x then I can reach the solution of Ax equal to b. In fact, I reach the solution if I take this as phi then dou phi by dou x the necessary condition for optimality turns out to be A transpose Ax minus b equal to 0 and obviously, if A is non-singular if necessary condition is satisfied you will reach the optimum. The second derivative what is second derivative here? The second derivative will be A transpose A which is symmetric positive definite. So, you are actually reaching the global minimum. Yesterday somebody had a doubt that iterative methods that we have looked at will they give you local solution or global solution? Jacobi method, Gauss-Seidel method, relaxation method if those methods are converging if we know that they are converging they will converge to the global solution. For linear algebraic equations there is nothing like local and global solution. If they are converging they will converge to the global solution. Now of course, we can a little bit simplify this. If A is a symmetric positive definite matrix in that case we can just minimize with respect to x half x transpose Ax minus x transpose b. If A is a symmetric positive definite then minimizing this objective function will give you the optimum. Now I want to do a general method called gradients based optimization method. This is now described in appendix D. In my notes this is described in appendix D on page 48. I want to solve this using a numerical search. I do not want to use this condition directly. I do not want to use this condition and solve it. If I were to use this condition and then solve it for x it would be either iterative method or it will be a direct method. I do not want to go into that. I do not want to use Gauss-Seidel or anything. I want to use a iterative scheme which is based on optimization techniques. Optimization techniques in general deal with. So, I am going to do this gradient method. This is also called as steepest descent method. So, right now I am going to be worried about developing a iterative method for minimizing with respect to x some objective function phi x where phi x is from R n to R. Phi x is a scalar objective function. Some scalar objective function. It did not be norm. It is some objective function that you have defined. It did not be always positive. I am not worried about that. I am just worried about a scalar objective function. So, it is R n to R. I want to come up with the iterative scheme to reach a local minimum in this particular case because in general phi x need not be nicely behaved and then after I derive that I want to apply it to this specific case. So, I want to it is a the purpose is two-fold. One is to introduce to you gradient based methods and its variants which are very very useful in optimization and I will show you what are the applications later. So, numerical search which is based on gradient and then we will of course apply to our specific problem that is solving linear algebraic equations. So, this method is also known as steepest descent. You may have done this in your undergraduate. I am not too sure. It is called steepest descent. It is also called Cauchy method. It is called Cauchy's method. It is just known by various names gradient based method. The basic idea is that if I look at a level surface, what is a level surface? A level surface is a set of points x is set of point all points x such that phi x is equal to constant. It is a scalar objective function. So, phi x equal to constant. I want to look at level surfaces that is I want to look at locus of x. Let us say if it is two-dimensional object if it is x is a vector which is in two dimensions x 1 x 2. I am actually looking at. So, this is say c 1, this is c 2, this is c 3, this is c 4 and so on. So, this is my x 1 x 2. This is my x 1 x 2 plane. I am plotting all those points in x 1 x 2 plane for which phi of x 1 x 2 is equal to constant. So, let us say this is phi, this is 4, this is 3, this is 2. I am plotting all the points locus of all the points. These are called as level surfaces. They are called as level surfaces. I am not plotting phi of x in this plot. I am not plotting phi of x. I am plotting. So, actually if you do a three-dimensional plot x 1 x 2 and phi. This will be nothing but a cross sectional plane projected on to x 1 x 2. It is set of all points. See if have you seen Matlab symbol? Matlab symbol is like one peak. Now, if you take it as a objective function. Let us say height above the or take mountain, height of the mountain above the ground surface is the objective function. I am trying to find out set of all points where the height is constant. How will you get it? Take a plane horizontal to x y, project it on to x and y. You will get the set of all points. So, these are set of all points. What is phi x? View phi x as a height and x 1 x 2 as ground locations. If you take constant level, it is also called level surface. Probably the reason for level surface is to relate to level. They are called as level surfaces. Now, I am going to use the local behavior of this level surfaces to come up with iteration scheme for solving this minimization problem. For the time being, I am going to forget about solving linear algebraic equations. I am just concentrating on this general problem. Some phi of x, it need not be this phi of x, any phi of x, not a specific one. So, what I am going to do now is let us say I have some guess solution. x k is some guess solution. x k may not unlikely to minimize, but I say is my guess. What is the philosophy in iterative methods? You start with the guess and then you move on to next guess. Start with one guess, move on to next guess and then hope that iterations converge to the solution. In this case, what it will converge to will be a local solution. In some cases, it will converge to the global solution, but that depends. It depends upon the problem. A, it depends upon your initial guess. If the problem is highly non-linear with funny shapes, it depends upon if the problem is one which has only one peak or one valley, it will reach the global minimum. So x k is my guess solution. Our good old friend is Taylor series theorem and I am going to use Taylor series theorem to phi x. I am going to write as phi x k plus, which is same as phi x k plus, which is same as phi x k plus delta x k, where delta x k is obviously x minus x minus x k. So this is my x minus x k. This is delta x k. If I do Taylor series approximation in the neighborhood of x k, so this is approximately equal to, this phi of x is approximately equal to phi at x k plus grad phi. So let us develop a notation. Let us develop a notation. Let us put this grad phi x k. So gradient of phi, gradient of phi evaluated at x k. That is what I mean. So this transpose delta x k and there will be higher order terms. I am neglecting higher order terms. I am looking locally in the neighborhood of x k. How this function behaves? How does this function behave in the neighborhood of x k? And then I want to look at the level surface that is phi x is equal to constant. I want to look at level surface phi x equal to constant. I want to look at level surface phi x equal to constant. In a small number of x k, some point x k, I get this approximation of phi x as this. What happens at x equal to x k? Delta x k is equal to 0. So which means at x equal to x k, if I am looking at a level surface, that means phi x k is equal to constant at that point. See suppose let us go back to here. Let us say this is your x k. This is my x k. I am trying to model this curve locally. I am trying to model this curve locally. You will see that actually I will model it using the tangent. I will model it using a tangent plane. That will become clear now soon. So what is the simplest approximation? This curve is there. What is the simplest approximation you can construct? Straight line. Locally for a small neighborhood, you can construct a straight line approximation to the curve. That is what I am doing. How do I get the slope of the straight line through Taylor series? I am getting the slope of the local slope of this line through Taylor series. So Taylor series is my vehicle to construct a local approximation. So now this phi x k is constant. If I substitute here, what will I get? See this becomes c. So c equal to c. So what is the local behavior of the curve? So this implies that gradient of phi at x k transpose delta x k is equal to 0. Is everyone with me on this? This is a scalar function by the way. This is a vector. Gradient is a vector. This is also a vector. Delta x k is also a vector. So this transpose, this is 0. Geometrically, what does it mean? The gradient is perpendicular to delta x k. Delta x, x minus x k is perpendicular locally to the gradient. So locally, gradient of phi is orthogonal to x minus x k. This is what we have found out. Actually, this gradient transpose delta x k equal to 0 is a equation of the tangent plane to the level surface. In general, I am talking in n dimensions. It is a tangent hyper plane in the n dimensional space. So what I want to show here is that this local behavior of the function in the neighborhood of the point x k can be used to find out the direction in which function decreases at the maximum rate. See, if I am at x k, let us go back here. What I am doing? I am minimizing phi. So if I want to move from x k to x k plus 1, which direction I should move? I should move in that direction in which function decreases at the maximum rate. Why? Why? The question is why is it? What is directional derivative? So I want to prove it. Angle will be… So which is the directional derivative here? Delta x k is the directional derivative or gradient is the directional derivative? Gradient is the directional derivative. So I want to show that if delta x k is aligned along the directional derivative that is gradient, then function increases maximum. If it is aligned along negative of the gradient direction, then the function decreases maximum. So this local gradient actually gives me maximum rate of increase and negative of that gives me maximum rate of decrease. I am going to use this local gradient to come up with a new point x k plus 1. So before I do that, I have to show that this is the direction of maximum descent. So first interpretation that we have learnt here is that this is nothing but equation of the tangent hyper plane and delta x k is perpendicular to the gradient locally. So I am looking at set of all x. I am looking at a unit ball in the neighborhood of x k. I am constructing a small unit ball in the neighborhood of x k such that it is set of all points such that magnitude is unity of delta x k. So just if you go back to this figure, I am constructing a small unit ball here. I am constructing a small unit ball here such that you pick up any point. It is distance from x k is less than 1. The distance from is this clear? I am just picking up a set to do the analysis. Now what is going to help me here is something that you probably can guess. What is going to help me here is Cauchy-Schwarz inequality. This is inner product of this vector with this vector which is less than or equal to by Cauchy-Schwarz inequality. What is this? This is norm but then I am looking at set of all x in the unit ball. So this is 1. Maximum value this can take is 1. So which means, so if this is 1, so maximum value this can take is 1, then I can write that grad phi x k transpose delta x k, this quantity is strictly less than norm of this. This inequality also means that minus of is less than I have just expanded this inequality. Here I had written absolute value. So in a unit ball in the neighborhood of x k, I can say that this quantity is bounded between these two numbers. This is a positive number, this is a negative number. This quantity cannot be smaller than this. What is the smallest value this quantity can take? When will it take this value? Delta x is aligned along which direction? Gradient direction. When delta x is aligned along gradient direction this inequality will be equality. Smallest change. Now why I am worried about this? Let us go back and look here. Let us return this figure. Let us go back here. See this phi x which is written as phi x k plus delta x k. I have written this like this and actually I am worried about how this function behaves. Phi x minus phi x k. I want to go to x from x k. I want to go to a new value x from x k. This is, we say that in small neighborhood this is approximately equal to gradient of phi is given by this. So the behavior of this quantity actually dictates how this function behaves locally. This is Taylor series expansion. I just wrote this sometime back. I am just rearranging. This thing goes on the right hand side. I have taken it on the left hand side. See if I move away from x k to some new x, if I move away from x k to new x, which direction I should move? If I want to decrease the function, which direction I should move? I should move negative of the gradient direction because what is the smallest value this can take? I am restricting myself to a unit ball around x k. I want to move inside this unit ball. I just want to know where to move inside this unit ball. What is the objective? I want to move in such a way that the function decreases at the maximum rate. Now, I know that from this Cauchy inequality, I know that the maximum rate of decrease will be obtained when delta x k is aligned along the gradient direction but not along negative of the gradient direction. Then I will get this minus here. I will get minus here. This Cauchy inequality, when do you get Cauchy inequality? What does Cauchy inequality tell you? We had related Cauchy inequality to cos theta angle. So, I am talking about two special angles. One is angle 0, other is angle 180. Negative and positive direction. If you are maximizing the function, you should move along the positive of the gradient direction. If you are minimizing the function, you should move along the negative of the gradient direction because this difference will be smallest negative. When will it be smallest negative? Look here. When will it be smallest negative? Negative of the gradient direction. So, if I move along the negative of the gradient direction, I will decrease the function. So, the way I should choose my next point from x k when I go to x k plus 1, I should choose my next point by moving along the negative of the gradient direction since I am minimizing the function. My objective was to minimize phi of x with respect to x k with respect to x. Locally what I find is that locally the function will decrease maximum if I move along negative of the gradient direction. See what is negative of the gradient direction? If this is grad phi x k divided by norm and negative of this, why I am dividing by this? Because I am looking at unit vectors. I am looking at unit vectors. So, this is a unit vector. What will be this transpose this? What will be this transpose this? In a product, in a product is square of the norm. In a product of vector with itself, square of the norm. So, if you take in a product of this with this, you will get square of this divided by this, you will get negative minus of… Is everyone with me on this? Is this clear? You move in the negative of the gradient direction, the function will locally decrease at the maximum rate. So, that is going to be my algorithm for… So, to find x k plus 1, I am going to take x k and minus negative of the gradient direction. So, lambda g k where g k is nothing but grad f x k by norm. Is this fine? This is the direction or we can put plus here and take this minus, does not matter, whichever way you want to look. Negative of the gradient direction, I am taking unit direction along the gradient. Well, of course, I am looking at two norm. I am not really worried about other norms. So, these are all two norms wherever I am writing norms. These are two norms. So, this is my negative of the gradient direction and what is this lambda now? Now, I know that locally if I go along negative of the gradient direction, function is decreasing. How much do I move? See, I just know that this direction is the steepest descent. I should move 1 meter, 5 meter, 10 meters. How much should I move? So, I am going to put one unknown here which is step length. This is my step length and this is my direction. Now, how much to move? I am going to do another optimization problem. Having decided to move in this direction, I am going to now solve for this problem. Lambda k is minimization with respect to lambda phi of, what is the difference between the original problem and this minimization problem? This is a one dimensional minimization problem. Lambda is a scalar. Lambda is a step length. The direction is fixed. How much to move is given by the step length parameter? Now, how to solve this problem? In some cases this problem can be solved analytically. In some cases, this problem has to be solved numerically. Now, if you go back and if you just go back, this is called as line search, one dimensional optimization problem. This is called as line search because we know in which direction to move. We just want to find out how much to move. So, this phi becomes, this x k is known, g k is known, lambda is unknown with respect to one scalar. I have to find out. Of course, what I have to do is to solve for dou phi by dou lambda is equal to 0. I have to find out dou phi by dou lambda equal to 0, whichever value gives me minimum, sorry, whichever value satisfies this optimality condition, I choose that value and use it for my step length. This has to be done in some cases. If phi is a highly complex non-linear function, this has to be done using non-linear optimization or using iterative process, you guess and then you find out a minimum. I have described that, but now in the case of solving a x equal to b, we will have some nice time. We can do this analytically. So, let me go back and is this clear? Is the idea clear? The line of argument is like this. Locally, the steepest of the direction in which objective function decreases maximum is negative of the gradient. You do not know how much to move. So, you know the direction to move, but you do not know how much to move that is quantified by this lambda. Then, we obtain lambda by one-dimensional minimization with respect to lambda. I am just going to up, maximum value of, so I want to find out, see I am decreasing phi. Now, in one shot, I would like to decrease when I am taking one step. I would like to decrease as much as possible. So, how do you find out how much is possible? See, just imagine that you are going down a slope. Now, let us say the slope is like this and then it flattens out. Now, locally, if you go down for one meter, your height will decrease, but your height might decrease even if you go five meters. So, how do you know how much to go? I know that this is local descent, but should I go one meter or three meters or five meters or nine meters? Nine meters might take me up. I do not know. See, the contour could be like this and then going up. So, I should find out what is best possible step length. I should go so that there is a minimization. Otherwise, see all this, just remember one thing. You are trying to do a local movement only based on the local derivative. There is limited information, one derivative of a function carries. So, you cannot take two large steps using just local gradient information. Then, you should not take two small steps also. So, to balance that, we actually introduce this lambda and then we minimize function with respect to lambda again and then find out how much to move. Now, let us see its application in solving A x equal to b. So, my phi x here is... Now, I am going to find out how much I am going to formulate just for the sake of writing simplicity. I am going to say that this is half x transpose A x minus x transpose b and I am going to solve for the case where A is symmetric positive definite. If your matrix A is not symmetric positive definite, what to do you know already? Pre-multiply both the sides by A transpose. So, I am just going to look at the case right now. For deriving the algorithm for the sake of simplicity of notation, I am going to look at the case where A is symmetric and positive definite. Now, let us apply the algorithm. This is my phi. I have a guess solution. My guess solution is x k. What is the local gradient? What is the local gradient? That is, what is grad? What is grad phi? Grad phi A x minus A x minus b. Differentiate this with respect to... This is a vector transpose A x symmetric positive definite vector. Differentiate with this vector x. Differentiate this with respect to x. Derivative of this objective function with respect to x will give you A x minus b. What is phi x k evaluated at x minus b? At x k. x k is your guess solution. A x k minus A x k minus b. Everyone with me on this? A x k minus b. So, what I want to do next? Well, I do not have to always find unit direction. I wrote the algorithm with unit direction. I can write it with respect to the direction and use lambda. Lambda will get scaled accordingly. So, I can say that I want to move now in the direction which is lambda g k where g k is A x k minus b. A x k minus b. Now, you want to do the step length minimization. Can you do the step length minimization? Can you solve it? Just try it. What is the step length minimization problem now? What will be phi x? What will be phi x k plus 1? What is phi x k plus 1? It will be half x k plus lambda g k transpose A x k plus lambda g k minus. What are the things which are known here? I know x k. I know g k because g k is function of x k. I know g k. I know only lambda. Can you tell me what will be this quantity? Doh phi x k plus 1 by doh lambda. I want to set this equal to 0. What is this quantity? Just find out. There is one small problem here. I want to move in the negative of the gradient direction. So, this is make one correction. I want to move in the negative of the gradient direction. The gradient direction is this. Negative of the gradient direction is. So, this is the gradient direction and my g k direction in which I want to move is negative of the gradient direction. So, put minus here put a minus here well what you have to do of course is expand this what you will realize is that the terms x k transpose a x k will vanish because they are not functions of lambda you have to only take those terms in which lambda will appear there will be cross terms and there will be lambda square will come out because lambda square g k transpose a g k ok here again you can neglect the term x k transpose b because it is not function of lambda you can take only this term ok what do you get after you minimize just expand just try what is this quantity you do not have to substitute this you do not have to substitute this you maintain everything in terms of g k ok maintain everything in terms of g k try to find out what which value of lambda will give you what I what I expect is if you do this if you do this scalar optimization problem you should get an equation just check this you should get an equation of the type lambda into g k transpose a g k minus b transpose g k equal to 0 you will get an equation of this type just check if you expand this if you expand this when you expand this you will get only one variable polynomial you will get only one variable polynomial lambda square lambda and constant you will get only one variable polynomial because lambda is a scalar g is a known vector x is a known vector ok so actually it turns out that lambda k which minimizes this is nothing but b transpose g k divided by g k transpose a a g k ok so my algorithm my numerical algorithm becomes how do you summarize the numerical algorithm ok this is my numerical algorithm how do I how do I go from x k to x k plus 1 I first compute negative of the gradient direction see what is the simplicity here no matrix is in inversion is involved no matrix inversion is involved ok I just have to compute the gradient direction gradient direction is nothing but actually error between right hand side and left hand side this is my guess solution this is my b actually I want this when will you when will you get the solution gradient becomes equal to 0 what is the meaning of gradient becoming equal to 0 you have reached the solution very very straightforward simple interpretation in this case if gradient becomes equal to 0 gradient becoming equal to 0 is a necessary condition for optimality right gradient becoming equal to 0 is a necessary condition optimality when if if the gradient is non-zero ok you will keep moving how much to move lambda k times g k how much to move lambda k times g k ok this is the optimum step length if you move less than this ok then you are not decreasing the function enough if you do more than this that will not help ok using the local gradient you can move only this much ok you can the local gradient you can move only this much this is the optimum value to which you should move every time this is a scalar calculation this is a inner product calculation this is an inner product calculation a is symmetric positive definite this is inner product calculation ok calculating this scalar is very very easy calculating this error very very easy when will you terminate when will you terminate iterations when g k is very very small right so I could terminate the iterations by saying g k norm g k is less than some epsilon norm g k is very very small or you could also sometimes is better to check whether you can put this also g k plus 1 minus g k this can be a termination criteria if there is no significant change in the derivative ok if you have very large matrices this is very very useful this method this method can quickly come to the solutions particularly if a is symmetric positive definite then you can reach the solution I think there is a specific result about this I will talk about it later there is a modification of this called as conjugate gradient method and we will talk about the conjugate gradient method too very quickly in the next lecture and then I will move on to well conditioned and ill conditioned systems so this method actually is very often used for solving large large scale problems and competitions involved are very very simple you just have to compute the gradient direction and inner products ok and you can very quickly get approximate solutions of or you can quickly go very close to the true solution using this method ok