 In this module we are going to be talking about direct minimization algorithm. You may recall there are essentially two ways of solving numerically the least square problems that arise within the context of linear and nonlinear inverse problems. One is to be able to use matrix based algorithms. We saw at least three different families of matrix based algorithm. One is Choloski based or Choloski factorization based algorithm derived from LU decomposition. The second one is QR decomposition that comes out of Gram-Schmidt orthogonalization process. The third one is singular value decomposition that is derived from the eigen decomposition of the Grammians of the matrix H. These matrix based techniques are called direct methods. We also alluded to iterative methods for solving linear system even though we did not cover them. In this lecture we are going to be talking about using minimization algorithms to be able to find the best least square estimate. You may recall we have already reviewed the basic principles of minimization, constraint and constraint multivariate optimization problem in one of the early modules on mathematical preliminaries. We are going to derive a lot of the ideas in this module from those modules that deals with constraint and unconstrained minimization algorithms. With that as a background I would like to be able to state the minimization problem in one dimension. To that end let f be a function, let f be a convex function in one dimension r to r that means it is a scalar valued function of a scalar. An example such an f is ax square plus bx plus c with a greater than 0. We can rewrite this f of x as by perfecting the square as follows a times x plus b by 2a whole square minus b square minus 4ac by 4a. The second term does not depend on x, the first term depends on x. So by choosing x is equal to minus b by 2a we can annihilate the first term. Therefore x star is equal to minus b by a is the minimizer. The minimum value of the function is f of x star is minus b square minus 4ac by 4a. Geometrically f of x with a is equal to greater than 0 represents a parabola that intersects the x axis. So you can think of this like this that is the x axis it intersects like this. If b square is greater than 4ac it intersects at 2 points this is the minimum sorry the figure is not perfect but you get the idea. If b square minus 4ac if b square is greater than 4ac the minimum is below the x axis otherwise the minimum is above the x axis and with a greater than 0 this quadratic function is a convex function it is simply a parabola it has a unique minimum and the unique minimum or the minimizer is given by x star is equal to minus b by 2a. I would like to be able to continue this example minimization in 1d consider an f of x with x square plus x plus 1 this can be written as x plus 1 over half whole square plus 3 over 4 in this case x star is equal to minus half and f of x star is equal to 3 by 4 since b square is greater than is less than 4ac since b square is less than 4ac x 1 and x 2 are complex and f of x lies above the x axis that means it has no real intersection with the x axis it is a continuation of the discussion of the problem in the previous slide. Now I would like to be able to generalize it to n dimension I would like to consider functions which are scalar valid function of a vector they are convex in Rn a typical function that is convex in Rn is this quadratic function 1 half of x transpose a x minus b transpose x plus c where a is a symmetric positive definite matrix using the principles of multivariate calculus we can say that the gradient of this f function is given by a x minus b. So at the point where the gradient vanishes x star is a inverse b x star is equal to a inverse b is called the minimizer of f of x the minimum value of f of x is given by this expression we can ready I would like all of you to be able to work this out in detail and understand the specific values how they are defined. So one while one could solve the solve for the minimum by solving a x is equal to b using any one of the matrix techniques a is spd so I could apply Chaloski or qr or spd kind of algorithm for solving a x is equal to b instead we seek to minimize f of x iteratively by doing a gradient search and that is the theme of this module. So what is the basis for the gradient technique the gradient technique rest on the concept of what is called a decent direction decent direction means the direction which the function increases decreases increasing mean ascent decreasing mean descent. So let us consider a function f of x let us consider a contour where f of x is a constant it takes a constant value that that contour is given by this curve. So let x be a point on the contour where f of x is equal to c that is this point that is that point if you consider the gradient of f at that particular point the gradient refers to the direction of maximum rate of change. So the negative of the gradient is the direction where the function decreases at the maximum rate. So any direction p that makes an acute angle theta with the negative of the gradient is called a decent direction which so the decent direction is described by this inequality p transpose gradient of f of x must be 0 must be less than 0. So what does it mean the inner product of the of any direction p with the gradient if it is less than 0 what does that mean if you consider the perpendicular line like this which is the tangent f of x at the point x on the right side is the direction of increase the left side is the direction of decrease and p points to a direction left of this tangent and the angle that p makes with the negative gradient is less than or equal to 90. Therefore theta the magnitude of this is less than or equal to 90 the inner product of p transpose with the gradient is less than less than 0 any p that satisfies this is called the decent direction. So what does what is the implication of this decent direction if you move along this decent direction p the function f is guaranteed to decrease f f of x must decrease as we move a small distance along p away from x. So this is the property of this is the property of decent direction why are we interested in decent direction we are interested in minimization minimization means trying to find a point where the value the function takes the least value. So it makes sense to search for good decent directions and move along the decent directions to be able to find the minimum of a function and the picture essentially describes the role of the decent direction if you are trying to maximize you would consider direction p in the right half that will be called ascent direction ascent direction and decent directions are duals of each other. So without loss of generality in our discussion we will only consider minimization and decent direction let alpha be a small parameter which are real number consider let p be a decent direction let p be a decent direction let alpha be a positive but a small real number. So x is the current point on the contour where f of x is equal to a constant value c as we saw in the previous picture. So x plus alpha p what does it tell you from x you move along p by a small distance the distance is controlled by alpha. So f of x plus alpha p is a neighbor of x so f of x plus alpha p can be expanded in a Taylor series expanded in a first order Taylor series. So f of x plus alpha p is f of x plus alpha times p transpose gradient f of f of x if p is a decent direction from the previous slide we know this quantity p times f of x of x that is less than 0 alpha is positive therefore I am subtracting something from the value f of x therefore the whole value is less than f of x since p is the decent direction. So what does this tell you starting at x by identifying a decent direction if you move a small distance along the decent direction the function f of x up to decrease that is the conclusion that is the power of Taylor series expansion and p being a decent direction. Now our aim is to be able to maximize the decrease in other words we are always greedy in problem solving. So if I move away from alpha if I move away from x by small distance controlled by alpha I am always looking for a direction p where the decrease in f will be the maximum the direction p where the decrease will be maximum a lateral reflection will reveal is indeed p is equal to minus the gradient in other words if you move along the direction which is negative of the gradient the negative of the direction where f of x decreases at the maximum rate such a direction p is called the steepest decent direction it is called the steepest decent direction because the rate of decrease along this direction is maximum. So if I substitute p is equal to minus the gradient in the Taylor series expansion that we have above we get an expression which is like this where the quantity that being subtracted is alpha times the square of the norm of the gradient at the point x that essentially follows from that any that expression and which is definitely less than f of x. Therefore by moving along a decent direction which is negative of the gradient this direction is also called the steepest decent direction it guarantees the maximum rate of decrease of f of x at x. So if I am interested in minimizing f of x if I am moving away from the current operating point the best direction for one to move is always negative of the gradient at the particular point this is the basis for almost all of the known minimization algorithms steepest decent direction these are the two key things now I am going to put it all together. So let xk be the current operating point what is the current operating point I am in my journey I started at point x0 I would like to be able to go and settle down on the minimum I am somewhere so at the kth iteration I am at a current operating point which is xk so let xk be the current operating point I am going to define a residual rk is equal to rk is a shortened form for r of xk r of xk is negative of the gradient at the point xk. Now if you recall the gradient of f is axk-b so the negative the gradient is b-axk because a is a quadratic function given above this residual is not unrelated to the residual that we have talked about in the least square problems in the least square problems what is that we have we have z is equal to hfx we call r of x is equal to z-hfx so that is what we call the residual in this case b plays the role of z a plays the role of h so you can see the relation between b-ax and z-hfx we called r of x as a residual within the context of least squares here we are calling r of xk as the residual this is the negative the gradient of f at the point x this rk must be a smaller rk this rk is the steepest descent direction of ffx at xk so from the previous discussion we know negative the gradient is the steepest descent direction so steepest descent direction now have two interpretation one is the residual r e to the negative the gradient both are same by definition now how do I go how do I know that I have reached the minimum at the minimum the necessary condition is the gradient must vanish therefore at the minimum r of k must be equal to 0 when xk is equal to x star is equal to a inverse b please remember a inverse b I already know the value of a inverse b except that is an expression for the minimum I do not know its actual numerical value but I do I do know what is the mathematical expression that defines the inverse for this quadratic problem so how do I test whether I have reached the minimum or not one way to test if I have reached the minimum or not is to compute the norm of the residual if the norm of the residual is very close to being 0 that means I am very close to the minimum if the norm of the so the norm of rk is a measure of how far my current operating point xk is away from the minimum x star so I have now a meter to speak so to speak to that measures how far I am away from the minimum please recall the I do not know where the minimum is if I had if I had known where the minimum is there is no problem there is no need to do any of these things so without knowing the minimum I need to develop a test to understand how far I am away from the minimum one good measure one good way to measure the distance of the current point from the minimum is the norm of the residual which is also the norm of the negative the gradient so that is a very beautiful way to measure how far I am away from the minimum therefore the norm of rk could essentially be used as a convergence for the iterative minimization it is a test that measures the distance from where I am to where I want to be so with this I am now going to define the framework for the so called steepest descent algorithm if the steepest descent framework if you wish so let xk be the current operating point let xk be the current operating point let rk be the directions of steepest descent at xk let alpha be the step length so xk plus alpha rk is going to give me my new operating point what is the property of the new operating point f of xk is less than f of xk plus one is less than f of xk even though f of xk plus one is less than f of xk the absolute value of the difference between f of xk plus one and f of xk depends on the choice of alpha this is not a this is alpha sorry alpha is called the step length parameter alpha decides how far I go from xk in the direction rk at xk so what is that we have now seen at time k xk is fixed at time k the direction of search rk is also fixed therefore given xk and rk what is the question now how to choose alpha such that we get a maximum decrease in f of x as we move away from xk to xk plus alpha times rk I hope this question becomes very clear alpha is an arbitrary parameter we have not explained how to pick an alpha so now the question here is that having decided to move along the direction negative the gradient the next question remains how far do I go in the chosen direction that question how to decide an alpha is formulated by this question is embedded within this statement given xk and rk how to choose alpha such that we get the maximum decrease in f of x as we move from xk to xk plus alpha times rk so that is the important question that we would like to answer so look at this now we have we originally started minimizing f of x we are at the current operating point xk we decided to move along the direction rk rk is a one dimensional vector so I would like to be able to minimize f of x as a function of alpha in that direction rk so we get a new one dimensional minimization problem this minimization problem can be stated as follows define g of alpha which is equal to xk f of xk plus alpha rk please understand f is a quadratic function is known xk is known rk is known so if you substitute xk plus alpha rk in the quadratic form since everything is known but alpha I get a function of alpha alpha is a real parameter so g is a function from real to real therefore we have reduced the problem to a 1D minimization problem that is the important recognition one needs to develop at this point so I am now describing the fundamental principle of successive minimization this is called divide and conquer principle given an n dimensional minimization of given the problem of n dimensional minimization of f of x we reduce it to a sequence of one dimensional minimization of g of alpha at x of alpha at x of k along the steepest descent direction rk which is the negative of the gradient so what is the idea here I am here at the point I am here at the point xk let this be the direction of rk I would like to be able to go a distance alpha from here so let this point refer to alpha times rk I should say like that I am sorry I will correct it now I would like to be able to go a distance of alpha times rk that distance is alpha times rk so this point now becomes xk plus one then from xk plus one I am going to consider rk plus one I am going to again move alpha times rk plus one I am going to define xk plus two and so on so I move from xk to xk plus one to xk plus two so what is the basic idea given xk given the direction rk I would like to be able to minimize f of x along the direction rk and find let xk plus one be the minimum point along this direction f of x becomes a function of alpha that is called g of alpha then after having found xk plus one again compute the negative of the gradient you want to be able to go alpha times rk plus rk plus one that defines the point xk plus two therefore I start with x naught I come to x one I come to x two I go to xk I go to xk plus one the search continues so in going from x naught to x one x one to x two I essentially solve your one dimensional minimization problem the one dimensional minimization problem always happens in the direction of the negative the gradient at each of these points so here in lies the basic principle of the divide and conquer a given n dimensional minimization problem is reduced to a sequence of one dimensional minimization problem g of alpha at xk along the steepest different direction rk for k is equal to zero one two three and that generates the sequence of points x naught x one x two xk xk plus one what is the idea here x one to the minimum than x naught wash xk plus one is closer to the minimum than xk wash so there are two ideas in here there is a greedy principle involved in here what is the greedy principle if I move I would like to get closer to the minimum that is the greedy principle there is also a divide and conquer principle at work here what is the divide and conquer principle I solve a given n dimensional minimization problem as a sequence of one dimensional minimization problem so I divide a larger problem into a smaller problem and solve a sequence of simpler problems to solve a complex problem so that is the basic principle divide and conquer so it is an amalgam of these two principles the greedy and the divide and conquer that is the basis for the iterative framework for minimization of ffx I come the basic ideas are very clear I also would like to remind the reader of the relation to hill climbing idea hill climbing sorry hill climbing a mountaineer wants to go to the top of the Everest what do they do they move from base camp to base camp to base camp to base camp they walk for 8 hours they are at a given base camp they know where the peak is they cannot go in a straight line from where they are to the peak if that were to be the case there would not be much interest in climbing Everest. So the path to the peak from where you are depends on the local properties of the mountain a clever mountaineer always looks at the local terrain and choose at the direction such that if I walk for 5 hours I would like to be able to make sure my elevation increases so every time they move from one base camp to another base camp the level the height of the base camp is becomes higher and higher I get closer and closer to the hill humans have been utilizing this idea of hill climbing based on the rate of change of the terrain at a given point for a long time so this steepest adjacent algorithm is patterned after what humans do when they climb hills to be able to go to the peak the only difference being instead of hill climbing I am trying to descent to the valley but the analogy between hill climbing and descending to the valley must be extremely clear from the basic discussions from the basic discussions. So what are the key things in any minimization the current operating point xk there is descent direction rk and then the step length parameter alpha times rk to decide hk plus 1 I know xk I know rk I want to decide what is the best alpha k I would like to be able to illustrate this idea of deciding the best alpha using an example of a quadratic function quadratic minimization quadratic minimization problem is a model problem so I would like to illustrate the basic principles using the model problem f of x is equal to one half of x transpose ax minus b transpose x plus a a being spd this is the problem we started with I am going to set xk plus 1 is equal to xk plus alpha rk I still do not know what is the best value of rk so alpha is a free parameter I am going to define g of alpha to be equal to f of xk plus 1 that is equal to f of xk plus alpha k if I substitute x is equal to xk x is equal to xk plus alpha rk in my f of x my f of x takes this particular form I would like all the readers to be able to do the substitution and simplification and do the simplification f of xk is a number rk transpose a of xk is a number this is also a number so given all these things g of alpha is a simple quadratic function in alpha that means and that is to be expected because f of xk is a size of f of x along the direction rk so g of alpha is a quadratic in alpha so to minimize g with respect to alpha please understand minimizing f of x along the direction rk is the same as minimizing g with respect to alpha setting the derivative of g with respect to alpha equal to 0 we get this equation so the minimizer of g of alpha which is alpha k is given by negative of rk transpose a xk minus b by rk transpose a rk please remember b minus a xk is rk therefore the whole expression becomes this so the optimal value of alpha that maximizes the decrease in the value the function along the direction rk is alpha k and this alpha k is always greater than 0 unless rk is equal to 0 when alpha when rk is equal to 0 we have already reached the minimum so long as I am not in the minimum the step length is going to be positive so I have now all the information that I need to be able to build my algorithm so the steepest descent algorithm another name for it is also gradient algorithm so I am going to summarize my algorithm here let f of x be the function given by this let x not be a starting point at the starting point I am going to compute my residual rk which is the negative of the gradient which is the negative of the gradient so a xk so a x plus I am sorry a x0 minus b is the gradient I believe this is the actual gradient so this has to be a negative sign in here sorry negative sign in here therefore at k is equal to 0 I know r0 I can compute alpha 0 r0 transpose r0 divided by r0 transpose a transpose that is the optimal step length at that particular step xk plus 1 is equal to xk plus alpha k rk so that is the iterate so I move from k to k plus 1 I want to be able to test for convergence if the convergence test passes exit otherwise you update rk that is called the residual update this is the gradient at the point xk plus 1 the gradient the negative of the gradient at the point xk plus 1 is related to negative of the gradient at the point xk and a correction term and a correction term so from the definition of rk one can very readily verify this residual update so these four steps together gives you the framework for the optimization algorithm so the optimization algorithm now generates a sequence of iterates x1 x2 xk and so on and we have already made sure using the greedy principle xk is closer to x star than x0 was that is the basic idea that is the basic idea that means I am moving towards my goal that is the basic idea of the algorithm the algorithm is extremely simple very easy to implement this algorithm is called the gradient algorithm or the steepest adjacent algorithm now I would like to talk about some of the properties of the residual the residual at xk plus 1 sorry the residual at xk plus 1 is given by rk plus 1 which is b minus ax xk plus 1 but xk plus 1 is given by this therefore rk plus 1 is given by this formula this is called the residual update that you have seen in the previous step 4 of the algorithm so here is the here is the verification of the correctness of the step 4 now I would like to be able to take the inner product of x rk and rk plus 1 it you can you can readily see from this calculation the inner product is 0 so what does it mean two successive search directions are orthogonal two successive search directions are orthogonal please remember rk is the direction of search at xk so with these properties now I would like to be able to ask the fundamental question while I know xk is moving towards x star what happens to xk as k becomes large so what is the limit of xk under what condition the limit of xk will be equal to x star where x star is a inverse b so that is the question that is the convergence question when is the limit equal to the minimum if I can show in the limit the xk tends to x star then I have guaranteed convergence of that algorithms so let me summarize what we have done so far we introduce the notion of descent direction we introduce the notion of steepest descent direction we then said that given an operating point and a descent direction I would like to be able to find the value of the step length for which I get the maximum decrease so I move from point to point base point to base point to base point guaranteeing that along the chosen at the chosen point along the chosen direction I have obtained the maximum decrease possible I have implemented my greedy strategy the divide and conquer strategy essentially tells you I search for the minimum along successive gradient direction and the previous analysis essentially tells you successive gradient directions successive or directions of search are mutually orthogonal so with that now we have the burden to show that while it goes towards the minimum it will indeed hit the minimum as the number of it rates k grows unbounded so that is what is called convergence proof the in order to prove the convergence now I am going to introduce another term called the error term the error is xk minus x star I do not know the numerical value of x star but I only know its form a inverse b so if I multiply a by ek I get axk minus b is equal to minus rk so what does that mean the error and the residual are related to each other by this relation a of ek is equal to minus rk we have already known the minimum occurs when the residual the norm of the residual is 0 when the norm of the residual is 0 at the minimum the norm of the error must also be 0 therefore I could utilize either rk or ek to analyze my convergence the only difference is rk is measurable ek is not measurable why ek is not measurable because I do not know x star even though ek is not measurable it is very useful to use ek in my theoretical analysis so rk is measurable but not ek ek is useful in proving convergence of a sequence to prove convergence instead of showing xk tends to x star it is enough to show ek tends 0 as k tends to infinity that is an equivalent way of proving convergence in order to prove convergence those of you are involved in application only they can simply take the algorithm program and apply it but we would like to go a bit further we would like to be able to provide a complete analysis of this algorithm so in order to be able to prove convergence I am going to define an energy function the energy function is e of xk which is equal to f of xk minus f of x star the value of the function at xk minus the minimum value setting b is equal to a x star and simplifying it can be shown e of xk after simplification becomes one half of ek transpose a ek you may recall from our module on matrices this is called an energy norm so this is one half of the square of the energy norm of the error that is what e of xk is so e of xk is always greater than 0 unless ek is 0 why is that e is a positive definite quadratic form from the definition of quadratic form we already know x transpose a x is always greater than 0 for all x not equal to 0 it is equal to 0 only when x is equal to 0 that is the definition of a being positive definite so since a is positive definite e of xk being the energy norm e of xk is a good measure of how far I am away from the minimum if e of xk is 0 I am at the minimum if it is not 0 I am not at the minimum so it is a kind of a meter that tells you how far away I am from the minimum so e of xk is a measure of how far xk is from x star again I want to reinforce e of xk is 0 only when xk is equal to x star so what is the basic idea I want to be able to so what is the proof of framework for the proof of convergence I want to evaluate e of xk along the trajectory I want to be able to evaluate e of xk along the trajectory and prove x is a decreasing function of k e of xk is bounded below by 0 so I have a positive function which is bounded below by 0 so if I can prove e of xk tends to 0 as k tends to infinity I would have proved convergence this framework of trying to use the energy function to prove convergence is an idea which is an old idea is due to a famous Russian mathematician called Lyapunov and has come to be called Lyapunov technique so that we are going to be talking about is a convergence proof that is directly related to the fundamental principles that Lyapunov introduced towards the turn of the last century about the early 1900s so we are going to talk about a recursive relation for e of xk so e of xk plus 1 because I am interested in trying to evaluate e of xk along the trajectory I am going to first want to compute e of xk plus 1 and related to e of xk substituting xk plus 1 is equal to xk plus alpha rk and simplifying and remembering that b is equal to a inverse x star because x star is equal to a inverse it follows that so I should have said the following I am sorry this there is an error here sorry this should have been x star is equal to a inverse b I am sorry we will make the correction so with x star is equal to a inverse b it follows in fact I would like to recommend all of you to be able to undertake this simplification is a non trivial simplification it can be shown that e of xk is equal to beta k times e of xk where beta is given by this quantity beta k is given by this quantity. So I have now related e of xk plus 1 and beta k is the multiplying constant so what does this tell you if you can show beta k is less than 1 that essentially tells you e of xk is less than e of xk plus 1 is less than e of xk so the whole proof of convergence now rests in our showing that beta k is a constant which is less than 1 if it is less than 1 from this equation star we would have proved essentially we would have proved to convergence so convergence now boils down to showing that beta k is less than 1. This quantity beta k less than 1 is analyzed through a very well known inequality called Cantrovic inequality in fact it was Cantrovic again another Russian mathematician who was the first one to prove the convergence of gradient algorithm. So now I am going to quote the Cantrovic inequality which is a very powerful inequality very useful inequality there are several different versions of this inequality in the literature I am talking about the simplest possible version of Cantrovic inequality it goes somewhat like this let a be a symmetric positive rate matrix let lambda 1 lambda n be the n eigen values since a is a speedy even lambda n the smallest value is greater than 0 Cantrovic inequality essentially says that for any vector y in Rn y transpose y square divided by y transpose ay y transpose a inverse y is always greater than 1 minus lambda 1 minus lambda n divided by lambda 1 plus lambda n whole square the proof of this is very well known you can get the several different versions of the proof by looking at Wikipedia for example are many text books on optimization theory I think is an interesting exercise to prove this inequality now we are going to take this for granted so if I substitute if I substitute y is equal to r see please remember this inequality is true for any y now what is the quantity I mean I am interested in rk rk is any vector I have no idea what rk will be except that it is a negative the gradient at the point xk therefore therefore by combining Cantrovic inequality and beta k so beta k has a value from the previous slide we know this is the value of beta k by identifying rk is equal to y in Cantrovic inequality we readily follow that beta k is less than this quantity that essentially comes from Cantrovic inequality I can divide the numerator and denominator by lambda n that gives you lambda 1 by lambda n minus 1 by lambda 1 by lambda n plus 1 the numerator is less than the denominator not only that please remember kappa 2 kappa 2 is called the condition number is a spectral condition number of the matrix a please remember kappa 2 of a is the ratio lambda 1 by lambda n the ratio of the largest to the smallest eigenvalue therefore beta k is less than or equal to kappa 2 minus a by kappa 2 plus kappa 2 of a minus 1 divided by kappa 2 of a plus 1 whole square I am going to call this quantity as beta and it is very trivial to verify beta is less than 1. So beta k is uniformly bounded by beta which is less than 1 so for all k beta k is less than less than 1 if and also I want you to recognize that the condition number is always greater than or equal to 1 when spd so you can readily see the notion of a condition number now critically used in the proof of in the proof of convergence of the gradient method therefore combining all these xk plus e of xk plus 1 is less than or equal to beta times e of xk now please remember that I have replaced beta k by beta beta k is less than beta beta is less than 1. Therefore if I iterate this I get this expression e of xk is less than or equal to beta to the power k of e of x not beta k beta being less than 1 goes to 0 as k goes to infinity. Therefore e of xk as guess k goes to infinity is 0 if e of xk goes to 0 that implies limit of xk goes to x star and hence convergence. So by exploiting the greedy nature and by concentrating on the energy norm of the error into the residual we can always see the relation between the previous lives we have related the energy norm of the error through rk and by combining cleverly the control which inequality we are now able to prove convergence so what does this mean this means gradient method for quadratic functions indeed will convert starting from any point. So that is the main theorem or the main summary if f of x is equal to 1 half of x transpose A x minus B transpose x plus A and A is SPD then the gradient so what is what is that we have achieved the gradient algorithm starting from any point x not converges to the minimum as k goes to infinity that is a guaranteed that is what is called convergence theorem it is all the people who use gradient algorithm to do minimization of quadratic functions they rely on the power of this theorem. So here comes the basic idea of combining clever ways to measure how far I am away from the minimum we are able to using greedy principles we are showing not only we are moving closer towards the minimum but also indeed I will be on the top of the minimum as k goes to infinity. So once I have proved convergence the next one is how fast I go towards convergence that relates to what is called the rate of convergence please understand the rate of convergence depends on beta if beta is very small beta to the power k goes to 0 very fast for example if this is k if this is beta to the power k if beta is 0.9 it goes to 0 slowly this is 0 beta is equal to 0.9 let us say but if beta is equal to 0.1 so beta smaller the value of beta faster the convergence larger the value of beta slower the convergence. So the convergence rate depends on the value that the beta takes now please remember converge the beta essentially depends on the condition number please go back what is the value of beta beta is essentially we can see from this slide I will code it now beta is essentially beta is equal to kappa 2 of a minus 1 divided by kappa 2 of a plus 1 whole square. So if kappa 2 is very large let us say 10,000 the numerator will be 9999 divided by 10,001 that ratio is very close to 1 therefore beta will be very large on the other hand if beta is 10 the numerator is 9 the denominator is 11 beta is much smaller than the previous case it comes down very fast. Therefore ultimately the rate of convergence is controlled by only the condition number the rate of convergence depends on the condition number and that is the condition number of a is fixed because given a matrix a the condition number is fixed though so the rate of convergence where it is fixed depends on the condition number of a so that is an important conclusion that comes out of this. Now I would like to be able to bring a practical question the method says as k goes to infinity I will converge at k goes to infinity e of x of k becomes 0 but if I am doing arithmetic in a finite precision machine what is that I am looking for we already know e of x k we already know e of x k is equal to beta to the power k is less than or equal to beta to the power k of e of x not therefore e of x k divided by e of x not is less than or equal to beta to the power k I am not in I do not want to wait until beta to the power k goes to 0 but I would like to be able to make beta to the power k less than epsilon where epsilon is 10 to the power of minus d. So what is d in a single precision arithmetic I cannot believe I cannot compute anything beyond 5, 6 decimal accuracy in double precision arithmetic I cannot compute anything more than 10 I am sorry about 13, 14 decimal accuracy. So by picking d to be 6 I can think of a single precision arithmetic I can by picking d is equal to 13, 14 I can think of double precision arithmetic. So whatever be I let d be a coefficient I am going to use in my judgment as to when to exist exit out of the algorithm. So let setting beta to the power k equal to 10 to the power of d by taking logarithms on both sides for a given d my k star is given by d over log of 10 of 1 over beta beta is less than 1 1 over beta is greater than 1. So logarithm of a number greater than 1 is positive d is positive and this is called this is called this ceiling of x ceiling of x and its value is equal to the smallest integer greater than x smallest integer greater than x. So k star is an integer which is the smallest integer the smallest integer greater than that and for a given beta k star in case so what does it mean for a given beta in k star iterations the ratio of the energy of the error at time k to the ratio of the energy at time 0 will be less than or equal to 10 to the power of minus d. So this is the measure actual measure one would use in practical applications. So what does it mean by picking k star which depends on d I can pre compute the number of iterations one would need to do to be able to get closer to the minimum. Now I am going to give you a field for the dependence of k on beta and kappa. So k2 is 1 the kappa when the condition number is 1 for which matrix condition number is 1 identity matrix beta is 0 that generally does not happen. So we are not interested in cases where the function the matrix A symmetric positive matrix is a quadratic form with identity matrix. So let us consider other cases when kappa is 10 beta is given by 0.6 is 942 k star is 40 likewise when kappa is 10 to the power of 4 beta is 9996 I would indeed need about 40,000 iterations. So you can really see the measure the power of the measure given by kappa the power of the measure given by kappa as kappa increases the number of iterations needed also increases and this is for the value of d is equal to 6 this is for the value of d is equal to 6. So this essentially tells you by pre specifying the accuracy with which we want to decide the minimum and for a given kappa we can estimate the number of iterations needed to be able to achieve. So once I know achieve the desired accuracy once I know the number of iterations then that could be used to be able to decide what is the total amount of time that is needed to be able to go to the minimum using the gradient search algorithm. So that is essentially this table essentially provide the summary of the performance of the gradient algorithm in trying to minimize quadratic objective functions. I am now going to illustrate this by an example consider a matrix A which is 100 lambda is a parameter f of x is equal to 1 half of x transpose A x for this A the function is given by this function is given by this expression this it can be rewritten as an equation for an ellipse the minimum value of f of x occurs at x star which is 00 I can compute the gradient of this function the gradient is given by this I am going to start my x naught is equal to lambda 1 as my initial point just for the fun of it I am going to start there I could verify that alpha naught is given by this which is given by this. So x1 is given by this vector so I am actually computing the progress of the iteration from step 0 to step 1 using this specific example I have given all the values I would like to encourage you to be able to verify each of these values continuing it can be shown that if I started x naught which is equal to lambda 1 my xk will be given by this no please understand this vector is fixed the constant that multiplies the vector lambda minus 1 divided by lambda plus 1 to the power k that number is less than 1 it is that number to the power k so that goes to 0 therefore xk goes to 0 as k goes to infinity therefore we have already verified the convergence when lambda is equal to 4 as a particular example I have already proved convergence for any lambda for a specific lambda is equal to 4 xk is given by this so 0.6 to the power k goes to 0 as k goes to infinity so xk goes to infinity. Now what is the basic idea here this behavior has a zigzag behavior what does this mean look at this now the first component is the first component is 0.6 to the power k4 the second component is minus 1 to the power k so when k is even the second component is positive when k is odd the second component is negative so what does this mean even iterates are above the x axis all iterates are below the x axis so it goes zigzag like this with decreasing amplitude. So if this is x not if this is the origin this is x star the iterates zigzags across the x axis so the iterates exhibit an oscillatory behavior which is essentially responsible for the slow convergence you have seen in the previous table that when cap i's are the order of 10 to the power of 4 it takes 40,000 iteration why it takes 40,000 iteration because the iterates has an oscillatory behavior the iterates exhibit an oscillatory behavior so instead of moving directly towards this it keeps oscillating and making progress towards that the progress towards the minimum is little slower that is an inherent behavior of the gradient algorithm the zigzag behavior the presence of oscillatory behavior is what is responsible for the slow nature of the convergence of the gradient algorithm in cases that is exhibited by this specific problem. Now I would like to talk about I would like to talk about the 1D search you remember we have to decide alpha k given an operating point x and a descent direction p the optimal step length obtained by minimizing g of alpha I am going to go I am going back to one of the problems we decided earlier so what is that we have said we would like to be able to get the derivative of g with respect to alpha which is given by this expression I want to solve that when f is quadratic g is quadratic and star is linear in alpha when f is not quadratic g can be stalled only numerically so what does it mean I am now thinking of extensions of the gradient algorithm to non quadratic functions so this is what we have already proven but in principle not all functions we are called upon to minimize are quadratic so when you are dealing with minimization of a non quadratic function at function which are highly non linear and apply the gradient method this method of choosing the step length parameter alpha is little bit more involved because this can be solved only numerically because it is not linear I would like to I would like to caution I would like to bring that caution to the forefront right now so in that case what do we do we essentially compute a quadratic approximation g of 0 is ffx g of 1 is f of x plus p the gradient of g with respect to alpha is given by that so if I have 3 pieces of information about the function g of alpha so what is g of alpha g of alpha is this slides ffx along the direction rk starting at the point xk so I am trying to fit a model which is a quadratic model that approximates g of alpha I do not know g of alpha precisely because of non-linear function I am going to bring a non-linear approximation to gfx gfx has 3 unknown parameters a b and c I have given 3 different pieces of information about g of alpha by using these 3 pieces of information I can essentially estimate the parameters a b and c so m of 0 is g of 0 which is c m of 1 is g of 1 which is a plus b plus c m prime the derivative of at 0 is given by 2a alpha plus b sorry 2a alpha plus b 2a alpha plus b the alpha is equal to 0 that is b therefore I know I now know what is c I now know what is a plus b plus c I now know what is b so I can now compute a is given by this b is given by this c is given by this these are all functions of g and g is the slides of ffx along the direction rk g is known so I can compute the values a b and c once I compute the value a and b and c while I do not know the actual gfx I have a quadratic approximation to gf alpha I can minimize the model the quadratic model by setting the gradient of m of alpha with respect to alpha is equal to 0 and I get the optimal struct length parameter given by this expression so you would generally use this expression when you use gradient method for a general non-linear function but you would use the ratio of the rk transpose rk divided by rk transpose a rk for alpha k when the function f is a quadratic function so we have now talked about generalization of the application of the gradient method to problems where the function may not be a quadratic function so a look back a summary of sorts gradient method converges only asymptotically even for quadratic functions so that is important thing gradient method converges for quadratic functions but it converges asymptotically asymptotically means what as the iterations go to infinity in practice we may not have all the time that is needed to be able to wait for convergence so we would like to be able to cut off the convergence at a desired place so we would like to be able to see when the magnitude of the residual becomes smaller than 10 to the power of minus d where d is 6 or 10 or 14 depending on the kind of accuracy you want in which case the total number of iterations needed one can pre-compute according to the table based on which is which is fundamentally depend on the condition of the matrix a and all those good results are only for quadratic functions and the key thing I want to emphasize is that the convergence is asymptotic so that behooves us the ask a question is finite time convergence at all possible theoretically the answer is the well-known conjugate gradient method and the conjugate direction idea can be used in principle to achieve this goal of finite time convergence at least for quadratic functionals. So what is the what is the summary here the gradient algorithms are very good for quadratic functions they have asymptotic convergence we can pre-compute the number of iterations needed to get the desired accuracy those are all the fundamental properties of gradient algorithm and that sets the limit of the power of the gradient algorithm once we have understood this asymptotic nature of convergence of gradient algorithm we would like to ask ourselves the question is there a way for us to be able to start on an arbitrary place to be able to get to the minimum in finite time that is the ultimate desire in the design of any optimization minimization algorithm. So we would like to be able to explore at least theoretically if it is feasible talk you cannot do anything in practice if it is not theoretically possible so this exploration of finite time convergence at least in the theoretical sense is a question that arises from the analysis of the gradient algorithm the answer to this question does there exist an algorithm in principle that can converge to the minimum in finite time the answer is yes a class of method called conjugate direction conjugate gradient methods is one that theoretically provides this framework of finite convergence that is the next topic which we will pursue in our next lecture.