 So, far in the context of minimization algorithm we analyzed almost all the properties of gradient based algorithm in the context of quadratic minimization problem. We also indicated how to adapt the gradient algorithm for non quadratic functions irrespective whether the function is quadratic or not the fundamental principle is I have an operating point I have the direction of the gradient for the case of quadratic functions I have an explicit formula for the step length if the function is more non-linear than quadratic for non quadratic functions in principle that does not exist a formula for optimal value of alpha k in that case we can only compute approximate values of alpha k the step length parameter at time k and that was done by fitting a quadratic model to the slice of ffx centered at xk in the direction rk we could further refine that concept by fitting a cubic polynomial or 40 degree polynomial which will be which will give us better and better approximations of the slice of ffx in the direction. So, by first fitting a curve and then we can minimize the fitted curve to fix alpha k therefore all these ideas together cover the general applicability of gradient based algorithm to both quadratic as well as non quadratic in either case the convergence is asymptotic often times we may not have the resources needed to wait until convergence more often than not we may want to be able to prematurely cut the iterates not arbitrarily but by measuring how far we are away from the minimum and allowing for certain range of values which are given by epsilon is equal to 10 to the power of minus d d could be 6 10 15 depending on whatever we saw. So, the best we can have with respect to gradient algorithm is asymptotic convergence that promotes the notion of being able to at least theoretically examine the presence of ideas that can guarantee finite time convergence and that gives rise to the notion of what is called conjugate direction methods a specific class of conjugate direction method is called conjugate gradient method. So, our next topic is to be able to explore the power of conjugacy to be able to produce theoretically a basis for algorithms that can guarantee finite time convergence for quadratic functions. Please understand quadratic function is a quadratic functions are essentially model problems in every area we have a notion of a model problem the for example in differential equation the harmonic oscillator understanding the basic principles of design of harmonic oscillator and analyzing the properties is a model problem in a first course in differential equation likewise in optimization theory the model problem is always minimizing a convex function given by a positive definite quadratic functions are positive definite quadratic forms. So, if you can guarantee the performance of these algorithms on these model problems then you know you have something to hold on to you have something to hold on to so that is what the that is what the basis for concentrating on quadratic functions. So, we are going to be looking at the derivation of conjugate the basic principles of conjugate direction methods and the basic principles of conjugate gradient method as our next topic. Please recall from our earlier discussion on finite dimensional vector space we have already indicated that given a matrix A which is SPD given a set of directions p0 to pn-1 each of them are vectors in Rn. So, I have n vectors please remember instead of labeling from 1 through n we have a label from 0 through n-1 nothing is lost it is one of the standard ways the literature does we say a given set of n directions are said to be a conjugate if pi transpose Apj is equal to 0 for i not equal to j is not equal to 0 when i is equal to j. So, that is the fundamental principle of conjugacy if a is equal to identity matrix conjugacy reduces to orthogonality. So, a conjugacy is a generalization of the notion of orthogonality this is a very simple but a very clever notion of the extension of orthogonality we are going to state our first result thus first result is as follows. If a set of vectors yes namely p0 p1 p2 pn-1 if they are linearly independent we know they are form a basis. So, linear independence of vectors is a fundamental property what is important here is that if you have a set of vectors that are known to be a conjugate they are immediately linearly independent as well that means a conjugacy implies linear independence. So, if conjugacy implies linear independence linear independence means a set of n vectors that are linearly independent can be considered as a basis for a n dimensional space. Therefore, by virtue of this property a conjugacy implies linear independence if I have a bunch of if I have a set of n a conjugate directions those being linearly independent one can build the analysis based on these a conjugate vectors as the basis for the n dimensional space in which we are going to perform the computations. We will soon see that doing arithmetic and doing analysis in this conjugate basis simplifies the overall analysis. So, by the trick is here by analyzing the problem instead of the original basis but doing it on the conjugate basis brings out the underlying structure of the problem to this and further simplifies the development of algorithms for minimization that is the basic thought process. So, I am now going to verify some of the claims we made. So, conjugate vectors as a basis for Rfn because they are linearly independent if they are linearly independent we should be able to use them as a basis. Let x not be a fixed vector in Rn for any x minus x not is an arbitrary vector. So, what is the idea here I pick x not and consider x not as the new origin and consider any vector x with respect to x not. So, x minus x not is the vector with respect to the origin at x not if x not is 0 x is a vector with respect to the origin itself. So, for any vector x let x minus x not be expressed as a combinations of p not p 1 p n minus 1. We already know if they are linear I have not proved that they are linearly independent if a conjugacy holds that is a homework problem for you. So, I am going to build on that result that a conjugacy implies linear independence. If they are linearly independent any arbitrary vector can be expressed as a linear combination of a conjugate vectors and that is what this statement is all about. An arbitrary vector is expressed as a linear combination of conjugate vectors. By a conjugacy I can multiply both sides by a I can multiply both sides by p k transpose. So, p k transpose a of x minus x not is equal to alpha j times p k transpose a p j but p k p j they are bound by a conjugacy. Therefore, p k transpose a p j is 0 if k is not equal to j that j is a free variable k is fixed. Therefore, the left hand side is equal to alpha k times p k transpose times a p k. Therefore, alpha k is equal to p k transpose a times x minus x not divided by p k transpose a p k. So, this applies for any k. So, this gives you the values of r k. So, let us see what is that we have accomplished we are given p k. So, we know p not p 1 p 2 up to p k we are given x not that is given we are given x minus x not is given. So, I am trying to express x minus x not a known vector as a linear combinations of p's the p's are known the only things are not known are alphas. So, the whole question is this if the p's are linearly independent I should be able to express any arbitrary vector of the linear combination. So, the problem reduces to one of finding alphas the value of alpha that is needed here is given by this. So, alpha k is given by this ratio every quantity on the right hand side are known. So, alpha k can be computed in principle. So, what does this sell you given any arbitrary vector x x minus x not being another arbitrary vector I can find the coefficient the linear combinations needed to express this arbitrary vector x minus x not as a linear combination of this conjugate basis. So, that is the that is the take a message from here. So, what does this say any arbitrary vector can be expressed uniquely as a linear combinations as a linear combination of conjugate vectors that is the thesis that comes out of this. So, with that in mind I am now going to talk about the solution of Ax is equal to b using conjugate vectors. You may wonder I we started minimization now I am considering Ax is equal to b I would like to ask you to recall the following fact. If ffx is equal to one half of x transpose Ax minus b transpose x plus c if I took the gradient of f of x that is equal to Ax minus b and if I set the gradient to 0 I get Ax is equal to b. Therefore, you can readily see if ffx is a quadratic function at the minimum Ax must be Ax must be equal to b if Ax is equal to b Ax is equal to b is the gradient of a quadratic function. Therefore, minimizing a quadratic function and solving a linear equation are equivalent problems. So, we can pose the conjugate gradient method either as one of solving a linear system or one of solving a minimization of a quadratic form. So, let us assume that we have been given a linear system Ax is equal to b let x star be the solution that means x star is equal to a inverse b and a is symmetric positive definite. Let p not to p n minus 1 be the a conjugate directions we have already seen the existence of a conjugate directions. So, let us not be the initial guess for my process of discovering what x star is x star is the solution I am seeking is also minimize the ffx I want to find. So, Ax star minus a of x star minus x not is equal to b of Ax star is b. So, that is equal to b minus Ax not that is equal to r not that is equal to the initial residual at x not then I can express x not as I am sorry x star as x not plus the linear combinations of all the conjugate direction vectors that are given from the previous analysis we now know the alpha k that enable this expression to be true this expression to be true the alpha case are given by this expression therefore what does this tell you the minimum x star which is also the solution of Ax is equal to b can be expressed as x not plus a linear combinations of the conjugate vector. So, this is the important result x not could in principle be 0 or it could be any vector. So, this ability to express the minimum as a linear combinations of the conjugate direction method is a very powerful principle that comes out of the analysis that we have already presented that follows from the linear independence of conjugate directions. So, conjugate a conjugacy linear independence and the consequence thereof that is the path. So, with that property at the back of our mind I am now going to pursue the notion of quadratic minimization let a be an n by n matrix. So, you can see I am still considering the model problem a b spd f of x is one half of x transpose ax minus b transpose x plus a the minimizer is given by Ax is equal to b let R of x be the residual which is negative the gradient which is the negative the gradient. So, I would like to be able to I would like to be able to minimize f of x which is one half of R transpose R which is given by this expression which when expanded is given by this expression if I compute the gradient I get this and that leads to Ax is equal to b if a is spd. So, this essentially tells you quadratic minimization problems and solution of linear systems are essentially one of the same or essentially one of the same. So, minimization of the sum of the squares of the residual is what we are concerned with and it is a relation to the given quadratic function. So, we are now going to look at a representation in the new basis that is constituted by the conjugate directions. So, let p be a matrix that is built out of the conjugate direction vectors is a n by n matrix. So, if I now p transpose a p p is given by this p is given by this there is a and p transpose that means p is given as columns. So, p transpose is given by the rows if I multiply this you can essentially see I get a matrix this is equal to a matrix where the first element is p naught transpose a p naught the second element in here is p naught transpose a p 1 likewise p naught transpose a p n minus 1 here it will be p 1 transpose a p naught p 1 transpose a p 1 and so on if you consider the last one this is p n transpose a p naught p n transpose a p 1 the last element is p n transpose a p n in view of the conjugacy property all the off diagonal elements are 0 all the off diagonal elements are 0 the elements along the diagonal are not 0 I am now going to call d of i is equal to p i transpose a p i therefore p transpose a p simply becomes a diagonal element with d i as the diagonal a diagonal matrix with d i as the diagonal elements. So, from the previous slide we now know x is equal to x naught plus linear combinations of p i the coefficient of the linear combinations of alpha therefore the coefficient of the linear combination alpha let alpha be a vector in or n where alpha is essentially given by alpha naught alpha 1 and alpha n minus 1. So, p of alpha is essentially summation p j alpha j j is equal to 0 to n minus 1 the linear combinations there off. So, p of alpha is a very succinct way of representing the sum. So, from the previous slide we now know any x can be expressed as x naught plus p of alpha. So, what is that I know I now know x naught is known p is known. So, if you give me an x there is a corresponding alpha. So, I am trans and transforming x to alpha the vector x is being transformed to vector alpha and that is the linear transformation we are talking about. So, if I consider a point in a space x is the coordinates in the ordinary basis alpha will become its coordinate in the conjugate basis. So, I am talking about simple coordinate transformation from ordinary basis to the conjugate basis x transforming to alpha. So, instead of working the problem in the x space we can now work the problem in the alpha space I also want to remind you the alpha here is not the same alpha that we talked about in conjugate gradient in the gradient methods. In the gradient method alpha refers to a step length parameter here the same alpha there the alpha is a scalar here alpha is the vector alpha is a new transformed vector that represents points in the with respect to the new basis. So, even though we use alpha in different in different places it is imperative that we understand the distinction between the use of these variables that are meaning that off. So, what is the basic idea I have been given f of x f of x is given by one half of x transpose x minus b transpose x plus c. Therefore, if I substitute f of and any x we already know can be represented by x naught plus p of alpha. So, f of x can be replaced by f of x naught plus p of alpha since x naught is known p is known this is simply a function of alpha I am going to call that function as g of alpha. So, g of alpha is the new name to the same function f of x f of x denotes the represented the function the old basis g of alpha represents the same function in the conjugate basis. So, I am now going to work the problem not in the ordinary basis but in the conjugate basis for the sake of this analysis. So, when I substitute x is equal to x naught plus p of alpha in the expression for f of x that takes this form this when simplified using a sequence of matrix of vector operations it becomes equal to this f of x naught plus k is equal to 0 to n minus 1 g k of alpha where g k of alpha is given by this function you can readily see g k is a quadratic function in alpha. That means I have expressed f of f of x I have decomposed it. So, f of x has coupling because the matrix a symmetric positive definite the off diagonal elements enabled you to couple various variables the importance of representing in the conjugate basis is that it is decoupled. So, that is the decomposition we are talking about. So, I am trying to express the function in a decoupled form where g of alpha where g of alpha is represented by g of each component where g of alpha is represented by each component. So, it is I probably should say this is alpha k that be probably better this also should be alpha k. So, look at this now g of alpha is simply sum of g k of alpha k where g k of alpha k depends only on alpha k. So, that is the decomposition that we have achieved there is no product terms among alphas each of g of k is a quadratic in only alpha k. So, this is the decoupling or decomposition that we have achieved in going from ordinary basis to the conjugate basis nothing has changed except the representation. Therefore, minimizing f of x in the x pairs in the original coordinate system is equivalent to minimizing f of x naught plus alpha x naught plus I should have said this is I am sorry this is p of alpha p of alpha but x naught plus p of alpha is minimizing with respect to g of alpha. But g of alpha from the previous page is given by this the right the terms in the parenthesis are sums of individual alpha case they depend only on the individual alpha case. Therefore, the minimum with respect to the vector I can replace it by summation minimum with respect to alpha k for each k this is alpha k with respect to each k because f of x naught is a constant. So, that does not change the analysis. So, what that we have accomplished minimization of x in the n dimensional space now reduced to minimization of n one dimensional functions which are called g of k of alpha g k of alpha k please remember each g of k depends only on alpha k. So, the transition from here to here is very crucial very critical this transition depends on our ability to decompose because g 1 depends only on alpha 1 g naught depends on only alpha naught g 3 depends only on alpha 3 g k depends only on alpha k. Therefore, we have made the problem to be one of n simultaneous minimization of g of alpha k which are one dimensional problem. So, to say in other words minimization of one n dimensional problem is converted into the minimization of n 1 d problem that is a divide and conquer. So, a hard is decomposed into n small sub problems in one d that is the fundamental achievement of going from the original basis to the conjugate basis. I hope you are able to recognize the power of conjugacy in transforming an n dimensional minimization to minimization of n one dimensional problems. Now let us concentrate on one d problem let g of alpha k is equal to a quadratic function dk is known r naught is known pk is known so it is simply a function of alpha k. So, if I compute the derivative of g of k with respect to alpha k I get this and if I equate this to 0 I get alpha k to be given by this and that is given by this formula which is very well known and you can readily see this formula is very much related to the formula that we have derived in the early slides. Therefore, we have now minimized each of these functions separately with respect to alpha k and the minimizing alpha k is given by this particular formula. So, we have achieved minimization of n one dimensional function simultaneously where the minimizer is given by this particular formula. So, this provides us a framework for what is called conjugate direction method is an idea. So, let me summarize this now let f of x be x let f of x be one half of x transpose A x minus B of x r naught is B minus A of x naught let us assume I am given a set of n a conjugate direction the whole analysis depends on the existence of the n conjugate directions pre specified given to us. So, if somebody gives me a set of n a conjugate direction where a is the matrix of the quadratic form where a is a symmetric positive definite matrix then from the above analysis what is that we can do I can try to minimize each of the g alpha k for k running from 1 to n that is exactly what is being done here for k running from 0 to n minus 1 step 1 find alpha k the formula that is given in the previous page compute x k plus 1 is equal to x k plus alpha k p k that means I am moving in the direction of the conjugate gradient I would like to now remind you that this is distinct from what we did in the gradient method the gradient method x k plus 1 is equal to x k plus alpha k r k. So, there we went in the direction of r k which is the negative the gradient here I am going in the direction of the a conjugate direction. So, that is the primary difference between the two ideas p k is a conjugate direction the residual also can be updated according to step 3 I am now going to test if the residual is 0 if the residual is 0 I get out x then x star is equal to x k plus 1 otherwise you continue another fundamental difference between this algorithm and the gradient algorithm is that in the case of gradient algorithm we had a for loop where we said k is equal to 0 1 2 3 up to infinity there is an infinite loop we had an exit condition here is the finite loop 0 to n minus 1 that essentially tells you I have finite time convergence the finite time convergence essentially implied by the decomposition that we have produced earlier. So I can solve your one n dimensional problem as n one dimensional problem if I did these n sub problems in a sequence I am done. So, the notion of a finite time convergence is inherent in this analysis the conjugate direction method the conjugate direction framework essentially summarizes this idea conditioned on the fact I have been given a set of n a conjugate directions I still allow the possibility of being able to get out soon if if if if something happens therefore this provides you a general framework for finite time convergence finite time convergence this was the idea that was proposed by Heston's in the early 50s this is one of the one of the landmark results in the theory of minimization domain and ever since the conjugate direction based ideas have been exploited and we would like to be able to tell you that this is not this is not called conjugate gradient method it is simply a conjugate direction framework this is an idea if you do not have an idea you cannot develop an algorithm. So, what is the essence of this idea if I have a quadratic function by choosing a set of n a conjugate directions I can convert in I can convert one n dimensional minimization problem to a set of n 1D minimization problem I can solve these n one dimensional minimization problems in a sequence. So, you know more than n steps I should be able to achieve the minimum the overall minimum of the original function f I am seeking that is the message of this analysis called analysis of what is called conjugate direction framework we would like to do some checking to further reinforce the idea of the power of the conjugate direction methodology. So, let us assume given xk and pk what is that xk is the given operating point pk is the conjugate direction even though we do the analysis in the conjugate basis computations are done in the original basis I want you to remember the thing. So, we do the analysis in the conjugate basis but computations are done the original basis so we need to be able to go between these two spaces and we would like to be able to reinforce some of the properties of conjugate directions by answering specific by examining specific relations. So, let xk be given let pk be the conjugate direction along which I am going to move xk in going to xk plus 1. So, the one dimensional minimization problem in this case becomes g of alpha is equal to f of xk plus alpha times pk if you substitute xk plus alpha pk in my function the function expression takes this form which can be reduced to this f of xk since xk is given f of xk is known that is a constant. So, if the quadratic function in alpha you can really see I want to minimize this quadratic function I get alpha k is equal to given by this and this is the formula that we had achieved in our earlier analysis. So, this is a further corroboration and verification of the properties looking at the conjugate direction method as one that starts at x naught and minimizes the function along the one dimensional direction pk. So, that is the aspect of the verification. So, we are trying to do everything similar to what we did in the gradient direction the only difference being in the gradient method we went along the direction negative the gradient here we are going in the direction of conjugate direction pk to verify the expression in step 3. So, let us go back step 3 in the conjugate direction method I am going to quick review from step 2 if you iterate it from x naught xk plus 1 takes the following shape xk plus 1. So, rk plus 1 this must be yeah xk plus 1 is a vector that is given by that and rk plus 1 is given by this and if I substitute xk plus 1 that expression becomes this which can be replaced by this that is exactly equal to rk minus alpha k apk. So, that is the expression we have gotten for step 3 relation between rk and pk pk transpose rk plus 1 is 0 using alpha k in step 1 I would like you to verify this these are all important properties one should verify that that means rk plus 1 and pk are orthogonal to each other please remember in the case of conjugate gradient method rk plus 1 and rk are orthogonal here rk plus 1 is the gradient to the function at xk plus 1 that implies xk plus 1 minimizes f of x along xk plus alpha pk. So, from here you can readily verify the following sequence of relations pk transpose rk plus 1 is equal to pk transpose rk plus 2 pk transpose rn is 0 likewise pk transpose rk pk transpose rk minus 1 is equal to pn transpose r naught that is an inequality that is orthogonality these two properties essentially follow from the analysis we have given these are important properties that one should examine one should understand this essentially tells you the intrinsic relation that exists between conjugate directions and the gradient directions which are which are which are the residual direction which are negative the gradient. Another thing is called the expanding subspace property is another interesting aspect of the conjugate direction method xk plus 1 can be expressed as xk plus 1 can be expressed as this. So, rk plus 1 is given by this which we have already seen used that is equal to r naught plus this taking the inner product of both sides with respect to pj I can you can readily verify that this is the result that one gets. So what does this mean rk plus 1 let us go back to the previous one rk plus 1 is orthogonal to pk in here what is that we have seen rk plus 1 is perpendicular to this set that means that is what is called the the the expanding property the expanding property I believe this must be k I want to check this I want I want to check this property I will probably correct it later. So, that essentially tells you xk plus 1 minimizes over the span of over the span of pk plus 1 this must be I am not very clear about I will I will have to check the correctness of this I will come back to that. So what is the basic idea the basic idea is I am sorry I am sorry the basic idea is xk plus 1 minimizes over a subspace and the subspace is expanding. So, x naught minimizes over p naught x 1 minimizes over the span of p naught and p 1 x 3 minimizes over p naught p 1 p 2 xk plus 1 minimizes over the span of p naught p 1 p 2 pk. So, this in this way when I consider all the vectors p naught through p n minus 1 and span of it and x naught plus the span of it if x belongs to that minimizes that minimizes f of x. So, that is the fundamental relation that comes out of this expanding subspace property. So, in addition so what is the basic idea in addition to minimizing xk plus 1 in addition to minimizing xk plus alpha pk it also minimizes over the subspace. Therefore, x n minus 1 minimizes f of x over r n. So, I believe this must be k I believe this must be k that is the correct way to look at it I also believe this must be this must be sorry I also believe this must be k that is the foundation for the that is the foundation for the this is k this is also k I can I will correct this I will send the corrected version of these. Therefore, you can see what is the summary of this the summary of this is each iterate not only tries to minimize the direction chosen it also minimizes in the subspace span by all the previous conjugate directions. So, when I come to x n minus 1 x n minus 1 minimizes f of x over the span of p naught p 1 p 2 up to p n minus 1 since p naught p 1 p 2 p n minus 1 span the whole space is the basis I have minimized it over the entire r n. So, that is the fundamental idea of this expanding subspace property expanding subspace property. So, that essentially gives you the notion of that essentially gives you the notion of finite time convergence that essentially gives you the notion of finite time convergence which we are now going to state explicitly. So, given f of x is equal to given f of x is equal to x transpose A x minus B transpose x given a set of conjugate directions the conjugate direction framework guarantees convergence in at most in at most n steps in at most n steps that means finite time convergence. But what is the tacit assumption that we are making the implicit assumption in this is that computations are error free that means I have a hypothetical machine which has infinite precision. So, if I have a computer with the infinite precision there is no computational error I can check for a conjugacy perfectly. So, if I have the ability to examine the a conjugacy perfectly this framework essentially provides you finite time convergence of minimization algorithm provided we do the searches along the conjugate directions that is the principle conclusion that comes out of this. The question here is that we have assumed the presence of n conjugate direction the whole question is this we did not show or we did not verify such a conjugate direction exist. So, to prove the existence of conjugate direction now I am going to look at Eigen decomposition of A and show the Eigen vectors of A are in principle could be used as conjugate directions. So, if I can show that we already know at least one set of conjugate directions exist if there is one set of conjugate direction exist the framework for conjugate direction as we have developed make sense. So, in order to show such a conjugate direction exists I am now going to start with a given matrix A which is SPD a given matrix A which is SPD consider the Eigen decomposition AVI is equal to VI lambda I I am going to consider a matrix of Eigen vectors V I am going to consider lambda of diagonal elements this relation can be expressed as a matrix relation AV is equal to V lambda V V transpose V transpose V is I V transpose therefore from this relation now we now have either V transpose AV is lambda or A is equal to V lambda V transpose. So, what does this tell you V is equal V transpose AV is equal to lambda essentially tells you if you so let us consider that V transpose AV is equal to lambda what does it tell you it tells you the following A V1 V2 Vn here I am going to have V1 transpose V2 transpose Vn transpose if you consider this this is going to be a diagonal matrix lambda 1 lambda 2 lambda n. So, this essentially tells you VI transpose AVJ is equal to 0 if I not equal to J is not equal to 0 if I is equal to J this essentially tells you the V's the VI's are a conjugate the VI's are essentially or essentially a conjugate VI's are essentially VI's are a conjugate so that essentially proves that I have at least one system of a conjugate direction for a given matrix and a conjugate directions are essentially the eigenvectors of A even though this proves the existence of conjugate direction it is computationally extremely demanding to find the complete eigen system that can be used as a conjugate directions because you have to spend a lot of money in trying to find the eigenvectors. So, you spend a lot of money to find the eigenvectors and then you have to perform the minimization the n one dimensional minimization as dictated. Therefore this idea of using eigenvectors of A as conjugate direction while in principle feasible is computationally inexpensive is computationally demanding therefore we should look for an alternate method for defining conjugate direction which are much less expensive this idea of trying to incorporate the method of finding the conjugate direction along with the search as we go on is the principle that is embodied in conjugate gradient method. So, what is the difference between conjugate direction method and conjugate gradient method the conjugate direction method is not an algorithm is a framework it essentially tells you if you give me a set of n A conjugate directions I can do the analysis I can prove I can converge in n steps. So, that is simply the framework it does not confine it does not consider how do you how does one deliver the n A conjugate direction therefore in order to make this framework a reality we must integrate the process of defining the conjugate direction along with the search along with the one dimensional search combined in a very nice way that will guarantee not only conjugate but also finite time convergence these two ideas melding together gives raise to a new class of algorithm they are called that is called conjugate gradient algorithm. So, that is the difference between conjugate direction and conjugate gradient. So, the basic principle the conjugate gradient algorithm we already saw now I am going to describe the various steps involved given the function f of x let x not be initial point or not is initial residual I am going to choose the initial conjugate direction to be the same as the initial residual please understand I need a set of n conjugate direction the first direction could be anything here we are going to pick the first conjugate direction to be the negative the residual at x not. So, p not is r not so, for k running from 0 to n minus 1 I compute alpha k by this formula I can also compute alpha k by another formula these two formulas essentially the same we are not going to indulge into the proof of the equivalence between two expressions like this many books and many papers written on conjugate gradient method essentially gives you the ability to compute these in two different ways both are equivalent. Now I am going to iterate I am now going to update the residual we are going to test for convergence the test for convergence essentially rests on the smallness of the magnitude of the residual if the residual is not small we need to continue we need to first define a conjugate direction conjugate direction is not directly conjugate direction. So, steps 5 and 6 together help you to define the conjugate direction pk plus 1 is the new conjugate direction pk is the old conjugate direction rk plus 1 is the new residual I have already computed so, using the new residual and the old conjugate direction I am taking a linear combination to get the new conjugate direction the coefficient of the linear combination is beta k and beta k is again given by two ways of computation one by this formula another by the other formula. So, 5 and 6 together define the conjugate direction step 2 and 3 define the update of the iterate and the update of the residual the step 1 essentially gives you the update of the coefficient which is used the step length parameter. So, step 1 gives you the step length parameter step 5 gives you the step length parameter needed to define the conjugate direction step 2 and 3 define the iterate and the update of the residual vectors step 4 essentially gives you a convergence test the overall convergence is repeated no more than n times 0 to n minus 1. So, if the computer is such that either is no round off errors this gives you a framework to be able to minimize in n steps the advantage of this framework is that you are not you need not be given a priori a set of conjugate direction I can build the conjugate direction iteratively as I proceed. So, this ability to integrate the search and the building of the conjugate direction together simultaneously in this process is the power of the idea behind the conjugate gradient algorithm conjugate gradient algorithm has been a very powerful workhorse in the industry. So, we understand the properties of steps 1 through 4. So, what is the only thing that one need to understand we need to be able to show that the pk defined by steps 5 and 6 are indeed a conjugate and here is the summary of the properties of the conjugate gradient algorithm CG the conjugate directions are computed is not internationally I am sorry there is an error that is a typo internally it is computed internally in steps 5 and 6 is not given a priori alternate choices of alpha k and bk are given in the respective steps they are equivalent pks are a conjugate I am not going to indulge in the proof of that that will take little bit longer time. But the proof that pk is so generated or a conjugate are contained in several sources rk plus 1 is perpendicular to rk as it happens the gradient algorithm. So, the residuals preserve the same property as the gradient algorithm and the rk is also perpendicular to the span I think this must be k I am sorry I think this must be k I have to carefully check some of these things I will do that another property is span of p0 p0 span of p0 to pn minus 1 is the same as span of r0 to rn minus 1 which is also equal to the span of r0 a r0 a square r0 a to the power k minus 1 r0 this must be a to the power of n minus 1 r0 and this space that is generated like this is called the Krylov's of space Krylov's of space this Krylov's of space of dimension n is generated by a and r0 the Krylov's of space generated by n and a and r0 that means given a vector r0 and a matrix a by successively multiplying r0 by a a square a to the power of n minus 1 I create different vectors the span of this vectors is called the Krylov's of space you can readily see the same space has different representation span of p span of r span of the vectors r0 a r0 a square r0 and so on it is this property of equivalence representation from the same space using different basis is the ultimate power of the conjugate gradient algorithm. So conjugate gradient algorithm with finite precision arithmetic until now we talked about conjugate gradient with infinite precision when when used on real computers because of finite precision our infinite precision arithmetic we cannot check a conjugacy perfectly. So what we think as a conjugate direction are not precisely a conjugate they are only approximately a conjugate direction therefore if x star is the optimum solution if e k is the error it can be shown this must be e of xk the e of xk is given by this quadratic function much like the e of xk that we used in gradient method. So when there is a round of errors what happens you start at x0 you perform n steps you come to x star this x star because of finite precision arithmetic will not be the I should not say x star let me change the notation a little bit this is let us assume x bar this x bar will not be equal to x star the minimum but close to that then what do we do you start from here you do one more n steps you go to x double bar then you start from here you do n steps you go to x double bar. So it becomes an iterative process it turns out if you consider this as an iterative process one can find out the ratio of e of xk this is x k x sub k e of xk by x0 is given by 2 times is less than or equal to 2 times a function that depends on kappa much like gradient algorithm where kappa is lambda 1 by lambda n is a spectral condition number. So you can readily see in the case of iterative in the case of finite precision arithmetic you cannot achieve finite time convergence it looks as though is an infinite process in this case the convergence rate is given by this expression convergence rate is given by this expression. So now I can do whatever I did with respect to gradient algorithm I can set this number which is an upper bound equal to epsilon 10 to the power of d by taking the logarithm I can compute the expression for k star by simplifying the by simplifying this this expression I can readily see that k star is given by this which is equal to which is equal to d plus 1 times square root of k2 kappa a by 2. So that is the number of steps so k star is the number of steps in the iterative process needed to be able to get closer of the order of 10 to the power of minus d d is the precision 6 14 and so on this this follows again in the same idea. So now I would like to be able to compare conjugate gradient with respect to the gradient algorithm. So here I have given you various examples of kappa various examples of kappa here is the number of values are the number of iterations that are needed in the gradient algorithm here is the number of iterations that are needed in the conjugate gradient algorithm. So with the presence of finite precision arithmetic conjugate gradient beats the gradient method hands down it can perform absolutely very well that is the power of the conjugate gradient method with respect to the gradient algorithm conjugate gradient with respect to the gradient algorithm and this difference you can see is very measurable. So let me summarize this now so how do you how does one utilize the conjugate gradient method that is given by the following if you start at x naught you go to x bar in n steps if you start at x bar you go to x double bar in n steps if you start at x double bar you go to x double bar in n steps it is it is said that in most of the time when you are doing an experiment when you are doing an experiment it is enough to repeat it about 3 times. So I will give you another graphical representation you start at x naught you get to x bar here you get to x bar here you start at x bar you go to x double bar closer to you then so you get ever closer in other words I will I will tell you the basic idea here so you start at x naught you come to x bar x double bar is closer than that this is x double bar x triple bar is even closer than that so that is x triple bar and that is the minimum which is x star so x star is the minimum x double star is closer to the minimum than x bar x triple bar is closer to the minimum than x double bar. So it is said that if you apply the conjugate gradient method in 3 phases 3 m iterations in principle you will be able to get very close to the optimum and that is the power of the conjugate gradient method with this we conclude the overall presentation of the minimization algorithm we said there are gradient algorithm conjugate gradient algorithm and quasi-neutral algorithm for lack of time we would indulge in the analysis of quasi-neutral algorithm we have given a several sets of exercises the exercises relates to verification of very many different properties of gradient and conjugate gradient algorithm I would like you to indulge in the proof of cantor which inequality I would like you to implement gradient algorithm and conjugate gradient algorithm on this for the same problem and compare the convergence I would like you to take a test problem with this A apply the gradient algorithm and verify that this is the theoretical way in which the iterates proceed starting with the initial condition 2 and 1 for this problem verify f of xk is equal to 1 ninth of f of xk plus 1 is equal to 1 ninth of f of xk I would like you to draw the contours and superimpose the trajectory so that you can visually demonstrate the convergence I would like to now combine couple of problems you consider the 4d grid with 16 points I am given observations at 2 observations at each locations so distribute 2 observations each of the grid boxes m is equal to 18 observations m is equal to 18 observations there are 16 points is an over determined system build the interpolation matrix H which is 18 by 16 create artificially 18 observations of temperature let us assume z i is equal to 70 plus vi vi is the random noise for i is equal to 1 to 18 so we have 18 of observations you have a matrix H we can now consider the problem aquatic minimization problem z minus hfx transpose z minus hfx z is given 18 observations we have already you have already generated you already have the matrix so it is a function of x this is given by this this is the quadratic function to this quadratic function you can apply the gradient algorithm and the conjugate gradient algorithm and compare and compare and what that I would like you to do I would like you to be able to plot the value of f of xk for the gradient algorithm for the conjugate gradient algorithm the gradient algorithm conjugate gradient algorithm you can readily see the value of f of f of k reduces faster for the conjugate gradient algorithm compared to the gradient algorithm because we have already seen from the table that conjugate gradient algorithm requires much smaller number of iteration compared to the gradient algorithm so this will essentially help you to verify the power of the conjugate gradient algorithm in solving problems with this we come to the end of the discussion of the optimization algorithms with this we have also completed some of the fundamental mathematical background needed finite dimensional vector space matrix properties properties from multivariate calculus principles of optimization matrix based algorithms as well as minimization algorithms these are the various topics that address the crux of the mathematical tool needed in doing a data simulation with this behind us from now on we are simply going to be concentrating on saw unsolving various types of inverse problems our next step is to be able to look at dynamic inverse problems leading to the standard 4d war methods and that is what we will begin in our next lecture thank you bye.