 have been looking at methods for solving non-linear algebraic equations and in that till now we have covered gradient free methods or successive substitutions then gradient based methods and now I want to move on to a third category which is optimization based methods. So if I just draw or if I just collate what are the different methods that we have seen. So for solving non-linear algebraic equations the first method that we saw was successive substitution. So here we had different variants like Jacobi iterations, Gauss-Seidel iterations and so on. Then the second class I would say is gradient or slope gradient or slope based methods and here we had we have looked at univariate secant method multivariate secant or which is popularly known as Wegstein popularly known as Wegstein iterations. So we looked at univariate secant method which was slope based method I would instead of saying gradient based method I would say slope based method it will use two consecutive iterates and then find a slope and use that slope to find out the next step. Then we looked at multivariate secant method under the same category we moved to Newton-Raphson or Newton's method sometimes known as Newton-Raphson method. So here we had some modifications so damped Newton's method then we looked at Caussi-Newton method, damped Newton method was adjusting the step length, Caussi-Newton method was Jacobian matrix update using rank 1 matrices or Broiden's update. So this was using Broiden update and of course you can combine these two and have a Newton's method which is damped Newton method with or Caussi-Newton damped Newton the damped method and so on. So you can have all the combinations. So these are the categories on which we have looked at solving nonlinear algebraic equations. The third category that I want to look at is optimization based. So first two categories we have looked at in detail the third one the optimization based it has its parallel in solving linear algebraic equations. Linear algebraic equations we solved using gradient method conjugate gradient method likewise here too we can form solving nonlinear algebraic equations as an optimization problem and iteratively solve the optimization problem till you reach the solution. So what are these optimization based methods? So here we form an optimization problem. See I want to solve for f of x equal to 0, x belongs to Rn, x is an n dimensional vector and f of x equal to 0 is what I want to solve for. I formulate an objective function phi of x, which is half fx transpose fx, which is nothing but half f1x square plus. So I have this function vector f which has components f1 to fn. So we want to minimize this with respect to x. So we solve the problem as minimize with respect to x phi of x. What is the optimum? What is the necessary condition for optimality? The gradient is equal to 0. So dou phi by dou x that is dou f by dou x transpose f of x is equal to 0. This is the optimality criteria and you can see that when the Jacobian is non singular at the optimum only way you can get the necessary condition satisfied is when f of x is equal to 0. If this is Jacobian is non singular then only way you can get the solution to this problem of the necessary condition being satisfied when f of x is equal to 0. So this is when you reach the optimum and when you reach a stationary point where the Jacobian is non singular then you have reached the optimum. Then you reach the solution of f of x equal to 0 that is the idea. Now how do you solve this? Well we do it iteratively using different numerical optimization methods. The first one of course the simplest one to use would be gradient method but as I told you the gradient method has a problem. So simplest would be gradient method and this is as I said gradient method is more useful for deriving more complex method as forms the basis probably we would not use gradient method we use conjugate gradient method. But just to state the algorithm gradient method we would start with an initial guess x naught and then you will get x k plus 1 is equal to x k plus lambda k g k and g k is minus grad phi x that is evaluated at x is equal to x k which is nothing but well if we develop a notation if we develop a notation I will just do it here. So if I develop a notation j k which is dou f by dou x evaluated at x is equal to x k if I do this notation and then of course f at x k is equal to f superscript k if I use this notation we have used this notation earlier same notation I am using. Then this would be minus j k transpose f k g k would be how do you compute lambda k this is a scalar parameter step length parameter and this is found by one dimensional minimization ok this is found by one dimensional minimization. So this is a iterative numerical procedure by which you start from x naught you generate a new guess in the direction of negative of the gradient direction and at each time how much to move is decided by minimizing with respect to lambda ok. So lambda k is lambda k is min with respect to lambda phi x k plus lambda g k ok. So lambda k is just one dimensional minimization g k is the gradient direction which you have computed ok x k is known to you g k is known to you lambda is not known which is found by this. So of course more popular method or more a better way more computationally efficient way is to use conjugate gradients ok. So in fact in MATLAB when you are solving non-linear algebraic equations simultaneous non-linear algebraic equations there is a subroutine called Efsol. I think Efsol is not Newton Raphson it is optimization based solver it actually forms an optimization problem and minimizes by iterative search ok. Let me confirm but I think it is it is not a it is Efsol is not a Newton Raphson solver it is a optimization based solver. So the reason is that even though we have even though we have written here in terms of the gradient direction is written in terms of Jacobian and multiplied by f k actually if you numerically compute gradient of phi k if you are numerically computing gradient of phi k phi x k this numerical gradient computation does not involve explicitly computing Jacobian. See this is a function phi k is half f 1 x square 2 right. So if I want to compute this is phi x if I want to compute the numerical Jacobian of this ok I need to compute function vector and take its norm ok I do not have to explicitly compute Jacobian ok computing see computing the numerical or computing gradient of this is equivalent to you understand what I am saying see computing grad phi x is equivalent to you know dou f by dou x transpose f x these two are equivalent but numerical computation of this if you perturb this function vector see what I have to do for how do you compute numerical Jacobian or numerical gradient we have a sub routine which we have written right in the programming tutorial we perturb each element we perturb each element and by plus epsilon minus epsilon and then find out. So see basically what I do is to compute if I want to compute dou phi by dou x i ok what we do is we approximate this as phi x i plus epsilon minus phi x i minus epsilon I am just perturbing the ith element remaining elements are same ok divided by 2 epsilon right this is how we find out and of course here I have not written remaining elements of x this phi is a function of remaining elements of x just to show you that. So doing this you know function calculation does not involve anywhere explicitly computing Jacobian you see the advantage it only that this is equivalent to this this is equivalent to this ok. So actually numerically when you are computing you do not have to go through this route you do not have to compute Jacobian ok. So there are advantages in using optimization based search because you do not have to compute Jacobian explicitly for a large scale problem ok you are not you are not required to compute 1000 cos 1000 matrix that is not required ok is this is this clear. So what about what about conjugate gradient ok. So of course we compute the conjugate gradient of course requires the current gradient information. So we need this we need this G k which is negative of grad x gradient of phi with respect to x this is the this is the direction. So this we have to compute at x equal to x k and what we do next of course is find the conjugate search direction. Now what was the idea in generating conjugate directions the idea in conjugate direction was given some matrix A we said that two search directions are A conjugate ok when S k transpose S k minus 1 is equal to 0. Given the A matrix which is positive definite matrix ok we call search direction S k and S k minus 1 as A conjugate ok. When you take any two successive any two successive directions you should have this property ok. So what is done in the conjugate gradient method here is you find out the search directions such that A is equal to I ok. So we find out the search directions such that the alternate directions are perpendicular to each other ok. Alternate directions are perpendicular to each other. So we find out we find out S k this is equal to beta k times S k minus 1 plus G k G k is the gradient which we have calculated currently and beta k is given by minus G k transpose ok. And the remaining part is similar to the gradient method that is once you know the search direction then you find out x k plus 1 is equal to x k plus lambda S k ok. Find out lambda k which is minimization with respect to lambda phi do this one dimensional minimization ok. So as to find out the step length see this is this beta k which is computed here this is only for finding out the new search direction which is conjugate with respect to identity matrix ok. So this is direction computation once you compute the direction S k how much to move in that direction see one interpretation of this conjugate directions is that it is linear combination of previous linear combination of previous gradients because you kick off this calculation using. So you at 0 you set that is G so your S 0 corresponds to G 0 the force direction is same as the gradient next time ok S 1 is beta 1 G 0 plus plus G 1 ok. So linear combination of G 0 and G 1 when you go to G 2 it will be linear combination of G 0 G 1 G 2 because see because S 1 is linear combination of G 0 G 1 ok. So S 2 turns out to be linear combination of G 0 G 1 G 2. So it is like saying instead of using current so once I get this direction here ok then I know that I want to move in this direction this direction is linear combination of past all gradients see it is like just imagine you know you are going down a valley ok. Now what do you mean by using gradient step you are looking at the current slope ok you are looking at the current slope. Now just looking at the current slope could be you know too local you do not have information about what has happened in the past if you take into consideration somehow ok if you make your next move based on the you know information about the past slopes ok then your move will be much better than just make basing your decision on the current slope because if you take into consideration past slopes ok in some way you are taking into consideration the curvature right you are taking into consideration the curvature along the path and then making your decision. See here here if you look at how this conjugate gradient method proceeds so the step here would be you know x k plus 1 will be x k plus lambda k S k. So if I actually look at the progression of how the direction changes see this will be S 0 is equal to G 0 the first time whatever the gradient that you get negative of the gradient direction ok. What is next time S 1 is equal to beta 1 S 0 plus negative of the gradient direction at the new point ok. What is S 2 beta 2 S 1 plus G 2 which is same as beta 2 you see this. So as you progress ok you are actually basing your move direction based on not just one gradient but a history of gradients ok that is why conjugate gradient direction you know turns out to be more powerful it moves much faster even when you go close to the close to the optimum as compared to the what is the problem with the gradient method when you go close to the optimum it tends to become slow ok whereas here the move is based on past history of gradients ok. So it is like when you are going down a hill you are trying to make use of the curvature rather than trying to make use of the local slope the gradient method will just look at the local slopes ok. Gradient method will only look at negative of gradient or G 0, G 1, G 2, G 3, G 4 ok. So that is why conjugate gradient method is more powerful. Now there is one more variant here in the gradient method you are only using the local gradient information right you are only using local gradient information. Now if you can generate something more ok if you can generate Hessian what is Hessian the second derivative of the objective function. If you generate Hessian ok then your moves are much better because you are using more information about the local curvature than just the gradient ok. Hessian computation would require second derivatives to be calculated and then you know the methods optimization methods that you get which use Hessian to generate the search direction these are called as Newton's optimization method ok. So we will just briefly look at this Newton's method now do not confuse the Newton Raphson method and the Newton's method here this is optimization based method that is you know that is direct that is a successive substitution kind of method. So this these are names appear similar but so under the category of optimization we have again Newton's method and Cauchy Newton method ok. So the Newton and Cauchy Newton methods so now ok so what did we start with we started by saying that to solve f of x equal to 0 we form this objective function phi which is half f x transpose f x right half f x transpose f x and what is the what is the necessary condition for optimality? Necessary condition for optimality is grad x phi is equal to 0 right this should be equal to 0 vector the gradient at the stationary point the gradient of this objective function should be equal to 0 0 vector ok. This is a non-linear algebraic equation ok this is a non-linear algebraic equation ok this is a non-linear algebraic equation now this non-linear algebraic because what is phi is a scalar function see phi is nothing but half norm f x 2 norm square this is nothing but 2 norm square norm is a scalar ok objective function is a scalar objective function this is a scalar objective function. So this what is this grad phi is it a vector or a matrix it is a it is a vector it is not a matrix this is equal to dou phi by dou x 1 dou phi by dou x 2 dou phi by dou x n transpose this is equal to 0 vector. So this is actually a because phi is a scalar it is gradient with respect to x ok is a vector this vector I want to set equal to 0 ok. If I solve this equation exactly ok then I will get I will get the solution ok but my problem is non-linear I cannot solve this exactly ok what I do decide to do is instead of solving this equation exactly I decide to solve this using Newton's step. So use Newton's step what I am going to do is I am going to linearize this equation ok I am going to linearize this equation exactly compute how will I get f of x equal to 0 no no no if you if analytically this is this grad x phi is dou f by dou x transpose f of x. So if Jacobian matrix is non-singular then this equal to 0 means this will happen only at f of x equal to 0 ok. So when you said this equal to 0 if this is non-zero only way you will get or this is non-singular only way you will get 0 here is when f of x is equal to 0 this vector is equal to 0 that is how you get the solution it is the same it is getting the solution is same ok. So now I want to I want to come up with a Newton's step here. So what I am going to do here is something like this I will continue here on this side ok. So what I need to do here is this my grad phi x ok I am going to write this as grad phi x k plus ok and use our good old Taylor series expansion. So this is approximately equal to grad phi x k plus del square phi in fact you will realize why this is called Newton's method because this is exactly what you do in Newton's method for solving algebraic equation you have non-linear algebraic equations you linearize and then instead of solving the original problem you solve the linearized problem right. So this is this is this we call as this we call as h k h k is the Hessian ok h k is the Hessian and this is the gradient and which we have shown that this is nothing but j k transpose f k we have I have shown this earlier this is j k transpose f k and now instead of solving for grad phi x equal to 0 I am going to solve for this approximation equal to 0 vector ok h k is a square matrix. So this is a n cross n matrix phi is a scalar objective function its second derivative with respect to x is the Hessian matrix we have looked at Hessian matrix earlier when we talked about the conditions for optimality necessary and sufficient conditions for optimality this is the Hessian matrix ok this is the Jacobian transpose f k well again I have written here Jacobian transpose f k but you do not have to explicitly compute Jacobian its more forgetting insides ok. So now I am going to solve for this ok if I solve for this I get delta x k is equal to minus h k inverse I am just solving for that approximate Taylor series expansion ok and what I get here is the step Newton like step ok. So this is my search direction ok. So the way I construct my next x k plus 1 is x k plus then what you do is the same thing you know you take lambda times delta x k ok and then do you do a search with respect to lambda ok. So that part remains same. So lambda k is equal to min with respect to lambda phi of x k plus lambda ok. So this part remains same one dimensional minimization but then the step or the direction for moving is obtained using the second order derivative ok. So actually if you look at optimization method, optimization based method let us say just the raw gradient method then conjugate gradient method and the Newton's method of course Newton's method is faster in converging that is because you are using second order derivative information ok using second order derivative information. So this is more powerful method but the price to pay is computing Hessian. Hessian is the n cross n matrix if your number of equations is or number of variables is large Hessian matrix is large and again the same problem. So you know fast computations or less number of steps but large number of computation at each time step or large number of iterations and less computations at each time step. So it or not each time step iteration ok. So it is a balance in some cases you know it is worth computing the Hessian and doing quick steps in some cases Hessian computation can be complex and so you might want to use just the conjugate gradient method and search for the optimum ok. So ok now of course the direction the gradient direction or this direction which you get should be a descent direction and that requires that Hessian should be positive definite and so on. So the convergence condition will depend upon the nature of H or the local Hessian ok. So what is the trouble with what is the advantage of Hessian you are using second order information ok convergence can be much better ok. What is the trouble you have to compute n cross n matrix ok. In Newton's method how did we get over this problem? We got over this problem in Newton's method using Boyden update right. We used rank one updates and then you know we had a way of updating Jacobian ok just like a difference equation and then we use the updated value of the Jacobian rather than actually computing the Jacobian. So the same thing is done here the quasi Newton methods ok actually use a update a rank one update of the Hessian inverse what you need to compute here is the inverse of the Hessian matrix ok. So what is done is in quasi Newton methods let us define this matrix L to be H inverse see what is the trouble step in doing this the trouble step in doing this see this is the gradient calculation this is the gradient. Gradient calculation is just one vector calculation by numerical perturbation not a big issue but calculating Hessian see gradient you can do relatively easily because there are only n components in the gradient of scalar function phi ok much less computations than computing the Hessian ok. Hessian would require lot more computations ok. Now to avoid Hessian computations we do the gradient computation by numerical perturbation but Hessian we do a update ok. So let us define this L is equal to H inverse then we have this update of the H inverse now this is called as this is called as variable metric called dividend feature power method in which you update the inverse of Hessian iteratively. So only once in the beginning you compute the inverse and after that you just update it without actually having to compute it explicitly ok. So this is a quasi Newton idea where you do not have to explicitly keep computing the Hessian. So here you have this update L k plus so we compute you compute L 0 is equal to H 0 inverse you compute this once in the beginning ok and after that what we do is we have this L k plus 1 is equal to L k plus 1 is equal to L k plus 1 is plus m k minus n k ok. In quasi Newton method this is the philosophy that is you define a matrix L is equal to H inverse ok you only once compute this and then every time what you do is the new inverse is old inverse plus some correction ok. This correction does not involve explicit Hessian computation what is this correction ok let us move to now the derivation of this is you can find in any book on optimization. So derivation of this particular approach I am going to just leave it to your curiosity you can go back to a book by SS Rao or any other book on optimization and you will find a derivation ok. So the you define a vector q k I am just giving you the final formula which is phi evaluated at k plus 1 minus grad phi evaluated at iteration k ok. m k the correction 1 is 0. So this is a rank 1 matrix delta x k is the previous move ok and this is a rank 1 matrix delta x k delta x k transpose will be a rank 1 matrix ok and n k ok m k and n k are 2 corrections which are rank 1 corrections yeah no no no this is after see the sequence is like this that ok the sequence is that you already have computed delta x k ok you already have computed delta x k ok you have then minimized and found out lambda k ok then you have found out x k plus 1 ok once you have found out x k plus 1 now for the next iteration you are preparing ok. So once I have found out x k plus 1 I actually go back here and find this find this new gradient at k plus 1 take the difference between the new gradient and the old gradient ok and use that and delta x k and previous inverse ok previous inverse ok difference between the old gradient and the new gradient and delta x k move and the lambda k which you have found out all of them are used to come up with approximate you know inverse of the Hessian at the next time step ok. So that is how this quasi Newton methods proceed well we do not have this is not a full course on optimization I am just trying to give you the idea that non-linear algebraic equations can be very efficiently solved using optimization based methods ok. So we are just as I said or I keep saying through our discourses we are just touching the tip of the iceberg we are not really getting into the deep ok. Well the last thing which I want to talk here is a method which is very very popular in solving non-linear algebraic equation this is called as Leverberg and Markovat method Leverberg and Markovat method is a combination of gradient method and Hessian method ok. See what is the what is the nice thing about the gradient method gradient method when you are far away from the optimum it makes very long strides it tries to move towards the optimum very quickly ok. But once it comes close to the optimum you know there is a problem it becomes very very slow the optimization the Hessian method on the other hand is very fast when you come close to the optimum ok. So it is there is a merit in having a method which is mixture of the two which initially starts like a gradient method and later on becomes like a Hessian method ok. So this or in the modern parlance it is like a multiple agent optimization method there are two agents one is the gradient method other is the Hessian method and then you try to mix them in such a way that one is dominant when it is useful the other is dominant when that that becomes useful ok. So this is just a small modification I am just going to give the philosophy I am not going to get into details. So what you do here is that you know you have this gradient direction which is minus I have this gradient direction ok. What I do is my search direction S k is found by minus of minus of minus of this is if I put this theta k equal to 0 it is nothing but a Newton step if you go back and check this will be nothing but a Newton step ok. The Newton step actually does involve negative of the gradient direction as involved except it is pre multiplied by H inverse ok. It is pre multiplied by H inverse otherwise it involves negative of the gradient direction. So what we do is we start with this theta have to be very large ok. So you start with large value of theta say 10 to the power 5 you have something you take a large value of theta. So actually your H 0 plus 10 to the power 5 times I ok see Hessian elements will not be typically very large ok. So you have to take this number sufficiently large compared to elements of the Hessian ok. So this inverse when this is a large number is approximately like 10 to the power 5 I inverse this term dominates over this ok. So initially you start with very large eta ok. So eta 0 is chosen to be 10 to the power 5 and then what you do is you go on reducing you go on reducing eta ok as k increases. So initially since this is like I inverse this direction S k direction is along the negative of the gradient direction. So initially you are moving along the negative of the gradient direction ok and then you go on ok. So we reduce eta k as k increases by some logic ok. As eta k reduces this terms become smaller and smaller and this starts dominating the Hessian starts dominating ok. So initially the method behaves like gradient method later on the method behaves like the Hessian method. So Leverbug Markov method I think you have a program in Matlab toolbox or any other toolbox Sylab toolbox you will find this. So this is one of the popular methods which is used for solving non-linear algebraic equations. So with this we come to an end of algorithmic part of solving non-linear algebraic equations. We have looked at different methods we have looked at gradient free or successive substitution methods. We have looked at gradient based or slope based methods. So in that was Wechslein method, Newton's method, damp Newton method ok and so on. Then broadened update for Newton's method. We moved on to optimization based methods. Optimization based method we looked at the gradient method again ok. Then we looked at Newton's method, Quasi-Newton method. I have just quickly looked at these things and Leverbug Markov which tries to merge the two gradient and Newton's method. So to solve non-linear algebraic equations you know it is a complex problem and more and more we become advanced in computing we want to solve larger and larger problems. So it is a ever challenging problem how to solve non-linear algebraic equations and there are many many methods ok. So a question that would naturally arise is that which is the method? There is no the method. If there was the method we would not be required right. We would be out of business. So you need an expert, you need a person who understand the physics of the problem. You should know how to concoct a solution, how to concoct a recipe ok. Just like you cook and you have to know the recipe or you have to form a recipe. So here also you have to form a solution for a particular problem and sometimes Newton's method will work, sometimes optimization method will work, sometimes Wechslein method will work. So you have to develop an expertise beyond the point as to how to go about solving these problems ok. So it does require this human element otherwise MATLAB would do everything. You just give a problem to MATLAB but it does not happen that way that is good. That is why we get jobs ok. So we will now continue with I will say a few things about the convergence of this non-linear algebraic equations. We cannot give justice to that in this course it is a very very advanced topic but at least you should be sensitize as to what is involved when you talk about conversions ok. It is fairly more complex than talking about convergence of iterative schemes for linear algebraic equations because here you have non-linearity and things do not work out as nicely as linear algebraic equations. But we will have a peak at that and then move on to ODE initial value problems ok.