 In the last module 4.1 we provided an interlude where we summarized the computational problems arising from the formulation of the least square problems linear non-linear etc. and we also indicated some of the methods the two pathways one by matrix methods another by direct minimization. So in this module 2 we are going to take up the methods of matrix decomposition are matrix methods most of the methods are decomposition based techniques we are going to primarily talking about 3 decomposition techniques Choloski decomposition QR decomposition and SVD we are going to look at the details of how these are organized. Matrix methods for solving AX is equal to B we are interested in solving a specific problem of AX is equal to B so in our case we are interested in solving AX is equal to B A is SPD that is what we are interested in. But before I get little deeper into solving systems with symmetric positive matrix I would like to provide a broader overview of the available classes of methods one can use to solve the matrix system. One is called direct method another is called the trity methods this is within the matrix methodology. The direct methods they rely on what is called multiplicative decomposition of A these methods have in general a complexity of n cube these methods give exact result when there is no round of errors in the computers. In this class we are going to be looking at 3 different classes of methods LU, QR and SVD LU is the foreigner of all the others so I am going to start with basic LU then QR then SVD multiplicative decomposition of matrix is very similar to multiplicative decomposition of integers if I have an integer n which is given any positive integer n there is a fundamental theorem that says I can express it as a product of powers of prime that is called prime decomposition prime decomposition is a multiplicative decomposition within the context of integers for example if you have the number 25 25 is equal to 5 square which is the square of the prime if I have the number 24 on the other hand I could express this as 4 times 6 2 times 2 times 2 times 3 so this could be 2 cube times 3 2 is the prime and 3 is the prime so this is called the prime decomposition so much like any positive integer can be decomposed into product of powers of prime given a matrix given a non singular matrix I can also express it as product decomposition or multiplicative decomposition so what we have been used to doing in numbers I would like to be able to translate it to matrices and that is what these 3 decomposition methods all entail these decomposition generally belong to the multiplicative class but the details of the derivation of the factors differ because the property of the factors differ on the other hand the system like AX is equal to B can all be can also be solved by iterative techniques the iterative methods rely on what is called additive decomposition of A this iterative methods in order to be able to make it operative we have to indulge in what are called convergence proves we have to show the method iteratively converges to the solution I am seeking in any iterative methods one has to be contend with what is called the derivation of the rate of convergence is one thing to prove that the iterative method converges another thing to find out what is the rate at which it converges to the optimum the complexity of this method depends on cost per iteration and the desired accuracy as opposed to the fixed cost for the direct methods which are all of the type O of n cube what is O of n cube if A is the matrix of size n by n the total amount of work to be done in solving the system AX is equal to B is it takes a total of the order of n cube operation for example if n is equal to a million a million size problem is very routine in geophysical domain if I am interested in solving a static inverse problem I convert the static inverse problem to one of solving a linear system AX is equal to B so A is a matrix of size 10 to the power of 6 the total amount of work the total amount of basic operation basic operations are addition multiplication subtraction division that the computer has to perform to be able to generate is of the order of 10 to the power of 6 cube which is equal to 10 to the power 18 10 to the power of 18 operations is large amount of work and that is going to take quite a long time we will try to provide an estimate of how long does it take to be able to solve a million by million system as we go by but at this stage I would like to be able to concentrate on two mutually exclusive classes of algorithm one depends on multiplicity decomposition or the depends on additive decomposition there is also an analogy with respect to integers additive decomposition for example if I have number 4 I can express it as 1 plus 1 plus 1 plus 1 I can express it as 2 plus 1 plus 1 I can express it as 3 plus 1 so these are different ways of expressing 4 additively so these are called additive decomposition much like I can express numbers additively I should also be able to express matrices additively so multiplicity decomposition of numbers additive decomposition of numbers likewise for matrices these two methods depends on our ability to decompose matrices multiplicatively and additively some of the methods of the iterative type are called Jacobi method Gauss Seidel method successive over relaxation methods these iterative methods are easy to program the total cost depends on how much it takes for you to be able to perform one iterative step the number of iterative step depends on the total desired accuracy so these two methods are two competing methods one can utilize to be able to develop program systems for doing data simulation the LUD composition method that is our first one I am going to concentrate on some of the major aspects of LUD composition it is derived from the classical Gauss in elimination methods that we are all introduced to in the first course in algebra to solve a 2 by 2 system what how do we solve a 2 by 2 system in when we learn algebra I have AX plus BY is equal to F1 CX plus DY is equal to F of 2 how do we do that we eliminate X in one of the equations so by multiplying the first equation by minus A by C we can make the first equation to be minus C minus AB times CY C equal to minus C times F1 I have CX plus DY is equal to F2 now I add these two the first two term gets cancelled then I get D minus AB by C times Y is equal to F2 minus A by CF1 by dividing I can get Y once I get Y I can substitute one of these equations and recover X this is called the method of elimination and please remember Gauss was the one who invented this method please realize now Gauss's fundamental inventions he developed the Gaussian distribution to be able to describe observational errors he developed the least square methodology to be able to solve the problem in astronomy he also invented this method of elimination to be able to solve linear system via cognizantly otherwise we use many of the results of Gauss routinely in all our work so this is the method of Gauss in elimination this method of Gauss in elimination when written in the matrix formulation can be shown to be equal in to LU decomposition so what is LU decomposition given a matrix A I can express this as a product so this is where the multiplicativity composition comes in as a product of two matrices where L is a lower triangular matrix and U is an upper triangular matrix what is the lower triangular matrix in a lower triangular matrix this is non-zero everything above is zero in upper triangular matrix everything below is zero therefore given any matrix I can express it as a product of two matrices with special structures the structure lower triangular the structure upper triangular so there is a general theorem given any non-singular matrix it can be expressed as a product of L and U where L is lower triangular U is upper triangular and this decomposition is mathematically equivalent to the Gaussian elimination method we generally use we are generally you are introduced when we first develop tools in algebra I am not going to show that it can be done now I am going to approach it using a constructive procedure one way would be to show such an L and U given A exists another way would be I am not going to worry about existence I will simply actually deliver it that will solve the equation A is equal to L L times U so A is given let this be the L matrix that let that be the U matrix I am sorry this must be this must be U instead of A so I am assuming a particular structure for L I am assuming a particular structure for U if you compute the total number of unknowns in the L matrix they are all below the diagonal if you compute the total number of unknowns in the U matrix anything on and above the diagonal so the L matrix has a total of N times N minus 1 by 2 unknowns the U matrix has N times N plus 1 by 2 unknowns these two together consist of a total of N square unknowns so to be able to compute L and U is equivalent to given the A elements I have to compute the L elements and the U elements so if I multiply these two equate these two matrices on the right hand side I am going to get expressions as a product of L and U I am going to have to equate them to A's there are N square elements in the left hand matrix there are N square elements in the right hand product matrix by equating the two I am going to get N square equations in N square unknowns by solving these N square unknowns I am going to uncover the elements of L and U the uncover the elements of L and U so that is the general idea of this methodology. But before I go further I would like to talk a little bit about the structure in the case of U matrices I assume the diagonal to be arbitrary U11 U22 but I assumed the L to be fixed of U1111 I am going to argue suppose I make this L11 I suppose I make this also L11 L22 LNN the total number of unknowns in the L matrix and the U matrix will be N times N plus 1 by 2 the total number of unknowns will be larger than N square but there are only N square equations now please realize how we got here I started with the under determinant system or over determinant system I converted into a linear system x is equal to b to solve the linear system I again cannot go back to an over determinant system under determinant system it becomes circular so I have to have a determinant system it turns out without loss of generality I can assume the diagonal factors of L they are all unity they are all unity so this is one way of being able to enforce solvability so there are N square equation there are N square unknowns we will illustrate this idea by a very simple example suppose I am given so I am now going to take an example of a symmetric matrix in general the above decomposition holds good for any matrix but for reasons that we are interested in symmetric matrix I am going to start with the simple symmetric matrix let a be the symmetric matrix let L be this U be this L and U follow the conditions that we have already stated if you multiply L and U I get a matrix which is given by this now I have to equate U11 so from here you already get U11 is equal to 1 you also get U12 is equal to 3 by 2 now you now have L21 U11 is equal to 3 by 2 but I already know U11 is 1 therefore L21 also becomes L21 becomes 3 by 2 now if you consider the equation the last equation L21 U12 plus U22 is equal to ½ I already know L21 I already know U12 I already know the right answer is ½ so using that I can determine U22 so in this particular way I have determined L is given by this U is given by this so now look at this now once U11 is known I can compute L21 1 U12 is known I can compute U22 so there is a particular structure with which we can uncover the unknown elements this structure is embodied in the LU decomposition algorithm I am providing a pseudo code I am sure you can read the pseudo code I do not think I need to I want to repeat the instructions in the pseudo code you can readily see there is a for loop there is a for loop and there is one loop another loop at the same level there is a third loop which I did not write in but there is an intrinsic third loop there is summation there is summation you cannot simply write in a computer programming some you have to write it as a do loop so you can see there is a triple nested do loop there is a triple nested do loop when there is a triple nested do loop the overall cost is n cube that is where the total cost of n cube comes into play it can be verified that the total number of operations to be performed is the order of n cube and I would like I would leave it to you to verify this for example in this particular case I am going to have to run r j is equal to 1 to r minus 1 it is a summation but before I sum I have to multiply so there are r minus 1 multiply that will give me r minus 1 numbers then I have to add them all to add 2 numbers I have to make 1 addition add numbers I have to make 2 additions to add r minus 1 numbers I have to perform r minus 2 additions and I have to perform one more addition subtraction subtraction addition are essentially the same therefore this particular step is going to require r minus 1 plus r minus 2 plus 1 that is going to be equal to 2 times r minus 1 operations but r runs from 1 to n therefore this particular do loop alone is going to now require 2 times r minus 1 r runs from 1 to n r runs from 1 to n I am sorry I would like to I would like to revisit this issue once again again that is correct good and this is this one is embedded in this do loop if it is embedded in this do loop this is repeated r running from 1 to n sorry that repeated r running from 1 to n r to n is n minus r so this has to be done n minus r times where r changing so likewise you can compute the operations in here then you have to compute the operations overall so if you add up all these expressions you can verify the total amount of operation is the order of n cube and I would like to leave this as a homework problem for you to compute please do and convince yourself that to solve any LU decomposition problem it requires n cube operations. Now the question is this once I have compute the l and u what do I do with it let us go back a is equal to LU a is equal to LU therefore ax is equal to LUx LUx can be written as l times u of x now I can write u of x is equal to g if u of x is equal to g then this becomes lg is equal to b so ax is equal to b reduces to lg is equal to b and then to uncover x I have to solve ux is equal to g therefore the LU decomposition framework essentially gives you first decompose a is equal to LU then you have been given see you have been given a and b use a find l and u then using the l you found in the previous step and the b solve for g solve a lower triangular system you already know u you already know g from the previous step solve ux is equal to x so that is how you solve in three steps LU decomposition step solution of lower triangular system solution of upper triangular system these three together constitute the method of LU decomposition and this method is applicable to any matrix so long as a is non-singular. So this is the mother of all direct methods LU decomposition the basis this is it is from here all the other methods start now I am going to talk about method of solving the lower triangular system so please remember LU decomposition algorithm we have already seen how to decompose please remember we have already seen how to decompose I have given a pseudo code now I want to be able to solve a lower triangular system but before I go further I want to tell solving a lower triangular system and solving a lower triangular system are essentially similar mathematically why is that the transpose of a lower triangular is upper triangular and transpose of upper triangular is lower triangular therefore I could once I know how to solve a lower triangular system I also know how to solve an upper triangular system so I do not have to deal with these separately so suffice to say that I would like to be able to solve a lower triangular system so L1100 L21 L2200 G1G2G and B1B2BN I can recover please remember from the first equation I can recover G1 then for I running from 2 to N I can recover any of the other GIs by the simple formula so using this loop and another embedded loop for the summation I can essentially solve the system it can again be verified the total operation that is required is only YN square please remember LU requires YN cube but solving a lower triangular system is much cheaper it is only YN square N square does not grow as fast as N cube therefore solving lower triangular system is much cheaper for sake of completeness I am also giving you an algorithm for upper triangular system this is a typical upper triangular system X we computed G from the previous cases so I can recover XN first I can compute XN first so XN is given by this then XN-1 to 1 I am I am sorry this must be 1 that is a typo so I could be able to recover all the XIs by using what is called back substitution this method is called back substitution. So by using a method of back substitution I have a formula which is very similar to a do loop there is another embedded do loop because of the summation sign so these two together requires YN square operations so in summary if I have if I am going to solve Ax is equal to b if I am going to solve Ax is equal to b Ax is equal to b using LU decomposition the LU decomposition steps N cube lower triangular system takes N square upper triangular system takes N square of all the three YN cube is a dominant term so the total cost is the order of N cube that is where the whole thing comes into play we can actually compute these are all actually polynomials in N you can actually compute the actual polynomial that gives you the amount of work and that is the homework problem I am leaving it to the reader to fill up. So this should this essentially provides you I have provided a pseudo code for both LU decomposition lower triangular upper triangular so you have a code you can code it in your favorite language C C++ Fortran MATLAB whatever it is in fact MATLAB has excellent programs written while you may use readily MATLAB programs I think is better to understand the intelligence behind how MATLAB solves the problem it does to be able to have a total understanding of the of the programs that you may develop in fact if you are trying to develop a data simulation system for work in operational centers they generally do not depend on anybody they try to code everything ground up because they will have total control over how things are happening so in such cases if you are interested in developing large scale systems one need to know the nuts and bolts of how these algorithms are implemented to solve the risk of problems. Now I am going to give you an indication of the time involved I think this this will be a very interactive exercise let us assume n is a million n cube is equal to 10 to the power of 8 operations I had to perform these operations in a machine let us pretend I have the best machine money can buy so I have a machine that takes 10 to the power of 12 seconds per operation such a machine is called teraflop machines I want to quickly add there are not too many teraflop machines in the on the on the phase of the app right now these are some they all talk about teraflop machine teraflop machine is is is each operation takes 10 to the power of minus 15 but over the across the world there are only 4 5 6 teraflop machines at very special places teraflop machines are much more popular so a teraflop machine let us pretend we have access to that so each operation takes 10 to the power of minus 12 seconds I had to perform 10 to the power of 18 operations so I would like to up so it would take 10 to the power of 6 seconds to solve the problem 10 to the power of 6 seconds is a million seconds now let me estimate how much is a million second in a day there are 60 seconds in a minute there are 60 minutes in an hour there are 24 hours a day there are 365 days in a year ordinary year including the leap years so there are only roughly this is 31 not 32 this is 31 there are roughly 31.5 million seconds in one year so I want 10 to the power of 6 seconds there are only 31.5 times 10 to the power of 6 seconds in a year so how many years or how many days does it take so this is the total amount of seconds 60 seconds in an hour 60 I am sorry 60 seconds in a minute 60 minutes in an hour and 24 hours so 10 to the power of 6 divided by this many hours so I am sorry there are 6 let me start all over again there are 60 seconds in a minute there are 60 minutes in an hour there are 24 hours a day so this fraction gives you the total number of days to complete 10 to the power of 6 operation all the operation 10 to the power of 18 operations on this machine and that is equivalent to 11.57 days to solve Ax is equal to b if n is a million this is conditioned on the condition of the fact that I have a teraflop machines if you do not have a teraflop machine the story is much different. Now I want to ask a following hypothetical question would you wait for 11 and a half days to solve one problem that is totally impractical we have to create forecast especially in the meteorological complex in atmospheric science every 24 hours in order to be able to make a forecast every 24 hours the data assimilation person may not get more than 5 hours 6 hours 4 hours. So observations have to be collected observations have to be made ready the models have to be run the data assimilation part has to be established once the data assimilation part is done inverse problems are solved then one has to generate forecast. Here what we are talking about is solving one part of the data assimilation problem namely to solve Ax is equal to b is going to take of the order of 11 and a half days. So now you can see the monster the monster is not because we do not know how to solve the problem we know how to solve the problem exactly the monster nature of the problem because it comes from the size of the problem share size 10 to the power of 6. If you talk to meteorologists if you talk to oceanographers 10 to the power of 6 is not all that big they would like to be able to refine the grid much smaller grid resolutions. So if you want to be able to improve the accuracy and predictions of the model on one hand you have to reduce the grid spacing reducing the grid spacing increase the size if I increase the size the problem becomes larger if the problem becomes larger my computers are not enough to be able to solve the problem in a time that is allowed for me to be able to generate prediction. So this is the dilemma that are faced world over by all the meteorological operation or the centers so what is the solution what is the way out one way would be to reduce the size of the problem one to reduce the size of the problem means what you make the problem course if you make the problem course there are more model errors or you buy the best machine money can buy but there are no machines faster than petaflop these days. So these kinds of problems provide impetus for the growth of ever faster computers terat to peta to exa flop machines so until faster machines faster and faster machines come into being we may have to contend ourselves solving only a smaller size problem because of the constraint of time within which we are allowed to operate. So that is the end result of these of this analysis the computational analysis. But then a is the metric now I am going to go over to having discussed the solution of ax is equal to b for a general matrix now I would like to transcend slowly to the case of the matrix that we are interested in I would like to be able to solve ax is equal to b when a is the metric general to symmetric from symmetric to positive definite. So that is the stage we are going to utilize. So if the matrix is symmetric now I am going to concoct a diagonal matrix with the diagonal elements of u as a diagonal matrix d in that case I should be able to express u as the product of d and m in this case it can be easily verified is a simple matrix calculation m is a matrix whose diagonals are all 1. So I should be able to use this in my LUD composition to express a is equal to L dm L is a lower triangular matrix with all 1's along the diagonal m is an upper triangular matrix with all 1 along the diagonal d is a diagonal matrix whose diagonal elements where part of u I have separated that is a further decomposition here. Now if a is symmetric it can be verified m is the L transpose m and L are not distinct therefore I should be able to express a as L d L transpose decomposition. So this is a special form of the LUD composition that a takes when a is symmetric. So in this case if I compute L I do not in the case of L u I have to compute L and u separately because L is not equal to u but in this case u has been replaced by d L transpose if I compute L I already know L transpose also half the work is saved. So all I need to do is compute L and compute d so when the matrix a is symmetric I am saving a ton of money by not having to compute 2 matrices because of this simple form a is equal to L d L transpose again I am going to give you a simple example a is a symmetric matrix it is instructive to go through these simple examples I have already shown L u is given by this my u elements are r 1 and 5 by 4 so u can be expressed as L this u can be expressed as L m this is d this is m. So m is given by this and m and L are essentially transposes of each other since a is symmetric and d the diagonal elements of d are positive if the diagonal elements are d are positive I can take what is called a square root of d the square root of d I can express this as the square root of a matrix square root of a diagonal matrix is simply a matrix whose diagonal elements are square root of the corresponding diagonal elements. So the diagonal elements are 1 square root of 1 is 1 square root of 5 by 4 is square root of 5 by 2 so this is the square root matrix. So I can express d as d to the power half times d to the power half much like I can express any number a as square root of a times square root of a. So what is that I have now done I have identified L I have identified u I have identified d I have identified m I have shown m is L transpose further I took the square root of d d can be expressed as d to the power half d to the power half the power of the square root I am sorry the product of the square roots therefore a can be replaced by L d to the power of half d to the power of half L transpose I can combine these two parts I can combine these two parts. So this part is essentially L d to the power half I am sorry I would like to write it clearly this part is essentially L d to the power half this part is d to the power half L transpose d to the power of half L transpose. So this element is L d to the power half this element is d to the power half L transpose I am sorry L transpose I am going to call this as g I am going to call this as g transpose therefore this a is equal to g g transpose now look at this now this is called the Choloski factor this is called Choloski decomposition a is equal to g g transpose where g is called the Choloski factor and I have shown you through using this example how to compute the Choloski factor. So in the case of Choloski factor there is no L there is no u there is only g g so once you find g I compute g transpose very readily so they are essentially the amount of work is reduced to half therefore Choloski decomposition the cost of it is about half of what it takes for a L d decomposition. Of course for large problems even this is going to be large but we are trying to see how various decomposition methods are related to each other. So we have talked about so I am now going to summarize the whole thing so let a be a positive definite matrix a be a positive definite matrix the diagonal elements of d are positive if the diagonal elements are d are positive I can afford to take the square root in that case I can express d is equal to d to the power of half times d to the power of half. So L d L becomes L d to the power of half d to the power of half L transpose which I can associate as the product of L d to the power of half L d to the power of half transpose you can readily see the transpose of the diagonal is the same. So I am I can now define g is equal to L d to the power of half that is called the Choloski factor in which case a becomes g g transpose so that is called the Choloski decomposition the diagonal elements are given by u 1 1 to the power half u 2 to the power half u n n to the power half and that is the square root of the diagonal matrix in some circles the matrix g that we compute they also call it a square root of a. So if I can talk about square root of a diagonal matrix if I can talk about a square root of a number I should also be able to talk about a square root of a matrix but here comes the difference when you take numbers square root of a positive number so a if a is equal to 5 square root of 5 if a if I consider a minus 5 square root of minus 5 complex numbers. So square root operation if you want to remain within the real world square root is defined only for positive numbers likewise if you want to be able to define square roots of matrices the matrices have to be positive definite. So square root of a positive number square root of a positive definite matrix Choloski factor Choloski factor also being called the square root of a so if I say a is equal to g g transpose I call g Choloski factor I call g the square root of a both the names simultaneously apply different people use different characterization for g but the end result is this is a multiplicativity composition. I am now expressing the computation of g in the form of a pseudo code again you can readily follow you can readily follow the only difference between LUD composition and Choloski decomposition is that in LUD composition there is no square root operation in the case of Choloski there is a square root operation square root operations are tricky square root is not a basic operation to be able to take a square root of a given number it may take lot more time. So this is called Choloski decomposition with square root operation so I have to be able to perform addition multiplication subtraction division and square root. So if I consider that we still need to do n to o of n cube operations but the leading coefficients of the polynomial that represent the amount of work is about one half of that required for LUD composition therefore Choloski decomposition of symmetric positive matrices are cheaper than LUD composition for any general matrix. So now I am going to talk about the framework the in this this is framework I think I do not think that is there is a L there. So let us be a SPD g is equal to g g transpose a x is equal to g g transpose x g times g transpose x I am going to call g transpose x as y so g y is equal to b. So given this framework I have a three step algorithm compute g then solve the lower triangular system g b is equal g g is equal to b and then solve an upper triangular system g g transpose x is equal to g. So you can see the second step depends on the first step the third step depends on the first step and the second step. So we solve the system in again a three step procedure quite parallel to the LUD composition but at a lesser cost the total cost of lower triangular system is n square the upper triangular system n square this is n cube the overall cost is still n cube but the smaller coefficient for the leading term so it is slightly cheaper. So if you look into MATLAB for solving normal equations if you apply Chaloski decomposition so what is that that is one has to do one has to do the following you have h you compute h transpose h you call it a then you split a is equal to g g transpose once g g transpose is computed you can use the lower triangle system upper triangular system to solve the resulting linearly square problems this method of solving the linearly square problem using Chaloski decomposition is a fundamental and a basic tool and that method has come to be called method of normal equations method of normal equations method of normal equations. So we have this provides you the algorithmic setup by which we can solve the linear systems that the linearly square problem gives rise to in our analysis. So I am going to summarize the method of normal equations let h be a full rank matrix so this is the overall summary let h be a full rank matrix compute h h transpose that is going to take an operation n m square compute h transpose z that is going to take n m operations compute the Chaloski factors of g that is going to take n cube operations solve the resulting lower triangular system n square solve the upper triangle system n square so you get to solve you get to solve the overall solution for the linearly square problem this is for the over determined system for under determined system we can we can again solve by the same procedure that is involved in here. So with this we are completed we have provided a complete story starting with the formulation of the linearly square problem by converting into minimizing the square of the sum of the residual computing the gradient equating the gradient to 0 leading to solution of symmetric positive different system and a given symmetric positive definite system can be solved by Chaloski. So we have provided a complete path from formulation to analysis to algorithms to computational complexity to pseudo program and be able to deliver the result. So this is the pathway a complete pathway that that exactly what happens when they say I have developed a data simulation system for use in practical applications. I would like to now take a few minutes to be able to discuss the notion of square roots of a matrix square root of number we know square root of a positive number is real square root of a negative number is complex in the case of matrices there are 3 possible ways to define the concept of a square root mathematically consistent way. One is by Chaloski factor that is one way please understand these definitions are manmade you can define anything you want so long as you are consistent. So it looks as though it is a is equal to g square even though g g transpose comes in you can think of g transpose you can if you forget transpose for a moment it looks like g is equal to a is equal to g times g so g square so g is a square root of a in that sense Chaloski factor defines the square root. Secondly we can express a as a product of a symmetric matrix x times symmetric matrix so a is equal to x s square you can parameterize the elements of s you can equate the elements of s square of a and solve for the elements of s and that is also possible mathematically even though I am not going to show the procedure the procedure is not too different from the LUD composition there we assumed elements of L and elements of u are unknown here I am going to assume the elements of s are unknown once the elements are s are unknown you multiply s square the elements of s square are functions of the elements of s element you equate the corresponding elements you solve for the elements of s you get what is called symmetric square root. So what is the difference between the Chaloski square root and the symmetric square root in the case of Chaloski the square root is a lower triangular matrix g is a square root lower triangular matrix in the case of symmetric matrix s is a full matrix it does not have any special structure but it is a symmetric square root s is a symmetric matrix we require it to be a symmetric matrix it has an upper half it is a lower half essentially they are the same because a symmetric square root. So I can compute the symmetric square root that is another way to define a third kind of square root also comes from Eigen decomposition from the module on on on matrices we have already seen any matrix can be expressed the can can can be expressed as the product of v lambda v transpose this is called Eigen decomposition we are assuming a is SPD any SPD using Eigen valid decomposition can be expressed this way that is called Eigen decomposition I can compare so v lambda v transpose is equal to v lambda to the power half lambda to the power half v transpose this is equal to v lambda to the power half times v lambda to the power half transpose that is equal to v bar v bar transpose that is what we have. So here again looks like the Choloski factors so V bar is considered to be a square root of A so this square root is given by the Eigen decomposition. So now if you say a square root of matrix there are 3 different ways of computing square roots you have to essentially specify the method by which you compute the square root. In data simulation the Choloski base square root as well as the Eigen decomposition base square roots are very popular we seldom use the symmetric square root but it is mathematically possible. So from square root of a number to a square root of a matrix Choloski decomposition square root of matrices different ways of defining square roots in a consistent way and we can utilize the square roots to our benefit when we do analysis with respect to matrix algorithms. So that is a summary of that is a summary of one class of matrix techniques that is based on Choloski decomposition which is essentially derived from LED decomposition. Now I would like to slowly move into the next decomposition method that is called the QR decomposition for that I need to have some preliminaries so I am going to recall some of the definitions for matrix theory let A be a matrix of size n by n we say A as a matrix is an orthogonal matrix if the inverse is transposed that means A transpose A or A transpose is identity. These orthogonal matrices are very powerful and they have very special property once property of orthogonal matrix we are going to illustrate here. So let A be orthogonal matrix let X be any vector if you multiply a vector by a matrix I get another vector so Y is a vector which is an orthogonal transformation of the vector X using the orthogonal matrix A I want to be able to compute the square of the norm that the square of the two norm of Y the square of the two norm of Y is square of the two norm of X the two norm by definition is A X transpose times X A X transpose is X transpose A transpose A times X A transpose A if A is orthogonal is identity is equal to X transpose X which is equal to square of the norm of X. So look at this now I start with a vector Y I linearly transform Y to A these two vectors happen to have the same length that means what is it mean this means that two norm is invariant under orthogonal transformation so what does this imply if two vectors have the same norm means they lie on the same circle with radius norm of X. So if Y is another vector Y also lies on the same circle with center as origin radius as norm of X this is equal to norm of Y and we first saw norm of Y is equal to norm of X. So what does this mean if I have a vector X if I multiply the vector X by an orthogonal transformation the orthogonal transformation simply rotates it without elongating it without shrinking it the length remains the same it simply rotates it so Y is simply a rotation of X that is the fundamental property that I would like to be able to emphasize at this moment the two norm is invariant under orthogonal transformation this is going to be very useful in the development of QR So QR decomposition algorithm I am also going to consider the case when M is greater than N for simplicity so let H be a matrix well let H be a matrix or MN then there is a fundamental theorem that says there exists an orthogonal matrix Q and an upper triangular matrix R such that H is equal to QR Q Q transpose Q transpose Q is IM because it is an orthogonal matrix R is an upper triangular matrix so I am going to express that in notation this is the matrix H this is the matrix Q this is the matrix R R is a triangular H is M by N Q is M by M and R is M by N so R is a rectangular matrix H is a rectangular matrix but Q is a square matrix so this is called full QR decomposition now look at that now LU decomposition is for square matrices A is equal to LU QR decomposition applies to even general matrices namely rectangular matrices so it is much more powerful is a generalization of sorts in one way so what is the idea why to form so you start with H you compute H transpose H and then decompose why do you create A and then decompose why not you decompose H itself directly that is the idea H is the M by N matrix so this M by N matrix can be expressed in the product of two matrices with special structures this is orthogonal that is a protrangular orthogonal matrix have special properties a protrangular matrices are very simple matrices structurally simpler matrices what do I mean by saying orthogonal matrices are in an orthogonal matrix the columns of Q are orthonormal vectors what do you mean by orthonormal the length of each vector is 1 if I pick any two vectors and do an inner product that is 0 that means any two columns are mutually orthogonal every vector is of length 1 such a collection of vectors is called orthonormal vectors so this is a very special form of decomposition it is also generalization of the LU decomposition so there are lots of beautiful properties from going from LU to cheloski to QR now I would like to go back M is larger than N so this is an over determined system M is larger than M sorry M is larger than N so there are M rows I am going to cut it at N rows so this is N this is M minus N so if I am going to partition the R like this I should be able to partition the Q also like this N I am going to have to this is M minus N so this part with the N columns I am going to call it Q1 this part with the M minus N columns I am going to call it Q2 so Q1 Q2 is a partition of Q R and 0 R partition so this lower part is essentially a 0 matrix therefore I can partition Q as Q1 Q2 but Q is the first N columns of Q here the rest of the N minus and columns of Q as we shown in the previous slide R2 is all symmetric is all 0 matrix R1 is the first N by N is the first N rows of R I am sorry first N rows of R therefore I can express H is equal to QR Q is equal to Q1 Q2 R1 R2 so this is equal to Q1 R1 plus Q2 R2 but R2 is 0 therefore H is equal to Q1 R1 so H is equal to Q1 R1 this is called reduced QR become position I do not have to build those too many columns too many zeros and here the property of Q1 is that Q1 transpose Q1 is IN so that is the property of the sub matrix Q1 so Q1 is a matrix is a rectangular matrix it has M rows and N columns and Q1 transpose Q1 is IN so full so let me go back here this is called the full QR decomposition in slide 20 this is called a reduced QR decomposition in slide 21 now let us see how I can utilize this in my least square problems please go back my RFX please go back my RFX is equal to Z minus HFX that is the residual my FFX is equal to square of the number the residual we already know that we also know that the norm of a vector remains invariant under an orthogonal transformation we saw this to start with columns of Q are orthogonal so I can express this as Q transpose RFX so RFX is equal to Q transpose RFX that is essentially an orthogonal transformation of the residual vector if I multiply if I substitute RFX is equal to ZFX in here I get this I can multiply Q transpose Z is equal to Q transpose HFX but we already know Q is Q1 Q2 therefore Q transpose Z is Q1 transpose Q2 transpose Z which is Q1 transpose Z Q2 transpose Z Q2 transpose HFX is equal to Q transpose Q RFX and that is essentially RFX which is R1 R2FX which is R1FX the bottom line is 0 because R2 is 0 so when I combine these operations my FFX now becomes this FFX is equal to Q1 transpose Z minus R1 X square plus Q2 Z this is the sum of the square residual what is the import of this now X is unknown so by changing X I can so my job is to minimize FFX F is a function of X but the right hand side consists of two terms the second term is independent of X so I cannot alter a term if it does not depend on X because X is a free variable here so this is the second term is independent of X so it is a constant term I cannot do anything with that I can only manipulate the first term so I have going to minimize FFX only by minimizing the first term FFX by minimizing only the first term I hope the argument is clear in here this decomposition reduces FFX to a sum of two terms one depend on X another does not depend on FX depend on X I can control only X there is no other control I have I have to minimize with respect to X so if I change X the first term changes second term does not therefore I need to concentrate only on manipulating the first term only the first term depends on X this is again a summary of what I already talked about the first term depends on X the second term does not and the first term is the minimum when does this get to be a minimum this is when the first term is 0 when will the first term is 0 I will have R1 of X is equal to Q1 transpose Z therefore the D square solution is obtained by solving R1 X is equal to Q1 transpose Z or D square solution is R1 inverse Q1 transpose Z is obtained by solving after triangular system so by spending money not on Choloski decomposition of H transpose H but on decomposing H to be Q and R H to be Q and R we have now reduced the problem of computing the solution to a linear least square problem to one of solving an upper triangular system this is a very important development please understand solution of a upper triangular system and lower triangular system costs only O of n square please go the solution of the upper triangular system or lower triangular system is going to cost only O of n square but the solution of a full system is going to cost you O of n cube therefore this solution process is much cheaper than solving the normal equations but what is the catch the catch I still have to pay for the QR decomposition therefore this methodology is a very elegant methodology but it rests on being able to perform the QR decomposition on the matrix H and that is a very fundamental operation so though our next step in the development of matrix methods is to utilize the ability to factor H as Q times R Q is orthogonal and R is a triangular if I can make it happen I know R because R is divided into R1 and R2 Q is divided into Q1 and Q2 so R1 is known Q1 is known I can compute the solution of this very easily so the everything rests on our ability to do the QR decomposition to which we now turn.