 In the previous module 8.1 we had given a complete derivation of the Kalman filter where the dynamics is linear and the observations are linear functions of state. This is the classical LQG problem. We gave a complete derivation of the Kalman filter equations and we illustrated profusely with a scalar example. We talked about the evolution of model forecast without observation then we brought an observation we did the Kalman filter derivation. We also analyzed the properties of the growth rate of forecast errors and examined under what condition the forecast errors will converge and then we also talked about stability of the filter in terms of what happens to the errors in the stability of the filter. You can readily see this analysis provides a complete characterization of the behavior of Kalman filter equation the classical Kalman filter. So when we say Kalman filter we essentially mean application of the filtering equations in the case of linear dynamics and linear observations linear observations linear in this state. So that is the classical LQG. Now in the early days when Kalman filter was implemented especially one of the early supplication was in space travel the onboard computers had only 8 bit they had only 8 bit processors. So when you do arithmetic in 8 bit processors that could be errors due to round off. One way to be able to improve the quality of computation is to be able to reduce the conditioning numbers for matrices that are involved. One way to reduce the conditioning number for matrices is to consider square root versions of the matrices instead of the matrix themselves. So what kind of matrices we are involved in? We are involved in forecast covariance analysis covariance. These matrices if it happens to be a large condition number that could be difficulties coming from numerical instabilities. So in order to be able to induce numerical stability in the in the early to mid 60's it was felt that one needs to be able to improve the conditioning of the calculations in order to improve the overall stability and accuracy of the solutions calculated that led to what is called square root filters. Since we have derived the Kalman filter equation within the covariance setup it is called covariance square root filter. I also want to remind you the inverse of the covariance is called information. So one could readily derive the Kalman filter equation in the information form or the that is called information filter. Once I in other words instead of working with the covariance we will simply work with the inverse of the covariance. While if the covariance is symmetric positive definite its inverse is also symmetric positive definite. So both the covariance and its inverse namely the information matrix all have positive square root. So we have covariance square root, information square root. These are some of the different versions in which Kalman filter equations can be readily and easily written. The square root versions in general have better stability properties and we are going to give an argument how to derive the Kalman filter equations within the framework of covariance square root filters. So that is the aim of this module. There is also another motivating aim. Sooner we will go into what is called nonlinear filters. Nonlinear filters are very very difficult problems to solve. We can only approximate. In the case of nonlinear filters we will also introduce an approximation one form of approximation using what is called ensemble filters. There is a natural connection between ensemble realizations of filters with covariance square root filter. So it is also for that reason we would like to educate the reader with respect to the derivation the Kalman filter equations not in the covariance form but in the square root of the covariance form. So that leads to covariance square root filter. Please recall square root of a symmetric positive definite matrix A can be given in one of three ways. One is in Choloski another is using a symmetric square root version and solving the symmetric square root version using a Newton's method. Another one is called the Eigen value decomposition. For simplicity in calculation to illustrate basic ideas we would either use Choloski or Eigen decomposition whichever is convenient for discussion. In this particular case we are going to be dependent on the Eigen value Eigen decomposition based square root version. Please go back to the previous case x bar is a matrix. So this is x bar times this. So this could be written as this is going to be written as x bar times x bar transpose that x bar matrix x bar. So this must be actually x bar is equal to x times lambda to the power half lambda to the power half is the square root of the diagonal matrix that consists of Eigen values. Since the matrix A is positive definite all the Eigen values are positive square root exists. So this is how the matrix square root is defined. So we are going to use the matrix square root derived from Eigen decomposition in our discussion of the covariance square root filter derivation. So let me quickly talk about the covariance matrix in the properties. If A is a matrix AXI is equal to lambda I lambda A is SPD. So all the Eigen values are positive. So this can be written succinctly as AX1XN X2 diagonal. So that becomes this the matrix X is such that it is orthogonal. Therefore by exploring this property I can get a multiplicity decomposition lambda is lambda to the power half lambda to the power half. So I can get a decomposition which is x bar this is that and that is also equal to in component form lambda I times XI XI transpose. So this is the crux of the Eigen decomposition. This is the crux of the Eigen decomposition. Now what are we going to do? We are going to talk about two things full rank as well as the reduced rank formulation. What do you mean by full rank reduced rank formulation? If I took all the Eigen values and Eigen vectors in the sum that is called the full rank. So this is the full rank factorization. Now in here we are going to assume all the Eigen values 1 to R are large from R plus 1 to n are small. It can also be shown that the trace of the matrix A which is a covariance matrix. So if A is a covariance matrix A is a speedy trace of A is equal to sum of all the Eigen values I is equal to 1 to n. So I can divide this into two summation into summation of two parts I1 is equal to R lambda I plus summation lambda I is equal to I is equal to R plus 1 to n. If this is small compare to this approximate this by I is equal to 1 to R lambda I. Therefore I am going to take only the first R columns corresponding to the first R Eigen values. So this is the matrix of first R Eigen values where R is a number which is less than R equal to 1, 1 is less than R equal to R. So R lies in this region. If R lies in this region now I can approximate A using the product of X bar and X bar transpose. Please understand the rank of this matrix is R. So this is what is called rank R approximation to A rank R approximation to A. We are talking about rank R approximation because it also comes from the computational demands. If the matrix is very large and if the Eigen values are such that 1, 2, k up to n. If the Eigen values falls very sharply like this it can be shown that the sum of the Eigen values the sum of the first R Eigen values that means the sum of the first R Eigen values I is equal to R 1 to R divided by sum of I is equal to 1 to n lambda I which is the this is the total sum versus partial sum if this is greater than equal to 90% let us say. That means 90% of the information are contained in the first R Eigen values or it could be 95% of the information are contained in the first R values. So the value of R will depend on the kind of percentage you are interested in the kind of approximations you are willing to accept. So if I consider the full rank I will have 100% of the variance accounted for. If I am going to go for a reduced rank I am going to neglect Eigen vectors corresponding to very small Eigen values. So we can talk about rank R approximation. So in this case XI XI transpose is a rank 1 matrix. A sum of R rank 1 matrices gives you a rank R approximation. Now please realize these Eigen value Eigen vectors to start with they are all linearly independent. So any the first subset of R are also linearly independent. So these are all the choices one have in terms of approximation. So we talk about two things now I want to go back to where we are. We started with the A positive definite matrix we got an Eigen decomposition. We also got a square root version of A. A is equal to X bar times X bar transpose that is the square root. Now we are trying to interpose another idea if in practice the Eigen values falls off very sharply like this. The last Eigen values the sum of these very small Eigen values do not add up much. So we may be able to cut off at a decent value R. The R is decided by the amount of the ratio of the sum of the first R Eigen values to the total sum of all the Eigen values. So if I can explain 90% of the variance if I can explain 95% of the variance the R will vary between 90 or 95 or 99. So that is something you start with the level of approximation you are willing to accept to start with. So given that I can define so you decide the number of percentage you are comfortable with based on that I can fix the value of R. Once I have the value of R then I can pick the first R Eigen values and consider an Eigen approximation to A. So A can be approximated by product of 2 rank R matrices. So that is which is called reduced rank approximation to the square root. Reduced rank approximation obtained by appropriately selecting the columns that go into the definition of the square root matrix. Now we are so that is one way to be able to reduce the rank. Why would you reduce the rank? If the computation of the full rank becomes too difficult too expensive we may settle for R and approximation the approximation could be R rank approximation. So that is one thought. So what is that we have done we are simply explained one form of taking square roots one form of taking reduced rank square roots these are simply ideas by which I can deal with full rank square root or reduced rank square root. The advantage of Eigen value decomposition based square root allows for this flexibility of being able to consider a square root of full rank or reduced rank that is the key message of the first 4 slides of this module. Now I am going to go back to using Cholesky factor as a square root. So we are going to use Cholesky factorization. So this means we are going to be talking about full rank. So let forecast covariance at time k I am still concerned with the Kalman filter please recall Kalman we would like to be able to rewrite the Kalman filter equation in the covariance square root form. So that is simply the product of the Cholesky factor SKF and SKF transpose please understand the Cholesky factor A is equal to G G transpose where G is lower triangular G transpose is upper triangular. So in this case SKF is a lower triangular square root of pKF and its transpose is SKF transpose. So this is simply a Cholesky factorization of the forecast covariance likewise I have the analysis covariance its square root. So we are going to rewrite the Kalman filter equation instead of pKF using SKF instead of using pK hat I am going to rewrite it using SK hat. I am also trying to get a square root factor for QK which is equal to XKQ and SKQ transpose. So in here SKF, SK hat and SKQ are all lower triangular matrices and this is lower triangular matrix times its transpose. Now let us try to consider the forecast of the covariance matrix under the Kalman filter derivation that we already have done. The forecast covariance at time k plus 1 depends on the analysis covariance at time k. Now replace the forecast covariance by its square root replace the model noise covariance by its square root. So you can readily see mk times the square root times mk transpose plus the product of the two Cholesky factors the lower triangular or upper triangular part of QK plus 1. This can be rewritten in the matrix form as mk SK hat plus XK plus 1 Q and the transpose of that. So you can readily see that the product this equal to this product and that product is equal to that. Now I am going to concoct a new matrix which is XK plus 1 bar F which is given by this matrix. So this matrix this is XK plus 1 therefore PK plus 1 has this form of Cholesky factor. So what does this tell you? This essentially tells you that if I have a Cholesky factor expression for PK hat, if I have a Cholesky expression for QK plus 1 I can get the corresponding Cholesky factor for the forecast covariance which is given by this equation. Each of these equations they are all n by n they are all n by n they are all n by n. But however this equation that describes the covariance of the square root of the covariance of the forecast is unfortunately is a matrix of size n by 2n. n by 2n is not a square matrix n by 2n is a tangled matrix it has double the number of columns as the rows that is undesirable. So while we are interested in keeping the square root we cannot allow the expansion for the size of the matrices. Here the expansion for the size of the matrices essentially comes from the fact that the forecast covariance are sum of 2 terms one coming from the analysis covariance another coming from the model error or model noise covariance. So this doubling of the column increases the storage as well as time we cannot allow this to happen. So we have to reduce it back to n by n matrix computationally it is required to reduce n by n matrices. Therefore when we try to reduce the n by 2n matrix to n by n matrix we have to take care of the fact that we do not lose any information. So this must be yeah information lossless transformation while I am trying to compress the number of columns from 2n to n that is the goal this can be done very easily by the method that we already know that is called Gram-Schmidt orthogonalization process which we saw in the context of QR decomposition. So what is the basic idea if Q in this case because my xk bar please understand my xk bar fk plus 1 my xk bar xk plus 1 is a matrix of size n by 2n the corresponding Q I would like to be able to be of the size 2n by n with orthogonal columns and the property Q transpose Q is the Q transpose Q is I Q transpose Q is I that is the condition we need that means let us look at this now sk plus 1 f bar belongs to Rn by 2n. So therefore s bar k plus 1 f transpose is a 2n by n matrix this matrix has more columns than rows so this matrix is it looks like so this essentially corresponds to the over determined case of H that we have used in the context of static inverse problems. So this is this matrix is a rectangular matrix with double the number of rows than the number of columns Q is a matrix again 2n by n I would like to be able to now get a matrix xk plus 1 f so here what is the condition I would like to get a matrix xk plus 1 f that is also nn by n. So the left hand side is given I have to find a Q and sk plus 1 such that this equation can be satisfied that is the classical QR decomposition method that is the classical QR decomposition method QR decomposition method by in applying to this matrix xk plus 1 bar f transpose. So I am going to apply the QR decomposition to this the QR decomposition will then deliver a Q and an xk plus 1 I am sorry there is no bar here sk plus 1 it will deliver an sk plus 1 f this will be an n by n matrix this will be a 2n by n matrix. So with these 2 now you can really see I am trying to move towards reducing n by 2n matrix to n by n matrix to see that what is that we would like to do we would like to be able to consider the old expression for k plus pk plus 1 f which is equal to xk plus 1 bar f times its transpose this comes from the previous slide now I have expressed the xk plus 1 f using the QR decomposition Q transpose Q is I therefore it reduces to this. So you can now really see by using the trick of QR decomposition and the expanded matrices I can now compress the matrix back to n by n so this is an additional step I have to do if I want to keep track of the full rack. So here in comes some of the challenges in Kalman filter equation when you want to rewrite from the covariance form to the covariance square root form again I would like to remind you why do we go from covariance to covariance square root form simply because I would like to be able to increase the conditioning and what is the conditioning the condition the conditioning for a square root is much better than the condition for the original matrix that is a very well known result that is a very well known result why is that. So we have expressed pk plus 1 in terms of square root of 2n by n by m 2n by n matrices so with this we come back to rewriting the filter in terms of square root matrices given now xk bar xk xk hat sk hat sk plus 1q these are the 2 square roots of the analysis and the analysis and the model noise I am going to now consider the forecaster this is the forecast so instead of updating the forecast covariance I am going to update the square root of the forecast covariance the square root of the forecast covariance is related to analysis and we have already broken them down therefore xk plus 1f is equal to xk plus 1 bar q transpose that is given by this times that and this product is going to give us the required chelasticity composition as we have seen in the previous slide as we have seen in the previous slide the data simulation step now consists of the Kalman gain the data simulation step consists of the Kalman gain it is the forecast covariance the Kalman gain can be written in the square root using the square root form now I am going to change the matrices hk xk the product that comes in here I am going to call it a so this I can rewrite also as 8 transpose times a using the same notation and using the same factorization for pk plus 1 therefore kk now gets this form and this form is the square root version of the Kalman gain. So we have come to a square root version of the Kalman gain and I can now substitute in the analysis covariance this is the analysis covariance form again I can substitute simplify this like this now I can express this in the square root form I can express this in the square root form I already have a square root form expression for kk plus 1 in the previous slide by combining them all I get the square root form for the analysis covariance. So to complete the square root version I simply need to be able to compute the square root of the term which is in the bracket so I would like to be able to get the square root algorithm to factorize then by n matrix inside the square bracket in order to be able to do that look at this now I minus a times a transpose a plus rk inverse a transpose that is the matrix whose square root I am interested in this has to be done in a couple of stages first compute b as a solution of first compute b as a solution of a transpose a r plus rk plus 1 please understand the b will then give us b will then give us a transpose a plus rk plus 1 inverse a transpose if you go back to the previous step that is the expression which is which is one of the parts of the which is which is one of the part of the equation that I already have. So this is the part that we are now concerned with and that is computed by solving a linear system this linear system is solved again by use of Chaloski again by using Chaloski please understand even though I need the inverse I even though the expression use of the inverse I do not want the inverse I want the product the product is simply the solution of a transpose a plus rk plus 1 times b is equal to a transpose you simply solving the system of linear equations to get this. So once you get b the term going back to the previous slide pk plus 1 is equal to sk plus 1 the term within parenthesis sk plus 1 f transpose. So I need to be able to find the square root of the term within the parenthesis and in order to find the square root of the term within the parenthesis once I find b I need to be able to find the square root of i minus ab i minus ab so I know what a is I now know what b is multiply ab subtract from i and then find the square root this square root also can be applied by using Chaloski factorization substituting all this back I have sk plus 1 cc transpose this I am now going to define sk plus 1 hat times sk plus 1 hat transpose where sk plus 1 hat is equal to sk plus 1 f times c. So this is one of the fundamental equations in the square root operations to find c so the key is to find c to find c I have to do a factorization here and to find b I have to solve a linear system so by doing steps 1 2 and 3 I can find c so c what does it do it essentially transforms the forecast square root covariance matrix to the analysis square root covariance. So the from forecast to analysis from forecast analysis the transformation is obtained by multiplying the forecast square root by c on the right where c is obtained by solution of 1 and 2. So with this I have now come to the summary of the square root algorithm square root version of the classical Kalman filter equation the model equation is given by this observation is given by that the forecast is given by classical Kalman filter the I am assuming I am available I have availability square root of the analysis I am also assuming the availability of square root of the model noise q is available from the q r d composition which is the Gram-Schmidt orthogonalization. So by doing a Gram-Schmidt orthogonalization on this matrix I get sk plus 1 f please remember q transpose q is I now I am going to go to the data assimilation step the data assimilation step essentially the same as the previous one except that I am going to x plus k k in the square root version where please remember a k is given by this and sk plus 1 hat is equal to sk plus 1 f times c where c is equal to i minus a b and b is equal to a transpose a plus r k plus 1 whole inverse times a transpose. So this essentially gives you the forecast part this essentially gives the analysis part this step takes care of the square root this step of this step takes care of the square root this step also takes care of the square root. So this is the summary of the square root version of the Kalman filter equation you can see there are lots of mathematical ideas but everything rests on the ability to be able to compute the square root of a covariance matrix so and transformation that are needed to compress the expansion. So by combining the square root and the ability to a q r d composition I can in fact derive the square root version of the Kalman filter it was this kind of a square root version of the Kalman filter was that was implemented in the Apollo mission that led to a very successful ability to track and that was one of the major applications of the Kalman filter equations in space exploration. So given that as a background now I am going to talk about what is called Porthos algorithm Porthos algorithm is further simplification of the application of the square root version of the square root version of the Kalman filter equation to be able to illustrate the Porthos algorithm I am going to assume M is 1 that means observation I have only one some single scalar observation. So HK is the state is the state is still n by n n by n vector so Z is equal to H times X this is a scalar this is 1 by 1 therefore this is 1 by n times n by 1 RK this is R sub K R sub K which is the variance of the scalar observation is R so the forecast step is the same the forecast step is the same as the general equation that we have given in the previous slide. Now I would like to be able to look at the simplification that results when M is 1 in the analysis step so HK plus 1 PK plus 1 HK plus 1 is equal to this that is a scalar so A which is given by HK plus 1 SK plus 1 F transpose this is equal to H times that that is simply a column vector so A is a column vector so A transpose A is a scalar R is a scalar so alpha is a scalar alpha is given by this so A transpose A plus R is 1 by 1 over alpha therefore A transpose A is equal to 1 over alpha minus R which is equal to given by this therefore I can write the column gain in this particular case as KK plus 1 is equal to alpha times SK plus 1 A which is again a column vector so the covariance matrix analysis covariance matrix can be written like this now please realize XK plus 1 F times 1 minus alpha times AA transpose I am sorry this is not one is identity times AA transpose this is identity times AA transpose so in this particular case in this particular case the A is a column vector A transpose is a row vector so this is an outer product matrix alpha is a constant so I minus constant times A matrix times this so this is this is the expression for the forecast covariance I am sorry analysis covariance at time at time K plus 1 please remember I am talking about the I am talking about the data assimilation step compute the column gain and the analysis covariance in here now if I want to be able to so this is this factor in this case is already square root form this factor is already square root form so to get a square root I need to consider square root of only the term within the parenthesis the analysis is very similar to the general case we talked about so the square root of this matrix I minus alpha times AA transpose is equal to I would like to be able to express it as a square of the matrix with the change in constant beta if I would like to be able to make this equal to the square of this matrix I have the relation this relation that essentially helps us to be able to find beta let us let us go back and remind ourselves all the A is a column vector so so AA transpose is a rank 1 matrix so this that is what we are now trying to talk about so this relation essentially implies I minus alpha times A transpose is equal to 1 minus 2 beta AA transpose for beta square AA transpose 8 AA transpose because of the square involved because of the square involved so I would like to be able to I would like to be able to find an relation between alpha and beta and that essentially tells so by equating by equating these two sorry by equating by equating these two I would like to be able to express beta in terms of alpha and that gives rise to a relation that alpha and beta must be related to this alpha and beta must be related to that look at alpha and beta related to that in other words I am going to equate both sides the corresponding terms so minus alpha must be equal to minus 2 beta times beta square times A transpose A A transpose A is a scalar so this this term can be written succinctly as beta square A transpose A times AA transpose so AA transpose is the term that I am equating there is a A transpose here there is a A transpose here so if I consider the coefficients and equate the coefficients of AA transpose terms on both sides I get this equality that equality implies that beta and alpha must stand in relation which is the equation which is related by that equation now please understand A transpose A is a scalar A transpose A is a scalar therefore this becomes a quadratic equation in beta this is a quadratic equation in beta the solution for the quadratic equation is given by this so if I substitute back and unwind all the expressions I get beta is given by this expression I can now substitute the matrix within this as 1 my I minus beta times AA transpose square because I have found the square root of that I have found the square root of that that essentially gives you the analysis square root is the forecast square root multiplied by this and this plays the role of the C matrix that we have talked about in the previous slides that is the transformation that is the transformation therefore SK plus 1 is equal to PK plus 1 hat is equal to SK plus 1 hat times SK plus 1 hat SK plus 1 hat transpose therefore this is the forecast this is the analysis analysis covariance analysis covariance square root analysis covariance square root so we have now a very simple factorization scheme when there is only a one scalar observation I do not have to take a matrix square root matrix square root is more expensive than scalar valued square root operations so that is the advantage of considering when m is equal to 1 how do we apply this m is equal to 1 to a general case in general case m my observation is a m vector is given by this now I am going to consider my V the covariance of V is R R can be expressed as a product of L and transpose that is a Cholesky decomposition if you multiply both sides by L inverse I get a new equation for the observation so this becomes the new equation for the observation and what is the point of this observation what is the point of this observation the covariance of V bar this must be V bar the covariance of V bar is equal to expected value of V bar V bar transpose that is equal to L inverse covariance of V times L that is equal to I so what does this mean this may essentially means the the noise covariance is identity if noise covariance is identity so what is that we have Z is equal to I am sorry Z bar is equal to H bar X plus V bar V bar is a normal noise vector with zero mean and I as the identity matrix as the covariance therefore there is no correlation between Zi and Z the no correlation between Z bar I and Z bar J and I not equal to J then I not equal to J because these things are not correlated now I could consider Z bar 1 Z bar 2 and Z bar M that is Z bar is equal to that and I could consider now each of these observations as a single observation so there are M single observations I can try to I can try to successively assimilate each of these observations by the Potter's algorithm that we have given in the previous case therefore instead of assimilating all of them together that involves matrix operations I can by doing this transformation try to convert the problem into one of assimilating M sorry one of assimilating M scalar observations. So it is this ability to convert using the transformation which is the L inverse is called whitening transformation what is whitening before the transformation V had covariances that means VI and VJ may be co correlated but after the transformation V bar I and V bar J are not correlated so in here please realize in here in here the V bar V bar is equal to V bar I also V bar 1 V bar 2 all the way up to V bar M and in if you take any two of them they are not correlated therefore I can consider the vector to be made up of M single uncorrelated observation if there are M single uncorrelated observation I can apply the Potter's formula one at a time. So I start with the forecast you assimilate the first component you get an analysis then start with that analysis assimilate the second component then you get another analysis you start with that second analysis assimilate the third observation you get the third analysis likewise if you can if you assimilate all the M components one at a time starting from a forecast in M steps you will get the ultimate analysis that takes advantage of that takes advantage of all the M individual observations that are arising out of whitening transformation. So this way one can bring in simplicity into the computation by using suitable transformations now we have talked about several different types of transformation in this lecture in this lecture we have talked about square root transformation of matrices QR decomposition to be able to reduce the reduce or compress in a lossless way the resulting square roots and here we are also talking about the notion of Potter's algorithm and how to apply Potter's algorithm component wise by doing what is called a whitening transformation or whitening filter what is the whitening filter if I pass a correlated noise through a whitening filter the output becomes uncorrelated in the sense that in the sense that V was correlated V bar the components of V bar are uncorrelated this ability to transform a noise vector that is a correlated covariance matrix to an uncorrelated covariance matrix and the output that is called the whitening filter by applying the whitening filter we are able to reduce the problem of assimilating a vector of M observation to assimilating a sequence of scalar updates of M single observation. So the assimilation step is broken down into M simple steps where you simply take the square roots of scalars as opposed to taking square roots of matrices the total time required to compute the square roots of matrices can be much larger therefore if you could apply Potter's algorithm it could greatly simplify the analysis not only that it also improves the conditioning of the calculation conditioning of the matrices which in turn tries to induce the stability of the computations with that we come to the end of the discussion of this covariance square root filter this is covered in chapter 28 of our book and there are enough number of exercises one has to perform in trying to verify all these things that is why I have not given you any additional exercises trying to verify all the calculations in here are themselves an exercise I hope you have come to appreciate the notion of what type of square root to use do I use full rank approximation or partial rank approximation if I were to do a square root version of this covariance how what how these transformations are aligned then if I want to be able to incorporate one observation at time how Potter's algorithm come to the rescue so by use the whitening transformation in Potter's algorithm we can further simplify the simulation of vector observations so that is the overall summary of this module thank you very much.