 In this module 8.4 we are going to be looking at strategies for doing nonlinear filtering approximate nonlinear filtering using the so called ensemble or reduced rank approximations. So this has come to be called ensemble or reduced rank filters. Let us quickly review where we are for the LQJ problem with the system is linear observations or linear functions of time when we are using least square criterion and when all the distributions are Gaussian we could derive the exact filter equation for the mean and the covariance of both the forecast and the analysis. But when either the system is nonlinear or the observations are nonlinear or both are nonlinear we are going to have difficulties we saw the difficulties in the variety of ways. The difficulty essentially arises from the fact that Gaussian all the other distributions other than Gaussian distributions are not or cannot be specified simply by the first 2 moments. So if you get so our aim in nonlinear filtering is to work in the infinite dimensional space of distributions. So what is the forecast distribution? What is the analysis distribution? How do the forecast distribution analysis distribution interact with each other? How do we update these 2 distributions? We derived sequential updating scheme for updating the forecast the forecast distribution as well as analysis distribution exactly using base framework much like we did in the linear framework. So what is the difference? In the linear framework they are finite dimensional there are only 2 variables mean and the covariance. But in the nonlinear filter is infinite dimensional because we are trying to give an updating scheme for the distribution themselves. So given the attendant numerical so it is not that we do not know how to solve nonlinear problems we know theoretically how to solve it we know the exact solution. So what is the problem? What is the challenge? It is very difficult to be able to compute the exact forms of these distributions they can only be handled numerically. So given this difficulty we then move down to computing approximate moment dynamics. So we derived expressions for the evolution of the mean and the covariance in that juncture we met with the so-called closure problems which are typical of any and every nonlinear systems. And as a consequence we saw 0th order filter which depends on linearist approximation we talked about first order filter which has come to be called extended Kalman filter and we also talked about second order filter which is little bit more accurate than the first order filter. While these are meaningful approximations have been used in several walks of life these approximations are computationally no less severe for geophysical applications. Because in geophysical applications the size of the problem easily of the order of million. So we have to deal with vectors and matrices which are of size million or more with the ever increasing interest in finer grid the number of grid points in a typical global model may reach a billion very soon. And so the desire to have a more accurate representation using a computational grid with a smaller grid size is becoming one of the challenges. And these challenges cannot be meted out until we have a powerful computers so given all this challenges all around us the reduced rank filters based on ensemble idea which is essentially a Monte Carlo idea has become very popular it is very simple you do not need to write an adjoint you do not need to solve any least square problems the basic idea is you have a model code if you have a multiprocessor you pick several initial conditions from an appropriate distribution run the model forward in parallel and once you spit out the forecast from say 50 models or 50 model runs from 50 model runs of the same model code but starting from different initial conditions. So you will get a forecast ensemble which is generated from an initial ensemble once the forecast ensemble is generated the forecast mean and the forecast covariance are essentially calculated as a sample mean as a sample covariance of this forecast ensemble after having create the forecast ensemble we then we then integrate the observation with the forecast ensemble to create an analysis ensemble and from the analysis ensemble we again start the model moving forward creating a forecast ensemble. So this notion of being able to create a forecast ensemble from there an analysis ensemble and starting the model forward again from the given analysis ensemble is a sequential way of thinking so we are still remaining within the sequential framework the only difference is that we never look back we simply keep moving forward moving forward not on one simple model run but permanent ensemble of model runs. So that is the basic idea where does the reduced rank come from the given size of the problem let us say million for example let us fix it. So in order to be able to ensemble based computation of covariance matrices which are full rank I may need to run of the order of million ensemble members running a million ensemble members is in practical in today's technology so what do they do they pick a small sized ensemble 50 100 200 so by creating 200 ensemble members of a model run I am going to basing simple statistical principles to be able to estimate the mean of 50 ensemble members and the covariance using 50 ensemble members the basic statistics tells you if you have the number of samples of the order of 50 if you are trying to compute the covariance matrix the maximum rank of the covariance matrix can be no more than 50. So we are trying to approximate a matrix of rank million by a matrix of rank 50 or 100 or 200 that is what is called the reduced rank approximation. So if we can capture some of the important modes of behavior of the model maybe we can very carefully approximate the forecast covariance analysis the analysis covariance analysis mean forecast mean and keep going forward. So this is a class of approximation which is computationally simple which is which is scalable the scalability largely depends on the size of the problem and the power of the computers. So because of this flexibility this class of ensemble based methods have become very popular in many of the operational centers around the world. So in this module we are going to provide a very broad overview of the current methodology that has come to be known as ensemble method or reduced rank filters. So what is the basic idea let fx be the density of a random variable x. So x is a vector let x be a vector in Rn ffx is the probability density let the mean of the random variable be mu and the variance of the random variable or did I say ffx is equal to no we will simplify this let us assume we have only a real random variable to start with let us assume I have a real random variable x and ffx be the density function for the real random variable with mean mu and variance sigma square. Let x1, x2, xn be the n iid samples what is iid sample independent samples that are drawn from the same distribution ffx the distribution remains the same and simply so you can think of the distribution as a black box you ask for a number it will give you i i running from 1 to n. So you can draw n such sample from the same box that are presents the distribution ffx that is called iid samples. So if I have n members of the id sample I can compute the mean I can compute the sample covariance and from our class on statistical estimation we have already known we have already shown that this estimator for the mean and this estimator for the sample variance are unbiased estimates. So this is x bar n is an unbiased estimate of mu and s square n is an unbiased estimate of sigma square very simple basic statistics this is the first one you probably learn in any course in fundamental statistics. What are the basic properties of these estimates the probability that the probability that the estimate of the mean using n samples differs from the true mean by magnitude more than epsilon tends to 0 as n tends to infinity. So what does it mean the sampling distribution of x bar n so x bar n is a random variable it is an estimate estimate is a random variable the estimate has a distribution the estimate has a distribution whose mean differs from the original mean mu by a smaller and smaller and smaller quantities as n the number of samples goes to infinity that is what is called the consistency condition that we have seen earlier likewise if I consider the estimate of the sample covariance the sample covariance estimated using the formula in the previous page differs from the true variance by a quantity epsilon the probability of that even happening goes to 0 as n goes to infinity. So what does it mean both the sample mean and the sample covariance become closer and closer ever closer to the true mean and the variance as the number of samples goes to infinity is a very simple fundamental fact what is another way of saying this the variance of the sample mean so sample mean is random therefore it is the variance is sigma square n. So sigma square n is the square I should sigma square n is the variance of the sample mean and that goes to 0 as n goes to infinity that is one of the statements that is captured in the first line the second line is also captured by this expression for the standard the variance of the estimate of the sample covariance you can see as n goes to infinity these 2 goes to 0 the standard error in the estimate is equal to is equal to I should say the standard error is equal to the standard error is equal to sigma by n which is the square root of the variance and that is proportional to. So I will simply say the following that is proportional to 1 over n that is proportional to 1 over n. So what does it mean the standard error which is a square root of the variance is sigma over n sigma over n is proportional to 1 over n and 1 over n goes to infinity as n goes to infinity so that is another way of saying these things. In fact many of the fundamental ideas of ensemble filters rush on this consistency criterion of basic estimates of both the mean and the variance. So what is the only difference? In the context of meteorological system we are going to be concerned with random vectors instead of random variables but whatever holds for random variables also holds for random vectors and that is what we will exploit. Another tool that we need is the following. Suppose X is a random variable X is a normal random variable with mu as the mean and sigma as the covariance. I can do a transformation what is the transformation? First compute the Choloski factors of sigma which is L L transpose. Please remember we have already talked about Choloski decomposition. The Choloski factors are also called square roots. So let sigma is equal to L L transpose where L is called the square root of sigma L is also a lower triangular matrix. We have already given an algorithm when we were dealing with numerical methods for solving static deterministic inverse problems. We had given the algorithm for Choloski decomposition in great detail. So given sigma I should be able to find L very easily. Now let Y be a standard normal variable. Let Y be a standard normal variable. So in this case what is the X? The X is a vector, mu is a vector, sigma is a matrix, L is a lower triangular matrix which is a square root of sigma, Y is a standardized normal variable which is mean 0 and identity as the identity as the covariance. Now I can relate X and Y through the relation 3. So what does it say? Y is equal to L inverse X minus mu or X is equal to mu plus Ly. So this transformation relates a unit normal random vector to a general normal vector with mean mu and covariance sigma with mean mu, covariance sigma. And the verification of the third statement is given in the box on the right hand side. You can easily verify that. So what does this tell you? If somebody gives you a random variable, so how are we going to be using this? We have talked about what is called whitening filter. What is the whitening filter? In general sigma is a matrix with element sigma ij. Sigma ij what is the value of that? It is the covariance of Xi versus Xj. So in principle this need not be 0. Therefore if I have a vector X whose mean is mu and covariance is sigma, the elements of the covariance matrix may be correlated and the correlation is given by or may be correlated. The covariance between them is given by sigma ij. But when the covariance matrix is i, the diagonal elements are 0. So Y is a Gaussian random vector whose components are uncorrelated. X is a Gaussian random vector whose components are correlated. So 3 gives you a transformation from a correlated random vector to uncorrelated random vector or uncorrelated random vector to a correlated random vector. So the transformation from X to Y is called whitening transformation. X to Y is called whitening transformation. Why this is called whitening transformation? White noise is does not have any correlation. So Y does not have, Y has components which are uncorrelated. So considering a given correlated vector how do you convert it to uncorrelated vector? We have utilized this whitening transformation in the context of Potter's algorithm when we did the square root version of the covariance square root version of the ensemble filter in a couple of days, in a couple of classes ago in the same module. Now with these two, this is all what you need in principle to be able to do an ensemble method. So let us assume I have a model, stochastic model. So here I am assuming I have a stochastic model that means I have a model code. Somebody has developed the model code. I have an observation. X0 has the standard one. You can say X0 is equal to, is normally distributed to the mean M and covariance P0. WK is mean 0 and covariance QK and VK is mean 0 and covariance RK. We also assume X0, WK and VK are uncorrelated. So these are the properties of the model. This is also properties of the observation. And this is the basis. You need to be able to give all this information to do any stochastic data simulation scheme in particular Kalman filter. So this corresponds to the following. The model is nonlinear. I have simply assumed the observations are linear functions. I could have assumed the observations also nonlinear. Simply I am giving a mix of many things. Linear with linear, nonlinear with nonlinear, nonlinear with linear. So lots of combinations of choices between models and observations. So how does the ensemble method start? So I have a model. I have observations. I have the underlying properties. I have the basic facts already described. So what do I need? Initially I am given the initial distribution of X0, which is M mean and P0. So first do a Chaloski factorization of P0, which is S0, S0 transpose. Please remember Chaloski we have already seen. Why not I, I running from 1 to N be the N samples drawn from the standard normal distribution? Look at this now. There are 2 indices now coming in here. Y0, this is the time index. We have to be very careful. And this is I. This is the ensemble index. Then we will have a presentation for the forecast or the analysis. So you can see I need time. I need analysis or forecast. I also need to keep track of which member the ensemble I am going to be dealing with. So each of the variable will be loaded. There is a time index. There is an ensemble index and there will be an index or a notation for whether I am concerned with forecast or analysis. Once you separate all these things keep in your mind how a particular notation is formulated. In our rotation time is always subscript. Ensemble index I put it in the stomach. I want it to go back. If I have a variable X, what are the different ways of notating? I can say X of t that is stomach. I can put it in the leg that is subscript. I can put it in the head that is forecast. So instead of t I will change it to I am sorry instead of t I will change it to. So this is the I of the member of the ensemble or time k but it is a forecast time. So there are three ways of notating to be able to distinguish different quantities. So why not I? I running from 1 to n or n samples drawn from a box. The box is normally distributed. So I am going to get why not of I? Why not of I? I running from 1 to n. So I have now copious supply n. What is n? n in principle could be a million if the vector X0 is of size million but even though the vector size may be million I do not have the luxury of being able to pick large sample. So we cannot do large sample theory. We have to do finite sample theory small sample theory that is what we are going to be talking about. Now so what do they have? I have the initial distribution. I have n members of the samples driven from a standard normal. I am now going to convert the standard normal ensemble members to the ensemble members from the initial distribution using the previous slide the connection between X and Y and Y and X. So now I am going to introduce another symbol. So we will use X for the state. We are going to use Xi for the ensemble. So Xi 0 I hat. What does it mean? This is the at time 0. I have the member the initial ensemble and also the hat represents it is the initial analysis ensemble. So that is equal to M0. So what is M0? I am sorry I am going to say this is M0 to be consistent with this. So let that be M0. M0 the mean of the sample plus S0 is the Chaloski factor times Y0 I. Why does this make sense? Please go back to the slide here. The equation 3 what does it say? X is equal to mu plus Ly. If Y is from unit normal in order that X is from normal with mean mu and variance sigma I need to be able to transform Y into L using an affine transformation mu plus Ly where L is the Chaloski factor of sigma. I am using the same principle in here S0 is the Chaloski factor of P0. Y0 I is the I hat the member the ensemble from the standard normal M0 is the mean. So I have now created I am sorry I have now created N samples. Now look at this now. It could be very expensive to be able to create the samples because I am going to be dealing with a million dimensional system perhaps. So once I have created sample what is the initial mean? Initial analysis mean is X0 hat N which is the sample mean of the which is the sample mean of the initial ensemble. This one is equal to M0 plus S0 Y0 of N. So if Y0 of N is the sample mean of the Y ensemble X hat of N at the time 0 will be the sample mean of the initial ensemble. Now what is the basic property? Y is the standard normal is the mean 0. So what is the claim? Y0 N hat which is given by this that tends to 0 as N tends to infinity. Why? Ys are the samples created from a standard normal with mean 0. So these are the properties. So what are the first thing we require for creating an ensemble? I first need to have a very good random generator to generate N capital N samples of random vectors from the standard normal distribution. Once I have this I can do this. Now what is involved in here? I have to do a matrix vector multiplication. Then I have to do a vector vector addition and this I have to repeat N times and that relates to the total cost of computing the initial ensemble. So what does this initial ensemble give you? I have an estimate of the initial ensemble covariance that is the analysis covariance is initial time. You can see this is the ith ensemble member that is the sample mean. Ith ensemble members are the sample mean. You take the outer product of that sum over N divided by N minus 1. So this is going to be the sample estimate of the analysis covariance at time 0. Now using the transformation I can also rewrite the line 1 by line 2 and the quantity within the parenthesis as N goes to infinity goes to I that is because what is the quantity in parenthesis? That is the sample estimate of the covariance of Y. The actual covariance of Y is I. So in time it will go to I. So the whole thing reduces to for large sample S0 times S0 transpose equal to P0. Therefore the analysis estimate the estimate of the analysis covariance at time 0 converges to the actual covariance P0 as N the number of samples go to infinity. Why is these limits are important? I keep repeating this because the quantities calculated using ensemble members ensemble statistics. There is a true statistics there is always going to be a difference. The difference becomes smaller and smaller as the number of samples N becomes larger and larger. So if you cut the number of ensembles to be finite let us say 100, 200 that is all your estimate is going to have an error and that is the error you have to deal with why not because I want to but because of the computational limitations. So it is not that I like to commit this error but I do not know how else to do it because all the other approximations are approximations not only they are approximations they are very expensive. For example in the extended Kalman filter if you look at the update for the four cascavariums that still requires two matrix multiplications each matrix multiplication is going to cost NQ. So even though it is approximate it costs a lot of money to compute the approximation. So what is the idea of ensemble? If it is approximate why to spend too much money? Can I compute an approximation in a quick and dirty way? Can I compute an approximation in the cheap way that relates to the notion of ensemble ideas. So if I know how to create the initial ensemble I can fast forward in time. I can fast forward in time. So what is that now I can assume? I can assume the following. So let us assume at time k I have ensembles forecast ensembles k I so I am sorry these are the analysis ensemble how it is analysis? Hat is analysis superscript F is forecast. So let this be the analysis ensemble. So what is that what that I have? This is time k so I have analysis ensemble N of them each point of this is xi k I hat. Now so if xi k I hat is the analysis ensemble the analysis ensemble is related to the analysis mean through x hat kn and the analysis covariance square root x hat k through y hat I am sorry y k of i. So what is y k of i? So let us come back to this now. So let pk hat N be the analysis covariance approximation as well as this is the analysis mean approximation pk hat can be expressed as the product of Cholesky factors sk hat and sk hat transpose. Therefore I know the mean I know the Cholesky factor I can also derive I can also relate from the distribution I can also relate I can also relate the samples the analysis sample to the standard normal distributions. Therefore yk I are belonging to the standard normal distribution and this formula is an important formula this formula is the same as that one holds good at the initial time except that k is equal to 0 if you put k is equal to 0 I get what I got. So if I can do it initial time I can assume I can do it at time k because I have forwarded by induction. So I have created an initial ensemble so this is what is called the initial ensemble what is that it is a swarm of points in the n dimensional space. It is a swarm of n points capital N points in the little n dimensional space. So I have an analysis ensemble at time k now what do I do? I have to create a forecast ensemble for the next time. So I am going from time k to k plus 1 I have this analysis ensemble here I would like to be able to create the forecast ensemble what do I do? I take a particular analysis ensemble which is xi hat k of i run it through the model and get the forecast ensemble which is xi k plus 1 i that is what it is going to be xi ki is the k is the i th ensemble member at time k by running it through the model m I am going to get the forecast ensemble. So the forecast ensemble the i th member of the forecast ensemble at time k plus 1 is obtained by taking the i th member of the analysis ensemble running it through the model plus adding a noise. Now let us talk about this now noise you do not add the same noise to every ensemble member use you take. So what is W k plus 1 I W k plus 1 I is the i th realization of the model noise. So what is the model noise generator? This is the model noise generator n 0 q k out of that comes W k plus 1 I W k plus 1 I okay. So the forecast is random because of two things one the previous analysis random this is the additional randomness we are doing therefore this is random that is how I am trying to generate the forecast ensemble. So from the analysis ensemble at time k I am creating the forecast ensemble at time k plus 1 in the standard non-linear filter or the Kalman filter there is only one mean there is only one covariance I am trying to update the mean I am going to update the covariance updating the mean is not expensive updating the covariance is very expensive. Let me remind you in the classical Kalman filter p k plus 1 f is equal to m times p k hat m transpose plus q k plus 1 in Kalman filter application this is the real killer why I have to do a matrix multiplication here I have a matrix multiplication here each one of them takes n cube time each one of them takes n cube time each one of them takes n cube time O of n cube time and n is the order of million we have already seen to do one matrix computation in a petaflop machine when n is a million it will take about 11 and a half to 12 days. So this step alone will take close to 24 days heck with that we simply cannot we simply do not have the time we simply do not have the time that is why we are looking for all kinds of ways to approximate which I can do in my lifetime. So let me summarize the thing as it is given in so at time k I am given the analysis mean I am given the analysis covariance I am given the ensemble itself once I have ensemble I can compute these quantities I am now have run these analysis ensemble through the model and create the forecast ensemble once I have created the forecast ensemble I can compute the forecast mean I can compute the forecast covariance this is what I was trying to tell you so this is the outer product matrix I am trying to add n outer product matrix capital N the size of this matrix is n by n but I have only capital N samples so the rank of pk plus 1 f at using n sample is no more than is no more than capital N capital N is much smaller so by definition you are dead on arrival so what is that we have we have we always consider reduced rank math computationally these reduced rank matrices are ill conditioned matrices you cannot invert them you cannot do many things with them so you have to be very clever in trying to do lots of things and why do we why do you want to settle for these kinds of crude approximations because for geophysical problem these are the only things we can afford to do in today's technology that is the bottom line so I have I had analysis of time k I had a forecast of time k plus 1 I have moved things forward now what do you do I have to do a data simulation so there are n forecast members so let us let us let us go back and understand this now so a time k plus 1 a time k plus 1 I have I have n forecast ensemble a typical forecast ensemble is called xi k plus 1 if I have only one observation zk please remember zk is equal to h of xk plus vk that is what is given to me but we already know vk is normal with rk so what is that we need we need to do I have n members n strands of the forecast I have only one observation I am not going to go into the details if you assimilate the same observation on every strand of the forecast your analysis covariance will be underestimated so to avoid under if the analysis covariance are underestimated that will lead to divergence of the filter computationally that will be major challenges and problems so to avoid that what do we do we introduce the notion what is called virtual observation you get one observation zk I am now going to create a set of virtual observations what is a zk plus 1 I zk plus 1 I is the Ith version of virtual observation which is equal to the actual observation given by the satellite radar whatever it is plus I am going to add to that an Ith version of the observation noise please understand zk1 is already has some noise I cannot separate zk1 from the noise because the noise is unavoidable and inseparable but I do know the properties of the noise so what is that we are trying to do we are trying to generate another random originator n0 rk I am going to create a noise vector vk I so I have zk plus 1 I am sorry vk plus 1 I I am going to add vk plus 1 I to get the Ith number of the virtual observations zk I this I runs from I is equal to 1 to n then what do I do I simply apply the Kalman filter Kalman filter equation so the analysis of analysis of time k plus 1 for the Ith ensemble strand is equal to forecast the time k plus 1 the Ith member plus k time zk plus 1 I minus hk plus 1 I zk plus 1 I here I would like to enlighten you with a couple of things I am sure a careful radar would have already noticed hey you started with them non-linear model but you are using a formula that is meant for linear model so this equation for the update for the data simulation step we are trying to do two things we are not doing the actual observation we are creating virtual observation we are not doing any non-linear thing we are only doing a linear thing and that should not be surprising to any one of you even though the model is linear observations are linear the estimates for the analysis is always a linear estimate I want to bring that to focus the models in the in the non-linear filtering what is that we have seen the model is linear the observations are could be non-linear but what is the principle we are applying best linear estimate blue so the estimator has still a linear structure that is why I am trying to use so what is this this is the estimate this is the estimate of the analysis this is the estimate of the forecast forecast by the role of a role of a background zki- this term is the innovation I am multiplying by a Kalman gain so what is the rub here how do you compute Kalman gain so Kalman gain has to be as you know has it is dependent on the forecast covariance the observational covariance and the operator H operator H is well very well known the forecast covariance known only approximately the observational error is known perfectly so some are good some are not good so we have to combine good and bad to create what we want so you can readily see k using the formula that we already know that makes us the forecast covariance the forecast covariance already approximate so the k that is going to be used in here is also approximate so that is another source of error so let us let us try to talk about some of the aspects of analysis so I have I have done the analysis using using using the expressions in page 10 I have now I am I am now computing the analysis statistics analysis mean analysis mean is going to be given by this formula if I substitute the expression from the previous the this is the virtual observation the virtual observation the mean of this are given by that the covariance not to compute the covariance I have to compute the error the error in the analysis is given by this the error in the analysis has this structure has this structure look at this now v k plus 1 i is the ith the member of the observational error this is the mean of that so this difference is the anomaly that anomaly goes to 0 as n goes to infinity so these are some of the things one has to remember when when when tries to one one when tries to manipulate these quantities these quantities can be very small but still we need to build those all these small quantities therefore I am now going to compute the analysis covariance I am now going to compute the analysis covariance the analysis covariance is given by this I am now substituting the expressions for the analysis error for the ith ensemble member which is then given by this expression these expressions which when simplified will become this which will simplify which when simplified will become this so in this case I want you to remember r k plus 1 n is the estimate of the analysis covariance as n goes to infinity that tends to r k plus 1 so the other quantities follow readily from these expressions I also want you to remember couple of things these are all coming from the observation noise these are all coming from this comes from the forecast this comes from the observation noise this comes from the observation noise that is forecast so you can see this is the this is the cross covariance between the observation noise and the forecast observation noise the forecast these are all finite samples now how do we know these finite samples make sense in order to understand where if this finite sample approximation make sense I have to consider the asymptotic versions of these formulas so what gives you the right to use the finite sample expressions with the errors if we can show the errors in the estimate using finite samples go to 0 as n goes to infinity that gives you a confidence yes I know I am committing error but I am committing error not because I wanted to and this is the direct result of smaller number of samples if I had the luxury of being able to go for larger number of samples all my errors will vanish and that is a comfortable thought that is the reason why we do asymptotics not because in practice asymptotic make sense but asymptotic analysis gives you a guarantee that when you do estimates with finite number of samples you can think of being close to being being close to the truth and how far you you are goes to the truth depends on the number of samples. So, we have to go between asymptotic analysis versus finite sample analysis all the time and statisticians do this very cleverly better than anybody else therefore and I am now going to do some of the asymptotic analysis as I talked about this term is the anomaly in the observation noise and that tends to rk plus 1 which I have already alluded to I also want you to remember one more the cross term what is this forecast error I am sorry the cross term the second term I am sorry first term second term third term fourth term the third term and the fourth term in the slide in slide 12 this term and this term involve the cross covariance it can be shown that cross covariance this is the cross covariance what is the what is the cross covariance let me let me bring this in this is the forecast error of the ith ensemble this is the anomaly of the ith realization of the observation noise the the cross covariance of that as n goes to infinity goes to 0 therefore in the previous slide the third term and the fourth term they automatically go to 0 therefore the expression for k the Kalman gain which I have been which I have postponed for this long is given by the n sample approximation of pk plus 1 f now look at this now xi ki f xi ki hat these are all I have to remember the forecast and analysis but when it comes to question of pkf I am going to have an estimate I am going to put a n if I am going to have pk n hat n so what are these these are n sample estimate estimate based on n sample n sample estimate n sample estimate because I want to be able to distinguish between so if I say pkf that is the actual forecast covariance if I say pk hat that the actual analysis covariance if I attach a n in the stomach to them that is an estimate so this is the n sample estimate of the forecast covariance hk plus 1 transpose this is the same formula as comes from linear analysis this product again we understand these two are the same quantities in sample approximation and rk plus 1 rk plus 1 we already know look at this now I want I would like to talk about a little bit further now rk we are assuming is full rank h fk plus 1 that is low rank hk we are assuming a full rank so the product of this is low rank but I am adding a low rank to a full rank matrix and getting the inverse does the inverse exist yes who guarantees the inverse exist Sherman-Marsson Woodbury formula so what is that you can consider this I am having a low rank perturbation of a full rank matrix so this inverse can be shown to exist and using this I can compute k so k is an approximation so k is an approximation so what is the actual data simulation step that is on page 10 I want to revisit that in the in the light of these calculations so I have calculate I have now calculated k if I have calculated k I know the I have virtual observation I know everything I can compute the analysis ensemble if I can compute the analysis ensemble I can compute the analysis mean and analysis forecast so let us talk and the analysis covariance so let us assume two things now so what do these centres do they have a large code that representing the model they run this model code parallel for n initial condition n is 50 100 I do not think the world nvd runs large models using more than 200 200 ensembles nobody it is not because they do not know it is simply because they do not have the competing power why let us talk about this now you have to make daily forecast you have to make forecast every day every day you make forecast for one day in advance two day in advance three days in advance five days in advance six days in advance seven days in advance and you keep doing this every day so you have only 24 hours maximum time that is possible to you of the 24 hours the observation people have to collect all the observations from radar from satellites and bring them all to a common platform the modelers have to run the model forward in time 50 ensemble 100 ensemble the modelers give the model output the observation people give the observation then the data simulation person comes to work the data simulation person does not come to work until and unless all the observations are made available until unless all the model runs are available at that time he takes both and he tries to create this analysis ensemble once he creates the analysis ensemble he can create the analysis mean analysis covariance it is this analysis mean analysis covariance for one day two day three day five day is being being announced as a forecast product so what what must what will be the time available for modelers let us say six hours what will be the time available for all the observation people get the observation on board may be 12 hours how much time and a data simulation may have six hours eight hours so you have to finish all these things in a fixed number of hours and one has to depend on the other the data simulation person is the last one who comes to work so you if you deviate the 24 hours into these three parts you can see how the operational centers are busy they do not wait they do not waste an iota of a second and that is what happens in all the major forecasting centers such as NSAP in Washington DC ECMWF British Met Office Japanese Meteorological Agency I am sure Indian Meteorological Agency they also do that depending on what kind of a forecasting system that data simulation system they have they have they have a simulated and what kind of computers they have I hope I have made it clear to you as to how the data simulation is done within the context of within the context of ensemble so some comments when n is small n is always small not when n is small the cross product terms you remember the earlier they will not be close to 0 that means there is there will be a large error in the estimate of the n sample estimate of the forecast covariance is going to be in error forecast covariance error directly reflects into Kalman gain error so that is something I want you to be cognizant of that error goes to 0 only when number of samples becomes large I also want to talk about one more the covariance matrix for forecast forecast covariance is the bottleneck in in Kalman filtering so this is the Jacobian times pk hat Jacobian times pk hat when n is equal to 1 so when n is equal to 1 million if I am using 100 samples the rank of pkn is no more than n which is less than n therefore it is ranked I am trying to reinforce all these things by quantifying some examples no please understand this itself is an approximation even the approximation cost money I have already alluded to this in I have then the dmxk is not available dmxk is a million by million matrix I only know the analysis I have to evaluate the elements of the million by million matrix along the trajectory so that is another hidden cost that does not come out that does not come out but so in the case of in the case of in the case of covariance approximation I am sorry in the case of first order filter or second order filter so the pk a full rank in first order and second order filter therefore it is a full rank but it is still approximate but full rank but not exact it is only approximate so I would like you to keep all these things at the back of the mind and I am also now going to argue why we need virtual observations I have already alluded to that if you use the actual observations of the virtual observation you would use only one observation for every ensemble if you compute the analysis error term for the analysis error becomes this when you compute the analysis error it becomes this and the analysis covariance matrix computed using this becomes this given by this expression we have already discussed all these things very much in detail in our book but I would like to tell you that the analysis covariance term is given by this and it does not have the krkt term in it and this lack of krkt term in it leads to under estimation and if the analysis error is the analysis covariance underestimated forecast covariance is also underestimated because forecast covariance depends on the analysis covariance that is the reason why we use virtual observation to be able to make up for this deficiency in resulting in under estimation okay when ensemble filter was known as a Monte Carlo method of estimating Kalman filters in learn in for non-linear systems within the control theory literature as early as 1960s but it was invented by evansson in the mid eight in the mid 90s within the context of within the context of geophysical applications so within the context of geophysical application the person who made this ensemble approach popular is a Norwegian geophysical scientist evansson I think geovansson around 1995 approximately that of approximately that of so ever since the ensemble methods have become popular most of the centers have gone for this when originally evansson published the paper he did not use the virtual observation people who came after him quickly realized that if you did not if you used only the same observation to every ensemble strand your your system leads to an underestimation underestimation leads to numerical difficulties so to be able to to be able to prevent such numerical difficulties from arising in artificial in artificial method was created and that is what virtual observation is all about so it is simply a mathematical trick by which I can restore some sanity the world where everything is approximate what is the next idea some might say hey instead of instead of trying to instead of trying to modify the observation why did we simply change the Kalman gain such that I do not have to mess with the observations why messing with the observation is considered bad in some circle observation is given by mother nature mother nature is observed only through observation and why to corrupt what mother nature tells us so philosophically some people are opposed to creating virtual observation because it is not philosophically consistent meddling with what mother nature is telling us therefore what is it they were looking for they were looking for an up they were looking for an alternate approach and what is the alternate approach depends on this alternate approach depends on the following I have the analysis I have the ensemble I have the forecast I have the ensemble I have the innovation please remember I am using the same observation while using the same observation can I use another way W such that I can restore sanity in the analysis ensemble that is a very beautiful mathematical question this question was answered in the affirmative there are several classes of ensemble filters they came out as a result of this line of questioning so the line of doing the simulation using virtual observation is called stochastic method the line where you use a single of the same only one observation but change the weight that is called a deterministic methods. So deterministic methods of implementing ensemble filters stochastic methods for implementing ensemble method stochastic method relies on creating the virtual observation what is the advantage of it you simply create a virtual observation you are done you can close the eyes and then say hey I have I have a decent approximation limited only by the value of n in here I have to do little bit of more calculation but I do not have to mess with the observation observations are word of God that is essentially the kind of approach that is that is that is taken here. So that is the approach of how do we modify the Coleman gain into the observation so this is the so if this is the analysis ensemble this is the mean of the analysis ensemble if I know the analysis ensemble and this mean I can compute the analysis error anomaly if I can compute the analysis error I can compute the analysis covariance by this expression please remember I am now talking like a broken record I am following the same path but with this different criterion for mixing the forecast information and the innovation information so W the matrix I am now going to change the Coleman gain to suit what I want without having to generate virtual observations. Therefore if the error structure is given by this the posterior covariance which is the analysis covariance is given by this expression is given by this expression let us assume the forecast covariance has a square root factorization like that has a far root so this is the forecast covariance so this is the forecast covariance let us pretend this forecast covariance has a square root factorization available in that case by combining these two I get by replacing this by the square root I get the analysis covariance is given by the product of this that you can you can you can readily you can readily see that I I believe that must be there must be a parenthesis than here sorry that must be that is there is a parenthesis here that is most be a parenthesis there and then there must be a parenthesis here with that that is correct because I minus W to s transpose I minus times S transpose to the whole thing a parenthesis missing here please add that. So, with that the question is how to choose W I am going to quickly talk about the method for choosing W. This idea in electrical engineering systems literature as early as 1968 it was reinvented in the meteorological community in the late 90s after evansson put forth his ideas. It is very hard to tell how much they knew of the previous literature, but anyway they are they are they have been reinvented. So, this is the analysis covariance the forecast covariance minus this you may remember this from the classical Kalman-Felton equations. So, what is the basic idea I would like to be able to simply factorize and why do I why do I want to talk about factorization there is a close connection between square covariance square implementations and reduced rank ensemble implementation they are very nearly related and so if I have a Choloski factorization I am now going to call A is equal to h times S f transpose. So, I am now going to factor the forecast covariance I am going to define a new quantity. So, this is the forecast covariance factorization this is the new quantity A I am going to introduce with these changes in notation this p hat. So, what is that I am now trying to do I am trying to drop the time index drop the time index why drop the subscripts I do not want to clobber my expressions because this is all happening at a given time. So, I do not have to remind myself and bulldoze all the all the all the subscripts together. So, by drop the subscripts from from from from the the covariance matrices. Therefore, p hat is sub product of the square roots plus this expression. So, these two expressions are essentially one of the same this is the covariance version this is the square root version covariance square root version square root version square root version and with this square root version the expression becomes this and that can be written as this expression after simple matrix algebra. Now what do I want to do I would like to be able to factorize the one that is within the square bracket in here and that is what the factorization is is is to be done. I would like to remind you now the the pk plus 1 hat becomes this and now that has been expressed this way I would like to be able to. So, if the forecast covariance can be expressed as the square root form analysis covariance also must be expressed from the square root form. If I express the analysis covariance the square root form I can identify w the quantity that I need very easily. So, let a transpose the let this be the square root of look at this now I am using several different several different symbols there are lots of quantities involved in here s f s hat s this is the square root of the forecast covariance this square root of the analysis covariance s is the square root of a is a generic square root I am going to use the generic square root in order to be able to define the square root of the quantity this quantity sorry I am going to define the square root of that quantity this quantity. So, this is the matrix a a a is the matrix a transpose a is a gramian a transpose a plus r is a symmetric matrix let this be the square root of that that is generic let r be f f transpose. Therefore, the inverse of this is the inverse of this which is given by the inverse of that and I am now going to consider a matrix of this type I already know the square root of a transpose a plus r given by that I am going to substitute this using this now I am going to multiply and divide so what is that we can do if I am if I am a is equal to a plus b minus b I can add and subtract a is equal to a times b divided b I can multiply and divide these are standard tricks in in mathematics to create newer newer newer expressions so what is that I am now trying to do I am trying to multiply I am trying to multiply the inverse and the matrix the transpose and the inverse. So, I am simply trying to do multiply and divide if I multiply and divide I get an expression that expression when simplified becomes this see the following slides for the details. So, the simplifications are given in here using this simplification now I can substitute and verify but my analysis covariance at time yes I know I am going a little fast but I want you to be patiently look through this these are grunt this is this is all grunt work from matrix algebra and and and if you want to be thorough in understanding data simulation method you have to be ready for this grunt work making you ready for the grunt work is part of the aims of these lectures. Therefore, this is the original expression by substituting the the factorized expression from the previous one I get this now what do I want. So, this is the expression I get from Kalman filter this is the expression I get I want I want these two to be equal if I equate these two I everybody but W are known. So, I am computing the same quantity in two different ways if I compute the same quantity two different ways they are equal. So, by equating these two so this expression and this expression are equal one expression has a W other expression does not have a W the W expression is a new way of deriving the one without expression is a classical way of deriving by so what is our aim I want to change the Kalman gain but I do not want to change the result. So, how do I use W I am still in for the same result as I would have gotten had I use the standard Kalman gain without having to use the virtual observation. So, that gave raise to an expression for W which is given by this which is given by this therefore if you use this W as a new Kalman gain in our analysis step let me go back in our analysis step in page 17 I will get the same result as Kalman but without having to worry about without having to worry about the virtual observations. So, this corresponds to the deterministic approach to deterministic approach to Kalman ensemble based ideas ensemble filters some people call it ensemble filter some people call ensemble Kalman filters I think it is misnomer to call it ensemble Kalman filter because classically Kalman filter refers to LQG here nothing is LQ nothing is L everything could be in general linear. So, within the contextual meteorology use I am my preferred way of I am of telling or explaining this is it is simply an ensemble filter or a reduced rank filter ok. Now, is ensemble filter equal to reduced rank filters no I there are million ways in which one can create a reduced rank filter ensemble method is one version of trying to create a reduced rank filters. So, in my view reduced rank filters is a larger class of filtering mechanisms where we deal with approximations of different degrees ensemble methods is simply one way of being able to create such approximations meaningfully by simply running the model forward in time. So, what is the basic thing in here if I have the muscle power to be able to run the model in parallel thousand times and better off but there is no brain power involved the simple statistics. So, by trading muscle power for the brain power ensemble methods rest on making quick simple estimates of course there are nuances do I do virtual observation do I do other things. So, but in principle the totality the amount of computation that one has to make in ensemble methods is simply dominated by the time required for model ranks that is where parallel computers coming to be. Do you make 200 runs of the same model starting from differential initial condition on the same computer no the technology has provided very powerful super computers which are which consists of several individual units. So, what is that you do in the world they have large parallel computers they can copy the model code into many of these computers and let all the computers run at near synchronous speed starting from differential initial conditions by. So, by exploiting the parallel computing technology by combining it with the ensemble methodology in modern days we can create very meaningful forecast based on data simulation techniques and this has this is becoming one of the major workhorse in the forecasting industry especially in weather forecasting all around the world all around the world. The British Metaphys spearheaded this ends up is following this Meteor France Japanese Meteorology Agency and many of these places they have a research unit as well as a prediction unit production unit they all work together in conjunction while they run the old version they are also trying to bring in the new codes for some time they run both the systems in parallel to be able to test it once the ensemble methods are well validated then they slowly provide a sunset class for the old methods and the new methods take hold in trying to generate daily forecast that is the that is the present situation if you want to know more details about this you can refer to the papers by Victor Hamill we can also refer to the paper by TIPET reference to all these papers are available in our textbook we have a rather extensive annotations to the literature. Let P be a number in between 1 and 1 and n n is the dimensionality of the state space so if P is less than n I am going to be talking about pth order reduced rank implementation of square root versions of the covariance filters mouth is full so we are seeking a rank P square root filters what is the use of square roots square root filter generally improved condition number condition number improves the quality of the numerical computation reduced rank in generally reduces the computational cost why the computational cost you can readily see that that if the dimension is large the cost is large. So you can think of it as a kind of a reduction largely induced by the desire to reduce the computation cost as well as an advantage to improve the condition number. So what is that we do let us talk about the initialization stage that is the mean that is the covariance let V be the matrix I am sorry let V be the eigenvectors of P naught corresponding to the eigenvalues lambda 1 lambda n so lambda IVI are the pair I can call it all the eigenvectors into a matrix V I can call it all the eigenvalues along the diagonal and call it a matrix lambda so this is the classical eigenvalue decomposition which can be written like this since we already know V V transpose is equal to V transpose V is equal to I I do not know how many times we have seen this eigenvalue decomposition is very fundamental we have also seen there are methods for finding square root 1 is using Chaloski another using eigenvalue decomposition. So here I am interested in reduced rank I am using eigenvalue decomposition so I am using eigenvalue decomposition as a tool to implement reduced rank approximations so that is the pathway. So I can express lambda is equal to lambda square root of lambda times square root of lambda then I can associate this like this I am going to call S naught S naught transpose where S naught is given by this therefore P naught is equal to S naught S naught transpose is the full square root factorization where S naught is one factor S naught transpose another factor these two factorization in even though it looks like a Chaloski factorization I obtained S naught not as a Chaloski factor but as a product of V and lambda to the power half I want you to recognize this is a different way of doing factorization so for no approximation now I would like to be able to come to the approximation we have ordered the eigenvalues in the decreasing order. So let us take the top P values of the eigenvalues and corresponding top P eigenvectors V 1 to V P lambda 1 to lambda P why is that we already alluded to this I will say it again the trace of if P is a covariance matrix trace of P is equal to total variance in all the components and the total variance is summation lambda i i is equal to 1 to n if the lambdas are distributed as fast decreasing see what are the various ways in which lambda can occur in for some matrices lambda can be flat like that for some matrices lambda can decrease like that so 1 2 3 n so this is the case we are interested in the bottom one if the lambdas decrease very fast I can pick a P such that summation lambda i i is equal to 1 to P divided by summation lambda i i is equal to 1 to n is greater than 90% so what does it mean the first P components together can explain 90% of the overall total variance that is hidden in the system so the 90% accurate approximation it could be 85% approximation it will be 99.9% approximation so you decide the level of approximation you are willing to contend with given that level 98 99 I can compute a P that corresponds to this inequality the sum of the first P eigenvalues to the total sum of all the eigenvalues is greater than or equal to a previously accepted threshold of 80% 90% 99% so that is how I am going to pick a P. Now if they on the other hand the eigenvalues grows a decay very slowly what does it mean you may have to have the entire eigenvalue system the P may be very close to n so in this case 1 this case is 2 I am considering the case 1 in the case 2 you may not be able to reduce the rank by simply chopping off the last one because even the last eigenvalue may be very large so it all depends on the intrinsic properties of the eigen system of the associated covariance matrices. So I am now going to assume my associated system has a picture which is very similar to the curve 1 so if I do that s0 please go back to in the previous one what is s0 s0 is a n by n matrix is a n by n matrix I am going to keep the first P columns of it and drop the last n-p columns so the matrix sorry so the matrix that is built out of the first P columns of s0 is now going to be called s0 1 column P so the matrix n by P matrix consisting the first P columns of s0 of the P dominant modes what do you mean by dominant modes the eigenvalues are the larger the one that I am dropping together the total sum is less than 10% is less than 5% is less than 1% so I am trying to divide the eigen system into dominant modes and not that dominant mode I am simply collecting all the dominant modes together in that case I can take this s0 1 column P and s0 transpose 1 column P for some P in the range 1 to P I can approximate by P0 by that so what is this this is the rank P approximation of P0 so what is that we are now going to do we are going to initialize a reduced rank square root filter with the mean m instead of the full covariance I am going to be talking about instead of the full initial covariance square root I am going to be talking about the first P dominant modes of the covariance square root I hope I hope you see you see the difference this is a very beautiful idea so now I am going to create a forecast so what is the forecast covariance 1 to P the forecast covariance 1 to P I am now going to talk about the an expression for this this is the full rank forecast covariance this Pk has a square root implementation like this and this has a square root implementation like this this I can rewrite at sk plus 1f in sk plus 1f transpose whereas sk plus 1f is given by this and that is or n by 2n we talked about already the expansion part in the last lecture when we talked about covariance square root implementation. So recall we do not know sk bar we do not know sk bar because we do not know the full but only the rank P approximation of sk bar so that is the important thing we have been satisfied with the rank P approximation so if I so that is the basic aspect of it we also know so one more this is the sorry this is the square root of the model noise covariance we may only know the rank rank Q of this so this may be so I may only know the rank Q of this I may only the rank P of this so I have to deal with I have to deal with a rank P approximation here I have to deal with a rank P approximation here and a rank Q approximation in here. So totally this matrix sk plus 1 it becomes a matrix of size n by P plus Q reduce rank approximation but our calculations are always rank P so how do you reduce a matrix of size n by n plus P to a n by P matrix that is what is called the reduction this reduction can be easily implemented by so this is what is called the rank reduction process sorry this is what is called the rank reduction process the rank reduction process depends on whether P plus Q is less than n or P plus Q is greater than n P is the rank of the analysis covariance matrix Q is the rank of the square root of the process noise P plus Q or 2 into P and Q are 2 integers so P plus Q could be less than n or greater than n I am going to consider case a where P plus Q is less than n so what is that we are going to we are going to fall back on singular value decomposition now you know why we did all the singular value decomposition at great length why we did Cholesky decomposition of great length we did Eigen value decomposition at great length why if you understand these matrix methodology very well the underlying basic principles of matrix algorithm they are used repeatedly in any advanced methods for dynamic data simulation and data simulation in a stochastic context are in general advanced methods and that is where we use repeatedly several of the algorithms from matrix theory. So what is the step one the step one is let so we already have a matrix which is n by n plus P plus Q let V be this matrix be an orthonormal matrix of Eigen vectors of sk plus 1 times sk sk plus 1 transpose sk plus 1 we are essentially doing the same singular value decomposition please remember that I would like to relate some basic ideas in our static case we had h is equal to m by n so we had two kinds of Hessian matrix Gramian matrix H transpose H and H H transpose by doing an Eigen value decomposition of this I derived an Eigen value decomposition of the other one using SVD the same principle applied here but h is replaced by h is going to be replaced by the matrix sk plus 1 is the matrix that we have defined in the previous page sk plus 1 defined in page 25 is this matrix so by replacing h by sk plus 1 I can do all the analysis that I did for h which are mathematically similar so by doing the Eigen vector Eigen value decomposition for this Gramian let these be the Eigen values let these be the Eigen value decomposition so once I know so once I know this I am sorry once I know the Eigen value decomposition of H transpose H I can derive the Eigen value decomposition of H H transpose using SVD and that is exactly what I am trying to do I have already known the Eigen value decomposition I have already known the Eigen value decomposition of xk plus 1 transpose xk plus 1 so I am now going to talk about xk plus 1 times xk plus 1 transpose from here to here using the standard rule of SVD so these two matrices by theory have the same Eigen same set of nonzero Eigen values I can define a matrix like this matrix of Eigen vectors of that if I multiply these two I get this from there you can readily verify that one therefore the matrix V which has defined in here the first P columns of the matrix are the so this is the whole matrix this is the first P columns of this matrix it can be shown that these two matrix relations are is the rank P approximation so the rank P approximation matrix for the forecast I want is given by multiplying sk plus 1 which I already have which is a matrix of size n by P plus Q by this so this is the required rank P approximation case B is similar I do not want to go over the detail by reciting the whole thing I would leave the case 2 as an exercise for you that means I can do the same reduction in the rank by this process now I am going to come back to the data simulation step the last step in this process of reduced rank implementation so the data simulation step this is very similar to the full rank approximation very similar to the full rank of counterpart I do not have to say in counterpart in Kalman filter square root version square root version so I am going to summarize the algorithm compute A h is known sk plus 1 f is known that is a square root rank P square root now if you make P is equal to n I get the complete full rank filter if you pick P to be anything in between 1 and 1 and 1 and n I get the reduced rank filter so this algorithm is so beautiful you can by bellowing in and out the value of P you get quite a variety of approximation. So this algorithm gives you not one approximation but a family approximation the family essentially parameterized by the value of P and P lies in the interval 1 to n so depending on how much money you can afford to spend depending on how much accuracy you want depending on the contingent of the properties of the behavior of the Eigen values you can create a wide family of approximations. So you can compute this matrix then you compute B you compute C we have already seen these calculations in the context of potters implementation of the square root covariance filter so C also covariance square root filters in the similar thing arise we do exactly the same kind of computations. So I can now given a given expression for a reduced rank approximation to Kalman gain if I have a reduced rank approximation to Kalman gain I have a reduced rank update please understand here I am using only one observation because there is no ensemble in here it is simply a deterministic reduced rank implementation as opposed to ensemble methods for reduced rank implementation which could be either stochastic or deterministic. So the rank P square root of P K plus 1 is given by this which is given by C and this is the end result. So I have gone from forecast step to the analysis step so that completes the cycle in the computation of the filtering equations. This material is derived from chapter 30 of our book where we discussed many of these things in great detail yes the analysis of case B I did not do it explicitly but it simply involves some other types of matrix manipulations the matrix algebra is the key in here. So with this what is that we have accomplished we have accomplished quite a few methods for doing data simulation in stochastic dynamic models in a stochastic dynamic model can be linear non-linear the stochastic dynamic model may have model errors the model errors are always represented by stochastic noise that is the weakest link in this process if I know the model error if I know how to correct the model error I would have corrected the model error. So the corrected model does not have as much error as the uncorrected model so when do you consider model error to be random I have an easy feeling that I have not incorporated several small scale processes and when does that small scale processes you neglect suppose you discretize a primitive equation model on a grid we all know from basic theorem in sampling that with the given grid length of size delta x I cannot create samples of frequencies more than whose wavelength is related to more than I am sorry less than twice the sample size I am sorry twice the grid length. So sampling theorem essentially tells you when you try to discretize a continuous quantity you may be missing out on high frequency or lower wavelength terms so the primitive equation model covers the whole spectrum of processes when you discretize that I truncate I can resolve only frequencies up to a particular level we cannot resolve processes involving higher frequencies therefore we have some idea that the neglected terms are high frequency terms so one way to approximate the unknown high frequency term is to add a random noise one way to approximate this is add a white noise so white noise addition of white noise to compensate the model error is largely a reflection of our belief that I have captured all the low frequency processes involved it is a high frequency things I am not able to capture because of the finiteness of the grid size so that is the credence that one can bring to bear for adding a stochastic noise such as white noise to the model error observations have always errors the so we have quite a variety of data simulation problems these problems were solved by colman in 1960 colman and bc in 1961 colman did it in discrete time colman bc did it in continue I am sorry colman did it in discrete time colman bc did it in continuous time the discrete time derivation was perfect the continuous time derivations used lots of heuristic principles of taking limits it was soon realized the only way to derive mathematically kosher derivation for these non-linear filters is within the framework what is called stochastic calculus or ito calculus so there is a parallel development of non-linear filtering within the stochastic calculus stochastic modeling literature for reasons that we do not have background in stochastic calculus we restricted ourselves only to discrete time we have given you the lqg problem complete solution colman filter we have given a complete solution to the discrete time non-linear model non-linear observation non-linear filter equations we are given several different types of full-rank approximation using zeroth rank first rank and second rank filters then we talked about approximate reduced rank filters we talked about reduced rank filters coming out of ensemble methods there itself we have two versions one by using random or virtual observation another one is by changing a deterministic gain which is different from colman filters we also talked about deterministic reduced rank implementations of covariance square root based filters so I am assuming that this gives you the breadth and depth of the literature has come to be called filtering within the context of data simulation so the colman filter even though the ideas were known as early as 1970 by 1969 1970 the entire problem of non-linear colman non-linear filtering in continuous time using stochastic calculus was thoroughly done understood so they are put the problem is essentially solved but implementation of those continuous time ideas still is a big challenge in principle non-linear problems whether you attack it through continuous time formulation or discrete time formulation continues to be a challenge in the computational world with that we conclude our discussion of data simulation using non-linear filtering techniques thank you.