 Shall we start? Let's see. So thank you very much, Chef, for the invitation to talk. Again, my name is Colin Grezian, and I'm an assistant professor at the University of Nevada, Reno, where I study data assimilation, where I try to work on theoretical methods that are, well, let's say addressing the operational problem that we actually see in geophysical models. What I want to talk about is, well, the Bayesian approach to data assimilation, where in particular, the Bayesian approach has become widely adopted for a variety of reasons. But because, in part, it provides some unification of a lot of the tools that we see in data assimilation, whether it's from statistical estimation, whether it's from nonlinear optimization, or increasingly, as it's including techniques in machine learning. This theoretical approach actually can encompass many of these methods, and so it provides a very, very good tool set for understanding the methods and how we address, let's say, the actual operational problem. So as a quick outline, the following things will be covered in this lecture, where we'll introduce the notion of the Observation Analysis Forecast Cycle, where a lot of the work that I'll be presenting today is inspired and motivated by, let's say, operational short-range weather prediction. And we'll look, then, at the mathematical model for such a, well, cycle of prediction, which is known as a hidden Markov model, and particularly, we'll look at how the Bayesian analysis is applied to such a hidden Markov model. Subsequently, we'll take a look at the Bayesian Maximum Aposteriori, or MAP, sort of formalism, and particularly with the Gaussian Air Model Analysis, where two particular, well, widely used and sort of widely understood schemes that can be fit into this type of analysis are the Ensembl-Colman filter and the Ensembl-Colman smoother, where that will close out, let's say, the part one, and then the part two, we'll start looking at, well, how we extend this formalism, and particularly using the 4D formalism, particularly the 4D formalism that has been traditionally used in VAR approaches, say, 4D VAR, but we want to take a step back from that and focus, in particular, on a few estimators that are going to be, well, derived in a sort of sequential ensemble variational formalism that follow, let's say, the type of analysis in the ENKF, which is a little bit more towards, let's say, my own specialty and expertise, where particularly, one instance of this is the Ensembl-Colman smoother, which is a sort of well-established method that was developed by Bokeh and Sokoff, and then a new work that, let's say, I've just recently put into submission is now currently an open review in geoscientific model development of EGU. This is the single iteration Ensembl-Colman smoother, which was developed by myself and Bokeh, and so that's just a little sketch of where we're going, and yeah, let's get into it. And likewise, if there's any questions, I mean, please feel free, I think I have time that I can address any clarification or anything as we are moving through this, if there's anything that should be, yeah, needed to be asked on the spot. I am very sorry I interrupted you. It's the point, either all questions will come to me, and then we will, yes, I will address you. We wouldn't like that somebody will interrupt you during your presentation. And then after the 45 minutes, you can see actually these and can be prepared to the question and answer time when you will answer all questions. Thank you very much. Very good, cool. So let's see, back to my slides. So for our motivation, so an application such as short range weather prediction, we know that data simulation provides a means to sequentially and recursively update our forecast with newly incoming information, where a Bayesian approach to DA has become widely adopted because it provides at least a partially unified treatment of the tools from statistical estimation, nonlinear optimization and machine learning for handling such a problem. And what we want to do now is introduce sort of our mathematical formalism for this, where we'll suppose that we have now dynamical physical states, which are written in a vector xk, where k is going to correspond to some time index, tk. And abstractly, we're going to represent the sort of time evolution of these states with this nonlinear map, this curly m, where we have xk is given as a map of m, xk minus one, lambda and some eta k, where xk minus one will be the vector of physical states at an earlier time, k minus one, lambda will be some vector of uncertain physical parameters for which the time evolution of our model depends, and eta k will be some additive stochastic noise term, representing the errors in the model for our physical process. And so the goal then is we want to actually estimate now the random vector xk, having a prior distribution, so on the previous states and on now our vector of parameters and some knowledge again of the physical process. And well, we will not assume that we actually know what the model errors are, because of course, if we knew what the errors were, we'd just simply correct for this in our model itself. But we'll suppose without knowledge of how, well, the error realizations are, we'll just suppose that we have knowledge of how they're distributed in general. And so at time k minus one, so one step back, we'll make a forecast for the distribution of this physical state with our prior knowledge, including the physics-based model, which in some sense sort of encapsulates part of our Bayesian prior. And well, this will be the sort of general formulation, but for the rest of this lecture, I want to focus in on a little bit of a simpler setting, so that's just easier to discuss and understand, where we'll just suppose now that lambda will be a known constant, so it's not going to be estimated simultaneously, and this sort of forecast model will be perfect in the sense that we'll neglect this noise term. Though we should mention that, I mean, this formalism extends directly to sort of handling the joint parameter estimation with the model states, as well as handling, well, significant modeling errors in the formulation. Well, suppose that in addition, in this forecast cycle, we have these real-world observations, these yk's that are related to the physical states, so yk is given as a map of h of xk plus some epsilon k, where? So this is our forward observation operator. It's a nonlinear map that relates the states that we wish to estimate to the values that are actually observed in practice. And typically, we see in many geophysical models that the, so observation dimension, this ny, tends to be much smaller than the state dimension of the model, nx. So this information is generally considered to be sparse and observations are not one-to-one with the unobserved states. So as an inverse problem, this, as we all understand, tends to be a very ill-posed inverse problem. And it's highly difficult, let's say, in the sort of direct sense to treat this as an inverse problem. But the Bayesian formulation provides, let's say, a philosophically somewhat natural interpretation of this as an inference problem, where we want to sort of understand, well, the likelihood of the states in our model that could correspond to such data. We'll then assume that, so we have some epsilon k. Again, this is representing some sort of observation noise. So the errors in our actual measurements, that again, we don't know what these error realizations are, but we'll assume that we sort of understand how they're distributed. And so therefore, again, at our time tk, we have our forecast distribution. We have this generated, let's say by, well, using our prior on the previous states and using our physical model, m. We have an observation, yk, that comes with some amounts of uncertainty. We'll assume that there's some error in these measurements. And we wish to find then, what is the posterior distribution for xk conditioned on this vector of observations, yk. And so the recursive estimation of the distribution for xk, conditional on yk, it can often be described as an observation, analysis, forecast cycle, in which, so again, we assume that we have this forecast prior from the physics-based model with a, generated by the numerical simulation, where in this conceptual diagram, we're imagining that time is proceeding left to right. And we have, let's say this disk is sort of representing, well, perhaps an ensemble forecast and sort of the spread of the ensemble or the uncertainty of that forecast. And so using the forecast prior and now what we say is a likelihood, you know, this observation, again, this disk is representing some of the spread of the model errors and the uncertainty. We want to estimate the Bayesian update of the prior to the posterior as we condition on the observation. Now, where this orange disk is representing our sort of posterior estimate, having combined both sources of information and hopefully having reduced the uncertainty by combining, let's say, both pieces of information for this estimate. Now, the recursive estimates of the current state can be performed in this fashion, but a related question also regards our past states, where new observations from the future actually, of course, gives information about the model states at past times, where this can produce, let's say, a retrospective posterior estimate for the past states where we can imagine, well, taking the observations at various times and interpolating this posterior backwards in time for the model states using all the information in this window. So the recursive estimation of the present states in this, well, to the Bayesian stochastic filtering literature, this is known as the filtering problem. And the conditional estimation of a past state given future observations is known as the smoothing problem. So what we want to note here is actually in this formulation, our filtering density for the current time is actually just a marginal of the joint posterior of over all states in what we call this data assimilation window, the DAW. This is this period, let's say, of observations and lagged simulated states that we will be studying in this. And so what we're referring to now is this is actually our filtering density. This is the density for, let's say, X3 is our current time and we want to condition the density for this model state on observations at Y3, Y2 and Y1. But this can be written actually identically in terms of, well, this is the marginal of the joint posterior on this side where we have a joint density for X3, X2 and X1. So all these times conditioned on the observations up to the current time, but where we are averaging out the possible realizations at the past times. And so, again, this is to, well, make the connection between these two types of problems because a smoothing estimate can actually be produced in a variety of ways where we exploit different formulations of the Bayesian problem. Or say, particularly, we might only estimate a marginal as on the left-hand side or perhaps the entire joint posterior as in the integrand and various formulations of the smoothing can actually be performed. I mean, by targeting different densities as such and the way that we sort of split the posterior will be sort of one of the themes in this, how we can utilize the different formulations of this smoothing problem to, well, create efficient algorithms. So for the rest of this lecture, we'll consider how we can utilize this Bayesian map formalism to efficiently solve the filtering and or the fixed lag smoothing problem. Once we recall again, we have our physical process model we have our observation model. And let's denote, let's say the process model states and observation model states with this sort of slice notation where if we have a K index, that's less than this L index. We'll refer to this slice of XL through K exactly as just the time series of the model states or respectively in terms of time series for the observation states. And we want to assume now that the sequence of observation errors that is these Epsilon's as they range in time up to an arbitrary leading states and off to infinity that these will be independent in time. That's, well, let's say the observation errors at one time give no information about the observation errors at another time, which may or may not be a good assumption in practice, but this is part of the sort of framework of this sort of hidden Markov model. So we're in the above, this is let's say one type of hidden Markov model where again, we sort of suppressed this model error term, but we could easily generalize this and include that once again and still reside within this framework where, well, what we say is these dynamic state variables are known as the hidden variables because again, they're not directly observed. They're only indirectly observed through these Y variables. And these assumptions then actually determine what we say are these sort of Markov probability densities for the hidden variables. A Markov process is something that's known as a memory list process in the sense that it's sort of the probabilistic analog of an initial value problem from sort of deterministic analysis of ODEs, PDEs, things like this, where you want to say, well, you sort of have an existence and uniqueness problem and knowing the past states determines the next state. And so respectively, what we're considering here is if we have, let's say a joint density for the model states from some initial time all the way through some leading time L, well, we can actually use this sort of set of assumptions, this Markov model assumption to decompose this in a quite elegant way where just through direct conditional probability we would be writing if we want to sort of shave off this leading time, we can write this actually as the conditional density given the past time series to model states from L minus one back to our initial time and then times the joint density on the past model states. But the, again, Markov assumption and say this particularly this perfect model assumption guarantees this, because this really is now the sort of deterministic kind of setting for the model. This is to say, if we actually have knowledge of the state at the last time of that particular outcome that completely determines the conditional probability for the state at the current time, it is then independent, let's say of the rest of this slice all these remaining, let's say time indices back in the past. And so we can then write actually this joint density in terms of now this conditional density for the current time given the last time and then the joint density for all past states. And thereby this is now the Markov property of how it becomes actually a very elegant representation. We say just do this recursively, follow this argument several times over, and then voila, you have the joint density for all the time series, in fact, can be written directly in terms of now a prior density representing sort of our uncertain initial data for the initial value problem. Basically the probabilistic representation of our uncertainty of this initialization of the physical model. And then times, well, this chain of what we call these Markov transition probabilities. That is we have, again, as you range from K is equals one to L the conditional density for XK given the last time. And so together this actually gives the complete representation of the joint density with these hypotheses. And similarly, if we are looking at the observation model, this independence assumption on the observation errors above. If we want to write, let's say a conditional density for the observed data at time K, conditioned on XK and conditioned on the previous time series of observations. Well, it's again that we can simply write this as a conditional density on the model state alone. And this is because of the way that's, well, the observed data is related to the model states and the fact that, well, under these assumptions, we are saying that the observation errors at different times are independent from each other. So this is knowing this outcome is sufficient to completely determine the density for the current observation independent of the past time series of observations. Colin, I am very sorry for interrupting you. Again, give in. Colin, I am very sorry for interrupting you, it's Alec. I am not sure that everybody knows it, the Markov-Pratz process. Could you just very briefly mention what means Markov-Pratz process and it's associated the probabilities and so on? Very briefly, because I know that it's a participant, some of them are, apply something. I mean, they are not a mathematician. If it is possible, just a very simple, yes. Thank you very much. Certainly. So I mean, the way that you can really think about this, it's really, it's an analog of the initial value problem. If we're thinking about, let's say, you have d dt of x is equal to some f of x. And let's say you have the function x at zero is equal to some initial data x naught. So this is sort of our traditional deterministic initial value problem. And we could include, again, time dependency, various parameters, anything like this. But this is to say now, if we suppose that we have basically uncertainty on our initialization of the model, and this is sort of representing, let's say maybe prototypically, we're gonna be studying a lot of Gaussian models. So let's say we have some sort of bell curve that sort of represents now the, well, possible values that we might initialize a model forecast with. Well, the initial value problem, let's say in the sort of deterministic setting that we're sort of setting up here, where we are using this perfect model assumption. This is to say, again, well, if you actually knew a particular outcome for, let's say, this uncertain initial data, the existence and uniqueness is to say, well, then in fact, this determines the model state at subsequent times as we forecast this with respect to this model M. But it's with respect to the uncertainty, again, of our initialization, which we represent with our prior knowledge, this is to say, given this uncertain initial data, we can, well, again, with the sort of initial value problem, push this forward in time with only the dependence on the knowledge of which particular outcome had occurred. But without the knowledge of what had occurred, this is then treated sort of probabilistically, where now we're writing this as, well, let's say a sequence of chains in this sense. If we knew what this was, well, then we would write this exactly, but without exact knowledge of the initialization, we write this sort of probabilistically as a sequence of these sort of, well, chains of the probability. Is that helpful? Yes, yes. Okay, very good, very good. Let's see. Okay, so back to the slides. So the notion here is then, well, with these two models, we want to consider how we will estimate this filtering density. That is this density for the current model state conditioned on the entire time series of observations from, well, the leading time to our first observation using this Bayesian map formalism. And so again, if we just sort of follow the definition of the conditional density, this is just a standard sort of axiomatic statement of probability that we can write our conditional density for the current model state given the past time series. This is also equal to this joint density for the time series of observations together with our model state and divided by the density for the observation time series. And now using again, the same sort of mark off and sort of independence assumptions for the process, we can imagine, let's say, well, taking again from the left hand side, let's shave off the leading observation and let's sort of associate, let's say the model states with the past time series altogether as a sort of joint events. And if we condition this, well, then here we have again, what we just looked at on the last slide, we have a conditional density for the current observation given our current model states and the past time series. And then times this joint density for our current model state and the past time series, not including the observation at time L. And well, again, we can with knowledge of this simply discard this because it's sufficient to know this particular model state. And so we reduce it to this form on the right hand side. And once again, if we sort of follow the same sort of arguments, well, if we want to take a conditional on XL given our past time series, well, we could write that just as such. And then we have on here the density for the past time series and respectively in the denominator, we can write, let's say our conditional for the current observation given the past time series times the density for the past time series canceling these terms, we have this form right here that we want to study. That is where we have a likelihood of the observed data and a sort of forecast prior for the model and this sort of numerator, sorry, denominator term, which is just sort of a nuisance term. Where in the last slide, this is actually in fact a complete description of the filtering cycle. That is what we're saying is we have our filtering density that we'd like to study. And in terms of Bayes' law, this is again given in terms of these following terms. We have our posterior estimate for the hidden states at the current time given all observations in the time series. That is our conditional estimate given all of our, well, past and present knowledge from the observed data. And this is the likelihood. And again, this is sort of the, well, conditional density, how likely is the current observation given our model forecast. And subsequently this term is, well, what we can kind of refer to as a model forecast prior. That is, this is the, again, conditional density for our current model state, but given all of our past data, where if we suppose that we had computed the posterior probability density at a past time, that is the conditional density for X at time L minus one, given observations up to L minus one. So this is again the sort of filtering density just shifted one index back in time. Well, we can write the sort of, well, this model forecast probability density on this side. Again, where we actually take the last posterior, we multiply it times the Markov transition kernel. This, well, conditional probability for the current state given our last state and where we integrate out and sort of average out the possible realizations of the last model state. And so this really, this is the sort of exact analog in that sense of this initial value problem. We're looking at, well, this step is representing our model forecast, given whatever our best estimate of sort of the initialization is. And in this case, that is the last filtering density as such. And then, well, lastly, we have the sort of, again, so this nuisance denominator term, which is a marginal of this sort of odd joint density. That is a joint density for our current observation, our current model state, given the past time series where we've simply integrated out the hidden variable. And that is, well, we can write it as this where we have a conditional density for our current observation, given our current model state times a conditional density for our current model state, given the past time series. But again, where we've integrated out the hidden variable. And that's the key point. And that's why, I guess, you know, we're going into some of these nuts and bolts of the mathematics is that this, again, is the free argument in this density. This is the model state, and we're looking at the probability density for a model state in terms of the sort of alternative formulation, given the sort of Bayesian analysis on the hidden Markov model. The denominator does not actually depend on this argument at all, this free variable. And the only terms that actually depend on this argument are, again, the likelihood and the forecast. So, typically, I mean, again, the sort of nuisance term, the denominator is just mathematically intractable. And you typically have no real way of representing this. But the denominator, again, is independent of the hidden variable. And its purpose is really only to normalize the integral of the posterior density to one as we integrate over all possible model states. And so instead, if we just read this as a proportionality statement, we say the density, again, for our filtering estimate for the current model state, given the past time series up to and including the current observation can be written proportionally just as a likelihood of the current data given our model forecast and the, well, forecast from our last filtering density. So then we can devise this Bayesian maximum posteriori estimate as, well, finding a particular choice, let's say XL bar that maximizes the posterior density on the left-hand side, but written in terms of these two right-hand side components because maximizing this proportionally is absolutely fine. Again, the denominator is just a normalization term. And so we only have to handle, let's say, these terms on the right-hand side, which is sort of the elegance of this formulation. And this is actually, it's a more general derivation than what we're presenting right here because again, you can include modeling errors, you can include various other terms. And this can encompass a much more complex model, but we're taking just the sort of simple view in this lecture. So again, for purposes of maximizing the posterior density, this denominator just leads to terms that we can neglect, just constants with respect to, let's say, an optimization problem or a statistical estimation problem. And so in order to actually compute this map estimate sequentially and recursively in time, we want to find a recursion and proportionality as we've written above to write our estimation just exactly through this recursive proportionality. So generally, this has no analytical solution. And this is partly because again, we haven't put, well, very many assumptions on this at all. And if you wanted to take, let's say, a pure Bayesian approach, you could actually have very arbitrary error distributions. You could have very, again, add in modeling errors, you could add in parameter estimation. But let's say if we want to impose a little bit of a constraint on this to make this a little bit simpler. Let's suppose for the moment that this process model is actually just a linear transformation. And respectively, this forward observation operator is also just a linear transformation. And if we suppose furthermore that, well, the sort of prior on our initial data, representing, again, the sort of uncertainty in the initial condition will just be some Gaussian with some prior mean and some prior covariance. And respectively, the likelihoods of various observations, we say the conditional density for an observation given the model states will just be a Gaussian, which will be centered now at the, well, observed model states with the observation error covariance. Then in particular, these assumptions are sufficient to, well, actually write the posterior density as Gaussian at all times. That is, this on the left-hand side has a form, which will be written exactly as the conditional density for the current model states, given all these observations can be written as a Gaussian with respect to some filtered mean and some filtered covariance. And this is, well, traditionally computed in a recursive formulation by the Kalman filter, which is exactly, let's say recursion on these two moments to completely parameterize this Gaussian distribution. So in the following, we'll make, well, use of this linear Gaussian approximation quite a bit because it's, well, it's underlies most operational techniques and practice, whether it's from the 40-var type approach or whether it's from the ENKF type approach. This is a little bit of an implicit assumption because it allows you to make very efficient recursions for the Bayesian map estimation problem. And in particular then, when we want to actually step outside of these very restrictive assumptions and actually estimate this with a nonlinear model, typically this is done, again, using some nonlinear optimization techniques and or Monte Carlo schemes as in traditional var and then ensemble analysis and in hybrid ensemble variational formulations. So just a note, let's say, because we're gonna pass a little bit back and forth between what we'll say are the ensemble estimates or sample-based estimates. And there's the sort of background which is more the theoretical that we're referring to in these particular equations. If you want to describe this background mean and covariance, we would define this in order to differentiate from our ensemble estimates as we would have, now this being basically the expected value with respect to the forecast density and this being the expected value in the outer product sense giving the theoretical covariance. If you make these notations really quick, this will be the standard Euclidean norm. And let's say with respect to a positive definite matrix, that is it will have positive eigenvalues, it's symmetric, we'll denote a vector norm with respect to A as we have this V bar with A, we refer to this now as the, well transpose product through A inverse times V in the square root sense. So this gives a sort of weighted version of the Euclidean norm where it's inverse proportional to the eigenvalues of this matrix. And then if we want to extend this little bit, let's consider that we have now a matrix in size n times m with a full column rank, we will refer to its pseudo inverse with this A dagger where A dagger will be equal to A transpose A inverse times A transpose. And this has some very interesting and highly useful properties. We won't go into detail, but just want to point out that this gives a left inverse in some sense in the column span where if we apply A dagger to A, well, then in fact, we have A transpose A inverse times A transpose A, aha, and we've guaranteed, well, the existence of an inverse because of the full column rank. And so this gives a sort of identity in the column dimension. And respectively, the sort of composition of the pseudo inverse, if we have A on the left-hand side with a dagger, this is in fact actually the orthogonal projection into the column span of A. So it's a very interesting construction. I don't think we have time to go through too many of the details, but the point of using this is so that if we have, in fact, let's say just an A with full column rank is above, we can write the sort of norm in quotes. It's not really a norm with respect to some G as A A transpose where we'd write the V bar sub G is now A dagger applied to V transpose times A dagger applied to V in the square root sense, where this is giving a sort of weighted norm of V relative to its sort of projection into the column span and weighted by the singular values of A, where it's not a true norm, but it's actually, it's a lift of a norm from RM to RN, which we're assuming to be greater in this or a larger dimensional space in this. Under the linear Gaussian assumption, then the filtering problem can be phrased in terms of this Bayesian map cost function where we have these sort of a posteriori cost is in terms of, well, this deviation of our input, the model states from the forecast mean in the norm with respect to the forecast covariance and the deviation from the observation data with our inputs in the norm with respect to the observation error covariance. Then exactly as we describe here, these are in the sense of these weighted norms in the last slides. So then the map estimate actually interpolates the forecast mean and the observation data relative to the uncertainty in each piece of data as represented by its spread. And then to render, let's say this into the sort of traditional form for a right transform analysis, we'll just suppose that we can make a matrix factor decompose the forecast covariance into sigma, sigma transpose. And then instead of actually optimizing this cost function above over the state vector, we can instead actually optimize this in terms of a weight factor where we'll say, let's write the model states in terms of the deviation from the forecast mean as written by a combination of the columns of this matrix factor where the weight vector is giving the actual combination of the columns. Then if we write this just in terms of weight space, these sort of reductions that we have with respect to, well, this form of the model states in this norm, we reduce this now just to a standard Euclidean norm on the size of the weight vector representing how far of a deviation we're making from the forecast mean. And then respectively, again, sort of substituting terms into this form. Now, if we make a few definitions really quick, which I think I'm gonna, well, buzz through a little because you don't need to go into all the details. This is a innovation vector weighted by the observation uncertainty and this gamma over here, let's say in one dimension would represent the standard deviation of the model forecast relative to the standard deviation of the model error. We can reduce this now into this weight space cost function actually in just sort of the standard Euclidean norm square where this is quadratic in the weights. And so it's globally minimized at the critical points. That is when we have the gradient with respect to the weights is equal to zero. And so solving this is actually extremely similar to the normal equations of just least squares regression. And this is of course because the Kalman filter is exactly the best linear unbiased estimator for a linear Gaussian hidden Markov model. But let's note that the map weights, if you find these sort of optimal W bar, they don't by default provide an update to the forecast covariance. And let's say one suboptimal approach is to simply assume that this background covariance, whether it's in the forecast sense or in the filtering sense will just be equal to some B completely in variance in time. And this is related, let's say, to the traditional nonlinear optimization approach of 3D var, which is cost effective because these covariances tend to be extremely large and difficult to compute. But what this lacks is the information of let's say a time dependent second moments. And let's say recovering, let's say the sort of structure of the errors as it develops in time. But in the linear Gaussian model, we can actually make an optimal analysis by finding a recursive form for the filter covariance. And then the forecast model is subsequently used to propagate the filter covariance so that we can recursively solve this equation in time. So if we set the gradients equal to zero for this all critical value of the weight vector, then we can write this now, actually just solving these equations where we have Hj inverse times nabla J all evaluated W is equal to zero, where this Hj is the Hessian or sort of formally it's the second nabla of the cost function. And this actually corresponds directly to a single iteration of Newton's descent algorithm from optimization where the forecast mean would just simply be updated by now use the particular weight vector to adjust this from the forecast and we get the filtered mean. And if we subsequently define the sort of right transform as the square root inverse of the Hessian, we similarly have an update actually for the covariance where if we take a right multiplication of the matrix factor with this T and then well take its outer product with itself this gives the filtered covariance. So this derivation actually describes the square root Coleman filter recursion when it's written for the exact mean and covariance as recursively computed for a linear Gaussian model. And in particular under the perfect model assumption the forecast can furthermore just be written. We would just simply propagate the filtered mean to give the forecast mean and propagate the filtered covariance matrix factor to give the next forecast covariance matrix factor which gives a complete recursion in time simply within the matrix factor alone. So it's quite a nice and very elegant derivation for this. Now this brings us to sort of the notion of the ensemble approach for this where the mean and covariance update in the square root Coleman filter being written entirely in terms of the weight vector and the transform matrix. Well, we want to note that the numerical cost if you want to invert this Hessian in the Newton step and the cost of computing this transform are actually subordinate to a singular or eigenvalue decomposition of the Hessian where both come for free if you calculate that decomposition of the Hessian. So these reductions are basically at the core of the efficiency of the ensemble transform Coleman filter which will typically make a reduced rank approximation to this background covariance where instead if we suppose that we have some ensemble matrix where this is our model state dimension and this is the ensemble size where these are assumed to be drawn from the appropriate distribution. And let's say in practice where the ensemble size is much smaller than the state dimension. We can write the cost function in the ensemble span and where we'll have a restricted Hessian where this will actually just be of the size of the ensemble dimension times the ensemble dimension. So this is actually a major reduction in terms of the numerical cost when we want to simply perform, let's say a singular value decomposition in the ensemble dimension. Specifically, let's say if we define these various ensemble based estimates which I think based on time we'll just brush over these are just the ensemble mean the ensemble covariance, et cetera, et cetera. Then the ensemble based cost function is written while in terms of these ensemble estimates again, making the appropriate substitutions yada yada, we have a term that includes the ensemble dimension minus one corresponding to the use of the ensemble covariance and the sort of pseudo norm with respect to this ensemble covariance and this term. But where this is now an optimization of the weight vector in the ensemble dimension rather than in the state space dimension where this is a key reduction that makes Monte Carlo methods like this feasible for the large size of geophysical models. But we want to note there's a big qualifier on that where covariance localization and covariance hybridization especially are what are used in practice to overcome the curse dimensionality due to the extremely small feasible ensemble size. So this is actually just a sketch of the derivation of the well, L-E-T-K-F of Hunt and Others which is of course one of the widely used operational TA algorithms for short range weather prediction. And in this formalism, in fact, we can define just a right ensemble transform, the Psi-K such that for any T-K, we have for our ensemble forecast, you can right multiply this by some appropriate right transform and recover actually the filtered ensemble where these are going to be assumed to be just drawn from these appropriate densities either the filtered or the forecast density. And we'll similarly associate, let's say our filtered density with the smoothing or the filtered ensemble with the smoothing ensemble, let's say where we'll write this, this is the ensemble at time L or four time L conditioned on information up to time L or leading time. And under the linear Gaussian model, we have furthermore that in fact, this well, ensemble transform from the filtering step can be used so that if we want to update a smoothing estimate, let's say we have the smoothing estimate at time K, given observations up to L minus one, you can right multiply by the filtering transform and condition on the current observation. So yeah, this will then perform a retrospective smoothing analysis on all past states stored in memory by using the latest right transform from the filtering update step. And this is at the basis of the ENKS, which I think, yeah, we just have a minute to go through. That's great. This is just where I wanted to stop. So the ENKS takes advantage of this sort of retrospective right transform analysis by using an additional interloop of the filtering cycle. So what we're imagining here is that well, time is in the horizontal axis, moving left to right forward in time. And each one of these small steps are representing, we have a filtering step with an observation coming in and let's say various pieces of information either come from memory, the model forecast or the sort of information flow where if we produce a filtering estimate, the way that the ENKS works is to say, well, once you've produced this ensemble transform, one can subsequently take the information from observations past backwards in time using the right transform to condition the past estimates as we go through and update these posterior estimates. And so this is a sketch again. Well, now if the let kf and the ensemble common smoother where I think just hit 45 minutes. So this is where I was going to stop with part one.