 We are now going to be talking about inverse problems for deterministic dynamic models. In particular we are going to be talking about the well known 4 dimensional variational assimilation method called 4D var which is also otherwise known as first order joint methods. Before starting on statement of deterministic dynamic model based inverse problems let us quickly review where we are and how we got here. We first started with what is data assimilation why data assimilation then we looked at the mathematical primaries from finite dimensional vector space matrix theory then multivariate calculus then optimization theory after that we introduced to the broad class of static inverse static deterministic inverse problems. Then we talked about matrix methods as well as direct iterative optimization minimization methods for solving static deterministic inverse problems that is the sequence of topics we covered that naturally leads to an extension of deterministic models but instead of static models we are going to be talking about dynamic models we are going to be talking about inverse problems that occur within the framework of deterministic dynamic models. So it is useful to talk about a useful classification of dynamic models let xk be the state of a dynamical system dynamical system is something that evolves in time as opposed to static models that is that remains constant with respect to time dynamic models can be either linear or non linear in a linear model the state xk evolves according to xk plus 1 is equal to m times xk where m is a n by n matrix it is called state transition matrix. So the dynamic models are essentially given by the transition matrix m and an initial condition x naught if the matrix m is given and x naught is known computing xk plus 1 using m times xk is the forward problem but here we are going to consider the inverse problem that assumes x naught is not known. So that is where the inverse problem comes in. So I would like to be able to distinguish between inverse problem and the forward problem in the dynamical context if x naught is known computing the states is called the forward problem for the linear model likewise for the non-linear model I am I should be given yet state transition map m is a vector valued function of a vector m is from rn to rn here the state of time k plus 1 is given by m of xk as opposed to m times xk that is the difference between linear and non-linear models again given x naught whether the linear model and non-linear model computing the future states is a forward problem. Now we are interested in the inverse version within the context of dynamical models. So in the inverse problem of interest to us it is assumed that x naught the initial condition is not known our goal is to be able to estimate x naught based on noisy indirect information about a finite subsets of states at time k greater than 1 that is the inverse problem of interest. I would like to pictorially represent this like this this is k I have been given observations at various instances in time k 1, k 2, k i, k let us say capital N, k n I have I have been given observations which are not directly state but some functions of the state. So I have been given some observations at this time at this time at this time at this time our aim is to be able to find the initial condition x naught such that the sequence of state computed using a dynamical model is such that the model trajectory fits the observation in the least square sense. And what is meant by least square sense if I fixed x naught the model solution can be computed using the given model which is either linear or non-linear so that is the forward problem. So given x naught compute the forward solution given the forward solution and observations which are indirect measurements of the state we get the difference between the observation and the state the observation and the state observation and the state observation and the state. We are going to compute the sum of squares of the errors between observation and the state you can really see the sum of squares of errors is the function of x naught because once x naught is given the solution is given for all time. Therefore our job is to be able to adjust x naught such that the solution calculated from x naught best fits the observation in the sense of the sum of the squared errors is a minimum. So that is the inverse problem of interest that means given information about the solution at time k greater than 1 I would like to be able to infer the state at time 0 that best starting from which the solution best matches the observations given. So the basic idea is essentially the same we are still going to pursue the least square framework so we are going to formulate the least square criterion and look at ways by which we can minimize the least square criterion. The techniques are slightly involved that is largely due to the fact that the model is not static but dynamic. So the basic principles of dynamic inverse problem is not much different from that of the static inverse problem except that we need to be able to handle the nuances exhibited by the dynamics. So we saw model can be linear or non-linear now we also said we have been given a finite subset of observations. The observations themselves have to depend on the state if the observation do not depend on the state then it does not have information about the state that is what we said that observation contain indirect information about the state. It is from this indirect information our job is to be able to extract information about the initial condition that is the inverse problem of interest in here. Again there are two cases to consider the observation vector z which now functions of the time. So in the static problem we said z belong to Rm now we say zk belong to Rm what is zk? zk is the observation at time k. So observation at time k may be a linear function of the state matrix H is a M by N matrix but observations are now assumed to be corrected by noise vk is called the observation noise. So this is the case of linear observation or the observation or linear functions of the state. In the non-linear case I have been given a non-linear map H. H is a map from Rn to Rm so zk is equal to H of xk we are going to again consider additive noise vk vk or the observation noise it is generally modeled as mean 0 Gaussian noise with the pre-specified covariance structure. In the context of geophysical applications the matrix H and the map H are called forward operators. So you can think of the model space which is Rn you can think of the observation space which is Rm the model state is xk the observation is zk I can relate this using H or I can relate this using H. One is the linear map another is the non-linear map. So the information about the state is not directly known when H is an identity matrix then the observations are really measurements of the state. If H is not an identity matrix or the map H is not a simple function the observations are complicated functions of the state of the system it is from this complicated information we have to unwind the information about the state using which I can estimate the initial condition by formulating a suitable inverse problems. You can readily see in the case of a static problem we considered z is equal to Hx as a linear problem we considered z is equal to H of x as a non-linear problem the only difference between the static problem and the dynamic problem is that we have to keep track of the time in disease at which observations were given to us. So that is the primary difference between the static and the dynamic cases. Now I am going to give examples of function H just to give you a feel that this formulation is for real and it captures the spirit of what happens in practice. Suppose I am interested in modeling the sea surface temperature of an ocean so let T be a sea surface temperature for example in the analysis of El Nino we are interested in understanding the changes in the sea surface temperature at the equatorial Pacific west of Hawaii it is very difficult to go and make measurements. So we tried to infer the temperature of the sea surface by looking at the energy radiated from the hot surface by measuring the radiated energy from a satellite and recovering the temperature. So if temperature is a state variable temperature T is a state variable if the observation z is measured the observation is the energy E measured by the satellite the observation z and the state are related by the equation called Planck-Stefans law which is given by E is equal to alpha times T to the power of 4. So the state is the state x is T the observation z is E and the function H the forward operator H takes the form E is equal to alpha to the alpha times T to the power of 4. In another problem your models state variable could be the Reitha rainfall I would like to be able to predict how much rain will occur the next 24 hours in order to estimate that I need to understand and make a measurement of the rate of rainfall. Oftentimes the rate of rainfall the state variable is not directly observable as in the case of the energy radiated I can only measure indirectly the indirect measurements comes from reflectivity from the cloud as measured by the radar the radar sends out the beam towards the cloud the cloud consists of a lot of droplets the droplets reflect the energy the energy radiated the radiated the radiated the reflected energy received by the radar is the measure of the amount of water content in the cloud. The relation that relates the reflectivity to the rainfall rate is an empirical is an empirically derived non-linear law this law was essentially derived at the national severe storms lab who specialized in applications of radar to meteorological problem essentially radar meteorology this was obtained as a result of an empirical curve fitting another example could be if I in the design of cruise control of cars I would like to be able to measure the speed of the car speed of the car essentially measured in terms of voltage generated they put a voltmeter on a shaft that rotates so the voltage generated is proportional to the speed the relation between voltage generated and the speed of the car is given by Faraday's law so you can see how the real states of the model and the actual observations are related using these examples I have given examples of the forward operator the first two cases is non-linear the third case is linear so identifying the forward operator H is not a trivial problem it depends on what is the state variable of interest to you in your analysis how the state variable can be measured is a directly measurable or it can only be indirectly measured for example pressure temperature wind velocity these are all directly measurable state variable but if your model consists of state variable such as vorticity vorticity cannot be measured directly so vorticity has to be informed from certain observables that inference is done through certain well-known mathematical relation those mathematical relation constitute the forward operator in question so I would like to now summarize I have a model model could be linear or non-linear I have indirect information about the state from the observation the observation could be again a linear function of the state or a non-linear function of the state we talked about observation noise vk when we dealt with the static problem even though we recognize that you can never make an observation without an error while we were cognizant of the fact that observations have errors we purposefully to simplify the problem assume that there is no error we simply close the rise to reality and consider it to be a deterministic case so in principle observations are riddled with errors one way to be able to model the error right from the days of Gauss is to be able to model the observation error as a white Gaussian noise that means at a given time the noise is Gaussian distributed but if you consider two different times there is no correlation there is no temporal or spatial correlation between the observation noise that affect the observation at two different times are two different points in space these are the key assumptions that are made to be able to simplify your analysis so vk is the observation noise that affects the observation at time k so vk is a vector of random variables it has 0 mean and a covariance rk it is assumed the covariance matrix is known rk is a covariance matrix of size m by m it is also assumed rk is symmetric and positive definite in most of the cases in general we will assume rk is r that means our instruments are identical so even though I may be making measurements at various times the instrument that make the measurements at different times are identical so their error properties are same so I am going to assume rk is r with the loss of generality one further simplification is that rk is a diagonal matrix for a given class of observations so what does this mean I have zk as my observation zk is a vector which is zk1 zk2 zkm so what is that we are going to assume zk1 has its variance zk2 has its variance zkm has its variance but zk1 and zk2 are uncorrelated zk i that means that the i th component of the observation and the j th component of the observation are uncorrelated not correlated this implies that the matrix r is essentially a diagonal matrix sigma 1 square sigma 2 square sigma m square where sigma i square is the variance associated with the measurement of the i th component of the observation vector these are simplifying assumptions these assumptions are pretty close to reality again I would like to emphasize the value the variance depends on the nature and the type of the instruments used in observations so here comes the challenge when it comes to satellite we generally do not know too much about the other properties especially if a satellite has been around for a long time the proper the measurement the quality the measurements change in time so one probably has to make approximations one approximation that is generally made is that r is equal to sigma square i that means it is a diagonal matrix where all the diagonal elements are the same I am not going to distinguish it because I just do not know if you do not know something but still you want to be able to include the noise covariance one way to assume is a simple case is that r is a diagonal matrix with a constant diagonal elements so what is it what is the implication all the measurements are equally accurate no to component of the measurements are correlated and so this is a much simplifying assumption so one can assume r to be a general symmetric positive rate matrix to the level of being r is equal to sigma square i so that is the range of variation these range of variations are allowed because we generally may not know the error properties of observation from radar we may not know error properties of observation from satellites we may not know error properties of certain buoys if they have been around for long for a long time so that is the dilemma that one has to deal with so we want to be as close to reality without making the problem too difficult to solve so the compromise are covered in these basic assumptions about the observation noise so now I am going to formally state the problem let k1, k2, kikn be the n times at which observations are available for n being some finite number it could be 1, 2 I do not think it should be less than or equal to it will be strictly less than infinity finite now finite number of observations. So given a finite set of n of the noisy observations of the state so these are noisy observations about the state means what this may be indirect measurements it may be indirect measurements we just do not know but we should be ready for it. So z contains information about the state either through the matrix H or through the map H the non-linear map H so given the model equation so here is the problem given the model equations linear non-linear and the set of observation the observation could be linear or non-linear the inverse problem is to estimate the initial condition that best fits the observation that is the statement of the inverse problem in here. So before I go further there are 4 different formulations of this inverse problem the model could be linear non-linear observation could be linear non-linear in case 1 both model and observations are linear in case 4 both are non-linear the other 2 one is linear another is non-linear. So the simplest possible case when both are linear one the most complex case is 4 which is both the model the observations are non-linear. So we consider these 2 extreme cases if we understand how to solve these 2 problems the extreme cases the intermediate case 2 and 3 would become obvious I would like to open up the class of all problems that one could consider so that is what we have done thus far we have essentially stated all the basic information one needs. So please remember we need ton of information I need to know the model I need to know the function that relates observation to the state we need to know properties of the observation noise we need to know the times and the values of the observation at specific finite number of times given all these we can then state the inverse problem in a most in a more formal way as we will see right now. So I am going to start with case 1 model is linear observations are also linear functions of time I am sorry linear functions of state not time state I am sorry. So we are going to define a cost functional the cost functional is j the j is a scalar valued function of a vector j is the weighted sum of squared errors j is given by zki-hxki transpose rki is given by zki-hxki transpose rki inverse zki-xki this looks this should be very familiar to all of us we considered z-hfx transpose wz-hfx this is the weighted version of the least square criterion that we use for the static problem instead of z I have zki instead of x I have xki instead of w I have rki inverse I am simply still considering r is dependent on time why is this kind of a weighting I would like to talk about it for a moment before I go further to that end let me consider what we normally do in basic statistics let x be a normal random vector with mean yam and variance sigma square if I want to normalize this I am going to consider a new random variable y which is equal to x-m divided by sigma because I am subtracting m y is going to be normal with 0 mean and unit variance so subtracting the mean and dividing by the square root of variance is essentially normalization therefore if I want to be able to consider y times y this will be x-m divided by sigma times x-m divided by sigma which is equal to x-m square divided by sigma square which I can write as x-m times sigma square to the power minus 1 x-m x-m is a scalar sigma square is the inverse of the variance x-m is a scalar so without loss of generality the transpose of a scalar is a scalar I can put a transpose here so you can readily see x-m transpose sigma square inverse x-m what is that it is a square of the normalized random variable with 0 mean and unit variance. So this is the form that is replicated here so divide so this is the error this is the error this is the error this is the weight function the weight function in here is so chosen that it normalizes to be similar to what we have done so we are interested in a normalized form of the error in the error which is the measure of the difference between z and hx. Z and hx is called the residual in our old notation from static problem so what is that one you can you can see this is the weighted sum of squared errors are squared residuals I am using the word error and residual in the same fashion and that is what it is I have this quantity at each time instant I have n time instance at which observations are available so I am summing this up from 1 to n I have added a multiplying factor one half and that is the technical thing I will explain that in a moment before we go further I hope the reason behind the choice of the weight function which is the inverse of the covariance matrix is clear from this basic idea of normalization of errors. Now I would like to I would like to be able to talk about y half to understand half I want to be able to go back to some basic principles let me make some space by erasing some of the unwanted things from here so let us suppose I have a function f of x let us suppose I have the function f of x plus c I let us suppose I have the function 8 times f of x plus c so f of x could be as an example could be x square so this is equal to x square plus c this is equal to 8 times x square plus c let us graph these functions now this is x square if c is positive this is c this is x square plus c if c is positive and a is positive this is going to be this is going to be a x square plus c. So let us assume a is 2 now look at this now I have 3 functions which are one is x square another is x square plus c another 8 times x square plus c even though the shapes of the functions are different you see the point at which the minimum occurs has not changed that means the minimizer the minimum value of x or the x that minimizes f of x for f is also the same minimizer for f of x plus c is also the same minimizer for 8 times f of x plus c. So what does it mean if I multiply a function by a constant the location at which the minimum occurs does not change if I add a constant to a function the location at which minimum occurs does not change if I multiply by a constant and add a constant the location at which minimum occurs does not change that means the minimizing value of x is invariant with respect to addition of a constant multiplication by a constant. So given this property I am multiplying this function the summation by half so the half plays the role of a why do I add a half in here the add a factor half in here you can see z minus hfx transpose rk inverse z minus hfx that is a quadratic function when you differentiate the quadratic function to compute the gradient there is an annoying factor 2 coming in the 2 can be cancelled so it is simply an algebraic simplification in the ultimate formula that I would get so anticipating that I do not want to deal with that factor 2 when I differentiate quadratic functions I simply added the 2 to simplify some of the expressions. So multiplying by a constant does not change the result adding a constant does not change the result so without loss of generality I can multiply a function to be minimized by any positive constant without altering any of the properties of the minimizer we would like to so Jl is the weighted sum of squares in the linear case Jn is the counterpart for the non-linear case now look at the difference now the difference is essentially in this term the difference is essentially in this particular term you can really see the this term converts to hfx this term converts to hfx everything else remains the same so if the state is a non-linear if the observations are non-linear function of the state I use Jn if the observations are linear functions of the state I use Jl so we have an expression to minimize so this is so the value of J is a measure of the fit when J takes the minimum the fit is the best. Now J is a function of x0 on the right hand side you do not see x0 directly on the right hand side x0 comes indirectly through xk xk is the state of the system at time k but xk is the state of the dynamical system the state of the dynamical system at any given instant is a function of the initial state so xk depends on x0 implicitly so I would like to add a comment to say that Jl of x0 or Jn of x0 are implicit functions of time and our job is to be able to minimize this implicit functions of time I am sorry implicit functions of the state the state being the initial state that is the that is the challenge associated with this formulation in the dynamic case. Now let us go back so I am going to have to minimize this with respect to x0 x0 does not occur but x0 occur implicitly through xk xk is related to x0 through the model therefore the model equations are inherent within this expression that means xk are not free variables xk are related to x0 through the model so the model is to be constrained is to be considered as a constraint and that is what our next realization is all about. While x0 is free J of x0 depends on x0 only implicitly through the model state xk i that is what we just talked about so there are two approaches to handle this implicit minimization problem one is to express xk i explicitly as a function of x0 using the model equation and solve the resulting unconstrained minimization problem that is approach one that is the approach one that is the approach one or else you would like to be able to solve this as a constraint problem where we are going to force xk to be related to x0 through the model equations. So the model equations would can would constitute the so called equality constraint these are the two equivalent ways of looking at this minimization problem these are two approaches one can utilize both the approaches are meaningful however this approach on the left the first approach I would like to call is more suitable for linear system this approach second it is suitable for both linear and non-linear systems I hope the formulation is very clear now I have to minimize Jl of x0 or Jn of x0 as an implicit minimization problem where you relate xk to x0 through the model equations. So either you use the model express xk in terms of x0 explicitly and solve an explicit unconstrained problem or simply use the model as a constraint and formulate it as an equality constraint problem we already know equality constraint problem are solved using Lagrangian multiplier we have already come across the use of Lagrangian multiplier constraint optimization problem in the context of underdetermined static deterministic inverse problems so the techniques are very familiar to us. So the same kind of techniques are going to be applied in the dynamic case as well so you can see the methodology is very similar between static and dynamic problems except that in dynamic problem we simply need to be able to take time into account. So a comment about some classification of strategies for solving this problem whenever we assume the model is deterministic what is the implicit assumption we assume the model is perfect. So in any analytic problem when we assume the model to be deterministic there is an implicit assumption about our belief of the model being the best what I know is the best or I have utilized the best model known to mankind there is nothing better than available so I am going to assume the model is perfect if the model is perfect if I want to be able to use the model as a constraint I would like to force the constraints strongly so I would force the problem as a strong constraint strong equality constraint problem you may remember strong constraint and weak constraint we have already talked about within the context of constraint optimization. So perfect model assumptions so let me underline this now perfect model assumptions and strong constraint formulation why strong constraint formulation if I assume the model is perfect I want the model to have the final say in what I do because I believe in model so much that model has the last say to provide a variety we will use Lagrangian multiplier technique to solve the linear model problem as a strong constraint problem. So you can see there are number of formulations the formulations of the problem the model assumption linearity nonlinearity so you can see the tree of potential ways to be able to mix and match so I will use Lagrangian multiplier techniques as a strong constraint formulation to solve the linear model for the nonlinear model I will use the so-called adjoin method why do we use different techniques to solve different versions of the problem to be able to expose the richness of the class of methods that has come to be known for solving this kinds of dynamic problem later we will also say solve the same problem using a new class of method called forward sensitivity method which was developed by us some 5 6 years ago so you can see the variety of techniques now model being perfect strong constraint using Lagrangian multiplier model being perfect nonlinear case using adjoin method model being perfect we will use forward sensitivity method to solve the same set of problems. So I am going to illustrate 3 types of techniques to solve the inverse problem within the context of the assumption that the model is perfect I am going to use it as a strong constraint so let us start with the linear case to get the ball rolling in the solution process so model is xk plus 1 is equal to m of xk x0 is the initial condition observations are zk is equal to h of xk plus vk the cost functional is given by this so what is the so before I utilize Lagrangian multiplier techniques and other things I am going to now simply use the model to be able to express xk in terms of x0 the simplest possible exercise this is simply method of elimination you can you can you can think of it. So express as a function of x0 by iterating the model you can readily see xk is equal to m to the power k x0 so I have expressed a solution of the model state at time k to the model state at time 0 through the kth power of this state transition matrix yes m to the power of k involves matrix matrix multiplication we know matrix matrix multiplication is very expensive computationally even though computationally this could be very expensive I want to provide a basis for comparison for later methods that is why we are going to indulge in this so called method of elimination. So look at this now this is xk so xk is are needed here look at this now xk is are xk is are needed here sorry xk is are needed here xk is given in here so I am going to use this xk in 3 so substituting xk is equal to m to the power of k x0 in expression for 3 I get the following so this is h of xk is m to the power of k x0 h of m to the power of k x0 so the right hand side is an explicit function of x0 I can now multiply this is a constant term this is the linear term in x0 this is the quadratic term in x0 even though the expressions are little bit more complicated than the ones in the static case essentially we are dealing with a constant term a linear term and a quadratic term so the essential structure has not changed even though the form of the expressions have changed my job is to be able to minimize x0 belonging to that so it must be r to the power of n so minimize j of x0 where x0 belongs to r to the n so this is an unconstrained minimization problem so how do I solve this I compute the gradient of j with respect to x0 we know how to compute the gradient of a linear function we know how to compute the gradient of a quadratic function this is what we saw in the module on multivariate calculus so we are going to draw our from our experience in multivariate calculus and so to be able to do that I am going to simplify this expression so that it looks much simpler than this so I am going to rewrite that as a expression like this which is much more simpler in this case you can readily see a is given by this sum of these matrices b is given by some of these and c is given by this so I would like you to be able to look at this now j so this is a sum of quadratic function this is sum of linear functions this is sum of constant functions so I can express jx0 by using these substitutions defining a b and c express jx0 as follows sorry jx0 as follows so this is the expression where the new variables a b c are given in here if I the reason for writing at a 6 because is the quadratic function if a is symmetric positive definite j0 is convex if j0 is convex again from basic optimization theory we know the unique minimum exists I can compute the gradient which is equal to ax0 plus ax0 minus b and that is equal to 0 so I get the optimal x0 to be a inverse b the Hessian is essentially a which is spd so this so I have essentially solved the problem and I computed the solution the form of the solution the optimal initial condition is given by this even though I have an expression for the optimal initial condition this approach is far from being practical why because it involves matrix matrix product let us look back that a matrix consists of m to the power k transpose and m to the power k so if I have a matrix m if I have a matrix m m square m cube m to the power k plus 1 is equal to m to the power k times m so we are going to have to compute these things recursively each one involves matrix matrix multiplication this is m square times m and so on and matrix matrix multiplication takes of the order of n cube and yesterday we saw n cube complexity when n is the order of the million would take a very long time therefore this method while is giving a very beautiful like mathematical expression it is far from being of practical the reason for doing this is not because it is practical but because it brings the essence of the problem the essence of the problem is that when the model is linear the observations are linear functions at the time the j function so concocted reduces to a convex functional a quadratic convex functional standard quadratic convex functional which is a unique minimum therefore optimizer exists I can in principle give an expression for the optimizer so I have theoretically solve the problem but this way is not practical. I would like to now ask you to concentrate on the expression for the matrix a expressions where the vectors b as well as expressions for the constant z so these are all computed using very many using all the properties I know m is the model h is the forward operator rk is the noise covariance so you can see the model component the observation component the observational error component all this comes in a hue and it is this hue that makes the matrix a there is a combination of various factors having seen a simple method of elimination now I am going to come back to formulating it as a Lagrangian multiplier based technique as a strong constraint Lagrangian multiplier based technique so to that end I am now going to construct a Lagrangian function so this is the objective function xk is equal to m of xk-1 that is the model so xk is equal to xk is equal to m of xk-1 is the model I am going to rewrite this as xk-m of xk-1 is equal to 0 that is a constraint now please remember xk is the vector m is the matrix xk-1 as a vector so the whole thing is a vector this vector belongs to Rn so I am going to take the inner product of this vector by lambda k lambda k transpose xk-m of xk-1 that is the Lagrangian multiplier term there is one such term for each k summation from 1 to n so lambdas are called the under Lagrangian multipliers lambda k they are all vectors in Rn so inner product that is constant this quadratic form there is a constant so L of x0 lambda that is a scalar function it is a scalar function of x0 and lambda when I say lambda lambda is not one lambda lambda is a collection of all the lambdas lambda 1 lambda 2 all the up to lambda n each of the lambdas are vectors in Rn each of the vectors are Rn therefore I have n capital n vectors each of size little n so this lambda is this lambda is a vector that belongs to Rn n because each of them n n there are n of them so that is the vector that belongs to n n so you can see x0 belongs to R of x0 belongs to R of n therefore L is a function that maps R of n cross R of n n to R that is the function Lagrangian multiplier it takes 2 inputs the first input is the initial condition the second input is a collection of n Lagrangian multipliers each of which has dimension little n I hope the structure of the Lagrangian function is clear it is simply an extension of what we have done previously so once the problem is formulated as a strong constraint problem using Lagrangian multiplier the constraint optimization problem now is solved as an unconstrained optimization problem by minimizing L with respect to both x0 and lambda so compute the derivative of L with respect to x0 compute the derivative of L with respect to xk compute the derivative which is the gradient of L with respect to lambda equate each of them to 0 this is a set of necessary conditions one has to satisfy for the minimum of the Lagrangian function by setting the gradient of lambda with respect to lambda k I get the model equation so xk minus m of xk minus 1 0 essentially means the model is satisfied so I am forcing the model as a strong constraint using 11 at the minimum xk must satisfy the forward model dynamics xk plus 1 is equal to m of xk minus 1 so that is what is being given by this so this essentially leads to forcing the model as a strong constraint the model must always have the last say the model equation must always be satisfied in defining x x's are not free variables they are constrained by the model now I am going to compute the derivative these are these are the mechanics of Lagrangian method I have to compute the gradient of L with respect to xk and that gives rise to this equation it is very easy for me to write the equation 13 but it will take a couple of minutes for you to be able to derive I am not going to spend time in trying to derive this equation from the Lagrangian but it is an important exercise for you to be able to apply the basic principles of multivariate calculus to be able to compute the expression for the gradient of the Lagrangian again a gradient for the Lagrangian in here equate all of them to 0 now we have to solve the system of 3 equations 12 13 and 14 12 implies 11 is satisfied 13 implies the gradient of L with respect to xk are 0 14 implies the gradient of L with respect to xn is 0 in order to be able to solve this problem I want to be able to concoct a new variable I do not want to write too many things so I am going to combine several variables into one to that extent I am going to defining a variable called ffk sorry I am going to defining a variable called ffk now let us understand what ffk is zk-hfxk that is the residual if zk is equal to hfxk there is no residual the difference between the observation and the hfxk is the model counterpart of the observation actual observation minus the model counterpart of the observation is the residual or the error if I divide that by rk-1 that is a form of a normalization I would like to now bring the operator h transpose we are looking at the first time so this is rn this is rm so z-zk-hfxk belongs to rm it is in here I am going to call it ek then I am going to call it eta k which is equal to rk inverse ek that is again belonging to rm so if this is ek this must be eta k so ek is the actual error eta k is the normalized error h is a map that goes from rn to rm h transpose is the map that goes from rm to rn therefore h transpose eta k is ffk so ffk so this is efk this is eta k h transpose eta k is ffk I hope that is clear therefore we can interpret ffk as the normalized forecast error so zk-hfk is the forecast error dividing by rk-1 is the form of normalization or weighting and multiplying that on the left by h transpose I am transferring from the model space to the from the observation space to the model space therefore ffk is the normalized forecast error viewed from the model space so it is very important to understand why we want to be able to convert the observation the forecast error from one space and transfer to another space because all the computations are done in the model space so this is the model space this is the observation space the data simulation computations are done in the model space model space is rn therefore when I have a quantity which is not in this space I have to bring the corresponding version of that quantity from observation space to the model space so that is what ffk is all about why is ffk is important look at this now let us go to 13 in 13 I have h of xk-zk so that is minus ek rk inverse that is minus eta k h of transpose minus eta k this is minus ffk so you can readily see this quantity is minus ffk again this quantity is minus ffn so that is the reason why we have defined this term so using ffk I can simplify the expressions for the derivative and handle it with little bit more easily I hope the notations are clear I hope the reason for defining these newer quantities is also clear so with this in mind I am now going to solve 13 and 14 if I solve 13 and 14 let us go back let us go back if I from 13 what do I have minus I am sorry from 13 what do I have minus ffk plus lambda k I am sorry plus lambda k minus m to the power of m transpose lambda k plus 1 equal to 0 equal to 0 so how can I rewrite this this can be rewritten as lambda k is equal to m transpose lambda k plus 1 lambda k plus 1 plus ffk so I am relating k plus 1 to k I am going backward in time I am going backward in time I just want to rewrite this and then explicitly write that k plus 1 this is k plus 1 so if 13 holds good this is the form 13 takes if 14 has to hold good the 14 takes the form minus f of n plus lambda n is equal to 0 therefore lambda n is equal to minus f of n that is what 14 is all about so this is 14 this is 13 so 13 and 14 when solved give raise to these two equations so lambda n is equal to f of n please remember that that is what lambda n is equal to f of n that is 14 lambda k is equal to m transpose lambda k plus 1 plus f of k is 13 so lambda k is equal to m transpose lambda k plus 1 f of k for 1 less than k less than n minus 1 lambda n is equal to f of n is the final condition therefore we have to compute lambda backward in time starting from lambda n which is which is the final condition this is called the backward dynamics this backward dynamics has come to be known as the joint dynamics that that joint equation we can iterate so fs are known m transpose are known fn is known so I can compute from fn to f of n minus 1 to from lambda n to lambda n minus 1 to lambda k to lambda 1 I can compute backward in time using this backward adjoint equation so we can compute lambda 1 by going back now I would like to remind you two things the model is solved forward x0 to x1 to x2 all the way up to xn here is lambda n is lambda n minus 1 lambda 2 lambda 1 so these are considered backward so the forward dynamics of the model the backward dynamics of that joint backward dynamics of the joint this backward dynamic comes very naturally from the Lagrangian multiplier problem so once lambda n is known from the Lagrangian 7 we can readily compute the gradient of L with respect to x0 is minus m transpose lambda 1 so I computed lambda 1 by the backward adjoint equation I computed the gradient once I have computed the gradient we can do a lot of things with it if this computed value the gradient is 0 then I am at the minimum because at the minimum the gradient must vanish please understand I have converted the unconstrained minimization to a constrained minimization of L gradient of L with respect to x0 must vanish at the minimum gradient of L at the point x0 is given by minus m transpose lambda 1 lambda 1 was obtained by computing the backward dynamics so if it turns out let me go back if it turns out the x0 I chose from which I computed x1 x2 x3 x4 from which I computed the errors from which I computed fn's from which I computed the lambdas from which I computed minus m transpose from which I computed minus m transpose lambda 1 and if this lambda 1 happens to be 0 well and good I am done but it stands to reason to expect this calculation would in general be not equal to 0 because I chose x0 arbitrarily so for a arbitrarily chosen x0 in general need not be optimal if in general need not be optimal the gradient of L with respect to x0 under the constraint is the same as gradient of j with respect to x0 I have computed the gradient of the cost function using the back adjoint method once I have the gradient what is that I need to do I can utilize either in the gradient method or in the conjugate gradient method to be able to minimize it gradient method and conjugate gradient methods we have already talked about in the previous modules so we are not going to repeat many of those things we already know so the emphasis in here is essentially computing the gradient to the cost function using forward run of the model and the backward run of the adjoint so what is that how is this algorithm work the algorithm works as follows this algorithm as come to be called the 4D war algorithm this also called first order adjoint method by whatever name you call the essence of the method goes as follows start with an arbitrary x0 so this is the algorithm 4D war algorithm start with an arbitrary x0 you are given a model compute the sequence of state computed by the model you are given a set of observations so from the model you know x0 xk you already know h you already know zk so you can compute zk – h of xk that is the forecast error you compute the normalization you compute the model counterpart of that so f of k can be computed explicitly so run the model forward given the observation you can essentially compute the quantities fk set lambda n is equal to ffm solve lambda k is equal to m transpose lambda k plus 1 plus f of k iterate backward in time compute lambda 1 compute – m transpose lambda 1 that gives you the gradient use this gradient in a minimization algorithm to find the optimal x0 by repeating the steps 1 through 4 until convergence until convergence so what does it mean you start with an x0 sorry you start with an x0 you start with an x0 you compute the gradient with respect to x0 of j you go in the negative the gradient alpha times j of x0 then you say x0 is new so this is x0 old you compute the gradient you subtract alpha times gradient of j define the new x0 then start with this new x0 run the model forward in time calculate the forecast errors compute the backward and repeat this loop so there are 2 loops I want you to be able to recognize one is the optimization loop another is the condition computing the gradient loop one loop feeds into the other so to be able to go from one initial condition value to the next initial condition value I need to run the model once forward in time and the adjoint backward in time so this requires the model run and the adjoint run plus adjoint run I am sorry adjoint run model run and the adjoint run so you need to do both to get one gradient if you get one gradient you can get the new operating point and you repeat this cycle until convergence that is the key that is the essence of the 40 bar algorithm you can readily see this algorithm is extensive in terms of using the model in the observation it provides you incremental improvement for the cost function if I use gradient algorithm and the objective function is convex we have already seen gradient function converges only asymptotically if I if the if the objective function is convex and quadratic if I use conjugate gradient methods I can hope to get good convergence by finite number of operations finite number of iterations so we are going to be concerned with the general utilization of all the principles we have seen so far namely optimization algorithm on one hand model running on other hand adjoint method on the other hand all comes into a hue to make this algorithm called 40 bar algorithm that is the summary of 40 bar algorithm for the case of linear deterministic dynamic system thank you.