 In this lecture we are going to be talking about the elements of Bayesian estimation. Bayesian estimation is the central piece of stochastic methods for data simulation. We have already seen stochastically squares. We have already seen the notion of maximal likelihood. We did not elaborate on these techniques too much because it will take us too far from the main goal of dealing with data simulation. Yet the fundamental principles of statistical estimation are the foundations on which statistical methods for data simulations are built. That is why a good nodding under a good understanding of the fundamental principles of statistical estimation is very useful and in my view is also necessary to be able to appreciate all the nuances relating to the techniques for stochastic dynamic data simulation. In that process we are going to start with the description of the Bayesian estimation schemes or Bayesian estimation methods. So I am going to develop the Bayesian framework. Please remember Bayesian framework is an alternative to the Fisher's framework. In the Bayesian framework let X be an unknown vector is random. This random unknown vector is supposed to follow a probability distribution called PX which is known as a prior distribution. Prior distribution essentially is a summary of our understanding about the unknown. So what is key here? I only know the probability distribution from which X is drawn. I just do not know what is the actual realization of X that mother nature has picked. So when you say X is unknown we do not know the actual value or the realization of X but we know where the X comes from. The X comes from the probability distribution called prior distribution. X is drawn from that probability distribution. Z again RM is an observation about X. Z contains information about the chosen X. So given X Z has a conditional distribution which is probability of Z given X. Let X hat be equal to phi of Z. So phi is an estimator. Z is the data. The estimator operates on the data to give us an estimate. We have already seen the notion of an estimator and an estimate. So if X hat is the estimate, if X is the value that mother nature has picked from prior distribution, the difference between what the mother nature has picked and whatever estimate. Estimate is something we create from the knowledge of the observation about X. More often than not X hat and X may not be equal. So there is an error. X tilde is the error in the estimate. We are going to now associate a cost function C. C is a mapping, C is a functional. It maps R into R. The cost function is defined over the set of all errors. C X hat is called the cost associated with the error. This is nothing new. We have already talked about FFX within the context of static deterministic inverse problems. In that case, we said Z is given. HFX is the model predicted observation. Z – HFX is called the residue. Residue error. They are similar connotations. So when I talked about Z – HX inverse Z – HX, that is essentially sum of squared errors. That is a cost function. F is a cost function that maps R into R. So FFX is simply sum of the square of the residuals or errors. Likewise, in this case, C X hat as C X tilde is the cost associated with the error X tilde. We would like to impose some special conditions on this functional C. If I know Z is equal to HFX and if I have FFX is equal to Z – HFX transpose times Z – HFX, we already know it is quadratic. So this FFX is quadratic. We already know quadratic functions have unique minimum. They have a convexity property. So we did not have to worry too much about further constraining FFX because FFX automatically by virtue of the definition in that case had all the required properties. But in here, we may not know right now what the form of the error is. I am simply trying to define a function that captures the cost associated with the error. Therefore, to be consistent with what can be done, we would like to impose conditions which are as follows. If the error is 0, the cost is 0. That makes lot of sense. Much like in FFX, if Z is equal to HFX, FFX is 0. So you can think of 0 as being the minimum of the cost. When there is no error, there should be no cost associated with it. There should be no penalty associated with it. Secondly, if A is a vector, B is a vector. If the norm of A is less than the norm of B, C of A is less than C of B. Norm of A less than norm of B means the length of the vector A is less than the length of the vector B. So if A is one error, B is another error, if the norm of A is less than norm of B means A is smaller than the two errors. If A is smaller than the two errors, the cost associated with A must be less than or equal to the cost associated with B. What does it tell you? It tells you some kind of a monotonicity property. The cost function increases with the length of the error vector, which is also a very nice and a desirable property. So you can see the cost function is such that it is 0 at the origin. When you go away from the origin in the space that represents the error, when you go away, the length of the error increases. As the length of the error increases, the cost function does not decrease. It either increases or remains the same. So that is the conditions. You can see we are trying to already develop a bowel-like shape for the cost function C. Why is that we are looking for such a bowel-shaped thing? Cost. In the optimization parlance, cost we would like to be able to minimize. If we would like to be able to minimize, the function should be endowed with a unique minimum. That is why we would like to be able to require C to satisfy some simple condition that will guarantee the behavior of the function around 0 where 0 corresponds to my estimate being equal to the unknown. The estimate of the unknown is equal to the true value. So here are some examples of cost function. So x-x hat is the error, x-x hat transpose x-x hat is also equal to x tilde transpose x tilde. So this is the quadratic function. So this is called sum of squared error. That is one way to be able to concoct the form of C. So that is a simple quadratic form that we are already used to. The next one is weighted sum of squared errors. In the first one, there is no weight. In the second one, we are adding a weight W. So this is simply weighted sum of squared errors which is also called the energy norm, the square of the energy norm of the error. So we are very familiar with the first one as well as the second one. There are couple of other ways in which one can design the cost function. This must be tilde. So if x tilde is such that its norm is less than or equal to epsilon, epsilon is some, so epsilon is greater than 0, epsilon is pre-specified and epsilon is small. Given an epsilon greater than 0 but small, if x tilde is the error, if the norm of the error is less than the chosen epsilon, we will say the cost is 0. If the norm of the error is greater than epsilon, we will say the cost is 1. So how does this one look like function pictorially? Let this be 0, let this be 0. So in a small neighbourhood around the origin, there are 2 epsilon. So let me try to, I think I want to refine this picture a little bit better, sorry. So let us try to give a description of this. So let us try to see, let this be the norm of x tilde. If the norm of x tilde, let this be 0, let this be epsilon. If the norm of x tilde lies between 0 and epsilon, the cost is 0, otherwise it jumps to 1, it jumps to 1. So this is how the cost function, this is called uniform cost function. This is in the case where the error is a vector. The last one is, suppose we are dealing with a scalar case, x is a scalar, x hat is a scalar, so x tilde is a scalar. In this case, one can think of what is called an absolute error. In this case, this is simply the absolute value of x tilde. So the cost could be simply the absolute value of the scalar variable which happens to be the error in the estimate. So there are 4 different ways of picking the cost function. Each one of the way in which we pick has a different interpretation of the optimal estimate. All these things are used in decision making. 2 of them are already known to us quadratic. The uniform cost and absolute error cost are add on versions of the cost function. So what are the general forms of functions that will satisfy or that could be used as a cost function? One possibility is to require C symmetric and also C is convex. What does symmetry means? It is symmetric with respect to the vertical axis. So for example, if you say x square is a symmetric function, minus x and x have the same value which is x square. So x square is a symmetric function. Therefore, if y is equal to x square, there is an example of a symmetric function and that is exactly what this symmetry refers to. The cost function for x tilde is the same as cost function for minus x tilde symmetry. You can readily see. So if x tilde is a vector, if you change the sign of each component of that, you get a point which is the reflection of that and C has the same value at these 2 points x tilde and x minus tilde. Convexity, what does convexity means? The picture will help you. A function C is said to be convex. If I consider 2 points x and y, it has the value cx and cy. If I consider a constant a, if I took a times x, if I consider a point in between x and y, there exists an a in the interval 0 to 1 such that this point can be written as a times x plus 1 minus a times y. This is called convex combination of the 2 end points. In fact, every point in the line segment from x to y can be obtained in this particular form by changing a in the interval 0 to 1. So let us take, let us draw a vertical line. This is the point. So let us call this point as z. z is the point which is equal to ax plus 1 minus a times y for sum a in the interval 0 to 1. So let us call this z. This point, the value of this point is c of z but the value of this point is equal to a times cx plus 1 minus a times cy and what is that? It lies on the chord that joins the point cx to cy. So this is point so this is the chord. This is the point on the curve. So what does this inequality says? This inequality says the value of the function at an intermediary point in between x and y is always less than or equal to the value along the chord. That means the function lies below the chord. The function cx lies below the chord. Such a function is called convex functions. We have already alluded to convexity when we did optimization. I am just trying to reinforce and remind you of some of the properties. So in general what are the desirable properties of cost function? We would expect the cost function to be symmetric. What does it mean? The value penalty or the cost for x and minus x are the same and the function c is naturally convex. Quality function especially the parabola is an example of a convex function. Convex functions generally have unique minimum. Therefore some of these conditions essentially help you to be able to guarantee that the cost function is well defined. It is a unique minimum. The guarantee of unique minimum is important because when we are trying to use the cost function in our estimation we would like to be able to get best estimates. Best in the sense of minimizing some associated cost function. So we want to be able to choose or see appropriately so that we can make meaningful decisions when it comes to deciding algorithms for appropriate ways to estimate the unknown. So now I am going to state the general Bayesian problem. We already know we are given the prior. This is the conditional distribution. We are also given the cost function and the observation. Look at this now. What are all the given data? The prior the conditional distribution the observation and the choice of the c function. We already know c function can be chosen in one of many ways. I am going to now concoct what is called a base cost function bx hat. Please remember x hat is the estimate. So this is the cost associated with estimate x hat b for Bayesian. This is going to be equal to the expected value of c of x tilde. Given x hat there is an associated x tilde the error associated with it. So given x hat I can compute x tilde. I have evaluated the cost. We already know the estimate is a random variable x tilde is a random variable. Therefore c of x tilde is also random. I am considering e the expected value of the cost function. This expected value of the cost function can now be written explicitly by this integral. Please remember c of x tilde is equal to c of x minus x hat. x hat depends on z depends on z observation x has its own prior information. So there are two sources of randomness the prior one from the prior another from the observation. So I would like to be able to integrate this cost function with respect to the joint distribution between x and z. x has prior z has a conditional distribution x is a random variable z is a random variable if I have two random variables or random vectors I can consider the marginal distribution as well as I can consider the joint distribution. Here this base cost function is simply expected value of the cost associated with the error in the estimation it is simply given by this double integral one with respect to x another with respect to z where the probability is the joint distribution of x and z. Please recall the joint distribution p x z can be written in this form using conditional probability it can also be written in this form both ways are meaningful it is using these two we generally derive what is called the base rule in elementary course on statistics. So if I now replace the joint distribution by the product form the one of the product form I get a new formulation for the base cost function. So b x hat now becomes if I substitute this in here I can absorb one of the integration in another quantity which is called base in cost associated with the estimate given the observation please understand please recall the estimate is a function of the observation. So I can rewrite this p of z as p x z p of z then I can associate this p x of z with this. So this integral with respect to n which is the internal integral I am going to denote it by b x hat given z and then if I plug this in here I get a new form I get a new form therefore b x hat has this particular form and this form is very useful and this is one of the forms that we are going to concentrate on. Now I am interested in minimizing so please remember minimize the base cost so I would like to be able to minimize b x hat b x hat from star is equal to the integral of b x hat condition on z times p of z. p of z in the region of interest is always greater than or equal to 0 it is it is it is it is that is a probability density function. So if I want to be able to minimize b x of b x hat it is enough to be able to minimize b x hat of z because if I minimize this double star I am taking a linear combinations I am taking the weighted linear combinations of b x hat slash given z with p z. So if double star is minimum naturally implies the minimum of star therefore minimizing the base cost function reduces to minimizing the conditional base cost function conditioning on the fact that I have been given a set of observation z. So what is the basic idea here let me let me let me re look back at it mother nature has picked x from the probability distribution p of x I am going to observe and gain information about the chosen x. So from the perspective of the estimator everything starts with z so everything given z is what we are going to be working at given z I am going to create an estimate x hat given x hat I am going to have an error given the error I am going to have a base cost function conditioned on the given observation which is given by double star if I make double star minimum that automatically minimizes the base cost function b x hat because p of z is 0 therefore without loss of generality we could minimize the conditional base cost and that is that is one of the important conclusion that comes out of this analysis. Now we are going to do with that as a background that is a very general principle we are going to do what is called Bayesian least square estimators please look at this now we are combining least squares with base what is the difference in the statistical least squares that we saw in the previous talk we are simply given the observation using the observation we are going to be able to estimate the unknown we were not given any information about the unknown so that is what is called statistically squares what is the difference between that statistically squares and Bayesian least squares within the Bayesian framework we always have 2 pieces of information one is the prior another is the new information coming from the observation given through the conditional distribution so I have 2 pieces of information so we are slightly richer within the Bayesian setup than compared to a simple statistical least squares so we are going to be able to revisit the design of least square estimation techniques within the context of the availability of prior information. So all the notations concepts carry over I am now going to define some new quantities X is the unknown Z is given to us from the previous slide we can see we are now interested in analyzing the problem conditioned on the given set of observation because that is what the basis for the estimation techniques so given Z I am going to talk about the conditional expectation of X given Z that is an important quantity I am going to denote that quantity by mu now please understand Z is random so even though we are going to condition the expectation of X with respect to Z conditional expectation is in general a random variable so mu is denoting the conditional expectation I want to remind you that mu in general is random this mu as given by the conditional expectation is a can be evaluated by the standard conditional expectation formula which is integral X times PFX given Z DX. So mu is a function of the observation observations are random so mu is random now I would like to be able to choose CFX to be the weighted sum of squares the weighted sum of squares if I assume the weight to be identity I get the simple sum of squares so it is a reasonably good way to start with this weighted version now BFX hat please remember from the previous slide is equal to expected value of CFX tilde this is CFX tilde sorry that must be CFX tilde so CFX tilde is X-X hat transpose WX-X hat now what do we do there are two terms in here there are two terms in here so I add and subtract mu to this I add and subtract mu to this why that is going to be a purpose for this we are going to get an important conclusion out of this so if you multiply and simplify I get one form like this another form like this and a third one yet another one is like this so I get sum of three expectations it is a very simple idea it is a very simple idea I am trying to compute the conditional expectation and I am going to compute the Bayesian cost I have inserted the conditional expectation into the Bayesian cost and I am going to see the role of conditional expectation in trying to minimize the Bayesian cost that is the purpose. So let us consider one of the terms X-mu transpose W mu-X hat let us look at this now X-mu transpose that is the term with the coefficient 2 X-mu transpose W mu-X hat X hat is the function of the observation therefore this expected this expectation from basic probability theory sorry yeah from basic probability theory it is very well known that this expectation can be expressed as iterated conditional expectations so what is the basic idea first I could consider the conditional expectation of this quantity X-mu transpose W mu-X transpose given Z if you calculate this conditional expectation it will becomes a function of Z and then you take another expectation with respect to the distribution of Z to be able to get rid of the randomness with respect to C as well so the inside one is given Z the outside one is expectation with respect to Z now let us look at the conditional expectation the inside factor so this is the conditional expectation which is inside if Z is given since X hat is a function of Z mu-X hat is a known quantity given Z so again the basic properties of conditional expectation reveals that I should be able to take this out of the expectation operator so it came here W is also a non-random quantity that also can be moved here therefore by pulling these two quantities by pulling these two quantities I can express this quantity is equal to this quantity please remember what is that we have used in here X-mu transpose W mu-X hat is also equal to mu-X hat transpose W X-mu because W is a symmetric matrix and this quadratic form is a scalar therefore these two are essentially the same and we have used this particular fact in trying to go from the right hand side to the right hand side now expectation is a linear operator expectation of a sum is a sum of the expectations given Z mu is already known therefore this expectation now can be written as expectation of X given Z-mu but from the definition this is mu itself therefore mu-mu is 0 therefore the cross term if you go back to the previous one this is the cross term this I would like to call the cross term cross term is 0 if this cross term is 0 now I can further simplify my Bayesian cost is essentially sum of only two terms sum of only two terms now you see why we introduce the conditional mean into each of the expressions and what is the outcome of that mathematical simplification namely the third term with a coefficient 2 vanishes identically and that is one of the simplification that results from this manipulation now please understand I would like to be able to minimize B of X hat to be able to minimize means I should have a free variable the only free variable I have is Z is given to you X is unknown to you so what is the only choice you have your only choice is the estimator phi so once you choose an estimator I got an estimate so the only control you have is to change the estimate so the choice the control is the choice of X hat now if you look at this expression the first term this is the second term the first term does not have X hat as a part of it first term independent is independent of X hat if somebody does not depend on X hat I cannot change anything the second term on the other hand depends on X hat the first term is a quadratic form expectation of the quadratic form and W is a symmetric positive definite matrix therefore the quadratic form is always positive the first term is expectation of a positive quantity which is going to be greater than 0 unless X is equal to mu. The second term and we do not know whether X is equal to mu or not mu is my estimate mu is the conditional expectation of my estimate X is the unknown that mother nature aspect so it may more often than not the first term will not be equal to 0 look at the second term mu is what I deliver it is the conditional expectation of X hat X hat is my estimator so the second term is also a positive definite quadratic form the positive definite quadratic form is 0 only when the vector is 0 and by changing X hat I can affect only the second term therefore by picking X hat is equal to mu X hat is equal to mu is called the posterior mean why X hat is equal to mu is a it is called the it is something that is done after the observations are given that is posterior therefore it is called conditional mean if you want to call it conditional mean so if I choose the second term is 0 I cannot change the first term therefore B X hat is minimum when X is when X hat is equal to mu so that means what is the optimum estimate the optimum estimate X hat is mu what is mu mu is the conditional expectation so conditional expectation is an optimal estimate within the Bayesian framework so why this is called least squares the function C we have chosen let us go back to this is a least square based function is a sum of squared errors so this is the sum of squared errors so the least square comes in here how Bayesian comes in here we have posterior I am sorry we have prior and conditional expectation so we have combined everything in a beautiful way namely mean square quadratic cost given prior given posterior we have combined all of them together to come to an important conclusion that the base cost function is minimum when the estimate is equal to the conditional mean the X hat is equal to conditional mean therefore we are now going to give a special estimator which is called X tilde MS X tilde MS is called the means Bayesian means square estimate it is like X hat L s for the least square estimate so X hat MS is equal to the conditional expectation of X with respect to Z and what is the expression for this conditional expectation it is integral over Rn of the expected value of X with respect to the conditional distribution of X with respect to Z the conditional expectation of X with respect to Z we already know from Bayes theorem this is equal to P of X given Z and P of X given Z by simple application of the Bayes rule is given by P of Z given X times P of X divided by P of Z so this is the characterization of the optimal estimate within the context of the Bayesian framework I have prior I have conditional distribution I have the mean now P of what is P of Z P of Z is the probability density of observation you can readily see P of Z is the integral is the P of Z is the marginal density with respect to D of X this is Rn so this is the this is the joint density I am trying to integrate the joint density with respect to X to get PZ this can also be written as integral over Rn PZ of X given times P of X D of X so this is the expression for P of Z and that expression is used in this in this denominator so by combining this we get the important formula for the optimal Bayesian estimate optimal least square Bayesian estimate and that is the structure of the optimal estimate that comes out of this analysis. Now we are look at this now again least squares comes in a beautiful way so we are going to be able to do the least squares either within the Bayesian concept where there is 2 pieces of information prior and observation are in official like situation where there is no prior I only have only one thing coming from the observation all I can do is to be able to extract the juice out of the observation and we have already seen both in the case of deterministic least squares also in the case of statistical least squares now we are redoing the same least squares within the framework of Bayesian analysis. So having established the structure of the optimal Bayesian estimate using the least square criterion I am now going to discuss some of the properties this is the Bayesian estimate we are going to the first time is this estimate is unbiased so what is it mean E of X minus so X is the unknown X hat MS is the estimate of X from the definition of unbiasedness from the earlier discussion expected value of X minus X hat MS must be equal to 0 to verify it is equal to 0 again I use the notion of repeated conditional expectation that is a very beautiful mathematical trick that one could use we have already used this in the derivation of the Bayesian structure of the Bayesian estimate in the previous slides so I can express this expectation as the iterated expectation with respect to the conditional expectation I can X hat MS is the function of Z so and expect this expect the conditional expectation is again a linear operator so I can express this function as the difference of the difference of the conditional expectation of X with Z and X hat MS but if a little reflection reveals this is also equal to the same as that and the difference is 0 that means the Bayesian estimate is automatically unbiased X tilde is equal to X minus X hat therefore E of X hat is equal to E of X minus of the mean square is 0 that means mean of the error in the estimate is 0 now let us come back and revisit the Bayesian cost so B X hat MS given Z is equal to that there is a base cost which we have already given the conditional Bayesian cost given the observation so going to the previous slide I would like to be able to I would like to be able to see what this one is this is given by X hat minus X MS so what is that X MS is the optimal estimate X hat is the so this is the I think there should be that should be X that should be X sorry that should be X so X minus X hat MS that means the error in the Bayesian estimate this is the sum of the squared error in the Bayesian estimate I am trying to integrate this with the conditional density the posterior density with respect to X I have assumed W is I but so what is this this is the random variable minus the mean transpose times random variable minus the mean so that is the total variance in the components of X tilde therefore the the Bayesian cost the conditional Bayesian cost function is equal to the total variance and by virtue of our property of X MS it readily follows it readily follows that the Bayesian estimation a Bayesian estimate minimizes the Bayesian cost the conditional Bayesian cost the conditional Bayesian cost is essentially variance and it minimize the variance so this is also a minimum variance estimation the minimum variance estimation so this would be a beautiful interpretation of what the Bayesian estimate is all about and some of the properties associated with this Bayesian estimate I hope I hope the concept is clear very very very nicely please also remember unbiasedness when the estimate is unbiased minimum squared error is equal to the is equal to the minimum variance which you have already seen that is one of the reasons why we require unbiasedness we have alluded to when we discussed unbiasedness so because it is unbiased I am the right hand side relates to the conditional variance conditional variance given the observation the left hand side is the Bayesian cost function given the observation and by virtue of the fact that the Bayesian estimate minimizes the conditional base cost function so we already know the left hand side is the minimum because x hat ms is a minimizer of the base cost function the conditional base cost function the right hand side has an interpretation of variance that is why we can associate important properties with this class of Bayesian estimation namely the optimal estimate is equal to the posterior conditional mean it also minimizes the total variance in the estimate total variance in the estimate so I would like to now bring out the similarity until now we used only observations using only observations in which case we said XLS is a blue and we had Gauss Markov theorem so Gauss Markov theorem refers to optimality of the least squares estimate when there is only one observation when there is only observation there is no prior when there is observation and prior within the Bayesian context X hat ms is also a blue X hat ms is called base posterior mean so XLS is the blue using only observation X hat ms is a blue using observation and prior so you can see the least squares coming in both sides one with another without the use of prior so you can see the parallelism between the arguments in the Bayesian context as well as in the non-Bayesian way of estimation all within the context of the squares. So thus far we have talked about the Bayesian context with the least square cost function please remember you could have considered the Bayesian cost function with any choice of cost function and we have already given 4 choices for the cost function the first two relate to the least squares the third one uniform cost is something else and the fourth one the absolute cost is something else. So you can readily see if I changed the form of the cost function the form of the estimate also will correspondingly change that means we will get a variety of different types of Bayesian estimates one for each possible choice of the cost function that goes to show the richness of the Bayesian formulation of course even in the case when there is no prior we could have considered very many different types of cost functions since we are trying to estimate since we are trying to seek optimal estimates we are we are we chose the least square criteria because least square criteria has very nice properties with respect to with respect to minimum existence of unique minimum and so on. So I am going to quickly illustrate some of these using a simple example let z be equal to x plus v let z be equal to x plus v x is an unknown v is the noise v has 0 mean sigma square v as the variance x is a random variable its mean is m of x and sigma square x is the variance of x we are assuming x and v are uncorrelated x and v are not uncorrelated I am sorry it must be uncorrelated that means that means x v transpose must be equal to 0 so that must be a v transpose because I am going to get the oh in this case I do not have to worry about the transpose sorry because I am assuming x and v are essentially scalar variables that is correct x v the expectation of the product must be 0 that is correct. Now z is the sum of two random variables therefore and x and v are uncorrelated I know x and v are both normal the sum of two normals sum of two uncorrelated normals is also a normal variable so z is a normal variable whose distribution is given by this where sigma is equal to sum of the variances. So if you add two random variables not only the mean changes but also the variance changes here the mean of the sum is the sum of the means the variance of the sum is the sum of the variances this happens because in this simple case z is the sum of x and v x and v are both Gaussian and x and v are uncorrelated if they are uncorrelated the expression for the variance will be different we are considering only the simplest possible case. So what is that we have we have we have all the information now so I would like to be able to compute the posterior using Bayes rule please recall the conditional distribution is given by so if you are given x z I would like you to realize this x is z is equal to x plus v so if x is given x is fixed z is random because of v and v has 0 mean therefore conditional distribution of v given x is normal with x as the mean and sigma square v as the variance so that is the formula that is given in here the prior x is already given to us with the mean mx and sigma square x as the variance therefore the posterior using Bayes rule is given by this ratio I am now going to substitute pxz px and pz we already know everything so we can substitute the form of the normal functions we know the functional form of normal random variables if you substitute for each of the normal random variable as the normal density and simplify you get the following expression which is beta times beta as a constant that depends on pi and and and and and and variances one can expressly compute that and the and the exponent of the exponential is given by this sum of three terms I would like you to be able to substitute and do the simplification and I think is a very good exercise now if you look at the term in the in the in the under the bracket that is the term that term can be rewritten this is the method of perfecting the square I can express this as x square times this 2x times this that is a constant times that so this becomes this by simple simplification of by by by by routine simplification now I have an x square term I have a 2x term I would like to be able to express it as x minus sum whole square plus some constant so I can you I can I can rewrite this by method of perfecting the square to that end we are going to define a new quantity called sigma e square I am defining 1 over sigma e square to be in other words the information this is this is the reciprocal of the variance that is the information this is the reciprocal of the variance of the noise this is the reciprocal of the variance of the prior so the sum of the reciprocal is the reciprocal of sigma square e which can be rewritten like this I can also concoct a quantity called x hat ms divided by sigma square e which is given by this quantity using 2 and 3 using 2 and 3 we can we can we can express the right hand side of 1 please remember 1 has a right hand side with 3 terms by using a perfect square and using the definitions in 2 and 3 the right hand side of 1 now the right hand side of 1 that is that is the important thing the right hand side of 1 becomes this quantity this quantity can be expressed as simply by 1 over sigma square e x minus x hat ms x hat ms has been defined in the previous page sigma square e has been defined in 2 so 2 and 3 gives you the basic definitions so with that implies I can indeed rewrite px the posterior density the posterior density pxz p of x given z as this particular form where alpha is a constant as alpha is a constant alpha is a constant that can again be explicitly expressed but in this particular case we already know this is the density function and and and alpha can be expressed in explicit form using the beta that we have used earlier we would like to leave the explicit computations of the constant beta alpha as an exercise. So if you now look at this expression it follows that that xms x hat ms is the mean please go back what is the standard normal distribution pfx is equal to 1 over square root of 2 pi sigma exponential minus x minus m square divided by 2 sigma square in this case m is the mean and sigma square is the variance of of of the random variable x. So if I use that analogy and use it here it readily follows the posterior mean is given by this the posterior mean is given by this and this formula from the previous page can be written like this sigma square e by expressions in 2 and 3 can be written like this sigma square v by sigma square v by sigma square x plus sigma square v sigma square x plus sigma square x plus sigma square v I would like to remind this alpha I think must be beta sorry this alpha must be must be the same as beta I do not want to confuse that that alpha must be the same as that in here we are introducing an alpha which is this ratio and this one is 1 minus alpha. So what does this tell you this essentially tells you the following the best Bayesian least least square estimate is the convex combination between the prior mean and the observation alpha times mx plus 1 minus alpha times z. So what does it mean mx is a point here z is the point here any point along this line has this particular form that is called the convex combination therefore the Bayes least square estimate is a convex combination of the prior mean and the observation that is an important conclusion that comes out of this analysis yes I have not talked about the algebraic simplifications I think it is a very good exercise to be able to go through this algebraic simplifications. I can rewrite the expressions in 6 again as the Bayesian estimate x hat ms the Bayesian x hat ms is equal to mx which is the prior mean plus z minus mx times this quantity you can readily see that quantity sigma square by sigma square x by sigma square x plus sigma square v it comes in here in the previous slide we call it 1 minus alpha we simply now call it a gain term. So the Bayesian estimate is equal to the Bayesian estimate is a very beautiful structure and interpretation the Bayesian estimate is equal to the prior plus z minus mfx mfx is the mean z is the new information so mfx my z minus mfx gives me what is called innovation innovation is the information in excess of what I already knew so the new estimate is equal to prior plus a constant times the innovation this form is the form that underlie the well-known Kalman filter so you can readily see the form of Kalman filter coming in here. So if I did not have any new information I would have the second term I would my best estimate is the prior mean but in addition to the prior mean if I get the new information I am going to update my belief to get the posterior mean. So the posterior mean which is the Bayesian optimal estimate is equal to the mean plus a correction term the correction term has a structure product of a gain term and the innovation and that form is a very standard form which is a very standard form so you can think of this in the form of a Kalman filter equation. This form has also an adaptive property what is the adaptive property if sigma square if sigma square x is greater than sigma square v if sigma square x is much much greater than sigma square v what does it mean observations are more reliable than the prior if the prior variance is very large compared to the observation variance observations are more reliable than the prior therefore in the previous equation 6 observations will be given more weight. On the other hand if the observations are less reliable than the prior for example sigma square v is much much larger than sigma square x that means prior is much more reliable than the observation then the prior gets larger weight. So that is a beautiful adaptivity property in the formula that is given in 6 also given in the Kalman filter form and that is called the adaptive property that is what is called the adaptive property and that is called adaptive property so by substituting this is equal to well this is the 7 is another way of rewriting the same thing I believe we should we should we should a is a constant so in this case a is equal to a is equal to the previous thing that comes in here which is sigma square v a is equal to sigma square v by sigma square x plus sigma square v a that is the definition of a which is which we have also called alpha so x or must can be written as a times mx plus one minus a times z which can be written as mx plus this plus this and that and that can be written like this so you can see these two forms are essentially the same form so this is also called the Kalman filter representation Kalman filter like representation k f form so that is the basic idea behind the Bayesian estimation we continue our illustration of the Bayesian estimate Bayesian least square estimates for the vector case we have already seen the properties of the Bayesian estimate for the simple linear case this is an extension to a vector case so z is equal to hfx plus v z is a vector in arm x is a vector in arm all the properties we have been utilizing all the all the all these properties of matrices vectors we have been utilizing all along v is such that e of v is 0 that means its mean is 0 v is such that its covariance is sigma v please remember we have we have utilized sometimes or we are now going to utilize sigma subscript v so v is normal with 0 mean and sigma v as the covariance x is the unknown x has a prior distribution expected value of x with respect to the prior distribution is mfx covariance of x is sigma sub subscript x so x has a normal distribution the prior distribution is mfx is a vector sigma x is a matrix therefore hfx is a deterministic function of x so hf is random because x is random hfx has a distribution which is h times mfx and its covariance changes h sigma x h transpose that is a very simple exercise in probability theory so covariance of hfx is equal to expected value of hfx times hfx transpose this is equal to expected value of hx x transpose h that is equal to h times expected value of x x transpose h transpose so that is equal to h sigma x h transpose so that is the formula that comes out of it that is why this occurs we are also assuming x and v are uncorrelated z is the sum of h times x plus v x is normal hfx is normal v is normal x and v are uncorrelated so z is normal the mean of z is equal to h times mfx the covariance of z is given by this out of our product if you do the simplification as given in here we get the formula in 9 which is h sigma x h transpose plus sigma v so that is the covariance of z I think it is worth remembering that z gets randomness from two different directions one through h another through v therefore and h and v are uncorrelated therefore the covariance of z is the sum of the covariance is one coming from x through h hfx another through the additive part of v therefore the total covariance of z is given by 9 it is the sum of these two terms that is an important thing to realize that again comes from basic considerations of probability calculations so e of x of e of the conditional expectation of z with respect to x is equal to conditional expectation of hfx plus v given x if since x is given hfx is already known so hfx comes out of the expectation operator we have we are left with the conditional expectation of v with respect to x x and v are uncorrelated therefore e of v given x is equal to efp which is equal to 0 therefore conditional expectation of z with respect to x is hfx the covariance of z given x is sigma v because the from z you must subtract hfx that is why the conditional expectation the conditional expectation of z with respect to hfx the conditional covariance of z with respect to conditional covariance of z given x is equal to sigma v therefore I have the conditional mean I have the conditional covariance if I have a conditional mean and the conditional covariance I have a conditional distribution because the conditional distribution is also normal so this is the conditional distribution of z given x so that is the distribution of the observation conditional on the fact x has already been chosen by mother nature even though I do not know the value of x this is the conditional distribution of the observation given by 12 so now I would like to be able to do a posterior analysis posterior analysis this is the posterior distribution the posterior distribution by invoking the Bayes rule so this is essentially a statement of Bayes rule so each of them are normal distributions see remember that we this is exactly what we did in the scalar case as well it is the ratio of the numerator the product of two normal distribution denominator another normal distribution so ratio of normal distributions this can be again expressed as a constant times this complicated expression even though looks complicated arithmetically is easy to simplify this now consider the exponent term again we are going parallel to what we did in the scalar case in the scalar case everything where scalar quantities here we simply need to consider matrix vector quantities matrix vector quantities so we are consider the exponent the x transpose h transpose sigma v inverse h plus sigma x inverse h times x so the exponent can be written like this look at this now this term is quadratic in x this term is linear in x the other terms do not have any x so if I have a quadratic term a linear term at a constant term what is the basic idea here you try to use the method of perfecting the square so you add and subtract a constant so that I can extract a square root after that the same principle that we did in the scalar case but the algebra is little bit more involved in the in the in the vector case therefore the exponent can be simplified the exponent in 3 can be simplified as 14 14 after doing the completion of the perfect square becomes identically equal to this where sigma e inverse is given by this and xms is given by this so you can verify by substituting 16 and 17 and 15 14 and 15 are equal yes there is there is a good amount of algebra involved in here and since our aim is to be able to indicate all the major steps we are going to leave the algebra for the reader to verify I think it is absolutely essential for anyone who wants to understand these derivations thoroughly must go through the details of all the simplifications so with this we have now derived an expression for the the the the best Bayesian estimate I am sorry this is the best Bayesian estimate this is the best Bayesian estimate estimate and this is its covariance again that activity property that we talked about in the case of scalar case also applies in here but the the interpretation is little bit more complicated because the fact there are matrices and and and the the operator h comes into play but in principles all the conclusions the adaptivity with respect to which weight is more which weight is less for example there are two pieces of information the posterior mean is going to be a linear function of the prior mean and the new information that comes from the innovation and how these two terms are weighted relatively that depends on the relative values of the covariance matrices for the prior and the observation the the covariance of the conditional distribution of of z given x and and and and and the similarity is very obvious and I would definitely like the reader to be able to compare the scalar expressions with vector expressions and and and identify which term becomes which term corresponds to what term the scalar case and the corresponding term the vector case I think it will be very beneficial for anyone everyone to do that so with that we come to the end of the discussion of the Bayesian methods now you can see it is this Bayesian method that is going to be the basis for stochastic aspects of data simulation what is an exercise the exercise is substituting 16 17 and 15 verify 14 and 15 are equivalent and again I have already mentioned this this is late this relates to the principle of perfecting the square in the matrix vector notation algebraically it is non-trivial please do that again a reference for this is by Melza and Cohen and decision estimation 3 memory grow hill also you can refer to our chapter 16 Lakshmi Raghun Lewis Lakshmi Raghun doll 2006 with that we conclude the discussion an elementary discussion of the Bayesian least squares estimation thank you