 I have taught a lot of topics because students lack the background. Some of those topics are not directly related to Bayesian. But today we are going to do the heart of the Bayesian calculation, the most important Bayesian calculation. Because this is so important, I will try to cover it in a number of different ways so that we can follow. So the first thing is that we have started with the variable x, which is normal with mean theta and variance sigma squared. Now we want to do inference. I am going to assume that sigma squared is known to keep things simple. Later on we will work with more complicated models. So there is only one unknown parameter. Actually these are the two parameters, theta and sigma squared or theta and sigma. Now in the traditional conventional econometric or statistical conception or methodology, theta is some unknown value and there is one true unknown value. And the goal of inference is to try to find what is the true value. Now this is a really bad way of thinking about the problem because there is no true value. It cannot be found. If something cannot be ever found then there is no point in looking for something like flying horses because they don't exist. So if there is no true value then what is the point of looking for it? And the whole conception of this is bad. So now the thing that is useful is to think of this as a model for the observation. So the problem with conventional thinking is that it confuses the model with the reality. It thinks that both of them are the same thing. So if you think that both of them are the same thing, you say you have a coin and you are flipping it and you say that this coin is a Bernoulli sequence. And so then the question, the null hypothesis is the coin fair? So is P is equal to 0.5 or not? Now that's the null hypothesis and there are lots of tests. Now the problem is that this is something you can never know because no matter how many times you flip the coin, a thousand times, million times, it will never, the observed frequency will never be exactly equal to 0.5. So if it's close enough to 0.5 you will never know the difference. So now the thing is, according to this conception, either it is 0.5 or it is not. And only one of these things can be true. And this is a problem actually because inference doesn't work like that. Now in our conception we're saying, okay, I'm taking a model. The model doesn't match the coin. The model is different. So if P equals 0.5 is a good model for this set of observations, then P equals 0.51 is also a good model because they are similar. So once you separate the model from the coin then the truth disappears and then you say that okay, all of these are good matches. And so when we say that probability that absolute x minus theta is less than sigma squared is equal to 68%. This means that now in the conventional inference it says that if you make a confidence interval of plus or minus one around the observation, there will be a 68% chance of finding the true parameter inside this. Now this doesn't make sense. I mean people have been arguing about this for a long time because either the true parameter is in there or it is not. There is no probability and there are many other problems with this. But if we say that okay, values of theta within plus or minus one sd of x are reasonably good matches to the observation. All of them are good. All of them are equally true. And then if you go to two standard deviations you have 95%. So this is more or less all of the values which are compatible with the data. If you go to more than outside, if it's outside three sd's then you say this is a very bad model. Between two and three it's pretty poor but maybe in some situations you can tolerate that much discrepancy. So now you have a range of models which provide varying degrees of goodness of fit to the observation. There is no true model. Now in the Bayesian view of the world we instead of thinking of theta as a true value we assign a probability to theta. So there is a probability distribution over theta. Now again the Bayesian view also is mistakenly thinking that there is a true model but if we can drop that assumption and we can say okay the probability distribution of theta tells us something about the values of theta which are compatible. Highly probable values of theta are good values, they are more compatible and probability, theta values which are having low probability are less compatible with the data. So with this method we get a better approach to statistics because we have a more flexible set of models which we can use to model reality. So we have more chance of finding something which is suitable and this actually turns out to be the case. There are many situations where the Bayesian models provide a more flexible class of models which can match actual real observations which cannot be matched by conventional models because conventional models are too rigid. They just say okay there can be only one value but if you allow for a range of values you can get more flexible models. So that's a rough general idea. Now we come to the specific details. So in the Bayesian view then the data distribution is called the conditional distribution of X given theta. So theta is regarded separately as a parameter. So this is the conditional distribution and now we have the marginal distribution of theta and that is going to be normal with the mean m and variance s squared. These are the parameters of the prior distribution. In order to distinguish this from the parameters of the data distribution these are called hyperparameters. And sometimes you have hierarchical Bayesian models. These are also useful. So we have multi-stage priors. So we can have priors over m and s squared. Also those would be hyper-hyperparameters. But anyway. So this hyperparameter is because of the difference between the values? No. This is the data. Data depends on parameters. Parameters have a prior distribution. This doesn't exist in conventional statistics because parameters are constants. They are fixed. They are not random. Now when you allow for random parameters then the parameter, the random distribution of the parameter also has its own flexibility. These parameters allow it to change. And these parameters which are parameters of the prior distribution are hyperparameters. They are not directly involved in the distribution of the theta there at one remove from the distribution of the observations. Now the key question here is, okay, now I have these two distributions. One conditional and one marginal. So now I want to get the joint and the other conditional and the other marginal. So deriving the other three distributions is a problem. There are two ways to do that. Two major ways and many other methods. And this is the key computation. This is the simplest case. And then the same computation is repeated in more and more complicated ways in more and more complicated setups. So it's important to understand this basic calculation. So the joint distribution first. So I want to calculate the joint distribution of X and theta. Well, I can just write it down. This is equal to 1 over, first I write down the, basically it's the conditional times the marginal. So I just multiply these two densities together to get the joint dense distribution. But that doesn't finish the job because I have to recognize what it is. It is actually going to be, as we know, a bivariate normal density. So I need to know what that bivariate normal density is. Even though I write down the function here, it will not allow me to recognize that immediately. But anyway, I can write it down. So I will do that. 1 over root 2 pi, sigma squared, sigma squared under the square root. And I have exponent minus X minus theta squared over 2 sigma squared. This is the conditional density of X given theta. And then I have to multiply it by the marginal density of theta, which is 1 over root 2 pi s squared m squared. Okay, so that's the joint density. But now actually what I want to know is what is this joint density? So it's a bivariate normal and bivariate normal has five parameters. And so I'm going to compute those five parameters. And I can compute it in many different ways. I can compute it directly by manipulating this to put it into the form of the bivariate. Or I can just compute by using the expectation properties. So I'll do that first. So I want to know what is the expected value of X? Well, so the expected value of X, this I mean, I want to know what is the marginal expected value of X? Because I want to know what's the marginal distribution of X? I don't have the marginal distribution of X. I have the conditional distribution of X given theta. So to get the marginal, there's a standard trick that I take the expectation with respect to theta of the expected X given theta of X. So you see the joint distribution can be written, either I can do, you see what I can do is, I can, the definition of EX is the double integral of X, FX theta DXD theta. This is the expected value of X integrated against the joint density of X and theta. What I, I don't have that joint density given to me in the original problem. So what I can do is, I have integral X conditional of X given theta. This is the first integral, this is the expected value of X given theta. And then the second integral is with respect to the marginal density of theta. So once I, I calculate the, the conditional distribution and the expectation of X given theta, and then I calculate the expected value of that thing, whatever it is, with respect to the marginal density of theta. And that is what this symbolism means. Integrate with respect to the marginal density of theta, and integrate with respect to the conditional distribution of X given theta, and you'll get what you're looking for, the expected value of X. So the first one is very simple. What is the expected value of X with respect to the conditional density of X given theta? What is the conditional distribution of X given theta? Yes. So what is the expected value of X in this conditional density? The expected value of X, if you have a normal distribution. Yes. It's theta. The normal distribution has expected value is theta. Yes. So now, what is the expected value of theta with respect to the marginal density of theta? M. M, exactly. So this means that if you look at the, if you want to look at the joint density of X theta, the mean of both of these is M. Mean of X is M and the mean of theta is M. It's important to understand this intuitively. Theta is symmetrically distributed around M, right? Because of normal distribution. X is conditional on theta is symmetrically distributed around theta. So if you take theta out of the picture, X will be symmetric around M. Because you start with M, you generate some theta which is symmetrically around that and then X is symmetric around theta. So if you take theta out of the picture, which is what we're doing, when we integrate it out, X will be symmetric around the mean. So both will have the same mean. Now, the second parameter of the marginal distribution. Well, okay, so now we want to look at the covariance matrix of these two. Now in the covariance matrix, we have one entry. The variance of the theta is known to us. It's given in the problem is S squared. Now I need to calculate these three entries. The variance of X and the covariance of X and theta. Now what is the variance of X? Well, variance of X conditional on theta is sigma squared. But I want to know what is the unconditional variance of this. So for that I need to do some work. What I want to do is look at the expected value of X squared. There are many ways to do this. The variance is defined by the expected value of X minus M squared. Because now I'm trying to calculate the marginal variance, not the conditional variance. The conditional variance I know is sigma squared. But I want the marginal variance. So let me do it in an easy way first. Now when X is equal to normal theta squared, there's another way to write this. And it's very useful mentally for us to be able to switch back and forth between the distribution and the model. So there is a way to write this as a model. You can say X is equal to theta plus epsilon. Where theta is constant conditionally. And epsilon is an error term. And epsilon is normal with mean 0 and variance sigma squared. So these two are the same things. X is normal with mean theta and variance sigma squared. Or X is equal to theta plus epsilon. Where epsilon is an error term with normal with mean 0. Adding theta just puts the theta inside the mean. As long as we are conditional on theta, this is just fine. These two are the same things. So now the variance of X will be equal to the variance of theta plus the variance of epsilon. This is sigma squared. Now unconditionally, when we condition on theta, this term is 0. But now when we uncondition, okay, now let theta itself fluctuate. Then theta will have its own variance which is S squared. And that's the answer. S squared plus sigma squared is the variance of X. It's very simple. So this is S squared plus sigma squared. There are many other ways to get the same result. But actually it's best to understand this intuitively. That we generate theta and theta has this variance sigma squared. Now post-experimentally, theta becomes fixed. And I add an error term to this to get X. So the X has variance post-experimentally of sigma squared. And post-experimentally with respect to theta. It has a sigma squared variance. Theta itself has a variance of S squared. So the total variance is just going to be the sum of those two. This is how much this one fluctuates. And then you fix it and the other one fluctuates. These two are additive because they are separate from each other. So you just have S squared plus sigma squared. Now the third entry in this matrix is the covariance of X and theta. This is equal to the covariance of theta plus epsilon and theta. Now this is equal to covariance of theta and theta plus covariance of epsilon and theta. Now the epsilon doesn't have any covariance with theta. So this is zero. These two are independent. And two things are independent. Then the covariance is always zero. Covariance of theta and theta, what is that? Yes, it's the variance of theta which is S squared. So this is it. So now we have calculated the joint distribution of X and theta. And now we can compute the marginal of X and the conditional of theta given X. These are the two missing distributions. So what is the marginal density of X? This is going to be normal. What is the mean of this? M, exactly. Covariance? S squared. S squared plus sigma squared. So I've got the marginal of X. I know the marginal of theta that's already given to me. And there's only one more parameter. There are five parameters in the normal distribution. So I've got the mean and the variance of X, the mean and variance of sigma of theta. And the covariance of theta and X is just S squared. So if I want the correlation, I have to divide the covariance between by correlation of X and theta is this divided by the square root of the variance of X, which is square root of S squared plus sigma squared, and the square root of the variance of theta. That's the definition of the correlation. The correlation is just the covariance divided by the square root of the variance of the two variances. So this is something a little bit complicated, but not too much. Okay. Now I can calculate the, I've got now the joint, the marginal of X, one conditional, the other conditional. I need the other conditional, yes. Now it's also useful to get the Bayesian terminology. So this is called the data density. Data density is the density of the observations conditional on the parameters. And the marginal density of theta is called the prior density of theta. And the key, the central thing in a Bayesian calculation is to calculate the posterior density of theta. The posterior density is the conditional density of theta given X. This is the target of a Bayesian calculation. And basically the idea of a Bayesian calculation is that prior distribution is what you know about theta before you observe the data. Then you observe the data and that gives you more information. And so you update the prior. So the posterior density combines the information in the prior which you had before you looked at X. And then you got some more information out of X. So you combine both of them and you get the posterior density. Now the posterior density is equal to, is a normal density with a certain mean and a certain variance. We know that because we know about the joint bivariate normal. And now we have the formulas for this. Basically what we know is that theta minus m over s. This is the standardization of theta, additional and X is equal to rho times X minus m over sigma. Where rho is the correlation coefficient. So from this we can conclude that the expected value of theta given X will be equal to m plus s rho over sigma X minus m. Where the rho I know is the covariance. The covariance was calculated to be s squared divided by square root of. So this is the formula. And now I can do a little bit of algebra to get it into the form that is desired. But I will let leave that for you. It is now just a minor algebra. What I want to do next is to put this formula. What we are trying to do is to calculate this conditional density. So this is one parameter. This part goes into here. And the other parameter is the variance of theta given X. And the variance of the conditional density is square root of 1 minus rho squared times the variance of X. So the variance of X is s squared plus sigma squared. And you have to multiply it by square root of 1 minus rho squared. So that gives you one set of formulas. And now there is a lot of messy algebra. What I am going to do is to give you, so that is one derivation. Okay, full stop. I am not going to do that derivation completely. We can write down the answer. This is the one thing and the variance of theta given X is equal to s squared plus sigma squared. That is the variance of X multiplied by square root of 1 minus rho squared. Now a lot of manipulation from this simple algebra, high school algebra, not complicated algebra, will get you to the formulas that I am going to now show. And then I am going to derive these formulas from another root. So these formulas are, in these formulas, very useful to introduce the quantity which is called the precision. Precision is actually 1 over the variance. So I am going to call the precision of X, this is the data precision, is 1 over sigma squared. So now precision is the opposite of the variance. If the variance is high, then the data is very imprecise. It can vary a lot. And if the variance is low, then the data is very precise. It is very sharply concentrated around its mean. So the precision has a natural interpretation. Then there is the precision of theta, which is 1 over s squared by definition. Precision is defined to be 1 over the variance. Now the very interesting and useful thing is that the expected value of theta given X is equal to pX times X. The data mean times its precision plus p theta times m over pX plus p theta. So this is a very beautiful formula. And it has a very natural interpretation. And you can get from here to here, but you have to spend black and a lot of paper. And I am going to do that calculation in a different way. So what is the beauty of this? Well, first you have to think that in the conditional density of X given theta, suppose that you only know the data, you don't have the prior. Then what would you do? Well, the maximum likelihood estimate for theta is just X. That is easy to calculate if you just take the likelihood function and maximize it with respect to theta. You will see that theta equals X is the maximum. Here is the likelihood function. We have an observed value of X. And I want to find a theta which will maximize this likelihood. Well, the theta has to be set equal to X to make this the largest possible. Any value of theta different from X will make this non-zero. And if it's non-zero, it will be smaller than zero. It will be a negative value. So e to the negative something will be smaller than e to the zero. So the point is, I mean that's the technicality of the maximization. The point is what is the value of theta which best matches the observation? So setting theta equal to X is the best match for the observation. It gives the highest probability to the observation. If you move theta away from it, the probability of X declines. So our best guess for what theta is when we observe X is just X. Because X is symmetric around theta. So based on the data, our estimate of theta is X. Now, separately consider the prior. Suppose that I only have the prior information. Theta is normal with mean m and variance s squared. And this is some information about theta. Suppose that's all I know. How am I going to estimate theta? What is my best estimate of theta? X is no longer available. Data is thrown away. I just have the prior. Data is giving some information. And prior is giving separate information. This is the prior. Prior is, remember, normal with mean m and variance s squared. So suppose I only have prior information. Then what is my best estimator for theta? M, exactly. The mean of the prior. That's the highest point on the prior density. It's the most likely value of theta. Based only on the prior information. So now I have both. I have the data and I have the prior. And both of them are giving me information. Now the data is telling me that you should estimate theta to be x. The prior is telling me you should estimate theta to be m. And I have to combine these two pieces of information. And, but it's not just going to be a simple average because both of them also have certain accuracies. The precision is, suppose that one of them is very precise measurement and the other is very imprecise. Then I should take the precise estimator and I should give less weight to the imprecise. So basically that's what this does. This takes a weighted average of the two estimates. The precision of x is the weight for x and the precision of theta is the weight for m. And so I multiply by the weights and I divide by the sum of the weights. So this is the standard weighted average of these two. And it makes perfect sense. This is what it should be. Now the second thing is, which is the variance. This is again another very beautiful formula. The precision of theta given x. So how much do I know about theta after I combine the prior and the data information? It's very nice. It's just precision of x plus precision of theta. Data had some value. It was the precision of x that measures how much information there is in x. The more precise, the more information. The precision of theta measures the precision of the data. So we just add the two and we get the sum of the two pieces of data. So these, now the precision, now if I want the variance, variance of theta given x is just one over the precision of theta given x. By definition. And so this is going to be one over px plus p theta. And this is equal to one over one over sigma squared plus one over s squared. Which is also equal to s squared sigma squared over sigma squared plus s squared. So these are two very beautiful formulas. And they tell you exactly the posterior distribution of theta given x. So we have to calculate the precision of theta given x. I guess if you talk about weight, those would be simple descriptive statistics or we will calculate them as well. I don't understand. So weights are the precedence. If you talk about weight precedence, that's right. The precedence are the weights. Now I would like to, I have skipped, I mean I can derive this from the formula, but I have skipped it. This derivation is a little bit, you have to do a lot of algebra. And very often when you do it, you make a mistake and then you are puzzled. I am going to do it another way. I am going to start from here and I am going to calculate the posterior density of theta given x. So this is the conditional distribution of theta given x. Now this is a standard Bayesian trick that I am going to use here. In the standard Bayesian trick what we do is, we collect all of the terms that relate to theta. The joint density is C of theta given x times M of x. So there are some terms which belong in here and some terms which belong in the conditional density. Now when you look at the conditional density of theta given x, every term which involves theta here should come into this conditional density. It has to. It can't go over there. Now the trick here is to recognize that once I collect the terms of theta, I don't need to worry about the other terms because once I have all the terms of theta, that's called the kernel of the density. Then all I need is an integrating constant. The integrating constant will be some function of x because x is involved. That doesn't matter. The thing is I compute the thing which will make this integrate to 1. That's it. That's the conditional density. That's by definition. So suppose I have C theta given x up. I have something which is proportional to C theta given x and I want to discover what is C theta given x. Well I just integrate it, whatever it is, with respect to theta in this constant and then I just divide by C and that's going to be my density automatically. So if I can find something which is proportional to the density, it's good enough as long as I can calculate the constant. So what that means is that I can drop from this expression everything which doesn't involve theta. And I can pick them back up in the end by finding the integration constant. That's a very key trick in this Bayesian calculation. And that makes this calculation easy. So now this term is dropped out because it doesn't have any theta. This term is dropped out because it doesn't have any theta. And I'm going to look at the x terms. x minus 1 half that I'm going to keep outside. And now this is x squared over sigma squared. It doesn't involve theta. I can drop it out. Because it's a multiplicative term, remember? So then I have minus 2 theta x over sigma squared plus theta squared over sigma squared. Those are two terms that come in here which are involving theta. And then I have a similar two terms from here. This and those terms are same. They are both x minus 1 half. So one of these terms is theta squared over s squared minus 1 half is outside. And then I have minus 2 m theta over s squared. And then I have a plus m squared s squared which I can take out because it doesn't involve theta. So I can say that c x theta is proportional to this. Or I can say this is equal to some function of x times this. That's all. Any function of x. I can actually exactly calculate what that function is but I don't care. Now I want to complete the square here. So because I know what I'm trying to get here is theta minus something squared over something squared. Because that's what the normal density looks like. So that's what I want. So I'm going to try to work to get this kind of thing. So now I recognize this. This is p x theta squared plus p theta theta squared. 1 over sigma squared is p x and 1 over s squared is p theta. So I have let me call p equal to p x plus p theta. This is actually the precision of the posterior density and it's showing up here. So I'm going to write this as the conditional density of theta given x which is posterior density. It's proportional to x of minus one half times p theta squared minus 2 theta times p x x plus p theta m. So I just rearrange the terms. So theta is being multiplied by the total precision of the posterior. Then I have one term which is minus 2 theta x over sigma squared. And that's minus 2 theta times x over sigma squared which is just x multiplied by its precision. And then I have the other term is minus 2 m theta over s squared. And factoring out the 2 theta that's m over s squared which is m multiplied by its precision. So now I have x of minus p over 2 theta squared minus 2 theta times p x x plus p theta m over p x plus p theta. This is just p. I have written it out because I factored the p outside. And so I have to divide by the p on this term. And now one more term that I would like in order to complete the square is let me call this expected value of theta given x because that's what it is. So this is going to be expected value of theta given x squared. So what did I do? I just added this term and I can subtract this term, subtract the same term. But when I subtract the same term, this term doesn't involve any thetas. So I can just put it with the f of x which is outside which is my proportionality concept. So I can ignore that term. So basically then I get x of minus where e theta given x is the weighted average of the means x and m and the weights have the precision. But now this is exactly a normal density and all I need is 1 over square root of 2 pi times 1 over p to complete the integration to 1. That's the integration constant that I was missing. And that's it. This is the posterior density. The posterior density is normal with the mean equal to this one which I have calculated. And the precision equal to p which means the variance equal to 1 over p which we can see that this is the variance of the normal. So that actually proves the thing that I was trying to prove. Now if you are very patient and careful and instead of dropping terms like I did and you keep all of those terms and you keep piling them up. Every term is there. There's nothing, no magic. And then you arrange those terms and I've done it. It's a messy cup. You will get exactly and after you get this on this side, the other terms will be I multiplied by 1 over square root of 1 over p here. So I'll have to multiply the other side by this reciprocal. So if you keep rack of these terms, you will get the marginal density of x and this marginal density will be exactly what we told you earlier. That it will have mean m and it will have variance equal to sigma squared plus s squared. But it becomes very complex before it simplifies to that form. So that's the key computation. This is the bedrock. So this computation must be understood. Everything else builds up on this. It's a more complicated version, a fancy version of this. So let me just restate this as a theorem. I've given several, I've found at least two lectures which I have put in the web page which do this calculation because it's so central. Everybody does it. But I didn't find anything which was nice. I was planning to basically cut and paste those lectures into my slides. Nothing very nice. They were all very clumsy derivations. So I decided that I would do it myself. So we have started by saying that the marginal density of theta is normal with mean m and variance s squared. And I'm going to write this as n star with mean m and precision p theta. So when I say n star, then I'm going to use the parameter precision instead of s squared, the variance. It turns out that it's convenient to use both in Bayesian calculations. Sometimes the variance is convenient, sometimes the precision is convenient. It's useful to parameterize the density by the precision as well. It's useful to write down the expression for the n star density. This is p theta over square root of 2 pi x of minus p theta over 2 theta minus m squared. That's the density of the normal parameterized in terms of the precision. Now the conditional density of x given theta is normal with mean theta and variance sigma squared. And this is n star theta with mean p x. Where p x is 1 over sigma squared and p theta is 1 over s squared. So now what I want from this is the joint density. Joint density is not useful actually. But we calculated the joint density. So let me just put it down. So the joint density of x theta is normal with mean mm. It's not just any normal density, both the means of x and theta are the same. And then we have the covariance matrix is sigma squared plus s squared. And the covariance is s squared and s squared and s squared. And from this row the correlation which is an important parameter between x and theta is equal to the covariance which is s squared over the square root of s squared plus sigma squared. And the square root of s squared is equal to s over the square root of s squared plus sigma squared. There are many other ways to write it. And now from this using the formulas that we have learned we can calculate the conditional, the posterior density. So this is the joint density. Now I want to go to the marginal density of x. And the marginal density of x is automatic from the joint density. It's normal with mean m and variance sigma squared plus s squared. And finally the price is c of theta given x, the posterior density. And now this is normal with mean equal to px times x plus p theta times m over px plus p theta. And if I put n star here then this is just px plus p theta which is a nice expression. So that's the complete derivation of the normal density, the Bayesian calculation. This is the heart of all Bayesian calculations. And then lots more complicated versions come up which we will be studying. Now I want to discuss just some implications of this formula in terms of Bayesian estimation theory. So basically then what happens is that the Bayesian estimator is this. So the Bayesian estimator is the mean of the posterior and this should be contrasted with. So now let me, okay so we are at the end of the derivation. I am just going to go over some implications of this formula in terms of the theory of Bayesian estimation. So first I made the argument that theta hat ml equals x. And if you have x is normal with mean theta and variance sigma squared then the maximum likelihood estimator of theta is just x. And this is obtained by taking the density and maximizing in respect to theta and you will get theta equals x. And this is done without calculus just by visually almost. Because basically if you want to minimize x minus theta squared you have to set theta equal to x. That's immediate without any calculus because that value is zero and every other value is non-zero and positive. So if you see that the likelihood function is a monotonic transform of this function you will see that the maximizing value of theta for the likelihood will be x. So that's the maximum likelihood estimator. Now in the Bayesian estimator you put a prior which has its own mean and variance and you get this formula. x plus p theta m over p x plus p theta. So now m is the prior mean and p theta is the prior precision. Now the prior mean let us set this equal to zero just for the sake of reference. Zero is a convenient point just to understand the properties of the Bayesian estimator. In that case the Bayesian estimator is just p x over p x plus p theta times x. So it is just a linear function of x. The classical estimator has this coefficient equal to one. So this is called a shrinkage estimator because it takes your maximum likelihood and it shrinks it towards zero. When it's on the negative side it brings it towards zero and when it's on the positive side it brings it towards zero. Now how much shrinkage? Well it's very sensible. If the prior precision p theta goes to zero this is called the uninformative prior. Typically you see the Bayesians are accused of introducing garbage information because people say that you don't have any prior information. If you're doing your exercise then you know that he's talking about somebody who has information, somebody who doesn't have information. So basically in typical problems they say even if you have information, if it's your personal information it should not intrude upon a scientific estimation theory. You should only introduce objective information into this. So if you have some personal idea that I think something is true about theta that shouldn't affect your estimation. So to answer this objection the Bayesians say that we can use objective priors. An objective prior is one in which there is almost no information. So what is the case where you have no information? That's when prior precision goes to zero. Now if precision goes to zero even if the mean is non-zero this will disappear from the mean. And if this goes to zero then px over px will go to one and so you will have exactly the same as the classical estimator which is perfectly sensible that if the prior information is very bad or very low then you ignore the prior and basically your estimator is the same as the classical estimator. Now consider the other case where your prior information is valuable. So p theta is something, suppose for example that p theta is the same as px. Then what you do is this will be half and half, the weights will be half and half. So you will have, you will take the data and you will multiply it by one half and you will average it with the prior mean. So basically this ties into the concept that we have discussed earlier about equivalent information. The prior is equivalent to having another sample. So how large is that sample? So if p theta is equal to px that means that your prior sample is equally large as your data sample. And in that case you combine them with equal weights. Sometimes when the data is very small and you have a large sample actually you see you can do this sequentially. One of your exercises does that. So you start by, so instead of having one x, see now that we have this situation. Now that we have this formula, we can consider the following situation where you have a sequence of IID observations. And suppose that, so what we do is we start with the prior which is the same as before. Theta is normal with mean m and variance squared. And now I do my Bayesian estimation. So I get theta hat 1 is normal with being equal to px, px1, x1 plus p theta m over px1. I'll use the insta point. Okay so I take the first observation and I update my prior. Now I'm at here. Now the thing is that this is the prior that I use when I observe x1. See this is a sequential. So I start with this prior, I have no data then I observe x1. I update my prior and I get my posterior. Now this posterior is my information before I observe x2. So now this is going to be my new prior and I'm going to update it to what happens after x2. So I had a prior, then I observe x1 and this becomes posterior. And now this posterior is the prior before I observe x2 and this will become another posterior. Now I have all the formulas I need to calculate this. So what happens when I want to calculate theta hat 2? Well this will be n star and now let me call this m1. This is the prior mean at the first stage and let me call this p1 is better. So now p2 is just going to be p theta plus px1 plus px2. And what is the precision? These procedures are all 1 over sigma squared. So the procedures are just adding along. There is no difficulty in the formula. And now I am going to have px2 times x2. To see this easily suppose that s squared is the same as sigma squared. That means that basically I have prior one observation only. So that makes sense here. Before I start I have just one observation. Actually you can change it also if you start by assuming that I have zero information. Send your p theta to zero, the prior variance to infinity. Then at the end of the first observation your prior will be exactly equal to one observation's worth. So in that case you see that everything is getting multiplied by the p which is 1 over sigma squared. So all of them have exactly the same weight. So basically what you are going to end up here is a weighted average of all of the x's. A simple average, not the weighted average. So what you will have here is x bar which is the mean of all of the x's observed plus m over n plus 1. Where n is the number of things that you have in your means. And 1 is added because m is coming up from before and is equal to one observation. So this is what you would get if your s squared was equal to sigma squared. If it is not then the m gets a little bit different weight. But the same formula holds. Basically what you get at the nth stage is that your posterior mean is the average of all the observations observed up till now and a weighted average with the prior mean m and the weights are corresponding to the relative variances. The variance of x bar is actually sigma squared over n which we will calculate later. So the sequential process also tells us how to, once we have this formula then we can apply it to a multivariate sample as well because basically that's the main advantage of the conjugate prior. That it combines nicely with the data. So the conjugate posterior is the same form as the prior. And so you can just sequentially update without having to do new computation. If the formula is not conjugate then at every stage you would do a new calculation because the formulas change at the all stages. No, it will be x1 plus x2. You keep all of the information and you keep updating. Alright, I think that this completes one unit. There are lots of other things that we need to do but I think it's a good place to stop this lecture. So next time we will build upon this.