 Welcome back everyone. I hope you've all had a nice weekend. So, before we start going on with today's topic today we're going to discuss more basic computations with Gaussians and we look a little bit at we introduce also the more general concept of what a conjugate model is, because obviously this is particularly relevant for the Gaussian distribution. But before we start with all that. Do people have questions about last week's material. I'll start getting the screen shared. I think the easiest thing to do as usual is just to unmute yourself and ask. And posting in the chat, but if you prefer to post in the chat for whatever reason so also do that I'll try to keep an eye for when things flash up. So is there a question I did you hear someone unmuting himself or herself or was, was that material being inadvertently. Good. We found the culprits there. Good. So more basic computations with Gaussians. Okay, so the typical observation that the typical scenario in which we are interested in basic inference is when we have observations of an unknown quantity. Okay, so suppose we have a quantity mu, which we wish to observe. And we can only observe with noise because that is the reality. So instead of observing mu what we observe is mu plus some noise. Okay. And typically we're always interested in the scenario where we have multiple observations. So I'll put a little epsilon and little I index. So we observe the quantity x multiple times. And this quantity is obtained from the fixed quantity mu plus a random error term. Okay. The first concept that we need to introduce is the so called I, I, the assumption, which means independent and identically distributed. What does this mean. This means that every time I do a measurement. I'm drawing epsilon, I, from the same distribution, which does not depend on a. In the Gaussian observation case, typically we are going to say since it is an error term and it's an additive error term we assume that it's got zero mean. That's the most common assumption made always, and some variance sigma squared. So this is the typical scenario. The joint probability of the observation so our likelihood. Let's say quite X vector by X vector I mean the vector of X one X and given mu and sigma squared is going to be a product of terms for each of the Xi. So this is the basic setup where most obeying interest works. So we have multiple ID measurements. And we wish to refer something about a latent variable mu. Notice that there are situations where the time is not valid so the idea assumption is not valid. For example, if I'm observing a time series typically I will not have ID because the time. If it is a dynamical process the time will play a role so presumably there will be some correlations in the noise at different time points. I might even have systematic errors for example I might have groups of variables that have different distribution so I might be under what is called covariate shift. But that's not what we're going to consider these days. So today what we're going to look at is given observations, which are ID in this form. How can I compute posterior distributions over the parameters of these Gaussian yeah so I have a likelihood term which is a product. And then, in general, I'll have a prior term, which is a joint distribution over the parameters of the Gaussian. And the question is, what is the posterior distribution, given the observations. Remember this is the posterior so we use the, the p letter for both the prior the likelihood and the posterior but it should be clear from what is in here that this is a posterior because it's conditioned on the observations. Okay. So, let us start first with the mean. Okay, so let's suppose that the Sigma squared is fixed and we want to compute the posterior over the mean, given the observations and keeping fixed that Sigma squared variable. And this is going to be by base theorem is going to be proportional to the likelihood term times a prior term. Now if I write it out. More explicitly, I will get a normalization constant, and then I have an exponential. Now my idea assumption means that the likelihood is a product of likelihood terms for each of the observations. And since I'm assuming that the observation noise is Gaussian, then this probability will be written as a sum, an exponential of a sum of terms equals one to n. So that is my likelihood term. And then the prior term is in the first sentence you wrote that PMU X and Sigma is proportional to the PAX. Did you want the square there? Yes. Okay. Thank you. Thank you for catching me out on the square. So now, and of course, there's another square, even more important square that I missed here, yeah, X minus, XI minus mu squared, of course that is the density of a Gaussian. So thank you very much. Now, you can observe that these terms contain a quadratic form in mu. Okay. So it's the exponential of a negative quadratic form in mu. If we want this to turn out to be of a family of distributions that we have encountered before, then this must also be Gaussian. So if you want to be able to do the calculations analytically, we need this thing to be the overall joint distribution to have also the form of a quadratic form in mu. And so if we take mu to also be Gaussian, then let's say that P of mu, if it is Gaussian, it needs to be a Gaussian with a certain prior parameter M and a certain prior variance S squared. Okay. So if we plug that in, so this term remains the same. And then we will have, let's make this a brace. And then we will have mu minus M squared. Now this is also an exponential of a quadratic form in mu. And so the posterior distribution is itself also the Gaussian. So this is the joint distribution. But if I fix the X, because I'm conditioning on it to get the posterior then I have the exponential of a quadratic this quadratic form in in mu. And therefore the posterior is also Gaussian. So this is the first example of conjugacy. So, conjugacy means if the prior is of type A, let's say Gaussian in this case, then the conjugate likelihood is a distribution of type B in general, such that the posterior is all type A. So there are pairs of distributions. A prior from the prior entity and the posterior from this other pair or other member of the pair, sorry, and the likelihood from this other member of the pair, you will get a posterior which is of the same form as the prior. So what we get here is that if you place a Gaussian prior on the mean, and you observe with Gaussian noise, then the posterior is also Gaussian. So the Gaussian is self conjugate on the mean Gaussian is self conjugate on the mean parameter. So if we have p of mu Gaussian p of X given mu Gaussian, then p of mu given X is also Gaussian. Yes, so I saw that there was something flashing up in the chat. And I've also found a way to why can we say that the likelihood is a Gaussian distribution that is just an assumption. It's an assumption that is made so I'm kind of, it is an assumption but it's also a very standard assumption and of course it's a very standard assumption because the Gaussian distribution pops up everywhere because of the central limit theorem so any observation process that effectively is averaging over multiple independent events will be Gaussian distributed. So Gaussian also has this very nice property of being self contribute. So within the basin scenario. If you place a prior over the mean that is Gaussian, and you have Gaussian observations the posterior of the mean will also be Gaussian. So here in this calculation I'm keeping Sigma fixed. So I'm just having X given you Sigma. To answer to G tender's question. So it's X given you. And Sigma we are ignoring for the time being we'll get back to Sigma in a minute. So this was the first observation that enables me to introduce the concept of conjugacy but what is the actual value of the posterior distribution okay so the posterior distribution is a Gaussian. And so it will have its own statistics. So the statistics will be the posterior mean and variance. So let's see one more question here is conjugacy related to the fact that Gaussian is an attractor in PDF space that is a very deep question, but I did the answer is not in any straightforward way because we don't really. You don't conjugacy does not only involve Gaussians and we shall see another example in a couple of minutes. And also we're not thinking of, you know finding the posterior as solving a dynamical system in this case in an infinite dimensional dynamic system. Writing is not clear. Okay, so posterior mean and variance. No, it's not the central limit here. Well, let's, let's, let's remove the attractor in PDF space at the moment all we're doing is doing a single basic computation so we're not having a flow on the space of probability distributions at all, in which case it may make sense to talk about attractors. Let's do the calculation on what are the statistics of this posterior. Okay, so since the posterior is Gaussian, as we've just seen. The probability of mu given X is a Gaussian, and that means it will have its own statistics and so it is possible to write it as one of z X minus a half mu minus m hat divided by s squared. So what will be the value of a nice hat. Well, we also know that this has to be equal. That's the calculation we had just done exponential of minus a half. sum over i of X i minus mu squared divided by sigma squared plus mu minus m squared divided by s squared. So this is the general trick to do calculations with Gaussians in in a Bayesian fashion. Yeah, so we know that the posterior is a Gaussian so we can write it as this exponential of a negative quadratic form, but we also have another version that comes from a model. The distribution has got to be the same. And therefore, we should be able to read what the posterior mean and the posterior variants are by simply matching the terms of the relevant degree in this quadratic form. So from the first one I get that the coefficient of new squared from here, I get one over s hat squared. Yeah, that's the only coefficient of new squared and I'll forget the one half because it's here and here. And this must be equal to the coefficient of new squared here and here I have a one over s squared. And then I have n times because I have new squared in each of these terms and divided by sigma squared. And that tells me straight away that the posterior variance s hat squared is equal to well the inverse of one over s squared plus n over sigma squared. Sorry, you missed the power to yes I missed the power to thank you very much. Yes, please. In the second line. We've got the one over the square of the partition function, well partition function z squared. Why do we have the square I mean why are we supposing that I assume the two Z's come from one exponential times the other exponential, but why are we assuming that the two Z's are the same. Why do we assume that the two Z's are the same. The PDFs will have to be the same. But what is true. And that is what you are kind of you know and they're not in fact I put a hat on the other one. And the reason why they are not so there was a square that was kind of you know, a bit over. As you've seen sometimes the squares happening the wrong place and missing the right place but they won't be the same. I'm sorry to interrupt you but was the square at the Z hat an error. Yeah. Okay, then I get it thank you. Yeah, it was a mistake. So this is already an interesting formula. And one of the reasons why it's interesting. Is because it shows us what is the trade off in the uncertainties. Okay, so the more you observe. So if I let N become very large. So make many observations. Then this term here the second term will become dominant. And since it is to the minus one, I can immediately see that in the Gaussian scenario, in which I have independent and identically distributed observations. The posterior variance will collapse. And the number of observations becomes large. The posterior variance over the mean parameter, which means the average will become infinitely. Well determined. Okay, please. What is the difference between Z and Z hat. There are the normalization constants but you see the reason why they're potentially different is that I have quite a few constant terms here. I'm going to be more explicit here. Yeah, so the two things will have to be the same. But here there is going to be an m hat squared divided by s hat squared. And here there will be all sorts of sums of Xi squared divided by Sigma squared minus plus M squared divided by S squared. The alternative way to do this posterior computations will be to work out what the various terms here expand them, then add and subtract a constant term to make sure that this becomes a perfect square in you. And then you can read what m hat and s hat squared are. That is totally equivalent. But it's a little bit more lengthy, at least from my point of view. And if you do that, then you will see that in order to obtain the normalization constant in this term, well you have to add and subtract something, which means you have to take some bits out of the, some constants out of essentially. So this is not written as a distribution of a mu in it's more concise for the evidence. Wait, wait, wait, wait. Okay, thanks. So can this be thought of as an iterative process start with a guess for prior mean and standard deviation. Well, so Rajat, yes, it can be thought of as an iterative process, precisely because that, you know, we can see very well. And we will see very well in a second when we work out what the posterior mean is is that it depends on the sum of the observations. So, you take some observations, you compute your posterior. If you take some more observations, you can recompute your posterior, keeping the previous one as a prior. So essentially doing Bayesian inference, using all the data at the same time, or using first a bit of the data and then another bit of the data doesn't change the result. So this is seen as an iterative process where you start with a prior belief you do some observations, you find a posterior, then based on this posterior you may want to use it as the prior for another batch of observations and that won't change the result. Now there was another question. The evidence does not change any assumptions on the probability distribution of the posterior. I'm not sure what you mean by the evidence does not change any assumptions on the probability distribution of the posterior. So there are no assumptions on the probability distribution of the posterior, the probability distribution of the posterior is a consequence of the assumptions on the prior and the likelihood. The evidence, which is the marginal likelihood will depend on these assumptions so what you choose to be your prior and your likelihood, but not on the posterior. Why is the hat in M missing in the last expression. Yeah, or here. Yeah so here because this is not the. So the m hat is the posterior mean, but this is the prior mean. So it is actually correct for once I've not made a mistake. Michael is there for the difference between these iterative approach and the frequentist approach only on the assumption on the prior and the likelihood. So, the two are strongly I mean the frequentist approach and the Bayesian approach, fundamentally different in the way you interpret the meaning of probabilities. But in practice, many of the results come out very similar. So, you know, here we could think that, you know we are having a regularized version. So many people use priors as if they were just providing some regularization so that your results are not too sensitive to finite effects. So when you have few data, you regularize. Now the reality, you know, becomes a little bit more philosophical very quickly. So this is the posterior mean as well. Let's get the pen to work again. So the posterior mean now requires finding out what m hat is and so to find out what m hat is, I have to compare the coefficients of the linear term in here with the linear term in here and I'll do this calculation and then I'll get back to the chat. In here the linear term has got well a minus two which will be everywhere but it's essentially going to be m hat divided by s squared. That's the coefficient of the linear term in you in here. Yeah, divided by minus two. The coefficients of the linear term in you. Well, will be an M divided by s squared from here. And then here, we will have a sum of Xi divided by Sigma squared. So this tells us that the posterior mean is somewhat a weighted mean of the standard mean, let's say, and the prior mean. So you can show relatively easily because this essentially will as sorry here should be a minus two because s squared is going like the inverse of that should be a plus two. And then s squared is going like the inverse of n with when do your sample size becomes large then basically this term becomes dominant over here. This term tends to become essentially one of our end, and this tends to the frequentist mean. So this is one other connection between the frequentist, the sample mean, which is what the frequentists would report and the Bayesian and the Bayesian equivalent but notice that when you don't have very large numbers, then M may be important and so choosing and choosing and prior that has got parameters that are reasonable can be very important. Now let's go to the chat and see. Okay, good, good, good. Good. So this is when we place a prior on the mean. Now what about the variance which is our other parameters so now priors on the variance. Now let me define an auxiliary parameter which is the precision. The precision is just an alternative parameterization of the Gaussian, which is much more convenient for certain types of calculation, which is the inverse of the variance. So my Gaussian PDF. Now becomes my observation PDF for example now would be p of x, I given you and Sigma squared. And it's going to be one over Z times an exponential of minus a half Xi minus beta squared divided by two Xi minus mu squared. Okay, so now priors on beta squared. So this that beta squared now my distribution maybe I should write it a little bit bigger because here it's on the margins. So in fact I should write it. You know I shouldn't write one of us that I should write the full thing. So, so normally it's one of the two square root of two pi Sigma squared but here is beta squared, because beta is beta square is the inverse of Sigma squared. So here you notice that the likelihood term is a product of a rational functional beta squared. And so something that is actually polynomial monomial monomial with rational exponent, and a negative exponential beta squared. And so these has a similar shape to the gamma distribution. Let me recall that the gamma distribution was. Remember that P of, let's say, why given theta and lambda is Gaussian distribution that gamma distribution that is one over gamma or theta gamma function of theta. Then it's got something like why to the minus lambda x minus theta y. I think, let me double check my gamma probability density function. Yeah, so there is a theta to the minus K as well. Depends how you phrase it. Okay, so this would be theta to the minus lambda minus. Yeah. Okay, so that's the gamma distribution. Oh, no, I don't want to add. Now I've dragged. Here is good. This is the gamma distribution. And as you see, if I multiply a gamma with something of this form which would be a distribution on beta squared, if I multiply it with something of this form. Well I'm just going to have to change by one half here and my lambda, and I will have to adjust my theta in this way. So if I place a gamma on beta squared, then the posterior gamma prior on beta squared is again. This is our second conjugacy pair. So if we on the precision gamma prior plus Gaussian likelihood brings you to a gamma posterior. And the calculation is again very simple because you'd have to multiply your terms like here you see it's got a rational term. You've got an exponential negative exponential in mind in beta squared. And so you would increase by this term, your theta, and you would increase by one half your lambda. So as you see, it's not that all conjugacy pairs have to increase include a Gaussian as it's not because it's an attractor in the space of PDFs. In fact, a third example of conjugacy. So let me observe to the person like you would read me and let me be a gamma distributed with theta and lambda. Now, recall that the person. Okay, has as its distribution is the probability of observing and is muted the end divided by factorial e to the minus mu. And you can see immediately that again, it's got a power of mu times a negative exponential mu. So if you multiply with the guy and with the camera, which would be of the full one over gamma or theta, theta to the minus lambda. And then we would have mu to the lambda minus one mu divided by theta. You know, if I take the product of these two terms, well, all I'm doing is adding and to lambda and subtracting one to one over theta. So again, Poisson gamma plus Poisson needs to come. There is obviously a question at the moment so we have looked at what happens. So if we want to do basic computations, the mean, and we've seen that if we keep the variance fixed, and we place a Gaussian prior on the mean then we're going to get a Gaussian posterior on the mean itself. We've seen, although we've not done the explicit and we found out that, in fact, this makes a lot of sense you're going to get out posterior parameters that are very clearly interpretable so the variance shrinks to zero when you have an infinite amount of data that posterior mean is very close to what would be the sample mean particularly when the number of data points becomes large. Although we've not done the calculations that if you keep the mean fixed. And you place a, and you reparameterize your Gaussian distributions in terms of precision. A gamma prior on the precision is conjugates to a Gaussian likelihood. So you get a posterior on the precision that is again. Of course, in general, so so what what happens if you don't keep the precision fixed when you're trying to be basing over the mean. So what if you treat them both as random variables. So the natural question is being basing on both new and beta squared. So here, then you still have your likelihood, which is your probability of Xi, given you Sigma squared. And this will be one over one square root of beta squared divided by two pi x minus a half minus beta squared divided by two x pi minus new squared. But now you see the likelihood couples the two parameters so they appear as a product. So independent priors, I mean the posterior will not be independent between you and beta. Okay, so the kind of treatment we've done so far is a rather artificial one. So it turns out that you need to define a joint prior over Mu and Sigma squared. Sigma squared beta squared. You can do the whole thing with Sigma course the distribution is called the inverse gamma as opposed to the gamma, because it's a gamma distribution of an inverse of the random variable. And you have, there is a special distribution which is called the normal camera. Okay, and this is a is a is a combination of a normal distribution, and, and the gamma distribution. So it's a distribution over pairs of random variables one of which needs to be positive, and the other one needs to be real. You can, you know, it's essentially a product of the product of a Gaussian and the gamma but the Gaussian terms also contains the gamma variable. Okay, so normal gamma distribution is the conjugate prior. I'm not going to do the calculations here they become a little bit more complicated. One thing that I want to tell you though, and it's the final thing that I want to tell you today is that you can also compute marginals. Okay, so, so now suppose that you have, again, your Gaussian distribution on your Gaussian likelihood, and you have placed a conjugate prior so you have placed a Gaussian on Mu and you keep Sigma squared fixed. You can also compute what would be the marginal distribution of X if you marginalize that Mu using this distribution so the marginal P of Xi, given Sigma squared M s squared. Now, what will it be? Well, it's pretty clear that is going to be another Gaussian, because you have a joint Gaussian and you're marginalizing out. So you have a, you know, your joint distribution over X and Mu. Yeah, which was written here is, is a quadratic negative quadratic form in both X and you so it's a joint Gaussian. So that when you marginalize out one entry of a Gaussian of a joint Gaussian then you end up with a Gaussian on the remaining variable. What Gaussian will it be? So what will its parameters be? Well, we have that Xi is obtained from Mu plus IID Gaussian noise so we can compute, I mean we can do the integral and that's rather painful or we can compute the two moments of this distribution. So the expectation of Xi is going to be equal to, well, the expectation of. Excuse me, over, so over what way or over what variable are we marginalizing to get a given Sigma squared M s squared. So we're marginalizing Mu. And this is going to be equal to. So, here we're marginalizing Mu so expectation of a Mu and also over X, sorry over epsilon really. The expectation of epsilon is zero and the expectation of Mu is M. Okay. So we have to find the expectation of Xi squared. And once again doing almost exactly the same calculation while doing really the same calculation that we did before. I think last time. So we have to write out what this square of this term would be Mu squared plus Mu epsilon I plus epsilon squared. Now these terms since epsilon is independent of Mu and epsilon has got zero mean will give you zero. This term is the second moment. Mu and the second moment of Mu. Well the variance is the second moment minus the square of the first moment so this is S squared. Plus M squared. And the second moment of epsilon is Sigma squared because it's got zero. Sorry. That's not an epsilon. That's an S. Okay. The variance of Xi is s s squared plus Sigma squared. And so you see that when we have our observation process, and we marginalize out Mu, we are adding the two uncertainties so we are uncertain of a Mu by an amount s squared. And so we are uncertain given that we give a Mu X as its own uncertainty which is Sigma squared. So when we marginalize with some the uncertainties, which makes perfect sense. Yeah, on average, we saw the answer to this. So this is another nice property of Gaussians that when you marginalize your song uncertainties. Let's talk about marginalizing the variance. There is a question. E of X, M squared. Why is E of X, M squared, or did I, did I write something nonsensical. No, no, E of X is M, because E of X is E of Mu plus E of Epsilon. E of Epsilon is zero and E of Mu is M. E of X is not M squared is M. E of X squared is S squared, E of Mu squared is S squared plus M squared. Good. So what about, no, I don't want to create. What about marginalizing Mu and Sigma together. You have your P of X, given Mu Sigma squared. You have a prior, which is a joint prior or Mu Sigma squared, which is normal inverse gamma, which is the conjugate prior. Or you parameterize these by beta squared and you place a normal gamma prior. Well the calculations are a little bit intricate, but the marginal is distributed according to a so called student. Which some of you might have heard. And it's the distribution that underlies the T test in classical statistics was invented by this guy, which went by the pseudonym of students to do quality control of beer in the Guinness Brewery and and all of these fun things and it's obtained by taking a Gaussian and marginalizing mean and variance using a normal inverse gamma prior. So, question. Can I go back a slide. Yes, sure. So the key thing here is that the second moment is the variance plus the square of the first moment. Okay, that's where the M squared came from. Okay, so today we've seen the basics of. Well, and you know it goes quite a long way. There aren't many more basic calculations you can do with Gaussians, but they are the kind of the foundations or many basic calculations you may want to do later on, and they kind of give you also a convenient way to understand what basic computations return in comparison to the standard frequencies computations where you take sample means and sample variances. So, a very profitable, insightful way of thinking of it is that base in some sense regularizes your results when you don't have much data. What you think about it is that if you are genuinely believing in the, in the expert than the expert when you have little data can give you very precious information about what your true statistics, what your inferred statistics should be. Most other cases in all cases where you can compute conjugacy. So when you can compute analytically posterior this is possible to do. So, plus on likelihood gamma prior you can do it but unfortunately, there aren't that many cases they all fall into the broad class of exponential family distribution so conjugacy exists only in the exponential family. And it involves a handful of other distributions for complex models. There are some calculations which are insightful, but they're just impossible after marginalizing one of the variables the resulting distribution would always have a Gaussian for no, no, so if you marginalize. If you have a joint Gaussian distribution and you marginalize any of the variables, then you always end up with a Gaussian, but if you have something that is not joint Gaussian. For example, distribution, a normal gamma distribution where you have a product of a gamma and a Gaussian which are coupled. Then you won't get a Gaussian. In fact what you get by marginalizing the variance in a Gaussian normal gamma, normal gamma prior is a student T, which is something that has a higher. Tale of the distribution, which is the interesting thing so you see when you marginalize the mean, you sum the variance, but the long term, let's say behavior so that the behavior distance from the mean is the same so it's decay with the to the minus x when you marginalize the variance. It decays as e to the minus x, which is different because it's a scale mixtures. It is the choice of the shape of priors like 1000 only driven by necessity, or there are other reasons. Well, for what we're seeing. Now, it's only driven by necessity to be honest. If you don't choose them. So, driven by necessity and also by pedagogical reasons that doing the calculations analytically shed some light into what is going on during the inference procedure. The reality is that you will never have conjugacy any model that you generally believe to be a model of the world. And there are other techniques for computing approximations to the posterior. But unfortunately it's not so clear what is going on. So, these will be a more advanced course we may mention them at the very end, but that's more for today's for certain we are just looking at choosing priors so that we can do the calculations. So I think it's a coffee break time. And if there are no more questions or there are some more things in there. How can we figure the exact value of the variables. Okay, so the student T distribution, for example, has got its own expectations that can be computed analytically so you know the exact value of the variable. What you mean is the statistics of the posterior distribution, and you can compute statistics for more distributions than just the Gaussian. For example, for a student you can as well. But in general, you're absolutely right if you can't compute analytically posterior, then the only way you can get at statistics is by doing approximations to these distributions and mostly approximate them as a collection of samples. Okay, so thank you very much.