 guys. How are the homeworks going? Last one was a little bit of a pain here. Okay so today what we're gonna do is link the common filter which is a way for us to estimate the posterior after we make an observation. We have a prior belief, we make an observation, then we form another belief. I'm gonna call that a posterior and we want to we want to link this to another domain of state estimation called Bayesian estimation. And basically what we want to show is that the two are related. That when I tell you that after I made an estimate I formed my took my prior and I multiplied this gain by this error in my prediction and I formed my new estimate my posterior. What I was doing is really finding the mean of a distribution and when I tell you that I have an uncertainty about my estimate P what I'm really saying is that that's the variance of that distribution. And so we're going to formulate the same problem in in a Bayesian framework which means that we're going to say we have some prior belief, we have some likelihood model, likelihood model that says given that you know the some x that we're trying to estimate this is the likelihood of measuring y the prior is what's the probability of x to begin with and then we're going to form a posterior. The posterior saying that well you know what's the probability of x given that I just saw this y. Now the trick along the way is to figure out how to take a joint probability distribution and factor it into two probability distributions one which is going to be the prior probability distribution the other which is going to be the conditional and the conditional is going to be what we're interested in which is going to be the posterior and then the whole business today is to figure out how to take a Gaussian distribution and divide it up into two parts. So we can when we multiply it we have one part which is our prior belief the other part is what we're interested in which is a posterior and if we do that then what we'll end up with is to demonstrate that what we were doing with the common filter we were estimating the mean and variance of the posterior distribution. So let's begin by talking about I guess what our aim today is. So we start with what we call our generative model which says what's the probability of y given x and you know when we say find the maximum likelihood this is a data that we observed and this is the states that we want to estimate and when we did a maximum likelihood model what we found was that we found the x that maximized this probability. So the x that maximizes probability was called the maximum likelihood model that's x hat ML. When we did a common filter we did something interesting we said well we have this thing we're going to call x hat of n given n that's going to be equal to our prior plus something that we're going to call this gain and that's going to be compared to what we observed minus what would be predicted on that trial and this was our posterior we call this. This is our posterior and this was our prior. And then we also had some uncertainty associated with this we said p of n given n is our posterior uncertainty after we formulated the mean and we said that that's going to be equal to you know i minus k times some something that's related to the way y is related to x if that's related to it by c then it's going to be something like this. So we said this is going to be our posterior uncertainty it's going to be a posterior value if you win and so in a way this is the variance and this is the mean. So in Bayesian estimation what you're trying to do is you're going to say if this is your model of how the data is generated what you're really after is p of x given y which is a posterior probability. What's the posterior probability that that you know given the data that you observed what's the most you know what the data came from some x and what you want to know is that how likely are these different x's that you can have given that you have observed this y and today what we want to show is that this distribution this posterior probability if we find the x that maximizes this posterior probability then what we are really finding is the distribution that has this mean and this variance covariance that this mean and this variance are the mean and the variance associated with this posterior probability. So this is the probability distribution this is its mean this is its variance and that's where we're going today. So to do that you know let me just show you the basic Bayes rule so you know suppose that I have suppose that I find a better pen this is better. So I have you know two variables I have some probability of x and y and the way we derive Bayes rule is through the the joint probability distribution. So x can take some variable xi y can take some value yi and probability of this joint occurrence is of course the probability of x being xi given that y is equal to yi times the probability of y is equal to yi and that's equal to also probability of y equal yi this pens are dying on me. Let's see if red is better. I'm sorry guys. I guess I'll just have to push harder. Is that legible? Come closer so you can see it. Yeah I write small anyway so basically if I set those things equal to each other then we have probability of x is equal to xi given that y is equal to yi is equal to the probability of y is equal to yi given x is equal to xi times this divided by the probability of y is equal to yi and the names that we use to describe these probabilities are as follows. This is called the posterior. This is called the likelihood called the prior. This is called the marginal. So basically I take my likelihood multiplied by the prior probability divided by the marginal and I get what's called the posterior probability. Alright so you know you guys are familiar with this. We did this stuff on the first lecture of the class and what I want to do today is to show that if I take the joint probability distribution for the kinds of models that we've been doing and write the probability for x and y, x being the variable that we're trying to estimate, y being the variable that we're measuring, if I write the joint probability distribution for a Gaussian relationship then I have some mean I have some variance and now what I have to do is to factor that joint probability distribution in terms of a conditional probability and a prior probability and if I do that then this conditional probability is going to have a mean and a variance that's going to be precisely these equations. So that's what we're doing today and to do it we have to learn a little bit of a math that has to do with how to break apart matrices, diagonalize them because in a Gaussian distribution you have things like determinants, right? You have things like inverses of the variance. So you know in a Gaussian distribution you normalize things by the determinant of the matrix which is the variance covariance matrix and in the exponential you have a inverse of the variance. So if I have a joint probability distribution that has some variance, right, that's a matrix I'm going to have to be able to break it up both it's inverse and it's determinant. So I can write it in terms of multiplication of one Gaussian times another Gaussian. So we're going to factor basically a Gaussian distribution into two different Gaussians and that's what's hard to do and we'll do it. We'll figure it out when we do it. We'll end up to see that what we have is precisely what the mean and the variance of the Coleman filter is. All right, so here's my prior. Oh God. Let's see, do we have any chance writing anything more today? Ah yes, here we go. Good. The problem with this guy is that he's got no, no lid lift on it. So okay, so X is normally distributed with some mean we're going to call X hat and P minus one. It's a great pen but it's got no felt left on it and I have Y which is equal to CX plus epsilon where epsilon is normally distributed with mean, zero and variance R. So what I'm interested in is writing the probability, the joint probability of X and Y. That's what we're interested in. David, any chance we can get another pen? Yes, excellent. Also this works but the thing on it. You're going to charge it. Yeah, I'm going to start carrying my own. Yeah, excellent. Okay, so we're just in the joint probability and I want to write down what that looks like. So what is this, this quantity? Well basically this is some normal distribution with some mean, call it mu X and mu Y and some variance, variance of X, covariance of X and Y, covariance of Y and X and variance of Y. So that's, this basic structure of this joint probability is this. So it has this mean and it has this variance, covariance matrix and I want to know what those things are. Well, what's the expected value of X? So here it is. What's the expected value of Y? So this is a normal distribution. What's the mean of X? It's just X hat of N given N minus one. What's the mean of Y? C X hat of N given N minus one. That's the mean. And what's the variance covariance structure? Well variance of X is P N minus one. Variance of Y, what's variance of Y? It's variance of this system which is C times variance of X times C transpose plus R which is equal to C, P of N, N minus one, C transpose plus R. So variance of Y is C, the prior uncertainty, C transpose plus R. Now what's the covariance of X and Y? What's the covariance of X with Y? That's equal to the expected value of X minus it's mean, mu X times Y minus it's mean, mu Y transpose. X minus mean of that is X hat. I'm just going to put X hat to put my prior estimate of X times Y minus C X hat transpose. Let's multiply this out. X times Y minus X times C, that's the transpose. That's this expected value which is equal to expected value of X times Y. What is Y? Here's this equation for Y. X times C X plus epsilon. That's X times Y minus the expected value of X times X hat which is going to be X hat, X hat T, C transpose minus X hat times expected value of Y which is C times X hat plus X hat, X hat, C transpose. This cancels out. This is just zero. Covariance of X and Y is just going to be, this is just the variance. The variance of X times C is going to be N minus 1 times C transpose. Let's see if I, oh I think I lost the transpose. Yeah, yeah, this is a transpose here, sorry. So this, this is just the variance of X so that's just going to be P of N, N minus 1, C transpose. So this is covariance of X and Y. So I'm going to write down here P of N, N minus 1, C transpose. The covariance of Y and X is just going to be the transpose of that. This is going to be C times P of N, N minus 1. P is a symmetric matrix so it transposes itself. Okay, so that's, that's the joint probability distribution. Okay, so why did I write the joint probability distribution? Because what I want to do is take this normal distribution. So here's this mean. It's a vector. Here's this variance covariance. It's a matrix. And I want to be able to write this in terms of something that has P of X in it, because it's my prior, and then something else. This something else is going to be my P of X given Y. And I want to know what the shape of it is. I know what this is, right? Here it is. P of X is up there. It has a mean of X had it as a variance of P. So I want to be able to factor this normal distribution to get this. Now what does it mean to factor? Well, so what does, what does this mean here? This, this, this distribution? Well P of X and Y is going to be equal to, say that, say that X is a P by one vector, Y is a Q by one vector. What that means is that this is going to look like this. 1 over square root of 2 pi raised to the power of P plus Q times determinant of, that's called the variance covariance matrix. This large, what is that, a sigma or a, I don't know, sigma? Capital sigma. Let's call, let's call this matrix here capital sigma times exponential of minus one half the vector X minus X hat Y minus C X hat transpose times this matrix, its inversion times X minus X hat. So that's what, it's just a scalar, right? This is just a probability distribution. You give me X and Y, I'm going to give you a number. And this is what I mean when I write that, that equation. And this is its variance covariance matrix that's inside there. Now my problem is what I want to do is I want to write this P of X, Y in terms of something times an exponential that looks like this, raised to the power of, so P of X, right? This is going to be my, oh I'm sorry. That's correct. That's what my problem was. Okay, so the size of Y is Q and its variance covariance is going to be this, let's call this structure. So I'm going to take my variance covariance matrix and I'm going to write it as sigma one one sigma one two sigma two one sigma two two. And so this is going to be determinant of sigma two two times exponential of minus one half Y minus this is prior which is going to be C times X hat transpose sigma two two minus one Y minus C X hat exponential. So what I want to do is take my joint probability distribution, factor out this term and whatever remains is going to be my P of Y given X. Yeah, P of X given Y, sorry. Well, no. So if you look at, so I can write it, I can write my joint probability in terms of a conditional probability that's P of X given Y or P of Y given X, right? Okay. But yeah, yeah, so you know that's our prediction, right? Our prediction on every trial is dependent on our prior of X and gives us also a Y, right, because we also have what the Y would be, we make a prediction on Y on every X, right? So P of Y is easy to say what it is. P of X is our prior of X, definitely. But we also have a prior on Y because we have a prior on X. Of course this has to be a P of Y. Yeah, because that's what we want, right? Because we want to know what X is. Okay. This is the joint probability we can compute this. Here it is. Yeah. And I know what P of Y is. Why? Because, well, you know, my P of Y is just going to be the mean, here it is. It's going to be with this mean and this variance. So now what I need to know is what remains in this, if I could factor this exponential in terms of this, then what remains? That has to be P of X given Y. That has to be the posterior. Yeah. So the question is how do we do it? How do we take this matrix and find its inverse so that we have one term over here, the rest of it here, find this determinant so we have one term here, the rest of it here. That's our problem. How do we factor those out? Questions. Do you see the basic idea of where we're going? Okay. Let me go over it one more time. So in principle, we can always write the joint probability distribution of X and Y because we have our model that says what X is and how Y is related to it. We can compute the variance of Y, which becomes this term here. We can compute the variance of X, which becomes this compute, this term here. And the covariance of X and Y come from this equation. So then we have the probability distribution for the joint variable. And what it is, is basically this hair-looking scalar quantity. And now what I want to do is that, all right, if this is true, then Bayes' rule tells me that I can take my prior probability on Y multiplied by the conditional probability, the posterior, and get what you have in your joint probability. So how do I divide up these variance-covariance matrices? So what we're going to learn is the problem of factoring this matrix so that essentially I can take this quantity and divide the rest of this matrix by that quantity. And so that sounds really, you know, weird. What does that mean? I'll show you what it means. So say you have some matrix M. You have some matrix M. And it has, you know, quantities E, F, and G and H. So it's just blocks of some, you know, have to be, of course, the right size. This is whatever my matrix is. And what I'm going to do is multiply the left and right side of this matrix by two quantities, two matrices that have a determinant of identity, of one. So I'm going to multiply the left and the right side of this matrix by something else. So here's my M matrix. And I'm going to multiply on one side by this quantity. The right side by this quantity. So the determinant of this matrix, this times this minus this times this is zero. I'm sorry, it's one. The determinant of this matrix similarly is one. And so let's multiply this out, see what we get. So I'm going to multiply these two first. This times E minus FH minus one times G minus FH minus one H, which is just identity. So that's going to be zero over there. This time it's going to be zero minus G. Thank you. Okay. Alright, so that's going to be now multiplied by this matrix. Okay, so I get this times that minus zero. So this becomes G times G minus G. And this should become just H. Let's see if that's right. Yeah. Okay, so that's just equal to. So why is that interesting? What we did is that we multiplied our matrix M by some matrix on the left side, some matrix on the right side. And I'm going to call this matrix Z, oh sorry, X. Be consistent with it. It just doesn't matter. We're going to just give them a name. And this matrix Z. So the determinant of this new thing that I have, this, this, when I multiply these things. So the determinant of M, it didn't change when I multiplied it by this X and Z. It's determinant is going to be the same because it's determinant of this new thing here. It's just going to be determinant of X times determinant of M times determinant of Z. The determinant of X is one. The determinant of Z is one. So that's going to be equal to the determinant of M. And what I end up with is this particular determinant, which is going to be determinant of X times determinant of M times determinant of Z. Because these two determinants are one. And that's going to be the determinant of my final thing, which is going to be determinant of E minus FH minus one G times the determinant of H. So why is this useful? Well, look what we just did. We took a matrix and keep in mind the matrix that we're interested in is this sigma up here. And I just said the determinant of sigma is going to be equal to the determinant of, so what's going to be E is going to be sigma one one minus F is going to be sigma one two. H is going to be sigma two two minus one. G is going to be sigma two one times the determinant of sigma two two. So the determinant of this is equal to this. Why is this useful? Well, because look what I just did. I just took the determinant of sigma, which is what I'm interested in, right? Because sigma appears here in my denominator. I want to be able to divide it into two parts. I want to have one thing over here and I want to have the rest over here, right? So I just did that. I just took this and I'd split it into two parts. The determinant of sigma two two, which is going to be a part of the variance for this guy, the probability of y, and then what remains is this, the determinant of this structure. So that's going to be the variance covariance of whatever's over here if the determinant was done correctly. All right. But I'm not done. I need to also be able to factor out the inverse, sigma two two inverse, right? So I'll have to be able to take sigma inverse and write it in terms of this and whatever's left of it. And we're going to do that. So how are we going to do that? Well, let's go back to what we just wrote. We wrote that I took my matrix M multiplied by x multiplied by z and I got this new matrix here. Let's call this w. So I have x times m times z is equal to w. So what's the inverse of this quantity? It's going to be z minus one, m minus one, x minus one is going to be equal to w minus one. And so that means that m minus one is going to be equal to w minus one times x times z. And what does that mean? Let me give myself some space here. Let's write down what that means. So m minus one is that quantity there. That's equal to z, which is that matrix there. So let's write that out. That's going to be identity minus h minus one times g, which is w minus one is this times x, our first matrix. So I have now my determinant written in terms of something that has sigma 22. And then if you look at this quantity, you notice that at its middle, my inverse here, look what it has. It has a sigma 22 at the right place. Why is that in the right place? Well, because what I need is have this quantity, right? And I'm going to be able to get it because I have a sigma 22 by itself sitting in the right most column of this matrix. So I'll do it now. So let me now take it apart and write my joint probability distribution in that way. So this quantity is written as follows. This is written as this. This is called a sure complement. It's just a shorthand of writing this terminology. To give you a little bit of intuition of what this means actually, so this is not an unfamiliar thing. So sigma 11 is the variance. So just to connect things a little bit before we get too far to our common filter. Sigma 11, this is equal to the prior posterior, sorry, the prior uncertainty of x minus sigma 12 as the covariance. That's going to be equal to the, let's see, p of n given n minus 1 times ct. Sigma 22 inverse is this quantity and sigma 21 is the covariance c times p. If you guys remember your equation for the common gain, it was this term here. This is k of n. And so this term, this term here, the determinant when I wrote it, well, this matrix here, when I factored out sigma 22, what I end up with is i minus k of n times c times the variance of p. And this is now, you can see that that's the posterior variance that I've been using. So this is p of n given n. Okay, but let's not get too far ahead of ourselves. Let me write down the probability. So I have p of x and y is equal to some normal distribution and p plus q times the determinant of sigma times the exponential of minus 1 half vector x minus its mean, which is my x out of n, n minus 1. Let me make things, let me call this mu, mu of x. Its mean is going to be this way, y minus mu of y transpose times, so that's our original joint probability distribution. So I just wrote for you that I can take the determinant here and write it as the determinant of this new matrix times the determinant of sigma 22. And which, you know, of course what that means is that I can take this term here. I can take 1 over square root of 2 pi p plus q times this and I can write it as 1 over square root of 2 pi to the p times this new determinant times 1 over square root of 2 pi q times sigma 22. So I can do this. I can take that determinant of the variance covariance matrix associated with the joint probability and I can write it in terms of a determinant of something else, sigma sigma 22, the sure complement times the determinant of sigma 22, which happens to be the determinant of my prior probability. Now what I need to do is the same thing here. Take this exponential and write it in terms of multiplication of two separate exponentials. So this exponential looks like this. This exponential has minus one half and so this inverse is this, right, this term here. So it's going to be x minus its mean, y minus its mean, transpose times. So I have to take that exponential and write it in terms of multiplication of two separate exponentials. So let me take the right part of that equation and because it's easy to see that this is i times x minus mu x plus this times this. So that is going to be exponential of something and then what is this going to be? When I multiply this out I get x minus mu x minus sigma 12 sigma 22 minus 1, y minus mu y and at the bottom I get 0 minus y minus mu y. Okay so I'm going to keep multiplying now. I get an exponential of minus one half times x minus mu x minus sigma 12 sigma 22 minus 1, y minus mu y, transpose times this term sigma times this times, so I just have a minus one half times this quantity times this times this quantity up here plus this quantity times this times this. This is just another scalar so I have an exponential that has two components to it and if you look at this what I've just done is that I've taken my determinant and I've written it in terms of this quantity times this quantity and I've also taken my exponential which had this quantity as of its variance covariance and I've written it in terms of something squared this times this variance covariance times itself plus another component y minus its mean times sigma 22 times this quantity. So what I've done is that I've taken a normal distribution with mean x minus mu x sorry with mean mu x and mu y and variance this quantity sigma and I've written it in terms of this new quantity let's see if I can write it another normal distribution and its mean is going to be this quantity mu x plus epsilon 12 epsilon 22 minus 1 y minus mu y sorry that's my variance I think I got that right all right so wait that has y dependence yeah of course of course of course it has to let me show you what that means so let me let me write it over here so we can we can put in our numbers all right so so let's see what that means let me I guess I need to be able to refer to what I just wrote up there so let me so I said if you take a normal distribution that has mean mu x and mu y and variance covariance sigma you can break it up into these two different normal distributions multiplied by each other and this is this term there is my prior p of y and the other one must be my posterior p of x given y and that's why it has a y in there right and what we're going to see is that that mean is precisely what the equation for the common filter is it is precisely you know x hat plus k times y minus y hat this guy would stay down all right so let me let me show it to you so in our problem what we have is that we have a normal distribution and our prior has this mean and it has c times x hat as the mean for y and it has this variance covariance structure it has p of n and minus one has c transpose okay so this was our this is what we called sigma and this is the mean this is mu x and mu y this is how we started okay so according to what I just wrote I can write this as a normal distribution with mean mu x which is x hat of n and minus one plus sigma one two which is c transpose times sigma two two c p of n and minus one c transpose plus r minus one times y minus its mean which is c x hat of n and minus one this is its mean and its variance is going to be the sure complement which I think I just erased but it's up there it's sigma one one its its variance is going to be p of n and minus one minus sigma one two which is p of n and minus one c transpose sigma two two inverse which is going to be c p of n and minus one c transpose plus r minus one times sigma two one which is c p of n and minus one okay so that's its variance this times a normal distribution with mean of y which is c times x hat of n and minus one and variance covariance sigma two two which is c p of n and minus one c transpose plus r all right so this first term is my posterior probability that's so that's equal to you know the first term is p of x given y the second term is p of y so what's p of x given y p of x given y is going to be this normal distribution with its mean and that variance so as I wrote up there so so if you look at this equation here what is this mean this when I when I say my x hat of n given n that's equal to x hat of n given n minus one plus something that I call a common gain k of n times y minus y hat of n right so that's equal to x hat of n my prior plus this k my k by definition was equal to p of n and minus one c transpose times the uncertainty of the observation which is c p of n and minus one c transpose plus r minus one times y minus my prediction on y which is c times x hat of n and minus one so this is my when I call this as my posterior that's the mean of the posterior probability my posterior uncertainty p of n given n was equal to i minus k of n c times p of n n minus one k is this right and that's the that's the variance for the posterior probability distribution so what we did today is to take a joint probability a Gaussian function and we factored out the probability associated with a prior in this case we learned how to take an exponential distribution and make it so that we can write it in terms of multiplication of two exponentials and what we did that we found that the mean and the variance of what we call the posterior probability was in fact how we found the posterior associated with x and what we called the p associated with the variance so this is the variance of the posterior probability and this is the mean of the posterior probability I have to tell you it's exhilarating to be able to actually do it on the board all right guys thank you for your time so David's gonna teach you guys