 Two topics for us today. We're going to be covering two sections of the book, chapter 4.9 and 4.10. And there are two ideas that I'd like to cover for you today. The first idea has to do with the relationship between what we did with the Coleman filter, which is basically to take into account our prior belief, the uncertainty that we had on the estimate that we're making, and the current observation. And I want to link that to maximum likelihood, which was the process by which we took data and we said, what's the likelihood of this data being generated given our model? And from that we learned that, for example, when we have two sensors, we want to wage sensor based on our uncertainty. So one of the problems that you face when you want to apply some of these techniques to real data is that, well, what should your uncertainty be when you start your estimate, for example? What should you start with? And to show you how you can compute a reasonable starting uncertainty, what we're going to do is go back to our problem with the two GPSs. So we have two GPSs, they estimate our location and we're going to combine them and we saw that in maximum likelihood how to do it. So now let's do the same problem, but using the Coleman approach. And what we're going to begin with an idea that we have no prior belief. So if I have no prior belief, so my prior uncertainty is going to be a particular structure, it's going to be basically infinite, I have no idea where I am. I make these two measurements and then I want to show you that if that's the case, then you end up with a scenario where your belief after you take the measurements is the maximum likelihood. So maximum likelihood is the solution when you essentially have no prior belief. If you have a prior belief, then when you use the Coleman filter, you know how to apply the prior belief with the current observation. Maximum likelihood, all we do is we take the observation, say what's the most likely location. In a Coleman filter, we have this additional thing which says, well, we also have this prior stuff that we've seen and we're going to combine where I believe I am with the data that I see. And of course then the Coleman filter becomes a segue into Bayesian way of thinking about estimation, which is where we're eventually going. So let's go back to our problem of hiking in the woods. We are at some location, X, and we have two sensors. And you know, you can imagine that you could come walking and every time you take a few steps, you take another estimate. So, you know, your position is going to change as a function of time and say that that relationship is done by this matrix A and then you're going to take another sample. So if we were to write our model, our generative model, it would look like this. That X in trial N is equal to A times X in trial N minus 1. These are vectors, 2 by 1 vectors plus epsilon X, where this is normally distributed with 0 and Q, being its variance covariance matrix. And your observation Y times N is going to be C times vector X, where here now your observation is a 4 by 1. You get a two, two dimensional vectors. And so this is a 2 by 1. This is a 4 by 2 plus epsilon Y, where epsilon Y is normally distributed with, this is your measurement noise with variance 0 and Q, sorry, 0 and R. And so say that this is the system that I have. I want to estimate where I am. So if you do it via the common filter approach, what you do is that you say you have some prior belief. You have some X hat in trial N given N minus 1. And for the purpose of our experiment here, let's just say that this is the very first trial. So I'm going to say trial 1, given that I've seen nothing in the past. That's my very first prior belief. And it has some uncertainty associated with that P of 1 given 0. This is how I start. I have some prior belief about where I am. I have some uncertainty about where I am. And now I'm going to make a measurement and then I'm going to update what I think where I am. So my X hat of 1 given 1 is equal to X hat of 1 given 0 plus my common gain K of 1 times my observation Y of 1 minus my prediction Y hat of 1, where Y hat of 1 is just C times X hat of 1 given 0. My prior times this matrix C tells me my guess on that trial. And so this is what you would do to update your belief about where you are based on your prior belief and your observation. Any questions about that? Okay. So now let's ask, what if I don't have a prior belief? You know, I have no idea where I am. I was placed on this place and I don't know where I am. So for me, my prior uncertainty is infinity. That could be anywhere. Okay. So then how do we do this? How do we do this now? Because how would I be able to compute this matrix K and so forth? So what we know from the work that we did on last lecture, K in principle for this problem, K of n, is going to be equal to P of n n minus 1 times C transpose times the variance of this structure, my observation, C times P of n n minus 1 C transpose plus R minus 1. And the shape of K, what is this matrix K? It's going to be, so this is X is a 2 by 1, K is going to be, Y is a 4 by 1, so this is going to be a 2 by 4. K will be a 2 by 4 matrix in order for this problem to work out. And this matrix C is a 4 by 2, so P is my uncertainty and P is a 2 by 2. It's a 2 by 2 matrix because X is a 2 by 1, so my uncertainty about X is just a 2 by 2. So let's see if this works out. So K is a 2 by 4 and P is a 2 by 2. C is a 4 by 2, but transpose is a 2 by 4, so it looks pretty good. And R is a 4 by 4, and so this must be a 4 by 4 as well. That looks pretty good. So that looks fine. So that's how I'm going to compute my common gain, but my problem is that this thing here that I call my prior uncertainty is infinite, so I don't know how to compute this thing. So I'm going to show you how to do it. To do it, we're going to add one more step here which has to do with the posterior uncertainty. So my posterior uncertainty, P of n given n is equal to I minus K times C times P of n given n minus 1. Let's see if that works out. So K is a 2 by 4. C is a 4 by 2, so 2 by 2. This is a 2 by 2. Yeah, that looks good. Okay, so let me write this out. What I'm going to do is to show you how the posterior uncertainty is related to the prior uncertainty, but I'm going to do it not in terms of the matrices, I'm going to do it in terms of the inverse of the matrices. And by doing it that way, I'm going to write a new equation for you after a couple of steps where I'm going to show that P of n given n minus 1 is equal to something that's going to have P of n minus 1 and then some other stuff in there. So by writing it this way, what I'm going to be able to do is to get rid of this because if this is infinite, P of n given n minus 1. If I start out with an uncertainty that's infinite, then this is 0. So then I'm going to be okay. I can still compute my posterior uncertainty. So the first step is I'm going to rewrite this equation not in terms of the matrices P and P of n, but in terms of the inverse of those matrices. Next, what I'm going to do is I'm going to get rid of this prior uncertainty because in this case it's infinite. Then what I'm going to do is I'm going to write this equation in terms of the prior uncertainty, but in terms of the posterior uncertainty. So these are the two steps we're going to do. And then by doing so we're going to be able to compute the common gain even though we have a infinite uncertainty about our past. And when I do that you're going to see that we're going to end up with the maximum likelihood estimate. Basically what we're going to end up with is this weighing of these two y a and y b based on their prior, their basically variance, which is our maximum likelihood estimate. Because if you have no prior belief, effectively what you're doing is only taking the evidence. And that evidence is going to be weighted by its variance, which in this case isn't that matrix R. Okay so let's go through it. Before I go, do you have any questions about where we're going? So our problem is we start out with infinite uncertainty. We have no idea where we are. If we're in that scenario how the heck can we compute k? We can't because based on that equation I have to have a, you know, I can multiply something by infinite. Right? The way we're going to do it is by rewrite our problem in terms of not p's but the inverse of p's. And then we're going to be able to handle it. But then what we're going to end up with is this idea that effectively our common gain is going to be the maximum likelihood estimator. Because we have no prior belief. Our prior belief is flat. Alright so let me write this equation. So this is, I'm going to replace k with what it is. Okay so this equation here can be simplified. Now how the heck can that be simplified? There's a lemma that I'm going to use. It's called the matrix inversion lemma. And the matrix inversion lemma is as follows. It says that if you have this quantity z minus x, y minus 1, x transpose quantity minus 1, then that's equal to z minus 1 plus z minus 1 x, y minus x transpose z minus 1 x, minus 1 x transpose z minus 1. So if you look at this equation here, what we're going to do, oh I think I missed the c here. k times, there's a c here. So basically I'm going to be able to rewrite this equation like this as an inverse of something. And how am I going to do that? Well I'm going to set z minus 1 to be equal to minus my prior uncertainty. I'm going to set x to be equal to c transpose and I'm going to set y to be equal to r. So let's see if we can do it. So this means that this equation can be written as minus z plus, this becomes a minus here. And if I take this and I write it as this, this becomes a plus. Okay, so I multiply that by a minus. So basically this is just minus of this equation, right? Which means I have that equation on top is equal to this. This is minus z which is equal to c transpose y minus 1, r minus 1, x transpose c minus 1. So I have pretty good. Okay, so I'm going to invert both sides. p of n given n minus 1 is equal to inverse of this, c transpose r minus 1c plus p of n, n minus 1. Did I? I think I have an inverse there that I missed. Yeah, this is a z here, not a z minus 1. This is a minus 1. So p of n given n, it's inverse is equal to p of n, n minus 1, inverse plus c transpose r minus 1c. Yeah, that's what I got. Okay, so what this means is that if I start with my p of 1 given 0 is equal to infinity, then after I take my first sample, then my posterior uncertainty, p of 1 given 1 is going to be equal to c transpose r minus 1c. So if I start with knowing nothing, after I take my first sample, what I know has an uncertainty equal to c transpose r minus 1c, where r is the measurement, measurement, measurement noise. Okay. Any, any questions so far? Yeah. This one here? Yeah. Oh, because this is just 0. Yeah. Yeah. Oh, yeah, I'm sorry, you're right. You're absolutely right. Thank you. Yeah. Yeah, sorry. That would have came back to haunt me later. Okay. All right. So all we've done so far is to show that if we have no information to begin with, if we have infinite uncertainty, we can compute our, so we can still compute our posterior uncertainty. Now, this is a good first step. The second step what we need to do is to take the common gain, and the way it's written up there, which is written in terms of the prior uncertainty, rewrite it in terms of the posterior uncertainty, this term, which we can do, and I'll show you how to do it. So before I do it, just to give you an heuristic that's useful in state estimation, if you begin your estimate with having no idea, then oftentimes what people do is they say, well, let's begin with a prior estimate that's equal to this, which is basically what they're saying is that let's begin with an estimate that is going to end up being our prior after we take this first sample anyway. So a reasonable prior uncertainty is this, if you know nothing else. Alright, now let's look at our, the way we wrote k. So we have k of n, what I'm going to do is I multiply both sides by the inverse of that matrix there, so I'm going to get c times the prior. So I'm just rewriting that equation. Okay, I'm going to multiply both sides by r minus 1, and I can do this because you know these are various covariance matrices, so they're positive definite matrices. I can always have an inverse for them. I'll multiply it through, alright, let me solve for k and keeping in mind that we have this thing that is interesting for us. It's giving us a relationship between the prior uncertainty and the posterior uncertainty, and what I'm going to show you is that we can get that out of here as well. So I'm going to bring these two terms together. This is this. So p of n given n, c transpose r minus 1. Did I miss a minus? No? Oh, oh yeah, because so then, so what I've now written is my common gain in terms of the posterior uncertainty, which I can take a step further. That means k of n is equal to, here's the posterior uncertainty after I take the first step. So at the end of my first data point I'm going to call that k1. That's going to be equal to c transpose r minus 1 c minus 1 c transpose r minus 1, and so this is what the common gain would be on that first data point that I'm going to take, which means that my x hat 1 1 is equal to, say that my prior is x hat of 1 0 is equal to 0, p of 1 0 is equal to infinity, then my posterior, after I take my first data point, is going to be equal to k of 1 times y of 1 minus y hat of 1. This is 0. So this means that my posterior, after I take the first data point, is going to be equal to this c transpose r minus 1 c minus 1 c transpose r minus 1 y, my observation. And if you look at what we did with maximum likelihood, this is the maximum likelihood estimate, which is basically, look what it's doing is that it's weighing the observation by the inverse of the noises in each measurement and then normalizing it. And if you put in the numbers, that's exactly turns out to be like the, like the maximum likelihood. Okay, so what we did were, we learned a couple of things. One, we learned matrix inversion, lemma. It's a useful lemma that we're going to use again when we do Bayesian estimation. The second thing we learned is that we can form of an estimate of a posterior uncertainty, even if our prior uncertainty is infinite, and we can form a estimate of the common gain, which is, here it is, even if our prior is infinite. And given that basic meaning of a prior that's infinite is a maximum likelihood estimate. It means that basically all you know is the current measurement, you don't know anything else. And if you have a prior estimate, then with the common gain you get to incorporate both of those two things together, both your prior and the noise in your measurement. Okay, okay. So let me now switch our topic to a different topic, which has to do with noise. And in the kinds of experiments that you've been seeing here, the kinds of mathematics we've been doing, we always have this additive noise, and that additive noise has a variance structure that's a Gaussian. In biological problems, that's not the case. It's not the case that when you look at a biological system, the noise in it is going to be Gaussian. What noise is going to look like is going to be what's called signal dependent. What that means is that if you take your hand and you push on a force transducer and you measure the little wiggle, the little force wiggles, look at the standard deviation of that. As you increase the magnitude of the force, the standard deviation of that signal is going to increase. So what that means is that the variability in the signal is not independent of the mean. And if you look at those equations that I wrote, I put down normal 0 q. That means, you know, it doesn't matter, it doesn't matter what your mean is, they're independent in these cases. Now what we want to consider are scenarios that are more biological where noise is going to be dependent on the magnitude of the signal. And what I want to show you is that when we do it in that way, then the common gain, which is written up there, so you notice here that this gain has no X's in it. It has nothing that says that the value of X matters to me. All I care about is the uncertainty in your measurement. And that's all I need to know to formulate my gain. But if the noise structure is such that what you're measuring depends not just on a measurement noise, that's in this case Gaussian with an independent mean, but if this noise depends on X, bigger the X's, the bigger the noise is, if your sensors are like that, which is what biological sensors are, then this K is also going to depend on your X. So it's going to give you a way to weigh what you're seeing based on your uncertainty of measurement as well as the value of that X that you're trying to estimate. And so this is called signal dependent noise. And I want to show you how to handle it. And what's nice about this is that the first paper that showed how to do this came out about 10 years ago or so. Because people in engineering were happy with the common way of describing things, even though almost I think no real system has just a Gaussian noise that's independent of its mean. Most systems have the signal dependent structure. Until people began thinking about biological systems, they hadn't considered this. So anyway, it's not a difficult thing. It's pretty easy to do it. And it's pretty important for our process. So what is the signal dependent noise? So signal dependent noise. What do I mean by that? If you ask somebody to make a movement, say 10 centimeters. And as they make the movement, so this is 10 centimeters, this is distance, you measure the standard deviation of their endpoint. So, you know, that you give them a target, you remove feedback, you ask them to move. They move. And you ask, okay, here's where you want. You do it again. And you do it again. You see, there's some standard deviation. So say it's here. And now if you ask them to make a 20 centimeter movement, but in the same duration, so you fix the amount of time that they have. So they have to move faster. So here, they move like this. Now, they have to move 20 centimeters, but with the same duration, so they have to move faster. Now what you see is that their endpoint variance increases. And it will continue increasing. So say that this is because you give them like 300 milliseconds to make these movements. Distance. So this is distance in centimeter. And this is standard deviation of the endpoint where they end up. Basically, what you see is that their variability is increasing as they're faster they move. The bigger the signal gets to their muscles, the more noise they have at their endpoint. Simpler way to see this is basically to ask somebody to produce force. And you measure the mean of the force. And you plot the standard deviation of the force here. So you ask somebody to produce one Newton. Then you tell them, okay, produce two Newtons. And every time they produce it, they produce the force and they stay there. And then you remove visual feedback. And you see there's some wiggle there. And then you produce two Newtons. And you see that the wiggle is a little bit higher and so forth. And you plot this standard deviation. You again see that basically the standard deviation increases as the mean of the force increases. So what does that mean? That means that force is equal to some input that the brain is sending. Let's call that U. This is force. This is the input that your brain is sending to your muscles. And that's getting multiplied by something like this. C plus phi. One plus C phi where phi is normally distributed, let's call it with zero one. And if I now ask what's the variance of F, what's the variance of this force, that's equal to the variance of this random variable here. And that's going to be equal to U squared C squared times variance of phi, which is equal in this case one U squared C squared. So if I plot variance as a function of U, this is the input to the muscles. This is variance of force. It will look like this. It will be quadratic, right? Because it's going to increase as a function of U squared. And if I plot the standard deviation of it, the function of U is going to be a line. It's going to be the square root of it. So that's what signal dependent noise is. It's basically a noise structure where you're getting multiplication of an input by the random variable, not just a summation. See up here, you just have the addition here. Now I have, this is multiplying this. And so its variance is going to depend on the variance of the random variable. And if the random variable happens to have variance one, then C is going to give you the standard deviation between the mean and the input. Okay. So the general problem for estimation, we can now write it as follows. That basically we have some variable X that we're trying to estimate. We're going to observe this variable with some variable Y. There's some input U that's going to be affecting the state. So maybe this is, you know, the thrust in the rocket. X is the position of the rocket. Y is the telemetry from the rocket. U may be the force that I'm producing on my muscles. X is the position of my body. Y is the sensory feedback that I'm getting. And then, you know, over time, X changes. It's called this matrix A. Another U is going to be acting on it. And then I'm going to measure Y. So this is the basic setup for the state estimation. X of N plus 1 is equal to A of X of N. And the relationship between U and X is via some matrix B plus B times U. Plus there's some noise here. Let's call it epsilon U plus another kind of noise, epsilon X. And I have my measurement, Y of N is, let's call this H, the readout of the state. So there may be some subset of states that I can read. And this is also corrupted by noise. What do I call it? I don't know what, I don't remember what I call this noise in your book. I think I call this, is this S? This is S. Okay, great. Plus epsilon Y. So the quantities that we're interested in that are different than before are these two. And they are state dependent noises, they're signal dependent noises. This noise is our usual noise, normal zero QX, normal zero, I don't know if I call it Y or R, QY I guess. And these are the signal dependent noises. And I want to be able to, I want to be able to write them in a way that I can deal with them. And the interesting idea is to write them as a sum, which I'll do it for you now to see how that's going to help. So basically what are they? So epsilon U is the noise in the motor commands. And you know they may be, so maybe U is a vector. So maybe there's a standard deviation of it that grows with the quantity C1, U1, phi 1. C2, U2, phi 2. And then depending on how big U is, Cm, Um, phi m, where these fees are all normally distributed with variance 1. And C is the gain that determines basically this slope. So for each of the components of U, I have some signal dependent property. So you see that this noise depends on U. Okay? Similarly I can write it for X that way. So epsilon S, let's call it D1 times X1, call it Mu1. The random variable D2, X2, Mu2, Dn, Xn, Mun, where Mu now is a random variable with mean 0 and variance 1. So now this noise, and this noise I can write it as follows. I can write epsilon U as a sum of i is equal to 1 to m of some new matrix that I'm going to call C. I'm going to define it in a minute. Let's call this Ci times Ui times phi i where this matrix Ci is going to have a single, oh I can do it here. This matrix C1 is going to be equal to C1, this C1, and then 0 everywhere else. Similarly C2 is going to be 0, C2, 0, 0, 0, everything else is 0. So this matrix only has one number in it. And that happens to be this C2. This matrix is going to have one number and that happens to be this C1. So that's what this matrix is going to do. Now all I've done is to write this vector in terms of sum of some matrix C, which I know what this, the noise is, times the input U times the random variable phi. And actually U is now a vector, the vector U. Now why did I do this? Because what I remember, in order for me to compute the common gain, what I need to know is my posterior uncertainty, which means I have to know the variances of these equations. I have to know the variance of y. For me to compute variance of y, if I write it in terms of this, right, now I can compute the variance of epsilon U, right, because now the variance of epsilon U is going to be equal to the sum times Ci U times variance of phi i, which is 1, times U transpose Ci transpose. So this is a very critical step because in order for me to compute the posterior uncertainty to common gain and all that stuff, I have to be able to compute variances here. And here's the variance. And similarly for this guy here. So I can write epsilon U is equal to the sum of some new matrix, di times x times mu i, i is equal to 1 to n. Alright, so now I can write x of n plus 1 is going to be equal to A x of n plus B U of n plus B times sum of Ci U of n times phi i plus epsilon x and y of n is equal to be h times x of n plus h times sum of di x of n mu i plus epsilon. Alright, so now I want to estimate my state x, x hat of n given n is going to be equal to x hat of n given n minus 1 plus B times U of n, I know this, this is one of the things I can read, plus k of n times y of n minus y hat of n, what's y hat of n? It's h times x hat of n and minus 1. So that's my prior, I'm going to make a guess, have an error in my guess, I'm going to multiply by the common gain, update my prior belief and form a posterior belief. Okay, so similar to this now what one has to do is to form the posterior uncertainty, minimize the posterior uncertainty given, actually sorry, this already incorporates my U into it. So what we do is that we write this y is here, let me bring up all the h, x hat of n, n minus 1 to one side, i minus k of n h, x hat of n minus 1 plus k of n this sum. Okay, so what we do is that we form the posterior uncertainty p of n given n is equal to this times p of n, n minus 1, this minus 1 plus the uncertainties here which have this shape k of n, this is the random variable here and this is the random variable so this becomes sum of h d of i x of n d of i h transpose plus this is q y times k of n transpose. So to find k you find the variance, here's the variance, you minimize the trace of that, so the trace of p of n given n, the derivative of that with respect to k of n and what we're going to end up with now is that there's going to be this term here, so before we used to just have this q y in there but now we have this x in there which we didn't have before and if one finds the derivative you get k of n is equal to the p of n, n minus 1, h transpose times the variance associated with the observation which is this h times epsilon which is d i x of n d i h transpose plus q y minus 1. The ratio of prior uncertainty to measurement uncertainty, oh I missed one term, this uncertainty plus this uncertainty plus this which becomes my measurement uncertainty which now you see it depends on x. So basically the larger your state is the more uncertain you're going to be based on the fact that there's signal dependent noise in your measurement. So what this says is that if there's signal dependent noise in your sensors, if your sensors have trouble measuring large numbers because as the numbers get larger the variance in them gets larger then your gain by which you estimate where you are depends not just on your measurement noise it also depends on your estimate of where you are. So typically this becomes your prior, you put your prior for x here, that's your best estimate of x. So the farther the larger the state is if the signal dependent noise depends on that state that itself becomes a measure by which it weighs how you predict where you are. Your k depends not just on your measurement noise typically being just dependent on epsilon y but also depends on the signal dependent noise that goes into the denominator. So always we have a ratio of two things, ratio of prior uncertainty to a measurement uncertainty and in this case the measurement uncertainty is going to have the signal dependent noise in it. We're going to see in a few lectures terms like this again when we do optimal control because typically noise in biological systems has this structure, it has this signal dependent structure and what we need to do is find the variance of these kinds of structures and so what we're going to do is divide it up by this sum so we're going to represent this vector as a sum and then by doing so we can find this variance and then the rest of it is just the usual traces and derivatives of traces. Okay? Alright guys good luck with the homework I'll see you Monday.