 Let's get started, guys. Good to see you again. Looks like the last person who wrote on the board took a few days for them to erase it, so left fairly permanent marks. But hopefully you can see it. I hope you guys on the fourth row, I don't know if you can see me up here. I hope it's readable. Plenty of room in the front. So the material that I'm going to cover today is chapter 4.6 and 4.7 today. And what we're going to do is I want to give you guys an intuition about what we mean by uncertainty and where uncertainty comes from when we do estimation. We're going to see that when we try to estimate a parameter, part of the uncertainty comes from the noise in our measurements. Part of it comes from the fact that the data that we're getting may not be exploring the space very much. So for example, if I have an input variable X that is telling me something about the relationship between inputs and measurement variable Y, if X has many dimensions, but I'm only exploring some small region in those many dimensions, then the uncertainty that I have in estimating the parameter W is going to reflect the space that we moved around in that X space. So if X is something that is like, you know, five dimensions, but three of those dimensions are only very occasionally measured, then the uncertainty that I have in my estimate of W is going to reflect the fact that X didn't get explored very much. And this is going to be reflected in that variance of my estimate for W. So my uncertainty for this variable W is reflected in this variance, and that variance depends on not just the measurement noise, but also how the input was sampled. Now, what this is going to lead to us is to segue to a very important algorithm which for many people reflects really the most important engineering discovery in terms of mathematical discovery of the 20th century, which is called the Kalman filter. And we're going to derive the Kalman algorithm today, and it's particularly relevant to us because the guy came up with it, and himself Kalman came up with this algorithm, basically a few blocks from where you guys are sitting on Balboa Street while he was working here in late 1950s, 1959 or so. So it has historic value. In fact, personally I think we should build a statue for the guy to reflect this amazing thing that he did while he was in Baltimore. So the idea of a Kalman filter is the notion that we're going to estimate a parameter in such a way that we will minimize our uncertainty about it after we alter our belief about it. So we're going to have some prior belief about what this parameter is. We're going to make some measurements, and then we're going to combine these two things to form a posterior belief. And that posterior belief is going to be made in such a way to minimize our uncertainty. And so today the discussions are about uncertainty. What is uncertainty? Where does it come from? And basically, as I mentioned, uncertainty comes from two sources. Your measurement noise, and also the way the input has been sampled. And so the input isn't very sampled very well. You're not going to be able to have much certainty about it. So in our discussion, we'll see that the history of the inputs is really reflected in our uncertainty about the parameter that we're trying to estimate. Again, imagining your inputs have many dimensions, and maybe most of the time you see some activity in two or three of those dimensions, but occasionally you see some activity in the other dimensions. That history, the way the input space was sampled, will be reflected in the uncertainty that we're going to have in our estimate. And that uncertainty is really a reflection of the past history of that input. And this uncertainty, which is basically our estimate of variance associated with the parameter that we're trying to estimate, is the key thing that's going to be used in combining the difference between what we see and what we predicted, our error in prediction, with the past history of it. So basically we're going to have, just like when we did the two GPS's, we have two GPS's, we're walking around the two or two pieces of information. Now what we're going to think about is that one of these pieces of information is the past, all the things we've seen in the past. The other piece of information is what we've seen now. And we're going to combine those two things to form a belief. So just like what we did before. But now this past is reflected in all the information that came, not just the current information. And the other one is our prediction error that we see from the current event. So the theme today is going to be this notion of uncertainty and then we're going to drive the algorithm that's going to show us how to combine information from the past to the present. And so the common filter. So I want to start with just a little bit of a reminder about what is covariance covariance matrix. Because that's going to be one of the key things in our discussion. So when I have a random variable X that's normally distributed with 0,0 and variance 1,2 minus square root of 2 minus square root of 2, something like this. Well now if I were to plot this random variable, I'm going to sample this random variable from this distribution. So here's X1, here's X2. This is a vector of course, X being distributed normally as follows that X is made up of X1 and X2. The mean of my dots that I'm going to sample of course is going to be here. And you're going to see that alright well variance of X1 is 1, variance of X2 is 2. So that means that there's a lot more variance along this dimension than this dimension, twice as much. But there's also this negative covariance. So that means that as X1 increases, X2 is going to decrease which means that you're going to see a cloud that look like this. When you sample points from this vector space. So you're going to have a, you know, an axis there where the data is going to fall. And that comes from this covariance here. As one is increasing, the other one is decreasing which means that you're going to have this slope in the distribution that you're going to sample. Another distribution, variance 1, 2, maybe 0.1 times square root of 2. 0.1 times square root of 2. So almost no covariance there. Then what I would see is something like this. So more variability in X2 than X1 but no covariance between them. And similarly if this was now a large number if I had positive covariance between them then I would see something like this. So let me now go back to our maximum likelihood estimate. From the maximum likelihood estimate show you what is our uncertainty in our measurement W, on our estimate W. And then I want to show you how that relates to our measurement noise in one hand and how the input space was sampled. So we can see the relationship between these two different variables. So we'll just briefly review our basic model to see if I can find a slightly better pen. So suppose that we have a model that says what I measure on any given trial the truth on that trial plus some noise. So this is my measurement noise where epsilon is associated with some variability sigma squared. And this truth, Y star, is going to be generated by some W let's put a star there, transpose Xi. So there really is this true relationship between input X and these weights that I'm trying to estimate and this Y star, but I can't see the true Y star. All I can see is Y star plus some measurement noise epsilon. So then what we have is that probability of Yi given Xi and sigma is this normal distribution with mean W transpose Xi variance sigma squared. And then when I want to estimate my parameter in a maximum likelihood way I have a whole bunch of data points so I want to maximize the probability of Y1, Yn given X1, Xn and this parameter variance sigma and that's the maximum that likelihood, this is a likelihood function and that's just the multiplication of these individual probabilities which is going to be equal to 1 over 2 pi sigma squared to the one half. My observation minus this truth squared which is equal to, I have N of these times and we can write the log likelihood now which is going to be the log of this function it's going to be equal to minus N over 2 log of 2 pi minus N log of sigma minus 1 over 2 sigma squared this sum. And to find our best estimate of W we minimize this log with respect to the thing that we're trying to find W so we find the derivative of the log with respect to this parameter that we're trying to find and if you remember we can write this as Y minus XW transpose Y minus XW where these are the now vectors and X is X1, 1, X1, 2, Xm, 1, 1, N, Xm, N and W is W1, Wm. So we multiply that out, find the derivative with respect to W and it becomes I think 1 over 2 sigma squared something like minus 2 I think X transpose Y plus 2 X transpose XW. Check that see if it's right. Wow, amazingly it is. Pretty good. Okay so then we set that equal to 0, we find W and we get the equation that you guys remember it's X transpose X minus 1, X transpose Y and now what I'm interested in is the meaning of the variance about this this is my estimate, this is the maximum likelihood estimate that you've seen now a couple of times now what I'm interested in is knowing something about what is my uncertainty about this estimate and the point that I want to show you is that my uncertainty about this estimate depends not just on my measurement noise sigma but also on X's how these X's were sampled and so to do that let me write now the variance of my estimate variance of W hat and well W hat is equal to this so let me write this in terms of the function so before I do this let me just write W hat that's equal to X transpose X minus 1 X transpose times Y but what is Y? Y is equal to Y star plus this noise and now let me write these as vectors Y star and epsilon epsilon is going to be all the noises that I had on each trial Y star is going to be the true measurement on that trial times X transpose okay so that's my solution so now the variance of W hat my uncertainty here is going to be this has no uncertainty the only source of noise is here the variance of this epsilon which is equal to identity matrix times sigma squared this is a scalar this is the identity matrix times this transpose which is X times X transpose X minus 1 and when I multiply that out I see that X transpose X minus 1 times X transpose X it just cancels out so I see sigma squared X transpose X minus 1 so that's my variance of my estimate now what I want to do is that I want to spend a little bit of time showing you what this means so what is this inverse of X transpose X the point is that my uncertainty in what I'm trying to estimate depends on measurement noise as well as how the input space was sampled so let's see what that means suppose what I do is I have five trials suppose I'm going to give you five trials and X is made up of a two-dimensional vector X1 and X2 Y is just a scalar and W that I'm trying to estimate is W1 and W2 so suppose X1 is going to be given to you X2 is going to be given to you and of course Y star will not be given to you but what you have is Y which is equal to some true X some true W times X plus this epsilon here so suppose that in this case I knew the Y star so I'm going to give you an example of one of these scenarios so suppose that we have 101 in this case sorry 0.5 I guess my first example Y star is going to be 0.5 second one is going to be exactly the same third one is going to be exactly the same fourth one is going to be exactly the same fifth one is going to be this so if the true data was generated from this this is trial 1, trial 2, trial 3, trial 4, trial 5 the true data was generated like this that means that of course on this trial you know X1 and X2 and Y that I actually measure is 0.5 plus some noise in that first trial that's what I give you right the true 0.5 plus this noise but what I'm going to do now is I say okay from here I can compute this variance so clearly my W hat my estimate of W is going to be equal to what? what's my estimate of W? for this to be true for this data to be generated so I have X1 is 1, X2 is 0 Y star is 0.5 and I also have another condition where X1 is 0, X2 is 1 and Y star is 0.5 so what's my estimate of W? 0.5 good alright but what do you think will be my estimate of the uncertainty of W? well let's compute it so what's X? X is equal to 1, 0, 1, 0 1, 0, 1, 0, 0, 1 I multiply this by X transpose and I invert it like this and what do I get? I get X transpose X minus 1 it's going to be equal to some matrix here but I want to see if you can guess what's going to go in this matrix look what's going on here I get four instances of X and only one instance of X2 so I get four instances of X1, one instance of X2 so if you're thinking about uncertainty then you should have a guess that well I'm going to know a lot about X1 and hardly anything about X2 because it was only on once most of the time X1 was on so what I would expect is that I'm going to have a small number here but a large number here in fact when you multiply X transpose X and invert it you get one quarter here and you get a one here and the second thing that you notice is that you never have these two on at the same time which means that there is no covariance between them so then you're going to get 0 and 0 here and so what this means is that when you look at your uncertainty about W it's going to be proportional to your up measurement noise whichever that is and how the input space was sampled which means that if I write what my W looks like this is my W2 this is W1 of course the mean of it is going to be here and the variance of it is going to be small in this dimension it's going to look like this this means that if I took this data and generated real data with it so I know the true value I add some noise to it after I take five of these data points I estimate my mean I estimate my variance and I compute what my uncertainty is I'm going to have a whole lot of certainty about W1 but very uncertain about W2 let me stop for a second see if you understood that let's do another one suppose my X1 and X2 look like this okay so in this case the X matrix is just this 111111110 and then X transpose X minus 1 it's going to be another matrix and that matrix let's see if you can guess what's going to be inside of it so first of all you have a whole lot of cases where X1 is available to you and almost as many cases where X2 is available to you so you're basically going to have about the same amount of certainty about the variances of X1 and X2 which means that these two are going to be about the same and then you have this other condition where whenever X1 was on X2 was on as well they were on together and in fact what it means is that for this equation to hold X1 plus X2 being equal to Y multiplied by some weight if this were to increase this has to decrease for this thing to hold so that means they're going to have negative covariance this is going to be a negative number here and indeed when we multiply these out that's what we get this is about 1.25 this is a little bit more than this and this is minus 1 so when we find W in this case again it's going to be the mean of 0.5 it's still the same mean as this but you notice that the input history is very different which means my uncertainty about W is going to be very different and that uncertainty is going to be associated with a negative covariance which means it's going to look like this so the history of the inputs in our uncertainty the past is going to be reflected in what we know about this parameter W and that is really the idea behind what Coleman did what Coleman was saying is that when you're trying to do your estimation you have all the past and then you have this current data point and if you're talking about Gaussian noise all the past is really in two parameters mean and variance the current data point can be incorporated into the history of the past by taking into account this measure of uncertainty this uncertainty that I have about the parameter that I'm trying to estimate so let's now get to our Coleman estimation any questions so far alright guys let's see what he did so suppose we're trying to estimate some parameter W and I'm going to have some trial N plus one that's going to be some function of its parameter value in trial N plus something that I'm going to call sensitivity to error which is also called the Coleman gain here times the difference between what I observed on that trial and what I predicted on that trial so this is our basic learning equation I make a prediction of what should happen I make an observation about what did happen and I want to change my estimates W based on this K this notion of sensitivity and this is going to be a function of that trial it depends on what that specific trial I'm at and what for example you know when we were doing maximum likelihood what we saw was that if I have two kinds of information say that you know I have these two GPS's one of them is telling me I'm here the other one is telling me I'm here the way I combine them is by taking into account the variance of these two things the uncertainty about each one of those measurements so what Coleman is going to be thinking about is that he's going to say this is one of your pieces of information this is info one this is what you believe about the world all the stuff that you've learned in the past and based on that you make this prediction which then gives you a second piece of information so this difference this error is your second piece of information you're going to have to combine one piece of information with another piece of information and how do we do it well this K here should be set in a way that reflects the uncertainties you have in each of these two pieces of information uncertainty in W uncertainty in this error that you're measuring and by knowing these two uncertainties we can weigh these two things and then combine them to form this new estimate W so that's basically the process and so what we want to do is to find a way to combine these two pieces of information in such a way that we optimize something and that optimization is going to be associated with variants of these W's which are my uncertainties so to change things slightly what I want to do now is to not just talk about trial N but think about it in terms of what we might imagine is a generative model so there's something that we're trying to estimate W something that we measure Y and another thing that we measure X and this Y here depends on W and X and so at the beginning we have an estimate it's called this W hat on trial N given all the things that we've seen in the past so I have an estimate of this W this is my prior belief before I see anything I have this estimate of what this is given that all the past things that I've seen so now I make an observation Y of N after I've made the observation I update my belief about W hat so this is my prior this is my observation this is my posterior so I begin with some guess I make an observation I change my belief after I've made that observation so we're going to use this kind of terminology estimate of N at trial N given that I've seen N minus 1 previous trials I see the Nth trial now I make another estimate which is called my posterior on the Nth trial given that I just saw the Nth trial okay and then for the next trial I just say this is my estimate of N plus 1 given that I've seen Nth trial so this becomes my prior on the next trial on trial N I have a guess that guess is based on all the past history that I've seen I make my guess I make an observation after I've made the observation I change my guess that's called the posterior okay and then on the next trial I'm going to make another guess that's going to be my prior on that guess and so forth now what we want to know is how do we set this this we observe this we guess so let's write this equation W of N given N is equal to W hat of N given N minus 1 plus K of N times Y of N minus Y hat which is going to be we're going to write it as a little bit easier for us to write it this way X of N transpose W hat of N given N minus 1 so our typical learning problem just a little bit more confusion here for you that we're going to keep track of all the things that we've seen in the past we're going to call this the posterior this the prior this the observation so what Coleman said is that set this parameter K your sensitivity in such a way that after you change your W your posterior you end up with a quantity in which you are as certain as possible in what you have estimated so he said in addition to having these estimates which really are the mean of your distribution you also have a variance associated with it so you have a variance of W hat of N given N minus 1 we're going to define that as this matrix P and given N minus 1 so my uncertainty about my estimate I'm going to call it P this is my mean this is my variance just like here so this is my mean this point here is the mean of my estimate this is here this sigma squared times this is the variance of my belief so we're going to call that P that matrix there's going to be some uncertainty in that estimate and you see that when we were doing regression there was a very clear relationship between the past history and this variance so what Coleman said is that find this parameter K in such a way that you minimize the posterior variance P of N given N so find W change in W make it such that when you set this K this becomes as small as possible now we'll have to talk a little bit about what we mean by that because this is a matrix and so what do you mean by making it small we have to have some norm associated with it which I'll show you what it is but for now let's just multiply this out let's see what we get so this is my function so let me multiply so I get W hat of N given N is equal to W hat of N given N minus 1 plus K of N Y of N minus K of N X of N transpose W to the hand given N minus 1 so then W hat of N given N is going to be equal to I minus transpose alright so the other the other element here is that this Y is made up of this W star transpose X of N plus sigma so we can also substitute that in there and that becomes the equation now what do we want to do we want to find the posterior uncertainty of N given N and then minimize that with respect to K find the K that minimizes the posterior uncertainty that's what we're going so let's find the variance of that equation because that's going to be the the thing that we're going to try to minimize so the variance of W hat of N given N what is the variance of this equation P of N given N and that's equal to the first term I have I minus K X times W hat the variance of that is I minus K X times the variance of W times I minus K transpose the second term is K of N times Y the only random variable there is epsilon the variance of it is K of N times sigma squared where sigma squared is the variance of epsilon so that's the variance and let me now multiply this out so that so that we find a norm associated with that variance because we have to define what do we mean minimize variance because variance is a matrix so I don't know what does it mean to minimize a matrix I have to give you a norm associated with that matrix but before I do that let me multiply this out so I get P of N minus okay so what we need to do is minimize this variance covariance matrix so what is this P in principle what it is is a covariance structure in which the diagonal elements are the variances of individual elements sigma 1, sigma 2 however size the number of parameters we have and then these are the covariances row 1 row 1, row 2 and so forth and a good norm so what is a norm the qualities of norm as follows norm of something has to be always positive it's one quantity second thing is norm is zero when that quantity is identically zero so in this case if we use trace as a norm trace of a covariance matrix which is the sum of the diagonal elements first of all the sum of the diagonal elements is always going to be positive because they're all squared quantities right second if the trace is zero that matrix has to be zero, why because first of all all the elements in the center are positive so nothing can cancel each other out they all have to be zero and if sigma 1 is zero, if sigma 2 is zero and sigma 3 is zero then all the covariances have to be zero as well so trace of a variance covariance matrix has a special property of being a norm that's not true for any general matrix but for a variance covariance matrix a trace is a norm so we're going to minimize the trace of this posterior variance covariance matrix and by doing that what I mean is we want to find the derivative of this with respect to k of n and set it equal to zero so that's the direction that we're going so okay let's see if we can do it we're going to find the trace of this quantity and then find this derivative with respect to k of n set it equal to zero find k and see what's going to tell us there so let's move forward so trace p of n given it the posterior quantity that's going to be equal to trace of course is just a number it's a scalar right it's going to be the trace of this I wonder if I should factor things out let me check my notes no I guess I left them I left them as they are these two terms here the trace so trace is going to be a scalar so this term is going to be equal to this term it's trace and so I can write this as minus two times the trace of because trace of a quantity is the same as the trace of the it doesn't matter if you transpose it or not so this quantity is the same as this quantity except it's transposed so they're the same quantities so I'm going to write it as k of n x of n transpose p of n minus 1 I took care of this equation and then I have this quantity now if you look at this quantity determine the center here is a scalar right x transpose p times x that's a scalar quantity so I can write it as follows and yet well I'll do it in two steps so we can clearly see it right this is a scalar times k of n and this whole thing is a scalar I'll write this as times let's see I am missing I'm missing a k transpose here right because that's k times sigma so it's k times variance k times epsilon so k times variance of epsilon times k transpose so it's a k transpose here k transpose here so this term I just rewrote it as this alright so the trace of this is equal to trace of this plus trace of this quantity here so I forgot to put my trace there which is equal to trace of this quantity so let me now write the whole thing down so this is equal to trace times the trace of k of n k of n transpose this quantity here the trace of the trace of a matrix times matrix transpose so that's a vector so trace of k times k transpose that's equal to trace of what's kk times k transpose it's going to be so k times k transpose is going to be k1 km times k1 km and that's going to be equal to the trace of k1 squared k2 squared km squared and then these off diagonal terms don't matter that trace is going to be equal to k times k transpose is going to be equal to the sum of k1 kI where I goes to 1 to m which is equal to k so trace of kk transpose for any vector k is equal to k transpose k so this term here is just kk transpose I don't know if I can simplify that let me see right, it's a scalar so this becomes trace of this is a scalar quantity which is in this case x of k transpose p of n n minus 1 k of n plus times k of n transpose k of n trace of this is the same as trace of this transpose because it doesn't matter what the matrix is the trace of the matrix is equal to the trace of that matrix transpose I just did a transpose here k goes out there it becomes a scalar quantity alright so next what I have is the problem of finding the derivative of that equation we're going to find the derivative of this with respect to k let me check to see if I have any errors in my equation it looks alright but it looks okay okay, so let me find it's derivative with respect to k I forgot to write on that board alright, so derivative of trace of p of n given n with respect to k of n is going to be equal to derivative of this with respect to k minus 2 xn transpose p of n n minus 1 this has to be transpose right because we want a vector we want a m by 1 vector which this doesn't give us this is going to be p of n n minus 1 times x of n plus derivative of that with respect to k which is going to be 2 times times k of n so now I solve for k of n I set this equal to 0 and I solve for it it's going to be equal to x of n transpose minus 1 times so let me stop for a second and just show you what this means so what this says is that your mixing term k which is how we're going to combine how we're going to combine our observation with our prior belief is going to be a ratio of this term and my prior uncertainty so let's see what this is x transpose p of n, x of n so where does this come from well your observation y of n was equal to x of n transpose w of n plus epsilon so this term here is the uncertainty in your measurement the variance of y is going to be x transpose uncertainty of w times x plus sigma squared which is the uncertainty in my measurement so this is measurement uncertainty this is my prior uncertainty this is my belief before I started I said I'm going to have some belief w that p so it's saying to weigh the prior uncertainty and its ratio to your measurement uncertainty and then use that as a way to combine your error in prediction in terms of updating your belief so it's saying that you just measured something and there was an error between what you measured and what you predicted that's the top equation multiply it by a quantity that's the ratio between your prior uncertainty p of n given n and your measurement uncertainty the ratio of those two so if your prior uncertainty was very low then k is going to be small it's not going to change very much it's going to say sure you had a prediction error but you were so uncertain about what you predicted that I'm not going to believe this prediction error it's a ratio of two things it's a ratio of the prior uncertainty p of n given n minus one and the current measurement uncertainty which is the quantity in that in that inverse in this case it's a scalar so we don't need to worry about the inverse but in general in general maybe the matrix quantity which I'll show you in a minute so this is my this is my my estimate w of n plus one given w of n so I have now w of n given n is equal to w of n given n minus one plus k of n times y of n so this is my estimate y hat of n so I know how to get this equation because I know k I know based on my prior uncertainty and the current inputs x and my measurement noise how to set k so I can get my posterior belief but what about my posterior uncertainty p of n given n so this is my mean of where I believe my parameters are because I need to set both of these so this comes from this equation this comes from this equation so p of n given n depends on p of n given n minus one and k of n k of n I just computed for you here so I'm just going to put k of n into this equation and then simplify it and we're going to get a p of n given n so p of n given n is equal to k of n is this times x n transpose p of n given n then I have this term times this k again so I won't write it down I'll just put a parenthesis there plus k of n times this term here x of n transpose p of n n minus one x of n plus sigma squared times k of n transpose so you see that basically I'm just going to plug in k of n and simplify and it's not difficult it's just a bunch of multiplications and a whole lot of things cancel out you can see it here you see how k has this term in it and we just have the same term here but without the inverse cancel in that term and when I do it what I'm going to get is that p of n given n is going to be equal to identity minus k of n times x of n transpose p of n n minus one okay so what I have are two equations here's my common gain for that trial it depends on my prior uncertainty and the current measurement x that was given here's my equation for changing my mean of the estimate k here's the equation for changing the variance of the estimate so what I have now is that on every trial I have formed a technique for computing a posterior belief given that I start with a prior belief so remember what my prior was I had this w and I had its mean and its variance so this is my prior I make an observation which is made up of y of n and x of n and then I make a posterior estimate p of n given n and I do this by computing my common gain k of n so let's one final thing you can sort of begin to see that for example see the posterior variance it is multiplying something less than one times the prior variance so in principle your variances will get smaller as you as you do this simply because when you combine two things you're going to get something better than if you hadn't combined it just like when we had two gps's so the posterior variances become smaller as you get more data the second thing to notice is that this function k this k that I'm computing it depends on x's the inputs that I'm seeing the history of those x's which are in the p so p is the uncertainty matrix and in the uncertainty matrix I keep a history of all the past that I've seen I take a current estimate I have all the past history that I've seen and by combining the two I form the my posterior so in principle the problem of state estimation which is an example of it is the way we've been doing our so there's some parameter that you're trying to estimate it's called a w that w gets observed by these quantities x and y and it could be that in a given trial you form your prior, I have my prior belief w of n given n minus 1, this is my prior I make my measurement y of n, x of n I compute I also have p of n given n minus 1 I compute my k of n I form my posterior w of n given n from k the difference between my prediction and my observation and then I also form my posterior this p of n given n, alright so now in principle there will be a next trial so I'm going to have another trial that looks like this okay now what's the relationship between w in this trial and w in previous trial so do we know is this the same system that we're going to be estimating again or is this a completely different system in principle we think that there is a relationship so I'm going to draw an arrow between the two and in all the experiments all the things we've been talking about so far we assume that this relationship is identity so w and n plus 1 we assume it's the same thing as w and n so we don't think the system is changing between one trial and the next trial but you know doesn't have to be that way it could be some matrix A that defines the relationship between one trial and the next trial what this means is that the generative model looks like this w of n plus 1 is equal to A times w of n plus epsilon where epsilon is some normal distribution with 0 and q is variance covariance matrix and our observation is equal to x of n transpose w of n plus another epsilon it's called this epsilon w epsilon y this is my measurement noise and this is my state noise so this is called state equation this is called the measurement equation so the idea now that I want to tell you is that there's a system out there that you're trying to estimate some parameter w for it and until now we've been assuming that in a given trial when we see x and y and we're trying to estimate w in the next trial that w is the same as the previous w but in principle there could be some uncertainty in our belief that this is the same w so I can write it like this I can say that the second trial n plus 1 is the same as w n multiplied by some matrix a which could be identity but there's also some noise in my belief about this particular state update equation and that my measurement y of n is related to w associated with some x but it also has some measurement noise so this is my measurement noise this is my state noise and in the experiments that we've been talking about this is 0 but it doesn't have to have variance 0 it could have variance q so this doesn't change any of the math that we've been describing so far what it does is the following so so far what I've done is that I have a prior mean and variance I have a measurement I compute k and I compute a posterior now what's my prior on the next trial n plus 1 is going to be w in trial n given n my posterior here times a which is this equation that's going to be my posterior this is going to be my prior on this trial so this is going to be w of n sorry n plus 1 given n this is my prior in trial n plus 1 and there's going to be variance n plus 1 given n is going to be equal to a times p of n given n a transpose plus q which is basically the variance and the prior variance in here so if I only have this amount of information I compute a prior I compute a measurement I update my weight based on what I saw in a former posterior now what I need to do is propagate my information to the next trial what's the relationship between the next trial and the current trial this is the equation that tells me that relationship it says on the next trial your w is going to be related to this w by this matrix a but there's going to be some uncertainty in it as well we're going to call that epsilon w and so in order for me now to compute my posterior on the next trial w of n plus 1 given n what I do is that I take w of n given n multiply by a to get this this is the mean of my prior this is the variance of my prior p of n plus 1 given n is a times variance of this plus q and so then I compute this is my prior on trial n plus 1 I compute k of n plus 1 the same as before I update I update my belief about w and I form now a posterior of that weight so the reason why I wanted to show you the relationship between prior in one case the posterior and then the prior in the next is because in principle the system can have uncertainty in the way the parameter we're trying to estimate will transform so for example say the thing that we're trying to estimate is the state of some object a rocket that has been sent off we're trying to estimate that position over that object well that position is going to change it's not going to be constant that position is going to depend on the thrust that we gave some input that we gave to it so in this case we're trying to estimate depends on the commands that we've been given it and then we have some telemetry from it occasionally we get feedback saying it's here, it's here, it's here every time we get feedback it allows us to make an estimate of where it actually is but then if we don't have any other information at the time we get feedback we get an estimate of where it is and then we're going to have to project forward in time where is it later so in that case it's going to change and it's going to change based on our knowledge of the dynamics of that system so for example there could be other things in this equation there could be inputs that we've given it so the thing can move based on some things that we've done so the thing that we're trying to estimate could change but nevertheless the same mathematics that we use to estimate this weight which is what we thought was constant can be used to estimate variables that can change in time when we're going and that's where common filters make their biggest bang is basically estimating states of linear dynamical systems so in linear dynamical systems you have inputs that you give the state of that system changes occasionally you get feedback from it and then you're going to say okay what's the state of the system and that state is going to be measured through occasional measurements but also based on beliefs that's where it is which comes from that top equation so that's going to be the prior that you're going to combine with your measurement and you're going to then form a posterior that says this is where it's actually located and the whole process is based on minimizing this trace of the uncertainty the posterior uncertainty and why a trace? because trace associated with a covariance matrix is a norm it's minimizing a positive number of the uncertainty of your estimate you want to estimate in such a way that you can maximize how certain you are okay I'll stop if there are any questions you have plenty of homework examples to try this out and get to learn what it's about anyway the material is also written in the book if you want to take a look at it it will provide you with further insight alright guys thank you so much see you Wednesday