 So, now we will consider the Kalman filter which is the one we discussed at the end of the previous lecture. The Kalman filter is a recursive algorithm for predicting the, for estimating the state of a system that is evolving linearly and powered by Gaussian noise and for which we get observations that are also a linear function of the state and with a corrupted by Gaussian noise. The, so the system here is as follows, so we have here xk plus 1 which is given as a ak xk plus fk uk plus wk and we get observations at any time k as ck xk plus gk uk plus vk. Now, what we will assume here is, so these the x's and y's are vectors here. So, let us assume your xk's are all in some rn, y's are all in suppose some rm for example, the we have here u's the uk is here these are a known exogenous sequence. So, this is a known and exogenous sequence, exogenous sequence. So, whatever it is it is a disturbance whose value we know the w's and v's these here these are noise, these comprise of noise. Now, we will assume unlike in the past we will now assume that they have a specific distribution. So, we will assume that the wk's are distributed normally with this distribution n0 qk I will explain what this notation mean I will assume also that the v's are distributed in this as follows n0 rk and the initial x is distributed x0 is distributed as normal say x0 hat and sigma 0. Now, what is this n something something well this is actually the Gaussian distribution or the normal distribution. Now, we want to write here xnx mu p this here denotes the Gaussian density evaluated at x having mean p mean mu and variance p and covariance matrix p. So, what is this given by so now suppose x here is in rn suppose. So, if x is in rn then this density is given by 1 by square root of 2 pi the whole raise to n. So, it is this times the determinant of p the determinant of p the square root of that absolute value of the determinant of p square root of square root of that alright and times the following exponential term this is an exponential of minus half x minus mu transpose p inverse x minus mu. Now, what is let us let us carefully look at this here. So, suppose so the so firstly notice that since x is a vector its mean mu is also vector. So, consequently the this here is a vector. So, therefore, I am taking a transpose of that vector multiplying it by a matrix and then I have another vector here. So, this is this is a vector this is so whatever I have here is now a scalar as a result p is my covariance matrix. So, this is my mean and this is the covariance covariance matrix is of if x is a random vector in rn in the covariance matrix is the expectation of x transpose this is the covariance this is the covariance matrix of x right. So, so the p here is the covariance matrix of my corresponding to my distribution. So, p is therefore, also n n cross n matrix. Now, it turns out that p is also a positive semi definite matrix. And in fact, if this distribution is not degenerate then in fact, it will be a positive definite matrix then in in as a result of that the determinant is actually always going to be a positive quantity. So, the way that the the covariance matrix appears is that it appears here in in this in this determinant here whose determinant is being taken and it also appears out here. The dimension the way it appears is explicitly actually comes up here it appears as a the is a power of this square root of 2 pi alright in the denominator out here. So, this is your distribution this is the Gaussian distribution evaluated at. So, this is the Gaussian distribution in n dimensions. So, of the n dimensional Gaussian right it is a Gaussian distribution in n dimensions with mean mu and covariance sigma or covariance p sorry alright. So, what we are assuming about the noise of the of the in our system here is that we the noise here in both cases the system noise as well as the observation noise is is has mean 0. So, these are 0 the covariance matrices here are q k and r k the the system starts from again and the noise is Gaussian in both cases the the system starts from a state x 0 which is drawn according to this distribution with mean x hat 0 and a covariance sigma 0 right. Now, the way what we are looking for once again is a filter that means we want to do this we want to estimate x we want to find a way of recursively estimating x k given k right. So, this our goal here is to recursively estimate this or more generally we actually want to estimate this recursively. Now, the the way to do this is it turns out is is a simple application of the previous of the previous theorem that we have which was which was in which we had used in which we had used Bayes rule to compute this particular updated distribution. But it is it is a little more than just that because there is a lot of underlying structure in the solution which gives us a lot of insight into how into our understanding of how how the the the new distribution is being derived from the old distribution. In a sense this this this process of updating one's updating the distribution is really about updating the the the belief about a particular event or a particular state as we give as the state keeps changing and as the observations as fresh observations keep coming. So, this that is what this particular this this system is actually helping this this particular filter is actually helping us do. So, the beauty of the filter is really in how it in the underlying structure that it shows that it clarifies about how the update of how the update of beliefs need to happen. So, this filter is is given by is was discovered by Rudolf Kalman and is one of one of the most pretty results in in in the theory of control. Alright, so now before we actually derive this let us derive a let us actually derive an let us actually make note of an auxiliary result and that result is actually this following simple observation. So, we so let me we say x and y are jointly Gaussian if a x plus b y is Gaussian for all scalars a comma b. So, we say x and y are jointly Gaussian if a x plus b y is Gaussian for all scalars for all scalars a and b. So, it is not enough that they are individually Gaussian. So, usually they when when there are two random variables whose marginals are Gaussian it does not automatically make them jointly Gaussian. They are jointly they they are jointly Gaussian if every linear combination of those two variables is is also also is is also Gaussian random variable. Now this particular case in which jointly Gaussian becomes equivalent to marginally equivalent to them being individually Gaussian is when the two are independent. So, if if x comma y are independent and independent Gaussian then they are also jointly Gaussian. Now, why do we need to think about jointly Gaussian random variables? Well, there is the reason we need to think of that is the following following simple result. So, we have already seen that the conditional expectation of x given y this here is the is the minimizer of this particular optimization. So, it is the best estimate of of x when you are given y best estimate in the sense that the one that minimizes the mean squared error and we have minimized. So, if you minimize x minus a function of y f of y the mean square error between x and the function of y the the conditional expectation of the so, the the x and so, you if you minimize the expected squared error between x and the and a function of y over all possible functions f it turns out that the the optimal f to choose is is this particular one on the left where you so you choose f of y or f's the optimal f star of y is a in fact equal to the conditional expectation of x given y this is something that we have already seen in one of our previous lectures. So, now what happens if x and y are jointly Gaussian? Now, if it turns out that if x and y are if x and y are jointly Gaussian x comma y are jointly Gaussian then it turns out that the conditional expectation of x given y is in fact a linear function of y. So, this is something this is something really beautiful and elegant that happens when we have when we have when we are considering jointly Gaussian random variables. So, you take two random variables that are jointly Gaussian and you want to estimate one from the other it turns out that the best thing that one can do is to do a linear transformation of the given of the given information. So, this the this linear transformation of course depends on the joint statistics it depends on their joint covariate on their cross covariance and so on and and is a fairly involved expression but nonetheless the most important thing is it is linear in the observation. Notice that this particular property is not true in general because a conditional expectation would involve a an integral and an integral over the posterior for that matter and the posterior involves would involves a prior and and the observation channel observation kernel all of this usually may leads to a mess it is not that it there is no there is no guarantee that this would actually reduce to any simple expression. But in the case of jointly Gaussian random variable this is what we get we get that expectation of x given y is a linear function of y. Now thanks to this linearity what happens is that the this this particular thing that we were looking to compute this boxed expression that we are looking to compute this actually becomes significantly easier to compute. So, this becomes a this linear this linearity helps us reduce the complexity of this computation. Another corollary of this linearity is is the following that remember y itself since x and y are jointly Gaussian y itself is also Gaussian right you can always take you know a a equal to 0 and b equal to 1 and we will get that y itself is also Gaussian and expectation of x given y being a linear function of y is also a Gaussian. So, this is also a Gaussian. So, this is also a Gaussian random variable. So, consequently the best estimate of a Gaussian random variable given another Gaussian random variable is also itself a Gaussian random variable right. So, consequently what we are getting here is this kind of a a preservation of form that we start with a Gaussian random variable we are given information which is also Gaussian and then the best estimate of that then turns out itself turns out to be Gaussian. So, this sort of suggests that that that there should be a very elegant and simple way by which we should be able to use the new estimate that we have of a random variable in the in deriving a new estimate in deriving a new estimate or an updated estimate. So, the things that we want to that we that we need to do as far as filtering is concerned which is to do prediction and measurement update these are things that we could potentially do rather elegantly because there is a preservation of form. Remember this is a lot in spirit of what we had seen also when we were looking when we were doing when we were doing linear quadratic problems where the quadratic nature of the problem continued to be preserved and that is why we could recursively keep computing the value function at each step without much without any blow up in complexity. Something similar is happening here. The exact nature in which they are similar needs you know is needs a little bit of elaboration which I am not going to do here but intuitively do remember that something of the same nature is going on in this problem as well. So, now let us see how this is going to be applicable to this particular observation is going to be applicable to our problem. So, our problem has a state which is given in this sort of form and as a consequence of of the linearity this the state at any time can be given as a function of all the previous noise all the previous noise up until that time k and all these exogenous and all these exogenous inputs. So, in other words we can just plug in always the x ks as the x ks using this dynamical equation and as we keep back substituting we will keep accumulating sums in this noise remember we had done something similar also in our when we were looking at one of our earlier problems when we were talking of linear systems with partial state information. So, therefore you can think of x k as so x k effectively is is linear function of x 0 which is the initial state and w 0 to w k right. So, w k minus 1 yeah. So, so that is what that is that is what x k would be. So, it is a linear function of all of these and because it is a linear function of all of these these remember are Gaussians and they are independent Gaussians right. So, they are independent I forgot to mention that these are noise and these are independent across time. So, these are all independent Gaussian random variables. So, consequently x k itself is also a Gaussian. So, these are independent Gaussian and consequently this is also Gaussian something similar can be written also for y k y k itself can be again written as a linear function of x 0 w 0 to w k minus 1 and also v 0 to v 0 to v k right. So, as thanks to that what we have is that this is the now these are also again the independent Gaussian in the same way and therefore, this here is also Gaussian y k is also Gaussian. So, now remember since since so so so in fact because these are all independent Gaussians these all of these x 1 till x k y 1 till y k these here this entire vector here is a jointly Gaussian vector. So, this is a jointly Gaussian vector why because if you take linear combinations of any of these vectors what you would end up getting is linear combinations of these of all this noise right and the linear combination of this noise all these noise vectors would be because they are by what I have written here if you have two independent random variables independent Gaussian random variables then they are also jointly Gaussian. So, therefore, as a consequence of that these are these are also jointly Gaussian right. So, hence what we are getting here is what we are getting is that this here is a jointly Gaussian random vector. So, consequently when we do estimations like these when we estimate x k given y 1 to y k the result this here is going to be a linear function of y 1 to y k. So, when we make estimates of the kind that we are looking for in the filtering problem this turns out to be a linear function of whatever it is that whatever it is that we are conditioning on. So, as thanks to this we will now be able to actually do a lot of our calculations calculations easily. Moreover because it is a linear function of y 1 to y k and all the y 1 these are also are also Gaussians themselves this here would also be Gaussian and because this is Gaussian we will be able to describe its distribution and therefore the distribution of the distribution of x k given y 1 to y k this distribution which is really what we had denoted by pi k of x this can be described using only two parameters which is using its mean and covariance. So, this here can be its distribution can be described using its mean and covariance. So, really the challenge for the in designing the filter for for a linear Gaussian system like we have is really about updating this updating these two updating the mean and covariance recursively this is essentially our our challenge. So, this is what we need to do we have to come up with a way to now update the mean and variance covariance recursively as we get fresh information. So, that is basically the essential idea behind the Kalman filter. So, the Kalman filter is given by a bunch of equations and you will see that they really all they are really doing is is you know updating these two quantities in a recursive manner. So, we will see that we will see the exact form of the filter in the next class.