 Welcome back. So, in the previous lecture we asked ourselves this question that what is the sufficient statistic which can be used for every partially observed Markov decision process or partially observed stochastic control problem. Now, this is something that we will dwell on in this lecture. So, the question for us for this lecture is what is the universal sufficient statistic for any partially observed problem. The thing to for us for one to realize here is that the when we are looking for a universal sufficient statistic that really is not such a thing as something that is universal regardless of the goal that we are trying to accomplish. Sufficient statistics are tied to a specific goal every problem that we are looking to solve or every particular task that we are looking to accomplish would have associated with it the appropriate sufficient statistic. One cannot really talk of the you know a sufficient statistic outside without giving a context of a task at hand. However, for the given the task or given a class of problems more specifically we can talk of what is the sufficient statistic that works for all the all problems in that particular class. So, that is something that is a well posed question, but without having specified a context or without having specified a problem class to talk of sufficient statistic does not really make sense because every problem every kind of goal may require us to have a different piece of information about for that to accomplish that particular task. Now, since here we are talking of problems of partially observed Markov distribution processes or partial or stochastic control problems with imperfect state information this sufficient statistic that we are that we have to that we are looking for is one that lets us compute the optimal control. So, when we are asking for the universal sufficient statistic here we are asking what can be so the optimal control optimal control which is which is UK star which is presently just only thing only written as a function of k I k mu k star of I k can this be written as a function of something simpler say for example, function of H k of I k where H k of I k then this inner thing is the R sufficient statistic. So, this is your this is what we mean by the this is what we mean when we are looking for a universal sufficient statistic that works for every problem. So, we should be able to define such a H k like this which would work for which would work for every problem. Now, what is common across all of these problems? Well, these are stochastic control problems. So, there which means that there is a they have a certain structure to the cost and there is a certain structure to the dynamics. And there is a certain and there is a certain assumption that we have about the information. So, you might say well you one kind of approach one could take in this problem who is in this question is that by inspiring what we have seen from the LQG setting or from the LQ setting is to start guessing what might be the sufficient statistic. You might say well since the LQ problem it turned out the mean was the mean of the conditional mean of the state given the information was what we needed maybe that is the sufficient statistic which will work for all problems. So, we may guess we may guess that this is the sufficient statistic is the universal sufficient statistic or we may guess that well if but it turns out this is not the case this is insufficient to for many problems. So, it turns out that this is not the right sufficient statistic. But then we may say why not generalize this when why not look at this and the second moment maybe this to the mean and the second moment as well suppose maybe we write these two moments maybe then but it turns out that this also is not sufficient in general we need something more then maybe we ask ourselves what I will let me not write it. So, let me write the second moment as Xk transpose here yeah so the covariance conditional covariance of Xk this again is enough for the Gaussian problem but it is not enough in general. But then we may say well let us look at the third moment fourth moment and so on and this starts giving us a picture that what we really need is something we really need as we had mentioned at the start of the course the we really we should not preempt this the mistake we had made in at the start of the course and when we looked at problems that involved risk is that we tried to preempt what is it that what is the information that we would need in order to predict in order to decide on a particular lottery and we preempted and looked at only the say some statistic that we thought was most appealing and which was at that time just the mean and it turned out that there was a fallacy in doing so. So, eventually what is it that we need in order to make make decisions in in the stochastic control problems. So, the structure of the stochastic control problem is that there is a cost that depends on the state right structure of stochastic control problem in a stochastic control problem problem we have the structure is this that there is a cost that depends on the state cost depends on the state there is a the state itself transitions in a Markov fashion Markov manner and the observations that we get are also functions of the state thus what we see is that it is our ignorance of the state which is the issue right and in all of these in in a stochastic control problem. So, the challenge in a stochastic control problem is that we do not know what the what the state itself is if we knew the state all of these all of these things that are that are there are derived from the state would also become known to us and then the problem would become a perfectly observed problem. So, consequently what we need in order to describe the optimal cost in a stochastic control problem is is the is not just the first moment second moment and third moment of of the state given the information but all the moments of the state given the information in other words what we need is the the probability distribution of the state given the information this if known can using this one can describe but everything that we needed to know to compute so everything that we are attempting to compute about the optimal about the optimal which means the optimal control and the optimal cost can be computed when we when we when we know the the this probability distribution of the state given the information right. So, this here the so the probability distribution of the state given the information is the universal sufficient statistic. So, the universal sufficient statistic is the probability distribution of the state given the information remember it is not just the information but the probability distribution of the state given the information that is the universal sufficient statistic. So, earlier we were trying to estimate certain moments we were trying to look at certain moments which were derived from this distribution but but essentially what we are saying is if we know the distribution then all of every for every problem regardless of what the nature of the problem is we should be able to describe the we should be able to describe the optimal control and the optimal cost. So, the so the universal sufficient statistic for partially observed problems this quantity here has a name this is called this quantity which is the conditional distribution of this conditional distribution not the conditional mean the conditional distribution of the state given the information this here is has a name it is called the belief state. So, what I will show you now is that there is a way we will actually I will convince you that this is this is in fact the correct sufficient universal sufficient statistic. So, we will take any partial we will take an arbitrary partially observed problem and we will we will write out the belief state for that particular problem and show that the optimal action and the optimal cost can be computed as a function of this of of this belief state and as a result of that the belief state becomes therefore the sufficient statistic for this problem. So, this is so what we will so that is what we will move to now which is the belief state formulation of a partially observed Markov decision process. So, the the so in order to describe the belief state formulation of a partially observed Markov decision process let us first look write out a a a general Markov decision process problem with partial observations. So, so for that we what we have is you have a first let us write out the state space. So, we have a state space x which is let us say denotes 1 1 to n these are the these are your state this is your state space now we have an action space u and this has let us say some actions 1 1 to u this is the action space we have an observation space which is denoted by y observation space the the state evolves according to a controlled Markov chain. So, these there is a there is a transition there is a transition probability here. So, let us let us denote this in the following way. So, we have p i j of u is the probability that the state at time k plus 1 equals j given that the state at time k equals i and the action at time k equals u this is p i j of u we also have an observation observation kernel or observation distribution which is written as b i which is denoted b. So, b i y of u so this is the probability that the observation at time k plus 1 is equal to y given that the state at time k plus 1 is i and the action at time k is equal to u. So, this is the probability that you get that the next observation is y given that the previous given that this the state at that time is i and the action you take is the action you take is u. So, and of course, as before we incur a cost here the cost you have a stage wise cost but a stage wise cost plus a terminal cost. So, the stage wise cost is we have c k c k x k u k this is your stage wise cost plus a terminal cost let us write this as c n of x n. What is the information that we have at time at any time we have the information at time 0 is we only have that the information about the distribution of the initial state we just know that at time 0 the information is pi 0 which is the distribution of the initial of the initial state probability distribution of the initial state and this distribution at any and the information at any further times is pi 0 u 0 y 1 u k minus 1 and y k this is the information. So, the problem then is to is to choose the problem is choose u k equal to mu k of i k to minimize the expected total cost c of x k u k plus c n of x n given that given all this this this transition that we have. So, remember we have this we again have this structure that i k plus 1 is equal to is equal to i k union u k and y k plus 1. So, the problem is to is to minimize this particular cost. In fact, let me make this explicit and this is conditioned on your on pi 0 which is which is the initial distribution. So, this is want to minimize this subject to by choosing the by choosing the appropriate mu's. Now, what is the so now with this with this particular formulation notice that this is actually just in in in spirit the same as what we have written so far. So, the in this particular formulation now the belief state so what is the belief state now in this formulation the belief state is can be written as the probability that x k is equal to i given given i k and this we denote by notice that this is something that we had written out during the filtering equation as well this is pi k of i. So, pi k of i is a probability distribution or pi k rather is a probability distribution distribution on x. So, this this pi k is now going to be taken as our new state. So, this is this this belief state is going to be taken as our new state. So, what we will we will be doing is now writing a a dynamic programming equation in which pi k is becomes the new state. So, that is what we will do next. In order to formulate the the MDP in terms the POMDP the partially observed MDP in terms of the belief state let us let us write out the objective of the or the cost of the partially observed MDP. So, that cost can be written in this form j. So, I have changed the notation a little bit I will instead of writing if pi earlier was our notation for the policy now pi is the is the notation for for the belief state. So, I will use I will use mu to denote the policy. So, mu is going to be equal to mu 1 is going to be mu 1 to or mu 0 to mu n minus 1 that is going to be my policy. So, mu without the index will denote the policy. So, this is the policy. So, j mu of of pi 0 then is just the the conditional expectation with no fixing mu of the of the cost from time 0 to time n minus 1 of x k u k plus c n of x n given pi 0 alright. So, this is the conditional expectation given given the initial given the distribution of the initial state which is pi 0. This in turn can be written in the following way. So, this here is equal to the conditional expectation this conditional this this expectation with this can be written. So, so j mu of pi 0 is the conditional expectation of the total cost given pi 0 and the expectation is taken fixing a mu which means that u the the action u is going to be chosen according to the function mu k of which will which in turn will be a function of i k. Now, what we need to show is that this cost actually can be written in terms of in terms of the belief state. So, presently the cost is written in terms of the actual state. So, this is the state of the problem. So, what we will need to do is express this express this in terms of in terms of pi k of i alright. So, we that is what we will be doing now. So, in order to do so what I will do is first you have the expectation fixing mu and now that in turn let us write it in as follows you have k equal to 0 to n minus 1 let us write out this each of these terms. So, to notice that we will be taking actions u k as a function of i k right. So, consequently what we will be doing is is we can one thing we will be doing we I am going to do this particular trick. This expectation I am going to write as so this is this expectation I am going to just take instead of putting the cost as is let me write it by putting the taking a conditional expectation with respect to i k ok. So, this here we have a conditional expectation with respect to i k and we have a the fine the terminal one also written in this form. Now remember we had discussed that the conditional expectation is an unbiased is always an unbiased estimate. So, or what is also called the smoothing property of expectation. So, when I so expectation of this particular expectation here is in fact the same as the above the expectation above here ok. So, these are in fact the same. So, taking this expectation inside the conditional expectation inside does not change anything for the problem. So, I can always write it in this particular in this form. What I will do now is write out this expectation here as in explicitly we will write it out as a sum ok. So, we can write this as expectation outside and you have a summation over k going from 0 to n minus 1 and then inside you have a summation i going from 1 to small n which is the number of states that we had right and that you have C i u k times pi k of i and plus here is something similar here you have summation i equal to 1 1 to n c of c n of i c n of i times pi n of i and then outside we still condition on i 0 right. So, now this here is actually this entire thing this here is in fact just a linear just an inner product between 2 vectors the vector there is one vector here which I can write as c u k c u k and then there is another vector which is pi k of i remember pi since I have finitely many states pi k of i is can be thought of as a vector. So, this it is just a inner product between that and pi k of i and pi k right and the last term also is another inner product here between a vector c n and pi n in other words what has happened here is that the entire expression has become in terms of pi k's. So, I have k equal to 0 to n minus 1 c u k transpose pi k plus c n transpose pi n where the expectation is fixing you and what are what are the what is c u k well c u is simply this vector it is c 1 u all the way till c n u the or let me write it as a column vector actually it is a column vector of c 1 u till and c n is itself the vector c n of 1 till c n of n right. So, what is happened as a result of this is that we have been able to write the probability the cost of this problem in terms of just pi k. So, the original cost which in which the state was x k was which was in which the state was x k and it was expressed in terms of x k has now been expressed in terms of pi k. So, this is expressed in terms of pi k in terms of these belief state. So, therefore, the cost can be we realize it can be expressed completely in terms of the belief state. Now, only thing that remains then is to express the dynamics of the system in terms of the belief state and once that is done the entire problem can be expressed in terms of the belief state. So, that is what we will do next.