 Welcome everyone. In our previous lecture, we argued that the universal sufficient statistic for partially observed problems that means problems where we do not have perfect information about the state. For such problems, a universal sufficient statistic is this conditional distribution that conditional distribution of the state given the information. And this is what we called the belief state. The reason this was universal is because this is what would work for every stochastic control problem regardless of what the exact algebraic form of the problem is whether it has linear dynamics or non-linear dynamics whether it has a quadratic cost or some other cost. This is what would be a sufficient statistic that would work for any of these problems. However, there were specific problem classes that admitted more sharper that admitted sharper sufficient statistics. In particular like we had seen for the linear quadratic problem, we had seen that the sufficient statistic was actually the conditional expectation of the state not the condition not just the conditional distribution but rather more specifically just the conditional expectation of the state given the information. The controller could be defined as a function of this particular quantity. So, what in order to argue so we had we wrote out the formulation for a POMDP that is a partially observed Markov decision process. So, this was just a model with where the state of a system evolves on a finite set 1 to n the actions were a finite on from a finite set 1 to u. We had an observation space this observation space was denoted by we had a the state evolved according to a controlled Markov chain with a control u with this probability kernel observations we got were based on this probability kernel as depending on the state and control applied at the previous time step and we had a stage wise cost and the problem was to choose of course actions as a function of the information at each time step where the information comprised of the probability distribution of the initial state the history of control actions and the history of observations so far this was the information I k. So, the only difference from the formulation as far as the linear quadratic problem was concerned is that the linear quadratic problem had an explicit form here for the probability for this product for these dynamics we had an explicit equation here that told us these dynamics. But as I have said before in the case of where you have perfect state information as well any of these forms are interchangeable one can always write out something a the problem where the dynamics are given in terms of this in terms of this or one can write out for these any of this a kind of dynamical equation only problem only difficulty is in deriving such an equation but let us not get into those matters for us these are theoretically these are equivalent. So, the problem was to choose a policy as of where which was sequence of mu's mu's are a function of I k that result therefore in the control action U k in order to minimize this this cost starting from an initial state which is distributed according to a distribution pi 0 remember also that we change the notation somewhat and this is because we are we I keep changing my reference books so the result that is a slight change of notation as well pi which was earlier denoting policy is now denoting the belief or so in this case the belief of the initial state the later later it will denote the belief of later or the belief state itself and the policy is denoted simply by mu. So, mu here without the without any index is the entire policy vector so it is mu 0 to mu n minus 1 this is what is our policy and what we argued at the end of the previous lecture was that the cost the cost under a certain policy mu starting from an initial state distributed according to pi 0 this cost could be written in terms of the belief state. So, we took this cost expression the total cost expression and no dynamic programming applied as yet this is just simply the total cost expression and then using the law of iterator expectations or the smoothing property of expectations we were we express this the cost in terms of in terms of this these pi k's pi k's here remember was the belief state this is the distribution of pi k here written out here. So, this the cost was expressed in these terms. So, in other words for all practical purposes we now have a system whose cost is a function of pi k. So, we can take the state of this of this system to be pi k the only question that is left in front of us is how does pi k evolve as a function of pi k minus 1 and what is then the noise in the system and you know is in short what are the new dynamics of the system. So, this for this new dynamics we are again back to the earlier question that we have we dealt with which is asking how can we update pi k from pi k minus 1 and any new observations that we get. So, if you have pi k minus 1 is remember the probability distribution of the state at time k minus 1 given the information up until time k minus 1 how do we update that to get the probability distribution of the state at time k using the information up until time k. So, this is the if you recall this is nothing but the filtering problem. So, the problem of filtering was precisely the problem of recursively computing these belief states. So, what I will write out now is an equation that will basically help us transform pi k minus 1 and any new information that we are getting to pi k and use that to get pi k which is going to be the belief state at time k. So, here is therefore that equation. So, this comes from simply the filtering equation that we had written out earlier if you recall we had we had written out something like this that pi k was a function t of pi k minus 1 the new observation y k that we get. Now, at that time we were not explicitly writing out control actions because we were you know the control actions remember were just given to us but now we are given that information also. So, let us write it explicitly here pi k minus 1. So, this is so it is the same algorithm as before. So, and it will give so the only difference will be the explicit appearance now of of the term u k minus 1 in all of these calculations. So, what is t? Well t it turns out can be if can be written as t of so t for any pi y comma u can be written in this in the following form. It can be written as b y of u times p transpose of u times pi divided by a term which is sigma of pi y comma u. Now, I will explain what all of these terms are. So, b y of u is a diagonal matrix. So, this is the matrix comprises of the following. So, it is these probability kernels that we had earlier. So, remember we have these probability kernels here b i y for every initial for every state i and observation y. So, those are the ones that we will stack up to create a diagonal matrix we will stack them up for every value of the state. So, we have n states. So, we will write this as b y b 1 y of u and a bunch of 0s b 2 y of u and again some 0s and eventually b n y of u. Remember this is small n not the time horizon n but the number of states n. So, this is the matrix b y of u. The matrix p of u is simply again can be is in fact given more easily it is the matrix written right here. So, we have here the probability transition here from i to j from state i to state j for a given action u. So, this can be thought of as a matrix we can think of this for every u you can think of this as a matrix it will be an n cross n matrix because there are n states and n actions sorry n states in our case for every action u this will become an n cross n matrix that matrix is what that matrix is is is p p. So, here p is the matrix p i p of u is the matrix comprised with of entries p i j of u where this this comes up in the i th row and this is the this is the j th i th row and j th column all right. So, that transpose pi is what is written here. So, pi also which is can be thought of as a vector because pi remember is a probability distribution on the states. The states themselves are these the state space x remember is 0 1 is just 1 to n right is this is your states space. So, pi is actually a vector in R n because it can be thought of as a vector in R n it is a probability distribution. So, I can just stack up pi 1 pi 2 pi n pi 1 2 pi n as a vector and this gives me a probability distribution on on x but it is not every vector is not a probability distribution. So, pi must satisfy pi has to be must satisfy that it is a vector in R n but it is also greater than equal to 0 and it is components sum to 1. So, the way we write this thing is we write this kind of 1 transpose z equals 1. This bold 1 here this one is simply a vector of 1s is a vector with all entries as 1 and it is a vector of appropriate dimension we do not need to write the exact dimension here because it is understood from the context. Since z itself is a vector of dimension n 1 is also a vector in R n. So, this is a vector in R right. So, pi can is a is a vector from this set right. So, so in other words now because in our in our belief state formulation we are going to be taking pi as the state of the system the state space of the new formulation is going to actually be this is in fact going to be the state space of the belief state formulation. So, I will explain more about this in a moment. So, this is what is in the numerator of T of this expression for T what is the denominator? The denominator is this expression sigma. So, let me write out what sigma of pi y u is this is in fact rather easily given this is in fact just the as you would have you would recall from the filtering equation this is simply the integral of the numerator taken with respect to taken with respect to the state itself. So, this is this is the integral of this. So, in because in this case we are talking of finitely many states. So, the integral is simply a summation. So, this here is one transpose by of u p transpose u p transpose of u times pi that is sigma of u right. So, now as a result. So, to summarize we now have what do we have here we have that we have that pi is given can be given in terms of pi k can be given in terms of pi k minus 1 y k and u k minus 1 where T where the right hand side is this function T where T can is written in this in this form here right. So, now let us take a step step back and try to understand what we what we have been able to accomplish. Our main the main thing that we have been able to do is that we have now written in the previous lecture we wrote out the cost here in terms of of these pi k's. What we have done here now is that we have written out what looks like a dynamical equation. We look it looks like then the pi k at time k is a function of pi k pi at time k minus 1 some noise which is your observation y k and the action at time k minus 1. So, this is effectively looking like basically like these the state space formulation essentially it is as if you have the state given in terms of a explicit dynamical equation with the state at time k given as a function of state at time k minus 1 the action at time k minus 1 and noise the noise is you know you can either talk of it as time k plus at the next time step or you say that the noise is comes after the action. So, that is what is happening here in any case this is something that is and that comes after the action. So, this is what we get as our new state equation or the dynamical dynamics equation. We have now therefore a cost we down basically now need to think that instead of thinking of the state as a probability distribution just think of it as some vector. It is some vector that lies in this space in this space here this space here. So, the vector so this is therefore our new state space it this here is our new state space this is the state space of the belief state formulation our dynamical equation is given through this boxed equation and the cost is given through this this expression here all of this put together gives us basically a stochastic control problem which now lives in the space in the space of pi k which is which is this this vectors this space of vectors here. So, when the so why is this a space of vectors well because the initial problem had finitely many states probability distributions on those states are vectors. So, the belief state formulation now has infinitely many states but finite dimensional but a finite dimensional state space that is this one it is infinitely many states because there are infinitely many probability distributions. But the by the dimension of that say of that space is finite and the state transitions through these dynamics here and we have a cost a cost that is given in stage wise cost given in terms of given in terms of these vectors. In other words we can just simply essentially ignore the fact that this was in fact even a partially observed problem to begin with this simply is a problem where with a vector state space whose the vectors just happen to be probability distributions on a certain state on a certain state space and then compute the optimal policy in terms of in terms of these in terms of these these states which are which are the belief states. So, what has happened therefore is that the problem of incomplete or partial observation imperfect state information or partial observation on a finite state space has been reduced to a problem of perfect information on a different state space which is finite dimensional but infinite which is the which is the space of belief states and on that state on that space because it is now a perfectly observed problem we can simply apply the dynamic the Bellman equation and the other dynamic programming algorithm that we have developed for the case when the state is perfectly known. So, therefore this actually gives will give us therefore in a way of solving the solving the partially observed problem through this these through these means so essentially through the reduction to the belief state. Now let me write out the theorem that that pertains to this so and also make a note of these observations so thus the POMDP has been transformed a problem to a to a MDP let me write like that like to a MDP where the cost where the state is the belief state I k state space is these probability vectors like these the set of all probability vectors of this kind the observations are perfect. We now have perfect observations remember perfect observations are perfect because we can always we know the probability distribution this it is what we are computing observations are perfect state dynamics is this these are the state dynamics and the cost is the cost is this cost here summation Cu k transpose pi k plus C n transpose pi n Cu k transpose pi k plus C n transpose pi n this is our and the initial state let me write that also initial state is so this is what we have been able to accomplish. So, you can see this here is a a very powerful reduction because we have been able to reduce every problem of with partially observed state information to a problem where the state information is perfectly known. And so this is this basically helps us of course this comes at the expense of some increase in complexity for instance we have to look at a much richer state space now because the state space which was earlier finite is now become has become infinite like this as as seen here this is now our state space the cost also we lose some amount of structure related to the cost because now the you know the the action is now somehow appearing a little more in a little less explicit manner. But nonetheless you know the the issues related to not having information has have been worked around this is this is this is something that we can this is something that we can celebrate. So, let me write out therefore the the theorem associated with this theorem is simply that we can this theorem basically says that one can now do dynamic programming on this the optimal policy let me write this on a fresh page the theorem is that the minimum expected cost is achieved by a policy mu which is of the form mu 0 star mu by a policy mu star mu 0 star to mu n minus 1 star where u k is mu k star of pi k okay mu k star of pi k where pi k is your is the is now the is the belief state in fact the policy can be found through the dynamic programming through the dynamic programming equation through the backward induction of the dynamic programming of dynamic programming in fact mu star is the solution of the following DP algorithm. So, where you initialize now again J n of pi to be C n transpose pi okay and then for k equal to 0 to n minus 1 we just do J k of pi to be the minimum of u in u in u of C u transpose pi plus summation y in summation is now y in y j. So, here I all I am doing here is taking the expectation of the of the cost to go. So, this is J k plus 1 of the next state what is the next state given by it is given by T of pi y and u right. So, when you get an observation y and the current state is pi and we take an action u this is the next state and this appear this comes up with a probability the probability of seeing this particular observation the well that probability is the is given by sigma pi y u where sigma is remember the term that we wrote out out here this term is sigma right mu star of pi is the argmin which is the optimal action mu k star of pi is the argmin of the same same expression above C and the J mu star is of pi when you start off from any any initial state pi this is equal to so this has to be written remember for all pi and J star mu of pi is simply J 0 of pi which is which means that the optimal cost this here is the optimal cost and this here is the cost at the last iteration. So, this therefore is the dynamic programming equation for any partially observed problem. So, with this we have been able to conclude our study of partially observed problems. Now, you might one thing you might wonder is well is this then the end of the complexities that arise in stochastic control problems is there anything more and that one needs to know apart from now what we have already studied. So, it turns out that I have surprises for you there are many more complexities that arise in fact mind boggling levels of complexities that arise because of many subtle assumptions that we have made so far. So, we will look into all of these in the in the in the next lecture and for the remaining part of this course.