 Welcome everyone. In the previous lecture we began our study of a problem class called stochastic control problems with imperfect state information. These were also called partially observed Markov decision processes and what we what we did was we just formulated the problem and by looking at the various elements and the various assumptions that are underlying the problem. So, our first the first element was a dynamic state dynamics. The state of the system was denoted by xk the state at the time k plus 1 was denoted xk plus 1 and it was given as a function xk uk as a function fk of xk uk and wk. Here uk is was the control action that you have to choose and wk was noise. Now in addition now in comparison with the problem with perfect state information we had a new element in this problem which was the element of observation. So, observation at any time was denoted by the letter z the observation at time 0 was a function at 0 of x 0 comma v 0 where v 0 is now the observation noise. Ws are the noise in the system this is how this is the noise in the system evolution these are what are called system noise v 0 and v the v's are what are called the observation noise. And the observation at time 0 is a function of x 0 and noise. So, in other words we get a noisy version and limited version of the initial state as our observation at time 0. The observation at any time k is a function hk of xk uk minus 1 and vk. So, it is a function of the state at that time the action you have taken at the previous time and any noise in the system that is present in your observation channel. Once again we wanted to minimize the total cost which is comprises of your terminal cost and stage wise cost as before. But the important thing is now we were not restricting ourselves to just mark out type policies we wanted to our action had to be chosen as a function of the entire information that we had at that time. And the information that we have at any time k is a function is all the observations that have been made up until that time k. So, it is z 0 to z k and the actions you have taken up until that time which is u 0 to uk minus 1. So, uk is chosen as a function of ik where ik is this information vector ik is the information vector the information vector it tells you all the information that we have at that time. The policy comprise of these n functions mu 0 to mu n minus 1 that minimize this the total cost with uk chosen as a function mu k of ik and the ik is populated through these observations and through the actions that you have taken so far. So, the way the system proceeds was that any time k you have the system state x k it results in observations z k based on the action you have taken previously and the observation noise this observation adds to your previous the this observation and the previous action that you have chosen adds to the information that you have that you have at the previous time step which is so ik plus 1 combined with uk minus 1 and z k gives you ik which is the information at time k. Then based on this information you choose an action which is uk after the action is chosen the noise w k gets realized and then based on the based on the noise the action chosen and the previous state the next state comes about and then you move to time step k plus 1 and the system continues to evolve. So, this is clearly a different type of problem. So, what we will show today is actually that despite the apparent differences between the problem with in perfect state information and the perfect state information there are actually a and there is actually an underlying similarity between the two. And in fact, you can reduce all these problems of imperfect state information to those with perfect state information provided one is willing to accept a certain mathematical complexity. So, this reduction is what I will show you today. The this reduction is basically the forms the at in some ways the heart of at is at the heart of how we think about problems where we do not have full information that we think of we try to think of every all our ignorance as part of an underlying state and we try to then we think of all our information as an underlying state and use that to model the problem. So, here is therefore, the here is now the reduction. So, the reduction proceeds in the following way. So, notice that we have this this equation to begin with. So, we always have i k plus 1 equals i k u k and and z k plus 1. So, this is the information that we have at time at time k plus 1. So, that the information at time k plus 1 comprises of all the information that you had at time k the action you chose at time k and the new observation z k plus 1 that you got at time k plus 1 this is this is your information vector at time k plus 1. Now, this can be written for for k equals 0 1 all the way till n minus 2. We also have at time 0 that the information is simply z 0. Now, we can think of this particular equation if you look at this carefully we can think of this equation as if it is the following. As if there is an initial state initial state and I will say state in quotes is as if there is an initial state here which is i k and then there is an action that you take which is based on that initial state and that action is u k and then there is a disturbance which comes externally a disturbance is is denoted z k plus 1. So, the next the state at so what this equation is saying in this analogy is that the state at time k plus 1 is equal to the state at time k is rather is given as a function of the state at time k the action you chose at time k and an external disturbance z k plus 1 where the action you choose at time k is can be chosen as a function of the state. So, let me write this down for you. So, we can think of this as a new state new state the new state at time k this is still the action at time k and z k plus 1 is the disturbance at time k. Once we once we write it in once you observe make this observation you can see that you we again have the same kind of equation as before which is the new state at time k plus 1 is equal to the new state is equal to a function. Let me write this f delta k of the new state at time k action at time k comma disturbance at time k. So, what is the characteristic of this disturbance? So, observe that this disturbance this is observations at time k plus 1 really for us. So, this disturbance given the information that you have at time k and the action that you choose at time k is actually independent of all the previous disturbances that have occurred. So, this is actually this probability of z k plus 1 given i k comma u k is equal to the probability of z k plus 1 given i k u k z 0 till z k. So, these are all the previous disturbances what we are calling disturbances they are in and these are remember are already included in your definition of i k. So, as a result of this z k plus 1 has this characteristic that it can that at the z k plus 1 at i k and u k is independent of the previous disturbances. Now, although we did not actually make allow for this sort of this kind of disturbances when we studied these the problem with perfect state information it turns out that the dynamic programming algorithm can also be applied when the disturbances are of this kind it is sort of general type. So, we had assumed that the disturbances are the all the disturbances in our perfect state problem were iid sorry were independent here we are not these are not independent, but they are independent given the given the state and the independent of other disturbances or previous disturbances given the state and the action and that is sufficient for us to apply dynamic programming. So, in other words dynamic programming can also let me write it here. So, dynamic programming can also be applied when disturbances are of this kind that is what we will do. Now, in order to do that in order to apply dynamic programming to this problem we need to do a few more steps those are not that hard we need to first write out the cost in the form of a typical dynamic programming problem. So, remember so notice for that let us write notice this. So, notice that the R stage wise cost is this is basically we are taking the expectation of xk, uk, wk and this expectation is in fact can be written as an iterated expectation it is the expectation of this particular thing which is gk of xk, uk, wk given ik, uk where now this expectation is taken with respect to the inner expectation is taken with respect to the things that we do not that we do not that we do not know when we are condition since we are conditioning here on ik and uk we do not have knowledge of we do have knowledge of ik and uk but we do not have knowledge of xk and wk. So, this expectation here is taken with respect to xk, wk. So, now let us now write the inner term here as a new function and what is it a function of well you it is obvious it is a function of actually ik and uk since you are conditioning on these two these two variables since you are conditioning on ik and uk this the inner term here becomes a function let us call that g tilde k of ik, uk that is this wk, xk of gk xk, uk, wk given ik, uk this is g this is your this is the function g tilde k. Now, what we will do we what you can see is because the expectation on the left is this expectation on the left can always be written as this sort of expectation on the right we can see that the we can equivalently write the cost the cost is simply the expectation of the expectation of the summation of g tilde k ik, uk for k equal to 0 to n minus 1 plus the last term which was which was which was which was g n of x n now that remember we can we can absorb that also into this into that also into our into our equation here by simply writing that also as the expectation of g n of x n given i n minus 1 comma u n minus 1. So, remember when we are at the at time step n the the we do not know this state at time n but we can take what we since we are taking an expectation and but at time n since we know the information i n minus 1 and u n minus 1 we can condition on this information i n minus 1 and u n minus 1 and then take the expect take the expectation of this particular terminal term and then take an expectation outside and that reduces nothing that reduces eventually to the original cost function that we had. So, now with this then so this then becomes my this then becomes my new terminal cost. So, thanks to this now we can actually apply the dynamic programming algorithm or the dynamic programming logic to this sort of problem as well. So, the DP algorithm so dynamic programming algorithm applied to this problem as well. So, how do we apply the dynamic programming algorithm let me let us write that down. So, dynamic programming DP algorithm the dynamic programming algorithm is applied as follows. So, what we do is we start now from time n minus 1 now at time n minus 1 we have information i n minus 1. So, what we do is we look at the stage wise now at time n minus 1 there are 2 there is a stage wise cost of time n minus 1 and there is also a terminal cost of time n minus 1 the stage wise cost is what we can we will write out a sum of both of these costs. So, which so we will be minimizing over u n minus 1 in capital U n minus 1 the expectation of the expectation of g n of now g n is remember the the terminal cost that we had here now that is a function of that is a function here of x n and x n itself can be written in terms of x n minus 1 using the using the dynamical equations of the system. So, I will replace that x n with f n minus 1 of x n minus 1 u n minus 1 w n minus 1. So, that is my that is my term g n. So, this takes care of the terminal cost plus you have of course your stage wise cost. So, you have your g n minus 1 now g n minus 1 is a function of x n minus 1 u n minus 1 w n minus 1. But remember now earlier we were just simply taking an expectation of this, but now here since we we do not we we and the expectation earlier was over the noise and we knew the state. So, the state was a parameter here, but since in this case we do not know the state the state is also a random variable. So, what we are really taking here is not an expectation, but a conditional expectation we are taking conditional expectation conditioning on i n minus 1 and u n minus 1. So, this here is this here is the is is our cost to go at time n minus 1. So, it is a function of i n minus 1 it is obtained by minimizing over the action u n minus 1 of minimizing the expectation of the terminal cost and the stage wise cost conditioned on i n minus 1 and u n minus 1. Now, we have explicitly conditioned here on u n minus 1 just to make sure that you you are clear that u n minus 1 is a parameter here that you have conditioned on and it is not a random variable. So, the expectation here remember is over everything that is random after having conditioned on this. So, that is whatever is left here is x n minus 1 and w n minus 1. So, that is where we are taking the expectation over. So, this is what we have written for n minus 1. Now, you might wonder why have not we written something for k equal to n because that is in the in the usual dynamic programming iteration we start from k equal to n and then work backwards. So, we have not written anything for k equal to n because it is firstly need not be written you could have always absorbed j n as simply the terminal term in the at time n minus 1 and started iteration from time n minus 1 you could have done this even in the problem with perfect state information. So, that was not really necessary that particular thing was not necessary. The other thing is that it is it has since we do not have any since we do not know the state at time n it actually is some we cannot really set j n as j n of x n. So, this kind of way where we this thing where we directly define the last the last value function as equal to the terminal cost this is something that we cannot do in a problem with imperfect state information. So, consequently what we are doing here is simply absorbing the last the terminal cost as part of the cost in the in the n minus 1 time step and we just start the dynamic programming iteration from n time step n minus 1 any at any previous time what we need to what we do is the following we have j k of i k as the minimum over u k in capital U k of the expectation of x k w k z k plus 1 the expectation. So, the minimum let me write this with a curly brackets minimum of the expectation of this expectation is with respect to x k w k and z k plus 1 of now g k of x k u k w k plus now we now like we have in the in our in the case with perfect information we now have the cost to go. So, that is your j k plus 1 of i k z k plus 1 u k and this is again once again conditioned on i k and i k and and u k. So, this has to be written for all k equal to 0 till n minus 2 then what is then the optimal cost well the optimal cost is j star the optimal cost remember we do not know the state. So, we have to take an expectation when we are talking of optimal cost it is an expectation with respect to the even with respect to the information that we have that we have at that we may have at that time since the state we do not even know what information we would get because of the channel noise. The optimal cost is given by expectation of expectation with respect to z 0 of j 0 of i 0. But remember what is i 0 i 0 is nothing but j 0. So, in other words this quantity is simply expectation with respect to z 0 of j 0 of z 0. So, notice how this iteration will proceed you first write at time n minus 1 you solve this particular optimization here for now this optimization is to be done for every value of i n minus 1. So, this has to be done for every possible value of the information that you could have at time n minus 1 then so that would then give you j n minus 1 as a function of i n minus 1 it would also give you as a corollary u n minus 1 as a function of i n minus 1. So, what will so this will also result these 2 will also result in mu star k of i k you will also get the action at time k as a function of the information at time k. So, you will also get the as a result the decision rule or the policy. So, the pi mu 0 star to mu n minus 1 star that result from this from the optimal policy. So, how do we proceed with this algorithm you do this for n minus 1 for n minus 1 for that means for every value of i n minus 1 you compute this the optimal of this. So, remember here what it entails is solving this optimization problem doing a conditional expectation with respect in which you have to do a conditional expectation with respect to i n minus 1. Then once this is computed for n minus 1 that gets substituted out here and then you do this for k equal to n minus 2 then similarly that the that gives you j n minus 2 of i n minus 2 which is then substituted at n minus 3 and so on and eventually you come to k equal to 0 at k equal to 0 you you get j 0 of z 0 and you take the expectation of that and that gives you j star this is therefore the DP algorithm. So, in the next class what we will do is we will think a little bit more about the complexity of this algorithm and also apply it to an actual problem.