 Hello everyone, in the previous lecture we saw that use of history and use of randomization does not help in obtaining a better reward in a Markov decision process problem. So our main result here was this what is written here that if you optimize the cost overall Markov deterministic Markov policies and then that is also equal to the optimal reward that you would get when you optimize overall history dependent randomized policies. So this can be expressed in this one simple line by simply saying that these two concepts are actually equivalent. The first concept is a Markov optimal policy and that the other concept is an optimal Markov policy. An optimal Markov policy is simply is the optimal over all these set of all Markov policies that is this left hand side here. So the left hand side is the optimal policy over the set of Markov policies so I can call that as an optimal Markov policy. And but the equality here is telling us that a Markov optimal Markov policy is also happens to be an optimal policy which is Markov. In other words that if you look at the optimal policy here though that means the right hand side and there is a Markov policy here which is also which is also optimal. In other words a Markov optimal policy that means if you look for a Markov policy amongst the set of optimal policies of optimal over all history dependent and randomized policies that is actually not the equivalent to looking for an optimal Markov policy that means looking for an optimal policy over the set of Markov policies. This is one simple way by which you can try to remember this basic result of Markov decision theory. So now we will get to a new class of problems. And these class of problems is basically are the main reason why for what makes this course interesting and that those are problems with imperfect information. So far in the course we have never cared about the issue of information. The information as at all has come to us through the history of the problem the or the history that you knew at about the system at any time. That was essentially the information that we had and the history was such that you remember we always assume that the history comprised of the end all the states and all the actions that have happened so far and the present state. So we always wrote out that history is at any time t was comprises of states is 0 the state at time 0 the action at time 0 state at time 1 the action at time 1 and so on state at time t minus 1 the action at time t minus 1 and the state at time t this was our history. So now we will now have a more complicated situation. So the what we will actually not know the state at any time. So we will not have access to the state at any given time. However we will be allowed to remember the entire history earlier in our problem we were also allowed to remember the entire history but we need not have remembered it because all that mattered was the knowledge of the state at that time. In this case we will we will since we do not actually know the state and this history also does not and therefore the history does not actually give us the knowledge of the state what we the problem will require us to remember the entire history up until at any time t. So because the essentially the dilemma that we will be faced with is that because we do not know the state at any time t we would there is nothing in the history that we can actually discard as saying that this is this is not useful anymore. We would want to keep the all the history all everything that we have observed so far because as more observations emerge our estimate or our belief about the state improves and this as a result of this knowledge of the history now becomes paramount earlier the knowledge of the history was optional all you needed to do was remember the state. Now knowledge of this history is critical is critical. So let us write out what this problem class is this problem class is what is called the problem problems with imperfect state information. So I will be writing this in the stochastic control language in the stochastic control language there this is how it is known it is known as a stochastic control with imperfect state information in the Markov decision processes world in that world it is known as partially observed Markov decision process partially observed Markov decision process in other also called sometimes called POMDP or POMDP. So I will write out the model from the stochastic control standpoint so we will have a state space model with a state equation and we will also have additional elements that will capture the information that we know so I will stick to that model for the time being we will if necessary we will go back to the POMDP model for clarity or for examples later. So the model is as follows so once again the state the system has a state and the state is let us say given in this sort of form so it is it evolves as xk plus 1 equals fk of xk which is the state at time k the action uk and wk where wk is the noise. So this is the state so xk is the state at time k now we however do not know the state at time k what we get our observations of this observations through a observation medium or an observation channel so what we get at any time at any time k is something like this so let me write this at time 0 first at time 0 z0 is our observation about the state at time 0 so we get z0 equals h0 of x0, v0 so what we see at time 0 is this z0 we do not see x0 so z0 is the observation at time 0 at any future times zk is equal to hk of xk uk minus 1 and vk. Now the observation at time that we get at time k is a function of the state at that time k the action that you took previously uk minus 1 and noise. So this here is the so zk is the observation at time k now the reason there is no dependence on the action in this equation is because this is at time 0 so the observation that you get at time 0 the function of only the state at time 0 and the noise at time 0 the action begins to make an appearance only for future times so this is this here this applies for all k greater than great let us say let me write it as k equal to 1 to n minus 1. These w's and v's here this and these v's these actually comprise of the noise in the system so this here is all of these this is the noise in the system. Now we need to choose as before we need to choose actions to minimize a certain cost so let me write out the cost function we have again a cost function a cost function that we want that is a function of you want to minimize an expected cost this is a function there is a terminal cost once again g n of x n plus some plus a sum of stage wise cost g k equal to 0 to n minus 1 g k of now x k u k w k. Now as before we would be we need to choose a sequence of not actions but a sequence of strategies we need to find a policy that we would be applying in order to minimize this particular cost because this is after all stochastic problems so we cannot really talk of which action we will be choosing in each state but rather we need to talk of what is going to be our policy which is what is going to be the function that will prescribe the action. But here becomes here is where the question essentially now becomes more interesting we do not know the state at any time so the question that we need to ask is what is it that we actually know at that time so that base so in other words what is the what is going to be our information on the basis of which we are going to be taking these actions so these actions u k are to be chosen based on a policy but that policy should take what as its argument or as its input this is a question that needs to be answered. So here is where we now we will start making assumptions about the information that we have in a problem so are we will be assuming that the information that we have at time k so here is this is a new quantity that is now coming up. So the information that we have at time k comprises of all the observations that have happened so far it means at 0 to z k and all the actions that you have taken so far so all the actions say this is u 0 till u k minus 1 this is the information that you have at time k at time 0 you have information of only the z 0 only z 0 that means just the initial observation that was received so this holds for k equals 1 till n minus 1 and this is for the second one holds for k equal to 0. So what we need to do is we need to when we in this problem here we need to choose our action u k as a function of the information that we have so the problem statement is to choose u k as a function of i k in other words the problem is to find the policy we want to find decision rules mu k such that mu k of i k is equal to u k. So our problem is to this if I pi 0 to pi n minus 1 this is now our policy for us and the problem is to find the or to minimize that cost over all policies which map the information that is known at that time to an action. So we want to minimize the expected cost like this the cost now because let me write out the actions explicitly g n of x n plus k equal to 0 to n minus 1 g k of x k but remember u k is now chosen as a function of i k. So let me write that mu k of i k comma w k minimize this under over all mu 0 to mu n minus 1 that is denoted by pi this is our problem. The problem is that of finding these policies which will map now these information vectors to actions. So here there is in this so once we apply a particular policy the action gets chosen based on the information that you have at that time. The state also would evolve based on this policy so the state remember the state equation here we had a state equation in given in this sort of form this state equation would now also change in its form. So your x k plus 1 when under the policy the state would evolve as follows x k plus 1 equals f k of x k mu k of x k mu k of x k mu k of i k comma w k. So this is how the state would now evolve. So and the information also remember the information that we were getting was also a function of the action that was being chosen. So z k would also could in general depend on the policy. So this here so you have z k equals h k of x k but this this is mu k minus 1 because the action of the previous time step was appearing here. So mu k minus 1 of i k minus 1 comma v k this is the this is z k and remember and similarly z 0 is just h 0 of x 0 comma v 0. This is therefore our problem. So the expectation here in this expression here the expectation here is being taken over all the noise all the noise in the problem. So this is with respect to all the noise noise and the sources of noise are remember there is w k and then there is also v k. Now let us dwell a little bit on this problem first. So what is first let us understand what is the sequence in which things happen. You have a you are in a suppose a state x k at a particular time at a particular time k. So suppose you are in state x k at some time k. So when you are in state x k at time k you do not actually know x k but what you get is an observation z k. So this is time k. So this is time this is the state that you are in and this here you this results in an observation. So the state in comes to you in a corrupted form through an observation channel. So the this thing here is your observation at time k. This observation remember depends on this observation was a function of h k u k minus 1 and v k. So the observation depended on the state at time k the action that you took previously that is u k minus 1 and the noise. So the observation is being obtained before you take the action at time k. So the sequence of events is that you when you are at time k there is a state x k you have previously taken an action u k u k minus 1 sorry and that results in an observation z k. This observation z k goes and adds to your to your kitty of observation. So the kitty of observation the information that you have previously that you had at the previous time step was i k minus 1. Now i k minus 1 had z 0 to z k minus 1 u 0 to u k minus 2. Now this this is the information that you had at time k minus 1 but at time k minus 1 you also took an action u k minus 1. So this the information of this action is also now available to you and in addition to this you also now have the observation that has been obtained at time k. So in addition to this you also have z k. So in other words your i k is i k minus 1 union u k minus 1 union z k you know this is the observation. This is now your new information z 0 to z k u 0 to u k minus 1 this is now the new information. So with this new information the observation adds to your information and with this new information you now take an action. So the sequence of events is that you have a state results in an observation using the previous current state and the previous actions and the noise. This observation adds to your information the information is all the previous information and the immediate previous action and the new observation that you have and based on this you take the action and then based on this now that the action is chosen this results in the next state. This now results in the next state based on the state equation. So this is so this action is your is mu k of i k and based on this you have x k u k based on this the state equation you get your next state which is x k plus 1. So at the time k and while choosing your the action the action u k here you do not know you do not know the w k's. Obviously we do not know the v k as well because all you are getting is this observation z k the this this observation is only the only thing that is known to you. So you do not know the w k alright. So w k gets realized after this. So this is where noise also comes in noise w k and that results in the next state alright. So this is the sequence of events that happen in the that happen in the problem. Now the main thing for you to realize is that now the information is playing a critical role because the because since we are never getting any information about the state we are keeping with us all the observations that we have that we have made so far and all the actions that we have taken so far. All of this is being preserved in order to take the next action. In other words the policy therefore it has to be history dependent in a certain sense. History is by history here I do not mean the true history of the problem but rather the history of information that you have received right. So the the entire in history of observations and the history of actions is available is available to us and that is how so the problem is formulated as one where you buy in a natural way the problem is formulated as one where you have to pick the action in a history dependent way. It actually you may wonder why is not there a Markov way of choosing an action here. Well there is but it is a much more evolved and much more a much more sophisticated logic. I will explain that in the in a subsequent lecture but if you if you if by Markov policy you really think of using if you say if you think that you are going to be using just the most immediate information if that is your view of Markov policies then it it really makes no sense to stick to just Markov policies because the most immediate information need not be the most correct information. Since we are getting noisy information at every at each time you essentially there is no there is no reason to believe that the current information is somehow the most is you know in some way the one with the highest fidelity the in fact are because we do not get correct information at any time the our best bet is to take is to simply use all the information that we have and that is that is essentially what is being done here by keeping track of the entire history of observations and actions that we have chosen so far. So, we are now at the threshold of a very interesting class of problems and this is where the the the role of information and eventually the role of communication will start playing start manifesting in our in this course. So, I look forward to seeing you further here.