 In the previous lecture we saw that we were looking at a problem where we had a Markov decision process on two states and what we saw was that because of the nature of the problem the kind of actions that were available at each state and the transitions that were allowed what we saw was that in this problem the set of Markov policies is actually the same as the set of history dependent policies. What we will do now in this lecture is we will do now in this lecture is we will consider a variation on this problem where we will add an additional action and in which to the state S1. So the model now would be that you again have state S1 and state S2 again transition the earlier details remain the same you can when you take action A11 you get a reward of 5 and you transition to state S2 with probability 0.5 you also can remain in state S1 with probability 0.5 and when you are in state S1 you also have this action A12 that one you can take A12 is this action this action gives you a reward 10 with probability 1 when you are in state S2 you have you can take action A21 which is the only available action and it gives you reward minus 1 with probability 1. The addition that I am going to make is this particular thing here I will now add another transition another action here let us denote this action by A13. A13 is the action that we can take in state S1. So now we can actually take three different actions in S1 A11 A12 and A13. Now A13 gives us a reward of 0 and with probability 1 it leads us to state S1. Now as a result of this notice that when you are in state S1 we cannot really conclude what the history of has of actions and states or the history of actions has been. We can conclude that if you are presently in state S1 you are earlier in state S1 that is because there is no way to reach to S1 from S2 when you are in S2 you you stay in S2 by taking action A21 which is the only action available. So when you are in S1 you are previously in S1 that is clear but when you are in S1 it is not clear whether you are in S1 because you took action A11 and went through this arc to remain in S1 or because you took action A13 and went through the green arc that then kept you in S1. So in other words just merely being in S1 does not help us conclude what the history has been so history of states and actions has been so far. It only tells you the sequence of states but not the sequence of actions. In other words the entire history is not known. So as a result the history contains in it much more information than just the knowledge of the state then there is the knowledge of the state of this that you are in. As a result of this the set of history dependent policies in this problem will not be the same as the set of Markov policies. So now let us for this problem let us write out a history dependent deterministic policy or deterministic history dependent policy and let us call this policy pi hd small pi hd small pi standing for the policy. I need to again define for you what I am going to be doing at each decision epoch. So I will add decision epoch at decision epoch 0. At decision epoch 0 here is one particular policy for example since now there are three actions available it can at state S1 you can choose any one of them. So here I am going to just simply choose in this decision rule I am going to choose action A11 in state S1 action A21 which is the only action I can choose in state S2. Then at decision epoch 2 or sorry at decision epoch 0 at decision epoch 1 I have to look at this in a little more in a little more with a little more care. Now at decision epoch 1 what is the length of history that has evolved so far at this up until that time. At decision epoch 1 we have we have the following information we have the information of the state at time 0 you have the the state at 0 the action that you took at 0 and the state at time 1 this is the state that you are in it at decision epoch 1. So these three things together tell you the history at decision epoch 1 this here is the history at decision epoch 1. So notice that we did not have to make all this fuss for decision epoch 0 because at decision epoch 0 the history was actually just simply the state at time 0 and so therefore the history the history was completely encompassed by the state. So I just wrote out my policy as a function my decision rule only as a function of the state apologies there is a small mistake here this should be D 0. Now because there is a you have the history of at decision epoch 1 comprises of these three variables what we will do is we will write out policies in a certain form. So what we will do is we will write out where we were at decision epoch 0 okay. So that means we will write out these two components here together which is where we were at decision epoch 0 these that is these two and what actions we took at decision epoch 0. So everything that has happened up and at decision epoch 0 will be written together and the state at at at decision epoch 1 will be written separately. So you have a so this is state, action the state, action at decision epoch 0 that has been written that is written here in this column. So these can take the following possible values you can take you can be in state s1 and have taken action a11 you could have been in state s1 and taken action a12 could have been in state s1 and taken action a13 and you could have been in state s2 and taken action a21 all of these are possible state action combinations that could have occurred at decision epoch 0. Now let us write out the state that is possible at at decision epoch 1. So state at epoch 1 this state there are two possibilities you could have you could be in state s1 or in state s2 and then in each of these cases you need to I need to now tell you what the action would be. In other words when I look at this table I and I consider this row here and this column the entry corresponding to these two will tell me the action that I should be taking when I when at decision epoch 0 I am I was in state s1 and I took action a11 at at decision epoch 0 and at decision epoch 1 I end up in state s1. So this is what I am going to now fill in so for example one particular deterministic deterministic history dependent decision rule could be that in this case when I when you have when you are in state when you are previously in state s1 and took action a1 a11 and are currently also in state s1 you are going to take action a13 say then and in this and in this term here could be as to be because I am in state now in state s2 I have to take action a1 a21. Now I can now fill in this entire table so when so once again the this entry here is simply saying that when the history is like this the history is that you took you were at decision epoch 0 you were in state s1 then you took action a11 then then you reached state s1 in that with that history when that is the history you should be taking action a13 that is what the this particular table is telling you and likewise when the when at decision epoch 0 you are in state s1 took action a11 and were presently at decision epoch 1 in state s2 then you are taking action a21 that is that is the specification. Now if I fill up this entire table I will get a fully it will fully describe for me the a deterministic a deterministic history dependent decision rule. But notice a few things that are little settled about this I cannot indiscriminately fill in whatever I want here see for example if you are previously in state s1 if you are previously in state s1 and then you took action a12 now action a12 keeps always transitions you with probability 1 to state s2 so there is no way that you can actually be in state s1 at decision epoch 1. So if you if at decision epoch 1 you are in state s1 and then you transition to state a then you took action a12 then with probability 1 you transition to state s2 that is depicted here. So consequently this this term here is actually infeasible this combination is infeasible which means this history cannot occur in this problem it cannot happen that at decision epoch 1 you took action you were in state s1 took action a12 and at decision epoch sorry at decision epoch 0 you were in state s1 took action a12 and at decision epoch 1 you ended up in state s1 that is just not possible. So this is infeasible so you but it is you will end up in state s2 and when you are in s2 you have to take action a21. Now if you are in state s1 and take and you take action a13 this is the new action we introduced here remember now if you are if you take action a13 you transition back to state s1. So there is so then you can then then being in s1 at decision epoch 1 is feasible so that is not that is not a problem. So here suppose you know suppose I am going to take action a11 let me fill this in here but in this case the other possibility which is being in state s2 is not possible that is because that is because you transition back to s1 with probability 1 when you take action a13. So a13 takes you back to s1 with probability 1 that is the one written here. So you could not potentially have been in state s2 so being at at decision epoch 1. So this history is impossible or infeasible. So s1 at time 0 and action a13 at time 0 followed by s2 at time 1 this combination is infeasible. Similarly now s when you are in s2 at time 0 and you take action a21 at time 0 you cannot move back to state s1. So again now s1 this history is infeasible and when you are in so you have to be in state s2 in which case all you have to do is take action a21. These are the only these this is a history dependent a history dependent deterministic policy. So any policy regardless of what it is will always have to respect that once it is history dependent will have to respect what kind of histories are actually feasible. So these these infeasibilities will occur in any history dependent deterministic policy that you define. However that is this is not the only history dependent deterministic policy. While in any history dependent deterministic policies all these infeasibilities will remain that is significant variation possible in terms of the actions that you can choose here. For example here I could have chosen a12 here I could have chosen a13 or here I could have chosen a11 and a12 here etc all of these are valid history dependent deterministic policies. Let us now look at a history dependent randomized policy randomized history dependent policy let us denote this by pi hr. Now for this to be a randomized history dependent policy again I need to tell you what probability distribution I am choosing over the set of actions at each decision epoch. So in this case let us write for example a decision epoch 1 sorry decision epoch 0. So this is at decision epoch once again as before at decision epoch 0 all I know is the the entire history is encapsulated in the state at time 0. So all I have that is the entire all the history information that I have so what I have is therefore a policy that will map the state at time 0 to a probability distribution on the set of actions on the state of actions that are available in the in that state. So here for example is one such policy so I will write directly the probability distribution. So I will write directly the probability probability distribution. So here my decision rule is d0 hr when I am in state s1 I take action a11 with say probability 0.6 and I when I and I am in state s1 I take action a12 say with probability 0.3. Now notice that this time we have three actions so just simply telling me the telling you the probability of taking action a11 does not directly tell us the probability of the other other two actions because there are three of them. So I need to also tell you what probability of taking action a12 is that is point in this case we have taken it to be 0.3. So then that tells me that the last remaining one is has to be 0.1 and naturally because you are in state s2 there is not much choice d0 hr of s2 when you are in state s2 the probability of taking you have only one action which is a21 and the probability of choosing that action is 1. This is at decision epoch 0 this is the probability of choosing with these actions at of choosing these various actions when you are in state s1 that means when the history up until time 0 which is the state at time 0 is s1 and the state at time 0 is s2. Now let us write out a what happens at decision epoch 1 so at decision epoch 1. Now once again as we did in the in the deterministic history dependent policy I need to make a detailed table which will comprise of what is transpired at decision epoch 1 and what is the state at decision epoch sorry at what is the what is transpired at decision epoch 0 and what is the state at decision epoch 1. So to make I will write out this table so this is decision epoch 0 again let me write this as state comma action let us denote these by s comma a once again we have the same four possibilities that we had earlier because it does not matter whether you are choosing deterministic policies or not the set of possible histories remains the same not the actual realized history but the set of possible histories remains the same. So let us write these out so s1 a11, s1 a12, s1 a13 and s2 a21. So when you what these are telling you is that when you are in state s1 you could at time at decision epoch 0 then you could have taken action a11 or action a12 or action a13 and when you are in state s2 at time 0 you could have taken action a21 these are the only possibilities. Now on out here once again I can I need to write what the what the what are the states I could that could have been realized at decision epoch 1. So here I will decision epoch 1 so let me write this so as the state at decision epoch 1. Now this here again there are two possibilities that the state was s1 and the state was s2 state at decision epoch 1 is s1 and s2 but now here I cannot simply tell you which action I am taking and you know I cannot just simply write one particular choice of action here because various actions can be because this is now a randomized history dependent policy I need to tell you with what probability each action is being chosen. So what I need to list out here is the set of all possible actions and their associated probabilities. So here what I will write here is the names of the actions. So these actions which is in this case a11 a12 a13 is these are the actions and in state s2 again a21 is the only available action and what I would write out here in this case is the probability is the are the probabilities. So what I would write is this distribution qd1 hr of s a so s and a is chosen from here from s and a are chosen from this from this portion that means this is what has happened at decision epoch 0 you are in state s and took action a s and a and then when you are in at decision epoch 1 you are in state s1 then you have s1 and you took action a action a which is one of these three actions with a certain probability which is denoted by qd1 hr which is denoted by qd1 hr. So when you are in state s1 and you are previously in state s, a where s, a is as given in the rows of this table here and you took an action let me denote this action by action a1 here and you took action a1 at decision epoch 1 and a1 can take one of these values a1 can be a11 a12 or a13 the for these three possibilities I need to write out write out probabilities. So one possibility for example is that you take action a11 with probability 0.4 action a12 with probability 0.3 and action a13 with probability 0.3. So what this once again what this means is that if at decision epoch 0 you are in state s1 and took action a11 and at decision epoch 1 you are in state s1 then you would be taking action a11 with probability 0.4, a12 with probability 0.3 and a13 with probability 0.3 at decision epoch 1. Now let us look at the other let us fill out the other terms here. Now if you are in state if instead you are in state s2 then you have only one action to choose from so the action has to be has to be a2 1 with probability 1. So that is that is this particular probability. Now if you are if you are if you are at decision epoch 0 you are in state s1 and you took action a12 at decision epoch 0 then if you are in state s1 and you took action 0 at decision epoch 0 then the figure actually tells then the model actually tells us that if you took action a12 you would be transitioning with probability 1 to state s2. So there is no way that you could have taken you would be in state s1. So none of these actions can actually be taken none of the actions available at s1 can actually be taken. So you cannot take action a11 a12 or a13 in other words the very history of s1 a12 followed by s1 again is infeasible. So this all of these are infeasible and since you are going to be ending up in s2 well that there is only one action to choose that is a21. Now if you took action a13 then yes you can be in then you can be in in state in state s1 in fact you will be in state s1 with probability 1 and there you can take action these these 3 actions a11 a12 and a13 with various probabilities in this case I am going to write them as 0.8 0.1 0.1. Now if your history is s1 and then you took at time 0 and you took action a13 at time 0 then there is no way you could have ended up there is no way you could have ended up in state in state in state s2 at decision epoch 1 that is because if you are in state s1 and you took action a13 then it has to be that you will transition you will remain in state s1 at at decision epoch 1 as well. So this will transition you back to this to state s1. So as a consequence of this this particular history now is infeasible and then finally if you are in state s2 and you take action a12 then you could not you cannot again move to state s1. So all of these actions are again infeasible. So such a history is actually not possible one has to take an action one has to take an action a2 1 itself in in state in state as in state s2 which is the state that you end up. Now when this is this here is a description of a randomized history dependent policy. So this is now a description of a history dependent randomized policy. So as we have as you have seen through this through these two through these lectures we have these various different from classes of policies and they are all to be described in terms of the the information that is available to those various policies. What we will now ask in the next in the next lecture is whether all this information additional information and the richness that history dependent policies bring to the problem whether that actually helps in in fact getting us a better cost. So we come back therefore to we will come back therefore to the question that we asked a few two lectures ago which is whether there is at all any benefit to having history dependent policies and randomized policies. In other words do there exist deterministic policy deterministic Markov policies which use only the information of the current state that are optimal. This will come up in the following lecture.