 So, let us now see how we can represent a stationary problem like the one that we have just described in the previous lecture in a compact and clean form. So, the way to do this is that we can draw a diagram of this kind. So, in this diagram here as this these big circles denote the state of the system. As you would recall this system can be in two different states in our problem. So, I have denoted these two states with these two circles S1 and S2. The actions that one can choose and the transition probabilities and the associated rewards are denoted by directed arcs between these two states. So, for instance I have a directed arc like this going from state S1 to state S2. Now on top of this arc this arc starts from S1 and goes to S2. On top of this arc I will write the action that this arc represents. So, in this case this arc represents the action A11. Below this I write the reward that I would get and the probability with which this transition from S1 to S2 is going to occur. So, the reward here if you recall was 5 units from choosing action A11 in state S1. So, the reward here is 5 and the transition from S1 to S2 occurs with probability 0.5. Now you would also remember that there is another possibility when one chooses action A11 in state S1 and that possibility is this that one can get one can stay back in state S1. So, this results in another arc that starts from S1 and ends at S1. Once again on top of this arc again I write that this is actually corresponding to action A11 and below the arc I have again the reward which was 5 units and the probability with which this transition occurs that is 0.5. Now, when I am in state S1 I also have another action possible that is the action A12 it takes me from state S1 to state S2 it gives me a reward of 10 units and it this transition occurs with probability 1. When I am in state S2 I have only one transition one action possible which is the action A21 and this action transitions me from S2 to S2 itself in other words it keeps me in state S2. This is the action A21 the reward that I received because of this is minus 1 and the transition from S2 to S2 itself happens with probability 1. So, that is written in these curly brackets here. Notice that this is this kind of a snapshot this picture only represents a snapshot in time it just tells you what is happening at when you do this at any instant in time. So, this would obviously be an accurate representation of the problem only if the problem actually was stationary means there was there were no there were where the rewards and transition probabilities are independent of time. One of the other elements that gets missing in a problem like in a representation like this is what is actually the time horizon and then associated with that time horizon what is the terminal reward or the terminal cost that you are going to incur. So, in this case we are going to define these separately from this particular figure this figure is going to serve only as a as a way for us to for us to understand what the transitions and the rewards are at any instant in time except for the terminal except for the terminal time instant. So, the decision epochs for us are going to be therefore 0, 1, 2 up until n minus 1 these are our decision epochs or stages or stages or time steps at which the decisions are being made this the states that we are in that the system can be in these S1 and S2 the actions that one can take are dependent on the state itself. So, a S1 is the set of feasible actions in state S1 so, a S1 is a 11 a 12 a S2 is the set of actions available in state S2 so, that is a 21 the rewards are already written here so, but I will just write them here again for completeness the reward at in state S1 when you take action a 11 is 5 this reward when you just put the index T but then we will for completeness here a 12 is 10 and this is true for all times T and because we are assuming a finite horizon here so, here let me write here that n is less than infinity I am going to I need to also define a terminal reward the terminal reward here I am going to take for simplicity that this is the terminal reward regardless of the state you end up in is equal to 0 finally, we have the transition probabilities the transition probabilities Pt of S1 given S1 comma a 11 this is 0.5 Pt of S2 given S1 comma a 11 this is also equal to 0.5 Pt of S1 given S1 comma a 12 is equal to 0 1 does not transition to does not remain in state S1 at all when you take an action a 12 Pt of S2 given S1 comma a 12 is equal to 1 and Pt of S1 given S2 comma a 21 this is equal to 0. So, when you take action a 21 you remain in state S2 so, this term is equal to 0 and Pt of S2 given S2 comma a 21 is equal to 1 this is your these are your transition probabilities. So, now let us write out the various types of policies. So, we will first do an example of a deterministic Markov policy now in order to do this I need to define a decision rule at each at each time instant. So, if I have n time instance I need to define a decision rule for each of those n time instance. So, obviously this can become very very very very very long and laborious. So, we are going to take a very simple case we will take n equals 2. So, that means there will be only 2 decision epochs which is time 0 and time 1 and we will design this define decision rules for those 2 decision epochs. So, we will take n equals 2 and let us write out these very a few a few policies. So, here is for example a deterministic Markov policy let us denote this by pi md because this is Markov and deterministic it is a particular policy. So, I have to write a decision rule for each time for each decision epoch. So, at decision epoch 1 now let us write the decision rule at decision epoch 1. Now, at decision epoch 1 I have to define a decision rule at decision epoch 1 I have to define an action for each for each state that could potentially occur at decision epoch 1. So, I have there are 2 possible states in this problem I need to define an action for each of those 2 states. So, since I am so actually this is so since I am taking n equals 2 this is actually decision epoch 0. So, this is at time equal to 0. So, here I have I am going to write so at decision epoch 0 I need to define a decision rule that defines an action for each of the states that can potentially be realized at decision epoch 0. So, here is one such decision rule. So, I have d 0 let me put a superscript of Markov deterministic of s 1 equals a 1 1 and d 0 for Markov deterministic at state s 2 equals a 2 1. So, notice that the decision rule defines for you an action for each state in the state space. The state space comprises of 2 states s 1 and s 2. So, it is telling you which action you should choose in each of those states. The decision rule also has to be feasible means that it should specify for you an action which is actually available for you to take at that at that instant at that state. So, it is here action a 1 1 is available at s 1. So, it is okay to assign that in this to this decision rule and action a 2 1 is the only action available at state s 2. So, you have to assign that in state s 2. Then let us move to decision epoch 2 or decision epoch 1 sorry once again I need to specify something that I would do in each state. So, here suppose I am in state if I am in state s 1 then what I would do is say let us say I take action a 1 2 and if I am in state s 2 then again I have no choice I have to take action a 2 1. So, again at decision at this should be d d 1. So, I again at state s 2 I have only action a 2 1 available with me. So, notice that these 2 function each of these lines here defines for me 1 decision rule. This is a function which maps s 1 to a 1 a 1 1 and s 2 to a 2 1. This is another function which maps s 1 to a 1 2 and s 2 to a 2 1. This is not the only possible decision rule. Obviously, there are many other decision rules which are also Markov and deterministic. For example, I could have taken instead of taking action a 1 1 here I could have taken action a 1 2 here or instead of taking action a 1 2 here I could have taken action a 1 1 here again. All of these are possible variations or different types of different different Markov decision rules or and therefore collectively they define a Markov decision deterministic Markov policy. All of this is just one example of a Markov deterministic Markov policy. Let us now look at a mark a randomized Markov policy. So, once again at decision epoch 1 at decision epoch 1 and at decision epoch 0 and at decision epoch 1 I need to specify what exactly is my policy going to be doing or my decision rule is going to be doing. So, this is now a randomized Markov policy. So, it will take as argument the state that you are in and it will produce as its output a probability distribution on the set of actions in such a way that it respects the constraints that the actions only the actions that are available are getting positive probability can get positive probability. So, here is one potential randomized Markov policy. For example, one could do Q. So, let me define directly the probability distribution that the policy is going to induce the probability distribution on the set of actions. So, here for example, Q this is this notation here stands for the probability distribution of a deterministic Markov decision rule at time 0. This has to be this must I need to define this has to be a probability distribution. So, it has to give me a probability for every possible every possible action that is available at that in that in a particular state. So, here for example, so I need to write this for every state that can occur at time at decision epoch 0. So, suppose if the state is s 1 then and if then the probability of taking action a 1 1 in state s 1 under this decision rule is given by this particular expression here which is Q Q sub d m r 0 of s 1 of a 1 1. Now, suppose this this quantity suppose is 0.7. So, what this says is that with probability 0.7 I am going to be I am going to be taking action a 1 1 when I am in state s 1 according to this particular policy. So, if I I can I have another action available at when I am in state s 1. So, that so this policy for it to define a valid probability distribution must give the remaining probability to that action. So, it should take action a 1 2 with probability 0.3 when in state s 1 at time 0. Now, this only defines for me the probability distribution on the actions when I am in state s 1 it for me to completely define the a randomized Markov policy I need to also tell you what I would be doing in state s 2. So, in state s 2 the same d 0 of d 0 m r at state s 2 takes action a 2 1 with what probability what should be this probability well if you see there is only one action available at in state s 2. So, naturally one you have to take that particular action with probability 1. So, this has to be equal to equal to 1. So, any randomized Markov policy must satisfy this particular this particular this particular formula here because there is really no choice when you are at state s 2. Now, let us go to decision epoch 1. So, at decision epoch 1 I now have suppose the decision rule is d 1 m r suppose I am in state s 1 and again I have to tell you with what probability I am going to be taking various actions. So, I am going to take action a 1 1 say with probability 0.4 and I take action a 1 2 with the remaining probability which is 0.6 this is that decision epoch 1 then I have to also tell you what I would be doing in state s 2 well in state s 2 again I have since I have no choice I have I have to take action a 2 1 with probability 1 all of this defines for me a probability distribution on the set of actions for every state clearly there can be multiple such probability distributions for in here I have chosen 0.7 and 0.3 as the probabilities of choosing actions a 1 1 and a 1 2 they need not be these numbers I could have for example another probability distribution is where this is 0.25 and this is 0.75 for or this one is 0.3 and this is 0.7 etc there are many such probability distribution each such distribution defines for you a randomized Markov policy each such so for every state I have a distribution and every different once you change the distribution at any one state you get a new type new a new randomized Markov policy. Let us now let us now consider what happens to this problem let us now write out history dependent policies now in order to write out history dependent policies one we have to notice something rather peculiar about about this particular problem now is here when you are in state s 1 when you are in state s 1 what is the way you could have potentially what is the history that could have led you to state s 1 suppose at some time instant you are in state s 1 then what is the history that could have led you to state s 1 well if you see there is not much there is not much variation possible here the reason is that if you are in state s 1 then the only way you could you are presently in state s 1 was if you are previously in state s 1 and took action a 1 1 right so the only way you can presently be in state s 1 is if you had in the previous step take if in the previous step also you are in state s 1 and you took action a 1 1 now I am not saying that every time you take action a 1 1 you will end up in state s 1 because there is a chance for you to transition to probably to state s 2 as well so if you take action a 1 1 in state s 1 you could move to state s 2 but the only way you could end up in s 1 is by taking action a 1 1 right so if you are presently in if your current state is s 1 then your so state at say time t is s 1 then the state at time s time t minus 1 should also be s 1 and the action at time t minus 1 should have been a 1 1 but then if the state at time t minus 1 is s 1 then the time the state at time t minus 2 should have also been s 1 and the action at time t minus 2 was also a should have also been a 1 1 and so on so in other words because there is really only one way to to get to state s 1 it means it has to be that you have always been in state s 1 if you are presently in state s 1 you have always been in state s 1 what this essentially tells you is that the present state the present state of being in s 1 actually is also telling you thanks to the structure of the problem is also telling you the history of states and actions that have occurred so far so if you are presently in state s 1 then you have previously been in state s 1 for all the times and have always taken action a 1 1 at all of those time instance so as a result therefore as a result there is really no additional information that is present in the history up until any time t if you are presently in state s 1 when you are presently in state s 1 that state itself tells you the entire history that is present as a result of this there is no difference between history dependent policies history dependent decision rules at time s 1 and Markov decision rules at time s 1. So every Markov decision rule can be thought of without loss of generality when it is acting at state s 1 can be thought of without loss of generality as a history dependent decision rule because the information in the state is enough to conclude for us to conclude the entire history what happens when we are in state s 2 when we are in state s 2 the matter the the situation is a little different when we are in state s 2 now s you could you could come to state s 2 in many different ways you could you could have been in s 2 previously and taken action a 2 1 or you could have been in state s 1 and taken action a 1 1 or you could have been in s 1 and taken action a 1 2 all these three possible histories are valid when when valid ways of reaching state s 2 at any given time. So being in state s 2 does not actually help you conclude what the previous history has been the way the way we could conclude the entire history from being in state s 1 being in state s 2 does not help us conclude the entire history however being in state s 2 does help us does have another consequence which is that there is no other action apart from a 2 1 that one can take so regardless of whether you know the history or not regardless of whether you have the information of the history or only the present state when you are present state s 2 you can only take action a 2 1 in state s 2 as a result of this and once again the set of whether you have the history or not you have you are effectively compelled to take only one particular decision rule when you are in state when you end up in state s 2 whether you know the history or not you have to choose action a 2 1 as a result of this once again for for a different reason the set of decision rules that that you can apply when you are in state s 2 is with the history or without the history is the same because you have only one action. So the consequence of this is that if you look at if you if you ask what is the set of history dependent policies in this particular problem well in this problem the set of history dependent policies is in fact equal to the set of Markov Markov policies and in this it does not matter whether these are whether the policies are deterministic or randomized if you have a history dependent randomized policy it can be implement it can be equivalently thought of as an as a randomized Markov policy if you have a history dependent deterministic policy it has it can be equivalently thought of as a Markov deterministic policy in other words there is no further richness or no further information available in knowing the history in this particular problem. Now this is obviously a coincidence in this particular problem that is because of the structure of the problem the and it has obvious I have obviously chosen this problem to be in order to make this particular point clear to you that there can be times when you do not benefit from knowing the history. Now what we will do is in the next lecture we will actually look at a variation on this problem in which certain we will add another action at state s1 and in that case we will see that knowing the history and not knowing the history can lead you to very different type of policies very different policies all together this will be in the next lecture.