 Welcome everyone. So, in the previous lecture you may recall what we had seen various classes of policies. So, first class of policies that we defined were these deterministic policies and out of that the first one was a Markovian policy, a Markovian policy was one where at the decision rule at any time t mapped the state that was at that time t to the actions at that time t such that at any in any state you always took actions that were available at that state. So, these were Markovian policies then we looked at what was called history dependent policies in which the decision rule at any time t mapped the history that was known up until that time t to an action and the constraint again was that when you looked at the history up until time t that contained the state at time t and therefore you could only map this entire history to an action that was available in that state at time t in the state s t at time t. So, if you have forgotten the history is comprised of all the states and these the sequence of states and actions available that were taken up until time t minus 1 and the state that was you that was realized at time t this was the history at time t. So, the history dependent policy mapped this entire vector of states and actions to an action at time t this was a history dependent deterministic policy and the above was a history dependent Markov policy then we looked at what were called randomized policies in a randomized policy the decision rule was not deterministic, but it defined a probability distribution on the set of actions that were available. So, the randomized Markovian decision rule was one where the decision rule mapped the state at time at the decision rule at time t mapped the state at time t to a probability distribution on the set of actions on the set of actions and one always had to obviously pick actions from the set of actions that were available at in the state that you are in. So, the probability distribution of the that was induced by this decision rule was denoted by q sub d t now q sub d t is the probability distribution that is this particular mapping to evaluate the actual probability that you get of choosing a particular action you need to input the state s t here. So, it is q sub d t of s t. So, given the state at time t and given the Markovian decision rule you get a probability distribution here which is of on the actions. So, q of a sub d t of s t this was a Markovian Markovian randomized policy or Markovian randomized decision rule. So, this gave you a probability distribution on the set of actions. So, it had to be greater than equal to 0 and the sum over all the feasible actions that the actions available at that state had to be equal to 1. So, naturally all other actions that are not available at in that state are given probability 0. One also had a randomized history dependent decision rule a randomized history dependent decision rule is one where the you map at any time at any time t the history up until time t to a probability distribution on the set of actions available at in the state at time t. So, q of q of a sub d t of h t this told you the probability distribution probability of choosing an action a when you have chosen a randomized decision rule d t in and you and the history realize up until time t is h t. This then gave you a probability distribution on the set of actions and it had the earlier the constraints as before. Now, we also saw another set of policies which was which is called the set of stationary policies stationary policy a policy pi was said to be stationary if it is Markov and if the decision rule applied at every time t is actually independent of t. So, pi is said to be stationary if d 0 is equal to d d 1 is equal to d 2 dot dot dot till d n minus 1 then the policy is said to be a stationary policy in general one uses the word stationary to mean time independent. So, one talks of stationary problems stationary transition probabilities stationary costs all of these or stationary rewards by these by this one means really that the these particular terms the rewards of the transition probabilities or the overall characteristics of the problem are actually invariant with invariant with time. That means the they do not depend on the time instant at which these are evaluated. Then similarly when we also had a stationary randomized policy a stationary randomized policy is one in which we apply a randomized decision rule and when applies the same decision rule at each time. So, you had the same randomized decision rule at each time this does not mean that you take the same action at each time it just means that you take the actions with the same probability at each time for the same state. So, the once you fix a state and you fix and you tell me what the decision rule is and you tell then the probability distribution on the actions does not depend on does not actually depend on time. So, this quantity here q sub d t of s of a this is actually independent of time t this is this was called a stationary this was called a stationary randomized policy. Now, we finally took note that anything that any of these policies that are deterministic are special cases of their randomized counterparts. So, a Markov policy is a special case of Markov randomized policies. Therefore, the set of Markov policies is included in the set of Markov randomized policies the state set of stationary policies is included in the set of stationary randomized policies. The a set of history dependent policies is included in in the set of history-dependent randomized policies. Now today what we will do is we will get some clarity, more clarity about all of these different types of policies by looking at a specific example. Now in order to calculate, in order to work out this example we need to understand how is it that a randomized policy actually works. So what exactly when we choose an action at random, what exactly is the cost or the reward that we are going to get and what is going to be the transition that is going to occur with what probabilities are the states going to transition to the future states. Let us do this for simplicity for a first for a Markov randomized policy the same can be done also for other history-dependent policies as well. So let Dt of S, let Dt be a Markov randomized decision rule. Now let Q sub Dt of S of A equals the probability of choosing action A under decision rule Dt in state S. So what happens then to the reward that one gets. So when we are the way this works is that when we are in state S and we have chosen a randomized decision rule Dt we would end up choosing an action A with this probability, probability given by Q of A sub Dt of S with this probability an action is chosen. So then question becomes what is the reward that we would get because we are not choosing any particular action we are choosing actually an action with a certain every action with a certain probability. So what we do is we say that what we are looking for is the expected reward. This is again in the same spirit as expected utility theory anytime there is an uncertainty involved one takes the an expectation over the rewards or the utility or the cost that is involved over the uncertainty. So what one looks at is then the expected reward. So the expected reward is reward at time t in state S due to this policy Dt. Now Dt remember does not specify any specific action but we will just simply write this as the expected reward that you get when you apply policy Dt in state S. So this is called this is the expected reward from Dt in state S this is simply taken as the so this what happens is we have in a randomized Markov decision rule we take action with a certain probability. So what we say is that we are actually getting the deterministic reward that we would get from that particular action with a certain probability. So with the probability being Q Dt of S of A where now the summation A is over all the actions are so the expected reward when you up when you choose a Markov randomized decision rule Dt in state S is the expectation of RT of S, A where the expectation is taken over A and the probability distribution is the probability distribution specified by the Markov decision rule. What then we can ask a similar question about what then happens to the transition probability or the transition the probability of transitioning from one state to the other. So the state transition probability equal to once again we write this as probability of transitioning to state J in state S when we choose a decision rule when we choose a decision rule Dt under Dt it should be S under Dt in state S and what is this probability well this is again one takes the expectation of the of the state transition probabilities that we have for each state and action. So we choose take this we have already this state transition probability given to us the Pt of J given S, A we assume that A now is a random variable chosen with a probability distribution Q Dt of S Q Dt of S of A. So we just multiply this by its probability Q Dt of S of A and we take the summation of over A in capital A this then is the state transition probability. So under Markov randomized decision rule the expected reward and the transition probabilities are given by these expressions. This leads us to a fundamental question in all of Markov decision theory which is when does there exist a Markov deterministic policy that is optimal the set of all policies all you know by all history dependent policies history dependent randomized policies. So the taking the most general set of policies which is the history dependent randomized policies and one if one optimizes your once cost over by choosing allowing the action to be chosen in a randomized manner with over taking into account all the information in the history when do we still get a Markov policy that is optimal. This is a fundamental question we are not yet in the stage where we can answer this question we will answer this in a couple of lectures but they remember that this is a question that we are building towards. So what I want to do next is to actually show you an example in which all of these various types of policies are computed and are written out so that you get clarity on how exactly are these policies defined. So in the example that we will consider there is a system that exists in two states the system we have a system that can exist in two states or these states are denoted by S1 and S2 let me write this here S1 and S2 these are the states that it can that it can exist in. So note that these are not the states at time 1 and time 2 these are actually the two states of that the system can actually exist in. Now when the system is in state 1 when state is S1 when the system is in state S1 the decision maker has two choices for actions two choices for actions let us denote these actions as A11 and A12. So the first index here 1 denotes these denotes that we are actually in state S1 and the second index denotes the index of the action. So this is the first action in state 1 this is the second action in state 1 when in when state is S2 there is only one action and there is only one action and let us say that action is that action is called A21. Now if one if the if the decision maker if you choose if A11 is chosen in in state S1 if A11 is chosen in S1 then the then the reward the reward that you get is actually 5 units. So one gets a reward the decision maker gets a reward of 5 units and the state transitions to S2 with probability half with probability half and remains in S1 with probability half. So when one chooses action A11 in in state S1 you get a reward of 5 units and you transition to state S2 with probability half or you remain in state S1 with probability half. Now if A12 is chosen in S1 so if A12 is chosen in S1 then what happens? Well then he gets a reward of 10 units then he receives a reward of 10 units and the state transitions the state transitions to S2 with probability 1. So with certainty the state transitions to to state S2. Now if action A21 is chosen in S2 then then he receives a reward of of minus 1 units. So if A21 if A21 is chosen in state S2 then the decision maker receives a reward of minus 1 units in other words he incurs a cost of 1 unit and the state transitions to S2 with probability 1. Actually I should not really say transitions or what I really mean is that it remains in S2 with probability 1. So this action is chosen when you are already in state S2 so if you choose an action if you choose action A21 in state S2 you get a reward of minus 1 and the state remains in state S2 with probability 1. Now this is the way we have formulated this problem. I need to also tell you what is the what the time horizon of the problem is I should I need to also tell you what the what the terminal costs in the problem are all of these things are yet to be specified I will I will specify them in a moment but but I want you to notice before before we move forward to those details I want you to notice one important feature of this problem which is which is that if you look at these these rewards this reward here this reward here this reward here each of these rewards is actually independent of time. So the reward that I get when I am in state S1 and I choose an action A11 is equal to 5 regardless of what at regardless of the time at which this this action was taken and this and the time at which you were in this state. So regardless of the of the time instant or the decision epoch at which this is happening the reward is fixed the reward is always 5 units the same is the case here as well the reward here in of choosing choosing action A A12 in state S1 is still is still 10 units regardless of the time instant at which this choice was being made this and finally the same is true here as well the reward of choosing out of choosing action A21 in state S2 is independent of the time at which this this choice is being made. The same is also true about the state transitions the state transitions here use transition from S1 to S2 with probability half regardless of the time at which this this these choices are being made this transition is also independent of time and finally this one also is independent of time. When the when this is the case when the when the rewards and the transition probabilities are actually time independent it is possible for us to represent this problem in a much cleaner and much more compact form. So instead of writing out these all of these details you know verbally in such a such a long and tedious manner one can actually just do a quick pictorial representation that captures the all of these elements of the problem. So note that this works only when the when the rewards and transition probabilities are stationary in other words they are where so the problem is stationary when so the transition the the the the rewards are independent of time the transition probabilities are also independent of time. In this case we there is a cleaner representation possible this representation we will cover in the next in the next lecture.