 Welcome everyone. So, so far we have been looking at Markov policies in either the Markov decision processes domain or in the stochastic control domain. Now the reason I had articulated some time back to you was that it is without loss of generality one can actually work with Markov Markov policies. Although if you recall we had also discussed other types of policies such as history dependent policies and history dependent randomized policies. Now what we will do today is actually prove that there is no benefit to be gained in a stochastic control or a problem or an MDP problem if you use the history and if you do randomization. So, in other words is the optimal cost that you can get under history dependent randomized policies will be shown as equal to that under deterministic Markov policies. So, consequently without loss of generality one can work with only Markov policies. Markov policies as we have seen have elegance, simplicity, easy to explain, easy to understand and so on. And we so this result will actually give us the license to use these kind of policies in our problems. So, just for you to recap remember that we had defined these classes of policies H pi H or pi HD was the class of history dependent randomized history dependent deterministic policies. Every decision rule in this policy any policy here which belongs to a small pi that belongs to this capital pi HD this policy comprises of decision rules. Let us call these decision rules say mu 0 to mu n minus 1 the decision rules or strategies at each stage. And the property the character of these decision rules was that if you take the decision rule at any time t this maps the history up until that time t to an action at time t. Now, and with the constraint that whenever history you have you would always be taking an action that is available from the at that at this based on the state that you are in. So, remember the history at time t was denoted HD and this comprised of the sequence of states and actions that you have taken so far. So, S 0, A 0, S 1, A 1 or all the way till S t minus 1, A t minus 1 and also the state at time t. So, this is the entire history of the problem up until that time t. So, the history dependent deterministic decision rule map this entire vector S 0, A 0, S 1, A 1 all the way till S t minus 1, A t minus 1 and S t to an action to an action at where at belong to A S t where A S t is the set of actions that we have at time t. So, this was our this was a deterministic history dependent decision rule or deterministic history dependent strategy. Now, the there was similarly a concept of a randomized policies and this is the class this is the set of all history dependent randomized policies and what these policies did was suppose if pi belong to the class of history dependent randomized policies and pi is say mu 0 to mu n minus 1 then we had then the then pi base then each of these mu t's they mapped the history on at that time to a probability distribution on the set of actions probability distribution on the set of actions and you had the property that if you consider the history up until a certain time looked at the probability distribution that this created and let us denote this probability distribution by let us denote this for simplicity by q mu t of h t this is a distribution. So, this distribution had the property that had the property that q mu t of h t of a t of a let us say is greater than equal to 0 for all a and if you sum this over the actions that are available at time t that some should be 1. In other words this probability distribution is supported on this set a t that we have a s t that we have here. Now, what are the costs and what are the transition probabilities under such policies? So, suppose you have chosen a policy history dependent randomized policy then the reward that you get at time t when you are in say state s and you choose a history dependent randomized policy and so you what you are applying at time t is the decision rule mu t then mu t is a function of the history up until that time t h t because this denote also by t. So, s t is the state at time t h t is the history up until time t r t of s t comma mu t of h t mu t of h t is now I am going to assume that mu t is. So, if this was so let us take two cases if mu t of h t is a deterministic history dependent policy then this directly mu t is of h t is directly the action that you would take at that time. So, then this would automatically become just simply r t of s t comma a t you would take where a t is the action prescribed by the decision rule or by the strategy. So, when so this is this is if mu t if your policy is history dependent and deterministic. Now on the other hand if the policy is if the policy that you are considering is is history dependent and randomized then the reward that you get from choosing a decision rule reward from this decision rule is actually not you do not specify any one you cannot specify any one particular action with this with this decision rule because you quite up what you this decision rule gives you is a probability distribution on the set of actions and but remember that this distribution is supported on the action set the set of actions that you have available at that time. So, this what we will what the reward is taken as the expectation over the actions that this policy that the probability distribution is supported on. So, in other words this is then equal to summation over comma a r t of s t comma a where a ranges over r t of s t comma a multiplied by I forget this multiplied by mu t of h t of a where a ranges over the feasible actions in state s t. So, this becomes r this becomes the reward under a the reward that you get when you choose a randomized history dependent policy. Now, remember here this is the reward the in this reward the expectation is only been has only been taken over the randomization that the policy is creating. In addition to this there is a random there are other random variables here which is the state at that time and the history up until that time. So, these have we have not yet taken the expectation over these two. So, the total the actual expected reward that you get would involve not only an expectation over this randomization because of the random choice of the action, but also an expectation over the states and you know these three and so on. So, the total reward that we actually have is the sum over let us say time going from 0 to so the expect total expected reward is expectation of you have your terminal reward r n of s n plus you have a reward for each of your time steps t equals 0 to n minus 1 r t of s t comma mu t of h t where now r where remember this here is to be picked up from here. So, if maybe let me just write it that way instead of writing this this way let me write this as a double summation. So, I have this and I have the sum of actions over this r t of s t comma a multiplied by the probability with which that action is chosen that is q mu t of h t of a and j this here we will denote by j pi. So, j pi and in general this would be a function of the initial state. So, let me write this j pi of s 0. So, j pi of s 0 is the expected reward that you would get when you use a certain policy when you use a certain policy pi in and in particular here this has been written out for a history dependent randomized policy. If the policy is not history depend is not randomized then then it is trivial then one just has to use this expression here okay one just has to use this expression here to evaluate the evaluate the reward. Let us also write out what the transition probabilities are when we use randomized policies the transition probabilities under randomized policies. So, the transition probabilities are as follows you transition to a state j when you are in state s and choose a policy and suppose you use a history dependent randomized policy. So, this is let me write this as mu t of h t again. So, remember this is the action that the policy specifies or if you have a deterministic policy, but otherwise we will still use this notation as a way as a you know with a little bit of abuse of notation this will be taken as simply the following. Now, what is this summation this is the probability that you would get the so what you have in here you have the probability of transitioning to state j from state s when you take action a. So, that is this probability, but the probability that you take action a is this all right. So, that that probability is given by this. So, therefore, the probability of the under the policy under the policy pi the probability of transitioning from state s to state j is given by this particular this particular quantity. So, the you have therefore you have these transition probabilities which are which are for the uncontrolled chain which are we just specified for every state and every action whereas for the control chain that means once you put in a particular particular policy the states transition with this probability. So, you get to the next state j with the probability equal to this. So, in other words this quantity here is in fact nothing but the probability under the policy pi of transitioning to state j when you are currently in state s all right. So, these are your transition probabilities. So, now what we will now what we will discuss is how does one actually solve for the optimal history dependent randomized policy? In other words suppose you are looking for suppose you are looking for let us say we call this quantity j star j star of s 0 which is the minimum over all history dependent randomized policies of j pi of s 0. So, we want to here again we are we are minimizing over all history dependent randomized policies of the cost that you would or the reward that you would get. Since we are talking of reward let me make this in maximization. So, you are maximizing the reward that you would get under all history dependent randomized policies the reward that you get from a particular randomized policy. So, j pi of s 0 is to be is written out here. So, in the case of Markov policies we what we the way we address this was we wrote out what was called the principle of optimality and the dynamic programming algorithm principle of optimality dynamic programming. So, this is what we we employed in the case of a when we were looking for the optimal the optimal policy over the set of Markov policies. So, on the now that we are concerned with history dependent randomized policies the question is how does one actually how does one actually solve this? In other words the the question at hand is there an analogous principle of optimality and a dynamic programming equation that we can use even for history dependent randomized policies. And the answer to this question is yes. So, there is in fact an analogous principle of optimality and a dynamic programming equation. And in fact that is in fact that equation looks very similar to the one that we wrote out when we were optimizing over Markov policies. And in fact that equation will also then show us why the why deterministic Markov policies are in fact optimal. So, all of this is what we will we will discuss in this lecture. So, in order to in order to discuss the optimality of Markov policies let us define a couple of quantities consider. So, define j star t of h t this is defined as the max over all history dependent randomized policies of j pi t of h t. Now, what is j pi t? Well j pi t remember there is a super script of pi here which denotes the policy and there is also a subscript of time which of t which denotes the time. So, what is j pi t of h t? Well this is actually equal to the reward incurred obtained however you want to think of it from time t onwards till time till time n under policy under policy pi that pi is in capital pi h r. So, the reward that we incur or obtain from time t onwards up until time up until time time n starting from history h t. Now, what is the meaning of this? See remember here the when I wrote out simply j star without a subscript of t here this j star was this particular j star this denoted the optimal policy starting the optimal reward starting from time 0. Now, but you can also talk of an optimal reward starting from an some intermediate time such as time 1 or time 2 or something like that and up until the end of the time horizon the same time horizon that has been considered in the to begin within the problem. So, that particular optimal reward which is the optimal reward starting at an intermediate time till the end of the time horizon of the problem with and assuming you start from a certain history h t that is the optimal reward from there for that truncated problem is what is denoted by j pi t of h t. So, the j pi t of h t is the reward that you would get when you apply a policy pi for that truncated problem and the max over all policies pi the maximum over all policies pi is the optimal reward for that particular problem. So, j star t of h t is then the optimal reward. So, let me write these down. So, here this is this has been written here of course and this one here is the optimal reward from time t onwards. So, this is the optimal reward from time t onwards all right. So, now what is the principle of optimality for this for this problem. So, let us write out that there is in fact the same principle of optimality which simply says that if you the principle of optimality which simply says that if you consider an optimal policy and look at its truncation from any time onwards up until the end of the time horizon then the truncated policy is also optimal also optimal for the truncated problem. In other words if you can. So, if we take an optimal policy pi in pi h r where pi is equal to mu 0 till mu n minus 1 is optimal for the MDP then let us call this pi t which is equal to say a policy from time t on where you apply the decision rules from time t onwards till time n minus 1 is optimal for this this problem the problem of maximizing the problem of maximizing j pi t over all policies over all policies pi starting from my history from my history h t. So, if you are if you are if you are randomized policy is optimal for the entire MDP then you can start it from any intermediate step onwards and it will be optimal for the for the remaining problem as well the tail problem as well. So, as a result of this is here the principle of optimality now as a result of this we have a dynamic programming equation also for or a Bellman equation for history dependent randomized policies as well. So, the Bellman equation for history randomized policies the Bellman equation is given in as follows j t of h t. So, is it is similar it is almost verbatim similar to the one that we had written out earlier j t of h t is the max over all actions a in A s t of this reward. So, the reward at at time t the stage wise reward at time t r t of s t comma a t plus the expected the expected reward that you would get now remember here you would have now j t plus 1 j t plus 1 with h t replace with h t plus 1. But what is h t plus 1 now well h t plus 1 is simply h t a comma j. So, you would reach a state j at the next time step that state j is here you were you had a history of h t up until that time t and you took an action a at time t. So, sorry I there is no a there is no t here. So, this is just a you took an action a at time at that time. So, that is your action a. So, the history at time t plus 1 is going to be the history until time t the additional action that you took at time t and the state that you have at time t plus 1 that is j. So, this therefore becomes this here is h t plus 1 this in total is the right hand side of your Bellman equation. So, notice that this is optically very similar to the Bellman equation that we wrote out for Markov policies and that is because there is in both cases you have. So, in addition to this we have the terminal condition which is that j n of h n is equal to r n of s n and as usual these have to be written out for all histories h n and for all histories h t. So, we will discuss more about this in a moment.