 So, if you go back to lecture 6, I want you to go back and check what I had written out there. See if I had written here that we can look at a closed loop problem or an open loop problem. An open loop problem was one where the actions that we were choosing at every time step were actually chosen before the system even evolves, before any of the uncertainty in the problem was known. In a closed loop problem, the action that you choose at every time step is based on some amount of information that you have at that time step. And what I had mentioned was that here, the action is can be chosen based on the history up until time, up until that particular time. So, UK is chosen based on the history up until time k. And I had mentioned with that without loss of generality, one can choose UK with only the knowledge of xk. Now this is actually a point that I did not elaborate on and this is something that I want to do right now. Now often in these kind of problems, one does not actually to talk of what is happening with the history, all the history up until time k because it turns out that it is sufficient to just look at to know only the state at time k. In fact, the state of the system is supposed to capture all the essential information about the system that gives you the complete configuration of the system. It is also what determines the cost that you would incur. Thus it does not benefit one to have any more information than just the state. But this fact actually needs a proof and this is something that we will look at now. So let us come to, so we can now look at this problem of what information is available to when we are choosing a particular action in the framework of Markov decision processes. So let us start defining a few terms first. So we can say, we say that if your decision rule at any time t is mapping only the state at time t to the actions, then we say that such a decision rule is called Markovian. This sort of a decision rule is called Markovian. What it does is it chooses the action that needs to be applied at that time t based on only the state that is realized at that time t. It does not look at any other information. Then now let us do find, let us see a more general piece of information that could be available at time t. So define h t, small h t as this, as the sequence of states and actions chosen up until time t. So suppose you say state 0 and action a 0, state s 1, action a 1 and so on. And then we have now at time t. So what we have done is we have at time t, we have seen state, the state that was at time t minus 1, the action that was chosen at time t minus 1 and we have the knowledge of the state at time t itself. So this is the history that is available at time t. So this comprises of the entire sequence of states that the system has gone through and the sequence of actions that were chosen up until that particular time. So the set of all, this is, this here denotes a particular history. So the set of all histories is denoted by capital H t. That set is simply the set s cross a cross s cross a dot dot dot again s cross a cross s. So this is at time 0, this is for the state and action at time 1 and so on. And this is at time, at time t minus 1 and this one is for the state at time t. So for example, the history at time 0 is simply, is simply the initial state which is the state at from which the system begins. So the history at time 0 is simply s. The history at time 1 is s times a times s and so on. So history dependent policy is one which maps the entire history available at time t to an action. Now just like a Markovian, in a Markovian policy again we had this, we have this constraint. I did not mention but without, it goes without saying that we must have this constraint at dt of s must choose some, an action from this set of actions available at time s, available in state s. The history dependent policy is also bound by the same requirement. So dt at dt of h t must have this, must respect that the constraint that you are, that the action that you are choosing is from that set. So for that, let us write h t in a slightly explicit way. We will let us write h t as in this sort of form where you have s t minus, so this is s t minus 1, at minus 1 s t. This must belong to the actions that are available when we are in state s t. And this has to be true for all the other, all the other components of the history that have been realized. So for all s 0, a 0, s 1, a 1, etc. s t minus 1, a t minus 1. So regardless of what, what sequence of states and actions that was chosen, when you end up at state s t at time t, you have to choose an action from the, from the set a s t. This is, this kind of a policy is called a history dependent policy. A history dependent, in this case this is a decision rule, but this is let us say history dependent policy. I will come, this sort of policy is said to be history dependent. So clearly, so let us write a few, let us write a few sets here. I would write capital pi or superscript m as the set of all Markovian policies and capital pi as the set of all history dependent policies. Now clearly, we have that any policy that is Markovian is a special case of a policy that is history dependent. So consequently, we always have that pi m is a subset of pi h and this will, this I will accept for a few corner cases. This is, this is in general going to be a strict subset. Now we can also do, also ask for something more specific for Markovian policies. So we can ask for the following. Say since we are choosing, since a policy is a sequence of decision rules d0 to dn minus 1, we can ask for something very specific if these policies were Markovian to begin, if this policy was Markovian to begin with. We can ask that each of these ds is actually independent of the time instant. That means that the rule by which you are choosing the action at any time, at any time is independent of the time. It will depend on the state at that time, but the rule that binds the state at that time to the action is not changing with time. The rule itself is the same for each time. This kind of a decision rule is what is called a stationary, a stationary decision rule or a stationary policy. So we say that pi is stationary d0 is equal to d1 is equal to dot dot dot dn minus 1. That means these functions are the same for each time. Notice I, once again I am not saying that one takes the same action at each time. We, I am just saying this, we have the same plan for each time. The action will change based on the state that gets realized, but the plan by which you are choosing your action based as a function of the state, that plan is not going to change. That is what we ask for by stationary policies. So obviously a stationary policy is a further subset of the set of Markov policies. So let us write pi s as the set of stationary policies and pi s is clearly a subset of pi m, which in turn is a subset of pi h. Now notice one other thing here that the concept of a stationary policy makes sense only when we are talking of Markov policies. There is no such thing as a stationary history dependent policy because as time goes on, the length of the history of the problem increases. So one cannot really ask that the functions that map the history to actions, one cannot ask these functions to be the same for each time because the number of arguments that these functions have itself changes. So the one needs to obviously to have a specific type of function, which is a function that depends only on the current state for you to even ask that it be independent of time. So there is no concept of a stationary history dependent policies, one only has stationary Markov policies. One can also do in addition to this some other type of policies. For example, one can also do in addition to this another type of choice of action. So far what we have been doing is that given a particular state we have chosen or given a particular history of the problem we are choosing a particular action deterministically using the function which or using the decision rule that we have or the strategy that we have. So our decision rule defined for us that the an action in a deterministic way given the current state of course or given the current history of course the next state given this action and the current state was random and therefore the entire problem was stochastic but the choice of the action itself was deterministic. So one can generalize this by asking well what if I chose my action also at random given that there is already stochasticity in the problem maybe I can achieve something by choosing my actions also at random. So this leads us to this concept of what is called a randomized policy. So a randomized policy or to begin with first a randomized decision rule a randomized decision rule can be thought of in this way. So I will first define this for a Markovian decision rule. A randomized decision rule is a function that provides for me for each state it produces for me a probability distribution on the set of actions. So let me write this p this kind of fancy p in this way. So this fancy p here of any of any set here is the set of probability distributions set of probability distributions on the set of actions. So what this when I what a randomized Markovian decision rule does is that for every state it produces for you a probability distribution on the set of actions. So it does not specify a particular action for you but rather it tells you with what probability is each action going to be chosen. So we can write this probability in that with which an action is chosen in this following way we can write that this probability q dt of an action a that is induced by the decision rule that we have chosen in a particular state when applied to a particular state. This here is the probability of choosing action a at time t in at time t in state st under decision rule under the randomized decision rule. So the decision rule has gives us for each state a probability distribution. So what it defines for you is not just one probability distribution but a family of probability distributions on the set of actions. So for every state there is a probability distribution on the set of actions so that the probability with which a particular action is chosen when you are in a particular state and you are applying a particular decision rule is given through this expression here. So when you are in state st and the randomized Markovian decision rule is dt the probability of choosing an action a is q dt st of a this is the probability. Now because this is a probability distribution on the set of actions we must it has to be that these probabilities themselves are each greater than equal to 0. So q dt of s of st of a this is always greater than equal to 0 and if I sum this these probabilities q dt of st if I sum these over the actions that are available these should sum to 1 the actions that are available at in state st. Then these should these probabilities should sum to 1 this is the form of a Markovian randomized decision rule or a randomized Markovian decision rule. Now just like let us add a notation for this these type of policies the set of all the set of all mark randomized Markovian policies would be denoted by pi m r this is the randomized Markovian policies. Now let us look at another an additional concept just like we had deterministic stationary Markov policies we can also think of randomized stationary Markov policies. So a Markov a randomized Markov policy randomized Markovian policy is said to be stationary let us let me denote this policy by pi equal to let us say d 0 again to d n minus 1 is said to be stationary if d 0 equals d 1 equals d n minus 1. Now what does this mean? This means that the probability distribution that this means that the rule by which the probability distribution of the neck of the action is being chosen as a function of the current state that rule is the same at each time. So this does not mean that the action is being chosen with the same probability each time no it does not because the action itself that is chosen it would depend on the state in which that you will that you are in. So it depends on the state but what it does not but the rule that that ties the current state to the probability distribution on the actions that rule is fixed across time. So what this means is that this the this here if I write this here is the probability choosing action a time t under state sorry in state in state s under the under the decision rule under the decision rule dt this probability is actually independent of time we want this to be independent independent of time in that case the random the randomized Markov policy is said to be stationary. So this would depend on as you can see the probability does depend on the state that you are in it depends also on the policy that you have chosen but once you have chosen the policy it does not depend on time that then the policy itself is said to be a stationary randomized Markov policy. So is such this set is denoted pi s r is the set of stationary randomized policies. Finally we also we can also talk of randomization given the history so far. So we can talk of a function dt that a decision rule that maps the entire history that we have so far to a probability distribution on the actions. So you map this to the set of probability distributions on the actions this will then generate a probability distribution on the actions given in the following form. So it is it generates q dt of h t of a so given a certain history h t and a decision rule dt that you have chosen the probability with which you have choose you will choose an action a is given is q dt of h t of a this here probability of choosing action a m t when history is with when history is h t under decision rule under the decision rule dt. This is a this then is a randomized is a randomized history dependent policy. So we will denote this the set of such policies by pi h r set of randomized history dependent policies. So clearly we have some some inclusions here a anything any any policy which is randomized can is a generalization of the policy which is deterministic or conversely any deterministic policy is a special case of a randomized of the corresponding type of randomized policy. For example if you take a deterministic a deterministic Markov policy then it is a special case of a randomized Markov policy because one can always do the following one can choose a but you can choose a distribution you can choose a distribution in which the action a particular action is chosen with probability 1 and all other actions are chosen with probability 0. In that case the probability of then in that case the randomized policy is actually just a function it tells you for that state that you have chosen that particular action. So a data so we have that deterministic policies is a special case of randomized policies. For example if you have a deterministic decision rule so if you have a deterministic decision rule like this which maps this which tell which gives you a particular action a t when you are in state s t this is this is equivalent to a probability distribution of the following kind takes with probability 1 it gives probability 1 if the action a is your particular action a t and 0 for all other actions. So this this sort of a this kind of a decision rule this kind of an embedding lets us map every stationary every deterministic policy to as a special case of a randomized policy. So as a consequence of this we now have we have these inclusions we have we had seen earlier that stationary Markov policies is a special case of Markov policies is a special case of history dependent policies but history dependent policies are a are also contained in history dependent randomized policies history dependent policies are contained in history dependent randomized policies. We also have that Markov policies are contained in Markov randomized policies randomized Markov policies. We also have that stationary policies are a special case of stationary randomized policies and we have that we also have this inclusion that stationary randomized policies are contained in Markov randomized policies which is contained in history dependent randomized policies. So consequently we have these we have these relations across the different policy classes. What we will do next is we will do an example where all these different types of policies and the way they play out in the when in an actual problem we will become clear through will be explored and we will become clear.