 Welcome everyone, in the previous lecture we had seen a problem which was which involved a system that had that could be in two possible states, states S1 and S2 and there were certain actions available at each of these states and we designed policies that of both deterministic as well as random randomized kinds that were Markov, here was a deterministic Markov policy and policies that were also history dependent. So first so these were so the randomized Markov policy and then we got to we altered the problem a little bit and then we got to also history dependent policies. So this was for example a deterministic history dependent policy and here was a randomized history dependent policy. So now what we will do is we will come we will come from here to the next type of problem where we would be asking ourselves how do we actually solve a problem like this I mean how do we actually solve a stochastic control problem or a Markov decision process means what how does one actually find what the optimal policy is. Now in this case what we will do first is we will restrict ourselves to just Markov policies which means we will look for a policy that is optimal over the set of Markov policies for a particular problem and what we will do is we will show that this kind of a search for the optimal Markov policy can actually be simplified greatly by using what is called the Bellman's optimality equation. So we will get to this in a moment. So what we will do now is go back to the stochastic control type of formulation of the problem where if you recall we were looking for we were looking for an optimal policy which was we were looking for an optimal policy. So this pi was our policy which comprised of functions mu 0 to mu n minus 1 these were the decision rules and this was we wanted that we were looking for a policy that is in the set of Markov deterministic policies. So let me write this emphasize that by putting a D next to this. So the policy has to minimize a cost and the cost that you get from a when you apply a policy pi to the problem is given by this expression. So we had a cost which was the expectation of a terminal cost g n of x n plus the sum of the stage wise cost which was g k of u k w k sum goes over from 0 to n minus 1. Now u k itself because we have choosing a policy which is Markov u k here is a function mu k of x k. So therefore this cost can be written as equivalently as g n of x n plus sum from k from 0 to n minus 1 g k of x k mu k of x k comma w k. The expectation here is over all the things that are random in this problem. Now since we are we since we have already put in the the a particular policy which is deterministic the only thing that is random in this problem are the is the noise that is given by w 0 to w n minus 1. So that is that is the randomness in the problem. Now this here is a cost that depends on two things. First is the policy that we are applying. So the dependence on the policy is clear it it is here all the mu's are appearing in the in in this expression. So it depends so the cost depends on the policy. So that is why the cost that we incur is is represented as j sub pi where pi is you emphasize this pi here because we want to show that the policy that the cost depends on the policy. The the other dependence here is on the initial state. The initial state here is is x 0. x 0 is where the system starts from now that is that so the policy also depends on the the the cost also depends on x 0 right because where you start from would determine what kind of cost you would actually get. For example in the inventory problem the amount of the the total cost that you would incur depends on how much inventory you start off start off with to begin with right. It also depends on a bunch of other things say for example the cost would depend on the time horizon also the cost would depend on the kind of the stage wise cost functions and the terminal cost functions and so on. But all of those are implicit here we are not we are not emphasizing their dependence in this particular expression. So the problem of the stochastic control problem that we are faced with is to find what is the optimal policy in a certain class of policies. So the class of policies we are as I said we are considering here is the class of Markov deterministic policies and there the problem now becomes to find the optimal cost the optimal cost as a function of the initial state. So J star of x 0 which you can say is the minimum over all policies in a in the class that we are considering which is the class of Markov deterministic policies. This is your stochastic control problem it asks you to find what is the optimal Markov policy when you start off from a from a certain state x 0 right. So let us let us just see let us dwell a little bit on what the complexity of a problem like this is. So here as I have stated this problem is asking me to search over all possible Markov deterministic policies right. So let us see how how large this this set of Markov deterministic policies is and and how you know how large how vast is this search. So to do this let us look at let us say for example the state space here are the state space that we are considering the size of the state space is is b some finite number is b. Suppose the the the number of actions available at each time step is the size so number of actions at each time step is so this here is the size of the state space the number of actions available at each time step is suppose small a and let us assume this also is finite this is the number of actions at each time step each time or each decision epoch as you want to call it and now suppose we have n which is your time horizon or number of decision epochs. So question is what is the size of the set of Markov decision Markov decision the what is the size of the set of Markov deterministic policies. So let us say so the question is what is this. So we can let us let us try to get an estimate of of of this this particular quantity. You notice that at every time instant you have a choice of of a actions a is the number of actions that you have at each time step. So the number of actions here is a and you can be in one of b possible states the system can be in one of b possible states and you have a choice of of a actions. So every decision rule can be decided once you tell me what action I am taking in each state. So there are b possible states and a possible choices for actions at each state. So that it tells me that the number of number of decision rules so number of decision rules each each time is is actually a to the power b. The reason for this is because we have we have a choices for and for each state and there are b states. So there are you can this is a so you get a times a times a b times so that gives you the number a so that gives you the number a to the power b. Since we have now since we have n time steps that means the number of this is the number of decision rules at each time. So since there are n time steps the number of policies is which is this is a to the power b to the power n in other words a to the power b times n. Now let us do this let us just run through this calculation and see how big this number could be. So suppose a is 10 which is that means you have 10 possible actions. And suppose you have say let us say 5 possible states. So b is 5 and suppose you the time horizon that you are considering is let us say let us say a time horizon is also 10. So then what is a to the a raise to b times n well a raise to b times n then becomes 10 raise to 5 times 10 this is 10 raise to 50. This is the total number of Markov deterministic policies in this in this problem. So you can see that this problem actually a simple problem with just 10 actions 5 possible states and 10 time steps. This itself has a humongous number of Markov deterministic policies for us to choose from. So this problem that we have posed here which is asking us to minimize find the policy which minimizes this cost over the set of all Markov deterministic policies is essentially asking us to find the best policy out of this set of 10 raise to 50 different policies. So this is clearly a very very complex problem. If you were to approach this in the most naive possible way which is by listing out all possible policies this would probably take you longer than you know the age of the universe to even solve. So this is obviously a very very complicated problem. So what we will do what we will see in today's and in the subsequent lecture is that there is actually a way to simplify this by where we will actually use the structure of the problem the fact that there is an underlying Markov process going on that there is a cost which is additive and so on. All of this will be exploited to reduce the complexity of searching like this. So in a sense we will not attack this problem as one where we have to search over or Markov deterministic policies. But before we get to that let me make another point. So suppose we were not looking for Markov deterministic policies but say history dependent deterministic policies. Suppose the search was suppose we were optimized not over Markov deterministic, optimized over not Markov deterministic policies but optimize over history dependent deterministic policies. Which means then the optimal cost that we were looking for is let me denote that by J star star of X0. This would then be the minimum over the minimum over pi in pi HD of the cost that would come about from that particular policy. Now let me put a small tilde here just to distinguish this from the earlier term and the reason for this tilde is because now the cost that I get from a history dependent policy is actually a different expression. So this here is going to be the expectation of the terminal cost plus the cost that you get in each stage which is a function of the state at that time and the action but then the action now is chosen as a function of the entire history. So let me write this as mu k of hk of wk, where hk here is the history at time k. The history at time k is simply the sequence of states and actions that have been realized up until time k minus 1 and the state at time k. So this is your, this then becomes the expression when you have a history dependent policy. So what has changed here is in place of, in place of having just the xk which would be where which is only the current state we now have the entire history present here. So that has, that therefore gives rise to a different expression let us call I have called it j tilde and when you minimize this over all history dependent policies we get a cost which is j double star of x0. This here is another way of posing the problem where we are now asking what is the optimal cost you can get if you were to allow for history dependence in your decision rule. So now let us, let us try to think of how may, what kind of complexity this problem amounts to. So suppose, let us take a simple case once again. So suppose we have, let me take for simplicity that, so once again we have A is the number of actions, B is the number of states, the number of possible states, number of elements in your state space. So and the question is what is this quantity pi hd, the size of pi hd. So well the number of history dependent policies would depend also on n which is the time horizon. So let me write the time horizon here, n is time horizon. So let us compute this for various values. So let us say for example, let us what we want to know what this is, let us say n equal to 2 for example. So for n equal to 2 you have two decision rules to choose from, there is a decision rule at time 0 and there is a decision rule at time 1. The decision rule at time 0 has in that you have A actions to choose from and you have B states at which you have to make this choice. So the number of possible decision rules over there is once again A raised to B, this is at time 1. Then suppose you then you take the decision rule at time 2. Now at time 2 you know not just the state at time 2 but also the entire history. Now you have to ask yourself well what is the total number of histories that you have at time 2. The number of histories that you can have is potentially the number of is going to be the number of is the history at time 2 will involve the state at time 0 or the history at time 1 will involve the state at time 0 and the state at time 1 and the action that you took at time 0. So the number of histories then is potentially A which is the number of actions at time 0 times the number of states at time 0 which is B times the number of states at time 1. So the number of histories at time 1 is going to be therefore A times B times which is which is let me write it simply it is A B square. This is the number of possible histories at time 1 and we have a choice of A actions one for each possible history. So the number of possible decision rules at time 1 is A raised to A raised A B square. So this so the total number of policies therefore can be as large as this A times B times A raised to B times A raised to A B square. This is what you get for when the time horizon is 2. Now when the time horizon is 3 let us compute this again for when the time horizon is 3 and the number of decision rules at time 0 is still A raised to B. Then you have decision rules at time 1 which is A raised to A B square. Then you have decision rules at time 3 at time 2. The decision rules at time 2 you have A actions and the number of histories now is going to be well it will comprise of 2 actions the actions at time 0 and time 1 and the 3 states which is at the state at time 0, the state at time 1 and the state at time 2. So this is going to be A square B cube. This is the total number of histories that you can have. So A raised to A square B cube is the number of decision rules at time 2. This entire product taken together gives you the total number of history dependent policies that you can have. So I do not need to tell you that this again can be an extremely large number as in fact it is much larger than even the number we computed for Markov deterministic policies. So this therefore once again puts a question on in front of us as to okay we have posed a problem of Markov decision processes we saw that it has several applications we looked at a very clean application in inventory control but what is actually the way to solve a problem like this. So this is something that we now need to address okay. So the crux of solving a problem like this okay whatever be the class of policies whether it is whether we are looking for the whether we are looking for Markov deterministic policies or whether we are looking for history dependent deterministic policies or randomized policies or any of that the whole idea behind solving any of this is to use what we know about the structure of the problem. We know this that the cost has this additive structure that there is a terminal cost and then there is a stage wise cost. So there is a there is the which means that the cost that you incur over the entire time horizon does not depend does not depend on the specific trajectory that you have followed does not depend does not depend is not very strongly dependent on the trajectory that you have followed over the time horizon. Rather what happens the way you in accrue the cost is that one accrues the cost at each stage that is a there is a stage at each stage the action that you take and the state that you are in combined with a bit of noise is what tells you the cost that you would accrue in that particular stage as it is and the total cost that we that we that we accumulate is the sum of all these stage wise cost. So this is a crucial part of the structure of a stochastic control problem or a Markov decision processes problem. So we have to we somehow have to make use of this structure in order to simplify this choice of or simplify the search of overall over the set of all policies of whatever kind. The other structural element which I have not written over here is the fact that the next state which is xk plus 1 is again given a given to is depends only on the current state which is xk the action we choose in the current state and any element of noise which means that the next state the next state at time k plus 1 does not depend any further on the previous states than it does on just xk. So which means given the xk and the uk and the noise the next state is completely determined alternatively given the xk and the uk then the probability distribution of the next state is determined completely determined. One does not need to know the entire history of states that has that have occurred in order to in order to determine what the next state is. This again is very conducive to the kind of structure we have in the cost function because here it means that once you get to a certain state you you get a and you take a certain action you get a certain reward. It does not matter how you got there and and it does not matter it does not matter what has happened previously in the past. This kind of separability or additive structure in the cost function has to be exploited and what we will see next how this is actually being exploited in order to produce a tractable method for computing the optimal solution or the optimal policy for these kinds of problems.