 Welcome back. So, what we will do now is find is produce a tractable way of finding the optimal policy of a stochastic control or Markov decision processes problem. Now, the main observation that we made in the previous part was that there is an additive structure to the cost which means that the way the cost gets accrued to us is that is based on what we do in at each stage and where we are in that stage and a bit of and the noise in the system. So, it does not really depend on the cumulatively on the sequence of states and the sequence of actions we have taken. This is reminiscent in a lot of other systems. For example, you see in computer games or in mobile phone games, you keep accumulating points based on what you do in each level not based on the entire history of things that you have done up until then. So, this structure is something that we will now explore. Now, the key analogy that I want you to remember is that this structure actually helps us think of all of these stochastic control problems as some kind of shortest path problems. So, now what is the shortest path problem? Well, let me write it here. So, imagine we want to travel from say a point here which is let us say Bombay. We want to travel from Bombay to a point here which is Goa and along the way there are numerous possible routes that one can take. One could potentially go by sea or one could fly until to a nearest destination and then take a taxi or one could say for example, go entirely by road or one could go by train and so on. There are numerous possible ways by which one can travel from Bombay to Goa. So, I am going to denote all of these by various paths. So, this is one possible path. It passes through these intervening nodes here. These intervening nodes could be villages, towns or something that fall on the way or airports that fall on the way to in the way that we choose to travel. So, here is another path instance. This another path could be the one that takes comes here and then again does this and then goes to Goa. Another path could be something like this. When you open up Google Maps and you look for a route from Bombay to Goa, what Google Maps is trying to solve for you is something like this. It is trying to look, search over all these possible paths and try and find for you the shortest path or the most economical path or some whatever the case may be. Now the, so why is this kind of a problem akin to the stochastic control problem? Why is there a parallel between the two? The main reason for the parallel is that when you look at the cost of travelling from Bombay to Goa, let us say the cost of travel was say we are measuring cost in terms of distance suppose. So, we want to say, we want to talk of, we want to minimise the distance travelled, the distance that we travelled when we go from Bombay to Goa. Now the distance travelled that the distance that we travelled on any path right. So, for example, let us say this I take this particular path this path that goes from Bombay to here then to here then to here and then here. The distance that I travelled in on any path like this is equal to the sum of the distances that I travelled on all the edges. So, the it is equal to the distance on this edge plus the distance on this edge plus the distance on this edge plus the distance on this edge. So, distance has this property that if I want to find the total distance along a path from Bombay to Goa what it can be given as a sum of stage wise distances where this where at every stage the distance is basically the length of that length of the edge or the distance on that edge joining that I have shown on this graph. The same kind of property holds with time also. Say for example, if I wanted to find the path that takes the least time not the least distance, but the least time then again suppose this was the path that takes the least time. So, then the time that you take on travelling that path is equal to the time that you take on this leg plus the time that you take on this leg. Now, this means that the cost that you incur is getting accrued to you at every stage. It is getting accrued in a stage wise manner. So, this is exactly the structure that we have in our problem. In our problem also the cost gets accrued to us in a stage wise manner. And as a result of that, we can think of basically the cost think of a stochastic control problem as a kind of generalized stochastic shortest path problem where what the stage wise cost that we represent are really the costs on each of these edges. And the action that you have to choose basically is asking us well suppose you are at this town on the way whether would you want to take the action of going to this town here or would you want to take the action of going to this town. This is essentially the choice that we are making when we are choosing actions. So, in completeness basically what is happening is that the states of the system can be thought of as all these intervening nodes or these are as I said different villages or towns or airports or train stations that are coming on the way from Bombay to Goa and the actions that we have tell us what next should you be doing, which edge should you be travelling on next, should you be travelling on this one or should you be travelling on this one or if you ended up here should you be travelling here actually there is no choice here if you are here you travelled here and so on. So, this is how one can visualize the problem that what is effectively happening is that we are solving a kind of generalized version of a problem of finding the shortest path between two nodes. Now, this analogy is not exact because really in a shortest path problem the terminal place that we want to be at is Goa and Goa only therefore whereas in the stochastic control problem we have a cause that depends on which node that we end up in and so on but you know broadly speaking the analogy goes through. Now, let us see what this analogy actually amounts to. Now, this analogy actually tells us one simple observation that one simple observation is as follows. Suppose there was suppose I found the shortest path that went from Bombay to Goa based on some cost criterion either its distance or time or whatever. So, suppose that shortest path pass through this highlighted node here this yellow highlighted node is the one through which this shortest path pass. So, the actual path then the actual shortest path is this green path here it is this from here to here then you go from here to here and then I am assuming this is going to be my shortest path it is from here to here and then from here to here this is suppose my shortest path and as I said it passes through this yellow node in between. Let me give this node a name let us call this let us suppose this node here is suppose Pune. I know the shortest path from Bombay to Goa probably does not pass through Pune but you know I am just cooking it up. So, suppose this is this is the this is this node. So, what we are finding is that well the shortest path from Bombay to Goa passes through this node called Pune. Now, let me ask you the following question if this green path is indeed the shortest path from Bombay to Goa. Now, I ask you what is the shortest path from Pune to Goa? What is the shortest path from Pune to Goa? So, suppose which means that suppose you ignore so you ignore all that is there on the left here you are not starting from Bombay but you are actually starting from Pune you start from Pune and you want to know what is the shortest path to go from Pune to Goa? Well what do you what does this tell you? The fact that we already know that the shortest path from Bombay to Goa is the green path tells us that this leg going from Pune to Goa also has to be the shortest path from Pune to Goa and why is that the case? Well that is the case simply because is simply because of the following you know some kind of very obvious observation. If there was another path which was shorter let us say for example there was this there was this blue path here which was shorter suppose this path if this blue path was shorter then what I could do is upon reaching Pune instead of going along the green path to Goa upon reaching Pune instead of going along this green path from Pune to Goa I could have switched this switched from switch to the blue path because after all that is shorter I could have switched to the blue path and then I would have got this combined green plus blue path here this green plus blue path as a path that is shorter than my original green path right. So this path here from Bombay to Pune and then from Pune you take the blue route that would be shorter than going along the green route from Pune from going from Bombay to Pune and then going along the green route from Pune to Goa right. So this is this simple observation basically tells us that if you have found a shortest path from Bombay to Goa and then I stop I look at that path at any intervening node in this case Pune or any other node let us say this one here I look at I ask what is the shortest path from this intervening node to Goa I still want to get to Goa. So we all want to party we all want to get to Goa ok. So we suppose we start from any intervening node here in along this path and we ask ourselves what is the shortest path from this node to Goa ok. So in that and the answer simple answer is that well it is the same path it is the remaining leg or technically the residual leg of the path of the shortest path that you have found from Bombay to Goa right. So this simple observation is actually going to be very powerful and going to be used in solving stochastic control problem. So this simple observation is basically what is it has been given an important name it is called the principle of optimality and this principle of optimality was actually first propounded by a person called by a scientist called Bellman so Richard Bellman. So it is often called Bellman's principle of optimality and it leads to an an equation for solving stochastic control problems which is which is called the Bellman equation ok. So let me write out now the principle of optimality. So the principle of optimality says that principle of optimality what does it say well it says so let pi star given by mu 0 mu n minus 1 star in capital pi md so I am going to look I am going to write out a principle of optimality for Markov decision Markov deterministic policies ok. So the principle of there is a similar principle of optimality for history dependent policy we will come to that in a moment so here I am writing out one for Markov deterministic policies. So pi star now is my optimal policy ok over the set of Markov deterministic policies. So let pi star be an optimal policy over the set of Markov deterministic policies now assume when using pi star a state a given state xi occurs at time i positive probability suppose it occurs at time i with positive probability ok. Now consider the sub problem whereby we are xi at time i so you are at you are at the state xi at time i ok and wish to minimize the cost to go from time i to time n. So what is the cost to go it is the cost that you would incur if you had started at time i and you are still ending at time n and you are starting from state xi this is the cost to go it is the cost that you have from time i onwards ok. So let me write out this expression it is simply the expected expectation this is simply the expectation of gn of xn plus the summation now goes from k equal to instead of 0 I have the summation going from k equal to i to n minus 1 ok of gk of xk uk wk this is my cost to go from i from i till n ok. So you start off at time i and you continue till time n ok then the truncated policy ok this truncated policy which just starts from time i onwards. So you let us call this let us look at this policy this policy is mu i star mu i plus 1 star and so on all the way till mu n minus 1 star this this policy is what do you think it will be well it is optimal for this sub problem. So the principle of optimality says so suppose you have a policy pi star which is a Markov deterministic policy and suppose it is optimal over the set of Markov deterministic policies. So you are minimizing the cost over the set of Markov deterministic policies ok. Now assume that when using pi star when using this policy pi star a particular state xi occurs at time i with positive probability ok. So when you use a particular policy the sequence of states that gets realized is actually stochastic because there is noise in the system the state that you would transition to even though you fix the policy the state that you would transition to is not known at to begin with. So but what we will assume is that we there is some state xi which will occur with positive probability when we are when we use a policy pi star ok when we when we use this policy pi star and now we consider the sub problem which starts at state xi at time i it is important here that we start at time i ok and we what we want to do is minimize the cost that of the remaining ok. So the cost that starts form i to n so this cost is simply the expectation of your terminal cost plus the stage wise cost starting from i till n minus 1 ok. And what what the principle of optimality basically claims is that if you look at this truncated policy which is the policy form which is mu i star mu i plus 1 star dot dot dot dot till mu n minus 1 star this truncated policy is the optimal policy or is an optimal policy for the sub problem the problem which is the problem starting from xi till time n from time i till time n ok. So how is this analogous to the what I was telling you about the shortest path problem the analogy is kind of easy to see here this intermediate state here xi is like your highlighted node you know the highlighted node Pune here. So what it is basically saying is that so your the optimal policy here pi star is the optimal path that you have chosen it is the green path that you have chosen ok. The intervening node Pune here is this state xi that occurs at time i so all we are asking here is that Pune is actually on the shortest path that is the claim here by that is what we are basically assuming. Now the shortest path problem is actually a deterministic problem as I have posed here but here because the stochastic control problem is stochastic in nature we also need to say that well the intervening state occurs with positive probability that is the stochastic equivalent of saying that Pune is on the shortest path ok. Now you consider the sub problem which is the problem starting from starting from Pune to go to Goa by the shortest path. So what is the shortest path from Pune to Goa well what so this is your this is the cost for that sub problem it is the cost to go ok it is it is the cost that you would incur if you if you started off at state xi at time i and the the principle of optimality says that well this this green path or in equivalently the remaining part of the policy or that what they what is called here as the truncated policy the truncated policy is actually optimal for this sub problem the sub problem of starting from Pune and ending up at Goa ok. So this is this is effectively the the the the analogy and this tells you also why there is a close correspondence between between stochastic control problems and and and shortest path problems shortest path problems are capture in them most of the spirit of a stochastic control problem. So now there is as I said a similar principle of optimality also for for when the set of policies is not mark out deterministic policies but history dependent deterministic policies and also for randomized policies I will come to those in you know in a subsequent lecture ok. So what is the implication now of of this particular of of of this principle of optimality well what what it gives you is actually a way of computing the optimal an optimal policy ok. So it gives us what is called the dynamic programming. So what is the implication now of the principle of optimality the implication is that this policy now gives us a an algorithm or a method for computing the optimal policy or computing an optimal policy ok this algorithm is what is called the dynamic programming algorithm. So why is this called a dynamic programming algorithm well the if you look at the nature of the algorithm the algorithm actually does go goes stage wise it and it optimizes something for each stage ok and I will explain what exactly this means what exactly does it mean to optimize for each stage it does not do as something as simple as finding the the shortest path and on at you know for each leg or something like it is not as simple as that but it it does go stage wise ok. So there is therefore it it it it is involves you can say a dynamic optimization an optimization that happens for each stage and as as the problem goes along. The the other so so what is the what is the sort of overall philosophy of this particular algorithm the philosophy is simply is comes from the from the principle of optimality well the principle what the principle is optimality has actually told us that the rest of that once you the shortest path from Pune to Goa is this green path but it actually also as a corollary is also telling us a few other things for instance if I had found ok if I had found the shortest path from any intervening node to Goa ok. So suppose I had found the shortest path from from I had found the shortest path from Pune to Goa and suppose I had also found the shortest path let us say let us say from this suppose I had also found the shortest path from this red node to Goa ok. So if I had found these two shortest path and then if I had told you what the paths were from what the the shortest path what the shortest distances were or what the distances were from Bombay to to each of these nodes then what I what I need to do in order to in order to in order to find the shortest path from Bombay to Goa is basically look for all the intervening nodes that can occur on my route from Bombay to Goa and use the fact that I have already computed the shortest path from those intervening nodes to Goa and combine that with the with the path with the time or distance it takes for me to get to any of the intervening nodes. If I combine these two I should be able to synthesize what would be the shortest path from Bombay to Goa this essentially is the algorithm the dynamic programming algorithm ok. So what we are doing is we are what we would be doing is synthesizing the shorter the shortest path by using the short these these the synthesizing the optimal cost by using intermediate cost which is which are these cost to go for each state ok details of this in the next part.