 Welcome everyone. So, we were in talking about in the previous lecture the dynamic optimization problem which involved taking decisions over n time steps or n time periods. So, the time periods were denoted in this way 0, 1, 2 and so on. Ending at a terminal time time n we were taking decisions for each time period and the decisions when we had taken this example where we were taking decisions for an inventory control problem. So, the decisions were being made at the left hand point of the time period and noise was being realized at during the time period and because of the presence of noise we saw that our problem which was actually stated as first as having to decide the action that we need to take at each time instant that problem actually became a problem about planning for every possible state that could be realized during this time instant. So, so to recall say suppose x k is my is the state at time k u k is the action at time k x k plus 1 is given as f k of x k u k w k where this is w k is the noise at time k or during time or period k and the problem was to minimize this cost which involved which was a sum of a terminal cost plus 0 to n minus 1 plus cost at each time instant. Now, this minimization was to be achieved by choosing actions at each time step, but because this the plan had to be made before the realization of uncertainty our way of posing the problem was that of minimizing this overall policies, policies denoted by mu 0 to mu n minus 1. Now, we notice that this would become therefore a function of the policy and also of the initial state the cost that we would incur would be a function of the policy and the initial state and so the optimal cost was the cost of the optimal policy this was denoted by J star of x 0. So, notice that this optimization the one that I have the one that I have put in a box here this optimization is an optimization of a function over the space of functions the decision variables is a sequence of functions. So, what one needs to decide is this sequence of functions now the just imagine suppose for simplicity suppose they were just say suppose there were 10 states at each time instant 10 states say at each time instant suppose you had 5 actions to take 5 actions 5 possible actions 5 actions that are admissible at each time instant and suppose there are say 3 time periods. So, you want to do this you want to make these decisions over 3 time periods. So, this needs to be decided thrice. So, how many possible functions are we looking at. So, how many choices are there for functions at the first time period. So, a function at the first time period would map the set of actions to the set of states to the set of actions and so every for every state you can take 5 possible actions. So, as a result the number of possible the number of possible functions that you can have at time instant 1 is 5 raised to 10 right. And now you have 5 raised to 10 possible functions at time instant 1. Similarly, there are 5 raised to 10 possible functions at time instant 2 and 5 raised to 10 possible functions at time instant 3. So, the total number of choices therefore becomes 5 raised to 10 times 5 raised to 10 times 5 raised to 10. As you can see this is an enormous number just for a simple problem like which involves 10 states and 5 possible and 5 actions and 3 time periods. As the number of time periods increases as the problem gets more realistic with more states and more actions the number of possible choices grows even further right goes even larger. So, consequently thinking of this problem as in the space of functions is intimidating it it is it is practically impossible to solve the problem if you want to think of the problem in terms of functions. So, our goal is going to now be to think see if we can somehow get to the value of the function implicitly either through this by somehow cleverly evaluating the function only at relevant points or somehow getting to the value of the function through a by optimizing not over the space of functions, but over the space of actions in some way ok. So, at the cornerstone of this reduction is what is called the principle of optimality ok. This is what is called the principle of optimality. The principle of optimality in dynamic programming simply states the following. So, let me state the actual principle and then I will explain what it means. So, it says this. So, let p let pi star which is just denoted by mu 0 star to mu n minus 1 star. Suppose this let this be an optimal policy be an optimal policy for the dynamic programming problem for the for the dynamic programming problem that we have stated on the on the left ok. And we will make a technical assumption and assume the technical assumption is assume that when using pi star given state x i say denoted x i occurs at time i with positive with positive probability ok. Suppose it occurs with at time i with positive probability. Now, consider the the consider the consider the following problem consider the following let us call it a sub problem consider the following sub problem are at x i at time i want to minimize the cost to go from time i to time n ok. So, consider the this sub problem where suppose it is as if your your decision problem has actually started at state i at state x i and at time i the original problem has actually started at time 0 with state x 0, but we are not looking at that problem we are looking at the sub problem which has started at time i and form a nominal state x i ok. And it continues from time i till time the original time horizon that you are fixed which is the end till time n itself right. So, the cost then is the cost that you would incur from i up till time n ok this is what we call the cost to go ok. So, the cost to go is expectation you still have the terminal cost g n of x n and you have this cost which starts from k equal to i and goes till k equal to n minus 1 once again of now g k x k ok. Now x k and u k u k remember is going to be evaluated as a function u k has to be evaluated as a function of x k as before ok. So, this now looks like any other dynamic of programming problem except that it has started from it does not have horizon time horizon n, but time horizon n minus i alright and it starts from state x i ok. Now the here is the so here is the result. So, the result is that u the truncated policy then the truncated policy truncated policy that is denoted by mu i star mu i plus 1 mu i plus 1 star till mu n minus 1 star right is optimal for above sub problem. So, what is this principle of optimality saying let us let us go through this carefully. So, the principle of optimal is basically saying this. So, suppose you suppose your pi star is the optimal policy this here is your optimal policy ok. I am going to take this as my optimal policy ok and it says suppose I do the following I consider in a sub problem ok consider this sub problem in which you start at state initial state x i ok at time i ok and we want to minimize the cost to go you want to minimize the cost that in that you incur starting from time i going up till time n ok this is your cost to go. Then look then if you look at this particular problem ok this for this problem what is the optimal policy well the principle of optimality says you just look at your original policy this original policy that you had here and just truncated look at the truncation of this policy. So, this policy has functions for each time instant right from 0 to n minus 1 you look at the policy starting from time i till n minus 1 alright. So, look at this a truncation of this particular policy and that is your truncated policy mu i star to mu n minus 1 star then that truncated policy is actually optimal for this sub problem ok. So, this policy so what is this effectively saying. So, let us think about what you know in plain English what is this problem what is this the principle of optimality actually say. So, let us suppose we think of a problem of finding a shortest path from going from say Mumbai to Delhi suppose you want to find the shortest way of getting from Mumbai to Delhi what is the what would be and suppose you found the shortest path that means if you found your optimal policy which says that you you know take go to this town then you go to that town and so on and so forth. Say and you find that the optimal path from going from Mumbai to Delhi say passes through Jaipur say for instance. Now, what what is this what is this principle of optimality say well if you have found the shortest path from Mumbai to Delhi and suppose that path passes through Jaipur then the you look at the leg of the of your shortest shortest journey which has gone from Jaipur all the way till Delhi that leg should also be optimal for the problem of finding the shortest path from Jaipur to Delhi. We were not really looking for solving the problem of form Jaipur to Delhi at all but if your shortest path passes through Delhi then the optimal thing to do then the then the optimal path that you have found the optimal shortest path that you have found all the way from Mumbai to Delhi has to also be in this leg also has to also give you in this leg the shortest path from Jaipur to Delhi right this is the principle of optimality. We have encountered something like this in shortest path problems in linear programming and so on but this is this this being a stochastic problem is a little bit more general because here we are here it is not as simple as just the shortest path passes through so and so point because here the the states that you will end up in or the cities that you will end up passing through etc depends on noise okay and the so what we need to do is put in a few qualifiers which is what we have done here we need to put in a few qualifiers about the occurrence of a state with positive probability right. So we said that we need we can say this first way that are that do occur with positive probability but once they occur with positive probability under the optimal policy what we can be sure is that starting from that state till the very end of the time horizon the optimal policy continues to remain optimal or the truncated version of that optimal policy continues to remain optimal right. So what this means is that your the the your way of solving your way of thinking about dynamic programming can be broken down a little bit so we do not need to think of this as finding these n functions mu 0 to mu n minus 1 all together but rather we can think of this inter in chunks now chunks does not mean that you think of them separately for each time period that is not the goal the goal is not to you know separate out or remove the interdependencies between time period but rather we think of it starting from one time till the very end and then starting from some time I till the very end and then move that time I backwards that that would be the way for by which we would we would approach approach this particular problem. So I will elaborate on that in a moment so this is essentially that the the principle of optimality so let us see how we can how we can employ this for our inventory control problem. So what what is the principle of optimality well we can think of the principle of optimality is saying well whatever policy you start off from that policy will continue to be is whatever policy is optimal for the entire problem will continue to be optimal for any for for the problem starting at any time I till the very end. So why do not we look at the the extreme version of this which is that you put I as n itself or as n minus 1 itself so you start from the tail of the problem and you work backwards till you get to the initial time instant. So let us see how that works so so this is now dynamic programming principle of opt or rather principle of optimality. So this is principle of optimality now applied to applied to inventory control. So suppose suppose we are at so let us let us start from the the the tail sub problem and tail sub problem of length 1 okay so sub tail sub problem of length 1. Now this is actually remember not just n not just one sub problem it is actually n sub problems sorry multiple such sub problems because if you think if you see there is a sub problem here where for each Xi right so you have a sub problem for of course you have fixed the time instant now as n minus 1 but you have also the in the state to be decided and that here is notional it is it is any any norm you know any given state Xi right at time at time I so we need to start do our calculation starting from any initial state Xi. So suppose okay so suppose at the beginning suppose the beginning of at the beginning of period n minus 1 the stock of of the item is n minus x n minus 1 so you have x n minus 1 units of the of the item. So now clearly with the does not matter what has happened in the past the inventory manager should now order the amount of inventory that minimizes the the minimizes the cost that starts from now will form time instant n minus 1 till the very end. So what is the cost then so he must he the since you are starting at at time instant n minus 1 the cost that you incur is the cost that of ordering the cost of ordering etc etc which you incur at time cost of ordering storage whatever is so r of x n minus 1 and c of x n minus c of u n minus 1. So the cost associated with that time instant and the terminal cost associated with time instant n right. So the inventory manager should simply look has to simply look at these two terms which is the cost in that time period the n minus 1 time period and the terminal cost. So he clearly the optimal quantity to order is the solution of now remember we had so you want to minimize this now remember we are now at time instant n minus 1 and we are given that we are starting from some initial state x n minus 1. So for us now this x n minus 1 here this term is not a random variable anymore right. So this is given to us so this is actually determined technically what we have here is actually not this but rather given x n minus 1. So given x n minus 1 this is actually deterministic likewise u n minus 1 is to be chosen as a function of x n minus 1. So this quantity is also deterministic. So this optimization is now simply a vector optimization although there is an expectation here we are not optimizing over functions. So all the randomness involved is actually presented x n and that is because x n itself is equal to x n minus 1 plus u n minus 1 minus w n minus 1. So w n minus 1 is the random term right. So this actually becomes minimization so I can actually take out a few terms here. So firstly this x n the term that depends on x n minus 1 here this is just a constant additive constant it has no effect on my optimization. So I can just drop this term right. So this is and likewise this being deterministic this being deterministic will actually come out of the expectation right. So putting everything together what I am looking for looking at is minimizing. So okay let me so putting everything together what we are looking at is minimizing r x n minus 1 plus the minimum over u n minus 1 of c of u n minus 1 plus the expectation now of capital R of x n minus 1 plus u n minus 1 minus w n minus 1 this is given x n minus 1 right. So now let us denote this particular thing here as a as and since remember we want to order a non-negative amount always so this is to be constrained in this sort of way. So this was our constraint on the action right okay. So this particular thing let us denote this by a term let us call this j n minus 1 of x n minus 1 this here is the so this gives us the what is this particular thing this is giving us the optimal value okay the optimal cost that you would incur if you started from state x n minus 1 at time n minus 1 why is it the optimal cost because after all you have taken the optimal decision you could have taken if you were to start from x n minus 1. So starting from x n minus 1 the optimal thing for you to do would be to choose an action u n minus 1 okay and the action u n minus 1 would be the one that minimizes this particular cost right. So this here is so this is what is this here is called the optimal the is called the value function at time n minus 1. So what you got is the value function at time n minus 1. So now naturally this is a function of of x n minus 1 and the reason it is a function of x n minus 1 is because we said we will we started off our calculation saying let the stock be at some level x n minus 1 some some nominal level x n minus 1. So naturally the optimal the optimal cost you will incur starting from that period onwards is the optimal cost you would incur starting from that level onwards is x is a function of x n minus 1 right. Now let us look at tails a problem of length 2 right. So okay before I mentioned tails a problem tool notice what we have got here we have of course what we have got here is the optimal value of starting from any state x n minus 1 that you could potentially reach but in addition to that we have also this got what the optimal action you would the optimal action to be taken as a function of that state right. So the minimizing the minimizing u n minus 1 here the minima this here and we are doing this the minimizing u n minus 1 actually tells you gives you the optimal action as a function of x n minus 1. So for the for the nominal x n minus 1 that you have chosen the optimal action is the minimum is the minimizer here the arg min here right. So u n minus 1 star actually implicitly is telling is giving you as a is implicitly coming out is implicitly being received as a function of x n minus 1. So this dependence this dependence here of of the optimal solution to the parameter x n minus 1 that we have chosen x n minus 1 was the parameter u n minus 1 is star is the optimal solution this dependence through this defines for you the function mu n minus 1 star and it will turn out that this is actually also the optimal the leg of the optimal policy or the component of the optimal policy for the overall problem right. So what we are getting is through implicitly here we are getting the value function and also in the process of calculating the value function we are obtaining also the optimal inventory policy to be chosen from that time onwards.