 Now, let us go to tails of problem 2 or length 2. So, now we are moving to time instant n minus 2. So, now assume your assume once again we are at time suppose assume we are at time n minus 2 and assume also that assume the stock in inventory level is x n minus 2 we are suppose that is the nominal level we are at. Now, what should the inventory manager should what should the inventory manager do now at time instant n minus 2. Now, at time instant n minus 2 he has he whatever decision he takes will incur him a cost for that particular period alright. So, suppose he is at he takes a decision to order some goods he will incur the cost of ordering of ordering under ordering or over ordering and whatever all of that cost for that particular time period is incurred, but then that decision also will impact the state that will occur in the future. The state that will occur in the future is is denoted x n is is going to be x n minus 1 right. Now, x n minus 1 will get realized as a function of x n minus 2 x u n minus 2 and the noise w n minus 2. So, x n minus 1 will come about through that, but the previous calculation that we have done has told us what the optimal cost would be if you were to reach any such state x n minus 1. So, as a function of the state that that you would reach we have already as a function of the state that you could potentially reach we have already computed the optimal cost right. We have we already know that if I were to reach a certain inventory level in the next day what the optimal cost is going to be. So, the way the inventory manager can think is he has to now minimize two things one is he has he has the cost that he will incur in this in this period and the cost that he is going to incur if he reaches a particular state. So, effectively the the value function j n minus 1 that we have calculated for state n minus 1 onwards becomes the terminal cost becomes the terminal cost for this particular for this particular stage for stage n minus for doing the calculation that stage n minus 2. So, n minus 2 at n minus 2 it is almost as if you have one stage to go with when you you incur a cost at this particular period and there is going to be a terminal cost base which is already which is given to you through a through your value function j n minus 1 as a function of the state that you would end up right. So, so the the so the the inventory manager the or the optimal inventory to order inventory to order minimizes. So, it minimizes expected cost in period n minus 2 plus the expected cost in n minus 1 assuming an optimal policy is used in n minus 1 ok. So, the expected cost in period n minus 1 assuming an optimal policy is used in that period n minus 1 right. So, so what does this equal this equals your r of its expectation of these terms r of x n minus 2 plus c of u n minus 2 plus now if you were to reach some state x n minus 1 what would be the optimal optimal cost that you would you would incur well the optimal thing that optimal cost you would incur is something you just calculated is j n minus 1 x n minus 1. So, this is why doing this you can see this is why the doing this calculation in the previous step for every possible every possible value of x n minus 1 is useful. So, it is useful to do this for every x n minus 1 because now we have the value function or the you know now we have what the optimal cost is starting from any possible initial state. So, we can plug this in as a function of that state right and now that state can get realized through x n minus 1 and u n minus 1 and then we can calculate the expectation and so on. But we can plug that in as a function over there and so as a result we can we know we can we can talk of what is going to be the optimal cost at this particular at stage n minus 2. So, now remember x n minus 1 is x n minus 2 plus u n minus 2 minus w n minus 2. So, because of this now we have the last term becomes this can be substituted in the last term once again as before x n minus 2 is deterministic u n minus 2 because it is a function of just x n minus 2 is also deterministic right. So, consequently this r minimization now becomes is a minimization over u n minus is can be written in this way. So, you can do x n minus 2 plus minimum minimizing this over minus 2 c of u n minus 2 plus expectation now of j n minus learning out of space here j n minus 1 of x n minus 2 plus u n minus 2 minus w n minus 2 this whole thing is denoted can be denoted as j n minus 2 of x n minus 2 let me write this more neatly. So, this here is a function of x n minus 2 write this as j n minus 2 of x n minus 2 and the process of calculating this as before like I mentioned the minimization here this also uses u n minus 2 star as a function mu n minus 2 star of x n minus 2. So, all the noise here is in w n minus 2. So, the expectation is with respect to w n minus 2. So, as you can see once again what we have got is by through this substitution now we have got the policy at time at time step n minus 2. So, more generally what we are doing is what we are getting is this particular recursion that the value function at time k as a function of this as a function of a nominal state not necessarily the realized state as a function of a nominal state x k is going to be r of x k plus the minimization over u k greater than equal to 0 of c of u k plus expectation of plus the expectation of the value function from the next time onwards evaluated at that state which is realized in this way. Now, this here this equation actually is essentially the cornerstone of solving such problems. This is what is called the dynamic programming equation dynamic programming equation for the inventory management problem for the inventory control. So, this is the dynamic programming equation for the inventory control problem. So, these functions they these functions j k as I have mentioned they are they are denoted they are called the value functions. They are effectively capturing for you the optimal cost for the tails of problem that starts from time k starting from any instant starting from any initial any state x k at that time at time instant k. So, effectively what the problem has done is what you although what you are looking for is the optimal cost starting at time 0 for your specific initial state x 0 what the this way of solving has effectively done is solve for you not just the optimal cost for that particular time and that particular initial state but also for every intermediate time and every intermediate initial state but the in the you this may seem like it is the it is it is you know it may seem wasteful because you have solved you have effectively tried to solve so many other additional problems in order to solve one but it turns out that that is the that is the simplified that is the source of the simplification that if you do this step by step you would be in the way that is illustrated here you would actually get the solution for the original problem right. So, the way one proceeds is that you start from the last step solve that particular problem go one step back solve that solve that problem substitute substituting the value function from the last step in it then start then go to step n minus 3 substitute it in it the value function of step n minus 2 and so on and so forth and eventually you come to step 0 where you would get which is in fact essentially your original problem. The major simplification that has happened is that at every stage the optimization that you are doing is now not over the space of functions you are optimizing only over the space of actions just you are optimizing over vectors rather than over functions that is where the major simplification has come and all these other benefits that we have got which is namely that we have got the value function and etc along the way they are just side benefits of making this particular simplification that is not what we aim for what we aim for was the collapse of complexity from having to minimize over you know that many 5 raise to 10 times 5 raise to 10 times 5 raise to 10 many options to something much more concrete which is just an optimization at each step optimization vector optimization at each step all right. So, this is basically tells me the give puts us in a position to state the dynamic programming theorem so it says that for every initial state so I should mention the problem that we are looking at we are looking to suppose we are so to state the dynamic programming theorem suppose we suppose we have you are minimizing this this objective which is minimize is subject to plus and x k plus so the dynamics are that x k plus 1 is f k of x k u k w k and you want to minimize this over all policies pi given by mu 0 to mu n minus 1. So, the theorem is as follows and the initial state let us say is x 0 I will mention that as part of the theorem. So, for every initial state x 0 the optimal cost j star of x 0 of the above problem is equal to j 0 of x 0 what is j 0 of x 0 it is given by given by the last step of the following algorithm. So, the j n of x n is simply j n of x n so you just initialize the value function at you know for simplicity just initialize j n of x n as j n of x n and then write it this write this value function at each step k before that in the as follows you minimize this u k and the minimization is with constraints u k of x k expectation with this of now this is that the cost in that time period which is g k of x k u k comma w k plus j k plus 1 now, but then you need to substitute substitute x k plus 1 to using your dynamics you know using these using these dynamics here okay. So, f k of x k u k w k and you do this for k equal to 0 all the way till k n minus equal to n minus 1 in other words actually you need to do this in reverse you start from k equal to n minus 1 and proceed all the way till 0. And moreover if I write if you write u k star equals mu k star of x k minimizes so let us denote this by star minimizes star then mu 0 star till mu n minus 1 star which is denoted by pi star is optimal. So, this policy is actually optimum okay. So, this boxed thing here equation star is called the dynamic programming equation dynamic programming equation it was discovered by someone by a scientist called Bellman. So, it is often also called the Bellman equation or the Bellman optimality equation this is called the Bellman equation. So, this is what we so this is going to be the way of solving any any dynamic programming problem essentially you start from the from the from the last step as if you had a time horizon one problem find the optimal thing that you would do in that problem use the use the cost that you the optimal cost or the value function that you get from that problem as the terminal cost of the problem with one step before then so find the solution of the problem with one step before find the value function for that and then move one more step behind and so on and so forth and all coming all the way back then you eventually get to time step 0 and times and a terminal cost which is essentially the value function from for time step one we said and that is how you get to the the the optimal solution of your problem of the dynamic programming problem okay. So, effective this is this this the the let us just once dwell on what enabled this the thing that has enabled this is that our optimal cost if you if you see our optimal so rather the cost function if you see is actually additive you see it is a function of you have your terminal cost and you have these costs at each step right you have your costs at each step plus your state equation is also in this form where the next state is determined by the previous state the action you take in that previous state and the noise in at that time it does not depend on in more complicated ways on steps on states before or states after and so on. So, this this the this structure of an of a separable cost function and also of this kind of interrelation between the current state and the next state that is what allows us to make this work this is what makes you know this kind of cost which is which which is which is cumulative of cumulative of the cost over at each time step this this sort of cost occurs when you are say computing say the total reward total cost total amount of money you make over time horizon or say the the total same amount of distance you travel you have to travel or the total amount of time spent etc any of these things which are additive across time periods can be written in this sort of form and that is once it is additive like this you can it can be reasoned through in a recursive manner and that is that is precisely what is being done what we are doing here. So, this gives that is that is what we get to this is what is called the dynamic programming equation. So, now only one topic special topic remains here which is which is that which is solving such problems in in the when you are dealing with when with when you have problems with this very specific cost structure and that is what I will do next we can talk we will talk about problems where the cost function is quadratic and the dynamics are linear in that case some a very simple form emerges for the optimal policy. So, the optimal the optimal policy and the optimal cost function ends up having a extremely simple form and that is that is what I want to just show you. So, we will let us now apply what we know about dynamic programming to a to a problem with a very specific structure where the cost is quadratic the dynamics are linear and the noise is independent across time instance. This is this the a favorite problem of control theorists and and in operations research it is what is called the problem of linear system with quadratic cost and we will we will find what the optimal policy for that that sort of problem is. Now, the recall before I go into this specific problem recall what we were talking about in the for the for this in the previous lecture which is which is we were we were referring to a problem where you wanted to minimize a cost like this over a time horizon n a cost that is comprised of a terminal cost and a cost at each step and we had dynamics that were given in this way which were in which the next state was defined as a function of the previous state the actions and some noise.