 So, this way of writing the problem as I said is not, this is not well posed here, this is not well posed. So, because this is not well posed, the right way of posing the problem is that we are not looking for actions u 0 to u n minus 1, but rather functions. We are looking for functions and now let me denote these functions in the following way 0 mu 0 to mu n minus 1 right. So, mu 0 to mu n minus 1 such that the amount of stock to be ordered at time k which is uk is equal to then mu k of xk for all values of xk. So, whatever be the value of whatever be the value of the stock available at the beginning of the time period, you will be able to find uk as a function of that once you know mu k. So, what we are looking for are these functions, these functions mu 0 to mu n minus 1 and these functions would then once these functions are fully specified, what they will let you do is pick your action which is the amount of stock to be ordered as a function of the information that would be present. So, therefore the problem the correct way of posing the problem is that you want to minimize the expected cost includes the terminal cost plus all these running costs starting from 0 to n minus 1 r of xk plus c of uk and you want to minimize this by choosing mu 0 to mu n minus 1 these functions. So, mathematically this is what is being done you are you are getting you are you are trying to optimize this particular cost function now by choosing a sequence of functions. Now, the reason we needed to do this was because demand itself is something that was going to be realized in the future alright and it was not known its value was not known to us. So, we had to essentially plan for every possible value and which when you are planning for every possible value what we are effectively doing is finding this these the sequence of such functions right. Now, there would be another case suppose in which someone told you what the exact demand is going to be hypothetically say suppose someone gave you the exact value of the demand in which case this the expectation here would be more the expectation in this that I have written here would have would be would be would be moved and then what you would know the exact value of the information or the exact value of the amount of stock that would be that would be present at the beginning of each time period. In that case in that case you can actually plan to for up for taking a particular action because you know the exact state right. So, there are no you do not need to plan for every possible state you can simply try to find the value of each the action for the particular state that is going to be realized alright for the or the particular value stock level that is going to be realized that is that space that kind of case where the where the demand is known would be the case where the noise is deterministic. So, it is really not noise anymore the demand is deterministic it is known in its value is known in advance and all you are doing is planning for specific events that are going to happen in the future specific known events that are going to happen in the future that is a much easier version of this particular problem. But the more general version of the problem is involves taking these taking these decisions without knowing what the information is going to be right. So, this mu 0 to mu n minus 1 that that I have written here this is what is called a policy and it is often denoted by pi. So, the distinction here that I have let me mention one more point here the distinction between what is called what what I have been saying as as taking a particular action which is the amount of stock to be ordered versus planning for every possible possible level of level of stock the distinction between these 2 is the distinction between what what we call in stochastic control and in games as the distinction between actions and strategy. What we have effectively done is because we do not know the information that is going to be realized in in the future we are not commit committing or deciding on what particular action is to be taken rather what we are doing is we are coming up with a strategy a strategy which says that no if I had this information this is what I would do if I have that information that is what I would do. So, this here these functions mu 0 to mu n minus 1 they took or your policy effectively constitutes decide doing what is called strategy it involves coming up with these plans whereas, u 0 to u k minus 1 these which are the actual decisions you are going to take these constitute actions. So, our problem therefore is to come up with is to come up with these these strategies or these policies. Now, the every time a problem involves noise it whether you whether however simple or complex the problem the noise may be the problem shifts from the space of actions to the space of strategies because we do not know the value of value of the noise in advance it means that the action you cannot plan for any specific actions. So, problem in of of choosing actions is simply a problem of choosing these vectors right. So, sorry this is not k minus 1 this should be n minus 1 the problem of choosing these actions is simply a problem of choosing these n vectors u 0 to u n minus 1 alright and I can stack them up and essentially think of this as one composite decision problem involving a one long vector right so u 0 to u n minus 1. However, this problem here the problem above is not a problem of choosing vectors it is a problem of choosing functions. So, the space of the problem itself has changed the problem of choosing actions which is the problem that you would have if if say the W's were deterministic that problem is a problem of simply the problem of vector optimization. Whereas, this problem here is a problem of optimization over functions right it is a problem of finding the right function sequence of functions not just a sequence of vectors. So, this distinction is a is actually what makes this dynamic programming or dynamic optimization significantly harder than static optimization alright. Now, the R R may R as as I go a little further further I will explain to you how we can actually reduce somehow this problem which is involves trying to decide these n functions to a setting to somehow a setting where we have to decide only actions ok. So, actions but actions and from there from those actions somehow reduce what the function should be. So, all of that will happen in a subsequently as we go further down in this course. So, now to summarize what we have said in this example let me let me write for you a general dynamic optimization problem or a general dynamic programming problem and its main constituents ok. So, the main constituents of a dynamic programming problem involve first a state the first component is state of the system ok. So, the state of the system simply is your xk in this case which was the inventory level. So, this it will be denoted by xk it was in the previous problem it was the inventory level it captures it is it is whatever is needed it is whatever is needed to capture the configuration of the system. It is some description that we have of the configuration of the of this of the system that we are that we are dealing with in this case since the system we were in the in the example we saw the system we were dealing with was the inventory in a shop. So, it was enough for us to keep track of just simply the amount of inventory present at the beginning of the time period right. So, in so as a result of this the we took that as the state of the system. Now, the state of the system is a part of problem modeling what exactly you want to define as the state it is a bit of there are usually more than one way of defining the state but try to always keep the state as simple as possible. For instance in the previous problem we could have because we were talking of inventory at and trying to decide the stock over these n time periods we could have taken the state as the entire history of you know the stock levels up till time k. Now, that entire history could be taken but it has no bearing on trying to decide it does not help in you know in trying to decide what we are what we are actually looking for which is the amount of stock that is to be ordered because that that the amount of for in order to decide what the stock that we have to order it is enough to simply know what the level of the stock is at the latest time period not and the entire history is actually not relevant for that. So, as a result we have picked a parsimonious definition of the state we have taken the state as simply the current the level of stock at the current time period. So, now the state itself state evolution or dynamics now the state itself evolves based on the action that you would take and the noise that that comes up in the system. So, the way we express this is that we write x k plus 1 as some f k of x k comma u k comma w k right. So, here this is the action at time period k this here is the noise this is the noise at time period k alright. And so, as a function of the action that you plan to take at time period k and the noise that evolves at that time period k and the previous and the state at time period k you get the next state or the state at the next time period. Now, because this w k is random this sequence will be random alright. So, the sequence of states will be random it is not something whose value you would know at the beginning of the time at the beginning of the problem and so, as a result the actions that you would end up taking would also be random because they would be as would be chosen as a function of the of the information that you have. So, this here is these are this is w k is noise usually we take w k is as independent random variables in many cases also identically distributed or 0 mean and so on, but that depends on the problem usually we take these as independent random variables alright and they are independent of each other and they would they their distribution cannot be decided you know as a function of as a function of using through your actions. Now, they would usually be a control constraint constraint on actions. So, that constraint on actions is in this in our case was in our case was say u k greater than equal to 0. So, for example, that was our constraint, but more generally it could be any constraint and moreover that constraint could depend also on the state that you are in say for example, it could be something as something like this u k belongs to capital U k of x k this would be the most general way of writing the constraint. So, way as a function of the state x k that you are in the kind of the actions that you can potentially take have to be in this set capital U k of x k and the additive cost form the cost we would the cost form is denoted in this sort of way here is your terminal cost g n of x n remember we are taking actions at starting at time 0 till time n minus 1 the problem ends at time n. So, at time n whatever is the state based on that you incur a terminal cost g n of x n we are not taking any further actions at time n. So, although I am saying I have said that we take actions at n time instance the n time instance themselves are denoted 0 to n minus 1 alright. So, this is so the entire cost is therefore g n of x n which is your terminal cost plus g k of x k u k sometimes we can we also include w k here, but it does not matter g k u it can this inclusion here is optional it is you can simply write g k comma u k that is also without loss of generality alright. So, as you can see we are in the earlier in the problem above g n of x n was this was simply r of x n and g k of x k u k w k this was r of x k plus c of u k alright. Now, so having formulated this the basic problem then our goal becomes the following that we would like to decide we would like to decide a policy. So, we would like to decide a pi which is denoted by mu 0 to mu n minus 1 above what we would policy mu 0 to mu n minus 1 such that it would map the state to control actions or to actions that you want to take. So, u k would be equal to mu k of x k for all for all values of x k for all values of k right. And now moreover this kind now what kind of policies are admissible now remember since we mentioned we have control we have constraints here on control actions that control actions needed have to satisfy this particular constraint the kind of policies that are admissible are an admissible policy is only one where an admissible policy is one which is in which u k or in which mu k of x k always belongs to u of u capital U k of x k. So, this is true for all k comma and all k. So, an admissible policy is one which satisfies the constraints the control constraints that we have. So, now given the given so how does the problem begin the problem begins with an initial state. So, you are given an initial state the which is say the inventory level at the start at the start of the of the time horizon say an initial state you given that initial state you would then and given the initial given the choice of a policy given an initial state and a policy mu 0 to mu n minus 1 given the initial state and a policy what what how does the how does the system behave well it takes the initial state uses the policy at at at time 0 to decide the action to be taken at time 0 right. So, it takes the policy at time 0 to decide the action to be taken at time time 0 which is that would be u 0 u 0 then feeds into the state dynamics and what we get from there is is is that you get the state at time 1. So, you get more generally x k plus 1 at time k a time x k plus 1 emerges as a function f k of x k and u k where u k itself is just simply mu k of x k right and the noise w k right. So, because you are choosing now the action as a function of the state because u k itself is being chosen as a function as a function of the state you can now substitute this here and what you find is that the next the next state comes up is is is defined through the dynamics and the policy that you have chosen that which is which is mu k right. So, after the substitution essentially what I what one can do is you can use can do this substitution everywhere in fact you can you can do this substitution even in the cost function if your cost function was g n of g n of x n plus so g n of x n plus g k of u k w k. So, I have replaced the u k there by simply the by simply mu k of x k right. So, as you can see the cost function now depends depends on two things this depends on depends on pi which is your policy and depends also on the initial state it depends on where you are starting because after all the state evolution that you that your system would see would now get determined by the dynamics and the policy that has been stated. So, once I fix my initial state which is the state where I am starting from and the kind of policy that I have chosen the noise distribution is what will determine the evolution of your of the state sequence and so what we have therefore is the expected cost of associate of evaluated over this particular noise sequence and that depends as you can see on pi as well as x 0. So, we denote this we denote this therefore pi j subscript pi of x 0. So, this is therefore this is what is this quantity this is the cost incurred by policy pi starting from state starting from state x 0 this that is what this would be. So, what is therefore the problem the problem then one way the problem that you have that way we want to solve then is to find policy pi star in say a space capital pi what this capital pi here is the set of admissible policies admissible set of admissible policies sorry j pi star such. So, find a pi star which is in the set of admissible policies such that the cost of the cost that you incurred by pi star is the least amongst the cost of all policies in the set of admissible policies. Now, the cost the notice that this is you it appears like this particular problem is actually a problem that we would now need to solve for each value of x 0 because we have not gotten rid of the initial state as yet. So, we would have a cost for each initial state and it appears primer phase see that the pi star which is the optimal policy would vary could vary with x 0 because I have just fixed one x 0 and found a pi star. However, the nature of the problem is such that it turns out that you know this it is typically possible to find a pi star that does not depend on x 0 itself. So, you can choose any token x 0 to start with and the policy that you would find you end up finding a policy that would work for any initial state. So, that is how it often happens because the policy itself is something that is a plan for every possible state. So, it does not it is not merely for this any the specific state that you are looking for if you usually get the policy for every possible initial state. So, this particular quantity which is the J pi star this is also denoted by J star of x 0 this sometimes denoted by J star of x 0 that is simply another way of writing the cost of the optimal policy this has a name it is what is called the optimal value function optimal value function. So, the optimal value function is simply a function that tells you what the optimal value of the of the dynamic optimization is going to be as a function of the initial state. This is called the optimal value function or optimal cost function optimal cost as a function of initial state. So, this is so what we will now what we will do in the next class is find ways of computing this particular the optimal value or the optimal value function. So, I will pause here and we will resume again in the next class.