 Welcome back. So, we had seen in the previous lecture that for the inventory control problem, the problem the question of that of asking how much inventory should I be ordered, should I be ordering? This question is actually not well posed. So, the amount of inventory to be ordered is depends on the amount of stock that you have available and that is not known because that itself depends on the demand that would have that would evolve from the start of the time period up until that time. So, as a consequence of this, we realize that the way to pose the problem at the stage of when we are trying to plan for what we should be doing is to pose the is to is to ask not how much inventory should you be ordering, but what is what is going to be the action plan. In other words, what is what how much inventory are you going to order for each value of this of the inventory level of the stock that you find. So, how much what action are you going to take for each value of the state that you are going to act that you would actually encounter during the problem in real time in the course of when you are actually in the field. So, what we can what we can compute is at the start of the problem is not the actual actions, but these functions. These functions we had we call these strategies or decision rules. We call them strategies or decisions rules mu 0 to mu n minus 1 was called a policy. Another term for this is often what is often that is often used is a control law. So, mu 0 to mu n minus 1 is often called the control law. So, let us give let us give say see an example of such a such a control law. For instance, one could say that for it can actually be shown that for some reasonable forms of the cost function the optimal policy is to actually at any time k is takes the following form mu k of x k takes the form that you do that of x k minus s k if x k is less than s k and 0 otherwise. Here s k is a suitable threshold. Now, what does this mean what how do I interpret this policy? This policy is telling me a plan of action. It is telling me that it tells me it defines for me a threshold s k and it tells me the following that if your if my inventory x k falls below the threshold s k. So, if x k is less than s k then all then how much how much should I be ordering? I should be ordering in order to top up to the level s k. If my inventory is less than s k then I order whatever is the deficit so as to top up up till s k. If my inventory is larger than or equal to s k then I do not need to do anything then I then I do not order any further. So, in other words what I what you what you get is is a sequence of such thresholds that such that your inventory gets when you keep you only monitor if your inventory is falling below that particular threshold. Once it falls below the threshold you just top it up up to that particular threshold and let and then carry on again till it again falls below that threshold. So, and in finite horizon problems you would also get that this threshold s k is which is a so this s k is a threshold that define that is a parameter that defines this particular policy. It is defined it is telling you what to do as a function at each step at each time step as a function of the inventory that you would see. So, this s k would be would also differ from one time step to the other. So, there would you would the correct threshold that we need to look at that one needs to keep is also would also vary from time to time. So, usually the so the so the inventory then it tends to take this sort of a form that at at at any time at any time k you when you order you fulfill up you you bump it bump up your inventory up till time up to s k then over that time over the time period demand gets realized the inventory falls further. But and then when you are at the next time step at time k plus 1 you again order more and bump up bump it up to s k plus 1 then again again the the the inventory falls and again you again you refill and so on. This is how this is how this particular policy would work. So, once again at the at the danger of of belaboring myself I am again repeating that this is this is not mu k does not actually tell you when we are choosing mu k we are not actually choosing an exact inventory level. But rather we are saying what is going to be the plan for every for every possible value of inventory that you could have set means for every value of x k we are telling we have a plan here which tells us what exactly we should be doing. So, now with this now we we we can we can go back to the original stochastic control problem that we had written not the example but the original stochastic control problem that we had written which was this one. And then we can ask ourselves what exactly are we solving here well the goal is we said we want to minimize this total cost incurred and we said we wanted to do this by taking a sequence of actions. But then we realize that we cannot actually define a sequence of actions at the time instant that this problem is written for which this time it is for which this problem is written this problem is written even before the realization of this of the entire uncertainty. And therefore it is asking us to choose actions even before the information that is available to us. So, therefore what we can choose is plans of actions or policies. So, the problem here therefore this problem is the problem is to is to choose a policy to minimize to minimize the total cost is to choose a policy that minimizes. So, a general stochastic control problem takes this form where you are minimizing this the expected total cost this is your terminal cost these are the stage wise costs. And what we are this is being minimized over all mu 0 to mu n minus 1 each mu k is a function that maps the state space at time k to the to the to the actions that you can choose. So, in this case for example that is let us let us suppose capital X k is suppose the state space at time k this maps this to let us suppose capital U the space of actions at time k. So, capital UK is the space of actions available at time k. Now, so one example of in the of this we had already seen we saw that the state space in the inventory control problem was the state could be any real number. So, the state space was R and the actions we were allowing we were only allowing non-negative actions. So, so capital UK therefore was the non-negative real life. Now, we can there are also problems in which the action that you can take can be constrained by how much by what state you are actually in. So, there can be a for example, a there are problems where you can define another as I said U tilde k of X k. This is the set of feasible actions U tilde k of X k feasible actions at time k when the state is X k. So, when the true state is X k and when the when you are at time k the set of feasible the actions that you can choose is has to be from the set U tilde K of X k. So, in that case any policy that you choose must respect this particular this particular constraint that you have to choose a feasible for a policy to be admissible it has to it must map the state to an action that is present in this. So, so you so your mu k should have the must satisfy must satisfy that mu k of X k is in U k tilde of X k. This is something that is required from any policy that you choose. So, we have seen therefore, in the inventory control problem and we have seen all of these elements we have seen that there is a discrete time system that there is a system that evolves in discrete time whose evolution is given by a dynamical system like this. So, it is a system whose state evolves in discrete time in this sort of way. We have seen that in our case this is equal was equal to X k plus U k minus W k. We have seen that W k which was noise we had that these were independent across time not known while choosing U k. We also saw that there was the action that has to be chosen there was a constraint on the action. So, remember we had U k greater than equal to 0. Now, in this case the constraint actually does not change with the state of the system. It only is just it is a constraint that is the same for every time and every state. But in general there could be constraints that could also be state dependent and time dependent. But nonetheless the key element of the problem here is that there can be constraints on actions. So, therefore policies a policy must respect this a policy must satisfy any policy must satisfy constraints. And we saw that the cost also had this very nice form. The cost had the form at this which was in this form. So, this in our problem was the terminal cost in our problem was capital R of X n and the stage wise cost in our problem was R of X k plus C times U k. So, you will notice that there was no there was no W k in this there was no W k out here. So, there is a W k here but there is no W k here that is quite alright this is this is the more general this is the general version general version of the problem. So, the lesson in short is that we are. So, now this has given us a way of by which we are we are going to complete problem formulation of a stochastic control problem where we are taking a sequence of decisions over various over some finitely many time steps. We have the elements of the problem are that there is an additive cost there is there is a sequence of actions that need to be chosen in that are chosen in real time. But what we are choosing when we are is we are making a plan for actions we are making. So, what we are choosing is this policy mu 0 to mu n minus 1 which maps the state at time k at the where mu k time maps the state at time k to an action at time k. Now, this policy must also satisfy the constraints. So, mu k of x k has to be has to be has to be in U tilde k of x k where U tilde k of x k is the set of feasible actions at time k when the state is x k. Now, when we choose so the way the system evolves is as follows then we when we choose when we choose an action when we choose a policy mu k when we choose a policy mu 0 to mu n minus 1 what happens what happens is this policy is chosen even before any uncertainty gets realized. So, even before the first random variable which is first piece of noise which is W 0 gets realized. So, this this policy is chosen at that stage. So, when we choose a policy what here is how the system evolves first you have your initial state x 0 gets realized then as a function of this initial state you choose an action U 0. And the function that you need that tells you what action is to be chosen is this policy. So, from here using the policy mu 0 using the policy mu 0 so using the policy mu 0 we get our action U 0 then nature pitches in nature pitches in with with W 0. So, nature introduces the noise in the system. So, therefore, as a result of the action that you have chosen and the previous state and the noise that is present you then get your next state x 1. So, x 1 comes out as f 0 of x 0 U 0 W 0 where now U 0 is a function of x 0 and then this goes on goes on again. So, then again you have x 1 x 1 knowing x 1 you choose U 1 using the as a function of that you choose you you choose U 1 and the function that tells you to tells you what you want to choose is is the function mu 1 this gives you U 1 then comes W 1 that then results in x 2 which is now f 1 of x 1 mu 1 W 1 and so on. This is this is this is what happens at each time step you eventually reach time step n where now you have you have time step n minus 1 where you are in state n minus 1 you have you choose you use the mu n minus 1 that has been decided to choose U n minus 1 then W n minus 1 comes in and what you get is your final state x n which is f n minus 1 of x n minus 1 mu n minus 1 of x n minus 1 comma W n minus 1. Now, the important thing to note here is the role that the policy plays. The policy does two things one is of course, it chooses the action that you need to choose it chooses an action for you it tells you. So, here for example mu 0 of x 0 is the U 0 it is the action that you need to choose but in addition to this the policy also decides the probability distribution of the next time step. So, if you look at x 1 x 1 is now a random variable it is the state that will get evolved that is evolved at the next time step that is a function of the noise. But it is a function of noise as well as the previous step previous state and the action that you have chosen. So, by choosing mu 0 in an appropriate way one can decide the probability with what probability you are going to get various states. So, one can influence indirectly the probability distribution of the state through by choosing through the choice of the policy. So, the policy basically determines also the probability with which various new states are going to get realized as a given the earlier state. In other words the policy determines this probability distribution the probability distribution of the next state given the previous state. This probability distribution is a function of the policy it depends on the policy that you have chosen. So, it depends on mu 0 to mu k one thing I had forgotten to mention that one of the short hand for policy often when you have mu 0 to mu n minus 1 a short hand for that for policy of that is often used is the letter pi use the Greek letter pi to denote a policy. So, this will often be denoted as p superscript pi of x k plus 1 conditional probability of the next times state given the given the current state. So, with this what we have done is set up the problem of stochastic control where we are taking decisions in a sequence and we have set it up formally as a problem of deciding a sequence of functions the sequence of functions is what is called a policy. What we will now do in the next lecture is consider look at an alternative model where the state at the next time step is not given in this kind of explicit way as a function of the previous step but rather in an abstract way where you only know the probability distribution of the next step of the next state given the current state and current action. This is the model that is often used in the field of operations research and it goes by the name of Markov decision processes. We will look at this model in the case in the setting of a where you have finitely many states and finitely many actions. So, this is this is the agenda for the next for the next for the next for the next for the next video. Thank you.