 Welcome everyone. So we will now begin with problems of sequential decision making. Just like we had seen in the end of the previous lecture where we had a system with a scalar state which was evolving over two time steps and over there what we had found was that the way in which we are given information about the system while taking the action made a difference that information made a difference to how we can perform, how well we can perform. So this is something that we will dive deeper into and this is something that will come up again and again as we explore this particular problem class. So what we will define now is a general purpose way of defining and considering problems where we are taking actions in a multiple stages one after the other and the goal of taking these actions would be to minimize a certain cost that would be defined over the time horizon of the actions that we will be over which we will be taking the actions. So to define this, this is what is this kind of this problem class is what is known as these are this is what is often known as a state space model for sequential decision making. This is the problem formulation that we will come to is known in various ways in cross different disciplines it is known in some disciplines as stochastic control. Stochastic control problem formulation sometimes it is also called a Markov decision process this is the term that is usually common in the control community this second term is usually is common in the operations research community operations research or even the computer science community. So our model is as follows we have we are going to assume time evolves in discrete steps time evolves in discrete steps. So it is assumed that the any of anything that happens in between two time steps is not is not completely accessible to us what we know is only what happens at the end of at the end at or at the edges of the time steps. So we are going to assume this that the time evolves in discrete steps these will be denoted by k and k will range from 0 1 dot dot dot till n minus 1 where n is some finite integer the system that we are considering is modeled using its state x k is the state of the system at time k this the state of the system at time k is assumed to is assumed to encompass enough information about the system so that we can we should be able to talk about the problem completely once we know about the state of the system. So the main thing here is that the cost function that we will consider in this problem we will see this in a moment has to be a function of the state. So the state is some description is some set of variables that describe that give you a complete description about the configuration of the system. At every time k we have because we are taking decisions in the sequence then at every time k we have an action that needs to be chosen. So u k at time k is a control action or decision chosen at time k w k is a random parameter. So this is a random parameter it is essentially an external disturbance it is an external disturbance so this is an external disturbance noise. So how is x k different from how is u k different from w k w k is a random parameter whose essential with that is essentially chosen by nature it is its value and its distribution is chosen by nature we do not have we cannot choose by we cannot affect the distribution we cannot affect the distribution of w k. And n here the n that I wrote here this n is called a time horizon it is the horizon of your decision problem. Now the when we take action u k at a time k the state of the system evolves from x k to x k plus 1 and it evolves according to the following equation. So we assume that there is this a function f k such that when you take action u k at time k when this original state was x k you get the next you get the next state at time k plus 1 given by f of f k of x k u k w k w. So this so the when so w k being random ensures that the next state that will get realized when you take an action at time k is a function it cannot be completely determined by just the action just the action that you take and the state that you are that you are previously in because w k is random the next state that will evolve is random and hence in some sense the future is random. So this f k is what is called dynamics these are called this is called the function f k is called the called the dynamics of the system it tells you how the state of the system evolves with time. The cost the goal of the problem is to minimize a certain cost minimize the total cost the total cost incurred k equal to 0 till k equal to the till k equal to n. So and the total cost is given in as follows so the total cost is total I should the total cost is given as follows it is g g n which is a function of x n plus a sum from k equal to 0 to n minus 1 g k of x k u k w k. Now notice that because there is a and there is the presence of w k in this and because the state that is going to evolve through the throughout the problem is also going is going to evolve randomly this cost is actually a random cost. So what we have seen from the expected utility theory and so on is that what one needs to do when one is faced with this sort of situation is that one must minimize the expected cost. So the complete problem formulation is that one needs to minimize this expected cost comprising of these terms. Now I will tell you in a moment right now what these terms are. So if you look at this term here g n of x n this is a function of only the state that gets realized at time n. We take decisions at from time 0 to time n minus 1 that at once we are done with taking the decisions we have we would we reach a stage where so where k is equal to n minus 1 you have taken your final decision at time at the last decision at time n minus 1 but then the system evolves for one more time step and goes to time goes to time n at time n the state then becomes x capital N. So x capital N is the state that the system ends with at the end of the system ends with or the system ends up with at the end of the time horizon or this is the it is the it is the state at which we can sort of say the problem has ended. So the or it is the state that way that the system is in when we say that our problem horizon has ended. So this therefore this is therefore often called the terminal state. This state is called the terminal state and this cost function here is called the terminal cost. These the other terms here these here these terms here these are costs that are associated with every time step. So these have another name they are called these are called stage wise costs. So if you look at the cost function here the cost function is actually a sum of cost that you occur incur at every stage. There is a cost for there is a cost for every stage from 0 to n minus 1 which depends on the on the state that you are in the action that you took and a random effect due to the due to the noise in the system and and then also there is a terminal cost which is a function of what state you find that that the system finally finds itself in at the end of the time horizon that is a terminal cost. The total cost is what we need to minimize as as part of the definition of the problem. Now the because we are going to be we are going to be taking this decisions in a sequence we can consider two different types of decision problems here. The first kind of problem is what can be what can be called an open loop problem an open loop problem and in an open loop problem what we are doing is we are choosing this means that we are choosing decisions u 0 to u n minus 1 these end decisions without the without any knowledge about about where about the system. So the it is as good as having chosen as good as choosing these these actions u 0 to u n minus 1 even before the system begins to evolve. So these are what chosen before the system evolves. The other type of problem we can consider is the one where which is a closed loop problem in a closed loop problem when we are choosing UK you have knowledge of everything that has happened in the system up until time k. So you have knowledge of all the states that have this the sequence of states that the system has been through you have knowledge of the previous actions that you have taken. So UK is chosen chosen based on what we can say is the history up until time k. Now you will soon see that actually the entire history up until time k is redundant and what one only needs to several parts of this history of up until time k are redundant and what one only needs to know is the state at time k. So this problem is also often posed in the following way that UK where we asked UK to be chosen only as a function of x k. So this in these kind of problems is without loss of generality we can choose with knowledge of x k. So the latter is the kind of problem we will be considering in this in this course. So we will be looking at closed loop problems where the where UK is to be chosen as a function of some some information and in this case the information is x k at time k. One other point that needs to be made about the about the closed loop problem which is to know which is then which is to understand the sequence in of in which the noise and the actions the noise the system and the action as are realized. So UK is chosen with the knowledge of x k but UK when we choose UK we do not have knowledge of W k. W k gets realized after we have chosen UK. So when we choose UK we know x k but we do not know but we do not know UK. So in this dynamics in the equation in the dynamical equation here UK is a function of x k but UK is not a function of W k. So the next time the state that will get evolved that evolves at the next time step is not completely determined by your action and the previous state. So it is determined by your action the previous state and an exogenous random effect which is the effect of noise. So W k is not known to you when you are choosing UK although x k is known to you while you are choosing UK. So to understand this problem this problem a little better let us let us consider an example. Let us consider an example with to understand this problem class a little better let us consider an example. The example here is that of inventory control. Inventory control problems are applicable in cases where say for example you have a shop say selling shoes and what you want to do is you want to decide how much inventory should you be ordering each day. The inventory gets consumed when demand arrives in the shop but demand is at demand how many people will arrive and how many shoes will get sold is something that is random. So what you need to do is across the decision epochs and across time steps that arise in the problem what you need to do is decide how much you should be how much inventory should you be refilling. So let us look at this problem a little bit more in detail. So let x so what we want to do is we want we have again a time horizon n we have a time horizon n. Let xk be the stock available at the beginning of the kth time period. So what we have here is you have time that has been slaughtered here say suppose this is 0, this is 1, this is 2 and so on this is this is n minus 1 and then here is n. So xk is here suppose k is somewhere here then xk is the stock that we have at this time at the beginning of the at the beginning of the kth time period. So when k is equal to 0 so when you have when you are at the beginning of the 0 time period the x0 is the stock that you start with x1 is the stock that you have at the beginning of the first time period and so on. Now uk is the stock that we that we will order. This uk is the stock ordered at the beginning of the kth period. So uk is the amount of additional shoes or inventory that you are ordering at the beginning of time period at the beginning of time period k. Now this stock there are different types of models but we are in this problem what we are going to assume is that the stock where that once when ordered at time k actually gets instantaneously delivered. So it becomes available to us for fulfilling the demand at at the beginning of the time period k itself. So we will assume that the stock is immediately delivered. Physically this would imply assuming that assume delivery is immediate. Physically this would imply that we are we are assuming that the delivery time is far less than the inter then the the time that is essentially that is there between between the decision epochs. Now the demand that that we see is the demand that we see at any time step is is random and we are going to denote that by wk. So the source of noise in this problem is the is the is the randomness due to the demand. So wk is the demand at time k. We are going to assume that these these demands are independent. So we will assume again. So demand rather during time period let us say not at time okay it is during time period assume w0 to wn minus 1 are independent. So wk if you look at the time period that starts from time k wk is the demand that is that will get realized in this time period. So in the time period that intervenes from from k to k plus 1 is is the time period when we will see wk getting realized. Now once again recall that we said that in a stochastic control problem uk cannot be chosen as as in our model uk cannot be chosen as a function of wk. So as a consequence wk is something that is not known to you at time k when when you are when you are choosing uk. So this is a demand that will get realized after you have chosen chosen how much how much inventory to order. So the sequence of events is that there is a state at time k you then choose the amount of inventory to be ordered at time k which is uk and then comes the demand which is a random event caused by nature and that demand then consumes consumes the based on that demand your inventory will get consumed. So bear this in mind that that so the sequence here xk becomes known then uk is chosen and then demand is realized or wk which is demand is realized. So we will continue more about this model you know you know just after this break.