 Welcome everyone, we now continue with our study of stochastic control problems, but we will continue it in a slightly different formulation from the one wave already studied. If you recall in one of the earlier lectures, I introduced to you a stochastic control problem where the state transition was given through these dynamics. The state at time k plus 1 was given explicitly as a function fk of the current state xk, the action that you took at time k that is uk and the noise in the system, the noise that occurred at during the time period k. So, what this means is if you are given xk, uk and wk, then this defines for you deterministically what xk plus 1 would be. That means the next state is determined as a deterministic function of the action, the state and the noise in the system and it is given as an explicit function. So, we have the option then in these kind of cases to use the form of this particular function to solve the problem. Now, the kind of problem I will now describe is one where this function is not given explicitly. The next state is still a would still be a random randomly dependent on the current state and the current action, but there will not be such an explicit formula or an explicit function that relates the current the next state to the current state the action and the noise. So, in some sense the noise the dependence on the noise the previous state and the previous action would be implicit. This is the formulation that we will be studying today. So, in this case what I am going to assume is that my state states come from a certain set let us call this set S. This is going to be my state space and I will assume for simplicity that S is finite that means S has that means the system can take only finitely many states. We will assume also that you have the actions at any time at any state S can be chosen from a certain finite set let us call this these actions as these are the set of actions at any state S which means that when the system is in state S the actions that you can choose are from this set as and we will also again assume that as is again finite. Now, the this particular problem formulation comes from the domain of operations research and economics and computer science and so on in which the convention often is that we one wants to maximize a certain objective and therefore the convention of that is often used here is that is that the objective does not comprise of a cost but rather of a reward and one would want the maximum reward. So, the what one has is that just like we had in the just like we had in the earlier model we have for we have a reward at each time step. So, the reward is given define is denoted this way RT of S, A this is the reward that the decision maker receives at time T when he takes action when action A is chosen in state S. Now, the key difference between as I said the previous problem and this previous formulation and this formulation is going to be the way we define how the next state of the system how this the next state of the system that means this is the state of the system at the next time step is going to come about what we will assume is that the next when you take an action A at in state S at time T the next state is realized through a probability distribution. So, then the state of the system at time at the next time step that means S T let us say the is. So, the state of the system at the next time step has a probability distribution given by P T P T of S P T given S S and A where S is the state at time T A is the action at time T the action chosen at time T. So, if you choose an action A at time T and the state at time T was S then the next state is going to be realized using this with in terms of this will have a probability distribution given by this particular distribution. What this formulation tells us is that the state at time T plus 1 will be equal to J with probability P T of J given S comma A where S is the state at time T and A is the action chosen at time T. As before what we have is a finite horizon. So, a horizon let us say where T goes from 0 to 0 till n and what one wants to do is to max because we are talking of a reward here what one wants to do is maximize that the total reward maximize the reward that one the total reward that one that one receives. So, the way this problem is then formulated is that one has time which is runs in discrete time steps over a finite horizon from 0 to n. We have rewards like we have just seen we have reward a stage wise reward at each time. So, this is your stage wise reward we also have a terminal reward at time n which is a function of the state in which you end up with at the end of the time horizon this is at time n and what one wants to do is maximize the expected sum of all of these terms. So, you have Rn of S plus sum from T equal to 0 to n minus 1 R of RT of. So, let me write this as Rn of SN RT of ST comma AT. So, where ST now is going to be the state at time T AT is the action at time T. This problem can there is another slightly more general version of posing this problem which is that one can allow one can allow these rewards to also depend on the next on the next state. So, in addition to being dependent on the current state. So, one another type of form of this problem is that one writes that the reward at time T is a function of the state at time T the action chosen at time T and the state that gets realized in the next time step that means the state at time T plus 1 this here is the state at time T plus 1. So, in this case this sort of a problem can also be reduced to a problem of the of the form that that is written above by doing the following simple transformation one simply writes RT of S comma A as the expectation of this of the above of this particular term. So, let us write this as R tilde T of S comma A as the sum of J in J in the state space capital S of RT of S A J P T of J given S comma A. In this case what one is basically doing is because the action at time T is being chosen using only the information available at time up until time T and therefore the future state which is J is the future state is not included in this information what we are doing is basically equivalently taking the expectation over the future state. The future state given the current state is current state and action is its distribution is given through this kernel P T of J given S that is already given to us. So, we take an expectation with respect to that and then that gives us an equivalent problem of the form that is described out here. So, one can always basically use R tilde of S comma A as the reward function out here if the original reward function is given in the form of where it depends on the current state the action and the next state. So, this is a generalization that is that is often that is often made. So, this entire this entire tuple which in which we have we have capital T which is given which is the time steps the end time steps S which is your state space A S which are which has the set of actions at each state at each state S the transition probabilities these are what are called transition probabilities and the rewards all of these put together define for us what is called a Markov decision process these define for us a Markov decision process. Now, let us see how this relates to the formulation that we had earlier apart from the change in notation the basic relation is actually fairly simple. Notice that earlier we had x k plus 1 was given as a function f k of x k u k w k. So, consequently we had that the probability that x k plus 1 was the x k plus 1 was distributed according to this distribution it was a distributed according to a distribution which which according to a certain distribution let us call this and what is this distribution given by well it is nothing but the probability that it is the if you want to so if I wanted to see what this the what p k of let us say what p k of j given x k comma u k would be well this would be the probability that f k of x k u k w k is equal to j given that you are choosing action u k at time k and you are in state x k. So, this this probability is it tells you the probability distribution of x k plus 1. So, you can see because x k and u k are are are given here the only thing only entity that is random in this problem in in this particular equation is is w k. So, with w k being the only randomness you are this this particular distribution this probability distribution conditioned on x k and u k is given entirely by the probability distribution of the noise and the transition and the transition functions of the dynamics f k. So, you can see this is the this this problem for this the earlier formulation can always be put in this in this particular form that instead of taking this function f k and and and an explicit noise distribution w k one can always transform it to the to a formulation of of this kind where the where one only has an abstract distribution on the next time on the on the state at the next time step given the current state and the current action. In fact, the converse can also be worked out under fairly general conditions if you are given if you are given a probability distribution of the of this particular form one can one one can try to construct in in several cases one can try to construct if a noise random variable from there and a and of and a suitable function this this requires some amount of analysis and so on. So, I am I will not be going into that particular direction it is not that important for us to for us to explore this. Now, the the problem as that we have formulated is where we are where we are maximizing this reward maximizing this sum of the terminal reward and stage wise rewards and once again we are posed with this we are we arrive at the same question what is this at what stage is this problem posed at. So, as we did in the in the earlier formulation notice that this problem is again posed at a stage before even the first even the first uncertainty in the problem gets realized. So, it is posed at a time when none of the uncertainty is has been realized and one has to take one has to decide what what one wants to do with this problem. So, because the uncertainty is not realized all all of these sts are the states at at any time t are actually not known to us we do not we we only know we only know that they are they are random variables, but their distribution is also not known to us. So, consequently because the the reason the distribution is not known is because the the distribution depends on the actions that we would choose through through the course of through the course of the time horizon of the problem. So, as a consequence these this problem also cannot be formulated as one where we have to where the choice is that of choosing the actions what one has to do therefore is to come up with a contingency plan in which an action has to be chosen based on whatever state gets realized. So, the plan that maps the state that would get realized to the action that will get chosen is what we have what we have to choose. So, such a so such a choice once again is a at any time t you have the you have what we need to choose is a decision rule. A decision rule d t that maps the the decision rule d t that maps the state at time t to the actions that you can choose a a her is the union of a s and s in capital S where a d t must have this property that d t of at any s must lie in the feasible actions at that at that are available at that at that particular at that particular state. So, once again what we need to choose therefore are these functions. So, this is this is called a decision rule rule or or a strategy and what one needs to do is choose this policy policy p which comprises of decision rules from d 0 to d n minus 1. These n decision rules have to be decided as part as part of our in order to solve this problem. So, this maximization here is actually one where the maximization is over all possible policies. So, the variable that that we are choosing in this in this problem is the policy. So, how does this policy manifest into into this problem the policy decide. So, what we have what you see here is the policy decides the action that will get chosen based on the state the action that is chosen based on the state decides the probability with which the next state is going to get realized. Because the action and the trans and the the current state decide decide the probability distribution of the next state that is which is given through our transition probability or transition function. So, the probability distribution of the next state is decided based on the by the policy that we choose. So, the probability distribution of the next state of the next state is shaped by the policy that is chosen. Consequently, we although we are given all these transition probabilities what they what we actually have in this particular problem is a choice over what sort of shape those transition probabilities can be given because those transition probabilities are given to us for every state at time t and every possible action. But the way we choose the actions at each time t when we are in a certain state decides the shape that the problem that the transition probability would take actually in closed loop. So, in closed loop when we are actually implementing the policy what kind with what probabilities are the states going to get realized that gets determined by our policy. So, consequently although if you know one superficially if you look at an expression like this it may seem like these well this all this where is actually the policy that in this particular problem well the policy actually tells you the probability distribution. So, that is the underlying probability distribution of these random variables of these random variables s t. Consequently, one may think that you it is often a notation that is used is to emphasize this by writing that this e this expectation e actually depends on the policy. So, one often writes e superscript pi or a subscript pi just to emphasize that. Now, what we will do in the next part is that we will consider a wide variety of different policies. Here in this particular section what I have told you here is the policy is mapping from the state to the action. But why should it be a mapping only from the state to the action it could be something far more than that it could be mapping all the previous histories also to action. So, all of this discussion about the kind of information that can be put as in defining the policy to choose the action will be discussed in the next in the next piece.