 Okay, so we've now described in MDP, we've seen some examples of MDPs and we've seen how problems that we might be interested in can be framed as MDPs. The next step then is to figure out how to solve MDPs. To solve an MDP means that you find a policy pi star. Remember we spoke about policies as maps from states to actions. So the input is going to be a state, the output is going to be an action. And we're going to want to find the policy pi star that performs best. That's going to be called the optimal policy. So specifically, optimal means that following that policy pi star is going to maximize the total reward or utility on average. The reason it's on average is because of course we've seen that anyone policy can have many different outcomes, anyone action can produce many different outcomes. For example in that grid world, if you move north you could potentially end up east or west with some probability. Now, when we try to solve the MDP in reinforcement learning we typically assume that P and R in our MDP tuple are unknown, right? And we spoke about this when we were introducing MDPs. But let's now talk about how you would find the optimal policy even if you did know all four quantities because that's not trivial, right? Even if you do know S, A, P and R it's not trivial to find the optimal mapping from states to actions. So here's an example of an optimal policy. If that small negative reward that we spoke about for survival for every instant that you spent in the grid world, if that small negative reward had a value minus 0.03 for all these states and then these two states of course have reward plus one and minus one. So in this case it turns out, we don't know yet exactly how we computed this, but it turns out that the optimal policy would be this one. And if you found yourself in any other state, so remember this was the start state, but if you found yourself in any of these other states you would try and return to this trajectory, right? And we kind of guessed this when we were setting up the grid world that this would likely be the case. Now this needn't always be the case and in particular if the reward is going to, the negative reward for staying in a non-terminal state is going to increase, meaning the quantity of the negative reward is going to increase that means the negative reward will become more negative then you might have different settings. So let's look at a few examples of what that would be. So earlier we saw 0.03, this is actually an example of a smaller negative reward and the policy looks kind of similar except that we had this one arrow that was pointing upwards in the earlier case that's now pointing this way. So I'll let you think about why that might be the case, that having a slower policy because r of s equals minus 0.01, r of s being a smaller negative penalty corresponds to not having as much of an emphasis on speed and so you can think about why it would be valuable then to not try to return to the fastest path but instead actually move, try to move into the blocked square. Think about what would happen if you tried to move into the blocked square. And if minus 0.03 we already saw on the previous slide what it looked like, what about minus 0.4? Now that's an order of magnitude larger and you can see that the policy has changed. We now have much more of an emphasis on speed so we don't really try and return to this path anymore. We instead, if we find ourselves over here we try and take the shortest path this way or if we find ourselves over here we try and take this shortest path. That's one way to interpret what's happening here. And if that emphasis on speed is even higher so the negative reward is minus two then that behavior is only further emphasized. And you can, in fact, there are actually behaviors now where you are okay with terminating at the wrong terminal state. Not the correct one that has a positive reward but the wrong one that has a negative reward because actually staying in the grid for one more time step is going to be worse than ending up at the wrong state. And so you just kind of end up at the, you choose to kind of move towards this guy if you're closer to this guy. So that's an example of how the optimal policy is sensitive to the reward function. Okay, so we still don't really know how to compute these optimal policies but let's start working towards that. So we'll start with this notion of a search tree which is a tree of future outcomes that's going to be associated with each MVP state. And in particular, we've kind of seen diagrams like this before when we drew the transition graph but we're going to break that down into two steps now where after we execute an action we end up at what's not a really a state. We sometimes call this a Q state for reasons that will become clearer later on. But after having executed an action you end up at Q state S comma A. At this point we haven't really transitioned into a new state, right? We just have executed an action A from state S. And so that's why this is a blue node and it's not a red triangle, it's a blue circle, right? So the red triangles here correspond to states in the true MVP. So from this state in the true MVP we executed an action, ended up at this Q state and then that Q state can allow transition to potentially many different outcomes. And one of those outcomes could be S prime which is depicted here. And then that process continues on and on, right? So S A S prime is called a transition and we've seen how that idea is captured by the transition probability P of S prime given S comma A and we've also discussed that you receive a reward R which is a function of S A and S prime. So basically all three of these guys. All right, so remember we said that we wanted to find policies that would maximize the reward but let's start refining that statement a little bit more and see where it takes us. So remember at each step an agent has to choose an action that's going to maximize the future expected rewards. And the reason we include that word expected there is of course because we have a stochastic environment, right? And so if you over the next few time steps get rewards RT, RT plus one, RT plus two and so on till our infinity, if your problem can potentially stretch on for infinite numbers of steps like your grid world can certainly, right? Then you have a sequence of infinite rewards and one problem that you might encounter then is what does it really mean to maximize expected rewards? So the sum of these infinite rewards will be infinity. So imagine that all of these were positive and then the sum of all of these would be positively infinite. So what does it then mean to maximize the expected rewards? Well, one solution to this is to say that we are going to only allow a fixed number of time steps. So for example, you could say in that grid world that after a hundred steps within that grid world you will no longer permit the episode to continue on. So you will forcefully terminate the episode after a hundred episodes. And so that leads to finite horizon formulations. There might be other environments where finite horizon doesn't even involve you changing the MDP. The MDP might naturally be finite horizon. So for example, if you have to catch a ball the ball has a fixed amount of time that it spends in the air and you can't really do very much about that. So you have a finite horizon in that case. Now doing this means that the time spent in the MDP effectively becomes a part of your state. And that means that you have to now account for how much time you've spent in the MDP already. And that effectively results in what are called non-stationary policies, right? Because the true Markov state of that environment now contains the number of time steps that you've spent in it as well. Yet another way to do this is to make sure that there is an absorbing state, which means that there will be one terminal state that any policy will reach. So you can engineer your MDP such that it has absorbing states no matter what you do. So we won't discuss these first two options in much detail. Instead we'll look at the third option which is the most general and the one that's most widely followed. Which is discounted rewards. And in discounted rewards, the idea is essentially to say that even if you have infinite potential outcomes that follow after your current step, even if there are infinite potential time steps, you will still say that the outcomes that you care the most about are the outcomes that are closer to you now and outcomes that are farther away that will happen after 500 steps will be less important in your maximization problem than the outcomes that are only five steps away. So in other words, RT and RT plus one will matter more than RT plus two and RT plus three and so on and so forth. So we'll see exactly how we set that up in the next couple of slides. All right. So we've introduced now the idea of discounted rewards where we talk about future rewards as being worth less than current rewards. In particular, they are worth exponentially less than current rewards. And what that means is that you're going to define this utility function which is going to take as input all the rewards that are coming in the future and you will define the utility function as not just the sum of all of these, which is what you would normally take to be the meaning of the total reward. So if you wanted to just maximize the total reward, you would simply sum up all these rewards. But of course we discussed that if it's an infinite length sequence then it could potentially produce infinite rewards at which point total reward maximization would become meaningless. Instead we define this discounted utility that looks like this. It has a factor gamma that is between zero and one and that factor is multiplied by itself every time you add a new reward. So in particular, if you expanded this out this would look like RT plus one because gamma power zero, sorry, yeah. So gamma power zero is one, so regardless of what value of gamma is. And so this would look like RT plus one plus gamma times RT plus two plus gamma squared times RT plus three and so on and so forth. In other words, the reward, the only reward that you calculate in its entirety that you count in its entirety is the current reward and future rewards are always discounted by this discount factor gamma that is between zero and one. And so if you set gamma equals one then you would end up in the same situation as before where you would just sum all the rewards. This is sometimes called the additive utility or the total reward. And so one way to think about what's happening is that as you decrease gamma starting from gamma equals one you are effectively shortening the horizon over which the time horizon over which you're trying to compute optimal policies. So as your gamma becomes closer and closer to zero the only thing that you care about is the immediate reward. And as gamma becomes closer and closer to one you care about all future time steps just as much as you care about the immediate next time step. All right, now what happens if all of these terms were exactly the same they were all equal to the maximum reward that you could obtain in this environment. Let's just say for the sake of argument that that's the case then if you have any gamma that is between zero and one not exactly equal to one then you can guarantee that your summation over all those terms is always going to be a finite quantity, right? And if you remember geometric progressions from high school then you'll be able to see that the maximum reward that you can get is even if you were getting exactly our max reward which is the maximum reward at any one time step then the maximum utility that you can get by computing this expression is r max over one minus gamma that's just the sum of a geometric progression. And because you're not doing that trick of trying to control the horizon explicitly by using finite horizon style approaches you don't really have to have non-stationary policy anymore instead if you use this notion of utility you can still continue to account for infinite horizon policies and you can produce infinite horizon MDPs and you can produce optimal stationary policies you don't have to have a non-stationary time-varying policy anymore. So again, let's kind of visualize what's happening here when we have a discount factor now. So we start at a particular input state we perform an action that takes us to a Q state which is one of these blue circles that will in turn transition automatically to one of several potential outcomes let's say a transition to this outcome which is a state and then you performed another action and so on and so forth. And what we are saying effectively is that anything that happens in this immediate first transition is going to be counted for with full strength but then you keep diluting the contribution of future such transitions. So the transition from the next step to the step after that is diluted by a factor of gamma and then the next one after that diluted by a factor of gamma squared and so on and so forth. So effectively as you set the discount factor to gamma you're effectively saying that the policy you're trying to discover is one that cares less about future rewards than about the more recent awards and as that gamma gets closer and closer to zero this is more and more of the case. Now remember we've defined MDPs as tuples of S, A, P and R and we've also defined this quantity called the policy which is a mapping from state to action and we've defined the utility which is the sum of rewards over future time steps and in particular we introduced the discount factor which says that we don't just sum up the rewards we also include a discount and that's why now we've included aside from S, A, P and R we've also included a discount gamma into the definition of the Markov decision process.