 Hello, everyone, this is Alice Gao. In the previous video, I defined what is a policy for solving a Markov decision process. In short, a policy needs to specify an action for each possible state in the Markov decision process. We have to plan for all contingencies. The agent needs to know what to do if it ends up in every possible state. If the concept of policies still felt abstract to you, don't worry. In this video, let me show you a few examples of the optimal policies for the grid world, depending on the reward function. When I introduced the grid world, I specified a particular value for the reward function, R of S. I said the reward of entering every other state other than the goal state is a small negative number, so minus 0.04. But in fact, if we're willing to change this reward, then we're going to observe very interesting things. The optimal policy for the grid world is going to change based on this reward function. So let's look at some of these examples. And you will see that as we change the reward function, the optimal policy is going to change to carefully balance the risk and the reward. Actually, before we look at the policies, let's think about what kind of rewards are there in our grid world and then what kind of risks are there that we want to avoid. Well, in terms of reward, obviously, we want to get to the better goal state, the nice goal state, S34. And we want to reach this state as soon as possible and if we're going with a fixed sequence of actions, then there are two paths, right? One path is going above and one path is going below. And then there are also some risks. For example, if we follow the blue path, the path going above, then when we walk by the state S24, there's a chance that we might fall into that state. If we fall into the minus one state, we're going to escape the world right unhappy, right? You can think of the S34 state as happily ever after and then the S24 state as unhappily ever after. And there's another risk is that, suppose we never reach any of the goal state, or suppose we take a really long time to reach either goal state, then the longer we explore the more negative reward we're going to accumulate, right? Because taking every step is going to cause us a little bit. So how would the optimal policy balance these rewards and the risks? Let's take a look. I'm going to look at five cases, depending on different ranges of values for the reward function. You can see that there's some gaps between these cases. So these do not cover the entire range for the reward function. But it's sufficient to illustrate how the optimal policy changes based on the reward function. And we'll see some interesting things going on there. Let me start with one of the middle cases. So I'm going to look at this middle case first. Then we'll go to the case when the reward is worse. And then we'll go to the rest of the cases when the rewards gets better and better and better. So hopefully this goes into the direction where after we look at the absolute worst case, then life can only get better. For each case, I'm going to show you the empty grid world first and explain the reward value a little bit. And based on this reward value, I'd like you to guess what the optimal policy looks like before I tell you what it looks like. Let's look at example number one. So in this example, the reward function is in a negative range. And remember our original value was minus 0.04. So this is a little bit worse than that. The maximum it can get to is around minus 0.08 and then the minimum is around minus 0.4. So what's an agent going to do in this case? Let me give you some general descriptions as a clue for what the optimal policy looks like. And then given that, I'll let you guess what the policy should look like. So in this case, life is quite unpleasant, as described in the title. And the agent is willing to take the shortest route to the plus one state. And on the way of taking the shortest route, there might be a chance that the agent is going to fall into the minus one state by accident and the agent is willing to take this risk. So given this description, think about this and try to guess what the optimal policy looks like. Then keep watching for the answer. Here's the answer. There are two possible paths to the plus one state. One is going above the X and the other one is going below the X. And these two paths have the exact same length. But the bottom path is safer because if we go along the bottom path, there's less of a chance of accidentally getting into the minus one state. So because of this, the agent is going to decide to start by going down. And once it's along the bottom path, it's going to try to follow that. So down, down, right, right and right. Now, what if agent tries to go down from S11 and actually ended up going right? Well, if it ends up going right, it will still try to follow the shortest path to the minus one state. In this case, the shortest path is going right and then going down, going down, and then going right, right? And obviously, if we follow this shortest path, there is a chance that the agent might fall into the minus one state. We are taking that risk. And also, there is one more thing we need to specify is that what if the agent accidentally gets into S14? That's the state above the minus one state. Well, in that case, we want to take the shortest route to get out of it. And so we're going to choose to go left. Now, again, going left, there is a chance that we might end up going down and actually falling into the minus one state, but we're willing to take that risk in this case. So to summarize, for this case, because life is quite unpleasant, the reward of exploring, the negative reward is quite a lot. So we want to take the shortest path to get to the plus one state as soon as possible. And if there's a risk of falling into the minus one state, we are willing to take it. Let's look at the second example. In this example, I have decreased the immediate reward of entering a state considerably. So when we enter a particular state that's not a gold state, we will get a negative reward that's relatively large in magnitude. You will be at most minus 1.6 something. Given this, what does the optimal policy look like? Let me give you one clue. Life is so painful, it's so horrible in this world. So we prefer to escape it as fast as possible. So given that clue, think about this for a moment, try to figure out the optimal policy yourself and then keep watching for the answer. Let's take a look at the answer. So this is a world when life is painful, it's horrible. Every step we take is really, really heavy because we incur a huge negative cost. So what does the agent want to do in this kind of world? Well, it wants to escape, right? Because the more steps it takes, the more negative cost reward is going to incur. So if there's an option of escaping, it wants to get there as soon as possible. So this is what the agent would do. If it's already on the bottom path, so it's going to follow that and follow the shortest path to the plus one state, right? That's going down and then keep going right until it gets into the plus one state. If it's already following the path above the X, then it's going to head to the nearest exit, the nearest escape as well. So it's going to go right. And then when it's on the path above X, actually the nearest exit is minus one, right? Not plus one. So in this case, the agent is going to do the somewhat suicidal move to go right and then go down, right? This is the fastest way to the minus one exit. Now, what would the agent do in S11? How would it choose these two paths? So remember for the previous case, the agent would choose the bottom path because it still wants to get to the plus one state, right? But in this case, it doesn't matter which escape, which ghost state the agent gets to. In fact, the top ghost state, the minus one state is closer. So by that reasoning, the agent is going right from the start state, just to follow the shorter paths and maybe to get to the minus one state faster because it just wants to escape. And finally at the state S23, the agent is going to choose to go down. It's unclear how to explain this, right? You could imagine going right, that would be the fastest way to get to the minus one state. But the calculations will end up saying that the agent prefers to go down in S23 instead of going right. Let me stop here for this video. In the next video, let's look at the other three cases and see how the optimal policy changes based on the change in the reward function. Thank you very much for watching. I will see you in the next video. Bye for now.