 Hi everyone, this is Alice Gao. In the previous video, I started talking about the optimal policies of the grid world. Depending on the change in the reward function, the optimal policy of the grid world would change to carefully balance the risk and reward. I talked about the first two cases in the previous video. So the case when life is quite unpleasant, and the case when life is extremely painful. So as we've covered the two worst cases already, life can only get better from there. So in this video, let's look at the rest of the three cases when life gets better, better, and eventually it gets really good. Here's the first case. Life is still quite unpleasant, but it's better than the unpleasant one we had at the beginning. So in this case, we have a fixed value for the reward is minus 0.04. This is our original value I gave you when I introduced the grid world. Let me remind you what the optimal policy looked like when life is slightly worse than this one we're looking at. So when life is quite unpleasant, so worse than the case we're looking at, we have this optimal policy right here where the agent is trying to take the shortest route to the plus one state. And while chasing, while trying to get to the plus one state, there is some risk of falling into the minus one state by accident, and the agent is willing to take this risk. Given that, let's look at this case where our life is slightly better than that. Same as before, let me give you some descriptions of the optimal policy and let you guess what it looks like. So the optimal policy for this world is conservative. Because our life is slightly better, the cost of each action is not as bad as before. So the agent prefers to take the long way around to avoid reaching the minus one state by accident. Given this, think about what the optimal policy looks like and then keep watching for the answer. Here's the answer. So there are, same as before, two paths to get to the plus one state. But the bottom one is always safer. So that means if we are already on the bottom path, we are going to go straight towards the plus one state. The top path is not as safe. So in this case, our strategy is conservative, which means if we end up on the top path, we don't want to go there. We want to go the safe route to make sure we don't fall into the minus one state. So this means that from the start we'll try to go down and if we by accident goes to the right, then we're going to try to course-correct and go left. And same thing if we by accident somehow gets to state one three, then we're going to go left again. And same thing for S14. You can see all of these arrows going left are trying to course-correct saying that we don't want to go the short route. We want to go back to the beginning and go along the safe route through the bottom to get to the plus one state. And then for S23, we're going to try to go there. So you can see this policy is more conservative than before. We're not willing to take as much risk because we're willing to explore a little bit more to try to get to the better goal state. But you can also see that this is not the most conservative policy because for example, if we look at the action we have in state S23, we're choosing to go down and by doing this, there is still a chance that we will step into the minus one state and that's it. And similarly for the state S14, we're choosing to go left. With this action, there is still 10% chance that we're going to go down and get into the minus one state. So these are certainly not the most conservative option but this is more conservative than before. Here's the next case. Our reward is even better. So it could be up to zero and the minimum we will get to is minus 0.02 something. So this is even better than the minus 0.04. So again, here is a description of the optimal policy. In this case, life is only slightly dreary. It's not so bad at all and the agent takes no risk. It wants to avoid falling into the minus one state by accident as much as possible and in return, in order to achieve this the agent is willing to bump into the wall many times in order to avoid falling into the minus one state. So given this description, think about what the optimal policy looks like and then keep watching for the answer. The optimal policy in this case is almost the same as the previous case. So for the previous, let me highlight the part that's the same first. So in the previous case we will try to go down first. This is the safe route and then if we manage, if we by accident get into the other route we will try to course correct. So if we get into S12 or S13 we're going to try to go left and we're willing to go the long way to ensure we're on a safe route. Now this policy is different from the previous one for two states. One is S23 and one is S14. The difference is that life gets better. The better life is, the more conservative we are. The better life is, we're willing to explore for a longer period of time we really want to avoid falling into the bad state, the minus one state. So in S23 we're going to avoid falling into the S minus one state by going left. You can see that if we go left it's only possible for us to remain in the same state or go up or down but it's not possible for us to go to our behind. Our transition probability does not allow us to turn 180 degrees. So there's no chance to fall into the minus one state. Similar for this for the state S14 we are going to choose to go up. Going up is the safest the best way to ensure that we never fall into the S minus one state. As you can see given these two choices there's zero probability for us to fall into the zero one state. It's simply not possible given the transition probabilities. But the consequence is that because we're directly going towards a wall so we might have to bump into the wall quite a few times before we accidentally get into the right direction. So for S14 we might have to do this many times before we end up going left. And for S23 we might have to bump into the left wall many times before we end up either going up or down. But because the reward is relatively high so the agent is willing to do this to make sure that it gets to the plus one state eventually. Let's look at our final case. This in this case the reward of reaching any non-goal state is positive. So R of S is strictly greater than zero. This reminds me of a friend I made during my PhD. So I used to go to lunch and dinner with her a lot. And whenever we get to the place and the food is ready then she would always look at the food and say, Alice, life is good. So this is exactly the case when life is good. So given this case I don't think I need to say too much about this case because it's quite intuitive the reward is positive. So given this take a guess think about what the optimum policy looks like and then keep watching for the answer. So what does the agent wants to do in this case? Well every step the agent takes in this world it gets a positive reward say a candy. If you get a candy every step you take would you ever want to leave this world? Probably not. So the goal of the agent in this world is to stay in this world forever as long as it's in this world and it takes any step it's going to keep accumulating positive reward. And we could potentially accumulate an unlimited amount of positive reward. Therefore the agent's primary objective is to avoid both goal states. This is funny because previously we were trying to avoid one of the goal states because it's bad but we still want to get to the other goal state but now we want to avoid both of them. Well we already know how to avoid the top one so as long as we go directly away from that goal state then we'll never get into it and similarly we can use similar strategy to avoid the bottom one so as long as we go left we're going to avoid the plus one state. For any other state really does not matter which direction we go as long as we stay in this world. So we can represent that by saying any direction works. So what I do here is that the optimal policy says for all of these other states I don't care what the action is. Any action is optimal. That's everything for this video. In this video I've shown you a couple more examples of the optimal policy for different reward functions. We've made some interesting observations about how the optimal policy changes based on changes in the reward function. I haven't told you how to solve the Markov decision process yet and how we can get these optimal policies. But I hope these examples sparked some interest so now you are more interested to learn how to solve a Markov decision process. Thank you very much for watching. I will see you in the next video. Bye for now.