 Hi everyone, this is Alice Gao. In the previous few videos, I defined what a Markov decision process is, and I showed you the 3x4 grid world and a couple of examples of the optimal policy of the grid world. In this video, I'm going to start talking about solving a Markov decision process using the value iteration algorithm. I'll do this in two steps. In step one, I will show you that if we already know the expected utility of the optimal policy, then we can use this information to easily figure out the optimal policy. Then in step two, I will show you how to solve for the expected utility of the optimal policy using the value iteration algorithm. This is a reminder of what our grid world looks like. There are three rows and four columns. The robot starts from the top left-hand side corner and there are two goal states, minus one reward for S24 and plus one reward for S34. There's a cause of exploring this world. If the robot reaches any non-goal state, it gets a small negative reward of minus 0.04. This is to discourage the robot to stay in this world forever. We also have a discount factor, gamma, and this discount factor is used to model the fact that if we get the same reward, then we prefer getting this reward today than getting the reward tomorrow. Any future reward will be multiplied by this discount factor. To make our calculation simple, I've said the discount factor to be 1. In general, the discount factor should be a value between 0 and 1. Let me start by defining some notation. We will use capital V to denote the expected utility of following a particular policy. We'll use a superscript to denote what policy we are following. If the superscript is pi, that means we're following a given policy called pi. If the superscript is a star, that means we're following the optimal policy. You'll also note that the capital V is a function of a state S. So this means that this expected utility that we're defining is really the expected utility of the robot by entering the state S first, getting the immediate reward for entering that state, and then following whatever policy is given thereafter. So one important point here is that capital V is not just the one-time reward of entering a particular state. We already have a notation for that. We are using big R of S to denote the one-time reward of entering a particular state. Now capital V here is denoting the long-term discounted reward that the robot is getting starting from state S and following the particular policy and see where that policy is going to take it. Let's look at an example now to see what the values of capital Vs look like. This is one example of the values of the capital V, so the expected utility of the optimal policy. This is for discount factor being 1 and the reward for exploration being minus 0.04. Let's make a couple of observations about this. So in general, the closer we are to the plus 1 goal state, the higher the expected utility. You can see this most clearly if you follow the path from the starting state downward towards the plus 1 state. You can see the expected utility steadily increases, and that's because the closer we get to the plus 1 state, the fewer steps we need to get to it. And if we need to take fewer steps, then we need to incur less costs of exploring until we get to the plus 1 state. Therefore, in the long run, we expect our total discounted reward until we get to the plus 1 state to be higher. Now for the other states, the expected utilities are generally lower, and that's because we not only have to account for the plus 1 state, we also have to account for the effect of the minus 1 state. So for example, let's look at the state S23. There is a sharp decrease from S33, where S33 has an expected utility of 0.9 something whereas S23 has an expected utility of 0.6. So this big difference is largely due to the fact that S23 has a relatively large chance of falling into the minus 1 state, if we follow the optimal policy, whereas S33 does not have a chance to immediately get to the minus 1 state. During the optimal policy, our optimal action is to go right from S33. So we might get to the plus 1, we might get to S23, or we might stay in this state. None of those outcome is actually directly getting into minus 1. Also you can take a look at S14. This state, the expected utility is the lowest among all the non-goal states, and that's because we are trapped in this corner, and it's very, very likely that we're going to fall into the minus 1 state. The purpose of this slide is to show you some examples of numbers. Right now I haven't covered the value iteration algorithm yet, so I don't expect you to be able to derive these numbers, but once I talk about the value iteration algorithm, it will become clear where these numbers come from. Now for the next step, let's assume we already know these numbers. We are given these numbers, the expected utility of the optimal policy for each state. Given these, we can determine the optimal policy in two steps. In the first step, we are going to look at all the possible actions we can take in the current state. So let's assume our current state is S, and let's assume there are some possible actions we can take in this state. So for each action, we are going to calculate the expected utility of taking that action, and this expected utility is again over the long term, not just for the immediate reward. And once we've calculated the expected utility of taking each action, we just have to choose the best one. So for the first calculation, we will define a new quantity called Q, and Q represents my expected utility of taking a particular action in the current state. So in the current state S, if I take a particular action A, then what happens? Well, if I take this action in state S, then with some probability, I'm going to get to the next state S prime. This is given by our transition probabilities. And once I reach state S prime, what happens? Well, we already have this quantity big V, which tells us in the long term how well we're going to do if we enter state S prime, get immediate reward, and then following the optimal policy thereafter, right? So we can use V star as the best estimate we have for how well we're going to do in the long term. Now, when I take a particular action A in the current state S, I might get into multiple possible next states, right? To account for this fact, I need to calculate the summation over all the possible next states that I can get into. So this is the quantity Q. And basically we calculate a Q value for each possible action. And each Q value basically tells us how good is that action. And next, the next step is quite simple. We want to choose an action that maximizes my expected utility. In other words, I want to choose an action that maximizes this Q value. So we will take the Q value and choose the action A that achieves the highest Q value. And this gives us our optimal policy. And we'll use pi star to denote our optimal policy. Let's now try to apply these formulas in an example. Here's a question for you. Given the values of the capital V, the expected utility of the optimal policy, and also given the formulas that I just talked about, let's determine the optimal action for state S13. So highlighted in the grid world, this is the state with V value of 0.611. Take a few minutes, try to do this calculation yourself and then keep watching for the answer. I'm going to give you the answer in this video and then I will make a separate short video to explain the calculation process. The correct answer is left. Left is the best action and we can simply calculate the Q value for all four actions and it turns out left achieves the highest Q value right here. One important point to remember here is that actions are noisy in this world. When you take a particular action it may not achieve its intended effect. So it's not sufficient to simply look at this table and try to compare the values in different directions, right? For example, if you look at this table and then try to compare down and left, the expected utility in the intended direction for down is actually better, right? Down achieves 0.660 whereas left achieves 0.655. So if you just compare these two numbers, you might think that going down is better. But recall that each action has 10% chance of going either to the state on its left or the state on its right. So we need to take those cases into account. And if you do that, you will end up figuring that left is better than down. You can watch the separate video for the detailed calculation process. That's everything for this video. After watching this video, you should be able to derive the optimal policy for each state given the expected utility of the optimal policy for each state denoted by the capital Vs. Thank you very much for watching. I will see you in the next video. Bye for now.