 Hi everyone, this is Alice Gell. In this video, let's work on applying the value iteration algorithm to the 3x4 grid world. Recall what the value iteration algorithm looks like. If you're fuzzy on the details, you may want to go back to the previous video to review the details. Also recall the numbers of some of the parameters I introduced, a couple of videos before. So we will use the discount factor, which is equal to 1. We will assume that the immediate reward of entering any non-goal state is minus 0.04. And then when we're applying the value iteration algorithm for the 0's iteration to start, we will assume all of our estimates are 0. This video is going to be focused on examples, answers, and also calculations for the calculations. Because of this, I won't make separate videos for the explanations. I will include the questions, the answers, and the explanations altogether. Now, you should definitely take some time to do some of the calculations yourself. So make sure you pause the video and do the calculations yourself, and then keep watching for the answers. To start, we are going to assume that the v sub 0 values are all 0's, as I'm reminding you in the table that you can see right now. We will calculate two values for iteration number one. The first one is for state s23, that's this state right here. And then the second question is for state s33, so that's for this state right here. Now pause the video, take some time to do these calculations yourself, and then keep watching for the answer. For the first question, the correct answer is option A, and the exact value is that v sub 1 of s23 should be minus 0.04. I've included the calculations here. You can also see this calculation on a later slide in bigger fonts, so if you have trouble seeing this, you can look at the annotated slides. Now let's go through the calculation and make sure we understand how to do development updates. Well, on the left-hand side, we have the target value that we are trying to calculate, and then on the right-hand side, what do we have? First of all, we have the immediate reward of entering that state, which is minus 0.04, and then plus we have the discount factor, which is 1, and then we're taking the maximum. The maximum is over the expected value of taking each of the four possible actions. So here, similar to before, I've labeled the four cases with the four possible actions that we can take. So the worst action, well, actually, I can't really say that, but let's start with the action of going to the right. That seems to be worse because we are going directly towards the minus one state, which is probably not what we want to do. If we take that action, then what's happening? Well, we have with 80% chance we'll get minus one, and then with 10% chance we'll go up and get zero, with 10% chance we'll go down and get zero. So this is exactly the same as how you calculate the expected utility of taking an action. Same as before. And then we have, let's look at the action of moving to the left. So if we're going towards the left, with 80% chance we'll bump into the wall and come back, that's why we have 80% multiplied by zero. Then with 10% chance we'll go up, that's zero, with 10% chance we'll go down, that's zero, so everything is zero. And then the actions up and down are actually symmetric. So the calculations will appear to be the same. If we go up, then with 80% chance we get zero, with 10% chance we go left and come back and get zero, with 10% chance we'll go right and actually go into the minus one state. So the only really different parts here are 10% chance multiplied by minus one, 10% chance multiplied by minus one, everything else are the same. So these calculations are really the same as how we would write out the Bellman equation, except we have to be careful that on the right-hand side, when we're doing the calculations, we will use all the values from iteration zero, whereas on the left-hand side, we're deriving a value for iteration one. Now after looking at the calculations, let's also look at the grid world to see whether this makes sense. Well, the calculation makes sense because going towards the right is indeed the worst thing to do, right? There's a really high probability that we'll just get into the minus one state and stay there, we escape the world. That's the worst thing. Now going up or down, they're slightly better because our chance of getting into the minus one state are now much smaller, only 10%, whereas going towards the left, that seems like the best thing to do because that guarantees us expected utility of zero, as opposed to going to the right, our expected utility is negative. So indeed, the maximum here is going to choose left as the best action here, which gives us a expected utility of zero, and that's how we get minus zero point zero four plus one multiplied by zero, which is minus zero point zero four. Let's look at the second question. What is V sub one of S three three? And as I've highlighted the state on the grid world, the process of calculation is extremely similar. So I'm not going to explain the process again. But one thing to be careful here is that we just calculated the value of V sub one as two three, right? So if you were to you were doing this yourself, what you might have done is you might have crossed out that zero and put minus zero point zero four there because that's the value for the next iteration. But be careful that when you're calculating V sub one of S three three, you are plugging in the old value of zero for S two three rather than the new value of minus zero point zero four. Okay, that's the only thing you need to be careful about. Otherwise, you should get the answer of zero point seven four. And again, this answer makes intuitive sense, right? For the state S three three, the best thing we could do is to go right because that gives us a really high probability of getting into the plus one state, right? This, you can see in the expression there going right, we got to expect the utility of point eight, which is great. Whereas all the other actions are not that great. Going up and down are similar because we get a 10% chance of getting into the plus one state. And going left is the worst because we just we're guaranteed to get zero. So therefore, the max is going to choose right. And we do our calculation, which is minus zero point zero four plus one multiply by point eight, which ends up being zero point seven six. On this next slide, I included the two calculations again in bigger font. So you can take a look if you like. If you want more practice questions, then I've included this additional slide where I have an empty table for V sub one. So you can do more practice and fill in this table. But in fact, this table is really easy to fill in. And why is that? Well, the reason is that the only state that will have a positive expected utility will be S three three. We already calculated this. And the reason is that this is the only state that's adjacent to the plus one state. For every other state, when we take one step, we are not going to get to the plus one state. So it ends up being that for every other state, the updated estimate is minus zero point zero four. In other words, you can intuitively interpret these value updates as follows. If we are calculating the estimates of the V values for iteration I, that means, well, only the states that can reach the plus one states in I steps or less will have a positive positive estimate of the V value, right? Because if the state is unable to reach the plus one state within I steps, then it's only going to accumulate negative reward. In this case, for V sub one, only S three three can get to the plus one state in one step, right? That's why this is the only state that has a positive positive estimate for the V value. Every other state has a negative estimate. Because the best thing you can do is get zero and then take one step, which costs you a little bit for the exploration. I've included three more questions for iteration number two. Now these are really good practice questions for you, but I'm not going to go through the process again. I've included the answers and also the detailed calculation process so you can check your answers. Here are the answers. For iteration two, for state S three three, the value estimate is 0.832. For S two three, the value estimate is 0.464. For state S three two, the value estimate is 0.56. And I also have all the value estimates for all nine states. You can see that the reason I included these three questions is because these are the only three states that will have a positive estimate for the V value. And that's because these states can reach the plus one state in two steps or less, right? We are dealing with the second iteration number two now. So if a state can reach the plus one state in two steps or less, then it will have a positive estimate for the V value. Otherwise, it will accumulate negative rewards. Finally, let me make a few observations about the value iteration algorithm. First of all, I've already tried to explain this. When we're executing the value iteration algorithm, every state will accumulate negative rewards until the algorithm finds a path to the plus one goal state, right? This observation is a little bit specific to our grid world, but it can help you to understand what the value iteration algorithm is doing, right? Because the immediate reward of entering any non-goal state is negative. So a state basically is trying to find a path to the plus one goal state. And before it finds such a path, it's going to keep accumulating negative rewards, right? In other words, it's saying, well, if I only take one step, I could only reach reward zeros, a non-goal state. So I have only accumulated the negative reward of exploration. But as soon as I can find a path to the plus one goal state, that plus one will add to my existing negative reward and make it a positive. Now, I want to make a final point about the value iteration algorithm, but this point is very important. There are in fact two versions of the value iteration algorithm. One is the synchronous version, and the other one is the asynchronous version. When I was talking about the calculations, I was strictly following the synchronous version. So the synchronous version says we will start with the estimates for iteration i. That corresponds to a big table for our example. And then we'll use all of these values, all of these estimates to calculate the estimates for the next iteration. So the iteration i plus one. And once we calculate all the estimates for all the states, we'll update all of these estimates at the same time. So in other words, let me look at the previous slide and explain this again. So in other words, if you were to implement this in code, what would you do? Well, you would store two arrays or two data structures. One data structure is keeping track of all the estimates from the previous iteration. So on the slide, that's the estimates for iteration one. And then you will have another identical data structure keeping track of the estimates for the new iteration. Here is iteration two. So we'll fill the blue table down here with the new estimates first, and then we'll replace the old estimates with the new estimates once. Alright, so this is a synchronous version of value iteration. But there's in fact an asynchronous version. The asynchronous version says we do not have to update all the states before we do not have to sweep through all the states for one iteration before we move on to the next iteration. In fact, we can update all of these states, the values for all the states at any time in any order. It doesn't matter whether we've updated the estimates for one state more frequently than we've updated the estimates for another states. Eventually, if we do these updates for all of the states infinitely often, then we are guaranteed to converge to the optimal values. So the asynchronous version suggests that if there are certain states that are more promising, maybe you want to update their values more frequently so that they will converge faster. But one thing you have to make sure is that you want to make sure to update the estimate for each state a sufficient number of times so that all of the values will converge. That's everything for this video. After watching this video, you should be able to apply the value iteration algorithm to solve for the expected utility of the optimal policy for a Markov decision process. Thank you very much for watching. I will see you in the next video. Bye for now.