 All right, having seen Bellman equations, we are now at long last at a place where we can actually look at our first algorithms for solving MDPs, meaning finding the optimal policies of MDPs when we do have their true transition probabilities and the true reward function. So remember that the Bellman equation gives us this definition of the optimal value function. And now our very first algorithm, which is called the value iteration algorithm, basically consists of converting this definition into an update rule. Now notice that this definition involves v star of s on both sides of the equation. So it's a recursive definition. And if you've taken courses in programming, then you'll know that this is related to dynamic programming. In dynamic programming, we often encounter situations like these. And the idea is simply to update your current values, your new value of the value function that you're trying to get, your estimate of v star of s, should be set to this right hand side, where the right hand side will depend on the current value of v star of s. So there's going to be some sort of iterative process of continuously refining your estimates of v star of s, where the way you refine it is by setting the new value, the refined value, to be equal to this right hand side. So effectively you've just converted the Bellman equation, which describes some property of the optimal value function. You've converted that equation into an update rule. And the way you do that is basically you say you start with values set to zero for all states s. And then at every subsequent iteration, because you're going to have to iteratively compute with dynamic programming the true value function, at every subsequent iteration, you simply set it to the right hand side, where now the right hand side is vi as opposed to vi plus one. And here the index refers to the iteration number. So at the next iteration, the value function has to be the value function and the previous function of the value function as it currently stands. And you're just going to plug in the current value function into the Bellman rule, the Bellman equation. Okay, let's see how that actually plays out. So again, here is the Bellman update rule. At every iteration, we're going to update our value function. And eventually after having run lots of updates of the value function over all states, so at each update, you're going to update the value function for all states. And after having run a lot of updates, so as I tends to infinity, you're guaranteed to get the correct optimal value function v star of s. Okay, so let's look at how this happens over the iterations. So we're going to look at an MDP that looks a lot like what we saw earlier, the grid world MDP. And this time, we're going to, for simplicity's sake, set the living reward to zero, which means that every time step spent in the environment is not penalized or given a reward. So earlier, we spoke about having small negatives, but right now, we're going to have this be exactly equal to zero. And we're still going to have this transition probability where only 80% of the time the correct action is executed. All right, and we're still going to have these terminal states plus one and minus, which have rewards plus one and minus one. Okay, now let's think about how what will happen at the first iteration and the zero's iteration and the zero's iteration. Remember, we said we were going to initialize our estimate of the optimal value function to all zeros, right? So that's all zeros. And at the first iteration, we're going to take this value function and plug it into this equation here and get our first iteration of the estimate. So we'll get V1 of s. And what will it look like? Well, remember that it has, so these terms are all going to be zero because that's what we have over here. But it also does have this term here, r of sas prime. And we do have the rewards for these two states already. So given that this is zero, and we have r of sas prime, therefore, at V1, all that we'll get is plus one and minus one. So we're just recovering the rewards that we already got. So now that we've computed the value iteration for, so the values for the first iteration, let's move on to computing the values at the second iteration because that's exactly what value iteration asks us to do. It asks us to sweep over all the states and apply the Bellman update rule up here. And your new value function is going to be a function of your old value function. So we're going to get V2 in terms of V1. So let's look in particular at these two squares once more. So 3 comma 4 and 2 comma 4. And observe that because their terminal states, therefore, there are going to be no new rewards. And so you're going to set V2 of 3, 4 and V2 of 2, 4 to be equal to plus one and minus one. The episode is terminated. There's going to be no future. And so there is no point in doing anything else. So you can just keep the terminal states as they are. Let's look at something more interesting and look at this square, 3 comma 3. Now for 3 comma 3, you have two, you have various squares that you can end up at. If you move in this direction, you could end up at this square or you could end up at this square. If you moved in this direction, you could end up at this square or you could end up at this square. And if you tried to move in this direction, you would only end up right here. And those are the rules of our grid world. So let's, for the moment, consider the case where you try to move in this direction. And if you try to move in that direction, and you can tell from this, from the structure of the grid world that that is likely to be the optimal action. So for the moment, let's not fully compute just because it's cumbersome. Let's not fully compute the maximum, but instead take moving towards the right to be the maximum, to be the correct optimal action. And then just compute what comes on the inside. So let's ignore the maximization problem just for now and assume that we know that the maximum action is the action that tries to move towards the right. So then how do we compute the value function? Well, 0.8 is the probability that if you execute the action move right, that you do, in fact, correctly move right. This was the structure of our MDP was already given to us in the transition function. So 0.8 times, and then what do we have here? R of sas prime, there is no reward for moving from here to here. There will be a reward for being there, but there is no reward for moving from here to here. And so we're going to set that R to be equal to zero. And then because we know the value at this point is plus one for this square. And we're going to set the gamma value to be 0.9. That's what we said over here. Therefore, we have 0.8 times zero plus 0.9. But we're not quite done yet. Because remember, we said if we move in this direction, there is still a chance that you might end up over here. And so the next thing we have to compute is with 0.1 probability, you might end up over here. And so let's take 0.1 times 0.1 is the probability of transitioning to that state. And then the reward again is going to be zero. And the value function at that state is currently zero. And so 0.1 times zero is what we end up at. And finally, there's a 10% probability that we instead try to move in this direction. And because of that, we will end up at the same square as before. And that again is going to produce the same number, 0.1 times zero plus 0.9 times zero. So both these terms are going to go to zero. That means that all that we're left with is 0.72. So 0.72, given that this is the optimal action, moving right is the optimal action, 0.72 is the value of this square. And so at this point, you can verify, I'm not going to go through all the entries in here, but you can verify that all of these other squares will have zeros. And only the 3 comma 3 is going to get updated. Okay. Now that we've set V2, we've found V2, let's then compute V3. And now for V3, you'll find, if you actually do the math, you'll find that it works out to be 0.52 over here, 0.43 over here, zeros elsewhere. And so you can see how this information about where the reward is, that information is kind of propagating outwards from the terminal states, right? Because that's where we'd like to end up at, that the optimal policy will end up over there. And so you're kind of backing up the rewards from there all through your entire value function matrix. And if you keep repeating this over and over, that information will continue to spread and continue to spread until it has fully propagated and all states will have the correct value estimates. So here is the value iteration once more. You start with values of all states set to zero and then you just keep iterating by using the Bellman equation, which is a property of the optimal value function. You use that Bellman equation as an update rule and you end up eventually with the true value function. In fact, you can prove that the value iteration does indeed converge to the optimal value function. All right. Next, we're going to look at another approach, an alternative to value iteration called policy iteration. And policy iteration involves as a subroutine, this concept of policy evaluation, which simply means to compute the value function corresponding to a policy. Because remember, the value function is associated with some particular policy and the optimal value function is the value function associated with the optimal policy. So in value iteration, we were dealing throughout with the optimal value function. But right now, we're going to try and compute the value function for some random policy. And it turns out that there is actually a version of the update rule that already exists for, there's a version of the Bellman equation that already exists for arbitrary value functions. And so just like we used the Bellman equation for optimal value functions in value iteration, in policy evaluation, we're going to use the Bellman equation for arbitrary policies. And we're going to take that and use that as an update for computing the value of an arbitrary policy. So in particular, we're going to again start by setting the values of all states to zero. This is no longer the optimal values. These are just the values of some given policy pi. And then we're going to apply this update rule, which corresponds to the Bellman equation for an arbitrary policy. And here all that has changed is that now in here, you're going to apply the, so if you compare what's going on in here with what's going on over here, there is no maximization. Instead, you're just applying the action that is prescribed by the policy that you're trying to evaluate. So there is no maximization over A. And in here in the reward correspondingly again, because A is set to pi of S, you again have R of S pi of S S prime. And that's basically it. Okay, how are we going to use this policy evaluation step as a subroutine in an algorithm that can actually solve an MDP? Well, that algorithm is called policy iteration. And it consists of two steps repeated over and over till convergence. The first step is precisely the subroutine that we just saw of policy evaluation, where you take the current policy pi and evaluate its value function. The second step is going to be policy improvement, where you find the best action corresponding to the value function that you've currently updated, that you've currently evaluated. So you evaluated the last policy, and based on the evaluation of the last policy, you have a value function. And based on that, now you're going to select the best action. So you're going to r max over the action, sum over transition probability times the current step reward plus the value function that you currently have. And that's going to produce for you your new policy. It's going to select the best action from this current state. And so you're going to go back and forth between evaluating the policy and improving your policy based on that evaluation, evaluating meaning finding the value function. So you're finding the value function for a policy, then finding the policy that would be optimal under that value function, finding the value function for a policy, finding the policy that would be optimal for that value function, and keep repeating this over and over. And again, it's guaranteed to converge. In particular, think about how this is related to the value iteration. And as a hint, I will drop this question. If your policy evaluation setup did not in fact wait for your value function to converge as we're asking to do over here, but if instead you were just doing one iteration of policy evaluation, then think about how this algorithm would then become related to the value iteration. And again, I'll drop another hint that actually reduces to exactly the value iteration if you do this. So policy iteration can be proven to be optimal too. And because you're doing this update rule, I just told you that policy iteration reduces to value iteration if you do not apply this first step perfectly, if you do not perfectly evaluate the policy, and instead do a very shabby job of evaluating the policy where you run only one iteration of policy evaluation. And because you're evaluating the policy better here, in terms of the number of outer loops, the number of steps of policy improvement, you will be faster in policy iteration than with value iteration. But remember that policy evaluation can itself be expensive as well. So it need not be the case that you are faster overall. So we've talked about value iteration, where you update in each iteration the utility, and that corresponds automatically to updating the policy. And then policy iteration, where you have several iterations to just find the values for fixed policy. And then once you've converged to the values for the current policy, then you update the policy using a policy improvement step. And finally, these aren't the only two methods that exist. Just like I said, policy iteration under some special conditions reduces to the value iteration, you can actually find other points that are on an intermediate, on a spectrum between value iteration and policy iteration. And these are typically called hybrid methods.