 Hi everyone, this is Alice Gao. In this video, I will continue talking about the policy iteration algorithm. The main idea of the policy iteration algorithm is that we are going to alternate between two steps. The first step is that we'll take a policy and evaluate it to update the utility values. This is called the policy evaluation step. And the second step is that we'll take the updated utility values and derive a best policy given that. This is called the policy improvement step. In the previous video, I also showed you how to perform policy improvement on policy evaluation. Turns out policy improvement is straightforward. We already know how to calculate the best policy given some estimates of the utility values. But performing policy evaluation is a bit more complicated because it appears that we have to solve a system of equations which look very similar to the Bellman equations. But turns out this system of equations is much simpler to solve than the Bellman equations. Let's understand why. My main goal of this slide is to compare the equations for policy evaluation and the equations for the Bellman equations. So first of all, let me highlight the parts of these two equations that are the same. So we have the utility values, we have the reward for entering a state, the discount factor. So these are the parts of the equations that are the same. So the remaining part are what's different about these two. So what is the main difference between the two? Well, for policy evaluation, we are restricted to one particular policy. Because we're given the policy, so the policy specifies what do we need to do in each state. It specifies what action we should take in each state. That is why we have the pi of xs here, which is what is the action specified by the policy for this particular state. Given the action specified by the policy, we do not have to consider other actions. That is why in the policy evaluation step, you do not see a maximal over all possible actions. Whereas in the Bellman equations, you do see this maximum across all possible actions, because we don't know which action is the best. So mathematically in terms of the expressions, this is the main difference between the two. Now let's turn both of these into concrete equations that will help us to understand why solving the equations for policy evaluation is much easier than solving the Bellman equations. Let's write down both equations when the state we are considering is the starting state. So s is equal to s11. And let's assume that the policy pi says that we should go downward when we're in state s11. Now take some time, write down both equations yourself, and then keep watching for the answer. Here are the answers. For policy evaluation, I've written down the equation, whereas for the Bellman equations, I've done this in previous lectures, so I've just copy and pasted the equation here. So again, you can see the main difference. For policy evaluation, we are already fixing the policy that we're using, which means we're fixing the action we're taking. Since we know the action we're taking is going downward, so we only have to consider one of the four possible actions. So all we have is this one term for the action of going downward, which means with 80% chance we'll actually go down and reach s21, with 10% chance we'll go left or right, and going left will bump into the wall, and going right we will reach s12. So we have this one term without the maximization anywhere in it, whereas for Bellman equations, we do have the maximization over the four different actions, four terms for the four different actions. The last one is corresponding to the action of going downward. Now that I have carefully compared and contrasted these two cases, you should have enough information to figure out why policy evaluation is much easier to do, to perform than solving the Bellman equations. So take some time, think about this yourself. So think about why is it easier to solve the equations for policy evaluations than solving the Bellman equations. Then keep watching for the answer. Here's the answer. Policy evaluation is much easier to perform because the equations are linear. We don't have the maximization anywhere in the equations. Solving a system of linear equations, we have many standard linear algebra techniques for doing it, whereas the Bellman equations are not linear and there's no general algorithm, efficient algorithm we can use to solve a system of non-linear equations. This is why performing policy evaluation is much simpler than solving the Bellman equations. Now given this, we can solve the policy evaluation equations in two ways. Let's look at them. So first of all, we can perform policy evaluation exactly, solving these equations exactly using standard linear algebra techniques. I'm not going to go over details, this really belongs in another course on linear algebra or linear programming. But in general, if we have n states, then a standard technique will take n cube time to solve this system of equations. When our state space is small, when we don't have a lot of states, this is quite a reasonable time complexity. But when the state space is large, this might still be too much. A cubic time might still be too much, although it is polynomial. So in the case when the state space is large, we might still want to solve this approximately instead of exactly. If we're okay with solving policy evaluation approximately, then we can use something that's very similar to the idea of value iteration. Because after all, we have a set of equations and they look very much like the Bellman equations. So we can do something very similar to the Bellman updates that we did. So take these policy evaluation equations and convert them into some sort of iterative update rule. So same as before, we can take the equality and turn it into an update rule where we plug in the estimates of the v's on the right hand side. And then calculating the right hand side will give us the new estimates of the v values. So the idea here is that we can perform something very similar as value iteration. So take these update rules, update the utility estimates in using a few steps until it's somewhat converges. So the number of steps really depends on us. It depends on how accurate we want these estimates to be. So using this approximation method, we can derive reasonably good estimates of the utility values without spending a lot of time. This concludes our discussion of the policy iteration algorithm. Let me come back to slide six and summarize the key ideas. The key idea of policy iteration is that we are going to iteratively improve both the policy and the estimates of the utility values. We'll do this via two types of steps. One is policy evaluation where we go from a policy, use a current policy to estimate the utility values. And then after that we will do policy improvement where we use the updated utility values to derive a new best policy. The policy improvement step turns out to be straightforward as we've done similar things before. Now for the policy evaluation part, we need to solve a system of equations. But fortunately, this system of equations is linear. So there are two ways of solving this system of equations. We can solve it in cubic time using standard linear algebra techniques. That is if we want to solve it exactly. Or if we're okay with solving them approximately, then we can use an iterative approach very similar to the value iteration algorithm. So during a few iterations, we should be able to derive reasonably good estimates of the utility values. And we might be happy with that. Thank you very much for watching. In the next video, I will start talking about passive reinforcement learning algorithms. Bye for now.