 Hi everyone, this is Alice Gao. In this video, I'm going to introduce the policy iteration algorithm. Policy iteration is one of the two main approaches of solving a Markov decision process, together with the value iteration algorithm. Now knowing policy iteration is not only useful for solving a Markov decision process, it's also useful for understanding reinforcement learning algorithms. In particular, one key idea from the policy iteration algorithm is called policy evaluation. And we'll see this idea appearing again and again in the reinforcement learning algorithms that I'm going to talk about in the next few videos. To describe the high-level ideas of policy iteration, let's first recall how value iteration works. The main idea of value iteration is that we will iteratively estimate the V values, which are the expected utility of the optimal policy starting from a particular state. And once we have derived accurate estimates of these expected utilities, then we will derive the optimal policy based on this. Now sometimes when we're executing the value iteration algorithm, you may observe the following. Although the utility values haven't converged, the optimal policy already stopped changing. So what does this observation tell us? This tells us that sometimes deriving the optimal policy does not necessarily require us to have accurate estimates of the utility values. When the utility estimates are reasonably good, we can already derive the optimal policy. This idea inspired the idea of the policy iteration algorithm. The main idea of the policy iteration algorithm is that we are going to alternate between improving the utility values and improving the policy. So in particular, this is what's going to happen. We are going to start with an arbitrary initial policy. And based on the policy, we will perform a step called policy evaluation to calculate the updated utility values. This is the utility of each state if the policy, the current policy were to be executed. And then next, now that we have the updated utility values will perform a step called policy improvement, where we take the updated utility values and calculate a new policy based on that. So the new policy will say that given the updated utility values, what is the best thing we should do given these utility values? If I were to illustrate the idea using a picture, you can see the description above. So we'll start with some arbitrary initial policy and perform policy evaluation to get the updated utility values. Based on the updated utility values, we will derive a new policy. And based on the new policy, we will again evaluate that policy to get updated utility values. And we'll keep repeating this until we get to a point where the policy does not change anymore. This suggests to us that the improvement in the utility values doesn't matter anymore, because as long as we're using the same policy, then we're good. So this is reflected in the final step where at some point, we already reached the optimal policy. So based on the optimal policy, the utility value, the updated utility value will still lead us to the same optimal policy. So you can see policy iteration has a slightly different focus from value iteration. The focus of value iteration was to, first of all, derive the most accurate estimates of the utility values, and then derive a policy based on that. Whereas policy iteration says that we will iteratively update both the policy and the utility values until the policy emerges. Let's now take a look at how we can perform the two steps, policy evaluation and policy improvement. Turns out policy improvement is the easier step of the two. So for policy improvement, we're going from the utility values to an updated policy. In fact, we already know how to do this from the value iteration algorithm, because at the end of the value iteration algorithm, once the utility values have converged, this is exactly what we do. We will take the converged utility values and calculate the Q values, which are our expected utility of taking a particular action in a particular state. Having calculated the Q values, the only thing we need to do is figure out the best action in each state based on the Q values. So this is why we're going to take the maximum and find the action that achieves the maximum expected utility. So this is exactly the same step we did in the value iteration algorithm to derive the best policy given a set of estimates for the utility values. Now for the policy evaluation step, this step we are going from a policy to updated estimates for the utility values. To do this, we can solve a system of equations as I'm showing you here. And this system of equation looks very similar to the Bellman equations, but actually they're a little different. They are in fact a simplified version of the Bellman equations. So why is this sufficient to solve the system of equations? Well, look at these equations. We already know the reward function, we already know the transition probabilities, and because we know the policy already, so the policy pie will determine the action we are going to take in each particular state. Given this, we don't need the maximum in front anymore because we already know which action we're taking. So the transition probability is just telling us what are the probabilities of getting to each target state based on this action that we're taking. So this system of equations, the only unknowns are the utility values. So we can write down the system of equations and then solve for the utility values. Let me stop here for this video. In the next video, I will talk about how we can solve this system of equations to get the updated estimates for the utility values. Turns out, performing the policy evaluation step is much simpler than performing the similar step in the value iteration algorithm. Thank you very much for watching. I will see you in the next video. Bye for now.