 Hi everyone, this is Alice Gao. In this video, I'm going to start talking about the value iteration algorithm. We can use this algorithm to solve for the optimal policy of a Markov decision process. In this video, I'm going to introduce the Bellman equations, which is the key component in the value iteration algorithm. In the previous video, I defined two important quantities. One is V and the other one is Q. So V denotes the expected utility of the agent starting with entering a particular state, getting the immediate reward for entering that state, and then following a given policy thereafter. So this is denoting the long time total discounted reward. Q is also related. It's also denoting the long time total discounted reward. But Q is about we are already in a particular state and what happens if we take a particular action? So Q is focused on measuring the expected utility of taking a particular action in the current state. It turns out that V and Q have a very natural relationship. We can define V and Q recursively in terms of each other. You have already seen the second of these two equations. So for equation 4, we've already seen that we can define Q in terms of Vs. So what's the expected utility of being in a particular state S and taking an action A? Well, action A might take us to multiple possible next states. And once it takes us to a particular next state S prime, then we can use the Vs to tell us how well we're going to do starting from that state S prime. And then taking the summation over all the possible states, this total, this sum, this sum is going to help us to evaluate how good it is to take this particular action in the current state S. Now we can also define V in terms of Q. This is the second equation or equation 3 right here. So V is denoting, V star here is denoting the total expected utility in the long term if we start from by entering state S. So what happens when we enter state S? The first thing we do is that we will receive an immediate reward of entering that state. That's denoted by this big R. And then next, we might consider which action are we going to take in this current state. So there are many possible actions we can take. If we take a particular action, then Q star is going to tell us how good is that action in the long term, right? In the long term, that action will take us to a next state and also following the policy. We can calculate the Q value tells us how good it is, how much total discounted reward we're going to get. So then we will choose the action that maximizes the Q value because we're following the optimal policy. So we'll always, we'll always follow the best action for each state. And in addition, we need to multiply the best Q value by a discount factor. Because well, once we take a particular action, this takes us to the next state, the next time step, right? And the same reward received in the next time step will be worse, will be worse less to us. And this is measured by the discount factor gamma here. So intuitively, these two equations tell us how the V values and Q values are related to each other. Now, our goal is to derive an algorithm to solve for the V values, right? We don't really care about the Q values right now. So let's take these two equations, combine them and eliminate the Q values and only leave the V values. By doing this, we get equation five, which are called the Bellman equations. So again, equation five is really what you would have expected by based on how I described the meanings of V and Q. But let's still take another look to make sure we understand what's going on. Equation five is describing our long term total discounted reward starting from entering state S. So again, what happens? Well, first of all, we get some immediate reward of entering that state. And then we consider what action should we take in that state and where would that action take us in the future? Well, that action is going to take us to a next state in the next time step. And whatever we receive in the next time step should be multiplied by a discount factor, right? Because this is measuring that we value future rewards less than we value current rewards. So if we take a particular action, then it may take us to the next state S prime with some probability. And starting with that next state, we will get a total discounted reward of V of S prime in the future. Right? This is the long term reward we are going to get starting from the state S prime. So summing over all the possible next state that we could get to, this gives us our Q value, right? As you can see, we simply plug in our Q value right here. And there are multiple possible actions we can take, and we want to choose the action that maximizes this Q value. So that's why we have a max over the actions right there. Notice that the Bellman equations, this is not just one equation, right? If we plug in different values for S and S prime, then we're going to get different equations. So equation five actually defines a system of equations. And it turns out the values V star of S are the unique solution to the system of Bellman equations. To make sure that we understand the Bellman equations, let's do an exercise. On the next slide, let's write down the Bellman equation for V star S11. So here S11 is referring to the state S. So the state S is equal to the state S11. Take some time, do this yourself, and then keep watching for the answer. Here's the answer. So what is expected utility if we start from the state S11? Well, the first thing is that we're going to get some immediate reward of entering that state. That's minus the small negative value here. Next, we'll have a term where we multiply by the discount factor first. You can put one here because we assume the discount factor to be one. And this is multiplied by the result of a maximization because we're maximizing over all the possible actions we can take in the current state. There are four possible actions. So inside this term for the maximization, I can write this bracket. It's a little bigger to show you inside this term, we have four quantities, right? And each one is corresponding to our total expected utility for taking one particular action. And the red here, the red letters here, they're not part of the Bellman equation. They're simply labels to help you understand which row is corresponding to which action. So for example, let's go through two of these maybe. So for the action going right on this picture here, if we successfully go right, that's with 80% chance we'll get to S12. If we end up going upward, we'll bump into the wall and come back. So we'll stay in S11. And if we get to our right, then we'll get into the state S21, right? So this is reflected in the probabilities with 80% chance we'll go right, with 10% chance we'll go down, and with 10% chance we'll try to go up and end up staying in the same state. Let's look at another one. So for some of these other terms, I've combined certain terms. For example, if we try to go up, then what happens? Well, if we successfully go up, then we bump into the wall and come back. If we try to go left, we'll also bump into the wall and come back. So for both of these, we'll end up being in the same state, and the total probability is 90%. So that corresponds to our first term here, with 90% chance we'll stay in the same state. And finally, with 10% chance, we will successfully go right and get to the state S12. So that corresponds to this, 0.1 multiplied by the value of the expected utility starting from S12. The purpose of this exercise is to help make the Bellman equations more concrete in your head, right? It's not just a bunch of mathematical symbols, but rather we will end up plugging a lot of values and we'll end up having an equation where the only unknowns, the variables, are the V values. And also notice here, each equation involves quite a few of the variables, right? In this particular equation, we have V of S11, we have V of S12, we have V of S21, I think that's all, right? So this equation contains three values, three variables. That's everything for this video. After watching this video, you should be able to explain how we can define V and Q recursively in terms of each other and how we can use the recursive equations to derive the Bellman equations. You should also be able to give an example of the Bellman equations by plugging values from the grid world example. In the next video, I will talk about how we can solve the Bellman equations and solve for the V values. Thank you very much for watching. I will see you in the next video. Bye for now.