 Okay, so we're still building towards how to solve an mdp in even the simplified setting where we have access to p and r and so we have access to the entire specification of the mdp, sapr and the discount factor gamma. Now we're going to define this function that's called the state value function. It's a function of both the state and of a policy that you're currently following and that will serve as a useful abstraction as we go forward. So let's define this value function of a state s under policy pi, the state value function of state s under policy pi. It's simply the expected utility which remember is the sum of discounted rewards, the expected utility of starting in state s and acting according to the policy pi and it looks like this. So you can write it down more mathematically as the sum of all future rewards after starting in state s at the current time instant and what's not specified here already is that the rewards come from following the policy pi. So there's a sequence of rewards generated by following pi. So that's the value function of a state s under policy pi. Now we can define a similar quantity for the optimal policy. We aren't doing anything particularly exciting here. We're just plugging in the optimal policy pi star into this expression here and that will lead to this. It's exactly the same expression there except now you're talking about the rewards generated by following pi star. So keep this in mind as we go forward. Now that we've seen state value functions of policies, let's talk about action value functions of policies. Now action value functions are the exact same thing that we defined on the previous slide except that now we have assumed that we're not just at a current state, we've also executed a particular action from it and we now have to take the expected value of future rewards after having executed the action. So we are no longer computing the expected future rewards, the expected future utility conditioned on being in a state where instead computing the expected future utility conditioned both on being in a state and performing a particular action from that state. And you can think of this in terms of the diagram that we drew earlier as trying to compute the value function of not the true states like we saw on the previous slide but of instead the Q states. These are the blue circles that correspond to kind of an imaginary state that exists after having executed a particular action A from state S. So remember we call them Q state and sure enough we're calling these action value functions now Q functions. So more formally the Q value of taking action A in state S and then following policy pi is what the Q function is. So it's the expected utility of taking action A and S and then following the policy pi. So it's still a function of both the policy and the state but it's also a function of the action that you've already executed from that state. So you can define it as Q pi of S comma A is the expected future utility where now these future rewards are coming from executing first action A then following the policy. Remember policy pi can prescribe the actions to take from any state. That's the that's what a that's what a policy is that maps from states to actions. But what we're going to do is just for the first step we're not going to follow the policy instead we're going to execute this first action A and after that we'll start following the policy. So that's the Q value defined in general. Now you can define this also analogously for the optimal policy and then it will be called the optimal Q value which is called Q star. Q star is just Q pi when pi equals pi star right so that's basically what it is. And you can now imagine that well let's think about why we actually define this Q star. Why weren't we comfortable with just using the state value function? Well the answer to that and the reason why a lot of what we'll see going forward in RL has to do with this Q function is because if you had this optimal Q value somehow if you somehow had the optimal Q star then you can easily determine what the policy pi star is because the policy pi star remember Q star is telling you what is the value what is the utility to be gained by executing and executing an action A at state S and then following the optimal policy. That means that you don't have to worry about things that happen after that first step because afterwards you're guaranteed to be following the optimal policy that's what the definition of Q star is right and so you now all that you have to optimize over is that first action A that would give you the policy so specifically you can greedily determine the optimal policy pi star from Q star by simply setting pi star of S the output of pi star of S remember it has to be an action so we're going to greedily set it to the r max of Q star right so the action that maximizes Q star of S comma A is going to be the action that will be output by the optimal policy all right so having defined the state and action value functions we are finally now ready to encounter the bellman equations which are a cornerstone of all of reinforcement learning really and the bellman equation essentially connects value functions at consecutive time steps and there are bellman equations for both the value the state value function as well as for the action value function we will write down a version of this for the state value function now and we'll return to the other variants later so before doing that let's observe that the state value function and the action value function are actually closely related in particular for the optimal policy the state value function and action value function are related in this way for the optimal policy the maximum over the Q star the maximum overall all actions of the Q star is exactly what V star is because remember V star is the optimal expected return and Q star is the optimal expected return after executing action A and so if action A were optimal then the optimal expected return would be exactly the thing on the left here right so the optimal value of S is what we get by picking the optimal action the optimal value function of S is what we get by picking the optimal action now the next thing will take a little bit of time to parse this is effectively going to this is basically the bellman equation already and what it does is it says having executed action A from state S we could potentially transition to one of several possible states afterwards and that's exactly what the transition probability P which is one of the elements in the MDP tuple tells us it tells us which state we'll transition to what the probabilities are of transitioning to various states right so this is going to tell us which state we transition to and now for the rest of it let's pretend that we know that we've transitioned to a particular state S prime then Q star is simply going to be the reward that we get for transitioning from S by executing action A to state S prime and then adding the discounted value function of the new state that we ended up at S prime this is the optimal value function now right we're trying to compute the optimal state action value function and here and here we're using the optimal state value function so let's think about what's going on here right this is basically an expectation because expectations remember are sum of p of x times x the expected expected value of x is the sum of p of x times x and so this is the expectation of the future reward this is the future reward the expectation of the future reward where the probability of computing the expectation over is the transition probability to new states after executing action A all right so you can think of this as the expected value over the successor state S prime of the current reward that you get for transitioning from S to S prime through action A plus gamma times the discounted future reward and now we can kind of plug in V star we can either plug in the expression for V star over here into this and get the bellman equation in terms of Q star or we can do what we're doing right now which is that we are going to take we're going to substitute this expression into this right and that will give you V star of S is maximum overall actions of Q star of S comma A and Q star of S comma A is the stuff and that's basically this all right so that's the bellman equation and we'll see very soon how this is extremely valuable in computing the optimal policy both in known MDPs where SA, PR, gamma are all known and in reinforcement learning where P and R are going to be unknown