 Okay, so now that we've seen how to solve an MDP with all its parameters known meaning that specifically the transition model P and the reward function R are known And we've seen methods for solving that using value iteration and policy iteration We are finally ready now to get into our very first reinforcement learning algorithms And remember we said at the outset that reinforcement learning dealt with MDPs where P and R were not given to you in advance So one really key Concept in many reinforcement learning algorithms is this idea of temporal differencing And to introduce that idea of temporal differencing Let's look at the policy evaluation algorithm that we wrote out earlier in the case of the fully known MDPs And we said to evaluate any policy meaning to assign states Assign values to each state We would start by setting all the values to be equal to zero and then iterate this update until convergence Right, so you should remember this is basically just the Bellman update The Bellman equation but written down as an update, right? And so we keep iterating on this until we eventually converge Now you'll notice that in this equation there is the known transition probability P and There is the reward function R and these are known and that's the reason that we were able to do policy evaluation in this way earlier Now how would we extend? policy evaluation to a setting where The transition probability and the rewards were not known now remember We don't actually have access to the transition probabilities and the rewards in the reinforcement learning setting But every time you perform an action in the environment You're going to get a sample from each of these distributions in particular the state that you end up at is observable to you Right, and so that gives you some information about P of s prime given s comma a and Similarly, you do get a reward back from the environment And this is what we said was the the reinforcement learning setting every time you emit an action You get to observe the next state and you also get to observe the reward So you get samples from both of these functions You don't get the entire function you don't get a formula for the function for example But you get samples for of each of those functions And so by acting in the environment and by slowly taking actions in the environment You get more and more information about these two unknown quantities Right, so just to just to restate that once more every time you take an action a from state s You get a sample from the unknown transition probability and the corresponding reward So this is going to be important to this idea of temporal differencing In particular if you stare at this equation That we are using as our Bellman update for policy evaluation The quantity on the right is an expectation over this probability distribution Right, it's an expectation over this probability distribution of this quantity because that's exactly what An expectation is right, so if you recall the the Expression for an expectation. We're taking the expectation of this quantity inside the square brackets under this probability. That's what's happening here Now we've just finished saying that every time we perform an action in the environment we get to see One sample of this quantity we get to see one sample in particular we get to see one sample of s prime right from drawn from this distribution and we get the associated reward and We of course already have access to the current estimate of the value function so what we're going to do is treat that single sample that we get as a proxy for the entire distribution as a proxy for the expectation over the distribution and Because it's obviously a very noisy estimate to just treat a single sample as the the expectation of the distribution So we don't go the whole distance and iterate until convergence, but instead just instead move the value function To be a little bit closer to the quantity on the right So let's see exactly how that works So again after having executed the action pi of s From the policy pi that you're currently trying to find the value function for because that's what policy evaluation does remember and After having executed that we get to see s prime and we get to see the reward function associated with s prime And so we can now take the quantity that we're trying to compute the expectation of we have one sample from that quantity because we also have access to the previous estimate of of The value function and So this is now going to be our proxy for this expectation and Instead of setting vi plus one Pi k of s to be equal to this quantity Instead what we're going to do is just move it to be closer to this quantity and how that works is By applying this update rule where we take the current v pi of s And then we apply an incremental update to it and the incremental update is the difference between the right-hand side and the left-hand side Where the right-hand side is replaced by the single sample like we said So think about this a little bit all that we're doing is so in particular if for example you had set alpha to be equal to exactly one then You can work out that this would This would reduce to exactly this expression here if you'd said alpha to exactly one then v pi of s would be set to the sample which is which is this expression over here and that sample we said is going to be a proxy for this entire expression But we don't usually set alpha to one instead we set alpha to a small value so that we only move a small positive value So that we only move by a small step towards this value and The intuition here is that as you do this more and more For all the samples that you encounter eventually You have kind of computed the running average over the samples So even though each sample by itself is a noisy estimate of the the expectation here Still by computing this kind of running average over the samples by using this update rule you have effectively actually Reproduced something like this this expression So that's an example of what temporal differencing is it basically reduces to this idea of treating every single sample that you encounter as representative of the distribution and Therefore what you get to do is you don't have to wait for a large number of interactions in the environment to improve your policy or to improve your Policy evaluation instead you get to learn from every single Step in the environment from every single action that you eminent in the environment and the feedback that you get from that So you are learning much more quickly Because you're making use of every single action in the environment every single interaction in the environment to update your Your policy or your value function or whatever you're trying to Compute so up until now we've seen temporal differencing applied to this policy evaluation problem But the next thing we're going to see is how can we apply this temporal differencing trick to Learning the optimal Q function and we've already discussed that learning the optimal Q function Amounts to being able to also perform optimal actions in the environment because all that you have to do is Pick the arg max of the Q function So to do that remember the Bellman equation for optimal Q star. We wrote this down earlier All that saying is we're taking this expectation again over the transition probability of this quantity now that includes this maximum over the Q star right so this should remind you of what we did on the previous slide and Let's also think actually about what the the Q value iteration equation Is so we've already seen the standard state value iteration equation But now we're going to look at the state action value iteration or the Q value iteration Which is quite analogous to how we derive the state value iteration all that you do is you treat the Bellman equation as an update rule and so your your new estimate of the the optimal Q is Just going to be The right hand side of the Bellman equation Where you plug in your previous estimate of the optimal Q in place of Q star All right, so this is exactly how we treated value iteration in the state value iteration case if you recall from a few slides ago Now again, we're taking an expectation of this quantity under the transition probability and we know now all that all about how we can use temporal differencing to to handle update rules like this Even when you don't have access to the the transition probability or to the true reward function We know how to how to do this now, right? So even though this expression right now includes it requires access to the transition probability and to the reward function We are going to apply now the TD trick to this the temporal differencing trick to this and So again, here's the Q value iteration Update rule and what we're going to do just like we did for the policy evaluation case is we're going to treat the single sample that you get as Representative of the distribution from emitting a single from performing a single action in the environment and getting the feedback from that you get one single sample from the transition probability you get one single corresponding reward and You can now treat this expression computed for that particular transition and that particular reward as Proxy for this overall expectation and instead of outright setting the The left hand side of this equation to the right hand side instead You just move the left hand side a little bit closer to the right hand side by using this incremental update All of the same stuff that we saw in the policy evaluation case So you execute a single action a from state s you observe s prime and the reward are that sample From the distribution is just going to be this expression here computed for this one one single transition and so this is going to be our proxy for this expectation and So we're going to now create an incremental TD update rule with this that looks like this So now you're gonna have Q of SA being replaced by Q of SA plus Alpha times the right hand side minus the current value of Q of SA and again If you set alpha to be exactly equal to one then you end up with the same update rule over here Except that you're replacing the entire expectation with just a single sample and typically that would be a bad idea because you are Updating it based on a very noisy estimate and instead of doing that completely noisy update you instead kind of smooth it out over time by Relying a little bit more on your previous Q of SA and that means that you set alpha to be a small value and so you don't outright update Q of SA to be Based purely on this one single sample, but instead you also bring in knowledge from your past interactions, which have shaped your previous Version of Q of SA and again, this quantity is often called the Bellman error It's simply the difference between the right-hand side of this equation and the left-hand side of this equation in this case In the context of just a single sample on the right-hand side instead of an expectation over a full distribution and This algorithm that we've just seen that replaces the Q value iteration with this incremental TD update That algorithm is exactly what Q learning is So here is an example of a Q function and what a Q function might look like remember a Q function is a function of both the state and the action and Because it's a function of both the state and the action in every cell in that grid world example that we saw We're going to have one value for Q of s comma a associated with every possible action a and so you can see for example here that from this state to perform an action going north the Q function for that is 0.54 and Given a Q function given the learned optimal Q function You can easily find a policy by simply following the action. That is that has the highest Q value So particular particularly in this cell The highest Q value is for the north action and so you would go north here in this case The highest Q value is for the north action again And so you would go north here in this case once again Well, actually in this case, it's the east action So you would go east here the east action the east action and then you end up at the the correct square Let's look at another example. So let's let's say you started over here the optimal Q is The highest Q is over here. So you would follow this action and Then once you reach this square, you would once again follow this action and then you would reach this square Now remember that we have stochastic transitions. So we aren't I mean at the moment We're kind of ignoring that when we say that if you just perform the west action you end up at this square But you actually have stochastic actions in this environment But if you end up at this square, then you would take this Action and so on and so forth and then you end up again at the starting square And so you follow the same route that we saw earlier Right. So this is This is that an actual example of a learned Q function after 1000 iterations that was learned Without having access to the true transition probability or to the true reward function Right just based on samples like we saw on the previous slide