 Now that we've seen the classic DQN algorithm, let's look at this idea called double DQN which further improves the performance of DQN. The core problem that double DQN is trying to address is this observation that the optimal Q star which double DQN is trying to learn is actually never learned correctly and in particular it is learned with a systematic error towards overestimating the Q star. So in particular, look at the red lines in all these plots and these are the Atari game environments where we showed that standard DQN performs well. Even in those cases, even though DQN performs well, you can see that the DQN estimate of the value for one particular state action pair is consistently higher over the course of training and continues to increase in some cases than the true value for that state action pair. The true value are these flat horizontal lines and you can see that in all these cases the DQN estimate is significantly higher in a systematic way compared to the true value. This is not really specific to DQN. This is actually true for all Q learning that you have this bias towards overestimating Q star and the reason it happens is because of the way we set the Q target. The Q target remember is Yi equals Ri the current reward plus gamma times maximum over actions of Q phi for the next state and action and to see why this maximum operator leads to this bias towards overestimation. Let's look at a simpler example outside the context of reinforcement learning. Let's say that all that we are trying to do is measure the maximum weight of 100 people and let's start. Let's look at the case where all those 100 people have equal weights. They actually all have equal true weights equal to 150 pounds and we are going to make measurements of those weights with a weighing scale that is off by plus or minus one pound. Let's say that it has a standard deviation of one pound. It doesn't particularly matter as long as there is some kind of symmetric noise on the weights. The weights are as likely to be below 150 pounds as they are to be above 150 pounds. Let's say that we measure person one's weight and store it in x1 person two's weight and store it in x2 and so on and so forth. At the end of this remember we said we were going to measure the maximum of these weights. So let's set a variable y to be the maximum over i of xi. Now remember these 100 people actually have equal true weights so if there had been no noise then this maximum operator would be trivial it would just be equal to 150 pounds. So y should be equal to exactly 150 pounds. But given that we have noise even though that noise is symmetric noise meaning that each person's weight, the measurement of each person's weight is as likely to be above 150 pounds as it is to be below 150 pounds. Even so once we take the maximum operator then y is almost always greater than 150 pounds and you can convince yourself of this by observing that all that you're doing is actually sampling from this noise distribution 100 times and picking the highest value which is almost always going to be greater than zero. And so this is the same thing that's happening in the Q learning case because of the way that we're using the max operator to set the set the target for the Q function. Again when the noise is zero y should be exactly equal to 150 pounds and as the measurement noise increases y increases. So under noise max is biased to be larger than it should be. That's your takeaway from this. Okay so how do we actually fix this? After all we got the the way that we are setting the targets we derived it from the Bellman operator from the Bellman equation. It's not like we came up with it. So how do we fix this problem that the max is biased to be larger than it should be? Well here is a simple idea. Let's measure each person's weight two times with independent noise each time because the weighing scale is going to just produce independent noise each time and when we measure each person's weight two times we're going to store it in two variables xi one and xi two. The second time is going to be an xi two. Then what you can do is you can set n equals r max over i of xi one meaning you find the person this is the index corresponding to the person who has the highest first measurement of weight because we are we're subscripting by one. The first measurement of weight is highest for this person and then to estimate the max you now take that same person's second measurement of weight. Alright so in other words if you found that the second person if you found that let's say out of the hundred people that you measured let's say the 34th person had the highest first measurement of weight and it was 150.7 pounds. Then you take that same 34th person and look at what their second measurement of the weight was and you are then guaranteed that this new estimate of the maximum is no longer systematically going to be higher than than 150 pounds. So it is as likely to be above 150 pounds as it's likely to be below 150 pounds right because that second measurement of that 34th person is independent and it again could have been either above 150 pounds or below 150 pounds. So that's the intuition here we're going to use two estimates of this quantity that we want to take the maximum of use one of those estimates to find the r max and the other estimate to actually measure it. So let's apply this to deep q learning and train two independent q networks q phi one and q phi two with parameters phi one and phi two. We're going to train them on disjoint subsets of experience so that they are actually independent which they have to be and then to train q phi one you no longer use the standard bellman target which is this this expression here. Instead you're going to apply the same trick that we saw in the previous slide. We're going to set the target by using the arg max over q phi one but then we're going to use the second estimate q phi two to actually evaluate at the arg max. So we're going to find the action for which the first measurement is highest and then we're going to use the second measurement of the q function of that action the q value corresponding to that action. And so one way to think about this is that the q network being trained is used to select the actions because we're right now we're trying to train q phi one and that's used to select the actions and the other q network that's not being trained right now is going to be used to evaluate the actions and you alternate between the two. So you keep training q phi one and q phi two simultaneously to keep to keep pace with each other but you train them on disjoint subsets of experience to keep them independent and you use any time you're training q phi one you use q phi two to do the evaluation and q phi one to do the the selection and any time you're training q phi two the two of them trade places. So q phi one will be used to to do the evaluation and q phi two will be used to do the the selection if you're trying to train q phi two. Okay now one thing to clear up here is that this looks a lot like another trick we've seen with the target network where we also had a second q network but in that case we were using a stale q network a q network from a few iterations ago with the intention of avoiding shifting targets in q learning. So this trick of double deep q learning is independent of that target network trick and you can in fact combine both those tricks and so if you combined the target network's trick with the double dqn trick you will effectively have four q networks because you'll have an old copy of q phi one and another old copy of q phi two to serve as the target networks.