 All right, so now that we've dealt with one of the problems that created that asterisk when we spoke about this update rule as gradient descent, let's talk about the second problem, which we spoke about as this problem that our labels that we're supposedly doing regression on, those labels are actually moving themselves, because they are a function of phi, because there's a q phi involved there. And we don't actually treat that correctly. We don't take a gradient through the target value. So how do we deal with this problem? Well, because of this problem, first of all, because of this fact that the labels are moving, you aren't really doing gradient descent on any fixed objective, because the objective function is effectively changing as you optimize your q. And so this can lead to instability. And the solution is quite simple. All we do is we deliberately force our q phi to be fixed for some duration. So in particular, the q phi that we use to compute the label is not the original q phi, not the q phi that we're optimizing, but it's actually an older version of the q phi that we don't update as regularly as this one. So let's write that out a little bit more clearly. We're going to use two q networks. One of those networks is going to be called the target network. The other one is the standard q network. And the labels are going to be computed. When we're computing the labels, we'll compute them using the target network. The target network is going to have parameters phi prime instead of the parameters phi in the original network. So therefore, we'll replace this term with all of the same stuff except that inside here when we're computing the labels, we're going to have the labels computed using q phi prime. So this is computed via the q network, and this is computed via the target network. And what this does is that it stabilizes at least locally for some duration. It stabilizes the training so that we're actually performing gradient descent, because the labels are going to stay fixed for some time, because the q phi is going to be kept fixed for some time. So you update the target network only occasionally, only once every k iterations. So that takes us to deep q learning v3, which looks a lot like deep q learning v2, except for that it has this one additional target network trick. So in particular, everything in deep q learning v2 is in steps 234 over here. And now we have one additional loop on the outside, which involves occasionally saving target network parameters. So we're going to save target network parameters, then collect a data set, then optimize several times using many mini batches on the labels provided using that target network. And we're going to repeat that process, still having the same target network. So we have this, the second loop, which we repeat n times, where we still use the same target network. We haven't yet changed the target network. So we are continuing to improve our policy, continuing to collect new data sets by interaction with the environment, all the while while using the same target network. We do this for a while. And then finally, we update the target network in the automorphism loop. And so you can now think of this as, because now we've fixed this problem of the target's changing, we are effectively doing supervised regression in all of this portion. All of that is supervised regression. And so we effectively are actually now correctly doing gradient descent. This is actually now really gradient descent on a regression loss, because we have this mean squared error loss over the squared Bellman error. The squared Bellman error is effectively a mean squared error loss because we fixed the labels. And so we are effectively doing supervised regression up until this point. And then we change the labels, right? We change the labels when we save a new target network. And then we run supervised regression again. So the number of points at which this instability of having new values to regress to is much smaller now. So this deep Q learning v3 that we've seen now is a general version of the classic deep Q learning algorithm. And in particular, the reason I say it's a general version is because in the classic deep Q learning algorithm, we have K equals one. And this is the algorithm that is maybe the most widely known as deep Q learning, where K equals one, meaning that we only update on a single mini batch before collecting more interaction in the environment and adding it to the replay buffer. So that's all that's happening over here. You can take a look at this and convince yourself that that's exactly what's written here with K equals one. And so this approach deep Q learning has worked really well. It made a big splash in 2013 when this algorithm was developed. It was considered a major step forward for reinforcement learning when it was shown that this algorithm can solve a variety of Atari games like what's shown here. And what was particularly striking about those results was that the very same DQN algorithm was shown to be able to work across a wide variety of games. The same algorithm could play breakout, it could play space invaders, it could play about 50 Atari games. And later, the same algorithm was also shown to be able to help with continuous control problems. In particular, you can modify DQN using this approach called DDPG for continuous control problems where your actions are continuous rather than the discrete actions we've been dealing with. And that lets you do things like robotics, where you typically have to deal with these kinds of continuous actions in order to be able to accomplish tasks. That includes, for example, driving as well, where in order to actually drive, you need to control, for example, the steering, which is a continuous input rather than discrete.