 All right, so we've now seen Q learning V1, deep Q learning V1, and we said that the update rule, the incremental update rule in standard Q learning, tabular Q learning gets converted into this gradient update in deep Q learning. But then we said, even though this looks a lot like gradient descent, really we put an asterisk there because it's not actually gradient descent. And now let's try and unravel the asterisk a little bit and see what we find. Now, observe that this update rule is in the way that this algorithm is currently written down, this update rule is going to be applied to actions and states in sequence, right? So as you execute some actions in the environment and observe S1, A1, S2, A2, and so on and so forth, you're going to apply this update rule to those samples as they come in. And those samples as they come in are going to be strongly correlated because any state that you end up at will be in the neighborhood of the state that you started in. And so samples from sequential time steps, from consecutive time steps are going to be correlated with each other, right? So let's visualize that a little bit. So you might, for example, think of this as the full distribution of states. And at any given point in time, the last few samples that you've encountered have all been from a very small neighborhood within that state space. And so you might very easily, by applying your gradient descent rule to those samples in sequence, you might very easily overfit to that very small neighborhood in the state space. And then all of a sudden, you find yourself in a different neighborhood in the state space where your function doesn't work anymore. And again, you have to overfit to that small neighborhood once more. And you might find yourself repeating this over and over where you keep overfitting to small neighborhoods and never really fitting the full function. So that's one problem that we don't have IID samples from the full distribution. And therefore, what we are doing in terms of this kind of SGD is not really SGD. It's not really stochastic gradient descent because in stochastic gradient descent, you require samples that are IID drawn from the distribution. The second problem is a little bit more subtle and it arises from the fact that when we compute this gradient, we actually treat this term inside here as a fixed value, as a label. So we measure the error between the current Q phi and this target value. But the fact is that that target value is actually a function of Q phi itself. But we have ignored that conveniently when writing down this update rule. So we've deliberately chosen to treat this term as a label that is independent of Q phi. And so we don't really compute any gradients with respect to this term, with respect to phi for this term. We treat this as Q phi minus a y. If you recall, we actually explicitly wrote it down as computing y i equals r i plus gamma max and so on. And then we said in this expression, we had Q phi minus y. But really that y i is a function of phi and we've ignored it and that's going to come back to trouble us. So in particular, the target value is not actually fixed, it's actually changing. Let's first deal with the sequential states being strongly correlated. To deal with this correlation issue, remember what we said earlier about Q learning being an off-policy algorithm. Which means that the actions and states that you are basing your updates on don't have to be drawn from any one particular policy. You can actually use any policy to act in the world. And so that means that we actually don't have to use only the most recent states and actions. We could instead just be using states and actions from a policy that we were deploying many iterations ago. And that's exactly what we do. We just maintain a buffer of all the past experience we've had. We don't have to only rely on our most recent experiences. We maintain a buffer of a lot of past experience. And then we perform Q updates based on a sample from the replay buffer. So we draw some sample from this buffer of previous experiences and use that to inform the Q update rather than only use the most recent sample. And that means we've automatically broken this issue of sequential samples being correlated with each other. And the other nice thing about this is that now you're not only, you're not throwing away all your past experience the moment you have used it once. You're instead kind of storing the past experience and you're able to use it to improve your methods performance even a long time after you first encountered that experience, after you had that experience. You don't have to only use the experience immediately and throw it away afterwards. You instead can actually keep your experience around and it will continue to inform you for many, many future gradient updates. So this leads to more efficient learning overall. And it also breaks the IID, the non-IID problem. So now let's look at what V2 of deep Q learning looks like with this replay buffer built in. So we're going to collect a dataset using some policy added to the replay buffer. And then in an inner loop, we're going to have, we're going to sample a mini batch of S, A, S prime R tuples from the replay buffer. And then we're going to change our update rule to not just use the most recent sample, but to use a mini batch of samples that's drawn from the replay buffer. So now we have something that looks a little bit more like stochastic gradient descent, mini batch stochastic gradient descent because we've sampled a mini batch and we are now applying gradient descent on that mini batch because there's a summation now, right? There's a summation over all the samples. So this is good, we've incorporated this. And you'll notice that we have this loop that goes K times over this, right? So every time after having collected a dataset, what we're doing is we're looping K times to sample a bunch of mini batches and then apply the update rule based on those mini batches over and over. And once we've looped K times, then we go back and use potentially a new policy and add to our replay buffer, right? Everything that's happening in here does not involve any interaction with the environment. All of this stuff in here is just optimization based on experience we've already collected, okay? So that process of collecting experience works like this where from the replay buffer, we're going to perform this off policy Q learning where we're just sampling mini batches of past experience and updating our policy, updating our Q function, pardon me. And once we've done that, then we can deploy a policy based on our current Q function that could be, for example, epsilon greedy, which we've seen as an example of a policy that you can use, right? So it might not necessarily be the most optimal function, but it could be an epsilon greedy version of the optimal policy. That means that we sample random actions, a fraction epsilon of the time. And that's going to give us some new data. It's going to generate some new data that we put back into the replay buffer. That's the first line right here, right? So we keep iterating in this loop until we stop improving our policy.