 All right, so we've now seen our very first reinforcement learning algorithm called Q learning and We saw it in the context of really simple settings where For example, like the grid world setting where it was easily possible to enumerate all the states and all the actions in them and learn a Q function associated with each state action tuple but This is not always the case and you might imagine that in more complex situations where for example your state might consist of the positions of cars on a street that you're driving on or Your actions might be really large. Maybe your actions consist of all the muscle contractions in your body In those cases, it's not quite as easy to enumerate the states and actions and and So it is practically Not possible to use this kind of simple tabular approach where we were tabulating Q as a function of s and a and we literally were kind of updating individual entries separately And this is where deep learning actually helps us. So the neural networks that we've been learning about in the last few weeks are going to come to our rescue here and That leads us from what we've been seeing as Q learning so far, which is sometimes called tabular Q learning to deep Q learning Okay, so like we were just saying it's very difficult to learn about every single state action pair There are too many state action pairs to even have visited them all during training time for any reasonable training duration and it might even be the case that there are too many state action pairs to hold the Q tables in memory because remember The size of the Q table is equal to the number of states times the number of actions because Q is a function of s comma a right and if you have for example hundreds of thousands of actions possible and millions of states then at that point this is becoming a real burden to actually hold in memory and to be actually useful and this should remind you of Something that we've encountered before in Standard supervised machine learning for example, it's very difficult to have observed every possible image of a cat at training time But if we are clever about how we represent images, then it becomes possible to actually learn Classifiers that are able to see and recognize new images of cats We don't really have to have seen cats in every possible image We only have to have the correct representation of images so that we can then generalize and there is a very similar Logic to how we will proceed now in this Q learning setting where we want to be able to generalize to new states that we haven't seen before by only learning from some small number of training Q states that we've encountered and From that we should generalize to new similar Q states And the way to do that is by building abstractions like we've seen before in machine learning So let's explore that idea a little bit further So for example, let's say that through our training experience at learning to play pac-man We've discovered that this state where we are kind of boxed in by these two monsters is a bad state right now if you were Learning to enumerate all possible states and think about how you would represent all possible states in this case There would be a really really large number of state configurations of this world But if we were somehow trying to do this We would know nothing about this second state which looks a lot like the state But it isn't exactly the state, right? The same kind of situation is happening, but in a very different portion of the of the pac-man grid So if we were just building tabular Q learning representations We wouldn't really know anything about this state from having observed and known that this state is bad Or for example, even this one, right? This is this is not even as different The only thing that's different about this one is that the monster is one step closer or there is actually In this case, I think the monster is not even closer. It's just that there are some Some of the food is has disappeared This should give you this idea that We can't possibly transfer any knowledge if we don't build some abstractions Like in this case the abstraction would have to encode something about the fact that you're boxed in by monsters For you to be able to generalize from this state to then this state and this state and so on and so our solution is going to be to try and describe a State represent a state using a vector of features and those features Could be for example the distance to the closest ghost the closest monster the distance to the closest dot the food the number of ghosts and And so on and so forth Right, so all of these seem like potentially important features that we might want to encode and which represent the important information about this this game state Without necessarily listing everything in the game right and so That is a key idea that we could build better representations of states Especially in these settings when you have a really large number of states and this is a relatively toy example comes in Compared to some real-world examples that you might encounter So this should again remind you of things that we've encountered a couple of times now in the context of both computer vision And in the context of natural language processing where we've said that building manually Good representations of either images or of text is generally inferior to trying to build Trying to learn those representations in the context of a deep neural network. And so that's exactly where we'll head next In particular, we'll start discussing this idea of deep Q learning Which essentially uses a neural network to predict the Q function So you might have a neural network that looks like this whose input is going to be the game state And in this case the game state might be represented as an image And so you might have a convolutional neural network that's dealing with it and The output of that neural network is going to be the Q values corresponding to various actions So if you had let's say 15 actions in your environment Then you would have 15 outputs each of which would be the Q value corresponding to this state and the first action the second action the third action and so on All right. So that's your deep Q network. How are you going to train this deep Q network? Well, you have to use something related to the Bellman error in particular we're going to use a loss function that is simply the squared Bellman error and That will then then we'll apply gradient descent on it now There is a caveat here that we'll encounter very soon But for the moment you recognize this Expression as simply the right-hand side of the Bellman equation Minus the left-hand side of the Bellman equation and so this is simply the error How much is the current Q function inconsistent with the Bellman equation? Right and the Q function remember it is just represented by this neural network itself So how inconsistent is this neural network with the Bellman equation for the optimal Q function? That's what this is like the Bellman equation for the optimal Q function has this on the right-hand side and this on the left-hand side and so we are checking the inconsistency and we're squaring it and We're going to then use that as the loss function over which we do gradient descent. We're going to try and minimize it and Eventually, of course just like an all-standard Q-learning we're going to use the policy action as the one that has the highest predicted Q value. So if you are if you Find yourself at a particular state and then after having trained the Q network You put it into the Q network and you get Values corresponding to Q values corresponding to different actions then you'll pick the action that has the highest predicted Q value and that'll be your action Okay, so remember what we said when we described the first Q-learning algorithm We simply would have to execute a single action from state s observe the transition then get a sample That sample is going to be used as a proxy for the expectation on the right-hand side of the Bellman equation And so we can measure the Bellman error in this manner and we're going to move We're going to update our Q function incrementally to be a little bit closer to what it should be On on the right-hand side of the Bellman equation, which is this guy here We aren't going to move it all the way there. We're just going to move it a little bit closer to that So what does this look like in the context of a neural network? Representing our Q function. So in this case, we've represented the neural network is taking in s comma a and output in Q phi of s comma a but really What it is is it's taking the state as input and then it's outputting Q phi of s comma a for different actions at the output, right? And let's imagine that it has parameters phi and Now let's write down what Q learning looks like what this algorithm transforms to under this setting So again, we're going to take one single action AI and observe s a s prime r Right. All of this is the same as before then we find the target, which is r plus gamma Max over a prime Q phi of s prime comma a prime. That's exactly the first two terms over here and The error is basically that why I minus the current Q value Which we've also written down over here. It doesn't matter what the sign is because we are taking the square of that anyway and so What's happening over here? Well, it looks like we've suddenly introduced some kind of gradient, right? This is a derivative of Q phi of s i comma comma a i multiplied by this error and let's observe that that is actually just the derivative with respect to the parameters phi of The squared Bellman error. So Q phi minus yi which already appears here If we take the square of it and we take the derivative with respect to phi of it Then you know this the derivative of x squared with respect to x is 2x. So We have a Q phi minus yi reappearing over there and then we have to take the derivative of Q phi minus yi with respect to phi itself Just applying chain rule, which is exactly what appears over here So convince yourself if you don't see it immediately Convince yourself that this is exactly going to appear like this This gradient is going to be computed like this And so this is exactly what we said we were going to do. We're going to update our Parameters with gradient descent for this Q network so that it moves closer to minimizing the squared Bellman error and so We've gone from this incremental TD update over here to something that looks quite similar in that it also has a learning rate alpha But really over here. We are applying that learning rate alpha to the gradient Right and so the gradient of the squared Bellman error So we've gone from the incremental update step step to the gradient descent on the squared Bellman error loss Now there is an asterisk here and we'll come back to that soon Because it's not quite as simple as standard gradient descent like we've seen so far