 All right, before moving on from DQN completely, let's think about what DQN would look like with environments that require continuous action spaces, and then see whether we can do something to actually alleviate the problems that arise in that context. All right, so this is how our deep Q learning network looks like so far. We have an input state S, and its outputs are going to be the Q values corresponding to different actions within our action space. And because we've been dealing primarily with discrete action so far, we've not explicitly mentioned this, but really we've been dealing with discrete action so far, because that's the only way you can enumerate all the actions in this manner. So from A equals 1 to A equals capital A, where you have some finite size capital A of the space of actions. And therefore, we are able to have exactly that many output units. Each of those output units produces the scalar corresponding to Q of S comma A equals 1, A equals 2, and so on. So this is how we've parameterized deep Q networks so far with these discrete action spaces. And before looking at what happens in the continuous action setting, let's look at how we actually use this deep Q network to select an action eventually. In particular, remember we said that to actually use the deep Q network values to select actions to convert it into a policy and select the optimal actions in the environment, all that you'd have to do is take the r max over A of Q of S comma A. And in this setting, it's very simple because all that you have to do is, given that all the outputs are produced in one shot, one pass through the network, one forward pass through the network, then you just take the maximum of all of those discrete, this finite set of values and you will get your A star. And that's also your policy pi of S. Now, this policy is used for selecting actions after having trained the Q network. But remember that we're doing actually something quite similar even during training time. In particular, at training time, we set the Q targets as the Bellman targets. And the Bellman targets are the current reward plus the discount factor times the maximum over the Q phi for the next state and action. And this looks an awful lot like what we were doing over here. In particular, you can explicitly write it as the Q phi of Si prime and pi of Si prime, pi being the r max. So convince yourself that this maximization can be taken inside as an r max. And when you do that, this is exactly the same as setting AI prime over here to pi of S prime. So this is obviously an operation that we are repeating several times because we're using it at training time after all. So we're repeating it very, very often. And it better be the case that we are able to compute this very quickly. And in the discrete case, this is very much the case. We're able to compute it quite quickly, like we just went through. All that you have to do is take the maximum of a discrete set of values. So what happens if we try and port the same solution into the continuous action setting? Well, in the continuous action setting, there is no meaningful way to enumerate the actions. You can't just list down a finite set of actions. And so you can't really do this anymore. And so you would have to instead come up with a Q network that looks a little bit different. In particular, you could take the state S and the continuous action A as input, and you would output Q of S comma A. That seems like it solved this problem of not having the ability to enumerate Q of S comma A. But you still have a problem now when you start thinking about how you would set the policy action, which is the r max over A of Q of S comma A. All of a sudden, see that there is no simple way in which you can take this r max. And it looks a little bit more like an optimization problem. So taking the r max of this network for a fixed state and trying to find the r max over the action input looks like a kind of optimization problem, the same type of optimization problem, perhaps, as what we were solving when we were trying to find the optimal weights of a neural network to minimize the loss function, let's say. So you could apply the same kinds of methods we used there. You could use gradient descent, for example. Of course, that's not something that converges very quickly. You would have to solve this iterative gradient descent problem every time you wanted to take the r max of Q of S comma A. And that's not feasible, because remember, we want to solve this optimization problem really frequently, like we saw in the previous slide. We want to use it both when computing the eventual policy actions, but also when setting the targets for our Q network. So this would be too expensive if we just treated it as an optimization problem that we solve from scratch each time. So what else could we do if we don't want to repeatedly solve this optimization problem? Is there some way to get around it? Well, here is one potential solution. Let's train a neural network to produce the outputs of this optimization problem. Now remember, the output of this r max is a function of the input state S. And so you have a separate A star corresponding to each input S. You have a separate optimal action corresponding to each input S. So why not train a network that maps from the input state to this output action A star, which is a solution to this optimization problem? So what would that look like? Well, it would look like a network that takes in S as input, produces A star as output. Now remember that this mapping from the state to the optimal action is exactly what the optimal policy is supposed to be doing. So let's just call this the policy network pi. Now we've only so far set up that the inputs are states S and the outputs are the optimal actions A star. How are we going to train this? We're going to train this to maximize the Q function. And we are parametrizing the Q function, of course, because we're operating with dQn. We're parametrizing the Q function in a Q network. And so now we have all the components necessary. We're going to train the Q network, just like we normally do, by using the standard squared Bellman error loss. And the policy network is going to be trained in turn to minimize Q of S comma A with respect to this Q that we're currently learning. And it's very common to call this kind of setup an actor critic setup, where the policy is called the actor, because obviously it produces the actions. And the Q network is called the critic because you can think of it as evaluating a state action tuple and saying how good it is, which is exactly what Q of S comma A is. Remember Q of S comma A is the expected utility of producing action A from state S with the optimal policy. We're not saying this explicitly anymore, but these are Q stars, right? So this is the actor and this is the critic. This is our very first example of an actor critic algorithm. And this algorithm where you take dQn and modify it in this way to work well with continuous actions is called deep deterministic policy gradients or DDPG. And we saw an example of how this works earlier with the robotics setups. In robotics, typically, you will very often need to set some continuous actions, like, for example, the torques in various motors of your robot arm, for example. And you'll remember we saw an example earlier of how that works. And that was actually, in fact, employing DDPG.