 All right. So now that we've seen DQN, let's spend some time thinking about some other algorithms for reinforcement learning. In particular, we'll start with direct policy search and actor critic, which are both in the same broad family of algorithms as DQ learning. And we kind of quickly recap what we already know just to set that up. So you'll remember that we've spoken about Markov decision processes where an agent executes an action AT belonging to a set capital A of actions and then observes the state S from a set of states, capital S. And then this process repeats over and over the world transitions, given this transition function probability of S prime given S comma A. And there's a reward R of S A S prime associated with every transition. And the agent's objective is to maximize the discounted sum of reward over time by executing a good action sequence A1 through A capital T. And capital T in some cases could be infinity. So you could have this sequence be an infinite action sequence. Okay. And within this MDP formalism, we've seen two kinds of things. We've seen some things that we can do when we know the MDP and then some things that we can do when we don't know the MDP. And in particular, when we know the MDP, we can find the value functions. We can find the optimal policies, exactly using methods like value iteration and policy iteration. And we also know how to find the with some particular policy pi. Now, if we don't know the MDP, then we have also studied how to do policy evaluation in the context of an unknown MDP. And we have seen our very first RL algorithm, Q learning, where we estimate the optimal Q corresponding to a state in action. We have learned how you can abstract the state using deep neural networks within the Q learning approach. And we've seen how to improve that further with experience replay target networks and double Q networks. And most recently, we've seen DDPG where we can include a policy network, and also introduce an action abstraction into the Q network. And this helps you handle continuous actions. Okay, so broadly, all of what we've seen in the context of reinforcement learning with DQN and all its variants and DDPG fall under this category of approaches that are called model free RL approaches. And we'll see very soon why they're called model free. But just to quickly foreshadow that the model that these algorithms are free of is the transition model. They don't explicitly learn a transition model, P of S prime given S comma A. Instead, they bypass that step directly to learn the policy. And you'll see when we talk about model based reinforcement learning that there is the option of learning the model. But the methods that we've seen, including DQN, double DQN, and something that we haven't seen called rainbow, which builds on top of these approaches are value function based methods. All right. And because of course, they're based on the Q function. And we've seen them already in some detail now. But another class of methods within model free reinforcement learning exists, which is the other major class of methods in model free RL. It's called direct policy search based methods. And in direct policy search, we actually sidestep even the step of learning a Q function. Not only do we not learn a model, we don't even learn a Q function. We instead learn directly a parameterized mapping from states to actions. So remember, in the value function based methods, we learn the Q function, and then we execute an action A star equals R max. Except at the very end, when we spoke about DDPG, we did something kind of similar to this. And we'll return to that very soon. But here, in DDPG, we still had a Q function. In direct policy search based methods, we just directly learned a policy. So how do we do something like that?