 All right. So now that we've seen DQN, let's spend some time thinking about some other algorithms for reinforcement learning. In particular, we'll start with direct policy search and actor critic, which are both in the same broad family of algorithms as DQ learning. And we kind of quickly recap what we already know just to set that up. So you'll remember that we've spoken about Markov decision processes where an agent executes an action AT belonging to a set capital A of actions and then observes the state S from a set of states capital S. And then this process repeats over and over. The world transitions given this transition function probability of S prime given S comma A. And there's a reward R of S A S prime associated with every transition. And the agent's objective is to maximize the discounted sum of reward over time by executing a good action sequence A1 through A capital T. And capital T in some cases could be infinity. So you could have the sequence be an infinite action sequence. Okay. And within this MDP formalism, we've seen two kinds of things. We've seen some things that we can do when we know the MDP and then some things that we can do when we don't know the MDP. And in particular, when we know the MDP, we can find the value functions. We can find the optimal policies exactly using methods like value iteration and policy iteration. And we also know how to find the, with some particular policy pi. Now, if we don't know the MDP, then we have also studied how to do policy evaluation in the context of an unknown MDP. And we have seen our very first RL algorithm, Q learning, where we estimate the optimal Q corresponding to a state in action. We have learned how you can abstract the state using deep neural networks within the Q learning approach. And we've seen how to improve that further with experience replay target networks and double Q networks. And most recently, we've seen DDPG where we can include a policy network and also introduce an action abstraction into the Q network. And this helps you handle continuous actions. Okay. So broadly, all of what we've seen in the context of reinforcement learning with DQN and all its variants and DDPG fall under this category of approaches that are called model three RL approaches. And we'll see very soon why they're called model three. But just to quickly foreshadow that the model that these algorithms are free of is the transition model. They don't explicitly learn a transition model P of S prime given S comma A. Instead, they bypass that step directly to learn the policy. And you'll see when we talk about model based reinforcement learning that there is the option of learning the model. But the methods that we've seen, including DQN, WDQN, and something that we haven't seen called rainbow, which builds on top of these approaches are value function based methods. All right. And because of course, they're based on the Q function. And we've seen them already in some detail now. But another class of methods within model three reinforcement learning exists, which is the other major class of methods in model three RL. It's called direct policy search based methods. And in direct policy search, we actually sidestep even the even the step of learning a Q function, not only do we not learn a model, we don't even learn a Q function, we instead learn directly a parameterized mapping from states to actions. Right. So remember, in the value function based methods, we learn the Q function and then we execute an action A star equals arc max, except at the very end, when we spoke about DDPG, we did something kind of similar to this. And we'll return to that very soon. But here we, in DDPG, we still had a Q function in direct policy search based methods, we just directly learn a policy. So how do we do something like that? Okay, so it would be great if we could somehow train our policies with gradient descent, just like we've been training our supervised learning algorithms, and our unsupervised learning algorithms for that matter in the last few weeks, right, where we take where we could take, for example, the parameters of the policy at the previous time step, add a learning rate times a gradient of the thing that we'd like to optimize, in this case, the expected utility. And and that would give us if we took the gradient with respect to the parameters of the expected utility, that would give us the direction in which we should move in order to, in order to set the new parameters, there are new such that that value increases. Now, what is the value that's increasing, it's the expected utility under the policy. And what does that mean? Well, the policy, every policy corresponds to some trajectories, every policy will generate some trajectories, if you will, right, a policy induces a distribution over some trajectories, trajectories remember our state action sequences. And of course, these trajectories have corresponding rewards. And so you can think of every policy as inducing a distribution over rewards. And so that's what we're doing here, the rewards are being measured under the distribution, under the distribution of trajectories, we're taking the expectation of the rewards induced by this policy, under the policy, right. Okay, so I said it would be great if we could do something like this, it turns out we actually can. We can take the gradient. And the gradient works out to this expression. Now, this expression looks quite complicated at first glance, but really, it's quite easy to parse. So let's step through it one term after another. Outside, we have a summation over trajectories. So we have this one over N, which suggests that we're averaging something over trajectories one through capital N. And within each trajectory, we have also have a summation over time steps of t equals one to capital T. Now remember, as this gradient increases, it means that we're changing the policies more, the policy parameters more. This is the gradient of the expected utility with respect to theta. And it looks like it is proportional to the gradient of the logarithm of this term with respect to theta. And that term is actually the probability assigned by the policy to the action at time t conditioned on the state at time t, within this current trajectory that we're dealing with among the capital N trajectories. So we've seen some trajectories. And what we're doing right now is we're saying, if we could move in the direction, because we're taking the gradient with respect to parameters theta, if we could move in the direction that would make the actions that we performed more likely, then we would be, we would be increasing, we would be moving in the direction of the gradient, and therefore increasing the expected reward over time. Now that doesn't by itself make a lot of sense. And that's because we haven't yet come to this term. But so far, what we've said, ignoring that last term is that we are trying to make the actions that we've already performed more likely under the policy. So we're just trying to do more of what we've been doing. But that wouldn't make sense by itself, just like we said now. And so that's why this term exists. And this term is essentially going to evaluate for you how good each of these, each of those actions was. So you can think of this term as assigning a weight to each of those, each of those actions. And it's going to assign a weight based on how good the rest of the trajectory was after you executed that action. So you're going to weight each action by the rest of the trajectory after you executed that action. That's why you have the summation from time step T till the end of the episode of the rewards at the next few time steps. All right. So what does this all say right now? It tells us that we are trying to make the actions that were good more likely under the policy. And if this quantity was a small quantity, then we wouldn't be trying to make those actions more likely under the policy. So we'd really only be trying to make the good actions, the ones that produced good rewards more likely under our policy. And that's really kind of intuitive when we when we think of it that way. Now, one thing that I want to point out here is that this term is something that should have rung a bell. And it is, of course, just a value function, right? And before we move on, so the next slide will make an explicit connection between policy gradients and value functions. But before we move on, I should make a note of the fact that on this slide, we've ignored the existence of a discount factor just to keep things simple. But a lot of these same ideas will continue to hold in the presence of a discount factor. Okay, so having seen policy search based methods now are two broad families of model free RL value function based methods and policy search. And in policy search, there are several approaches within policy search, including Greenforce, TRPO, PPO and SVPG. We are getting into those in detail on this line. But what I do want to point out is I want to make more explicit that connection between the term that appeared in the policy gradient and value functions. And in particular, if you think of value functions and policy search based methods as lying on two ends of the spectrum, actor critic algorithms, there's a class of algorithms called actor critic, which kind of blends both those worlds and tries to make the, tries to create the best of both worlds. And among those actor critic algorithms is the algorithm that we already saw at the end of the last segment on DQN for continuous actions, which is DDPG. And there are several other algorithms like TD3, A3C, SAC, Impala, etc. And all of these are, they represent some of the most highest performing algorithms that we have today in reinforcement learning. And the idea here is essentially to say that within the expression for the policy gradient, you're going to replace that real reward, the sum of real rewards, sigma TRT, that looked a lot like the value function. You're going to replace that with the value function. And the reason you would do it is because you don't want to, if you only use the real reward from that particular trial, you have a really high variance. Because remember, the rewards are after all stochastic. The transitions are stochastic. And so if you only use a single sample, if you only look at what's happening from a single episode, and you use that to evaluate a particular action in our policy gradient expression, then you aren't really making the best use of all the data that you've encountered before. If instead you've been maintaining a value function, then you don't have to rely purely on that single episode. You can instead use the expected reward produced by that value function. And therefore, you get a much better evaluation of how good your actions were. And so you can plug that into your into your policy gradient. And that gives you better policy gradients, less noisy policy gradients. So before wrapping up the segment, let me show you some of the examples of successes in model free reinforcement learning in a domain that I particularly like, which is robotics. So here's an example from OpenAI of a robot hand that is solving a Rubik's cube. This was done with three RGB cameras, 16 motion capture cameras as sensors. And it was done with 100 years of experience in simulation, what was the equivalent of 100 years of experience in simulation, but using using large compute cluster and with accelerated simulations, there's a mantra to about 50 hours of real world time. And this was considered as one of the sort of major advances in dexterous manipulation for robots. Things like this are notoriously hard to do with robots. Here's another example, this time in the manipulation setting again. And what it's trying to do is this robot is trying to turn this wheel to some given configuration. And you can see that it's managing to do it. It did it from image observations in this case, a single RGB camera in about 20 hours of experience. So these things do take a really long time to train. More recently, from what are called state observations, meaning you're only operating in the space of joint angles, etc. You can teach a walking locomoting robot like these. You can teach it to walk in about two hours from those state observations. So they're pretty low dimensional observations and that's why you can get away with two hours.