 Okay. So having seen policy search based methods now are two broad families of model free RL value function based methods and policy search. And in policy search, there are several approaches within policy search including reinforce the RPO, PPO and SVPG. We are getting into those in detail on the slide. But what I do want to point out is I want to make more explicit that connection between the term that appeared in the policy gradient and value functions. And in particular, if you think of value functions and policy search based methods as lying on two ends of the spectrum, actor-critic algorithms. There's a class of algorithms called actor-critic which kind of blends both those worlds and tries to create the best of both worlds. And among those actor-critic algorithms is the algorithm that we already saw at the end of the last segment on DQN for continuous actions which is DDPG. And there are several other algorithms like TD3, A3C, SAC, Impala, etc. And all of these are, they represent some of the most highest performing algorithms that we have today in reinforcement learning. And the idea here is essentially to say that within the expression for the policy gradient, you're going to replace that real reward, the sum of real reward sigma TRT that looked a lot like the value function. You're going to replace that with the value function. And the reason you would do it is because you don't want to, if you only use the real reward from that particular trial, you have a really high variance because remember the rewards are after all stochastic, right? The transitions are stochastic. And so if you only use a single sample, if you only look at what's happening from a single episode and you use that to evaluate a particular action in our policy gradient expression, then you aren't really making the best use of all the data that you encountered before. If instead you've been maintaining a value function, then you don't have to rely purely on that single episode. You can instead use the expected reward produced by that value function. And therefore you get a much better evaluation of how good your actions were. And so you can plug that into your policy gradient and that gives you better policy gradients, less noisy policy gradients.