 Okay, so it would be great if we could somehow train our policies with gradient descent just like we've been training our supervised learning algorithms and our unsupervised learning algorithms for that matter in the last few weeks, right, where we take, where we could take for example the parameters of the policy and the previous time step, add a learning rate times a gradient of the thing that we'd like to optimize, in this case the expected utility, and that would give us, if we took the gradient with respect to the parameters of the expected utility, that would give us the direction in which we should move in order to, in order to set the new parameters, theta new such that that value increases. Now what is the value that's increasing? It's the expected utility under the policy. And what does that mean? Well, the policy, every policy corresponds to some trajectories, every policy will generate some trajectories if you will, right. A policy induces a distribution over some trajectories, trajectories remember our state action sequences. And of course these trajectories have corresponding rewards. And so you can think of every policy as inducing a distribution over rewards. And so that's what we're doing here. The rewards are being measured under the distribution, under the distribution of trajectories. We're taking the expectation of the rewards induced by this policy under the policy, right. Okay, so I said it would be great if we could do something like this. It turns out we actually can. We can take the gradient, and the gradient works out to this expression. Now this expression looks quite complicated at first glance, but really it's quite easy to parse. So let's step through it one term after another. Outside we have a summation over trajectories. So we have this one over n, which suggests that we're averaging something over trajectories one through capital N. And within each trajectory we have also have a summation over time steps of t equals one to capital T. Now remember as this gradient increases, it means that we're changing the policies more, right, the policy parameters more. This is the gradient of the expected utility with respect to theta, and it looks like it is proportional to the gradient of the logarithm of this term with respect to theta. And that term is actually the probability assigned by the policy to the action at time t conditioned on the state at time t within this current trajectory that we're dealing with among the capital N trajectories. So we've seen some trajectories, and what we're doing right now is we're saying if we could move in the direction, because we're taking the gradient with respect to parameters theta, if we could move in the direction that would make the actions that we've performed more likely than we would be increasing, we would be moving in the direction of the gradient, and therefore increasing the expected reward over time. Now that doesn't by itself make a lot of sense, and that's because we haven't yet come to this term. But so far what we've said ignoring that last term is that we are trying to make the actions that we've already performed more likely under the policy. So we're just trying to do more of what we've been doing, right? But that wouldn't make sense by itself just like we said now, and so that's why this term exists, and this term is essentially going to evaluate for you how good each of those actions was. So you can think of this term as assigning a weight to each of those actions, and it's going to assign a weight based on how good the rest of the trajectory was after you executed that action. So you're going to weight each action by the rest of the trajectory after you executed that action. That's why you have the summation from time step t till the end of the episode of the rewards at the next few time steps, all right? So what does this all say right now? It tells us that we are trying to make the actions that were good more likely under the policy, and if this quantity was a small quantity, then we wouldn't be trying to make those actions more likely under the policy. So we'd really only be trying to make the good actions, the ones that produced good rewards more likely under our policy, and that's really kind of intuitive when we think of it that way. Now one thing that I want to point out here is that this term is something that should have rung a bell, and it is of course just a value function, right? And before we move on, so the next slide will make an explicit connection between policy gradients and value functions, but before we move on I should make a note of the fact that on this slide we've ignored the existence of a discount factor just to keep things simple, but a lot of these same ideas will continue to hold in the presence of a discount factor.