 We're happy to have now her she's a professor at the ATH University in Zurich and she is leading the Optimization and decision-intelligence group and in her work She's involved like in many areas of machine learning, but mainly from the optimization theory perspective so showing the detailed grounds and properties of algorithms for the stochastic optimization problem These techniques well apply to reinforcement learning and Policy good in particular where some of our latest work also Involved so please let's welcome together now. All right Thanks for the introduction and it's a really great pleasure to be here and for this lecture as introduced I will discuss about Policy gradient methods and as mentioned I think this is also a topic that's more close to my heart and also a topic that has received tremendous very exciting progress both in practice and in theory and Of course, it's hard to imagine how you can you know cover such a big topic in just one and a half hours I'll try my best to give you kind of both a Treatment of the basics of policy gradient methods and a bit of high-level ideas What's going on in the in the research directions more about theoretical guarantees? Okay so here's basically an outline of the lecture for today and I think you already have learned a bit about policy gradient methods from I think the Monday lecture by Olivia and Here we'll go a bit dive a bit more into the details of what are the specific algorithms and and the results All right to start with I think you already all know by now that reinforcement learning is all about agents trying to learn to optimally act in an unknown environment and Oftentimes we say an agent's behavior is really described by policy pie Which is a mapping from state to the action space or the distribution over your action space And the model or the environment is actually defined usually by the model Right this captures both the transition dynamics P and also the reward function are right these are often times unknown in real world and at each round writes the agent observe some state from the environment and then it plays some action a according to the behavior strategy pie right policy pie here and then the environment will return some Reward signals RT to the agent and the agent also and the environment also Transits to a new state ST plus one following this transition dynamics P right and the goal We know that here we can define the cumulative expected cumulative reward as the value function V pie right for a given state S initial state S and the policy pie Right this here particularly I'm looking at the infinite horizon setting where we consider this discount factor gamma in between Zero and one because we know that from an economic perspective usually humans favors more intermediate reward rather than a delayed reward right the goal here oftentimes really trying to maximize overall choices feasible policies pie that maximize this value function V pie right So these are really the key three elements of reinforcement learning right the policy the the model and then the value function and that said Most of reinforcement learning algorithms can be characterized into these three big classes right the value based methods Which directly try to learn or estimate what is the optimal value functions V star or Q star right these are the Value functions and then based on these optimal value functions you can then recover what is optimal policy right by taking the greedy policy based on these value functions and There are algorithms you already have learned from yesterday like temporal difference learning Q learning algorithms are like sexual right these are really Aimed to solve or to estimate these optimal value functions on the other hand for policy based methods they directly try to search what is optimal policy Among your policy class right you can search them through all kinds of Algorithms based on your zero order information of your objective function or first order information of your objective function right this kind of are the typical algorithms like policy gradient methods natural policy gradient methods ERPL meant some of these will discuss a bit more in detail today and of course Should also notice that for value based methods right these methods I think a big advantage of them is that they can easily leverage the Spellman properties or Markovian properties to To use temporal difference to estimate your value functions effectively in the sense that these estimators usually have much low variance however, the downside of value based methods is that they are really not very scalable to large state large action spaces right because Remember that the way that we say for example update Q learning algorithms is where we have to use this Bellman optimality Operator which requires you to take the best action of Your next state right if your action space is super large right or even if this continues Then this boils down to solving some long linear optimization problem, which can be already very challenging So sort of policy based methods are comes to the rescue to really scale things up to the large and potentially even continuous action spaces However, the downside as you also have learned among this lecture is that they oftentimes suffer from very high variance when you try to estimate this gradient information of your objective and they Could suffer from some sample efficiency as well due to that so So in this lecture right to focus on this policy based methods and the idea here is really that we wanted to represent our policy by some Like we want to Parameterize our policy right with some parameter theta and we represent the policy pi as pi theta here Right and the goal here is we want to find the best parameter theta that maximizes the expected long-term reward So here's just one kind of common objective, right? we wanted to consider this maximizing the episodic reward for example and Wanted to maximize it over all possible choice of theta right finding the best parameter and Again, like there are many other alternative objectives. You can consider right instead of episodic Reward you can also look at average reward right So here particularly we're going to look at this Expected long-term reward if your initial state s follows from some initial distribution mu and you are interested in to Maximize this expectation of your value function over your initial state s So particularly we're going to Parametrize our considered policies that are stochastic policies, right? So and these are policies where we define pi theta as a distribution Right for any given state s is a distribution over your action space and there are many reasons why we are interested in stochastic policies, right one Kind of more Convenient reason here is because when we have this stochastic policies This allows us to deal with a more well-defined Continuous objective here, right? So this objective is is kind of continuous and oftentimes differentiable if your stochastic policy is also differentiable Right, so that allows us to use many of the rich class of algorithms like based on gradients There are also other reasons why we would favor stochastic policies, right? So in I think as also mentioned in the previous lectures that if you have non stationary environments Oftentimes, you know the optimal policy may not necessarily be deterministic Maybe the optimal policy has to be stochastic policies and you want to learn from them. You want to learn them, right? And another kind of in practice when you have a stochastic policy This also encourages more exploration right in the real world So so there is a lot of good reasons why we want to learn these stochastic policies So how do we represent or parameterize stochastic policy, right in the discrete actions based set in There many natural choices this direct parameterization, right for any given state and action Right, we can assign a parameter theta to this pie, right? So we will sign this Parameter theta for any given state and action, right so So as we know that the policy has to be a distribution for any given state s hence here This theta these parameters should sum up to one and they should be non-negative, right when you fix any givenness Of course, this means that if your state and action are very large and then you have to deal with this Kind of constraint, which is not oftentimes very convenient so another alternative is use this softmax policy where you just parameterize your pi theta using this Exponential from normalize exponential function here, right as you can see here You can just choose any arbitrary theta, right without having to confine to any constraints So of course so far, right if your state and action space are reasonably You know small or like of medium size, then of course you can represent this pi theta, right? like Essentially, this is also also called like a tabular setting where your state and action are are not too large But if you say the action space are very large You don't really wanted to keep a parameter for every state and action space So you can we need to leverage all these function approximation techniques. You have studied also yesterday, right? We can introduce some feature mappings phi and we can instead of Instead consider some linear function approximation for example here theta that Belongs to some low-dimensional space, right with dimension D, which is much smaller than say state s times action space a and then of course These are more computationally efficient in the sense that you don't have to deal with a very large vector anymore But of course the caveat here is that this policy class may not necessarily recover What is underlying ground truth or what is the true optimal policy? In the real problem, right? So so your true optimal policy may not necessarily always be realizable in this case So to take a step further, of course, we can also introduce neural networks, right instead of considering this linear function here We can use f theta here, which is some nonlinear neural network instead Which gives you a bit more representation power, but also introduces some additional computational challenge as well So this is in the discrete setting in the continuous setting Right action. If you have continuous action space, right? There are also pounds of other ways that you can parameterize a Stochastic policy, which is in this case You can you can just consider this as a continuous probability distribution and one natural way Say for example, if your action space is just one-dimensional, right? You can parameterize your policy using just a Gaussian distribution particularly you're going to parameterize your say your your mean mu as Some function of theta and you can parameterize your variance sigma as also some function of a theta and then you can represent Your policy say for example in this way, right? Of course, you can also consider many many other type of probability distributions as well So once we we have this parameterization We have this objective of over our parameter space theta, right? A natural question is how does this objective look like right? What is kind of the optimization landscape that we're talking about so in general this objective, right? Which we denote as j pi theta in terms of theta This is not necessarily a concave objective, right in general This is non concave and this is non concave even in the very simply so setting where you assume, right? Where you only take this direct parameterization or the softmax parameterization So let me just give you a very simple example to illustrate this, right? So you consider this mdp here in the bottom here You have five states s1 s2 to s5 and there are say two actions a1 and a2 a1 is to move up a2 is to move right and There are three terminating states s3 s4 s5, right and there is only a One transition where you would get a non negative reward R is when you translate from s2 to s4 While taking the right or the up action there Right, so if you look at this mdp here and suppose there is no discount factor here And they say discount factor R is equal to one gamma is equal to one, right? So what is going to the value function, right? So if you look at the value function at state s1, right? So this is going to be basically you will only have positive reward, right? If you if you take this path here, right? So value function here is going to be basically The probability that you will take the right action at s1 times probability that you will take the up action at s2 times the reward you will get R Right, so you see that if you consider these as two kind of parameters, right? This is the multiplication of two parameters and usually this Is a non concave or non concave structure, right to be a bit more specific For example, you consider these two policies pi1 and pi2 right, say pi1 is the one that you take three over four probability of taking the right action At state s1 and you take also three over four probability if you add s2 and you take action Of a1 right, you can define pi2, which is sort of the opposite And then you can look at the average of these two policies Which is basically you take one half probabilities to take each of these actions and then plug in these policies there You can see that the value functions apparently does not to satisfy this This type of Jensen's inequality, right? This is does not to satisfy the concavity property you need Okay So this is for the direct of parametrization for the softmax Parametrization similarly right you can define say four parameters theta one theta two theta four parameters for each of these state action pairs and Then this is how your value function will look like right based on the softmax Parametrization and again similarly right you can you can just define say just a simple example theta one theta two These values which basically corresponds to the previous policies we see in the previous example And you can see that again this concavity properties basically fail to hold in this case So that means in general right we're really dealing with a non-concave objective even under this very kind of ideal parametrization settings right without even Involving any neural networks right once we parametrize these policies even with neural networks Then you can imagine that this problem objective will become heavily non-convex So how do we optimize in general a sorry a non-concave objective right over the parameter space or over your policy space So they are of course many algorithms in the literature already from you know many other fields that Based on using say only zero order information of your objective if you if you apply deploy your policy You have a way to estimate your objective Then you can you can apply many of these gradient free methods like here climbing simulated a lenient and so on right For reinforcement learning particularly terms are the oftentimes actually computing gradient is not an actual right You can easily efficiently compute the gradient And that's why gradient based methods are particularly very popular and this is basically our focus for this lecture We're going to look at more closely what are the policy gradient methods and natural policy gradient methods So in general of course, we cannot compute the gradient exactly right We we can only find some ways to roughly estimate what is a gradient and here I denote Hat of gradient as just some stochastic estimation of the gradient the true gradient based on some samples or some trajectories you have observed and There are a number of fundamental questions We want to ask you one is how do we construct a good estimation and what we mean by good, right? So good oftentimes means you want your gradient estimation to be unbiased and have relatively low variance and How does or where does this algorithm converge to right in general and how fast do they converge to the Limit point so these are also very natural questions and then how do we improve that right? How can we make them behave or perform much faster in practice? So before we dive into the details, maybe I'll just start with a very Kind of a question here Maybe you can also try to make a guess here like Here are basically four algorithms. These are the plus of their the trajectory of these four different algorithms and particularly here in the in the In the green region here. This is basically Describes is what is the value functions look like right? So this is in fact, what is the? the set of value functions over all Feasible policies. So this is just a two damage like a simple mdp example with only two states and These thoughts are basically the trajectories of four different algorithms under different initial points So if you were to guess like which one of these behavior Correspond to say policy gradient methods Yes, please. Okay How can you tell this is fully degraded? I don't know Feeling Okay, so it's falling exactly the the vertex of this sort of yeah But this is not it's not yes, I think your observation is kind of close But this is does not necessarily indicate is like the the direction of gradient there Uh-huh, and in fact, this is the one algorithm you already have learned. I think in one of the lectures Uh-huh Any other guesses Figure a. Wow. Okay. Why is figure a? Because of the stochasticity you think So, so let me just maybe Describe again. What is this red this green region? I just green region is basically all of your value functions. We pie s1 We pie s2 Right for all pie that these are in your feasible set, right? So if you see these thoughts that's outside of your region meaning that These values does not correspond to any feasible policy pie. So then now We're guessing see okay see see looks promising right because see like it's first of you see it's very slow, right? You take like so you see like the the color code here Implies how many number of iterations you reach reach to the end point, right? You see like see shows that actually this algorithm can be very sensitive to initial points, right? Depending on how your initial point is it might converge Very slowly at some point or it might stop at some points, right? So so this kind of a typical feature you would expect for gradient methods, right for solving some like non convex Objectives, right? They can be very slow, especially you don't have the true gradient or use even if you have a true gradient It can be still slow, right and it can be very sensitive to initial points, right? So C is in fact the policy gradient. Yeah, and the figure a is indeed Value iteration Okay, because the value iteration it actually converges fast Right, but then value iteration does not guarantee at every iteration these value functions You get corresponds to any policies, right? So these are can be infeasible, right and the B is indeed the policy iteration, right? You can see policy iteration is in fact very fast. It converges only in two iterations in this case no matter where you start and then the last one is kind of an improved version of policy gradient, which is natural policy gradient and This algorithm converges much faster than policy gradient and it's a bit less sensitive to the initial points Okay, all right, so kind of dive a bit into right why would you would expect these type of behaviors, okay? Any questions at this point? so just Kind of record that this is kind of the objective, right? We wanted to maximize For a given policy pi theta Represented by theta, right? You can also rewrite this objective as just to say what is expected Trajector reward you will get So here I define tau Just to align with the notation. So you have seen before this is just a random trajectory, right? Which is the sequence from S0, A0, S1 to to infinity to the end and then this random trajectory will Take place with probability or the probability that you observe this random trajectory tau would be basically How likely right you observe your initial state S0 because it follows distribution mu and How likely at state S you would take action a how likely you are translated to a next state From S A to ST plus one, right? This is the probability that this of this random trajectory tau Okay, and our tau is just defined as this total reward over this random trajectory Okay, so if we are going to consider what is the gradient of this objective J, right? You already seen from the day one lecture by Olivia that using the log trick, right? This is going to be the gradient, right? So the gradient Is going to be basically your random reward times? What is your this is your record score function the gradient of log log of P theta It didn't Okay, I checked the lecture slides, but okay, maybe I'm wrong Okay, so let me just briefly then derive it here, right? So it's just one line of proof here so if you look at the Gradient, right so your gradient of J pi theta Right is going to be basically if you write this is gradient of integration of R tau, right? There is probability P theta tau, right? D tau Right, so gradients over theta right, so this is equal to integration of Right, you basically take the gradient. This is the only place where you have theta, right? P theta tau D tau, right? Here's how you use the log trick Is that the gradient of P theta The gradient of P theta is equal to the gradient of log P theta Divided by P theta tau D tau, right? Because you take the gradient log P theta is equal to 1 over P theta times like by chain rule, right? The gradient of P theta over P Sorry, this that's the times, right? P theta tau, right? So this is by the fact that if you take the gradient log P theta is 1 over P theta and Gradient P theta, right? D tau, right? So this can be written as basically expectation that your tau is taking distribution over P theta Right, and then you have R tau gradient theta log P theta, okay So this is basically your gradient Yes, how does the total reward that depends on theta this R tau here? Expectation of the reward that's total reward. How does it depends on the parameters theta? Right, so you see that this is basically a way to rewrite the total objective, right? So the dependence on theta really appears here on the probability of this trajectory, right? This is where the dependence on theta come into place Okay, so you see that this gradient of this log P theta, right? If you take the gradient of this whole thing Right, so this is the only place where it depends on theta, right? So if take the log of P theta, right? You get a bunch of summation of log pi theta, right? So the gradient is basically the summation of this the other terms are not dependent on theta anymore So you don't have that, right? So you only have the summation of gradient of log pi theta. So this is this is it, right? So this basically Combining these two observations we have seen, right? This is a so-called policy gradient theorem, right? This was initially Introduced by convenience in 1992, right? It's Very simple result using the log trick basically. So this gradient log pi theta is usually called the score function Right, and particularly you can see that if your pi theta is Differentiable, so for example, you consider all this softmax Paralyzation or this Gaussian policies, right? This pi theta you can easily compute what is this score function? Right, just give you an example, right? If you consider this log linear policy in this form here Right, if you just plug it in there, you see that the gradient of Your score function based is a gradient log pi theta basically has this very simple form which is your feature mapping or feature vectors phi minus the the expectation of your feature vector Over a, right? Your random action a, right? This is basically a centerized or a normalized feature vector Okay, and based on this you can also easily see that if you take the expectation of the score function over a Then that expectation is going to be equal to zero, right? It actually holds true not only for this simple example, right, but in general this This is true for any policy pi theta, right? It's also very easy to see why this is the case and you'll see it like later We'll use this very often, right? So if you look at this expectation Over a, right? This basically gives you taking the integration over a gradient theta log pi theta a given s times pi theta a given s dA, right? By the log trick, we know that this is equal to basically log theta pi c gradient pi theta a given s dA, right? So this is equal to you can exchange your gradient and the integration here And we know that this integration over a of your policy pi This is a distribution. This is equal to one. So taking the gradient of one, which is equal to zero, right? So that's why you have this result and you can see that this result can be also generalized, right? If you take anything Which we'll see later, but I'll just write it down here if you take any Function b which is function only in terms of state s this is also true, right? Because you can just you just Plug in your b s there. You see that this does not depend on theta. It does not depend on a hence, right? This always give you zero, okay Cool. All right. So once we have this policy gradient theorem, right? You can then easily derive a way to estimate what is a Value the stochastic gradient estimator, right? So the idea is you you have a policy pi theta. You roll it out. You get an episode, right? Thou, which is the sequence, right? Based on this sequence You can then calculate what is your total reward over the sequence, right? What is this summation with score functions, right? This then gives you a Stochastic estimation of your true policy gradient, right? This stochastic gradient estimator is also unbiased Because this is just a Monte Carlo estimation, right? So it's unbiased however to can suffer from a high variance, right? It has a high variance because there's randomness along this trajectory, right? There is also, sorry, so there is also this Correlation, right between these two terms both of these terms depends on how sequence, right? There is high correlation there as well So so if you just naively using this reinforced estimator, then it could suffer from high variance And let's see how we can maybe reduce the variance a bit, right? So one observation here is that if you look at These two terms, right? You can see that the policies, right? Say at time t2, right? These are independent with Everything that you have observed before this time, right? Like say it's independent Essentially with all these rewards you have seen before time t2 because of this Markov property So this means you can actually like simplify many of these terms here and get a much compact form here and That is basically the kind of another variant of policy gradient theorem, which says you can Replace your Total reward by just your reward to go with this Q function So let me just show you kind of what is are the differences between this previous expression we see and also this expression here Right in the previous reinforced expression. What we have is the total reward, right? This is R of tau right, which is the cumulative reward from Time zero to infinity But now what we see in this new formulation is really the cumulative reward from time t Onwards, right? So so the difference is really the summation from zero to t minus one and we know that the summation and you can easily shown that from the The difference, right, which is From zero to t minus one and these terms are independent with pi theta of a t and s t, right? And so hence if you take the expectation, right, it's going to be the expectation of these terms times the expectation of Right because they are independent of these terms and We just shown that these terms are equal to zero hence you can just remove many of these zero terms, right? So that's why you get this simplification So in the supplementary material you can find a more kind of detailed proof here Okay, so this is a simplified version of the policy gradient theorem and based on this Formulation then you can you can derive another way of Constructing the policy gradient, right? You say at every iteration, right? You you roll out your policy pi theta you generate this episode tau, right? You calculate what is this your reward to go function, which is an estimation, right? The multi-color estimation of your Q function at state s t and 80, right? You plug it in there you get your a gradient estimator So this estimator games unbiased Right, but the game to requires you to compute also the whole based is a trajectory So can we further reduce the variance? so one very common way and Why do you think practice is to use the baseline right the baseline? Here we denote a function B of which is the function of your state s, right? You can show that your policy gradient is Equivalent to this form by just a subtracting this function B Of your Q function, okay? And the reason for this is exactly as what we discussed earlier, right? Is because the expectation of BS and gradient flock by theta is equal to zero hence You're just adding zeros to it, right doesn't change your policy gradient Okay So what is a good choice of this baseline function BS, right? So ideally you want some function that is or some estimation that is positively correlated with Q right if this positive correlated with Q then right so Then you can kind of reduce the variance, right? so one typical choice of this be is then to use the The value function at state s, right? So if you use because the value functions are kind of Very similar in a way to the state action value function, right? So So you can show that Then if you use a value function as the baseline then what you have here is just your Q function minus your value function and this is called the advantage function, right? Advantage function basically captures what is advantage you would gain by taking this action a At state s rather than following your policy pi theta Right so kind of that is intuition here So just to summarize, right? So there are basically different ways that you can you can rewrite your policy gradient, right? There is the one that you use a reinforce expression There is one you use just to kill value functions And then there is other one where you use advantage function or the space line function. So yes Even with the positive correlated, I mean yes, it's I first of all, it's based on an intuition, right? You can show that this reduces the variance. I think for a certain cases but But this does not listen say for example, right? You can always try to construct a b in a way that you can minimize you can calculate with a variance And you can try to construct a b then minimize the variance then of course that will always reduce it But the the account of the interesting question is does this particular choice of b, right? Right exactly does this gives you the minimum variance or not? I think they are paper saying this gives you minimum variance But then there are also papers disprove it actually this is not variance estimator, but nevertheless is still the one that's most popularly used. Yes Okay, so the question is is variance always a bad thing in practice I Think if you depends on like if you are talking about convergence, right? You want your algorithms convert very fast then variance is the bad thing, right? Because You cannot really like because it has a bit of oscillation into your your your algorithm But if you are thinking about more about say if you want to See if the more variance you have sometimes it could be a good thing because you can get you away from this It can help you to avoid stuck in some local solutions Because when you're stuck in some local solution, you do want to have some some variance some noise to get out from it So it's I would say like it depends on what is a goal. Yeah Could you also say that the variance in your gradient estimate would sometimes in some cases for some environment help exploration? Could be yeah, uh-huh, but if the variance is too large then maybe it's It's not going to be very helpful. But yeah, I think it's a very subtle issue. Yeah, uh-huh, but that's a good question Yeah, and now the questions at this point Yes So when you use the baseline as the value function, right? So as as written, I think it's the same policy, right? So would there be any benefit of using a value function of a different policy as the baseline if you had some sort of I Don't know natural policy or something Right. That's that's also a very good suggestion. I think yeah, you can come up with different the baselines, right? So this is to use the exact the same current policy There are benefits of using this one because you can you can directly estimate it right because they are estimating q functions anyway Right, of course, it also makes sense if you have other available Policies or other things you can also use those as a control variable to reduce the variance So so far you see that in all these expressions it requires you to generate the whole sequence of your Your trajectory right and to estimate the policy gradient So now I'm going to introduce a even compact even more compact form of the policy gradient representation so you can rewrite the objective again simplified by just introducing this Discounted state of this station distribution, right? so this is basically a Distribution that characterizes how likely if you start with state s How likely if you start with some initial distribution mu and following this policy pi how likely you will visit state s? right if you consider this With some discount factor. Okay, so this is basically the definition of this discounted state visitation distribution right, so Using this definition you can Significantly simplify what you see earlier, right? So what we see earlier is nothing but just the expectation of our q functions Times our score function at state s and a right taking the expectation Over how likely you'll visit the state s and action a right that is expectation over as as Following the state visitation distribution and the expectation of a following your policy pi theta so this is just one line of proof by Emoking this definition here So So once we have this you can see that you can also construct a policy gradient By at every iteration you just generate some state and action pair following this distribution And then you can just use your q function estimation, right? You don't necessarily have to generate the whole sequence But of course, how do you generate a sample from the state of the station distribution, right? There are many ways you can achieve that one way is just using this random truncation idea by generating some random horizon that follows some geometric distribution So so that is I think we'll see more this type of expression are very helpful Particularly when you try to analyze the behavior of the algorithm. Let me just skip this part Okay, so just to summarize right we have talked about how we calculate or how we estimate a policy gradient, right? So that is basically the policy gradient method, right? If you cannot exactly compute the policy gradient then you can construct some diesel all these stochastic estimations for it, right? This could be reinforced reinforced with baseline or more generally you can consider Based on what we have seen that the way that you estimate your policy gradient essentially requires you to estimate your q functions Right, and we already learned the terms of ways to estimate q functions, right from these critique algorithms like temporal difference learning algorithms, and then we can combine these Temporal difference learning algorithms with policy gradient estimators and that corresponds to actor critique methods I think you will have a tutorial on actor critique method, so I will not go into too much of details here But say for example here is just an example if you are going to use just the Reinforced estimator, right? This is basically how policy gradient would look like, right? For any iteration you generate an episode and then use this episode to estimate what is the total return gt, right? And then you plug in there and you get an estimation or you get an update of your parameter theta, right? If you Alternatively you can use the idea of temporal difference learning to To estimate your q value functions, right? Particularly if your state and action is very large then you can also introduce function approximation to it and you can Parametrize or represent your q functions as a qw, right? And now you want to keep track of this parameter w In order to estimate your q value functions, right? So basically what you do at every iteration, right? You can calculate what is your temporal difference, right? And then based on your temporal difference you update your w parameters, right? Based on the TD learning algorithms, right? and then For your policy gradient, you're going to just leverage this q value estimation you have right qw here Right to estimate your policy gradient and then you do a policy Gradient ascent step for parameter theta, right? So this is kind of the most standard version of online actor critique method and As you can also imagine that Of course, there is a lot of things you need to take care of first of all, right? How do you parameterize your q value functions, right? Because if you use linear function approximation on your networks, it may not necessarily Capture the true q value functions, right? Which means using this qw instead of the true q value functions would introduce some extra bias into the algorithm, right? and also so that we in In general want to use different step sizes when you update this policy parameter and when you update this value parameter, right? So two times scale turns out to be very crucial in practice if you want to guarantee the convergence and On top of that there are numerous ways that you can try to build better estimators for the you know The q functions or advantage function using all these concepts You have learned like multi-step returns or the eligible ability trace, right? So on and so forth. So which I will not go into details I'm sure you will see some of those in the tutorial this afternoon Okay, so I want to also introduce this other kind of improvement or another Algorithm which is called natural policy gradient Before that are there any questions about policy gradient? Yes It seems that policy gradients do not depend on the MDP Dynamics, but at the same time we want to know the influence of the policy on the states so It's kind of weird. Isn't it? It's weird. You're saying it does not Leverage these dynamics. Yeah There's a good observation, right? That's what makes this different from the many of the value-based methods, right? Value-based methods really require crucially on this Dynamics and this Bellman equation stuffs, right? So So far of course the algorithm itself you can view it basically apply to essentially any type of objective under any dynamics as long as you are able To to estimate the gradient, right? But the way that you estimate the gradient, I think Like like when you try to estimate these Q functions, it still relies on kind of the dynamics, right? and also later on you'll see that these policy gradient methods are Essentially can be viewed an approximation of policy iteration algorithm a policy iteration algorithm is nothing but you actually choose this step size to go to To be very large and that will converge basically to a policy iteration algorithm So policy gradient methods At high level, this is just an approximation of policy iteration So it's still doing something related to the dynamics Yeah, and you also see later like a lot of the convergence analysis of this algorithm how it behaves really heavily rely on this Dynamics as well. Good so so one of the very commonly used I think in practice a policy gradient methods is called natural policy gradient, right? The The idea was first introduced by Cacardi in 2002 so The algorithm behaves as a following right at every iteration you're going to update your parameter theta Not just based on the policy gradient itself But rather based on the natural policy gradient and the natural policy gradient is defined as You take the inverse or pseudo inverse of your feature information matrix times the true gradient Right, so this is very similar to the so-called natural gradient descent algorithm and The feature information matrix is defined based on Your policy pi theta you take the score function times 23 score function and taking the expectation over your state action distribution Right, so this I denote as f theta Right, this f theta may not always be invertible, right? So that's why here instead of using the inverse of a time using just a pseudo inverse of this matrix f theta so the high-level idea of using natural policy gradient is really to leverage some of this curvature information, right? instead of just using the pure first-order information and There are many benefits of using natural policy gradient. I think even I Think one of the well-known benefits is this gives you some sort of invariance It's because we know that you wanted to have more or less similar trajectories under different type of parameterization policy parameterizations you want to be able to Preserve some sort of invariance properties and these type of algorithms usually give give you those those properties and Another interpretation of natural policy gradient is that you can view this as the iteratively Solving this quadratic approximation of your objective, right our original goal is to maximize J pi theta Right, but we want to make sure that that every iteration when we update our policy It does not drift too far away from our previous policy, right so that we wanted to kind of Introduce this trust region or these Constraints like safe we want to make sure that when we move to the next iterate our policy is still very Close to the previous policy in terms of KL divergence, right the KL divergence between them are not too far away So in the case where we use policy gradient basically we just to ensure that the euclidean distance between these two Parameters theta are not too far away from each other right here We say that the policies themselves introduced by this parameter theta should not drift too far away from each other in terms KL divergence Right, so you can view that basically natural policy gradient what it does is is Approximating this objective J using just the first order theta expansion right using only first order approximation of J and Using second order approximation of your KL divergence that gives you actually exactly the future information matrix this quadratic term Okay, so if you do that then this gives you basically the natural policy gradient method So one might ask them right so if you look at this algorithm, right? It requires you to invert a matrix and Inverting the matrix can be very expensive right in the naive computation naive computation cost could be quadratic or cubic in terms of the dimension we're talking about here Which can be expensive. So why is this an attractive algorithm after all? So it turns out that you know, you can actually easily compute this Natural policy gradient direction without having to invert this matrix f theta You don't even have to compute this from a feature information matrix f theta, right? So here's just a one simple result which says that this Natural policy gradient which I denote as w c w star which is just by definition, right? The inverse of your feature information matrix times your true gradient so this w star is The exact solution to the following in this the square type problem where you try to minimize a Linear function, right w transpose your score function minus the advantage function So if you solve this least square problem the optimal solution of that is going to be exactly your natural policy gradient Direction and why is that the case right if you look at exactly this objective, which is this a square type problem This is a convex objective right the minimizer w star to the satisfying the first order of humanity condition meaning that if you take the gradient At w star it should be equal to zero right if you look at the what is a gradient of this objective here right so the gradient of that is going to be 2 times this whole thing right w transpose This function pi theta times gradient Love right, so this has to be and you have to take the expectation right Over s and a I'm going to omit these distributions. This has to be equal to zero Right, so that means if you look at this this gives you the future information matrix, right taking expectation Right this times this gives you actually the gradient, right if you solve this linear equation That basically implies w is the exactly this form here right the inverse of your Future information matrix times the true gradient. Okay, so the implication here is basically that if you wanted to Implement natural policy gradient Well, what you have to do is just to solve this least square problem and there are tons of ways you can solve this least square problem, right? It's very well studied the problem. You can solve it by say for example stochastic gradient descent You can solve it by a conjugate gradient many many other methods. Okay, so aside the result of this aside the story of this observation is that if you denote say AW star as Basically, W star times. This is a basically like you plug in the optimal W star here, right? And you define AW star as basically this linear function Where the Right as this linear function here what you can see here is that the true gradient, right? The the true gradient policy gradient, right? It's just a real range of terms, right? This is going to be equal to this Right because remember this is the inverse of f theta times the true gradient now. I'm just Moving f theta on the right hand. So there's f theta times W star, right? And then if you plug in the definition of f theta, which is the score function times the transpose of score function, right and then Then you can rearrange terms and that's what you get here All right. So what you what this says here is that your true gradient? Remember that in the policy gradient theorem what we have is that the true gradient is equal to a Pi s a right for given theta pi for given pi theta, right? So you're you're from the Policy gradient theorem that we have seen earlier, right? Your true gradient is equal to this distribution Your score function pi theta Yes times the advantage function of your policy pi theta, right? What this result says that you can well just replace your advantage function by this linear function and This linear function the feature mappings are given exactly by your score function of pi theta, right? So this is actually a simple result But this is quite insightful in the sense that It means when you try to estimate your advantage function is sufficient to just use a linear function Approximation to estimate it. It does not introduce any extra bias into the picture, right? Because you just replace this advantage function by this linear function if you use this particular feature, right? So Let me just elaborate a bit more. So usually what we do is we approximate this right by some Linear function approximation or non-linear function approximation with some, you know Prefixed the feature vectors phi, right? This result says that if you choose particular this Feature vectors phi as nothing but just gradient of your log pi theta, right? This will exactly recover the true policy gradient with without introducing any bias into the picture So that's why in practice, right? If you run after critical methods, right? If you use this particular linear function approximation is sufficiently enough For getting an unbiased gradient estimator Okay, are there any questions at this point? Yes so With the form of the sorry Did we put the objective gradient in the form of the state visitation frequency distribution just to get it in the form of the Fisher matrix so that we could do all of this stuff. I Think it's more like a convenient thing for mass, right? Yes. Okay. Yeah, so it's convenience But okay, but it's not some special like mathematical thing that I'm missing, right? No, I think yeah, it's more just to be more convenient to see all these connections Right, but also you can also use that as a way to construct different gradient estimations Thank you Okay, and of course like so far. This is just the case. We're assume that you can exactly question please You said now we are linear, but in With respect to W W star I don't see the Connection between W star and data and theta. Yeah, there is no you don't need it to have any connection between them Right, this is W is the parameter you used it to estimate your value functions Sita is the parameter you use the parameter as your policies Right, so the W is really something that you use only to estimate your value functions And once you have the value functions you use it plug it into the policy gradient theorem, right? You use it to construct a policy gradient. So these are two separate parameters that you keep track of. Yeah Yeah, I see I see no, okay. Thanks. Yes So does using the Fisher information in the objective actually have an impact on the variants? Like as an it's unbiased as you showed but does it have an impact on the variants as opposed to using something else? Mm-hmm. That's a very good question So it's harder to Exactly characterize The comparison between the variance of these two estimators Because in practice what you would do right where does the variance come from it really comes from How you estimate or how you solve this W right W star, right? And the variance really comes from how you solve this problem here We're going to use samples that you generate from this policy pi theta and use those samples to estimate What is W star, right? So in a way that this Gives you a easier control of the variance because The more samples you use to solve this problem, right? You can get a more accurate solution of W star and you can actually make sure that the In the end of the W star or the approximate solution you get for W star does not have too much Variance Right. So so here I think with a natural policy gradient sort of separates this variance and the bias issue But in the in the policy gradient if you directly use the samples to estimate it you would have this this variance but here I think in the way that All what you care about is the accuracy of how you you compute this policy gradient or how you solve this W star Right Thank you. Yes, there is another question So policy gradient methods are usually used for continuous action spaces. So the mapping phi of s a How can you create? So the W vector that you would need to create like your unbiased estimator in this case, it would be infinite dimensional No, it does not have to be infinity dimensional, right? Because this this is just a vector from say Rd Right for any state and action, right? You can construct construct some feature maps That's lies in say dimension Rd, right? W is also in dimension Rd, right? It does not have to be infinite Thank you, okay, so So in the end I wanted to maybe shed light a bit about the convergence Guarantees you would expect to with policy gradient the natural policy gradient particularly I mean the interesting question is that a lot of researchers spend many time in the past a few years is to understand You know can policy gradient methods or when does policy gradient methods converge to a globally optimal solution? Right because we mentioned that after all you're solving a non-concave Objective and we know that if you run gradient ascent to solve a non-concave objective At the best you would hope to get a convergence to some stationary point or critical points, right? So but when does these algorithms guarantee converges to the true globally optimal Solution because we know that the critical points has no quality in sure like has no quality Guarantees on right critical points could be subtle points, right? Could be local local mean could be local max, right? They do not give you any quality Insurance right and also the critical point is when the gradient is very small and for Various reasons you might have vanishing gradients, right? And that does not necessarily mean that you converge to a reasonably good policy So we want to understand the other any good global Optimality guarantees for policy gradient methods and the other question is really trying to understand Why does policy natural policy gradient method perform better than policy gradient method and what are there's difference in their sample efficiencies or? computation efficiencies So I want to give you again this toy example We have seen also a bit earlier But a different one where you have these two states and two actions and Here is a Viralization of what is a value function look like to say one of the state Under different to parameterizations under say direct parameterization So so your x and y axis of theta 1 theta 2 and your y axis is a value function under this different Parameter theta 1 theta 2 and you can clearly see that he has this non con non concave shape and if you run policy gradient methods to solve it and Observation you can see for this particular example is that no matter where you initialize your policy gradient Then if you use the same step size, they will always converge to the optimal policy basically Okay, and you can also show see that for this example that The larger step size you use right from left to right you see that I'm Increasing my step size from point zero one to point zero five point one to one right the larger step size It looks like this always and converge faster right to the global optimal policy, right and and this is Not just to one Single observation right actually you can observe this even for many of the practical applications although the problem is non concave So question one is the when does policy gradient methods converges the global optimal policy and why right what are the Intuition here. I think kind of one of the optimization wisdom here at least that was discovered in the past a few years is that If you consider gradient descent or even stochastic gradient descent, right these algorithms converge to global optimal for Problems with very beeline non convex landscape, right? So if your objective is kind of is not convex, but it's convex like Right in the sense that for example if you satisfy this type of gradient dominance property Where the difference of your function values from any given policy pie to the open policy pie star is bounded by the gradients Then you can see that in this case If the gradient is small then that means that you are close to the open policy P is just some other like could be in between One to two or any other right? So the idea here is if your objectives are dominated by the gradient, right then you can say that If the gradient is small or if the gradient is close to zero or if there are the critical points Then you are also at a global optimal points, right? So it's sufficient to find the critical points So why does natural policy gradient perform better than policy gradient? Again, there has to be a lot of theoretical study trying to understand what are the convergence guarantees for natural policy gradient and kind of one One explanation or one observation is that natural policy gradient If you perform natural policy gradient on the parameter space, this is the equivalent to performing something called policy mirror descent on the policy space and then policy mirror descent is known to be able to get convergence guarantees that are dimension-free and And we discussed some of the results here in the remaining few minutes so to give you a sense of what is a policy gradient or Like the gradient dominance property. I think it's important to introduce this super Important lemma. It's called a lemma, but I think this is maybe one of the most important the results in theory of reinforcement learning is a so-called the performance difference lemma right this lemma basically characterizes the Difference between two policies pi and pi prime, right? if you look at the difference between two policies pi and pi prime this difference is exactly can be written as What is the expectation of the advantage function Over the state actualization distributions, right? the kind of the intuition here is that the difference between their objective is basically What is the advantage that I will take at the state s a different policy pi from my pi prime? right so this basically characterizes what is the Advantage that at state s. I'm going to take an action from policy pi instead of policy pi prime, okay, and and this is The so-called the performance difference lemma, okay, and I say this This result is important because we can actually use this result to derive many many other Theorem or results you have seen in reinforcement learning for example, you can easily use this result with only one line of Proof to show that to the to prove the policy improvement theorem for policy iteration And you can use this to show linear convergence of policy iteration algorithm And you can also use this to show policy gradient theorem as well You can use this to prove many things and the proof of this lemma is actually Quite simple. Maybe I will just show you here Since we have still a bit of time So if you look at the difference between these two value functions v pi minus v pi prime right by definition of v pi Right, this is a cumulative reward you get Let me just add and subtract this v pi prime For every state s right, I'm just adding and subtracting something so this does not change my value here and I can simplify this as you can see that This is taking summation from zero to infinity and the one t is equal to zero That's where my state s zero starts from state s So there is one term that can cancel with this v pi prime, which is the first term, right? So I can actually rewrite this in the following form, right? I can I can So this this term is here still the same right to this term because I don't have the first Term at the t equal to zero anymore I can only count the summation from t from one to infinity and I'm going to shift the index a bit from zero to infinity. That's why I'm Shift to the index by plus in one here. Okay, so these two are equivalent Okay, and then if you use a tar property and you was shown that The first term here right if you just take the expectation over s t plus one This gives you exactly the Q value functions Right. So so now you see that this is a difference between Q value functions and your value functions at policy pi prime And this is an advantage function which gives you the performance difference lemma Okay, and now with this performance difference lemma, right? You just rewrite it a bit You see that this basically characterizes the difference between any given policy pi and the option policy pi star Can be written as this form, right? This is how likely you will visit the state s Action a if you follow policy pi star and what is advantage if you take some action that's not the deviate from policy pi and If you recall what is a policy gradient theorem we discussed earlier Here's just a simplification in the tabular setting if we consider direct parameterization or softmax Paramersation you can exactly write down what is a policy gradient and the policy gradient is going to have some form like this, right? is a Discounted distribution D times your Q value functions in the direct to a parameterization case and in the softmax parameterization It has this form here Again, there is nothing tricky here. You just plug in the theorems with this particular policy Paramersation, this is what you will get and if you compare these two results on the performance difference lemma and policy gradient theorem Right, you see that they have a lot of terms in common, right? There was the state distribution policy pi here is pi star, right? There is a pi here, which is a dominating term. They're the same, right? This basically suggests this type of inequality hold that Right, if your gradient is small that means these terms are going to be very small and that means Your objective difference will be very small. Yes Convex like I should be more precise because we're always maximizing the reward So it really be should be concave like, right? It should really be something like this, right? Because Concave like yes, sorry Right, so it could be right. We know that convex concave has to be something like this Right, but if you have some shape like this, this is not necessarily concave, but it's concave like for example, right? So these are all concave This is concave, right? This is also concave, but this is not concave But it's concave like for example, right? And for this type of objectives, you can see although it is long is not concave, but at these points Maybe it's not the best example here, but yeah at these local points is also globally optimal, right? Yes in slide 38 You said that using larger State-sized converges fast to the optimal value, but This is a very simple case. So for example for more complex settings We will have this problem of stability. So How do you tackle this problem? Do you use a regulator between? Different policies and do you take the expectation over different states or There is another Method to tackle this Right as you see the step size is actually very important, right for policy gradient methods, right? the whole performance of policy gradient method essentially depends on really the The variance of your gradient estimator and the choice of your step size So here that result I shown you is when there is no variance, right? We can exactly compute the gradient and then that's where you can see that the larger steps as you use the better Because the larger steps as you use the more closer you are to policy iteration and we know that policy iteration converging in few iterations Right, so that is kind of that behavior there But of course in practice oftentimes you have variance in your policy gradient, right? You can only estimate it, right? You cannot really use too large step size Otherwise it will introduce a lot of instability Into the algorithm behavior and you have to as you said introduce some regularization techniques I will discuss a bit in the moment. Yes, but let me just Sorry get back to sort of this high-level idea here is that Because we can show this type of gradient dominance property and then that guarantees at high level that policy gradient methods although they're dealing with a non-concave objective, they are able to converge to a globally optimal solution and Particularly for example, this is a recent result that was shown that if you consider the tabular policy gradient method and use a constant step size and Assume that you can exactly compute your gradient and you do this projected gradient descent update because your policy has to satisfy this simplex type of constraints and then you can show that Policy gradient methods can converge to a globally optimal policy In the convergence rate of one inverse square of t here if you use a constant step size and this results has been, you know improved Extensively in the past two years actually one can show that if you use larger step size or If you do line search then you can actually prove linear convergence of policy gradient method However, the caveat is that these methods usually suffer from this very large constant, right? That depends on state action space depends on this distribution mismatch coefficient, which can be very large Okay, and also depends on this large, you know effective dependence on the effective horizon on the other hand as I mentioned natural policy gradient method can be viewed as a performing a mirror descent on the policy space and this is the case where you use softmax optimization in the tabular setting and I just briefly show the result without going too much of the details here You can find all the proof analysis in the supplementary material after this slide Here you can show that if you use a constant step size Then natural policy gradient method actually converges much faster than policy gradient method Right in this case it converges one over t rate without any dependence on this state action or This distribution mismatch coefficient as you see earlier right this has a very clean convergence guarantee here And also you can you can you can show these results can be further improved in some cases You can even expect some quadratic convergence rather than linear convergence locally So I still have few minutes, right? I wanted to maybe touch point to this issue that was mentioned earlier, right? How do we deal with step size? So as I said step size is perhaps one of the most important challenge One of the biggest challenge you is policy gradient method in practice is that These methods can be very sensitive to how you choose step size if the step size is too large you can imagine that you might just Get a very terrible policy at the next iterate and then that policy will induce even worse data, right? Which is not going to be helpful if the step size is too small then you end up with a very similar policy and your data is more Less similar, right? This will does not help with efficient exploration so sort of the common remedy of Dealing with this step size is you can you can do line search or you can introduce this trusted region type of regularization to ensure that your your your policy does not Drift too far away from the previous iterates or there are also ideas like clipping that was extensively used in say Proxima point policy optimization algorithms and then there are also of course many other Challenges like how can you efficiently leverage your data from the previous Policies, right? This is kind of the big difference from policy gradient methods to stochastic gradient descent for training supervised learning tasks because in supervised learning tasks, right you you can Always reuse your data, right? You can you can because the data are oftentimes ID, right? You can reuse them and you can you can run stochastic gradient descent more efficiently But here the data generated from different policies can be different, right? It's hard to directly reuse them, right? You need to use a lot of additional Techniques like important sampling or experience replay in order to reuse your data efficiently and to improve sample efficiency And we mentioned about you know the high variance of stochastic gradient estimators and the remedies to combine with this Temporal difference learning algorithm, which are known to have low variance to to fix it And as I said you probably see more in the tutorial this afternoon and and one of the as I said to to to Deal with the fact that the step size is too sensitive is to consider this trust of region ideas, right? This is called trust of region policy optimization So the idea here is that you want to approximate your objective using this objective here Such that your next iterate is does not drift too far from your previous iterate in terms of the KO divergence Okay, and this origin is called the TRPO and you can view this Essentially as Approximating your original objective with this linear approximation by replacing your policy pie here by pipe by kind of fixing your Your your expectation here following the state action visitation distribution of your previous iterator pi actually know Actually, yeah, sorry, so this is so remember that from the Performance difference lemma right so you can basically rewrite the J pi as J pi t times plus this whole thing here right so What you can do now is basically this we don't know how to estimate it, right? So we don't know how to optimize it you basically replace this by d mu pi theta and then you You consider this Saragate objective and then you solve it So this is sort of the main idea of the RPL and this has been of course one of the most popularly used to benchmark in Among all policy gradient matters And then there is also the proximal policy optimization which again we probably have heard this in many many places right idea against to To to introduce this clipped objective so that you do not have to deal with these constraints in the previous in TRPL And you don't have to deal with this step-size choice in and this algorithm is Much more robust to different Step-size and you can you can just solve this single objective here So the idea is just you wanted to make sure that your next iterate does not drift too far away from your previous iterates So you introduce this clipped objective to ensure that you would penalize the objective if your ratio between these two Policies are very large. Okay, and then you solve this iterative based on stochastic gradient descent matters This is called PPO and this as some of you know This is one of the workhorse for also chat GPT as well and has shown a lot of tremendous numerical Success in practice So I think it's about time to to kind of wrap up and Here's just one kind of summary. I would like to borrow from shoreman's slide and as you can see that there is a lot of Modification when you move from just a pure vanilla policy gradient matters to this more kind of more than you know Advanced policy optimization algorithms like TRPO and PPO and For policy gradient methods and natural policy gradient We have relatively good theoretical understanding of how these algorithms perform right what are their convergence guarantees How can you integrate them with functional approximation with stochastic gradient estimation with sample based approaches? And how do you analyze their sample efficiency, right? However, they're of course in practice people oftentimes use this more, you know heuristic approaches like the RPO or PPO which has less theoretical understanding but the show is much better performance, right? Obviously, there is a big gap between the theory and the practice and I would say there are lot of room to To work on in this domain on policy gradient matters And if you are interested in any of these topics, feel free to talk to me afterwards with that I just conclude here and thank you very much for staying with me. Thank you. Thank you very much