 All right. So for today's lecture, we are going to discuss basically the last two relevant items in the theory of Markov decision processes and how to optimize decision making. And so in the first half, we will basically discuss how to put at work all the theoretical results that we obtained in the last couple of lectures. That is how to construct algorithms to search for optimal strategies in policy space. And that would be that for the first half. And in the second half, we will address the problem of what can we do when the state space or the action space is just too large? So we cannot approach such a search procedure or directly the optimization guide, the solution of the Bellman equation, just because the sheer size of the state action space is so huge that we cannot address it. And therefore, we will discuss how to combine these algorithms with the function approximation. So that's the plan for today. But let's go one step at a time. So let me start by giving you a brief recap of what we have in our hands so far. So on one side, we have the solution of the Bellman's equation. Fine. We will use it explicitly tomorrow in the tutorial. So in the lecture tomorrow, we will see how to explicitly solve the Bellman's equation in two examples. One is the traveling assessment problem, in which the problem is explicitly time dependent. And then the second one will be solving the grid word in one or two instantiations, again, by Bellman's iteration, that is, by the value iteration algorithm. But now we are focusing on algorithms that take explicitly into account the policy, the structure of the policy. So summarizing our results so far, we have been somewhat painfully. But successfully, we've been able to derive the following expression. So remember, just recall that our goal, as always, is to find the policy which maximizes our total gain, which is more explicitly find the policy which maximizes the expected value of the sum of discounted rewards along a trajectory generated by our Markov decision process. And yes, what you have shown yesterday is that you can explicitly write the gradient with respect to any component of a policy pi in the following form. These eta and q are two functions, two, respectively, vector and a matrix that have a very specific interpretation. In terms of, basically, these can be interpreted as occupation times of state S discounted. And this is the state action value function. So let me just record you the basic definitions of these objects. So the eta as a vector was defined as 1 minus gamma p to the minus 1 times rho note. So this rho note is the initial distribution over states. This is where my system is at the initial time. It could be localized on one state. It could be dispersed across all states. And this capital P here is a matrix whose entries are defined to be just as the transition probability from one state to the next one given the policy pi. So this P explicitly depends on pi and eta implicitly depends on pi to the dependence on it. And then what is q? q is defined as the expected value under the policy pi. If we start from a certain state S and we pick action A, which means that if we are in state S and pick action A, we will end up in state S prime. And in this process, we will incur a reward as A S prime. And then from state S prime, what we will get under that policy is the value starting from state S prime. And what is the value? Well, the value you can also rewrite it in a simple form just as the following definition. This is just RT1 minus gamma capital P to the minus 1. This is the same capital P that is above. And what is this? R is a vector. It's a pro vector whose entries are just the expected rewards starting from state S. So if I'm in state S, I take action A according to my policy. I end up in state S prime according to my transition probability. And then I collect an average reward RSA S prime. So far, so good. This is just a very short summary of the definition. So in practice, so let's go back to our example when there are two states and two actions. Two states, two actions. In this case, the policy space is a square. So on this axis, we have the probability of taking action 1 given state 1. And here, we have the probability of taking action 1 given state 2. These are the only independent variables in terms of the policy. And my G is something that looks like something like this. This is the level lines of my G function. And then the gradient is some vector. So if I take any policy pi here in the middle, this is my gradient of G. So at this point here is any policy pi for my system. So how does that work? How do I compute this gradient in practice? Well, remember, we start with some things that we know. So the known things are the initial density, the initial distribution of our states, the policy, which is something we choose at any given point in this space of policies. Then we know the transition probability is small p. And we know the rewards are. These are parts of the model that we have. And we assume to be correct. And then from this, we can construct several things. So from pi and p, look at this formula here. Combining a linear combination of pi and p gives me capital P. And a linear combination of pi, p, and r, differently arranged. So I can combine this three together, p, pi, and r, to get my capital R. And then if I combine a row node with capital P, I can get peta. So how do I get this in practice? Well, one possibility is that now that I have my capital P metrics, I could just make this linear combination and then take the inverse. The inverse is relatively expensive as a procedure. So if your metrics is large enough, you might resort to other techniques from linear algebra. Whatever. The important thing is that this is a linear object. There's nothing nonlinear happening here. You just have to do inverses or you can use iteration methods. So you can deploy all the linear algebra that you know, just take off the shelf and the solver for a linear equation, and that will do. I mean, the fact that this is a linear equation, you just realize because this is equivalent to 1 minus gamma P times eta equals rho. OK? So you know, this can be written as a linear equation of the four metrics. Capital A times eta equals some rho node. So you just have to solve a linear problem here. And same for V. OK? V is also the solution of a linear problem, which is nothing but the recursion relation in the sense that V transpose 1 minus gamma P is equal to R transpose. Again, same structure only that now you're multiplying on the left. OK? So this is your unknown, and this is the matrix. This matrix you know, and this row of entries you know. So it's just an inversion. And you know that these metrics that you have to invert is well-behaved, so OK. And then this means that if I combine P and R, I can get my value function. And then from the value function, I obtain by combining it with the small R and P, I get my Q. OK? So these things are already in stock. This object here is also known. So the three of them together give me the Q. And eta and Q together give me my Vg in the pi. OK? So this is the flow of operation that you have to do. If I give you a policy, you will be able to compute the gradient. Given all the other side information, which are important, that is what is the initial distribution of states, what is the transition probability of small p, and what is the reward structure small R. OK? And all of this is just linear computation. So you can figure out what the complexity of these operations is. And as you can clearly imagine, this complexity raises growth very, very rapidly with the size of the system that you're considering. OK? But conceptually speaking, there's no mystery, no difficult step here. So now that we have our gradient, the question is, what can we do with that? OK? In this case, in this specific case, we know that the optimal solution is here. Just because I draw it like that. If you try out a different combination of rewards and transition probabilities for your two states to action, more or you might find different landscapes of your G function. OK? So let me repeat again. These are the level lines of G. So G here is increasing. And here is the maximum. So this one would be my optimal policy by star, which according to this sketch would be, this would mean every time that you are in state one, pick action one. And every time that you are in state two, pick action two, because the probability of picking action one is zero here in this point. OK? So the question is, how do we get from any initial policy pi to our optimal policy? So how do we search for this maximum of G in policy space? So in practice, what we are looking for is some sort of gradient ascent. We want to climb the gradient until we get to the maximum, which we know it sits on the boundary. So what we're going to do now in the next 20 or 30 minutes is to study different approaches to this problem of optimization when you know a gradient. So the things that I will describe now are in part are common to all kinds of optimization problems, and in part are specific of this optimization problem of an MDP. OK? So some things will be general, some others will be more specific of this problem. But these are nevertheless very, very broad concepts and ideas. OK? Fine. So the first algorithm that we will discuss goes under the name of policy iteration. So what is the idea? Well, the idea is very simple. It's sort of brutal in the sense that I rewrite here that my gradient is object here. So once more, I hand you a policy and then you can compute all these components of the gradients. For every pair A and S, you have a real number, which is the right hand side here. And this tells you basically how much your objective function depends on the various components of the policy. So policy iteration does the following thing. Suggests to pick as an action the one which maximizes the component of the gradient. So let's go back to our sketch here. So you have components for each direction. OK? So you have your component in this direction of the gradient. You have your component in this direction. And then for each of those, you pick the action which gives you the best. And so in this case, for instance, this gradient is telling you you should move to the right and you should move down. OK? So what this policy iteration is doing is basically is telling you believe in your current gradient, wherever you are, choose your gradient and go all the way to maximize within your policy space in that direction. OK? So you just compute the gradient and you go full throttle in that direction, which is what is expressed analytically by this expression here. So among the many components of your gradient, you pick the largest one. OK? So just to give you a further example. So suppose that now you are somewhere and you are told that you have to go, for instance, in direction north, northwest. So you have a compass and you say, OK, I have to go north, northwest. And what you're going to do with this is that you're going to go north, because north is the direction which is closest to your orientation of the gradient north to northwest. OK? So basically you ignore all the variations. You just fix on your relevant axes, which are just north, south, east, west. Let me say, OK, I'm going to take the direction, which is the largest component of my gradient. I hope this makes the idea clearer and not more confusing. But if you're not totally clear with this, just let me know. So by definition of the gradient, OK, if I have to pick the maximum over my A's, this means that this eta does not play any role here. And this is just the arc max over A of my Qs A. So this choice is also called the greedy optimization. OK? It's greedy because we, based on our current knowledge of what we're going to get in the future, according to this strategy, we go all the way with it. OK? So we do not allow for any other possibility than this being true. OK? So in graphically speaking, if we go to our example, again, with two states, two actions, this is my policy space, the graphical representation. So now, suppose that I have a different form of my Q function of my G function over that. OK? So that, for instance, locally, my optimum is here. But locally, my gradient points in this direction. OK? So it might well be, right? That is, it has some form like this. OK? So that eventually, your maximum is here. But if you start from some point here, maybe your gradient is pointing in that direction. So if you apply this algorithm, this will send you, as a first step, will send you here. OK? This is what this is telling you. Pick a policy, your new policy, by prime. If you follow the gradient like this, will be a deterministic policy which assigns with probability 1 the action 8 yielded. And with probability 0, any other action. And remember, this is the characteristic function of this condition. So it means explicitly that this object is equal to 1 if a is equal to a star of s and 0 otherwise. So if I do this, then my local gradient here would tell me, OK, you have to go here. So my pi prime will be here. And then if I repeat, what I'm going to do is now I'm going to evaluate my gradient in this new point here. OK? So my new gradient will be pointing like this. So if I follow that and I repeat my operation, where will I go? I will hope here. OK? So this is at my time, my second guess now will be here. And if I compute the gradient here, now my third guess will send me, the gradient will send me here. And if I follow the gradient, I will end up here, which eventually will be my maximum. OK? So in this sketch, starting from a point pi, I have made the three steps to get the maximum. Because this operation of picking this greedy optimization sends me straight in the corners of my policy space. Remember, in general, the policy space is not a square. OK? It's a more complicated object. But it has corners. And these corners correspond to the deterministic policies. All those that are made of a single one for one action and all zeros. So this idea of policy saturation in a nutshell is compute the gradient, then go to the corner where your largest component of the gradient sends you. And then recompute the gradient from that corner, go to the corner in which you are directed to, and so on and so forth. So there are two questions here, of course. First of all, is this ever going to converge? I'm hoping through corners. So in general, my state space, suppose that I have s states and two actions. My policy space is an hypercube. OK? So policies belong to 0, 1. OK? Because if I have just two actions, there's just one single real variable from 0 to 1, which describes the policies. So here is, for instance, probability of taking action 1 in state s. And so this is power to ds. So this is a hypercube, an s-dimensional hypercube. So my algorithm, my policy improvement algorithm, as it is defined, would start somewhere in the middle of this cube, then jump to one corner, then jump to another. But there are exponentially many of them. OK? So will it converge, eventually, to the optimum? First question. Second, isn't it stupid to just to generically look around for something in the corners that could take a long time to explore? Could be exponentially long time to explore. OK? So first, we address the first question. OK, let's see. Now we're going to prove that this policy improvement algorithm does indeed converge to the optimal solution, which makes it sensible as an opportunity. And then in the second step, we will address this question. Can we do something smarter than this? Everything clear so far? Any question? So let me. This was a very sort of informal description, but now we can describe the algorithm in full. So this algorithm is called policy iteration. So the pseudo code for this is initialize a certain pi 0. This index 0 is just a counting index so it tells me what iteration I am in. So this is the first initial step. I choose one policy, whatever. Any admissible policy, anything that is inside my space of policies. Then I have a loop that goes as follows. From any pi k, I obtain a value function. OK? How do I do that? Well, let's go up a bit. You remember this flow here. So if I have a policy, I can construct all these terms and I can get my value function. So this step of giving a policy, computing the value function, is what is called a policy evaluation. It's pretty self-explanatory as a term. You have a policy and you compute a value function. Now from this value function, you can compute the corresponding queue, which is defined, you remember, here. So from this line, you have a value. All these things you knew from the beginning so you can combine them together and get your queue. And then from your queue, you get through the greedy choice, you get a new policy pi k plus 1. What does this arrow mean? Well, in practice, this means that your pi k plus 1 a s is just the arg max over actions e bar of queue k e bar. So this is what I just wrote up over here. It's actually this notion. So it's your a tilde's right here. This is also equal to 1 over a tilde k because this is taken from queue k of s. So you see the flow. Policy value, value queue function, queue function by the greedy choice new policy. Then k k plus 1 goes into k and you close the loop until some termination condition. For instance, the usual request is that your value function isn't changing much. It's less than some epsilon, which you decide beforehand. Or again, you might choose the relative change in v. So this is the algorithm. And what we're going to do now in a few lines is to prove that policy iteration converges to the solution of the Bellman equation. So this is not very complicated. So we're going to do this in a second. So the key step is to show that when we perform one step of this loop, basically we go from one gk to a new gk plus 1. So this g is my objective function. So this means that in my graph, every time that I make a jump like this, I'm actually going to a level lines, which is higher. So what we're going to show is that this object here is larger or equal than the previous one. So there is an ordering here. Actually, the proof goes into two steps. The first step will be that it's always less larger or equal to the previous one, the value of my objective function under one iteration of the policy. And the second thing will be that if it's equal, then it's the Bellman's equation. And this will close the matter. So by the way, this second step here, so this first step is policy evaluation, this second step is called policy improvement. So this policy iteration algorithm is an alternation of two steps. Start with a policy evaluated, which means compute the value function. Then from the value function through the greedy assignment, compute a new policy. Repeat. So the proof goes as follows. So remember that for any given policy, my objective function can be written in terms of the value function as the distribution of initial states times the values from those states. This is just another way of writing the expectation of sum over 0 to infinity of gamma t. Now let's concentrate on the value function. So the value function for my policy pi is just very simply connected to my Q function. So this first step is quite straightforward. Let's go up a little bit to the definition. Oh, sorry. So this object here is the definition of Q. And if you sum over the policy on both sides and perform the summation, you realize that you find back again your iteration formula here. So this is the connection between the value and the function, or more directly. So what is the value function from state s? Is what you expect to gain starting from that point according to the policy pi, which in itself is nothing but peak action a. And then this will be what you expect starting from that Bayer state action. OK, so quite straightforward. All right. Now we are going to use the definition of my a tilde here. So what is a tilde? Is the maximum overall possible actions for my Q. So by definition, each of those is smaller or equal than Q pi s a tilde of s. This is by definition of a tilde. Because for any s, for any row s of my Q matrix, I pick the column which has the largest Q. So this is the first step. And the second step is to use the definition of Q pi by definition of Q. This is equal. Sorry, just one step before. So this object here, now notice, sorry, one more step before we go there. So we take all possible precautions. So now notice that this term here does not depend on a. So this means that our sum over a is only restricted to the pies, which are probabilities. And therefore, this object is equal to one. So my sum reduces just to a tilde of s. And then I can use the definition of Q pi, which is just sum over new states, S prime, S a, then a. Where here I have, instead of these a's in general, the present in general, I have my a tilde of s. So this means that at my current state, I'm picking the greedy action, a tilde. And then from that point on, I'm still sticking to my policy pie. Now look at what we have here. We have our policy and we have the same policy in a new point S prime. And these are linked by inequality. So what we're going to do here is just we're going to plug back this inequality back here. So this object here will be smaller than something on this side. So I'm going to apply again this inequality iteratively. B pi of s prime is going to be smaller than some other thing. So let me write it explicitly. I iterate the inequality. And then I get that this is smaller equal than, I have my first part here, sum of s prime, d of s prime, s a tilde of s, s prime, plus gamma. And now this object is smaller than again the right hand side. So we have another sum inside here. Now with respect to s second, p of s second, s prime, a tilde of s prime times r of s prime. And then I could do this again and again and again. So I can unroll all these inequalities. And what do I get? Well, look at each of these terms. So the first term here, which comes from this contribution, well, this is the expected reward under the choice of the greedy action. And the second term is gamma times the reward that I will get at the next step if I still keep on using the greedy action. So just with the moment of reflection, you realize that this object here is nothing but the expectation of the sum from t going to 0 to infinity of gamma t of the reward that you take if you are in state s t, you pick the action a tilde of s t and you end up in state s t plus 1. This condition on s 0 equals s. Just by repeating this unrolling of this inequality, eventually this will be less equal than the sum of everything. But this object here is the g taken according to the greedy policy. So this is g of pi prime by definition because I'm taking every time I'm picking a tilde, which is my choice according to the pi to the iteration pi prime. So if you look at this, sorry, this is not g pi. My bad, I'm rushing this through. This is v pi prime of s. So if you look at the term, the final term here and the initial term here, these are connected by a series of inequalities which go all in the same direction. So I can conclude that the value function of the new policy is larger than the value function of the previous policy during one of these steps of policy iteration. And this, of course, implies if I average over the initial distribution, this implies that my g pi is smaller than g pi prime. So this confirms the not so obvious intuition, actually, that every time that you make a step according to this greedy strategy which sends you in the corners, you're actually improving your g. You're moving up in the level lines every step. Final conclusion about this, what happens if there is an equality? If g pi prime is equal to g, then what do we have? Well, we have that. This means that v pi prime of s is equal to v pi of s. But let's go on step at a time. So v pi prime, by definition, is the maximum over a of q pi s a. So this is the greedy choice. But since pi is equal to pi prime, this is also equal to the maximum over a of q pi prime s a. But this, in turn, is equal to max a by definition of sum over x prime, again, using the definition of q. And now if you look at this object here, this equal to this. This is just the Bellman's equation. So this is, I mean, this is a sort of lengthy manipulation, nothing particularly interesting here, apart from the bottom line, which is that this process of using the gradient to visit all the corners of the policy space in order is guaranteed you to bring from one corner to a better corner and eventually reach, without exception, the optimal policy. So this motivates the use of this policy iteration approach, which you will find described also in the book by Satom and Bartow, but without making reference to gradients. But you have to realize that that's exactly the same thing that we're discussing here. OK, so is this a smart idea, after all? Well, there might be something smarter, right? So if you have a gradient, you might want to look for, for instance, gradient ascent. So rather than jumping on the corners, you might want to look for some way to go down the gradient, which will take you some sort of trajectory like this. Sorry, going up the gradient, because this is a maximization process. So how do we do gradient ascent for this problem? And that's what we're going to do after the break, OK? Sorry, can I make a question? Yes, please. Is a question about the notation. I didn't understand the difference between A tilde and the pk plus 1. Because in the policy improvement, we are using the formula of the A tilde. And then for saying that pk plus 1. You're making reference to this one? Yes, yes. Yeah. Because up we wrote that A tilde is equal to this formula. Yes, that's exactly the same thing. So there's no difference, except that you are repeating this operation at every step, which means that you have a different queue over which to optimize. So if you give me something more, I can explain more. But I'm not sure I understand the question. Because no, OK, OK. Is it clear, or you are just giving up? No, OK. I have to watch it maybe better later. So no problem. We can go through it. I mean, I think it's helpful for everyone now. So this choice corresponds to picking up the column in the queue matrix, which has the largest entry. So given a row labeled by a state, for each row, we pick up the column, which corresponds to an action, which has the largest entry. And this defines our A tilde of S. So in our queue matrix, which, for instance, is made of states and actions. So for every state, we run through the values of this queue. And then we identify the one which has the largest entry. So for every row, this dot means the ones which has the largest entry, which may be equal or different. And this defines my A tilde of S. Because for every S, I have an A tilde of S. So given a policy, I compute a queue. So given a policy pi, I can compute a queue. And from this queue, I can compute my alpha tilde. So this is the first step, which is called the greedy operation here. And I'm repeating this thing over different policies. So I take one, and then I use this new policy to compute a new queue function, which will give me a new policy by a degree of choice, and so on and so forth. Is that any clearer? Yeah. Thank you. Thank you. Any other question? OK. Then we take a break until 10 past 10. And then we resume the rest. Share my screen. So we've seen that there is one possible way of exploiting the knowledge of the gradient of our objective function in policy space. And it is through this policy iteration algorithm, which basically sends the policy in the corners of our policy space and then jumps from one corner to the other according to the direction of the local gradient. And this is guaranteed to terminate the search in the corner, which has the largest value of our objective function, which corresponds to the choice of the solution of the Bellman's equation. But like we said, there are other ways of using the gradient. And perhaps the most intuitive and metro one is to use it in a more local way, rather than jumping around in the policy space to follow locally the gradient and to climb the landscape of the objective function, which is depicted in this picture by this green path, which is described here. So the basic idea is to move to different kind of algorithms inspired by gradient ascent. So intuitively, how does gradient ascent work? Well, the idea is that if you have a gradient, you could say, OK, I'm going to move from my current policy using some little step beta in the direction of the gradient. This is a very intuitive idea, making us a step. And the length of this step in the general case is proportional to the intensity of the gradient. And this is called proportionality constant. It's called the step size beta here. And of course, there are immediately, and this is very general as approach. So it could be policies, but it could be any space, any domain, any function, any real function on a domain. Of course, the question is, how do I choose the step size? First thing. And the second thing is, what should I do if I end up outside of my domain? So let's address the second question first. So again, using, as an example, a policy space, my square. So if I am at a certain point, I suppose I am considering policy pi here, and my gradient is pointing in this direction, then clearly, if my beta is large enough, or if my pi is close enough to the boundary, I may end up with my new point pi prime, which is sitting outside of the domain. So what should I do in this case? Well, what all these kind of algorithms do is to prescribe some way to remap the point again inside the domain, or on the boundary of it. So all this kind of procedure, like gradient ascent, are usually combined with some projection back in the space of policies. So it's almost always necessary to do that, except in some cases. There are some cases in which if you do the gradient smart enough, well, you can happen to stay always inside the domain. And we will see an example later on. OK, of course, here there is an issue, if you want. Is that that the length of the step that you're doing in this process depends on the strength of the gradient if you keep the beta constant. So another opportunity that we have is to use a constant variation. So what is the area? Yeah, there is, for instance, you could do, OK, let's use the gradient. But now we use it in a different way. That is, we normalize the gradient by its length. So what is it the advantage of doing that? Well, is that now your parameter beta, you can interpret it as the distance between the previous and the following policy. OK, so in one case, if we go back to this scheme here. So in this first case, let's say that this is the one in yellow. No, I'm not managing to. OK, painful. So let's say this one. So what this algorithm is doing is that if it starts from a policy, then there is some gradient which is pointing like this. Here. And then it makes a step to a new policy by prime. And then in this new point, there is a gradient which is, for instance, much longer, much greater, much larger in magnitude. And then this new point will be by second. So the distance in policy space depends on how large the gradient is. This is the first option. What this other algorithm would do is think one. So what this algorithm does is that it moves along the same directions. So it starts always starts from five. And then it goes by the same length, always the same length. OK, so all steps are the same or the same length. It moves more uniformly in the space of policies, which is also a reasonable option. One can perform a lot of analysis on these algorithms and prove that they converge with trade that they converge and what are the properties that must be enforced, typically convexity of G. But this is just an overview to let you realize that there are actually many, many options on the table once you have a gradient at hand. And this requires also a lot of craftsmanship in choosing the proper ways to do it, depending on the problem and depending on the space. I have a question. Yeah, please. So how do we exactly know that we are outside the space of policing? Well, at every step you have to do the check. Let's go here. So the boundary of the space policy aren't known for us. The boundary, sorry, is? The boundary of this policy space. Yes. We don't know them. No, you actually know because the check that you have to do at every step. So check if all the pie. Must be. No negative and the sum. For every S. The sum of pie a has must be equal to one. So you realize immediately if you're violating these conditions and in which case you have to do some kind of projection. I'm not spending a lot of time about what kind of projection you can do, but there are several. And it's a piece of analysis in itself to define properly defined projections because if you define that them in the wrong way, maybe you get stuck into some bad place in the in your domain. There's a good way of doing it that preserves the, but I'm not spending time on this because actually we've never used this. Thank you. Okay. One thing that you should quite the, you need to realize is that when we reason in terms of the length of the steps that we make the distance between one policy and other, we are using the opinion distances here. But does it make sense, doesn't make sense to define the distance between two probability distributions in terms of equity and that distance. It's it's legitimate, but maybe there's something better that we can do here, something more meaningful. Okay. So in order to understand what we can do. Let's reconsider the gradient ascent from a different viewpoint, slightly different viewpoints. So you might have seen this before. So this is called gradient ascent. Pre-visit. So what I'm going to tell you is that you can see one step of gradient ascent as an optimization problem of the following sort. So suppose you want to take the maximum overall pies. And then I explain what this means of gradient of G times. By prime minus by this is computed in pi minus is there's a beta minus one half of the Euclidean norm square is the L2 norm. So what I'm what am I writing here so suppose that I have a policy pie. Okay, I have a gradient. G pi. This quantity here is approximately is a first order expansion. So it's approximately equal to G pi prime minus GPI. Okay, it's a local Taylor expansion of the difference between my value in the new point by prime. Okay, I agree with that just first order expansion of GPI prime around GPI. So this this term here inside the bracket means that we're trying to get the largest possible value of G. But but we are adding some cost for being too far. So I'm going to compromise between following the gradient in order to optimize over my objective function, which is the first part here. And the cost for going too far. And the parameter which threads off these two requirements is my beta here. So, if beta is equal to zero, then I don't move by prime will be equal to pi, because clearly the maximum of minus the distance squared is equal to the minus the minimum of the distance square, which is zero if the two points are equal. On the other hand, if beta goes to infinity. What is happening here is that I am ignoring this cost for going too far. And therefore I will be certainly projected very much out of my domain. So there is some beta in the middle which does the job of compromising between these two different requirements believing your local gradient and not believing too much. So the important thing here the problem messages that if we solve this optimization problem over the next by prime. We are actually doing gradient ascent. Why is that, well, because it's a you realize that the term in brackets here. This one is a quadratic object in by prime. This is quadratic. And since it's quadratic in pi prime, because this first term here is linear, and this one is just quadratic, then we can do the optimizations in a very straightforward manner. So we just have to take the derivative of what is inside the bracket with respect to pi prime. Okay, so if we do that. So let's let's introduce an object which we can call the Lagrangian, which is just this be right. So this product is not product here and this norm here is just I some overall possible a NS so explicitly, I can write this down. I'm a clearer this DG by s by prime. Now our case right of course this is valid for much larger. This is one possibility there are many of those right so this is the prime. Okay, so if notice that I have to add something here because I want to optimize but by primer must be a probability so I have to enforce the fact that by prime, the sum, the sum over by prime of as must be equal to one. So actually, I add to my Lagrangian multiplier, which enforces this condition this is the Lagrange multiplier that imposes the normalization of probabilities. And then I can take the derivatives of my Lagrange function in order to find the maximum. And how does it look like. How does that look like well. It's very simple because this is a linear the first term here is linear. So, the derivative gives me beta. I, then these are the second term here is quadratic, which gives me minus by prime minus pi. And then I have the Lagrange multiplier which enforces the normalization. Okay, but let's let's ignore this for a moment. The most important thing here is to. Okay, I could put the lambda and then the post normalization. Now this object here, the job that is doing is you realize this gives me pi prime equals to pi plus beta grad g pi. So this is the gradient descent step. And what this object here does is it enforces normalization. So this is the projection. So, I'm not going through the details you could also implement via Lagrange multipliers the condition that the pies must be the pi primes must also be positive. Okay, I'm not doing that but this is just to tell you that there is a deep connection between gradient ascent with projection and this optimization problem. So you can go from one description to another following the gradient also means making some compromise between the gradient itself and some notion of the distance. So, this, why do we care about this connection because this suggests different ways by which we can follow the gradient. And the basic idea is that we want to replace this object replace with some other notion of distance, because this notion of the gradient distance for probabilities is probably not the best one. This is also a very general idea that when you have to deal with complex spaces not necessarily do you pretty and measure is the is the right one. So what is the what is a better choice. So, what we want to do is want to repeat this kind of argument, but now with a different distance so we want to move in policy space, according to this. So we want to find by prime. My new policy by prime is going to be the arg max over by prime of the same thing, some beta, the gradient of G times by prime minus pi. And now I'm going to penalize this with some other notion of distance between the two policies. So do you know of any reasonable definition of distance between probabilities that you have encountered so far. Is anyone of you aware of. So if I have two probability distributions what is a way of measuring how far they are from each other. Exactly. Okay, so what we're going to do is going to repeat this argument by using kubak library divergence. So if you're not, if any of you is not familiar with the notion of kubak library divergence, this is something that you should fix, because it's a very widespread notion in machine learning and probably any information theory, but I assume that the level of knowledge that you need to be able to control this kind of calculation and operations is sort of very very basic so you can find it on every, on every manual, it's not very. So, now we're going to use KL divergence but in a way that is tailored to our problem at hand. Okay. So what we're going to do because we don't have a single policy we don't have a single probability distribution we have a collection of probability distributions. We have a probability distribution for every state. So we combine sort of this all these policies to make a global distance between two policies rather than between two policy property distributions. So, the way I'm going to define this distance between two policies is just as the sum overall states weighted with my occupation distribution is the same data that we've been using before for the gradient. So, my kubak library divergence between the policy, the prime in s and the policy pie in s. So it's actually it's a linear combination of kubak library divergences. Since these objects, it does are positive. It behaves pretty much as a ordinary kubak library divergence so it's a proper notion of distance is not strictly speaking a distance because you know it's not symmetric. It doesn't obey all the properties for to be defined properly as a distance but it's a divergence so it's a measure of separation between two policies. So, if now let's try and do the same thing that we did before now with this object clearly this this quantity here now is not anymore quadratic. It's more complicated. But the bottom line is that will be that if we use the kubak library divergence we are actually able to do this maximization. Exactly. Okay, so let's see how it works. Remember, always that we have to impose the conditions on the normalization of by prime and the positivity. Okay, so in this case as well this there will be a Lagrangian, which we can write as big right G dot by prime plus pi. Then we have minus the distance, and then we have plus my. Sorry, I have to put this inside. I have to impose normalization for each is a set of large multipliers I also multiply them by it as for reason that will be clear in a second. I have to impose the condition that my policies are normalized to one. Okay. These are large multipliers. And these are the constraints. Again, here I should also take care of the condition of positivity, but as we will see in a second in this case, it will turn out that we can enforce them automatically. It's a property. It's a good property of the kubak library divergences. So let's, let's take the derivatives. Okay, we're going to take the derivative of my Lagrange player with respect to policy by prime. Yes. And now we're going to use this explicit expression of the gradient here. So let's go one step at a time. So we have beta. DG in the pie. This is computed in pi. Okay, so this is when I take the derivative with respect to pi prime of this term, this is what I get. Then I have to take the derivative with respect to pi as of this D pi pi. So let's write it. And then if I take the other derivative I get minus lambda s. And all of this will have to be zero. Okay, now it's time to replace this quantities with the ones that I calculated. So this object here. I know it. So this is going to be beta of s q of SA. That's from the policy gradient theorem. So what about the derivative with respect to pi a s. Okay, let's, let's look at this. Okay, so first thing we have to. We are deriving the respect to the index s in the policy. Okay, so the sum over s goes away. And we are left with minus beta of s. And then we have to take the derivative with respect to pi as of the curve back library divergence between pi prime. Yes, and I'm going very slowly. It's just when I take the derivative I kill all the terms except the ones which have the s, and then I have to take the derivative of the curve back library and then I have minus on the s. So in this step, you see I have artfully introduced all this it as which are positive and also they can get rid of them here. Sorry, not this one. Now, last step I have to take the derivative of the cool back library divergence so let's remember that this is deep in deep by a s. So let's write the cool back library divergence explicitly. And with the sum overall possible actions let's write the a prime award of pi prime a prime s logarithm by prime prime s divided by a prime. I think the derivative is with respect to pi prime. Yes, thank you. Awesome here. Thank you very much. So this is the term that we are considering here. So now this is easy because I have just to derive this object and it says straightforward calculation. I get the logarithm of pi prime a s divided by pi as plus one, and then I can plug this back in. This I can plug back in here. What do I find? I find beta q SA minus log by prime. Yes, yes. Plus one minus lambda s equal to zero. And here is where the nice properties of the cool back library, the dangerous become ever just become fully fully appreciated because this allows me to write down explicitly by prime as a function of pi a s e to the beta. And then I can normalize that's where my large multiplier comes into the game I can normalize this. So before we discuss this result so let's make a few remarks. So first remark is that if pi is positive, as it should be, then also pi prime is positive, and by prime is normalized. So in this case you see that we don't need to make any projection. If you take, if you make one step. If we start with a policy which is within the admissible range, the next one will still be in. There's no risk of going outside the domain. This is a property of the cool back library. First remark. I want to recapitulate remember what I've been doing, we've been doing here is that we've been doing a sort of gradient ascent, but now with a different notion of distance between points between points in our space that is distance between policies. That's what we've been doing here. So as a result, if we start from a policy, we get to a new policy, which is obtained by multiplying each entry of my policy by exponential of beta of the queue function of that policy. Clearly this skews the property distribution towards the entries which have a larger queue. So where my metrics queue is larger there will be more probability. So this makes sense. In this policy, I construct an estimate of what this policy is worth. And then I get to a new policy, which is which has larger entries in the items in the in the actions that give the most. So let's have a look at what is happening in the string cases so clearly if beta equals zero. Then by prime is going to be equal to pi. And it goes to be expected because what does beta mean. Well, if you have a look at the optimization problem here. When beta is zero it means that you, you don't want to, you pay a cost for moving away from the point without getting no compensation for that because it is zero. On the contrary, now when beta goes to infinity something interesting happens here. You get outside the domain, because you're always inside your policy space. So what happens when beta goes to infinity, when beta goes to infinity, your new policy by prime. Look, if beta becomes very large. Here there is many exponentials that are competing with each other. But only the one which has the largest entry will win. And the project here on the right left hand side becomes the arc max over a of q pi s a. In fact, in fact you can think of this function here as a softmax as a so called softmax. It's a way to take the maximum but soft on it. Okay. So this is the greedy strategy. So in this limit better going to infinity. Our gradient ascent with kuba fiber divergence becomes the policy iteration. So in a sense we have forgotten the request that we have to stay close to our initial point, because the weight on the distances back getting very small. The importance of the distance with respect to the gradient is one over beta. So we forget about this notion of distance and then we go full throttle to optimize over the gradient. And we recover the original policy iteration that we have been discussing at the beginning of the lecture today. Okay. If you tune properly your beta, you will be making the right steps in the proper direction. And this is a very effective algorithm for policy improvement, because you don't go you don't visit all the corners so in a sense. Let's let's make a graphical summary of what is happening here. So this is our policy space. Suppose we start with a pie. If we, this is my gradient of G. Then if beta is infinity, I just go the corner here and then corner here and then corner here. So this is what happens when beta is infinity. When beta is zero, I don't move. I take a step somewhere in this direction, and then another in this direction, and then this length the length of the steps is variable. But I'm guaranteed that I will never go outside my policy space. So this case that I discussed here. So this algorithm here. Because you clearly see that you can turn this into an algorithm. Let's write it explicitly. This is the name, and it's called mirror ascent. Why it's mirror, it's a complicated issue but this is the name that comes in optimization theory works as follows. So initialization, your policy by zero, as usual. And then you have to iterate in a loop. And then by K compute your Q function Q K, as we did several times so far today. And then get your new by prime. Sorry, not by prime but by K plus one equals to by K exponential of beta Q K normalized. So what I mean here is a s as a, as a, as a, as a. Okay. And this goes on until, as always, there is, for instance, you can termination. This might be for instance, when you're distance between the new policy and the previous one is smaller than some website. This means that your steps are becoming smaller and smaller because you're close to the optimal. Or there might be other ways, you can also use your Q function as a terminator, whatever. And this is actually an instantiation of a much broader class of algorithms which are this mirror online mirror online online. You're a scent or mirror descent if you want to minimize. Which are of an almost importance in, in convex optimization theory, and find an application in this, in this context. Okay. And I said, more or less what everything I want to say about this. Algorithms in policy space so we did not have time to discuss the problems and answers that arise when you mix this algorithms, either value iteration or policy iteration with the function approximation. This will be the subject of the next lecture, not tomorrow because tomorrow we are going to have a tutorial on value iteration with practical applications. But let me just make up a short graphical summary of what we have achieved so far. We have two possible approaches, basically, for solving an MVP by value iteration. Okay, so graphically this means that you're working in the space of value functions. So this is abstractly the space of value functions this is the value for state one, but it was the two very first state three, and then there may be other many many access that don't make any sense on a on a picture but something like this. And then you start from a certain V note, and you use the bellman operator to send you to a V one, and this will send you to another V two, and all this process eventually converges to be star, which is the solution of the bellman equation be this star equals this star. Tomorrow we will see examples of this procedure of dynamic programming. Second possibility is policy search. Okay, now, the thing happens simultaneously, both in the space of values, and in the space of policies. Okay, so this is symbolic for both. Okay, don't take this literally. So what is the area here, the other is that you start with the policy by zero, and you use this to compute a value V zero. This new value allows you to compute a q zero function, which somehow by either algorithm, for instance by by this algorithm here, your new q will send you to a new by one here. This by one will send you to a new V one. And this V one computer q one which will send you to by two and so on and so forth. And this going back and forth between the policy space and value space will eventually lead you down here at your optimal policy by star, which will correspond to an optimal value V star. So back and forth, I can attend a stable game, eventually, this to this procedure stabilizes between the optimal value on the left and the optimal policy on the right. Both this kind of problems and architectures will be very important when we deal with the modern free problems, that is, we will be able to use the same kind of ideas, even if we don't know what the transition probabilities and rewards are. So that's why I'm insisting a lot on this. Okay. And I think that's a good point to stop. Any question. I suppose you have to digest this. Yes, can I see what you wrote just before the first example. Yes, perfect. Just one second. Again, if you want to play around with these two approaches with the two state to action model, then you can clearly visualize everything because the value space is two dimensional, the policy space is two dimensional. So you can really see these points moving around. And it gives a, I think a very, very neat interpretation, of course, this kind of exercises could be the basis for your final for your evaluation as a sort of exam. Okay, this is just a suggestion that would be many more many more things to come so this is one possible entry for your final assignment. Okay. Now no further question I stop sharing and stop recording.