 Okay, welcome back. So for today, the plan is to resume the discussion we had in last class. And since it was apparently too much, too fast, we will go through that again, step by step. And with the aim of familiarizing with the concepts that I've been introducing in last lecture. So let's start, as always, by re-discussing and what are we doing, why we want to do that. So first and foremost, so we have been solving the problem of decision-making in a Markovian setting. So we are going to solve MDP with the Bellman equation. So this is what we did in the previous lectures. So we've been introducing this notion of value function for a policy, so the value function of a policy in the discounted case is defined as the expected value. And some of the scattered rewards, you know, the trajectory condition on the fact that the first, the starting state is S. And then defining the best policy as the one which maximizes the expectation value, young condition one, okay. So we've been defined G as sum over S. This is the initial distribution of states. And this is the value which means just that we are averaging this term here without conditioning on the initial state or better, averaging over all possible distribution of possible issue of states. And then the maximum, this implicitly depends on pi and the maximum of this quantity is attained when for the optimal policy, which is given by the action which maximizes the following quantity, which is implicitly defined in terms of the optimal value. V star, which itself obeys the Bellman's equation. Which as you see here, it's nothing but the maximum of the same argument of the arg max, okay. And then we've seen in the previous lectures that this equation can be seen as a no linear equation which features vectorial, no linear vectorial equation which features a no linear operator. This is the Bellman's operator. And this operator is contracting in the L infinity norm, which tells us that this equation has several nice properties in particular, there is a unique up to trivial degenerations in case that there are actions which are absolutely equivalent in terms of return, but there's a unique solution to this problem and this contractivity of B also allows us to construct an algorithm, which is called value iteration, which tells us that if we start with a guess for our value function, then iterating the Bellman operator several times will eventually converge to the optimal value function and from the optimal value function or an approximation thereof, we can compute a policy according to this definition here that you see up here. We can compute an arbitrary, an approximation which is arbitrary close to the optimal policy, okay. So far so good. Then what we wanted to do last time is to see the same problem solving this equation but with a different approach, okay. So here the focus was on the value function. So from the very beginning, we introduced this object and everything was revolving around this value function. Now we want to take a different path, okay. And this different path starts with the recognition that this object here, so our target, the goal, the return that we want to optimize is in itself a function of the policy, okay. So in very abstract form, so here we want to put the focus is on policy more than on value function. It turns out that while focusing on policy, it will turn out that the value function emerges naturally from the problem because of course, I mean that there's no mister here. Even if we don't introduce it explicitly at the beginning, it will pop out because the Bellman equation is what describes optimality. But nevertheless, we put our focus on the policy. So what is the idea? Well, the idea is that in abstract term, there is some set, which is the space of policies, okay. So all policies live here. The point here is a collection of probability distributions for actions given states. And it's a relatively complicated geometrically object but it has nice properties, convex, okay. And our G function is something that is a function. So it's a real function that sends any policy into a real value G, okay. So to fix the ideas, let's consider just one simple example which I think may help you in developing some intuition. So let's consider just a simple two-state, two-action system. So from every state, there are two possible actions. And from these actions, you can, with some probabilities, either go to one state or the other. So this is a generic two-state, two actions problem, okay. So as usual, the squares are actions and the circles are states. So I could put numbers on those. I say this is state one, state two, this is action one, action two. And over these arrows I could put as usually the transition probabilities. So let me just write the one for you. So over this arrow, what do I have? I have the probability of getting into state one, given that I started in state two and I took action two. And in this process, I get a reward which is state action, new state. So this is S, this is A, and this is S prime, this is S prime, this is S, this is the notation that we've been using so far. And then you can put all values that you like and make a list that you like for this transition problem. So what is the policy space here? Now the situation is particularly simple here because for every state, there are just two possible actions. And since they must be positive, remember, pi A S must be positive and there must be normalization of probabilities. So in this case, it's very simple because the policy space for state one over this axis and for state two over this axis is just a segment that goes from zero to one. So for instance here, every point would be probability of picking action one, given that I am in state one. And of course, since pi two, this is equal to one minus pi of two of S equals one, I don't need to introduce another object. So the simplex that is the space of probability distributions for just two actions is just this segment here. And for the other state, that's the same. And then you can clearly imagine that if I had three states that would be a cube, if I had S states that would be in hypercube, this is just because I have two actions, but it's useful, so we fix the ideas. So our G, our return is a function on the square because to every policy, I can associate a numerical value of my goal. It also depends on the initial conditional row nodes, but let's just forget about that, okay? For any given row node in sugar distribution of states, I'm focusing on the dependence on the policy because this is what I want to optimize over. Please stop me if there's anything that is not clear. So I could write down level lines, okay, for this problem, something like this. So this is a very simple example in which here I have larger values of G. So these are increasing. So this is a real example. I mean, if you replace these objects with some probabilities and very simple situation, you will find out a picture like this. You can play with it yourself using any kind of symbolic manipulator to produce graphs of the return, average return as a function of this. How do you compute it? It will come in a second, okay, explicitly, okay? So what is this telling you? Well, it is telling you that since I want to find the policy, which is optimal, that is where the return, the G is maximum. Well, in this case, if the level lines are made like this, okay, here is the point where my function will be taking the optimum, okay? It's just like a gentle slope going upwards and in the corner here, it's maximum. Now, if you try, you will see that it has always this shape, except for the usual stupid degeneracies in which there are several different actions which have the same value. Why is that? I mean, you've seen it from the other angle, from the Bellman's equation, in the sense that the optimization is always getting you the maximum over a single action. So optimal strategies, there is always a deterministic strategy which is optimal, which means that these functions always have maxima in the corners, generically, okay? Which is, again, a consequence of saying that this problem is a problem of optimizing a convex function in a convex set. The convex function is G and the convex set here is the square. So is your intuition clear so far? Any question about this? So why do we want to do that? Because we want to find out algorithms that search for the optimal policy directly in policy space, okay? So that's the goal. We would like in principle to bypass the notion of value function and look directly in policy space. So given if we start, so the bottom line of all this procedure is that suppose you start with a guess for your policy, okay? For instance, the point in the middle here means that you're doing everything at random. It's just 50% choosing one action or the other independently of the state. So this is a state of complete absence of knowledge. It's a totally random strategy. And you would like to see, okay, how do I go from here on this point here in the middle to my optimum here? So the intuition is that, okay, I should do some kind of gradient ascent to climb up this function G in the space of policies. If I find out algorithms to do that, then I have another way of solving my optimization problem, okay? So since the in reinforcement learning in the full setting, we will see this kind of different approaches. So algorithms that heavily rely on the value function and algorithms that focus more on the policy and algorithms that mix up the two architecture in some way. So it's important to see them from the outset. That's the sort of overarching goal of this, okay? Is the plan clear, the question clear? Okay, so how do we do it? So we are gonna break this problem in several steps. So the first step, we want to write G in another form, so as explicitly as possible in terms of pi. This is the first step, which will take us to the second one and then we can take the gradient, okay? So we want to find out this object here. If I am in a certain point, actually find any point of the square for any policy, where does the gradient point? Can I compute it? So these are the two steps that we will have to take slowly again. Once we have that, there are two things that follow. First of all, as a byproduct, we will see that the optimum, the optimality conditions, okay? What are the optimality conditions? Well, it's just that again, very loosely speaking, the intuition is that if there is an extremum of this G function or this return of my objective function within the square, then the gradient has to be zero at that point. Or if it sits on the boundary, which we know is true, then at this point, the gradient must be pointing out, okay? So for this means that we have a geometrical interpretation of optimality in terms of the gradient. And if we use this optimality condition, then this will give us back again, the Bellman's equation. So in a sense, the first goal is to come full circle with what we did before by this other different approach, okay? So we rederive the Bellman's optimality equations by this approach which focuses on gradients in policy space. Second thing that we get, we get algorithms in policy space, like I did, which are different from value iteration. Value iteration was obsessed with the value function. There's only the value function and then the policy comes at the end. But here we will introduce new algorithms that at every step deal with policies. This is the plan, okay? So we now go back to the notes of last lecture and we go one step at a time in order to clear everything that was not obvious and to sort of work it out perhaps with simple examples. Any questions so far? Okay. So first step. The first step is to rewrite G in a proper form. So I will comment on the same material as last time. So I will let my comments and think it's good. Okay, so we start from here. So this is the definition of my G, okay? I mean, you should be slightly nauseous for the number of times that I wrote this on the book. So this is, as usual, the cumulative discounted sum of rewards. And then the first thing that we did, we rewrote this in terms of explicit averages, okay? So first step, we take the sum of our time out with the discounting factors, okay? So this is the first term here, which goes out. There's no averaging to be done here. Can you see the arrow when I move it on the screen? Okay, sorry, because it's a bit cumbersome to move from one screen to the other. If I'm referring to something that you don't see just there, okay, please. Then let's look at the things that are random here. We have a state at time t, an action at time t, and a new state at time st plus one. So how on earth did we get there? So what has happened that has brought us to this particular triplet, okay? So first we have started from some initial condition over which we have to average. And that's the first thing. Then in these steps, okay? This is these steps after the initial condition we have come from S bar to S, okay? With this probability that I'm not writing yet. It's just a formal option. But once I am there, okay, I can pick an action according to my policy. And then this action will send me to a new state S prime. And in the process, I will gather a reward which depends on the triplet S, A, S prime, okay? So this expression here is just an explicit form of what is at the first line. Everything good with this? Everybody good with this? No. Next, I have to write down explicitly what this blue underlying object is. And now I use the Markov property because the system is Markovian. So in order to get to step at step T in position S I have to go through these steps. And each of those is governed by a transition probability. So the probability of going from here to there in these steps is the product of probabilities of going through all possible paths that take me from here to there. And then I have to trace out all intermediate positions which is sort of over expanded way. Which is to say that the probability transition from one state to another in these steps is the product, is the metrics product of probabilities over each of the steps. This is something which is very basic in Markov chains. Is everybody at ease with this? Which is what I rewrote here, okay? If I iterate this unfolding over the last step and I repeat that again, this means that seen as a metrics, this probability seen as a metrics which has entries say the column, okay? So my P, my transition probability P is a metrics in which I have here the starting state and here the arrival state, okay? So columns correspond to the starting state and rows correspond to the arriving state. And my P after these steps is the product of my single probability transition probability from one state to another, which I wrote explicitly here, okay? So how do I go from state S to state S prime? Well, I have to pick an action with a certain probability pi and this will send me when I average over the new state PS prime. And if I want to know what will I be after these steps, I have to take the product of these matrices P times. Okay. Now, the fact that I am using matrices is sort of alluding to the fact that I can rewrite the G in a much more compact form using metrics, matrices and vectors. So that's what I'm about to do here, okay? So use metrics vector notation. So today after the lecture, I will post this revised version of the nodes on Slack so you can see through them. And if it's not enough, don't hesitate to ask for specific one-on-one explanations. That's fine, okay? All right. So this, if we want to sort of map everything in terms of vectors and matrices, we have to do something with the rewards, okay? And what we are gonna do here is that we are gonna focus, since this object, this metrics is a state by state metrics, it's a square state by state metrics. We want to find out the vector, which is made of states. But average rewards themselves are function of a triplet state actions and state prime, which is a bit cumbersome because it's not properly speaking a nice metrics, okay? Next vector. So what we do here, we replace this small R with another object, which is just the average return that I get if I am in state S, okay? So if I start out in state S, I pick my action according to the policy by, I end up in a state S prime and I sum, okay? Which is what I wrote here below. It's the expected value of the reward if I start in state S at any given time. So if I look back at this expression that I had before for my G, I realized that this part in yellow, okay? In which I perform the sums over S prime and A is just R S, okay? If I sum over A and S prime, this object, I recognize that is just exactly the R S that I introduced here and it's in yellow. That's pretty useful, okay? So this means that I can go from here to here. So let's number the equations. So you might feel more comfortable with the, this is equation one, this is equation two. And here, what I'm writing here is that two is equal to three is equal to this, okay? So this is another way of rewriting G, which follows from the definitions above. Where P with index T, upper index T is the product of the P matrix T times. So now that we have it in this form, you see that it's easy to see that this sum above here, below here is just the sum over the entries of vectors and matrices, okay? If this is a row column vector, which I wrote here and this is a row vector, okay? Then this is R transpose P, sorry. This is R transpose P to the power T, row zero. This is just the way you write in linear algebra, the product of matrices, which is the equation that I wrote below here in the middle, okay? I just replaced these two sums explicitly with that. Performing the sums, this is what it becomes. Okay? Is it transparent so far? It's a bit lengthy, but it's useful to rewrite things in this book, okay? Why is it useful? Well, because now we can finally carry inside our sum over time and realize that this object is a geometric sum, the geometric series of matrices. It's a sum of matrices to integer powers. And as such, we can rewrite it in terms of the inverse of another matrix, which is get entity minus gamma P. So why is that? Well, this is just these two lines, demonstration that I put here, okay? Below here, which is the same way by which you demonstrate the geometric sum with real numbers, okay? It's just because these terms cancel out when you expand this object. So when you do this with real numbers, the condition that you have to satisfy is that the quantity that you're considering here, okay? If these objects were a real number, this should be less than one. It should be a real number less than one. Otherwise, this object does not converge, okay? Because in order to be manipulating up, to do all this simplification that I'm doing here, everything must make sense from the mathematical viewpoints. So these objects should converge. And the convergence is guaranteed if gamma P is less than one. So how do I replace this notion of gamma P being less than one for a matrix? Okay. So rather than saying gamma P is less than one, I have to make a statement about the eigenvalues of this matrix. And here is when the fact that P is a transition probability becomes important. So when I say here that P is a stochastic matrix, this means to say that P as prime S is a transition probability. What it means that it is a transition probability. Well, it means that sum over S prime of P S prime S is always equal to one. Because if I start from one state, I must end up somewhere, okay? There's no disappearance for my P matrix. And then all the P's, no matter what S prime and S you have must be caused. It's a probability. So every time that you, your matrix satisfies this condition, it's called the stochastic matrix. And stochastic matrices have nice properties from the viewpoint of eigenvalues. Because all these eigenvalues lie, eigenvalues can be complex, okay? Eigenvalues of a stochastic matrix can be complex because the matrix is not necessarily symmetric. It's typically it isn't. So eigenvalues can be complex. So if you look at the complex work, do this eigenvalues, these are the eigenvalues of P, my stochastic matrix. Where do they are in this complex plane? Well, of course they can be complex but they come in pairs because the matrix is real. So you can have either real eigenvalues somewhere here or you can have complex conjugate ones. But what is important? All of them stay inside the bowl of reduced one. This is a mathematical property of transition matrices which is the content of the Peron-Frobidius theorem. There's more to this, okay? In this theorem, you also discuss situations where your P matrix is recurrent, non-recurrent, et cetera. So there's a lot more than that. But for now we just need this result that all eigenvalues lie within a bowl of reduced one. Because this condition combined with the fact that gamma is smaller than one tells us that the eigenvalue of gamma P is necessarily strictly inside the bowl. And if it is strictly inside the bowl, this means that one minus gamma P is invertible. And if one minus gamma P is invertible, then we can manipulate things like we did. And we can explicitly write down this object here. One minus gamma P inverse. It wouldn't be possible otherwise, okay? So again, this is plain linear algebra. Again, many, many things you're not probably familiar or you forgot but nothing is serious so far. So we have completed step one. We have been rewriting our G in a more compact form where remember that we have G depends on pi. So this is my formula five. This object depends on pi. You remember the yellow definition here. It has pi inside. It depends on pi in a very simple way because it's linear. And so this is good news. We have a piece of this term which is linear in pi. So the derivatives are gonna be easy here when we want to compute the gradient. Second, the pi is also inside my capital P. My capital P is also linear in pi. So good news for the second time. What's the bad news? Well, bad but mitigated is that there is a nonlinearity because this is the inverse of the measurements. So my G pi is a strongly nonlinear function of pi because of this inverse. This part is linear. This part, sorry. This object is linear, this object is linear but there's a minus one inverse of the measurements. So if you work out explicitly with a symbolic manipulator like mathematical math lab or whatever would you like to obtain? Just for the two states to action problem you will see that there will be pi's at the numerator and at the denominator. So, okay. Already with a very simple two state to actions you see a nonlinearity which is not obvious. The function is still convex. So it has nice properties but it's strongly nonlinear. So why anyway, we're happy with this because we know how to make derivatives of inverses. So we can apply the chain rule here. And that's what we're gonna do now to derive the gradients. So next step. There are two points in which there is a dependence on pi like we said here and here. So now we're gonna take first step the derivative with respect to r. And if you take the derivative with respect to r well, it goes through. I'm just manipulating this. So here, this symbol here means the derivative with respect to some object. Okay. But if we go back just for a second to our example at the bottom. Sorry, now it's gonna be this usual no gating scroll. But okay. So it's a function of two variables. So my grad might be, for instance, the derivative with respect to the horizontal object or to the vertical. I'm not just not to over burden the notation. I'm not writing down at the moment with respect to which coordinate I'm doing the derivative. Okay. But here the idea is that in this case your g would be a function of what? Would be a function of pi one, one, pi two, one, pi one, two, pi two, two. So this would be in principle an object which lives in the four dimensional space. But in fact, as I told you, these two things are not independent. Okay. Because the probabilities must sum up to one. So this is in fact just a function of a by choice by one, one and by one, two. Which are the coordinates that I'm writing here. And with respect to them, I'm gonna take the gradients. Okay. Can I go back up? All right. Here we were. No, not yet, we're still up here. Okay. So chain rule. First derive this first object, then derive this one. And for matrices, the chain rule is just the same as for ordinary products. Okay. So derivation, the other term plus the other term derivation, which is what I wrote here. Okay. So this is the second line here is just trivial application of the chain rule. I haven't done anything yet. Okay. Now I have to treat this term here. The derivative of an inverse of a matrix. And if it were a real function, you would know how to do it. This would be gradient. I'm putting it here, right? If it were gradient of f minus one, derivative of f minus one. Okay. You would know how to do that. It's just the rule of the inverse of the function. So it would be f prime. Sorry. I have to erase this. It would be minus f prime f square. Yeah. So this is not the inverse. It's one over f. So let's try not to make limitations to stupid. Okay. Now the issue is that matrices are not necessarily commutative. So we have to take care about what we do when we take this inverses here. And this is what is done in the few lines below. How do I do the inverse, the gradient of an inverse? Well, it's just I have to realize that the inverse is defined by this simple equation here. So my matrix times its inverse has to be the identity matrix. So now if I take the gradient of this equation on both sides, on the left, oh, sorry, on the right, I get a zero because the gradient of unity is zero. And on the right, again, by the chain rule. So I take the variance of both sides. And then this is the chain rule again. And then I multiply by m minus one on the left. Again, simple manipulations. And then eventually I get this formula here for the inverse of the measures. Wait, where you see all the care must be taken in the fact that the m minus ones must be on both sides of the gradients. Okay, and then I am now kept with all I need to go to the second step, which is one here below. So these are just here. So let me just clean a little bit this because it's messing a little bit stuff. So this is my equation. I was a question, okay. This is my gradient, which was a question six. And then after this stuff, I applied my chain rule and this is a question seven, okay. So this is the, this object that here is just, I'm copying back again. And this one here is the, this rule. So I have the gradient of this function now, which is what I want because this object is linear. So this derivative is gonna be easy. But then the price I have to pay is these two inverses that are bracketing my gradient, okay. So this derivative in the middle is very easy because it's just minus gamma grad P, okay, linearity. Very good. Now I have to do something. I realize that there are two terms on the right-hand side, which are the same, exactly the same. And it's my inverse matrix times the initial condition. So this object here, I call it eta, okay, the one here, which is the same as here. This is a definition for the moment. It's a vector, it's a column vector because our note was a column vector and I multiply it by a square matrix. So still it is a column vector. So what it is, it will come in a second. The only thing I have to do is just by manipulating this equation, I come directly to equation eight, which is just I'm regrouping everything on the left and isolating this new eta, okay. This is just a simple manipulation. Remember that here I have a minus here. Here, oh, sorry. I have a minus here and the minus here, which gets a plus, the gamma goes outside and the gamma P must stay in its place that is to the right of this inverse. So this is just copying paste, basically. Now, what is eta? Okay, we have just to rework through our definition. So make one step back and remember that the inverse is actually a sum of the powers of this gamma piece, okay. So this is the definition. This is the definition, but this step here is just unrolling again. We are unrolling again the inverse in terms of the geometric sum. And then we, again, we are tracing our steps back and we remember, okay, but P to the power T is just the probability of going from the initial state to the state at time T, okay. So I'm again rewriting this object again in terms of sums over components. And here, here there is something missing, which has to be corrected because there is also a sum over, so let me rewrite this object again because I was missing a sum over S bar. So I will take the opportunity to correct this and write it in a repair shape. So this, the component S of this vector or this column vector is the sum T going zero to infinity. Gamma T, then I have to take the components of this matrix P to the power T, which I said it's, so you have to sum over S bar of the probabilities that S T is equal to S, given that S zero is equal to S bar times S bar. And so this sum of S bar is the matrix product. And this object, you will remember that from above that is just P to the power T components S bar, sorry, components S and S bar, okay. So now I can interpret this object eta. So what it is, remember gamma is the survival probability, okay. So we can interpret this discounting as the probability of not being killed at each step. Sorry, one minus gamma is the survival probability. No, gamma is the survival probability, I'm messing it up. Gamma is the survival probability at each step, okay. So at each step with probability one minus gamma the process gets killed. That's the interpretation of discounting. You remember we introduced this augmented Markov process in which discounting was sending with probability gamma to probability one minus gamma into the coffin state, okay. So what does this sum over T means? It means that at each time step, if I have survived and if I end up being in state S, I add a one, a plus one. This is the way you would compute this object by Monte Carlo. Suppose you run a simulation of your stochastic process, you start from a state, you pick a state from your initial distribution, run out and then you let it evolve according to your stochastic process. And every time that it is in state S, you put a one, okay. And then you kill the process with probability one minus gamma. If you do that, this object would be exactly this eta of S. So another way of writing it, if you find it more clear is as an average. So eta of S is equal to the expectation of some T going from zero to infinity gamma T and then we have this indicator function if state S, I am in S. This is what in formula, what I just said in words. Every time that you step on state S, you mark a one. You have visited the state and then you sum with all this gamma to the power of T, which means that in the augmented markup process in the one with killing, these objects will be the time spent in state S before the system is killed, the process is killed, okay. This is just to say that this is not an abstract quantity that we've been introducing but it has a very simple interpretation in terms of the markup process that is happening now. First remark, second remark, there is another object here. Which is the one marked in red here, which appears. And again, this object is the one that involves the inverse of my one minus gamma P metrics. And what we're gonna do now is realize that this quantity here is just the value function of my policy, okay. So that's the next step. So let me formally initially define this object R transpose one minus gamma P to the power of one. This one is a vector V, it's a transpose, so it's a row vector. And why is this the value function? Okay, the best thing to say to show it, here I was just recalling you what the value function is. The best thing, the best way to show this is to start from two from this equation here. And multiply both sides on the right, multiply on the right by one minus gamma P. Okay, so I'm just gonna remove the inverse from the equation. And what do I get? Well, I get RT equals one minus gamma P V transpose. Okay, and then I can rewrite this again just by moving this part on the left-hand side. So this becomes RT plus gamma P VT equals VT. Okay, very trivial manipulations of this definition. Are you with me? So why do I do this? There's something. Please. It should be on the right side, not on the left. No? Or it is a convictative product. Oh, when you did. Sure, sure, you're totally right. Thank you, thank you. Thank you very much. I said multiply on the right, then I multiply it on the left. That's very good. So let me cancel that. I don't know, there's no commutation whatsoever. These are vectors and matrices. You are totally right. So you have to change this as well. So multiply to the right and then this is what it becomes. Okay, now let's write this object in components now. Okay, so this is a vector, a row vector. This is a row vector times the square matrix. Again, a row vector. And this is another row vector and we take all the components and write it out. So if we do that, we get R of S, plus gamma. And then I have sum over S prime of V of S prime, P of S prime, even S equals V of S. Now, if I replace the definitions for row and P, remember this for R, capital R and P, which are the yellow and the greens or green objects that we have been defining here. Is that a question? No. So if I replace this definition of capital P and green and green and yellow, I'm running short of space here. Sure, if I should squeeze something in the middle here. Okay, so let me see if we can do it without writing and if not, I will take some space in at the bottom. If you replace these objects here, which are very simple and expressed linearly in terms of the policy. This object, this quantity here, turns out to be the recursion relation, which you might remember from the very first lecture that in which we introduced the Markov processes is the equation that connects V, the value function at the certain state with the value function at the state, at the next state. So if you trust me or we can go forward, if you don't trust me, I will write down this further step at the bottom. Any request for explicit demonstration? This is just to say that this V is really the value function, okay? Okay, didn't receive any complaint. Okay, good. Then at this step, we are arrived at this equation here for the gradient in which we introduce the value function, which I'm gonna call equation number nine. And now we are in a good position because now we can take the derivatives because this quantity is capital R and capital P are linear in the policy. So we can take the derivatives, fine. So I think we've gone quite a long way. So what I would suggest now is that we take a break, maybe a 10 minutes break and we restart at the quarter past 10 with the final steps of the derivation. So we will have time also to discuss about algorithms and wrap it up for the day. Any questions so far? Was it the right pace this time? Okay, good. Fine, see you in 10 minutes. All right. So last few steps. So we are now in a position to take the derivatives with respect to the policy of our two terms, capital R and capital P, which are linear. So that's what we're gonna do here. So the first term, I just reinsert the definition of our capital R of S, which is the expected reward at the current step, which depends linearly. So it's straightforward, the derivative gets into the sum and then this is what you get, okay? And similarly below, nothing different. Insert the definitions. This is just the definition of P. It's just capital R of S. All right. So when we plug everything inside, that's what we get. This is our equation 10. Okay, so just a couple of comments before we go away and move to the explicit calculations. So first remark, we've been starting without caring about the value function. So we said we're focused on policy. We wanna do everything policy space. And nevertheless here, the value function kicks again in. So this points to the fact, obvious and trivial fact that it is a very important object in optimization because even if you don't start with that, it will pop out. And second, it's sort of obvious because the value function contains all the relevant information about the future. It encodes what the sum of future rewards will be. So if you want to plan, it has to appear somewhere, okay? So it shouldn't be too much of a surprise if you see it popping all over the place. And if you think of, it's useful to take a look at this formula about the gradient. This formula of the gradient. You see, it contains three different pieces. So this part here is basically telling you how much time you spend in state S given your initial condition. This is part of how did you get to S. So this is in a sense about the past is summing over all the trajectories that brought you to state S. And this part here is about what everything that will happen in the future from state S onwards, from the viewpoint of returns, of rewards. And here pinched in the middle is what you're going to do now. So the total variation in objective function, if I make a change in my decision-making now is the result of two operations. Everything that happened before and brought me to the current context and everything that will happen in the future and will lead me to gain rewards. So this is a sort of qualitative interpretation of this gradient formula. So in this form, this object is also known as the policy gradient theorem, which you can find derived by another route in the Saturn and Bartos book. So if you wanna have a fresh view on this, you can go by totally different route. And in Saturn and Bartos book is introduced in connection with very different algorithms that we are not seeing now, but we will see so far. We will see in the future. Okay, now we do the final step, that is we take the gradients explicitly. So we consider the policy as coordinates. So getting down to our example here, now we are really gonna do derivatives with respect to the horizontal coordinate and the vertical coordinate. In practice, we're gonna take derivatives also with respect to the other coordinates which are dependent on them, just for the sake of notation, but if not necessary, because of course, in this case, for instance, the derivative with respect to pi 21 here, the derivative with respect to that is just minus the derivative with respect to that. But now we consider the function as a function of all these possible objects without caring about the constraints. It doesn't matter. They will be reflected in the gradients. So what I'm gonna do here is just as it's written here, turn this abstract derivative operation which is this nabla here, turn it into an explicit derivative with respect to one policy. Sorry, can I make a question? Yes, sure. In this case, the value function is depending on the policy. The value function? Is depending on the policy? Yes, of course. And also eta does depend on the policy. Okay. All of this was in fact to sort of move around all the implicit dependence on the policy and get to the derivatives with respect to the policy. Because we don't want to have a gradients of the value function of gradients of eta. And we have removed them. Is that clear? Okay, thanks. Sure. So like I said in our example, let me expand that still a little bit on that. We are finding out how to compute this vector. This is my grad G. And as you can imagine, depending on where I am in the square, it is gonna be a different vector, okay? So here it's gonna go like this in general, okay? So it depends on the policy, the gradient. And the way it depends on the policy is through the value function and the occupation time eta. All right. So if you last steps, so this part is very easy because if we take the derivative of this with respect to the policy itself, well, this is just selecting the indices A and I in the sum, okay? It's a linear function and consider A and S as coordinate indices in my space of policies. Then what I say here is that basically is that the derivative of pi A S with respect any pi A prime S prime is just one if these are equal and zero. Otherwise, so this means that I can basically kill the sums over S and A here if I take derivatives with respect the policy components. Just to make it extremely clear, let me draw a little box here. Suppose I have a function. So I have a function which is just sum over I, some coefficients alpha I and some coordinates Xi. Okay. And if I take the derivative with respect to XJ, it becomes sum over I, alpha I, and then I have the XI in the XJ, which becomes this object is just sum over I, alpha I, delta IJ, this is the Kronecker delta. And therefore this becomes alpha J, okay? This is all I'm doing here. This X sub I are just these policy coordinates. So if I do that, I get to a question 11 here, which deserves to be framed. So the question I'm asking here is that suppose that I'm in state S and I change my policy. So I change the probability with which I take action A, I change it a bit. This will result in a change in my objective function that is given on the right-hand side. And of course, since the left-hand side depends on which state and which action I'm modifying, also the right-hand side does. And in particular, I'm going to introduce this object, which is nothing but the sum over S prime that appears here. So I'm leaving eta S outside, and I'm considering this object here. So this object is, as again, as a very simple interpretation. It's called the state action value function or the quality of the state action pair. And in case you've heard about it, this is the same object that appears in Q learning. So if you ever happen to hear this word Q learning, which is a famous algorithm in universal learning, that's the Q we're talking about. So what is it? Let's have a look. Well, if we are in a given state S here, and we select a certain action A, what will happen? Well, we will end up in state S prime. We will get an immediate reward R of S A S prime. And then when we move to state S prime, then from that point on, our policy will collect the value of V. We collect an objective reward, objecting return from that state S prime onwards in the future that is V of S prime. So this quantity here is nothing but what you expect to collect as a discount cumulative reward, if you start from a given state and a given action. So the difference with the value is just the value doesn't condition on the action, just conditions on the state. So that's why it's called the state action value function. But same manual. As I've written this here below, this Q function is exactly equal to this option. How do you prove it? Well, again, you repeat the same kind of tricks that I did for the value function. You remember what we did above here. You can plug that in again, replace and see that this object is indeed this expectation value. And its connection with the value function is that, again, since this is what I expect to get if I start from state S and take action A, if I average over all the policies or over my policy that is over all the actions that I can take, then I get my expected return from state S. So these things are intimately connected. So you can go from Q to V and you can go from V to Q using this definition. So they are equivalent. Only that V focuses on the state only and Q focuses on states and actions. But the content is the same. There will be more on this Q function later. But what I want to do here is just that the fact that this allows us to remember this Q is implicitly dependent on the policy, okay? Because V depends on the policy. V depends on the policy. So Q does. Of course, what you will get in the future starting from a certain state and action does not depend only on the action you're taking now. It also depends on the action that you will take in the future. And these are determined by the policy. Okay, so eventually we come up with our equation 12. That tells us, look, there is a very compact expression for this object, which is just this one. Sorry, can I make another question? Yeah, please. When we have computed the derivative of the G function, we shouldn't take care that is no more linear due to the fact that there is a multiplication between the value function and the policy. So I'm not sure what you're worried about. So the G is nonlinear, okay? But we have manipulated in such a way in order to be able to extract its dependence on pi. So this question here is exact, okay? So this doesn't mean that this whole object is linear in the gradient, not at all. It depends nonlinearly on pi through V and eta. We are just taking out the explicit dependence on those derivatives. Okay, okay, okay, thanks. Sure, no problem. So in this case, the G divided by the derivative in a single policy is just a single value of the gradient. Yes, so at each point in policy space, you have a different gradient. It wants in different directions because it depends on pi explicitly through this and this. Okay, thanks. Sure, this is just a manipulation of the gradient in a way that it's convenient and allows us to write the gradient in a form which apparently is more complicated than before because you've been introducing along the way many things that hide all the complexity, all the complex dependence of pi into their belly, okay? So at the end, the gradient depends on two objects which by themselves depend on pi, but we have definitions for them, okay? So you might think of this just as a complex manipulation, but which has a very straightforward interpretation in terms of the quantities that we are using here. Okay, now comes the first part which is important. Now that we have the gradient, first thing is that I promised you that if we look for the optimal points given the gradient, then these optimal points coincide with the solutions of the Bellman equation which is nothing but an exercise to say, okay, everything that we've been doing so far should be consistent with what we do by a totally different route, okay? So it's a consistency check, consistency check. And the first thing that we have to realize is that we have to use this first-order condition for optimality. So let me go up a bit because I've introduced this at the beginning. So scrolling up for a second to recall you what the optimality condition is. It's this notion that if you have a function in a complex set, then there are two possibilities. Either the maximum is attained inside the domain, in which case then it must be a stationary point. So the gradient must vanish there if it's strictly inside. And if the function has a single extremum, then it's a maximum, okay? We don't have multiple maximum or minimum in this case because the function is convex, okay? So that's why there is a double implication from the left and right because the function is convex and this is happening on a convex space. If the point sits on the boundary, then the gradient must be pointing outwards, okay? If the maximum is sitting on the boundary, the gradient of the function must be going towards outside the domain. And which means that if I take any other point here which is inside the domain and I take the vector which connects these points to the maximum, to the point where the maximum is attained, then this vector must have a positive scalar product with the gradient because that's the definition of maximum. If I do not have this property, I would run into contradiction and there will be some point close to x star which has a value of my f function larger than the x star. So this is a way of expressing the condition of being optimal in terms of properties of the gradient. Is it intuitive? Okay, I'm not proving this, okay? But you can find it in every textbook on analysis. It's called the first order of the multi-condition. So now we are gonna use this to prove that one particular choice of a policy is in fact optimal. Okay, so use this to prove optimality. And that's where the Q function becomes handy. Because we are gonna define, okay? So we're gonna or assume that, so assume that the policy which optimizes my G is given by this property of my Q star function at the optimality. So remember, here I have to evaluate the gradient at my optimality point. So this object here will be equal to eta star s, Q star s a. Notice that everything here is very formal, it's very implicit because this Q star I don't know yet. And in its belly, it has the optimal policy itself. So we are making a little bit of acrobatics here, okay? So it's a highly formal manipulation. But bear with me for a second. We assume that the optimal policy we can read it from the Q star. So if we knew what the Q star is, this is a matrix, okay? Q star, you can think of it, it's actually any Q, but in particular Q star, you can think it as a matrix which has the states here and the actions here or the vice versa, whatever you like. Then my suggestion is let's look at this matrix and for every state, we consider this column and we single out the slot which has the largest value in this column, which is what I wrote here. I pick a certain column s and I extract the row a which has the largest value of Q star. What is the area? What is that? If Q star is expressing what I can get from a state and action, from a state action pair, then if I am at the state, if I compare all possible options that I have that is all possible a's, I have to pick the one which gives me the largest Q and that's what I'm doing here. Okay, so let's define this. Now, if I use this, I just can plug this object inside the first-order optimality condition. I'm just replacing this choice for pi star. Okay, so this is the first-order optimality condition. So it's actually it's the left-hand side of the optimality condition. So I have to take this gradient, which is this. This object here is just this and here there is a star also. And here I have this. Now, this pi star now must be chosen according to my assumption here. So I have to pick that particular action which maximizes my Q star. So this first sum that I have here, so this times this object becomes, which just becomes this and then becomes this. Okay, so if I select my pi star like this, I just have to replace in my Q star metrics the item with the best action. And on the right-hand side, I have any other policy again with my Q star. I'm just replacing the things that I put. Now, but since my choice was that Q star was the largest one for every S, by definition, by definition, this thing here is larger than any other item. This is larger equal than Q star SA, by definition, because by definition of a star of S, any other entry in the column must be smaller or equal to that. But then if I combine all these things on the right with any policy, okay? Okay, so operationally, if I take this and I multiply on both sides by sum over A by A, there should be also an S here. If I multiply this by both sides, what do I have on the left-hand side? I have sum over the probabilities of the policy which gives one. And on the right-hand side, I get this. I get this thing here, okay? So this is all amounts to say that by the definition of A star as the best action chosen from Q star, this means that this term here is strictly larger, sorry, strictly or equal to zero for any choice of pi, which is my optimality condition. So what does it mean overall? It just means that the optimal action must be this one. For this object here is just my Q star. Are you dazed, confused, a little bit dizzy? I'm a little confused to you. Okay, so the only thing I'm doing here is that to show you that if I choose my policy according to this definition here, so if I have my Q star and I pick the item action, which has the largest Q star for each value of S, then this quantity is optimal because it satisfies the first order optimality condition here. Okay, so I'm translating the first order optimality condition into a property of my matrix Q star, still confused. Okay, so what I suggest here is that you do the exercise with the two state, two action problem in which there is a square and you compute explicitly all these things because everything is computable. You can, there are closed expression that you can write down with mathematical or whatever and you can look at all these things and then you can check all these properties explicitly and you can write down your Qs, your Ethas, okay? It's a, I think a useful exercise. At this level, it's very mathematical, I realized this last step. But the bottom line of this, if you just trust me for a second that you will be able by yourself to realize what this all means. The bottom line of this reasoning is that this is your optimal action but this thing is absolutely equivalent to the Bellman's equation. This is exactly what the Bellman's equation prescribes that we have been finding by other ways. You just have to replace, you see you look back into your notes and this is exactly the best policy which is prescribed by the Bellman's equation. And in fact, if you plug this optimal action into the recursion relation, that is the division of V, then you get the Bellman's equation itself. Again, this is just it strictly speaking, it's not important. It's just the check that we have been doing everything right because we find out again that the Bellman's equation describes optimality. So strictly speaking for practical purposes, this is not important. It's just another way of obtaining the Bellman's equation, if you wish. You might find it more or less intuitive depending on your background and your familiarity with formal arguments, doesn't really matter. What matters most is that this kind of result about the gradient, which is more important. This is the important result here or if you wish it's most appreciated form is the form of this policy gradient. These things are more important because this gives us the possibility to writing of writing algorithms to do optimization in policy space. Now, since we are running late, I'm not gonna start this again because otherwise we repeat the same kind of again of last time in which I run, you get lost and then I repeat everything the day after. So we're gonna just skip this algorithmic part to tomorrow and let's just redo again the summary that we had at the beginning to see all the points that we have cleared so far. So professor. Yes, please. Just one observation. So this last question, this the optimal value of V star of S. Yes. It tells me that V star is the maximum over the action of the optimal Q factor. This tells you that V star of S is also Q star of S, A star of S as it should. Is that what you're saying? Yes. So it is a maximum of an expected value. Well, so also when we have to implement something I mean, we can compute the expectation with the Monte Carlo simulation for instance. Okay. Now I stop you because that's very important but we're gonna discuss this later on, okay? So now you're a little bit of jumping ahead in the sense that so far we have perfect knowledge of what is our transition probability and what are the rewards. So we are not gonna do any sampling. Now it's everything is a problem of planning. When we'll go to model free reinforcement learning in the sense that we do not have access to our probability transitions or we do not have the computational power to solve the Bellman's equation. Then we will have to do sampling. And then it's very important to remark that you made. We are not gonna do it this way. We're not gonna do it like first take Monte Carlo averages and then take maximum but we're gonna do it differently. And that's a key concept in reinforcement learning. How to not separate but intertwine learning and optimization. Because the way you're saying, you basically do first all the learning because you do Monte Carlo and then you do the optimization with the max. But you can do these things together in a much more efficient way and less prompt work. But this will come a little bit later in the course. Okay. That's a good point. Any other comments? Okay, so let's see what we did. So this was our recap. Remember, always if you have something unclear go back to this model. Fill in whatever transition probabilities and rewards and then by any technique you like look at what happens in your policy space what the gradients look like or what the level lines of G look like. It's very instructive. Okay, so what did we did so far? What's the following? We have been rewriting G in another form which allowed us to take the gradient. Are we all on the same page for the first two parts? Good. Then we moved to this exercise which is probably the least intuitive one. That is to use the optimality conditions to find ones for the Bellman's optimality equation. Little bit of a skewer to some of you but I think you can work that out by yourself. Otherwise you trust me. It's just a safe procedure what we've known so far. But the most important things and interesting things is that this reasoning in policy space will take ourselves into constructing algorithms in policy space. And there are several of those. And this will be doing, we will be doing tomorrow. Okay, I hope we eased up some of the wrinkles here if not all. And are there any questions? All right, I'll post to these revised notes on the Slack for those of you who want to suffer a little bit longer. Otherwise I'll see you tomorrow. Bye everybody. Bye. Thank you. Thank you. I will only say one thing. If you can please put pressure on those that have to post the lessons because the videos could be. I already did and we'll do that again, sure, of course. Okay, thank you. Thank you very much. They have been uploaded on YouTube, the lecture. Did they? Okay, because I checked it, I think yesterday on the day before I did the student. Okay, thanks. Something is moving, great, okay. See you tomorrow. Goodbye. Bye.