 Okay, sounds good. Welcome back, and we are going to now take a relatively long ride. So I'm going to take a little bit of time to re-derive the Bellman's equation for the discounted dusk in a different way, which from the beginning takes a completely different path. And as I anticipated, as several advantages in the sense that it allows us to introduce many concepts that we will be working on in the following lectures as well. So, since there will be many things coming in, many interesting ideas and techniques and approaches that also have an interest outside of this specific problem at hand. I urge you to stop me at any given time if you can't follow, okay, and I can then sort of take a short break in the narration to explain some several aspects, or refer you to the relevant material. It doesn't matter if we can't close it until the end of today, if needed we can overflow on that next lecture, even though I think we can make it, but we'll see, okay. So, let's start all over again from the beginning. The goal is to find the best policy, which is a collection of probability distributions over actions given states. So the best policy in order to maximize the discounted, sorry, discounted cumulative reward also set the return. All right. So, the area here is that we don't start from introducing the value function and taking the optimization over the value function and driving development equation that way. We look at this problem from an entirely different angle, what's the angle? Well, this objective here, our G function, depends on the policy. Of course, because that's what we want to optimize. So, why don't we just focus directly on the policy? Okay, so let me elaborate on this. So, our G function is something which maps policies into real values. So, what is this? This is not a proper notation for a function, I should put the domain here. So, what is the domain of this function? Well, it acts on the following space. So, what are policies? Well, policies are probability distributions, okay, there's one probability distribution over actions for each state. So, the objective here is, in fact, is the simplex, okay, delta. I will explain what it is in a second, over actions, and there are capital S copies of it. Okay, so this is the space where policies leave. Policies belong to this set. Let's get more clear. What is delta K? So, delta K for any integer K is called the simplex. It's a set of real vectors, X, which belong to RK, such that they are all non-negative, okay, so they live in the orphaned as it's called, and they sum up to one. Okay, so these are legitimate probability distributions over a set of K possible values, say K possible events. Okay, so delta sub A means that I'm carrying here about probability distributions over actions, and there are number A of them. This is the cardinality of the set of actions. And then I have one of these probability for each state, so there is the product of all these simplexes. So, how does a simplex look like? For instance, take K equals three, okay, so this would be X1, X2, X3. I'm carrying only positive on positive values, so it lives just in this orphaned. I have to impose that the sum of those is one, which defines a plane. Okay, so this object here, this triangle here, is a sigma three. Every probability distribution over three values over three possible outcomes sits on a triangle, and this triangle is a simplex. And then I have as many of those as states I have, so in our case we have, for state one, we have one simplex. For state two, we have another simplex, okay, and then we have the simplex for state capital, for the number, the largest item list. All these are triangles, or tetrahedra, or hyper tetrahedra, or whatever, okay. And so a point on this triangle is a policy for state one. This is a policy for state two, including possibly points that are on the vertices of this set, okay, on the boundaries. This is possible. Sitting here on this on the tip. This is what does it mean, well it means that x one and x two are zero and x three equals one, which means taking a policy, which always chooses action three. And if I sit here I always choose action one and if I sit here I always choose action two. Anything which is in the middle is a random policy which chooses some fraction of probabilities, okay. So so far so good. So our states, our sort of domain for the policies just this set here which is the product of all these triangles. The, the first interesting remark to make is that this is a convex set. Okay, remember the intuitive idea of a convex set is a set, which is like a bone. Okay, so to speak, in the sense that if you take any two points inside the this set, and you join them by a line. The middle is within the set. Okay, so let me join this. So this is a convex set because every couple of points I take. And if I draw a line between them. This is still any point on the line is inside domain. This is the definition of a convex set for any pair. This is a convex set, or clearly a banana shaped set is not convex because if you take points on the tips there go outside the domain. Why this is a convex set well this is very simple because if I have two sets of policies by one and by two. And then I make a linear combination lambda by one plus one minus lambda by two. Okay, which means drawing this line, basically, because for anybody of lambda comprised between zero and one. I can move on a line between this by one and by two. Okay, so lambda one half with will be sitting here. Here I am when I'm not that is one. And here I am with my zero. Okay, then you can clearly see that if I make this linear combination of two probability distributions, this is still is a property distribution. Because it's positive and it sums to one. It's straightforward. So this is a convex set, which is good. Okay, it's a nice problem. We will use that in a second. What we want to do now is basically, we want to find the maximum of this g function over this space. Okay. So, again, at a very abstract level. So this is my state space and everything is mapped onto the real axis. Okay, this is where the g function leaves. It's a real object. It's a number. Okay, the real number. So, again, abstracting away. So if this plane now here is my space of simplex's. Okay, so this is the convex set. This is delta s sort of a power s that my function g is some objects like this. Okay, I'm trying to draw a surface here. So from from every policy here, this is a policy from every policy I get the g value. And if I move around in this convex space of all policies, I get different values. Okay, so you can think this was as a sheet, which is lying over the domain. Are we all on the same page so far, just to visualize for the thing is everything is taking place in very high dimensional abstract spaces but this is the equation. So the optimization problem. What is it? Well, we have to find where this sheet of values which is above reaches its maximum. So this is a point here when where my, I have my g star my maximum which corresponds in this horribly drawn graph to some point inside this domain where I have my optimal policy. So in very abstract terms, this is a problem of optimization, you know, over a convex set of a relatively complex complicated function. So we want to take on now this problem right directly. Okay. So, the first thing that we need is that we need to express what is the condition for a function to be at its optimum in a convex set. This is given by the so called first order optimality conditions. Okay. So what do these conditions say. Okay, let's consider that you have again this complex, not so complex set a which we call D is a complex set, and there is a function from x here to some f of x on the real axis. So there's a function on the bone, think about a function on the bone. We have that if x star belongs to the set of maximum of x. Okay, so if it is a maximum x if star is a point where this is maximum, it could be either inside or on the boundary doesn't matter. So this is equivalent to asking that the gradient of f in x star times x star minus x must be non negative. So where does this condition come from. Let's put it in the two cases. So, f comments f is also a complex function. So if x star belongs to the interior of the. Okay, so it's not on the boundaries. Then, as usual, you know that the maximum is a stationary point. So in this case, we have the gradient must be zero. Well, but this inequality. I have also a larger equal to zero. It comes from the fact that sometimes you may have a maximum on the boundary and you want to account for that as well. So, in that case, what is geometric intuition or let's have a look at what happens here so suppose that your x star is on the boundary. So here is where your function f is the largest. So if you look close to this boundary, then you will see that these are the level lines. So the level lines for f must be like this. Okay, so f must be increasing in this direction, which means that the gradient must be pointing somewhere outside of the domain. So once again, that if I take any point x here, and I take the vector, which connects x and x star, which is this one. It must also point outside. The fact that the domain is convex makes the fact that this condition is equivalent to saying if x star is on the boundary then this object must be strictly positive, or zero if the generate but cannot be negative. So this is a contradiction that if you assume that these things is negative, then x star cannot be a point, maximum point. Okay. So, this is basically a little extension on the notion user notion of stationary point. So if x star is on the boundary of the, this works as well. Fine, we will use that at the very end. Okay, so just keep it in store. So earlier, here is that this motivates us to try to compute the gradient. So what we're going to now do is find a way to calculate the gradient of G. If we are able to compute the gradient of G that is the derivative of G. So how G varies, depending on how we move in this policy space. Okay, so pictorially, the gradient of G is a vector which points like this. Can you see it's a, sorry, it's a bit blurred. Let me draw that again. So there is at any point where there is a policy pie, there is a G value and there is a gradient in this space. And if we can compute the gradient of G, we can see where this optimality condition this first order of the mighty condition is met. But we have to compute this gradient first. And that's what we're going to do now. Again gradient with respect to what with respect to the all possible components of the policies. Okay, so it's a relatively complex object, but let's go one step at a time. And the first step in this is we first we want to rewrite G. In a simpler, simpler way. Okay, so we have to manipulate a little bit our G, which remember is this expectation value. We want to rewrite it a little bit in order to be to make our goal of taking derivatives of G easier. How does that work. Okay, so G is once again. And then of course, this, this expectation value we have to make it explicit in terms of probabilities in order to make any analytical operation or any derivation. So what is it. Well, we can rewrite this as the sum. We're going from zero to infinity. And then we have to take the expectation value of this object. Okay, so we're going to some overall possible values that are taken by these variables as a S prime, with their own probability. Okay, so it's going to be, if I mean state S, I am picking action with a probability pie. So if we land up in a new state as prime, it probably be, then there will be our S as prime. But what is left, I have to specify what is my probability of being in state S at 90. Okay. So what is this probability of being in status and then let me write it explicitly. It's just the probability of being of that ST is equal to S. So given that at the initial time my S note is something like another S bar. And then I have to consider the probability of the point where I started from and I have to sum over that. Okay, so recap short recap about this single step. This average year is the sum of many times over time. First, some here. My process starts from an initial state S bar with the probability run out. Then in these steps goes to a state S with the probability, this one which I'm not writing yet. We will sort of wrap unroll it in a second. And then when once I am in status IP connection a with probability by. I end up in the status prime with this probability. And I receive an average reward. This. Okay, so this is just the explicit way of writing this off. Good. So next step. What is this. Here is where the Markovian property of the system comes into play because the probability of being in some position in some state after these steps is the product of probabilities of making single steps. Okay, this is the Markov property. If you go there in these steps, you have to make these most steps to go there explicitly. The probability that ST is as an S zero bar is what well. Let's go back one step at a time. This is equal to the probability being as given that the previous time I was in another point say as still that times the probability. S minus one. And the still that the initial time I mean as, and this is some overall possible intermediate states as still that. So this is the Markov property. I go to as through an intermediate step as still that the previous time. And then, there's the probability of everything that was happening before that. And I can go on with this right so I could enroll at the next step so on and so on and so forth. But what I want to do here is simpler at this step in the sense that it's useful to move to a description. In terms of matrices. Okay. So let me define a matrix. PT with indices s and s bar. Okay, so these are the rows. And these are the columns. And this is a matrix, a square matrix. So this object here is exactly this thing. By definition. Equation on the above here telling me, well, this is again a matrix and this is the same matrix but for T minus one. So this is, in fact, the matrix capital P. S s tilde times P with index T minus one. S tilde s bar. Some over a still. But then you realize here that. Well, this, this thing is defined by this. Okay. So, here is what I'm redefining here. But he realized that this is a product of matrices. I'm just taking columns times rows of the new metrics. So, in short, this tells me that the P. The metrics is just this P metrics to the power T for this P times P times P times. So what I'm doing here is, I'm just refreshing you about how to describe a mark of chains with matrices. Okay, I'm just doing it for our specific purpose. So what, what is explicitly. This P s prime s metrics, for instance. This is the one step probability. Okay, the probability of going from status to status prime. So this is in our case. Some overall policies. So you must remember that all the times that this capital P metrics that we are dealing with this green one. It's always the same object depends on pie. Okay, because when we have to take derivatives, we have to remember which things depend on by which don't, for instance, here in this sum. Ronald does not depend on pie, because it's the initial condition that we choose doesn't depend on the decisions that will make in the beginning. But these objects depends and of course I is there. So when we have to take the derivative we have to take derivatives with respect to this dependence and this dependence but not this not this not this. Okay, so now another little step in definitions. It's useful to consider here this following terms so when we take the sum over here, the sum over s and a prime only affects these terms here. The yellows up one more year. So it's useful to define another object. This is called capital R of s, which is defined as the sum over s prime many of pi. Yes. EOS prime. So what is it. This is the expected reward from a status. If I'm in status, and I use policy pie. This is what I'm going to expect for the next step. Yes, I take policy, I take action a according to pi. I end up in status prime according to be, and I collect my average reward at our essay is for. Remember, this object here, which is the one that is appearing here. Yellow, orange, whatever. Also depends on pi. You got to remember this when you take derivatives. And then we can collect all this information here from this formula. And we can rewrite the G in a different way. So, but this introduction of notation, I am allowed to rewrite G in the following way some over s award. Well, the sum of in yellow I can make it I define this as capital R of s. Okay. There was a sum over time for everything. Some over time going from zero to infinity. There is a sum of rest of R of s. Then there is a gamma to the T, which I sort of forgot here, sorry. And I should better put here as well. Okay, so this is my reward, and then I have what this transition probability. That I'm going to put in blue. Okay, which is the same that is here. And the same that is here. Therefore, it's this one is the product of this piece. So what I have here is, let me write this. And then I'm going to put B to the power T components s s bar times rho zero of s bar. Okay, I'm just for now fumbling with the definitions in order to rearrange this object. But now things come become even sort of there is also some over s bar here. This here become even more friendly from the viewpoint of calculation, because again, this is a matrix. This can be seen as a vector a row vector. So we can introduce our as a row vector. And we can make a roll out the column vector. Sorry. I take this back. I take both column vector. And then this means that I can write G as a sum from T going to zero to infinity from gamma T, and then I have our transpose P to the power T from zero. And so this is already very complex. The transpose of a column vector is a row vector. This is a square matrix P to the power T is a square matrix. The row note is a column vector. All these leave over the space of states in our real to the power number of states. We don't need to follow step by step everything I say, just acknowledge at this stage that if you sit down you can try work work all the steps back, there's nothing which seems obscure, or totally absurd to you. Okay. On the other hand, if this is the case, you must absolutely stop me. Very good. We are almost done here. We can even go a little step further, because here there is an interesting object so let me do a further trigger manipulation. When I you see that this object in in brackets. This looks something looks like something familiar. This is a geometric song. So, I can some formally this thing. It has one minus gamma P. So this is the identity metrics. This one year is the identity. This is the inverse of this metrics times Rosie. Let me open a parenthesis here. We're going to take a quick detour to prove this, because it's very simple but let's prove it. So, I'm a little bit embarrassed at writing proof for such a simple thing so I will not do that. I will just write it and that's it. So, one minus gamma P times this sum. So let's write it what it is it is one minus gamma P times one plus gamma P plus gamma squared to square plus etc. And when we just distribute the product, we get one minus gamma P, then this times this plus gamma P, this times this minus gamma squared to squared distance one plus gamma squared to squared. And so you see that this cancels out, etc. So this gives one where one is again the identity metrics. It's a usual trick you do for geometric sums of real numbers but you can do that for matrices as well. What is the important point? The important point is that this series, this series here, must converge. If it converges then you can write it like this. Or in other way, that's the same statement. The matrix here must be invertible. And then I just quote here a result from linear algebra that says that P is stochastic. What does it mean that P is stochastic? Well, it's a matrix which describes probability transitions reconstructed like it. So it's positive. And the sum of the elements of each column must sum up to one, which is probability conservation. Okay. So this is what I say when it's, I say it's stochastic, it's a definition of what a stochastic matrix is. So these are stochastic metrics. We have the spectral radius, which is a fancy name for the modulus of the largest eigenvalue of P is smaller or equal to one. This is a small part of a bigger theorem which is called the Perron Frobenius theorem. Okay, it's linear algebra. And this is enough for us because if this is true, then this means that gamma P, the eigenvalues cannot be larger than gamma lambda max modulus, which is in terms smaller than equal to gamma which is in terms smaller than one. Okay, so therefore, if this guarantees that one minus gamma P is invertible. So, again, many details. The basic idea is that if you use custom tools from linear algebra, you can prove that you can invert your metrics here. So this expression finally makes perfect sense, which I summarize here. Sorry, there's no gradient yet. I'm running out of time. We can write G as a simple object, which is just R transpose one minus gamma P to the minus one, rock zero. Well, remember that we have these definitions yellow here. And green here. This was defined here. And this was defined above here. Okay, now this puts us in conditional taking gradients. Now we're going to take gradients. For the moment for the sake of simplicity, I'm just going to take, I'm going to write this, right G. So taking gradients means that you have to take derivatives in in all possible directions of space. Okay, so this my my graph C means that I'm choosing one particular direction. Okay, so it's a formal I don't want to overload the notation here to put gradient with respect to the coordinate the IJ whatever okay for the moment is just a gradient along some direction in space. Okay, which means that if we go back to our graphical description. We are we are taking two neighboring points say by and by plus the pie. Okay, a small increment in policy and then we're looking at the gradient along that direction. We're going to specify the actual direction. Okay. Are you okay with this notation. So, well, we have to take derivatives of the right hand side with respect to the policy but these things that I highlighted in a in orange and in green depend on pie. Okay, so let's take the derivative and treat these matrices as they should be treated just like a functions of real numbers. So, first, I take the derivative with respect to this, I use the chain rule. So this is gradient of RT. This is the transpose Okay, I didn't write explicitly but it's better to mark it down. This means transpose because there are these all over the place. Okay, so this is the first part of the derivative and there is the second part. Again, I'm going very, very slow. And that's not to be taken with respect to Ronald because it doesn't depend on course. Okay, then the first part we have to live with that for a while. And that's what it is. But we can work on the second part. So what is the gradient of the inverse of a matrix. Here we have to pay attention because we cannot use the usual rules of differentiation because matrices cannot can be no commutative. So we have to pay a little bit of attention. So let's do this but let's refresh this once more pink color this. So, let's consider this thing. So, in general, what is the gradient of a metrics minus one. So this is the gradient of the inverse of metrics. It's a very easy way to derive this expression. You just have to realize that M times M minus one is the identity metrics by definition. That's the definition of the inverse. And then we take the gradient of both sides. Grad M times M minus one plus M times grad of M minus one equals zero. This is a metrics of all zeros on the right hand side. And then what we do is just multiply everything by by M minus one on the left. And this becomes M minus one grad M minus one plus grad of M minus one zero. And therefore, minus step, the gradient of the inverse is minus the inverse, the gradient of the metrics itself times the inverse. So let's move very forward. We can go back to this and say okay what happens today to this derivative here. So this doesn't look nice so it's better if I erase this and report this below. And so we know how to do this gradient now. So I'm writing that grad G is going to be our first part. My second part, which is minus my metrics M was one minus gamma P. So I am party, which was there. One minus gamma P minus one, the gradient of one minus gamma P times one minus gamma P minus one times times. Now this, this thing here is simple because this becomes just minus gamma grad P. And therefore I can further simplify and write this thing. R T. Then notice here that we have two parts which are the same on the right most part. Okay, these two are the same object. And we call this thing, ita. It's another vector, ita which I introduced now in just to write this in the more compact form. So this becomes grad RT plus gamma RT one minus gamma P minus one. All of this multiplied by ita. At this level, this, sorry, there's got to be here. At this level, I'm just doing some just manipulations rewriting things. Okay. Very good. Now, we are closing on the calculation of this, of this gradient. Let's not. Hurry now. So two remarks remarks number one. Ita, what is it? Well, it is defined as the inverse of this matrix times Ronald, which if I unfold again, these objects using the transition probabilities. Like this, but remember that Ronald is the initial distribution and P to the power T is the probability of being in a certain place after a certain number of steps. So this is made to me just just second. Sorry. So this leads me to to the following expression that the component as for this is just the sum of T points to zero to infinity. So gamma to the power T times the probability that I am in state S. And this is my metric speed to the power T. And this is Ronald as bar. Okay, I have to ask you for apologies because I am. Okay. The point is that this quantity here, ita that will be fine. If you look at this. So again, so are these. These objects here it is in fact is nothing but the time spent in state. This is the probability of being in state S before the process dies. Okay, because remember gamma is the survival probability. So gamma to the power T is the probability of having survived after these steps. And after these steps, you will be in in state S with this probability. So this is exactly this thing is exactly the probability of being in state P after state S after the steps. Okay, so this is just to give you an interpretation of this abstract quantity that I introduced it as a meaning. The time spent in a given state that it also has a consequence that this ita of S. If the process is nice enough is something which is strictly positive. Because if it were, it's positive it's not negative because the probabilities are positive gamma is positive everything is positive on the right hand side. And it's not strictly zero because this would mean that there is some there are some states which are never visited, which is never the case. Okay. Unless you focus on that. Let's not spend too much time in the details. Okay, this is the first remark, the second remark, which is very important. We, we, when we look at this term here in red. Let's say we call it capital B. Well the important thing is that this is the value. This is the same object that we introduced at the beginning of our lectures. This is the value. So, the entry S of this vector is the value of state S. Why is that well, again, it's pretty, it's pretty simple. You just have to do the same process I did here. Right this explicitly so V of S here is some over T going from zero infinity gamma T of what are transpose, which means I have some over some as prime of capital R as prime, and then I have the probability that ST is equal to S given that S zero is equal to S. But this is exactly the definition of the value function. It's exactly the expectation of what I get. And I end up here. So this is exactly expectation of some over T going from zero infinity of gamma T of capital R of S T. And you remember that by the definition that capital R of ST is defined up here. There's object here. And this is also the expected value of our ST 80 ST plus one condition on ST equals S so that here below this is also equal to the expectation value. The amount of small R ST 80. All of this is not a chance I mean because we are doing essentially getting towards the same final result so it's not surprising that we see the same objects popping up. So when we recollect all this, where our almost final step says that the gradient of G is equal to the gradient of R transpose plus gamma value function transpose times right P times beta. Okay, it's a rather compact description. We're almost done here, we just have to compute these objects, these gradients up to this. So what is grad R component S. Well it's the gradient, I use the definition, which is some over S prime A of five. And of course the only thing I have to act upon is this. So this is just some over S prime A. It's a linear function of the policies so it's easy to derive. And same, same, same, same here. So this object which has two indices because it's a metrics with the gradient, how was this defined it was some over collections of all possible transitions. So when we make this product explicit, we end up with writing the gradients in the following form. So some over S A prime, P over S prime XA, then we have R over S A prime plus gamma prime times the gradient. So this is a route for a very neat result in that we have been able to express the gradients in terms of derivatives with respect to the policy itself. Everything else in the middle is something which depends on pi, so V depends on pi, but it just depends on the gradients in such a way. We don't just have five minutes to conclude and we are going to show that the optimal equation, the optimal conditions the first order condition. Okay, for optimality that I wrote earlier is satisfied by the choice which is given by development solution. So what is the connection with the battle equation. Now the connection is given by the this term which appears here. So this object in front of the gradient is the one which appears in the battle equation. So that's where the two ends meet. So, let's do this final step. And this final step is goes as follows now it's time to take the derivative with respect to some specific directions. Okay. Now here I specialize and I say that I'm taking this object becomes the partial derivative of G with respect to the policy, which has a certain state and a certain action. So let's roll back to our geometric interpretation of the beginning here. So this is the space where my independent variables live. So now I'm focusing on one particular state. And I'm focusing on one particular action. So I'm taking the derivative in this direction, more explicitly, if you wish, you have to think that G is a function of what of the policy. So let's choose a section one in state one of pi of two one of pi of the last available action in state one, but it depends also on one, two, two, two. One for the, for the last state. So it depends on all these entries. And we're going to take the derivative with respect to one of these entries here in the middle, anyone. And if we do that. Well, what we realize is that this derivative. Well, this just selects the entries a an S here. And this gives an identity matrix in the space of actions and states. So this becomes some over S prime of EOS prime SA. All this is multiplied by the OS, which is outside because it doesn't depend on a spring. Then very final step so it's useful to define this object. Give it a name, which is QSI, which is also the state action value function, also called the quality. So this is nothing but this object Q, you can realize by looking at this formula that QSA is just the expected value. I'm a T. R STS T plus one. But now this is conditioned on the fact that S zero is equal to S and a zero is equal to a. The value function, but now you're conditioning also the choice of a. And it's easy to check that the connection between the value function and the state action value function is that just if you sample the policy, you'll recover the usual value function. This object naturally comes in this gradient taking procedure. So, before the last equation. This gradient is just either OS, which is a positive object and then you would say, and then, given the very simple expression that we get here remember that the first order condition for optimality is that the gradient taken as a vector is G, but that's not out here. Times. By star, they can make it. Times five star minus five must be greater equal than zero. Then the statement is that if you take five star to be our max over a of QSA. So if you pick your policy from two star, taking the item with the largest value. This object is satisfied. Why, because if you plug this in and take the product, what you end up with is some over s and a of the G by a S. And at optimality, that's my star. So this is the part that I'm writing here. Times. Hi. Yes. Star. Hi. Yes. Okay, so this object here is one. I replace for this expression. And then I use this, this definition of my optimal policy. And take some. So since I am selecting the best action here. So this becomes some over S of Q of S a star S. This is the best action for every state I pick the best action according to the one in this metrics gives me the largest amount. So this is multiplied. This is the first sum here. And then I have minus some over a by a. Q star. And now clearly, since this object was the maximum. There is no possible way. So for every pie. These objects cannot be larger than this. Because this was the one that maximized the column by column is metrics. So conclude that this must be larger. What does it mean? Well, it means that you remember how it's defined. It's defined by this. Okay. So this means that my optimal policy is star. And this state S is equal to the R max. Of Q, but Q is this object by definition is one. So this is sum over S prime, P of S prime S a, R of S a S prime. Plus gamma V star. And this star is given by its definition, which is this recursion and question here. So if you plug that back in. You can obtain that. This star of S is equal to the maximum of a. And this is the Belman's question. Okay. I, sorry, I overflowed on time. So that was a lot of information of course here. The bottom line is that if you follow this simple steps from algebra, etc. But you can get to Belman's question to this very different pathway. And we will use this gradient policy gradient result elsewhere in the following to construct effective algorithm for the model pre-kicks. Okay, with that I think if you still have some energy and most of us for questions, I'm available. Yes, please. I have, I'm sorry. It's impossible, I think to follow. I wasn't able to follow really what you were saying and I hardly managed to write anything to be honest, it's very hard to follow like this. All this will be available to look back again on the videos and I can make a recap next lecture to highlight the basic steps. Yeah, maybe, maybe if we can, if we have other, let's say very computational. Well, this is not computational, it's more mathematical demonstration. If we can have maybe the steps before it will be much easier to follow because like this, I think at least I wasn't able to really follow and I think there are lots of errors in the script that I tried to copy. It's very hard to follow. I know that this was, this was basically the most mathematical dense part of the course. And so I can sort of relax for the following but I understand your concern, there's a lot of things going on here. I'm going to get, get back to this in next lecture and revise all the steps. And meanwhile you can, as long as soon as it will be made available on the, on the channel you can sort of go through the steps itself. And I cannot promise anything but if I have time I can, I can write down some notes in Lapec which might be more readable. Can I make a question regarding one, one passage, why we lost the transpose of the V, because at a certain point we, we wrote the gradient of air and the gradient of P. And when we wrote again the gradient of G, we wrote V of S prime but we lost the transpose. Well, because that's just the component. Okay, so this is just the component of the metrics. It doesn't matter whether it's a row or a vector, it's just the item S in the metric in the, in the vector. So you, I could have written this BTS but that doesn't matter. It's just a real number. So whether you write it as a column or as a row, that's, that's much, pretty much the same. I just wanted to ask you if it's possible for you to put these notes to take during the lectures on slack so we can have them. I can also slack so you can. Thank you very much. Sure. Sorry, can I make a question. Yes, please. When you're using this great technique, we possibly can reach just local optimum. Or we always find the, the global. Yeah, that's the question is correct. So in general, these techniques, the only allows you to get to local maximum. But, as you know from the previous lecture. See, Bellman operator is contracting. There is a unique optimum. So, one could go one step further and compute the Hessian of G to show that it's definitely positive. But I think you're overwhelmed. So I'm going to go through that. But yes, in general, you're perfectly right in this particular case the problem is complex. So that's what it makes the solution to be unique. So, due to the convex domain we converge to, we have just one. Both things that play together, the convex domain and the convex function. When you, when you have both of them then the solution is unique. Because if you have a convex function in a non complex domain, you might have multiple me. Sorry, we have a convex function enough convex domain. I discussed about convexity when I said that all these simplexes you can move around on a line from one point to another. And about the convexity of the G function itself as a function of the policy well, we didn't prove it here, but it was a consequence. It was implicitly a consequence of the fixed point result that we have yesterday. We have a question. If we take the region of G, we will do this numerical or analytical. Well, so when you have this expression here, which is in front of you, you can compute it numerically as follows. So you have a policy, you compute your value function for that policy using the recursion. Okay, so you can compute this object using recursion for given policy. You can compute this gradient analytically, because you know what the policy is you choose how to express it. So if you wish, going a little step low, this expression here, here, you compute the value function. And then you combine linearly these objects, you compute, you can compute these things also obeys a linear equation. And then therefore you can derive the gradient from these two steps. So it's just a linear operation to compute the gradient. And it also points to the fact that these results produces some algorithms for solving numerically development equation and I will tell you about this next lecture. All right. So if you're done. Yeah, I know that is late. I don't know. Another math that I have is when we are writing the gradient of G up, if you can go a little up. I, I have from gradient of air a sum over s prime and a, and the gradient of P which is a sum over a, but then in the gradient of G, I have a sum over s a and s prime. I don't know why we added that as we added this because here I'm missing a piece, which is either of us, which was always around since the beginning. Is this this product here. That is the index of eta, which is missing there. So you're right. That was something missing here. Okay. Thank you. Sure. Okay. Goodbye everybody have a relaxing weekend and see you next time. Thank you. Thank you. Happy holidays.