 Okay, la registrazione si esa. Thank you. All right. So for this second half of today's class, we are gonna derive the optimality equation for the discounted setting, okay? So the good news is that if you are not impressed about all the heuristics we've been giving in the last part, doesn't really matter because we start, basically from scratch and derive the Bellman's equation from its bare assumptions, okay? So this where the first part was meant as a motivation. Maybe after the fact you will want to go back and understand where it comes from, okay? So let's recap what the problem is. We want to find out the policy, you know, a single actually policy by AS which maximizes, so this will be our optimal policy, which maximizes our expectation going from zero to infinity of the discounted rewards which we can write as usual in this form. And just as we did for the time-dependent Bellman equation, we introduce value functions for a certain policy pi, okay? So specify the policy. We have the expectation of the same discounted sum. Now it's conditioned on the fact that we start at the initial time at a state S, okay? So nothing has changed except for these minor changes about the presence of the horizon and the presence of discount factor in this definition. All right, so and then in a sense we want to find that what is the optimal value function, this object, which is the one which maximizes overall positive policies to value function. So once we have obtained this, we know what the best policy is in a way that we will figure out in a second. So this is just to recap what the task is. So the first step in this derivation, so the step one is to find a recursion relation, okay? So this will be very similar to what we did for the time-dependent Bellman equation. So I'm gonna go quite quickly on this. If we start from the value, what we do here, as usually we split this sum here into two terms. So this is gonna be the expectation of the reward for the first step, that is for state S which is the initial state, action one, state two, okay? Which I can say S note as if you wish, same clear, plus gamma, and then I have the sum from T going to from one to infinity. I pulled out a gamma here explicitly so that this becomes the T minus one and then I have R ST, AT, ST plus one okay, all this is conditioned on S mode equals S. And so very easy step, separate what is happening now from what is happening from the next step until infinity. And then just going by the same steps that we had last time, we see that this first average can be written as sum over all possible S primes and A, or again, S note is S, so I have to pick an action from S, then I have to pick a new state from S and A and then I have my average reward for the triplet SAS prime. So this is this first expectation value. And the second expectation value is plus gamma and it's absolutely similar in the sense that I pick an action. I'm sorry. Yep. S is S naught. Yeah, they are the same. Yeah. Okay, and in the first equation, it's S R of S zero, A one, and S one. No, no, you're right. You're right. So thank you. Thanks for pointing that out. Okay, so the same happens here. We are going to move one step forward. And then in that step, then we get, we see the new value function in the new state S prime. And that one, we put everything together here. We went up with our recursion relation, which is sum over S prime, A of I S, A of, okay. And we're writing this here, this object. Now, this is a recursion relation, which is recurrent, strongly recurrent, because there's no more this feedback structure that there were for times. So you don't have a vector at a certain time and then you can derive the vector the previous time. Now, every state talks to every other state with the same object, okay. But nevertheless, you can solve this equation. It's a linear equation. So you can solve it very simply. It requires just one sweep to be solved explicitly. And this is our starting point. Then the second step is to look, step two, look for optimal P. Okay, V star, which is the vector of optimal values. That is what you get on average if you follow the optimal policy yet to be discovered. Now the proof of the optimality equation is extremely straightforward. So it's a mathematical proof. So it's very elegant and concise. It doesn't shed a lot of light on what is happening, but we will derive the Balmain's equation in other ways tomorrow in order to see more from the different angles what is happening. But now this proof will just require basically four lines of computation. And it starts out as it should in the sense that the definition of the optimal value is just the maximum over all possible policies of the value of any given policy, okay. So, and then the next step is nothing but replacing this quantity by the explicit expression that is in the red box above, okay. So quite not particularly demanding from the intellectual viewpoint, I just copy. Okay, now in taking the maximum, there are here of course a couple of problems because you cannot take the maximum inside simply because there's a sum. So you cannot just make it go through. It's a non-linear operation, okay. So in order to manipulate these things we have to do something. So the first thing we do is quite obvious in the sense that by definition V star is optimal, okay. So for any pi by definition, let me write here, by definition of V star we have that P star S is larger equal than V pi S for any state S just because it's a maximum. So using this I can say that this quantity here since it has a V pi in here is a smaller equal than the same thing where I replace V pi with V star. So you see what has happened here? This thing here, I replaced with this one because of the property of the maximum. Okay, quite easy so far. Next step is again, relatively simple. We have to look now, what have we gained here from going from here to here? Is that pi has disappeared here? So there's no pi any longer inside. The only pi is here. So second step of this second step second step, we realize that term in square bracket is linear in pi. So the dependence on the policy of the term over which we maximize now is linear. And therefore, as you remember, we discussed this last time. When you optimize a linear function over a convex set the maximum takes place on the boundaries. If you remember, we had this very simple example. So let me go back to that for a second. So maybe it refreshes your memory. We've been discussing this at the end of last lectures. So the linearity and the example was this one in the very simple case in which you have a policy of two actions. This means that it's a linear function so it can only take values at the boundaries which in this case are zeros and ones. And the value that it takes is just the maximum between the two coefficients that are in front of this. So the equivalent multidimensional argument here leads us to say that this object here is equal to the maximum over a and we drop the pi in there. So once again in interest of clarity. So we've been using the property of optimality of V star to simplify the dependence on the policy of the right hand side. What is under the maximum operator? The first line on the top has become the third line and the third line has become the fifth line because of linearity in the policy. This is very nice, but we have an inequality yet here. So we wish to find an equality. So we have to do a little bit of steps yet. So now next step that we do is that we define, we define arbitrarily a new policy. Let's call it pi bar, which is defined as the arc max over a of this last square bracket. Sorry, this sum here was over as far in. Of course, S prime of B of S prime S a. This is not going to end well. So let me take some more space here. This defined as the arc max over a sum over. And then again, I'm coping the last line. Okay, so I am defining a new policy which was not present at the beginning. Now, what is important, if I choose this, this means that this last line that I have here is also equal to the value of my policy pi bar by definition, because if my policy is to pick the argument of this, then the maximum is just the value of my policy, okay? Because I use the recursive relation, okay? So this is just an autology, it's just by definition. But then if this is true, well, this is one possible policy among many. So again, this has to be suboptimal by definition. So this has to be smaller than B star. And therefore, if you look at this in full, well, you have this here and you have this here at the beginning, this is where we started from. So I don't know if you probably it's not possible to see it in all, but we started from here. And then we put series of less than, less than something which is the same thing which we added at the start. So this chain of inequalities basically is our starting thing is less equal than our final thing which means that they must be equal and everything that is in between must be the same, right? So as a result of this chain of inequalities, also this object must be equal to B star. So that putting all things together, we can conclude that B star of S is equal to the maximum over A sum over S prime, which is the Bellman's optimal equation. And which is nothing but exactly the same thing that we derived heuristically in the first half of today's class. So the intuitive solution of the problem which could have been demonstrated to be true otherwise is also shown to be valid by this mathematical reasoning. Of course, if you feel a little bit dizzy about all these inequalities going back and forth, I mean, it's okay. It's perfectly understandable because this is a purely formal proof. We will give, like I said, tomorrow, more concrete proofs about how this optimal equation emerges, okay? A more sort of little bit more constructive way of deriving it. But that's the bottom line. This is the Bellman's optimal equation. Now that we have an equation, again, this is a no linear equation. We cannot use dynamic programming, okay? Because we have the same thing on the left and on the right. So we have to find out some other way of solving this equation. But first of all, if we approach this problem mathematically, the first question is, are there solutions to this equation? Are they unique? So now what I'm gonna do in the next minutes that are left is to derive one result which answers to these three questions altogether. That is, it will tell us that the solutions exist, that they are unique and will give us a method to solve them, okay? So the following step that we take now is step three, solving Bellman. Okay, so the first thing to notice is that if we reason at a very abstract level, we can write down the Bellman's equation in the following form. So V star, you can see it as a vector, okay? V star is a vector in R to the cardinality of the state space. Oh, you're frozen again. Hello, hello. I can hear you, but the screen is frozen. Now I can open it. No, it's like the freezing came with a little bit of lagging. Hello, you were totally frozen for a while. Yeah, we saw you freezing after that you say that you are frozen, but no, it's okay. Okay, so V star is, you can think of it as a vector in a space which has the dimension of the number of states. And before we can write down the Bellman equation in the following formal way, which is some operator B acting on V star gives back V star. So what this operator is while you read it from here, you have to take your vector V star here. You multiply it by gamma, you make a linear combination with the transition probabilities and you add on top of this other thing, okay? So it's a combination, and then you take the maximum. So it's a combination of linear and non-linear operations that returns you another vector. And we call this non-linear operator without much of a imagination, it's Bellman operator. Now what we're gonna show here is that this operator B is contracting. What does it mean to be contracting? It means that it takes pieces of the space of this space where the value functions leave. And if you take any pair of points in this space, this operator brings them closer. So it creates, if you apply this operator to a cloud of points, this cloud of points will get closer. And then intuitively, this means that if you repeat many times this operation of applying the Bellman operator, this will get you closer and closer and closer, which is the important intuition. Because this will tell you that there exists a solution which is the center of this contraction. And this solution is unique. This is what is one example of a broader concept in mathematics, which is called the fixed point theorem. Okay? So, excuse me. Let's move on and try to prove the contractivity of the Bellman's operator. So again, it's quite compact as a proof. It works as follows. Take two vectors, W1 and W2 in this space where the value functions leave. A vector which has as many entries as states. Then you ask yourself, so if I apply B to W1 and B to W2, and then I take the norm according to some norm, what is it? So my goal is to show that this distance is smaller than the distance between W1 and W2. So if I take two vectors and I apply the Bellman operators, after one iteration, they will be closer, okay? That's my goal. How do I go about it? Questions, any questions? If I go about it, I will go about it. So let's first replace my definition of the Bellman's operator. This is just copying and pasting the definition. So this is max A of sum over X prime. Sorry, not to restart, but minus. So this just follows from the definition and there is an absolute value outside, okay? No manipulation here so far. Now, remember, we want to show that this is smaller than the initial one. So we want to bound this expression with something that is larger than that. And the basic idea here is let me take a short break by using a different color. So what I'm going to want to show here is realize that this is the difference of two maxima, okay? So it's a maximum of a vector minus the maximum of another vector. So I'm gonna use, so let's draw a little box here. I'm gonna use a very simple inequality. So this is very simple inequality and I'm just gonna play around with that a little bit. So let's suppose we have the maximum over A of the difference of two functions, F and GA. Then if I take this as the, I sum the maximum over GA. This is smaller equal than the maximum over A of the sum of these two, which is F, A. So this is very straightforward to show, okay? Because the maximum of a sum is less than the sum of the maximum, okay? So recall that the maximum of the sum of two functions which are different from this, let's say H1A plus H2A, okay? So since every one of this is a smaller equal than the maximum over A of the maximum of each, and then this maximum here doesn't matter any longer, the one outside, this is equal to maximum over A of HA prime plus maximum over A of HA2, okay? So the same inequality here, if you use it by changing the names of H1 and H2, you get this. And therefore this tells you that the maximum of, yeah, the difference of the maxima is smaller than the difference of this two taken in the proper order, okay? So you can use this to bound the absolute value of this, okay? So basically this tells you that maximum A of FA minus maximum A of GA is smaller equal than the maximum A of FA, okay? So we're gonna use this here. The difference of maxima, we can sum inside, so algebraic sum of the things that are inside. So take this minus that, this minus that, and therefore be able to write that this object is smaller equal than the maximum over A of the difference. But when we take the difference, you realize that the first term here is the same. So we take kill each other. And therefore what you get is that this is smaller equal than gamma times sum over S prime E of S prime S A E of S prime S A W one S prime minus W two S prime. This is close and then there is an absolute value outside. Okay, now next step. Excuse me, but according to the last orange equation, if I'm solving it, I'm obtaining that the sum of the maximum over A of F A minus GA plus the maximum of GA is greater than... Oh, sorry, yeah, I don't know what I'm writing here. They're perfectly right, sorry. So that's not what I'm writing here. Sorry about that. Yeah, so let me just go quickly through it, see what was wrong with this last step. Oh yeah, because I was wrong in the first, sorry, sorry. I copied this thing wrong in the very beginning. Apologies, apologies. This is the statement which derives from this, okay? So if you recombine, then it's the first line as they are in the opposite dimension. Thank you for pointing that out. I see you're wide and awake much more than me. Very good. So you can directly check this separately. It's just a straightforward calculation, but I was wrong in writing down the answer at the top. Very good. All right, so if you use this, then we go to the next step. And then, of course, there are a few things, few easy things to do. The first one is to pull out the gamma, which is the number. It's just a prefactor. And then the second thing we can do is that, so this is every one of this term here is actually smaller equal than the modulus, okay? So this is, again, very quite obvious. If it's positive, it's the same. If it's negative, it's less than the absolute value. So we can bound this with the modulus. And of course, if this is positive, then all these, the things which are inside the square bracket are also positive. So I can drop out to the modulus outside because everything will be positive. And this becomes the sum over S prime P of S prime S A times the modulus of one, S prime W two. Very good. Next step. So next step, we are going to introduce the infinity norm, okay? Which is a mathematical concept. It's a distance in vector space, which is defined as, or it's also called the total variation distance. It's just the maximum overall S, which are in the S of the modulus of the vectors, of the components of the vectors, okay? So it's a norm. And basically, if you have a vector, you take the absolute values over the entries and then you take the maximum of the absolute values. It's not the Euclidean norm, but it's a proper norm. If you're thinking of it geometrically, the Euclidean norm has a level lines which are spheres. In this case, the infinity norm has a level lines which are squares, okay? Cubes, hypercubes. Okay. So if you define this, then you realize that you can bound each of those, each of these terms. These are components of a difference of two vectors. So you can go one step further and say that this is the infinity norm of W1 minus W2 is bounded by this. Because every component must be smaller than the norm in absolute value, okay? Then now we're almost done because now the sum over S prime is acting only on this. But this is a probability. So this object gives one. And now the next nice thing is that the maximum over A now is totally vacuous because the dependence on the action has disappeared. So this object is actually equal to gamma, the infinity norm of W1. So let's look at where we started from, okay? So this was basically, if I have to write it properly here, so I'm sort of making it even more transparent. This was taken for two different S, okay? Because there was still an S around. So this was component by component, okay? You see that there was this dependence on S heading over. So recap, we have shown that the Bellman operator applied to the vector W1 component S minus the Bellman operator applied to vector W2 component S the modulus is smaller equal that gamma, the infinity norm of W1 minus W2. This is valid for all components. So I can take on both sides, I can take max over S on both sides and the max over S of what is here is just the same thing because there's no dependence on the component. But what this object here is for by definition the maximum of the modulus of this is just the infinity norm. And then we're done. We're done because now we conclude that if gamma is smaller than one, the Bellman operator is contracting. And then by the fixing point theorem, we conclude that B V star equals V star has one unique solution which is again intuitively given by this property of contracting points. Okay, so first take home message. This apparently nasty nonlinear equation in fact has rather nice mathematical properties at least when we use the infinity norm. Okay, so we will discuss a little bit if this is really crucial or this is a mathematical artifact. I can anticipate that it's mostly for demonstration purposes that you can sort of make it more sensible even with other norms. It's not a necessity, okay? Why do we care about this? Well, I mean, it's a good and interesting mathematical property which we have to cherish because this means that if we find, if we look for an optimal solution of our planning problem, we don't risk ending up in the spurious optimal results or local maximum, for instance, okay? This is not the case. This has a unique optimal solution which is not obvious from the outside. But it's even more interesting than that because this provides us step four. An algorithm for solution. And this algorithm is called value iteration. So what is the area? Well, the area is extremely straightforward. So the pseudo code for this algorithm is the following. Initialization, choose vnode. What is vnode? It's a guess. It's a guess for your value function, okay? Thanks to the fact that the Bellman operator is contracting, everywhere contracting, it doesn't matter where you start, okay? So your initial guess, did you miss any part of? The initial guess. The initial guess is arbitrary, okay? Because it doesn't matter where you start, you will always be converging to the optimal solution. But of course, if you start very far away, it will take a lot, okay? This is to be expected. So then you define your next, then there is a loop, which defines your next approximation to the value function as the Bellman operator applied to your previous approximation. So you start with a guess, you apply the Bellman operator. This produces another vector, and then you repeat this again and again and again until your distance from the new guess to the previous one is smaller than some tolerance, okay? Where you should define this in the infinity norm in order to be able to apply the theorem, okay? So when the two vectors, the previous one and the new one are close enough in this hyper-cubic distance, then you call it off. You say, okay, I'm happy. So you're close enough and then you define your optimal approximation to the policy. So you're after this step when you're exited. So this is return, the optimal policy, the approximation, which is, as usual, a function which is defined as the arc max over a of the Bellman operator. So it's S, S prime, C, R, S, K, S prime. Okay, so it's the usual thing that I'm rewriting thousands of times all over again. So even if you cannot read it well, it's the same object which has appeared many, many times. And that, this algorithm guarantees you that given the tolerance, you are within, you are close to the optimal policy as much as you want. If you decrease the tolerance, you will get even closer. So in a sense, this tends to be star and pi star. And pi star. So graphically speaking, just to give you the final intuition toward what is happening here. So you may think of this being the space of values, okay? So it's not a line, it's a possibly high-dimensional space. So this stands for something which should be R to the power number of states, real numbers to the power number of states. So this is the space of values. Now, then there is a function which sends values into values, which is the Bellman operator. So how the Bellman operator is to think about it as something like a function like this, which everywhere has a slope lesser than one, okay? So this is what the B of V is. Like the graph for B of V. Everything is, I mean, this is just a one-dimensional sketch. So be aware that these are maps of vectors into vectors, okay? So, and this axis would be the outcome. So again, here, I drew it really horribly. So please let me just try to do it better. Otherwise, this is gonna be messy. It's something like, all right. And it's no linear, okay? So I'm trying to draw something which is at the same time, no linear, but not too ugly. So it has to have the right properties, something like curve like this, okay? So what is the solution of the Bellman equation? Well, it is a solution when there's two lines. Oh my God, this is open. So this is the curve of V prime equals V. And this is V V prime equals V V, okay? So this object here, this point here, this is V star, which is the solution of V star equals V star. The contractivity property is equivalent to saying that this green curve here has a slope lesser than one. Why is that? Because if I take two points like W1 and W2, what will happen to them? Well, I look at here and then this will be my, the position of my W1 prime and W2 prime, which correspond to this. And you see that if the slope is lesser than one, two points which were far apart will become close. So this is the contractivity property. And what does the value iteration algorithm do? Well, you start from one guess, okay? Say you start from, let's use another color. Let's, you start with your guess V nodes, then you go up, you see, okay, this is sending me to a value which is larger than V nodes but because the green curve here is above the white one. So this will send me to V1 here. And then if I apply again the Belman operator, I will get here and so on and so forth. So you see that this sequence approaches V prime and the same thing happens V star. And the same thing happens from the other side. Of course, again, this is one dimensional, but you can take my word that this idea of contracting does the same job in an arbitrary number of dimensions. Okay, so I think that that was quite a lot for today. In the exercises session, you will see how this value iteration works, okay? But tomorrow for starters, I will go back to this idea of value iteration and show you a little bit, some intuition about what kind of approximation it produces in some simple example, like the green one, okay? But that's for tomorrow. Fine, any questions? I just wanted to ask you just a little bit above what you wrote, okay? Where there's the infinite norm of Vk plus one minus Vk, it needs to be lesser than you said, tolla. And I think also- This is some tolerance that you decide at the beginning. Okay, so you have to decide when to stop, okay? So you basically decide maybe I want to stop when the difference of my value functions is lesser than given number or you can use actually other things. So you can use also the percentage change, okay? So you might want to also to choose when these objects divided by the norm of Vk infinity becomes smaller than something, okay? So you have different choices depending on if you have an idea of the vertical value that it takes or if you haven't, you maybe want to use the percentage tolerance. It's the substance of the algorithm doesn't change. The performance might be different, of course. This requires a little bit of craftsmanship to be tuned. Thank you. So, any other question? Okay, so that's good. Thank you very much and see you tomorrow at nine, okay? Okay, thank you. Thank you, bye. Have a good day.