 Thank you. All right. So for this second half of today's class, we are going to derive the optimality equation for the discounted setting. Okay. So the good news is that if you are not impressed about all the heuristics we've been given in the last part, it doesn't really matter because we start basically from scratch and derive the Bellman's equation from its bare assumptions. Okay. So this were the first part was meant as a motivation. Maybe after the fact you will want to go back and understand where it comes from. Okay. So let's recap what the problem is. We want to find out a policy, a single actual policy by AS which maximizes, so this would be our optimal policy, which maximizes our expectation from zero to infinity of the discounted rewards, which we can write as usual in this form. And just as we did for the time dependent Bellman equation, we introduce value functions for a certain policy pi. Okay. So specify the policy. We have the expectation of the same discounted sum. Now it's conditioned on the fact that we start at the initial time at a state S. Okay. So nothing has changed except for these minor changes about the presence of the horizon and the presence of discount factor in this, in this definition. All right. So and then in a sense, we want to find that what is the optimal policy, the optimal value function, this object, which is the one which maximizes overall possible policies. So once we have obtained this, we know what the best policy is. So this is just to recap what the task is. So the first step in this derivation, so the step one, is to find a recursion relation. Okay. So this will be very similar to what we did for the time dependent Bellman equation. So I'm going to go quite quickly on this. If we start from the value, what do we do here? As usually, we split this sum here into two terms. So this is going to be the expectation of the reward. For the first step, that is for state S, which is the initial state, action one, state two. Okay, which I can say as note, as if you wish, so it's a clear plus gamma. And then I have the sum of t going to one to infinity. I pulled out a gamma here explicitly so that this becomes the t minus one. And then I have r s t a t s t plus one. Okay, all this is conditioned on S mode equals S. Okay, so very easy step. Separate what is happening now from what is happening from the next step until infinity. And then just going by the same steps that we had last time, we see that this first average can be written as some overall possible S primes in A. Or again, S node is S. So I have to pick an action from S. Then I have to pick a new state from S and A. And then I have my average reward for the triplet S, A, S prime. So this is this first expectation value. And the second expectation value is plus gamma. And it's absolutely similar in the sense that I pick an action. I'm sorry. Yep. S is S naught. Yeah, they are the same. Yeah. Okay. And in the first equation, it's R of S0, A1 and S1. You're right. Thank you. Thanks for pointing that out. Okay, so the same happens here. We are going to move one step forward. And then in that step, then again, we see the new value function in the new state S prime. And then when we put everything together here, we went up with our recursion relation, which is sum over S prime A of I, S. Okay, we're writing this here, this object. Now, this is a recursion relation, which is recurrent, strongly recurrent, because there's no more this feedback structure that there were for times. So you don't have a vector at a certain time, and then you can derive the vector that previous time. Now, every state talks to every other state with the same object. Okay. But nevertheless, you can solve this equation. It's a linear equation. So you can solve it very simply. It requires just one sweep to be solved explicitly. And this is our starting point. Then the second step is to look, step two, look for optimal V, okay, V star, which is the vector of optimal values. That is what you get on average if you follow the optimal policy yet to be discovered. So now the proof of the optimality equation is extremely straightforward. So it's mathematical proof. So it's very elegant and concise. It doesn't shed a lot of light on what is happening. But we will derive the Bellman's equation in other ways tomorrow in order to see more from new different angles what is happening. But now this proof will just require basically four lines of computation. And it starts out as it should in the sense that the definition of the optimal value is just the maximum over all possible policies of the value of any given policy. Okay. So, and then the next step is nothing but replacing this quantity by the explicit expression that is in the red box above. Okay. So quite not particularly demanding from the intellectual viewpoint. I just copy. Okay. Now, in taking the maximum there are here, of course, a couple of problems because you cannot take the maximum inside simply because there's a sum. So you cannot just make it go through. It's a nonlinear operation. Okay. So in order to manipulate these things we have to do something. So the first thing we do is quite obvious in the sense that by definition this star is optimal. Okay. So for any pi by definition, let me write here by definition of V star. Yeah, the V star s is larger equal than V pi s for any state s just because it's a maximum. So using this I can say that this quantity here since it has a V pi in here is a smaller equal than the same thing where I replace V pi with V star. So you see what has happened here? This thing here I replaced with this one because of the property of the maximum. Okay. Quite easy so far. Next step is again relatively simple. We have to look. Now what have we gained here from going from here to here? Is that pi has disappeared here? So there's no pi any longer inside. The only pi is here. So second step of this second step. We realize that term in square bracket is linear in pi. So the dependence on the policy of the term over which we maximize now is linear. And therefore as you remember we discussed this last time. When you optimize a linear function over a convex set, the maximum takes place on the boundaries. Okay. If you remember we had this very simple example. So let me go back to that for a second so you maybe it refreshes your memory. We've been discussing this at the end of last lectures. Okay. So the linearity. And the example was this one in the very simple case in which you have a policy of two actions. This means that it's a linear function. So it can only take values at the boundaries which in this case are zeros and ones. And the value that it takes is just the maximum between the two coefficients that are in front of this. So the equivalent multidimensional argument here leads us to say that this object here is equal to the maximum over a. And we drop the pi in there. So once again in interest of clarity. So we've been using the property of optimality of V star to simplify the dependence on the policy of the right hand side. What is under the maximum operator? The first line on the top has become the third line. And the third line has become the fifth line because of linearity in the policy. Okay. This is very nice, but we have an inequality yet here. Okay. So we wish to find an equality. So we have to do a little bit of steps yet. So now next step that we do is that we define arbitrarily a new policy. Let's call it pi bar, which is defined as the arc max over a of this last square bracket. Sorry, this sum here was over S prime. Of course, S prime of B of S prime S a. This is not going to end well. So let me take some more space here. It's defined as the arc max over a of the sum over. All right. And then again, I'm coping the last line. Okay. So I am defining a new policy, which was not present at the beginning. Now, what is important if I choose this, this means that this last line that I have here is also equal to the value of my policy pi bar by definition. Because if my policy is to pick the argument of this, then the maximum is just the value of my policy. Okay. Because I use the recursive relation. Okay. So this is just the tautology. It's just by definition. But then if this is true, well, this is one possible policy among many. So again, this has to be suboptimal by definition. So this has to be smaller than V star. Therefore, if you look at this in full, well, you have this here and you have this here at the beginning. This is where we started from. So I don't know if you probably it's not possible to see it in all, but we started from here. And then we put series of less than less than something which is the same thing which we had at the start. So this chain of inequalities basically is our starting thing is less equal than our final thing, which means that they must be equal and everything that is in between must be the same. Right? So as a result of this chain of inequalities, also this object must be equal to V star. So that putting all things together, we can conclude that V star of S is equal to the maximum over A sum over S prime, which is the Bellman's optimal equation and which is nothing but exactly the same thing that we derived heuristically in the first half of today's class. So the intuitive solution of the problem, which could have been demonstrated to be true otherwise is also shown to be valid by this mathematical reasoning. Of course, if you feel a little bit dizzy about all these inequalities going back and forth, I mean, it's okay. It's perfectly understandable because this is a purely formal proof. We will give, like I said, tomorrow more concrete proofs of how this about how this optimal equation emerges. Okay? A little bit more constructive way of deriving it. But that's the bottom line. This is the Bellman's optimal equation. Now that we have an equation, again, this is a no linear equation, we cannot use dynamic programming because we have the same thing on the left and on the right. So we have to find out some other way of solving this equation. But first of all, if we approach this problem mathematically, the first question is, are there solutions to this equation? Are they unique? So now what I'm going to do in the next minutes that are left is to derive one result which answers to these three questions altogether. That is, it will tell us that solutions exist, that they are unique, and will give us a method to solve them. Okay? So the following step that we take now is step three, solving Bellman. Okay. So the first thing to notice is that if we reason at a very abstract level, we can write down the Bellman's equation in the following form. So V star, you can see it as a vector. Okay? V star is a vector in R to the cardinality of the state space. Oh, you're frozen again. Hello, hello. I can hear you, but the screen is frozen. Now I can open it. Oh, it's like the freezing came with a little bit of logic. Okay. Hello. You were totally frozen for a while. Yeah, we saw you freezing after that you say that you were frozen, but no, it's okay. Okay. So V star, you can think of it as a vector in a space which has the dimension of the number of states. And before we can write down the Bellman equation in the following formal way, which is some operator B acting on V star gives back V star. So what this operator is while you read it from here, you have to take your vector V star here, you multiply it by gamma, you make a linear combination with the transition probabilities, and you add on top of this, this other thing. Okay? So it's a combination, and then you take the maximum. So it's a combination of linear and nonlinear operations that returns you another vector. And we call this nonlinear operator without much of a imagination. It's Bellman operator. Now what we're going to show here is that this operator B is contracting. What does it mean to be contracting? It means that it takes pieces of the space of this space where the value functions leave. And if you take any pair of points in this space, this operator brings them closer. So it creates, if you apply this operator to a cloud of points, these cloud of points will get closer. And then intuitively, this means that if you repeat many times this operation of applying the Bellman operator, this will get you closer and closer and closer, which is the important intuition. Because this will tell you that there exists a solution which is the center of this contraction. And this solution is unique. This is what is one example of a broader concept in mathematics, which is called the fixed point theorem. Okay? So, excuse me. Let's move on and try to prove the contractivity of the Bellman's operator. So again, it's quite compact as a proof. It works as follows. Take two vectors, W1 and W2 in this space where the value functions leave. A vector which has as many entries as states. Then you ask yourself, so if I apply B to W1 and B to W2, and then I take the norm according to some norm, what is it? Okay? So, my goal is to show that this distance is smaller than the distance between W1 and W2. So, if I take two vectors and I apply the Bellman operators, after one iteration, they will be closer. Okay? That's my goal. How do I go about it? Questions? Any questions? If I go about it, I will go about it. So, let's first replace my definition of the Bellman's operator. This is just copying and pasting the definition. So, this is max A of sum over x prime. Sorry, don't restart. So, this just follows from the definition. There is an absolute value outside. Okay? No manipulation here so far. Now, remember, we want to show that this is smaller than the initial one. So, we want to bound this expression with something that is larger than that. And the basic idea here is let me take a short break by using a different color. So, what I'm going to want to show here is realize that this is the difference of two maxima. Okay? So, it's a maximum of a vector minus the maximum of another vector. So, I'm going to use... So, let's draw a little box here. I'm going to use a very simple inequality. So, this very simple inequality and I'm just going to play around with that a little bit. So, let's suppose we have the maximum over A of the difference of two functions, f and gA. Then if I take this as the... I sum the maximum over gA. This is smaller equal than the maximum over A of the sum of these two, which is fA. So, this is very straightforward to show, because the maximum of a sum is less than the sum of the maximum. So, recall that the maximum of the sum of two functions, which are different from this, let's say, h1A plus h2A. Okay? So, since every one of these is a smaller equal than the maximum over A of the maximum of each, and then this maximum here doesn't matter any longer, the one outside, this is equal to maximum over A of hA prime plus maximum over A of hA2. Okay? So, the same inequality here, if you use it by changing the names of h1 and h2, you get this. And therefore, this tells you that the maximum of... Yeah, the difference of the maxima is smaller than the difference of these two taken in the proper order. Okay? So, you can use this to bound the absolute value of this. Okay? So, basically this tells you that maximum A of fA minus maximum A of ga is smaller equal than the maximum A of fA. Okay? So, we're going to use this here. The difference of maxima, we can sum inside, so algebraic sum of the things that are inside. So, take this minus that, this minus that, and therefore be able to write that this object is smaller equal than the maximum over A of the difference. But when we take the difference, you realize that the first term here is the same. So, you take each other. And therefore, what you get is that this is smaller equal than gamma times the sum over S prime, E of S prime SA, W1, S prime minus W2, S prime. This is close, and then there is an absolute value outside. Okay? Now, next step. Excuse me? Yep. But according to the last orange equation, if I'm solving it, I'm obtaining that the sum of the maximum over A of fA minus ga plus the maximum of ga is greater than... Oh, sorry. I didn't know what I'm writing here. That's perfectly right. Sorry. So, that's not what I'm writing here. Sorry about that. Yeah. So, let me just go quickly through it, see what's wrong with this last step. Oh, yeah. Because I was wrong in the first... Sorry, sorry. I copied this thing wrong in the very beginning. Apologies. Apologies. This is the statement which derives from this. Okay? So, if you recombine, then it's the first line as the area in the opposite dimension. Thank you for pointing it out. I see you're wide and awake, much more than me. Very good. So, you can directly check this separately. It's just a straightforward calculation, but I was wrong in writing down the answer in the top. Very good. All right. So, if you use this, then we go to the next step. And then, of course, there are few things... Few easy things to do. The first one is to pull out the gamma, which is the number. It's just a pre-factor. And then the second thing we can do is that... So, this is... Every one of this term here is actually smaller equal than the modulus. Okay? So, this is, again, very quite obvious. If it's positive, it's the same. If it's negative, it's less than the absolute value. So, we can bound this with the modulus. And, of course, if this is positive, then all these... The things which are inside the square bracket are also positive. So, I can drop out the modulus outside, because everything will be positive. And this becomes the sum over S prime, P of S prime, S A, times the modulus of 1, S prime, 2. Very good. Next step. So, next step, we are going to introduce the infinity norm, which is a mathematical concept. It's a distance in vector space, which is defined as... Well, it's also called the total variation distance. It's just the maximum overall S, which are in the S, of the modulus of the vectors, of the components of the vectors. Okay? So, it's a norm. Basically, if you have a vector, you take the absolute values over the entries, and then you take the maximum of the absolute values. It's not the Euclidean norm, but it's a proper norm. If you think of it geometrically, the Euclidean norm has level lines, which are spheres. In this case, the infinity norm has level lines, which are squares. Okay? Cubes, hypercubes. Okay. So, if you define this, then you realize that you can bound each of those, each of these terms. These are components of a difference of two vectors. So, you can go one step further and say that this is the infinity norm of W1 minus W2 is bounded by this, because every component must be smaller than the infinity norm in absolute value. Okay? Then, now we're almost done, because now the sum over S prime is acting only on this, but this is a probability. So, this object gives 1. And now the next nice thing is that the maximum over A now is totally vacuous, because the dependence on the action has disappeared. So, this object is actually equal to gamma, the infinity norm of W1. So, let's look at where we started from. Okay? So, this was basically, if I have to write it properly here, so I'm sort of making it even more transparent. This was taken for two different S, okay, because there was still an S around. So, this was component by component. Okay? You see that there was this dependence on S heading over. So, recap, we have shown that the Bellman operator applied to the vector W1 component S minus the Bellman operator applied to vector W2 component S in modulus is smaller equal than gamma, the infinity norm of W1 minus W2. This is valid for all components. So, I can take, on both sides, I can take max over S on both sides. And the max over S of what is here is just the same thing, because there's no dependence on the component. But what this object here is called by definition, the maximum of the modulus of this is just the infinity norm. Okay? And then we're done. We're done because now we conclude that if gamma is smaller than one, the Bellman operator is contracting. And then by the fixing point theorem, we conclude that bv star equals v star has one unique solution, which is again intuitively given by this property of contracting points. Okay? So, first take home message. This apparently nasty nonlinear equation, in fact, has rather nice mathematical properties, at least when we use the infinity norm. Okay? So, we will discuss a little bit if this is really crucial or this is a mathematical artifact. I can anticipate that it's mostly for demonstration purposes than or you can sort of make it more sensible even with other norms. It's not a necessity. Okay? Why do we care about this? Well, I mean, it's a good and interesting mathematical property, which we have to cherish because this means that if we find, if we look for an optimal solution of our planning problem, we don't risk ending up in the spurious optimal results or local maximum. Okay? This is not the case. This has a unique optimal solution, which is not obvious from the outside. But it's even more interesting than that because this provides us step four, an algorithm for solution. And this algorithm is called value iteration. So, what is the area? Well, the area is extremely straightforward. So, the pseudo code for this algorithm is the following. Initialization, choose vnode. What is vnode? It's a guess. It's a guess for your value function. Okay? Thanks to the fact that the Bellman operator is contracting, everywhere contracting, it doesn't matter where you start from. Okay? So, your initial guess, did you miss any part of? The initial guess. The initial guess is arbitrary. Okay? Because it doesn't matter where you start, you will always be converging to the optimal, to the solution. But of course, if you start very far away, it will take a lot. Okay? This is to be expected. So, then you define your next, then there is a loop which defines your next approximation to the value function as the Bellman operator applied to your previous approximation. So, you start with a guess. You apply the Bellman operator. This produces another vector. And then you repeat this again and again and again until your distance from the new guess to the previous one is smaller than some tolerance. Okay? Well, you should define this in the infinity norm in order to be able to apply the theorem. Okay? So, when the two vectors, the previous one and the new one, are close enough in this cubic, hyper cubic distance, then you call it off. You say, okay, I'm happy. So, you're close enough. And then you define your optimal approximation to the policy. So, you're after this step when you're exited. So, this is return, the optimal policy, the approximation, which is as usual a function, which is defined as the arc max over A of the Bellman operator. So, it's S, S1, S8, S8, S1. Okay? So, it's the usual thing that I'm rewriting thousands of times all over again. So, even if you cannot read it well, it's the same object which has appeared many, many times. And then this algorithm guarantees you that given the tolerance, you are within your close to the optimal policy as much as you want. If you decrease the tolerance, you will get even closer. So, in a sense, this tends to be star and pi star. So, graphically speaking, just to give you the final intuition of what is happening here. So, you may think of this being the space of values. Okay? So, it's not a line. It's a possibly high-dimensional space. Okay? So, this stands for something which should be r to the power number of states, real numbers to the power number of states. So, this is the space of values. Now, then there is a function which sends values into values, which is the Bellman operator. So, how the Bellman operator used to think about it as something like a function like this, which everywhere has a slope lesser than one. Okay? So, this is what the B of V is, like the graph of B of V. Everything is, I mean, this is just a one-dimensional sketch. So, be aware that these are maps of vectors into vectors. Okay? So, and this axis would be the sort of the outcome. So, again, here I drew it really horribly. So, please let me just try to do better. Otherwise, this is going to be something like this. All right. And it's no linear. Okay? So, I'm trying to draw something which is at the same time no linear but not too ugly. So, it does have the right properties, something like curve like this. Okay? So, what is the solution of the Bellman equation? Well, it is a solution when there's two lines. Oh, my God. This is lovely. So, this is the curve of V prime equals V. And this is V V prime equals V V. Okay? So, this object here, this point here, this is V star, which is the solution of V V star equals V star. The contractivity property is equivalent to say that this green curve here has a slope lesser than one. Why is that? Because if I take two points like W1 and W2, what will happen to them? Well, I look at here and then this will be my, the position of my W1 prime and W2 prime, which correspond to this. And you see that if the slope is lesser than one, two points which were far apart will become close. So, this is the contractivity property. And what does the value iteration algorithm do? Well, you start from one guess. Okay? So, you start from, let's use another color. Let's, you start with your guess V nodes. Then you go up, you see, okay, this is sending me to a value which is larger than we know, because the green curve here is above the white one. So, this will send me to V1 here. And then if I apply again the Belman operator, I will get here and so on and so forth. So, you see that this sequence approaches V prime. And the same thing happens with star. And the same thing happens from the other side. Of course, again, this is one dimensional, but you can take my word that this idea of contracting does the same job in arbitrary number of dimensions. Okay. So, I think that that was quite a lot for today. In the exercise session, you will see how this value iteration works. Okay? But tomorrow for starters, I will go back to this idea of value iteration and show you a little bit some intuition about what kind of approximation it produces in some simple example like the green one. Okay? But that's for tomorrow. Fine. Any questions? I just wanted to ask you just a little bit above what you wrote. Okay. Where there's the infinite norm of Vk plus one minus Vk, it needs to be lesser than you said, and this is some tolerance that you decided at the beginning. Okay? So, you have to decide when to stop. Okay? So, you basically decide maybe I want to stop when the difference of my value functions is lesser than a given number, or you can use actually other things. So, you can use also the percentage change. Okay? So, you might want to also to choose when this object divided by the norm of Vk infinity becomes smaller than something. Okay? So, you have different choices depending on if you have an idea of the numerical value that it takes, or if you haven't, you maybe want to use the percentage tolerance. It's a the substance of the algorithm doesn't change. The performance might be different, of course. This requires a little bit of craftsmanship to be true. Thank you. So, any other questions? Okay. So, that's good. Thank you very much, and see you tomorrow at nine. Okay? Okay. Thank you. Thank you. Bye. Have a good day.