 All right, so let's resume. And so now it's the moment to get into the real stuff that is to understand how to solve this decision problem, not this particular one, the whole class of decision problems that I have outlined above. And a key ingredient in constructing the algorithm that solves the planning problem, this planning problem, is to introduce the notion of a, sorry, to introduce the notion of a God's sake, the value of a policy. Okay, so what is the value of a policy? It's a vector, it's a real vector. So it's a vector V which belongs to a space of reals with a dimension as large as the number of states. So each component of this V is V of S, okay? And how is it defined? Well, actually it's a family, okay? Because a policy is a sequence of decisions at different times. So in fact, what we are interested in is a sequence of these vectors up to time T minus one. So this is an object in our number of states, number of times, okay? How is it defined? Well, it is defined as follows the vector, the value of a policy at time T is defined as the expectation value of the sum of the rewards from the current time T until the end. So notice just the difference between the goal of the problem, yeah, sorry, I should keep probably, let me rewrite this because I need more space on the side. So it's a sum of rewards over the triplets of state, section, state one. So as such, this will be basically the same thing as the objective that we have, the goal, except that it starts from some intermediate time T. As such, if it were starting from T equals zero, that would be the objective, but now we are also imposing conditioning on the fact that at the time T we are in a given state S. So, and this becomes the entry S of this vector. So at each time you have a vector which has one component, or your component for each state of your system and which expresses that under the policy that you have at hand, you will get on average this return from that time to the end, okay? Is it clear? Yeah, the T in the sum of the ratio that I put the first. Yeah, you're right, I think, perfect. Thanks very much. Okay, so to make connection with the previous definition so G, in this language, G would be the sum over the possible initial states S0 which were distributed according to some distribution of row node of S0 times V node of S0. So these V's are conditioned at some starting point. If you average at time zero over the initial distribution, then you get your function G. So this is the connection with our objective but we need to introduce this sequence of objects at different times in order to be able to optimize over G. Okay, so these are auxiliary objects which are nevertheless extremely important in the following. I repeat again the interpretation. The interpretation is that if at some intermediate time T, you find yourself in a state S then according to your policies which will be taken now and in the future instance, on average you will collect this quantity here, V at time T, depending on your initial state S. Any questions so far? I'm going extremely slow because these objects will pop up in different forms in many times in the future. Okay, so it's important to have an increase in creep. Sorry, row zero is the initial distribution of the state. Exactly, exactly. So if you remember when we reviewed the definition of G, we were taking the sum from initial time zero to up to the final time. And then the initial states were distributed according to this row. I said row, yeah, row, row zero, whatever, okay. Okay, thanks. Sure. Okay. Now the first comment remark, which is in order, is that remember that V at time T, this vector, looks ahead in the future. So it depends on what? It depends on the policy at time T, which is the one which will make me pick the action at that current time. Then the policy at the subsequent time T plus one until the policy just before the horizon pi T plus one. So this is a very complex object which has a dependence on a product of spaces of probabilities. All the decisions that I will take from that time onwards. So just for sake of simplicity, we want to always remember this dependence. So I'm using a rather cumbersome notation, but let's simplify it a little bit. So my value at time S as a vector, we say that it depends on the vector of policies from time T to capital T minus one, okay. This is a way to signify that I'm looking at this sequence of probability distributions over actions over all the future. I think this is just a definition. It's not anything implied, just how to define this notation. Okay, our key step in the derivation of optimality equations, which will be the bottom line for today, is to realize that these objects have a recursive structure, okay. So what happens now depends on the present and the future. So we want to unroll this dependency explicitly. The fact that we are able to express relationship, recursive relationships between the values at different times is a direct inheritance of the Markov property. Since the system is Markovian, we know that if we start at a given time, everything in the future is just defined by the current state. So I can unroll this dependence. So how does that work in practice? Well, let's start from one vector VT, which like I said depends on all the future and current policy at time S. So what is its definition? I repeat that it's here. So let me rewrite it once more for you. It's a sum of the primes that go from T to T minus one, of the words that I will get along the way. And this is conditioned on the fact that at the current time, I'm in a given state S, okay. So we are doing something very easy here. We are just splitting this sum into present and future. So this is the expectation of ST, AT, ST plus one, plus the sum with the prime that goes from T plus one to T minus one, the same object. And all of this is conditioned to ST equals S. Okay, now we can take care of the two terms separately. So the average of the sum is the sum of the average. So let's consider first the average of this first term. And the average of this first term becomes, okay, I am in S. I pick AT according to my policy, pi T at that time. And then my P, my transition probability will give me the new state ST plus one. So I'm writing this and then we look at it. And if you have doubts, you can ask, okay. So are we all on the same page here? Do you agree that the expectation of this object conditioned on the fact that the current state is S is this quantity, that is, S is fixed. It's not in the sum, you see. Then I pick A according to pi T. Then this pair S A with probability P gives me S prime and this is the average. Okay, then now we're ready for the second step. Now this object gives us already a glimpse of where we're going to go. Okay, look, this thing here looks very much like the value at the subsequent time because the sum starts at time T plus one until the end. So this is very close to what we are looking for. It's something connecting the value at time T with the value of time T plus one. But not yet because here the conditioning is on the time T whereas the actual value must be conditioned on the current time of start. So we have to manipulate the second part a little bit. And the way to manipulate is basically to do a trick which is very common in this Markov processes that is splitting this conditional average in the average of another conditional average. Let me write it and then we see if we agree on what I write. So this is gonna be the expectation of this object. In order to get the actual value, we would like to have this object conditioned on S T plus one equals to something S prime. And then this is in turn conditioned to S E equals S. So what's the trick? We are in state S this can send us in different states as prime but from each of these status prime, from that point on I have my value function. So you can think it starting in a state then the process fends out to several different states as prime and from each of these states as prime you then go on to until the end. So this object here is equivalent to this where this external average so it's taken according to the fact that S prime is picked from B dot S 80. So if I unroll this explicitly, I repeat this first part as it is. Professor, please can you explain again the expectation of expectation? Yeah, let me finish writing this in explicitly and then I will explain that again. Let me try to explain this in the following way. Graphically, make a break here. Suppose that these are my states, okay? Here I am at time T, here on time T plus one and these are the possible states of my system. So my value, if I'm a state S here, my value here is I can jump to different possible states, okay? And then from these possible states onwards I can also jump to different states, okay? So the idea of this recursion is that the average of anything that is in the future starting from here is the sum of everything that happens starting from each of those which has to be combined with the probability of going from here to there, okay? So it just cut in the average in two steps. First you condition on this and then you average on all possible outcomes of this layer, okay? You are explicitly unrolling this. Okay, thank you. Is that a bit clear? Okay. Yes sir, thank you. Okay. So there's really nothing very deep here, okay? It's just that we are splitting the processing to two steps from now to the next step and then thanks to the Markov property from the next step to whatever else follows, okay? Now you see that here this term has these two parts in common so we can group them together. Is there any question in the background? I'm a little bit confused about the S prime. It is from P dot S 80. Sorry sir. Is it 80? I'm a little bit confused about the distribution of the S prime, the expectation. S prime? Yes, and I don't understand why it is 80. Why it is? No, no, in the expectation expression. Here? Yeah, yeah, I mean expectation of expectation. Yeah. And then S prime is from distribution P dot S 80. I don't get why it is a time T. Well, because S prime here is meant to be at time T plus one. So what I'm depicting as a prime is the index here. And what you get here depends on the previous state and of course the action also depends, you know, I'm not depicting it but it also depends on the action and the policy where you end up, okay? So it depends on two steps. Take the action and then the action will send you to a new state. Okay, so if I regroup all this in a single expression, I get the following recursion formula, which is the value at time T, which depends on all the policies from now to the end at state S is the sum over the next states, which are given by the, what the hell is happening now? The current action, which is picked according to the current policy from state S, the transition probability to the new state S times the immediate reward. So what I get at that time plus what I will get from the new state onwards. So this is just our first important result. So this is important because the Markov property tells us that we can look into the future by successively using this value functions. It's a way to connect the now with the future in single steps, thanks to the Markov property. And this is the key to predicting what will happen in the future. Any question before I move on? So the next key step is to remember that we want to look for optimal solutions. So we want to find out, remember the initial goal, we wanted to find out arg max over all policies or sequence of policies of G, which was I repeat some over the initial states of the distribution of our initial states times V node, which depends on all this. So what we're going to do now is to focus on what happens if we take this maximum over this equation. So we are in a current state S at time T and we say, what is the largest amount of reward that I can get on average from now on? So the answer to this question is to define this following object, which is V star at time T, which is defined as the maximum, it depends on S, the maximum over all policies to be taken from now to the end of my VT of S, which depends on this. So this will tell me, if I know this quantity, I know what's the best performance that I can get from that point at that time. And as a byproduct, I also will find the optimal policies. So the next step is to find out a recursive relation, which is valid not for every policy, but it's valid for the optimal policy, for the optimal sequence of decisions. So how we do that, that we just take the maximum of this equation. We take the maximum of both sides, overall possible policies. That is, we derive an equation for this object. So we proceed from here and by using this equation, okay, we know that this thing is also equal to the maximum. I'm going extremely slow with all steps, but if it's not clear enough, stop me at any time. So here I'm replacing the recursion relationship, okay? So this equality follows from the recursion. And this object here is what I call the recursion. So this is, like I said earlier, I repeat it. Here's the sum policies, new states according to the choice. This is the policy taken at time t times current reward plus what will happen from pi plus one, which depends on all decisions from t plus one to t minus one in S prime. Okay, this is just what we had a few lines above. Now the next trick is also extremely simple is to split this thing into two separate steps, okay? So one is the maximum over the current policy, which we take last of the maximum over the policies from t plus one to the end, okay? So the maximum overall decisions in the future is the maximum over what I do now of the maximum of what I will do in the future, okay? Because the maximum of two things, they are interchangeable. There's no problem. If you have two maxima, you can do the one maximum first and now there are one later and the problem is totally commutative. So you can, this joint maximum, you can split it in the sequence of maxima one after the other, okay? So let me just not repeat what is in the parenthesis here, okay? The next step is to realize that this part here does not depend on all the things that I will do from time t plus one on. It only depends on what I do now. So this maximum here actually goes across this term, goes across the first term and hits only the second term. So this object here will also be equal to the maximum over all policies at this moment of the sum over a s prime a s, let's see. So this part is not, is not modified. And the other part, which is modified, so I'm super explicit here. So this maximum has just gone through these average, through all the stuff that don't depend on the parameters that we want to optimize through. And but this thing here is nothing but our v star at time t plus one. So I can summarize so far what I got is that now I have another recursion relationship which reads v star at time t in s is equal to the maximum over all the current policies of sum over a s prime by t, a s s t, sorry. So this again is something which is rather intuitive as a meaning. It means that the best thing that I can do now depends on what I do now. Assuming that from the next time on, I will stick to the optimal policy, okay? So it's just like saying, if assuming that I did everything right from t plus one on, then the only thing I have to do is to optimize over what I have to do now. This is sort of a verbal interpretation of this equation. Now we are almost done. There's the one very, very simple final step to take is to notice that this object in round brackets here. Okay, this thing here, let me highlight it, is linear in pi t. This object depends on pi t only through a linear function. So this is a very simple optimization problem now because we have to optimize a linear function in a convex domain. So the convex domain is the fact that pi t belongs to simplex, which means that all our pi t a's s are positive and they sum up to one. Okay, this is the normalization property, which means that each of these pi's lives in a hyper tetrahedron in space, okay? Which is the space of normalizable probabilities. And this happens for every state. But this means that it's a linear optimization problem in a convex space, okay? So it's quite intuitive, you can prove it, but it's quite intuitive. So where will this function be maximum? So a linear function in a compact domain can only take its maximum on the boundaries. And in particular, this means that the value at which this function inside the maximum will be maximum is just for one single action. There is one coordinate along the width along which this quantity will be maximum. This is because it's a simplex, it's a sort of a triangle. And then you move in one direction as far as possible along the direction which has the largest variation, okay? So as a consequence of this linear optimization problem, this equation also becomes the following one. So the maximum overall possible action is replaced by a maximum directly overactions. So this means that the optimal policy pi star t is deterministic. So there is a single action that maximizes this. So this is maximum over a of some, I repeat for the nth time, the same thing. And this equation here, which is very important is called the one of the many forms of the Bellman's Timality equation. So how does that work in practice? So what is this? This is a recursion in nonlinear. It's a nonlinear. Why is it nonlinear? Because there is this max operator here. Recursion relation. How do we solve it in practice? Well, the simple idea is to start from the end of your process. So let's see what happens when t is equal to capital T minus one. We are at the end of the rope and the end of our process. So from that point on, there will be no rewards. There will be nothing. It's the end of the world. Therefore, any vt will be zero, okay? There's nothing to expect from the future from that time onwards. So what will the recursion relation become in this case? Well, it's very simple. The best value at time t minus one for every state is just the maximum over a of the sum, sorry, here I have messed up the equation. Sorry, sorry, this is confusing. Let me, if there's a maximum, I don't have this. Sorry about that. Because I already replaced here the optimal policy which is to take this maximum. Sorry for the confusion. Okay. So this is going to be max over p s prime s a times what? Well, just this reward. So what does that mean? It means that when we are just one step away from the abyss, we don't care about what will happen in the future. And we are only interested in maximizing the immediate rewards. So if your horizon is just one step, there's no smarter thing to do than just optimize the average rewards. Optimize what we'll get now by as a result of your election, okay? And this is a very simple optimization problem because you know this, you know this and you just have to construct this simple matrix and then take the largest possible entries out of this matrix. So this calculation of the value one step away from the end is very straightforward. But then given this first initial step that you have to take which starts from the end of time, then you move backward in time. You move backward in time and you just replace this quantity that you have calculated in your iteration here. So you plug that here and taking the maximum of this other sum will take you one step back in time and so on and so forth. So at every step, you produce a new value function going from the end to the beginning. This procedure that takes you from and from t minus one to t minus two, these are all the optimals and eventually to be a time zero. This procedure of going back when in time is called dynamic programming which is now sort of bizarre and fashionable name that's came up in the 50s for a series of historical reasons. But it has stuck and now it's how it's this procedure of backing up from the final time to the initial time. That's how it's called. By the way, these equations and others are due to the work of Richard Bellman's and this is something which has happened around the 50s, okay, 1950s. All the formalization of decision processes and optimality equations is basically dating back to the 50s. Okay, so why? Could you like re-show the equation for this star with the right because you canceled the piece but like it's, I think it's also in the equation with the right. So I'm a little bit lost. You're asking about the question I'm sharing now here. Yes, but this one is the last piece. We derive it from here to here. Yes, here the pi term is present. So I don't understand why you canceled it in the. Okay, let me take a small side remark here to explain what is happening here. I'll make it over here. Thanks. Sure, no problem. So let's consider a very simple situation in which you just have two actions, okay? It's either zero or one or one or two, whatever you want. So how does it look like a function like this one if you have just two actions? This kind of problem will look like more or less like maximum over pi one of pi one times something alpha plus the other pi times beta. So this will be equal to pi two, but since it's normalized, the pi two is equal to one minus pi one. So this is a toy version of the kind of optimization problem that we want to do here. So how does this function look like? Well, this is a linear function and it's derivative. The derivative of this function is alpha minus beta, okay? So it's either positive or negative. It's flat only if alpha is equal to beta, but in that case, every point is good, okay? You don't care. It's a tie, but otherwise it's either alpha or beta. So when alpha is larger than beta, the function is growing, okay? And of course, since the function is growing, your maximum would be take pi one going to one. And if take pi one going to one, the function will be alpha, okay? But if it's the opposite, pi one will go to zero and your function will be beta. So when you have such a linear optimization problem, your result is that this thing is just the maximum between alpha and beta. And the argument that you take is either action one or action two, okay? So your optimal policy in fact, in this case is just either take action one if alpha is larger than beta or take action two if alpha is smaller than beta, okay? It's nothing more complex than this except that it takes place in a high dimensional space which is the space of policy. Is that clear now? Yes, thank you. Okay, so that's the step that brings us from here to here. So once you realize this, you know that this maxima that I'm writing here are the correspondence of this alpha and beta that you take, which are the coefficients that appear here that multiply the policy, okay? Instead of having lines, instead of having two lines in a segment, okay? Instead of having a picture in which you, your pile one goes from zero to one and your function is something like this. More generally, for instance, if you add three actions, this would be a triangle and the function you want to optimize would be a plane. And if you add a larger number of actions, it would be hyperplanes in simplexes and simplexes are the geometrical name for the space onto which probability distributions leave. Okay, thank you for asking this question. So I can make myself clear. So you see the ideas. Now we do have an algorithm which allows us to back up from the final time to the initial time by solving a sequence of equations which are no linear, but in fact, the only nonlinearity that there is is just picking maximum. And every time that we go one step back, the argument of this function here is the optimal decision to be made, okay? So at any time, the best action to take at time t for state s is just the arc max overall possible actions of the same thing that I wrote before. So sum over s prime of p s prime s a times, okay? So in one sweep, if you start from the end, you make the first computation, this one, the trigger one. And when you do the maximum, you pick the argument of this maximum, that action that optimizes this for each s. This is, there is a maximum for each separate s. There are as many optimization problems as states. And for each of these states, you pick an action. This is your optimal action. If there are ties, well, you pick whatever you want according to some rule. They will always give you the same result. And then you unroll this procedure one time back and you repeat. So in one sweep from the end of time to the beginning of time, you solve your problem. So this is a very powerful procedure, which however, as you can imagine, can be very highly, very demanding from the computational viewpoint if you have a very long time horizon, which often happens to be the case. Because you, I mean, you can save on memory storage because once you have completed one step of your optimization, you can forget about the value function that you computed for the last. So you can clear your memory at every sweep. At every time that you make a step backward, you can forget about the value function at worst, but you have to keep the actions nevertheless, which are, of course, are discrete here. So it's good. But nevertheless, your state and action space may be very large itself. So this, the fact that you have a procedure to solve a problem, but it becomes exponentially heavy in computing when you have sufficiently large time horizons and state space is what is called as the curse of dimensionality. So it's a very powerful theoretical engine, but it soon, it very quickly probes the limits of your computational power. If you want to apply it to very complex systems, let alone the fact that you need to know this and this, okay? Which is an entirely different matter, but we have been assuming that we know everything about the environment from the beginning, okay? This is another kind of problem. Even if we, when you know everything, there might be computational limits about what you want to do or can do. Okay. It's been quite a ride. So it's time to let this idea's sediment, okay? And I'll let you think over it and we can get back to them when we meet again in order to make sure that all the wrinkles are away so we can proceed further. But I would like to close with something like a more lighter part to introduce you one decision-making problem which is very popular, very famous for other reasons and which can be solved by dynamic programming in some smart way. And in fact, the first of the tutorial lectures and zone tutorials will be about solving this problem with dynamic programming. So what kind of problem am I talking about? I'm talking about the traveling salesman problem. Have you ever heard about it? It's a classical problem in planning which is stated as follows. So suppose you want to travel and there are N cities that you want to visit under the following conditions. Each city must be visited only once, okay? So if I give you this list of cities you cannot skip any of those and you cannot visit any of those more than once, okay? Second, you must return home at the end of the travel, okay? So you must make a loop. And then third, which is the part which is related to optimization, you want to minimize a cost. For instance, might be flight tickets, might be time, might be CO2 consumption, depending on everyone's sensitivity, okay? So we are considering a situation in which it's possible basically to design every possible path across the cities, okay? Because there are more complex problems like the seven bridges of Konigsberg in which some of the connections are prohibited. So maybe this problem is impossible under certain conditions because you cannot find the path that avoids getting twice on the same spot. But let's forget about this. We are working in a setting where every city can be reached from any other city, okay? Just to clean out the mess of topology of paths and just working on the idea of finding the best cost. So basically what it means to minimize a cost. I mean, you have a table in which you have city one city and city one city and and you have an airfare, okay? Maybe you, in principle, you can formulate the problem with the different costs going from A to B and from B to A, okay? So these metrics could be two matrices, one in one direction and one the other. But let's forget about this detail. It doesn't really matter. And let's say that if you turn from A to B or from B to A, you pay the same price, okay? So that would be- I have a question about the problem. Yeah. So you start from your home city and you have to go to different, you have to go to N cities. And every time you go to like city A, you have to come back home first and then go to another city. You have to make a single loop. You have to make a single loop. A single loop that goes through all the cities and then gets back. Okay. So these are just one way fairs that I'm writing down, okay? So basically you travel through all the cities with a fix and start at the ending point, which coincide? Yeah, actually what you would find with, and what you can find is actually a solution that is valid starting from any city, okay? This is valid for any starting point, okay? In one shot, you can solve the problem for any starting point. You don't have to focus on a single starting point, okay? So on the diagonal, of course, the costs are zero, but then you have some entries, which is the cost of going from one, say from two to one. This is the cost from one to two. And this may or may not be symmetric, depending on whatever, right? So this is your entry. These are your data. And the goal is find the optimal way to move around to close this loop. This is not an easy problem, okay? It becomes very challenging when N is large enough. Can you imagine? Because there are order of N factorial possible paths that you can take. So if you start in one place, you can have N minus one at the next step and then N minus two available in the second. So it's a huge space over which to search, the space of paths. And doing this exhaustively, it's not possible. You cannot enumerate all possible paths for the reasonable N. So you need to do something different. So the last sentence that I will leave you with is to tell you that this problem can be solved efficiently if one maps it into a decision-making problem and in particular a mark of decision problem and use backward induction, that is use dynamic programming. And that's what I was thinking that if, for example, your home city is the last city, let's say, then you just go from one city to the other, to the other, to the other. In the end, it will be a loop. You just get the lowest. That's the basic idea. But there are subtleties in the sense that you must first define what are the states. So defining what the states are and having them to be Markovian requires some thinking, okay? So everything will be unveiled in the tutorial or everything will be unveiled if you just Google for traveling assessment problem and scroll down. The page, but okay, I'll leave you the choice of what to do. Either be patient and discover it and thinking a little bit about it or be impatient and discover it straight away, okay? Either way, it's useful. Okay. I think we're good for today. And, right, I have one question. Yeah, sure, please. Why we didn't try to solve the optimization problem, for instance, using Lagrang-Tutti-Plyer? Why did we introduce all of this value? What is your value and everything? Good point. We will do that for the other formulation with the discounts, okay? We could use that also for the time-dependent case, but it's a slightly more cumbersome because there are many things over to optimize. So, next lectures, I will discuss the discounted case in which there is a unique policy, time-independent policy, which solves the problem independently of the time, okay? Which makes the problem more compact. And we will derive the optimality equation in at least two different ways. And one of those will be through Lagrang-Tutti-Plyers. And so we will see an array of techniques. Since this concept of optimization is so central, we spend comparatively large amount of time in discussing it by approaching by different ways. More mathematical, more operational. And this also gives me the opportunity to introduce some tools that will pop up in the following goals as well, okay? So it's also a trick for me to sort of prepare the ground for future developments. Okay, any other question? If not, thanks everybody. Have a nice weekend and begin a week and see you next Wednesday. Bye, everybody. Thank you, goodbye. Bye. Thank you. Bye. Thanks, have a nice weekend. You too.