 Just wait for it to be connected. It should be recording now. OK. Hope the connection is more stable than yesterday. Let's hope for the best. So welcome back, everybody. We start today's class by a quick recap of what we had yesterday. And as always, we start very gently. So the first fraction of this lecture would be devoted to reviewing the material of yesterday and taking a look at one simple example of value iteration. And then we move to the bulk part of today's class, which is proving the Bellman's optimality equation, which you see here in front of you in red, from a different angle, using a entirely different set of techniques and approaches, which are somehow more geometrically intuitive about what is happening with this Bellman's optimality equation. And we also be useful very far in the future. So let's say some 10 lectures from now, and we will have to introduce special kinds of algorithms to deal with very complex situations in robotics that will leverage on what we do today. Of course, in due time, I will refresh your memory about what we are going to do today. But keep in mind that the material that we cover today is going to be used in the following lectures as well, which is usually the norm. OK. So one of the key findings of our last class is that we can write down for the discounted objective. So I recall you that our objective here has become to optimize this quantity here, OK? The expected value of the sum of the rewards from now to infinity with the discount, geometric discount factor, gamma, which I repeat, summarizes the notion of an horizon in a sort of stochastic form in the sense that you can understand gamma as the probability of death for my process at each time step, OK? So at every time step, the process gets killed with probability gamma. That's the same interpretation as in the more economical term of having a discount factor over the value of things that will happen in the future. All right, so given that objective, we introduced the value function, which is here at the top of the screen. And then we ask for maximization. And then after a little set of manipulations, most of them are of formal nature, some inequalities. And then we end up with the Bellman's optimal equation, which is written here. And then we face the question, so how do we solve this? No linear equation. So we have a vector in a finite dimensional space because now we are dealing with a finite set of states, discrete and finite. So it's a vector, our optimal value function v star. It's a vector with components v star s for each state of the system. And therefore, the Bellman's optimal equation is an operator, no linear operator acting on this vector and returning another vector. And the equation is just solved when there is an identity. So when we apply the Bellman operator to our vector, it returns our vector itself. And then, which is the equation that I wrote here, and then the key property that allows us to know that there is a unique solution to the problem is that this Bellman operator is contracting in some norm. And the simplest way to prove it is to use the L infinity norm or the total variation norm, which means that to measure distances, norms of vectors by taking the modulus of each component and the maximum over these components. So after, again, a little tour of inequalities, we were able to prove that the Bellman operator is contracting with a constant, which is gamma, which means that if gamma is small, our algorithm converges very quickly to the algorithm that I will describe in a second, so the contractivity and ensures it to converge very quickly as it should. And when gamma approaches 1, it takes a lot of time because the contraction is very slow. But that's understandable because you have to reproduce the behavior of a very long time. So in the following, the example I want to take will be instructive in this sense. OK, so given that the system is described by the fixed point of a contractive operator, there is a very simple algorithm, which is the fixed point iteration or value iteration in this specific case, which is very straightforward. You start with a guess, and then you apply the recursively the Bellman operator. So you map vectors into new vectors, and then you go on and go forth until you get the variation in these two vectors defined in absolute or relative terms is smaller than some tolerance. And then you are happy, so my algorithm has converged. And then from that value function, you can instruct your policy at the final step. And this is arbitrarily close to your true optimal solution depending on the tolerance. So let's see how this works in a concrete example, which is, I hope, will be somehow enlightening. So value iteration example, we're going to do this for GridWord. So we're going to take a very, very simple instantiation of GridWord. You remember, GridWord is a decision problem, mark of decision problem, which takes place on a grid. Not surprisingly, so it's sort of how Picasso would draw a grid probably, but your imagination will rectify everything. So the goal here is to identify a reward which is placed here. So there is a single point in the map which gives some reward R. There are no other rewards around. I'm defining the structure of the problem. So I'm repeating again, sort of going slower here. So the states are each of these points in the grid. Every tile on this map is a state. So what are the actions? Well, the actions are the steps that you can make in every accessible neighbor. Let's say that we move along the verticals and horizontals to diagonal steps. That's just for simplicity. So these green objects here, these are the actions. So for simplicity, again, we consider, in this case, a situation when the system is deterministic, what it does mean. It means that the probability of being in state S prime, given that I start in state S and I take action A, which is an admissible action, is just, sorry, I should write better here. This is just, OK, let me draw this symbol and then explain what that is. 1 of S prime equals S plus A. This one means that every time that I'm writing the following, it's something like 1 of some condition. This means that this object is 1 if condition is true and 0 if false, OK? If you're more familiar with that other notation, I could have written this in terms of a Kronecker delta. This would be a Kronecker delta of S prime S plus A. This is also called the characteristic function of a set, if you wish. Same different names of the same thing, OK? I tend not to use delta because delta, as we discovered in the following, delta has one particular interpretation in reinforcement learning, which is the temporal difference error. So in order to mix up notations, I tend to use this notation for one. So what does it mean? The term is that if I take the action of going east, I will go east by one step. And if I want to go south, I will go south. So there are no errors in the implementation of the actions. There are no execution errors. There is no stochasticity in this part, OK? So given that there is a single reward, OK, this is a situation in which we can basically solve the decision problem just immediately. What's the solution? Well, suppose you start from a certain state here, any point on this matrix. So what is the best outcome for your problem? Remember that our goal is to find the best expected discounted sum of rewards, OK? So this is the general expression. Now, as you've seen here, the structure of the problem is such that, as a matter of fact, this doesn't depend on the actions, and it doesn't depend on the new states. It just depends on where you sit, OK? So it's a very simple structure of the reward function. And as a matter of fact, this is only this object here is, again, 1 if st is equal to this target position, which let me call s, I don't know how to call it, but let's say s hat times r, OK? So this reward object, it's a vector itself. It's a matrix over the space of states. And takes value r, if I am sitting on the target, and takes value 0 elsewhere, OK? But then you easily realize that what is my value of g? Well, it's just given by the time I need to get to the target. So if I take this path here, suppose that sometimes I produce this path. So what is the value here? Well, it's just the number of steps I take to get there. So in short, if I start from a state s node, my g is nothing but gamma to the distance, if you wish, in the sense of the so-called Manhattan distance, so the distance along a squared grid between my target point s hat and my initial point s node. So if I have any sequence of actions, this will be my g. There is no stochasticity here because everything is deterministic. Very good. So there is a r here. Sorry, can you repeat the question? There should be r. Yes, thank you very much. There's an r as a prefab. Thank you for pointing that out. Of course, the key thing is that the best strategy is to minimize d. So this problem, this problem of grid wall with a single point here, is equivalent to finding the shortest path from any point to the target. So this is one other way of rephrasing the problem of shortest path. So we are casting another classical problem of computation, which is finding the shortest path. Of course, I'm writing this for grid wall, but you could write it on any kind of graph. So any problem in a graph of finding the shortest path from point a to point b, given that you know all the ways that the transition probabilities occur, so all the edges and what other problems you can take in different edges, can be cast as a problem which has a Bellman equation and can be solved by value iteration. So this is both simple because geometrical intuitive, but it could live in abstract graph spaces. OK? Very good. So let's see how good value iteration work here. So if you remember, the algorithm starts by choosing a guess for b node. Well, here we are totally agnostic, and let's say that our b node is 0 everywhere. It's a very bad choice. I mean, because you know that if the target is there, you know where the target is because you know the structure of the reward. So it's very stupid to start with 0, but we sort of start with that because it's very simple to understand what it is. So what does it mean? What is the value function? Well, the value function, remember, it's a vector of the state space, which means in this case that you put a real value in every point of this, OK? So for instance, if you have a point here, here there is a value of this state s here. So if this is your state s, this will have value s. So every slot of this array has a real value attached to it, which is your value function. And you start with a guess, which is everything is 0. And then we apply the first step of the value iteration algorithm, which says, well, v1 is going to be the Bellman operator acting on v0. So let's go component by component and write down again what is this meaning. So my next approximation at any given state s on my grid is Bellman operator max over a, sorry, sum over new states s prime, p of s prime s a. Then I have my rewards in state s plus gamma v0. OK. So OK, let me do just one little correction. So because for the logical purposes, it's best if we say that this is a small modification. Let's say that we have r s prime here. So now I'm explaining why I'm doing this just in a second. So let's consider this question where it just depends on the side you lend on, OK? Not from the one you just left, but from the one you lend on, OK? Just one step ahead. It's just because it simplifies the things and the speed of the map, and this will be clear in a second. Sorry for not having realized this earlier. OK, so do we all agree on that? This is the first step of our algorithm, which means that we have an input on the right, which we know well, OK? Inside of the square bracket, we can do this computation. It's a linear combination. So we can do it very fast over our r a. And then we have to take the maximum of all possible a's. So we do this. We repeat this for all possible actions. And we take for every state, we take a sweep through these values and select the a, which is the largest one. And then we obtain the term on the right, OK? So what we need to do in practice? So at the first step, let me have a look at this. So at the first step, the matrix was everywhere 0. So v0 was 0 everywhere. What happens at the step number 1? I'm not painting, I'm not drawing all the grid points, OK? But you know that here around, there is the place where the r stays, where the reward stays. So somewhere here, we have our reward. So what will happen? OK, remember, by choice, this object is 0. So this goes away. And then what we have to do, basically, is consider that only the point where the target is contributing to this sum. To what does it contribute? Well, it contributes. It may contribute to all points as, OK? So this is the point where r is. We can have a contribution to all points or close points which are accessible, from which the target point is accessible, OK? So because remember, this object here is going to be r times s prime equals s hat. So in order to insist this object here is 1 of s prime equals s plus a. The only possibility of having this object not being 0 is that if you combine these two expressions, this requires s to be s hat minus a, OK? Because there are two delta functions here. So both conditions must be verified in order to have 1 and not 0. And if you just combine these two, well, this means that s must be s hat minus a. So this means that the only states that can contribute on the left, the only entries that appear on the left can only be these ones. And these are the ones from which you can get, OK? So at your next step, your v1 will be everywhere 0, except on these points here. Next step, well, let's repeat. v2s is equal to the maximum over a. So yeah, just one thing before we move. So you notice that in this case, what happens here is that we are just propagating the value of r. So I can even put the actual values here. This will be r, this will be r, this will be r, and this will be r. So after my first iteration, I can say that around my target I will get r. This is clear because at the first step before reaching the target and from that point on, of course, you just then get the final word and then nothing else happens later, if you think in time, sort of. This will just get the reward at the next point. So what happens at the next step? Well, we have to repeat this again. And again, taking the sums here, this will be just explicitly 1 over s equals s hat. Sorry, no, this is not correct at this stage. Let me write it in full again. So we are repeating the same argument as before with the same things. But now there is v1 here. And the consequence is that now we have to look at all points from which we could have reached r, any of this. So this propagates back in the sense that now we have it could have come from here. It could have come from here. It could have come from here. It could have come from here. So this allows us to fill at the next stage new values here and here again. And this expands all over. So every time that you make an iteration, you move away from the target. And you fill in with new values for the value function. So at some point, this sort of spreading out of the values, we reach the boundaries in the system that will start to converge towards the final optimal value. So this is an exercise that you can try to do by yourselves. In some form or another, maybe for Gridworld, maybe for another system, we will do explicitly value iteration in a system in the tutorial session, which will take place on April 9. But this is just to give you an idea of how it progresses. Any question on that? Yeah, so one question, if I can. Sure, please. About the notation in the last equation, V2, didn't get the point why R is of S prime, and then V1 is again on S prime and not maybe on another S. Like I'm in another state now. OK, so the argument of V must be always S prime, because that's the kind of recursion thing. So the value now depends on the value at the next state. So this was always there. You see here in the Balmain's equation, it's always S prime, where you have the next value to be evaluated. And here, the only thing that changes is which kind of form you choose for the dependence on the triplet. And here we're saying that the reward depends on the place where you step in, rather than the place from which you left. OK, because we cancelled the S and the action. Yeah, we just assume that it depends on the place you land. Just to be clear, if you assume that it's from the place you start, it doesn't change much. It's just that you have one additional step in the iteration, which makes it a little bit more cumbersome, but doesn't change them. You can think about it. You will realize that it's a minor modification. OK, that was another question. Yeah, no, I just wanted to clear something. So in the first, the V1 in the first step is R on the adjacent tiles of S hat. And zero elsewhere, including S hat. Yes, because there's no way to get there. Because we did not have the action stay there. OK, so we have the action in the same place, then we would update that as well. OK, OK. Any other question? Just a small one. And our goal, we define it as the sum from t equal to 0 to the infinity. But in this case, if we know the environment, so if we know the grid world, it wouldn't be better to impose an horizon like the maximum distance from one corner to the other one. Because maybe the agent could end up inside an infinity loop without never reaching the reward goal. Well, so let me try and understand what your concern is. Because here, this is a problem of planning, OK? So first part of the answer, you can define this problem in different ways, OK? So this one of using the discounted version is one of them, of one of many. So you can define this with a finite horizon if it's long enough, OK? Because if it's shorter than the time it takes to reach the target in the shortest time, then you will not be able to learn anything there, OK? So the choice of a horizon or a finite horizon requires some priori knowledge about which is the longest of the shortest path to the target, OK? But you could define that as well. That's perfectly fine. You can define this shortest path problem in other ways. For instance, you could put a cost every time that you make a step and you are not at the target. And then make the target an absorbing state, a terminal state, OK? This is another way of setting exactly the same problem. The solutions will always be shortest paths. So there are several ways of casting this problem. But the second part of your question is one that I think it's more important to pay attention to. Remember that this is a planning problem. So the agent is not doing anything. It's just looking at the map and saying, OK, just right before I start, I want to compute what's the best thing to do. So it's really like what Google Maps does for you when you ask for an itinerary. Nobody's sitting in the car. Nobody's switching on the engine. You just, beforehand, you say, let's imagine that. I want to go to Milan now. What route should I take given the current conditions? The map, the envelope, the road, et cetera. And this is a way of solving the problem. So there is no such thing as you said, at least maybe I said correctly in the case of please correct me, such as the agent cannot make it in time and can be caught in an infinite loop. Because unless your map has some position which is isolated from the target, if there are barriers which separate one state from the target, that clearly there's no way of learning the last part. Did I answer your question? Do you want to answer? Yeah, no, I got it, I got it. Thank you. Sure, no problem. Any second order question in the sense that what I said elicited some other doubt or question? I'm thinking something that in this case, so we are able to solve the problem because the setting is that we know exactly where the reward is so that we can do the backward path and determine the best path. But a situation in which we are not certain about where the reward is, where we want to go. How would that be presented? Like, we don't have a reward in a single place or we have, I don't know, I can't imagine. I don't know if I explain myself about the. Sure you did, sure you did. So this basically leads us to jump ahead by several lectures in the sense that this will be the focus when we deal with the model free situations. Because knowing the map means knowing the model of the environment, knowing what to do. So your question basically is about what I do if I don't know where the reward is. I don't know, maybe I know where it is, but it's a stochastic object, which may be on and off. There are several variants of this problem. So all of this boils to this finding model free solutions of the optimal optimal strategy. I can anticipate without, of course, giving you any detail because it's too early. But what we will do to deal with this kind of problems where you don't have a map, like I said, qualitatively, the idea is that you have to use experience. So you have to go around by trial and error. And as you move, basically, you get the amount, feedback about the environment. Either you build the map, OK, this is a way of doing it. So you collect information, you say, OK, there was a reward there, let's mark it down. And you construct your model and you use, you sort of can solve the Bellman equation approximately using your approximate map and your approximate model of transitions. This is one way to go. But there is an even more radical approach which bypasses this construction model, which in a nutshell amounts to saying that you can solve this equation here, the Bellman equation, without knowing this. How? Well, let me just use this as a teaser. Well, you can, OK? You can. You will have to replace, of course, the knowledge, the a priori knowledge about these things with some empirical knowledge. So you will have to sample the environment and at the same time, trying to solve the Bellman equation. That's what temporary difference methods do, OK? OK, that is an interesting teaser. This was a sneak peek of something that will come into time. OK, any further question? So if not, since we are going to have take a very long walk in the second half of the lecture, so it's derivation which doesn't require any super special knowledge, but lots of things from Markov processes and linear algebra will come in the way. So my suggestion is that we take the break now, a little bit earlier, and then we reconvene at, say, 5 to 10, 9.55. And then we make one single long stretch until we're done. OK, so go ahead and prepare your coffees or teas or whatever. See you in 15 minutes more or less. OK, thank you. Sure. OK.