 See you guys. I was at the library a couple of weeks ago, and I was walking around the science section, the public library, close to my house. And there was a book by a physicist. I think his last name is Weinberg, I'm not sure. There was a chapter on equations. And he had an interesting comment. He said that cathedrals are basically our link to the past when we think about architecture. So if you want to know something about what life was like three, four hundred, five hundred years ago, when you visit a remnant of life from that time, architecture is often one of the things that's left behind. So when you go to a cathedral or any other public building or palace or something that was built, you have a link to the time when that era expressed itself by building something. And he said that equations are kind of like that to scientists because great equations, what they do is that they provide you with a link to the past because at the time, when the person wrote that equation, some human being wrote some equation, they had some idea about what they were applying it to. And it may turn out that in the years and centuries later that it no longer applies to that because it's perhaps not really a part of that. Like for example, Maxwell's equations. Maxwell's equations were written, if you read the original paper, he's talking about ether. And so we don't talk about ether anymore when we think about physics, but Maxwell's equations were written in that timeframe when ether was the general knowledge of describing of how physical things interacted with each other through this thing called ether. But nevertheless, Maxwell's equations still apply today because they describe some physically real thing that Maxwell himself perhaps wasn't thinking about, but they apply to it. So yeah, what I liked about that chapter in his book was this concept of how great equations are a link to the past. And in science, great equations are rare things. So today we're gonna see another great equation. I think we've learned one great equation which has been the idea of the Coleman had in the very late part of the fifth decade of the 20th century, and that led to the concept of describing state estimation as a way to minimize posterior uncertainty. Today we're gonna see an equation that came about 10 years or so later from Bellman that describes the idea of how to find a policy, meaning how do you find commands that in the long term will minimize some cost, some maximization of some overall value. And the idea of coming up with a policy is based on the notion that it makes sense to have a way to respond to the states around you so that if you find yourself at this state, you do something differently than this state, and both of them are the best you could do in order to arrive at whatever goal that you have. So you have some long term goal and you don't wanna describe your actions from where you are now to the future because who knows, there might be perturbations, there may be uncertainties. So what you wanna do is to not describe all the actions but you wanna describe a policy. A policy is a way to think about how do you respond to a state. So if you find yourself here, what's the best command you could give? If you find yourself here, what's the best command you could give? So it's a feedback controller that we're gonna be describing. We have a long term goal, we find ourselves here, what's the best policy that we could do from here on in order to arrive at our destination? So a policy is more like a way, if you wanna think about it in the colloquial term, when they say honesty is the best policy, well what they mean isn't that honesty at this particular state is the best policy, is that it's the best policy in some general way of thinking no matter where you find yourself. So the way we're gonna be thinking about control in the framework that we've been describing is as follows. So we have, so today's lecture is gonna cover chapter 12.1 and 12.2 and in particular, I'm going to cover a document that I wrote to help you with understanding the Bellman equation that's on the webpage, it's just called I think Bellman equation or something like that, it's a four page or so document, you should look at it. It's unfortunately not in your book. So we have some cost that we wanna minimize. Let's see if I can find a good pen. No. So for example, we might say that by the end of the movement what we want is that from the beginning to the end from K is equal to zero, to K is equal to P, we wanna have minimization of some kind of a cost, maybe what I wanna do is to get there as effortlessly as possible, so U transpose L, U of K, some of all the commands that are produced, plus what matters to me is to be as close to the goal as possible. So the state at the end of my movement X at time point P should be as close to this goal as possible. So something like this, maybe this would be my effort cost, let's call it JU, this would be my state cost, let's call it JX, state cost. You know, I wanna be in Bahamas and the closer I am to Bahamas, the better it is, I wanna get there with as little effort as possible. So something like this. We begin with something that we wanna minimize, accumulation of some kind of a cost. And then what we have is we have a model of how our actions changes our state. So we believe that our state is gonna change based on some model that maybe looks like this, some kind of noise at the end of my model, maybe something like this, I could also have different kinds of noises here. And then I have some sensory system that measures these states. So this is what we call a forward model. So I can predict if I produce some action what's gonna happen. And then what I have is a way to estimate my state, where am I? So what I do with this is that I have a prior belief about where I am, I make a prediction about what should happen if I generate some input u and then I make some observation y. And so what I get is that I have a posterior belief that looks at the observation that I made minus my prediction about what should have been made. And this is my common gain. And then finally what I need to do is to actually generate command u in such a way so that over the long term I can minimize this cost. And so my u is gonna be some function of my belief about where I am, my prior belief about where I am, and something that's gonna transform this into my motor commands g, sorry, u. Let's call this, in the case of this particular cost and this particular linear dynamical system, this is gonna be a linear transformation and this is gonna be my feedback control policy. And what we're after is this equation. What's the way to transform your belief about where you are to motor commands u in such a way that you minimize over the long term the cost? How should you do this? So what we have is the problem of having accumulation of some cost. We find ourselves at some state and we wanna minimize over the long term this cost. And the idea is to find a feedback control policy. Something that says you tell me where you think you are and I will tell you the best commands to produce in order to minimize this cost. So what we're after is this matrix g. Now that's for linear systems. You would get a linear transformation. Today what I'm gonna do is that I'm gonna show you the problem in a nonlinear system but small enough so we can get our hands around it. So in principle, you're gonna understand basically how does this Bellman equation work? How do we find the transformation from the state to the commands u? So that's the basic idea. Any questions before I get into it? All right, so let me give you some examples of why this kind of thing is reasonable. We've been talking about saccades which are basically movements of your eye. So a typical saccade position of your eye versus time might look something like this where this is on the order of maybe 100 milliseconds. So oftentimes when you move your eyes what happens is that you blink as you move your eyes. So I'm looking, I move my eyes over there and if I do it while I'm blinking what happens is there's something that pushes down on my eye. So this action is like a perturbation. Your eyelids, when they fall on your eye they are pushing it down. And what one can do is we can put contact lenses that have a coil around them and you sit in this magnetic chamber and the position of your eye is sensed by this little wire on your eyes. And so when you move your eyes despite the fact that it's under your lid you can measure its position. So what happens then is that if you have a blink that looks like this where your eyelid is closed for a moment it pushes on your eyes. And if you look at what happens to your eye your eye goes up and does this. So it gets perturbed. But what's interesting is that it corrects some mechanism in your brain corrects for that perturbation and the reason we think is because this isn't some unexpected perturbation that's acting on your eye, you blinked. So it wasn't like me pushing down on your eyes it was you yourself. You're moving your eyes and some part of your brain says close your eyelid and so it somehow can compensate for this perturbation. And so this idea is that even on the course of 100 milliseconds, which is too short of a time to have sensory feedback the sensory feedback of all you had was sensory feedback from your eye muscles to tell you that your eye got perturbed. 100 milliseconds would be too short for you to respond to it and correct it. But if you yourself generate this action then some part of your brain says wait there's this motor commands that are being sent this is gonna have a consequence and I can respond to it, I can correct it because I can monitor these commands and say ah there's gonna be something that's acting on my eye. So here's an example of feedback control. It's not just some motor commands that are generated to your eye and they are blind to everything else that happens around you. If something perturbs the eye the nervous system responds to it. Another example of this is something that we did sort of like this. But instead of having people blink what we did is that we were using stimulation of the brain. So what you could do is that you could give a pulse to anywhere pretty much anywhere on the brain transcranial magnetic stimulation. So you take this coil and you place it on the head someplace and you go zap. And if people are saccading right around the time that that zap takes place it appears to engage startle reflexes. And startle reflexes are complicated and if they're given considerably before a movement it makes it so that people jump but if you're giving a startle thing during a movement it seems to make that movement inhibit it. So it makes it look like basically your saccades are gonna have a motion that the eye begins moving and then it can come down almost to a stop and then it can correct itself. So this is with TMS. This is with eyelid closing. So I give you these examples because they're important from the point of view of demonstrating that the brain has mechanisms in place that monitors and corrects for actions that you're there and being produced. And it tells you that even the fastest of all movements saccades, they're under feedback control. Okay, so this feedback control is what we're gonna be studying. Feedback control, what that means is that actions that you produce depend on your estimates of state where they are. They're not just some sequence of actions that are programmed from beginning to end. It depends on where you believe you are and you generate a different feedback depending on where your body is. Final example of this that I wanna show you is something about the way you move your eyes naturally. So usually you don't just move your eyes you also move your head with it. So when I'm looking from here to here I move both my eyes and my head. The way that works is as follows. So if you look at the way most people move their eyes from one location to another they move their eyes and then their eyes come back as their head rotates. And this makes the sum of these two is called gaze is where you're looking at. So usually your eyes are centered in your head. So when I'm looking to the left the eyes are almost centered in my head. When I'm looking to the right the eyes are centered in my head. But when I move my eyes from place to place like this what happens is that first my eyes get there then my head gets there and as my head is moving my eyes rotate back. Okay so in the 70s my former mentor Emilio Bicci he did this experiment where in monkeys he looked at what would happen when the animal is about to make a movement and he wouldn't let the head rotate. So the eyes would rotate but the head on random trials was maintained in its position. So usually the monkey won't move its head but the head doesn't move in those cases. So you can imagine a little thing that holds the head down the break is applied the head doesn't move. And so what happens is that in that scenario the eye goes up and stays up until the head is released then the eyes come back. So it's again a feedback system until the eyes and the head the eyes by themselves isn't just some commands being sent to the eyes and says okay move your eyes there and then bring it back. No, it depends on where the head is. The commands to the head if they didn't cause any movement of the head then what happens is that the animal waits until the head moves and then the eyes come back. So the commands to the eyes are not just set in motion and then they go on by themselves. They depend on what's happening to the head. So that's the state here. This state is the state of the head and the eye. And so this command is only relevant is only gonna be the same basically. It's gonna let the eyes come back only if the head has been allowed to move. Okay, so what we're interested in is how does this gains look like? You know, how do we divide up these gains? And the way you can think about it is that when you have redundancy so we always have redundancy, right? For example, we have many muscles in our body we have many neurons. And one of the redundancies that we talked about last time was that suppose that you have two effectors, one arm, the other arm, these two arms. And so you can imagine that well, if you are doing something that depends on both arms, well then the feedback gain should make it so that when I need to respond to something if I can help with this arm I should respond with both arms. Because if you push my right arm and my objective is to maintain some location here then some of the response, say maybe the position of this thing that I'm holding is the sum of the right and the left arm. So if you just push on my right arm if I just respond with my right arm say you push me with one Newton. If I respond with one Newton back, that's fine. But if I respond half a Newton here, half a Newton here. If each arm does half a Newton and the together they produce one Newton then the effort is lower, right? So because half a Newton squared plus half a Newton squared is less than one. Does that make sense? So my feedback gain would be such that it would say despite the fact that this got perturbed I'm gonna respond with both of them. Because that's the best way to minimize that kind of a cost. So that's the nature of what we're after. What we're after is say that I have a state described by position of my head and position of my eye. The objective is to make the sum of these to equal to the gaze. That's the cost. I want the gaze to be on the target. And now if you hold the head in one trial what happens is that the eyes move there and they don't come back down. On the other hand, if you don't hold the head then one moves there and the other one moves there later and the two get aligned. So this thing tells us that if we could build a feedback control that does that then it is much more complicated of course but it sounds like it would be more interesting than something that says just generate these commands of the eye, generate these commands of the head and that's it. Our objective is to go after this feedback control. And what we wanna do is to minimize some kind of a cost that looks like this. Okay, any questions? All right, let me show you how to do it. So typically what we have is what's called cost per unit of, cost per step. And so for example, in the cost that I wrote up there what I have is, let's call this alpha, cost per step at time step K is equal to U at time step K times L. U of K plus in that case my cost is only at the endpoint. So it only gives me a cost as I've run out of time but in principle I could have something like this. X at time point K transpose T at time point K, X at time point K. So I took out the G because that doesn't really matter. It's just a constant. And really what you can do is you could incorporate G into X just to give you a sense of things. So for example my state, it could be my position, my velocity, my whatever you want but let's say if it's a second order state and let's second order system and then I could put the goal in here. My state could be the goal as well. That's the state. And so then if I'm interested in representing that equation I can just put T in such a way that it finds a subtraction between this and this. That matrix T. So G itself isn't very interesting. We could always incorporate that into the state. All right, so and this T has a superscript K because in some scenarios there's only a cost at the end. So all the other T's are zero and T is not zero at the end but in principle you could have a cost per step for the state. Okay, so this is a more general way of writing that cost. This is cost per step. Okay, all right. So what we wanna do is first of all assume a finite horizon. What this means is that our cost is gonna run total cost is the sum of K is equal to zero to P of alpha K. Meaning that P is the end. There's some time point at which we're gonna be done. That's called finite horizon. So the kind of problems that we're gonna be considering are called finite horizon optimal control. Now remember that on Wednesday I suggested to you that time itself is interesting. Why should something end at 100 milliseconds, right? So I suggested that there's another kind of a cost that's a cost of time. So that would be adding onto this cost and you would then get a cost for a particular P. So you say, all right, let's assume that my movement should be only 300 milliseconds. What's the optimum thing that I can do? And let's now find that for 400 milliseconds, let's find that for 300 milliseconds and you'll see what's the best time as well. So for now, let's assume that P is constant. We're not gonna change it. And we're gonna have two kinds of costs in our cost per step. There's gonna be an effort cost and there's gonna be a state cost. And the problem that we're considering is belongs to this class called finite horizon. So now what I wanna do is find the policy such that my policy is gonna be u of k is gonna be some policy pi of x of k. So you find yourself at some x at time point k. You're gonna produce some action that's gonna be called pi. And this pi could be non-linear. Up there I wrote it linearly, but in principle it could be non-linear function of this state. And the objective is to find this optimum policy that finds the u that minimizes this sum of alpha k, k is equal to zero to p. That's the problem statement. So how do we do it? Let me give you intuition about how this is gonna work. We're gonna begin at the end of our time, the last time point p. So we run out of time. We're gonna have a value associated with the states at the end. That's gonna be basically our cost. The farther away we are from our goal, the worse it is. That's pretty simple. Now we're gonna say step back one time point. Step back to time point p minus one. At that time point, whatever actions you take is gonna carry a cost. That cost is gonna be this cost per step. However, the result of that action is gonna be a state that you're gonna achieve. That state is gonna have a value associated with it. And the question is how do you find the policy which is the motor command that minimizes the sum of step that you've taken plus where you ended up? And so if you find that policy that minimizes that sum, which is the sum of the step cost plus where you ended up, that's the optimum policy. Then you step back two points and you say find the best motor command that minimizes this and so forth. And so the Bellman equation is a way to take our problem which is all of these time points and we're gonna begin at the end and then find for each step the optimum action. And we're gonna see that produces a sum that's the best policy. So let me see what do I, how am I gonna write it down for you? So all right, let's begin at time point P. So the last time point. And if we begin at the last time point, we've run out of time. So our only cost is gonna be this. Any action that we produce is irrelevant because there's nothing to be gained by our action. So the minimum of this function is gonna be produced no actions at all. And whatever state that I have is gonna be the cost that I'm gonna end up with. So suppose that I produce pi star of x of time point P is equal to zero. I'm not gonna do anything because whatever I do is gonna cost me something. And then what I end up with, the value of this policy at x of P is just gonna be the x of P transpose T x of P. So whatever state that I'm in, that's gonna be the value associated with it. So at time point K is equal to P minus one. What's the best thing that I can do? Well, whatever actions I perform, they're gonna carry a cost. This is gonna be the cost here. And so the U that I'm gonna produce is gonna be cost per step. It's gonna be my first thing that I'm gonna, it's gonna cost me. Whatever U I produce is gonna cost me something. And so that's gonna be the alpha at time point P minus one which is gonna be the action that I do at this time point plus whatever state I'm in. So at time point P minus one, I'm at some state. This is the action that I will produce is how much this is gonna cost me. Now that action results in a change in my state. I end up going to some other state. So I get, as a consequence of this action to a new state, as a consequence of U of P minus one, I go to X of P, right? Some state I'm gonna end up with. And this place that I end up at is gonna have a value. And that value is gonna be this. The goodness of this state, how good is it? That goodness of the state is associated with the value function. What is this value function? It is the best action that I could produce basically the minimum of this value. So the state is gonna have a value and that value is gonna be associated with the optimum policy. Yeah. It's a scalar quantity that's described by, we just saw what, so this value is associated with the policy and at this, so this is the policy to do nothing and at this state is gonna have this value. Just a scalar quantity. So, no, no, no, no, right, right. So for this policy, so what is the policy? The policy is do nothing. What is the value of that policy? For this state, it is this. Evaluated exactly, exactly. So a policy is something that tells us what we should do at any state. The value of the policy is at a given state what is the cost? So look at what I wrote. My policy at the last time point is to do nothing, but it could be something else. Whatever it is, the value of that policy is described as a number and that number tells us how good is that policy at this particular state? So this action is gonna take me anywhere and in each of those places, I'm gonna have a number. We'll do an example so you'll see what this looks like. Yeah, yeah. The one up there? Yeah. So the one up there is only at the end point. There's only one location for us so it doesn't have a superscript. Here I could have a different cost at different times. This one here? Yeah. Yeah, at the end it would be T or P. So you could have a cost that says the state at the end is gonna have some quadratic cost. I wanna be as close as possible to the center and the farther I am, but I could also have a cost associated with states at different times. In principle, it doesn't just have to be at the end. Okay, so all right. So as a consequence of generating this action, I'm gonna go to some state. And when I get to that state, what I have is this probability of X of P given that I was at P minus one and I produced action P minus one. So this is a probability that says where I'm gonna go given that I was here and I did that action. Okay, so in a stochastic system, you generate some action but there's some probability associated with where you're gonna go. All right, so now what we're gonna do is we're gonna multiply this times the value of the optimum policy at that position, X of P. So you generate action U, that alters your state, but the alteration of the state is probabilistic. You might go here, here, here, here, all these places. Each one of those places has a value. Think of it as a game board. Each position on the game board has a value and your actions changes your state to one of those places. Now, if it's a probabilistic thing, then it's 90% likely that when I move my piece from here to here, it's actually gonna go there but it's also possible that's go someplace neighboring states and so the value of each one of those states is gonna be weighted by the probability of actually getting to that state. So this is how good is the state to get to? How good is the state? Well, so obviously, if I get to the goal, that's great. That's a great state to be in, right? But there's neighboring states that I might have gotten to. How good are those? Well, there's a value associated with it. This is the probability of actually getting to those states given this action, right? Okay, so what we wanna do is we wanna say, this is the value of this state. This is the probability of getting to that state. You know, we wanna sum that up over all of these states, dx of p. And then what I wanna do is I wanna find the policy that minimizes alpha at p minus one plus this. So I wanna find the u argument u that minimizes the sum. This is my best action, my u star. So look what I did. I said, this is my cost per step. I produce some actions, and this is where I am. I'm at some state. I produce some action u. It's gonna take me to some other state with some probability. Each one of those states is gonna have a value. And I wanna find the action that minimizes this cost. Okay, let's do an example so you can get a sense of what I'm talking about. This is the expected value. This is the expected value of this function, right? Because it's multiplying the probability by this value function. This is the Bellman equation. It says the best action that you can do is a function of the cost per step plus the weighted sum of the values assuming that from then on, everything that you do is optimal. Let me do an example for you. Suppose that our cost per step we write it like this, jx plus ju. And suppose I have a world that has the following behavior. So suppose that I have a four by three grid. These are positions on a board. And what I can do at each one of these positions, I can perform following actions. I can move down, I can move up, I can move to the right, I can move to the left, I can move sideways diagonally and so forth. So there's a wall here. So that means that the only thing I can do here is to move down. This point I can move up or down. This point I can move up or down. This I can move to the right. I can move diagonally over to this position and so forth. So these are the actions that I can do. These are the states. So what are my costs? My costs are as follows. So this is my j of x. Regardless of the position I'm in, it's gonna cost me five, except this point, where it's gonna cost me nothing. So that's my goal. Okay? Yeah, it's just today's example. Yeah, so what about the cost of action, ju? That's equal to one, no matter what action I take, if I move zero, if I stay. What I want is a policy of x that minimizes the sum. And we're gonna use the Bellman equation to do it. So let's begin at the end. So we're gonna begin at k is equal to p at the last time point. And at the last time point, my best action is to do nothing since I've run out of time and I have, for each time point, some value function that I would like to describe. So I guess I should first write down what happens at time point k. All right, so. All right, let's write it like this. So my best action at time point p is gonna look like this. Do nothing. What's the value of this at time point p? Well, that value is gonna be jx plus ju. So for this point, I didn't move, it's just gonna be jx. It's just gonna be this. The value of this policy, here's my policy. What to do at every state? I'm not gonna do anything, I'm just gonna sit there. I'm not gonna move. The value of that policy is gonna be equal to this. jx plus ju, ju is zero, jx is just this. Okay? All right, time point p minus one. Let's consider one of these points. Let's consider this point here. So I'm gonna consider this point, this state. Let's consider some actions. Suppose I do nothing. Suppose I do this. What's gonna happen is that I'm gonna have a cost per step, which is gonna be equal to alpha of k, which is gonna be equal to the cost of u, which is zero. And then that's gonna have a cost of state, jx, which is gonna be equal to five. This cost here, right? jx plus ju, I chose to do nothing, so that's zero. I'm gonna have a cost of state five. Now, where is the state gonna take me? If I do nothing, it's gonna stay where it is. So what is the value of the optimal policy at that point? It's five. So alpha at p minus one plus value of the optimal policy at x of p is gonna be equal to five plus five, which is equal to 10. If I do nothing, that's the cost that I incur. Per step plus the optimal policy. That's if I do nothing. Okay, what if I move down? So if I do this, then what I get is alpha k is equal to, it's gonna cost me one to move plus the cost of state that I'm at, where I start from is five, right? So this, jx of k plus u of k, u k, ju of k, this is five. Now what I have is alpha of p minus one. Let me write this as p minus one as well. So where do I end up going? v of that's equal to six plus I end up at this state. If I go down, I end up at this state. What's the value of this state? It is zero. So that's six. Let me go over that again. I chose a time point p minus one to do nothing. What's the cost of that action? That action is gonna have a cost associated with the state that you start from and the action that you do. The state that you start from has a cost of five because that's where I'm starting from. The action that I do carries a cost of one. That's one. Five plus one is six. Where does it take you to? It takes me to this point. What's the value of that state under the optimal policy? The value of that state under optimal policy is zero. So that's six plus zero is equal to six. That's the value of the policy that I chose. So this v of pi at x of p minus one is this. This is the value of that policy. A time point p minus one. This is a better policy than this. Why? Because this cost here is more than this. What if I move to the right? Move right. Do this. The cost that I start from, alpha at p minus one, is I have a cost for my movement, one. I have a cost for my state, five. And then I end up to a new location. This new location is gonna have v of x of p is gonna be equal to five. So the value of my policy at time point x of p minus one is gonna be equal to 10. Okay. So, yeah. Oh, sorry, 11. Thank you. Right, okay. So if I stay, do nothing, the value of my policy is 10. If I move down, the value of my policy is six. If I move to the right, the value of my policy is 11. So what the Bellman equation says is pi star is the policy that minimizes alpha at time point p minus one plus the value of the policy, the optimum policy of the state that you go to at time point p. And in this case, the one that minimizes it is at time point k minus one. If I'm at this state, the best policy that I can have is the one that moves me down here. That's better than going to the right or better than going down. Better than staying still. Okay, so and what is the value of this policy at the state x of p minus one? So there's gonna be some number here. This number is gonna be six. There's gonna be some number there, some number there, some number there, some number there. So I will figure these out in a moment, but I'm just gonna put them as blank for now. I'm gonna have to compute these things. So the best policy at time point x p minus one has a value of six, which is a sum of the actions costs plus the optimal policy that it takes you to. So this is my pi star at x of p minus one. This is the value of the policy. All right, let me do another one for you. So suppose that I'm interested in this time point, this state there. What's the best action for that, this state over here? So I could do nothing, in which case my alpha at time point p minus one is gonna be equal to zero plus five, which is the state cost. Right, so that's just five. And where do I end up? Why end up where I stay? So I get alpha p minus one, so the value of under the optimal policy of x at time point p is gonna be equal to up there, it's five, so the sum is gonna be equal to 10. That's if I do nothing. What if I move down from that state? If I'm at this state, what if I move down? So here's what I'm gonna do, I'm gonna move down. So now what happens is that alpha at p minus one is gonna be one plus five, which is six, right? Five is the state value to begin with, one is the cost of the movement, and where do I end up? I end up in the state below it, and that state has a value of five. So the sum of these two is what? What's the value of moving down? 11. So what's better for me to stay or move down? It's to stay, it costs me more to move down. So this state is to stay, and if I do stay, the value at time point p minus one is gonna be 10, that's the optimal policy for that. Why is it that it's better for me to stay? Because it's hopeless, I cannot get to the goal. I might as well stay. Performing an action does not improve my state. See there's no difference between the value of this state and this state, there's no difference in there. So why should I move? It makes no difference anyway, I'm gonna run out of time. So the policy, the best policy for this point is to stay, not to move. It's gonna be the same for this. This is gonna be 10 as well. All of these far away places, it's best to do nothing. All right, let's consider, and it turns out, let's consider this one here, the goal state, what's the best policy there? So let's consider this point here. Suppose I decide, so here, I'm interested in this state. State row two column, column two. So let me label these. Say this is one, this is two, this is three, this is one, this is two, this is three, this is four. So I'm interested in the state associated with row two column two. So if I decide to stay, I have alpha P minus one is equal to zero plus the value associated with that state, which is zero. And then I have V where I end up under optimal policy. So if I stay, I'm just gonna stay where at this location. So that's gonna be equal to the value that state under optimal policy is zero. So alpha P minus one plus value under the optimal policy of X of P is zero. So the best thing to do is stay, obviously. If you're there, you don't want to move, no big deal. So, and if I do so, this is zero. These other states are gonna be very similar to this. So let me do this one here for you, for example. This state here, so let's do state row one column three. If I stay, I have the cost of staying alpha P minus one is gonna be zero plus the cost of the state is five. And then I have the value under the optimal policy of X of P where I end up, and that's gonna be equal to five as well. So this is 10 if I stay. If I move, in particular, if I move to the goal with a single act like this, my alpha at P minus one is gonna be one plus five. My value function at the state that I end up at is gonna be zero. So this is gonna be six. So similar to this, the optimum thing to do is to move to the goal, and if I do that, the value of that policy is six. And it turns out this is the case here as well. Six, six, six, and here's the action. So all right, if I'm near, if I'm one step away from the goal and I have one time point left, what I should do is move to the goal. What about out here? Let's pick this one here. So let's do row four, column three. Let's see what that turns out. So if I were to stay, I get alpha at P minus one is equal to zero plus five. The state is gonna have a cost five. Action is gonna have a cost zero. And where I end up with is V at pi star at X of P is the location that I'm at is gonna have a value of five. So the total cost is 10 for staying. What if I move toward the goal? So what if I do this? Say I'm considering this action. Is that like, I move from here to this state here. Is that a good thing or is that a bad thing? Well, alpha at P minus one is gonna be one plus five. The state that I end up at is this. What's the value of this state? The value of the state is up there, it's five. V of pi star at X of P is equal to five. So this is 11. So it's better for me to stay. And if I do, I end up with 10 and 10. So what did we do? We computed the optimum policy for every one of our points and my optimum policy is stay, stay, stay, stay, stay, stay, move toward goal, move toward goal, move toward goal. And the value of that policy is this. You notice that the value of the policy is not as good as the value at the last time point. So we're building up now our cost. The other thing that you notice is that whereas up there, look at the value of the policy there. It's uniform everywhere except at zero, right? Here we begin now to see a gradient. So now this says that these places are gonna have something better if you move toward the goal. Whereas these far away places are still gonna be irrelevant. You don't wanna move. So let's do it for one more time point. What is it with whiteboards? After a little while, they become like completely unusable. There's some kind of a technology flaw, it seems to me. It's terrible. All right, so let's begin. Let's write down our value function. So value out of the best policy at time point X of P is 55555 5555555, and this policy at this last time point is do nothing, do nothing, do nothing, do nothing. Do nothing, stay, stay, stay, stay, stay, stay, stay. The value of the policy at time point P minus one is equal to, at each state, I have a value 6066661010, right? And my policy at each of those states is stay, stay, stay, stay, move down, move to the goal, stay, so these are my actions, right? You see it? Okay, let's do one more time point. So let's do time point P minus one, see what happens. Let's pick one state. So let me pick this point here. Let's pick an action, action. So if I do nothing, then what happens is that my cost at this time point is just gonna be five, because that's the cost of my state. And then my value of where I'm gonna end up with at time point P minus one is gonna be my current state, which is six. So this is gonna cost me 11 of doing nothing. What about the action of moving down? So if I were to move down, my cost per step is gonna be one plus five. And my value where I end up at is gonna be zero, because if I move down, that's the value of that state. That's six, this is better. So what this tells me is that at time point P minus two, my best action is gonna be, for this one it's gonna be moved down. And the value is gonna be equal to, I don't know what it is. This is gonna be six. We're gonna have to fill the rest of it in. The same way as we're gonna have to fill in the actions, we're gonna have to fill in the value of the optimal policy. Where should we do next? What should we do next? Pick a state. Which row, which column? What's rowing? All right, so row four, column three. So, this one here. Okay, so, give me the action. Diagonal. Alpha at P minus two is equal to one plus state five is equal to six, right? Value of the optimal policy at XP minus one where I end up at, so I end up at this state here, right? And that state has the value six, that's 12. So is that better than the other choices that I have? I could move up, I could move left, or I could stay where I am, right? Those are all the other options that I have. So let's take action, do nothing. Stay. Then I get alpha P minus two is five. This V is gonna be 15. Clearly it's better for me to move now, right? Now, is it better to move to the left or is it better to move up to this point here? Is this, should I move to this point or this point? This one. Either this or this, they're equal, equally good, right? So this is gonna be the best action to do. Say this one here. It's as good as to move up there and the value is gonna end up being 12. Do you see it? We break down the problem like this. And slowly, and it turns out that the best action for here is still is to stay because look what happens. If I stay at this point, I have 10 plus five, which is the value of the time per step, so I get 15. If I move, I get 10 plus five plus one, which is 16. So it's better for me to stay and the value is gonna end up being 15. Yeah. No, I decided to stay. So I didn't have any action. Yeah. I decided to stay with the other one. The other one is even worse. Yeah, it's like. Yeah, the sideways one is even worse. Yeah. Yeah. So I think this ends up looking like this. So I wanna stay here as well. This ends up costing me 15. Here, I think I wanna move because it's gonna cost me five, six, 16. So if I move, if I'm at this point, if I were to move up, it's just like this point. It makes it so that I go from one plus five plus six is 11 and I believe that six is six is 12. So I think this is gonna be 12 and this is gonna be like this. All these others are gonna be the same as before. This is gonna be a zero. This is gonna be six, six, six, six. No, no, no. So this is gonna be, at this point, if I move up, I have one plus five. Yeah, no, that's correct. Let me check my, yeah, okay. And then at this point here, this point here, the question is should I move? And I think if I move to this location, what I'm doing, let's consider that point. So this would be row four, column one. So if I stay, I have alpha P minus two is equal to five. If I, and then the value P minus one is gonna be equal to 10. That's the 15. So if I go to this state, it's gonna be six plus the value of the state that I end up at, which is six, which is 12. So it's better for me to move there. Okay, so look at how the value function changed. The value function at time point P was flat except at the goal, at time point P minus one, we have gradient that's one step away and we see that there's now this gradient forms and now this gradient has gotten bigger and it's beginning to show a difference here. So eventually what happens is that in the next time point, my best policy would be to move from here to here, stay, move from here to here, stay and finally move from the last time point down. All right, so what did we do? We said the best action that you can do is to find the action that minimizes a cost and that cost is composed of two things. The cost at the step that you're taking, the cost per step, the cost of state plus the state that you're at, it's gonna cost you something. The action that you're gonna take and that's gonna take you to some location. That location is gonna have a value. That value is defined by the optimum policy from that state. So we did a deterministic system. If it was a stochastic system, it would just be the expected value rather than the value itself.