 Okay, so today we're going to wrap up the Bellman equation and the main concern that I want to present to you today is the problem that we are going to have a policy that depends on a gain times an estimate of state. That's x hat and in our derivation so far we've been using x as if it was x hat. So when we found the value associated with the optimal policy we showed it to be quadratic in x. But in fact our motor commands u are a function of x hat, our common filter estimate of state. So that's not quite right meaning that our policy depends on both x hat and this value function which we showed to be quadratic in x. So we need to take care of things a little bit with that regard which we'll do that today. What we're going to see is that the value of the policy is going to become a quadratic function when we consider a commands u to be a function of x hat rather than x but that it becomes a quadratic function of both x and the difference between x and x hat. So the value function is going to be dependent on x hat and that's going to be its component of our derivation today. The second thing we're going to see is what to do when we have signal dependent noise. So the signal dependent noise condition is going to make it so that our policy g is going to depend on the signal dependent noises. Basically the larger the noise is going to be the smaller the gains associated with our feedback loop is going to be and it also makes it so that the u that we generate is going to depend on the size of this feedback gain k associated with how we're going to estimate the state of the system. So the main thing that I want to come back to is that in the Bellman equation what we're doing is we are finding for the last time step the optimum action then we're finding the value associated with that policy and if it's a closed form solution that we have then we're going to get a function. If it's a game that we're working on then we're going to get a value for each state. So some nonlinear function of state. Then what happens is that in the step after that we're going to find the optimum action that minimizes the sum of the cost per step plus the value of the state that we're going to go to assuming that from then on we're going to have the optimum action. So the value function is really the optimum way to the minimum cost that we're going to incur given that we find ourselves in a particular state. That's what the value function refers to and what the value function computes is iteratively to give you a value associated with a particular state. How good is this state and of course really value here is like a cost because the smaller the value the better that that state is and why is it better because from that state on if you produce the optimum commands you're going to have the minimum accumulated cost which is that value function. Okay, so do we get a sense of what the Bellman equation is? So as it's applied to linear dynamical systems it's only interesting because it becomes a quadratic function of state. The value function becomes a quadratic function. That's really what that is. So even if it's not quadratic whatever function it is as long as we have a representation of it in state space we can use it. So that's just a mathematically convenient way of representing the value function. In your homework you didn't have a value function that was quadratic it was something else. But nevertheless if you have enough memory space you can just represent it as it is. So that's fine. With regard to today's lecture we're going to add to it the concept that well the value function depends not just on the state but also depends on your estimate of state because you really don't know the state. At any given time x, at any given time k you have an estimate of the state but you don't really have state. So well how good is the value of a state if you have an estimate of state x hat? So we're going to see how to incorporate that into our system. Okay, so let me start. As a reminder we will begin with our cost per step alpha and we write that there we go. Our cost step depends on state x and input u. So our policy at time point p will be to minimize this alpha at time point p for state x of p. So we find the u that minimizes that state and the value of this policy at time point p is just going to be alpha at time point p x of p and pi star of x of p. So we just compute the time per step for whatever this is just a general representation of it for this particular alpha you could have a particular pi but this is just a general way of writing our policy for the final time step. Then our policy for a step behind the final time point is going to be one that minimizes alpha at time point p minus one plus the expected value of the value function at time point p given that we were at p minus one and executed action u of p minus one. And if we find this particular policy then the value of that policy at time point p minus one is going to be equal to alpha of p minus one at x of p minus one and given the policy x of p minus one plus the value plus the expected value of producing that action. So what we saw on Wednesday was that when we produce the optimum policy then what we get is that u at time point k is equal to some gain times x hat of k. And if we place this policy inside the value function then the value function for the optimum policy at x of k is going to be equal to some quadratic function plus some constant. But that was assuming that u is equal to g times x. So in reality u is not equal to g times x. In reality u is equal to g times x hat. And so what that's going to cause is that well when we put this u as a function of x hat into this equation we're going to have to show that the value function is still quadratic function and that we can continue using our procedures. So what we're going to see is that if we allow this to be as a function of x hat then of course what we're going to have to know is that what is x hat? How does x hat depend on things that you know we can measure? Well x hat of k given k is for a dynamical system that we typically have b times so our typical estimate of what x hat is going to be at time point k is our prior estimate for what it was plus the difference between what we observe and what we predicted times this thing that we call the common gain. So if we have for example a system of the following form x of k plus 1 is equal to a x of k plus b u of k plus epsilon u plus epsilon x and y of k is equal to c, let me write it like this, h times x of k plus epsilon s plus epsilon y where I'm going to introduce signal dependent noise in both the action and in the sensors so epsilon u is going to be equal to c1 u1 phi 1 c2 u2 phi 2 as many as u's that we have so u is here u here is a vector u1 u2 is the components of that vector so that's how we're going to introduce signal dependent noise phi is just some scalar random variable with a mean zero variance one epsilon s similarly is going to be another random variable that looks like this. Yeah so remember when we do signal dependent noise what we do is that we say we say that the standard deviation of that signal grows as a function of the signal u standard deviation of u grows as the expected value of u with a slope that's is equal to c. So what I'm writing there is that if I have a random variable that is of the following form so if epsilon u is normally distributed with mean that is u and variance that is c squared times u squared then a standard deviation of epsilon u and this is the mean of epsilon u right so standard deviation is going to grow linearly c times u with a slope c yeah so the way I wrote it here is that epsilon is a random variable and you see that its mean is is is is u times phi which is zero but its variance is u squared times c squared so it has a it's it's it's a random variable with mean zero and a variance that grows as the size of the signal and that variance grows quadratically which means the standard deviation of it grows linearly. So epsilon in this case is a vector that depends on the vector u and the slope of each of the elements of this vector is defined by this variable c and what this means is that the variance of epsilon u so it looks like this so I can write epsilon u as follows some of this matrix c i times the vector u times the random variable phi sub i and what what this means is that c i is this matrix that for example c 1 this is c 1 then it's 0 everywhere else it's all 0 one element is one element is c 1 and then c 2 is 0 c 2 then 0 0 0 everywhere else and so I've written epsilon u which is a vector as a matrix times a vector times a scalar and what that means is that the variance of epsilon u is going to be equal to the sum of c i u u transpose c i transpose it's mean is zero its variance depends on the square of u sure yes yes yes yes yes the individual slopes right so signal dependent noise for a vector you can have independent slope for each component of the vector and there's no covariance between the elements so phi 1 is no has no covariance with phi 2 they're independent and similarly with x so you can have signal dependent noise on the things that you measure and so epsilon s is going to be equal to the sum of this matrix d times x times phi i and so the variance of epsilon s is going to be the sum of d i x x transpose d i transpose what this is saying is that for this system the variance of your observation is going to depend on this which we normally have the variance of epsilon y this is your usual random variable with variance q y sorry q y and q x the usual variances that we have in our state equation and our update equation and our measurement equation but this has a variance that depends on x x of in this case x of k so the bigger the state the bigger the noise is going to be the bigger you the bit bigger the input that you have the bigger your uncertainty is going to be on measuring state on estimating state this variance here okay so our problem remains how to incorporate x hat into the into our into our problem all right so what I have is that my x hat of k given k is a times x hat of k given k minus 1 plus k of k common common gain times y of k minus what's my y hat my y hat is going to be h times x hat of k given k minus 1 right the expected value of the observation equation the expected value of y of k the y hat is just the expected value of h times x because these other terms expected value of y hat is just h times x because the expected value of this is 0 and this is 0 and then my x hat of k plus 1 given k is is going to be according to that equation it's going to be oh I'm sorry there's no a here so my my prior estimate for the next trial is going to be my post my posterior from the previous trial x hat of k given k times a plus b times u of k which is my input that I've given to the system so that's that's how I'm going to move things forward which is going to be equal to a times x hat of k k minus 1 plus a times k of k y of k minus h x hat of k k minus 1 plus b u of k so what I'd wrote is that my x hat at time point k plus 1 is going to be equal to a times my x hat at time point k plus a times k of k y of k minus h x hat of k plus b u of k so what I'm calling my prior estimate this is my prior right that's my estimate on trial k for any trial k my estimate of state is my prior so my estimate of state on trial k plus 1 is my estimate of state on trial k plus my correction with the common gain plus the input that I gave on u of k that's the expected value of my state in time k plus 1 is that clear do you see why see what I did if you didn't if you don't just raise your head I'll go over it again all I've done is use the common gain to estimate state at time point my posterior here's my posterior right is my posterior time point k and I say okay what's my prior on the next step my prior on the next step is my posterior from the previous step plus the input that I gave that becomes my prior in time point k plus 1 so then all I've done is that I've written this equation in terms of this equation which is this so this is my estimate of state at time k plus 1 as a function of estimated state in time point k no no it's the same it's exactly the same you haven't done it no we haven't done it no no we haven't done it the reason why I need to write this is because I'm going to say g times x hat of k so what is x hat of k this is what it is x hat of k is a times x hat of k minus 1 blah blah blah right so why is this important because the value function look at the value function the value function is in terms of x of p given x of p minus 1 and u of p minus 1 right and you see that x hat also depends on the previous time points so we just have a way to write what x hat is all right so let's begin at the last time point and minimize and find our optimum optimum command so time point p so for time point p right so my optimum policy at x of p is one that minimizes that function which is going to be equal to zero right so that the thing that minimizes x of p alpha p is just going to be zero and the value function for that policy at x of p is going to be equal to x of p transpose t of p x of p so what's the value function of x of p given that I'm at x of p minus 1 and I've given command u of p minus 1 why do I need to know this why am I writing this yeah because to go to move to time point p minus 1 I need to know what my value function is for time point p given that I'm at some other state and so what I'm going to do is I'm going to write that out to see what that what that what that shape looks like for the particular dynamics that I have and I have a x of p minus 1 plus b u of p minus 1 plus epsilon u plus epsilon x transpose times t of p times this thing again right so this this is just repeated here that's what that is so what's the expected value of this function so what's the expected value of this well the expected value of this quadratic equation is the expected value of this transpose expected value so let's write that down what's expected value it's going to be a x of p minus 1 plus b u of p minus 1 this expected value is zero this expected value is zero transpose t of p times this plus what I have now is that I have to consider the trace of t of p times the variance of this this variance here variance of epsilon u that's our random variable and I have a variance of epsilon x the other random variable so let's begin with the easy one variance of epsilon x see what is that this is going to be this term here I'm going to write it out here so trace of t of p times the variance of this term this term here let's see what that is so the you're going to have the trace of t of p times variance of this term epsilon x which is qx plus the variance of epsilon u times b which is going to be the sum of b times d sub i sorry c sub i u u of p minus 1 u of p minus 1 transpose c of i transpose b transpose right so this is a scalar variable this is a trace so that's just a scalar and so this is going to be equal to the trace of t of p times q of x plus the trace of t of p times this is it's just going to be a scalar variable which means that I can rewrite this by bringing the u's on the outside just doing enough doing a transpose on the scalar scalar quantity I'm going to get u of p1 transpose c i transpose b transpose t of p times b c i put this times the sum of c i transpose b transpose t of p c i b u of p minus 1 and c is a symmetric matrix anyway so it's transpose doesn't mean anything so okay so I'm going to call this quantity here cx at time point p so then my expected value is going to be ax p minus 1 plus b u of p minus 1 transpose t of p plus u of p minus 1 transpose cx of p u of p minus 1 plus the trace of t of p q of x okay so all we've done so far is introduce the concept of signal dependent noise and when we introduce signal dependent noise in our in our equation then the expected value of the the value function the optimal value function here depends on depends on not just depends on this this c of x which has these noises in it so you see that if you have signal dependent noise then the value function is going to depend on your command u scaled by the noises that influence you we didn't have that before so when we don't have signal dependent noise then the value function doesn't depend on the these commands the noises in these commands now that when we have it the expected value the expected value of this of this function here it has these epsilon u's in it where it depends on it depends on you so the big difference between having signal dependent noise and not having signal dependent noise is that having signal dependent noise makes it so that the value function depends on you okay all right so what this means is that when we put when we put in what the u is if it depends on x x hat then it's going to change our value function so okay so what's our next step so this is the expected value of u all right sorry the expected value of under the optimum policy so what we do is that we say the optimum policy at x of p at x of p minus one is going to be the argument of u of alpha of p minus one plus the expected value of v pi star of x of p given that you are at x of p minus one and you did act action u of p minus one all right so and this is alpha p minus one is going to be x of p minus one transpose t of p minus one x of p minus one plus u of p minus one t l u of p minus one plus this expected value is is over there what's so funny yeah yeah yeah but we're going to simplify it that's what what's nice so we're going to write it as a as a quadratic function of x and u so this is x of p minus one transpose times t of p minus one plus I haven't so if you look at this I have x of p minus one x of p minus one a times t so this is a times t of p times a that's all I have times x of p minus one right yep okay that's what I have all right so now I can write it as a quadratic function of u u of p minus one transpose I get c of x plus I have u times b times t here so b times t of p all right so I've taken care of all the squared u's and x's and now I get one one of the interaction terms two times let's do the u so let's do two times u of p minus one t times bt t of p times a x of p minus one and plus the trace of tp qx so my oh this is sorry this is argument of this right so we're going to minimize this to get the the the policy and it's a quadratic function of x and u quadratic function of x quadratic function of u and this interaction term all right so we minimize this with respect to x with respect to u of p minus one and what do I get I get I get the squared term in u which is two times cx plus b transpose tp times b plus l times u of p minus one so what I'm doing I'm finding derivative of this with respect to d u of p minus one plus two times b transpose t of p a x of p minus one so I'm equal to zero so then I get u of p minus one is equal to minus cx plus b transpose t of p b plus l minus times b transpose t of p a times x of p minus one and this is called my gain g at time point p minus one okay and my problem is that I don't know x all I have is x hat so I found for my time point p minus one my policy and here it is and let's look at it for a second it depends on l which is the motor cost obviously the greater your motor cost the smaller the smaller u you're going to produce and it depends on cx what is cx cx is the noise the signal dependent noise on the on the u's bigger the noise is the smaller your gain going to be so that kind of makes sense yeah yeah yeah at this point I'm just finding the policy right so I'm saying at time point XP show me what's your best policy and that's the one that minimizes this right and so because because my v of my value function I pieced I for XP is is just you know it's just x there is no x hat here yeah it's kind now it's gonna now it's gonna make a difference so because if the time point p didn't make a difference but time point p1 is gonna make a difference because the time point p g was zero so didn't matter so the value function didn't depend on x hat it just depend on on x but the value function under the optimal policy for p minus one is going to depend on x hat because you depends on x hat and time point p you didn't you you was just zero so it didn't matter okay so so then right so the important thing is you is equal to minus g of p minus one times x hat of p minus one all right so now we have to put this back to this function all right so now you're going to get g times x hat g times x hat g times x hat so your value function no longer is going to be quadratic in x it's gonna have x hats in it that's the problem we face well not well no because remember we can't we're not going to replace this with with x hat we're only going to replace this with x hat you depends on x hat x doesn't depend on x hat the actual state is independent of what our estimate is it doesn't matter what we can still compute is the value function for any x and any x hat so whatever the true state is we can find the value function for x and x hat you'll see you'll see yes and and x hat it's a function of both yeah yeah we'll see that this value function so so so the value under the optimal policy is this quantity where you is equal to minus g times x hat right so all I can do right now is to say v under the optimal policy for x of p minus one and x hat of p minus one is equal to x of p minus one transpose times this so I'm calling I'm going to call it w of p minus one x times x of p minus one plus this u which is going to be this term here it's going to be x hat of p minus one transpose times g p minus one transpose times minus because this u is not going to have minus g times x hat minus two times x hat of p minus one transpose g of p minus one transpose b t t p a x of p minus one plus the trace of t of p q x so we have some strangeness going on here so we have x we have x hat and we have x hat times x we can do a little bit of simplification so what is g let's look at g g is this quantity inverse times this right so that's exactly this quantity right so this quantity times g is going to cancel so this term here is going to be plus x hat of p minus one transpose g of p minus one transpose this times g this is going to fall out this is going to remain b t t of p a times x hat of p minus one so this term becomes this okay so that's that's kind of nice because here's what we got we have if I call this z you see that it appears here as well right so I have x hat transpose z x hat minus two x hat transpose z x well that's kind of nice why is that nice because x hat transpose z x that's the first term minus two times x hat z x that's a second term right that's going to be something that I can simplify that's x minus x hat transpose z x minus x hat minus x hat z x sorry x x transpose z x so this value function can be written as a quadratic function of x which is what we've always done plus a new term here which I'm going to call error in estimating x how far is the true x away from my estimate of x so the value function under the optimum policy at time point x of p minus one it's going to depend on both x and x hat and it's going to be x of p minus one transpose times t of p minus one plus a transpose t of p a x of p minus one plus yeah so the last term just gets into here so what we see is that the value function becomes a quadratic function of x and the error in estimating x and basically this matrix we're going to call w x at time point p minus one and this matrix we're going to call w e at time point p minus one so it's a quadratic function so in 2005 Todorov his work was based on showing that if you have signal dependent noise in the state update equation like we have now and if you incorporate that into the Bellman equation then what you end up with is a representation of the value function that is quadratic equation but in terms of x and the error in estimation of x so remember how we proceeded last time what we said is that if we can write the value function in terms of a quadratic equation then when we apply the optimum commands and you know we see that the value function is quadratic the next step is going to be just a formula that basically says if we apply this we still going to get a quadratic function on the next trial so that the proof in in basically in your in your book and in what Todorov did in 2005 was to show that for whatever k that you have whatever time point k that you have will basically write it down for you so if you have if you have a time point k let's say time point k plus one you have the value function under the optimum policy at x of k plus one and x hat of k plus one equal to x of k plus one wx of k plus one transpose x of k plus one plus e of k plus one transpose w e k plus one e of k plus one plus a constant so if that's the value function and if you were to apply u equal to the time point k you apply u equal to minus g of k times x hat of k then what happens is that the value function at time point k and x hat of k is going to be equal to a once again at quadratic function of x and the error so any time point if you begin you can write the value function under the optimum policy as a quadratic function of x and error in estimation of x for the next time point you can find a minimum of the sum of the alpha plus the value function and it's going to be a linear function of x hat and when you apply that policy the value function associated with that policy is again going to be a quadratic function of x and the error so that's just like before except now instead of having a value function that's only depends on x we also have a value function that depends on the error in estimating estimating this now yeah so that that's basically it we don't really have any anything I think other than the idea that the value function being a function of x and an x hat now x hat of course depends on k right the common gain so technically this becomes a problem because remember that what happens is that when you do estimation you go forward in time you say I have computed all the common gains I need from beginning to the end right because you can do that you can say at my first time point this is going to be my first common gain my second my time point is going to be my second so forth in in this approach the case are going to influence these ease which are going to become part of the value function so what what Todorov showed is that case and the value function they converge after a few iterations so you run this forward in time you get the case you get the value function backwards in time and then you run it again and you run again and it to the to converge and I did it in your book for the condition where you're moving the eye and the head just to check to see and it and it works quite well it converges after four or five runs through the system question on the x state which is running forward so cx depends on the noise but if you have like state dependent like you said that that signal dependent right in x yes so that means that like you need to know where x is in order to compute cx which means you have to run forward iteration of the model to compute x is to get cx so that you can run backwards iteration with the Bellman equation like it seems like that doesn't that seems weird yeah yeah no you're right you're right because because until you get to that state you don't know what the feedback gain is going to be right so the feedback gain is going to depend on not just x hat but the value of the so it's won't it would be a the noises are going to influence the size of the feedback gain because that's going to affect c so this is this is what effects so cx is the size of you and then we're going to have a variable that also depends on the size of x which is going to depend on d which which let's see where where is the scenario equation it's going to depend on it's going to influence k the common gain and and then that's indirectly going to influence e that you have there so so it's an excellent point basically you can see that optimum u so let's see what is where do we do g up there it is right so you see that g depends on cx right but cx depends on you but this this quantity here is ci right and the variance of it depends on the size of the size of the noise so just just to be clear it isn't the actual value of you that matters what matters is how the noise grows with you so c doesn't depend on you right it just depends on it's just a slope of that line is it's not circular all right this is where we are with the bellman equation and signal dependent noise that's as far as control theory as far as optimal control and signal dependent noise is gone and it's like I said it's about now nine years old this this derivation how does that representation of the noise when you're trying to compute like yeah yeah yeah yeah yeah yeah with common estimation and that's all that would just give you the process so how does this representation of the value function in terms of the error right actually help in terms of like like the control theory what's going on because you never have access to the actual underlying x but it seems like it's kind of a little bit so it's important because when we put you into the into this equation when we say when we say the value of the policy is going to be the cost per step plus the expected value of this well this depends on you but you depends on x hat right so this expected value which is what we just did we found was to be a quadratic function of x and e so the importance of this is that it shows us that no matter what step we're at if we have the value function associated with optimal policy and we minimize the Bellman equation we end up with another value function that still is quadratic and so what this means is that in your in your recipe you're going to see that wx of k depends on this it's just going to be some you know transformation of wx we is going to be some transformation of k and g is going to be a function that depends on wx and we okay so basically it just the fact that that transforms quadratic value functions in the quadratic yeah it really allows you to have that same linear form yeah yeah yeah that's right yeah so before when we didn't have x hat the all that we have was on the left side we just had this right and so when we did the policy that was optimum we still got a value function was quadratic in x so the thing was it just consistent with itself the only thing that was new today was that well my you depends on x hat not depends on x but to handle that we have to introduce this concept of an error in state and as long as we do that we get a value function that's quadratic in error in state