 Welcome back for the second lecture. Yes. OK. Welcome everybody back. I see that I've scared away some of you. Maybe it's a nicer day today than yesterday. So maybe that's the reason that you are not here, some of you. So today and tomorrow I will talk about control theory. And this is a very important theory for understanding the brain. So we can think of the brain as a sensory motor machine. We get sensors in and we motor actions out. And that breaks down to perception. So we get a stimulus in and we have to do the pattern recognition. We may want to use a deep neural network to do that kind of process or this perception. And then there is action, right? So we want to reach for a cup or something or pick up something. And this is an action that we have to compute to set up a motor program that does this. But then these two problems are not in isolation. Then there is the perception is causing action. So with the base on what we see, we see a nice piece of food and we pick it up. But there's also that the action causes perception. If I look this way, I see this. If I look this way, I see something else. So depending on the action that I take, I give a different perception. So there is an interplay of this and much of this stuff is learned. And so for instance, this little child here that is engaged in play. Play is very important in understanding this whole sensory motor integration, right? You see something, you give it a push, something happens, you laugh, and you build it up again, you put the blocks again. And you learn all kinds of things about physics. For instance, if you have a stack of four blocks and you remove the lowest one, the whole thing is going to fall over. No, isn't that wonderful, right? This is sort of something you learn. You not only learn the physics of the blocks in the outside world, but you also learn what your arms are doing. Because in the beginning your brain is sort of wired to your muscles and your brain doesn't know how it's wired to the muscles. But you can fire these neurons and something will happen. And then so lo and behold, you do these experiments which we call play. And out of that play, somehow we learn a motor program. Okay, so separately, we understand the perception and the action to a certain extent. So perception is something like Bayesian statistics, right? We have maybe or not Bayesian statistics. We have a feedforward neural network. We have like a perceptron that I told you about yesterday. So we pretty much understand that and there's a lot of theory about that. There's a concept, we have quite a good understanding of that. Information theory, maximum entropy, all these kinds of principles that are very well known and very well studied. And they have a solid engineering background and there's a good match between what the brain does and what we know from engineering. The learning is a parameter estimation. But action, what is action? So it could be control theory. So that's what I think. But so far, if there are different types of control theory, there is a so-called adaptive control theory in which you have a controller which is already basically telling you what to do but it just has a bunch of parameters that you can adjust which is the kind of controls that you see in many industrial plants that is of limited use to describe a control situation that we are facing here with this child. So the more richer class of control theory is called optimal control theory in which you basically model the whole space and define the whole problem as an optimization problem and this is what we're going to be talking about. But that is, as we will see, is intractable meaning that the computation required scales exponentially with a problem size. And furthermore, this optimal control theory has funny features like that it has to compute backwards in time that you have to solve equations in the reverse order of time which seems to be very counterintuitive for a biological process and how to represent different control strategies. Anyway, there's a lot of problems there. So we have some theory about, some idea about how to do control but actually how to integrate control and the perception. So do this whole sensory motor integration. We have actually very, very little ideas. So the sensing depends on the action as I just told you in the example. So how to set up a theory that integrates all these kind of things is very difficult. So also, for instance, the kind of features that you may want to learn for a certain task, for instance, again, this child here, in a sense they depend on the perception so you can look for the things that are most dominant in the individual space and say, I'm going to learn these features. But then again, there is also features that may be very pertinent for the kind of control tasks which you're trying to execute. And so how do you link these two? How are you going to learn from the individual system the features that you need for the motor system? There's problems with action hierarchies. So children, they learn first how to crawl and then to walk and then to run. So you have a hierarchy of skills and you use the basic skills as ingredients for higher-level learning. And in cognitive tasks, we see the same. You learn some basic mathematics and then you really get an expert at these mathematics where you keep on combining elementary things. And we have basically no understanding of a theory how to do this hierarchical learning where we combine, where we break down a learning task into elementary atomic motor primitives, as they call it, and then how to combine these in a flexible way. So these are all open problems and I'm not going to tell you the answer to any of these problems here, but I hope to in these two lectures to give you a framework of this pathological control theory which is addressing some of the issues. For one thing, it's addressing the issues how to do learning in these systems and we'll see that. And it will also deal with the attractability issues and with the uncertainty. So most importantly, the uncertainty issue is that it will provide... So if you want to do learning in these systems, you have to be able to deal with the uncertainty because your learning system, your controller, at the beginning is going to be very bad. And so you have to build a controller which is based on a very bad understanding of the situation. So you have to do a control action which is also adapted to that very uncertain situation. And therefore you need control mechanisms that allow you to deal with uncertainty and that is what these pathological methods will be able to do, as I hope to show you. So there's also a philosophical bid here which is my pet theory of consciousness. So if you think about it... So the neural activity that we have in the brain are actually of two types. There are the neural activity that are driven by the sensors. This is your Bayesian processing, right? I see a chair. I have some preconceived concept of a chair. I do a sort of a template matching which is my Bayesian template matching of the sensory data that comes in. And what I think is a chair and if we have a better fit with the chair than with the dog, then I say this is a chair, right? So this is one kind of neural activity that is happening in a recurrent neural network in some complex way that Jim DeCarlo know much better to explain to me. That's that part. So there the neural activity depends on the stimulus and the internal model. But there is also the internal model that is also responsible to build our actions, right? So we run internal simulations. We have an internal model of the world. Like I was a child. We all were children before. We all had this world block model. We know what happened when we removed the lower block. So we have this internal model in our world. It's embedded in some neural network structures. In a sense, we have a small world in our head which is a complete simulation of the world outside there. We know, essentially, in all the situations that we are engaging in, what more or less is going to happen. We have a predictive model of the world outside us in our head. And that predictive model allows us to set up little what-if experiments. What if I would do this? What would happen? If I would do that, what would happen? And based on this kind of simulations, we're going to get some statistics and we're going to get some good outcomes and we're going to get some bad outcomes. And based on that, we're going to make a plan. And in fact, the story that I'm just telling you, these kind of simulations is very much what is pass interval control is going to be doing. It's going to see tomorrow. So there are these two kind of activities. So the models that we have in our head, they play two roles. One is a recognition task, where the model is used in a Bayesian sense to say, okay, I'm reading a text, I see a bunch of letters, and I fill in the rest of the letters by making the whole text, because I have preconceived ideas what this text is going to be about. So that is Bayesian perception. But at the same time, this world model also allows us as a simulator to generate possible actions, to compute possible actions that we're going to do in the future. So let's get that in mind. Yeah, so I'm going to keep this for later. Okay, so let's talk about optimal control theory. What is it? So here is the essence of the optimal control theory. There's a big giant. He has a tree. He wants to hit a little guy. And so we have to make a sequence of actions which hits him right on the head. And so what the task is is to find this sequence of actions that minimizes two things. One is the energy consumption of the movement. You want to do it as efficiently as possible. And secondly, of course, you want to hit the target. You don't want to be next to it. So the cost is made up of two things, which is a path cost, which is this total integrated cost, which may be energy or whatever, or speed that you want to be as fast as possible. And the second one is to get to the target. So this is in essence the control computation given some model of this environment computes this action, this sequence of actions that get you to the goal. Now, in some environments there is uncertainty. There's noise. And I tend to explain with this small anecdote that there is this spider wants to go home. And there's two ways to go home. He can go over the bridge here, here on the left. I made this drawing myself. I'm very proud of it, by the way. So go over the bridge and walk around the lake to go home. So these are the two possible plans that he can make, the two possible actions. Now, if there's no noise in the world, he would just take the simplest and the cheapest and the fastest route, which is go over the bridge and be home at the end of the story. This is the optimal control computation. Now, suppose that the spider has been drinking at night and he wants to go home. So then, as you may have experienced yourself, then there is a certain uncertainty in your ability to actually go right. And if you do that on this rickety bridge there, there is actually a chance that you're going to fall in the water. So if you do the expectation computation, there is an expected cost of now crossing that bridge, which may be very, very high. And so the smart drunken spider decides to walk around the lake and take this solution. So what is the morale of the story? The morale is that if you add a little bit of uncertainty to your problem, the solution can change quantitatively. So it's little noise to the problem. Doesn't mean that you get a little bit changed solution. You get a drastically different solution. And here the analogy with physics is also very revealing, because noise is, of course, temperature, or inverse temperature, and so we have a high-temperature solution, which is the noisy solution. They have a low-temperature solution, which is a non-noisy solution. And we can think of phase transitions, where we have temperature in water with a high temperature, and if it's a low-temperature, we have a solid. And we see that we get two qualitative different solutions in a high and in a low phase. And the same you see here, you can imagine that there is a critical amount of noise in which you get the transition from going over the bridge to going around the lake. So you get this phase transition in the solution space. Okay, so that's, and this is important for exploration. So when you're very uncertain about the environment, you don't know about the environment. It's very important to have a control of this noisy setting, and also for learning, as I mentioned before. So, in the control I discern three very hard problems. So one is the motor babbling that the child was doing. Right? He's just sitting in front of a pack of blocks and just figuring out what the physics of the world is. Right? This is like little Newton. Right? He's just playing with the blocks and see what happens. He learns that you don't put a block on one of the corners of the other. It's going to fall over. You have to put it on a flat side on top. All these kind of basic physics things. Some things slide, some things don't slide. So make a model of the world. This is one of the, is a very complex task. So learning and exploration. And of course exploration means that you're not going to learn everything. Right? I mean if you, exploring, means that you go, it's sort of, it says that you're going to look at certain things, but not at all things, because if you're going to be exhaustive, you're going to be dead before you have a sensible model of the world. So you have to make choices. And that's about the exploration. So these are very hard problems. The second is once you have that model of the world, so you know your plan, you know that if you turn this knob this is going to go up, and so you turn that knob that's going to go up. Now you're in the business of planning. Now you have to find your optimal plan. So you have a certain initial condition. You have your physical model of your plans. And you want to go to a certain end goal. How do you get there? So you make the movement of the thing. This is the second problem. Those are very hard, because it scales exponentially in the stochastic setting, scales exponentially in the dimension of the problem. But then there is a third problem. So if you have solved the first two problems, you have the solution, which is your optimal control. It tells you, if I'm in this position I have to do this movement. And it is this very small thing, U of X and T. U is the control. Some vector, it steers all the joints in your system. And X is the state, is the external state of your body that you are, the pose that you are in, or maybe velocities also. And so this mapping, although it is a very small thing here in this transparency, can be a massive thing. It's very, very, very big. So X can be 100 dimensional. And U can also be 10 or 100 dimensional. So you have this mapping in very high dimensions. And actually it's not a Gaussian. It's not something simple. It's not a table. It's an dimensional object, actually. So you have to, how to represent that in the machine is a very big problem. So these are the three big problems. We're mainly going to talk about the second problem, but we're also going to talk a little bit about the third problem, how to solve this with neural networks later in the thing. So this idea of the path control is essentially to express a control computation as an inference computation. I will put formulas to that, what that means. So whereas normally the control problem is a as you have to solve some differential equations, here the approach is that for a certain class of control problems you can find an expression for the solution as a path integral. So the optimal control is and then you get some expression. And then you have to evaluate it. So it's a closed form expression. But it's hard to evaluate the expression and that the expression is evaluated. This expression is a path integral and it is evaluated by using some sort of a Monte Carlo sampling approach. Now then the sampling that you use can be more or less efficient. Your sampling and we're going to let me not dwell on that because we're going to come back to that. So we're going to accelerate this sampling by something called important sampling and actually the controls that you compute are proxies for the way you sample. So if you want to go this way and you're sampling this direction you're not doing very well because your sampling is very inefficient. If you're able to steer your sampling procedure towards the direction you want to go then you're doing much better. And so there is a sort of an agreement between the optimal control that you want to compute to solve your control problem and the optimal control that you use actually in your sampling. So they go hand in hand. So you have a sort of a boot-stepping approach to get better and better. And now in order to get this good controller, this controller can be learned. And so you have some neural network that represent that. So the overall picture is going to be like this and it's going to be very hand-wavy at the moment because we're going to actually come back to this whole theory tomorrow and I just want to give you a sort of forecast of it. So you'll have, you get some samples which give you the data and then you learn a controller and that controller is then used to get better samples and with the better samples you're going to learn a better controller and that is going to get better samples and you're going to learn a better controller and this is sort of a feedback loop in that thing. So this is the way, so and then this theory is about how to learn these controllers for these systems. So that's a little bit of a looking ahead. So the outline is so we're going to look at control theory, standard control theory today, discrete time control theory, in this case the continuous time control. Then we're going to look at the stochastic case and we're going to look at the path control tomorrow. And here you find on the slide some material that you can look at. So, okay. So control problems are what is also known as delayed reward problems. So if you think about normally then if you want to have two neurons that fire then the first neuron is in front of the second neuron and if there's a positive correlation you want to strengthen that link, something like that. Or if you have a feed forward neural network you have an input, you have an output and the difference is going to be back propagated and it gives you a change of the weight, but it's all at the same time. In control you do a sequence of actions and at the end of the day you have a target, yes or no, and depending on that you evaluate the whole path that you do. So it's called delayed reward. So the reward comes later and that is the main problem of the control. So control problems arise of course in motor control like in biological systems or in robots where you have to control a plant but also you can think of them in finance where you have a portfolio and your actions are by and sell of certain products and your control task is to double your income in the period of a year and you have a model of the environment and you do this by and sell kind of strategy. So that's also a control problem. But in effect learning and our whole life is actually a control problem. Because if you think of it we live for about 80 or 100 years and so we get born in the beginning and then we die at the end and so we have to make a plan for that. So in the plan what can we do? We can go work, we can go to school we can do all kind of things and the planning of that life is the sensible way to do that is to go first get your education and then work. You could also have done that the planning the other way around first work then get your education. It's no good. You could also keep on working, start working very early and work your whole life but that's also not very good. You could also study your whole life and never work. It's also not very good. So there is a sort of an optimal way to set up this life in study and work and you could view this as an optimal control problem where the only purpose of learning is that you can exploit it in life because the best thing you want to get, your reward function is to get the maximum amount of money say and then your study is only to get the knowledge that you get you the large amount of money. So for cognitive systems this is also the case. So for instance if you have an example like you're driving on the road in a car and it's a narrow road but your steering wheel is connected to the wheels as it should be but you don't know how it's connected. Maybe if you turn right the wheels go to the right and the wheels go to the left. So this is a dangerous situation. Now you have to find out this now learning how your steering wheel is connected to the wheels, this is not your objective. Your objective is just to stay on the road but in order to fill that objective to stay on the road you have to learn this other task which is how your wheels connected to that and in order to solve that you have to make certain jerks to your steering wheel to figure out how this works and get that information. Once you know that then you can incorporate that in your planner to stay on the road because now you would know how it's connected. So the learning of the internal model may be part of the action is part of the learning itself may be an action that you may want to consider. So this is all very complex and we're not going to touch on much of that but I just want you to think about that. So type of control problems that we can have we can have fixed horizon finite horizon control problems and then you have the dynamics and the environment may explicitly depend on time and the optimal control also depends on time so if you're close to the end to the horizon the strategy may be different than if there's still a long time to go so this is all in the finite horizon setting think about your life if you're almost there you're going to do different things than if you're just born you can also consider this in a moving horizon so you have a finite horizon but now the horizon moves every time you take an action so you're going to plan so this is what you happen when you're biking if you have a finite horizon you're making a plan always for three seconds ahead and then you do the planning within that three seconds you use the first action of the action which you have to do now to optimize over the first three seconds and then in the next iteration you're going to again do that same problem and your horizon keeps on moving forward with you so this is also a finite horizon but a moving horizon you can also have infinite horizon problems and the most well-known one is the way you have the discounted reward where you say okay the gain that I get immediately on short notice is more important to me than the one that I get next week so you have a discount factor for instance gamma to the power t which is the reinforcement learning so you get that kind of thing and make sure that the integral over the infinite time is still finite because of this exponential decay you can also not have the discount factor then look at total reward and in that case you need something to make sure that it's integrated you have an absorbing state that allows you, that says the end of the game or you can look at average reward if an infinite horizon but you then divide by the time interval take that limit the funny thing about the infinite that the average reward is sort of counterintuitive because the average reward doesn't care about what you do on short notice so if I do first something wonderful and then do nothing afterwards or I do first something horrible then do nothing afterwards in terms of average reward there's no difference because it looks at the average over the infinite time and it takes the so this is sort of counterintuitive kind of reward other issues are we can have discrete versus continuous state we can have discrete versus continuous time we can have all the states that we're going to control on we can have them observable and they can be noise of course in the problem now we're going to start with the simplest case which is the discrete state, discrete time all observable, no noise and this is the case that we're going to be first considering and then we're going to build our intuition from that and then extend the idea so here we go so the discrete state X is a state it's like think of a grid world where you number the grid points the number and so discrete state and with discrete time and so in each step we can go one step we can have a dynamics we go from the current state we go to the next state with a function which may explicitly depend on time it depends on where we are and it depends on some control signal that we want to compute and this control and this is in the finite horizon so we're going to do this from T is 0 to capital T minus 1 steps if we specify the initial state and if we specify the sequence of controls then of course we can compute the whole state sequence so we start in X0 we take control U0 we go to state X1 we take control U1 so we get this whole sequence of states and that's this so now the control problem is to say I want to optimize this function and this function is composed of two parts one is the path cost this is the cost that I have to execute in the path and then I end up in an end state and that is the cost associated with ending up there and so this is just a function of my controls because given my controls my states are given and so this is the control cost and the control problem is to find the sequence of views that minimizes this cost so that is the optimal control problem everybody with me here so here's an example of how this control problem may look so for instance I want to go from A to F sorry A to J and there's a whole bunch of paths that I can take through these intermediate states so think of horizontally as this is time so four time steps and in this state I could take three actions go here, there or there and the cost that I have two, four or is it a four? It's a one I don't know, whatever this cost and I go here and then I take this path, case cost one and this path cost three and this path cost four and so I get a total cost for this trajectory and now I want to have the sequence of steps so particularly my problem is I'm here now and I want to go in the cheapest way there and here I have to decide to go here here or here, right? and the choices I make here are the possibility of where of my total paths, right? so the cheapest way to do that the simplest way to do that is just to list all the possible trajectories that I can take and take the best one that's solving the problem so but that is of course very difficult, very complex because if in every step I could for instance make two choices then if I have a T capital T steps then I can do the power T possibility so this explodes in your face and I cannot do typical so I can do something else which is called dynamic programming so I start at the end and I say well if I'm in here well there's no choice so I have my optimal, my cost to go so I denote this notion of cost to go which is the cost from my current state to solve the rest of the problem optimally so my cost to go from H is 3 and cost to go from I is 4 my cost to go from F I can go here this way or I can that way, right? so my cost to go is 6 plus 3 or it's 3 plus 4 so my cost to go from F is actually 7 that's the best way to do that and I can do it for G and E also and I can compute the cost to go in any of the intermediate states in terms of the optimal cost to go from the next one so this is the notion that you have to get so if I let me give an example with city so if I want to go if I want to go back home to Amsterdam from Trieste I can go either through Milan or I can go through Rome and so I have to know how far it is from here to Milan and then I have to know what is the optimal way to go from Milan to Amsterdam the fact that there are million ways to go from Milan to Amsterdam I've solved that problem and the optimal way is such and the same with Rome so if I know what the optimal stop to go is from Rome to Amsterdam and from Milan to Amsterdam I can figure out what the optimal way is to go from Trieste to Amsterdam because I can combine it so this is this recursion this is dynamic programming so in formulas so this J is called an optimal cost to go and you take your original cost function that we had here but now instead of starting at t0 you start at some intermediate time at some intermediate state and that is J is at intermediate time intermediate state so you have to solve the remaining problem so the optimization is over the remaining controls that you have the tail of the control sequence which is the optimal cost to go you solve this and now if you take the small t equal to the end time then this sum is empty and you just have the end condition and if the small t is equal to 0 you will get the original control problem so we initialize J at the end and then we do a recursion backwards in time to get the solution and this is called dynamic programming so one of the works is that here we have again the definition of J as I had before and now I split out I take out the current time optimization this step I take the current time and I take the rest so this is the rest, this minimization and then this sum I split out the current time S is equal to t which gives me this term and then I have the rest minimization over t plus 1 to t minus 1 I can take through this one because this one is only the current time so it's not affected by this minimization you can take it through but of course this part all depends on this optimization so I have to put it in there so we get to this expression now this one we recognize as again J but now evaluated at t plus 1 so instead of t is t plus 1 it's an x state that I met so this is started at S so x t plus 1 we get here so that we had here x t, s t, x t so you get this here so you get this recursive relation and so x t plus 1 of course by the dynamics is x t plus f so you see that this is we have to now solve and this is called the Bellman equation and so it's a function of time and state so for each time and state you have to solve this so you get an optimization optimal value U which depends on x and t which is your optimal control at time t in state x so you have to do this at all times in all states so as an algorithm it looks as follows so you start with your J initializing with the end cost the U that minimizes this right hand side and then you evaluate that J that you get there so that is essentially that is this J do the optimization you fill it in and you get this value actually this equation because so you have this next U, J because that's from your initial condition so you know this value for U, you look for the best one and the value that you get is the optimal cost to go that is this one here and so then you do that all the way backwards to time 0 and then at time 0 you know that you are in state x0 and now you're going to make you know what is the control in x0 because you have computed the control for all times in all states so also for the current time so now you can make a forward step where you take that control so this is how you go forward so in the backward step so if we go back to this picture here in the backward step you actually have to visit all states and all times and in the forward step you actually only take one trajectory through it which is the optimal trajectory that seems like a waste that you have to compute the cost to go in all states in the backward step but that's the price you pay for programming because you don't know where the optimal trajectory is going to be later because you still have to construct it so therefore you have to compute it for everywhere okay so that is Balmain equation okay so what happens if we add noise to the problem so here so one way to add noise is to say okay my next state is going to be my previous state which depends on all these three variables that I had before but also on w which is a random variable I don't know what it is I know maybe it's a distribution but I don't know its value so an example of such a dynamics is a Brownian motion random walk so for instance if the simplest example of this would be xt plus 1 is xt plus wt where wt is a bit which can be plus or minus 1 with equal probability say and then the dynamics would be this sum of these increments right so if you start at 0 you get a bunch of plus 1s and a bunch of minus 1s and that's where you are after a certain time so your state becomes a random variable and so the cost also becomes a random variable so saying well I want to optimize the cost is now a hard thing because this cost itself is a random variable and you cannot optimize a random variable so what we can optimize is the expectation of a random variable or some statistics of a random variable so that's what we do so we put these brackets around this meaning that we take the expectation over this cost where xt is now a random variable because it depends on w I also added now that this cost r also has a random component in it which we are going to use in the next example so the cost also has a random component so if you write out this expression it means basically that this is an expectation value with respect to two distributions one is a distribution over all the random variables w that I had this whole sequence of random variables that I have in the interval from t is 0 to time capital T and other ones are the random variables that appear here so I take this value that I get here and I sum over all the possible ways that I can have these random variables and that gives me that expectation value with respect to computer expectation value so so now we are optimizing not the cost that we the real cost that we are going to see because in any experiment we are going to go from state x0 to state x1 to state x2 etc which is impartly is determined by the noise that we that we have in our system and we are not going to optimize the cost actually of that actual direction that we are going to see because we cannot at times zero we are not going to know what is going to happen in the future we are not going to know what kind of w values we are going to see in the future so the best we can do at times zero in computer optimal control in the average case so that is the expected value case and that is what we are optimizing so since we don't know the future we can only optimize expected values that we do know there is nothing else we can know and we cannot optimize it the closed what is called closed loop control to be distinguished from open loop control is that you find and in fact we saw already the example in the previous case in the deterministic case you find a control which is a function of the time and of the state so this is your control function as we had before and so what is referred to as a control policy is this whole sequence of functions for all times or you can say it is the whole function u of x and t that is the policy and so the optimal policy is the one that so if this is the dynamics with this noise and here we put a control which depends on time and on the state we put it already in there and with that dynamics and this particular control function that we take one of these possible control functions we get a and we call this set of control we call this pi and our expected cost will have a certain value and the optimal policy that we want to make is to pick out that particular control function that minimizes this expectation value that is it, it is a very hard program to do but conceptually it is quite easy now this idea of open loop and closed loop control I should mention a little bit so in the in the previous case we also got a solution which depends on the which depends on the on time and on the control on the state in that sense it is also a closed loop control but since there is no noise so since so we have a deterministic system we go from x0 we go to x1 we go to x2 etc so since everything is deterministic we can write the u of x xt and t we can write it actually purely as a function of t because the xt is actually determined by the initial condition and the previous controls so given the initial condition and given this sequence this x1 is determined by x0 and the control so this whole just becomes a sequence of time there is only one control, there is only one trajectory optimal trajectory because that optimal trajectory there is only one control sequence also so the fact that this controller depends on the state you are never going to use because you are never going to deviate from the optimal trajectory so there is and if there is no dependence on the state then this is called an open loop controller because it is sort of like blind you say ok this is the plan guys we are going to first go right and left and we don't look anymore what is happening actually where there is any deviation of the plan there is no noise, we know exactly what is happening and so this is called open loop control now in general when there is noise of course it may be that having a fixed control trajectory control sequence so you may say I go straight ahead but if the wind is blowing from the left you go to the left and based on that your next step is going to be something that depends on the state not only on time and that is called closed loop control the sensory feedback into account ok so that is that is called closed loop control and that is the control policy ok so if we take this stochasticity into account and put it into the bellman equation like let me remind you of this bellman equation it was this equation here so if we now put the stochastic element in there so we are going to have that this depends on the noise here because we had put noise in here and the dynamics also depend on the noise so if you repeat this analysis all that has happened is that you just have to put expectation values around this whole thing on the right hand side that is the upshot of dealing with that uncertainty so we now have to take the minimum of this expectation value and we still have the boundary value at the end cost and so then this is this stochastic bellman equation that we have to solve so let's look at an example so I have an ice cream shop selling Italian ice cream but it is a very small one I can only have one zero one or two ice creams in my shop and and tea is labeling the days day one, day two, etc and so at the beginning I have a number of ice creams in my shop and I can order hold more than two so if I have zero ice creams I can order two maximally if I have one ice cream I can order one so this is what I can order so my control is what I order and it is between it's less than two minus the number that I have and then in the same day my shop opens and then there is people coming and there is zero one or two people coming but I don't know the last experience that the probability that nobody comes is about 10% the probability that one person comes is about 70% the probability that two person comes is about 20% this is what I know and so at the end of the day I have first ordered some ice creams so I put up my stack and now I am waiting people coming buying ice creams and at the end of the day I am left with some ice creams and what I have left over is my storage and that is of course that is cost me money and also if there is more demand then I can supply if more people come then I have ice creams I cannot sell them and that's also a loss so in other words at the end of the day I want my ice cream content in my store to be close as possible to zero if it would be less than zero negative I could have sold more if it was less than zero I have to store which is also a bad idea so I have two cost one is the cost of purchasing ice cream which is proportional to the number that I purchase which is U itself and then I have the second term which is quadratic which is just basically X plus U minus W is XT plus 1 this is the amount that I have today plus what I buy minus what I sell this is this amount the cost is of this form this is the cost I have over a certain horizon and I keep it also very very small number of days so two days and this is my dynamics so the next my next day number of ice creams is the previous number of ice creams plus the ones that I buy minus the ones that I sell and if it gets less than zero everything is lost but it gets less than zero and so I have to this is a control problem equation so I start with time three when the horizon is empty and then the cost is zero because there was no five term right so then I initialize this at zero for all X3 and now I can solve this for for X I can solve this for X2 and X1 etc so suppose that we are at time two and we have stock is X2 and then the cost to go is what we order and how much we have left it at the end so the JX2 just using the formula we get the minimum the U2 can be between zero and two minus X2 right that was this constraint on what I could buy of the cost the immediate cost plus the J but the J was zero which was the future cost so we just get this one so we get this expectation value now the expectation is not over this one because this is a deterministic variable that we have here this expectation value means 0.1 times that the value of W2 is zero plus 0.7 that the value of W2 is one plus 0.2 times the value of W2 is two so we get these three things and then we have to minimize this expression over over U2 and so we can do that and so we find that for U2 is zero we find 1.5 for U2 is one we find 3 if U2 is two we can do that so the minimum one is U2 is one so we find that for X2 for X2 is zero the optimal control is that U2 has to be one so if on day two I have no ice creams I should order one this is what this says and the cost is this right so this is very spelled out for this particular case but now you can repeat it and you get this so at stage two where we were if we have zero ice creams we should order one if we have one ice cream we do the same thing we should order zero if we have two ice creams we should order zero and the cost that you are left with is this and then that back propagates to the previous time so then you can do it at the previous stage you see that if you have zero ones you should order one if you have one you should order zero etc it turns out actually that at all the three stages you should do the same thing but that's of the exercise that could be different of course because the optimal control is explicitly time dependent so in this way you solve this control problem and so you see that at the beginning of your long summer of two days you know exactly the answer that if you have zero stock you should order one this is what you should do this is optimal given that the model is correct of course this is a very simple illustration of this optimal control here's another example this is a case of two ovens and I want to I have a certain material which has initial temperature x0 and then I have an oven first oven of a temperature u0 and after that the temperature of my material will be x1 and I put it through a second oven and then the temperature will be x2 and so in this case the dynamics is on the temperature so the temperature is a convex combination of the previous temperature and the oven temperature so it's a mixture of these two so we get this and this holds for both ovens here and so this is a linear dynamics in x and you see we get a cost we just care about two things one is that the final temperature is close to a certain target temperature x star this is what we want to design and the way this is quantified by this r if r is very large we want it very much if r is very small we don't want it so much and the second thing is the energy that we spend in the two ovens which is proportional to the temperature squared so we get these two terms so this is now our control problem and we want to optimize this and this is an interesting case because here you see the dynamics is linear and the cost is quadratic both in the state and in the controls if I look for the end cost so here's an end cost term that is this end cost term so we initialize the j at the time 2 for any temperature is just this end cost and now we can compute the j at time 1 for an arbitrary state x1 by doing this Bellman equation I get the immediate cost which is u1 squared which is the oven temperature this is this one right this oven temperature we get the cost to go from j2 so I have to put in this from state 2 time 2 and so I fill in this thing where I replace where I put j x2 and now I fill in for x2 I fill in the dynamics which is 1 minus a x1 plus a u1 so I put this in here and so I now get this quadratic expression of u and I can do the minimization because quadratic form minimizing very easy I get the solution u is something linear in x I get this solution and I can fill this solution I can fill it back in here and you see that the result is again something that is quadratic in x and so now I can repeat I put this j I put it in here in this one fill it in here put in the dynamics x1 in terms of x0 I optimize again with respect to u0 I again get something which is linear in x and I get the j which is again quadratic quadratic in x so in other words this is something that perpetuates so you stay in the same form you can do this in closed form so this is an example of a linear quadratic control problem one of the most well studied understood control problems and it has this feature that you get a closed form solution closed form solution of the control in terms of the parameters of the model and you don't get this in the tabular case with the ice cream store that you have to do this all cases so you can get it much easier by differentiating getting the optimal solution another nice thing about this linear quadratic control is uncertainty equivalence and that means that the optimal control solution that we compute does not depend on the presence of noise that is to say if my problem were to be noisy I would still compute the same optimal control and that's easy to see for instance if we if we say that the dynamics is now is the same this linear dynamics but now we have this noise we can do it say and the cost is still the same if we do it in this case we get that for instance in this computation for j1 we have to fill in now for x1 for x2 we have to fill in this equation for x1 so we get this noise term in here and if I now take we sum out this square we take all the terms that involve w1 so we get this term that we had before and then we have a double product term and then we get a w1 square term and you see the double product term vanishes because if the noise has mean 0 because it's linear in w and then we get a w square term which is independent of u the independent of the optimization so in other words we see that if we optimize this expression it doesn't matter whether there is noise because we get the same value of u1 that we would have gotten in the noiseless case and that's saying that the optimal control doesn't depend on the factor where the plan does noise or not now that's great because often you don't know what the noise is so it's good to have this kind of thing now this certainty equivalence holds in the linear quadratic case now we saw a wonderful example earlier today of non-certainty equivalence right because the drunken spider when there was no noise you had one control and if there was noise there was another control so that's a clear example of a violation of the certainty equivalence where you get very different solutions so this is of course the fact that this problem you can think of it as a multimodal problem whereas a solution this way, the one corridor there one corridor there with a big value in between and whereas in the linear quadratic case everything is sort of a convex one unimodal kind of a problem and that's why you get this certainty equivalence any questions okay so now we can take the continuous limit actually so we have done the hardest part actually because this is if you understand this then the rest is sort of easy it's just a bit of math that it may be a little bit scary but in fact conceptually it's very much the same so okay let's take the continuous time limit so we had discrete time t plus 1 and t et cetera we're now going to have t plus dt and we're going to send dt to 0 so our Bellman equation this form x t plus dt is x t plus f dt and our cost is now n term and an integral of terms and so if we now take this this Bellman equation that we had here so we get this one so we get x t so here we place x t plus dt and here we get x plus f dt so that's what we get so then we get this situation so we do the discrete time control for the case that the discretization is small and so we get j is the min over u of this r dt which you get from this with one term and we get then j t plus dt x plus f dt and now if these functions are smooth we can do a Taylor series expansion so we get this one which is proportional to dt and then we get j we can evaluate it at t and x and then we get the first derivative with respect to t we get dt times the gradient with respect to the j sub t means the j dt and then we get the space derivative the j dx the gradient times delta x which is f dt and now you see that j is on both sides and for the more this j doesn't depend on u so we can just cancel it so it falls out and you see that the rest is all proportional to dt so we can divide it out and then we can take this thing which doesn't depend on u we can take it to the other side and then we're left with this equation so we get now a partial differential equation with a scalar value u, j depending on x and t and there's some time derivative there's some functions multiplying it and there's this nasty minimization is in there and this partial differential equation has to be solved with a boundary condition still the same that at the end time it has to be equal to this phi of x so this is now the continuous Bellman equation also known as the Hamilton-Yakobi-Bellman equation because the Hamilton-Yakobi part is because there's actually a close tie between classical mechanics and control theory you can sort of say that classical mechanics is sort of a special case of control theory in the mathematical sense and the equations that arise in the classical mechanics are known are were derived by Hamilton-Yakobi in the 19th century and the control formulation is due to Bellman is from the 1950s but so that's why it's called that way so this equation this Bellman equation can be visualized in this way this is a space at the end time this is space at our current time and this is the phi at the end time, phi of x this is a boundary condition for J so what this equation does is it morphs this phi of x which exists at the end time to something else at an earlier time because you're doing backwards in time so the picture is like this so you first have this kind of a shape and then you solve this equation and you get for each time you get a J at the end time it looks like phi and at earlier times it looks like something else and it morphs to something and at our current time it will look like something like this and the way that it's called is called an anticipated potential so there's a potential that lives at the end time but that's not your concern you want to know what to do now not at the end time so to translate that end potential to the current situation what you do now you have to compute this anticipated potential which is this solution J at the current time so the control that you compute you can view often as a gradient flow in the anticipated potential so if you were at the end time you would take a gradient of this and that would be your control and now you have to take a gradient in this shape so that is how this control how you solve the equation yeah we're going to look at that so this is the whole this is the difficult thing right so in principle this is a beast of an equation and you cannot solve this equation and this is where you get stuck and this is particularly if you now add noise to the problem because this is still not a stochastic problem if you add noise it gets even harder or no I shouldn't say it but yeah that's indeed, that's the topic of the two lectures yeah okay so let's take an example so here we have a mass on a spring so there is there's a spring force which acts in the z direction which is minus the position so if the ball is up the force is downwards if the ball is down the force is upwards so that's spring force and then there is a control force and so we get that the total force which is minus z plus u is equal to m times a is the acceleration this is Newton's law so this is our Newton's law here and now we take the mass equal to 1 because it saves typing and so now we have this dynamical system and now our question is given initial position and velocity z at 0 and z dot at 0 dot means time derivative for those who don't know time derivative say 0 at time 0 find the control path that's bounded in the interval minus 1 plus 1 over a time interval from 0 to t such that at the end time the position of the ball is maximum this is want to get it as high as possible this is the control task we want to do so how would we do that well first of all we have a dynamical system which involves a second derivative we don't want that because our theory of time derivative so we have to but that's very easily solved because we make two variables x1 and x2 x1 is z and x2 is z dot the velocity so then you find that x1 dot is z dot and z dot is x2 so x1 dot is x2 and you find that x2 dot is z double dot and that was z double dot was minus z plus u and z is x1 that's this so we convert this one-dimensional second-order equation to two-dimension two equations of first-order that's a standard trick so now we have it in the standard form so we have the change of x is a function of x and some control and we have an end cost that z has to be maximal but we were in the business of minimizing so we need that this phi which is minus z has to be the end cost then this has to be minimal and the path cost how we get there we don't care we didn't specify it so that cost a zero so it's r a zero so we take the hib equation now we get we have to solve let me put it here so r a zero and this f we just got in this two-dimensional form so we got so the gradient so we got so we got f gradients j this is equal to f1 dj dx1 plus f2 dj dx2 these two components and f1 was the first component of the dynamics so f1 is x2 dj dx1 plus and f2 was minus x1 plus u x2 so this is that term so that's in here and so now we have to minimize the respect to u now u was bounded in the interval plus or minus 1 now it better be like that because we see that this function depends linearly on u so it would have u unbounded it would go to minus infinity we don't want it that's why it's bounded plus or minus 1 and the optimal value the minimal value that you get is of course depends on the sign of this one if we have to get to u as minus 1 if it is negative we have to get u as plus 1 in other words u is equal to minus the sign of this term in other words this optimizer goes to the absolute minus the absolute value of the second term this is how it is optimization and u is minus the sign of this so this is what we get so this is our differential equation that we now have to solve now this is to come this is still not easy to solve but you know we have some hindsight or as well as an ansatz as we were German about it then we can say okay let's look for a solution which is linear in x with some time dependence components let's say let's do this ansatz so if you put that in there and I don't have time to do it really but if you put that in there then you get some some x dependent terms x1 dependent terms, x2 dependent terms and x independent terms and you collect them all and essentially this partial differential equation becomes now three ordinary differential equations one for phi1 this component where the time drift is equal to phi2 one for phi2 and one for this alpha so these are now the equations that we need to solve and we had boundary conditions for the for phi for j had to be equal to minus x1 so the boundary conditions but in the boundary condition of the phi1 has to be equal to minus 1 because it has to be minus 2 equal to phi and the phi2 is 0 and alpha is 0 so we get these boundary conditions for these three things and now we see that that phi1 and phi2 can be solved self-consistently and we can we can very easily verify that the solution of these two equations is given by just this system which defines a cosine it's solving this system so if you take the t equal to the capital T you will find indeed that this gives a minus 1 and if this one gives it capital T it gives equal to 0 so this is the solution of this system and then then we see that the control is the gradient of the cost with respect to j2 that means that if we take the gradient with respect to x2 we will be interested in alpha in fact we are only interested in phi2 but for that we need to solve phi1 but we have to solve phi1 and phi2 so we do that so we get the solution is minus the sign of phi2 in a sense and that is the phi2 was the sign of this so we get minus the sign of the sign of this and so if you now take the capital T equal to 2 pi you find that the solution that u is optimal is 1 in the interval between 0 and pi and it is 1 in the interval pi to 2 pi that is following this solution and so you see that the solution is that the spring what you first do is you first pull it down for the half period and then you push it up for the second half and that gives you the maximum deflation and here is the solution so this is the position so it goes from the origin and it pushes up and uses the spring force to get at the end time at the maximum position so this is an example of how you solve this control system so you may wonder that this is a very particular case that you can solve and of course this is the case and in general these systems are very very hard to solve let me see how far I am doing in time I have I have still half an hour more or less okay any questions about this so control problems we can set up a Bellman equation in the continuous time and we can solve this if we know the dynamics so let me so now let me go on to the so one thing that we found is that actually this j that we had here has to solve a partial differential equation for all space and actually in the end we only need one trajectory because we have a noiseless situation so the picture is like this so here we have again this picture we have the end condition, the phi at the end and then we have this anticipated potential that we compute for all states but then again our initial position is maybe this and our optimal trajectory is going to be this red line from the current to the end and so the question is can we avoid this partial differential equation and just directly compute this red line without doing this all space this all space dependence now you can now you can do that and the yeah so you can do that I'm not let me just a little bit just do it and those who follow those who follow follow those who have difficulty with this part may now go to sleep and then those who follow then can continue to follow and then I'll wake you up in about five minutes and then we will take it from there okay so what you do is okay so we're going to set up a optimization problem which is now we want to minimize the cost right this is what we want to minimize and normally we want to minimize the cost and we said well we have a dynamics and we're going to do this Bellman equation but now we're going to do something else we're going to say we're going to consider the trajectory over controls and the trajectory over states as two independent variables and we're going to optimize both but of course they depend on each other right if I have a so if I have x0 and I have xdt time dt and I take udt and that will give me x of 2dt right two time steps etc and so it is clear that if I specify all the controls all the dynamics is given so there is a dependence between these two sets of variables right so the way the way that I implement it is that I set up a constraint this is the dynamics sets up a constraint between these variables and so the way who knows about Lagrange multipliers that's the part that is not asleep I guess so Lagrange multipliers allow you to enforce constraints and so I'm just going to present it as a trick so if you want to so if you want to minimize a function with respect to x some c of x right but now you want to minimize it such that that g of x is 0 say that x has to be in a certain set that is specified by this constraint then this can be written equivalently that you say I do minimization over x and the maximization over lambda of c of x so if you if you introduce this this Lagrange multiplier is make sure that if you maximize this everywhere where g is non-zero this will make maximization will get put it to infinity and if you then do further the minimization all these points become irrelevant so all the things that this thing can put to infinity are excluded from the minimization and therefore you do actually the minimization over the set where it's 0 right so this is the idea of a Lagrange multiplier so now we're going to have Lagrange multipliers on the constraint x dot is f of x ut so this is a this is a constraint that's happened at each time right so there is a many time and so we're going to have a Lagrange multiplier which depends on time it's a time dependence function so we get for each time we get this constraint so we get this constraint that the difference is equal to 0 so this difference is equal to 0 so this difference is having a Lagrange multiplier and the Lagrange multiplier enters here so here I have my optimization which is this part and the Lagrange multiplier is this lambda t integrated for all times I have a Lagrange multiplier each time of this difference of this equation and so then I can define h as just as a administration r minus lambda f so that's taking part of it and then we have lambda x dot term here and now this is the thing we want to optimize now we're going to optimize with respect to the whole trajectory x the whole trajectory u and the whole trajectory lambda and we're going to take derivatives and set all these derivatives equal to 0 and then we're going to look for a point of c where the derivative with respect to all these changes is 0 the stationary point this is called the variational argument and it will lead to some miraculous solutions so if we do this delta c so that is the change of the so we're going to differentiate this with respect to x so we get the phi dx times delta x at the n time that's this thing the phi dx the sub x means derivative the phi dx at delta x and then we take the derivative of this whole integrand thing here so we get h derivative with respect to x with respect to u with respect to lambda and we have to take this one also with respect to lambda and with respect to x so we get all these terms here and they're all fine because we want to now set the coefficients of everything that's multiplying delta x delta u and delta lambda want to set it equal to 0 but we have here this one term which is delta x dot and we have to get rid of that and we can get rid of that by partial integration so we say that the integral from 0 to t lambda delta x dot so delta x dot is the delta of the x dt which is equal to the d dt of delta x so we can write this as the integral from 0 to t lambda d dt delta x and now we can we can do partial integration so this is minus the integral from 0 to t of the lambda dt delta x plus lambda delta x evaluated at 0 and t right so this gives minus t lambda dot delta x plus lambda at the end time delta x at the end time minus lambda at 0 delta x at 0 now the variation that we do is in the plane from 0 to t and we have all these trajectories but we keep always the initial condition at the same spot so that means that the delta x at 0 is in fact 0 because you don't vary there we keep a fixed initial condition so this term is absent so we're just left with these two terms and you will find that this one is the one here and the other one is this lambda dot delta x is in here okay so now we're done essentially because now we can set all the so this delta c is equal to 0 if the coefficient of this and the coefficient of this and the coefficient of this are all 0 and if this term is 0 now this term is 0 gives us a boundary condition that the gradient of the end cost has to be equal to minus lambda at the end time so that's a boundary condition and these ones give us differential equations so we can first solve for this one the h, the u is 0 so we're going to solve for that and so this formula so u, h depends on x, u and lambda and so taking this derivative equal to 0 gives us formally a solution of u in terms of t, x and lambda and now the other two and in the following way we get an x that dot is the h, the lambda where the solution of u is in there and lambda dot is minus the h, the x and we get initial boundary condition that x that initial one is has x 0 and the boundary condition at the end time is that lambda is minus phi x which came from this term that we picked up here which also has to be 0 so maybe now time to get awakened again so the upshot is that you can do a variational argument where you get the solution directly in terms of two coupled ordinary differential equations so these are ordinary differential equations like Newton's law but they're coupled and they have mixed boundary conditions so there's two variables one is x and one is lambda x has an initial condition at the initial time and lambda has a condition at the end time which actually depends on x so it's a mixed boundary value problem that you have to solve so if we do this for the mass of the spring we will see that we get directly the solution so we had the dynamics was of this form the cost was of this form if we take the Hamiltonian this h that I defined it has this form of lambda x2 plus lambda 2 this is the form of this expression if we compute this h star where we optimize over u which was this step this is the optimization over u 10 equal to 0 if we optimize over u where u is bounded in interval plus or minus 1 we find that this h star has this value we recognize it's almost the same as the differential equation that we had before with the absolute values in there and then from this we can compute the equations of motion x dot is the hd lambda lambda dot is minus the hdx and so we see that the first equation in fact gives us back our original dynamics this is this dynamics here this is just the equation of motion dynamics where we now have replaced u by its optimal value and the new thing is the dynamics of lambda which is this equation and which was the same that we got this sinusoidal solution where we had previously this size this is exactly the same equation that we had here for the size the same kind of thing so it gives these sine solutions so and so this is the way that you can use this with the Pontriagin principle so here's another example so suppose I have a linear quadratic control problem so my change in the position is just the u is the control and I want to be at the end I started from the initial value I want at the end I want to be at the origin I want to penalize to be away from the origin at the end time and I have a quadratic control cost so if I solve this in the p and p formalism you find here the equations I'm not going to do it you can just follow this recipe you construct the Hamiltonian you optimize the Hamiltonian with respect to u you get a function of x and lambda you get these equations of motion with the boundary conditions and you solve it and what you find the optimal control in this case which is lambda equal to u star is of this form so you see that you get a feedback controller that steers towards the minus alpha x so it always steers towards the origin and it steers with a gain which has this term which increases with time which increases with time so if t0 which is your current time if that goes to the end time then actually this gain factor minus alpha over 1 plus alpha t minus t gets stronger and stronger so the picture is that you start you start initially at some state and you want to get to the end of the origin and what you will do is that you will steer towards this position and for the same x you will find that you get an increasing control strength so the urgency is larger there than here at this point so this is in the absence of noise this is still a very simple deterministic situation so in relation to classical mechanics I'm going to skip yeah okay so okay so let's add some stochasticity in this continuous control formulation so we have now these capital things denoting stochastic variables and we have so suppose they were on discrete time and we would have just this situation that the current x goes in the new x with some random variable plus or minus 1 so then after t time steps x is just the sum of the increments so we get this value so we can compute this x since it's the sum of independently distributed variables we know that it's distribution the distribution of x is going to become Gaussian because the sum of random variables in a large sum limit is Gaussian distributed if they're independent so if this is Gaussian distributed we know that we only have to compute the mean and the variance so then we're done so the mean is the expected value of x is 0 because each of these terms has the expected value 0 and the variance is the sum of the variances of each of these terms and each of the variance has a variance 1 and so we get the sum of 1 so we just get t so we see that in this way we see that these fluctuations that they grow with the square root of time so this is the width of growth with the square root of time so this we can use as a starting point for the continuous time formulation so in the continuous time we get something similar that the change in x is just the x at the new step minus the late one is just this what is called this Wiener process this infinitesimal Gaussian thing which has also mean 0 and a variance which is now also is dt we had a time interval so we take the key insight here is that where we had here this variance proportional to t we take that to the infinitesimal scale so we get now the variance is proportional to dt because that's our time step and the new is just a factor of the scale the size of the noise we can take it also 1 so the x at time t is the x at time t1 which we call x1 plus the integral of all these increments so this is this stochastic integral and so from this we can compute the expected value of xt it's the expected value of this one which is x1 sorry it should be x1 here plus the expected value of this one which is 0 so we get this one and we get the variance of the sum is the sum of the variances the variance of this one is 0 and we get the variance of this one each of these one has variance dt which gives us new times gives us t so this process this variable at time t can be described as a Gaussian distribution which has a mean which is given by so it's a mean has a mean value x1 and it has a variance which is t or new times the time difference so if we go from time 1 and we start the position x1 then the probability to see that the position x2 is just a Gaussian distribution x2 minus the mean value divided by the variance and the variance is new times the time difference which is t2 times minus t1 and so this is the this is the distribution of the stochastic process so if we now go to stochastic differential equations now we're going to add a nonlinear term so we have this term we have explained but now we're going to add some deterministic term which is just our normal dynamics that we have in this system so in this case this conditional distribution can be very complex and we don't know really what it looks like but there are two equations that are useful to describe them so there are evolution equations that describe this distribution so one is that we take this argument fixed so initial time and initial space so we fix the initial position and we ask ourselves what happens to rho if we evolve forward in time and this is known as a Focke-Planck forward equation and it is if you take it in this form this is this equation that you get so it has a so called a drift term which describes this part and it has a diffusion term which describes this noise part and the initial condition which you initialize at a certain state and you can also describe this equation this process by fixing the end variable and varying looking at it as a dynamical process in this variable so you define psi x t as for some end time you are at a fixed state z and you look at it as a function of this initial variable and then this quantity has also a differential equation associated with it which is called the Komogorov backward equation and it has boundary conditions at the end time so at capital T and it is given by this differential equation it also has a drift term which is similar to this one but not quite because here the gradient is over both terms and here the gradient is only over one of the terms and it has a diffusion term which is identical right so okay these are two descriptions that can be used to model this stochastic differential equation so for instance if you have a Gaussian then the forward process is given by this by this row that we already described is Gaussian like this and the backward process is given also by a Gaussian but it has a width which increases with the time to the end so the picture is quite confusing in a sense so I suppose that I so the forward picture is quite easy I start here in the origin say and I have some sort of a diffusion process like this and this is a Gaussian so if I take any time that this is a Gaussian distribution like this right so this is the forward picture but the backward picture is saying I fix an end time and I fix a state z and now I'm asking myself what is the probability to come from any time and state to that state z now you can think that if you are very close to this z if this difference is small then this is a big probability if you're right in front of it but if you're very far away this is a very small probability to go from this state to that state is very small and in fact that distribution is also in this case the Gaussian distribution similar and that's given by this and the closer you get to the end time so this time is capital T minus T the smaller this gets the narrower that Gaussian gets because the harder it is to hit that probability that end state so that's this distribution that gets narrower when you get further in time so it's so this the forward equation is a diffusion that extends this way the backwards equation is a diffusion that extends this way but that backwards equation that backwards diffusion is still describing this same good old process this one which moves forward in time and gets more noisy so it should not be confused that this backward equation is sort of modeling something where the diffusion is where the noise is actually getting less that's not the case it's just this picture of it okay so now we're ready to go into the stochastic optimal control formulation so actually we're almost there so don't despair I realize it's a very tough right this afternoon so we are now going to put a control ingredients into this continuous time stochastic differential formalism so we have now a dynamics which also depends explicitly on some control variable and we have noise and the noise can also depend on the control and on the state which is very complex and now we have a cost that is expectation value as we see before of an end cost and a path cost which also depends on the control and on the state and this expectation value is over all trajectories that start at the current position x position and that have also control function in them right so you should appreciate the dependence on the control is not only here in this dependence of the cost but actually also in the expectation value because expectation value is with respect to all the trajectories and the trajectories depend on their control if I steer this way I get a different expectation value if I steer that way right so that is the cost that we have and so now we want to optimize this c with respect to all the set of functions and so and here it is so it's very similar to the scene so if you just bear with me for this little last thing then we hope to have it happy so we have the cost to go was the minimum was like the same old Bellman equation that we got from the first slide in the discrete time case and we have this expectation value and so now we can write this jxt plus dt and x plus dx that we can write it in the Taylor expansion and we write it as j at the current t and x and delta t times the differential in t and the delta x differential in x and here the new thing comes with delta x squared and the second derivative in x and the reason why we have to do that it becomes clear immediately because now we have to take the expectation value of this thing expectation value of this thing gives just j expectation value of this thing because it doesn't depend on any stochastic variable expectation value of this thing gives us f dt right because it's the expectation value of this just gives the first term because this one has expectation zero so this gives us f dt and this one is the expectation value of dx squared and dx squared so we have dx t is f dt plus dw t so we get dx t squared is f dt plus dw t squared and so if we take we take this out we get f squared dt squared plus dw t2 dw tf dt plus dw t squared now this one is higher order in dt so we can ignore them because we only keep the lowest order in dt so we can get rid of this and this one has if we take the expectation value this one also has expectation value zero but this one is actually not zero because we have that the expectation value of dw t squared was nu dt it's linear first order in dt so that's why we have to take the second derivative of x because it will pick up a term proportional to dt and that's the whole difference that we have from the situation before if we put this in here we can do the same thing that we do before that the j cancels the j everything rest all becomes proportional to dt we can divide by dt we can take the delta t j we can take to the other side we're left with a minimization of u and this is everything that is left now in this equation so this is the master grand master result of stochastic optimal control we see that we have the same that we had before in the case that the noise is zero if nu is zero this term is absent and we reduce to the previous case that we had before now we have this extra term we have to solve this equation now as was already remarked how do you solve this in general this is very very hard to solve so the rest of the of the lectures I'm gonna talk about two classes for today I'm gonna talk a little bit about linear quadratic control so if you for linear quadratic control you can make an ansatz for for the j and actually solve this Bellman equation and the solution is known as the Riccati equations and that is very well known and well studied and we're gonna look at that and the other case is the pathological control case in which we make a certain assumption for our dynamics and tomorrow in which case we can also make progress but for today let's look at the linear quadratic control case so if the dynamics is linear that means that dx is linear in x and linear in u and the noise is also linear in x and linear in u and the noise is white and if the cost function is quadratic that the end cost is quadratic and the path cross is quadratic in x and in u and in that case the optimal control is optimal cost to go is quadratic in x so we can make a postulate of that the optimal cost to go is quadratic in x in a linear term and a constant term and then we can fill this this in into this equation and solve for it and instead of getting since we make a explicit space dependence for the j we're only left with time dependence and this gives rise to ordinary differential equation for these coefficients P of t, alpha of t, and beat of t in terms of the parameters that we have in a model and the result of it is this so here you see three equations the time gradient of p which is a matrix between states this is alpha this is beta and these things are just definitions that enter here to make this a little bit more concise and so this is just ordinary differential equations and you can solve these and it's not particularly useful to look at this in general but let's look at some simple examples so for instance in the case that the dynamics is of this form and we have noise and we have the same problem as before where we wanted to steer towards the origin but now we have a noisy system so this is our end cost this is our path cost so this will apply that the phi is the end cost so the path cost is this and so all these constants here that you have here like these a, b and c and d they have certain values and everything becomes very simple so the Riccati equations in this in case reduce to these three equations you get that p dot is p squared and the end condition is g because that is basically the end cost was this thing has to be equal to p p is also the thing that's the quadratic term in j right so that j has to be equal to phi that means that p has to be equal to g at the end cost so that's this term the alpha is the linear is the linear term in the control in the cost to go which has zero so this has a boundary condition zero and beta also has a sum equation and the so you see that since alpha is zero at the end time and it's multiplying the derivative you see that the solution is at alpha zero for all times and so we can solve this one very easily that p of t is one over c minus t this is the solution of that equation and the constant is equal to it's given by this equation it gives a boundary value and the beta is not relevant again because of the same reason that it affects the optimal cost to go but not its derivative which is the control and so the control is just minus p minus alpha alpha zero and so it's just minus p and we solve it we get this solution and here again we see the same that we had before in the noiseless case that you get steering towards the origin with the gain factor which increases which increases with time so interestingly if you so the control figure is now here so we start we start here and we want to go here and so typically this controller will the diffusion will pass like here and so and then asymptotically at end time you want to be at the origin you may not want to be completely at the origin right because there is a cost here involved there's a cost involved for getting to there's a benefit of being in the origin but there's also a cost that you pay for steering right and so there's a trade-off so depending on the size of g this end thing maybe wider or not wide right so this is depends on how much you value this term versus that term but anyway so this is the overall picture that is your diffusion that first is expanding because your control is actually your control is not very strong but you control at later time your control gets stronger and pushes the thing to the origin now a particular interesting limit of this is when you take g to infinity so then the cost of being not in the origin becomes infinitely large and then this control law just becomes simply this this law here and so now you see that you do really go from the origin you go there and you go to the origin and this problem is known as the Brownian bridge problem it's a well-known diffusion problem and you see that this Brownian bridge where the solution depends not only on the initial state but also the end state if you formulate it in a normal way can be formulated as a control problem where actually there is a controller that makes just a simple Markov process so okay let me step back a bit so one way to think of this Brownian bridge is say Brownian bridge formulation is say I started the initial condition at zero and I want to have a process that ends up also at the end time at zero so you take information from the future and from the past to compute where your solution should be right so this would look like a causal time progressive kind of a thing that you can solve it in this way so you need information about the future now what the control theory is actually doing for you think again of this anticipated potential it takes this information from the future it transfers it to the current time and it tells you what control to use at the current time and then in that language you can just use a first order Markov process or four purely forward process this process that's here on the slide just say okay wherever you are just move with this control and everything will be happy everything will be fine you will end up with probability one in the origin and that so it's taking this two time thing and makes it into a causal Markov Markov structure okay that's any we're almost out of time so here's another example let me just do this example and then I think we'll stop so here so here you so okay in this example you saw that there was an end cost and you saw that the control was increasing with time was getting more and more urgent to get to the end cost here you see another example so here we have same dynamics but now the control cost is a path cost so it's there at each time so it's in this R where's my mouse so this is happening in the integral so this is integrated of all time and so you get this cost now if you now do the Riccati equation you get these equations doesn't matter and you find that the optimal control is of this form and you can solve it and you get a solution which looks like this so the optimal the gain now of the control I think that's multiplying x so you get a feedback controller that is initial has a for initial times so the time horizon is 10 here and the initial the value is constant and then then at the end it's the gain is getting smaller so maybe you can think about why why this is getting smaller why is this happening can you think about that who understands why this is gain is getting smaller so you want to get at the origin but after some time you say well I don't care anymore nobody so it's the path cost right so we have this interval from 0 to T right and if we are at some time here then I'm at some state and at each state the controller has to decide shall I steer towards the origin which is going to cost me u squared and I have to trade that off what I can gain with that what I can gain with that is how far how close I'm going to get in the future towards the origin because that's what's my path cause I'm going to gain that now if I have a long time in front of me I should I should steer at a certain amount right and that's what I do all the time in the initial time but at some time at the end it doesn't really benefit anymore to steer towards the origin because the expected future that I'm going to get out of that out of that control solution is not going to pay off not worth my while right so in particular if you're at the end there's nothing to gain anymore so that's why it's optimal to stop steering when you get to the end time if you have this path cost unlike this other case okay so we had a so in this last example the optimal control is independent of the noise so you see that here right this control solution is proportional to P and there's no noise the noise doesn't appear in here and this is again a feature of this of this certainty equivalent that the optimal control doesn't depend on the noise for linear quadratic control problems these are linear quadratic control problems and in general for this kind of systems these kind of equations this is true when these c and d variables are 0 then you get this certainty equivalent and if not not okay so that's what I wanted to say today so it's been a bit of a rough ride I realize but the upshot is that in one and a half hour you've learned everything about classical control theory that you may have wanted to know so so the the story is that control leads to a bellman equation bellman equation is very hard to solve it's a partial differential equation you can solve it for the linear quadratic case because it yields the raccatti equation so then it's just a polynomial so if you have an n dimensional system you have this raccatti equation which are matrix equations of n by n matrices so this is all fine you can just get a thousand dimensional system as soon as you get out of this class of linear quadratic control problems things with if there is no noise you can still do this p and p formalism but that's so that gives these ordinary differential equations with these mixed boundary boundary problems you can do that in the noiseless case but in the noisy case for the non-linear case there's really no nothing out there except for these path integral methods and so that we're going to study next tomorrow where the solution actually the j that we see here so much is discussed to go the solution of the bellman equation this j we're going to now find an explicit form of it we're going to say this j is and it's going to be a path integral and that is the trick that we're going to use and then we're going to compute that path integral and going to do some applications with that so with no further ado see you tomorrow